1. Definitions
  2. Download and Clean Data
  3. Statistics
  4. Feature Importances

1. Definitions

The goal of this series of articles is to explore various aspects of plate discipline for hitters, including building multiple machine learning models to predict strikeout and walk rates from plate discipline data. First, let us define the variables we use for plate discipline. From FanGraphs:

  • O-Swing% – swings at pitches outside the zone / pitches outside the zone
  • Z-Swing% – swings at pitches inside the zone / pitches inside the zone
  • Swing% – swings / total pitches
  • O-Contact% – number of pitches on which contact was made on pitches outside the zone / swings on pitches outside the zone
  • Z-Contact% – number of pitches on which contact was made on pitches inside the zone / swings on pitches inside the zone
  • Contact% – number of pitches on which contact was made / swings
  • Zone% – pitches in the strike zone / total pitches
  • F-Strike% – first pitch strikes / plate appearance
  • SwStr% – swings and misses / total pitches
  • BB% – walks / plate appearance
  • K% – strikeouts / plate appearance

2. Download and Clean Data

We downloaded the data as csv files from FanGraphs. To download a single season, follow these instructions:

  1. Go to FanGraphs -> Leaders -> Batting Leaders -> choose a year. For 2018 you should end up here.
  2. The following options should be enabled: Leaderboards, Player Stats, Batting, League: All Leagues, Team: All Teams, All, Single Season (choose a year), Split: Full Season, Min PA (we used 150 for a full year and 80 for a half year [this is what we use for 2019])
  3. The Dashboard will be displayed for the options chosen above. Click Plate Discipline.
  4. Go towards the bottom of the page, under the header: Custom Leaderboards, on the left hand column, click on BB% then click the single arrow pointing right. Do the same for K%.
  5. Scroll down, click: Create Custom Table.
  6. Scroll up to the top of the displayed table, above the rightmost column, click: Export Data.

Here is a sample of the raw data:

NameTeamO-Swing%Z-Swing%Swing%O-Contact%Z-Contact%Contact%Zone%F-Strike%SwStr%BB%K%playerid
Mike TroutAngels21.8 %59.1 %37.6 %69.0 %91.7 %84.1 %42.4 %58.1 %6.0 %20.1 %20.4 %10155
Juan SotoNationals21.9 %60.7 %38.8 %68.1 %85.7 %80.1 %43.6 %57.5 %7.7 %16.0 %20.0 %20123
Jose RamirezIndians22.3 %62.3 %38.5 %79.4 %92.1 %87.7 %40.4 %53.0 %4.7 %15.2 %11.5 %13510

Individual files for 2015 to 2019 were downloaded.

The code below reads in the csv file, converts it to a Pandas DataFrame, deletes the Name and Team columns, converts the playerid column to the index and removes ‘ %’ from the numeric entries.

 
import pandas as pd

def read_clean_csv(csv_file):
    
    dfpd = pd.read_csv(csv_file)
    
    # drop Name and Team columns
    dfpd.drop(columns=['Name','Team'],inplace=True)
    
    # set playerid as the index
    dfpd.set_index('playerid', drop=True, inplace=True)
    
    # delete ' %' from all data
    for col in dfpd.columns:
        if '%' in col:
            dfpd[col]=dfpd[col].replace({' %':''},regex=True).astype(float)
    
    return dfpd

The raw csv files can be downloaded from here.

3. Statistics

After putting the data into the format above, we combine 2015-2018 into a single DataFrame which we will refer to as the calibration data. Later, this will be shuffled and split into parts for use in various machine learning algorithms.

To compute descriptive statistics we use

    dfstats = dfcalib.describe().T
    dfstats['std error'] = dfcalib.std()/np.sqrt(dfcalib.shape[0])
    dfstats['skew'] = dfcalib.skew()
    dfstats['kurtosis'] = dfcalib.kurtosis()
    dfstats = dfstats.round(decimals=3)
    
    dfstats.to_csv(results_dir + 'calibration_statistics.csv')

 countmeanstdmin25%50%75%maxstd errorskewkurtosis
O-Swing%1566.030.4265.88414.226.230.234.253.20.1490.208-0.032
Z-Swing%1566.067.4325.7849.363.667.471.385.10.1460.006-0.026
Swing%1566.046.7934.91332.543.346.6550.061.10.1240.172-0.133
O-Contact%1566.063.7898.90430.358.12564.070.186.10.225-0.198-0.131
Z-Contact%1566.086.2974.92467.383.286.990.097.50.124-0.4770.018
Contact%1566.078.2125.9559.174.278.382.692.60.15-0.211-0.272
Zone%1566.044.2372.52236.142.52544.146.053.30.0640.039-0.108
F-Strike%1566.060.2733.95848.257.42560.363.074.40.10.112-0.203
SwStr%1566.010.243.1683.17.910.112.323.80.080.31-0.148
BB%1566.08.3273.1211.36.18.010.320.60.0790.60.391
K%1566.021.0856.0456.416.82520.824.942.20.1530.317-0.122

The code to compute and plot a correlation coefficient matrix:

import seaborn as sb
import matplotlib.pyplot as plt

    sb.set(font_scale=0.6)
    hm = sb.heatmap(dfcalib.corr(), annot=True, fmt=".3f", cmap=sb.color_palette("Blues"))
    hm.set_xticklabels(hm.get_xticklabels(), rotation=35)
    plt.ioff()
    plt.savefig(results_dir + 'new_correlation.png')
    plt.clf()
    plt.close()

Plate Discipline Correlation Coefficient

 

4. Feature Importances

We use XGBoost to compute various measures of feature importances. In decision tree algorithms, feature importances are related to the impact of splitting on each feature on the samples that reach a given leaf in a tree. XGBoost has five types of feature importances: weight, cover, total cover, gain, total gain. In “The Multiple faces of ‘Feature importance’ in XGBoost” Amjad Abu-Rmileh provides a good discussion of the meaning of the different types of feature importances and how they can be misleading if not interpreted correctly. We found the following to be of great interest:

Suppose that you have a binary feature, say gender, which is highly correlated with your target variable. Furthermore, you observed that the inclusion/ removal of this feature form your training set highly affects the final results. If you investigate the importance given to such feature by different metrics, you might see some contradictions:
 
Most likely, the variable gender has much smaller number of possible values (often only two: male/female) compared to other predictors in your data. So this binary feature can be used at most once in each tree, while, let say, age (with a higher number of possible values) might appear much more often on different levels of the trees. Therefore, such binary feature will get a very low importance based on the frequency/weight metric, but a very high importance based on both the gain, and coverage metrics!

The code below computes feature importances and plots them as bar charts.

from xgboost import XGBRegressor
import seaborn as sb
import matplotlib.pyplot as plt

    dfcalib = pd.read_pickle(data_dir + 'df_calibration.pkl')
        
    # feature importances
    header_attr = ['O-Swing%', 'Z-Swing%', 'Swing%', 'O-Contact%',
                   'Z-Contact%','Contact%', 'Zone%', 'F-Strike%', 'SwStr%']
    x = dfcalib[header_attr].values
    list_header_target = ['BB%', 'K%']
    list_importance = ['weight','gain','total_gain','cover','total_cover']
    for target in list_header_target:
        y = dfcalib[target].values
        for importance in list_importance:
            xgb = XGBRegressor(n_jobs=-1, importance_type=importance,
                               objective='reg:squarederror')
            xgb.fit(x,y)
            
            y_pos = np.arange(len(header_attr))
            importance_values = xgb.feature_importances_
            
            fn = 'feature_importance_' + importance + '_' + target + '.png'
            list_of_colors = ['darkblue']*len(header_attr)
            
            plt.ioff()  # prevents plot from showing; use plt.show() to show or delete ioff
            fig, ax = plt.subplots()
            ax.bar(y_pos, importance_values, align='center', color=list_of_colors)
            ax.set_facecolor('white')
            ax.axhline(y=0, linewidth=0.5, color='black')  # solid horizontal line on x-axis
            ax.set_axisbelow(True)
            plt.xlabel('Features', fontsize=10)
            ax.tick_params(axis='y', labelsize=10)
            plt.ylabel('Feature Importance: ' + importance.title(), fontsize=10)
            plt.xticks(y_pos, header_attr, fontsize=10, rotation=20)
            plt.title(target, fontsize=10)
            ax.grid(linestyle='--', linewidth='0.5', color='black')
            
            fig.savefig(results_dir + fn)
            plt.close(fig)
            plt.rcParams.update({'figure.max_open_warning': 0})

To see the bar charts, click on Feature Importance Bar Charts.

Pin It on Pinterest