1. Definitions and Data
  2. Statistics and Feature Importances

1. Definitions and Data

This article is the beginning of a series of articles in which we create machine learning models to predict wins and losses for the NBA. We start by defining the variables that we use as features. Non percent data, such as assists, are per game. The data is from Basketball Reference.

We used the per game team offense and defense data as well as miscellaneous statistics. Below is a snippet of each from the 2018-2019 season.

2018-2019 Team Offense Per Game

RkTeamGMPFGFGAFG%3P3PA3P%2P2PA2P%FTFTAFT%ORBDRBTRBASTSTLBLKTOVPFPTS
1Milwaukee Bucks*82241.243.491.10.47613.538.20.35329.952.90.56517.923.20.7739.340.449.7267.55.913.919.6118.1
2Golden State Warriors*82241.54489.80.49113.334.40.38530.855.30.55716.320.40.8019.736.546.229.47.66.414.321.4117.7
3New Orleans Pelicans82240.943.792.20.47310.329.90.34433.462.40.53617.823.40.76111.136.247.3277.45.414.821.1115.4

2018-2019 Team Defense Per Game

RkTeamGMPFGFGAFG%3P3PA3P%2P2PA2P%FTFTAFT%ORBDRBTRBASTSTLBLKTOVPFPTS
1Indiana Pacers*82240.338.786.10.4511.432.10.35427.4540.50715.8210.75610.533.143.724.57.45.215.720104.7
2Miami Heat82240.638.386.80.44111.933.10.35826.453.60.49317.522.70.76910.13444.223.37.54.714.120.1105.9
3Memphis Grizzlies82242.437.683.40.45111.632.20.362651.20.50819.324.70.7849.535.244.723.47.64.915.221.4106.1

2018-2019 Team Miscellaneous Statistics

RkTeamAgeWLPWPLMOVSOSSRSORtgDRtgNRtgPaceFTr3PArTS%eFG%TOV%ORB%FT/FGAeFG%TOV%DRB%FT/FGAArenaAttend.Attend./G
1Milwaukee Bucks*26.9602261218.87-0.828.04113.8105.28.6103.30.2550.4190.5830.551220.80.1970.50311.580.30.162Fiserv Forum72169217602
2Golden State Warriors*28.4572556266.46-0.046.42115.9109.56.4100.90.2270.3840.5960.56512.622.50.1820.50811.777.10.205Oracle Arena80343619596
3Toronto Raptors*27.3582456266.09-0.65.49113.1107.16100.20.2470.3790.5790.54312.421.90.1980.50913.177.10.19Scotiabank Arena81282219825

For offense and defense, we kept the following data:
3P — 3-Point Field Goals
3PA — 3-Point Field Goal Attempts
3P% — 3-Point Field Goal Percentage
2P — 2-Point Field Goals
2PA — 2-point Field Goal Attempts
2P% — 2-Point Field Goal Percentage
FT — Free Throws
FTA — Free Throw Attempts
FT% — Free Throw Percentage
ORB — Offensive Rebounds
DRB — Defensive Rebounds
AST — Assists
STL — Steals
BLK — Blocks
TOV — Turnovers
PF — Personal Fouls
PTS — Points

For miscellaneous statistics, we kept:
W — Wins
L — Losses
We added win fraction (wins/(wins + losses)).

Data encompasses the 2008-2009 to 2018-2019 seasons. Machine learning models will be constructed to predict wins and losses for the 2018-2019 season, while data from previous seasons will be used to calibrate models. The datasets can be downloaded here.

2. Statistics and Feature Importances

We compute basic statistics for calibration data.

 countmeanstdmin25%50%75%maxstd errorskewkurtosis
O-3P300.07.7362.0173.86.27.69.015.30.1160.5160.246
O-3PA300.021.6665.27911.317.97521.125.1542.30.3050.5640.342
O-3P%300.00.3560.0190.2950.3440.3560.3690.4160.0010.0140.318
O-2P300.030.0471.91723.428.930.131.335.30.111-0.2650.341
O-2PA300.061.3234.53141.958.561.964.42571.60.262-0.5510.731
O-2P%300.00.4910.0210.4390.4770.4890.5040.560.0010.5430.572
O-FT300.017.71.91112.216.517.5518.924.10.110.3340.695
O-FTA300.023.2882.45416.621.523.1524.831.10.1420.3140.438
O-FT%300.00.760.0290.660.7450.7610.7790.8280.002-0.5380.695
O-ORB300.010.7511.2427.610.010.811.714.60.072-0.073-0.193
O-DRB300.031.8061.88127.230.27531.833.236.50.1090.046-0.692
O-AST300.021.9011.87817.420.621.7523.030.40.1080.8542.055
O-STL300.07.5970.855.57.07.558.210.00.0490.153-0.094
O-BLK300.04.8770.7542.54.34.85.48.20.0440.6471.391
O-TOV300.014.3261.08811.213.614.315.017.70.0630.0150.182
O-PF300.020.2961.44516.619.320.321.324.80.0830.07-0.23
O-PTS300.0100.9964.97887.097.5101.0104.225115.90.2870.243-0.056
D-3P300.07.7361.5584.66.57.48.72512.10.090.631-0.282
D-3PA300.021.6644.15614.218.520.724.532.70.240.621-0.474
D-3P%300.00.3570.0150.3080.3470.3570.3670.4110.0010.0980.55
D-2P300.030.0431.67725.028.87529.931.234.70.0970.282-0.101
D-2PA300.061.3243.52950.859.161.463.770.00.204-0.2210.029
D-2P%300.00.490.0190.4420.4770.490.5030.5360.0010.006-0.284
D-FT300.017.7011.79913.716.417.718.823.90.1040.3880.144
D-FTA300.023.2882.29918.221.623.224.730.20.1330.271-0.117
D-FT%300.00.760.0140.7250.7510.760.7690.8030.0010.0360.019
D-ORB300.010.7480.9368.010.210.811.414.20.0540.0410.559
D-DRB300.031.8051.98426.730.331.733.137.70.1150.239-0.224
D-AST300.021.91.73117.820.57521.8523.326.10.10.085-0.581
D-STL300.07.5980.7055.67.17.68.09.60.0410.0490.185
D-BLK300.04.8790.7383.04.44.95.46.90.0430.065-0.011
D-TOV300.014.3221.15211.313.514.315.117.60.0670.109-0.063
D-PF300.020.2941.35116.219.420.321.124.30.0780.1110.54
D-PTS300.0100.9955.00488.297.375101.0104.425113.30.2890.028-0.41
Win Fraction300.00.50.1560.1060.3780.5120.610.890.009-0.142-0.672

Correlation coefficient matrix.

We use the Scikit-learn wrapper for XGBoost to compute feature importances. See Plate Discipline for Hitters – Data Exploration for an explanation of the different types of feature importances.





The code for all of the calculation above can be found at Plate Discipline for Hitters – Data Exploration.

The graphs are included with the dataset files here.

Pin It on Pinterest