Imputing Statcast’s missing data

MLB’s new Statcast system is a fantastic way to study baseball. The hybrid camera-plus-radar system tracks the location and movement of every object on the field, from the exit velocity of the ball to the position of the umpire. But Statcast is also a work in progress: as I detailed in a recent article, the system loses tracking on about 10% of all batted balls. In this post, I describe a way to ameliorate the system’s missing data problems using another source of information, the Gameday stringer coordinates.

Long before Statcast existed, there were stringers sitting in stadiums and manually recording various characteristics of batted balls, including–most importantly–the location at which each ball was fielded. As a data collection method, stringers are less accurate than a radar or camera, prone to park effects and other biases. But they have the advantage of completeness: every single batted ball is recorded by a stringer, while only 90% of batted balls are tracked by Statcast.

I had the idea of combining these two sources of information to provide more accurate and complete estimates of batted ball velocity than either system could provide alone. Each data source helps fix the weakness of the other: the stringer data is complete, but inaccurate, while Statcast is accurate, but incomplete. The stringer coordinates are recorded in the same files in which MLB provides batted ball data, making this idea exceptionally easy to execute.

I’ve come to rely on two main variables in my use of Statcast data: exit velocity, or the speed off the bat of every batted ball; and launch angle, or the vertical direction of the batted ball. I regressed the stringer coordinates against both of these variables using a Random Forest model, also including the outcome of each play and a park effect to further improve the accuracy. (For the statistically initiated, I fit the model on 20,000 batted balls and predicted the remaining ~100,000 as a form of out-of-sample validation.)

Exit Velocity

Here we’re guessing exit velocity based on the stringer coordinates and the result of the play (for example, single, double, lineout, etc.). The results are strong: the predicted values correlate with the actual numbers at r=.57. The median absolute error is only 8.4 mph, suggesting that Gameday coordinates are at least capable of distinguishing hard hit balls from soft ones. The RMSE is a bit higher (10.9), because there are some outliers with unusual exit velocities given their characteristics–for example, deflections. Manually inspecting some of these outliers convinced me that there are also some cases where the Statcast data is inaccurate. For example, there are line drive singles in the data with improbably low exit velocities (30-50 mph). In these cases, the imputed exit velocities may be more accurate than the measured ones.

Launch Angle


The imputation works even better with launch angle. (You’ll notice a kind of banding pattern for the imputed exit velocities. I believe this comes from using the recorded batted ball types (line drive, groundball, etc.) and then integrating the coordinates as a secondary factor.) The correlation between predicted and actual is even higher, at r=.9. And while the error statistics are about the same (RMSE=10.9, MAE=8.0), the range of launch angles is about three times larger, so the relative prediction error is substantially less than for exit velocity.

The results for exit velocity and launch angle suggest that we can impute both quite accurately using the Gameday stringer coordinates. To further verify that these imputed numbers are an improvement on raw Statcast, I calculated the average imputed exit velocity for each hitter and compared that to the wOBA (weighted on-base average, a rate measure of offensive production) values for the same hitters.

Unsurprisingly, the raw exit velocities correlate slightly worse with wOBA (r=.55) than the imputed exit velocities (r=.6). Interestingly, that holds true even if you focus only on the 90% of batted balls that Statcast successfully tracked (r=.55 for the imputed, r=.51 for the raw), which suggests that using the stringer coordinates acts to smooth out of some the measurement error in Statcast, even when it’s not missing data (see the example above concerning 40 mph line drives).

These are pretty encouraging results. They suggest that it’s possible to accurately impute the missing Statcast data, thus overcoming the radar’s tracking problems. Even better, doing that imputation tends to improve the underlying data’s reliability. In hindsight, it’s not surprising that fusing two sources of data would result in a more accurate set of numbers. And there are certainly other ways to improve the imputation procedure, for example taking into account plays where there has been a deflection. But for now, it’s good to know that Statcast’s missing data issue is easily solvable by integrating the Gameday stringer coordinates, which should improve a lot of downstream work that depends on Statcast.


On Sequencing

I wrote something up at Baseball Prospectus about pitch sequencing.  This time, I scaled up the initial analysis I did before, wherein I just looked at Clayton Kershaw and Joe Saunders.  In that limited sample, I found very little evidence of non-random sequencing for 2-pitch sequences.

For Baseball Prospectus, I examined all of the starting pitchers (>100 IP) using a similar approach, but applied to 3-pitch sequences (specifically, the three pitches which start an at-bat).  I found these longer sequences to be much less random than 2-pitch sequences, and that variation in the level of randomness correlated with some elements of pitcher skill (see the article for more details).  The results are largely uncertain at this early stage, but point in any case towards sequencing being important for pitchers (which was probably obvious before).

Many caveats apply, this analysis still being very young, so I won’t try to claim that I’ve solved sequencing or somesuch.  However, I do think that, as with my earlier work using entropy, we can apply some cool tricks from information theory to figuring out how pitchers harness variation in their endless quest to confuse, befuddle, and out-think the batter.

Predicting Injury Status

This post will focus on predicting a baseball player’s future injuries, something I have written about in the past.  It is directed at refuting a particular line of criticism leveled at me in response to previous entries.  As before, I’ll use Jeff Zimmerman’s carefully procured and curated injury data.

This was prompted by a comment I received from a couple of visitors, which went something along these lines:

“Well sure, a player’s previous injuries may be related to their future probability of injury, but the difference is slight.  What’s more, in either case, the R2 value of the logistic regression is miniscule.  Therefore injuries are effectively random.”

Boiling this line of argument down to its core, it relies on the difference between some result or finding being statistically significant vs. whether it is actionable or useful.  This rhetoric implies that just because something is statistically significant does not make it practically helpful for prediction.

While I have some sympathy for the argument in its general form, it was misapplied in this case.  I used logistic regression in the previous post not because logistic regression is the optimal algorithm for prediction–it usually isn’t–but rather to make the inference concerning prediction clear.  Logistic regression allows one to easily and simply determine to what extent some variable (e.g. age) increases the probability of some outcome (e.g. injury).  For example, in the case of the previous post, the logistic regression model spit out a slope of ~2% higher injury risk per year.

To address the comment further, I redid my predictions with Random Forests, a machine learning algorithm which is, if not the best, certainly among the best machine learning tools currently available.  Random forests and similarly high-performing algorithms are better at prediction than simple logistic regression approaches, but the tradeoff is that they are much more difficult to understand.  Because the classifiers are so complex, the relationship the predictor variables have on the outcome can be difficult to cleanly parse.  As a result, it can be impossible to formulate simple heuristics such as “2% higher injury risk per year of age”.  However, because they are better at prediction, complex machine learning tools such as these can show just how significantly predictor variables impact accuracy.  Since that’s what I wanted to show, I brought in a Random Forest (which, on an unrelated tangent, has to be among the best names for a statistical technique ever invented).

Methodologically, I randomly divided the set of players with data from 2012 and 2011 into two subsets: a training set and a testing set.  I trained the algorithm using the training set, considering injury status in 2011, age, and position as predictor variables.  I then used the trained Random Forest to predict players in the testing set, and counted how many the algorithm got right and wrong (giving me an accuracy score).  Each combination of variables went through the cycle (with different training/testing sets each time) 1000 separate times.

To test whether a given variable has a significant impact on prediction, I performed predictions with and without that variable included as a predictor.  If the accuracy went down without the variable in, I considered it as significantly increasing accuracy.  I can never disprove the notion that the changes in accuracy aren’t big enough to be useful, but I’ll argue against it (see below).

Results In One Graphic

Prediction Accuracy Increases Substantially with Injury History
Prediction Accuracy Increases Substantially with Injury History

On the y-axis here is prediction accuracy.  On the x-axis are the different combinations of predictor variables I used: just age, age & injury history, and age, injury history and position.  The most predictive model is the full model, with all of the variables included.  Note here that I mixed the two classes–injured in 2012 and not injured in 2012– in equal proportion, so that “random” accuracy is set at 50%.  Just age produces an accuracy of ~52%, which is not great.  Age and injury history improves on that by fully 5% prediction accuracy. Now, that 5% may not seem like a big deal, until you realize that millions of dollars are riding on it. What’s more, the prediction accuracy is heterogeneous; the model becomes more accurate as the player becomes older.  For players above 30, prediction accuracy improves by ~2%–again, it may not seem a significant difference, but 30 years old is when free agents are first available.  Those contracts are the ones where injury probability is most important.

One last thing to note is that although knowledge of position (CF/SS/C, etc.) increases prediction accuracy, the magnitude of that increase is fairly small–less than .5%.  However, as in the last post, several caveats apply.  Most prominent among these is the fact that players tend to shift positions with age.  A more nuanced classifier might take into account where a player has played the most games in their career, or where they started.  This approach may be applied in a future post.

Surgery Is Significant

This approach revealed one more thing.  By using the more powerful Random Forest model, I was able to show that an additional variable was significant for predicting future injury.  That variable is whether the player had surgery during their trip to the Disabled List, and it improves accuracy by something like 2%.  Again, this boost in accuracy is not tremendous, but it’s worth noting here that the combined classifier, with all the variables included, has pushed its accuracy up to ~60%, which I would argue is high enough to be useful.

As expected (and hypothesized in the last post), players who had surgeries are especially likely to have another trip to the DL in the next year.  Surgery is generally invasive and only applied in cases of severe injury, so it makes sense that the procedure would be linked to further injury.  This result may suggest that injury type generally could be influential for predicting future injuries–but the jury is still out on that until more data becomes available.

Knowledge of Surgery Improves Prediction Accuracy
Knowledge of Surgery Improves Prediction Accuracy

Small Differences in Injury Probability Make Big Differences in Money

Injury history is significant for predicting future injuries, along with a handful of other factors I’ve discovered so far: age, position, and whether the player had a surgery.  All of these things result in better prediction accuracy when they are considered in the model than when they aren’t in the model, which shows that they are predictive. To the broader point which prompted this post: my guess is that anything that produces any statistically significant increase in predictive accuracy would be capitalized by the teams, if for no other reason than the fact that they have princely fortunes riding on even their mediocre role players.  In this case, using a relatively naive classifier, we can improve prediction accuracy beyond chance by something like 10%, which I expect is significant enough for even a skeptical GM to pay attention.

It is of paramount importance that they know the probability that those players will be injured, not only in terms of calculating the proper amount of money to offer the player, but also in terms of projecting the team’s future performance and the insurance policy they need to take out.  The bottom line is that any difference in injury probability likely compounds to significant levels over the several years a player is employed, and this, together with the size of the contracts, will conspire to make even the most infinitesimal of predictive edges extremely important.