Imputing Statcast’s missing data

MLB’s new Statcast system is a fantastic way to study baseball. The hybrid camera-plus-radar system tracks the location and movement of every object on the field, from the exit velocity of the ball to the position of the umpire. But Statcast is also a work in progress: as I detailed in a recent article, the system loses tracking on about 10% of all batted balls. In this post, I describe a way to ameliorate the system’s missing data problems using another source of information, the Gameday stringer coordinates.

Long before Statcast existed, there were stringers sitting in stadiums and manually recording various characteristics of batted balls, including–most importantly–the location at which each ball was fielded. As a data collection method, stringers are less accurate than a radar or camera, prone to park effects and other biases. But they have the advantage of completeness: every single batted ball is recorded by a stringer, while only 90% of batted balls are tracked by Statcast.

I had the idea of combining these two sources of information to provide more accurate and complete estimates of batted ball velocity than either system could provide alone. Each data source helps fix the weakness of the other: the stringer data is complete, but inaccurate, while Statcast is accurate, but incomplete. The stringer coordinates are recorded in the same files in which MLB provides batted ball data, making this idea exceptionally easy to execute.

I’ve come to rely on two main variables in my use of Statcast data: exit velocity, or the speed off the bat of every batted ball; and launch angle, or the vertical direction of the batted ball. I regressed the stringer coordinates against both of these variables using a Random Forest model, also including the outcome of each play and a park effect to further improve the accuracy. (For the statistically initiated, I fit the model on 20,000 batted balls and predicted the remaining ~100,000 as a form of out-of-sample validation.)

Exit Velocity

Here we’re guessing exit velocity based on the stringer coordinates and the result of the play (for example, single, double, lineout, etc.). The results are strong: the predicted values correlate with the actual numbers at r=.57. The median absolute error is only 8.4 mph, suggesting that Gameday coordinates are at least capable of distinguishing hard hit balls from soft ones. The RMSE is a bit higher (10.9), because there are some outliers with unusual exit velocities given their characteristics–for example, deflections. Manually inspecting some of these outliers convinced me that there are also some cases where the Statcast data is inaccurate. For example, there are line drive singles in the data with improbably low exit velocities (30-50 mph). In these cases, the imputed exit velocities may be more accurate than the measured ones.

Launch Angle


The imputation works even better with launch angle. (You’ll notice a kind of banding pattern for the imputed exit velocities. I believe this comes from using the recorded batted ball types (line drive, groundball, etc.) and then integrating the coordinates as a secondary factor.) The correlation between predicted and actual is even higher, at r=.9. And while the error statistics are about the same (RMSE=10.9, MAE=8.0), the range of launch angles is about three times larger, so the relative prediction error is substantially less than for exit velocity.

The results for exit velocity and launch angle suggest that we can impute both quite accurately using the Gameday stringer coordinates. To further verify that these imputed numbers are an improvement on raw Statcast, I calculated the average imputed exit velocity for each hitter and compared that to the wOBA (weighted on-base average, a rate measure of offensive production) values for the same hitters.

Unsurprisingly, the raw exit velocities correlate slightly worse with wOBA (r=.55) than the imputed exit velocities (r=.6). Interestingly, that holds true even if you focus only on the 90% of batted balls that Statcast successfully tracked (r=.55 for the imputed, r=.51 for the raw), which suggests that using the stringer coordinates acts to smooth out of some the measurement error in Statcast, even when it’s not missing data (see the example above concerning 40 mph line drives).

These are pretty encouraging results. They suggest that it’s possible to accurately impute the missing Statcast data, thus overcoming the radar’s tracking problems. Even better, doing that imputation tends to improve the underlying data’s reliability. In hindsight, it’s not surprising that fusing two sources of data would result in a more accurate set of numbers. And there are certainly other ways to improve the imputation procedure, for example taking into account plays where there has been a deflection. But for now, it’s good to know that Statcast’s missing data issue is easily solvable by integrating the Gameday stringer coordinates, which should improve a lot of downstream work that depends on Statcast.


One comment

  1. Pingback: Spin Rate and Soft Contact Probabilities | Exploring Baseball Data with R

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>