This post will focus on predicting a baseball player’s future injuries, something I have written about in the past. It is directed at refuting a particular line of criticism leveled at me in response to previous entries. As before, I’ll use Jeff Zimmerman’s carefully procured and curated injury data.
This was prompted by a comment I received from a couple of visitors, which went something along these lines:
“Well sure, a player’s previous injuries may be related to their future probability of injury, but the difference is slight. What’s more, in either case, the R2 value of the logistic regression is miniscule. Therefore injuries are effectively random.”
Boiling this line of argument down to its core, it relies on the difference between some result or finding being statistically significant vs. whether it is actionable or useful. This rhetoric implies that just because something is statistically significant does not make it practically helpful for prediction.
While I have some sympathy for the argument in its general form, it was misapplied in this case. I used logistic regression in the previous post not because logistic regression is the optimal algorithm for prediction–it usually isn’t–but rather to make the inference concerning prediction clear. Logistic regression allows one to easily and simply determine to what extent some variable (e.g. age) increases the probability of some outcome (e.g. injury). For example, in the case of the previous post, the logistic regression model spit out a slope of ~2% higher injury risk per year.
To address the comment further, I redid my predictions with Random Forests, a machine learning algorithm which is, if not the best, certainly among the best machine learning tools currently available. Random forests and similarly high-performing algorithms are better at prediction than simple logistic regression approaches, but the tradeoff is that they are much more difficult to understand. Because the classifiers are so complex, the relationship the predictor variables have on the outcome can be difficult to cleanly parse. As a result, it can be impossible to formulate simple heuristics such as “2% higher injury risk per year of age”. However, because they are better at prediction, complex machine learning tools such as these can show just how significantly predictor variables impact accuracy. Since that’s what I wanted to show, I brought in a Random Forest (which, on an unrelated tangent, has to be among the best names for a statistical technique ever invented).
Methodologically, I randomly divided the set of players with data from 2012 and 2011 into two subsets: a training set and a testing set. I trained the algorithm using the training set, considering injury status in 2011, age, and position as predictor variables. I then used the trained Random Forest to predict players in the testing set, and counted how many the algorithm got right and wrong (giving me an accuracy score). Each combination of variables went through the cycle (with different training/testing sets each time) 1000 separate times.
To test whether a given variable has a significant impact on prediction, I performed predictions with and without that variable included as a predictor. If the accuracy went down without the variable in, I considered it as significantly increasing accuracy. I can never disprove the notion that the changes in accuracy aren’t big enough to be useful, but I’ll argue against it (see below).
Results In One Graphic
On the y-axis here is prediction accuracy. On the x-axis are the different combinations of predictor variables I used: just age, age & injury history, and age, injury history and position. The most predictive model is the full model, with all of the variables included. Note here that I mixed the two classes–injured in 2012 and not injured in 2012– in equal proportion, so that “random” accuracy is set at 50%. Just age produces an accuracy of ~52%, which is not great. Age and injury history improves on that by fully 5% prediction accuracy. Now, that 5% may not seem like a big deal, until you realize that millions of dollars are riding on it. What’s more, the prediction accuracy is heterogeneous; the model becomes more accurate as the player becomes older. For players above 30, prediction accuracy improves by ~2%–again, it may not seem a significant difference, but 30 years old is when free agents are first available. Those contracts are the ones where injury probability is most important.
One last thing to note is that although knowledge of position (CF/SS/C, etc.) increases prediction accuracy, the magnitude of that increase is fairly small–less than .5%. However, as in the last post, several caveats apply. Most prominent among these is the fact that players tend to shift positions with age. A more nuanced classifier might take into account where a player has played the most games in their career, or where they started. This approach may be applied in a future post.
Surgery Is Significant
This approach revealed one more thing. By using the more powerful Random Forest model, I was able to show that an additional variable was significant for predicting future injury. That variable is whether the player had surgery during their trip to the Disabled List, and it improves accuracy by something like 2%. Again, this boost in accuracy is not tremendous, but it’s worth noting here that the combined classifier, with all the variables included, has pushed its accuracy up to ~60%, which I would argue is high enough to be useful.
As expected (and hypothesized in the last post), players who had surgeries are especially likely to have another trip to the DL in the next year. Surgery is generally invasive and only applied in cases of severe injury, so it makes sense that the procedure would be linked to further injury. This result may suggest that injury type generally could be influential for predicting future injuries–but the jury is still out on that until more data becomes available.
Small Differences in Injury Probability Make Big Differences in Money
Injury history is significant for predicting future injuries, along with a handful of other factors I’ve discovered so far: age, position, and whether the player had a surgery. All of these things result in better prediction accuracy when they are considered in the model than when they aren’t in the model, which shows that they are predictive. To the broader point which prompted this post: my guess is that anything that produces any statistically significant increase in predictive accuracy would be capitalized by the teams, if for no other reason than the fact that they have princely fortunes riding on even their mediocre role players. In this case, using a relatively naive classifier, we can improve prediction accuracy beyond chance by something like 10%, which I expect is significant enough for even a skeptical GM to pay attention.
It is of paramount importance that they know the probability that those players will be injured, not only in terms of calculating the proper amount of money to offer the player, but also in terms of projecting the team’s future performance and the insurance policy they need to take out. The bottom line is that any difference in injury probability likely compounds to significant levels over the several years a player is employed, and this, together with the size of the contracts, will conspire to make even the most infinitesimal of predictive edges extremely important.