Link roundup for 1/27/2017

I first tried the experiment of a roughly weekly link roundup (including both my own stuff and my favorite reads) in November, and I haven’t tried it since. So, weekly it is not (or has not been, at least). But that’s because of the baseball off-season, during which my work schedule slows down dramatically. With work ramping back up now (pitchers and catchers report in just over two weeks!), I intend to make this a more regular feature at my blog.

With that preamble out of the way, here’s what I’ve been working on recently:

http://fivethirtyeight.com/features/barry-bonds-and-roger-clemens-are-benefiting-from-public-hall-of-fame-ballots/

I was happy to write up this piece that summarized a recently-accepted paper I worked on with Greg Matthews and an undergraduate student of his. (I was responsible for a very, very small portion of that paper, so I am kind of mooching off their work. Forgive me.) In it, we analyzed the ways that the BBWAA voters seem to cluster their votes, and found (predictably enough) that the major split is between PED-supporters and non-PED supporters. You can find the paper linked in the article, but I thought it was a cool example of how you can use the summary statistics and public portion of partially anonymous datasets to infer characteristics about the anonymous portion. That fact has applications as far as genetics, where some patients may choose to participate anonymously while others reveal their data. If a small enough portion is anonymous, and you have the overall statistics, you can effectively “de-anonymize” the remaining portion with a method like this.

https://fivethirtyeight.com/features/sabermetrics-helped-put-tim-raines-in-the-hall-of-fame/

I wrote this short piece in reaction to Tim Raines finally being elected to the Hall. I never quite embraced the Raines campaign, and that’s mostly because I simply can’t muster up much outrage about the Hall of Fame. It’s never been consistent or objective, and it never will be. We know a lot more about baseball now than we did in 1970, and our metrics have changed, and it doesn’t bother me that we elect a different kind of player. Similarly, I will never understand the moral relativism that goes on in these Hall of Fame debates. That the fans of the 1930s tolerated Ty Cobb launching himself into their midst and throwing punches does not mean that we need to–or should–put up with much less dickish behavior today. As far as I’m concerned, it’s OK (and indeed unavoidable) that our standards for inclusion have changed over time. I guess this opinion is not a very hot take.

https://theathletic.com/34872/2017/01/23/can-cubs-avoid-a-world-series-induced-hangover/

For the Athletic, I did a short piece about whether the Cubs hitters and pitchers would suffer at all from the long postseason. As with many effects in sabermetrics these days, there was the small hint that playing deep into October could make a difference, but it did not pass any stringent statistical threshold. As all the low-hanging fruit gets picked, we should expect for this to be a more and more common pattern.

https://fivethirtyeight.com/features/chicagos-murder-rate-is-rising-but-it-isnt-unprecedented/

Have to say: never did I think I would write an entire article in response to a tweet from the President, but here we are. The point is simple: Chicago’s murder rate is high, but no higher than it was in the 1990s. In fact, it’s not higher than New York City’s rate was in the 1990s. It’s still too many deaths (as any number of murders would be too many), but the notion that this murder rate represents some radical departure from the past is wrong.

 

And here’s what I’ve read.

https://www.washingtonpost.com/news/monkey-cage/wp/2017/01/25/we-asked-people-which-inauguration-crowd-was-bigger-heres-what-they-said/?utm_term=.745a256e328b

This article is incredibly demoralizing for me. In it, they show Drumpf supporters photos of the Drumpf and Obama crowds, asking which one is larger–a simple, obvious question with only one reasonable answer. And yet, a surprising number of Drumpf supporters pick his crowd as the larger one, defying all rational belief.

So why was this demoralizing? Photographic evidence is a kind of gold standard in my mind. If we can’t convince people with side-by-side photos, then what hope does a more sophisticated and nuanced argument have? I think about this in regards especially to journalism, where we are often trying to make points using words or (in my case) numbers, both of which are abstract representations of data from the real world. If people can ignore what is immediately in front of their eyes, why would they ever choose to think through a reasoned, but complex argument that they disagree with? So this piece made me hopeless.

https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy

This article was just fantastic, one of my favorites of the last few months. The history of statistics stuff is interesting, if necessarily incomplete (they barely touch on the important role science played in developing statistical knowledge). But where it really came through for me was the ending, and how they described a future–perhaps even a present–in which people don’t buy into government-supplied statistics.

There are a lot of reasons for the current state of the electorate, and their overall disbelief in objective knowledge. Some of it has to do with identity politics: There is a fraction of the electorate who don’t believe even photographic evidence when it is provided, as detailed above.

But some of the problem is undoubtedly due to the ways statistics has been used and described. In reading the article, one problem that occurred to me is how statisticians consistently describe the average of a group as being representative of the experience of the group as a whole. In other words, if I am describing the population of some town in Nebraska, I may summarize its wealth by the median (or mean) income. But that will grate on individuals in the town who live on the poverty line, and incorrectly describe those who are upper class there.

This is a long-term issue with how statistics are discussed and written about. The noise or variation around the mean is just as important as the mean itself. I think too often statisticians (and writers like myself who translate statistics to a larger audience) stop at the mean, assuming incorrectly that it is a sufficient description of the population.

While the mean should be representative, humans are inclined to disregard information if it doesn’t accord with their prior beliefs (or lived experience). So if you describe a population by the mean, and the reader is at one end of a distribution, they assume (falsely) that not only the mean but the whole dataset is flawed. As a profession (or group of professions), I think we statisticians have to develop a language and a framework to describe the average in the context of the variation around it, in such a way that readers intuitively understand the idea of a range of outcomes–a distribution. At the end of the day, it’s the distribution that matters the most, and the idea of describing that curve has often been abandoned in favor of the simplification of a single number. That’s a mistake.

 

 

Imputing Statcast’s missing data

MLB’s new Statcast system is a fantastic way to study baseball. The hybrid camera-plus-radar system tracks the location and movement of every object on the field, from the exit velocity of the ball to the position of the umpire. But Statcast is also a work in progress: as I detailed in a recent article, the system loses tracking on about 10% of all batted balls. In this post, I describe a way to ameliorate the system’s missing data problems using another source of information, the Gameday stringer coordinates.

Long before Statcast existed, there were stringers sitting in stadiums and manually recording various characteristics of batted balls, including–most importantly–the location at which each ball was fielded. As a data collection method, stringers are less accurate than a radar or camera, prone to park effects and other biases. But they have the advantage of completeness: every single batted ball is recorded by a stringer, while only 90% of batted balls are tracked by Statcast.

I had the idea of combining these two sources of information to provide more accurate and complete estimates of batted ball velocity than either system could provide alone. Each data source helps fix the weakness of the other: the stringer data is complete, but inaccurate, while Statcast is accurate, but incomplete. The stringer coordinates are recorded in the same files in which MLB provides batted ball data, making this idea exceptionally easy to execute.

I’ve come to rely on two main variables in my use of Statcast data: exit velocity, or the speed off the bat of every batted ball; and launch angle, or the vertical direction of the batted ball. I regressed the stringer coordinates against both of these variables using a Random Forest model, also including the outcome of each play and a park effect to further improve the accuracy. (For the statistically initiated, I fit the model on 20,000 batted balls and predicted the remaining ~100,000 as a form of out-of-sample validation.)

Exit Velocity

imputedexitvelocity
Here we’re guessing exit velocity based on the stringer coordinates and the result of the play (for example, single, double, lineout, etc.). The results are strong: the predicted values correlate with the actual numbers at r=.57. The median absolute error is only 8.4 mph, suggesting that Gameday coordinates are at least capable of distinguishing hard hit balls from soft ones. The RMSE is a bit higher (10.9), because there are some outliers with unusual exit velocities given their characteristics–for example, deflections. Manually inspecting some of these outliers convinced me that there are also some cases where the Statcast data is inaccurate. For example, there are line drive singles in the data with improbably low exit velocities (30-50 mph). In these cases, the imputed exit velocities may be more accurate than the measured ones.

Launch Angle

imputedlaunchangle

The imputation works even better with launch angle. (You’ll notice a kind of banding pattern for the imputed exit velocities. I believe this comes from using the recorded batted ball types (line drive, groundball, etc.) and then integrating the coordinates as a secondary factor.) The correlation between predicted and actual is even higher, at r=.9. And while the error statistics are about the same (RMSE=10.9, MAE=8.0), the range of launch angles is about three times larger, so the relative prediction error is substantially less than for exit velocity.

The results for exit velocity and launch angle suggest that we can impute both quite accurately using the Gameday stringer coordinates. To further verify that these imputed numbers are an improvement on raw Statcast, I calculated the average imputed exit velocity for each hitter and compared that to the wOBA (weighted on-base average, a rate measure of offensive production) values for the same hitters.

Unsurprisingly, the raw exit velocities correlate slightly worse with wOBA (r=.55) than the imputed exit velocities (r=.6). Interestingly, that holds true even if you focus only on the 90% of batted balls that Statcast successfully tracked (r=.55 for the imputed, r=.51 for the raw), which suggests that using the stringer coordinates acts to smooth out of some the measurement error in Statcast, even when it’s not missing data (see the example above concerning 40 mph line drives).

These are pretty encouraging results. They suggest that it’s possible to accurately impute the missing Statcast data, thus overcoming the radar’s tracking problems. Even better, doing that imputation tends to improve the underlying data’s reliability. In hindsight, it’s not surprising that fusing two sources of data would result in a more accurate set of numbers. And there are certainly other ways to improve the imputation procedure, for example taking into account plays where there has been a deflection. But for now, it’s good to know that Statcast’s missing data issue is easily solvable by integrating the Gameday stringer coordinates, which should improve a lot of downstream work that depends on Statcast.

 

Link roundup for 9/23/2016

No matter how complete an article feels at the time of publication, there are always a handful of interesting details that slip through the cracks or don’t fit under the word limit. On top of that, I tend to receive a ton of feedback post-publication, some of which is even worth addressing.

Twitter isn’t the ideal medium to respond or provide those additional details. So I wanted to experiment with a kind of weekly, link roundup-style blog post, summarizing the articles I’ve done in the past week and highlighting a few pieces from other authors that are worth reading as well. As mentioned, this will be a trial run for now; your comments and criticisms are welcome.

My Articles

http://fivethirtyeight.com/features/baseballs-savviest-and-crappiest-bullpen-managers/

At FiveThirtyEight, I wrote a followup to last week’s piece on optimal bullpen management with Rian Watt. We extended our metric, which measured the extent to which managers used their best relievers in the highest-leverage spots, to grade individual skippers. Better still, we established a run value for the skill, allowing us to say how many additional wins optimal bullpen management was worth.

The metric itself, which we called weighted Reliever Management+ (wRM+) is best thought of as a retrospective yardstick of a manager’s decisions. It is limited in that it does not factor in fatigue (both day-to-day and cumulative effects), matchups, or how bullpens can change over the year. Much of the criticism toward the piece focused on the fact that we didn’t account for these issues.

All of that criticism is, of course, fair. But insofar as it’s incredibly difficult to judge bullpen management in any sort of rigorous, quantitative way, I think this piece was a significant step forward.

The optimal metric would probably appraise bullpen decisions in a dynamic way, that is to say, on an inning-by-inning basis according to what the manager knows at the time of the decision. (The distinction between retrospective and dynamic measurements was suggested to me by BP writer and all-around good guy Rob Mains.)

So, for example, rather than aggregating season-level statistics as we did, you could build a system to grade every individual call to the bullpen according to which relievers were available in that game, their statistics to date in the season, their projections, the matchup (who they’d be facing), and so on. In this way, you could say whether a manager made the optimal decision based on the information he had at the time, and price in the effects of fatigue and availability.

Such a system would be exceedingly difficult to create, however. You’d need game-by-game information, and you’d have to make a lot of assumptions about when relievers were tired and how much to consider matchups. With that said, I have full faith that eventually, someone is going to make this kind of dynamic scoring system. It’s going to be awesome, and probably more accurate and insightful than wRM+ (although by how much, I do not know). In the mean time, I think of Rian and I’s metric as a step in the right direction, an approximation that works better over longer managerial careers, where factors like bullpen quality tend to even out.

Three Key Variables that Affect the Cubs’ Playoff Hopes

For The Athletic, I wrote about some of the ways October baseball is different from the regular season, and how those factors may affect the overwhelming postseason-favorite Cubs.

It’s striking how distinct playoff baseball is from the rest of the year. On top of the weather and better caliber of opponent, you have very different patterns of pitching usage. As managers get more sabermetrically savvy, I think that October is going to get even weirder and more tactically separated from the rest of the year. Ned Yost pioneered a new style of employing his relievers to more full effect in the postseason, and that increased usage will only grow more pronounced. The increase in pitching quality–both in terms of higher-caliber starting pitchers, and more bullpen action–is probably the single biggest factor which separates October from the rest of the year.

Long-term, I think that means there will be a premium on hitters who can maintain their performance against the highest-quality opposition. That is, if those hitters really exist; so far, sabermetrics hasn’t found much evidence for there being a kind of hitter who is less susceptible to the quality of the opposing pitcher. (Of course, that doesn’t mean that front offices can’t find those hitters better than public analysts.)

Other links

The Most Extraordinary Team Statistic


Looking at some of the team-level records being broken this year. More on the Cubs BABIP here: http://cybermetric.blogspot.com/2016/09/cubs-have-allowed-historically-low-babip.html
It’s probably the biggest deviation from the league average of all time (at least for BABIP). So what is it? Defense? A new kind of positioning or shifting? Pitchers who can suppress batted ball velocity?

http://www.baseballprospectus.com/article.php?articleid=30420
You’ll never guess the luckiest team in baseball this year.

https://www.theguardian.com/us-news/2016/sep/19/us-gun-ownership-survey
3% of American adults own half of the guns in the United States. Think about that for a minute. The article is worth a full read.

http://www.cbssports.com/mlb/news/heres-how-the-as-went-from-small-market-heroes-to-present-day-zeroes/
From R.J. Anderson, on how the Oakland front office has failed to navigate the modern age of sabermetric equality.

http://www.gq.com/story/a-word-for-donald-trump-voters
A distillation of the righteous anger many feel when thinking about a Drumpf voter. I think I’m more insulated from Drumpf voters than most people; only one person on my Facebook feed ever tweets pro-Drumpf propaganda. As a result, I’m more bewildered and confused than angry.