Link roundup for 1/27/2017

I first tried the experiment of a roughly weekly link roundup (including both my own stuff and my favorite reads) in November, and I haven’t tried it since. So, weekly it is not (or has not been, at least). But that’s because of the baseball off-season, during which my work schedule slows down dramatically. With work ramping back up now (pitchers and catchers report in just over two weeks!), I intend to make this a more regular feature at my blog.

With that preamble out of the way, here’s what I’ve been working on recently:

I was happy to write up this piece that summarized a recently-accepted paper I worked on with Greg Matthews and an undergraduate student of his. (I was responsible for a very, very small portion of that paper, so I am kind of mooching off their work. Forgive me.) In it, we analyzed the ways that the BBWAA voters seem to cluster their votes, and found (predictably enough) that the major split is between PED-supporters and non-PED supporters. You can find the paper linked in the article, but I thought it was a cool example of how you can use the summary statistics and public portion of partially anonymous datasets to infer characteristics about the anonymous portion. That fact has applications as far as genetics, where some patients may choose to participate anonymously while others reveal their data. If a small enough portion is anonymous, and you have the overall statistics, you can effectively “de-anonymize” the remaining portion with a method like this.

I wrote this short piece in reaction to Tim Raines finally being elected to the Hall. I never quite embraced the Raines campaign, and that’s mostly because I simply can’t muster up much outrage about the Hall of Fame. It’s never been consistent or objective, and it never will be. We know a lot more about baseball now than we did in 1970, and our metrics have changed, and it doesn’t bother me that we elect a different kind of player. Similarly, I will never understand the moral relativism that goes on in these Hall of Fame debates. That the fans of the 1930s tolerated Ty Cobb launching himself into their midst and throwing punches does not mean that we need to–or should–put up with much less dickish behavior today. As far as I’m concerned, it’s OK (and indeed unavoidable) that our standards for inclusion have changed over time. I guess this opinion is not a very hot take.

For the Athletic, I did a short piece about whether the Cubs hitters and pitchers would suffer at all from the long postseason. As with many effects in sabermetrics these days, there was the small hint that playing deep into October could make a difference, but it did not pass any stringent statistical threshold. As all the low-hanging fruit gets picked, we should expect for this to be a more and more common pattern.

Have to say: never did I think I would write an entire article in response to a tweet from the President, but here we are. The point is simple: Chicago’s murder rate is high, but no higher than it was in the 1990s. In fact, it’s not higher than New York City’s rate was in the 1990s. It’s still too many deaths (as any number of murders would be too many), but the notion that this murder rate represents some radical departure from the past is wrong.


And here’s what I’ve read.

This article is incredibly demoralizing for me. In it, they show Drumpf supporters photos of the Drumpf and Obama crowds, asking which one is larger–a simple, obvious question with only one reasonable answer. And yet, a surprising number of Drumpf supporters pick his crowd as the larger one, defying all rational belief.

So why was this demoralizing? Photographic evidence is a kind of gold standard in my mind. If we can’t convince people with side-by-side photos, then what hope does a more sophisticated and nuanced argument have? I think about this in regards especially to journalism, where we are often trying to make points using words or (in my case) numbers, both of which are abstract representations of data from the real world. If people can ignore what is immediately in front of their eyes, why would they ever choose to think through a reasoned, but complex argument that they disagree with? So this piece made me hopeless.

This article was just fantastic, one of my favorites of the last few months. The history of statistics stuff is interesting, if necessarily incomplete (they barely touch on the important role science played in developing statistical knowledge). But where it really came through for me was the ending, and how they described a future–perhaps even a present–in which people don’t buy into government-supplied statistics.

There are a lot of reasons for the current state of the electorate, and their overall disbelief in objective knowledge. Some of it has to do with identity politics: There is a fraction of the electorate who don’t believe even photographic evidence when it is provided, as detailed above.

But some of the problem is undoubtedly due to the ways statistics has been used and described. In reading the article, one problem that occurred to me is how statisticians consistently describe the average of a group as being representative of the experience of the group as a whole. In other words, if I am describing the population of some town in Nebraska, I may summarize its wealth by the median (or mean) income. But that will grate on individuals in the town who live on the poverty line, and incorrectly describe those who are upper class there.

This is a long-term issue with how statistics are discussed and written about. The noise or variation around the mean is just as important as the mean itself. I think too often statisticians (and writers like myself who translate statistics to a larger audience) stop at the mean, assuming incorrectly that it is a sufficient description of the population.

While the mean should be representative, humans are inclined to disregard information if it doesn’t accord with their prior beliefs (or lived experience). So if you describe a population by the mean, and the reader is at one end of a distribution, they assume (falsely) that not only the mean but the whole dataset is flawed. As a profession (or group of professions), I think we statisticians have to develop a language and a framework to describe the average in the context of the variation around it, in such a way that readers intuitively understand the idea of a range of outcomes–a distribution. At the end of the day, it’s the distribution that matters the most, and the idea of describing that curve has often been abandoned in favor of the simplification of a single number. That’s a mistake.



Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>