Towards Heuristics of Heuristics

The world is complicated, nearly unmeasurably so. Even the workings of the smallest bits of matter in the universe are incredibly complex, and so to make progress in understanding the world, we have to neglect some of this vast complexity. We must impose simplifications on our mental models of the world in order to make progress in comprehending the world.

Following the definition of William Wimsatt, these simplifications can be called heuristics. By his reckoning, not only are heuristics necessary, but the choice of particular heuristics guides the path of our gaining of knowledge.

Heuristics are necessary because human beings are finite. In the old Enlightenment-era schemes, the world could be perceived as Lamarckian in nature, as a finite series of determinate computations. Now that we are aware of the true scope of the universe, the notion of a determinate universe, while perhaps theoretically interesting, lacks utility. Even if we were to possess perfect information, the scale of the universe would not yield to any accurate calculation.

These arguments apply just as well when scaled down to the level of realistic scientific inference. There are too many permutations of gene regulatory networks to model them as the interactions of molecules. Combinatorial explosions abound in dealing with genomes of even a few thousand genes. We must idealize them.

So we require simplifications, and a change in the framework of our thought. Absent a single, unifying conception of the universe, we are allowed a choice of heuristics. For example, one choice of heuristics for gene regulatory networks is to consider the network in the context of graph theory, qua Davidson.

You end up with something that looks like this.

Davidson-type wiring diagram of a gene regulatory network. Boxes are genes, lines connecting them represent regulatory interactions. From
Davidson-type wiring diagram of a gene regulatory network. Labelled transcription cartoons (e.g. FoxA) are genes, lines connecting them represent regulatory interactions. From

In addition to being (sometimes) visually appealing, this heuristic frames the problem in a space of mathematics which seems to apply well to gene regulation. That space is network theory. By considering regulatory interactions (Gene X activates Gene Y) as edges and genes as nodes, we can perhaps gain some useful knowledge by applying the well-understood axioms and practices of network theory to this new, empirical problem of gene regulation. This perspective is intriguing, and also simplifying: under the hood, regulatory interactions are not simple mathematical relationships, they are incredibly complex processes performed by elaborate molecular machinery. Perhaps, however, these incredibly complex machines perform their operations in ways that are similar enough to those simple mathematical relationships that we can, for a moment, neglect the considerable intricacy of those molecular machines.

Supposing that the network theory heuristic is an interesting and useful one, I now want to broaden the scope of this post to a larger, more significant question. If some heuristics are useful, and others aren’t, can we come up with a general theory for which heuristics apply in which situations?

I had this thought while re-reading William Wimsatt’s excellent tome on heuristics.  He advocates for them, and for an escape from physics-style thinking (in terms of Lamarckian demons and absolute rule-sets). I agree on both scores, but the trouble with heuristics is that they are frighteningly* arbitrary.

For instance, I could choose to use one set of heuristics to simplify gene regulation, and you could choose a different set.  Our two sets would differ in terms of their simplifying assumptions, and, if we advanced the study of each of them sufficiently far, they might produce different predictions about the behavior of some particular gene regulatory systems.  How would we know which one was right?

This scenario is something of a false problem, because we could always do experiments to test the predictions of each, and then disregard (or assimilate) the one whose predictions turned out to be correct less often.  Even so, as long as there were two seemingly correct systems of thinking about gene regulatory networks, there would be tension between them.  And one could envision scenarios in which people proposed wildly inappropriate heuristics to tackle gene regulatory problem-agendas, producing false results but being unprovably wrong (for at least a short time).

In the long run, to avoid these kinds of problems, it would be desirable to have a guidebook of sorts which prescribed the kind of heuristics which are most likely to be useful in each situation.  Wimsatt doesn’t provide that guidebook, although he seems to hint at its utility (and perhaps, its future existence).

The Book of Heuristics

In thinking about heuristics, I always tend to come back to the concept of statistical models which, under my reading of Wimsatt, are perfect examples of heuristics. A statistical model describes the way that two or more variables relate to each other. To work properly, it has to assume some structure between the two or more variables. When we then apply the model to actual data, we fit the data into the structure, and in so doing, perhaps learn something about the problem at hand.

For example, linear models assume that two variables are linear functions of each other, that is:

y = ax + b

That equation should be comfortable to the reader, as it’s a fairly straightforward relationship to have. Verbally, it means that for every one unit increase in x, y increases or decreases by some steady amount, a. When we fit a linear model to data, we take a series of y‘s and x‘s and, using some algorithms, impute the values of a and b. By applying this heuristic, we are able to learn something, we think, about the underlying link between x and y (e.g. each unit of x is worth a units of y).

Linear models work great in all sorts of places, but they also fail when applied to some datasets.  If one were to attempt to model the amount of sunlight as a function of the hour of the day (with a 10-year dataset of each) using a linear model, one would get something like a flat line. Moreover, the fit of the data would be terrible. We know that there isn’t a linear relationship between these two variables, and so to assume that there is defies common sense and good statistical practice.

The answer to the question I posed earlier (could we build a guidebook for heuristics?) seems to hinge on whether we could identify, in advance, whether a certain heuristic would be likely to fail when applied to a given problem.

To a certain degree maybe that objective is possible, in that we can let intuition and the “eye test” guide us. If, for example, we were to plot hour of the day vs. amount of sunlight, it would be very plain that the dataset is not amenable to a linear fit, although there is clearly a pattern present. But this ‘eye test’ seems really just like mental model fitting.

I wonder if we couldn’t do better still, and without gathering the data beforehand. As a parting quandary, I pose the following question. Is it possible that, by leveraging our intuition about the architecture of a system, we could rule out certain heuristics as being likely to apply poorly to that system?

I think the answer is yes, and I think this kind of heuristic choice is exercised successfully all the time (albeit without a firm theoretical basis for its use). For example, imagine some highly complex system, in which different parts of the system are strongly interconnected. The system functions in such a fashion that changes to any single part ripple outwards in unpredictable ways, sometimes being buffered by the other parts of the system, sometimes causing catastrophic deviations from the system’s normal functioning.

Intuitively, when I envisage such a system, I think of it as being a poor fit for any simple heuristic like a linear model, at least for most questions. Because of the strong interdependence between parts of the system, and the overall intricacy of the structure as a whole, I imagine most perturbations to such a system are unlikely to result in linear effects on any appreciable scale. Again, as a matter of intuition, I would prescribe the use of (for example) more sophisticated statistical models, those with fewer constraints and more flexibility, in order to better agree with the inherent characteristics of the system.

If my intuition is correct (perhaps it’s not!), then it suggests that there could yet be a guidebook of heuristics, a way to tell in advance which heuristic approaches are likely to be most fruitful. In this way, we could build a set of heuristics of heuristics, i.e. meta-heuristics, which shaped not the simplifications we applied to models of complex inquiry, but the kinds of simplifications.


*Frightening only to some. I think we shouldn’t be frightened of their inconsistency with each other; as the quote goes, “Consistency is the hobgoblin of small minds.” Instead, I think it’s one of the universes most forgiving and helpful properties that multiple schemes of thinking about a problem can converge on the same correct answer. We should embrace and enjoy the chaotic, diverse world of heuristics, and set about determining which ones are the best and why.

Entropy is Magic

Information theory is a discipline of computer science which purports to describe how ‘information’ can be transferred, stored, and communicated.  This science has grand ambition: merely to define information rigorously is difficult enough.  Information theory takes several steps further than that.

The most important single metric in information theory, and arguably the basis for the whole discipline, is entropy.  Nobody agrees precisely on what entropy means; whereas entropy has been called a quantification of information, Norbert Wiener famously thought that information was better defined as the inverse of entropy (‘negentropy’).

Whatever entropy is, it’s important.  It keeps appearing in science–in statistics, as a fundamental limit on what is learnable–in physics, as a potentially conserved quantity like matter/energy–in chemistry, where it was first discovered, as a law of thermodynamics.  Entropy is everywhere, so let’s see what it is.

Some History

One of the most famous and important scientists you may never have heard of is a man named Claude Shannon.  A mathematician by training, Shannon is best thought of as the inventor, and simultaneously the primary explorer within, information theory.  He holds a status within information theory akin to what Charles Darwin is to evolutionary biology: a mythical figure who at once created an entire discipline and then figured out most of that discipline, before anyone else had a chance to so much as finish reading.

In 1949, Shannon wrote a paper called “A Mathematical Theory of Communication“.  I don’t think there are many more ambitiously titled papers, although, to further my analogy, On the Origin of Species is certainly a contender.  In this paper, now a book, Shannon begins by defining entropy.  Of note, he writes:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point… The significant aspect is that the actual message is one selected from a set of possible messages.

Thus the stage is immediately set for entropy.  It relates to the idea of sets of signals, one of which may be chosen at a time.  A priori, it seems some sets may be more ‘informational’ than others.  Books communicate more information than traffic lights.  To this point, he says,

If the number of messages in the set is finite then this number [the number of possible messages] or any monotonic function of this number can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.

So clearly, part of the reason books are more informational is because there’s simply a larger ‘bookspace’, that is, number of possible books, than there is a space of possible traffic lights.

That only works, however, if all of the choices are equally likely.  To get a more general formulation of entropy–which to Shannon is information–Shannon creates a list of requirements.  They are as follows.

1. Entropy needs to be continuous, so long as the probability of some message is defined from 0 to 1.  In other words, picking a message from a set of messages with any probability p ought to produce a continuous metric.

2. As above, if all the possible messages are equally likely, then entropy should increase monotically with the number of possible messages.  We already covered this–books are more entropic, and thus informational, than traffic lights.

3. Entropies ought to be additive in a weighted fashion.  So if a single choice from three messages is broken down into two sequential choices of the same three choices, the total entropy remains the same.  This one is the hardest to grasp, but relates essentially to the idea that information can be translated into different languages.  I can translate a signal in an 8-letter language into the same signal in a 2-letter language without the entropy changing, simply by recoding my 2-letter language.

Three (relatively) simple rules, and yet, as it turns out, Shannon proves that there is one and exactly one function that obeys all three rules.  That function is entropy, the very same function Gibbs found earlier in reference to chemistry, now recruited in service of information…

H is Shannon’s name for entropy.

There’s something fascinating about not just the formula itself, but the way Shannon derives it.  He sets out a series of requirements, detailed above, and realizes that there’s only a single mathematical relationship which obeys all three requirements.  All of the requirements are straightforward, common-sense attributes which any characterization of information must obey.  Strike out any of the three, and one is no longer discussing information.  It’s an elegant way of making an argument.

Some Theory

Shannon calls entropy H before he even invents the formula for it.  A couple of pieces of background information are useful in understanding the formula.  One is that elements of the set–remember, communication is just choosing from a set of signals–are numbered or indexed with i, going from 1, the first element, to n, the last.  Any formulation of entropy must take into account each such element, so we sum (Σ) over all the elements.

pi refers to the probability of the ith element.  In practice, we rarely know this probability, but it can be estimated (as the frequency of the ith message divided by the total number of observed messages).  If an element occurs 0 times, its probability is estimated as 0, causing it to drop out of the calculation.

For a given element, we multiply pi by the logarithm of pi.  Interestingly, the base of the logarithm is arbitrary, meaning that the actual number that entropy assumes is also arbitrary.  Traditionally, the base is two, which outputs entropy in terms of bits, which are universally familiar now that computers are ubiquitous.  But entropy could just as easily be computed in base 10 or 100 or 42, and it would make no difference.  No other number in science so significant is also so fluid.

Which highlights an important point: entropy means relatively little in and of itself.  Entropy is most useful when comparing two or more ‘communications’.  When applied to a single source, it can be difficult to grasp what a given entropy means.  The clearest formulation is in terms of bits, in which a message can be understood as a series of yes/no questions.  For example, a message or source with an entropy of 1 bit can be described by a single binary question.  But even this is really describing one source of information as another (that is, language).

But back to calculation–one must repeat the pi log pi calculation for each of the i possible elements, and sum the results.  One now has a negative number, because log(x) when 0 < x < 1 is negative.  To reverse this sorry situation (after all, how could information be negative?*), we multiply by -1.  Then we have a positive quantity called entropy that measures the uncertainty of the n outcomes.

Entropy is Magic

When entropy is high, we don’t know what the next in a series of messages will be.  When it is low, we have more certainty.  When it is 0, we have perfect knowledge of the identity of the next message.  When it is maximal–when each of the pi‘s is equiprobable, and equal to 1/n–we have no knowledge as to the next in a series of messages.  Were we to guess, we’d perform no better than random chance.

We can see then that entropy is a (relatively) simple, straightforward mathematical expression which measures something like information.  It’s not clear, as I mentioned above, what the exact relationship between entropy and information is.  There’s something fascinating about the fact that entropy (as a formula) appears vitally important, but is simultaneously difficult to translate back into language.

In any case, entropy is important not only on its own but because of the paths it opens up.  With an understanding of entropy, one can approach a whole set of questions related to information which would have appeared unanswerable before.  For example, conditional entropy builds on entropy by introducing the concept of dependency between variables–the idea that knowing X could inform one’s knowledge of Y.  From there, one can develop ever-more complex measures to capture the relationships between different sources of information.

*That’s a rhetorical question, of course.  It could be negative.  So much of this (most important) metric is arbitrary, up to the discretion of the user, which reinforces the fact that entropy is only meaningful in relative terms.

The Criminal Topography of Chicago

Lantern slide of the street and boulevard system, "present and proposed," from the Plan of Chicago

A Map of Chicago, in Crimes

I recently stumbled upon this website, which is a huge repository of the City of Chicago’s formidable data-gathering.  This page was put together as part of the city’s new commitment to greater transparency, and includes everything from CTA bus records to Tax-Increment Financing districts to registers of public employees and their salaries.  One of the files was a listing of every crime report filled out in the city for the last 13 years.

Let me note here that such a file is a staggering achievement.  For each crime, there is recorded the location, the time, the address, the kind of crime, and a few other bits of miscellaneous information.  In total, the file encompasses 4 million crime reports, spread about over 13 years; an average of 350,000 per year.

This spreadsheet is colossal, is massive, not only in terms of its sheer information (1 GB in total), but in the implications of it.  How many questions can be asked with this data that were impossible or accessible to only a handful of academics before?


This picture is a map of the City of Chicago, in crimes.  Each dot here is a single crime.  The darker green bits are where multiple crimes have piled up in roughly the same place.  You can see that it recapitulates perfectly the known geography of the city.  Where there are people, there are crimes.

There are also some gaps in the map.  Some of these blank spaces are parks; presumably the report is assigned to the nearest street location, instead of properly putting it where it actually occurred.  Other gaps are rivers or industrial zones, where presumably little crime is committed.

Different Neighborhoods, Different Crimes

Yet, not all areas heave the same crimes.  I separated the data by the offense.  I noticed immediately that some crimes were disproportionately committed in certain areas (I’m sure you can imagine some hypotheses).  Here’s an example of that phenomenon, in which I’ve focused on three kinds of crimes: battery (as in assault and battery), deceptive practices, and narcotics.


You can still see the outline of the city here, and its idiosyncratic borders.  But you’ll note that the battery is found mostly on the southwest side of the city, along with much of the narcotics reports.  There are pockets of narcotics reports on the northside, in Rogers Park and other neighborhoods, but narcotics is primarily a southwestern offense.

In strong contrast to the narcotics pattern is the grouping for “Deceptive Practices.” Wikipedia (behold, my complete ignorance of the law) informs me that Deceptive Practices includes things like fraud, false advertising, and misrepresentation.  Such offenses are most often the province of businesses, and so the Loop is home to the greatest cluster of Deceptive Practices reports.  You can see rays of Deceptive Practices reports emanating from the Loop along the major thoroughfares (Milwaukee Avenue, for instance).  This pattern presents a corollary to the above law: where there are businesses, there are deceptive practices.

The Inequality of Arrests

*In an earlier version of this post, I wrote about no arrest crime reports, which I erroneously assumed were tickets.  They are not tickets, but it is not clear exactly what they are.

Another variable recorded in the dataset is called “arrest”.  In fact, this field doesn’t record arrests, but rather whether a crime report has been marked “cleared”.  Cleared reports are those for which an offender has been found and successfully prosecuted.  For some crimes, the clearance rates are near 100%, but in others, clearance rates appear to be much more variable.

One of the more variable categories of crimes are narcotics possession offenses.  This variability is curious, since I wouldn’t imagine officers devoting a lot of resources to filing reports on “unsolved” narcotics crimes.  However, I have very limited information as to why a low-level narcotics crime would go uncleared (see updates below).  I decided to model clearance probability for marijuana possession as a function of location.

I subset the data by the kind of crime and year (2013).  I selected narcotics reports, specifically those with less than 30g of cannabis (as described in the report; this was the lowest level).  I made a logistic regression model, incorporating latitude, longitude, and an interaction term between them.  There are fancier and cleverer ways of doing this modelling; I am not striving for mathematical precision but rather a rough overview.  Here’s the fitted probabilities of clearance, depending on where the crime took place.

arrest probability

I built a fully 3-dimensional version of this graph, which can be rotated and zoomed, here.  You should go play with it.  If you view the visualization head on, with longitude as the x-axis and latitude as the y, you’ll see a map very similar to the first graph on this page.  If you tilt it slightly, you’ll see that this graph can be thought of as a criminal topography of Chicago (as above), but warped or deformed by the probability of an arrest.

The takeaways of the model are largely as expected.  There are significant effects of latitude, longitude, and their interaction, such that the more southern or western the crime occurs, the more likely there is to be clearance.  The difference is not overwhelming, but stark nonetheless: on the northeast side of the city, the probability of clearance falls to ~80%.  Anywhere in the southern or western sides of the city, it is close to 100%.  That’s probably not an artifact or an accident: we know that African-Americans and Latinos are much more likely to be arrested, overall, and the southern and western parts of the city are where many African-American and Latino people live.

Even so, I ought to note that I haven’t (and can’t) control for all the necessary covariates.  As always, the task is more complex than it appears initially.  Police must arrest (and have more motivation to clear a report) when the offense is performed by people under 17, so it may be that in the south & west sides of the city, more clearances are occurring because there are more young people committing the crime.  As well, the police reports do not offer enough granularity to know whether the amount of weed is higher in the southern and western quadrants: the lowest level is simply less than 30g.  Since the technical cutoff for a ticket is only 15g, perhaps it is the case that there’s simply more offenders in the 15-30g window on the south and west sides.

I doubt that’s the case, though.  According to lots of published research, blacks and whites use marijuana at approximately similar rates.  So we are left with other, more uncomfortable reasons for the observed differences in clearance rates.

Whatever the cause, the pattern is clear, and serves as an example of the sort of discovery this data can give rise to.  I am usually skeptical of the notion of “big data” as any kind of transformative phenomenon, on the theory that big data is really just lots of small data put together.  But to the extent that governments embrace it, there may yet be some sea change, in that big data reduces the inherent information asymmetry between the people and their representatives.  The government of Chicago has a thousand maps of the city at their disposal at any moment, some flattering, some despicable; the thought that all of those maps might be laid bare and examined is exciting.


Update #1

In the first draft of this post, I assumed that many of the marijuana crime reports (<30g) without clearances corresponded to citations that were issued instead.  I made this assumption based on three facts: 1) the marijuana reports without arrests increased suddenly in the crime logs in the same year the decriminalization measure was passed; 2) the number of no-arrest low-level marijuana reports was quite similar to the reported number of tickets issued; and 3) I couldn’t think of a reason why there would be crime reports filed for such small amounts of weed unless there was also an offender present to prosecute.

Having now found a database of some of those (~300) tickets, that assumption appears not to be the case: the issued tickets do not seem to be recorded in the crime database, meaning that the no-arrest marijuana reports are potentially something different.  This conclusion begs the question: what, then, are the no-arrest marijuana reports?  As of right now, I don’t know.  I have emailed the appropriate contact at the Chicago data portal to find out.

The fact remains, however, that whatever the no-arrest marijuana reports are, they are distributed non-randomly throughout the city.  I have changed some language in the post to make clear that these no-arrest reports are not necessarily tickets, and I will update again when I find out what the reports actually are.

Update #2

I sought clarification about the no-arrest reports from the contact listed at the City of Chicago data portal.  Apparently, these no arrest reports are not tickets, and have little to do with whether there was an arrest made.  Instead, they denote whether the report was marked “Cleared-Closed”, as would happen after the incident went through the court system.  The body of the post has been updated accordingly.

As a result, the no arrest reports could be cases in which the offender couldn’t be found (or died?).  According to the person with whom I spoke, there may be additional reasons why the report would not be marked Cleared-Closed, but I wasn’t able to get an exhaustive list.

And that’s where I’m leaving it.  I’m not a journalist, and I don’t really know where to go next with this line of inquiry.  It does seem strange that these no arrest narcotics reports would show the pronounced geographic pattern that I describe in the post; I doubt police on the north side are finding unattended joints, and writing up reports on them, at a different rate than on the south side.  But in the absence of more an better information, I’ll leave these no arrest reports as an unsolved, albeit intriguing riddle.  Hopefully someone with more time and the skills to pursue the question will pick it up at some point.


I reproduce the full email chain with the City of Chicago contact below.

I wrote this email:

“I’m curious about what a certain classification of crime report corresponds to.  Specifically, I have noticed that not all offenses described as “30 or less grams of marijuana” also show arrests (arrest = true).  Are these cases in which there was a ticket issued instead?  If not, what occurred in these instances?”

And received this email in response:

“The field indicating whether an arrest was made is based on whether the incident has been marked as “Cleared – Closed” not necessarily whether an arrest was made. When a citation is issued instead of arrest for under 15 grams, no case report is generated.  ”

To which I replied:

“Thank you so much for your response. So in cases where the report was not “Cleared/Closed”, the perpetrator has not been found or charged? Is there any other reason why a case would not be cleared/closed?”

And got this back:

“There are actually a few other statuses, such as cleared/exceptionally closed and cleared open.

Cleared/exceptionally closed could be that we know who the offender is but the victim does not want to sign complaints, the offender is dead and therefore cannot be charged, or some other reason the CPD could not charge the offender.
Cleared open means we know who the offender is but do not have the person in custody yet.

Additionally, while looking at narcotic incidents which were not marked as having an arrest, some of those case reports are still in preliminary status and should not have been visible/available on the dataportal.”