Information theory is a discipline of computer science which purports to describe how ‘information’ can be transferred, stored, and communicated. This science has grand ambition: merely to define information rigorously is difficult enough. Information theory takes several steps further than that.

The most important single metric in information theory, and arguably the basis for the whole discipline, is entropy. Nobody agrees precisely on what entropy means; whereas entropy has been called a quantification of information, Norbert Wiener famously thought that information was better defined as the opposite of entropy (‘negentropy’).

Whatever entropy is, it’s important. It keeps appearing in science–in statistics, as a fundamental limit on what is learnable–in physics, as a potentially conserved quantity like matter/energy–in chemistry, where it was first discovered, as a law of thermodynamics. Entropy is everywhere, so let’s see what it is.

## Some History

One of the most famous and important scientists you may never have heard of is a man named Claude Shannon. A mathematician by training, Shannon is best thought of as the inventor, and simultaneously the primary explorer within, information theory. He holds a status within information theory akin to what Charles Darwin is to evolutionary biology: a mythical figure who at once created an entire discipline and then figured out most of that discipline, before anyone else had a chance to so much as finish reading.

In 1949, Shannon wrote a paper called “*A Mathematical Theory of Communication*“. I don’t think there are many more ambitiously titled papers, although, to further my analogy, *On the Origin of Species* is certainly a contender. In this paper, now a book, Shannon begins by defining entropy. Of note, he writes:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point… The signiﬁcant aspect is that the actual message is one selected from a set of possible messages.

Thus the stage is immediately set for entropy. It relates to the idea of sets of signals, one of which may be chosen at a time. *A priori*, it seems some sets may be more ‘informational’ than others. Books communicate more information than traffic lights. To this point, he says,

If the number of messages in the set is ﬁnite then this number

or any monotonic function of this number can be regarded as a measure of the information produced when one message is chosen from the set, all choices being equally likely.[the number of possible messages]

So clearly, part of the reason books are more informational is because there’s simply a larger ‘bookspace’, that is, number of possible books, than there is a space of possible traffic lights.

That only works, however, if all of the choices are equally likely. To get a more general formulation of entropy–which to Shannon is information–Shannon creates a list of requirements. They are as follows.

1. Entropy needs to be continuous, so long as the probability of some message is defined from 0 to 1. In other words, picking a message from a set of messages with any probability p ought to produce a continuous metric.

2. As above, if all the possible messages are equally likely, then entropy should increase monotically with the number of possible messages. We already covered this–books are more entropic, and thus informational, than traffic lights.

3. Entropies ought to be additive in a weighted fashion. So if a single choice from three messages is broken down into two sequential choices of the same three choices, the total entropy remains the same. This one is the hardest to grasp, but relates essentially to the idea that information can be translated into different languages. I can translate a signal in an 8-letter language into the same signal in a 2-letter language without the entropy changing, simply by recoding my 2-letter language.

Three (relatively) simple rules, and yet, as it turns out, Shannon proves that there is one and exactly one function that obeys all three rules. That function is entropy, the very same function Gibbs found earlier in reference to chemistry, now recruited in service of information…

There’s something fascinating about not just the formula itself, but the way Shannon derives it. He sets out a series of requirements, detailed above, and realizes that there’s only a single mathematical relationship which obeys all three requirements. All of the requirements are straightforward, common-sense attributes which any characterization of information must obey. Strike out any of the three, and one is no longer discussing information. It’s an elegant way of making an argument.

## Some Theory

Shannon calls entropy H before he even invents the formula for it. A couple of pieces of background information are useful in understanding the formula. One is that elements of the set–remember, communication is just choosing from a set of signals–are numbered or indexed with *i*, going from 1, the first element, to *n*, the last. Any formulation of entropy must take into account each such element, so we sum (Σ) over all the elements.

*p _{i}* refers to the probability of the

*i*th element. In practice, we rarely know this probability, but it can be estimated (as the frequency of the

*i*th message divided by the total number of observed messages). If an element occurs 0 times, its probability is estimated as 0, causing it to drop out of the calculation.

For a given element, we multiply *p _{i}* by the logarithm of

*p*. Interestingly, the base of the logarithm is arbitrary, meaning that the actual number that entropy assumes is also arbitrary. Traditionally, the base is two, which outputs entropy in terms of bits, which are universally familiar now that computers are ubiquitous. But entropy could just as easily be computed in base 10 or 100 or 42, and it would make no difference. No other number in science so significant is also so fluid.

_{i}Which highlights an important point: entropy means relatively little in and of itself. Entropy is most useful when comparing two or more ‘communications’. When applied to a single source, it can be difficult to grasp what a given entropy means. The clearest formulation is in terms of bits, in which a message can be understood as a series of yes/no questions. For example, a message or source with an entropy of 1 bit can be described by a single binary question. But even this is really describing one source of information as another (that is, language).

But back to calculation–one must repeat the *p _{i}* log

*p*calculation for each of the i possible elements, and sum the results. One now has a negative number, because log(x) when 0 < x < 1 is negative. To reverse this sorry situation (after all, how could information be negative?*), we multiply by -1. Then we have a positive quantity called entropy that measures the uncertainty of the n outcomes.

_{i}## Entropy is Magic

When entropy is high, we don’t know what the next in a series of messages will be. When it is low, we have more certainty. When it is 0, we have perfect knowledge of the identity of the next message. When it is maximal–when each of the *p _{i}*‘s is equiprobable, and equal to 1/n–we have

*no*knowledge as to the next in a series of messages. Were we to guess, we’d perform no better than random chance.

We can see then that entropy is a (relatively) simple, straightforward mathematical expression which measures something like information. It’s not clear, as I mentioned above, what the exact relationship between entropy and information is. There’s something fascinating about the fact that entropy (as a formula) appears vitally important, but is simultaneously difficult to translate back into language.

In any case, entropy is important not only on its own but because of the paths it opens up. With an understanding of entropy, one can approach a whole set of questions related to information which would have appeared unanswerable before. For example, conditional entropy builds on entropy by introducing the concept of dependency between variables–the idea that knowing X could inform one’s knowledge of Y. From there, one can develop ever-more complex measures to capture the relationships between different sources of information.

*That’s a rhetorical question, of course. It could be negative. So much of this (most important) metric is arbitrary, up to the discretion of the user, which reinforces the fact that entropy is only meaningful in relative terms.