Function and Sequence Conservation

“A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 − 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions or because no mutation in these regions can ever be deleterious.”

I want to unwind the argument that’s at the core of the abstract (above) from D. Graur’s now-famous evisceration of ENCODE.  He starts by quoting ENCODE’s notorious declaration that 80% of the genome is functional; he proceeds to contrast that figure with the widely-accepted, yet rough, estimate that purifying selection can be detected in around only 10% of the genome, and therefore concludes that the ENCODE figure must be wrong.  There’s a hidden assumption here which, while seemingly rational, falls apart under close scrutiny.  That assumption is that function must entail conservation within the genome.

The thesis of this post is that the relationship between sequence conservation and function need not be so straightforward.  There are some encodings of function in the genome which wouldn’t necessarily be expected to produce sequence conservation, even when the function of such DNA is under strong purifying selection.

 

A Positive Control

Consider the following computer simulation of some Wright-Fisher populations evolving under drift, mutation, and selection.  First, I made 100 sequences, with a rate of polymorphism of ~1%.  I then evolved this group under mutation and drift, subject to the selective constraint that a particular region—in this case, nucleotides 79-82—had to contain the word “ACGT”.

Allele frequency of ACGT

The preferred word “ACGT” arises at base pair 79 around the 300th generation, rapidly rises to near-fixation, and sticks around at a high frequency thereafter (it never fully fixes owing to my unrealistically high mutation rate, set thusly to minimize the number of generations my feeble computer must run the simulation).  This situation can be thought of as a positive control for the relationship between function and selection; there is exactly one configuration of four nucleotides at one particular spot in the 100bp genome which satisfies the selective constraints.  As a result, when one looks for selective constraint, nucleotides 79-82 can be easily detected as a spike in percent identity:

smoothed_pid

Or, if one prefers fancier metrics, try looking at smoothed phastCons scores:

smooth_ppscores

So there! You might say, conservation equals function.

 

Not So Fast

But wait a second.  I promised that there are encodings of function which don’t require much conservation.  Here’s an example: consider that the same word, ACGT, must be in the sequence, but unlike last time, selection doesn’t care where in the sequence it is found; any place will do.  To anchor this to some real biology, consider a transcription factor which is able to affect its targets in a wide range of distances, and so is under little spatial constraint.

Although the word remains the same, its position is allowed to vary, and so vary it does.  One way to visualize this is as follow: on the x-axis, time is moving forward; on the y-axis is the position, in base pairs, within the 100bp genome I am simulating.  Colored dots indicate that a motif is present, and the size of the dot indicates in what frequency it is present.  (Dots below the red line, at -1, indicate sequences for which no motif was present.)

switches_ACGT

Motifs arise regularly, and disappear regularly too.  Because the word is sufficiently short (I’ll come back to this later), multiple haplotypes coexist in which the motif is found in different spots.  Eventually, even dominant haplotypes die off, and are replaced by other variations.  Chaos rules, entropy triumphs.  What’s more, sequence identity isn’t much of a guide to the location of the motif:

smoothed_pid_match_variable

There’s multiple peaks here, none of which stand out particularly well.  Also note that the mean sequence conservation is much lower in this case than in the position-constrained case.

Arguably the most pronounced peak in the free-position curve is the one which occurs at the very end of the sequence, but, consulting our map of motif evolution, there was never even a motif there; any peak of sequence identity is therefore not the result of selection so much as chance.

One might argue that more sophisticated algorithms are needed to address this situation: enter phastCons.  Unfortunately, it performs basically no better:

smooth_ppscores_match_variable

This graph is one case where smoothing is deceptive, so here are the raw scores.

unsmoothed_pid_match_variable

There’s exactly five spots with high phastCons score (encouraging!), and three of these show scores of exactly 1, representing perfect certainty that purifying selection is operating there (very encouraging!).  But let’s crush those hopes and dreams, because not a single one of those had a high frequency motif, as the following map shows (red spots indicate the base pairs with high [>.1] scores).

switches_ACGT

Motif Rearrangement as a Function of Motif Size

There’s one more point which I think is worth making.  The word I used above—ACGT—had to be perfectly matched, which means it had an information content of 8 bits (2 bits per base pair*4 base pairs).  A skeptic might say, however, that the motif replacements and reversals we observed here wouldn’t happen if the motifs were longer.  So I looked into that, doing similar experiments with randomized words of 5, 6, and 7 bp.

Here’s 5:

k5_motifs

Here’s 6:

k6_motifs

There’s still turnover happening for those two lengths.  Here’s 7:

k7_motifs_idd

Finally, at l=7, there’s no major motif turnover events.  Still, there are some proto-motifs which arise, as circled in red.  There’s no reason to believe that, given enough time, one of those wouldn’t eventually replace the dominant motif.  Indeed, if one is willing to wait long enough, that’s exactly what happens (note the longer generation time–3000 generations).

k7_motifs_long

 

Function without Evidence of Purifying Selection

Given a sufficiently short motif and a sufficiently long time, one can and does observe cases in which there is function which manifests no obvious signs of purifying selection.  The crucial determinant of this phenomenon is when there are multiple adaptive peaks which are equally fit—in this case, whether a motif can arise anywhere in the sequence, or whether it is constrained to a particular location.  In the former case, motifs up to 7 nt (14 bits of entropy) can replace other motifs.

These motifs can be thought of as transcription factor binding sites.  For those with particular spatial requirements (e.g. must be exactly 20bp from the transcription start site), selection might be easy to discover.  On the other hand, TFBSs with variable spacing requirements will resemble the cases discussed later, in which motif turnover happens frequently.

Of course, none of these simulations demonstrates that motif rearrangement actually does occur in nature.  It would be difficult to observe this phenomenon directly, as it would require sampling multiple timepoints over generations of time (Lenski-style), as well as knowing what the specific TFBSs are, in order to examine their constraint.

Still, there’s a few lines of evidence, both empirical and theoretical, which suggest that the scenario of motif turnover is not altogether unrealistic.  For starters, TFBSs in Drosophila have an average information content of ~12.1 bits, roughly similar to that found in the case of my 6 bp words, for which plenty of motif turnover can be observed.  Second, lest you argue that my unrealistically simple model was, ahem, unrealistically simple, a much more advanced and complex model largely confirms my results (albeit in a much more advanced and complex way [Bullaughey 2012]).

But third, and most importantly, conservation of function without overt signs of sequence constraint has been observed empirically on numerous occasions (see: here, here, or here).  Although we can’t exactly reconstruct how those occasions came about, we can say definitively that the relationship between function and the classic signs of purifying selection is not as clear as it superficially appears.  Whereas sequence conservation seems often to co-occur with function (but see the notorious ultraconserved elements), function need not be associated with the most overt signs of purifying selection.

 

P.S.

One thing worth noting here is that conservation of function is associated with some signs of purifying selection on the variable position sets… they just aren’t obvious ones.  Consider that, following selection, fully 74 of the 100 sequences have the selected word, “ACGT”—whereas, the average for 100 randomly selected 4-letter words is a mere 32 sequences, a highly significant difference.  This result hints that there are some breadcrumbs to follow, should one look carefully enough for selection.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.