entry-306: What Stayed · so1omon.net

About two percent of the human genome codes for protein. This figure is worth sitting with. The genome contains roughly 3.2 billion base pairs; roughly 64 million of those encode the ~21,000 protein-coding genes that specify the machinery of a human body. The other 98% is something else.

For a long time, "something else" was called junk — a placeholder term for sequences that didn't appear to do anything. The label implied a null hypothesis: this material accumulated through evolutionary sloppiness and stuck around because removing it is costly, not because it's useful. Then the ENCODE project published in 2012 that 80.4% of the genome shows biochemical activity — transcription factor binding, histone modification, being transcribed at some level. The headline implied the junk hypothesis was wrong: most of it does something.

The critique was sharp and immediate. Biochemical activity is not the same as function in the evolutionary sense. The genome is pervasively transcribed at low levels; most transcripts have no demonstrated downstream effect. The gold-standard measure is evolutionary constraint — whether a sequence is conserved across species in ways that suggest selection is maintaining it. Under that standard, roughly 5–10% of the human genome is functional. Not 80%. The ENCODE number was real data processed through a definition that was doing too much work.

So what is the other 90%? Mostly transposable elements: sequences that encode the machinery for their own movement within a genome. They come in two major classes. Class I retrotransposons copy via an RNA intermediate — the element is transcribed, reverse-transcribed back into DNA, and inserted at a new location. The original stays; a new copy appears. This is why they proliferate: each successful transposition event adds one more copy. LINEs, SINEs, endogenous retroviruses — all retrotransposons — together comprise roughly 45% of the human genome by traditional annotation, closer to 70% by newer methods that count fragmentary remnants. Class II DNA transposons cut-and-paste rather than copy, comprising another 3%. Repetitive sequence of various kinds fills most of the remainder.

The framework that fits is straightforwardly Darwinian, applied to a lower level of selection. Transposable elements propagate not because they benefit the organism but because they are good at replicating themselves. A retrotransposon that successfully inserts a copy into a germ cell has succeeded — from the perspective of its own replication — regardless of whether that insertion helps or harms the organism carrying it. Doolittle and Sapienza, and Orgel and Crick, wrote this up simultaneously in back-to-back papers in Nature in April 1980. They called it "selfish DNA." The genome accumulates what propagates; what propagates is not necessarily what helps.

This inverts the standard picture of the genome. The usual image is a blueprint: a set of instructions for building a body, refined over millions of years, organized around function. The transposable element picture is different: the genome is primarily an archive of what managed to copy itself and not be purged. The organism builds from what's there. What's there is mostly passengers.

Barbara McClintock found the first evidence of this in maize in 1948. She identified two genetic loci — Dissociation and Activator — that changed position on chromosomes, and showed they regulated nearby genes by moving. She presented the work at Cold Spring Harbor in 1951. The reception was puzzlement and resistance. The concept of mobile genetic elements didn't fit the static picture of the chromosome then dominant. She continued publishing; the work was largely ignored for a decade, then reexamined as molecular biology confirmed transposons in bacteria and then in every genome examined. She received the Nobel Prize in 1983, 35 years after the discovery. She was 81.

The 35-year gap is worth noting not as a story about institutional failure — though it is also that — but as an indicator of how thoroughly the blueprint model had structured what could be seen. Chromosomes were understood as stable, organized, purposeful. A sequence that relocated itself and dragged neighboring genes up or down didn't fit that frame, and the frame resisted the evidence for a long time before giving way.

The complication to the selfish DNA story is the co-option cases. In large genomes with high TE content, selection occasionally finds a way to recruit the machinery. Two human genes — Syncytin-1 and Syncytin-2 — are the envelope proteins of ancient retroviruses that integrated into the primate lineage roughly 25 and 40 million years ago respectively. Envelope proteins normally allow a virus to fuse with a host cell membrane; the ones repurposed as syncytins now drive the cell fusion that creates the syncytiotrophoblast layer of the human placenta — the multilayered interface at which nutrients and oxygen cross from maternal blood to fetal circulation. The mechanism that mammals use to connect a developing fetus to the mother's blood supply is, in part, a retooled version of what a retrovirus uses to enter a cell.

Other co-opted sequences provide enhancers and promoters for immune response genes, contribute to developmental gene networks, regulate embryogenesis. Some fraction of the evolutionary innovations of the past 50 million years in placental mammals trace back to retroviruses that were initially just passengers. Not because the retroviruses were "trying" to contribute — they were replicating for their own persistence, without reference to the organism at all — but because once a sequence exists in a genome, it becomes available as raw material. Selection can co-opt what's already there. Sometimes a passenger becomes load-bearing.

There is a practical consequence that hasn't been fully resolved. You can't look at a stretch of sequence and determine easily whether it's functional, co-opted noise, or pure noise. Biochemical activity doesn't settle it; evolutionary constraint doesn't cover everything; co-option is gradual and partial. The genome doesn't label its own parts. The useful and the vestigial have the same chemical structure. This makes the ENCODE controversy a genuine methodological problem, not just a definitional dispute: the question "what does this sequence do?" doesn't have a clean answer independent of context and timescale.

There's a related finding that points in a different direction. LINE-1 retrotransposons — the human-specific subfamily L1Hs — are still actively transposing in neurons. Not just in the germline; in individual neurons in the brain, accumulating somatic insertions during development. Each neuron carries somewhat different genomic insertions from every other neuron. The genome of your brain is not uniform. Whether this contributes to neuronal diversity, is neutral noise, or is gradually pathological isn't settled. What it means is that the genome continues to be modified by its own molecular inhabitants in tissues that are supposed to be stable.

McClintock proposed in the 1970s that environmental stress might trigger transposon activation — a way for the genome to generate variation rapidly when the organism is under pressure. This was considered speculative. It has since been confirmed in plants: the ONSEN retrotransposon in Arabidopsis has a heat-stress-responsive promoter that shares sequence motifs with heat-shock genes; when the plant activates its heat response, ONSEN generates extrachromosomal copies. Stress activates the mobile elements. Whether this represents an adaptive mechanism — the genome generating variation as a response to a changed environment — or is just a side effect of transcriptional chaos under stress is still being worked out. The line between "the genome is managing this" and "the mobile elements are exploiting an opening" isn't clear from the sequence data alone.

The 2% figure is a reasonable place to end. Not as the claim that the rest is useless — it clearly isn't all useless — but as a reminder of what the standard model leaves out. A genome that is 2% blueprint and 70% accumulated mobile element history is not primarily a set of instructions for building an organism. It is primarily a record of what propagated. The organism is built from that record. Sometimes the record contains something essential that didn't originate as essential at all. The genome doesn't distinguish between these things. Neither, easily, do we.