← home
simulation 059 · phylogenetics

Long Branch Attraction

parsimony reconstruction · 4 taxa · binary characters · Felsenstein 1978

Phylogenetic trees are built by comparing sequences and inferring which taxa share a common ancestor. Maximum parsimony picks the tree that requires the fewest mutations to explain the data. When all lineages evolve at similar rates, this works well. When rates vary — some lineages fast, others slow — parsimony fails in a predictable direction.

Fast-evolving sequences accumulate many mutations. Many of those mutations occur independently in multiple lineages: convergent evolution at the molecular level. Parsimony cannot distinguish shared ancestry from convergent similarity. It places fast-evolving taxa together regardless of where they actually belong. This is the Felsenstein zone.

More data makes it worse. The longer you sequence, the more confident parsimony becomes — in the wrong answer.

simulation

Four taxa. True tree: taxa 1 & 2 are sisters; taxa 3 & 4 are sisters. But taxa 1 and 3 evolve faster. Each character evolves along the true tree under a binary symmetric model. Parsimony then votes on which of the three possible topologies best explains each character.

true tree
1 2 3 4 fast fast slow slow int
● taxa 1, 3: fast-evolving
● taxa 2, 4: slow-evolving
● internal branch: short
T1 = (1,2)(3,4) is the truth.
Can parsimony find it?
fast branch rate: 0.50
slow branch rate: 0.05
internal branch rate: 0.05
1 2 3 4
T1: (1,2) | (3,4)
true tree
0
1 3 2 4
T2: (1,3) | (2,4)
long branch attractor
0
1 4 2 3
T3: (1,4) | (2,3)
0
press run to begin
what's happening
The signal for T1 comes from the internal branch. When A and B are in different states, taxa 1&2 share one state and taxa 3&4 share the other. This happens with probability proportional to the internal branch rate — which is short by default.
The noise for T2 comes from convergent evolution on the fast branches. Taxa 1 and 3 both evolve quickly and often land on the same state independently. Parsimony sees their agreement and concludes they're sisters. It's wrong — they just changed a lot.
The Felsenstein zone is the region where the fast branch rate is large enough, and the internal branch short enough, that convergent T2 signal exceeds true T1 signal. The condition is roughly: fast rate > internal rate + slow rate.
What the model hides: binary characters, constant rates within branches, no model of rate variation across sites. Real phylogenetic data uses nucleotides (4 states), and maximum likelihood methods can partially escape LBA by modeling rate variation — but only by assuming the rate model is correct.
see also

entry-531 · The Long Branch — the Archezoa story: how the Felsenstein zone led a whole kingdom to be misclassified for over a decade · convergent evolution sim · genetic drift sim