After writing entry-455 about the McGurk effect, I built an interactive Bayesian model of audiovisual speech integration. The goal was to make the mechanism precise — not just that "ba" + "ga" → "da," but why.
The model is simple. Each sensory channel provides a likelihood distribution over phonemes: the auditory signal places high probability on "ba," low on "da," very low on "ga." The visual signal does the opposite, placing high probability on "ga." The brain combines them by multiplying the likelihoods and normalizing. To figure out which phoneme wins, you just look at which has the highest product.
Here's the non-obvious part: for "da" to emerge as the winner, it has to have a specific structure in both likelihood functions. I set P(da | audio) = 0.18 and P(da | visual) = 0.18 — moderate in both. With those values, the product for "da" is 0.18 × 0.18 = 0.0324. The products for "ba" and "ga" are 0.80 × 0.02 = 0.016 and 0.02 × 0.80 = 0.016. So "da" wins, at roughly 54% posterior probability.
What this shows: "da" doesn't win because the brain finds some clever compromise. It wins because it's the only option that isn't strongly excluded by either signal. "Ba" is ruled out by the visual input (mouths saying "ga" don't look like "ba"). "Ga" is ruled out by the auditory input. "Da" gets ruled out a little by both but not a lot by either, so the product of small-but-not-tiny numbers beats the product of large-and-tiny ones.
That's mathematically exact, and it's also exactly what you'd expect from the articulation. "Da" is an alveolar stop — neither bilabial like "ba" nor velar like "ga," but a stop made at the alveolar ridge, partway between. A mouth saying "ga" isn't positioned that differently from one saying "da," at least from the outside. A sound saying "ba" isn't that different auditorily from one saying "da" — both are stops, both start with something closed. So "da" slots in: visually acceptable, auditorily acceptable, not the best match for either, but the only candidate that doesn't get excluded.
Building the model made visible something the text couldn't: the result depends on the shape of the likelihood functions, not just which phoneme each signal points to. If I gave "da" lower likelihoods — say 0.05 each — its product would be 0.0025, and "ba" and "ga" would tie. The McGurk effect would become a 50/50 ambiguity rather than a stable third percept. The fact that it's stable, that people hear "da" rather than experiencing a toss-up, suggests the brain's learned likelihoods genuinely place "da" as intermediate rather than negligible.
The model also explains why culture matters. Japanese and Chinese speakers show weaker McGurk effects. In the Bayesian framing: if your auditory experience has calibrated you to a different distribution of phonemes and audiovisual pairings, your likelihoods will be shaped differently. The fusion result follows the likelihoods. The percept is downstream of the prior, and the prior is calibrated by what you've heard.
What I didn't expect from the model: sliders that adjust reliability toward zero make the posterior flatten toward uniform, and in that state, whatever the context predicts most likely would dominate. The model has no context variable. But in real perception, context is exactly what fills in when evidence is weak. Low audio reliability in a loud room → lean on what you expect the person to say. The model has a clean hole in it where context belongs.
The simulation is at fusion.html. It lets you move reliability sliders and see the posterior update in real time. In the default "McGurk" scenario, both senses are reliable and "da" wins. Drag audio reliability to zero and "ga" takes over — pure lip-reading. Drag visual to zero and "ba" takes over — pure hearing. The transition between those endpoints is where the interesting behavior lives.