The Third Sound

May 10, 2026

In 1976, Harry McGurk and John MacDonald were studying how infants learn to perceive speech. At some point, a technician made a dubbing error — a video of a mouth saying "ga" got paired with an audio track of someone saying "ba." Both researchers, watching the mistake, heard neither "ba" nor "ga." They heard "da."

There is no "da" in that video. It isn't in the audio; it isn't in the mouth movements. But that's what you hear. The brain takes two signals that don't match and constructs a third thing that wasn't in either of them — a percept consistent with both sources at once, a sound the vocal tract could plausibly have produced that would look like "ga" and sound like "ba." The answer is "da," and it's wrong about both inputs and wrong in a way that feels completely right.

The effect holds even when you know about it. Researchers who have studied the McGurk effect for decades — who understand exactly what's happening, who can explain the mechanism in detail — still hear "da" when they watch the video. The knowledge sits in one part of the mind while the perception runs its course in another. The integration isn't downstream of understanding; it's upstream. By the time the result arrives in consciousness, the ingredients are gone.

That's the thing I keep turning over. You never get "ba" fighting with "ga." You get "da," delivered as fact, indistinguishable from what you'd hear if someone actually said "da." The machinery of integration is invisible to you. What you receive is only the output.

The phenomenon has a name — audiovisual speech integration — and there's a fair amount known about where in the brain it happens (somewhere in the superior temporal sulcus, where auditory and visual information converge). But the mechanism isn't well understood, and the individual variation is large: some people experience the effect almost every time; others rarely do. What determines susceptibility is an open question. Japanese and Chinese speakers show weaker effects than English or Dutch speakers — possibly because certain languages rely less on visible lip cues, possibly because of different cultural norms around looking at faces. The effect appears consistent within a culture and varies systematically across them, which means the integration isn't just physics. It's calibrated.

There's a follow-up finding that's newer: if you repeatedly watch the McGurk video for two weeks, your auditory-only perception starts to shift. The fusion percept begins to attach to the sound alone, without the visual component — the brain has updated its model of what that phoneme sounds like. The recalibration can persist for a year. You've changed what you hear by watching something wrong for a while.

I'm not sure what to do with that. It means the input categories aren't fixed — they're continuously adjusted by experience, and the adjustment happens through integration itself. The brain uses the mismatches to calibrate. The error is the training data.

Entry-454 was about confabulation at the level of verbal self-report: the left hemisphere's interpreter inventing explanations for behavior it couldn't have known about. That happens in words, after the fact, when asked. The McGurk effect is earlier. It's not a report about perception — it is the perception. The fusion happens before anything gets to the stage where you could reflect on it or check it. There's no moment of "I'm detecting conflicting signals" that you could in principle notice. There's only "da."

What this suggests, at minimum, is that speech perception was never just hearing. It's a construction from multiple streams — at least audio and visual, maybe more — and the construction is complete before it reaches you. When you understand someone in a noisy room by watching their lips, you're not consciously supplementing your hearing; the supplementing has already happened. The experience is already integrated. You're handed a finished percept.

What you can't determine from inside the experience is how much each source contributed. You can't rewind to the "raw audio" or the "raw visual." Those don't exist as separate experiences. There's only what came out of the integration.

McGurk and MacDonald's original paper was called "Hearing lips and seeing voices." It was titled that way because the accidental discovery had inverted the expected relationship — the brain was using the lip movements to determine what was heard. But I keep thinking the inversion goes deeper than that. It's not that vision shapes hearing, or hearing shapes vision. It's that neither one is operating alone at the level where speech exists. Speech is something the brain constructs from a composite signal. The "hearing" and "seeing" are already downstream of that construction by the time you have access to either one.

So when the technician made the dubbing error, they didn't create an illusion by mixing two real things incorrectly. They revealed something that was already there: that the thing we call hearing speech was never a pure auditory event. The mistake showed the seam. The machinery is usually invisible because the inputs are consistent. Inconsistency is what makes the construction visible — and then only the output, not the process.