Localization...the holy grail of recording (and not just for binaural). However, people seem to have higher expectations for binaural (as compared to conventional two-microphone stereo) in terms of its realism, and probably with good reason.
Binaural often gets a bad-rap because detractors of it say that front-to-back confusions destroys the realism. Others say that for it to be as realistic as possible, head tracking must be employed. However, there is another component to all of this that is often (and logically) glossed over by audio folks like us, namely, the importance of the visual cortex and the role that it plays in localization. By that I mean that when you consider the physiology of hearing and how we localize, the eyes and the ears are in constant communication. Take this simple example:
When you hear a sound - regardless of the direction of origin - and you are not sure where it's coming from, the first response your brain suggests is to look around you. If the source is identified (by looking around) then your brain resolves the ambiguity of where the sound is located - suddenly, everything in the sensory puzzle fits - there is no cognitive dissonance. On the other hand, if you cannot locate the source (such as in the dark or for other reasons, i.e. it is obscured from vision by something else etc) then instinctively - and especially when we can't see the source - we turn our heads. This simple act forces an arrival time difference, as well as an intensity difference, all due to the shape of the person's head, ears, the angle / elevation of their head, and so forth.
As a tangential example of the inter-dependence of the visual and auditory cortexes, watch this quick video (McGurk Effect):
Interesting, eh? If nothing else this is a pretty good emo of the interdependence of the visual and auditory cortexes.
Back to localization...so, when most people hear binaural recordings, they are audio-only recorings, absent the visual cues that would be present in a live performance. However, when you provide synchronous video coupled with the binaural audio, taken from the same position of the mannequin head microphone, the visual component helps to 'anchor' the placement of sound coming from directly in front of you.
Exercise #1: Here's a simple example (and yes, you need to use headphones for this). What I would like for you to do is to queue-up this link, and advance the time to around 1:50. After that, press PLAY, and then keep your eyes closed for a while (maybe a minute or so), then, open your eyes. Here is the link:
As you watch the video, alternately keep your eyes open and closed.
What's interesting here is that the binaural was acquired at the center of the stage; the mannequin was above the crowd, facing the band. The audio that you hear is just the binaural signal (no board feeds). However, what's interesting is that apart from the main PA (directly in front of the mannequin), which constitutes an amplified and time-shifted component (think of the path time from each instrument to the mannequin mic - it is longer than the path from the PA to the mannequin head mic), you can still localize the instruments quite well - not perfectly, but still quite well. Notice how when the camer angle is direct (i.e. facing the musicians) the audio seems to make more 'sense' to your brain, because the visual perspective being presented to you is closely aligned with the auditory stimulus (the binaural component).
Now here's anotehr interesting element of this video. I made the point about some sound coming through the mains PA as opposed the natural, acoustic radiation of the instrument. Go back to the Sumkali video, and watch - and listen - VERY carefully right around the 3:50 mark. Watch the singer at far right. As he sings, (roughly at the 3:52 mark), he tilts his head upward, and pulls back from the microphone. When he does this, two things happen. First, the amplified portion of his voice is momentarily attenuated (by moving away from the mic). Second, by tilting his head upward, he is effectively singing directly towards the right ear of the mannequin mic, in essence, this is his acoustical radiation component (un-amplified). As you listen and watch, you will hear the apparent location of his voice momentarily move from the center, more towards the right. This happens a couple more times in the video, and once you have seen / noticed this at the 3:50 mark, it becomes easier to spot this in the rest of the video (I think this also happens to varying degrees when he is singing around the 1:25 mark).
If the Sumkali show had been purely acoustic (no PA whatsoever) with the musicians situated in the same arrangement (and the mannequin head located in the same place), then there would be even more natural localization of the musicians and their instruments. However, it's the PA itself that causes much of the smearing of the image (again, due to the path differences between the instruments (their acoustically radiated component) and the sound of the instruments that came through the PA.
What's interesting (to me anyway) is that when I just listen to the recording, the image is stable. However, when I watch the video, the image seems most realistic when the camera angle is the direct one (pointed at the stage), and when I see the video from the roving camera, then I actually experience a bit of cognitive dissonance - because the camera angle and perspective are not consistent with the auditory element.
It's an interesting subject...and it would be very interesting indeed to compare how well people localize recordings, made at essentially the same location, using (for example) ORT, Blumlein, binaural or oterh formats, and then asked to rate them in terms of accuracy of localization - first, based on the audio alone, and then with accompanying video. I'm really curious to see how our perceived localization is influence by the two-channel technique of choice unto itself as compared to being presented with synchronous and representative stimuli. Would realism be be perceived as 'better' for all formats? That is, would people say the localization for all types of two-channel sound are better when they are watching the video as compared to when they only hear the audio, but don't see the video? I wonder...and I wonder if any one method would be perceived as having the highest (statictically significant) realism as compared to the other formats.