My phrasing was not clear indeed.
I was referring to both recording techniques reproduced with: headphones (so the listening room is out of the equation) and playback DSP (a head related transfer function that transform multichannel content into idiosyncratic two channel output; okay the measured playback room reference has its influence here, but the DSP is able to deal with that).
How two ears are able to sense sound sources in a 3D field? Suppose a singular sound source (like a bird) is within an imaginary sphere. Roughly: a) Inter-aural delays would explain the horizontal displacement (azimuth cues); b) tonal modulation from head, torso and outer-ear would explain vertical displacement (elevation cues*); and, c) reverberation would explain source distance (that such singular source is within a near or further imaginary sphere, in other words, different radius).
Is it possible to fix all that cues in mass distributed audio content. I believe it is not. We have several problems: at least two of those cues are very idiosyncratic and one is very room dependent. So XY microphone pattern or the
Neumann KU-100 is not an ideal solution.
With 2, 5.1 or 7.1 channels content you are able to reconstruct horizontal displacement (azimuth cues) by crosstalk in your listening room. As you pointed out, source distance is more problematic given that your listening room imprints its own reverberation mode. You may add ambience to this recording with a Neumann KU-100, but this will not translate into precise elevation cues, which are, I believe, very idiosyncratic.
So the Realiser comes into the playback chain of regular 2, 5.1 or 7.1 channels content. You are able to capture your azimuth and elevation idiosyncratic cues and your ideal reverberation listening room. Then a function transforms your audio multichannel stream into a two channel headphone output. What do you have here? You will listen to a very convincing out of the head circle on the horizontal plane. Do you have a sphere? Do you have a 3D sound field? Nope.
Then you take such function and add some variables that allow placing such your virtual speakers (a fixed base that comes from your recorded content and feed your HRTF computation). Believe or not, the Realiser does that, allow the user to change azimuth and elevation of the virtual speaker (see
Realiser A8 manual, page 27). Reverberation of the playback room and speakers proximity(!) can also be altered (see
Realiser A8 manual, pages 55, 56). If the recorded content has two layers (NHK example) or a 3D omnidirectional pattern (SPS200), voilà, now you are able to place your virtual speakers into the right virtual spot and reproduce a 3D sound field.
But why NHK needs to fix 22.2 channels instead of 4? When the target for your audio content is not only people with a DSP and headphone playback, but a movie theater audience, then such original tracks might be useful to reduce the influence of the sweet spot in the latter. At home theaters, less channels are needed. NHK mention that:
I was trying to say that while playing back a regular two channel with speakers may add an artificial sound stage, playing back a 3D sound field, which I repute a faithful playback method, also relies in some kind of crosstalk (inter-aural cues). Two channels via speakers is an artificial reconstruction of reality, but it is acceptable at the actual state of the art.
They are all very interesting technologies.
Gosh, we should start a new thread for this subject. Forgive me.
*Directly in front azimuth and elevation cues are the worst (0º azimuth). Unconsciously we slightly turn or head to feel the cues and identify the source localization at such spots. That’s why some sort of gyroscope at the listener head might be useful. The Realiser has the head-track... Outstanding.