The playback chain is very specific.
1. Dummy head and torso binaural microphone (stationary);
2. Dipole speakers separated by 10 degrees or a beam forming line array;
3. A DSP to cancel crosstalk or the mentioned beam forming array that inherently avoid crosstalk.
4. A more and less dead room or speakers with high directivity.
Well, I've actually tried some of this. The dummy head, the closely spaced speakers, and crosstalk comp. The effect can be startling, but there are some significant problems. While you may cancel at least some crosstalk, you must also translate the recorded HRTF to the listeners HRTF based on the the brand new angle of incidence (from the speakers) that you haven't accounted for anywhere in the system. That's very hard to do. What you will have is big, spacious, and dimensional, but not a solid palpable image. And any movement of the listener's head, even fery slightly, severely alters the presentation. You literally need to clamp the head. Accurate and effective acoustic crosstalk cancellation is highly location specific. Many directional cues involve frequencies where the acoustic wavelength is very short, and cancellation depends on highly precise phase relationships. Just moving 20 degrees away from perfect cancellation reduces the null to 30dB. 20 degrees at 1kHz is 0.75". And it's obviously less as frequency goes up. You need a head clamp.
If the listener looks straight ahead, even if his/her typical ILD and ITD are not exactly the typical of that dummy head, such playback chain may still produce a front horizontal image detached from the plane between the speakers.
Yes, it does. In fact, the image will include an area beyond the speakers in several directions, and even ambience from behind will be sometimes perceivable. But if you want to specifically place a sound source in a location and hold it there, this doesn't work.
The more the ILD and ITD mismatch between the dummy head and the listener's HRTFs, the more the sources are misplaced, but not necessarily collapsed into the horizontal plane between the speakers.
Yes, this is also true, but it will never "collapse" into an area between the speakers (assuming the listener's head doesn't move), everything will image larger than the speakers (especially at 10 degrees apart!!!) even with very poor HRTF mismatch. In fact, all non-binaural stereo material will image outside the speakers, just with some relatively simple crosstalk comp. I did it first in 1980 with a tiny hand full of opamps in a small box with one "Image" control that varied the effect. The effect is pleasing, but very ambiguous. You can't ever actually reach out and "touch" a source. Everything ends up huge and blurry. But even stereo will present a larger more dimensional image with crosstalk comp.
If spectral cues imprinted by the dummy head HRTF are close enough to the ones the listener's HRTFs adds are close enough and one keeps the reflections from the recording venue (which includes the dummy torso reflections) and avoid the reflections from the playback room, one could also have a believable elevation image.
My experiments showed there were height cues, and sometimes they were believable, but there were never accurate. And frankly, the HRTF match is never "good enough" unless by random chance your HRTF happens to be exactly that of the recording head. But those chances are very small. I found that embedding tiny mics in my own ears resulted in a good HRTF match since I used my own head and torso, but the problem was I then had to reference since shoving mics in my ears required them to be glued to ear plugs, and I could never hear the original. Again, a mismatch in recording/listening HRFT doesn't always eliminate height (or any direction), but it won't be correct either.
But again the recorded sound sources, although not collapsed into a horizontal plane, might have it elevation misplaced if compared to the real elevation they were while they were being recorded.
Pretty much happens all the time, especially with the speaker setup. You can do a bit better with headphones.
Please do not ask me what is close enough... And yes I know there are a lot of ifs...
I can tell you, you can't ever get close enough for enough people. The average is the best you can do, and that average results in a reasonable binaural presentation. Just don't expect accuracy.
Since you do not know the real positions of sources in the moment they were recorded, the perceived front image may be more believable that the standard regular stereo that do not avoid crosstalk.
It's not a question of "knowing" the real source positions. Playback on speakers has many, many variables. You've tried to control some of them in your description of the theoretical system. You haven't covered all of them, and when extending the model beyond a single, idealized, theoretical system configuration to, well, reality, it all falls apart. You cannot expect every listener to have an acoustically treated room, a head clamp (or the desire to use one), and so his head won't be locked in the calibrated sweet spot for the crosstalk and HRTF compensation to work. Beamforming speakers won't help you there either, as that technology is also very location specific. Heck, even the general frequency response of the speakers has an impact.
But none of this actually means much in practice since every recording is really a means of suspension of disbelief rather than a replication of an event. It doesn't take much to get "pleasing", it takes much more to get "accurate", if it's even possible.
Again, the microphone is stationary. If and only if you avoid crosstalk at playback, turning your head to one side would cause an acoustic shadow in the same side pinna.
In that situation the spectral cues from the generic HRTF gets attenuated at the left pinna and highly distorted in the right pinna so elevation collapse. This was the answer I understood
@spruce music gave to my question.
Not only elevation collapses, everything about the stereo image collapses. You cannot allow head turning!
Add a PRIR measurement and the dsp filter can deal with the head turning and perhaps to relaxe the dead room condition.
No, not in a room with two speakers at 10 degrees. The PRIR the Realizers develops is translated to headphones. If you turn your head wearing headphones the processor can track your head turn and rotate the image to track it. It works because the relationship of the transducers to your ears is still fixed. You're not going to do that with two speakers in front of you.
That is what I am imagining the bacch filter does, otherwise Dr. Choueiri would not claim the results he is describing.
I'm skeptical that he's actually saying that, and that it could be done with speakers. Pretty tall order. Link to the paper perhaps?
I didn't mean analog electronic circuit, but that kind of acoustic second HRTF filtering that crosstalk cancellation or beam forming allows.
Care to tell us how you plan to do any sort of HRTF, crosstalk cancellation or beamforming without a DSP?
That's enough of binaural through loudspeakers.
So you could ask: if all personal HRTFs are so "close enough" to a generic dummy head HRTF, why binaural playback with headphones does not work for all listeners and in that cases anything collapse inside or at the back of the listener's head?
If Dr. Choueiri and Dr. Smyth claims are both correct, there must be at least three reasons.
The first is that such "acoustic second HRTF filtering" may not be cumulative, but somehow an acoustic translation and it may partially cause some of effects a real PRIR measurement causes in the Realiser.
The second is that the Realiser head tracking avoid a moving sound field by heavily filtering ILD and ITD, while in the bacch filter the customization may be used to mainly increase crosstalk cancellation efficiency and tackle the problem of the spectral cues acoustic distortion in the pinna that is facing the speakers as the listener turns his/her head and the other that is behind the head acoustic shadow.
The third is that the HPEQ may heavily counteract the strong filtering effect the headphones itself imprints in the playback chain.
Again, this is the only way I found as a layman to explain both products claims are true.
I think we are all saying the same thing in different ways. The whole idea has problems, some can be partially remedied, others cannot.
I think you are describing the whole 360 degrees azimuth and elevation HRTF in all its complexity, but as fair as I understood spruce music had such those specific conditions in mind to answer my question.
When you think of music you also must include the acoustic space around it. It's always there, even if artificial. For the reproduction to be real and accurate you must include a 360 degree sphere.