How do we hear height in a recording with earphones

Discussion in 'Sound Science' started by vidal, Mar 20, 2017.
  1. pinnahertz
    The variations are enough to make one HRTF incompatible with another's HRTF to the extent that true 3D imaging is not believable.
    I believe that could be found in literature. And I don't think that peaks/dips necessarily stay at the same frequency.
    It varies with all the other variables that are out of control in that scenario, results from completely to not as much.
    Elevation perception is one of the more fragile things to translate. It's worse than just an error, it can completely collapse.
     
    Look, this gets at why binaural, and especially binaural on speakers, has never really worked universally, which is part of why it has not been generally accepted (a small part, but none the less). Capture has to be done with a fixed generic HRTF, and then it's played into specific and variable HRTFs.  DSP translation between HRTFs is non-trivial. There are many vectors that image well using a generic HRTF, like behind and slightly above, but frontal elevation, particularly that of a dead-center source, is really difficult. That's largely because humans don't perceive frontal elevation well to being with without visual support.
     
    jgazal likes this.
  2. gregorio
     
    While I can't provide proof to absolutely rule it out, I'm highly sceptical. The whole point of a coincident pair is that the mics are so close together that cancellations and other potential phase issues are minimised/almost eliminated. Additionally, the "resolution" of our aural ability to identify vertical position is many times lower than our ability to identify horizontal position, I seem to remember about half a degree accuracy on the horizontal plane and only 7 degree accuracy in the vertical plane. Combining these two facts, I'd need some very compelling evidence to convince me there's anything even close to useful/audible height info from the output of a coincident pair.
     
    G
     
  3. castleofargh Contributor
    intuitively I agree. a lot of the panning detection is done using the variations from left and right, so it makes sense that recording with 2 mics would provide a lot of similar cues. but if the vertical axis is guessed mainly thanks to the shape of the outer ear, I wonder what recording technique could help with that aside from using the actual ears shape(or a good simulation of it)?
     
  4. spruce music

    I agree with you, but other than Calrec or similar Tetra mics coincident pairs are not coincident vertically.  Due to microphone size and/or sloppy technique it isn't uncommon to see coincident pairs that have somewhere from 1 to 2 inches between the diaphragms in the up and down sense.  Any comb filtering for off axis sounds at that spacing drops you right into the 6-11 khz range.  It wouldn't be accurate nor extremely strong, but it makes it possible that some height perception would get triggered that way. 
     
  5. jgazal

    But what about ILD and ITD mismatch and the stability of an horizontal image?

    Does that HRTF translation need to rely on DSP?

    The microphone is stationary. If and only if you avoid crosstalk at playback, turning your head to one side would cause an acoustic shadow in the same side pinna.

    Anyway, my gut feeling is that you are right: if the spectral cues differ to much, perception of elevation would even collapse.
     
  6. spruce music

    You mostly only have height perception straight ahead or close to it.  Not much at all other than vaguely to the sides and rear.  You also won't get much if there are no higher frequencies in the source of the sound. 
     
    jgazal likes this.
  7. jgazal

    Thank you very much.

    It is so difficult to think about the results without hearing to any crosstalk cancellation - xtc, beaforming line arrays or headphone externalization working units.

    With your answer in mind I can now speculate that if someone listens to a customized bacch-sp filter with head tracking and the elevation image does not collapse, then crosstalk cancellation is just half of the story and Dr. Choueiri may use the PRIR not only to maximize the crosstalk cancellation but to also further improve the stability of the 3d image when the listener turns his/her head modelling the spectral cues.

    Now with a Realiser set to avoid crosstalk, I still can't figure out the best measuring arrangement to simulate with headphones the bacch-sp loudspeakers playback performance. Dipole speakers separated by 10 degrees (best arrangement to match straight ahead dummy HRTF and listeners HRTF) or the regular stereo triangle (best measurement points to feed the constants/weights of the HRTF interpolation function for head tracking). I don't know...

    As far as I understood, I guess your answer also enlighten the performance limits for not customized beamforming line arrays that naturally avoid crosstalk.

    So now I can agree with pinnahertz that those new devices without personalisation might not have good enough performance to really proliferate.

    Sorry to disturb everybody, specially pinnahertz and gregorio. It takes some time for me to understand these things. Now I can go back to silent mode.
     
  8. pinnahertz
    Not sure what you mean by "stability", but yes the others define the issue.
    It's highly unlikely you could whip up an HRTF translator as an analog circuit. It would be very complex, and very difficult to design. I played with things like that a few decades ago, tried my hand at acoustic crosstalk cancellers, it's almost impossible to nail it with a mountain of analog filters.
    I'm sorry, you've lost me. What is the situation here?
    I do hope you understand the complexity of the problem. It's way more than spectral cues. Way, way more.
     
  9. pinnahertz
    Actually, any point where ITD and ILD result in identical signals at both ears results in an inability to accurately localize.  So directly straight ahead and behind are ambiguous, with reduced directional acuity in the vertical plane as well.  Just slightly off that axis and directional acuity picks up remarkably in all directions.  Yes, we're better at localizing in the forward direction, and localizing at 90 degrees, directly to the sides, also results in a "cone of confusion".  But we do have the ability to resolve height in may directions, dead center isn't one of them.
     
    Localizing a height vector doesn't just depend on high frequencies.  In fact it works best when we have reflections to process.
     
  10. gregorio
     
    Admittedly, it's more correctly called a "near-coincident pair", as it's obviously impossible for two different mic capsules to occupy the exact same position. On the other hand, one of the main, if not the main, considerations with different stereo mic'ing techniques is phase issues between the mic's and this is why a near coincident pair is so popular, the capsules are placed so close together that differences in phase are minimised and any resultant problems are well above the critical band. Typically for a near-coincident pair we'd use small diameter condensers with only a fraction of an inch between them, 2" would be sloppy indeed and with the exception of the occasional newbie/student, I can't remember ever seeing anything so sloppy professionally. Additionally, any info in the signal arriving at the mics due to height (ceiling or floor reflections for example) is going to be off-axis and therefore less accurately represented (attenuated and/or less flat). Lastly, I've experienced this height perception on popular genre recordings, where we're dealing most commonly with close mic'ed mono sources and even with acoustic genres, I don't know if there are any commercial recordings which employ only a near-coincident pair and of course, once we add other mic sources to the mix we're changing the phase relationships within the mix. I think you're clutching at straws here.
     
    G
     
  11. spruce music

    Well everything you wrote contradicts research by universities into the matter.
     
    Height information is provided by the shape of our ears. If a sound of fairly high frequency arrives from the front, a small amount of energy is reflected from the back edge of the ear lobe. This reflection is out of phase for one specific frequency, so a notch is produced in the spectrum. The elongated shape of the lobe causes the notch frequency to vary with the vertical angle of incidence, and we can interpret that effect as height. Height detection is not good for sounds originating to the side or back, or lacking high frequency content.
    Peter Elsea 1996
    http://artsites.ucsc.edu/ems/music/tech_background/te-03/teces_03.html
     
    From William Yost, from his text book, Directional Hearing:
     
    Both man and monkey require high frequencies for localization in the vertical plane.
     
    From Brian C.J. Moore, Basic Aspects of Hearing:
     
    Spatial stream segregation vertically demonstrated high acuity in conditions during which interaural cues were negligible.  (which is straight ahead)
     
    Numerous articles on hearing aids and people with damaged hearing document that if you have significantly depressed hearing at 8 khz and above you lose ability to locate sounds vertically.
     
  12. spruce music

    No not really. I have already acknowledged and emphasized the rarity of such recordings.  Nor am I representing this effect is common in recordings. There are plenty of mics that have baskets around the capsule that prevent spacing so close as only a fraction of an inch.  Yet I have seen them used this way. I personally don't think it important or common.  It can however occur. Chesky is one example of commercial recordings that would have the chance to represent this effect. They early on used AKG 24 stereo mics which appear to have at least an inch between the capsules.   Again I am not saying it is common or even important.  Just that to dismiss it is possible is not quite the truth.
     
  13. pinnahertz
    Perhaps you misinterpret what I'm saying.  Yes, HF is required, but it's not the whole story.  Not much to disagree with in the above, but there is more to it. 
     
  14. jgazal


    The playback chain is very specific.

    1. Dummy head and torso binaural microphone (stationary);

    2. Dipole speakers separated by 10 degrees or a beam forming line array;

    3. A DSP to cancel crosstalk or the mentioned beam forming array that inherently avoid crosstalk.

    4. A more and less dead room or speakers with high directivity.

    If the listener looks straight ahead, even if his/her typical ILD and ITD are not exactly the typical of that dummy head, such playback chain may still produce a front horizontal image detached from the plane between the speakers. The more the ILD and ITD mismatch between the dummy head and the listener's HRTFs, the more the sources are misplaced, but not necessarily collapsed into the horizontal plane between the speakers.

    If spectral cues imprinted by the dummy head HRTF are close enough to the ones the listener's HRTFs adds are close enough and one keeps the reflections from the recording venue (which includes the dummy torso reflections) and avoid the reflections from the playback room, one could also have a believable elevation image.

    But again the recorded sound sources, although not collapsed into a horizontal plane, might have it elevation misplaced if compared to the real elevation they were while they were being recorded.

    Please do not ask me what is close enough... And yes I know there are a lot of ifs...

    Since you do not know the real positions of sources in the moment they were recorded, the perceived front image may be more believable that the standard regular stereo that do not avoid crosstalk.

    Again, the microphone is stationary. If and only if you avoid crosstalk at playback, turning your head to one side would cause an acoustic shadow in the same side pinna.

    In that situation the spectral cues from the generic HRTF gets attenuated at the left pinna and highly distorted in the right pinna so elevation collapse. This was the answer I understood spruce music gave to my question.

    Add a PRIR measurement and the dsp filter can deal with the head turning and perhaps to relaxe the dead room condition.

    That is what I am imagining the bacch filter does, otherwise Dr. Choueiri would not claim the results he is describing.


    I didn't mean analog electronic circuit, but that kind of acoustic second HRTF filtering that crosstalk cancellation or beam forming allows.

    That's enough of binaural through loudspeakers.

    So you could ask: if all personal HRTFs are so "close enough" to a generic dummy head HRTF, why binaural playback with headphones does not work for all listeners and in that cases anything collapse inside or at the back of the listener's head?

    If Dr. Choueiri and Dr. Smyth claims are both correct, there must be at least three reasons.

    The first is that such "acoustic second HRTF filtering" may not be cumulative, but somehow an acoustic translation and it may partially cause some of effects a real PRIR measurement causes in the Realiser.

    The second is that the Realiser head tracking avoid a moving sound field by heavily filtering ILD and ITD, while in the bacch filter the customization may be used to mainly increase crosstalk cancellation efficiency and tackle the problem of the spectral cues acoustic distortion in the pinna that is facing the speakers as the listener turns his/her head and the other that is behind the head acoustic shadow.

    The third is that the HPEQ may heavily counteract the strong filtering effect the headphones itself imprints in the playback chain.

    Again, this is the only way I found as a layman to explain both products claims are true.


    I think you are describing the whole 360 degrees azimuth and elevation HRTF in all its complexity, but as fair as I understood spruce music had such those specific conditions in mind to answer my question.
     
  15. pinnahertz
    Well, I've actually tried some of this. The dummy head, the closely spaced speakers, and crosstalk comp. The effect can be startling, but there are some significant problems. While you may cancel at least some crosstalk, you must also translate the recorded HRTF to the listeners HRTF based on the the brand new angle of incidence (from the speakers) that you haven't accounted for anywhere in the system. That's very hard to do. What you will have is big, spacious, and dimensional, but not a solid palpable image. And any movement of the listener's head, even fery slightly, severely alters the presentation. You literally need to clamp the head.  Accurate and effective acoustic crosstalk cancellation is highly location specific.  Many directional cues involve frequencies where the acoustic wavelength is very short, and cancellation depends on highly precise phase relationships.  Just moving 20 degrees away from perfect cancellation reduces the null to 30dB.  20 degrees at 1kHz is 0.75".  And it's obviously less as frequency goes up.  You need a head clamp.
    Yes, it does. In fact, the image will include an area beyond the speakers in several directions, and even ambience from behind will be sometimes perceivable. But if you want to specifically place a sound source in a location and hold it there, this doesn't work.
    Yes, this is also true, but it will never "collapse" into an area between the speakers (assuming the listener's head doesn't move), everything will image larger than the speakers (especially at 10 degrees apart!!!) even with very poor HRTF mismatch. In fact, all non-binaural stereo material will image outside the speakers, just with some relatively simple crosstalk comp. I did it first in 1980 with a tiny hand full of opamps in a small box with one "Image" control that varied the effect. The effect is pleasing, but very ambiguous. You can't ever actually reach out and "touch" a source. Everything ends up huge and blurry.  But even stereo will present a larger more dimensional image with crosstalk comp.
    My experiments showed there were height cues, and sometimes they were believable, but there were never accurate. And frankly, the HRTF match is never "good enough" unless by random chance your HRTF happens to be exactly that of the recording head. But those chances are very small. I found that embedding tiny mics in my own ears resulted in a good HRTF match since I used my own head and torso, but the problem was I then had to reference since shoving mics in my ears required them to be glued to ear plugs, and I could never hear the original.  Again, a mismatch in recording/listening HRFT doesn't always eliminate height (or any direction), but it won't be correct either.
    Pretty much happens all the time, especially with the speaker setup. You can do a bit better with headphones.
    I can tell you, you can't ever get close enough for enough people. The average is the best you can do, and that average results in a reasonable binaural presentation. Just don't expect accuracy.
    It's not a question of "knowing" the real source positions. Playback on speakers has many, many variables. You've tried to control some of them in your description of the theoretical system. You haven't covered all of them, and when extending the model beyond a single, idealized, theoretical system configuration to, well, reality, it all falls apart. You cannot expect every listener to have an acoustically treated room, a head clamp (or the desire to use one), and so his head won't be locked in the calibrated sweet spot for the crosstalk and HRTF compensation to work. Beamforming speakers won't help you there either, as that technology is also very location specific. Heck, even the general frequency response of the speakers has an impact.
     
    But none of this actually means much in practice since every recording is really a means of suspension of disbelief rather than a replication of an event. It doesn't take much to get "pleasing", it takes much more to get "accurate", if it's even possible. 
    Not only elevation collapses, everything about the stereo image collapses. You cannot allow head turning!
    No, not in a room with two speakers at 10 degrees. The PRIR the Realizers develops is translated to headphones. If you turn your head wearing headphones the processor can track your head turn and rotate the image to track it. It works because the relationship of the transducers to your ears is still fixed. You're not going to do that with two speakers in front of you.
    I'm skeptical that he's actually saying that, and that it could be done with speakers. Pretty tall order. Link to the paper perhaps?
    Care to tell us how you plan to do any sort of HRTF, crosstalk cancellation or beamforming without a DSP?
    I think we are all saying the same thing in different ways. The whole idea has problems, some can be partially remedied, others cannot.
    When you think of music you also must include the acoustic space around it. It's always there, even if artificial. For the reproduction to be real and accurate you must include a 360 degree sphere. 
     
    jgazal and castleofargh like this.

Share This Page