Pink noise volume pan for evaluating stereo imaging

Jul 9, 2024 at 1:13 AM Thread Starter Post #1 of 63

MrHaelscheir

500+ Head-Fier
Joined
Apr 12, 2023
Posts
708
Likes
1,101
Location
Toronto, Canada
As a mere hobbyist with an interest in measurements and binaural head-tracking, I wish to clarify my assumptions regarding how stereo imaging works or what constitutes "objectively correct imaging", and hence how this would guide this imaging evaluation method I describe.

tl;dr: Is stereo in the ideal case supposed to image perfectly along a line between the channels, and if so, is panning pink noise a reasonable means for assessing the precision of that imaging?

Firstly, am I right to assume that in normal/traditional stereo mixing with volume pans and no special HRTF DSP or phase tricks, as much as the microphones might capture timbral or loudness cues for distance, when playing the stereo mix through speakers in a well-treated room if not anechoic chamber, while at the sweet spot, sounds would image nigh perfectly along a line between the two speaker channels, maybe with slight height variations depending on one's HRTF or imperfections in one's sound localization faculties? That is, it would be conceivable if not correct for there to exist people for which panned sound sources are imaged purely on the line between the two channels (which implies my present understanding that surround sound or ambisonics are founded on the addition of channels to create more lines or planes in 3D space along which to locate if not "pan" sounds) with the ability to hear the timbral distance effects independent of the actual perceived imaging localization? In effect, is "soundstage depth/layering/height" or "3D holography" merely an illusion of spatial properties not actually in the (traditional stereo) recording that are technically erroneously created by room acoustics, happenstance HRTF interactions with said room or one's headphones, or the lack of the faculty to separate tonal cues from one's main localization faculties?

Now, one exception to my experience of imaging for classical recordings might be the case of off-stage trumpets in some Mahler and other, whereby I don't know if there was a special mixing trick or if it was down to micing or a combination thereof, perhaps just a good timbral cue and wide reverb otherwise technically largely smeared across the 1D stereo line ahead of me with some height from bass cues. Maybe some classical recordings seemed like they had actually captured and imaged ceiling reflections or the height of the choir loft. Otherwise, I suspect it is good stereo micing and mixing that allows string sections in some recordings to image along realistic diffuse lines rather than being squished into points on the left and right.

Given this, is it fair if not an existing practice to use a volume pan of a pink noise signal between the left and right channels to assess a stereo system's imaging accuracy and coherence? Here, "accuracy" to me implies that a linear pan should incur the perception of a linear rate of motion of the sound pink noise sound source along a line from one channel to the other with no variations in height. Then "coherence" to me refers to the case of all frequencies within said pink noise being perceived as imaging from the same point. That is, in an anechoic chamber, I would expect during the pan for all frequencies to image from the same, single moving point at least as one's HRTF and localization faculties allow (e.g. any physiological asymmetries or HRTF "edge cases" that the brain didn't adapt to, if that's even a thing). If one hears different noise bands shifting up or down or lagging or leading, this incurs what I call "imaging incoherence" which for music could cause the same instrument or sound to image from multiple directions per different parts of its spectrum interacting with your HRTF differently, else cause different parts of the mix to attain spatial height variations that are not necessarily intentional or accurate. From my experience, this can be caused by errors in the HRTF measurement or binaural decoder implementation (see https://www.head-fi.org/threads/rec...-virtualization.890719/page-121#post-18027627 (post #1,812)), sometimes inescapable issues with how the headphones interact with your ears (e.g. I without DSP correction almost always hear the treble imaging high), or for real speakers, the effects of room reflections skewing the image, like causing some vocals to image higher and left of center where it images perfectly ahead through a headphone simulation of anechoic stereo speakers or in an actual treated room.

https://www.audiosciencereview.com/...out-headphone-measurements.18451/post-2016279 (post #1,278) documents how I hear pink noise pans and hence the imaging of music through nigh all my headphones regardless of shape, size, frequency response, and cost or reputation. This is the "terrible imaging imperfection" I hear through headphones and probably heard through the Stax SR-X9000 and Sennheiser HE-1 when I encountered them, same as any other, whereby along the lines of https://www.head-fi.org/threads/can-poor-soundstage-and-imaging-be-fixed-digitally.949757/, personalized HRTF measurements and binaural head-tracking DSP allows me to with the click of a button switch to exquisitely coherent and linear imaging.
 
Jul 9, 2024 at 3:14 AM Post #2 of 63
Balanced response that eliminates masking problems helps a lot. I don’t care at all for head tracking with my iPhone’s Spatial Audio. Sometimes the Spatial Audio helps. But not as much as DSPs with my speaker system. I get the best placement with speakers.
 
Jul 9, 2024 at 6:10 AM Post #3 of 63
I often referred to this http://recherche.ircam.fr/equipes/salles/listen/sounds.html over the years. It has become a PITA to use because I always need to go get some app that can handle an FTP(and usually by the time I want to use the page again, years went by, and I've forgotten what app was doing it or change my computer. :sweat_smile:

But the BZZZ BZZZ they call modulated noise is perhaps the best sound I've ever heard to locate something. Sadly, the demos already have someone's head impacting them. But maybe you're better than me and have some idea of how to make that noise? Or I guess pink noise can do too and is readily accessible. ^_^
 
Jul 9, 2024 at 6:19 AM Post #4 of 63
Firstly, am I right to assume that in normal/traditional stereo mixing with volume pans and no special HRTF DSP or phase tricks, as much as the microphones might capture timbral or loudness cues for distance, when playing the stereo mix through speakers in a well-treated room if not anechoic chamber, while at the sweet spot, sounds would image nigh perfectly along a line between the two speaker channels, maybe with slight height variations depending on one's HRTF or imperfections in one's sound localization faculties?
I’m not sure I really understand the question. In “normal/traditional” stereo mixing, the left/right positioning is achieved with volume panning, however, our hearing doesn’t only rely on volume panning for location perception, it also relies on time arrival differentials of signals (psychoacoustic panning) as well as frequency content. There’s a lot of variables you appear to not be considering, for example: Different stereo mic’ing techniques are typically a trade off between better time arrival and better level differentiation, IE. As we improve timing coherency, we loose level differentials and vice versa. Technically the best stereo mic pattern is called the “Blumlein Pair” but it’s extremely rarely ever used because we don’t record music in anechoic chambers and sound/reflections from the rear of the pair is equal to the sound/reflections from in front of the pair, which is virtually always undesirable. In practice this is irrelevant though, because we do not make commercial recordings using only a stereo pair, we use various different mic sources and therefore the timing is whatever the mix engineer/producer decides. Likewise, it also depends on what “Panning Law” has been chosen. Typically for music recording it’s 2.5dB but for film sound it’s more commonly 3.0dB but different engineers/studios may choose something different. Additionally the freq content is also chosen by the engineers/producer and this too can affect the perception of location, of both distance and height (as well as the perception of loudness). So too is the width between the speakers in a particular studio, which obviously will also affect the positioning, as will the width of the consumers’ speakers.

So in practice, with all the variables above, we only have a fairly vague idea of how consumers are going to perceive our panning decisions and how we deal with that knowledge varies from engineer to engineer and circumstance to circumstance. Some choose to concentrate on areas rather than specific points on the line, others are more exacting and just hope for the best, it also varies according to music genre. In other words, even if you somehow figure out the exact/perfect reproduction it’s not going to work with some/many recordings, although you will probably never know, unless you get access to the master and the studio in which it was created.
Now, one exception to my experience of imaging for classical recordings might be the case of off-stage trumpets in some Mahler and other, whereby I don't know if there was a special mixing trick or if it was down to micing or a combination thereof, perhaps just a good timbral cue and wide reverb otherwise technically largely smeared across the 1D stereo line ahead of me with some height from bass cues.
I can’t answer that, because there are several ways of skinning that particular cat and it also depends on the exact circumstances. For example, are the trumpets backstage, on a balcony behind the orchestra or maybe in a box to the side or sides of the orchestra? Also I’m not sure what you mean by “bass cues”? Except in the case of headphones and a hard panned bass (which is very rarely the case) then we are insensitive to location cues in the bass freqs. And that is a potential loophole with your pink noise reference, the fact that different freqs can be perceived in different locations (or no specific location at all) and so playing all the audible freqs at once means we can’t hear specific freqs and harmonics as we can with music. For simple balance purposes pink noise is a good choice, say if one speaker is louder than another but measurements would be preferable IMHO.

Not sure this response has been of much help.

G
 
Jul 9, 2024 at 11:13 AM Post #5 of 63
I’m not sure I really understand the question. In “normal/traditional” stereo mixing, the left/right positioning is achieved with volume panning, however, our hearing doesn’t only rely on volume panning for location perception, it also relies on time arrival differentials of signals (psychoacoustic panning) as well as frequency content. There’s a lot of variables you appear to not be considering, for example: Different stereo mic’ing techniques are typically a trade off between better time arrival and better level differentiation, IE. As we improve timing coherency, we loose level differentials and vice versa. Technically the best stereo mic pattern is called the “Blumlein Pair” but it’s extremely rarely ever used because we don’t record music in anechoic chambers and sound/reflections from the rear of the pair is equal to the sound/reflections from in front of the pair, which is virtually always undesirable. In practice this is irrelevant though, because we do not make commercial recordings using only a stereo pair, we use various different mic sources and therefore the timing is whatever the mix engineer/producer decides. Likewise, it also depends on what “Panning Law” has been chosen. Typically for music recording it’s 2.5dB but for film sound it’s more commonly 3.0dB but different engineers/studios may choose something different. Additionally the freq content is also chosen by the engineers/producer and this too can affect the perception of location, of both distance and height (as well as the perception of loudness). So too is the width between the speakers in a particular studio, which obviously will also affect the positioning, as will the width of the consumers’ speakers.

So in practice, with all the variables above, we only have a fairly vague idea of how consumers are going to perceive our panning decisions and how we deal with that knowledge varies from engineer to engineer and circumstance to circumstance. Some choose to concentrate on areas rather than specific points on the line, others are more exacting and just hope for the best, it also varies according to music genre. In other words, even if you somehow figure out the exact/perfect reproduction it’s not going to work with some/many recordings, although you will probably never know, unless you get access to the master and the studio in which it was created.

I can’t answer that, because there are several ways of skinning that particular cat and it also depends on the exact circumstances. For example, are the trumpets backstage, on a balcony behind the orchestra or maybe in a box to the side or sides of the orchestra? Also I’m not sure what you mean by “bass cues”? Except in the case of headphones and a hard panned bass (which is very rarely the case) then we are insensitive to location cues in the bass freqs. And that is a potential loophole with your pink noise reference, the fact that different freqs can be perceived in different locations (or no specific location at all) and so playing all the audible freqs at once means we can’t hear specific freqs and harmonics as we can with music. For simple balance purposes pink noise is a good choice, say if one speaker is louder than another but measurements would be preferable IMHO.

Not sure this response has been of much help.

G
I am not familiar with the extent to which time difference pans are used, but from playing with it in the Reaper DAW, I had found the effect rather unnatural if not jarring at least if used to the extreme or without sensible combination with volume pans. "Frequency content" to me sounds like deliberate simulation of HRTF localization cues, or perhaps whatever trick was used to allow even a cheap CRT TV years ago to make it sound like someone was knocking at the door. Maybe that was a time delay pan that allowed said pan to extend beyond the speakers?

As for the panning law, I only hear the changes in level while the sound source continues to image right on the line between the speakers.

Nonetheless, is my description of plain volume panning of individual sources imaging along a line between the channels sensible? I guess I am asking how sound engineers hear these pans and their effects, and hence wish to gauge the sanity of a number of "audiophiles'" flowery descriptions of "3D holographic soundstaging" compared to the actual recording content, stereo imaging theory, or the mixer's intent within their mixing environment. E.g. Would it not be insensible for one to claim to hear differences in the resolving of "depth" to different instruments between different playback systems and thus deem one system or piece of gear "better" if the reality is that the recording and mixing technically forces the audio onto a 1D line such that one only really discerns "distance" like with how one does so within a 2D photo? What I am supposing is that for each channel in isolation, one's HRTF, namely the interaural frequency response, level, and time differences, allow one to localize said speaker's direction and to an extent its distance, whereby once panning between two speakers, at least in my experience, things image on the line between those two speakers at the distance that those speakers were placed. If so, then are there conventional mixing or mic'ing techniques that allow those two speakers to "actually" reproduce the HRTF cues to image something beyond or in front of the speakers without the assistance of room acoustics, or that could pan a sound source further and closer while keeping the volume level constant?

By "bass cues" I mean how I perceive localization of bass through binaural head-tracking if not sometimes at the concert hall in the sense of imaging "larger" than the closer to point-like higher frequencies or having the sense of filling the space, hence ascribing a sense of "height". As for localizing deep bass drums situated left or right of center, I guess the higher frequencies of the transient have a bearing on that, but I still like the idea of full-range stereo sub-bass in an anechoic chamber or simulation thereof.
 
Jul 9, 2024 at 11:52 AM Post #6 of 63
I am not familiar with the extent to which time difference pans are used, but from playing with it in the Reaper DAW, I had found the effect rather unnatural if not jarring at least if used to the extreme or without sensible combination with volume pans.
ILD and ITD panning should go hand in hand and it is far from simple. Both ILD and ITD are frequency dependent and if that wasn't complex enough, mixing for speakers and headphones are very different as a principle. Headphones "want" binaural sound incorporating the properties of HRTF while speaker spatiality doesn't have to do that, because the listener's torso/head is there to do the HRTF business. That's why there are three options for creating spatiality:

- For speakers (headphone users can use processing that makes the sound more "binaural")
- For headphones (speaker user can try crosstalk canceling if that helps)
- For both speakers and headphones (a compromise that hopefully works okay for both).

In the third option low frequencies should be mixed mono-like (ILD: 0-3 dB, but ITD: 0-800 𝜇s can be used to pan the sound ). ILD should increase with frequency (+ corresponding ITD). Peak ILD frequency itself is a function of angle! For example the peak ILD happens at 2000 Hz if the sound comes from 57° angle and if the angle increases, ILD at 2000 Hz decreases! I have to say I'm struggling to get my head around it all...
 
Last edited:
Jul 9, 2024 at 12:39 PM Post #7 of 63
ILD and ITD panning should go hand in hand and it is far from simple. Both ILD and ITD are frequency dependent and if that wasn't complex enough, mixing for speakers and headphones are very different as a principle. Headphones "want" binaural sound incorporating the properties of HRTF while speaker spatiality doesn't have to do that, because the listener's torso/head is there to do the HRTF business. That's why there are three options for creating spatiality:

- For speakers (headphone users can use processing that makes the sound more "binaural")
- For headphones (speaker user can try crosstalk canceling if that helps)
- For both speakers and headphones (a compromise that hopefully works okay for both).

In the third option low frequencies should be mixed mono-like (ILD: 0-3 dB, but ITD: 0-800 𝜇s can be used to pan the sound ). ILD should increase with frequency (+ corresponding ITD). Peak ILD frequency itself is a function of angle! For example the peak ILD happens at 2000 Hz if the sound comes from 57° angle and if the angle increases, ILD at 2000 Hz decreases! I have to say I'm struggling to get my head around it all...
For this thread, I am focusing on the expectations for proper stereo speaker imaging and hence claims about the perceived spatial properties in music mixed for speakers, whether played through speakers or through headphones (with some folks claiming that certain headphones or gear sound like studio monitors).

If thinking about the entire HRTF or what one's in-ear mics would capture from measurement chirps, then certainly, ILD and ITD are changing in complex ways with respect to frequency for the actual direction, but I am uncertain of the actual necessity of doing a full binaural pan using a generic HRTF (unless you mean manually simulating general HRTF trends with filters) when mixing for speakers where I find volume pans to give sufficiently vivid localization along the line between the speaker channels; i.e. I am supposing that doing a plain volume pan induces a linear movement of the sound source along the line between the speakers as opposed to inducing a rotation about the listener as a binaural pan could, but I suppose an actual sound source (or personalized binaural simulation thereof) translating along that line would have a subjectively different imaging and tonal quality from the equivalent plain stereo volume pan. Or is it that approximations similar to the filters and delays used in traditional crossfeed (e.g. the stuff covered in https://bs2b.sourceforge.net/) are sometimes employed as a solution for getting speakers to image things further or closer than the line between the channels?

For the last point, I like to just accept the idea of an in-ear frequency response graph having its peaks and dips move around as they please as one turns one's head. On Reaper with my binaural head-tracking, I can play pink noise and watch the FFT output of the binaural decoder as I move the head-tracking unit or manually input a binaural rotation.
 
Jul 9, 2024 at 5:28 PM Post #8 of 63
For this thread, I am focusing on the expectations for proper stereo speaker imaging and hence claims about the perceived spatial properties in music mixed for speakers, whether played through speakers or through headphones (with some folks claiming that certain headphones or gear sound like studio monitors).

Imaging soundstage on speakers is entirely different than using headphones. Speakers interact with the room which is responsible for a lot of the spatial cues.
 
Jul 9, 2024 at 6:37 PM Post #9 of 63
It has become a PITA to use because I always need to go get some app that can handle an FTP(and usually by the time I want to use the page again, years went by, and I've forgotten what app was doing it or change my computer. :sweat_smile:
Always wget :)

But the BZZZ BZZZ they call modulated noise is perhaps the best sound I've ever heard to locate something. Sadly, the demos already have someone's head impacting them. But maybe you're better than me and have some idea of how to make that noise? Or I guess pink noise can do too and is readily accessible. ^_^
This sounds similar enough, imo:
Code:
SOX_OPTS="-r 44.1k" sox --combine multiply \
    "|sox -n -p synth 0.4 sin 20 norm -36 synth square fmod 20 norm -6" \
    "|sox -n -p synth 0.4 white" \
    -b16 BZZZ.flac pad 0 0.1 repeat 5
(result in attachment)
 

Attachments

Jul 10, 2024 at 5:25 AM Post #10 of 63
I am not familiar with the extent to which time difference pans are used, but from playing with it in the Reaper DAW, I had found the effect rather unnatural if not jarring at least if used to the extreme or without sensible combination with volume pans.
Actually, it’s more natural but you have to get it right, which can be tricky. Time difference pans are quite commonly used, though very rarely explicitly. Any stereo mic pair (other than a near-coincident pair) inherently have at least some timing difference and larger mic arrays or more widely spaced stereo pairs have significantly more timing difference. Other than this, which is a combination of level panning and psychoacoustic panning, we rarely deliberately use psychoacoustic panning, because it relies on the listener being ideally positioned exactly between the two speakers in order to experience the panning effect, while level panning is much more forgiving to non-ideal listening positions.
"Frequency content" to me sounds like deliberate simulation of HRTF localization cues, or perhaps whatever trick was used to allow even a cheap CRT TV years ago to make it sound like someone was knocking at the door. Maybe that was a time delay pan that allowed said pan to extend beyond the speakers?
Manipulating the frequency content is done simply for tonal reasons or to help create a perception of distance. For example, less HF to imply greater distance or more HF (or higher harmonics) to imply proximity but is not done as a deliberate simulation of anything to do with HRTFs. There maybe some exceptions but none I’m aware of. Trying to do so is very flakey/inconsistent. “Shuffling” is the effect which allows a sound to appear to be “beyond the speakers”, although it was extremely rarely ever used in TV as it virtually always adversely affected mono compatibility (which was a mandated requirement). The effect was most likely just a poor acoustic result of your room or of your vision/expectation overriding your hearing. Certainly we could make a door knocking sound appear quite present and surprising/shocking but not deliberately beyond the speakers.
Nonetheless, is my description of plain volume panning of individual sources imaging along a line between the channels sensible? I guess I am asking how sound engineers hear these pans and their effects, and hence wish to gauge the sanity of a number of "audiophiles'" flowery descriptions of "3D holographic soundstaging" compared to the actual recording content, stereo imaging theory, or the mixer's intent within their mixing environment.
As much as I like to debunk “audiophile flowery descriptions”, there is unfortunately some truth to it in this instance. It is not a case of panning sources “imaging along a line between the channels” or more precisely, that is the case technically but not perceptually and that perception is entirely deliberately manipulated by sound engineers and has been even since the mono age. Relative amounts of volume, HF, reverb/reflections and compression on the different individual sources in a mix will create the impression of depth/distance. Imagine a trapezoid starting just in front of the speakers and extending almost infinitely behind the speakers, narrowing the further in the distance is intended. This trapezoid is our “soundstage” with stereo, while with mono the soundstage is a line, starting just in front of the speaker and extending to almost infinity behind. This “depth” is a vital part of virtually all audio mixing (music and sound for film/TV). With stereo our soundstage is therefore 2 dimensional, not a line between the speakers, which would be 1 dimensional of course. However, the audiophile description of “3D holographic” is also incorrect, it indicates a very poor room acoustic, with strong reflections from the rear wall or ceiling interacting severely enough with the speaker dispersion characteristics to give the impression of sound coming from behind and/or above. Some audiophiles appear to like this effect, which is up to them of course but it is not inherently in the recording, is not the intention of the recording and is therefore not high fidelity.

G
 
Jul 10, 2024 at 4:12 PM Post #11 of 63
Actually, it’s more natural but you have to get it right, which can be tricky. Time difference pans are quite commonly used, though very rarely explicitly. Any stereo mic pair (other than a near-coincident pair) inherently have at least some timing difference and larger mic arrays or more widely spaced stereo pairs have significantly more timing difference. Other than this, which is a combination of level panning and psychoacoustic panning, we rarely deliberately use psychoacoustic panning, because it relies on the listener being ideally positioned exactly between the two speakers in order to experience the panning effect, while level panning is much more forgiving to non-ideal listening positions.

Manipulating the frequency content is done simply for tonal reasons or to help create a perception of distance. For example, less HF to imply greater distance or more HF (or higher harmonics) to imply proximity but is not done as a deliberate simulation of anything to do with HRTFs. There maybe some exceptions but none I’m aware of. Trying to do so is very flakey/inconsistent. “Shuffling” is the effect which allows a sound to appear to be “beyond the speakers”, although it was extremely rarely ever used in TV as it virtually always adversely affected mono compatibility (which was a mandated requirement). The effect was most likely just a poor acoustic result of your room or of your vision/expectation overriding your hearing. Certainly we could make a door knocking sound appear quite present and surprising/shocking but not deliberately beyond the speakers.

As much as I like to debunk “audiophile flowery descriptions”, there is unfortunately some truth to it in this instance. It is not a case of panning sources “imaging along a line between the channels” or more precisely, that is the case technically but not perceptually and that perception is entirely deliberately manipulated by sound engineers and has been even since the mono age. Relative amounts of volume, HF, reverb/reflections and compression on the different individual sources in a mix will create the impression of depth/distance. Imagine a trapezoid starting just in front of the speakers and extending almost infinitely behind the speakers, narrowing the further in the distance is intended. This trapezoid is our “soundstage” with stereo, while with mono the soundstage is a line, starting just in front of the speaker and extending to almost infinity behind. This “depth” is a vital part of virtually all audio mixing (music and sound for film/TV). With stereo our soundstage is therefore 2 dimensional, not a line between the speakers, which would be 1 dimensional of course. However, the audiophile description of “3D holographic” is also incorrect, it indicates a very poor room acoustic, with strong reflections from the rear wall or ceiling interacting severely enough with the speaker dispersion characteristics to give the impression of sound coming from behind and/or above. Some audiophiles appear to like this effect, which is up to them of course but it is not inherently in the recording, is not the intention of the recording and is therefore not high fidelity.

G
Thank you for these clarifications. So I presume most of the time, any "time difference panning" is inherent to what was captured by the respective stereo mic pair? And that indeed, any height cues are usually unintentional and induced by room acoustics.

Otherwise, regarding imaging and any controls for the manipulation of perceived depth or layering, is a distinction made between "real" depth and "illusory" depth?
  • For analogy, I would describe "illusory" depth as the cues that allow us to clearly distinguish foreground and background and other things in a 2D image/photograph, or perhaps more analogously, taking pictures of individual subjects and composing them in Photoshop, else painting a scene to have perspective, doing colour and brightness manipulations and other (analogous to panning and EQ or level changing) to simulate optical distance.
  • Then for "real" depth, that would be the case of actually seeing the live view, else using a VR or 3D camera and using actual stereoscopic parallax to judge depth and layering, analogous to binaural capture and playback or approximated ILD and ITD manipulation/simulation.
To me, "illusory" depth would be painting the depth cues along the 1D line between the channels to create only the illusion of hearing a trapezoidal stage extending further back, whereby maybe some can more readily notice that they are "listening at" that 1D sonic picture rather than an "actual" 2D trapezoid that might be created with a binaural recording, whether or not a stereo pair can actually on its own capture and relay that horizontal 2D image. Maybe clarify the "technically" in "It is not a case of panning sources “imaging along a line between the channels” or more precisely, that is the case technically but not perceptually..."?

Maybe if you have examples of recordings that employ one or the other method of creating depth and layering, or something like one recording with just pans and EQ, and another maybe going as far as just a pure stereo pair. In the context of my own music listening, for orchestral music, per my live experience, I expect to hear depth cues for which the woodwinds, brass, and percussion could be clearly discerned as sounding from behind the string sections, yet so far with the majority of recordings, while I maybe hearing timbral cues for distance, I still largely hear everything along the same 1D line between the channels per what I described as purely "illusory" depth. So would this be a limitation of my imaging perception (e.g. a tendency to hear EQ tricks independent of the actual perceived image location), of my playback system (be it through my Genelecs or my personalized HRTF rendering; maybe even think of those folks who describe one headphone or piece of gear as being better at resolving depth and layering), or of the available mixing methods for non-binaural recording?
 
Jul 11, 2024 at 4:19 AM Post #12 of 63
So I presume most of the time, any "time difference panning" is inherent to what was captured by the respective stereo mic pair? And that indeed, any height cues are usually unintentional and induced by room acoustics.
Correct. The only exceptions would be height cues when mixing in Dolby Atmos, because of course it actually has overhead speakers, and there are some binauralisation plugins specifically for creating binaural headphone mixes, which can include height panning.
Maybe clarify the "technically" in "It is not a case of panning sources “imaging along a line between the channels” or more precisely, that is the case technically but not perceptually..."?
Sure: Technically there isn’t even a line between the two speakers, there are just two point sources, the speakers, and that’s it. The line between the speakers and the rest of the trapezoid which includes depth are all illusory. We don’t really think in those terms when mixing though, we think about and use it as if it were real. In the case of depth, we nearly always have to deliberately create it, by emulating those parameters of sound that our hearing uses to perceive depth. For example, the loss of volume with increasing distance of course but also the additional increasing loss of HF with distance (due to the higher absorption of HF by air), the longer initial delay of reflections, the reverb width, dispersion, density, FR and other parameters of reverb. We also use compression for positional purposes, more (full range) compression makes sound sources appear closer/more “present” and in addition, all of these are relative. So for example, more compression, less HF reduction and less reverb would only make an instrument or sound appear closer than a different instrument or sound if it has more compression, less HF reduction and less reverb than another channel. IE. Applying compression will only make a channel appear closer than other channels if those other channels have no or less compression. It can get a bit complicated how these parameters interact, for example if we add compression while also reducing HF and applying reverb. And lastly, there is of course a time delay with increasing distance, due to the relatively slow speed of sound in air. However, there are only very specific and quite unusual situations where we employ or account for this in music production because in larger ensembles, where distances are enough to make a noticeable difference, the musicians are typically accustomed to compensating, EG. “Anticipating the beat”.
Maybe if you have examples of recordings that employ one or the other method of creating depth and layering, or something like one recording with just pans and EQ, and another maybe going as far as just a pure stereo pair.
Again, this is somewhat of an artificial distinction to my mind. For example, “a pure stereo pair” is just mono until you pan each of two mics left and right and how much you pan them left and right determines the perceived width of the stereo image. We commonly build a full width stereo image from various panned mono sources and narrower width stereo pairs. A common example would be almost all rock and pop recordings, the drum kit would virtually always be recorded by an overhead stereo pair (as well as various individual “spot” mics) but that stereo pair will usually be panned quite narrowly because the real width of a drum kit is of course limited to the reach of the drummer. The other most common example would be the piano, which although typically recorded with a stereo pair, the stereo image of that piano is typically significantly narrower than the available stereo image width.

To answer your question though, there are hardly any commercial recordings I know of that just use a stereo pair, some of the earlier Sheffield Labs recordings and maybe one or two other esoteric labels who make that a marketing point. The depth you could perceive on those recordings would be entirely captured by the mics. In most more typical orchestral recordings, the depth is really somewhat of a mixture, it is all recorded but then some is manipulated. For example, we have depth information recorded by the main array (typically a stereo pair or a decca tree) but also various spot mics for both room acoustics (reverb) and individual instruments or groups of instruments. Some/many of those instrument spot mics will probably have to be time delayed and possibly even have the HF reduced and some reverb added to create a perception of distance that roughly matches what was captured by the main array, and of course they’ll also need to be panned to match the left/right position as well. It’s really hard to think of any commercial recordings that comprise only “real depth” as you define it. It’s virtually always a mixture as just described or an almost entirely manufactured depth and width in the case of non-acoustic ensembles (rock and all other popular genres). Film/TV sound is even more extreme, we’re commonly dealing with far larger distances between sound sources and virtually the entire soundstage is manufactured.

G
 
Jul 12, 2024 at 4:11 AM Post #13 of 63
The earliest RCA Living Stereo albums were recorded with just two mics. Later on, they added a third center channel. But that was probably because of the limitations of the recording medium. There isn’t much reason to record with two mics today…. At least for commercial music.
 
Jul 14, 2024 at 4:12 AM Post #14 of 63
Correct. The only exceptions would be height cues when mixing in Dolby Atmos, because of course it actually has overhead speakers, and there are some binauralisation plugins specifically for creating binaural headphone mixes, which can include height panning.

Sure: Technically there isn’t even a line between the two speakers, there are just two point sources, the speakers, and that’s it. The line between the speakers and the rest of the trapezoid which includes depth are all illusory. We don’t really think in those terms when mixing though, we think about and use it as if it were real. In the case of depth, we nearly always have to deliberately create it, by emulating those parameters of sound that our hearing uses to perceive depth. For example, the loss of volume with increasing distance of course but also the additional increasing loss of HF with distance (due to the higher absorption of HF by air), the longer initial delay of reflections, the reverb width, dispersion, density, FR and other parameters of reverb. We also use compression for positional purposes, more (full range) compression makes sound sources appear closer/more “present” and in addition, all of these are relative. So for example, more compression, less HF reduction and less reverb would only make an instrument or sound appear closer than a different instrument or sound if it has more compression, less HF reduction and less reverb than another channel. IE. Applying compression will only make a channel appear closer than other channels if those other channels have no or less compression. It can get a bit complicated how these parameters interact, for example if we add compression while also reducing HF and applying reverb. And lastly, there is of course a time delay with increasing distance, due to the relatively slow speed of sound in air. However, there are only very specific and quite unusual situations where we employ or account for this in music production because in larger ensembles, where distances are enough to make a noticeable difference, the musicians are typically accustomed to compensating, EG. “Anticipating the beat”.

Again, this is somewhat of an artificial distinction to my mind. For example, “a pure stereo pair” is just mono until you pan each of two mics left and right and how much you pan them left and right determines the perceived width of the stereo image. We commonly build a full width stereo image from various panned mono sources and narrower width stereo pairs. A common example would be almost all rock and pop recordings, the drum kit would virtually always be recorded by an overhead stereo pair (as well as various individual “spot” mics) but that stereo pair will usually be panned quite narrowly because the real width of a drum kit is of course limited to the reach of the drummer. The other most common example would be the piano, which although typically recorded with a stereo pair, the stereo image of that piano is typically significantly narrower than the available stereo image width.

To answer your question though, there are hardly any commercial recordings I know of that just use a stereo pair, some of the earlier Sheffield Labs recordings and maybe one or two other esoteric labels who make that a marketing point. The depth you could perceive on those recordings would be entirely captured by the mics. In most more typical orchestral recordings, the depth is really somewhat of a mixture, it is all recorded but then some is manipulated. For example, we have depth information recorded by the main array (typically a stereo pair or a decca tree) but also various spot mics for both room acoustics (reverb) and individual instruments or groups of instruments. Some/many of those instrument spot mics will probably have to be time delayed and possibly even have the HF reduced and some reverb added to create a perception of distance that roughly matches what was captured by the main array, and of course they’ll also need to be panned to match the left/right position as well. It’s really hard to think of any commercial recordings that comprise only “real depth” as you define it. It’s virtually always a mixture as just described or an almost entirely manufactured depth and width in the case of non-acoustic ensembles (rock and all other popular genres). Film/TV sound is even more extreme, we’re commonly dealing with far larger distances between sound sources and virtually the entire soundstage is manufactured.

G
Thank you for these clarifications. Given this, beyond excellent channel matching, excellent room acoustics, or happenstance tonal similarity to the original monitoring system, what if anything would you consider to be the objective hallmarks for a playback system to be able to deliver ideal rendering of a recording's depth cues? Does an "ideal" perception of a recording's imaging and depth/layering even exist, or does it ultimately depend even to the level of some getting perceptual enhancements by the wildest audiophile means?

Anyways, going back to the main thread topic, we have castleofargh's mention of the signal in http://recherche.ircam.fr/equipes/salles/listen/sounds.html, or danadam's FLAC file which sounds to me something like a pulsing roughly white spectrum buzz which I guess can be reasonable for quick assessment of direction, but I think wouldn't be as helpful as constant pink noise for honing in on whether specific frequency bands are imaging from the wrong place as a result of HRTF errors or other. Maybe I am describing an uncommon imaging incoherence phenomenon, but I would imagine it to be easily discernable in an untreated room, or like how I had described my own experience of stock headphone imaging in the link within the OP.

If talking about just a volume pan, would it be reasonable to conclude that independent of the chosen panning law, the pink noise stimulus should image evenly along the line between the two speakers with no frequency bands seeming to vary in height or lag or lead left or right through the pan? i.e. To an extent, since no depth cue was intentionally mixed, ideally no image depth variation should be perceived, though I partly realize that an ideal panning law simulating a source of constant loudness should actually simulate the loudness cue of the sound source moving along the line as opposed to constant radius arc between the speakers such that it would be quietest at the flanks and loudest at the middle where it would be necessarily the closest to the listener. Actual binaural pans/rotations as I know them would not have any such "panning law" parameter with perceived variations in loudness through the pan depending on one's HRTF and other.

This thread is partly for validating the utility of my described pink noise pan for demonstrating to headphone users the limitations of headphone imaging or testing for themselves their claims of hearing "speaker presentation" through a given headphone or amp, else to come up with a better exercise for demonstrating this.
 
Jul 14, 2024 at 6:58 AM Post #15 of 63
Anyways, going back to the main thread topic, we have castleofargh's mention of the signal in http://recherche.ircam.fr/equipes/salles/listen/sounds.html, or danadam's FLAC file which sounds to me something like a pulsing roughly white spectrum buzz which I guess can be reasonable for quick assessment of direction, but I think wouldn't be as helpful as constant pink noise for honing in on whether specific frequency bands are imaging from the wrong place as a result of HRTF errors or other. Maybe I am describing an uncommon imaging incoherence phenomenon, but I would imagine it to be easily discernable in an untreated room, or like how I had described my own experience of stock headphone imaging in the link within the OP.
I was specifically talking about localization, and for that, I find that the sound going on and off is easier to place than something at least subjectively more stable. For matters related to frequency response, it’s not going to do great.:smile_cat:

About depth perception, keep in mind that the worse part is us. We’re not very good at estimating distances by ear, and if there is any visual cue, we’re likely to trust that more for our experience.
There is that graph somewhere showing estimates for sound source's distance at less than 1m were overestimated, while sound sources beyond 1m showed a trend of being underestimated. I'm talking about real sound sources, and just assume that proper simulation would give similar results. But that's just my guess.
 

Users who are viewing this thread

Back
Top