A layman multimedia guide to Immersive Sound for the technically minded (Immersive Audio and Holophony) | Headphone Reviews and Discussion - Head-Fi.org

jgazal · Jul 3, 2018

This post was written to try to answer the following questions:

I see a new product being advertised and I was wondering about your take on it. I'm driven to post in this forum because "tube magic" is mentioned four times on the product description page and it seemed hokey. Based on my reading, it seems that more scientific explanations prevail here than elsewhere.

The company claims their new amp "corrects the fundamental spatial distortion in recordings" and "increases the width of the soundstage beyond that of the speaker placement." At the push of a button, an extra 30 degrees of soundstage can be recreated.

Question 1: Is there a fundamental spatial distortion in recordings?

Question 2: What is happening to my ears when (if) I experience depth or width to a recording? I believe that I do experience a soundstage effect that differs between headphones, but I'm at a loss to describe what is happening to the audio or to my perception of the audio. It is similar to the difference I hear between open back and closed back headphones.

Since immersive audio is a broad subject, I believe it deserves its own thread.

Question 1

Fundamental

There would be something fundamentally wrong if all types off recordings suffered from such unique distortion. But each kind of recording has its own peculiar set of distortions.

Spatial distortion

Spatial distortion could refer to:

a) what makes stereo recordings played back with loudspeakers different from stereo recordings played back with headphones;

OR

B) what makes our perception while listening to real sound sources different from listening virtual sound sources reproduced via electro-acoustical transducers.

Even if we define spatial distortion as the difference in perception of space of a real sound versus a reproduced sound, chances are every recording format played on every system would suffer spatial imprecision varying by degree, as described below.

But do not interpret the word distortion in this post as negative alteration, because although varying by degree, they are not a "fatal" types of distortion, as recordings were made with full knowledge that the reproducing systems probably were not going to match the creative environment exactly. Perhaps with cheaper digital signal processing we have a chance to even both environments.

Recordings

As per item 1 you need to know how your recording was made and what level of spatial precision you want to achieve, because you need to know which kind of distortion you need to address.

A layman multimedia guide to Immersive Sound for the technically minded (Immersive Audio and Holophony)

Before you continue to read (and I don’t think anyone will read a post so long), a warning: the following description may not be scientifically and completely accurate, but may help to illustrate the restrictions of sound reproduction. Others more knowledgeable may chime in, filling in the gaps and correcting misconceptions.

First of all you need to know how your perception works.

How do you acoustically differentiate someone stroking a piano 440hz note from someone blowing an oboe and pressing its 440hz key?

It is because the emitted sound: a) has the fundamental frequency accompanied by different overtones, partials that are harmonic and inharmonic to that fundamental, resulting an unique timbre (frequency domain) and b) has an envelope with peculiar attack time and characteristics, decay, sustain, release and transients (time domain).

When you hear an acoustical sound source how can your brain can perceive its location (azimuth, elevation and distance)?

Your brain uses several cues, for instance:

A) interaural time difference - for instance sound coming from your left will arrive first at your left ear and a bit after at your right ear;

B) interaural level difference - for instance sound coming from your left will arrive with higher level than at your right ear;

For frequencies below 1000 Hz, mainly ITDs are evaluated (phase delays), for frequencies above 1500 Hz mainly IIDs are evaluated. Between 1000 Hz and 1500 Hz there is a transition zone, where both mechanisms play a role.

Sound localization - Wikipedia

C) spectral cues - sound coming from above or below the horizontal plane that cross your ears will probably have not only a fundamental frequency a complex set of partials and then your outer ear (pinna) and your torso will change part of those frequencies in a very peculiar pattern related to the shape of your pinna and the size of your torso;

jgazal said:

D) head movements - each time you make a tiny movement with your head you change the cues and your brain track those changes according to its head position to solve ambiguous cues;

E) level ratio between direct sound and reverberation - for instance distant objects will have lower ratio and near sources will have higher ratio;

F) visual cues - yes visual cues and sound localization cues interact in the long and the short term;

G) etc.

A, B and C are mathematically described as a head related transfer function - HRTF.

Advanced “watch it later” - if you want to learn more about psychoacoustics and particularly about the precedent effect and neuroplasticity, watch the following brilliant lectures from Chris Brown (MITOpenCourseWare, Sensory Systems, Fall 2013, published on 2014):

So you may ask which kind of recording is able to preserve such cues (spatial information that allows to reconstruct a lifelike soundfield) or how to synthesize them accurately.

One possible answer to that question could be dummy head stereo recordings, made with microphone diaphragms placed were each eardrum should be in a human being (Michael Gerzon - Dummy Head Recording).

The following video from @Chesky Records contains a state of the art binaural recording (try to listen at least until the saxophonist plays around the dummy head).

Try to listen with loudspeakers in the very near field at more and less +10 and -10 degrees apart and with two pillows one in front of your nose in the median plane and the other at the top of your head to avoid ceiling reflections (or get an IPad Air with stereo speakers, touch your nose in the screen and direct the loudspeakers sound towards your ears with the palm of your hands; your own head will shadow the crosstalk).

Chesky Records

Just post in this thread if you perceive the singer displacing his head while he sings and the saxophonist walking around you.

However, each human being has an idiosyncratic head related transfer function and the dummy head stays fixed, while your listener turn his/her head.

What happens when you play binaural recordings through headphones?

The cues from the dummy head HRTF and your own HRTF cues don’t match and as you turn your head cues remain the same and the 3D sound field collapses.

What happens when you play binaural recordings through loudspeakers without a pillow or a mattress between the transducers?

Imagine a sound source placed to the left of a dummy head in an anechoic chamber and that, for didactic reasons, a very short pulse, coming from that sound source is fired into the chamber and arrives at the dummy head diaphragms. It will arrive first at its left diaphragm and after and lower in its right diaphragm. End of it. Only two pulses recorded because it is an anechoic chamber with fully absorptive walls. One intended for your left ear and the other for your right ear.

When you playback such pulse in your listening room, first the left loudspeaker fires the pulse into your listening room, it arrives before at your left eardrum, but also after and lower at your right eardrum. When the right speaker fires the pulse (the second arrival at the right dummy head diaphragm), it arrives first at you right ear and after and lower at your left ear.

So you were supposed to receive just two pulses, but you end receiving four pulses. If you now give up the idea of very short pulse and think about sounds, you can see that there is an acoustic crosstalk distortion intrinsically related to loudspeakers playback.

Since the pinna filtering from the dummy head fired into the listening room and interact with its acoustics, even not attempting to tackle acoustic crosstalk, there is a tonal coloration, that engineers try to compensate to make such recording more compatible with loudspeakers playback:

Some previous reporting has seemed to indicate the "+" is related to filters developed by Professor Edgar Choueiri of the 3-D Audio and Applied Acoustics (3D3A) Laboratory at Princeton University.
This is not the case. The "+" indicates that the EQ changes due to the pinna effects on the tonal character imparted on the sound has been restored to neutral EQ using carefully chosen compensation curves.
Read more at https://www.innerfidelity.com/conte...dphone-demonstration-disc#IzEmD8iz8lEHGfSo.99

To make things worse there are early reflections boundaries in typical listening rooms. Early reflections arrive closely enough at your eardrum to confuse your brain. More “phantom pulses” or distortions that you have in the time domain. A short video explaining room acoustics:

There is also one more variable, which is speaker directivity. One concise explanation:

sander99 said:
The whole thing makes perfect sense to me. It is consistant with what I know about vibration sources and wave propagation.
For example one thing that is usefull to know is that a vibration source that is considerably larger than the wavelength it produces will beam, a vibration source that is considerably smaller than the wavelength it produces will spread the waves.
A vertical long and narrow (compared to the wavelength it is producing) shaped vibration source will beam vertical and spread horizontal. With the inverse square law you refer to the rate of propagation loss as a function of the distance?
That law only holds for spheric shaped wavefronts from point sources. For example line sources have cylindrical shaped wavefronts and have propagation loss linear with the distance.
Vertical line sources beam in the vertical and have wide dispersion in the horizontal.
An horizontal array of small drivers all playing in phase acts as a horizontal line array and will in the horizontal have a narrower dispersion pattern (getting narrower with higher frequencies).
By manipulating the relative phase between the drivers the dispersion pattern can be made narrower (or wider), and the direction can be changed.
For example delaying the inner drivers compared to the outer drivers will further narrow the beam. (The classic Quad ESL63 electrostatic loudspeaker with concentric circular stators uses the opposite principle: to decrease directivity the outer rings receive a delayed signal). (...)

A rather long video, but Anthony Grimani, while talking at Home Theater Geek by Scott Wilkinson about room acoustics, gives a good explanation about speaker directivity (around 30:00):

Higher or lower speaker directivity may be preferred according to your aim.

Finally, the sum of two HRTF filterings (the dummy head and yours) may also introduce comb filtering distortions (additive and destructive interactions of sound waves).

With "binaural recordings to headphones" and "binaural recordings to loudspeakers" resulting no benefits, what could be an alternative?

Try something easier than a dummy head, such as ORTF microphone pattern:

For didactic reasons I will avoid going deeper into diaphragm pick up directivity and the myriad of microphone placing angles that one could use in a recording like this (for further details have a look at The Stereo Image - John Atkinson - stereophile.com).

So for the sake of simplicity, think about such ORTF pattern, as depicted above: just two diaphragms spaced at the average size of an human head so you can skip mixing and record direct to the final audio file.

Spectral cues are gone. Some recording engineers place foam disks between the microphones to keep the ILD closer to what would happen with human heads.

An example:

What happens when you play such “ORTF direct to audio file” with loudspeakers? You still have an acoustic crosstalk, but you have the illusion that the sound source is between the loudspeakers and at your left.

What happens when you play such “ORTF direct to audio file” with headphones? You don’t have acoustic crosstalk distortion, but ILD and ITD cues from the microphone arrangement don’t match your HRTF and as you turn your head cues remain the same and the horizontal stage collapses.

Are there more alternatives?

Yes, there are several. Some are:

A) Close microphones to stereo mix.

Record each track with a microphone close to the sound source and mix all of them into two channels using ILD to place them in the horizontal soundstage (panning - pan pot).

Reverberations from the recording venue need to be captured in other two tracks from an microphone arrangement that allow then preservation of such cues an that is mixed into those two channels.

The ILD of each instrument track (more precisely the level such track will have the right and left channel) and the ratio between instrument tracks and reverberation tracks is chosen by the engineer and do not necessarily match what a dummy head would register if all instruments were playing together around it during the recording.

You still have acoustic crosstalk when playing back with speakers and soundstage will collapse with headphones.

In this case, if the ILD/ITD levels are unnatural, when using headphones, as @71 dB advocates, adding electronic crossfeed may avoid the unnatural perception that sound sources are only at the left diaphragm, at the right diaphragm and right in the center:

71 dB said:
(...). I have given the "magic" number 98 %.

A lot of recordings of that era were recorded hard panned so that "half" of the instruments were on the left and "half" on the right and perhaps some in the middle. Example: Dave Brubeck: Jazz Impressions of Eurasia.

How many recordings have unnatural ILD and ITD? IDK.

B) Mix multichannel with level panning.

In stereo, only sources intended to emanate from the exact speaker locations are present correctly, as they are hard-panned and you localize the speaker and the source in the same place. Every other position depends on phantom imaging, which in stereo, in order to work, your head must be precisely equidistant between and from both speakers. That makes the listening window for stereo a single point, which is itself imperfect, because every phantom image has acoustic crosstalk built in. Multichannel reduces that problem by placing an actual speaker at the primary source locations. And more channels works even better.
Credit: @pinnahertz

@pinnahertz describes some historical facts about multichannel recording here.

The main mixing stage at DTS in Calabasas. Ty Pendlebury/CNET

Even with more channels, there is still some level of acoustic crosstalk when playing back through loudspeakers in regular room, unless all sounds/tracks are hard-panned.

You just can’t convey proximity as one would in reality (a bee flying closely your head; or the saxophonist from Chesky recording above if the speakers are placed further than the distance the saxophonist was playing when he was recorded...).

C) Record with Ambisonics.

This is interesting as the recording is made with a tetrahedron microphone (or an eigenmike, as close as possible to measure a sound field at a single point) and the spherical harmonics are decoded to be played back into an arrangement of speakers around the listener.

So with Ambisonics the spatial effect is not derived from two microphones but at from at least four microphones that encode height spatial information (The Principles of Quadraphonic Recording Part Two: The Vertical Element By Michael Gerzon) and the user HRTF is acoustically filtered at playback.

There are two problems.

The first one is that you need a decoder.

The second one is that at high frequencies the math proves that you need too many loudspeakers to be practical:

Unfortunately, arguments from information theory can be used to show that to recreate a sound field over a two-metre diameter listening area for frequencies up to 20kHz, one would need 400,000 channels and loudspeakers. These would occupy 8GHz of bandwidth, equivalent to the space used up by 1,000 625-line television channels!
Surround-sound psychoacoustics - Michael Gerzon

Although the spherical harmonics seem more mathematically elegant, I still do not figured out how acoustic crosstalk in listening rooms - or instead the auralization with headphones without adding electronic crosstalk - affects the possibility of conveying sound fields and proximity, neither if crosstalk cancellation in high order ambisonics ready listening rooms with high directivity loudspeakers is feasible.

Localisation of Elevated Sources in Higher-Order Ambisonics
Paul Power, Chris Dunn, Bill Davies, Jos Hirst
BRITISH BROADCASTING CORPORATION

Let’s hope third order ambisonics, eigenmikes and clever use of psychoacoustics are good enough! See item 2.C of the post #2 below or here to have an idea of such path.

If ambisonics spherical harmonics decoding already solves acoustic crosstalk at at low and medium frequencies, then the only potential negative variables would be the number of channels for high frequencies and the listening room early reflections. That would be in fact an advantage of convolving a high density/resolution HRTF/HRIR (when you can decode to an arbitrary higher number of playback channels) instead of interpolated low density/resolution HRIR or BRIR (of sixteen discreet virtual sources for instance), when binauralizing ambisonics over headphones. Current, state of the art, example of High-Order-Ambisonics-to-binaural rendering DSP:

BACCH-hoa

https://www.theoretica.us/bacch-dsp/

Problem is that, currently, HRTF are acoustically measured in anechoic chambers, a costly and time consuming procedure:

How Immersive Sound Brings Mixed Reality to Life - Alice Bonasio

The Quietest Place on Earth?

Veritasium - Can silence actually drive you crazy?

In the following video Professor Choueiri demonstration of a high density/resolution HRTF acquisition through acoustical measurements:

That is one of the reasons why this path may benefit from easier ways to acquire high density/resolution HRTF such as capturing biometrics and searching for close enough HRTF in databases:

jgazal said:
castleofargh said:

all in all it's only mysterious because we're lacking the tools to look at your head and say "you need that sound", but the mechanisms for the most part are well understood and modeled with success by a few smart people.

Click to expand...

3D audio is the secret to HoloLens' convincing holograms

(...)

The HoloLens audio system replicates the way the human brain processes sounds. "[Spatial sound] is what we experience on a daily basis," says Johnston. "We're always listening and locating sounds around us; our brains are constantly interpreting and processing sounds through our ears and positioning those sounds in the world around us."

The brain relies on a set of aural cues to locate a sound source with precision. If you're standing on the street, for instance, you would spot an oncoming bus on your right based on the way its sound reaches your ears. It would enter the ear closest to the vehicle a little quicker than the one farther from it, on the left. It would also be louder in one ear than the other based on proximity. These cues help you pinpoint the object's location. But there's another physical factor that impacts the way sounds are perceived.

Before a sound wave enters a person's ear canals, it interacts with the outer ears, the head and even the neck. The shape, size and position of the human anatomy add a unique imprint to each sound. The effect, called Head-Related Transfer Function (HRTF), makes everyone hear sounds a little differently.

These subtle differences make up the most crucial part of a spatial-sound experience. For the aural illusion to work, all the cues need to be generated with precision. "A one-size-fits-all [solution] or some kind of generic filter does not satisfy around one-half of the population of the Earth," says Tashev. "For the [mixed reality experience to work], we had to find a way to generate your personal hearing."

His team started by collecting reams of data in the Microsoft Research lab. They captured the HRTFs of hundreds of people to build their aural profiles. The acoustic measurements, coupled with precise 3D scans of the subjects' heads, collectively built a wide range of options for HoloLens. A quick and discreet calibration matches the spatial hearing of the device user to the profile that comes closest to his or hers.

(...)

Click to expand...

A method for efficiently calculating head-related transfer functions directly from head scan point clouds
Authors: Sridhar, R., Choueiri, E. Y.
Publication: 143rd Convention of the Audio Engineering Society (AES 143)
Date: October 8, 2017

A method is developed for efficiently calculating head-related transfer functions (HRTFs) directly from head scan point clouds of a subject using a database of HRTFs, and corresponding head scans, of many subjects. Consumer applications require HRTFs be estimated accurately and efficiently, but existing methods do not simultaneously meet these requirements. The presented method uses efficient matrix multiplications to compute HRTFs from spherical harmonic representations of head scan point clouds that may be obtained from consumer-grade cameras. The method was applied to a database of only 23 subjects, and while calculated interaural time difference errors are found to be above estimated perceptual thresholds for some spatial directions, HRTF spectral distortions up to 6 kHz fall below perceptual thresholds for most directions.

Errata:

In section 3.2 on page 4, the last sentence of the first paragraph should read “…and simple geometrical models of the head…”.

Click to expand...

(...)
In the past, the way to acquire unique HRTF profiles was to fit miniature microphones in your ears. You would have to remain completely still inside an anechoic chamber for an extended period of time to take the necessary measurements. Thanks to dedicated computer modelling, those days are finally over. Using their smartphone’s camera, users will be able to scan themselves, gathering sufficient data for IDA Audio and Genelec to accurately 3D model and then create the unique HRTF filter set for personal rendering of 3D audio. Users will also be able to choose to undertake the scan with a designated third party if preferred. Based on years of comprehensive research, IDA Audio’s modelling algorithms provide precision that matches Genelec’s dedication to the accuracy and acoustic transparency of reproduced sound.
(...)

Click to expand...

jgazal said:

Fascinating research in 3D Audio and Applied Acoustics (3D3A) Laboratory at Princeton University:

Models for evaluating navigational techniques for higher-order ambisonics - Joseph G. Tylka and Edgar Y. Choueiri
Virtual navigation of three-dimensional higher-order ambisonics sound fields (i.e., sound fields that have been decomposed into spherical harmonics) enables a listener to explore an acoustic space and experience a spatially-accurate perception of the sound field. Applications of sound field navigation may be found in virtual-reality reproductions of real-world spaces. For example, to reproduce an orchestral performance in virtual reality, navigation of an acoustic recording of the performance may yield superior spatial and tonal fidelity compared to that produced through acoustic simulation of the performance. Navigation of acoustic recordings may also be preferable when reproducing real-world spaces for which computer modeling of complex wave-phenomena and room characteristics may be too computationally intensive for real-time playback and interaction.
Recently, several navigational techniques for higher-order ambisonics have been developed, all of which may degrade localization information and induce spectral coloration. The severity of such penalties needs to be investigated and quantified in order to both compare existing navigational techniques and develop novel ones. Although subjective testing is the most direct method of evaluating and comparing navigational techniques, such tests are often lengthy and costly, which motivates the use of objective metrics that enable quick assessments of navigational techniques.

D) Wavefield synthesis.

This one is also interesting, but complex as the transducers tend to infinity (just kidding, but there are more transducers). You will need to find details somewhere else, like here. And it is obviously costly!

E) Pure object based.

Record each sound source at its own track and don’t mix them before distribution. Just tag them with metadata describing their coordinates.

Let the digital player at the listening room mix all tracks considering the measured high density/resolution HRTF of the listener (or lower density/resolution with better interpolation algorithms) and room modeling to calculate room reflections and reverberation (“Accurate real-time early reflections and reverb calculations based on user-controlled room geometry and a wide range of wall materials” from bacch-dsp binaural synthesis).

Playback with crosstalk cancellation (or binaural beamforming with an horizontal array of transducers such as the yarra sound bar; more details forward in this post) or use headphones with head-tracking without adding electronic crossfeed.

This is perfect for the realms of virtual environments for video games in which the user interact with the graphic and narrative context determining the future states of sound objects.

What are the problems with a pure object based approach? As we have just seen, it is costly and time consuming to measure the HRTF from the listener. It is computational intensive to mix those tracks and calculate room reflections and reverberation. You just can’t calculate complex rooms. So you miss the acoustic signature of really unique venues.

Atmos and other hybrid multichannel object based codecs use also beds to preserve some cues. But such beds and the panning of objects between speakers also introduce the distortions from the chains before mentioned (unless it also relies in spherical harmonics computation?).

Going the DSP brute force route to binauralization and to cancel (or avoid) crosstalk

Before whe start talking about DSP, you may want to grasp how they work mathematically and one fundamental concept to do that is the Furrier transform. I haven’t found better explanations than the ones made by Grant Sanderson (YouTube channel 3blue1brown):

A. The crosstalk cancellation route of binaural masters

So what Professor Edgar Choueiri advocates?

Use binaural recordings and play them back with his Bacch crosstalk cancellation algorithm (his processor also measures a binaural room impulse response to enhance his filter and use headtracking and interpolation to relieve head movement range restrictions one would have otherwise).

Before we continue with Bacch filter, a few notes about room impulse response and a recovery of speaker directivity.

There is a room impulse response for each (Length x Width x Height) coordinates of a given room. A RIR can be measured by playing a chirp sweep from 20hz to 20khz from a source in a given coordinate. The microphone will capture early reflections and reverberation at another given coordinate. Change those source and microphone coordinates/spots and RIR is going to be different. Room enhancement DSPs use RIR to compute equalization for a given listening spot (Digital Room Equalisation - Michael Gerzon), but if you want even bass response across more listening spots then you probably need more subwoofers (Subwoofers: optimum number and locations - Todd Welti). A BRIR is also dependent of the coordinates in which it is measured (and looking angle!). But instead you measure with two microphones at the same time at the entrance of a human head or dummy head. That is one of the reasons why you need to capture one BRIR for each listening spot you want the crosstalk canceled filter to work. Such BRIR integrates not only the combined acoustic signature of loudspeakers and room, but also the HRTF of the dummy head or the human wearing the microphones.

Here an speaker with higher directivity may improve the performance of the algorithm.

Dummy head HRTF used in the binaural recording and your own HRTF don’t match, but the interaction between speakers/room and your head and torso “sum” (filter) your own HRTF.

The “sum” (combined filtering) of two HRTF filterings may also introduce distortions (maybe negligible unless you want absolute localisation/spatial precision?).

Stereo recording with natural ILD and ITD, like the ORTF discribed above, render an acceptable 180 degrees horizontal sound stage. Read about the concept of proximity in the Bacch q&a Professor Choueiri has in the 3d3a of Princeton website.

Must watch videos of Professor Choueiri explaining crosstalk, his crosstalk cancellation filter and his flagship product:

Professor Choueiri explaining sound cues, binaural synthesis, headphone reproduction, ambisonics, wave field synthesis, among others concepts:

jgazal said:

B. The crosstalk avoidance binauralization route

B.1 The crosstalk avoidance binauralization with headphones

And what you can do with DSPs like the Smyth Research Realiser A16?

Such processing circumvent the difficulty in acquiring an HRTF (or head related impulse response - HRIR) by measuring the user binaural room impulse responses - BRIR (or also personal room impulse response - PRIR, when you want to say that it refers to the unique BRIR of the listener) that also includes the playback room acoustic signature.

Before we continue with the Realiser processor, a few notes about personalization of BRIRs. Smyth Research Exchange site will allow you to use your PRIR made with a single tweeter to personalize BRIRs made by other users in rooms you may be interested to acquire. So the performance will be better than just use that BRIR that may poorly match yours.

So after you measure a PRIR, the Realiser processor convolves the inputs with such PRIR, apply a filter to take out the effects of wearing the headphones you have chosen, add electronic/digital crossfeed and dynamically adjust cues (headtracking plus interpolation) with headphones playback to emulate/mimic virtual speakers like you would hear in the measured room, with the measured speaker(s), in the measured coordinates. Bad room acoustics will result bad acoustics in the emulation.

But you can avoid the addition of electronic/digital crossfeed to emulate what beaforming or a crosstalk cancellation algorithm would do with real speakers (see also here and here). This feature is interesting to playback binaural recordings, particularly those made with microphones in your own head (or here).

The Realiser A16 also allows equalization in the time domain (the latter is very useful to tame bass overhigh).

Add tactile/haptic transducers and you feel bone conducting bass not affected by the acoustics of your listening room. Note also that the power requirements to feed the headphone cavity are lower than feeding your listening room (thanks to brilliant @JimL11 for elaborating the power concept here) and that the intermodulation distortion characteristics of the speakers amplifier may then be substituted by those from the headphone amplifier (potentially lower IMD).

In the following (must hear) podcast interview (in English), Stephen Smyth explains concepts of acoustics, psychoacoustics and the features and compromises of the Realiser A16, like bass management, PRIR measurement, personalization of BRIRs, etc.

He also goes further and describes the lack of absolute neutral reference for headphones and the convinience of virtualizing a room with state of the art acoustics, for instance “A New Laboratory for Evaluating Multichannel Audio Components and Systems at R&D Group, Harman International Industries Inc.” with your own PRIR and HPEQ for counter-filtering your own headphones (@Tyll Hertsens, a method that personalizes room/pinnae and pinnae/headphones idiosyncratic coupling/filtering and keeps the acoustic basis for Harman Listening Target Curve).

Extra interview at CanJam SoCal 2018 (by @kp297):

kp297 said:

Stephen Smyth at Canjam SoCal 2018 introduces the Realiser A16 (thanks to innerfidelity hosted by @Tyll Hertsens):

Stephen Smyth once again, but now introducing the legacy Realiser A8:

Is there a caveat? An small one but still yes, visual cues and sound cues interact and there is neuroplasticity:

VIRTUALISATION PROBLEMS

Conflicting aural and visual cues

Even if headphone virtualisation is acoustically accurate, it can still cause confusion if the aural and visual impressions conflict. [8] If the likely source of a sound cannot be identified visually, it may be perceived as originating from behind the listener, irrespective of auditory cues to the contrary. Dynamic head-tracking strengthens the auditory cues considerably, but may not fully resolve the confusion, particularly if sounds appear to originate in free space. Simple visible markers, such as paper speakers placed at the apparent source positions, can help to resolve the remaining audio-visual perceptual conflicts. Generally the problems associated with conflicting cues become less important as users learn to trust their ears.

http://www.smyth-research.com/articles_files/SVSAES.pdf

Kaushik Sunder, while talking about immersive sound at Home Theater Geek by Scott Wilkinson, mentions how we learn since our childhood to analyze our very own HRTF and that such analyses is a constant learning process as we grow older with pinnae getting larger. But he also mentions the short term effects of neuroplasticity at 32:00:

Let’s hope Smyth Research can integrate the Realiser A16 to virtual headsets displaying stereoscopic photographs of measured rooms for visual training purpose and virtual reality.

B.1 The crosstalk avoidance binauralization with an phased array of transducers

So what is binaural beamforming?

It is not crosstalk cancellation but the clever use of a vertical phased array of transducers to control sound directivity resulting a similar effect.

One interesting description of binaural beamforming (continuation from @sander99 post above):

sander99 said:
(...) A beam can be shifted [Edit: I mean the angle can be shifted] to the left or the right by relative delaying drivers from one side to the other.
More independent beams can be created by superimposing the required combinations of driver signals upon eachother.
(With 'one combination of driver signals' I meant the 12 driver signals for one beam. So to get more beams more combinations (plural) are added together.) (Each driver contributes to all the beams).
By the way: the Yarra only beams in the horizontal, in the vertical it has wide dispersion (narrowing down with increasing frequency of course).
Low frequencies are harder to beam in narrow beams. This is consistant with the fact that in the video with the three different sources over three beams they used mainly mid and high frequencies probably to avoid low frequencies "crossing over".
For the normal intended use of the Yarra this should not cause a problem because the lower the frequency the less vital role it plays for localisation, so don't fear that the Yarra will sound equally thin as in that video.

Peter Otto interview about binaural beamforming at Home Theater Geek by Scott Wilkinson:

Unfortunately the Yarra product does not offer a method to acquire a PRIR. So you can enjoy binaural recordings and stereo recordings with natural ILD and ITD. But you will not experience precise localization without inserting some personalized PRIR or HRTF.

Living in a world with (or without) crosstalk

If you want to think about the interactions between the way the content is recorded and the playback rig and environment and read other very long post, visit this thread: To crossfeed or not to crossfeed? That is the question...

Question 2

I find this question somehow harder.

You already know mixing engineers place sound sources between speakers with level differences. Some may also use ITD in conjunction and that helps with a better rendering when you use binaural beamforming or crosstalk cancellation with loudspeakers playback.

Recordings with coincident microphones direct to the final audio file incorporate early reflections and reverberation in a given venue spot.

Instruments recorded with closed microphones in several tracks and then mixed with other two tracks or digital processing that incorporate early reflections and reverberation in a given venue spot may theoretically give a sense of depth with two loudspeakers playback. But what is the level ratio between those tracks?

Can you mix them keeping the same ratio one would have with a binaural recording alone?

But not treating early reflections and bass acoustic problems in the playback room may also null any intelligibility of such depth you could theoretically convey.

I am curious to know how the processors that increase soundstage depth, that @bigshot mentioned here, work. An example would be the Illusonic Immersive Audio Processor, from Switzerland, a company that also provides the upmiximg algorithm in the Realiser A16, which extracts direct sound, early reflections and diffuse field in the original content and plays them separately over different loudspeakers:

sander99 said:
Here is a link to a video about it:

Anyway, without a binaural recording and crosstalk cancellation (or binaural beamforming), playing back plain vanilla stereo recordings, you could only hear an impression (and probably a wrong one) of elevation by chance, in other words, if some distortion downstream is similar to your very own pinna spectral filtering. With multichannel you can try the method @bigshot describes here.

Going further in the sound science forum.

The following threads are also illustrative of concepts involved in the questions raised:

A) Accuracy is subjective (importance of how content is produced);
B) How do we hear height in a recording with earphones (the role of spectral cues);
C) Are binaural recordings higher quality than normal ones? (someone knowledgeable recommended books you want to read);
D) About SQ (rooms and speakers variations)
E) Correcting for soundstage? (the thread where this post started)
F) The DSP Rolling & How-To Thread (excellent thread about DSP started and edited by @Strangelove424)

To read more about the topics:

Table of Contents

Contributors

Foreword

WIESLAW WOSZCZYK

Acknowledgements

Introduction

AGNIESZKA ROGINSKA AND PAUL GELUSO

1 Perception of Spatial Sound

ELIZABETH M. WENZEL, DURAND R. BEGAULT, AND MARTINE GODFROY-COOPER

Auditory Physiology

Human Sound Localization

Head-Related Transfer Functions (HRTFs) and Virtual Acoustics

Neural Plasticity in Sound Localization

Distance and Environmental Context Perception

Conclusion

2 History of 3D Sound

BRAXTON BOREN

Introduction

Prehistory

Ancient History

Space and Polyphony

Spatial Separation in the Renaissance

Spatial Innovations in Acoustic Music

3D Sound Technology

Technology and Spatial Music

Conclusions and Thoughts for the Future

3 Stereo

PAUL GELUSO

Stereo Systems

Creating a Stereo Image

Stereo Enhancement

Summary

4 Binaural Audio Through Headphones

AGNIESZKA ROGINSKA

Headphone Reproduction

Binaural Sound Capture

HRTF Measurement

Binaural Synthesis

Inside-the-Head Locatedness

Advanced HRTF Techniques

Quality Assessment

Binaural Reproduction Methods

Headphone Equalization and Calibration

Conclusions

Appendix: Near Field

5 Binaural Audio Through Loudspeakers

EDGAR CHOUEIRI

Introduction

The Fundamental XTC Problem

Constant-Parameter Regularization

Frequency-Dependent Regularization

The Analytical BACCH Filter

Individualized BACCH Filters

Conclusions

Appendix A Derivation of the Optimal XTC Filter

Appendix B: Numerical Verification

6 Surround Sound

FRANCIS RUMSEY

The Evolution of Surround Sound

Surround Sound Formats

Surround Sound Delivery and Coding

Surround Sound Monitoring

Surround Sound Recording Techniques

Perceptual Evaluation

Predictive Models of Surround Sound Quality

7 Height Channels

SUNGYOUNG KIM

Background

Fundamental Psychoacoustics of Height-Channel Perception

Multichannel Reproduction Systems With Height Channels

Recording With Height Channels

Conclusion

8 Object-Based Audio

NICOLAS TSINGOS

Introduction

Spatial Representation and Rendering of Audio Objects

Advanced Metadata and Applications of Object-Based Representations

Managing Complexity of Object-Based Content

Audio Object Coding

Capturing Audio Objects

Tradeoffs of Object-Based Representations

Object-Based Loudness Estimation and Control

Object-Based Program Interchange and Delivery

Conclusion

9 Sound Field

ROZENN NICOL

Introduction

Development of the Sound Field

Higher Order Ambisonics (HOA)

Sound Field Synthesis

Sound Field Formats

Conclusion

Appendix A: Mathematics and Physics of Sound Field

Appendix B: Mathematical Derivation of W, X, Y, Z

Appendix C: The Optimal Number of Loudspeakers

10 Wave Field Synthesis

THOMAS SPORER, KARLHEINZ BRANDENBURG, SANDRA BRIX, AND CHRISTOPH

SLADECZEK

Motivation and History

Separation of Sound Objects and Room

WFS Reproduction: Challenges and Solutions

WFS With Elevation

Audio Metadata and WFS

Applications Based on WFS and Hybrid Schemes

WFS and Object-Based Sound Production

11 Applications of Extended Multichannel Techniques

BRETT LEONARD

Source Panning and Spreading

An Immersive Overhaul for Preexisting Content

Considerations in Mixing for Film and Games

Envelopment

Musings on Immersive Mixing

Index

https://www.routledge.com/Immersive...el-Audio/Roginska-Geluso/p/book/9781138900004

P.s.: I cannot thank @pinnahertz enough for all the knowledge he shares!

jgazal · Dec 17, 2017

More examples of attempts on immersive audio:

jgazal said:
gregorio said:

4. It's not a particular challenge but that's irrelevant anyway because almost no one is trying to create a "real world" sound recording or reproduction because in the real world no one perceives the actual sound entering their ears!

You just have to keep coming back to referencing the "real world", a real world which usually does not exist and even when it does, that's not what we're trying to record or reproduce anyway!

Click to expand...

I am glad you used the word almost.

It is always a matter of reference or perspective, isn’t it?

But “almost no one” may become in the short term or may be already a wrong degree.

Let’s see a few examples:

1. Sennheiser Ambeo (first order ambisonics)

Click to expand...

2. YouTube VR

YouTube VR recommends content With first order ambisonics, probably downmixed to binaural with a generic HRTF. Potentially better with a Realiser crossfeed free PRIR.

Attention: use YouTube app to rotate your visual point of view in the monoscopic 360 degree video!

2.A. Cindy Crawford closet - Vogue

Important demonstration for wives!

Cindy voice need to derotate in order to enhance the immersion.

Mobile devices and tables feed the tracking. Probably better with crosstalk cancellation with speakers or crossfeed free externalization with headphones.

You may want to follow these instructions:

jgazal said:

Try to listen with loudspeakers in the very near field at more and less +10 and -10 degrees apart and with two pillows one in front of your nose in the median plane and the other at the top of your head to avoid ceiling reflections (or get an IPad Air with stereo speakers, touch your nose in the screen and direct the loudspeakers sound towards your ears with the palm of your hands; your own head will shadow the crosstalk).

Click to expand...

Click to expand...

2.B. Bewitched love - Manuel de Falla - Orchestre national d'Île-de-France

Audio content that derotates according to the visual point of view! This is an spot microphones mixing probably with binaural synthesis.

Click to expand...

2.C. Showcase/Showdown Eigenbeams/Ambisonics - mh acoustics

Audio content that derotates according to the visual point of view! This is an eigenmike probably downmixed to first order ambisonics and then downmixed again to binaural with a generic HRTF and finally streamed through YouTube.

With a HOA streaming, a personalized HRTF an Realiser A16 this would be the lowest distortion path available (what Professor Choueiri says “being fooled by audio”).

Click to expand...

Click to expand...

Click to expand...

Can you imagine that with an stereoscopic video?

Ping pong - compare with Ricoh below...

Click to expand...

3. Netflix VR

Probably atmos bed and objects downmixed to binaural with generic HRTF. Potentially better with a Realiser crossfeed free PRIR.

Click to expand...

4. Google Daydream “Fantastic Beasts” VR

Maybe a mix of first order ambisonics and objects? Or maybe full binaural synthesis (similar to BACCH-dSP)? IDK.

Click to expand...

5. Ricoh Theta V with “spatial audio”

Some frustrating content done with 360 monoscopic cameras in which the audio does not derotate.

Probably not Ambisonics.

Just showing to demonstrate the potential of home-made music oriented video distribution with spatial accuracy:

Click to expand...

Ping pong - compare with eigenmike above...

Click to expand...

You may see that those are not minor players in the entertainment market...

gregorio said:

(...)
3. Again, (...). Maybe I would think that producer was an idiot too, maybe the producer really is an idiot but maybe I've just missed the point. Almost all musical developments throughout the history of music (and it's production) have been negatively received by some/many. Regardless, I want to hear what the producer/artists intended, even if they were all idiots!
(...)
Make up your mind, which is it, are we too puristic or spatially ignorant? We can't be both! In actual fact we are neither. (...)
G

Click to expand...

If you think about Google Daydream VR, Netflix VR, Samsung VR, Facebook virtual hangouts etc, you will realize that they are whole highways in which immersive sound content formats can reach consumer’s virtual environments.

You just need to use them.

I guess consumers don’t care about the bit depth or the the sample rate of distributed content, but I believe they going to click in sponsored pages and happily pay for streaming/downloads that render audio spatial immersion that is accurately correlated with visual cues.

That is a huge incentive for mastering engineers to persue spatial accuracy with reference to the original sound-field or virtual designed visual cues, don’t you think?

I know the number of streamings with Spotify, Deezer, Pandora, Tunein Radio etc is higher than music streamed with video formats, but is that an intrinsic barrier?

Is there any other objection to distribute music content through such distribution channels and consumers virtual environments?

There is a reason why Google has chosen - ambisonics - to distribute vr content in YouTube VR and Daydream VR.

Perhaps you may financially benefit from reducing your negativity about shifting your reference to the real sound-field or virtual designed visual cues.

gregorio said:

Particularly in this case, as I've been talking about commercial stereo music releases!
G

Click to expand...

I liked that!

Added a new video, now with music oriented content.

bigshot · Dec 9, 2017

I actually recently stumbled across some recordings that seem to be trying to recreate a totally natural realistic sound. And they did it in a very non-intuitive way that doesn't follow the rules.

I posted in the lounge about a deal at JPC on multichannel SACDs (4 euros apiece!) I got a big batch of them and after I listened, I immediately went back and ordered the rest. For $300 including shipping, I got about 70 SACDs. The performances are top notch. They were recorded in the 1990s as the Royal Philharmonic was preparing for a 50 year anniversary tour of the world. But as good as the performances are, that isn't what makes them interesting for us here.

The liner notes says that they used as many as 45 microphones to record these. That normally would terrify me because I look to the Living Stereo recordings recorded with two or three mikes as the standard of quality for recording classical music. But they've done something very interesting with all those mikes. I've listened to several of them so far and they appear to be trying to recreate the sound of an orchestra from the audience in a great concert hall, which makes sense for multichannel. But there are differences between the way an orchestra sounds in a concert hall and how it sounds on a record.

DYNAMICS

They have chosen somewhere towards the front of the orchestra section as their optimal point for the recording. It's close enough to get a very broad dynamic range, yet out in the open enough to pick up all of the hall ambience. The dynamics are very wide. These push the range of what I expect the dynamics to be on a classical recording. My spit in the wind estimate is somewhere between 60 and 70dB. This means that you have to listen to these LOUD. At a more moderate listening level, stuff in the quiet parts starts to fade into the distance, and the hall ambience isn't nearly as clearly presented. If you adjust it to the sweet spot, at least 10dB louder than most classical recordings, in a particularly dynamic piece, like Tchaikovsky's Pathetique, the peaks are right at the edge of being too loud. But the conductor modulates the performance so it isn't tiring to listen to. I think the ideal with most classical recordings is based on a theoretical listening position ten or fifteen rows back from the band where the hall has the opportunity to suck up some of the dynamics and the level of the dynamics is more in the 55 to 60dB range. But at the proper listening level, a little more dynamics certainly adds to the natural presence of the sound.

DISTANCE

The engineers on this recording have poured a lot of attention to the secondary depth cues. Instruments in the front of the band have less hall reverberation than ones in the back of the band. The percussion has a great deal of reverb on it- more than in any classical recording I've ever heard in fact. The percussion is also allowed to fall into the dynamics of the band. When there is a triangle hit, you don't hear it because of the volume of it, you hear it because it's a different frequency. Most classical recordings will mike things like that a little closer so they can get the fundamental frequencies of the triangle to cut through. But this recording allows the triangle to sound like a triangle from a distance. The same is true of the woodwinds. Even though you can hear the finest points of the phrasing, you don't hear any clacking of the keys. A lot of audiophiles judge the quality of a recording by whether they can hear tiny details like that, but I've never heard anything like that in a concert hall. The same goes for the sounds the conductor makes or squeaking of chairs... all those things are often audible in recordings, but when you're sitting 40 or 50 feet away from the band, there's no way you can hear any of that. The sounds that are projected sound like they are projected, the ones that are small and localized don't come through. Even though it seems counter-intuitive, details like this are what sound like close miking overlaid over distant miking. It doesn't sound natural. These recordings seem to have a very realistic presentation of depth.

DIRECTIONALITY

I'm still trying to get a grip on this aspect. I suspect that the unique sound of these recordings when it comes to directionality has something to do with the way distance is presented. The soundstage isn't as pinpoint clear as some classical recordings, for instance the three channel Living Stereo recordings. The sound is more diffuse at lower volumes and more directional at louder ones. I'm guessing that is a natural effect of the hall ambience. When something is loud, it cuts through like a laser beam, and when it is quieter, it tends to spread out into the massed sound of the band more. I've heard recordings with "spotlighting" where they will record a solo with a dedicated directional mike so they can mix it in clearly in position. This doesn't sound like that. A solo can go from loud to soft and the directionality will become more focused as it gets louder and more diffuse as it gets softer. Again, it goes against that old audiophile adage that everything in the soundstage must be pinpoint accurate at all times. But it works and it sounds VERY natural. I'm going to listen carefully and see if I can understand this better.

AMBIENCE

The rear channels are almost silent at a moderate listening level. But when you get the volume up to a loud peak level, it blooms out and makes the back of the room feel live. It tends to pull upper mids and high frequencies out into the room a bit. For instance, trumpets seem to jump a little forward when they blare, and the triangle taps, even though they are low level, are echoed a little in the rear. But the lower frequencies tend to mass together into an undifferentiated reverberation. It's all gradated through the frequency range perfectly. I'm guessing this is also a very natural sort of way of handling the ambience. It definitely isn't the sort of ping pong surround that is common on a lot of multichannel mixes, but it sounds a little different than the way Pentatone or my blu-rays of classical music sound. The reverberation isn't as sharply defined below the midrange. It tends to bloom out into a sound that makes the room feel full, rather than echoing. I'm not sure how to describe it, except to say that it sounds very natural, even though it's not probably specific to the venue they recorded in. (C.T.S. Studios in London http://www.malonedigital.com/studios-cts.htm#.WixiGSPMyV4 ) I'm guessing that they analyzed some real concert halls and synthesized this sound from scratch, which is astounding considering how natural it sounds!

So putting all these aspects together... The sound is more ambient and diffuse in some ways than most classical recordings, but the clarity is gradated along with the dynamics in a natural way. And the multiple channels provide more definition to put across a more sophisticated and natural soundstage like this.

I'm guessing that they used so many microphones because they miked each section of the orchestra from several different distances with both broad and shotgun mikes. That way, if they needed to go in and fix the balance of a particular part, they could do it with an ambience that matched the perspective of the dynamics. I suspect that most of the time, engineers just take a wide, broad perspective of the orchestra as the foundation of the mix, and punch in with close miked details if they need to adjust anything. These engineers seem to be able to maintain the distance cues better when they do that so nothing sounds "spotlighted" and everything feels like a consistent, coherent distance. That would require several mikes on each section at different distances with different polar patterns optimized for what they were trying to isolate.

I haven't listened to anything with a small band yet. I'm interested to see how they handle that. I would bet that they don't follow the big hall approach they did with the full orchestral recordings. I am interested to see what sort of environment they create for a more intimate scale.

jgazal · Dec 10, 2017

Yes, @bigshot, I was also surprised the National Orchestre of France recording linked above was mixed from several spot microphones.

I was expecting coincident microphones, because I used to share the same fear you mentioned: more microphones, more variables can go wrong.

Nevertheless, as Professor Choueiri said in one of the videos linked above room acoustics and HRTF are governed by linear physics, so it can be mathematically modeled. Even the HRTF can be mathematically modeled from the geometry of our heads and torsos (though computational intensive...).

But I believe elevation is the worst part (The Principles of Quadraphonic Recording Part Two: The Vertical Element -Michael Gerzon). As I see it, spectral cues act together with ILD and ITD intricately.

With binaural recordings the dummy head HRTF and particularly its spectral cues are convolved/convoluted (which is the right term?) and I guess the post equalization for speakers is an averaging. You would need to know the dummy head HRTF that was convolved/convoluted to neutralize the second HRTF filtering of the listeners at playback and that would be too computational intensive.

That is why I guess Smyth Reaserch service to personalize BRIRs may be practically limited to ILD and ITD. Spectral cues may be not treated, which may explain the second tier performance.

That is also why I believe acoustic crosstalk forbid sources to be perceived inside the limits of the surrounding speakers if you don’t use spherical harmonics. You can hard pan 12 instruments with 12 speakers in the horizontal plane around you but that is not very practical.

With coincident microphones elevation information is encoded up to an optimal resolution depending on the order of microphones. And then the HRTF convolution happens at playback.

Why I said all that?

Because limiting my comments to an horizontal sound stage within the loudspeakers and channels in phase, then I agree that, even without crosstalk cancellation: (i) synthetic mixing of direct sound and reverberation can theoretically render distance and stage depth in the recording you mentioned; (ii) level and time differences can render directionality of instruments in the recording you mentioned; (iii) tracks from spaced microphones can add ambience.

Nevertheless, I am sorry to say that, but I have never perceived depth in such circumstances within regular rooms without acoustic treatment. Until now I only perceived depth with binaural recordings with speakers in the very near field and two pillows, one blocking crosstalk and the other blocking ceiling reflections... I visited two rooms with acoustic treatments but it was so long ago I don’t remember...

On the podcast above and in the Realiser A8 manual, Smyth Professors (I guess is fair to call both brothers professors also given their academic background) explains how we perceive distance: a mix of how plain is the wave reaching or head (so notice that time arrivals and envelope does matter) and the ratio between direct sound and reverberation (so in theory even with continuous sound waves we could perceive distance). And the Realiser alters the speaker distance by varying the second aspect (direct sound and reverberation ratio) just as you describe mixing engineers do.

But it seems to me from your description that your listening room has well treated reflections in order to preserve the ratio of direct sound and reverberation. I don’t know if and how crosstalk alter such ratio.

I still didn’t study how binaural synthesis mixing engines can change the perceived size of sound sources (sound objects). That is something curious because in some recordings I find the piano huge and in others I find it is thin and razor sharp. I will dig deeper.

Interesting also that you find the dynamics somehow different from what you hear live. I don’t know if that alters the localization.

I don’t know why @71 dB sees the crossfeed debate so negatively. It was one of the reasons that made me interested in comparing surround and ambisonics. I learned the effect of reducing crosstalk by hard panning in surround speakers with @pinnahertz. I had already read Daniel Levitin (This is your brain on music), but was @pinnahertz again that did alert me “don’t forget the envelope”.

So I find the debates here really constructive.

To sum up, I guess that coincident microphones or synthetic mixing, we need to test to say if the rendering is realistic. @Erik Garci like a PRIR that emulates the Hafler circuit, something that I would fear previously (perhaps because I don’t understand the mathematics of microphones directivities and angles). Visual cues do help to achieve a more secure reference (but stereoscopic headsets are another can of worms as @pinnahertz also alerted me).

Please post the sacd booklet so others can test it also. One of the recording engineers of the studio you linked did some impressive work with motion picture sound tracks and Maurice Jarre. I have found this comment hilarious:

Typically, Tomlinson would use a touch of artificial reverberation however “Maurice Jarre hated it, as he used to call it ‘the faucet.’ ‘Turn off the faucet, Eric. I don’t like it,’” laughed Tomlinson, imitating a French accent. “And I would try to enhance the strings with a little bit and he would come in and he would say ‘No, too much faucet.’ So we had to cool it. He always wanted a drier string sound than I would have normally gone for but that was his choice, he was paying!”
Eric Tomlinson Recording Engineer

But to be fair, I don’t know if post recording synthetic reverberation is as good as the reverberation in real time during the recording that would be correlated to the excitation music signal. If they play a sweep tone into the hall it is still related to one single spot, but the orchestral instruments are in several spots. Maybe a mix of several RIRs in several spots. I really don’t know... I just know that Maurice Jarre had a really strong opinion on the subject. :laughing:

Perhaps he would change his opinion with RIR convolution reverbs instead of algorithmic reverbs that were available in his time?

I would like to confess that I just can’t understand all those acoustical and psychoacoustical variables together... And I would also like to confess that I also find really amazing all that Rayleigh, Michael Gerzon, Choueiri and Smyth’s wrote and engineered. It is just unbelievable how humans can understand those patterns and digitally process them. To me such achievements are very similar to what made Max Planck solving the ultraviolet catastrophe or Einstein defining Newtonian mechanics as an special case of the more general relativity or fissioning an atoms in a chain reaction etc...

Finally, thank very much for replying to this thread.

bigshot · Dec 10, 2017 at 3:34 AM

I'm talking about speakers. None of that head stuff applies because I'm using my own head.

jgazal · Dec 10, 2017 at 3:57 AM

bigshot said:
I'm talking about speakers. None of that head stuff applies because I'm using my own head.

Yes, I do agree. Either ambisonics through speakers or multichannel through speakers just convolve/convolute your head acoustically at playback, they are both personalized to your very own head.

What I am trying to say is that your head would be fine if and only if the recordings were made with coincident microphones and the rendering had spherical harmonics computed to each speaker in order to recreated a the original sound field at a single point (the center of the listener’s head).

But usually recordings are made with coincident pairs or mixed and playback adds acoustic crosstalk and reflections so that the ratio between direct sound and reverberation may not be as realistic as the first technique. But I do agree with you and @pinnahertz that multichannel has an edge over regular stereo, because the less you pan, the lower the crosstalk you introduce.

bigshot · Dec 10, 2017 at 4:21 AM

Most mixes are constructed from mono tracks potted to where the engineer wants to place them. The shape of your noggin, crosstalk, or pattern of the mike are all irrelevant to it. You sit in front of the speakers and you get the sound. It works fine that way.

jgazal · Dec 10, 2017

bigshot said:
Most mixes are constructed from mono tracks potted to where the engineer wants to place them. The shape of your noggin, crosstalk, or pattern of the mike are all irrelevant to it. You sit in front of the speakers and you get the sound. It works fine that way.

So the mixing engineer adjust the “convolution reverb” and the dry spot microphone object ratio to define depth. With headphones the ratio is maintained. But with speakers, my gut feeling gives me the fear that crosstalk and reflections just don’t allow to keep the exact same ratio. Sure synthetic ILD and ITD may be enough for your brain to perceive an acceptable auditory illusion, so I will trust your experience.

Nevertheless, consider the following two descriptions and after that just tell me if you are really sure that the interaction between “convolution reverb”, direct sound and crosstalk and at which point in the audio reproduction chain you convolve/convolute and HRTF are all completely “irrelevant” to where “the engineer want to place” the object:

John Atkinson Comments: I have not been impressed by earlier attempts at crosstalk cancellation. The result seemed unstable, colored, and was limited to such a small sweet spot that it was impracticable for a comfortable listening experience. But after Edgar Choueiri had calibrated the system for my ears and listening position, and played back some binaurally recorded music over a pair of KEF LS50s reinforced by a subwoofer, I was impressed. Not only did the soundstage now wrap almost to my sides and was not affected by my moving my head from side to side and back and forth, what I found most convincing was that the ambience, the reverberation on the recordings, was now a stable, solid halo around the performers, just as it is in reality.

(...)
The listening experience in the Bacch 3D room was unusual to say the least. The system was modest by audiophile standards and consisted of the lovely KEF LS50s, a Hsu subwoofer, and a pair of Sanders Magtech amps.
(...)
With Professor Choueiri at the iPad controls, I was treated to 3D demo material one consisting of David Chesky in a large church. David starts out about 30' away to your left/center and he proceeds to walk closer and closer until he whispers in your ear. Let me just say this was 100% convincing and kinda creepy in its intimacy, no offense to David intended.
(...)
The presentation literally dissolved both the speakers and room, the later felt more like it had been blown open. Sound was not only coming at me from nearly all directions in such a way as I could very easily pinpoint each instrument and singer's physical location within the recorded space, but the natural reverb and decay into the space of the recording was simply astonishing.
(...)
Read more at BACCH 3D Sound - AudioStream

Since you may be the one of the most experienced with multichannel, I would like to ask you a couple of questions. Did you find multichannel recordings in which all sound objects are hard panned to each speaker? Do they convey the object size accurately? Are you able perceive the object inside the boundaries of the speakers around you? In other words, have you perceived proximity (see How Does BACCH™ 3D Sound differ from surround sound?)?

Now consider something like the Microsoft HoloLens and that you want to connect the holographic world with the real world. Objects sizes must be visually coherent with the real objects in the scene:

Size
Consideration of the size of a mixed reality hologram is more involved than it is in VR. In virtual reality, the developer has control over the entire “worlds and so can ensure that everything is the right size relative to everything else. With mixed reality, holographic objects must be adjusted to the same size that they would be in the real world, because they must fit in with the real-world objects in the room. If holographic representation of a real object is not the right size relative to its surroundings, the illusion of reality is lost. This means that you must pay attention to the scale of the holograms you create. Not only must the holograms in a scene all be proportionally sized to each other, but also they must be appropriately sized for the real-world surroundings.
Develop Microsoft HoloLens Apps Now - Allen G. Taylor, p. 164

But the size of sounds and their proximity must also be coherent.

Do you believe that multichannel is able to render size and proximity of audio objects with better coherence than (I) binaural through two loudspeakers with crosstalk cancellation, (ii) binaural through beaforming phased array of transducers, (iii) binaural with personalized headphone externalization and headtracking, (iv) Ambisonics through multiple loudspeakers or (v) Ambisonics downmixed to binaural using you own HRTF and playback with headphones or two loudspeakers with crosstalk cancellation or beaforming phased array of transducers?

To be sincere, I still don’t know if option “iv” (coincident soundfield microphones or eigenmikes* and ambisonics decoding; the “pattern of the mic” variable on your terms) is also capable of conveying proximity in the sense Professor Choueiri describes. Dealing with reflections with increasing speakers is an increasing challenge. Perhaps option “v” with your own HRTF and headphone externalization with headtracking and without crossfeed? Maybe.

* in other words, coincident microphones with 3 axis spatial information.

bigshot · Dec 10, 2017

Studio engineers mix on speaker systems, not headphones. If you play it back on speakers that are set up properly you aren't getting any distortion. It's the way they heard it in the studio. If you listen with headphones you're getting a certain amount of distortion by definition, because that wasn't what it was mixed for.

There are ping pong recordings where everything comes from a different speaker. That was more common in the early days of quad when they didn't really know how to use multichannel yet. And yes, when something is thrown to just one speaker, it does sound smaller. Almost all current mixes either use the "Pink Floyd" approach (a bed of bass and drums and perhaps rhythm guitar in a natural acoustic and solo elements potted around) or the "Sound Field" approach (a natural acoustic with fixed positions for the instruments in two dimensional space between front and back) or a "Sound Stage" approach with fixed positions up front and just hall ambience in the rears). There are blends of the different approaches as well. Perhaps the most common approach for rock is a sound field up front with background vocals and fills thrown occasionally to the rears ala "Pink Floyd"

If you want to associate a sound with an image in three dimensional space (like a hologram), it would require more speakers and more carefully meshed channels. At this point Atmos is the technology that can do that. Atmos is scalable up to 128 channels and 68 speaker feeds. In the home it's much more limited. It can support up to 35 channels there.

Multichannel can do just about anything better than two channel speakers or binaural. Object size is determined by the number and placement of the speakers. You can use DSPs to optimize the room acoustics and synthesize any size space you want, from a tiny box around you all the way up to an arena. Atmos would be the most effective at this because it includes the vertical dimension, not just right and left and front and back. The advantage to speakers is that you are creating real sound in real space. You don't have to worry about head shapes or head tracking because all that is handled naturally by the listener.

You're misunderstanding what I'm talking about with the number and types of microphones. The purpose isn't to capture a multichannel ambience like binaural. It's to capture sound objects that can be assembled into a multichannel ambience in the mixing board. When I talk about the pattern and placement of the microphones, I'm referring to the ability to isolate specific groups of instruments and the ability to capture secondary distance cues from the actual recording venue. Secondary cues are natural and real and add a level of verisimilitude to the ambience. Distance cues like reverb and changes in frequency response can be subtle and very specific. When you build a synthetic ambience using digital reverb, it can tend to create monochromatic ambiences where all instruments feel like they are the same distance away from the listener.

You can also get disconnects in secondary sound cues when you mix them haphazardly. Say you want to boost the woodwinds. You take a close miked isolated track of them and mix them into the wide perspective to pull it up in the mix. Perhaps you add a digital reverb to make the close miked feed sound more distant. This certainly works, but it isn't quite the same as having a more distant isolated track with natural secondary distance cues to mix in. When you have these natural details recorded, you can sprinkle them into the mix and get an even more vivid perception of depth.

I think what they did in these Royal Philharmonic SACDs was to mike the band a dozen different ways- close and distant, sections isolated, wide panoramas of the whole band, mikes dedicated to room ambience, etc. Then they took the 45 different channels and started assigning them to slots in the mixing board, which at the studio they recorded at had a whopping 96 channels! That would allow them to add a specific stereo ambience to each one of the 45 discrete channels and still have six channels left for 5.1.

I'm guessing they created the depth of the mix in layers, isolating each section- percussion in the rear, woodwinds in the middle, strings in front of that, soloists in concertos right up front- then they applied digital hall ambience combined with natural secondary distance cues to each layer according to how far away they wanted it to feel. When they had the depth worked out for the three front channels, they synthesized a rear channel hall ambience with a 3 or 4 second decay and delays varied by frequency to allow higher frequencies to bounce clearer than lower ones.

In the ideal world, all you would really need is five microphones placed in the positions of the speakers for the listener and a room big enough to have the natural ambience you are going for. That would give you a pretty exact recreation of the sound- just like the three channel Living Stereo sessions back in the 1950s. But if you did it that way you would have no flexibility at all to correct level imbalances or change an instrument's perspective or make it sound like a different hall ambience. By recording from 45 different vantage points on the band in the room and bringing it into a mixing board capable of adding a unique digital stereo ambience to almost all of those discrete channels, they would have complete control over the way distance, direction and ambience was rendered. It's like a film shoot with three cameras where one shoots the closeup, one shoots the two shot and one shoots the wide... you can capture a single performance three different ways so it can be assembled into a filmed presentation. Except here they are recreating a single perspective in great detail, not switching between different perspectives. Another analogy would be HDR photography. They take three different photos at three different exposure settings and merge them together into a photo that has the best elements of all three.

It has nothing to do with head shapes or crosstalk or headphones. They are creating a synthetic auditory environment in the mixing board by pulling the sound apart into small bits, reassembling them in the mixing board layer by layer and sprinkling in real depth cues to make the synthetic ambience sound vividly real. Then they are channeling the sound out into three front speakers, two rears and a sub to create real dimensional sound surrounding the listener. This is the holographic sound you're talking about. If they upped the complexity and did an 8.2.6 mix like this, it would be even better than 5.1.

By the way, I haven't listened to the 2 channel mixes on these SACDs, but if they are simple fold downs of the 5.1, I bet they sound really thick and awful. You need the multichannel to separate out all the sound objects properly.

jgazal · Dec 16, 2017

bigshot said:
Studio engineers mix on speaker systems, not headphones. If you play it back on speakers that are set up properly you aren't getting any distortion. It's the way they heard it in the studio. If you listen with headphones you're getting a certain amount of distortion by definition, because that wasn't what it was mixed for.

I see that, likewise @gregorio and @pinnahertz, your reference is the mastering room. I cannot argue that, it is a more down to earth reference given the current mainstream equipment in consumer environments.

So with that peculiar premise in mind, I agree with all you say, except the following statement:

bigshot said:
If you want to associate a sound with an image in three dimensional space (like a hologram), it would require more speakers and more carefully meshed channels. At this point Atmos is the technology that can do that. Atmos is scalable up to 128 channels and 68 speaker feeds. In the home it's much more limited. It can support up to 35 channels there.

When Professor Choueiri explains his crosstalk cancellation algorithm, he likes to describe it as an “stereo purifier”.

And he does that because he states that the standard Blumlein stereo is corrupted by crosstalk.

So he does not even consider the word holophony and simply restores the meaning of word stereo:

15 Why call this “BACCH™ 3D Sound”?
The word “stereo” was always associated with three-dimensional objects or effects until its later use, in the 1950s, in the word stereophony, which, ironically, is now a term that does not invoke true three-dimensional sound in the popular mind20. In fact, the earliest use of “stereo”, which comes from the word Greek στερεóς, (stereos) which means solid, goes back to the 16th century when the term stereometry was coined to denote the measurement of solid or three-dimensional objects. This was followed by stereographic (17th c.), stereotype (18th c.), stereoscope (19th c.) (a viewer for producing 3D images), and stereophonic (circa 1950). Stereophonic sound, alas, remained a poor approximation of 3D audio until the recent advent of BACCH™ 3D Sound, which restores to the word stereo it original 16th century 3D connotation.

The epithet “pure” refers to the purifying action of the BACCH filters, which are at the heart of BACCH™ 3D Sound. A BACCH filter “purifies” the sound from crosstalk for playback on loudspeakers, without adding coloration, and purifies it also from the detrimental effects of spatial comb filtering and non-idealities of the listening room, the loudspeakers and the playback chain.

So thinking about your analogy to holograms, I am inclined to believe that holophony would be a better word than expressions like “immersive audio” or “immersive sound”.

I didn’t want to use the word holophonia or holophonics since it was coined by Hugo Zuccarelli to describe his system that, AFAIK, is binaural recording without crosstalk cancellation.

Apparently, Helmut Oellers in Germany is using the word to refer to Wave Field Synthesis: Holophony - the path to true spatial audio reproduction.

Anyway, I am not an audio engineer or researcher, so I will live the thread title as is, since it was based on the “Immersive Sound” book.

Having said that, I feel the urge to insist on this:

jgazal said:
Do you believe that multichannel or currently object based proprietary codecs, like Atmos, Auro and DTS:X, are able to render size and proximity of audio objects with better coherence than (I) binaural through two loudspeakers with crosstalk cancellation, (ii) binaural through beaforming phased array of transducers, (iii) binaural with personalized headphone externalization and headtracking, (iv) Ambisonics through multiple loudspeakers or (v) Ambisonics downmixed to binaural using you own HRTF and playback with headphones or two loudspeakers with crosstalk cancellation or beaforming phased array of transducers?

To be sincere, I still don’t know if option “iv” (coincident soundfield microphones or eigenmikes* and ambisonics decoding; the “pattern of the mic” variable on your terms) is also capable of conveying proximity in the sense Professor Choueiri describes. Dealing with reflections with increasing speakers is an increasing challenge. Perhaps option “v” with your own HRTF and headphone externalization with headtracking and without crossfeed? Maybe.

* in other words, coincident microphones with 3 axis spatial information.

Are you really sure that currently object based proprietary codecs, like Atmos, Auro and DTS:X, are capable of such holophony?

Do you know why Professor Choueiri do not mention those codecs as the most promising methods for generating 3D soundfields?

BACCH™ Filters:
Optimized Crosstalk Cancellation for 3D Audio over Two Loudpseakers

An Introduction

There are a number of methods for generating 3D soundfields from loudspeakers. The three most promising are 1) Ambisonics, 2) Wave Field Synthesis and 3) Binaural Audio through Two Loudspeakers (BA2L). The first two methods rely on using a large number of microphones/recording channels for recording, and a large number of loudspeakers for playback, and are thus incompatible with existing stereo recordings. The third method, BA2L, relies on only two recorded channels and two loudspeakers only, and is compatible with the vast majority of existing stereo recordings (recorded with or without a dummy head).

(...)

@gregorio and @pinnahertz, you both also know the music industry and mastering engineer: do you agree with @bigshot that Atmos, Auro and DTS:X are capable of conveying proximity (or holophony in a broader concept) without relying on crosstalk cancellation and convolution of objects with a generic HRTF?

I think that being capable to convey proximity is a very important engineer specification of such codecs and that need to be really clear here.

bigshot · Dec 10, 2017

Atmos is an object based system. You take an individual track (like a specific musical instrument) as a discrete channel and the system processes it to place it anywhere within the three dimensional sound field. The channel doesn't necessarily relate to a specific speaker, rather it's plotted to groups of speakers that represent a specific point in space. The size and definition of the sound field is governed by the size of the room and the number of speakers in the installation, but the mix is the same regardless of the number of speakers. The more speakers you have, the more precise the placement in space. It's like being in a rectangular cubic grid of sound.

Then they can add distance cues in three dimensions- say for instance a plot of how reflections, reverberation and decay work in a Gothic cathedral- and they can wrap that ambient envelope around the objects. That creates scale. The rectangular cube of sound defined by the size of the listening room is now able to create an environment of any size, shape or acoustic. Again, the precision is based on the number of speakers, and the mix is the same for a small installation as a big one.

Now if that isn't holographic sound, I don't know what is. I'm sure all the research into crosstalk, phase and all that stuff can be a part of the processing of Atmos as well. It's a lot easier to get two speakers to mesh to create a coherent phantom center than it is to get a cube covered with 64 speakers to create coherent phantom centers between every possible pair of speakers. I'm sure there is a lot of room tuning going on to make it work in the real world. But since the placement and ambience is completely object based, rather than baked into the individual channels, it allows for infinite flexibility. You could theoretically take a Rolling Stones album recorded in Atmos and have them performing in an arena acoustic on stage a hundred yards from you, then change a DSP and have them standing in a recording studio six feet in front of you. I would bet that eventually, they could even create alternate mixes using the same recorded channels to make the instruments move around you in a circle or fly over your head. Since it's object based, the placement can be anything they want.

Someone once mentioned to me that the order of magnitude of improvements to the quality of a sound field is related to the doubling of channels. So stereo is an order of magnitude better than mono, quad is an order of magnitude better than that, 8 channel, 16, 32, 64, etc. It's based on evenly spaced channels defining the four walls. You have to have each of the four walls equal to create a sound field that is significantly better.

jgazal · Dec 12, 2017

bigshot said:
Atmos is an object based system. You take an individual track (like a specific musical instrument) as a discrete channel and the system processes it to place it anywhere within the three dimensional sound field. The channel doesn't necessarily relate to a specific speaker, rather it's plotted to groups of speakers that represent a specific point in space. The size and definition of the sound field is governed by the size of the room and the number of speakers in the installation, but the mix is the same regardless of the number of speakers. The more speakers you have, the more precise the placement in space. It's like being in a rectangular cubic grid of sound.

Then they can add distance cues in three dimensions- say for instance a plot of how refulections, reverberation and decay work in a Gothic cathedral- and they can wrap that ambient envelope around the objects. That creates scale. The rectangular cube of sound defined by the size of the listening room is now able to create an environment of any size, shape or acoustic. Again, the precision is based on the number of speakers, and the mix is the same for a small installation as a big one.

Now if that isn't holographic sound, I don't know what is. I'm sure all the research into crosstalk, phase and all that stuff can be a part of the processing of Atmos as well. It's a lot easier to get two speakers to mesh to create a coherent phantom center than it is to get a cube covered with 64 speakers to create coherent phantom centers between every possible pair of speakers. I'm sure there is a lot of room tuning going on to make it work in the real world. But since the placement and ambience is completely object based, rather than recorded into the individual channels, it allows for infinite flexibility. You could theoretically take a Rolling Stones album recorded in Atmos and have them performing in arena acoustic on stage a hundred yards from you, then change a DSP and have them standing in a recording studio six feet in front of you. I would bet that eventually, they could even create alternate mixes using the same recorded channels to make the instruments move around you in a circle or fly over your head if they want to. Since it's object based, the placement can be anything they want.

I admit is somehow difficult to describe what I mean. But my next two questions are very objective.

I will quote Professor Choueiri and Lavorgna again to describe my doubt:

2 How Does BACCH™ 3D Sound differ from surround sound?

Pure Stereo has nothing to do with surround sound. Surround sound, which was originally conceived to make the sound of movies more spectacular, does not (and cannot) attempt to reproduce a 3D soundfield. What 5.1 or 7.1 surround sound aims to do is provide some degree of sound envelopment for the listener by surrounding the listener with five or seven loudspeakers. For serious music listening of music recorded in real acoustic spaces, audio played through a surround sound system can at best give a sense of simulated hall ambiance but cannot offer an accurate 3D representation of the soundfield.

In contrast, Pure Stereo’s primary goal is accurate 3D soundfield reproduction. It gives the listener the same 3D audio perspective as that of the ideal listener in the original recording venue2. Soundstage “depth” and “width”, concepts often used liberally in hi-end audio literature to describe an essentially flat image (relative to that in Pure Stereo), become literal terms in Pure Stereo. If, for instance, in the original soundfield a fly cicrles the head of the ideal listener during the recording, a listener of that recording played back through the two loudspeakers of a Pure Stereo system will hear, simply and naturally, the same fly circling his or her own head. If, in contrast, the same recording is played through standard stereo or surround sound systems the fly will be perceived to be inside the loudspeakers or, through the artifice of the phantom image, in the limited vertical plane between the loudspeakers.

Fortunately (and perhaps to some, unfortunately) flies do not generally buzz around during the recording of great musical events3. However, an acoustically recorded real soundfield is replete with the 3D cues, if not buzzing insects, that give the brain of the listener the proper information it needs to correctly perceive true depth and width of a sound image, locate sound sources in 3D space, and hear the reflections of sound and the reverbation that occur naturally in the space where the recording was made. For instance, recorded applause in a concert hall, or laughter or chatter in a jazz club, will be reproduced with uncanny accuracy, and would appear as near to the listener as they were in the original venue during recording.

Pure Stereo allows the transmission of these recorded cues (which are critical for the perception of a realistic 3D space) by removing an artifice that occurs during playback through loudspeakers (see Q&A 10) and which would otherwise corrupt the natural reception of these important cues by the listener.

Surround sound does not even attempt to do that. Furthermore, surround sound, like standard stereo, is inherently plagued by so-called comb filtering problems, which are caused by the mixing of the sound waves emanating from the loudspeakers and arriving at the ears of the listener4, even if the listener is sitting in the “sweet spot”. Pure Stereo, in addition to its primary role of reproducing the 3D soundfield, automatically corrects these comb filtering problems and flattens the frequency response at the ears of the listeners, as well as other (spectral and temporal) non-idealities of the loudspeakers, the playback hardware, the listening room (see Q&A 19). It even compensates for the individual features of the listeners outer ears, head shape, and torso (see Q&A 11 on customized Pure Stereo filters), which affect the spatial fidelity of the reproduced sound.

______________
2 By the “ideal listener in the recoding venue” we mean the actual main stereo recording microphones, or the left and right channels of the stereo master recording, which represent the left and right ear of the ideal listener in the original soundfield.

3 Recordings of natural sounds such as insects, birds, crowds, moving vehicles and sound sources are often used, along with music recordings, to demonstrate the shocking realism of Pure Stereo.

4 This is the well-known phenomenon of constructive and destructive interference, which causes the frequency response at the ears to be far from flat due to the creation of frequency- dependent peaks and valleys in sound pressure that severely color the sound.

(...)
The listening experience in the Bacch 3D room was unusual to say the least. The system was modest by audiophile standards and consisted of the lovely KEF LS50s, a Hsu subwoofer, and a pair of Sanders Magtech amps.
(...)
With Professor Choueiri at the iPad controls, I was treated to 3D demo material one consisting of David Chesky in a large church. David starts out about 30' away to your left/center and he proceeds to walk closer and closer until he whispers in your ear. Let me just say this was 100% convincing and kinda creepy in its intimacy, no offense to David intended.
(...)
The presentation literally dissolved both the speakers and room, the later felt more like it had been blown open. Sound was not only coming at me from nearly all directions in such a way as I could very easily pinpoint each instrument and singer's physical location within the recorded space, but the natural reverb and decay into the space of the recording was simply astonishing.
(...)
Read more at BACCH 3D Sound - AudioStream

So your Atmos horizontal speakers are seating in an imaginary circle of 2 meters in diameter. You are right at the center.

The first question is: can you perceive David Chesky voice as if he were whispering next to your ear?

By whispering next to your ear I mean more and less 10 centimeters away and not 1 meter away (since your are in the center of such imaginary circle).

The second question is: is crosstalk cancellation currently part of Atmos?

Edit - this is what Dolby claims about Atmos:

I just don’t understand how they do that without crosstalk cancellation or beamforming...

bigshot · Dec 10, 2017

I think the difference between what you're talking about and the way music is recorded is that you are talking about capturing a real acoustic and perspective. That is rarely the goal in music recording. Recorded music is about creating an optimized acoustic and perspective. It's not trying to capture small things to make real sound more real. It's about creating a synthetic reality that sounds real but is better than real.

If you want to have a sound be right next to your ear, you have to have a speaker right next to your ear. Headphones can do that easily.

I'm not sure if Atmos incorporates that particular technique, but I'm sure a VST plugin could be created if there was some way to apply it.

jgazal · Dec 10, 2017

bigshot said:
I think the difference between what you're talking about and the way music is recorded is that you are talking about capturing a real acoustic and perspective. That is rarely the goal in music recording. Recorded music is about creating an optimized acoustic and perspective. It's not trying to capture small things to make real sound more real. It's about creating a synthetic reality that sounds real but is better than real.

Yes that is the difference and the reason why I say that your reference is the mastering room.

bigshot said:
If you want to have a sound be right next to your ear, you have to have a speaker right next to your ear. Headphones can do that easily.

It was not a question about what I want, but what is currently possible to do with speakers.

Most headphones if not all don’t externalize sounds without DSP.

So do you think Professor Choueiri and Lavorgna were not precise when they described the proximity?

What I want, in case you wonder, is to emulate with headphones what Professor Choueiri claim to do with speakers. That is currently possible.

But I wish the Realiser to allow just a little more freedom in the mix block (or ILD) and in the ITD of the PRIR’s (with two speakers PRIRs and Ambisonics speakers arrangement PRIRs).

But I haven’t received any answer from them so I am just losing my hope.

jgazal · Jan 26, 2018

This is a concise counter argument to what I have been writing in this thread:

castleofargh said:
I'll answer for myself. the album was mastered using speakers, so I consider them to be the first meaningful reference. ideally I would want to have the sound of the speakers in the very studio the master was done, while sitting where the guy was. that is my own idea of the sound like the artist intended. not some often poor sound coming out from giant speakers at a live concert. live event is the true sound like the band is playing in front of me. but it is not what I wish to replicate. because I don't really like that, and also because it is typically impossible when using a mixed and mastered album. so I aim for the next best thing, the sound like the guy heard it when doing the mastering in the studio. I also don't get that, but I try to get close to it and it starts with speaker sound.
I do believe that one day(soon) it will be a reality and albums will have the data(one way or another) to use on our devices and blend in the result with our own HRTF. and maybe when those tech are everywhere, making an album will change too, and the released albums will become the sound like someone in a VIP seat heard it at the live event in some glorious room without spectators. or the sound like you're next to the singer(although I don't think that would sound great). with the potential for good mimicry of a given space or given speakers, comes all the potential to produce differently. so you should IMO see the Realiser as the step in the door of future audio.
for now if I can get at night on headphones, the sound I get on my speakers during the day, that would already be mighty cool. anything beyond that will be bonus to me. ^_^

There are certainly at least two drivers to keep the reference locked to the mastering room when we are dealing with music content. One is objective and the other is let’s say volitive: a) the consumer environment; b) preference for such best seat in the audience.

@bigshot, @gregorio and @pinnahertz can correct me, but they seem to agree with the first one.

Someone that shares the opinion of @castleofargh regarding the second driver:

Jason Victor Serinus adds: I, too, found the wrap-around aspects of this system amazing. What threw me off, however, was the perspective, which replicated the sound “from the microphone’s ears.” On the orchestral and choral recordings I auditioned, which were recorded with main mike(s) positioned close to and above the conducting platform, I may have listened from a similar sonic perspective as would a conductor, but it did not ressemble anything I’ve ever heard from a seat farther back in the orchestra. Thus, I was simultaneously fascinated and puzzled by the experience.
Read more at https://www.stereophile.com/content/bacch-sp-3d-sound-experience#S4vPgthTw6wF2U5P.99

But there may be a third driver: apprehension that one single coincident microphone could detract the creative intent of artist, producers, recording, mixing and mastering engineers.

I don’t believe that shifting the reference can in anyway detract the creative intent:

jgazal said:
I agree that what really needs to be preserved from an real recorded event are mainly (a) the acoustics relationships between the hemispheres that are divided by the human sagittal plane (crosstalk cancellation with speakers or crossfeed free externalization), (b) spectral cues (personal HRTF/HRIR/BRIR or in a lower performance tier a generic HRTF/HRIR/BRIR) and (c) bass response (for instance, the A16 allows direct bass, phase delayed bass and even equalization in the time domain). But once those three factors are controlled to maintain localization fidelity, the overall frequency response and seat perspective not necessarily need to be referenced to the real event.

Then it will be up to the creator to deliberately define the art, by creative intent, not only by carefully altering some of the frequency bands of real recorded events (dummy heads, baffled spaced microphones, ambisonics, eigenmikes), but also in synthetic recreations (stereophonic mixing or binaural synthesis with synthetic reverberations).

By “carefully” I mean choosing the recording room, where to place the musician and instrument, where to place the microphone etc. Anyway, if your aim is to keep elevation precision, I would be extra careful when altering some specific frequency bands with post equalization either of recordings made with microphone patterns that encode spatial info (dummy heads, sound field microphones or eigenmikes) or with synthetic mixing.

Giving the user access to 360 degrees of freedom in the x, y and z axis of a sound-field may sound challenging, but do not detract the creativity. It may be just a new language. This analogy with the cinema limited angle of view versus VR freedom of view might be useful:

Latest Thread Images

jgazal

500+ Head-Fier

Attachments

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

bigshot

Headphoneus Supremus

jgazal

500+ Head-Fier

jgazal

500+ Head-Fier

Users who are viewing this thread