Recording Impulse Responses for Speaker Virtualization

jaakkopasanen · Oct 9, 2018 at 10:53 AM

Speakers can be virtualized (simulated) with headphones very convincingly with impulse responses and convolution software. This however requires the impulse responses to be measured for the individual listener and headphones. I'm trying to achieve this.

I made impulse response recordings by playing sine sweep on left and right speaker separately and measuring it with two ear canal blocking microphones. I turned these sweep recordings into impulse responses with Voxengo deconvolver software. I also measured my headphones the same way and compensated their frequency response with an EQ by inverting the frequency response as heard by the same microphones. Impulse responses are quite good and certainly better than any other out of the box impulse response I have ever heard. However they are suffering a little by coarseness, sound signature is a bit bright and sound localization is a tad fuzzy.

When listening on headphones a music recording which was recorded with the mics in ears while playing the music on speakers the result is practically indistinguishable from actually listening to speakers. My impulse responses and convolution come close but still leaves me wanting for better. I think the main problem might be the noise introduced by my motherboard mic input.

I thought about using digital voice recorder like Zoom H1n for the job. This model can do overdub recordings with zero delay between the playback and recording making it possible even to record each speaker separately. I'm also assuming that the mic input on this thing is quite a bit better than my PCs motherboard.

Does is seem like sensible idea to use voice recorder and are there better options? Can you think of other sources of error than the noise from the mic input? Should I do some digital noise filtering on the sine sweep recording before running the deconvolution? Any other ideas for improving the impulse responses?

71 dB · Oct 10, 2018 at 6:11 AM

What about phase? Phase/time delay between ears? How long sweep do you use? Doubling the sweep duration theoretically increases signal-to-noise ratio by 3 dB. Is the system linear enough? Increasing level of measured signal increases signal-to-noise ratio too, but easily introduces more distortion (loudpeakers!). Broadband noise can be filtered out of the response using sweeping band-pass filter that follows the frequency of the sweep. If this filtering is done in "filtfilt" style (first normally and then again for the reverved response), no additional phase shifts are introdused. The filter should be asymmetrical, steep for higher frequencies, because you don't expect higher than f frequencies when measuring frequency f, but you do expect lower frequencies, because they are attenuating away.

jaakkopasanen said:
When listening on headphones a music recording which was recorded with the mics in ears while playing the music on speakers the result is practically indistinguishable from actually listening to speakers.

Tell that to bigshot. :beyersmile:

gregorio · Oct 10, 2018 at 6:15 AM

There are a number of issues and potential issues with what you're trying to achieve, most of which are related to the fact that the impulse response you're recording largely isn't an impulse of your speakers. It's an impulse response of your room (acoustics) and of your recording chain at least as much as it is of your speakers and additionally, it's a response of your "impulse". In no order of preference/importance:

1. Your room acoustics: Human perception works on the basic principle of actually discarding/ignoring most of the sensory input we receive and combining what's left, along with prior experience/knowledge, to create a "perception". There's simply too much sensory data for our brain to process in any reasonable amount of time and evolution has come up with this method as the most practical way of making sense of the world around us. In effect, our perception is an educated guess/judgement of what's really occurring, rather than an accurate representation of it and music (and pretty much all art) absolutely depends on this difference between reality and perception. In other words, certain aspects of what our senses are telling us can be (and usually are) changed somewhat by our brain, in order to make sense of it all. Hence why optical and aural illusions exist and why two different witnesses to an event can truthfully describe that event significantly differently. In the particular scenario you're talking about we have a sensory conflict, the brain will typically change (the perception of) some of that sensory input to remove the conflict and make sense of it. Let's take the example of a recording of a symphony; when you play that recording back on your speakers, what you're hearing is the acoustic space of a large concert hall but your knowledge and eyes are telling you that you're in (for example) your sitting room, we have a sensory conflict. The music producer and engineers compensate for this as much as possible (as they too are creating the recording/mix/master in acoustic spaces which are significantly different to a concert hall) but nevertheless, there's still somewhat of a conflict which the brain will try to resolve. So even with a theoretically perfect recording, perfect speakers and perfect home acoustics, the reproduced recording is never going to sound the same as the original performance in the concert hall, although it might be close enough to fool some/many/most people. In addition, what you're attempting to achieve is a faithful reproduction of your speakers/room in a different room/environment, an additional sensory conflict. In other words, even if it were possible to achieve a perfect impulse response and convolution, when listening to your symphony recording on your headphones, you're effectively hearing a concert hall in your sitting room while your knowledge and eyes are telling your brain that you're actually in (say) a bus! How convincing is that going to be? Maybe almost completely convincing to you personally, but who knows. It might be interesting to see if it's more convincing listening on your convolved headphones when you're actually in your sitting room (or whatever room your speakers are in) than when listening in a significantly different environment. I assume it would be more convincing but whether that makes enough of a difference to you personally I obviously can't say.

2. Your recording chain: Microphones, being transducers, are relatively inaccurate. Measurement mics are the most accurate as far as freq response is concerned but unless you buy very expensive ones, even measurement mics are still relatively inaccurate. A more favoured solution these days is to buy cheaper measurement mics, have a "correction file" created by a calibration lab for each mic and software which allows you to apply them. However this is not a perfect solution and additionally, measurement mics typically gain their frequency accuracy at the expensive of a lot more self noise, which is why measurement mics are never used in studios for recording music. Music mics have far less self noise but are typically far more inaccurate, each brand/model of mic has it's own "colouration" which is desirable for commercial music/audio recording but not when what you're specifically trying to record is the "colouration" itself, of a different transducer (your speakers)! There's also the issue of "off-axis" mic response. Then there's the rest of the chain, the mic pre-amps and noise introduced by say your computer/motherboard. The Zoom H1n should have little/no motherboard noise but it does have rather poor mic pre-amps. It's effectively cheap consumer grade electronics, which is OK for a quick, dirty record of an event but a long way from higher-end pro units. Of course all of this is relative, if your recordings suffer from a great deal of motherboard noise then the H1n could be a considerable improvement.

3. Your impulse: Does a sine sweep fully characterise your speakers? How do your speakers respond to sharp, loud transients rather than a continuous sine wave?

The things I've mentioned above can each be fairly insignificant on their own or quite noticeable, depending on what equipment you've got and your personal perception. Additionally, even if they are relatively insignificant on their own, the cumulation of them might not be.

G

Speedskater · Oct 10, 2018 at 8:12 AM

To expand on the above great reply:
a sine-sweep test will only tell you about the sustain response of the system, which is mostly room response. This type of test tells you little about the transit or impulse of the speakers (or direct response).

jaakkopasanen · Oct 13, 2018

71 dB said:
What about phase? Phase/time delay between ears? How long sweep do you use? Doubling the sweep duration theoretically increases signal-to-noise ratio by 3 dB. Is the system linear enough? Increasing level of measured signal increases signal-to-noise ratio too, but easily introduces more distortion (loudpeakers!). Broadband noise can be filtered out of the response using sweeping band-pass filter that follows the frequency of the sweep. If this filtering is done in "filtfilt" style (first normally and then again for the reverved response), no additional phase shifts are introdused. The filter should be asymmetrical, steep for higher frequencies, because you don't expect higher than f frequencies when measuring frequency f, but you do expect lower frequencies, because they are attenuating away.

Tell that to bigshot.

Not sure what you mean by those phase and time delay between ears questions. They should be mapped correctly by having the mics in ears, no? Sweeping bandpass filter is probably just the thing I was looking for. I think Smyth Realizer A16 does this because it sounds like the sweeps from different channels are overlapping some. Controlling the bandpass filter steepness (bass part of it) would allow control of reverberation time if I haven't understood things wrong. It might be possible to have better room acoustics in the impulse response than in the real room. Thanks for the hint!

gregorio said:
There are a number of issues and potential issues with what you're trying to achieve, most of which are related to the fact that the impulse response you're recording largely isn't an impulse of your speakers. It's an impulse response of your room (acoustics) and of your recording chain at least as much as it is of your speakers and additionally, it's a response of your "impulse". In no order of preference/importance:

1. Your room acoustics: Human perception works on the basic principle of actually discarding/ignoring most of the sensory input we receive and combining what's left, along with prior experience/knowledge, to create a "perception". There's simply too much sensory data for our brain to process in any reasonable amount of time and evolution has come up with this method as the most practical way of making sense of the world around us. In effect, our perception is an educated guess/judgement of what's really occurring, rather than an accurate representation of it and music (and pretty much all art) absolutely depends on this difference between reality and perception. In other words, certain aspects of what our senses are telling us can be (and usually are) changed somewhat by our brain, in order to make sense of it all. Hence why optical and aural illusions exist and why two different witnesses to an event can truthfully describe that event significantly differently. In the particular scenario you're talking about we have a sensory conflict, the brain will typically change (the perception of) some of that sensory input to remove the conflict and make sense of it. Let's take the example of a recording of a symphony; when you play that recording back on your speakers, what you're hearing is the acoustic space of a large concert hall but your knowledge and eyes are telling you that you're in (for example) your sitting room, we have a sensory conflict. The music producer and engineers compensate for this as much as possible (as they too are creating the recording/mix/master in acoustic spaces which are significantly different to a concert hall) but nevertheless, there's still somewhat of a conflict which the brain will try to resolve. So even with a theoretically perfect recording, perfect speakers and perfect home acoustics, the reproduced recording is never going to sound the same as the original performance in the concert hall, although it might be close enough to fool some/many/most people. In addition, what you're attempting to achieve is a faithful reproduction of your speakers/room in a different room/environment, an additional sensory conflict. In other words, even if it were possible to achieve a perfect impulse response and convolution, when listening to your symphony recording on your headphones, you're effectively hearing a concert hall in your sitting room while your knowledge and eyes are telling your brain that you're actually in (say) a bus! How convincing is that going to be? Maybe almost completely convincing to you personally, but who knows. It might be interesting to see if it's more convincing listening on your convolved headphones when you're actually in your sitting room (or whatever room your speakers are in) than when listening in a significantly different environment. I assume it would be more convincing but whether that makes enough of a difference to you personally I obviously can't say.

2. Your recording chain: Microphones, being transducers, are relatively inaccurate. Measurement mics are the most accurate as far as freq response is concerned but unless you buy very expensive ones, even measurement mics are still relatively inaccurate. A more favoured solution these days is to buy cheaper measurement mics, have a "correction file" created by a calibration lab for each mic and software which allows you to apply them. However this is not a perfect solution and additionally, measurement mics typically gain their frequency accuracy at the expensive of a lot more self noise, which is why measurement mics are never used in studios for recording music. Music mics have far less self noise but are typically far more inaccurate, each brand/model of mic has it's own "colouration" which is desirable for commercial music/audio recording but not when what you're specifically trying to record is the "colouration" itself, of a different transducer (your speakers)! There's also the issue of "off-axis" mic response. Then there's the rest of the chain, the mic pre-amps and noise introduced by say your computer/motherboard. The Zoom H1n should have little/no motherboard noise but it does have rather poor mic pre-amps. It's effectively cheap consumer grade electronics, which is OK for a quick, dirty record of an event but a long way from higher-end pro units. Of course all of this is relative, if your recordings suffer from a great deal of motherboard noise then the H1n could be a considerable improvement.

3. Your impulse: Does a sine sweep fully characterise your speakers? How do your speakers respond to sharp, loud transients rather than a continuous sine wave?

The things I've mentioned above can each be fairly insignificant on their own or quite noticeable, depending on what equipment you've got and your personal perception. Additionally, even if they are relatively insignificant on their own, the cumulation of them might not be.

G

1. I've noticed this myself. Listening to a PRIR with speakers far away is quite a weird experience when sitting close to a computer monitor. Brains don't really know how to reconcile the auditory cue for distant sounds and visual cue for near sounds. Works a lot better when both match. This could work the other way around too by making the impulse response sound better than what it actually is if it's recorded in the exact same spot as the listener sits when listening to headphones. I recorded impulse response from my own speakers sitting in my regular spot so it's quite easy for my brain to believe that what it's hearing is actually the real deal because my brains have been conditioned for some time now for this environment having that sound.

2. If I'm not wrong the frequency response of the microphones doesn't really matter. I'm recording frequency response of my headphones with the same mics in ears and whatever that result is it also contains the frequency response of the mics. So when I compensate for the headphone frequency response with EQ I'm actually compensating for the mics' response too. Mic preamps on my motherboard are probably beyond abhorrent. Zoom H1n isn't known for it's mic pre-amps but should be significantly better than motherboard. Anything is better than motherboard really. If Zoom H1n mic pre-amps are not sufficient I will try separate pre-amps and feed the signal into the recorder by line input.

3. Maybe it's good to make clear that I'm not actually trying to imitate speakers perfectly. The goal is to have realistic audio reproduction with headphones.

Speedskater said:
To expand on the above great reply:
a sine-sweep test will only tell you about the sustain response of the system, which is mostly room response. This type of test tells you little about the transit or impulse of the speakers (or direct response).

If I understand correctly this would actually be a good thing. I don't really want speakers' transient response in there messing with my music experience. I think one could have significantly better transient response performance with headphones than with speakers (at least when considering price) so this speaker/room virtualization could sound better than the recorded speakers.

I'm also thinking that it might be possible to do better room correction for the virtual room being simulated than for the actual physical room. Room correction can only go so far because not all acoustic phenomena are easy to handle just with DSP but since headphones don't have that problem (standing waves etc) it could just be so that the impulse response can be edited to have better room acoustics than what would be normally possible.

71 dB · Oct 14, 2018 at 11:14 AM

jaakkopasanen said:
1. Not sure what you mean by those phase and time delay between ears questions. They should be mapped correctly by having the mics in ears, no?
2. Sweeping bandpass filter is probably just the thing I was looking for. I think Smyth Realizer A16 does this because it sounds like the sweeps from different channels are overlapping some.
3. Controlling the bandpass filter steepness (bass part of it) would allow control of reverberation time if I haven't understood things wrong. It might be possible to have better room acoustics in the impulse response than in the real room. Thanks for the hint!

1. Yes. I'm not sure what I meant when asking it… :face_palm:

2. Hopefully it helps…
3. Yes. Logarithmic sweeps greats a response of "natural" shorter reverberation time whereas linear sweep creates "unnaturally" decaying shorter reverberation (faster initial decay and then slower decay).

bigshot · Oct 14, 2018 at 3:37 PM

Can you simulate 5.1, 7.1 or Atmos?

RRod · Oct 14, 2018 at 10:58 PM

I use my Roland R-05 all the time for this task; works just fine. As 71dB noted, you can both extend the sweep and up your speaker volume to get better SNR. After deconvolution you will see miniature IRs before the main IR that correspond to the orders of harmonic distortion; these can be windowed-off to get to the linear part of the decomposition. As far as errors, one of the big ones can be binaural mic placement, so it's good to do several sweeps and pick one that seems reasonable. You might check out the Aurora plugins, made by a researcher who is big into this kind of thing. After it's all said and done you won't get something that sounds perfect. For me, just having something clamping on my head seems to prevent a real sense of sitting in front of speakers, and you won't have head movement accounted for. But satisfactory results don't take a huge amount of effort.

castleofargh · Oct 15, 2018 at 2:59 AM

on the SNR idea, it can be interesting to have some notions of which loudness levels the mic can handle without distorting much. I've ruined a bunch of recordings/measurements myself trying to get the very best SNR possible without thinking of checking for distortions.

jaakkopasanen · Oct 15, 2018 at 11:10 AM

bigshot said:
Can you simulate 5.1, 7.1 or Atmos?

This is my main use case. I'm using HeSuVi on Windows and that can do 7.1. I should be able to turn stereo setup into 7.1 when I receive the Zoom H1n because overdubbing allow me to record channels separately. Atmos however is out of reach because Windows doesn't have decoder for that and I'm not super optimistic that there would be one in the near future.

RRod said:
I use my Roland R-05 all the time for this task; works just fine. As 71dB noted, you can both extend the sweep and up your speaker volume to get better SNR. After deconvolution you will see miniature IRs before the main IR that correspond to the orders of harmonic distortion; these can be windowed-off to get to the linear part of the decomposition. As far as errors, one of the big ones can be binaural mic placement, so it's good to do several sweeps and pick one that seems reasonable. You might check out the Aurora plugins, made by a researcher who is big into this kind of thing. After it's all said and done you won't get something that sounds perfect. For me, just having something clamping on my head seems to prevent a real sense of sitting in front of speakers, and you won't have head movement accounted for. But satisfactory results don't take a huge amount of effort.

Very cool! What I've read the Roland R-05 has quite good mic pre-amps but doesn't have overdubbing. That windowing thing seems like an idea worth trying. What mics are you using? I have Sound Professional SP-TFB-2. Are you compensating your headphones? I would imagine having mics in different position when measuring the room impulse response and when measuring headphones would be problematic. But doing multiple sweeps is a good tip. And thanks a lot for the Aurora plugins link, maybe I don't have to write all the code myself after all.

castleofargh said:
on the SNR idea, it can be interesting to have some notions of which loudness levels the mic can handle without distorting much. I've ruined a bunch of recordings/measurements myself trying to get the very best SNR possible without thinking of checking for distortions.

Indeed. However I don't know how easy would it be to detect loudspeaker distortion from the sine sweep measurement.

RRod · Oct 15, 2018 at 7:16 PM

jaakkopasanen said:
Very cool! What I've read the Roland R-05 has quite good mic pre-amps but doesn't have overdubbing. That windowing thing seems like an idea worth trying. What mics are you using? I have Sound Professional SP-TFB-2. Are you compensating your headphones? I would imagine having mics in different position when measuring the room impulse response and when measuring headphones would be problematic. But doing multiple sweeps is a good tip. And thanks a lot for the Aurora plugins link, maybe I don't have to write all the code myself after all.

Yeah I have binaurals from sound professionals as well, I think the 'masters' series of that same set. I am not currently doing a full inversion to match my speakers to headphones, simply because the software I use and like doesn't have a convolver. I use the speaker measurements to adjust my headphone EQ and to set a crossfeed, though. I DO plan to use the Kirkeby filter in Aurora to help make a 1024 tap filter to add to my speaker chain on my miniDSP.

castleofargh · Oct 15, 2018 at 7:41 PM

jaakkopasanen said:
Indeed. However I don't know how easy would it be to detect loudspeaker distortion from the sine sweep measurement.

manufacturer information, people with the same mic and a good experience of it, or yourself after you go and measure a bunch of sound sources at various levels and find out some pattern in the distortions that correlate well with levels at certain frequencies.
but I don't want to worry you for nothing, using speakers at a distance in fairly realistic conditions, you probably don't get all that loud and most mics should handle that just fine as they were made for such conditions. I was just bringing it up to make clear that while SNR is an obvious concern for us all, the ideal measurement conditions are probably not going to be when we manage to push the speakers to 1337dB ^_^. be it for the mics or for the speakers.

Glmoneydawg · Oct 15, 2018 at 8:18 PM

I feel like SNR is likely to be the least of my concerns at 1337db...noise complaints from other cities would be up there along with the local electricity supplier melting down

bigshot · Oct 16, 2018 at 1:35 AM

I'd rather listen to noise than signal at that level! At least it would be quieter!

71 dB · Oct 16, 2018 at 8:41 AM

RRod said:
After deconvolution you will see miniature IRs before the main IR that correspond to the orders of harmonic distortion; these can be windowed-off to get to the linear part of the decomposition.

For this to work the sweep must be logarithmic. Linear sweeps have the distortion products scattered all over the impulse response. Also, the miniture distortion IRs are folded in time to the end of the whole IR (but they are easy easy to window away anyway).

Latest Thread Images

Featured Sponsor Listings

Recording Impulse Responses for Speaker Virtualization

jaakkopasanen

100+ Head-Fier

71 dB

Headphoneus Supremus

gregorio

Headphoneus Supremus

Speedskater

500+ Head-Fier

jaakkopasanen

100+ Head-Fier

71 dB

Headphoneus Supremus

bigshot

Headphoneus Supremus

RRod

Headphoneus Supremus

castleofargh

Sound Science Forum Moderator

jaakkopasanen

100+ Head-Fier

RRod

Headphoneus Supremus

castleofargh

Sound Science Forum Moderator

Glmoneydawg

500+ Head-Fier

bigshot

Headphoneus Supremus

71 dB

Headphoneus Supremus

Users who are viewing this thread