A misconception about measurements | Page 4 | Headphone Reviews and Discussion - Head-Fi.org

mikeaj · Sep 3, 2010 at 3:24 PM

I think in many (most) contexts, accuracy is defined without reference to perception. This is not even considering all of science, engineering, statistics, etc. If you want to get some produce at the grocery store, you need to weigh the contents in a bag. What is the accuracy of the scale you are using? It's just the difference between the actual weight and the reading from the scale, which is just a single numerical value that doesn't rely on how the produce looks or smells. Maybe somebody else's bananas look yellower or seem bigger or heavier, but that has no bearing on how accurate the scale was. If you have more than one type of measurement or more than one sample, fine. You can include all of the measurements or do some data reduction and summarize the key points via some statistics (i.e. functions on those measurement samples).

There are many types of signals in the real world, electrical and not. One such type of signal is a audio waveform. The statistics we use to describe any type of signal can also be applied to audio. There are well-defined quantities such as the mean value, peak-to-peak amplitude, and variance--as well as others and higher-order statistics--that we can use to convey information about the signal. When comparing the input and output signals, there are even more statistics that describe the differences between the two such as mean squared error.

If we have a list of all the time-domain measurement samples of A and B like you suggested, at 44.1 kHz for a minute duration, then we can generate some statistics on the data. These will describe A and B in a meaningful way. The mean squared error would be one indicator of the accuracy of B (given A as a reference). This value would be one way of describing the real-world difference between A and B. Audibility and preception are not relevant to this value, since the statistic is a function of the samples only. This statistic has a far-from-trivial meaning for many different applications and many different types of signals. Distortion likewise means something.

These objective statistics are good even if we are concerned with human experiences because they will be the same no matter who calculates them.

Many of these statistics correlate well with perception, too. If they had no bearing on what people heard, then we would still be able to calculate them, though we wouldn't bother. If the correlation coefficient (well, there are multiple ways to mathematically define correlation, but any should do) between A and B is close to 1 rather than close to 0.8, then they will sound much more similar. Of course, audio is perceived in both frequency and time, so frequency-domain statistics are generally more interesting than time-domain statistics.

But, as you say, to conflate the statistics to average human perception, you need outside knowledge in the form of psychoacoustics. If the correlation between A and B was close to -1 (i.e. B is a 180 degree phase shift of A), then they would actually sound the same. Second harmonics for distortions are hardest to perceive and least offensive.

Somebody could use all this information from psychoacoustics to create an aggregate statistic that predicts how similar B sounds to A in terms of average human perception, or what we're calling "perceived accuracy." In a way, this has already been done by those working on lossy music encoding (e.g. LAME), among other things. It's not perfect, but a VBR -V0 encoding has a consistent quality level or accuracy throughout, according to some metric. Encoding algorithms got better over time because the statistics used for accuracy better predicted human perception. More precisely, the new encoding algorithms created files perceived as closer to the original at equivalent bit rates, whether constant or variable. It's definitely not that the entropy of the same source music files changed.

Shike · Sep 3, 2010 at 4:59 PM

Quote:

mike1127 said:
To me this is a different issue. You are talking about attempting to correlate perception to the external world. For example, trying to judge whether the lines are straight by eye.

That's the same mistake as thinking that a subjective description of a headphone as "fast" is the same thing as the measured transient response.

It's not a different issue, it's trying to relate reality to perception and realizing they don't always correlate. As for the subjective description of "fast" - it's not my fault they're using the wrong word. When relating to speed we have transients and decay.

Quote:

What I'm trying to get you to do is give me a completely empirical, numerical, and universally complete way of judging which of two devices is more accurate. What you have written is not even close. It looks like you are guessing at things.

Let's try it this way. Consider these two plots of FR and THD vs frequency for two different transducers. Tell me which one, green or red, is more accurate.

The red one had better THD but seemingly worse FR, so it has a more accurate THD and an inaccurate FR.

Here's what you're missing: some headphones, transports, and amps measure better than others in multitudes of tests. At that point we can say one is more accurate than the other by the sheer amount of tests it does better at in relation to reproduction.

Honestly though, most decent headphones have below 1% THD. Considering that even if one has a worse THD (.5% vs .75%) another test may be of greater importance. At this point maybe a weighted algorithm with a quantifiable number taking all others into account may be necessary.

mike1127 · Sep 3, 2010 at 9:01 PM

The important issue here is that models always have a purpose. And we can always ask: is this model good enough for the purpose?

Quote:

Originally Posted by mikeaj

I think in many (most) contexts, accuracy is defined without reference to perception.

That may be, but it is usually defined for a purpose. In many contexts, perception is not related to that purpose.

Quote:

If you want to get some produce at the grocery store, you need to weigh the contents in a bag. What is the accuracy of the scale you are using? It's just the difference between the actual weight and the reading from the scale, which is just a single numerical value that doesn't rely on how the produce looks or smells.

The accuracy of the scale is not a single number. Let's say it's a spring scale. It has some non-linearities and a bias. The error will be different for each weight. We can model this. Our model would have some parameters, that is, some specific numbers. For example, the scale has a bias (a particular form of error in which the scale is not aligned at zero, but instead reads an extra amount heavy, like one ounce heavy---remembering that the nonlinearities also contribute to the error).

Now, what is the purpose of an accurate scale? I would say it's to prevent either the customer or the store from being cheated.

The factory that makes the scale could perform measurements to fill in the parameters of the model. It would be easy to choose a model and set limits on the measurements so that no one feels cheated. That's the purpose. Easy to achieve with measurements.

Now let's consider a situation in which the weight of a food item is not always good enough: meat with bones in it. Let's say you pay for the total weight of meat plus bones, but the only useful part is the meat. Let's also say that the proportion of bones varies a lot. In that case, modeling it as a single weight may not be good enough for the purpose.

Quote:

Somebody could use all this information from psychoacoustics to create an aggregate statistic that predicts how similar B sounds to A in terms of average human perception, or what we're calling "perceived accuracy." In a way, this has already been done by those working on lossy music encoding (e.g. LAME), among other things. It's not perfect, but a VBR -V0 encoding has a consistent quality level or accuracy throughout, according to some metric. Encoding algorithms got better over time because the statistics used for accuracy better predicted human perception.

Yes, there has been work on relating distortion to perception in the MP3 encoders. By the way, Shike would have a hard time reconciling the fact that audio engineers working on codecs DO try to relate distortion to perception and his insistence that audio engineers don't do anything of the kind.

I'm interested in high-end audio and its ability to recreate the experience of a reference musical event. For that purpose, current measurements and models are useless.

They are more useful is you are talking about some kind of aggregate model of many unsophisticated listeners.

Again, let's come back to the purpose of a model. What is the ultimate purpose of codecs like MP3? I would say it's to make shorten downloads that don't anger the average consumer for having noticeable distortion. That's what it is. It exists so the average unsophisticated consumer is satisfied with their purchase and so media companies can save money.

mike1127 · Sep 3, 2010 at 9:05 PM

Quote:

shike said:
The red one had better THD but seemingly worse FR, so it has a more accurate THD and an inaccurate FR.

Now you're really in "useless land." You apparently can't tell me which one is more accurate. Given a list of measurements, you can tell me which numbers are larger than the other numbers. Wow.

Apparently you think that the peak in the green THD is more significant than the overall much greater area under the curve of the red THD. Peaks trump average apparently. Do you have any objective justification for this, any at all?

mikeaj · Sep 3, 2010 at 10:26 PM

Quote:

mike1127 said:
The accuracy of the scale is not a single number. Let's say it's a spring scale. It has some non-linearities and a bias. The error will be different for each weight. We can model this. Our model would have some parameters, that is, some specific numbers. For example, the scale has a bias (a particular form of error in which the scale is not aligned at zero, but instead reads an extra amount heavy, like one ounce heavy---remembering that the nonlinearities also contribute to the error).
They are more useful is you are talking about some kind of aggregate model of many unsophisticated listeners.

Now let's consider a situation in which the weight of a food item is not always good enough: meat with bones in it. Let's say you pay for the total weight of meat plus bones, but the only useful part is the meat. Let's also say that the proportion of bones varies a lot. In that case, modeling it as a single weight may not be good enough for the purpose.

Upon reflection, that scale example was really stupid, you're right. Also, I meant to say that the accuracy of the scale's (single) measurement could be expressed easily and totally by the difference between actual and measured weights. The accuracy of the scale as a measuring device in general is much more complicated, since that depends on many parameters relating with the construction and calibration of the device. Anyhow, it was a poor example and nothing but a diversion from the real issue at hand.

Quote:

Originally Posted by mike1127
I'm interested in high-end audio and its ability to recreate the experience of a reference musical event. For that purpose, current measurements and models are useless.

They are more useful is you are talking about some kind of aggregate model of many unsophisticated listeners.

Again, let's come back to the purpose of a model. What is the ultimate purpose of codecs like MP3? I would say it's to make shorten downloads that don't anger the average consumer for having noticeable distortion. That's what it is. It exists so the average unsophisticated consumer is satisfied with their purchase and so media companies can save money.

First off, I think this is a rather condemning view of lossy encoding. Storage capacity as well as network capacity are limited resources in many scenarios. I'm rather happy that I have the ability to compress music to store on a portable player and get more TV channels from cable or satellite because of lossy encoding. For portable audio and even to a much smaller extent in home audio, I personally don't think there's much point in going strictly with lossless in non-ideal listening circumstances, especially for non-amazing recordings. Of course I can see where others would disagree. But now I'm kind of curious if lots of video enthusiasts are angry at having h.264 or whatever other video codecs in Blu-Ray.

However, the end of the bolded line to me is the most significant part. What evidence do you have that current measurements and models are useless with regards to high-end audio? Now that's crazy talk.

I also think it's highly presumptuous to think that psychoacoustical models have such narrow scopes that they don't apply to fancy listeners. (I'm not disagreeing that some listeners have developed a stronger ability to identify and interpret certain effects, but wow.)

Shike · Sep 4, 2010 at 12:02 AM

Quote:

mike1127 said:
Now you're really in "useless land." You apparently can't tell me which one is more accurate. Given a list of measurements, you can tell me which numbers are larger than the other numbers. Wow.

I told you which was more accurate, you're comparing two different criteria that can only be treated as two different criteria unless we agree upon a weighted scale between the two. Neither one is clearly better than the other though, so we could sum it up by saying both suck honestly. Feel better about that answer?

Realistically, your THD graph wouldn't even happen as it's not showing the fundamental frequencies and the harmonics properly. It's not really done in the proper format for interpretation either which makes your example fit well within this supposed "useless land" you speak of.

Quote:

Apparently you think that the peak in the green THD is more significant than the overall much greater area under the curve of the red THD. Peaks trump average apparently. Do you have any objective justification for this, any at all?

It's similar to SNR. You apply the fundamental frequency and look for harmonics that crop up. You then figure how many decibels down they are from the fundamental. The higher the SPL generally the higher the distortion, so 90dB and sometimes 95dB are common for measurements. The goal is to have a low percentage of distortion across all frequencies, though it depends what you consider an acceptable threshold. Many speaker engineers would probably go around 3% since it seems there's studies showing that it (distortion) is of relatively low to no audibility around such a level. Many headphones fall well within that criteria today, let alone the 1% THD level. Considering this FR would probably be considered at a greater weight on the scale unless distortion climbed into levels above it (3%). I myself like 1% though as it shows careful attention to engineering of the transducer IMO.

Of course, anyone that's more familiar with distortion measurements is free to chime in if I have some misunderstandings.

A real example of a THD measurement:

Three fundamental tones were used: 100Hz, 1Khz, and 10Khz. At -40dB you have 1%, at -50dB you have <.5%. Before the hundred seems like it might be a limit of the hardware going by the sites author's other measurements (same with the slope till 200Hz). At 200Hz it's down a total of what, 50dB? It only gets lower from there as you can see.

Krav · Sep 9, 2010 at 3:09 AM

Excellent reference to the 3-frequencies intermodulation distortion measurements. Thank you!

I looked at virtually all ryumatsuba measurements and mentally compared them to my impressions of the headphones I've heard. Not a very scientific conclusion, yet it appears that the flatter the FRC and the lower the IMD, the "smoother" and more "transparent" I perceive the headphones to be. Other characteristics seem to have much weaker influence, for me personally.

I'm pretty satisfied with my HD650 and they measure like a champ too. I will keep them as the primary open cans for the time being. But the ryumatsuba data shows that high-end closed Denons may be noticeably better than my DT770, so there is an action item for me - compare DT770 to D5000 at home - stemming from this abstract (at times

thread.

Once more, thanks.

Quote:

shike said:
Quote:

mike1127 said:

Now you're really in "useless land." You apparently can't tell me which one is more accurate. Given a list of measurements, you can tell me which numbers are larger than the other numbers. Wow.

Click to expand...

I told you which was more accurate, you're comparing two different criteria that can only be treated as two different criteria unless we agree upon a weighted scale between the two. Neither one is clearly better than the other though, so we could sum it up by saying both suck honestly. Feel better about that answer?

Realistically, your THD graph wouldn't even happen as it's not showing the fundamental frequencies and the harmonics properly. It's not really done in the proper format for interpretation either which makes your example fit well within this supposed "useless land" you speak of.

Quote:

Apparently you think that the peak in the green THD is more significant than the overall much greater area under the curve of the red THD. Peaks trump average apparently. Do you have any objective justification for this, any at all?

Click to expand...

It's similar to SNR. You apply the fundamental frequency and look for harmonics that crop up. You then figure how many decibels down they are from the fundamental. The higher the SPL generally the higher the distortion, so 90dB and sometimes 95dB are common for measurements. The goal is to have a low percentage of distortion across all frequencies, though it depends what you consider an acceptable threshold. Many speaker engineers would probably go around 3% since it seems there's studies showing that it (distortion) is of relatively low to no audibility around such a level. Many headphones fall well within that criteria today, let alone the 1% THD level. Considering this FR would probably be considered at a greater weight on the scale unless distortion climbed into levels above it (3%). I myself like 1% though as it shows careful attention to engineering of the transducer IMO.

Of course, anyone that's more familiar with distortion measurements is free to chime in if I have some misunderstandings.

A real example of a THD measurement:

Three fundamental tones were used: 100Hz, 1Khz, and 10Khz. At -40dB you have 1%, at -50dB you have <.5%. Before the hundred seems like it might be a limit of the hardware going by the sites author's other measurements (same with the slope till 200Hz). At 200Hz it's down a total of what, 50dB? It only gets lower from there as you can see.

Latest Thread Images

mikeaj

Headphoneus Supremus

Shike

Headphoneus Supremus

mike1127

Member of the Trade: Brilliant Zen Audio

mike1127

Member of the Trade: Brilliant Zen Audio

mikeaj

Headphoneus Supremus

Shike

Headphoneus Supremus

Krav

Head-Fier

Users who are viewing this thread