Head-Fi.org › Forums › Equipment Forums › Sound Science › The case for caution in interpreting null results
New Posts  All Forums:Forum Nav:

The case for caution in interpreting null results

post #1 of 20
Thread Starter 
I've learned a lot and clarified some of my thinking, so I thought I would a bit of a "summary" post.

First of all, I am not opposed to doing blind tests. I would evaluate all new components blind if it were possible. I think that blind tests are needed to determine whether differences exist between the sound of components.

My issue is with the way blind tests are usually run, and the way they are interpreted.

As objectivists are fond of telling us, all humans are subject to illusions. We know that when a person compares two components while knowing their identities, that person may "hallucinate" (for lack of a better word) a difference between those components.

But I think people can also "hallucinate" two components to be the same. Why do objectivists feel it only works the one way? I give some reasons below.

This debate is often framed as between "those who trust their ears (i.e. non-controlled, sighted listening)" and "those who trust blind tests, science and measurements".

But there are other dividing lines. For instance, I see "those who treat their brains as a black box" and "those who introspect to figure out what is going on while they listen." Personally, I trust that introspection on my listening process is a valid source of information. Valid in what sense? Valid in the sense of suggesting what kinds of test protocols need to be explored, and suggesting the limitations of existing protocols.

Now, let's consider how a test subject performs in an ABX test. If A and B are dramatically different, the subject will get the right answer every time. We could express that by saying their probability of answering correctly is 1.0, or "p=1.0". If A and B are sonically identical, the test subject can only guess randomly. In that case p=0.5.

In the really interesting blind tests, the differences will be subtle, and we can safely assume that p is not 1.0. It is somewhat lower than that. That's equivalent to saying "The test subject makes mistakes sometimes, but the difference is real and audible."

What kinds of factors would drag down p? There are two:
  1. Factors which cause a test subject to "hallucinate" that two components are the same, even when they aren't.
  2. Factors which introduce distraction, causing the test subject to focus on the wrong things, and "hallucinate" the wrong differences.

Introspecting on my own listening process has suggested some reasons these would be factors. In the first case, "hallucinating" that two components are the same, I see the following as possible causes:
  1. Imagination contamination. (See http://www.head-fi.org/forums/f133/i...nation-431573/)
  2. Fatigue from listening to the same music many times. If the critical differences between components are in musical factors, like emotion, excitement, pace and rhythm, etc., then it's pretty obvious to a musician that you are going to fatigue to these factors rapidly. Music is not meant to be listened to repeatedly. It loses its fresh quality very quickly.
  3. A context in which you simply can't pick up on musical factors. The best example of this is listening to small snippets of music.

Examples of distracting factors are:
  1. The simple fact that when the test music is very rich, we tend to hear different things each time we listen. On listen #1 I may hear the cymbals prominently. On listen #2 I may notice the bass drum. This can give rise to the illusion that the music has changed.
  2. Some people compare components by switching between them while the music is in progress. This doesn't even remotely make sense to me. The music you hear just after you switch is not the same music you hear before you switch. I count this as a distracting factor... it may cause you focus on the wrong differences.

Some blind tests are done quick-switch or short-snippet style. Others are done with long listening and/or long breaks between switching. Either style is going to be affected by some of the factors above.

So, we can safely assume that p is somewhat lower than 1.0 (even if you just say that people make mistakes). The factors I have described above will drag p down. Here is a major problem: as p gets lower, the probability of Type II error gets very high.

What is Type II error? This is the error of failing to reject the null hypothesis when it is actually false. Or in English, this is the mistake of finding no difference between the components where there really is a difference.

To give some examples, let's say we are talking about a 16-trial test, which is common. If a person gets about 12 right, then they have "passed" the test to a confidence of about 3%. That means their chance of getting that result purely by guessing is 3%.

Now consider that p is about 0.6. That is, there is a real difference, but due to distracting and hallucinatory factors, the person only stands a 60% person chance of giving the right answer. In that case, if you run 16 trials, the person probably won't get more than 9 or 10 right. The test will fail to reject the null hypothesis. And yet it was false! This is called Type II error.

I don't know how to do the calculations exactly, but for the standard 16-trial test, with p in the range 0.6 to 0.8, the chance of Type II error is almost 1.0 (I believe). That's means you are almost guaranteed to get a null result, even if there is a real difference.

I wanted to do some calculations, but the web page I usually use is down right now. The point I wanted to make was that Type II error goes down as you do more trials. But you have to do a LOT more trials to get Type II error down to something below 10%. I estimate that for p=0.75 you would have to do 30 or 40 trials. This is very uncommon in blind testing.

Something else rarely acknowledged by objectivists is that differences between components may fall into distinct categories. Some differences are clearly audible under quick-switch conditions. Lots of ABX tests have been run giving good data about codecs. But that doesn't mean all differences are audible under quick-switch conditions. From introspection on my listening process, I deduce that musical differences are inaudible under quick-switch condtions, or when switching many times.

In other words, p has been dragged down all the way to 0.5.

Finally let me address the claim we frequently see that "no one has ever passed a blind test proving that cables (or DACs or amps, depending on who's talking) sound different." This is false on its face, for a simple reason. If you are running tests with a confidence level of 5%, then about 5% of them will reject the null hypothesis through guessing alone. That doesn't prove there is a difference, but what is does prove is that the person making this claim is cherry picking or dismissing results. There most certainly have been tests that rejected the null result, but the objectivist making the claim has dismissed those tests as improper in some way. For a person with an agenda, it is easy to dismiss everything they don't want to hear.

To summarize, I'm not opposed to blind testing and I don't dismiss blind test results as completely irrelevant, but I do think there is reason to be cautious in interpreting null results.
post #2 of 20
No offense but this to me is an example of the complicated, convoluted, splitting hairs, storm in a teacup arguments that I see in favor of cables. Contrast this to the very simple cut and dried argument against them and you have one of the main reasons that I am a non-believer.
If a $3K cable does not give P=1.0 then it is overpriced (IMO).
If expensive cables = grand improvements it would happen fairly regularly and as far as I know it never has.
post #3 of 20
Thread Starter 
Quote:
Originally Posted by Real Man of Genius View Post
No offense but this to me is an example of the complicated, convoluted, splitting hairs, storm in a teacup arguments that I see in favor of cables. Contrast this to the very simple cut and dried argument against them and you have one of the main reasons that I am a non-believer.
First of all, this doesn't have to be about cables. All the arguments I supplied are generic to any kind of device.

But are you dismissing what I write here simply because it looks similar to other arguments that you have already dismissed? I think your position is fair, but it would be nice to see a point-by-point rebuttal rather than a blanket dismissal.

Should the answers be cut and dried? I think interpreting statistics is complicated, and that audio testing is complicated for the reason that it sits at the intersection of the subjective (first-person experience) and the objective (measurements and observing test subjects in the third-person).

Quote:
If a $3K cable does not give P=1.0 then it is overpriced (IMO).
If expensive cables = grand improvements it would happen fairly regularly and as far as I know it never has.
I would agree that a $3K cable should give p=1.0. But the caveat is that the test conditions have to be right, or else p will be dragged down through no fault of the cable.

Note that you are repeating the claim "as far as I know it never has." As I stated, we can be sure that is HAS, even if the test subject was guessing. So anyone who says that no one has passed a cable/DAC/whatever blind test is just not looking at enough tests to get a fair representation.
post #4 of 20
Well, YOU are the one who referenced the cable argument in your original post and I think it is simple: you either hear it or you don't.
post #5 of 20
Thread Starter 
Quote:
Originally Posted by Real Man of Genius View Post
Well, YOU are the one who referenced the cable argument in your original post and I think it is simple: you either hear it or you don't.
Which argument is "the cable argument"?

I would be interested to know if you've ever observed yourself noticing different things each time on repetitions of a musical selection.
post #6 of 20
You mention objectivists and cables in your original post. That was the argument I was referring to so you tell me.

Yes I have noticed different things upon repetitive listening. I also will sometimes focus on certain aspects of a track such as the drums or vocals.
post #7 of 20
Quote:
Originally Posted by mike1127 View Post
I would agree that a $3K cable should give p=1.0. But the caveat is that the test conditions have to be right, or else p will be dragged down through no fault of the cable.
I think that if the differences are subtle enough to elude quick ABXing, then the difference is relatively small and probably not worth pursuing. If I need test conditions to be perfect to be able to ABX a $20 cable from a $2000 cable then there are more worthwhile areas to spend (or save) my money.

Even if it would produce p=1.0 in ideal conditions, I can't imagine that nonideal conditions would drag p down to the region of "no difference" even if perhaps it would be slightly lower (so instead of say a .1% chance of guessing, a 1% chance of guessing).

Also I think the type II error you mention can be minimized be consciously controlling your focus on different areas. For example, when I try to ABX MP3, I often focus on cymbals to listen for high frequency cutoff.
post #8 of 20
Thread Starter 
Quote:
Originally Posted by AtomikPi View Post

Also I think the type II error you mention can be minimized be consciously controlling your focus on different areas. For example, when I try to ABX MP3, I often focus on cymbals to listen for high frequency cutoff.
You're not talking about Type II error. Type II error is not an "error" made by the test subject during the test. It's a property of the statistical analysis of the whole results.
post #9 of 20
Great post mike1127, with many fine points.

The correct name for the statistical property you are referring to is "power", and indeed it goes up as the number of trials goes up when testing a single individual. The calculations you are making are standard ones for the binomial test, and I surely know how to make them, but don't bother. There are so many assumptions behind them -- like independence from one trial to the next -- that are not satisfied here. I don't care about the exact values the calculation yields -- the important point is the one you made so well: it takes a lot of trials to reliably fail to reject the null hypotheses when the effect size is not large.

And you are oh so right -- you haven't convinced AtomikPi, I'll take him on later -- but surely the standard A/B/X protocol and environment mutes to some degree an effect that is potentially already not large.

It has changed looking for a baseball bat in a haystack to looking for a needle.

On top of the many problems you point out, let me add this: A/B/X asks the wrong question. It asks an unnatual question. It asks a question that is harder than the simple "are these two samples different" and introduces response bias.

I actually like "Do you prefer the first or second" better than either of these questions -- minimizes a type of response bias where you try to please the experimenter because you think you know what he wants, or just the opposite, you try to foil him. Statistical control is introduced in my type of test by using swindles (both A and B are secretly the same). Statistical power is very high, because in a swindle you KNOW the correct answer for everyone by definition (the probability model is not binomial anymore, since four answers are allowed: Prefer A, Prefer B, Hear a difference but don't have a preference, Don't hear a difference).

Folks have argued with me that "Difference? Yes/No" is simpler than "Preference", but I do not think this makes it right -- response bias is higher with the forced choice of Yes/No, as opposed to the gentle bailout of "Difference but no Preference". When you get this answer a lot in swindles you quickly can conlcude there is no difference for this subject and she is guessing.

It is ironic that A/B/X was first introduced to cure the bias that people thought a simple 2-way choice introduced.

AtomikPi -- I think the whole point is that our listening rooms at home do present nearly perfect conditions. If I can reliably hear a difference in two cables 6 out of 10 times (it take a lot of replications to prove that), then this means (roughly) I will enjoy the music with the better cable 10% of the time at home. Out of every 10 CDs I play, one will have a sound quality improvement that I can hear. I will gladly pay $200 over $20 for that cable. $2000 ... maybe not. But if I had more money, yes.
post #10 of 20
Quote:
Originally Posted by Real Man of Genius View Post
No offense but this to me is an example of the complicated, convoluted, splitting hairs, storm in a teacup arguments that I see in favor of cables. Contrast this to the very simple cut and dried argument against them and you have one of the main reasons that I am a non-believer.
If a $3K cable does not give P=1.0 then it is overpriced (IMO).
If expensive cables = grand improvements it would happen fairly regularly and as far as I know it never has.
Mike's arguments are valid. It's all about refining the test method.

About the whole p = 1.0 thing. statistically speaking, you dont need a 100% score to determine a significant difference. No researcher sets a 100% score as statistically significant from to getgo. If one mistake occurs during testing (and they will), the test becomes insignificant (not rejecting h0 = no difference between components), but a type I error occurs (not rejecting h0, while you should in favor of h1 = difference between components). Most use an α of 0.05 (5% or 95% confidence) and in reality, most researchers even accept a very high value of α (= the results become quickly significant). It depends on the research you're doing, but 100% is never used.

About the type II error: is this one so important that we need to do a lot of trials (therefor risking certain events which could influence the results and the results will become more easily statistically significant) or increase the α (and therefor the chance of a type I error)?
I think not. Its about the consiquences of the results here. Minimizing the type II error, the chance that I reject h1 (=difference), where I shouldnt, is in my opinion, less imporant than minimizing the type I error: the chance of rejecting h0, where I shouldnt. Because a type I error could mean in this case 'recommending' (they do make a difference according to the results right!?) buying expensive audio components, while they shouldnt (wasting money). With the type II error could mean 'recommending' (they dont make a difference according to the results right!?) inexpensive audio components, while they shouldnt (saving money though).
It's imo better to go from a small α (0.05 or even 0.01) and a number trials that wont introduce effects fatigue for example that significantly influence the results. Sure, the chance of a type II might be high, but I see it as more important to keep the chance of a type I error down, in avoiding claiming things that make people buy expensive gear while they shouldnt.
post #11 of 20
Quote:
Originally Posted by wavoman View Post
On top of the many problems you point out, let me add this: A/B/X asks the wrong question. It asks an unnatual question. It asks a question that is harder than the simple "are these two samples different" and introduces response bias.

I actually like "Do you prefer the first or second" better than either of these questions -- minimizes a type of response bias where you try to please the experimenter because you think you know what he wants, or just the opposite, you try to foil him.
I don't see that A/B/X inherently asks such a question.

The only thing it asks is to identify X as either A or B. It places no specific demands on how the listener comes to that identification and does not force a "same/different" paradigm.

A listener may switch between A and B and choose one or the other based purely on preference. Once they've done that, then they can do the same between their preference and X, again going purely by preference. And once that's done, then the identification of X takes care of itself by way of simple logic.

For example, let's say the listener prefers A to B. If, when comparing between A and X they prefer A to X, then logic says that X is B. If they find they have no preference between A and X, then logic says that X is A.

So again, I don't see that A/B/X inherently demands any sort of same/different paradigm.

k
post #12 of 20
Your analysis is correct, however it only holds for tests with a sample size of one. This is a non-issue for most (I say most to cover exceptions, but in my experience it's been all) tests done in peer-reviewed journals. Those tests are done by people who have the resources to test over a large sample size, at which point statistical power increases such that the risk of committing a type 2 error is negligible.

Your analysis holds moreso for experiments with a single subject, in which case failing the test would not necessarily indicate that the amps are audibly distinguishable. However, it does indicate that you can't confidently say that there's an audible difference. The real question: if you can't confidently say that there's an audible difference, why are you buying cables (or amps, DACs, etc) in the first place?

A couple of things to point out:

Quote:
Originally Posted by mike1127 View Post
There most certainly have been tests that rejected the null result, but the objectivist making the claim has dismissed those tests as improper in some way. For a person with an agenda, it is easy to dismiss everything they don't want to hear.
Where are these tests? If you're certain that there are tests that have rejected the null, you should be able to find them.
post #13 of 20
Note that with one subject, n = number of trials.

With mutiple subjects (randomly chosen from the population), n = number of subjects.
post #14 of 20
Quote:
Originally Posted by royalcrown View Post
...test over a large sample size, at which point statistical power increases such that the risk of committing a type 2 error is negligible
Only true under the assumption that the people in the test are interchangable sample units. But they are not -- I am a huge fan of sample size one tests with large n, replicated over many individuals (large m, say), but the results are NOT pooled.

Much harder analysis, and not in the standard stats packages which is why it is not done, but failure to do it wrecks most experiments.

Statisticians would say "between-subject variability swamps within-subject variability".

This is but one of many things wrong with standard "group" DBT. The correct protocols do not pool, but rather cull results, known as "play the winner" sampling. But there is then a selection fallacy trap (you accidentally think a very luck guesser can tell the difference), however using swindles will smoke that out.

I have not seen even one well done test published in AES. Some of the ones from Europe that you need Google translator for seem much better -- individual results are not pooled. Note that these well done tests still have not found individuals who can hear differences that many of us claim exist. The tests may have other flaws.

I consider the issues still open. And also not very relevant to buying and listening to gear in the real-world, as I have argued elsewhere (in essence, audition eveything at home, and keep what sounds better to you, even if it is placebo, because, well, it sounds better to you).

But from a scientific standpoint, not settled IMHO. I am trying hard to get equipment built for a good long-term multi-individual not-pooled blind test of analog interconnects, with swindles. Very tough set of issues.
post #15 of 20
Quote:
Originally Posted by wavoman View Post
Only true under the assumption that the people in the test are interchangable sample units. But they are not -- I am a huge fan of sample size one tests with large n, replicated over many individuals (large m, say), but the results are NOT pooled.

Much harder analysis, and not in the standard stats packages which is why it is not done, but failure to do it wrecks most experiments.

Statisticians would say "between-subject variability swamps within-subject variability".

This is but one of many things wrong with standard "group" DBT. The correct protocols do not pool, but rather cull results, known as "play the winner" sampling. But there is then a selection fallacy trap (you accidentally think a very luck guesser can tell the difference), however using swindles will smoke that out.

I have not seen even one well done test published in AES. Some of the ones from Europe that you need Google translator for seem much better -- individual results are not pooled. Note that these well done tests still have not found individuals who can hear differences that many of us claim exist. The tests may have other flaws.

I consider the issues still open. And also not very relevant to buying and listening to gear in the real-world, as I have argued elsewhere (in essence, audition eveything at home, and keep what sounds better to you, even if it is placebo, because, well, it sounds better to you).

But from a scientific standpoint, not settled IMHO. I am trying hard to get equipment built for a good long-term multi-individual not-pooled blind test of analog interconnects, with swindles. Very tough set of issues.
I have personally seen several AES articles, all done in the US, that only pool subjects into broad categories, such as "expert" and "novice" distinctions, etc. No test to this date has shown a difference between so-called "golden ears" and normal regular folk. While it's true that musicians and experienced people can distinguish between differences in patterns or rhythms, but to date I have yet to see any evidence that people have varying abilities to distinguish between audio components.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Sound Science
Head-Fi.org › Forums › Equipment Forums › Sound Science › The case for caution in interpreting null results