Head-Fi.org › Forums › Equipment Forums › Sound Science › The validity of ABX testing
New Posts  All Forums:Forum Nav:

The validity of ABX testing

post #1 of 109
Thread Starter 
This is a pretty simple thread, but I decided to make a separate thread because I want to get as many answers as possible without threadjacking.

I think it's safe to assume that at least a handful of people on this forum believe that double blind testing is invalid as a form of testing, and argue from there that a negative result of an ABX test, as currently performed, with all of its downfalls (no pooling, large sample size, listening fatigue, quick-switching, imagination contamination, etc) does not mean much. The position, if I've summarized this properly, is that ABX is too flawed of a testing method to draw reliable conclusions.

Now say I were to perform an ABX test, with quick-switching, a large sample size, no pooling, and with quick-switching. Say that in this hypothetical, the test comes back positive - people can, according to this test methodology, distinguish between, say, cables (what's being tested isn't that important, you can insert DACs or amps if you want).

The question is: would you accept this test?

Assuming that it was done properly, most "skeptics" would (I imagine) have sufficient proof to say that at least some people can tell the difference between two cables. This is my personal belief. However, this belief slices two ways - we can't say that positive results are compelling without also saying that negative tests are compelling as well (even if they're far less compelling), or at least that they hold some sort of validity.

If, however, you are not a skeptic, is this positive result notable? If not, why not? And if so, how do you reconcile the apparent contradiction between shunning negative results and embracing positive results (given that ABX testing is so flawed to begin with)?
post #2 of 109
ABX testing is the gold standard for scientific drug trials. If it is the best tool we use to make drugs that save millions of lives, why is it not good enough for audio cables and amplifiers? That is a serious question. Maybe I am missing something here.

To answer your second question, I am a total skeptic about cables, but if a test with a large enough sample size showed that people could distinguish between them, then yes of course I would change my mind.
post #3 of 109
Quote:
Originally Posted by royalcrown View Post
If, however, you are not a skeptic, is this positive result notable? If not, why not? And if so, how do you reconcile the apparent contradiction between shunning negative results and embracing positive results (given that ABX testing is so flawed to begin with)?
Actually it's rational to be suspicious about negative result but accept positive results. In a Student's T-test or whatever the official name is, we never say we have proven the null hypthesis. We always say we've failed to reject it.

But to give some more detail:

Consider that we perform an ABX test with two devices A and B. For the moment we adopt a simple model of the listener---we assume they are capable of telling the difference between A and B, but under the conditions of the test occasionally make mistakes or get confused.

Let p be the probability that the listener can correctly identify X=A or X=B. If the difference is easy to hear and the listener never makes mistakes, then p=1.0. But p is probably somewhat less than 1.0 because the listener will make mistakes. Let's say p is 0.75.

What kinds of mistakes would they make?
  • Maybe they just space out?
  • Maybe they are superimposing some imagination or "hallucination" on the experience of listening. We know expectations can affect what they hear, so if they develop an expectation in the middle of the test and that expectation actually leads them astray, they will give the wrong answer.
  • Maybe they hear so many different things that they can't sort them out in the their mind, and just start to feel everything is the same. The idea is that a more skilled listener, or the same listener but under better conditions (with better test music, with more training, etc.) would be able to keep their attention focused.

As p gets lower, it takes more and more trials to reject the null hypothesis and the chance of type II error increases.

A DBT is designed to show to a certain confidence that p is greater than 0.5. (p=0.5 would be random guessing). We can never say how much greater than 0.5, just that it's not likely to be 0.5.

As I've said, if p is something just a little bigger than 0.5, then it will take many, many trials to reject the null hypothesis. Since most published tests used something like 16 trials, this is clearly not sufficient for those cases in which p is a small number like 0.6.

Furthermore some subjectivists theorize that p is greatly reduced in quick-switch conditions when the difference between A and B is a "musical" difference (for lack of a better word)---that is, when the difference is most obvious while perceiving musical elements such as dance, emotion, large-scale shape, etc. In fact, no one seems to have ever invented a test that could work with these conditions but control the listener's use of their attention, but I'm working on it.
post #4 of 109
Thread Starter 
mike: I know the statistics behind testing, but that's not what I'm concerned with. If an experiment is flawed (as you have argued at least on one occasion) then the statistical analysis doesn't matter - statistical analysis works with the assumption that the scientific test is sound. What I'm asking is, in the hypothetical that your objections are indeed valid, would you still accept a positive result from a flawed test? If an ABX test is designed poorly then a positive result should be, statistically, as meaningless as a negative result.
post #5 of 109
Quote:
Originally Posted by tvrboy View Post
ABX testing is the gold standard for scientific drug trials. If it is the best tool we use to make drugs that save millions of lives, why is it not good enough for audio cables and amplifiers? That is a serious question. Maybe I am missing something here.
FWIW, here's a article (also see the comments that follow) that attempts to identify some of the issues. If you do a search, there have been several threads on this sub-forum where some members who know quite a bit about DBT's have addressed some of the differences between medical trials and audio trials. Hope this helps a little. You ask a legitimate question.
post #6 of 109
Quote:
Originally Posted by royalcrown View Post
mike: I know the statistics behind testing, but that's not what I'm concerned with. If an experiment is flawed (as you have argued at least on one occasion) then the statistical analysis doesn't matter - statistical analysis works with the assumption that the scientific test is sound.
I'm not sure what you mean by "the assumption that the scientific test is sound." You probably know more about statistics than me. My understanding was that a test can show p is greater than 0.5 to some degree of confidence. The "unsoundness" of the test can be modeled by saying it reduces p. Explain to me what I'm missing.

Quote:
What I'm asking is, in the hypothetical that your objections are indeed valid, would you still accept a positive result from a flawed test? If an ABX test is designed poorly then a positive result should be, statistically, as meaningless as a negative result.
Again, I don't see this question as separate from the statistics. If we run 100 tests to a 5% confidence level and a few of them reject the null hypothesis, then I would be suspicious those tests were a result of guessing.
post #7 of 109
Quote:
Originally Posted by mike1127 View Post
Actually it's rational to be suspicious about negative result but accept positive results. In a Student's T-test or whatever the official name is, we never say we have proven the null hypthesis. We always say we've failed to reject it.
This is not entirely accurate. When we fail to reject the null, we say that we "accept the null" as it is the more likely hypothesis given the evidence.

Such results (if significant) stand as relatively strong evidence against the test hypothesis.

It is true that is methodologically it is extremely difficult to prove a negative, but remember that the scientific method "proves" nothing.
post #8 of 109
Thread Starter 
Quote:
Originally Posted by mike1127 View Post
I'm not sure what you mean by "the assumption that the scientific test is sound." You probably know more about statistics than me. My understanding was that a test can show p is greater than 0.5 to some degree of confidence. The "unsoundness" of the test can be modeled by saying it reduces p. Explain to me what I'm missing.
For example, if you were to do the test sighted, you can't do any statistical analysis on it because the data wasn't gathered in a proper manner. Likewise, if, say, quick switching was in fact an improper way to reveal differences, statistical analysis would be pointless on tests that used quick switching because the testing methodology was unsound.


Quote:
Originally Posted by mike1127 View Post
Again, I don't see this question as separate from the statistics. If we run 100 tests to a 5% confidence level and a few of them reject the null hypothesis, then I would be suspicious those tests were a result of guessing.
The confidence level doesn't matter. It's about the validity of blind testing in the first place. Some people, including you, have objected that ABX testing on the grounds that, among other things, listening to "sound as sound" made it impossible to tell between two cables, and that quick switching facilitates this. Let's use that as an example, though this is an open question. Say that I have a test where people listened to sound as sound, only used quick-switching. Now say that, somehow, they were actually able to detect a difference. Having just discredited ABX testing as an incorrect method of testing, and having those accusations actually be true, would you still say that the positive result obtained by that test matter?

I say this because even though a negative test is not proof by itself because the primary objections to ABX testing (at least at some point in the debate's history) were that ABX testing itself is flawed to begin with. If that were the case though, what does that imply about positive results? That's what I'm interested in.
post #9 of 109
Quote:
Originally Posted by royalcrown View Post
For example, if you were to do the test sighted, you can't do any statistical analysis on it because the data wasn't gathered in a proper manner. Likewise, if, say, quick switching was in fact an improper way to reveal differences, statistical analysis would be pointless on tests that used quick switching because the testing methodology was unsound.
I don't get this nebulous concept that "data wasn't gathered in a proper manner."

A sighted test isn't a test on sound alone. As long as you recognize that, sure you can do statistical analysis on it. You would be testing more whether people need glasses, but to a first approximation there still is a 'p', that is, there still is a probability that someone gets the right answer, and you can reject the null hypothesis that p=0.5.



Quote:
The confidence level doesn't matter. It's about the validity of blind testing in the first place.
I don't get your point. I accept that blind tests are tests on sound alone. So there's some value of p (in a crude model) and it's based only on what someone can hear.. and here's the kicker.. under those test conditions. It's not whether the whole concept is "valid" in some nebulous way. It's about how the test conditions affect p.

Quote:
Some people, including you, have objected that ABX testing on the grounds that, among other things, listening to "sound as sound" made it impossible to tell between two cables, and that quick switching facilitates this. Let's use that as an example, though this is an open question. Say that I have a test where people listened to sound as sound, only used quick-switching. Now say that, somehow, they were actually able to detect a difference. Having just discredited ABX testing as an incorrect method of testing, and having those accusations actually be true, would you still say that the positive result obtained by that test matter?
Sure it matters. Under those conditions, and with respect to the devices under test, we have rejected the null hypothesis.
post #10 of 109
Quote:
Originally Posted by ph0rk View Post
This is not entirely accurate. When we fail to reject the null, we say that we "accept the null" as it is the more likely hypothesis given the evidence.


Such results (if significant) stand as relatively strong evidence against the test hypothesis.
Okay, you may know more about statistics than I do, but it's not really "wording" but a matter of the underlying assumptions.

p could be anything between 0.5 and 1.0. A significant result is good evidence that p is NOT 0.5. It doesn't, however, tell you what p is. p could be literally anything. All we know is that it's probably not 0.5.

A null result could be looked at as evidence p is 0.5, but on the other hand a null result is consistent with a range of p's. A 16-trial null result is consistent with p=0.6, or p=0.7. That is, a null result is the most likely thing to happen when p=0.7 or below.

Actually this requires a bit of calculation, and I'm not sure I've got it right, but I'm just trying to elucidate this idea that we aren't just dealing with funky wording, but with mathematical models.
post #11 of 109
To give another example: A DBT test is a test of a human being as an instrument.

Another kind of instrument is a SPL meter with a microphone of some sort. Let's say we want to detect a sonic event with our SPL meter. Well, we first have to know the characteristics of this meter. For example, its microphone probably has some directivity and a characterisitic frequency response.

This SPL meter might be a very good instrument for detecting loud noises in the midrange band in the direction it's facing.

On the other hand, it would be a very insensitive instrument for detecting bass rumbles in the other direction.

A human in a quick-switch DBT is a fairly decent instrument at detecting certain kinds of differences. One category would be, simply, large differences. Another category would be certain kinds of differences between codecs.

My argument is to say, just because a human is a good instrument for some kinds of signals in certain conditions, doesn't make those conditions ideal for all kinds of signals.

A positive DBT result obviously means that to the level of confidence, the subject was not guessing, and therefore the difference between A and B was audible in that context. So it always means something.

When you try to run a quick-switch DBT on two devices which don't have audible distinctions under those conditions, you have set up a situation in which the "noise" in this human-being-instrument dominates the "signal". The "noise" comes from imagination (where by "imagination" I mean the same thing bias and expectation produces), lapses of attention, and confusion.
post #12 of 109
From all the answers I have read none have been directly related to royalcrown's answer. The answers always move away from the initial question and we get to nowhere...

He wants to know if for the people who claim DBTs are an invalid method for testing differences, and reject any test that claims cables -in this case- don't make a difference, would they accept the test if some people passed the DBT and said that in their case different cables make a difference?
post #13 of 109
I think I would, depending on who did the test (not if it was some cable manufacturer ).
The problem with DBT is that people don't just listen with their ears, looks and thoughts are just as important. So if things are inaudible it will not say anything about a difference in perceived sound. You can tell people a thousand times that they sound the same, but in the end it is about your own perception. If there is a clear DBT difference, then yes it probably actually sounds different so there is no reason to reject it. But of course, that test would be repeated and argued about until the difference is close to 0.
I don't think there will ever be a positive result from DBTs... doesn't get me to enjoy the Equinox cable less
post #14 of 109
Quote:
Originally Posted by royalcrown View Post
However, this belief slices two ways - we can't say that positive results are compelling without also saying that negative tests are compelling as well (even if they're far less compelling), or at least that they hold some sort of validity.
The basics of ABX is that a positive proves that a difference exists, and that a negative doesn't prove anything.

Quote:
Originally Posted by ph0rk View Post
Such results (if significant) stand as relatively strong evidence against the test hypothesis.
Because in medicine, tests are run on statistically representative samples of the population, in order to evaluate the average result, while in hifi, tests are run with one subject, that is everything but representative, in order to find a minimum possible effect.

Under these conditions, a negative doesn't prove the null hypothesis.

Quote:
Originally Posted by ph0rk View Post
It is true that is methodologically it is extremely difficult to prove a negative, but remember that the scientific method "proves" nothing.
Sophism. It all depends on the context.

Quote:
Originally Posted by royalcrown View Post
Now say I were to perform an ABX test, with quick-switching, a large sample size, no pooling, and with quick-switching. Say that in this hypothetical, the test comes back positive - people can, according to this test methodology, distinguish between, say, cables (what's being tested isn't that important, you can insert DACs or amps if you want).

The question is: would you accept this test?
It depends on the context. It requires a lot of conditions for a test to be valid. For example I do not accept the test of the Ionostat as a proof ( Coin audio hi-fi, ionostat schema plan fabrication banc d'essai ecoute, bidouilles perso pour melomanes bricoleurs ).

It 19 failed tests have been run in the past by other people, and this one succeeds with an error probability of 0.05, this is not a success.
If the test involved 20 listeners, and one succeeds with an error probability of 0.05,this is not a success.

If amplifiers have been compared with volume matched within 0.1 dB, but not the balance setting, it is interesting, but not a final proof. Volume controls can have slight left-right imbalance that may be audible.

If CD players are compared launching two identical CD at the same time, with left and right output level precisely matched, pausing each player at the beginning, then starting the playback exactly at the same time, then using an ABX switch for instant comparison, it doesn't work either, because we can hear skips or lags extremely short, and we can't start the playback with a precision of a millisecond by hand.

If the cables compared are plugged by an operator who can be seen by the listeners, the test is not double blind, because of the "clever Hans" effect (see wikipedia). What a horse can do, a listener can do it too.
If the operator can't be seen, what about the noise made by the plugs ? In the headphone amplifier test that we did with Nitri, I could hear, in spite of the music playing in the headphones that I was wearing, the dull sound of the plastic plug that Nitri was putting on the desk behind me after having plugged the metallic one in the amplifier.

And Nitri, wearing the headphones could distinguish between the electric "clicks" of the amplifiers when the plug is inserted, even when the music was playing.

If everything is perfectly setup, there is one thing left : the statistical context. The statistical significance is nothing in itself. Actually, it is compared to the likeness of the tested hypothesis.
Analyzing the result of a test, you face two hypothesis : the difference was heard / the score was got by chance. And you pick the one that seems most probable.
If we compare two speakers and get probability of guessing of 0.05, which is the most probable hypothesis ? That the difference was heard.
If we compare two identical power cords, one with an antistatic product on it (not on the plugs, just on the plastic), and get probability of guessing of 0.05, which is the most probable hypothesis ? That the score was guessed while no difference was heard.
For this kind of test, the probability of guessing would rather have to be something like 0.0001.
post #15 of 109
I forgot one last thing : the possibility that the test was faked. That's a third hypothesis, that must be compared to the two previous ones.

For extremely unlikely results, two or three blind tests, run by independant people, are unlikely to be all faked.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Sound Science
Head-Fi.org › Forums › Equipment Forums › Sound Science › The validity of ABX testing