The validity of ABX testing | Page 6 | Headphone Reviews and Discussion - Head-Fi.org

mike1127 · Jul 22, 2009 at 9:00 PM

Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
This is turning in to a SmellyGas love fest, but the following, already quoted by mike and others, is just perfect:

Thus, issues with the methodology of the experiment may have prevented actual differences from being detected.

The solution is simple. Repeat the experiment in such a way that complaints about the methodology are mitigated.....

Click to expand...

...
For testing cables I have a solution. Blind testing at home, relaxed, with no partner.

That would be amazing. I appreciate your effort.

One question. Because listeners differ in discriminatory ability, wouldn't you have to test individuals many times? Let's say 75% of the listeners either lack discriminatory ability or self-choose a poor method of evaluating the differences. Wouldn't that "drag down" the power of the test, and overwhelm the 25% who can tell a difference?

By the way, I am more concerned with people self-choosing a poor method of evaluating the differences, than I am with lack of discriminatory ability. My hypothesis is that cable differences are in fact real, and that people who try different cables and live with them long-term have a unconscious way of settling on what they like. But I wince when I see them trying to compare cables consciously. They seem to have no ability to control the conditions of listening. Not their fault... it is damn difficult to do. I think I may have found a way to do it, in my current protocol. Of course, only for me. Other people may have to find their own way.

I see this as probably the most difficult problem in testing subtle differences. How does the listener control their attention? How does the tester give instructions so the listener uses their attention in a consistent way? The best way of doing so might be different for every listener.

EDIT: one way of running this test might be to mail one cable at a time and have the participant rate it relative to their usual setup. So they could answer: (1) Much better than my usual setup (2) Slightly better (3) about the same (4) Slightly worse (5) Much worse. This has many problems, but just throwing it out there as a suggestion.

SmellyGas · Jul 23, 2009 at 12:22 AM

There are a lot of fantastic ideas here on how to test whether there are audible differnces between cables. I'm pretty sure a series of trials could be designed and executed to satisfy pretty much most of the objections. Certainly you could test a range of cables, from cheap to exotic, $1 to $10000, precisely level-match, single or double-blind, vary the response types (A/B, same/different, A/B/no pref/dunno, etc.), fast switching vs. long-term listening, trained and/or motivated listeners who think they can hear a difference, etc.

Let's say such a large, EXPENSIVE, comprehensive, and scientifically sound trial could be done. Let's say the study fails to show any difference...or perhaps a 40% of listeners could reliably pass a blind test under some conditions. Then what?

The problem is, there is already a pervasive belief that cables make dramatic improvements in sound. It is perpetuated by audio salesmen, big cable displays at Best Buy, all over the internet, and even in otherwise respectable audio magazines (not journals). Furthermore, people WANT to believe that cables make a difference. I did, and I still do.

The results of a big experiment using a methodology that most non-research/science people have absolutely no idea how to interpret would probably not impress the average audio listener. A person who just spent $350 on his Gorilla Cables and is convinced his soundstage has been kicked up two notches is certainly not going to believe the results of a blind experiment published online.

The Gorilla Cable owner will make petty objections like "well duh, they didn't use $100,000 electrostatic speakers, no wonder those buffoons couldn't hear the differences"...or the quintessential "I can hear differences, so the study must be flawed." Conspiracy theorists will just think you just fabricated all of the data.

In psychology, there is a concept known as "cognitive dissonance." I'll just quote the wikipedia article: A powerful cause of dissonance is an idea in conflict with a fundamental element of the self-concept, such as "I am a good person" or "I made the right decision." The anxiety that comes with the possibility of having made a bad decision can lead to rationalization, the tendency to create additional reasons or justifications to support one's choices. A person who just spent too much money on a new car might decide that the new vehicle is much less likely to break down than his or her old car. This belief may or may not be true, but it would likely reduce dissonance and make the person feel better. Dissonance can also lead to confirmation bias, the denial of disconfirming evidence, and other ego defense mechanisms."

Thus, even if people are so inclined to take an interest in reading the study and its results, there's good chance they'll still just dismiss the results. "Oh, I don't believe in blind listening tests. I trust my ears."

Perhaps you could consider getting such a great study published in a peer-reviewed journal. Maybe Journal of AES? At least you know your audience is scientifically-oriented and might be less likely to make petty objections to the methodology. Furthermore, a peer-reviewed article is authoritative, will likely be cited by others, and is not likely to be subject to the "oh they made all that data up" objection.

mike1127 · Jul 23, 2009 at 12:30 AM

Quote:

Originally Posted by SmellyGas /img/forum/go_quote.gif
The problem is, there is already a pervasive belief that cables make dramatic improvements in sound. It is perpetuated by audio salesmen, big cable displays at Best Buy, all over the internet, and even in otherwise respectable audio magazines (not journals). Furthermore, people WANT to believe that cables make a difference. I did, and I still do.

I think the study should address our own interest in evaluating cables and we should accept we are going to have zero impact on the industry.

wavoman · Jul 23, 2009 at 2:29 AM

Quote:

Originally Posted by royalcrown /img/forum/go_quote.gif
Out of curiosity, what benefit does this question set have over a question set of, say, "I find no difference" and "I find a difference" ? Would you still perform analysis on a Yes/No paradigm (lumping "I prefer one" and "I find a difference" together mathematically but still offering the choice for response reasons) or are you going to perform analysis on all answers? Basically I'm just wondering why you would ask 4 questions instead of 2, because it intuitively doesn't click with me (which is sometimes the case when it comes to statistics).

It at least outwardly appears that you could still incorporate swindles into the protocol by just asking if there is a difference or not. I can understand why that might still be an unnatural question, but it seems more unnatural to me the way you have it worded, because it imparts onto the subject that the difference between preferring a cable and hearing a difference actually matters, when (as far as I see it) the objective of the test is just to find out if someone can distinguish between a given component as opposed to whether or not they like one or the other more. I guess it just seems more complicated than it needs to for me.

All your points are right on, and others have raised these concerns as well.

Why four questions, why not simply "diff or no diff" ??

You could all be right. And technically everything you say about the statistical testing, and swindles, is 100% correct.

The issue is: avoiding response bias. I have observed people being uncomfortable with forced-choice. The food industry runs on forced-choice, but I think it might be a problem.

You see, I do not believe the null hypothesis. I think there are small differences in cables, that may, at some times, and with some people, be audible. I think we have a weak effect, maybe a very weak effect.

The cables measure differently electrically, so physics does not contradict my belief.

With a small difference, we need a way for people to believe in what they say, to have a bail-out when they are unsure, etc. I want to tease out of lay people a difference when it exists but is small. If most of the time they hear no difference, and they only have two choices, then they will be reluctant perhaps to say "diff" when they think maybe, just maybe, there is one.

That is textbook response bias.

My four answers are designed to eliminate this behavior. We tip the fact that one cable can be better than another. We allow a weak "I hear a difference but have no preference" for people who are unsure of themselves --but really hear something!

I have no idea if I am right. I am investigating the protocol as much as I am investigating cables. I have some interesting math behind it -- not ready for prime time yet.

Again, your reasoning is 100% correct. But I think you can see what is in my head.

And with swindles, we quickly expose the people who can't hear, and have response bias the other way (always claiming they can hear something, like a fellow I work with, who has spent $250,000 on his rig, not counting the construction to build the addition to his house, no joke).

Added: one way to do "diff or no diff" but eliminate the pain of a forced choice is to have a third answer "I am not sure". The trouble is, people won't chose it, even when it is true, since they think it makes them look stupid. This is really what my "I hear a difference but have no preference" is -- a safe choice, not forced to making a strong statement -- which I am hoping they use when they think they hear something but are unsure (the effect is small, remember).

Let me say this another way: the two-answer "diff or no diff" is stiff and strong and rigid -- it lays out a real dichotomy, and lay people will really need to hear a big difference before they say "diff" (or at least that is what I think) so we will miss the small effect. One of my answers is squishy soft on purpose ... by putting in the "I prefer A" and "I prefer B" which are strong, hard statements, I leave the "Difference, but no preference" as an answer that the unsure people will find acceptable and select.

My fondest hope is: we weed out the jokers and the bullies with the swindles, then we get a statistically significant number of "I hear a difference, but no preference" when there really is a difference, and "I hear no difference" when we present swindles. What a huge win! Nobel prize for me! Well not really, but we will have proven a solid point that all the AES papers missed. I live for this kind of turn-the-tables triumph, don't we all? We will have proven that cables can potentially make a difference, which audiophiles claim, and explained as well why A/B/X tests fail to show it.

I can dream, can't I?

wavoman · Jul 23, 2009 at 2:53 AM

Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
Because listeners differ in discriminatory ability, wouldn't you have to test individuals many times? Let's say 75% of the listeners either lack discriminatory ability or self-choose a poor method of evaluating the differences. Wouldn't that "drag down" the power of the test, and overwhelm the 25% who can tell a difference?...

You make an important point, and here is my answer.

We do not pool the results of listeners willy-nilly. The swindles are used to eliminate the people who are useless to us -- they claim to hear a difference when none exists. Then we "play the winner" -- going back for more tests (you are right about that) with the people who do well on the swindles.

With enough tests for a given individual, we can treat him as his own block -- not pool results.

To avoid what is known as the "selection fallacy" we then indeed have to go back for more tests with the ultimate winners. We've got time, lots of heat shrink in different colors, a pot of postage money. I said this would take a few years.

I believe there are no differences that are audible in digital cables, because modern receiving firmware in DACs use PLLs and buffers to compensate for any timing or catch-the-edge-transition errors a poor cable introduces (via reflection or whatever). I do beleive the cables make errors, "bits are bits" is false when there is a hard real-time clock constraint, as there is in digital audio, but the errors are easy to correct without any re-transmission protocol since there is so much extra time and redundant bits. And I failed my own blind tests pitting a $1 digital cable (actually 75 ohm video) against a $1000 digital cable (using an excellent DAC). Lesson learned, for me at least.

However I do believe that analog audio cables can effect sound, and they don't all measure the same electrically, do they? I think the effects are very small, and may not be audible most of the time, in most systems, and with most people. But I think maybe there are careful listeners out there who might pick up --often enough to prove it statistically -- on small differences.

I am not sure, but that's why these tests are very interesting.

You are ahead of me with your experiments, and you are passing some blind tests, so I am encouraged!

wavoman · Jul 23, 2009 at 2:55 AM

SmellyGas -- great post! Yes, people will ignore the results of our experiments.

As mike1127 says, we're doing it for ourselves.

A toast to us (pun alert): Hear Hear!

mike1127 · Jul 23, 2009 at 2:58 AM

Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
My fondest hope is: we weed out the jokers and the bullies with the swindles, then we get a statistically significant number of "I hear a difference, but no preference" when there really is a difference, and "I hear no difference" when we present swindles. What a huge win! Nobel prize for me! Well not really, but we will have proven a solid point that all the AES papers missed. I live for this kind of turn-the-tables triumph, don't we all? We will have proven that cables can potentially make a difference, which audiophiles claim, and explained as well why A/B/X tests fail to show it.

I can dream, can't I?

I do hope you don't take too seriously the idea of changing the minds of AES fellows, because most of them would say under the laws of physics and the best understanding of hearing, cable differences are an order of magnitude (or three) below anything that can be detected.

Note: are you implying that you plan to send many pairs of cables to each participant? Because isn't that the only way you can "smoke out" the guessers?

I do believe that whether cable differences are real or not, people can perceive huge differences between things that are the same. People can be fooled this way. That's one potential problem with this test. I think you are going to get a lot of people perceiving large differences in "swindles" because they can't control their own biases or perceptual process. I guess that's why you need to present each person with both swindles and real differences.

wavoman · Jul 23, 2009 at 3:19 AM

Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
...you plan to send many pairs of cables to each participant? Because isn't that the only way you can "smoke out" the guessers?... I guess that's why you need to present each person with both swindles and real differences.

Exactly! You understand what I am pitching at perfectly!

The actual sequence of what I mail you is complicated and is not fully randomized (would take too long) -- I use "keep the winners, terminate the losers" but have to avoid the selection fallacy by mixing it up sometimes.

I am not done with the math. It is taking a while to work it out, and I am old and tired (plus I have a day job).

Statisticians, just like the data they present, are broken down by age and sex.

edstrelow · Jul 23, 2009 at 6:01 AM

I think many here are being seriously misled by the large subject types of tests used in things like drug trials which is the type of human experimentation with which most people have any familiarity.

Most testing of auditory process is done with small subject samples, sometimes very small samples, where an individual subject is tested repeatedly in various conditions. Some of this type of testing is amenable to statistical testing, some is not.

Large group testing has not shown itself to be very sensitive. Certainly if a large group experiment shows statistically signficant differences, that is meaningful. The problem is that this type of testing is very crude and is more likely than not to give a "null" difference.

One method widely used in human sensory testing which is not even subject to statistical testing is the method of "limits. " Von Bekesy used the method with his patented audiometer. In that application a subject's threshold of hearing is tested by running either several specific tones or a slowly swept frequency, so as to cover the audible spectrum. The subject runs the volume down until the sound disappears for him at which point the subject runs the volume up until he hears it again. The threshold of hearing is the mid point between the turnover points.

There is no obvious control condition here although anyone who thinks this is unscientific should be reminded that Bekesy received a Nobel prize for his work in audition.

However, I think procedures like this are far more likely to clarify what can and cannot be detected than large group, randomized testing. I will give an example of a test that just about anyone here can do using similar adjustments with a visual stimulus to see what I am getting at about sensitivity.

Take a photo-shop type progam and adjust one of the picture paremeters, such as color saturaton umtil you first detect a change.

Now if you were to take prints or store images of the picture at the initial and final settings you might very well not be able to distinguish these pictures or say which has more saturation than the other even though the difference was initially detectable.

Random testing adds a geat confusion factor which must be overcome before the eye/ear/brain can tell what is going on. Sure if the test shows a difference that is great, but often that will not happen.

Before I would worry too much about human testing, I would be happy just to see some physical measurements of some SQ features to see if anything can be physically measured, before trying to establish that a human subject can detect a difference. I did this some years back with a simple pink noise analyses of different cables and found slight but consistent measured frequency differences between some cables.

nick_charles · Jul 23, 2009 at 3:26 PM

Quote:

Originally Posted by edstrelow /img/forum/go_quote.gif
I would be happy just to see some physical measurements of some SQ features to see if anything can be physically measured, before trying to establish that a human subject can detect a difference. I did this some years back with a simple pink noise analyses of different cables and found slight but consistent measured frequency differences between some cables.

How slight and how consistent ?

I have done this myself

http://www.head-fi.org/forums/f21/my...rprise-405217/

and indeed you can find small measurable differences between cables generally though I found that the diferences were in the 0.001 to 0.07 db range until you went over 20K or below -60db, on a very few occaisions the difference hit 0.1db for individual frequencies but this was rare and the FR pattens were always near identical certainly there was nevera consisrent character difference.

odigg · Jul 23, 2009 at 4:25 PM

Quote:

Originally Posted by nick_charles /img/forum/go_quote.gif
and indeed you can find small measurable differences between cables generally though I found that the diferences were in the 0.001 to 0.07 db range until you went over 20K or below -60db, on a very few occaisions the difference hit 0.1db for individual frequencies but this was rare and the FR pattens were always near identical certainly there was nevera consisrent character difference.

One other question is will differences be consistent for one brand/model of cable? For example, if you find brand/model X cable (let's say the Zu mobius) has a .001 db reduction at 100hz, will every brand/model X cable (every Zu mobious) cable have a similar reduction at 100hz?

If not, then you have no idea if these differences are from the cables themselves, just measurement to measurement variation or random variation between each cable.

nick_charles · Jul 23, 2009 at 5:49 PM

Quote:

Originally Posted by odigg /img/forum/go_quote.gif
One other question is will differences be consistent for one brand/model of cable? For example, if you find brand/model X cable (let's say the Zu mobius) has a .001 db reduction at 100hz, will every brand/model X cable (every Zu mobious) cable have a similar reduction at 100hz?

If not, then you have no idea if these differences are from the cables themselves, just measurement to measurement variation or random variation between each cable.

Good question, I cannot make any assertions on sample variations, i.e some examples of cable X might be different from others (not having unlimited funds) , I did test each cable 10 times to average out random measurement to measurement variations.

s_nyc · Jul 23, 2009 at 6:38 PM

Aside of all possible technical explanations, isn't the main issue with ABX testing simply that our powerful human brain, while being fed repeatedly with 2 slightly different yet substantially similar sounds, will eventually recognize them as being one and the same? (for what purpose, I do not know... maybe saving resources/energy!)

Just to give another example of A/B testings where our senses are deceived, have you ever tried to compare perfumes? First two, ok, no problem. Let's try a few others and it becomes more complicated to distinguish one from the other. After 15 minutes max. of intensive testing, you won't be able to smell the difference between them...

Arjisme · Jul 23, 2009 at 7:12 PM

Quote:

Originally Posted by s_nyc /img/forum/go_quote.gif
Aside of all possible technical explanations, isn't the main issue with ABX testing simply that our powerful human brain, while being fed repeatedly with 2 slightly different yet substantially similar sounds, will eventually recognize them as being one and the same? (for what purpose, I do not know... maybe saving resources/energy!)

I think ABX testing's goal is to assess whether two different pieces of gear become or are indistinguishable from one another or are they different enough that you can always perceive it.

If I had two DACs, say, that I was considering buying and one was substantially more expensive than the other but they became or were the same to me (as you described above), I would have a hard time justifying paying a premium for the one.

mike1127 · Jul 23, 2009 at 8:23 PM

Quote:

Originally Posted by s_nyc /img/forum/go_quote.gif
Aside of all possible technical explanations, isn't the main issue with ABX testing simply that our powerful human brain, while being fed repeatedly with 2 slightly different yet substantially similar sounds, will eventually recognize them as being one and the same? (for what purpose, I do not know... maybe saving resources/energy!)

Just to give another example of A/B testings where our senses are deceived, have you ever tried to compare perfumes? First two, ok, no problem. Let's try a few others and it becomes more complicated to distinguish one from the other. After 15 minutes max. of intensive testing, you won't be able to smell the difference between them...

Good point, and another point is that music is supposed to be fresh... that's how it works... generally people don't listen to the same music many times repeatedly but prefer to listen to new songs... so you "fatigue" quickly to the musical factors (such as the quality of articulation, delicacy of tone colors, ...). Or, if the test snippets are short enough, you can't hear musical factors at all.

EDIT: for those who don't believe in "fatiguing" to "musical factors," that is a real phenomenon at least in the world of musicians. In fact, a lot of music is composed or performed with a sensitivity to when repetition becomes fatiguing and walking that fine line. A cool thing to do as a composer is repeat some basic idea, but with just enough variation that it stays fresh. I am an amateur composer, and whether justified or not, I feel some authority in claiming that this phenomenon, "fatiguing to musical factors," is real.

mike1127

Member of the Trade: Brilliant Zen Audio

SmellyGas

100+ Head-Fier

mike1127

Member of the Trade: Brilliant Zen Audio

wavoman

Headphoneus Supremus

wavoman

Headphoneus Supremus

wavoman

Headphoneus Supremus

mike1127

Member of the Trade: Brilliant Zen Audio

wavoman

Headphoneus Supremus

edstrelow

Headphoneus Supremus

nick_charles

Headphoneus Supremus

odigg

500+ Head-Fier

nick_charles

Headphoneus Supremus

s_nyc

100+ Head-Fier

Arjisme

500+ Head-Fier

mike1127

Member of the Trade: Brilliant Zen Audio

Users who are viewing this thread