I've learned a lot and clarified some of my thinking, so I thought I would a bit of a "summary" post.
First of all, I am not opposed to doing blind tests. I would evaluate all new components blind if it were possible. I think that blind tests are needed to determine whether differences exist between the sound of components.
My issue is with the way blind tests are usually run, and the way they are interpreted.
As objectivists are fond of telling us, all humans are subject to illusions. We know that when a person compares two components while knowing their identities, that person may "hallucinate" (for lack of a better word) a difference between those components.
But I think people can also "hallucinate" two components to be the same. Why do objectivists feel it only works the one way? I give some reasons below.
This debate is often framed as between "those who trust their ears (i.e. non-controlled, sighted listening)" and "those who trust blind tests, science and measurements".
But there are other dividing lines. For instance, I see "those who treat their brains as a black box" and "those who introspect to figure out what is going on while they listen." Personally, I trust that introspection on my listening process is a valid source of information. Valid in what sense? Valid in the sense of suggesting what kinds of test protocols need to be explored, and suggesting the limitations of existing protocols.
Now, let's consider how a test subject performs in an ABX test. If A and B are dramatically different, the subject will get the right answer every time. We could express that by saying their probability of answering correctly is 1.0, or "p=1.0". If A and B are sonically identical, the test subject can only guess randomly. In that case p=0.5.
In the really interesting blind tests, the differences will be subtle, and we can safely assume that p is not 1.0. It is somewhat lower than that. That's equivalent to saying "The test subject makes mistakes sometimes, but the difference is real and audible."
What kinds of factors would drag down p? There are two:
Introspecting on my own listening process has suggested some reasons these would be factors. In the first case, "hallucinating" that two components are the same, I see the following as possible causes:
Examples of distracting factors are:
Some blind tests are done quick-switch or short-snippet style. Others are done with long listening and/or long breaks between switching. Either style is going to be affected by some of the factors above.
So, we can safely assume that p is somewhat lower than 1.0 (even if you just say that people make mistakes). The factors I have described above will drag p down. Here is a major problem: as p gets lower, the probability of Type II error gets very high.
What is Type II error? This is the error of failing to reject the null hypothesis when it is actually false. Or in English, this is the mistake of finding no difference between the components where there really is a difference.
To give some examples, let's say we are talking about a 16-trial test, which is common. If a person gets about 12 right, then they have "passed" the test to a confidence of about 3%. That means their chance of getting that result purely by guessing is 3%.
Now consider that p is about 0.6. That is, there is a real difference, but due to distracting and hallucinatory factors, the person only stands a 60% person chance of giving the right answer. In that case, if you run 16 trials, the person probably won't get more than 9 or 10 right. The test will fail to reject the null hypothesis. And yet it was false! This is called Type II error.
I don't know how to do the calculations exactly, but for the standard 16-trial test, with p in the range 0.6 to 0.8, the chance of Type II error is almost 1.0 (I believe). That's means you are almost guaranteed to get a null result, even if there is a real difference.
I wanted to do some calculations, but the web page I usually use is down right now. The point I wanted to make was that Type II error goes down as you do more trials. But you have to do a LOT more trials to get Type II error down to something below 10%. I estimate that for p=0.75 you would have to do 30 or 40 trials. This is very uncommon in blind testing.
Something else rarely acknowledged by objectivists is that differences between components may fall into distinct categories. Some differences are clearly audible under quick-switch conditions. Lots of ABX tests have been run giving good data about codecs. But that doesn't mean all differences are audible under quick-switch conditions. From introspection on my listening process, I deduce that musical differences are inaudible under quick-switch condtions, or when switching many times.
In other words, p has been dragged down all the way to 0.5.
Finally let me address the claim we frequently see that "no one has ever passed a blind test proving that cables (or DACs or amps, depending on who's talking) sound different." This is false on its face, for a simple reason. If you are running tests with a confidence level of 5%, then about 5% of them will reject the null hypothesis through guessing alone. That doesn't prove there is a difference, but what is does prove is that the person making this claim is cherry picking or dismissing results. There most certainly have been tests that rejected the null result, but the objectivist making the claim has dismissed those tests as improper in some way. For a person with an agenda, it is easy to dismiss everything they don't want to hear.
To summarize, I'm not opposed to blind testing and I don't dismiss blind test results as completely irrelevant, but I do think there is reason to be cautious in interpreting null results.
First of all, I am not opposed to doing blind tests. I would evaluate all new components blind if it were possible. I think that blind tests are needed to determine whether differences exist between the sound of components.
My issue is with the way blind tests are usually run, and the way they are interpreted.
As objectivists are fond of telling us, all humans are subject to illusions. We know that when a person compares two components while knowing their identities, that person may "hallucinate" (for lack of a better word) a difference between those components.
But I think people can also "hallucinate" two components to be the same. Why do objectivists feel it only works the one way? I give some reasons below.
This debate is often framed as between "those who trust their ears (i.e. non-controlled, sighted listening)" and "those who trust blind tests, science and measurements".
But there are other dividing lines. For instance, I see "those who treat their brains as a black box" and "those who introspect to figure out what is going on while they listen." Personally, I trust that introspection on my listening process is a valid source of information. Valid in what sense? Valid in the sense of suggesting what kinds of test protocols need to be explored, and suggesting the limitations of existing protocols.
Now, let's consider how a test subject performs in an ABX test. If A and B are dramatically different, the subject will get the right answer every time. We could express that by saying their probability of answering correctly is 1.0, or "p=1.0". If A and B are sonically identical, the test subject can only guess randomly. In that case p=0.5.
In the really interesting blind tests, the differences will be subtle, and we can safely assume that p is not 1.0. It is somewhat lower than that. That's equivalent to saying "The test subject makes mistakes sometimes, but the difference is real and audible."
What kinds of factors would drag down p? There are two:
- Factors which cause a test subject to "hallucinate" that two components are the same, even when they aren't.
- Factors which introduce distraction, causing the test subject to focus on the wrong things, and "hallucinate" the wrong differences.
Introspecting on my own listening process has suggested some reasons these would be factors. In the first case, "hallucinating" that two components are the same, I see the following as possible causes:
- Imagination contamination. (See http://www.head-fi.org/forums/f133/i...nation-431573/)
- Fatigue from listening to the same music many times. If the critical differences between components are in musical factors, like emotion, excitement, pace and rhythm, etc., then it's pretty obvious to a musician that you are going to fatigue to these factors rapidly. Music is not meant to be listened to repeatedly. It loses its fresh quality very quickly.
- A context in which you simply can't pick up on musical factors. The best example of this is listening to small snippets of music.
Examples of distracting factors are:
- The simple fact that when the test music is very rich, we tend to hear different things each time we listen. On listen #1 I may hear the cymbals prominently. On listen #2 I may notice the bass drum. This can give rise to the illusion that the music has changed.
- Some people compare components by switching between them while the music is in progress. This doesn't even remotely make sense to me. The music you hear just after you switch is not the same music you hear before you switch. I count this as a distracting factor... it may cause you focus on the wrong differences.
Some blind tests are done quick-switch or short-snippet style. Others are done with long listening and/or long breaks between switching. Either style is going to be affected by some of the factors above.
So, we can safely assume that p is somewhat lower than 1.0 (even if you just say that people make mistakes). The factors I have described above will drag p down. Here is a major problem: as p gets lower, the probability of Type II error gets very high.
What is Type II error? This is the error of failing to reject the null hypothesis when it is actually false. Or in English, this is the mistake of finding no difference between the components where there really is a difference.
To give some examples, let's say we are talking about a 16-trial test, which is common. If a person gets about 12 right, then they have "passed" the test to a confidence of about 3%. That means their chance of getting that result purely by guessing is 3%.
Now consider that p is about 0.6. That is, there is a real difference, but due to distracting and hallucinatory factors, the person only stands a 60% person chance of giving the right answer. In that case, if you run 16 trials, the person probably won't get more than 9 or 10 right. The test will fail to reject the null hypothesis. And yet it was false! This is called Type II error.
I don't know how to do the calculations exactly, but for the standard 16-trial test, with p in the range 0.6 to 0.8, the chance of Type II error is almost 1.0 (I believe). That's means you are almost guaranteed to get a null result, even if there is a real difference.
I wanted to do some calculations, but the web page I usually use is down right now. The point I wanted to make was that Type II error goes down as you do more trials. But you have to do a LOT more trials to get Type II error down to something below 10%. I estimate that for p=0.75 you would have to do 30 or 40 trials. This is very uncommon in blind testing.
Something else rarely acknowledged by objectivists is that differences between components may fall into distinct categories. Some differences are clearly audible under quick-switch conditions. Lots of ABX tests have been run giving good data about codecs. But that doesn't mean all differences are audible under quick-switch conditions. From introspection on my listening process, I deduce that musical differences are inaudible under quick-switch condtions, or when switching many times.
In other words, p has been dragged down all the way to 0.5.
Finally let me address the claim we frequently see that "no one has ever passed a blind test proving that cables (or DACs or amps, depending on who's talking) sound different." This is false on its face, for a simple reason. If you are running tests with a confidence level of 5%, then about 5% of them will reject the null hypothesis through guessing alone. That doesn't prove there is a difference, but what is does prove is that the person making this claim is cherry picking or dismissing results. There most certainly have been tests that rejected the null result, but the objectivist making the claim has dismissed those tests as improper in some way. For a person with an agenda, it is easy to dismiss everything they don't want to hear.
To summarize, I'm not opposed to blind testing and I don't dismiss blind test results as completely irrelevant, but I do think there is reason to be cautious in interpreting null results.




















