Starting with this article by Robert Harley, first published in 2008 in The Absolute Sound
http://www.avguide.com/forums/blind-listening-tests-are-flawed-editorial?page=1
The main part being -
"The Blind (Mis-) Leading the Blind
Every few years, the results of some blind listening test are announced that purportedly “prove” an absurd conclusion. These tests, ironically, say more about the flaws inherent in blind listening tests than about the phenomena in question.
The latest in this long history is a double-blind test that, the authors conclude, demonstrates that 44.1kHz/16-bit digital audio is indistinguishable from high-resolution digital. Note the word “indistinguishable.” The authors aren’t saying that high-res digital might sound a little different from Red Book CD but is no better. Or that high-res digital is only slightly better and not worth the additional cost. Rather, they reached the rather startling conclusion that CD-quality audio
sounds exactly the same as 96kHz/24-bit PCM and DSD, the encoding scheme used in SACD. That is, under double-blind test conditions, 60 expert listeners over 554 trials couldn’t hear any differences between CD, SACD, and 96/24. The study was published in the September, 2007 Journal of the Audio Engineering Society.
Harley offers no methodological critique of this study merely that it must be absurd because it disagrees with his views.
I contend that such tests are an indictment of blind listening tests in general because of the patently absurd conclusions to which they lead. A notable example is the blind listening test conducted by Stereo Review that concluded that a pair of Mark Levinson monoblocks, an output-transformerless tubed amplifier, and a $220 Pioneer receiver were all sonically identical. (“Do All Amplifiers Sound the Same?” published in the January, 1987 issue.)
Harley offers no methodological critique of this study merely that it must be absurd because it disagrees with his views.
Most such tests, including this new CD vs. high-res comparison, are performed not by disinterested experimenters on a quest for the truth but by partisan hacks
At this point I will stop being polite. Harley calling anyone else a hack is a case of gross hypocrisy. Remember Harley's audio engineering credentials are exactly nil, he got his job at Stereophile by writing an essay, his technological knowledge is so poor that for several issues the Audio Critic had to write corrections to his "technical reviews" it got so bad that they even commissioned an article from Bob Adams (Analog Devices) to correct the egregious mistakes he had made in an article on Jitter. The AC even started a SHEESH fund (Send Harley to EE School) . For Harley to call anyone else a hack, including Meyer and Moran is utter hypocrisy !
on a mission to discredit audiophiles. But blind listening tests lead to the wrong conclusions even when the experimenters’ motives are pure. A good example is the listening tests conducted by Swedish Radio (analogous to the BBC) to decide whether one of the low-bit-rate codecs under consideration by the European Broadcast Union was good enough to replace FM broadcasting in Europe.
I bought the paper which Harley cites, somewhat inaccurately as it happens, and it is not a particularly good example of DBT. The test protocols invoke large delays use tape and have predetermined presentation order and the listeners cannot go back and forwards at will and worse still it is not a this is the same as this test , it is a test on listeners generic opinions of degradation in the sound, listeners rated the sound between undegraded and very degraded. Therefore even the reference was never graded at a full 5 (undegraded). The methods used to analyze the data allow for the misleading conclusion that the reference and codec were undistinguishable, this is a statistical thing, in the 1990 test the authors admit that the codec is not good enough, in the 1991 tests the difference between reference and codec is still there (reference is graded higher) but is now considered not significant, this is a policy decision on the side of SR and is a flawed decsion.
That said Harley may have a point here, this test clearly did not identify an artifact found later and this was not very good, but here is the thing, the tape sent to Bart Locanthi, we know nothing about whether any of the samples on it were actually used in the tests, nor do we know if it was a clean recording, we know almost nothing in fact about this bit, we don't know the level of the artifact and we don't know under what conditions SR actually detected it, but they were primed to hear it, measurements would have been really valuable here.
Swedish Radio developed an elaborate listening methodology called “double-blind, triple-stimulus, hidden-reference.” A “subject” (listener) would hear three “objects” (musical presentations); presentation A was always the unprocessed signal, with the listener required to identify if presentation B or C had been processed through the codec.
The test involved 60 “expert” listeners spanning 20,000 evaluations over a period of two years. Swedish Radio announced in 1991 that it had narrowed the field to two codecs, and that “both codecs have now reached a level of performance where they fulfill the EBU requirements for a distribution codec.” In other words, Swedish Radio said the codec was good enough to replace analog FM broadcasts in Europe. This decision was based on data gathered during the 20,000 “double-blind, triple-stimulus, hidden-reference” listening trials.
Not quite true, only the 1990 tests had 20,000 trials, the 1991 test which led to the SR decision used less tests
(The listening-test methodology and statistical analysis are documented in detail in “Subjective Assessments on Low Bit-Rate Audio Codecs,” by C. Grewin and T. Rydén, published in the proceedings of the 10th International Audio Engineering Society Conference, “Images of Audio.”)
After announcing its decision, Swedish Radio sent a tape of music processed by the selected codec to the late Bart Locanthi, an acknowledged expert in digital audio and chairman of an ad hoc committee formed to independently evaluate low-bit rate codecs. Using the same non-blind observational-listening techniques that audiophiles routinely use to evaluate sound quality, Locanthi instantly identified an artifact of the codec. After Locanthi informed Swedish Radio of the artifact (an idle tone at 1.5kHz), listeners at Swedish Radio also instantly heard the distortion.
Indeed, this is a valid point the SR were primed to hear the artifact
(Locanthi’s account of the episode is documented in an audio recording played at workshop on low-bit-rate codecs at the 91st AES convention.)
Quote:
Unfortunately, his recorded speech didn't make it onto the cassettes of the workshop, so I'll have to rely on my memory and notes of the event.
How is it possible that a single listener, using non-blind observational listening techniques, was able to discover—in less than ten minutes—a distortion that escaped the scrutiny of 60 expert listeners, 20,000 trials conducted over a two-year period, and elaborate “double-blind, triple-stimulus, hidden-reference” methodology, and sophisticated statistical analysis?
The answer is that blind listening tests fundamentally distort the listening process and are worthless in determining the audibility of a certain phenomenon.
As exemplified by yet another reader letter published in this issue, many people naively assume that blind listening tests are somehow more rigorous and honest than the “single-presentation” observational listening protocols practiced in product reviewing. There’s a common misperception that the undeniable value of blind studies of new drugs, for example, automatically confers utility on blind listening tests.
I’ve thought quite a bit about this subject, and written what I hope is a fairly reasoned and in-depth analysis of why blind listening tests are flawed. This analysis is part of a larger statement on critical listening and the conflict between audio “subjectivists” and “objectivists,” which I presented in a paper to the Audio Engineering Society entitled “The Role of Critical Listening in Evaluating Audio Equipment Quality.” You can read the entire paper here
http://www.avguide.com/news/2008/05/28/the-role-of-critical-listening-in-evaluating-audio-equipment-quality/. I invite readers to comment on the paper, and discuss blind listening tests, on a special new Forum on AVguide.com. The Forum, called “Evaluation, Testing, Measurement, and Perception,” will explore how to evaluate products, how to report on that evaluation, and link that evaluation to real experience/value. I look forward to hearing your opinions and ideas.
Robert Harley"
So Harley has found one rather old rather badly done test with a dodgy policy decision and from this concludes that all Blind tests are flawed.