The flaws in blind testing

Crazy*Carl · Nov 2, 2010 at 12:28 AM

Quote:

anaxilus said:
I haven't heard the HD580 but the HD555 benefited little from amping, plus its veiled signature and average detail retrieval leaves little transparency to be observed from sources and amps. If I did the same test w/ my 555 you did I would reach the same conclusion, however I would also know the HD555 was the bottleneck here. 300ohms also doesn't guarantee benefits from amping either. I could easily hear a 50ohm phone clearly benefit wear a 150ohm phone might not at all. I'm afraid it's a bit more complicated than that.

Of course, but I am just saying that the HD580/600/650 is one of the most popular hi fi phones around here and many say it benefits from amping.

Anaxilus · Nov 2, 2010 at 12:38 AM

Quote:

crazy*carl said:
Of course, but I am just saying that the HD580/600/650 is one of the most popular hi fi phones around here and many say it benefits from amping.

I agree I have heard that argument a lot. Considering the 580 is the predecessor of the 600 you would be inclined to agree that would be a big '?'

I personally have 5 different portable sources alone and each sounds different to my ears. I seriously don't think its delusion, otherwise how would I be able to hear the differences and note them w/o having heard them or read impressions of them before. I won't discount that there is a lot of hype surrounding what we are 'supposed' to hear. But w/ the absence of suppositions I cannot discount personal experiences either. One is always free to point out flaws in how someone makes their experiential observations to disprove any resulting conclusions.

Antony6555 · Nov 2, 2010 at 12:58 AM

Quote:

anaxilus said:
One is always free to point out flaws in how someone makes their experiential observations to disprove any resulting conclusions.

How does one logically do this without resorting to some objective method of testing? Without this, it any disagreement will devolve into a pointless circular, argument, since neither party will have any objective evidence to back up their claim (it will devolve into "I think the sky is red", "Well I think the sky is blue.")

In response to the article posted by the op, the first part of the argument "dbt is wrong because the conclusions it reaches are not believable" has no basis at all. Science isn't constrained by "common sense" (which really just consists of the preconceptions we bring to a topic). If the method is sound, the conclusions should be sound as well. That's the whole premise of the "Scientific Method."

As for the second part, perhaps there's some basis to it but in this form it's just an anecdote.

Anaxilus · Nov 2, 2010 at 1:27 AM

Quote:

antony6555 said:
How does one logically do this without resorting to some objective method of testing? Without this, it any disagreement will devolve into a pointless circular, argument, since neither party will have any objective evidence to back up their claim (it will devolve into "I think the sky is red", "Well I think the sky is blue.")

You are reading to much into what I was discussing. I was talking to Crazycarl. If one says they did X and heard Y. It's not hard for me to say did you consider Z? Then one says I tried Z or will try Z. It was more a digression from the OP than a relevant argument. I was discussing a specific case.

haloxt · Nov 2, 2010 at 7:34 AM

Quote:

anaxilus said:
I think you have a bottleneck somewhere in your testing. If you want to tell me an HD800 plugged into your laptop playing FLAC sounds the same as using a DACport I'll say you're crazy and you can call me a Koolaid drinker.

I think there is a bottleneck somewhere in his system, but you might have to think outside the box to find it.

Prog Rock Man · Nov 2, 2010 at 10:18 AM

Quote:

crazy*carl said:
The amount of debating here is hilarious. Listen and compare. If you cant hear a difference, then don't spend money on it. I think failing to blind test equipment just leads to false views on it (This is seen here all the time with people's bogus claims about their equipment)

I personally can not hear a diff between 128 mp3s and lossless. But I always try to get either 320mp3s or lossless because I have no reason not to. Space is super cheap. On the other side, I can not hear a difference using entry level hifi amps and dacs, so I just use my computers motherboard sound. And boy does it sound good

Hi Crazy Carl. The debating is fun is it not!

I do think that it is important to look at blind tests especially when a test (Swedish Radio) produces such an unexpected result. I would have expected such a huge blind test which concluded no difference would be accurate. But it was not. So it seems reasonable to go back and check other blind tests and see if there is/can be the same flaw in them.

I agree ignoring blind testing is wrong and that it has resulted in many exaggerated and bogus claims being made.

I also think that you not being able to differentiate between the lowest and highest bit rate and a dedicated amp/DAC over your PCs soundcard is interesting, as there are blind tests which suggest that both can be differentiated.

nick_charles · Nov 2, 2010 at 11:34 AM

I feel a bit of deconstruction is required here.

Quote:

prog rock man said:
Starting with this article by Robert Harley, first published in 2008 in The Absolute Sound

http://www.avguide.com/forums/blind-listening-tests-are-flawed-editorial?page=1

The main part being -

"The Blind (Mis-) Leading the Blind

Every few years, the results of some blind listening test are announced that purportedly “prove” an absurd conclusion. These tests, ironically, say more about the flaws inherent in blind listening tests than about the phenomena in question.
The latest in this long history is a double-blind test that, the authors conclude, demonstrates that 44.1kHz/16-bit digital audio is indistinguishable from high-resolution digital. Note the word “indistinguishable.” The authors aren’t saying that high-res digital might sound a little different from Red Book CD but is no better. Or that high-res digital is only slightly better and not worth the additional cost. Rather, they reached the rather startling conclusion that CD-quality audio sounds exactly the same as 96kHz/24-bit PCM and DSD, the encoding scheme used in SACD. That is, under double-blind test conditions, 60 expert listeners over 554 trials couldn’t hear any differences between CD, SACD, and 96/24. The study was published in the September, 2007 Journal of the Audio Engineering Society.

Harley offers no methodological critique of this study merely that it must be absurd because it disagrees with his views.

I contend that such tests are an indictment of blind listening tests in general because of the patently absurd conclusions to which they lead. A notable example is the blind listening test conducted by Stereo Review that concluded that a pair of Mark Levinson monoblocks, an output-transformerless tubed amplifier, and a $220 Pioneer receiver were all sonically identical. (“Do All Amplifiers Sound the Same?” published in the January, 1987 issue.)

Harley offers no methodological critique of this study merely that it must be absurd because it disagrees with his views.

Most such tests, including this new CD vs. high-res comparison, are performed not by disinterested experimenters on a quest for the truth but by partisan hacks

At this point I will stop being polite. Harley calling anyone else a hack is a case of gross hypocrisy. Remember Harley's audio engineering credentials are exactly nil, he got his job at Stereophile by writing an essay, his technological knowledge is so poor that for several issues the Audio Critic had to write corrections to his "technical reviews" it got so bad that they even commissioned an article from Bob Adams (Analog Devices) to correct the egregious mistakes he had made in an article on Jitter. The AC even started a SHEESH fund (Send Harley to EE School) . For Harley to call anyone else a hack, including Meyer and Moran is utter hypocrisy !

on a mission to discredit audiophiles. But blind listening tests lead to the wrong conclusions even when the experimenters’ motives are pure. A good example is the listening tests conducted by Swedish Radio (analogous to the BBC) to decide whether one of the low-bit-rate codecs under consideration by the European Broadcast Union was good enough to replace FM broadcasting in Europe.

I bought the paper which Harley cites, somewhat inaccurately as it happens, and it is not a particularly good example of DBT. The test protocols invoke large delays use tape and have predetermined presentation order and the listeners cannot go back and forwards at will and worse still it is not a this is the same as this test , it is a test on listeners generic opinions of degradation in the sound, listeners rated the sound between undegraded and very degraded. Therefore even the reference was never graded at a full 5 (undegraded). The methods used to analyze the data allow for the misleading conclusion that the reference and codec were undistinguishable, this is a statistical thing, in the 1990 test the authors admit that the codec is not good enough, in the 1991 tests the difference between reference and codec is still there (reference is graded higher) but is now considered not significant, this is a policy decision on the side of SR and is a flawed decsion.

That said Harley may have a point here, this test clearly did not identify an artifact found later and this was not very good, but here is the thing, the tape sent to Bart Locanthi, we know nothing about whether any of the samples on it were actually used in the tests, nor do we know if it was a clean recording, we know almost nothing in fact about this bit, we don't know the level of the artifact and we don't know under what conditions SR actually detected it, but they were primed to hear it, measurements would have been really valuable here.

Swedish Radio developed an elaborate listening methodology called “double-blind, triple-stimulus, hidden-reference.” A “subject” (listener) would hear three “objects” (musical presentations); presentation A was always the unprocessed signal, with the listener required to identify if presentation B or C had been processed through the codec.

The test involved 60 “expert” listeners spanning 20,000 evaluations over a period of two years. Swedish Radio announced in 1991 that it had narrowed the field to two codecs, and that “both codecs have now reached a level of performance where they fulfill the EBU requirements for a distribution codec.” In other words, Swedish Radio said the codec was good enough to replace analog FM broadcasts in Europe. This decision was based on data gathered during the 20,000 “double-blind, triple-stimulus, hidden-reference” listening trials.

Not quite true, only the 1990 tests had 20,000 trials, the 1991 test which led to the SR decision used less tests

(The listening-test methodology and statistical analysis are documented in detail in “Subjective Assessments on Low Bit-Rate Audio Codecs,” by C. Grewin and T. Rydén, published in the proceedings of the 10th International Audio Engineering Society Conference, “Images of Audio.”)

After announcing its decision, Swedish Radio sent a tape of music processed by the selected codec to the late Bart Locanthi, an acknowledged expert in digital audio and chairman of an ad hoc committee formed to independently evaluate low-bit rate codecs. Using the same non-blind observational-listening techniques that audiophiles routinely use to evaluate sound quality, Locanthi instantly identified an artifact of the codec. After Locanthi informed Swedish Radio of the artifact (an idle tone at 1.5kHz), listeners at Swedish Radio also instantly heard the distortion.

Indeed, this is a valid point the SR were primed to hear the artifact

(Locanthi’s account of the episode is documented in an audio recording played at workshop on low-bit-rate codecs at the 91st AES convention.)

Quote:

Unfortunately, his recorded speech didn't make it onto the cassettes of the workshop, so I'll have to rely on my memory and notes of the event.

Click to expand...

How is it possible that a single listener, using non-blind observational listening techniques, was able to discover—in less than ten minutes—a distortion that escaped the scrutiny of 60 expert listeners, 20,000 trials conducted over a two-year period, and elaborate “double-blind, triple-stimulus, hidden-reference” methodology, and sophisticated statistical analysis?

The answer is that blind listening tests fundamentally distort the listening process and are worthless in determining the audibility of a certain phenomenon.
As exemplified by yet another reader letter published in this issue, many people naively assume that blind listening tests are somehow more rigorous and honest than the “single-presentation” observational listening protocols practiced in product reviewing. There’s a common misperception that the undeniable value of blind studies of new drugs, for example, automatically confers utility on blind listening tests.
I’ve thought quite a bit about this subject, and written what I hope is a fairly reasoned and in-depth analysis of why blind listening tests are flawed. This analysis is part of a larger statement on critical listening and the conflict between audio “subjectivists” and “objectivists,” which I presented in a paper to the Audio Engineering Society entitled “The Role of Critical Listening in Evaluating Audio Equipment Quality.” You can read the entire paper here http://www.avguide.com/news/2008/05/28/the-role-of-critical-listening-in-evaluating-audio-equipment-quality/. I invite readers to comment on the paper, and discuss blind listening tests, on a special new Forum on AVguide.com. The Forum, called “Evaluation, Testing, Measurement, and Perception,” will explore how to evaluate products, how to report on that evaluation, and link that evaluation to real experience/value. I look forward to hearing your opinions and ideas.

Robert Harley"

So Harley has found one rather old rather badly done test with a dodgy policy decision and from this concludes that all Blind tests are flawed.

Prog Rock Man · Nov 2, 2010 at 1:14 PM

That is a point, send one tape of one piece of music. Why not send at least a few?

InnerSpace · Nov 2, 2010 at 1:16 PM

The Swedish Radio test result quoted above might be explained by a well-known phenomenon related to mass testing, in which some very obvious things are missed - to a degree that's hard to credit. For instance, I have a passing interest in discussions of the credibility of eyewitness testimony, where there have been some amazing experiments. In one, subjects were shown a video of basketball players standing in a circle bouncing balls to each other at short but irregular intervals. Test subjects were told to count the bounces. At the end, they were polled for numbers. They answered, "49," "51," "55," and so on. Then they were asked, "What did you think of the guy in the gorilla suit?" They all replied, "What guy in a gorilla suit?" On watching the video again, they saw - quite clearly - a guy in a gorilla suit wandering in and out of shot. In some versions of the test, the guy walked right through the circle of basketball players.

I have no real ax to grind on the subject of blind testing, except to observe that at this point, on audio matters, it's hopelessly flawed now. In principle, a DBT is a cool, passionless affair. In medical models, a volunteer patient and a drug company might have hopes and dreams, but everything else is matter-of-fact. New pills are taken (or not) and tumors shrink (or they don't.) Everything is measurable (or not) by disinterested third parties. Subjects are not questioned live, in real time. No one is asked, "So? How does that tumor feel now? Smaller? Bigger?? Denser?? Or kind of bloated?? Come on, we're waiting for an answer!"

But audio DBTs are no longer cool and passionless. They have degenerated to "Gotcha!" tests - understandably, IMO, given the inanity and credulity all around. The response "OK, now prove it!" is entirely appropriate a lot of the time. Not that the skeptic side is blameless either - too many crusty, cantankerous types upset that the world has moved on after they ran out of passion and energy, or stopped thinking altogether. Too much impatience on both sides, too much emotion, too much stress, too much on the line - no chance of conducting meaningful tests in our overheated little corner. For instance, my daughter once had an Aiwa midi-system - a little plastic cube molded to look like a stack of small separates - that I heard all the time. Years later I was in the museum shop at the Whitney in New York, with muzac playing, and I knew - for certain - that it was coming from the same system. I looked under the counter and saw I was right. Would I have backed that judgement in a blind test? Probably not. I would have anticipated the hassle and the stress and the jeering if I got it wrong, and I wouldn't have risked it. And a testing protocol that produces such advance feelings in a subject is flawed, isn't it?

I was going to say we could stay away from the whole subject if people on both sides didn't overstate and exaggerate so much. (Tiny differences are called "jaw-dropping" ... and on the other side, "bits are bits" reflects a very imperfect understanding of science, and especially real-world production engineering.) But ultimately I think we're pushed into it ... the complexity of the whole scene is polarized by hucksters chasing profit, which provokes a consequent antipathy from those impatient with others' alleged gullibility, and in the process a whole lot of good stuff in the middle gets ignored.

Satellite_6 · Nov 2, 2010 at 1:18 PM

Quote:

crazy*carl said:
The amount of debating here is hilarious. Listen and compare. If you cant hear a difference, then don't spend money on it. I think failing to blind test equipment just leads to false views on it (This is seen here all the time with people's bogus claims about their equipment)

I personally can not hear a diff between 128 mp3s and lossless. But I always try to get either 320mp3s or lossless because I have no reason not to. Space is super cheap. On the other side, I can not hear a difference using entry level hifi amps and dacs, so I just use my computers motherboard sound. And boy does it sound good

Quote:

crazy*carl said:
No its not a joke. I have blind tested 3 different hifi amps with my ipod shuffle and computer, and me or the other person testing was never able to distinguish them.

Head-fi does not wanna hear the truth: an ipod is as good as an audio gd compass.

Old computers used to have problems with noise, but it just is not the case anymore. Claim what you want about the "noisy" environment the dac is in, I can't hear it.

In some cases, at least with sources, hifi is becoming a thing of the past.

Quote:

crazy*carl said:
I have never heard the HD800, so I cant say. But on my HD580 what I said holds true. The 300ohm sennheisers are supposedly supposed to be benefited greatly with dedicated amps. New head-fi users feel they are missing out because they dont have an amp, but I found it to all be nonsense.

Crazy*Carl, I think you might be crazy.

To my ears. . . There is an absolutely huge difference between 128 kbps and lossless, an ipod doesn't even sound as good as my meager creative DAPs never mind comparing to my (still modest) EF2A, and the 300 ohm HD 650s sound weird/bad without an amp (not that my amp is that great but on everything else the 650s sound weird). I mean this in the nicest way possible, but I think you may be somewhat deaf. . .

I Think it is unusual that you are against blind tests, usually the objectivists are the ones that say "everything sounds the same!" Aren't the bold statements contradictory tho? You seem to have used failed blind testing as evidence to support your opinion yet claim failures lead to false conclusions!

No offense intended, just thought I'd put that out there. . . I'm kind of on the fence/leaning towards blind tests are a good thing, but only to prove things when you actually PASS them, so I think I'm on your side on this point.

EthanWiner · Nov 2, 2010 at 1:31 PM

Quote:

nick_charles said:
Harley offers no methodological critique of this study merely that it must be absurd because it disagrees with his views ... For Harley to call anyone else a hack, including Meyer and Moran is utter hypocrisy!

Exactly. I too have that AES study report, and I think it was well done. Could it have been done better? Sure, anything could have been done "better." But it certainly proved to my satisfaction that 16 bits / 44.1 KHz is indistinguishable from higher resolution formats. Further, it makes perfect sense scientifically that 16 / 44.1 is adequate as a distribution medium. Humans don't hear past 20 KHz, and the background room noise in every recording is louder than the noise floor of 16 bits.

In the larger picture, even if one or two people out of hundreds could just barely detect a tiny difference between 16 / 44.1 compared to higher resolution, who cares? I can just barely hear half a dB boost or cut at midrange frequencies, but adding such a tiny boost or cut doesn't change my enjoyment of a piece of music even a little. Of all the things to fret over regarding audio fidelity, resolution higher than CDs should be at the bottom of everyone's list. Especially when compared to the degradation caused by loudspeakers (distortion that's always audible) and listening rooms (numerous 20-30 dB variations in frequency response).

--Ethan

Uncle Erik · Nov 2, 2010 at 2:09 PM

InnerSpace, the reason DBT - or any form of objective testing - has fallen so far out of favor is for one simple reason. The audio world would collapse if strict testing was done. There is very little - if any - difference between products these days. The industry knows this.

The only reason the subjectivist camp is so heavily promoted is because that is the only way you can survive when you don't have anything new to offer. You have to turn it into a fashion parade where new models debut each year and you get "reviewers" to yap about night and day differences and throw out some clever wordplay to make sure everyone knows how smart they are.

Of course these folks turn indignant when you put their livelihood up to a test they know they can't pass.

haloxt · Nov 2, 2010 at 2:20 PM

Innerspace, I agree that a middle path is preferable to the usual bickering between the two sides. It's not so depressing that people choose to keep bickering though, because it's what they are used to

.

InnerSpace · Nov 2, 2010 at 3:02 PM

Quote:

uncle erik said:
InnerSpace, the reason DBT - or any form of objective testing - has fallen so far out of favor is for one simple reason. The audio world would collapse if strict testing was done. There is very little - if any - difference between products these days. The industry knows this.

The only reason the subjectivist camp is so heavily promoted is because that is the only way you can survive when you don't have anything new to offer. You have to turn it into a fashion parade where new models debut each year and you get "reviewers" to yap about night and day differences and throw out some clever wordplay to make sure everyone knows how smart they are.

Of course these folks turn indignant when you put their livelihood up to a test they know they can't pass.

Erik, absolutely. But we shouldn't nudge so far to the cranky, cantankerous side that we risk losing the baby with the bath water. For instance, I've been listening to CD players with the new "apodising" or "minimum phase" digital filters - so far offered by Meridian, dcs, and Ayre. (And you'll have to take this on trust, but the only people less likely to be BS-ed than me are time-served lawyers like yourself. And probation officers.) But over many months I have been aware of ... what, exactly? A much more frequent sensation of greater enjoyment ... there are evenings of music - or whole weeks of it - after which I think, "Wow. That was really, really wonderful." What do I credit? The composers and the performers, of course, and my mood, and the weather and so on ... but it has happened so often, in such disparate circumstances, that I can't help but be convinced that the new filters offer some kind of long-term solidity and stability, some kind of secure, relaxing platform, some kind of greater naturalism. Very small changes, to be sure, appreciated only very slowly and gradually ... almost like (to quote somebody) waking up one day without the headache you never knew you had. What I would like is a playing field on which such issues could be discussed without either exaggeration or impatience, and I would like there to be a testing protocol that allows for it. In theory, DBT could do the job, but in practice would be far too unwieldy - months and months of blinded solo testing in thousands of homes. Is there an alternative?

Prog Rock Man · Nov 2, 2010 at 3:22 PM

I do not see a way other than to conduct as many blind tests as possible.

Featured Sponsor Listings

The flaws in blind testing

Crazy*Carl

500+ Head-Fier

Anaxilus

Headphoneus Supremus

Antony6555

500+ Head-Fier

Anaxilus

Headphoneus Supremus

haloxt

Headphoneus Supremus

Prog Rock Man

Headphoneus Supremus

nick_charles

Headphoneus Supremus

Prog Rock Man

Headphoneus Supremus

InnerSpace

100+ Head-Fier

Satellite_6

1000+ Head-Fier

EthanWiner

100+ Head-Fier

Uncle Erik

Uncle Exotic

haloxt

Headphoneus Supremus

InnerSpace

100+ Head-Fier

Prog Rock Man

Headphoneus Supremus

Users who are viewing this thread