That makes sense.....
If you would like to suggest a particular test, I'll be glad to try to point out at least the obvious flaws.
I'm going to start with the Meyers and Moran study.
One of the most common errors I see is the MISAPPLICATION of statistics to situations where they don't apply.
The asserted purpose of the Meyers and Moran test was "to determine whether people could reliably determine an audible difference when a "CD loop" was inserted into the audio signal chain". Most of us here interpret that to specifically mean "whether any human being can reliably determine the difference". Assuming that to be the case, then here is the first (MAJOR) error. Because of the nature of the testing and reporting involved, THE PERFORMANCE
OF EACH INDIVIDUAL MUST BE ANALYZED STATISTICALLY. The only way to tell whether an individual can reliably tell the difference is to run a bunch of test trials and see whether their responses score significantly better than would be expected by random chance.
However, if you apply a bit of logic, it will be obvious that there is no reason to apply statistics to the OVERALL results, and doing so would be improper in light of the results desired. If the purpose of our test is
to determine whether ANY HUMANS can reliably detect a difference, then, as soon as we find a single human who can do so, we have our answer - which is yes. We have a human who can do so, therefore "some humans can do so".
That is a simple YES/NO question and NOT a statistical question. (It doesn't matter how many can do it; once we find one who can , we're done, and the answer is known: yes. We might go ahead and see how many can do it, and probably would do so, but that is outside the scope of the original test.)
If Meyers and Moran were REALLY trying to determine "whether most people would hear a difference" or "how many people would hear a difference", then statistically analyzing the overall results was a perfectly valid way to do so. I Strongly suspect that this was what they were trying to do - since this is what most of the public, as well as most CD vendors, would really be interested in. Therefore, attempting to claim that their test "showed that nobody could do it" is simply a misinterpretation of their results. Their results simply showed that "they were unable to suggest that a statistically significant percentage of the people they tested could reliably determine a difference a statistically significant percentage of the time".
(It is also absolutely true that "a statistically significant percentage of the human population is not able to run a mile in four minutes". However, that is FAR different from "proving that nobody can do it", along several different axes.)
Here is the proper way to run a test "to determine whether ANYONE can reliably detect a difference".
You conduct the first level of the test exactly as they did, using statistical analysis ON EACH INDIVIDUAL'S RESULTS, to determine whether each individual test subject scored above what might reasonably be expected by random chance. However, you treat the first level of the test as a screening round. You assume that all the subjects whose results fell very close to random are "uninteresting" and send then home. HOWEVER, if any subjects produced individual results that appeared significant, you conduct further test runs with them to determine whether their results were random or not. Typically you would set some sort of threshold level for "subjects to be passed on to round two". This would typically be either a direct threshold like "everyone who scores above 7/10" or a fraction like "the best scoring 10%". (And, if nobody scored above that threshold, then you would either declare a null result, or find a new set of test subjects and try again.)
At this point you should also note and report two obvious potential causes of error:
1) Your sample didn't contain all humans - so you could simply have missed one who could do it.
2) It's possible that one of your subjects could do so - and his or her failure this time was the anomaly.
So far that constitutes either an error of "performing a test that is inappropriate to the desired results" or "attempting to mis-interpret their results to prove something the test was never intended to measure".
The test itself also contains numerous errors of methodology and procedure.
We all know that, whether it's audible or not, the CD signal chain fails to exactly reproduce the original audio in many measurable ways. It adds some noise, some distortion, and some errors in frequency response and impulse response, as well as possibly other unknown differences. They should have provided detailed analyses of their test content, both "before and after". They were attempting to determine whether their sample listeners could detect differences -
YET THEY FAILED TO PROVIDE DETAILED INFORMATION ABOUT THE DIFFERENCES PRESENT IN SAMPLES IN THE TEST. For all we know they used low quality original material, the CD signal loop made no measurable difference to it, and there were no measurable differences between them. They should have provided both detailed data, like product disc numbers and specifications, as well as test results like spectrum analyses, showing what was present in each sample, and what the
measured differences were. In fact, since the "item of interest" was the differences, they probably should have provided an actual difference file containing those differences for easy analysis and confirmation. They should also have included measurements that showed that the differences involved were being accurately presented to their listeners - and not being obscured by any of the components used for the test. Then we would know precisely what differences the subjects were or were not able to hear.
I should note, in their defense, that they DID note the anomalous results obtained from several test subjects.
It was always my impression that Meyers and Moran simply set out to "provide a simple test to demonstrate that, when used to reproduce typical consumer-quality content, most consumers didn't notice any audible degradation caused by CDs".
I’ve seen repeated commentary here suggesting the tests documented in the first post and others may be flawed. That may or may not be true, but if that assertion is goin
g to be made, then it’s encumbent upon the person suggesting the tests were flawed to be specific about those flaws, not just throw a turd in the punch bowl.
General statements that “there might be flaws” aren’t helpful and could be construed as deflection, particularly from those averse to participating in reasonable testing. While it may not be conclusive, an ABX via Foobar is quite simple to construct and enough data gathered may be inidicative. Particularly if statistics aren’t abused and single run results aren’t stated as being significant. If someone can score 8/10 or better on 20 test runs, then we have something to discuss.