Absolutely.
I just wanted to point out that there's a distinct difference between statistical significance and a fact that can be generalized with absolute certainty... and many audiophiles in particular seem to miss that distinction.
The number of people with a potentially fatal peanut allergy is almost certainly "statistically insignificant".
And this even extends to the point that, if you see someone fall over after eating a peanut, the likelihood is that they DIDN'T collapse from a peanut allergy.
Yet neither of those statistical probabilities rules out the POSSIBILITY that he or she has a serious peanut allergy.
Unfortunately, many audio companies use less than credible science to sell their products, and many audiophiles seem too eager to believe what they read (or give in to their biases).
This leads to the quite reasonable suggestion that much of what audiophiles claim to hear much of the time probably is in fact the product of their own biases.
My point was basically that many people don't understand "how to read the statistics" and "how to design the tests".
For example, let's say I'm trying to prove that "the difference between FLAC and WAV files of the same bit depth and sample rate is inaudible".
(First off, the only real way to do this is to attempt to falsify that claim - attempt to prove that some people do hear a difference and then fail in that attempt.)
I could test a thousand people, with twenty files each, and perhaps produce a result that "there was no statistically significant correlation" when people attempted to tell which they were listening to.
However, by doing it that way, while I would have failed to prove that there IS a difference, or to PROVE that there isn't; I could only suggest that there is NO difference.
(And only on the particular test equipment, with the particular sample files, and under the particular test conditions, I chose.)
The test protocols often used in audio generally fail to pick out small groups of outliers (for example the small percentage of people who have "absolute pitch").
From a practical point of view, if I wanted much more conclusive results, here's how I would run the test....
I would advertise a public application for test subjects.
I would offer a prize of $500 to anyone who can "get 17 out of 20 correct" when trying to guess whether they're listening to a FLAC or a WAV file.
I would invite them to use the audio system of their choice.
(If I expected a lot of applicants, I might have a self-administered "screening round", to weed out those who obviously couldn't tell, before the "cash round".)
This protocol would:
- self-select for people likely to actually be able to hear a difference (at least by their own evaluation)
- provide those people an incentive to participate
- provide an incentive for them to try their hardest to succeed
- ensure that they were tested under optimal test conditions (within limitations)
I would now see if any of the applicants were INDIVIDUALLY able to distinguish which file was which format to a statistically significant degree.
And, if a statistically significant number of the participants "beat the odds" then I would conclude that I had a positive result.
And, if even a few participants "beat the odds", I would conclude that the result appeared significant, but statistically COULD still be due to random chance.
So I would RETEST my successful candidates - with a longer list of files.
And, if they were AGAIN able to "beat the odds" I would conclude that those few candidates were actually able to hear the difference.
And, if, UNDER CONDITIONS CHOSEN BY THE APPLICANTS, their guesses were still random, I would have a pretty solid justification for claiming that there was probably no audible difference.
In short, I would have offered every reasonable opportunity for a positive result...
All I need is one person who can consistently and reliably hear a difference to state with certainty that "at least some humans can hear a difference".
However, BECAUSE I PROVIDED EVERY POSSIBLE OPPORTUNITY FOR A POSITIVE RESULT, if I FAIL to produce a positive result, then my failure gives credibility to my claim that the negative result is probably correct.
(Note that it is impossible to ever prove a negative in most situations.)
I would still have failed to test against the possibility that there is some specific small group who could actually hear a difference but failed to participate (perhaps only children below the age of five can hear it........)
But that is probably a minor consideration.
However, I would have given every person who believes that there is in fact a difference to "win their point" under their chosen conditions.
Therefore, I can assert that "I have given everyone a fair opportunity to prove me wrong in my claim that there is no difference - and nobody has succeeded in doing so.)
@KeithEmo you're right, but with one small proviso. The more outcome measures you have, the higher the likelihood of some outcomes reaching statistical significance. This is called P-hacking, and is a big problem in medicine and psychology at present. If you use a P-value of 0.05 (i.e. there being a 5% chance of the result being due to chance), you only need to measure about ten outcomes to have a pretty high likelihood than one reaches statistical significance (chances are 1-(19/20)^10, I think...)
What is a better measure of significance in your example is having clusters of people at one end of the normal distribution curve.