nwavguy was nice enough to share his raw data with me for statistical analysis, so I thought I'd post the results of the test. There are enough flaws in the test to diminish the validity of these results, but let's sweep those under the rug.
For the tough-minded, some of the flaws are behind this bar.
Warning: Spoiler! (Click to show)
1- results are not independent. People could see other people's guesses and be swayed in their opinion.
2- people answered very different questions, so did not make the same comparisons. e.g. some ranked all the tracks, some chose the best, or the worst, etc.
3- there are not nearly enough subjects to satisfy a standard β < 0.25 statistical power. So a negative result has a very high chance of being a false negative.
4- because of #3, choices were collapsed over songs to try to increase sample size. Again, averaged results reflect choices over different tracks.
5- It's too late in the evening for me to finish this list. Look a pony.
Statistical tests were made of the likelihood of the null hypothesis (that all choices were random) as described by a binomial distribution of random discrete choices. A probability of α < .05 means that the choices were most likely not random and people really could distinguish between the dacs.
Did any of the three dacs get chosen as the 'favorite' more often than would be predicted by pure chance (33%)?
Benchmark got chosen the favorite 7 out of 12 times.
The cumulative probability of people making 7 or more equivalent choices using the binomial distribution is p = 0.0664.
Nuforce got chosen the favorite 3 out of 12 times. p=0.819
Beringer got chosen the favorite 2 out of 12 times. p=0.946
Did any of the tracks get chosen as the 'worst' more often than would be predicted by chance? This test includes the surprise CD tracks, so the null hupothesis is 25% chance of choosing any one source.
Behringer was chosen the worst 4 out of 12 times. This has a probability of p=0.286
Original CD track was chosen the worst 6 out of 12 times. This has a cumulative probability of p=0.054
In the two tests, multiple comparisons were done so the significant threshold should be divided by the # of tests (Bonferroni correction). Regardless, p=0.066 and p=0.054 are pretty close to significant and count as real trends in the data.
I think we could reasonably conclude that there are two strong, but not significant trends:
People chose the Benchmark as the best dac and people chose the CD tracks as the worst sounding track. So people have combined gold and tin ears.
Edited by eucariote - 4/9/11 at 4:26pm