NuForce uDAC-2 Listening Challenge
Mar 12, 2011 at 7:13 PM Post #31 of 60


Quote:
^ me too.  Don't give away the answer yet or I'll have to to use the computer with both eyes closed
wink.gif
.  8/10 pairwise comparisons don't quite reach significance, with about 0.055 chance of happening by random.  9/10 do, with about 0.011 chance.  7/8 is the lowest bar to clear (p=0.035).



With so few trials by each tester, I don't think a 95% confidence level is sufficient.  There's still too much of a probability of the result being chance.  In this case, 1 in 20 tests would show a false positive if the user is guessing.
 
99%, or higher, and better yet, run more than 10 tests.  But determine the number of tests before beginning.
 
This should explain it a bit:
http://www.hydrogenaudio.org/forums/index.php?showtopic=16295&
http://www.graphpad.com/library/BiostatsSpecial/article_151.htm
 
Mar 12, 2011 at 7:14 PM Post #32 of 60


 
Quote:
I'll attempt to ABX Harrison and Monroe later. (I'll edit this post)
 
EDIT: I was able to ABX with a 10/10 although it took a few trails and I failed a lot before hand. I listened to a very short part of the track which sounded different. . . I don't think I'd notice this difference normally.
 
This is a no ABX forum I believe. . .


This is the one place (the science subforum) that does allow discussion of DBT. 10/10 is significant to P < 0.05  DBT is great for introducing reality checks, the big differences that are obvious when sighted are sometimes much harder to pick out in DBT.
 
 
 
Mar 12, 2011 at 7:19 PM Post #33 of 60
Quote:
I'll attempt to ABX Harrison and Monroe later. (I'll edit this post)
 
EDIT: I was able to ABX with a 10/10 although it took a few trails and I failed a lot before hand. I listened to a very short part of the track which sounded different. . . I don't think I'd notice this difference normally.
 
This is a no ABX forum I believe. . .


Wait, what do you mean "it took a few trials and I failed a lot before hand"?  Does that mean that you did several ABX tests and failed those while passing only the last one?  That does not at all mean you've positively identified a difference.
 
 
Mar 12, 2011 at 7:23 PM Post #34 of 60


Quote:
Wait, what do you mean "it took a few trials and I failed a lot before hand"?  Does that mean that you did several ABX tests and failed those while passing only the last one?  That does not at all mean you've positively identified a difference.
 


Very well. Here it is anyway. Technically you are supposed to do like 20 trails and no more than one wrong to be definitive. Oh well.
 
"You don't have permissions to create attachments."
 
weeeeeeeak.
 
Here it is anyway:
 
foo_abx 1.3.4 report
foobar2000 v1.1.1
2011/03/12 18:50:04

File A: C:\Users\[my name, you can't have]\Downloads\Brick House Harrison.flac
File B: C:\Users\[my name, you can't have]\Downloads\Brick House Monroe.flac

18:50:04 : Test started.
18:50:44 : 01/01  50.0%
18:50:59 : 02/02  25.0%
18:51:11 : 03/03  12.5%
18:51:30 : 04/04  6.3%
18:51:42 : 05/05  3.1%
18:52:04 : 06/06  1.6%
18:52:44 : 07/07  0.8%
18:53:08 : 08/08  0.4%
18:53:25 : 09/09  0.2%
18:53:47 : 10/10  0.1%
18:54:19 : Test finished.

 ----------
Total: 10/10 (0.1%)
 
 
 
 
Mar 12, 2011 at 7:39 PM Post #35 of 60


Quote:
Originally Posted by Satellite_6 /img/forum/go_quote.gif
 
 
Very well. Here it is anyway. Technically you are supposed to do like 20 trails and no more than one wrong to be definitive. Oh well.
 
"You don't have permissions to create attachments."
 
weeeeeeeak.
 
Here it is anyway:
 
foo_abx 1.3.4 report
foobar2000 v1.1.1
2011/03/12 18:50:04

File A: C:\Users\[my name, you can't have]\Downloads\Brick House Harrison.flac
File B: C:\Users\[my name, you can't have]\Downloads\Brick House Monroe.flac

18:50:04 : Test started.
18:50:44 : 01/01  50.0%
18:50:59 : 02/02  25.0%
18:51:11 : 03/03  12.5%
18:51:30 : 04/04  6.3%
18:51:42 : 05/05  3.1%
18:52:04 : 06/06  1.6%
18:52:44 : 07/07  0.8%
18:53:08 : 08/08  0.4%
18:53:25 : 09/09  0.2%
18:53:47 : 10/10  0.1%
18:54:19 : Test finished.

 ----------
Total: 10/10 (0.1%)
 
 
 


Yes, that's all fine and dandy.  But what about the other trials you did?  That's what I'm asking about - I don't doubt that you scored 10/10, and even if I did suspect you faked it then the results file isn't exactly difficult to counterfeit...
 
Mar 12, 2011 at 8:12 PM Post #36 of 60


 
Quote:
Yes, that's all fine and dandy.  But what about the other trials you did?  That's what I'm asking about - I don't doubt that you scored 10/10, and even if I did suspect you faked it then the results file isn't exactly difficult to counterfeit...


For these tests it is okay to spend a few trials getting familiar and practising, for instance finding a section where the difference is more apparent. What is less okay is say doing 20 sets of 10 and only reporting the one where you manage 10/10, but once you are confident you can always go back and repeat it.
 
 
 
Mar 12, 2011 at 8:22 PM Post #37 of 60
I did three:
 
The first was epic fail. . . the second I got a few right until I got a bunch wrong. . . which happens pretty often -_-' . . . and then I focused on a tiny part that sounded different and I was actually able to get 10/10 pretty fast. I didn't do it like 20 times.
 
Mar 12, 2011 at 8:31 PM Post #38 of 60


Quote:
I did three:
 
The first was epic fail. . . the second I got a few right until a got a bunch wrong. . . which happens pretty often -_-' . . . and then I focused on a tiny part that sounded different and I was actually able to get 10/10 pretty fast. I didn't do it like 20 times.


Okay, so you didn't focus on the same difference each time.  I figured that was the case, but I wanted to confirm it.
 
All I can say at this point is try it a few more times on that same difference.  Perhaps do three (or five, or whatever) more sets of ten samples to avoid fatigue.  Just decide how many more before you start.  That should clear things up a bit.  It's not really rigorous, but it's better than just one set with 10/10 - which is good, but really doesn't mean that there is for sure an audible difference - especially with other failed trials around it.
 
Mar 12, 2011 at 8:46 PM Post #39 of 60
I got 10/10 abx testing Harrison vs Wilson right now.  It took too long though as I listened to all 4 in it's entirety for each test.  So I heard the entire section at least 40 times just to do 10 trials.  Sometimes I had to repeat because i started daydreaming and forgot which of A vs B or X vs Y was the better one.  The only way to do 20 trials in a row is probably take a break and come back to it later. 
 
 
 
Mar 12, 2011 at 9:32 PM Post #40 of 60


Quote:
I got 10/10 abx testing Harrison vs Wilson right now.  It took too long though as I listened to all 4 in it's entirety for each test.  So I heard the entire section at least 40 times just to do 10 trials.  Sometimes I had to repeat because i started daydreaming and forgot which of A vs B or X vs Y was the better one.  The only way to do 20 trials in a row is probably take a break and come back to it later. 
 
 


Yeah, 20 is impossible, the best ABX's I've done were like 14/15 (in the past I mean, not this stuff) and by the time I get that far I'm bored to tears. Funny how you think Harrison sounds the worst and I think it sounds the best, I want to know which is which!
 
 
Mar 12, 2011 at 9:41 PM Post #41 of 60


Quote:
With so few trials by each tester, I don't think a 95% confidence level is sufficient.  There's still too much of a probability of the result being chance.  In this case, 1 in 20 tests would show a false positive if the user is guessing.
 
99%, or higher, and better yet, run more than 10 tests.  But determine the number of tests before beginning.
 
This should explain it a bit:
http://www.hydrogenaudio.org/forums/index.php?showtopic=16295&
http://www.graphpad.com/library/BiostatsSpecial/article_151.htm

 
True enough, one listener could will almost definitely have too few trials to safely avoid both type 1 & 2 errors.  But by convention, the probability of type 2 errors are kept under 20% while type 1 are under 5%.  And type 2 errors are not usually avoided by reducing the latter (alpha), they are addressed by sampling an adequate number of subjects*.  Since the simplest answer anyone would give is a ranking of 4 tracks, I was thinking that the OP would analyze results by song by collapsing all the results of listeners as 3 better/worse judgments per song per subject, thereby minimizing both false positives and negatives.
 
*This calculator says that 32 samples are needed, assuming H0 = 50%, H1=100%, SD=1, alpha=0.05 & power= 0.8.  So let's say 32. 
tongue.gif

 
Mar 15, 2011 at 11:44 PM Post #43 of 60
Here's my guesses.  Despite its terrible compression, I used the Lady Gaga because it had more treble, where i hear dac differences more clearly.  Or rather, where I imagine I hear differences more clearly.
 
jefferson >= jackson > lincoln > adams
 
Apr 9, 2011 at 12:50 AM Post #44 of 60
nwavguy was nice enough to share his raw data with me for statistical analysis, so I thought I'd post the results of the test.   There are enough flaws in the test to diminish the validity of these results, but let's sweep those under the rug.
 
For the tough-minded, some of the flaws are behind this bar.
 
1- results are not independent.  People could see other people's guesses and be swayed in their opinion.
2- people answered very different questions, so did not make the same comparisons.  e.g. some ranked all the tracks, some chose the best, or the worst, etc.
3- there are not nearly enough subjects to satisfy a standard β < 0.25 statistical power.  So a negative result has a very high chance of being a false negative.
4- because of #3, choices were collapsed over songs to try to increase sample size.  Again, averaged results reflect choices over different tracks.
5- It's too late in the evening for me to finish this list.  Look a pony.
 
Statistical tests were made of the likelihood of the null hypothesis (that all choices were random) as described by a binomial distribution of random discrete choices.  A probability of α < .05 means that the choices were most likely not random and people really could distinguish between the dacs.
 
Question 1:
Did any of the three dacs get chosen as the 'favorite' more often than would be predicted by pure chance (33%)?

Benchmark got chosen the favorite 7 out of 12 times.
The cumulative probability of people making 7 or more equivalent choices using the binomial distribution is p = 0.0664.

Nuforce got chosen the favorite 3 out of 12 times.  p=0.819

Beringer got chosen the favorite 2 out of 12 times. p=0.946
 
Question 2: 
Did any of the tracks get chosen as the 'worst' more often than would be predicted by chance?  This test includes the surprise CD tracks, so the null hupothesis is 25% chance of choosing any one source.

Behringer was chosen the worst 4 out of 12 times.  This has a probability of p=0.286

Original CD track was chosen the worst 6 out of 12 times.  This has a cumulative probability of p=0.054

In the two tests, multiple comparisons were done so the significant threshold should be divided by the # of tests (Bonferroni correction).  Regardless,  p=0.066 and p=0.054 are pretty close to significant and count as real trends in the data.

I think we could reasonably conclude that there are two strong, but not significant trends:
People chose the Benchmark as the best dac and people chose the CD tracks as the worst sounding track.  So people have combined gold and tin ears.
 
 
Apr 9, 2011 at 12:08 PM Post #45 of 60
People choose the Benchmark as the best dac and people choose the CD tracks as the worst sounding track.  So people have combined gold and tin ears.
 


Awesome closing words. :D Thanks for the analysis.
 

Users who are viewing this thread

Back
Top