Head-Fi.org › Forums › Equipment Forums › Sound Science › NuForce uDAC-2 Listening Challenge
New Posts  All Forums:Forum Nav:

NuForce uDAC-2 Listening Challenge - Page 3  

post #31 of 60
Quote:
Originally Posted by eucariote View Post

^ me too.  Don't give away the answer yet or I'll have to to use the computer with both eyes closed wink.gif.  8/10 pairwise comparisons don't quite reach significance, with about 0.055 chance of happening by random.  9/10 do, with about 0.011 chance.  7/8 is the lowest bar to clear (p=0.035).



With so few trials by each tester, I don't think a 95% confidence level is sufficient.  There's still too much of a probability of the result being chance.  In this case, 1 in 20 tests would show a false positive if the user is guessing.

 

99%, or higher, and better yet, run more than 10 tests.  But determine the number of tests before beginning.

 

This should explain it a bit:

http://www.hydrogenaudio.org/forums/index.php?showtopic=16295&

http://www.graphpad.com/library/BiostatsSpecial/article_151.htm


Edited by BlackbeardBen - 3/12/11 at 4:15pm
post #32 of 60



 

Quote:
Originally Posted by Satellite_6 View Post

I'll attempt to ABX Harrison and Monroe later. (I'll edit this post)

 

EDIT: I was able to ABX with a 10/10 although it took a few trails and I failed a lot before hand. I listened to a very short part of the track which sounded different. . . I don't think I'd notice this difference normally.

 

This is a no ABX forum I believe. . .


This is the one place (the science subforum) that does allow discussion of DBT. 10/10 is significant to P < 0.05  DBT is great for introducing reality checks, the big differences that are obvious when sighted are sometimes much harder to pick out in DBT.
 

 

post #33 of 60
Quote:
Originally Posted by Satellite_6 View Post

I'll attempt to ABX Harrison and Monroe later. (I'll edit this post)

 

EDIT: I was able to ABX with a 10/10 although it took a few trails and I failed a lot before hand. I listened to a very short part of the track which sounded different. . . I don't think I'd notice this difference normally.

 

This is a no ABX forum I believe. . .


Wait, what do you mean "it took a few trials and I failed a lot before hand"?  Does that mean that you did several ABX tests and failed those while passing only the last one?  That does not at all mean you've positively identified a difference.

 

post #34 of 60
Quote:
Originally Posted by BlackbeardBen View Post


Wait, what do you mean "it took a few trials and I failed a lot before hand"?  Does that mean that you did several ABX tests and failed those while passing only the last one?  That does not at all mean you've positively identified a difference.

 


Very well. Here it is anyway. Technically you are supposed to do like 20 trails and no more than one wrong to be definitive. Oh well.

 

"You don't have permissions to create attachments."

 

weeeeeeeak.

 

Here it is anyway:

 

foo_abx 1.3.4 report
foobar2000 v1.1.1
2011/03/12 18:50:04

File A: C:\Users\[my name, you can't have]\Downloads\Brick House Harrison.flac
File B: C:\Users\[my name, you can't have]\Downloads\Brick House Monroe.flac

18:50:04 : Test started.
18:50:44 : 01/01  50.0%
18:50:59 : 02/02  25.0%
18:51:11 : 03/03  12.5%
18:51:30 : 04/04  6.3%
18:51:42 : 05/05  3.1%
18:52:04 : 06/06  1.6%
18:52:44 : 07/07  0.8%
18:53:08 : 08/08  0.4%
18:53:25 : 09/09  0.2%
18:53:47 : 10/10  0.1%
18:54:19 : Test finished.

 ----------
Total: 10/10 (0.1%)

 

 

 


Edited by Satellite_6 - 3/12/11 at 4:28pm
post #35 of 60
Quote:

Originally Posted by Satellite_6 View Post
 

 

Very well. Here it is anyway. Technically you are supposed to do like 20 trails and no more than one wrong to be definitive. Oh well.

 

"You don't have permissions to create attachments."

 

weeeeeeeak.

 

Here it is anyway:

 

foo_abx 1.3.4 report
foobar2000 v1.1.1
2011/03/12 18:50:04

File A: C:\Users\[my name, you can't have]\Downloads\Brick House Harrison.flac
File B: C:\Users\[my name, you can't have]\Downloads\Brick House Monroe.flac

18:50:04 : Test started.
18:50:44 : 01/01  50.0%
18:50:59 : 02/02  25.0%
18:51:11 : 03/03  12.5%
18:51:30 : 04/04  6.3%
18:51:42 : 05/05  3.1%
18:52:04 : 06/06  1.6%
18:52:44 : 07/07  0.8%
18:53:08 : 08/08  0.4%
18:53:25 : 09/09  0.2%
18:53:47 : 10/10  0.1%
18:54:19 : Test finished.

 ----------
Total: 10/10 (0.1%)

 

 

 


Yes, that's all fine and dandy.  But what about the other trials you did?  That's what I'm asking about - I don't doubt that you scored 10/10, and even if I did suspect you faked it then the results file isn't exactly difficult to counterfeit...

post #36 of 60



 

Quote:
Originally Posted by BlackbeardBen View Post




Yes, that's all fine and dandy.  But what about the other trials you did?  That's what I'm asking about - I don't doubt that you scored 10/10, and even if I did suspect you faked it then the results file isn't exactly difficult to counterfeit...


For these tests it is okay to spend a few trials getting familiar and practising, for instance finding a section where the difference is more apparent. What is less okay is say doing 20 sets of 10 and only reporting the one where you manage 10/10, but once you are confident you can always go back and repeat it.
 

 

post #37 of 60

I did three:

 

The first was epic fail. . . the second I got a few right until I got a bunch wrong. . . which happens pretty often -_-' . . . and then I focused on a tiny part that sounded different and I was actually able to get 10/10 pretty fast. I didn't do it like 20 times.


Edited by Satellite_6 - 3/12/11 at 6:33pm
post #38 of 60
Quote:
Originally Posted by Satellite_6 View Post

I did three:

 

The first was epic fail. . . the second I got a few right until a got a bunch wrong. . . which happens pretty often -_-' . . . and then I focused on a tiny part that sounded different and I was actually able to get 10/10 pretty fast. I didn't do it like 20 times.


Okay, so you didn't focus on the same difference each time.  I figured that was the case, but I wanted to confirm it.

 

All I can say at this point is try it a few more times on that same difference.  Perhaps do three (or five, or whatever) more sets of ten samples to avoid fatigue.  Just decide how many more before you start.  That should clear things up a bit.  It's not really rigorous, but it's better than just one set with 10/10 - which is good, but really doesn't mean that there is for sure an audible difference - especially with other failed trials around it.

post #39 of 60

I got 10/10 abx testing Harrison vs Wilson right now.  It took too long though as I listened to all 4 in it's entirety for each test.  So I heard the entire section at least 40 times just to do 10 trials.  Sometimes I had to repeat because i started daydreaming and forgot which of A vs B or X vs Y was the better one.  The only way to do 20 trials in a row is probably take a break and come back to it later. 

 

 

post #40 of 60
Quote:
Originally Posted by bcwang View Post

I got 10/10 abx testing Harrison vs Wilson right now.  It took too long though as I listened to all 4 in it's entirety for each test.  So I heard the entire section at least 40 times just to do 10 trials.  Sometimes I had to repeat because i started daydreaming and forgot which of A vs B or X vs Y was the better one.  The only way to do 20 trials in a row is probably take a break and come back to it later. 

 

 


Yeah, 20 is impossible, the best ABX's I've done were like 14/15 (in the past I mean, not this stuff) and by the time I get that far I'm bored to tears. Funny how you think Harrison sounds the worst and I think it sounds the best, I want to know which is which!

 

post #41 of 60
Quote:
Originally Posted by BlackbeardBen View Post





With so few trials by each tester, I don't think a 95% confidence level is sufficient.  There's still too much of a probability of the result being chance.  In this case, 1 in 20 tests would show a false positive if the user is guessing.

 

99%, or higher, and better yet, run more than 10 tests.  But determine the number of tests before beginning.

 

This should explain it a bit:

http://www.hydrogenaudio.org/forums/index.php?showtopic=16295&

http://www.graphpad.com/library/BiostatsSpecial/article_151.htm

 

True enough, one listener could will almost definitely have too few trials to safely avoid both type 1 & 2 errors.  But by convention, the probability of type 2 errors are kept under 20% while type 1 are under 5%.  And type 2 errors are not usually avoided by reducing the latter (alpha), they are addressed by sampling an adequate number of subjects*.  Since the simplest answer anyone would give is a ranking of 4 tracks, I was thinking that the OP would analyze results by song by collapsing all the results of listeners as 3 better/worse judgments per song per subject, thereby minimizing both false positives and negatives.

 

*This calculator says that 32 samples are needed, assuming H0 = 50%, H1=100%, SD=1, alpha=0.05 & power= 0.8.  So let's say 32.  tongue.gif

post #42 of 60
Harrison > Monroe = Wilson > Taft, didn't like Taft's sound signature, seemed a tad bit bassy (I prefer neutrality)
post #43 of 60

Here's my guesses.  Despite its terrible compression, I used the Lady Gaga because it had more treble, where i hear dac differences more clearly.  Or rather, where I imagine I hear differences more clearly.

 

Warning: Spoiler! (Click to show)

jefferson >= jackson > lincoln > adams

post #44 of 60

nwavguy was nice enough to share his raw data with me for statistical analysis, so I thought I'd post the results of the test.   There are enough flaws in the test to diminish the validity of these results, but let's sweep those under the rug.

 

For the tough-minded, some of the flaws are behind this bar.

Warning: Spoiler! (Click to show)

 

1- results are not independent.  People could see other people's guesses and be swayed in their opinion.

2- people answered very different questions, so did not make the same comparisons.  e.g. some ranked all the tracks, some chose the best, or the worst, etc.

3- there are not nearly enough subjects to satisfy a standard β < 0.25 statistical power.  So a negative result has a very high chance of being a false negative.

4- because of #3, choices were collapsed over songs to try to increase sample size.  Again, averaged results reflect choices over different tracks.

5- It's too late in the evening for me to finish this list.  Look a pony.

 

Statistical tests were made of the likelihood of the null hypothesis (that all choices were random) as described by a binomial distribution of random discrete choices.  A probability of α < .05 means that the choices were most likely not random and people really could distinguish between the dacs.

 

Question 1:

Did any of the three dacs get chosen as the 'favorite' more often than would be predicted by pure chance (33%)?


Benchmark got chosen the favorite 7 out of 12 times.
The cumulative probability of people making 7 or more equivalent choices using the binomial distribution is p = 0.0664.

Nuforce got chosen the favorite 3 out of 12 times.  p=0.819

Beringer got chosen the favorite 2 out of 12 times. p=0.946

 

Question 2: 

Did any of the tracks get chosen as the 'worst' more often than would be predicted by chance?  This test includes the surprise CD tracks, so the null hupothesis is 25% chance of choosing any one source.

Behringer was chosen the worst 4 out of 12 times.  This has a probability of p=0.286

Original CD track was chosen the worst 6 out of 12 times.  This has a cumulative probability of p=0.054

In the two tests, multiple comparisons were done so the significant threshold should be divided by the # of tests (Bonferroni correction).  Regardless,  p=0.066 and p=0.054 are pretty close to significant and count as real trends in the data.

I think we could reasonably conclude that there are two strong, but not significant trends:
People chose the Benchmark as the best dac and people chose the CD tracks as the worst sounding track.  So people have combined gold and tin ears.

 


Edited by eucariote - 4/9/11 at 4:26pm
post #45 of 60
Quote:
Originally Posted by eucariote View Post

People choose the Benchmark as the best dac and people choose the CD tracks as the worst sounding track.  So people have combined gold and tin ears.

 


Awesome closing words. biggrin.gif Thanks for the analysis.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Sound Science
This thread is locked  
Head-Fi.org › Forums › Equipment Forums › Sound Science › NuForce uDAC-2 Listening Challenge