Quote:
Originally Posted by nick_charles /img/forum/go_quote.gif
Yep, you could do some preliminary DBTs such as level differences, distortion and so on and use that to weed out the merely average listeners..
|
Yes, that's a good idea. That would be useful in public tests. But in private tests, it depends if our friends are OK to be left out, especially when they claim to hear the difference.
Quote:
Originally Posted by wavoman /img/forum/go_quote.gif
1. Many experiments reported in threads here -- and many discussions here -- do in fact suggest averaging results across the sample subjects. I am not tilting at windmills. And most experiments I have read do not re-test the winners.
|
Some experiments indeed use incorrect statistical analysis. The most common error (in the
Prism paper about the audibility of the pressing quality of CD, for example) is to present individual results, then pick among them the most significant and presenting them as successful, because they are above the threshold.
Actually, if the threshold is one chance out of 20 of getting a false positive, and you test 20 subjects, and one of them is positive... well, you see what I mean
The strangest example I've seen was an AES presentation about high definition digital audio. The authors used such a complicated and obscure statistical model that they ended up proving statistically that group A scored better than group B ! Which could be seen immediately reading their scores... that were both way below the significance threshold !
Quote:
Originally Posted by wavoman /img/forum/go_quote.gif
2. Standard A/B/X protocol does not allow switching/listening over and over before making a choice. X is picked at random from A or B in each trial. Subjects answer, and the next trial starts. I think this is a terrible protocol.
|
In this case, I agree. When I setup an ABX test, I let the subjects have complete freedom in the choice of what they want to listen to : in each trial, A, B and X can be freely listened to, as many times as they want, in the order they like, for the time they need.
About correcting previous answers, it depends on the protocol. In a protocol where the answers are only given after all trials are done, I let people go back and change their previous answers. It is especially useful if people discover a detail during the test that allows them to easily identify the correct source. They can start again with much more confidence.
However, I personally prefer, by far, having my answer checked immediately after each trial. I need to know if I am doing right or wrong, so that if I am wrong, I can compare more carefully A and B, in order not to make anymore mistakes.
This way of testing puts two requirements : I can't correct my prevous answers, since the solution have been told to me, and the total number of trials must be decided in advance, and be respected in order to avoid "probability picking".
If a listener seems more at ease with other protocols like A/B preference, I see no problem with it, as long as there is randomization, and we can estimate the probability of false positive.
For me, the test must affect the listener the least possible. Otherwise, the protocol is always blamed in case of failure.
In the tests I have run with some forumers, the question of averaging across the subjects was a very difficult one. On one hand, if we sum the results of everybody, we can prove a difference otherwise unseen, thanks to the statistical weight of all the answers, but on the other hand, since the listeners are rather untrained, the probability is high that one or two listeners score well and not the others. Summing the answers leads then to a failure while these listeners actually hear the difference.
That's why I rather first define a target probability of false positive, according to the hypothesis under test : for example 1% for amplifiers, or 0.1 % for interconnects, because "extraordinary claims need extraordinary evidence".
Then I take the maximum number of listeners, and divide my probability by this number. It gives me, roughly, the probability that one of them at least (to precise, the probability that one of them exactly, but it's nearly the same) scores a false positive.
Then I estimate the minimum number of trials needed so that this probability can be satisfied with one error from the listener. I thus give them the right to make one mistake at most.
Then, there is still a risk of biasing the statistics by repeating ABXes over and over until the right score is met. Then I require for any listener that he scores a modest success, like 6/6, in order to be allowed to proceed to the real ABX. In reality, in our last meeting in Paris, we agreed that the 15 trials ABX would rather be divided int otwo parts. The 7 first trials. If there is more than one mistake, the test ends and another listener, or group of listener, or device tested, can proceed. If there is less than two errors, the ABX goes on until the 15 trials are done.
I did not check if it was as good as my method in order to avoid repetition bias, but I was not alone in the organisation.