Originally Posted by jaddie
The thing about trying to get definitive results out of an ABX test is that you have to get all variables under tight control, then you have to collect a lot of data from randomized subjects. This is not as easy as the whole Foobar/ABX concept might make it seem. We'd all like to run a couple of ABX tests, get the data and call it a day.
The problem I see with ABX testing of different files with the goal to determining the audibility of different bit rates and depths is:
1. Source material. Assuming we don't use digitally synthesized test signals, we need to sample analog first. There are three general methods, the first two are to sample high-rate then down-sample to obtain the low-rate file, sample low rate then up-sample to obtain the high-rate file. The problem with both is they involve up or down sampling, which is a variable (different algorithms, etc), and could have an impact. The third method is to sample the analog original simultaneously with two ADCs running at different rates, but identical in every way possible. The problem with that is that of necessity, the lower-rate ADC will have a different anti-aliasing filter than the high-rate ADC, which could be considered either a variable or part of the fingerprint. What you end up with isn't quite a definitive bit rate comparison, but a definitive comparison of the results of a specific DAC running at different rates, which when you get down to it, is not quite unassailable, but probably better than many at least some of the prior art.
I found out right away when I opened the iZotope SRC that "upsampling" was a procedure with plenty of choices right there, AND trade-offs between, them, AND differing recommendations about how to use the settings. And then all over again with "downsampling", so possible outcomes get pretty deep pretty quickly. I bolded and red-ed your line above, because it goes to just those points. I'm not sure, even, what an experimental design which *entirely* isolated ONLY the effects of sample rate would have to look like. As soon as one is playing back at different sample rates, the playback chain is of necessity different, with all the attendant possible effects in the elements of the playback chain. [As you address below]:
2. The second half of the chain also contains unknown elements. DACs performance differ significantly, which imposes another unknown variable. What happens after the DAC is also a bit unknown, especially when you're crowd-sourcing, but would end up being a bit of random statistical noise rather than a bias.
This is getting really speculative, but--there are meta-analysis techniques which *might* allow a [well-resourced] researcher to cross-match the results of many thousands of experiments on non-identical equipment, to try to sort out which variables are more likely to be responsible for which results. There would need to be a wide variety of well-designed test files--that part would have to be common across trials--which would include both test tones designed for the trials, and a wide variety of program material. The only thing that potentially helps the researcher get something out of this morass of data is the very large numbers of trials for each of the test files. And of course, there would need to be some demographic data (like age), and maybe even a short frequency-range hearing test worked into the front end of the game somehow.
So what we would need to do is set up tests for each of several conditions, collect data from several dozen trials with as many testers for each condition, then analyze what we have. Clearly, that's not a project undertaken lightly, though it's fascinating to consider the crowd-sourcing possibilities. If whoever was doing the project could be funded enough to generate test files that were carefully done and with the variables known, then distributed them, we might actually have something here. But until then, I wouldn't put a whole lot of confidence in the data obtained from casual ABX testing, though it's clearly far more relevant than casual sighted A/B testing.
Getting the design right would be an awful lot of work, as well as connecting theory to methodology in rigorous ways. But, I have to say, if I were a journal editor looking at a submitted paper with this sort of innovative testing approach, and I thought it was done well (within the boundaries of what's possible), there's **no way** I would not accept it for the centerpiece of an upcoming issue.