There's fault to be found on both sides in the ABX debate.
On the one hand you can point to
this discussion by David Clark, inventor of the ABX switcher, posted in response to a critique of his methodology. He claims that his approach is magically immune to
Type II errors (and it's a little worrying that the term seems novel to him) and states, with
colossal disingenuousness, "we never formally conclude that any difference is inaudible". Yeah ... right ... He goes on to display difficulty in identifying what defines 'a trial' and attempts to inflate his numbers accordingly: "our tests
do use a very large number of trials, we simply do not report each of them individually" (ORLY?). In an attempt to tug on our heartstrings, he bewails the difficulties that are faced: "large numbers of qualified listeners are hard to find." Sorry, but science doesn't award sympathy votes because your experiment was too hard to do properly. All he manages to do is demonstrate why engineers should stick to building things and leave the science to people who've been properly trained.
But while David Clark just has me rolling my eyes, John Atkinson can get me spitting in fury with pieces like
this steaming pile of dung in which he states blind testing "represents the actual point where two opposed faiths clash," and goes on to play the 'teach the controversy' card by moaning about those who "wrap themselves in the flag of "objectivity," " If Atkinson wants to preach faith-based audio then he needs to come clean about it rather than pretending to any rationality, and plant his flag in the camp of the flying spaghetti monster.
There
are areas in which the classic ABX test could be critiqued, the problem is that they exist only as hints and outlines.
The first is the 'golden ears' approach, which claims that some people have better acuity, either through natural variation or training, and are able to hear differences that other can't. The notion that humans vary in their perceptual abilities is hardly controversial - I have no doubt that a 16yr-old can hear, see, taste and smell better than me in my decrepitude. But if there's a difference, the first thing you should want to do is measure it. For instance, we have claims like
this (from John Atkinson again) that a fellow editor had a 100% success rate on a blind test and could even identify the brand of amplifier 80% of the time. What's his reaction? Is it, 'Golly, this is an interesting result, let's repeat it with a lot more trials so we can be sure, and then do it with a range of other people to see if we can estimate how this variance is distributed'? No. This tiny result is taken as a 'win' for the home team and trumpeted from his editorial throne without thought of further exploration. Not even remotely good enough, John. Of course, one suspects that hifi magazines have to be careful with the whole 'golden ears' thing, as their primary function is to sell merchandise to consumers, and a result that states a difference that can be heard by some audio crackerjack may be imperceptible to the plebs with money in their pockets is probably not what they're looking for.
The second, and far more interesting, approach is to question whether ABX is testing the right thing.
Modern psycho-physiology is founded on an interactionist framework that has devolved fom early Gestalt ideas, in which we create models of the external world and continuously refine those models through integration of sensory stimuli. If I listen to an instrument I'm not hearing the fundamentals and overtones, the modulations and bilateral phase discrepancies, I'm hearing my interior model of what that instrument sounds like, and, subconsciously, I'm hearing how far the sensory stimuli differ from what my model tells me the instrument
should sound like. The problem is that we know our models are wrong, and so our perceptual systems engage in a constant process of refining them based on the data coming in. The problem with
that is that there's no easy way to judge which data should be incorporated into the model and which should be rejected as malformed or erroneous. If you know what a clarinet sounds like through long experience and then get presented with a recording in which everything above 5kHz has been removed, your system should be robust enough to realise that there's something wrong with the new data and you shouldn't expect clarinets to sound dull and lifeless in the future.
Testing this integrate-or-reject mechanism is hard, not least because there are still gaping holes in our understanding of perceptual physiology. It's possible to persue indirect methods, however, in which we attempt to measure how activation of error signals has an impact on other psychological mechanisms. One such example (which has been trotted out by a variety of manufacturers) is described
here (Stereophile again, no surprise) in which a German psychologist apparently discovered marked differences in responses to a mood questionnaire between those listening to analog and digital systems in a blind test. This is certainly very interesting, but those looking for more details about this experiment will be frustrated. Searching for anything by Jürgen Ackermann yields a lot of articles on car steering, and there's a
Jürgen Ackermann listed as a practising psychologist in Frankfurt, but no sign of any audio publication and nothing on the AES site, even though this was (apparently) performed as part of a PhD thesis - it would be rare, to say the least, for a reputable university to award a PhD for work that was not worthy of publication, and rarer still for a student not to seek to get published. I'd certainly be very interested if anyone manages to track this down but ... 'Show me the money!' is the phrase that comes to mind. I should also note that searching for 'Jürgen Ackermann' brings up a 'Some results may have been removed under data protection law in Europe' flag on Google in the UK which is ... strange, though this may relate to someone else.
There's certainly work that
could be done outside the arena of ABX testing, it's just that no-one seems to be bothering to do it. Instead we have scraps and fragments that are accorded a significance far beyond their actual value. There are certainly interesting questions to ask, such as the significance of accuracy versus euponics, and what is euphonic anyway? But you do actually have to do the experiment. The value of ABX testing is that, being fairly simple, it
has been done (and there are those who have done it properly instead of letting engineers kludge it). At the end of the day you have to go with the data that you've got. You always need to be open to new data, and it's always possible that new methods of testing will expose flaws in our underlying hypotheses, but until we actually
have those data we don't really have anything to consider.