That's actually the standard criterion for statistical hypothesis testing.
Equations and numbers. :)
That jogs the memory. I think I'm going to get the hang of how to do this stuff again.
Looking at both images at the same time, you concluded you preferred B. Say I am the image store and you choose to buy B to take home and enjoy. You pay the money and I slip B into the sack and off you go. At home, you take it out. As it happens, I accidentally gave you A. How will you ever know if you can't tell the difference when looking at them separately?
If there was no swap delay, anyone who could see the differences side by side, would get it right 100% of the time. The long delay just makes this test silly, because it's like listening to a headphone, then rapidly taking it off your ears, and then pluging in the next one to find the differences. This is not standard ABX at all.
So do you think the ABX/blind tests I linked to at the start of this thread are properly executed?
To me delay is an issue that varies on how well you think you can remember sound. If it familiar test tracks and you listening for specifics, I say you can go a long time between switches. If you are unfamiliar with the music, then the less time the better.
In any case if there really were the differences that audiophiles describe them, hearing the differences should be easy.
Not only that, but you only get to see A and B at the very beginning and cannot refer back to A and B during the test. So you get to see A and B at the beginning (better commit them perfectly to memory) and then get to see 20 versions of X one at a time for the duration of the test. A proper ABX test would allow you to refer back to A and B at any time that you want.
An ABX test should be designed so that you are testing whether the subject can detect a difference and properly match X with A or B. The test should be designed to minimize other factors that may influence the test.
The ABX test by Sieveking introduces a source of error due to forcing you to memorize the colors of A and B for the duration of the test. The test becomes more of a psychology experiment on color memory than a proper ABX test to show whether you can or cannot detect a difference between A and B.
The Sieveking ABX test is an example of how not to design a proper ABX test.
There is always going to be some amount of a psychology experiment component to an ABX test relating the memory of audio (or in this case color) and other psychology experiment factors. The goal when designing an ABX experiment should be to minimize those other factors not maximize them. The Sieveking test chooses to try to maximize the influence of color memory rather than minimize that influence.