That's another good point.
It does, however, depend on what your actual goals are when performing the test. If you're looking for a real scientific answer, taking into account that some people may in fact consciously be unsure that they hear a difference, then a forced A/B/X protocol will probably give you more accurate results. However, if you're doing market research for a new product, then what you probably
really want to know is if a significant number of test subjects clearly and consciously notice a difference. If that't your goal, then there's no reason to spend the extra effort to resolve small uncertainties.
I should also point out yet another factor that sometimes confounds these sorts of tests. It's called "self selection bias". What that basically means is that, if you're asking for volunteers, your test population is limited to people who want to take your test. In simplest terms, most of the people who are already certain that a difference is - or is not - audible aren't going to bother to take your test... Some may do so because they honestly "may want to contribute to science", and some others may believe they do or don't hear a difference, and be looking for confirmation of that belief, but most simply won't be interested enough to take the test. As a result, your test sample does NOT represent a true cross section of the general population; instead, your test sample has been self-selected for "those who are interested but unsure - and consider the question important enough to show up".
One solution, which is often employed in serious testing, is to use truly random samples. You get a letter in the mail stating "your name has been chosen at random to take this test", or someone at the mall invites you to come into a back room and sample three new soft drinks. Another is what we might term "motivated selection". We have a pretty good idea who can run the mile the fastest - because there is a major incentive for fast runners to try out for sports events. We probably have no idea how fast the fastest human can run up five flights of stairs - because nobody has any motivation to find out. (And, if you were to try to find out, unless you offered a cash prize, nobody would show up to compete.)
One solution there IS to offer a prize of some sort.
For example, if you REALLY want to find out if ANYBODY can reliably tell the difference wth those compressed files....... Offer a public contest, where people can submit their own samples for you to encode,
AND OFFER A PRIZE FOR ANYONE WHO CAN SUCCESSFULLY PROVE THAT THEY CAN RELIABLY TELL THE DIFFERENCE. The prize offers incentive for people who are already convinced that they hear an obvious difference to participate. And, if nobody can hear a difference, then you won't end up having to pay out the prize anyway. I could imagine a booth at an audio show, promoting some new sort of compression. Visitors would be encouraged to bring in their own song, whcih would be compressed on the spot, and inserted into a fancy "A/B test machine" where they could listen to the samples in random order. They would be offered a choice of using several popular premium headphones - of bringing their own. They would be offered a $500 prize if they could tell which samples were compressed and which ones weren't at least 18 times out of 20. I suspect you'd get plenty of participants, and a negative result under those circumstances would be quite compelling.
(This might also be an interesting event to offer to raise interest for a local audio club.)
Another point I don't really see discussed much is the difficulty of the task involved in these listening tests. When I do the tests, I'm usually not sure of whether I hear a difference. It's not the case that I'm sure 'they sound the same' or 'I definitely hear X difference', but rather more like 'I'm not really noticing a clear difference' or 'I think I might have noticed X difference, but I'm not sure'. This can sort of be remedied by doing a forced-choice ABX test where the listener has to guess if not sure. If the listener scores at a statistically above-chance level, that would suggest that they likely do notice a difference, but as far as I know, the stats won't answer these questions:
- What differences do they notice?
- How consistently do they notice those differences? (small differences likely won't be noticed anywhere near 100% of trials, and may not be much above 50%)
- How big are the differences? (effect size)
These are all important questions from a practical standpoint.