You bring up an excellent point - and one which many people seem to ignore.
Various sorts of blind tests can largely eliminate the effects of an expectation bias to hear a difference if there isn't any.
However, it's impossible to completely eliminate an expectation bias to
NOT hear a difference.
It's quite possible that people become less likely to hear or report subtle differences if they don't expect a difference to be present.
There is also a widely known tendency for humans to respond to peer pressure when publicly reporting their experiences.
And there are even more interesting and subtle possibilities for error.
For example, we humans have a negative reaction to "failed expectations".
We tend to get frustrated when our expectations aren't met.
So, for example, someone who is expecting to hear "a big obvious difference", and fails to hear an obvious difference, may be less likely to notice a subtle difference.
(Because, after being frustrated at not hearing the obvious difference they expected to hear, they are less carefully focussed on noticing subtle differences.)
There are ways in which some of this
COULD be tested statistically... if anyone was willing to bother.
Here's one suggestion for how to do so.
(In order to produce valid test results you would want a large number of test subjects to take the test.)
Test files could be made up with known flaws - perhaps different amounts of deliberately added noise or distortion.
The basic test procedure would be to run a bunch of trials to determine at what level each test subject could reliably detect and report the presence of the distortion.
HOWEVER, the test would be run multiple times, with different groups of test subjects, with each group subjected to a DIFFERENT EXPECTATION BIAS.
(Using some sort of pretext, perhaps by being told that something else was being tested, one group would EXPECT the files to be different,
one group would EXPECT them NOT to be different, and a third group would have no particular expectation either way - they would be told that some files might be different.)
It would be VERY interesting to see how the "ability to notice and report a difference" would differ between the neutral group and the two groups with "pre-loaded biases".)
There is also another sort of bias which needs to be accounted for - and which is often used to major advantage in group situations: peer pressure.
Put someone in a room full of people, and ask people to "raise their hands if they hear a difference".
As soon as a few people raise their hands, it creates a desire to "raise your hands and become part of the group".
This both biases people to raise their hands, even if they don't hear a difference, and actually creates a bias to WANT and EXPECT to hear a difference.
And, the exact converse of that, place someone in a room full of skeptics, most of whom don't raise their hands, and there is a bias NOT to "raise their hand and go against the group".
(Anyone who runs demonstrations knows how effective it is to place a few shills in the room to raise their hands at the appropriate time and "get the ball rolling".)
This effect is widely known... and described in many textbooks on the subject.... for example, Cialdini's text book on "Influence", which is course material in Harvard business school.
Both of these effects are well know... and both need to be accounted for.
The "group effect" can be accounted for by doing the tests in isolation.....
Where each person takes the test separately, and reports their results separately, and is NOT allowed to see other results until after the total is tallied.
Note how this is the exact OPPOSITE of running an online study where everyone gets to see a running total of the results their peers have already turned in.
When you do that you are introducing TWO distinct problems:
- you are introducing an EXPECTATION in each new subject to experience what the majority of previous subjects have already reported
- you are creating peer pressure to WANT to both experience and report results similar to what most others have already reported
I might also suggest an interesting way to test for that last sort of bias.... which is simply to create a phony bias and see how it affects the results.
The way to do that is relatively simple....
Create some sort of fair test and present it to three groups of test subjects; you could use BigShot's test of "which lossy compressed files are audible".
(The only requirement is that the range of differences is wide enough that it is unlikely to be "obvious to everyone".)
One group is told that "fifty people have already taken the test, and 92% of them heard an obvious difference"...
(You have now created both an expectation bias and a peer pressure bias in that group to expect and want to hear a difference.)
The other group is told "fifty people have already taken the test, and the resuts were statistically random"...
(You have now created both an expectation bias and a peer pressure bias in that group to expect and want to NOT hear a difference.)
The third group is told that they are the first ones to take the test - and they won't get to see the results tallied until their results are all turned in.
(This group is truly neutral in terms of bias.... except, of course, for any biases they may already have.)
If the results are significantly different - then you may infer that the differences were due to the initial bias.
Yep, I and many others have had the same experience. It's amazing how we can hear clear and consistent differences between gear, and they're consistent with what other people report, yet those differences seem to disappear when doing controlled blind testing.
Matching volumes is important to avoid something a bit louder seeming more 'dynamic' or whatever. Matching music segments is important to avoid applying perception to different raw material and getting different perceptual results for that reason alone (e.g., one segment has much different bass content than another). Minimizing switching time is important to reduce the effects of rapidly fading sensory memory. Blinding is important to reduce the effects of perceiving differences that we expect to perceive; but we can also not perceive differences because we expect that things will sound the same, and blinding doesn't solve that problem at all (i.e., if you expect things to sound the same, you may perceive it that way without really trying to find differences, and therefore just guess in the trials, so a null result is produced because it was expected).
I still think all listening tests are subject to problems related to 'trusting our ears' and 'trusting our memories', and results of one type of listening test may not necessarily generalize to other listening situations, but I'm inclined to think that controlled blind tests should be helpful in ruling out the possibility of large differences caused by expectation of large differences.