Erm, I still don't understand the objections. Some people seem to think that it's really difficult to set up a "proper" test. Environment, moods, etc. need to be controlled for. OK, I get that, but two points need to be made:
1. With a sufficiently large sample and random allocation, group differences average out. You'll have the about same number of golden-eared people, the same number of depressed people, same number of people tested during each time of day etc. Sure, matching for every conceivable confound is better, but unnecessary. Very, very few scientists do that. If you reject a DBT on the grounds that some variables weren't matched, you have to reject lots and lots of scientific research.
2. If the differences between two systems are really THAT difficult to discern...then who cares if there "really is" a difference? I mean, then there'd be "practically" no difference, even if there is technically a difference...one that can only be discerned in a dark, silent room, on a cool day, by a person in a perfectly calm mood wearing a felt hat. I care about the truth of the matter as much as anyone else, but our primary concern is whether or not it's worth it to pay for cables etc. isn't it? It's a practical consideration.