Quote:
Originally Posted by Hirsch 
Rather than communication, a still more accurate word would be interpretation... By running the test using DBT, you eliminate that alternate hypothesis, and your interpretation of your results as something that you actually heard is likely to be more accurate. This is DBT working the way that it should.
|
True, but I don't quite understand applying the appellation "interpretation" to a protocol.
Quote:
| Again, why guess? If you’ve got known differences built into your test (positive controls), you can look at your data and tell just how many people are “golden ears” that can really distinguish known auditory differences, with real data to back it up. If those people who hear the known differences reliably can’t hear a difference between the test stimuli, it’s much more meaningful than testing a group of people who claim to hear differences, but whose actual auditory discriminatory capability is an unknown. |
Is it? Or, at least, is it meaningful to everybody? If we had a ton of golden ears fail a DBT of something like interconnects or high res, even in the presence of some quite insane positive control (like 1 degree changes in speaker position, or 0.1db Q=1 frequency response deviations, or what have you) - what exactly does it mean if we only establish inferiority of effect relative to that positive control?
I think such a result would be of great use to skeptical designers/manufacturers/listeners who wish to prioritize their investments - from a statistical point of view, it's clearly very plausible (and very useful) to establish a ranking of level of audibility for these sorts of things. But aren't the sorts of debates that bring up these DBT questions in the first place already trading at the limits of audibility to begin with? People can yammer all they want about how highres/interconnect changes/etc are such drastic night and day differences, but when you actually get down to testing them, well, the effect is so subtle that measurements may not pick it up, you need a sufficiently high quality system to tell the difference, etc.
Any positive control applied against such an effect in the DBT will do nothing more than place constraints on the scale of the effect that can (and will) be freely dismissed by those advocating the existence of the effect, on the grounds that the positive control was far too audible to adequately constrain the effect. And of course, because people tend to overestimate how well they can hear even simple-sounding stuff, like +1db Q=1 eq changes or 1% THD, I think that's going to be a really hard charge to dismiss.
I get the extremely strong notion that any serious attempt at a positive control is just going to provide more fodder for people whose minds are already made up, in the positive, for these sorts of audibility anyway. We already have Robert Harley of TAS lambasting Meyer/Moran for the fundamental reason that the results run so utterly and blatantly contradictory to (his) experience. I'm not disputing that positive controls are fundamentally a good idea - they are -
just not for the problem we are talking about.
I guess I'm looking at this debate too single-mindedly from the eyes of a non-statistician. I'm not really interested in making
persuasive (and accurate!) statements about audio quality from a position like this, because I just don't think that's possible. The questions about relating results to positive controls need to be answered, but I don't think it will do a thing to truly resolve these issues.
Quote:
| Incidentally, what I am saying is that a threshold shift in blind testing is a real possibility. Yet another reason to insure that people in a blind testing situation can make normal sensory discriminations (back to those positive control groups I keep harping about). |
Well, you
assert it is a real possibility, using analogous proven results in other sensory perceptions, but the magnitude and detail of this threshold shift is entirely up in the air. How well quantified are sighted forms of bias, in comparison to this shift? Depending on the nature of the shift, it might not be a significant bias in a DBT. Again, I'm not disputing that these things can exist - but until compelling evidence is provided, I don't think it's logical to consider this bias to be universally important in all audio DBTs. Just like it's not logical to argue that price-induced observer bias is alone a compelling reason to discount sighted tests - of course it's there, but sometimes it's not significant or runs contrary to expectations, and the larger issues are more important.
Quote:
| Of course your hypothesis is a valid experimental hypothesis. However, you’re moving far from the original question. That is, the hypothesis you’re testing now is “Can people hear what they think they can”? So, the experiment is moving away from perceived differences in gear, encoding, or any other feature of the auditory stimulus, and moving into the realm of psychoacoustics. It’s an interesting question, but now you don’t even need unknown stimuli (the gear), because unknown stimuli are simply going to introduce random variance that can mask effects of interest. You’re going to want to use stimuli with known and measured characteristics, because you’re testing the relationship between what is expected and what is actually heard. To do that, you’re going to manipulate the differences between the stimuli, and see if the subject can track the known differences. |
The hypothesis is more like "Group G of audiophiles literally hears distortion effect X." I think the hypothesis is intrinsically tied to that effect X, and changing it - especially to couple the hypothesis to threshold tests that may be of a substantially higher magnitude of distortion - compromises the meaning of the results versus the questions surrounding effect X. That said, I do think your method would be much superior than ABX in finding exact threshold levels.
I agree that I'm changing the hypothesis a bit - at least, from how it is commonly stated. But I don't see my hypothesis ("group of audiophiles claim they hear effect but don't") as being substantially less meaningful than the "usual" hypothesis ("nobody can hear effect"). If one can prove that no more than a certain percentage of a test population has a certain property, but the
entire test population believes they have that property - assuming the test population is evenly distributed - doesn't that that puts a damper on anybody else deciding they have that property? (without additional, and rightful, justification?)
I think this is the crux of the type II discussion here, and of my complaints about testing of the individual vs testing of the group. I agree that calculating pd is more or less a shot in a dark room, but said room is certainly not pitch black. To argue that pd should be low, for an effect that is asserted to be obvious and/or significant, presumes either some test sensitivity issue; or some sort of exceptionalism on the part of the test population. While you do bring up the possibility of the former, I simply cannot believe the latter to be true for most audiophile blind tests unless specific plausible proof is cited.
I guess what I'm saying is that I don't see how this sort of beta analysis - even
post-hoc pd-based beta analysis - can be called "guesswork". In fact, given the problems I think are always going to exist in the interpretation of positive controls with this sort of testing, I daresay that post-hoc beta analysis is equally valid, if not superior. ABX testing is pretty straightforward to evaluate on a trial by trial basis. The listener didn't hear a difference and got it 50% right, or the listener got it right - and these probabilities match the meaning of pd exactly; the literal interpretation of the three possible trial states match the exact meaning of pd. In that context: what is so wrong about running the results through different values of pd and interpreting the meaning of the results? The value of pd - for an ABX test - has clear, well-defined, and comprehensive meaning that encapsulates all the potential negative biases of the test.
Let's say that beta<0.05 on some results can be achieved with pd=0.001 (and therefore will yield even lower values of beta for larger pd). This requires a success fraction very close to 0.5 (with many many trials) and I think would represent a substantial statistical result. But what if one objects and claims that pd was in fact below 0.001? It's certainly possible, and if so would increase beta beyond the 0.05 level. But is it
plausible? For many issues - but of course not all issues - I think that one can confidently state that such an assertion of a low pd is simply not what exists in reality, based on knowledge of the test population. I'd call this sort of line of reasoning post-hoc, but I don't really see the issue.
Quote:
| This was real data, and you’re absolutely correct that issues with type II error are going to be horrific. However, in a scientific setting, the data is the data. You can’t ignore it because it poses methodological issues that are horrific. You’ve got to deal with the issues instead. This is not just true in audio, but all areas of science where blind testing is used. I am aware of at least one proposed clinical trial that the FDA killed because the expected incidence of a possible side-effect was so low that even a large-scale study was not likely to have sufficient power for the FDA to make a decision based on the results. |
A very interesting example, but wouldn't such a situation invalidate all testing, then, not just blind testing? I mean, you'd be down to case studies....
I guess that's your point: that some stuff is just unknowable?