Quote:
Originally Posted by Publius /img/forum/go_quote.gif
Quite true. My way of thinking has come around significantly in the past several weeks on the matter - instead of seeing a DBT result as a fact, I see it as a method of communication - a way of taking my subjective perceptions, and grounding them in a statistical basis that has meaning to other people. It does not necessarily mean that something is or is not audible - it merely grounds subjective results in statistical meaning.
|
Actually, DBT is about experimental design, not statistics. This is an important distinction.
Quote:
That's true for negative results but I think not true in the general case. Clearly there is very little knowledge of type II error and how to handle it a priori. But besides that, I think a priori analysis is done all the time - by the ABX software, and not by the user. Users are pretty well trained to know that p<0.05 is considered "statistically valid" (assuming they keep the number of trials fixed). How is that not a power analysis? |
Just to bring those who might not know into the loop, statistically:
Type I error is the probability that you obtain a positive result when, had you repeated the experiment an infinite number of times, you would find that the result was false. You normally see this reported as alpha. So, when we say that alpha < 0.05, what we are really saying is that the odds are less than one in twenty that our positive result is mistaken.
Type II error is the probability that you obtain a negative result when, had you repeated the experiment an infinite number of times, you would find that there is in fact a difference between the experimental conditions. The statistic is known as beta. When you have a negative result, it is only meaningful in a statistical sense if you can report beta < 0.05.
The 0.05 level is of course subjective, but has become the standard used by science over the years.
A power analysis is a statistical test used to predict the number of subjects (N) needed to calculate beta for a given, known, difference prior to the actual test. The smaller the difference that you're trying to measure, the larger N must be to calculate beta for a given effect size, and draw inferences about a negative result.
In science, negative results are often considered boring, and frequently not published. As a result, you rarely see beta reported, and many scientists don't even bother to do the a priori analysis needed to calculate N. Note that you need pilot data beforehand before you can even perform a power analysis. The more common method is to choose a large enough N so that if alpha is not significant (< 0.05), the experiment is considered a failure, and the lab moves to the next study. Lazy in many ways, but beta is a royal pain to figure out, and often a waste of time.
Quote:
Strictly speaking... so is a positive DBT result. A positive result is not a certain guarantee of audibility (in fact for p<0.05 there is a rather significant probability that it's due to chance!). That we treat the positive results as proof of audibility is "subjective" in the same sense as your example with negative results. It merely involves different probability levels and a subtle logical shift (reject null hypothesis vs failure to reject null hypothesis, as opposed to accept proposition A or accept proposition B). But it's unsupported by the statistics all the same. |
No, although there is always the possibility of experimental error, the entire purpose of inferential statistics is to produce an objective estimate of the probability of that error. So, if you've gotten to the point where you can calculate alpha and beta, you've got a pretty good idea of whether or not your effect is significant (which of course does not guarantee that it's real). All that you can do is to estimate the probability that you’re making an error. The only way to have 100% certainty is to measure en entire population of interest.
You need to keep the experimental and the statistical hypotheses separate, for they are in fact different. The experimental hypothesis is that two [insert audio item here] sound different. The null hypothesis is a statistical concept. If the null hypothesis is rejected, and a statistical difference is significant, then we still have to interpret the data within the context of the experimental design. GIGO. A friend of mine is color blind, and cannot distinguish blue from green. If I present obvious (to me) color differences, he won’t see them. If he’s the subject of a visual discrimination, and a no differences between blue and green are seen, it doesn’t mean that blue is the same as green. It means that I haven’t designed my experiment to account for possible inability to distinguish color.
Here's one to think about. Suppose that the auditory system is not independent of the visual system, and that visual cues can affect auditory thresholds (and in the real world, there is a lot of data to support this). When you do a DBT or ABX, you are removing some of the cues a person uses to interpret auditory information...and may be changing an auditory threshold. So, in the larger scheme, a DBT or ABX may be measuring an artificial situation that does not reflect how we interpret auditory data in the real world. If so, we've designed the experiment poorly, and the data may not mean what we would normally expect.
Quote:
Honestly, I don't really believe such an interpretation of a negative test is all that subjective. Just like one can come up with good, logical, persuasive reasons to treat a positive DBT result as proof of audibility, even though it is unsupported by the math, one can also come up with good, logical, persuasive reasons to treat a negative DBT result as proof of inaudibility. The persuasiveness of such arguments obviously rests on a subjective basis, but the arguments themselves can be quite logically debated - except that the debate occurs outside the scope of the DBT itself. To some degree I think it requires discussion completely outside the realm of listening. That doesn't mean the discussion is subjective! |
I think we're going to have to disagree here. We can treat a positive result as meaningful because we can calculate alpha. Our conclusions may be in error, but we know the probability that we are making that error through an objective mathematical analysis. However, if we cannot calculate beta, we can't draw the same conclusions about probability of error with a negative result. While we may believe that the negative result is meaningful, there is no objective error calculation to back it up.
Quote:
In other words, if you get a gaggle of audiophiles together, who self-purport to clearly hear an audibility in a particular room, and then you throw them into an ABX and they fall flat on their ass at 60/120 or something like that, that is not a meaningless result. There is a lot of meaning to it. It may not intrinsically mean the audibility doesn't exist, but it rules out a wide variety of other possibilities - chief among them the idea that it is a "clear" audibility, or that a large number and/or majority of audiophiles are capable of hearing the effect, etc. And, of course, it may have quite a lot of meaning to the people conducting the test: it places pretty tight bounds on how many people out of the gaggle actually could hear the difference, those people who scored highly can retake their test, etc. |
Not necessarily. We need to insure that those “clear” differences remain audible in the test system. It goes back to design principles. If we’re going to run an experiment, we don’t care what people think they can hear. We need to be able to measure what they actually can hear, and insure that results from our test system generalize in some way to the real world. The best test we could run would be a truly blind test where the subject had no idea that a test was occurring. If we could sneak into someone’s house and switch out his cables with others that appear identical but may be made differently, and that person noted that his system sounded particularly good or bad in casual conversation over the next several days or weeks, that might be the ideal test, although it can’t really happen.
Quote:
I agree with the jist you're saying with all of these things, but - especially in the case of self-selecting audiophiles - I think you are exaggerating the importance of calibrating test sensitivity. Such controls are most certainly necessary to establish the absolute importance of an effect, against other baseline effects. But for establishing relative importance - say, against effects that the listeners themselves claim they hear, for instance? - test sensitivity is less necessary. |
I cannot overemphasize the importance of controls if you're trying to interpret an experiment. Negative controls are critical to establish baseline variance. You absolutely have to know, in any experiment that is going to claim a scientific basis, what the baseline variance is in that sample using that equipment on that day. If the variance of the negative control situation is high, then you know that even important and meaningful positive results may be masked by sample variance. If the variance of the negative control is extremely low, you know that even a significant positive result may reflect a difference that is trivial.
For the positive controls, they are needed to insure that if a real difference is present, it is being reported. This is an often overlooked control, and yet without it there is no way to know if a negative result is due to the absence of perceived differences, or if the subject was hearing impaired, perhaps due to fatigue in the experimental situation. This control can also turn up a subject bias against a positive response.
Quote:
Put another way: Lemme go back to that gaggle of audiophiles again. If your listening audience is composed of audiophiles who believe they can already hear the effect easily, an ABX test of said audiophiles in a listening environment of their choosing has a proportion of distinguishers of exactly 1. Therefore, type II error is exactly zero. A firmly negative result of that ABX test would, in fact, have profound statistical power: the ease at which the audiophiles could detect the effect is firmly disproven. Almost all possible audibility might be disprovable, depending on how critical and/or sensitive the listeners, environment, and listening samples were. It may not have shown that the effect didn't exist to begin with, but ya know what, it would still be a fine, meaningful result all the same. |
There is no such thing as a 0 probability of either type I or type II error unless the entire population is measured. Then you don’t need inferential statistics at all, because you’ve measured every one. Statistical power is a function of variance, the size of the difference in question, and the number of subjects. That’s the foundation of power analysis. You need to have an estimate of the population variance for a given effect size. With those, you can compute the N needed for a negative result to have statistical meaning. Bear in mind that we don’t really care what people believe they can hear. We want to know what they can actually hear. We don’t know who can distinguish real differences because of claims of audiophilia, we have to determine it in our test system.
Quote:
Now, if a bunch of these tests get done by self-selecting audiophiles... and they come back all negative... and numeric measurements do not show any differences, or the differences that do exist are conclusively found to be unrelated or immaterial.... and there are strong psychoacoustic reasons to doubt the existence of the effect in the first place... I think I can reasonably come to the conclusion that the effect does not exist. Obviously I can always have that opinion, but there really is a totality of evidence in some of these cases and transcends mere preference. |
The self-selection is the problem. Again, it doesn’t matter what people think they can hear. What matters is what they can actually hear. Do those people actually hear real differences in the test system? That’s why positive controls are mandatory. If those people report negative results in the test, but also report negative results when there is a known and measurable difference, we can safely conclude that the negative results we have obtained are an artifact of our test system, and can ignore them. If we have not run the appropriate controls, all we can safely say is that our test system did not show differences…even if real ones were present.
Quote:
This is the main mechanism why I have no problem agreeing with the conclusions reached in certain controversial but large blind tests, which I will not name in order to avoid derailing the thread. (although it sure would be nice if those rat bastards actually did publish the trial breakdowns.)
[snip]
While it is true that such a test may not necessarily prove that the effect is inaudible.... how much does that really matter? I'm not a golden ears. And like I said above, the entire notion of deriving universals from a statistical result is a little daft in the first place. I believe Pio2001 has mentioned that some of the most brutally persuasive arguments he has made in support of ABX testing (and in opposition to specific claims of audibility) is to simply invite the audiophile opposition to an ABX test, conduct it fairly, and hand them the results. And ultimately, that's what matters. |
Why are audiophiles the “opposition”?
At the 2006 Rocky Mountain Audio Fest, I participated in a blind study where a listening audience had to identify which of two amplifiers was playing (one amplifier was tubed, and the other solid state). I was able to do so consistently. I was about the 200th person tested, However, the tester reported to me that I was only the fifth person to correctly identify the amplifiers 100% of the time. To me, the differences were profound, but they also happened to be along audio dimensions with which I was familiar and used to listening. Based on the numbers, only about 2.5% of the people tested to that point had that kind of accuracy. Was the low number due to the way the test was conducted, or was it real? I don’t know. I do know that if that kind of result is real, then the number of people that would need to be tested in a DBT of that nature to obtain statistical significance, even of a known, real difference, is staggering. Somewhere in the tens of thousands.
Alternatively, you can use a reverse logic. How many people need to obtain a positive result in a DBT/ABX to prove a difference is real. If enough trials are run, and there is no possibility of cheating, then the answer is one. If
anyone can reliably hear a difference, then the difference has to be real. At that point, you have to switch to a new experimental hypothesis, and ask why only a small percentage of people are hearing the real differences in the test system.
Quote:
The reverse placebo thing is something I struggle with considerably on certain kinds of tests, albeit only on a philosophical level. Even on the last 128kbps test conducted on HA, once I successfully ABX all the samples, and I spot like 5 completely independent artifacts for each sample, how the hell am I supposed to objectively rate the quality of each sample? If I'm doing threshold detection testing for a particular distortion, with an independent parameter (like switching out DACs, or if the distortion has a polarity associated with it), it's really easy for me to get some wild idea in my head that one configuration is more sensitive than another, and this might turn into a self-fulfulling prophecy by entirely emotional means. |
Therein lies the real problem. In any test situation, we’re trying to analyze what we hear. We break it down, and try to identify differences. So, we may listen for a particular artifact, or listen to bass response, or whatever floats our boat. If we identify a “tag”, we’ll know which is which, although the jury may still be out on which is better. However, does this bear any resemblance to the process we use when we listen to a favorite piece of music? Once again, we’re faced with the test situation distorting the listening process, to the point that the generalizability of any results to the real world come into question.
Quote:
Ultimately this is an indictment against the use of plain vanilla ABX testing for more complicated protocols than it was originally designed for. It's awesome for two speakers, one codec vs lossless, etc. But for other protocols it's a different ball game. |
My own experience goes back to my first post. ABX and DBT do fine if you ask a question they can answer. If you have a positive difference, you know that it was not due to an expectancy effect, which is all that these controls were ever intended to show. If you go beyond that, and try to judge the quality of the difference, rather than its existence, or try to draw inferences about negative results, you’re putting too much strain on the experimental design.