An interesting take on DBT | Page 2 | Headphone Reviews and Discussion - Head-Fi.org

Hirsch · Apr 14, 2009 at 8:54 PM

The original poster has many things right, and a few things wrong. Before you can argue about DBT, you need to understand DBT. First and foremost, DBT is not a test, per se. It is an experimental control, to assist in interpretation of a test when positive results are found. An experimental test has an independent variable (the variable being manipulated, whether it is a pair of cables, or a drug and a placebo). An experimental test also has a dependent variable that is measured, whether it is objective (measurements from a meter, or in a drug test perhaps white blood cell counts) or subjective (sonic ratings from a listener, or perhaps rating on the Hamilton Depression Scale). These are the actual tests. DBT is an aid to interpretation of results.

DBT comes into play when a positive result is obtained. If a positive effect is found, it is likely to be due to what is heard, rather than expectation. If a test is sighted, there may be bias in favor of one condition over another, and the positive effect becomes uninterpretable. A key error is to assume that DBT determines if a result is obtained or not. Wrong idea. It's simply there to aid in interpretation of positive results. Nothing more.

DBT is useless for interpreting negative results in most cases. In order to make any inferences about negative results, you need to have done the appropriate a priori statistical analysis to insure that your N (number of subjects)is sufficiently large for an inference to be made. This type of a priori analysis is called power analysis. As far as I know, such an analysis has never been done in the audio domain. Interpretation of negative results in DBT is simply guesswork (dare I say that it is in fact subjective). The absence of a positive result does not imply a negative result. It simply means that a positive result was not obtained.

However, for the moment let's assume a sufficient number of subjects has been run. In a DBT of cables, or any audio gear, a result of no differences could be due to a real absence of differences. Or it could be due to a hearing-impaired subject (s). Or external noise in the room. Or possibly broken test equipment. All of these require separate control conditions from DBT to eliminate as possibilities.

To properly use DBT (or ABX), you also need to validate the test. That is, in audio it is necessary to show that listeners in the test situation can distinguish between known, measurable differences that have been shown to be perceptible. If the methodology has not been validated by demonstrating its sensitivity to known differences, it's useless.

A properly executed test should contain positive controls as well as negative controls. That is, conditions with known and measurable differences should be interspersed among the test trials, to insure that test sensitivity is present and does not drift (through fatigue, for example), and that negative bias does not confound the test. Negative controls are also needed (item tested against itself, so no possible real difference) to check for random variance.

Audiophiles do imagine differences in sighted comparisons, but that's not the whole point. They imagine differences in blinded comparisons also. Further, they may not be sensitive to real differences that can occur in either situation. Remember, "placebo" effects can cut both ways. If audiophiles perceive differences where none are present due to a placebo effect, then if someone expects that differences are not audible, what is the prediction of a placebo effect? Note for those not following this: the prediction would be that, in a test of cables perhaps, those that do not believe in cable differences might not detect real cable differences due to expectancy (which is why positive controls are a necessity).

The essential problem of the DBT methodology is that those who use it in audio seem to be unaware of its limitations, or how to interpret obtained results. This is particularly true of subjective results, which is a key point made in the original post.

Zanth · Apr 14, 2009 at 9:16 PM

Awesome post Hirsch. That should be reproduced in every audiophile publication. You should seriously consider sending that into the letters section!

Publius · Apr 15, 2009 at 2:38 AM

Quote:

Originally Posted by Hirsch /img/forum/go_quote.gif
The original poster has many things right, and a few things wrong. Before you can argue about DBT, you need to understand DBT. First and foremost, DBT is not a test, per se. It is an experimental control, to assist in interpretation of a test when positive results are found. An experimental test has an independent variable (the variable being manipulated, whether it is a pair of cables, or a drug and a placebo). An experimental test also has a dependent variable that is measured, whether it is objective (measurements from a meter, or in a drug test perhaps white blood cell counts) or subjective (sonic ratings from a listener, or perhaps rating on the Hamilton Depression Scale). These are the actual tests. DBT is an aid to interpretation of results.

Quite true. My way of thinking has come around significantly in the past several weeks on the matter - instead of seeing a DBT result as a fact, I see it as a method of communication - a way of taking my subjective perceptions, and grounding them in a statistical basis that has meaning to other people. It does not necessarily mean that something is or is not audible - it merely grounds subjective results in statistical meaning.

Quote:

DBT is useless for interpreting negative results in most cases. In order to make any inferences about negative results, you need to have done the appropriate a priori statistical analysis to insure that your N (number of subjects)is sufficiently large for an inference to be made. This type of a priori analysis is called power analysis. As far as I know, such an analysis has never been done in the audio domain.

That's true for negative results but I think not true in the general case. Clearly there is very little knowledge of type II error and how to handle it a priori. But besides that, I think a priori analysis is done all the time - by the ABX software, and not by the user. Users are pretty well trained to know that p<0.05 is considered "statistically valid" (assuming they keep the number of trials fixed). How is that not a power analysis?

Quote:

Interpretation of negative results in DBT is simply guesswork (dare I say that it is in fact subjective). The absence of a positive result does not imply a negative result. It simply means that a positive result was not obtained.

Strictly speaking... so is a positive DBT result. A positive result is not a certain guarantee of audibility (in fact for p<0.05 there is a rather significant probability that it's due to chance!). That we treat the positive results as proof of audibility is "subjective" in the same sense as your example with negative results. It merely involves different probability levels and a subtle logical shift (reject null hypothesis vs failure to reject null hypothesis, as opposed to accept proposition A or accept proposition B). But it's unsupported by the statistics all the same.

Honestly, I don't really believe such an interpretation of a negative test is all that subjective. Just like one can come up with good, logical, persuasive reasons to treat a positive DBT result as proof of audibility, even though it is unsupported by the math, one can also come up with good, logical, persuasive reasons to treat a negative DBT result as proof of inaudibility. The persuasiveness of such arguments obviously rests on a subjective basis, but the arguments themselves can be quite logically debated - except that the debate occurs outside the scope of the DBT itself. To some degree I think it requires discussion completely outside the realm of listening. That doesn't mean the discussion is subjective!

In other words, if you get a gaggle of audiophiles together, who self-purport to clearly hear an audibility in a particular room, and then you throw them into an ABX and they fall flat on their ass at 60/120 or something like that, that is not a meaningless result. There is a lot of meaning to it. It may not intrinsically mean the audibility doesn't exist, but it rules out a wide variety of other possibilities - chief among them the idea that it is a "clear" audibility, or that a large number and/or majority of audiophiles are capable of hearing the effect, etc. And, of course, it may have quite a lot of meaning to the people conducting the test: it places pretty tight bounds on how many people out of the gaggle actually could hear the difference, those people who scored highly can retake their test, etc.

Quote:

However, for the moment let's assume a sufficient number of subjects has been run. In a DBT of cables, or any audio gear, a result of no differences could be due to a real absence of differences. Or it could be due to a hearing-impaired subject (s). Or external noise in the room. Or possibly broken test equipment. All of these require separate control conditions from DBT to eliminate as possibilities. To properly use DBT (or ABX), you also need to validate the test. That is, in audio it is necessary to show that listeners in the test situation can distinguish between known, measurable differences that have been shown to be perceptible. If the methodology has not been validated by demonstrating its sensitivity to known differences, it's useless. A properly executed test should contain positive controls as well as negative controls. That is, conditions with known and measurable differences should be interspersed among the test trials, to insure that test sensitivity is present and does not drift (through fatigue, for example), and that negative bias does not confound the test. Negative controls are also needed (item tested against itself, so no possible real difference) to check for random variance.

I agree with the jist you're saying with all of these things, but - especially in the case of self-selecting audiophiles - I think you are exaggerating the importance of calibrating test sensitivity. Such controls are most certainly necessary to establish the absolute importance of an effect, against other baseline effects. But for establishing relative importance - say, against effects that the listeners themselves claim they hear, for instance? - test sensitivity is less necessary.

Put another way: Lemme go back to that gaggle of audiophiles again. If your listening audience is composed of audiophiles who believe they can already hear the effect easily, an ABX test of said audiophiles in a listening environment of their choosing has a proportion of distinguishers of exactly 1. Therefore, type II error is exactly zero. A firmly negative result of that ABX test would, in fact, have profound statistical power: the ease at which the audiophiles could detect the effect is firmly disproven. Almost all possible audibility might be disprovable, depending on how critical and/or sensitive the listeners, environment, and listening samples were. It may not have shown that the effect didn't exist to begin with, but ya know what, it would still be a fine, meaningful result all the same.

Now, if a bunch of these tests get done by self-selecting audiophiles... and they come back all negative... and numeric measurements do not show any differences, or the differences that do exist are conclusively found to be unrelated or immaterial.... and there are strong psychoacoustic reasons to doubt the existence of the effect in the first place... I think I can reasonably come to the conclusion that the effect does not exist. Obviously I can always have that opinion, but there really is a totality of evidence in some of these cases and transcends mere preference.

This is the main mechanism why I have no problem agreeing with the conclusions reached in certain controversial but large blind tests, which I will not name in order to avoid derailing the thread. (although it sure would be nice if those rat bastards actually did publish the trial breakdowns.)

While it is true that such a test may not necessarily prove that the effect is inaudible.... how much does that really matter? I'm not a golden ears. And like I said above, the entire notion of deriving universals from a statistical result is a little daft in the first place. I believe Pio2001 has mentioned that some of the most brutally persuasive arguments he has made in support of ABX testing (and in opposition to specific claims of audibility) is to simply invite the audiophile opposition to an ABX test, conduct it fairly, and hand them the results. And ultimately, that's what matters.

Quote:

Remember, "placebo" effects can cut both ways. If audiophiles perceive differences where none are present due to a placebo effect, then if someone expects that differences are not audible, what is the prediction of a placebo effect? The essential problem of the DBT methodology is that those who use it in audio seem to be unaware of its limitations, or how to interpret obtained results. This is particularly true of subjective results, which is a key point made in the original post.

Quite true, and I've been involved with some extremely heated discussions on HydrogenAudio and Computer Audiophile Forums (and Stereophile!) recently on this very topic.

The reverse placebo thing is something I struggle with considerably on certain kinds of tests, albeit only on a philosophical level. Even on the last 128kbps test conducted on HA, once I successfully ABX all the samples, and I spot like 5 completely independent artifacts for each sample, how the hell am I supposed to objectively rate the quality of each sample? If I'm doing threshold detection testing for a particular distortion, with an independent parameter (like switching out DACs, or if the distortion has a polarity associated with it), it's really easy for me to get some wild idea in my head that one configuration is more sensitive than another, and this might turn into a self-fulfulling prophecy by entirely emotional means.

Ultimately this is an indictment against the use of plain vanilla ABX testing for more complicated protocols than it was originally designed for. It's awesome for two speakers, one codec vs lossless, etc. But for other protocols it's a different ball game.

royalcrown · Apr 15, 2009 at 6:58 PM

Interesting counterpoint from HA:

Quote:

Sorry if this is a bit of a hobby-horse, but a simple distinction between subjective and objective doesn't quite get it.

John Searle (an American philosopher of forthright tendencies) distinguishes between the epistemologically subjective and the ontologically subjective; that is, between knowing objective things subjectively, and things that are irreducibly a subjective experience. Take two pieces of paper: is one brighter than the other? You can either judge it subjectively, or measure it. One is blue: that is an inherently subjective experience.

Of course, ontologically subjective things may well have correlations with the objective. For humans, blue corresponds to light of a certain wavelength. For us, UV levels are pretty much irrelevant, but for a bee they seem to be critical. There's no way of knowing what objective measures correlate with the ontologically subjective experience without actually using subjective experience as a test.

I take it that what HA stands for, above everything, is the disciplined and, if you will, objective discussion of ontologically subjective experiences. For instance, one way of trying to assess the performance of a lossy codec is to look at an audio spectrum. A standard HA meme is, when people make wrong use of audio spectra, to say "You don't listen with your eyes." This, I take it, is honouring the fact that the hearing of music is an essentially, ontologically, subjective experience. But ABX and similar blinded methods enable us to deal with this subjectivity in a disciplined and sharable way.

This is not only good because clear thought is good; it can be important for developers. Often it's easier to use objective measurements than to stage ABX tests: oscilloscopes are less complicated than human beings. But you've got to know what measurements will actually correlate to differences in the ontologically subjective experience of human listeners to music, and that can only be done using real humans with their subjective experiences. That's the way you know that there is little point in worrying about frequencies above 20KHz, or THD below about 0.1%. But you also need to know that although even-order harmonics are, objectively, a distortion of the signal, a non-linearity, quite a lot of people like a little dash with their music. It's beside the point to go all tech and say they shouldn't; they do, and because hearing music is essentially subjective, that's all you can say, and maybe give them a chance to have it. But you can say they're wrong if they confuse a little bit of spice with their signal with more *accurate* reproduction.

Hirsch · Apr 16, 2009 at 1:26 AM

Quote:

Originally Posted by Publius /img/forum/go_quote.gif
Quite true. My way of thinking has come around significantly in the past several weeks on the matter - instead of seeing a DBT result as a fact, I see it as a method of communication - a way of taking my subjective perceptions, and grounding them in a statistical basis that has meaning to other people. It does not necessarily mean that something is or is not audible - it merely grounds subjective results in statistical meaning.

Actually, DBT is about experimental design, not statistics. This is an important distinction.

Quote:

That's true for negative results but I think not true in the general case. Clearly there is very little knowledge of type II error and how to handle it a priori. But besides that, I think a priori analysis is done all the time - by the ABX software, and not by the user. Users are pretty well trained to know that p<0.05 is considered "statistically valid" (assuming they keep the number of trials fixed). How is that not a power analysis?

Just to bring those who might not know into the loop, statistically:

Type I error is the probability that you obtain a positive result when, had you repeated the experiment an infinite number of times, you would find that the result was false. You normally see this reported as alpha. So, when we say that alpha < 0.05, what we are really saying is that the odds are less than one in twenty that our positive result is mistaken.

Type II error is the probability that you obtain a negative result when, had you repeated the experiment an infinite number of times, you would find that there is in fact a difference between the experimental conditions. The statistic is known as beta. When you have a negative result, it is only meaningful in a statistical sense if you can report beta < 0.05.

The 0.05 level is of course subjective, but has become the standard used by science over the years.

A power analysis is a statistical test used to predict the number of subjects (N) needed to calculate beta for a given, known, difference prior to the actual test. The smaller the difference that you're trying to measure, the larger N must be to calculate beta for a given effect size, and draw inferences about a negative result.

In science, negative results are often considered boring, and frequently not published. As a result, you rarely see beta reported, and many scientists don't even bother to do the a priori analysis needed to calculate N. Note that you need pilot data beforehand before you can even perform a power analysis. The more common method is to choose a large enough N so that if alpha is not significant (< 0.05), the experiment is considered a failure, and the lab moves to the next study. Lazy in many ways, but beta is a royal pain to figure out, and often a waste of time.

Quote:

Strictly speaking... so is a positive DBT result. A positive result is not a certain guarantee of audibility (in fact for p<0.05 there is a rather significant probability that it's due to chance!). That we treat the positive results as proof of audibility is "subjective" in the same sense as your example with negative results. It merely involves different probability levels and a subtle logical shift (reject null hypothesis vs failure to reject null hypothesis, as opposed to accept proposition A or accept proposition B). But it's unsupported by the statistics all the same.

No, although there is always the possibility of experimental error, the entire purpose of inferential statistics is to produce an objective estimate of the probability of that error. So, if you've gotten to the point where you can calculate alpha and beta, you've got a pretty good idea of whether or not your effect is significant (which of course does not guarantee that it's real). All that you can do is to estimate the probability that you’re making an error. The only way to have 100% certainty is to measure en entire population of interest.

You need to keep the experimental and the statistical hypotheses separate, for they are in fact different. The experimental hypothesis is that two [insert audio item here] sound different. The null hypothesis is a statistical concept. If the null hypothesis is rejected, and a statistical difference is significant, then we still have to interpret the data within the context of the experimental design. GIGO. A friend of mine is color blind, and cannot distinguish blue from green. If I present obvious (to me) color differences, he won’t see them. If he’s the subject of a visual discrimination, and a no differences between blue and green are seen, it doesn’t mean that blue is the same as green. It means that I haven’t designed my experiment to account for possible inability to distinguish color.

Here's one to think about. Suppose that the auditory system is not independent of the visual system, and that visual cues can affect auditory thresholds (and in the real world, there is a lot of data to support this). When you do a DBT or ABX, you are removing some of the cues a person uses to interpret auditory information...and may be changing an auditory threshold. So, in the larger scheme, a DBT or ABX may be measuring an artificial situation that does not reflect how we interpret auditory data in the real world. If so, we've designed the experiment poorly, and the data may not mean what we would normally expect.

Quote:

Honestly, I don't really believe such an interpretation of a negative test is all that subjective. Just like one can come up with good, logical, persuasive reasons to treat a positive DBT result as proof of audibility, even though it is unsupported by the math, one can also come up with good, logical, persuasive reasons to treat a negative DBT result as proof of inaudibility. The persuasiveness of such arguments obviously rests on a subjective basis, but the arguments themselves can be quite logically debated - except that the debate occurs outside the scope of the DBT itself. To some degree I think it requires discussion completely outside the realm of listening. That doesn't mean the discussion is subjective!

I think we're going to have to disagree here. We can treat a positive result as meaningful because we can calculate alpha. Our conclusions may be in error, but we know the probability that we are making that error through an objective mathematical analysis. However, if we cannot calculate beta, we can't draw the same conclusions about probability of error with a negative result. While we may believe that the negative result is meaningful, there is no objective error calculation to back it up.

Quote:

In other words, if you get a gaggle of audiophiles together, who self-purport to clearly hear an audibility in a particular room, and then you throw them into an ABX and they fall flat on their ass at 60/120 or something like that, that is not a meaningless result. There is a lot of meaning to it. It may not intrinsically mean the audibility doesn't exist, but it rules out a wide variety of other possibilities - chief among them the idea that it is a "clear" audibility, or that a large number and/or majority of audiophiles are capable of hearing the effect, etc. And, of course, it may have quite a lot of meaning to the people conducting the test: it places pretty tight bounds on how many people out of the gaggle actually could hear the difference, those people who scored highly can retake their test, etc.

Not necessarily. We need to insure that those “clear” differences remain audible in the test system. It goes back to design principles. If we’re going to run an experiment, we don’t care what people think they can hear. We need to be able to measure what they actually can hear, and insure that results from our test system generalize in some way to the real world. The best test we could run would be a truly blind test where the subject had no idea that a test was occurring. If we could sneak into someone’s house and switch out his cables with others that appear identical but may be made differently, and that person noted that his system sounded particularly good or bad in casual conversation over the next several days or weeks, that might be the ideal test, although it can’t really happen.

Quote:

I agree with the jist you're saying with all of these things, but - especially in the case of self-selecting audiophiles - I think you are exaggerating the importance of calibrating test sensitivity. Such controls are most certainly necessary to establish the absolute importance of an effect, against other baseline effects. But for establishing relative importance - say, against effects that the listeners themselves claim they hear, for instance? - test sensitivity is less necessary.

I cannot overemphasize the importance of controls if you're trying to interpret an experiment. Negative controls are critical to establish baseline variance. You absolutely have to know, in any experiment that is going to claim a scientific basis, what the baseline variance is in that sample using that equipment on that day. If the variance of the negative control situation is high, then you know that even important and meaningful positive results may be masked by sample variance. If the variance of the negative control is extremely low, you know that even a significant positive result may reflect a difference that is trivial.

For the positive controls, they are needed to insure that if a real difference is present, it is being reported. This is an often overlooked control, and yet without it there is no way to know if a negative result is due to the absence of perceived differences, or if the subject was hearing impaired, perhaps due to fatigue in the experimental situation. This control can also turn up a subject bias against a positive response.

Quote:

Put another way: Lemme go back to that gaggle of audiophiles again. If your listening audience is composed of audiophiles who believe they can already hear the effect easily, an ABX test of said audiophiles in a listening environment of their choosing has a proportion of distinguishers of exactly 1. Therefore, type II error is exactly zero. A firmly negative result of that ABX test would, in fact, have profound statistical power: the ease at which the audiophiles could detect the effect is firmly disproven. Almost all possible audibility might be disprovable, depending on how critical and/or sensitive the listeners, environment, and listening samples were. It may not have shown that the effect didn't exist to begin with, but ya know what, it would still be a fine, meaningful result all the same.

There is no such thing as a 0 probability of either type I or type II error unless the entire population is measured. Then you don’t need inferential statistics at all, because you’ve measured every one. Statistical power is a function of variance, the size of the difference in question, and the number of subjects. That’s the foundation of power analysis. You need to have an estimate of the population variance for a given effect size. With those, you can compute the N needed for a negative result to have statistical meaning. Bear in mind that we don’t really care what people believe they can hear. We want to know what they can actually hear. We don’t know who can distinguish real differences because of claims of audiophilia, we have to determine it in our test system.

Quote:

Now, if a bunch of these tests get done by self-selecting audiophiles... and they come back all negative... and numeric measurements do not show any differences, or the differences that do exist are conclusively found to be unrelated or immaterial.... and there are strong psychoacoustic reasons to doubt the existence of the effect in the first place... I think I can reasonably come to the conclusion that the effect does not exist. Obviously I can always have that opinion, but there really is a totality of evidence in some of these cases and transcends mere preference.

The self-selection is the problem. Again, it doesn’t matter what people think they can hear. What matters is what they can actually hear. Do those people actually hear real differences in the test system? That’s why positive controls are mandatory. If those people report negative results in the test, but also report negative results when there is a known and measurable difference, we can safely conclude that the negative results we have obtained are an artifact of our test system, and can ignore them. If we have not run the appropriate controls, all we can safely say is that our test system did not show differences…even if real ones were present.

Quote:

This is the main mechanism why I have no problem agreeing with the conclusions reached in certain controversial but large blind tests, which I will not name in order to avoid derailing the thread. (although it sure would be nice if those rat bastards actually did publish the trial breakdowns.)
[snip]
While it is true that such a test may not necessarily prove that the effect is inaudible.... how much does that really matter? I'm not a golden ears. And like I said above, the entire notion of deriving universals from a statistical result is a little daft in the first place. I believe Pio2001 has mentioned that some of the most brutally persuasive arguments he has made in support of ABX testing (and in opposition to specific claims of audibility) is to simply invite the audiophile opposition to an ABX test, conduct it fairly, and hand them the results. And ultimately, that's what matters.

Why are audiophiles the “opposition”?

At the 2006 Rocky Mountain Audio Fest, I participated in a blind study where a listening audience had to identify which of two amplifiers was playing (one amplifier was tubed, and the other solid state). I was able to do so consistently. I was about the 200th person tested, However, the tester reported to me that I was only the fifth person to correctly identify the amplifiers 100% of the time. To me, the differences were profound, but they also happened to be along audio dimensions with which I was familiar and used to listening. Based on the numbers, only about 2.5% of the people tested to that point had that kind of accuracy. Was the low number due to the way the test was conducted, or was it real? I don’t know. I do know that if that kind of result is real, then the number of people that would need to be tested in a DBT of that nature to obtain statistical significance, even of a known, real difference, is staggering. Somewhere in the tens of thousands.

Alternatively, you can use a reverse logic. How many people need to obtain a positive result in a DBT/ABX to prove a difference is real. If enough trials are run, and there is no possibility of cheating, then the answer is one. If anyone can reliably hear a difference, then the difference has to be real. At that point, you have to switch to a new experimental hypothesis, and ask why only a small percentage of people are hearing the real differences in the test system.

Quote:

The reverse placebo thing is something I struggle with considerably on certain kinds of tests, albeit only on a philosophical level. Even on the last 128kbps test conducted on HA, once I successfully ABX all the samples, and I spot like 5 completely independent artifacts for each sample, how the hell am I supposed to objectively rate the quality of each sample? If I'm doing threshold detection testing for a particular distortion, with an independent parameter (like switching out DACs, or if the distortion has a polarity associated with it), it's really easy for me to get some wild idea in my head that one configuration is more sensitive than another, and this might turn into a self-fulfulling prophecy by entirely emotional means.

Therein lies the real problem. In any test situation, we’re trying to analyze what we hear. We break it down, and try to identify differences. So, we may listen for a particular artifact, or listen to bass response, or whatever floats our boat. If we identify a “tag”, we’ll know which is which, although the jury may still be out on which is better. However, does this bear any resemblance to the process we use when we listen to a favorite piece of music? Once again, we’re faced with the test situation distorting the listening process, to the point that the generalizability of any results to the real world come into question.

Quote:

Ultimately this is an indictment against the use of plain vanilla ABX testing for more complicated protocols than it was originally designed for. It's awesome for two speakers, one codec vs lossless, etc. But for other protocols it's a different ball game.

My own experience goes back to my first post. ABX and DBT do fine if you ask a question they can answer. If you have a positive difference, you know that it was not due to an expectancy effect, which is all that these controls were ever intended to show. If you go beyond that, and try to judge the quality of the difference, rather than its existence, or try to draw inferences about negative results, you’re putting too much strain on the experimental design.

HammerSandwich · Apr 16, 2009 at 3:29 AM

An interesting point is that medical placebos can produce physical improvements. Although the "drug" doesn't have a direct physiological effect, a placebo affects the patient's environment so that the body responds as if that physical effect were real. IOW, the brain creates a real result from nothing.

Keeping that in mind, I'm open to the idea that listeners might hear things in sighted listening that they cannot in blind conditions. And I'm sure that voodoo tweaks can improve a listener's experience, even if they don't affect the "sound" per se.

Publius · Apr 16, 2009 at 5:54 AM

Quote:

Originally Posted by Hirsch /img/forum/go_quote.gif
Actually, DBT is about experimental design, not statistics. This is an important distinction.

Yes - thank you for the correction. I still think my point here stands though: DBTs are more for communication than proof.
Quote:

A power analysis is a statistical test used to predict the number of subjects (N) needed to calculate beta for a given, known, difference prior to the actual test. The smaller the difference that you're trying to measure, the larger N must be to calculate beta for a given effect size, and draw inferences about a negative result.

Thanks for your clarifications, and as usual, my lack of rigor is shown by my not using the word "power" in the correct (1-beta) meaning

I'd like to clarify this point specifically, because it has important implications below... note that I do not have formal training in these statistical methods, I'm mostly pulling this from ff123's work, so if I make a mistake please lemme know. (This does remind me that I probably should get "Sensory Evaluation Techniques" on my wish list.)

What is under test in an audio DBT/ABX test, first and foremost, is whether any effect can be detected in the first place. The magnitude of this effect is unimportant as long as it can be detected (and in fact the rating of that magnitude is generally entirely subjective). If the effect is small enough that some people cannot hear it, they will give results indistinguishable from chance (50% right). The people who can hear it will (in theory) give correct answers every time. Therefore, you can use a parameter, the proportion of discriminators (pd), which represents the fraction of the trials that you're expecting can actually tell a difference in the first place. pd could represent the percentage of people in a crowd who are golden ears, or the fraction of tests one runs in a personal ABX test that you expect to get right for a subtle effect before you get fatigued, etc.

Given pd, plus the trial size N and the number of correct responses, you can calculate the type II error. It is at a minimum at 50% correct responses, and rises as that percentage rises. It falls as pd rises. In other words, if you come to the conclusion that no difference was observed, the probability that you are wrong depends on how many true responses were in fact given, and on how many respondents you expect to be able to tell a difference.

ff123 made a handy spreadsheet to plug all these numbers in, which recently went 404 but I got him to put it back up: http://ff123.net/export/TestSensitivityAnalyzer.xls. You can plug in all the parameters and get type I/II error numbers out. For instance, for N=16, correct=8, pd=0.5, you get type II = 0.0075 - not bad! But for correct=10 you get type II = 0.08 which is not great.

pd is a rather crude way of estimating type II error, to be sure, but I'm not so sure that it is an invalid method.

Quote:

Here's one to think about. Suppose that the auditory system is not independent of the visual system, and that visual cues can affect auditory thresholds (and in the real world, there is a lot of data to support this). When you do a DBT or ABX, you are removing some of the cues a person uses to interpret auditory information...and may be changing an auditory threshold. So, in the larger scheme, a DBT or ABX may be measuring an artificial situation that does not reflect how we interpret auditory data in the real world. If so, we've designed the experiment poorly, and the data may not mean what we would normally expect.

Could you link me to some of these studies? I'd be interested in knowing about them. Certainly, some kind of threshold shift in certain environments could be extremely important here, but I'm not sure that blind tests should be invalidated based on a hypothetical threshold shift (note, I'm not saying you are actually saying that, but I do think some people might misinterpret it that way).

Quote:

There is no such thing as a 0 probability of either type I or type II error unless the entire population is measured.

... precisely my point. If I am performing a personal ABX test for my own personal edification, I am the entire population. The test hypothesis rests solely on whether I can hear a difference at the audibility level I am expecting. If I am confident before the test that I can absolutely tell a difference 100% of the time, the test hypothesis is "I can hear this obviously audible effect 100% of the time". And pd=1. Period! If I get 50/100 on the test - and I have no qualms with the test environment and protocol (given the issues you have mentioned) - then there is simply no way to justify my pre-test beliefs.

Whether I can actually hear a difference or not, in my opinion, doesn't matter - at least as far as I am concerned, when I am the population.

Now, extending that logic a little, if the population is defined as a gaggle of audiophiles - who happen to be the same audiophiles as the ones taking the test - then as before, in the context of a test hypothesis resting on an effect being as audible as people say it is, pd can still be accurately determined, and the type II error can be estimated and interpreted accordingly. This is essentially where we're at with the existing audiophile blind tests, where some people come and expect the difference to be insanely audible, some people don't, and the results generally all come back negative.

Of course, when you define the population to be "the entire human race", or even including anybody who's not taking the test, that logic breaks down. But it does not make the test result entirely moot. If one is testing a hypothesis of something like "this effect is audible, to SOMEBODY in SOME situation" - if it is an absolutely minimal effect - then of course you're going to have an extremely hard time with the protocol. And that to a certain degree is the level of sensitivity required for many medical tests, where side effects present in 0.1% of the population and such need to be observed with accuracy. But when the hypothesis is something like "this effect is audible to people who believe it is audible and at the level they are expecting" - well, isn't that a valid test hypothesis? And even if you don't test the entire population, if you assert that the sampling is sufficiently representative, doesn't that place pretty strict bounds on what percentage of the population who claim they can hear a difference, can actually hear a difference? And can't one draw extremely meaningful conclusions from that?

Quote:

At the 2006 Rocky Mountain Audio Fest, I participated in a blind study where a listening audience had to identify which of two amplifiers was playing (one amplifier was tubed, and the other solid state). I was able to do so consistently. I was about the 200th person tested, However, the tester reported to me that I was only the fifth person to correctly identify the amplifiers 100% of the time. To me, the differences were profound, but they also happened to be along audio dimensions with which I was familiar and used to listening. Based on the numbers, only about 2.5% of the people tested to that point had that kind of accuracy. Was the low number due to the way the test was conducted, or was it real? I don’t know. I do know that if that kind of result is real, then the number of people that would need to be tested in a DBT of that nature to obtain statistical significance, even of a known, real difference, is staggering. Somewhere in the tens of thousands.

No. The correct population is - well, should have been - one: you. Because you should have schlepped back to the test, do the trial to N=64, score 44 or above, and blow the whole damn debate right out of the water (type I<0.001, type II<0.001 with pd=0.67). Which should have been easy, right?

Instead, your test got rolled into a large sample size test, where you were only a small fraction of the respondents who got it right. I suppose it's even plausible (although there isn't enough documented in your description to justify this) that, assuming p<0.05, the number of sufficiently correct responses, judged on a tester-by-tester basis, may have been entirely explainable due to chance.

Of course you're going to have horrific, horrific type II issues if you actually assume pd=0.025 or whatnot. Which is why, when you find testers who might appear to be able to hear differences that most others cannot, you bring them back for further testing.

b0dhi · Apr 17, 2009 at 11:05 AM

Truly excellent posts Hirsch. Should be stickied, or better yet, published as Zanth suggested.

Hirsch · Apr 17, 2009 at 3:06 PM

Quote:

Originally Posted by Publius /img/forum/go_quote.gif
Yes - thank you for the correction. I still think my point here stands though: DBTs are more for communication than proof.
Thanks for your clarifications, and as usual, my lack of rigor is shown by my not using the word "power" in the correct (1-beta) meaning

Rather than communication, a still more accurate word would be interpretation. Once you’ve gotten a result in an experimental situation, you need to be able to tell exactly what that result means. You’re experimental hypothesis might be that two encoding methods sound the same. You run the test, and are reliably able to tell the difference. Now, is that because you know that the samples were at different bitrates, and the higher bitrate is supposed to sound better? This expectancy effect is an alternate hypothesis that can explain the result obtained. By running the test using DBT, you eliminate that alternate hypothesis, and your interpretation of your results as something that you actually heard is likely to be more accurate. This is DBT working the way that it should.
Quote:

I'd like to clarify this point specifically, because it has important implications below... note that I do not have formal training in these statistical methods, I'm mostly pulling this from ff123's work, so if I make a mistake please lemme know. (This does remind me that I probably should get "Sensory Evaluation Techniques" on my wish list.)

What is under test in an audio DBT/ABX test, first and foremost, is whether any effect can be detected in the first place. The magnitude of this effect is unimportant as long as it can be detected (and in fact the rating of that magnitude is generally entirely subjective). If the effect is small enough that some people cannot hear it, they will give results indistinguishable from chance (50% right). The people who can hear it will (in theory) give correct answers every time. Therefore, you can use a parameter, the proportion of discriminators (pd), which represents the fraction of the trials that you're expecting can actually tell a difference in the first place. pd could represent the percentage of people in a crowd who are golden ears, or the fraction of tests one runs in a personal ABX test that you expect to get right for a subtle effect before you get fatigued, etc.

Your first three sentences are absolutely correct (except that 50% should be 50+/-X%, where X depends on the sample variance and a cutoff criterion that you set). People who can’t hear a difference are not going to give 50% exactly, and people who do hear a difference may miss a cue on some trials, so it’s not a matter of 50 vs 100%. You need a way to tell if the person who scores 75% is someone who does not hear a difference, but has made lucky guesses, or if it is a person who does hear differences, but has made some response errors. The pd parameter also requires you to guess what percentage of the sample can actually detect the stimulus…
Again, why guess? If you’ve got known differences built into your test (positive controls), you can look at your data and tell just how many people are “golden ears” that can really distinguish known auditory differences, with real data to back it up. If those people who hear the known differences reliably can’t hear a difference between the test stimuli, it’s much more meaningful than testing a group of people who claim to hear differences, but whose actual auditory discriminatory capability is an unknown.
Quote:

Given pd, plus the trial size N and the number of correct responses, you can calculate the type II error. It is at a minimum at 50% correct responses, and rises as that percentage rises. It falls as pd rises. In other words, if you come to the conclusion that no difference was observed, the probability that you are wrong depends on how many true responses were in fact given, and on how many respondents you expect to be able to tell a difference.

ff123 made a handy spreadsheet to plug all these numbers in, which recently went 404 but I got him to put it back up: http://ff123.net/export/TestSensitivityAnalyzer.xls. You can plug in all the parameters and get type I/II error numbers out. For instance, for N=16, correct=8, pd=0.5, you get type II = 0.0075 - not bad! But for correct=10 you get type II = 0.08 which is not great.

pd is a rather crude way of estimating type II error, to be sure, but I'm not so sure that it is an invalid method.

Crude is correct. You can’t really compute beta (probability of type II error) without a priori knowledge of expected sample variance and the effect size of interest. As long as the parameter is based on guessing the expected proportions of the population, rather than on empirical data, I would have to consider it invalid. If the a priori data used to predict beta is guesswork, then beta will be guesswork.
Quote:

Could you link me to some of these studies? I'd be interested in knowing about them. Certainly, some kind of threshold shift in certain environments could be extremely important here, but I'm not sure that blind tests should be invalidated based on a hypothetical threshold shift (note, I'm not saying you are actually saying that, but I do think some people might misinterpret it that way).

I’ll have to look them up and get back to you, as it’s been a few years since I looked. However, as a teaser, you may have noticed that you usually hear music playing when you go to a dentist’s office. The reason for this is an older study showing that music raises the pain threshold. So, dentists who are aware of the study play music to make the procedures slightly less painful. So we have another instance where sensory systems to not seem to be independent.
Incidentally, what I am saying is that a threshold shift in blind testing is a real possibility. Yet another reason to insure that people in a blind testing situation can make normal sensory discriminations (back to those positive control groups I keep harping about).
Quote:

... precisely my point. If I am performing a personal ABX test for my own personal edification, I am the entire population. The test hypothesis rests solely on whether I can hear a difference at the audibility level I am expecting. If I am confident before the test that I can absolutely tell a difference 100% of the time, the test hypothesis is "I can hear this obviously audible effect 100% of the time". And pd=1. Period! If I get 50/100 on the test - and I have no qualms with the test environment and protocol (given the issues you have mentioned) - then there is simply no way to justify my pre-test beliefs.

Exactly so. As long as you’re the population of interest, then you can pretty much do whatever satisfies you. However, the data obtained only applies to you. That is, your claim would be “I did (or didn’t) hear a difference in this situation”. A common mistake is for people who obtain a negative result here to claim “There was no difference, therefore what other people hear is placebo” and then the fireworks start.
Quote:

Whether I can actually hear a difference or not, in my opinion, doesn't matter - at least as far as I am concerned, when I am the population.

Agreed that when I’m selecting gear for my own use, I am the entire population of interest, but what I hear is the only thing of interest to me. I’m interested in what sounds good to me. I’ve participated in some crude blind testing that pretty much convinced me that in a blind test, I’m going to miss some real differences that matter to me. So, I’m willing to live with the alternate hypotheses that blind testing would eliminate, in order to maximize my enjoyment of music. Doing a blind test correctly is a very time and resource intensive procedure. Doing it right would likely cost more than a very high-end system.
Quote:

Of course, when you define the population to be "the entire human race", or even including anybody who's not taking the test, that logic breaks down. But it does not make the test result entirely moot. If one is testing a hypothesis of something like "this effect is audible, to SOMEBODY in SOME situation" - if it is an absolutely minimal effect - then of course you're going to have an extremely hard time with the protocol. And that to a certain degree is the level of sensitivity required for many medical tests, where side effects present in 0.1% of the population and such need to be observed with accuracy. But when the hypothesis is something like "this effect is audible to people who believe it is audible and at the level they are expecting" - well, isn't that a valid test hypothesis? And even if you don't test the entire population, if you assert that the sampling is sufficiently representative, doesn't that place pretty strict bounds on what percentage of the population who claim they can hear a difference, can actually hear a difference? And can't one draw extremely meaningful conclusions from that?

Of course your hypothesis is a valid experimental hypothesis. However, you’re moving far from the original question. That is, the hypothesis you’re testing now is “Can people hear what they think they can”? So, the experiment is moving away from perceived differences in gear, encoding, or any other feature of the auditory stimulus, and moving into the realm of psychoacoustics. It’s an interesting question, but now you don’t even need unknown stimuli (the gear), because unknown stimuli are simply going to introduce random variance that can mask effects of interest. You’re going to want to use stimuli with known and measured characteristics, because you’re testing the relationship between what is expected and what is actually heard. To do that, you’re going to manipulate the differences between the stimuli, and see if the subject can track the known differences.
Quote:

No. The correct population is - well, should have been - one: you. Because you should have schlepped back to the test, do the trial to N=64, score 44 or above, and blow the whole damn debate right out of the water (type I<0.001, type II<0.001 with pd=0.67). Which should have been easy, right?
Instead, your test got rolled into a large sample size test, where you were only a small fraction of the respondents who got it right. I suppose it's even plausible (although there isn't enough documented in your description to justify this) that, assuming p<0.05, the number of sufficiently correct responses, judged on a tester-by-tester basis, may have been entirely explainable due to chance.
Of course you're going to have horrific, horrific type II issues if you actually assume pd=0.025 or whatnot. Which is why, when you find testers who might appear to be able to hear differences that most others cannot, you bring them back for further testing.

The population could not be one, because the experiment tested well over 200 people. I was one data point. I think that there were around 20 trials in which I scored 100%. However, the experimenters were trying to generalize to the population of audiophiles (a good assumption for people who would attend the Audio Fest is that they are audiophiles). Note that we’re not assuming that there was a pd of 0.025. That was an actual result based on testing over 200 people. This was real data, and you’re absolutely correct that issues with type II error are going to be horrific. However, in a scientific setting, the data is the data. You can’t ignore it because it poses methodological issues that are horrific. You’ve got to deal with the issues instead. This is not just true in audio, but all areas of science where blind testing is used. I am aware of at least one proposed clinical trial that the FDA killed because the expected incidence of a possible side-effect was so low that even a large-scale study was not likely to have sufficient power for the FDA to make a decision based on the results.

Hirsch · Apr 17, 2009 at 4:05 PM

Quote:

Originally Posted by royalcrown /img/forum/go_quote.gif
Interesting counterpoint from HA:

Quote:

Sorry if this is a bit of a hobby-horse, but a simple distinction between subjective and objective doesn't quite get it.

John Searle (an American philosopher of forthright tendencies) distinguishes between the epistemologically subjective and the ontologically subjective; that is, between knowing objective things subjectively, and things that are irreducibly a subjective experience. Take two pieces of paper: is one brighter than the other? You can either judge it subjectively, or measure it. One is blue: that is an inherently subjective experience.

Of course, ontologically subjective things may well have correlations with the objective. For humans, blue corresponds to light of a certain wavelength. For us, UV levels are pretty much irrelevant, but for a bee they seem to be critical. There's no way of knowing what objective measures correlate with the ontologically subjective experience without actually using subjective experience as a test.

I take it that what HA stands for, above everything, is the disciplined and, if you will, objective discussion of ontologically subjective experiences. For instance, one way of trying to assess the performance of a lossy codec is to look at an audio spectrum. A standard HA meme is, when people make wrong use of audio spectra, to say "You don't listen with your eyes." This, I take it, is honouring the fact that the hearing of music is an essentially, ontologically, subjective experience. But ABX and similar blinded methods enable us to deal with this subjectivity in a disciplined and sharable way.

I'm not altogether certain that blinded methods do anything at all to help share the subjective experience. In the end, they are only telling us whether or not expectancy plays a role in what we hear. But the blinded methods don't provide any qualitative data, which is what the richness of the subjective experience is all about, at least to me.

Quote:

This is not only good because clear thought is good; it can be important for developers. Often it's easier to use objective measurements than to stage ABX tests: oscilloscopes are less complicated than human beings. But you've got to know what measurements will actually correlate to differences in the ontologically subjective experience of human listeners to music, and that can only be done using real humans with their subjective experiences. That's the way you know that there is little point in worrying about frequencies above 20KHz, or THD below about 0.1%. But you also need to know that although even-order harmonics are, objectively, a distortion of the signal, a non-linearity, quite a lot of people like a little dash with their music. It's beside the point to go all tech and say they shouldn't; they do, and because hearing music is essentially subjective, that's all you can say, and maybe give them a chance to have it. But you can say they're wrong if they confuse a little bit of spice with their signal with more *accurate* reproduction.

I didn't want to let this one slip by, as it's important. Perception is a matter of context. To go to one of the questions, is one piece of paper brighter than another? Surround the brighter one by stimuli that are brighter still, and the darker one by even darker stimuli, and people could get the brightness wrong (this has been demonstrated in a popular optical illusion). Illusions are based on the way in which people process sensory information. We cannot perceive stimuli veridically. We aren't built that way. What we perceive is never a one-to-one correspondence with the physical reality. We don't have the bandwidth. Attention is a voluntary mechanism by which we filter stimuli that we process, but there are involuntary filters as well.

WRT "spice" vs. accuracy":

Most of the people who have heard me say this before aren't around any more, so I'll repeat something I've said previously. Sometimes we try and use live music as a reference for what an audio system should sound like. This is wrong, but not for the reasons most people would consider, IMO. The big difference between live and recorded music is that live music happens once. If you go to a concert, you're sitting in a particular seat, in a particular auditorium. The sound will be that way only in that seat at that time. In the performance, a soloist may be particularly inspired, or perhaps fall flat. We don't know what's going to happen next, even on a very familiar piece of music. Live music is a dynamic experience.

However, a recording is of necessity a static event. Once the performance is recorded, it can no longer change, for somewhat obvious reasons. If we listen on the same system in the same room, we're going to hear the same thing, which can never happen in live music. The only time that we can hear some semblance of the dynamic experience of live music using a recording is the very first time that we hear it. After that, it's all repetition.

However, we still want the dynamism of the live experience. Hence, we tend to change parts of our audio systems. Each time we change something audible, we can hear the performance just a bit differently...and if we hear things we'd previously missed, we're restoring some of the dynamism of the live music experience. I suspect that this is a least part of the reason that we change parts of our audio systems frequently. However, if we're really seeking accuracy, shouldn't we all be looking for the most accurate one possible, at which point we should stop completely. Since accuracy can be measured, shouldn't that ultimate system be the same for all of us?

IMO "accuracy" is highly overrated. If someone wants a bit of "spice" in recorded music, why not? No matter how accurate we get the input signal, we are not going to hear it accurately. We can only record deviations from "accuracy" produced by our gear. We have no way to measure those deviations from accuracy produced by our ears and brains. I'm well aware of a slight difference in frequency sensitivity between my ears. For true accuracy, one way might be to equalize the input signals so that the input + my hearing response = flat. We could go further, and increase the high-end of the auditory signal to compensate for loss in high-end hearing that occurs with age. Doesn't work, though. I still hear an enhanced high-end as too bright. And the frequency distortions that would result in my ears being equal are heard as just that, distortions in the frequency response. So, somewhere upstream in my perceptual system, compensation is occurring so that the "inaccurate" signal transmitted by my ears sounds normal to me.

So, at least to me, accuracy in an absolute sense is undesirable. I want to enjoy what I listen to, and, if the truth be known, couldn't care less if what I'm hearing has deviations from accuracy. When I go to a concert, I achieve a high level of involvement most of the time, and that involvement is what I want out of my audio system. I want the music I listen to to grab and hold me, not serve as background while I read a book. I want to be so absorbed that I can't stop listening, even if it's late and I need to work the next day. I want listening to music to be fun. If that requires a deviation from pure "accuracy", I'll take that deviation every time. But I'm making those deviations from "accuracy" with my eyes open, to achieve my own personal listening goals.

The Monkey · Apr 17, 2009 at 4:18 PM

Quote:

Originally Posted by Hirsch /img/forum/go_quote.gif
Rather than communication, a still more accurate word would be interpretation. Once you’ve gotten a result in an experimental situation, you need to be able to tell exactly what that result means. You’re experimental hypothesis might be that two encoding methods sound the same. You run the test, and are reliably able to tell the difference. Now, is that because you know that the samples were at different bitrates, and the higher bitrate is supposed to sound better? This expectancy effect is an alternate hypothesis that can explain the result obtained. By running the test using DBT, you eliminate that alternate hypothesis, and your interpretation of your results as something that you actually heard is likely to be more accurate. This is DBT working the way that it should.

Interesting thread. I readily admit my only rudimentary knowledge of doube-blind testing, but I have a few questions about the above statements. Hirsch, in the above example, the hypothesis is that two encoding methods sound the same, and the result is that you are reliably able to tell the difference. Wouldn't that conclusion be supported only if it were a primary or at least secondary endpoint? In other words, if your hypothesis is that two encoding methods sound the same and your curves are off, then haven't you simply failed to show with statistical significance that the methods sound the same, which could say just as much about the protocol as about the results? Please let me know what I am missing.

On a related note, when we discuss double blind testing with respect to audio equipment, what is the control that is most often used? People frequently mention the placebo effect in this context, but isn't that really only a viable explanation with an inactive control? I'm having a difficult time imagining a double-blind audio test that uses an inactive control as opposed to an active. Can someone elighten me here?

scompton · Apr 17, 2009 at 9:01 PM

Quote:

Originally Posted by royalcrown /img/forum/go_quote.gif
Even with ABX, the problem still remains - what is an ABX actually trying to do? On the one hand, it's trying to isolate just being able to hear a difference using your ears alone. In this sense it's trying to get rid of "imagined" sound, right? But in the aesthetic experience, everything is "imagined." In terms of objective reality there's no such thing as color - no objective property of an object gives it its color, we simply perceive it as red or blue, etc. So it would go for any aspect of audio: We're creating music internally in our head, so in terms of qualitative differences, what makes a so-called "imagined" sound qualitatively different from a "real" sound?

IMO that you can hear a sound is no difference than can you feel something, or hear a difference between 2 samples or feel a difference. If you think the senses of hearing and sight are subjective and imagined, so is the feeling of touch.

The ABX tests I took were to determine if I could hear a difference between different bit rates. I've yet to get a positive result from any test I've taken. Even if I would have gotten a positive result in a test I've taken in the past, it would have been a false positive since I've never heard a difference and I've always guessed. I'm not measuring qualitative differences but quantitative differences. I'm not trying to determine which I like better but can I hear a difference.

BTW, you're wrong about colors. There are objective measurements of colors based on wave lengths of light. An object has color based on on reflection and absorption of different wave lengths of light. Perception of color is also measurable and has to do with the interaction of different wave lengths with cone cells in your retina. So, perception of a color is quantitative, whether or not you like a color is qualitative.

Hirsch · Apr 18, 2009 at 6:00 PM

Quote:

Originally Posted by scompton /img/forum/go_quote.gif
BTW, you're wrong about colors. There are objective measurements of colors based on wave lengths of light. An object has color based on on reflection and absorption of different wave lengths of light. Perception of color is also measurable and has to do with the interaction of different wave lengths with cone cells in your retina. So, perception of a color is quantitative, whether or not you like a color is qualitative.

Wrong! Does a blue/green color-blind person perceive differences between blue and green, even though those differences can be measured? Do most of us perceive music in the same way as a person with perfect pitch? I'll bet we don't. You're confusing the stimulus with perception of the stimulus.

The relationship between the physical dimensions of a stimulus and the way that it is perceived is a rather complex area of study. You can measure the physical characteristics of a color stimulus. How it is perceived is another matter entirely. Perception occurs in the nervous system and brain, and the most accurate way that we've found to measure it so far is subjective report. Perception of a color is every bit as subjective as enjoyment of it. You do not see or hear the real world. You see or hear a representation of the real world as filtered by your nervous system and brain. Those systems actually take a lot of shortcuts and create distortions long before we arrive at anything like a conscious perception.

Only the stimulus itself is quantitative (measurable, with the right tools). Both perception and enjoyment are qualitative, and will vary from person to person.

Hirsch · Apr 18, 2009 at 6:26 PM

Quote:

Originally Posted by The Monkey /img/forum/go_quote.gif
Interesting thread. I readily admit my only rudimentary knowledge of doube-blind testing, but I have a few questions about the above statements. Hirsch, in the above example, the hypothesis is that two encoding methods sound the same, and the result is that you are reliably able to tell the difference. Wouldn't that conclusion be supported only if it were a primary or at least secondary endpoint? In other words, if your hypothesis is that two encoding methods sound the same and your curves are off, then haven't you simply failed to show with statistical significance that the methods sound the same, which could say just as much about the protocol as about the results? Please let me know what I am missing.

The first thing that you do in actually conducting an experiment is to decide on your experimental hypothesis. In the case mentioned, it might be that two encoding methods are not perceptibly different. So, the independent variable in the experiment is going to be encoding method. The dependent variable will likely depend on experimental design, but would likely be some measure of whether or not there was a difference between stimuli presented. You are absolutely correct when you say that failure to show a significant difference may well reflect the protocol (experimental design) rather than an actual difference. A good experimental design minimizes the chances of this occurring, but it's always a possibility.

Quote:

On a related note, when we discuss double blind testing with respect to audio equipment, what is the control that is most often used? People frequently mention the placebo effect in this context, but isn't that really only a viable explanation with an inactive control? I'm having a difficult time imagining a double-blind audio test that uses an inactive control as opposed to an active. Can someone elighten me here?

The control actually depends on how the experiment is set up. A negative control, or placebo, in pharmacology is a pill that does nothing. In an audio experiment, if you're looking at differences between two encoding methods, on a negative control trial you could present the same encoding method twice. The person would have to note that there was no difference. On an experimental trial, you could present both encoding methods, and the person would have to note that he/she could hear a difference. You'd expect the control trials to report no difference. If you got a significantly higher percentage of reported differences on the experimental trials, then you'd conclude that the person could hear a difference. The DBT part comes in to insure that there is no way for the person to know whether any given trial is experimental or control. In that way, there are no cues except the sound itself present that the person can use to make the "yes there is a difference" or " no, there is no difference" choice.

However, a negative control alone makes it really easy to bias an experiment in favor of negative results. If you include a second type of control trial, or positive control, where there are known differences, then if a person tends to be biased toward a negative response, you'll start seeing a lot of errors on those trials. So, the additional control let's you know that subject bias, or even hearing-impairment, may be present and affecting your results.

A positive control in pharmacology might be an already approved drug that has a similar effect to the one that you're testing. You'd be testing both a known drug and a new compound trying to become a drug against a known inactive compound. This is not commonly done, as there really isn't an appropriate positive control available for the majority of drug trials.

scompton · Apr 18, 2009 at 7:19 PM

Quote:

Originally Posted by Hirsch /img/forum/go_quote.gif
Wrong! Does a blue/green color-blind person perceive differences between blue and green, even though those differences can be measured? Do most of us perceive music in the same way as a person with perfect pitch? I'll bet we don't. You're confusing the stimulus with perception of the stimulus.

Most blue/green color blindness is due to a defect in the cones so that they aren't stimulated by the same wave lengths of light so I'm not confusing stimulus and perception. True, there are cases of people being blind from problems in their visual cortex or optic nerves. In that case their eyes work fine but they can't process the stimulus. It's also possible for brain damage to cause color blindness, but it's much less common that retinal damage or deficiencies.

For myself, the only useful thing to get out of a DBT or ABX is can I hear a difference. If I was taking part in a study, that would be different, but I never have and probably never will. I've done ABX tests to determine if I could hear the difference between 128kbps and lossless. I did it because it didn't matter to me if other people could or could not hear the difference. I needed to find out if I could. My results shouldn't be extrapolated to anyone else. IMO, this is a perfect use of ABX tests.

Hirsch

Why is there a chaplain standing over his wallet?

Zanth

SHAman who knew of Head-Fi ten years prior to its existence

Publius

500+ Head-Fier

royalcrown

500+ Head-Fier

Hirsch

Why is there a chaplain standing over his wallet?

HammerSandwich

Head-Fier

Publius

500+ Head-Fier

b0dhi

Headphoneus Supremus

Hirsch

Why is there a chaplain standing over his wallet?

Hirsch

Why is there a chaplain standing over his wallet?

The Monkey

Monkey See, Monkey DAC
A really sick dud

scompton

Headphoneus Supremus

Hirsch

Why is there a chaplain standing over his wallet?

Hirsch

Why is there a chaplain standing over his wallet?

scompton

Headphoneus Supremus

Users who are viewing this thread

Why is there a chaplain standing over his wallet?

SHAman who knew of Head-Fi ten years prior to its existence

500+ Head-Fier

500+ Head-Fier

Why is there a chaplain standing over his wallet?

Head-Fier

500+ Head-Fier

Headphoneus Supremus

Why is there a chaplain standing over his wallet?

Why is there a chaplain standing over his wallet?

Monkey See, Monkey DACA really sick dud

Headphoneus Supremus

Why is there a chaplain standing over his wallet?

Why is there a chaplain standing over his wallet?

Headphoneus Supremus

Users who are viewing this thread

Monkey See, Monkey DAC
A really sick dud