Head-Fi.org › Forums › Equipment Forums › Sound Science › The validity of ABX testing
New Posts  All Forums:Forum Nav:

The validity of ABX testing - Page 2

post #16 of 109
The original question is interesting. I'm one of those who doesn't rate ABX tests - I wouldn't go so far as to say that they were completely invalid, more that they are a less reliable way of spotting certain kinds of differences.

So I would expect most ABX tests of neutral components (flat frequency response, negligible distortion) to give a negative result because the stress, illusions, confusions, etc will swamp the listeners ability to spot (objectively) small differences.

If an ABX test does return a positive result, then I would cautiously accept that the result is probably valid and certainly worthy of further investigation. This would be especially the case if there was a consensus on what the difference was (e.g. "A's bass was tighter than B's") when that characteristic was never previously mentioned.

I said "cautiously accept" because I haven't thought about it this way round to the same degree as for a negative ABX result, so I could well be guilty of oversimplifying the situation.
post #17 of 109
Quote:
Originally Posted by Bullseye View Post
From all the answers I have read none have been directly related to royalcrown's answer. The answers always move away from the initial question and we get to nowhere...

He wants to know if for the people who claim DBTs are an invalid method for testing differences, and reject any test that claims cables -in this case- don't make a difference, would they accept the test if some people passed the DBT and said that in their case different cables make a difference?
I've answered that several times.

Maybe some of the confusion comes from the fact that I don't think quick-switch ABX is an "invalid method"---my hypothesis is that it's not very sensitive under certain conditions. That doesn't make it "invalid" in some general sense. it means that the "noise" in the "human test instrument" swamps the "signal".
post #18 of 109
Nick, thanks for putting it clear.

Then I think the question royalcrown wrote in first place was not directed towards you. It was directed to someone who thinks in a different way.

Lets see what other people think as well.
post #19 of 109
I think I agree pretty much with the points made by Pio2001. Doesn't a negative test really prove nothing, in light of the null hypothesis?

Nevertheless, if adequate (and that's part of the issue, I guess) tests were conducted that failed to yield a positive result, I would consider this pretty persuasive evidence of the lack of audible differences, depending, of course, on what was being tested. But I would like to see, among other things, some confirmation of the validity of the test methodology and I would like to see addressed some of the issues raised by wavoman and others in other threads. Moreover, I would like to see if the test yields a positive result regarding something we can all agree sounds different. (I guess we'd have to determine what that is.)

But I'm not sure I understand the logic behind the notion: "If we design a krappy DBT and the skeptics are prepared to accept a positive result, then you subjectivists should agree to accept a negative result if the result is, in fact, negative." I'm not sure that is what is being suggested, but that's sort of what I took from the initial post. And I think it is illogical. Maybe someone could explain better what the question is or how I'm misinterpreting it. (The second paragraph's description of "downfalls" is a little bit confusing; is "quick switching" good or bad?)

It seems to me the better question is: "If we can design a test that the skeptics and objectivists can agree upon, are both sides prepare to accept the result, or at least give it some credence?
post #20 of 109
Quote:
Originally Posted by TheAttorney View Post
The original question is interesting. I'm one of those who doesn't rate ABX tests - I wouldn't go so far as to say that they were completely invalid, more that they are a less reliable way of spotting certain kinds of differences.

So I would expect most ABX tests of neutral components (flat frequency response, negligible distortion) to give a negative result because the stress, illusions, confusions, etc will swamp the listeners ability to spot (objectively) small differences.
I agree with you. You have just put succinctly the same thing I was saying with math in my first post. Yet royalcrown seemed to think I had no point, saying that he understand the math.

I'm still don't know what royalcrown is really asking. He seemed to be directing his question at a group that includes me.

-Mike
post #21 of 109
Thread Starter 
[QUOTE=mike1127;5845180]I don't get this nebulous concept that "data wasn't gathered in a proper manner."

I don't know how to make this any clearer for you. If you do not collect the data in a proper manner, any statistical analysis is useless.

Quote:
Originally Posted by mike1127 View Post
A sighted test isn't a test on sound alone. As long as you recognize that, sure you can do statistical analysis on it. You would be testing more whether people need glasses, but to a first approximation there still is a 'p', that is, there still is a probability that someone gets the right answer, and you can reject the null hypothesis that p=0.5.
If you do a sighted test, every subject tested will get 100% all of the time (even after 1,000 trials) unless they press the wrong button (highly unlikely). This will happen because if they know what the identity of A, B, X, and Y are, they'll be able to identify X as A or B in each trial without even listening to the music. You can't run statistical analysis and conclude anything meaningful when the subjects already know the answer to the questions you're giving them. You can do the math, but at the end of the day your hypotheses will either be meaningless or outright wrong.


Quote:
Originally Posted by mike1127 View Post
I don't get your point. I accept that blind tests are tests on sound alone. So there's some value of p (in a crude model) and it's based only on what someone can hear.. and here's the kicker.. under those test conditions. It's not whether the whole concept is "valid" in some nebulous way. It's about how the test conditions affect p.
Define P in this context, because you're not using a standard definition of a P-value here.

Quote:
Originally Posted by mike1127 View Post
Sure it matters. Under those conditions, and with respect to the devices under test, we have rejected the null hypothesis.
What would h1 be in this instance? You're begging the question that I'm asking you: would that test be sufficient evidence that cables have a causal effect on the audio system?

Quote:
Originally Posted by Pio2001 View Post
The basics of ABX is that a positive proves that a difference exists, and that a negative doesn't prove anything.
This is only true in some instances. It depends on what hypotheses you are testing.

If h0 is "cables a and b are audibly indistinguishable" and h1 is "cables a and b are audibly distinguishable," rejecting h1 is compelling evidence that h0 is likely true - this has nothing to do with statistics and has to do with logic - you cannot declare "not P" and declare "P" at the same time, assuming P is h1. In the case of statistics we're not dealing with logical truths, but you can substitute "not P is likely" and you still cannot conclude that "not P is unlikely".

Quote:
Originally Posted by Pio2001 View Post
Because in medicine, tests are run on statistically representative samples of the population, in order to evaluate the average result, while in hifi, tests are run with one subject, that is everything but representative, in order to find a minimum possible effect.

Under these conditions, a negative doesn't prove the null hypothesis.
Which is why I specified in my first post (in retrospect, rather redundantly, wrt quick-switching) that the samples would be representative.

Quote:
Originally Posted by mike1127 View Post
I've answered that several times.

Maybe some of the confusion comes from the fact that I don't think quick-switch ABX is an "invalid method"---my hypothesis is that it's not very sensitive under certain conditions. That doesn't make it "invalid" in some general sense. it means that the "noise" in the "human test instrument" swamps the "signal".
Which is why I brought up the hypothetical in the first place: would you accept a DBT if it turned out that a positive result was ascertained, even if all of the factors that you claim reduce the sensitivity of the test were still there? To avoid being redundant see my response to PhilS's post (since this response is getting long enough as is).

Quote:
Originally Posted by PhilS View Post

But I'm not sure I understand the logic behind the notion: "If we design a krappy DBT and the skeptics are prepared to accept a positive result, then you subjectivists should agree to accept a negative result if the result is, in fact, negative." I'm not sure that is what is being suggested, but that's sort of what I took from the initial post. And I think it is illogical. Maybe someone could explain better what the question is or how I'm misinterpreting it. (The second paragraph's description of "downfalls" is a little bit confusing; is "quick switching" good or bad?)
Take mike's arguments (just because they're the most accessible, seeing as there are whole threads about them): all of his objections reduce to "something is messing with ABXs that reduce their sensitivity, and therefore they cannot be used to distinguish between cables." If, however, an ABX comes out with a positive result, despite the fact that ABX testing is insensitive (I'm granting arguments here), and mike (or whoever, I'm just using an example. This applies to many similar arguments) accepts these results as proof that cables make an audible difference, then in the real world (i.e. where ABX's have not in fact made a difference) mike's argument is circular: ABX's fail because they're insensitive, but if they showed that cables made a difference (in a hypothetical, i.e. a counterfactual) then they wouldn't be insensitive. At best this is just a heavily biased argument, but at worst this is circular. I'm not accusing him of anything, but I think that this logical structure either needs clarification or revision, which is why I made the thread.

Yes, I know, that big blob of text doesn't make much sense. I'll try to clarify it in a later post (one reason why I didn't reply for awhile).

Quote:
Originally Posted by PhilS View Post
It seems to me the better question is: "If we can design a test that the skeptics and objectivists can agree upon, are both sides prepare to accept the result, or at least give it some credence?
That's a fair question, and worthy of discussion, it's just that I'm trying to get at something else in this thread.
post #22 of 109
Quote:
Originally Posted by royalcrown View Post
Define P in this context, because you're not using a standard definition of a P-value here.
Lower-case p is the probability that the test subject gets the right answer.

Quote:
Take mike's arguments (just because they're the most accessible, seeing as there are whole threads about them): all of his objections reduce to "something is messing with ABXs that reduce their sensitivity, and therefore they cannot be used to distinguish between cables."
Not quite right. Something is messing with quick-switch ABX's that reduce their sensitivity, therefore Type II error is high.

Quote:
If, however, an ABX comes out with a positive result, despite the fact that ABX testing is insensitive (I'm granting arguments here), and mike (or whoever, I'm just using an example. This applies to many similar arguments) accepts these results as proof that cables make an audible difference, then in the real world (i.e. where ABX's have not in fact made a difference) mike's argument is circular: ABX's fail because they're insensitive, but if they showed that cables made a difference (in a hypothetical, i.e. a counterfactual) then they wouldn't be insensitive.
I'm starting to suspect you don't really understand the statistics. My arguments boil down to "p is close to 0.5". That's all that has to be said, really, if you understand the statistics.

If the confidence level is, say, 5%, then if we run 100 tests on cables, and only about 5 reject the null hypothesis, then that would not be good evidence.

Or to put it another way, if one single cable test out of the blue rejects the null hypothesis (assuming many people have run such tests along the way) that is not good evidence.

However, if 15 or 20 tests reject the null hypothesis, then that is good evidence to reject the null hypothesis that p=0.5.

This does not in any way contradict my suggestion. My suggestion is that p is small, but not 0.5. So a test that rejects p=0.5 does not contradict my suggestion.

Quote:
At best this is just a heavily biased argument, but at worst this is circular. I'm not accusing him of anything, but I think that this logical structure either needs clarification or revision, which is why I made the thread.
It's not a "logical structure." It's statistics.
post #23 of 109
Quote:
Originally Posted by royalcrown View Post

Take mike's arguments (just because they're the most accessible, seeing as there are whole threads about them): all of his objections reduce to "something is messing with ABXs that reduce their sensitivity, and therefore they cannot be used to distinguish between cables." If, however, an ABX comes out with a positive result, despite the fact that ABX testing is insensitive (I'm granting arguments here), and mike (or whoever, I'm just using an example. This applies to many similar arguments) accepts these results as proof that cables make an audible difference, then in the real world (i.e. where ABX's have not in fact made a difference) mike's argument is circular: ABX's fail because they're insensitive, but if they showed that cables made a difference (in a hypothetical, i.e. a counterfactual) then they wouldn't be insensitive. At best this is just a heavily biased argument, but at worst this is circular. I'm not accusing him of anything, but I think that this logical structure either needs clarification or revision, which is why I made the thread.
I guess I still don't get it. The believers say there's a flaw in most DBT's. Let's say it's quick switching. If you do a test with quick switching, and you get a positive result, that would appear to mean that people can identify the cables (for example) as sounding different, notwithstanding the test conditions are not ideal (at least to the believers). (Presumably the test is structured in a way to eliminate the possibility of a positive result from pure guessing). In the absence of some other explanation for the positive result, I would think the skeptics would have to concede that the test is somewhat probative. And I don't think it's unreasonable for the believers to hold that opinion also.

OTOH, if it yields a negative result, I understand why the folks who think DBT's are flawed would say: "Well, it's a bad DBT, because you used quick switching. We told you that is one of the things that is screwing the test up, and you didn't fix it."

To use a somewhat imperfect analogy (but one that I think adequately illustrates my point), let's say I claim I am a good putter and routinely can make ten putts in a row from three feet on the greens at my local club. You offer to test me provided that while I am putting at my club, you can shoot bullets over my head. If I nevertheless make ten putts in a row with bullets whizzing by, I will think that's probative of my abilities on the green from three feet. If I don't make ten in a row, and only make 1 or 2, presumably because I fear for my life, I don't think it is inconsistent or hypocritical for me to say, "Well, the test was not a fair test."
post #24 of 109
Thread Starter 
Quote:
Originally Posted by mike1127 View Post
Lower-case p is the probability that the test subject gets the right answer.
Statistical hypothesis testing - Wikipedia, the free encyclopedia

Scroll down to "definition of symbols" and point out which p you're referring to (note that there are many, and always used in lowercase). Or, instead, find a website with a standard definition of the "p" you are using.

Quote:
Originally Posted by mike1127 View Post
Not quite right. Something is messing with quick-switch ABX's that reduce their sensitivity, therefore Type II error is high.
Re-read my first post: the fictional test uses a large sample size. I specifically put that there so that the discussion would not revolve around statistical power (Type II errors are non-issues when large sample sizes are involved).
post #25 of 109
Quote:
Originally Posted by royalcrown View Post
Scroll down to "definition of symbols" and point out which p you're referring to (...). Or, instead, find a website with a standard definition of the "p" you are using.
In the ABC/HR software, the probability that Mike refers to is called "Effect size" and noted theta.

It is the proportion of right answers. For example if I get 80 right answers out of 100, my theta (that Mike calls p) is 0.8.

Given the expected theta, the target type I error (alpha), and the target type II error (beta), the software gives you the total number of tirals and the minimum score needed.
post #26 of 109
Quote:
Originally Posted by royalcrown View Post
Statistical hypothesis testing - Wikipedia, the free encyclopedia

Scroll down to "definition of symbols" and point out which p you're referring to (note that there are many, and always used in lowercase). Or, instead, find a website with a standard definition of the "p" you are using.
I don't think any formula on that page applies, because that page is about hypothesis testing for continuous values, while an ABX test is testing discrete value (right/wrong). A guy on Audio Asylum and also pio2001 explained this to me: they said that for ABX tests we need to use the "binomial mass function." A calculator for it is here:

P-values

Note that this page uses upper-case P and lower-case p differently.

We are modeling the test subject as someone who answers correctly with probability p. Some sources are careful to distinguish this from upper-case P. The Wikipedia page does not distinguish them.

Quote:
Re-read my first post: the fictional test uses a large sample size. I specifically put that there so that the discussion would not revolve around statistical power (Type II errors are non-issues when large sample sizes are involved).
Dude, you are just driving me crazy. In your hypothetical test there was a positive, so there couldn't possibly be a Type II error! So you didn't need large sample size to reduce Type II error.

I think my first post (follow-up to your first post) explains this all adequately. So re-read it, get back to me with questions.
post #27 of 109
Quote:
Originally Posted by Pio2001 View Post
In the ABC/HR software, the probability that Mike refers to is called "Effect size" and noted theta.

It is the proportion of right answers. For example if I get 80 right answers out of 100, my theta (that Mike calls p) is 0.8.

Given the expected theta, the target type I error (alpha), and the target type II error (beta), the software gives you the total number of tirals and the minimum score needed.
Okay, I'll start calling it theta. However, we do need to recognize there is some subtlety here. The proportion of right answers in a given test is not the same thing as our model of that proportion. We are modeling the listener as someone who answers correctly with probability theta. The null hypothesis is that theta=0.5. A successful test rejects that hypothesis, which means theta is probably not 0.5. Note that we still have no idea what theta is.



Or to make this even clearer, let's say I hypothesize that theta is "really" 0.6, meaning that the test subject answers correctly 0.6 of the time, we must recognize this is a crude model. The test subject and test conditions are astronomically complex things. So it's not at all clear what it even means to say theta is 0.6.

However, we are clear on what it means to reject the null hypothesis. That means it's not very likely that the model of theta=0.5 is true. What, then, is true? Who knows? EDIT: Actually we can say it is true the listener was influenced by the sound; i.e., the cables were audibly different. What I meant to say was that we have very little power to extrapolate this test to how the listener would perform in other conditions; that is, we don't have much of a model of that listener.

EDIT: now that I reflect, if you are saying, Pio2001, that your software claims 80/100 means theta=0.8, then theta is definitely not what I call p. p is the model, not the measurement.
post #28 of 109
Thread Starter 
Quote:
Originally Posted by PhilS View Post

To use a somewhat imperfect analogy (but one that I think adequately illustrates my point), let's say I claim I am a good putter and routinely can make ten putts in a row from three feet on the greens at my local club. You offer to test me provided that while I am putting at my club, you can shoot bullets over my head. If I nevertheless make ten putts in a row with bullets whizzing by, I will think that's probative of my abilities on the green from three feet. If I don't make ten in a row, and only make 1 or 2, presumably because I fear for my life, I don't think it is inconsistent or hypocritical for me to say, "Well, the test was not a fair test."
I think the analogy is flawed because it's resting on a different assumption than with audio. In your analogy, you mention flying bullets. For one, we know that the flying bullets exist. Moreover, we know what the effects of flying bullets are outside of playing golf - for example, we know that flying bullets cause people to tense up and fear for their lives in almost any situation.

However, with the objections raised such as imagination contamination, it's a completely different story. We don't know if imagination contamination exists. We also don't have any conclusive proof that this condition manifests itself in anything outside of double blind audio testing, nor where we would find it, or even how to isolate it (unlike flying bullets, where we've witnessed them in tons of situations and can isolate their effects very easily). The only proof that imagination contamination even conceivably exists is that people are failing ABX tests where they "should" be passing them. Thus, the proof of their existence relies upon ABX tests yielding negative results. However, if ABX tests yielded positive results, and if one accepts those positive results as valid, then the only proof for imagination contamination's existence is gone. That means that, ultimately, the argument is circular:

If imagination contamination exists, ABX testing is flawed.
Imagination contamination exists only if ABX testing produces a negative result.
Therefore,
ABX testing is flawed because ABX testing produces a negative result.
post #29 of 109
royalcrown, I noticed you edited out this portion of my reply, but I believe it is the answer you seek:

Quote:
Originally Posted by mike1127
If the confidence level is, say, 5%, then if we run 100 tests on cables, and only about 5 reject the null hypothesis, then that would not be good evidence.

Or to put it another way, if one single cable test out of the blue rejects the null hypothesis (assuming many people have run such tests along the way) that is not good evidence.

However, if 15 or 20 tests reject the null hypothesis, then that is good evidence that cables matter.

This does not in any way contradict my suggestion. My suggestion is that p is small, but not 0.5. So a test that rejects p=0.5 does not contradict my suggestion.
To put this another way, I am suggesting that p might be close to 0.5. That amounts to saying that test subjects are almost purely guessing. You are actually right that a single non-null result doesn't mean a whole lot. You would need quite a few to suggest that the subjects weren't purely guessing.
post #30 of 109
Quote:
Originally Posted by royalcrown View Post
I think the analogy is flawed ...

However, with the objections raised such as imagination contamination, it's a completely different story. We don't know if imagination contamination exists. ...

If imagination contamination exists, ABX testing is flawed.
Imagination contamination exists only if ABX testing produces a negative result.
Therefore,
ABX testing is flawed because ABX testing produces a negative result.
Now you are going in a totally different direction than your original question. Your original question is whether I would accept a positive result as meaningful in some way. I've answered that. A conditional "yes". I've shown why it is rational to think that.

Arguing for or against the existence of "imagination contamination" or any effect that reduces sensitivity in an ABX is an entirely different issue.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Sound Science
Head-Fi.org › Forums › Equipment Forums › Sound Science › The validity of ABX testing