I don't like the Burden of Proof Augument. | Page 4 | Headphone Reviews and Discussion - Head-Fi.org

mmerrill99 · Dec 28, 2015 at 1:22 PM

rrod said:
The issue is having enough trials for the differentiation between dishonesty and true inability, especially within the context of a test like ABX where more trials can cause other human issues to crop up. The 10-trial tests that seem to be the online norm are fine with Type I error but have absolutely no power for low true probabilities of detection, but I doubt 10 trials is enough to detect dishonesty is any meaningful way; I guess I should do some math on the subject. All this just highlights the importance of letting real researchers have some $$ to allow meaningful samples sizes.

Yes, I agree RRod - perceptual testing should be left to those trained in the field of perceptual testing - I consider home based blind tests as anecdotes trying to be passed off as objective & scientific. Fact is that most people have no clue how to do these tests in a manner which minimises testing skewedness. The min number of trials for a statistically acceptable result is 16, I believe?

RRod · Dec 28, 2015 at 1:31 PM

mmerrill99 said:
Yes, I agree RRod - perceptual testing should be left to those trained in the field of perceptual testing - I consider home based blind tests as anecdotes trying to be passed off as objective & scientific. Fact is that most people have no clue how to do these tests in a manner which minimises testing skewedness. The min number of trials for a statistically acceptable result is 16, I believe?

That all depends on the alpha level you want. For 5% and a 1-sided test, 9/10 trials will give you a significant result. However, it only has 53% power against an alternative of 75% chance of discrimination.

mmerrill99 · Dec 28, 2015 at 1:31 PM

castleofargh said:
I guess you are right, sighted evaluation can get 100% all the time, as the guy only needs to know how to read and ultimately to agree with himself.

"ok I will click on the file that says 24/96 and listen to it, then I will click on the file that says mp3 that I know for a fact is different and inferior in resolution, and try to find out if I can guess that I that I know it".
do you define this as a test? IDK for you, but the tests I've had to pass in my life were missing that nice part where I get the answer with the question.

-"now close one eye and read this, it says D K N P W M N. can you see it? do you need me to repeat it slower?"

come on let's be serious. you're trying to compare a test with a bad joke.

I asked you to back up your claim "personal blind test. we're only saying how it's more accurate" but all you provided was a diatribe on sighted listening & called it a "bad joke". Does this, in any way, prove your claim? What evidence have you presented which can be examined by unbiased observers?

As per the thrust of this thread - does the burden of proof not extend to all claims, including yours above?

mmerrill99 · Dec 28, 2015 at 1:34 PM

rrod said:
That all depends on the alpha level you want. For 5% and a 1-sided test, 9/10 trials will give you a significant result. However, it only has 53% power against an alternative of 75% chance of discrimination.

Sorry, I should have stated for a forced choice test - typically a ABX test as is the most commonly found blind testing demanded of people

RRod · Dec 28, 2015 at 1:49 PM

mmerrill99 said:
Sorry, I should have stated for a forced choice test - typically a ABX test as is the most commonly found blind testing demanded of people

Then we'd have to specify the exact model type and I'd have to look up the stats anyway. Online you'll find Bernoulli-based results to be the norm for how people analyze their ABX results.

castleofargh · Dec 28, 2015 at 2:55 PM

mmerrill99 said:
I asked you to back up your claim "personal blind test. we're only saying how it's more accurate" but all you provided was a diatribe on sighted listening & called it a "bad joke". Does this, in any way, prove your claim? What evidence have you presented which can be examined by unbiased observers?

As per the thrust of this thread - does the burden of proof not extend to all claims, including yours above?

you're asking me to prove a nonsense. a sighted test without control is not a test because no control= no proof. no proof=meaningless result I can't trust. it's the very same as just asking my own opinion. as long as I believe I'm hearing it, I will say "yes I heard it". because hey I do believe I heard it. but did I? who knows I can't confirm it.
what good is that for me. my question is "can I really hear a difference?", not "do I want to hear a difference?".

that's my old example of me tweaking ever so slightly my EQ by 1db here and there, until I feel like I achieved something nice. just to notice later that the EQ was OFF all the time. all along I thought it was ON and I felt that I had improved my EQ. now the only reason I know I was fooled is the ON/OFF control button. without it my delusion would have never stopped. if something like that happens to me from time to time(and it does), how can I trust myself in a sighted evaluation where again I have no control test? I think the files are different and hear a difference. ok. did I really? IDK and I have zero way of checking. and what is something you believe but cannot verify? faith.

sighted evaluation is so bad, you can't even test it. you're asking me to prove that the absence of control is inferior, when I need to add controls to the listening test to prove it... I could make a test where some files aren't really what they say they are and have a sighted evaluation, but that would already become a blind test. sighted evaluation isn't a test at all. the only thing we should answer with it is "does it look good?".

mmerrill99 · Dec 28, 2015 at 3:16 PM

castleofargh said:
you're asking me to prove a nonsense. a sighted test without control is not a test because no control= no proof. no proof=meaningless result I can't trust. it's the very same as just asking my own opinion. as long as I believe I'm hearing it, I will say "yes I heard it". because hey I do believe I heard it. but did I? who knows I can't confirm it.
what good is that for me. my question is "can I really hear a difference?", not "do I want to hear a difference?".

that's my old example of me tweaking ever so slightly my EQ by 1db here and there, until I feel like I achieved something nice. just to notice later that the EQ was OFF all the time. all along I thought it was ON and I felt that I had improved my EQ. now the only reason I know I was fooled is the ON/OFF control button. without it my delusion would have never stopped. if something like that happens to me from time to time(and it does), how can I trust myself in a sighted evaluation where again I have no control test? I think the files are different and hear a difference. ok. did I really? IDK and I have zero way of checking. and what is something you believe but cannot verify? faith.

sighted evaluation is so bad, you can't even test it. you're asking me to prove that the absence of control is inferior, when I need to add controls to the listening test to prove it... I could make a test where some files aren't really what they say they are and have a sighted evaluation, but that would already become a blind test. sighted evaluation isn't a test at all. the only thing we should answer with it is "does it look good?".

OK, can you tell us what controls are used for detecting false positives & false negatives in your typical self-administered blind test - let's make it even more specific so everyone can participate - let's say Foobar ABX testing? Can you tell us any statistics with regard to these controls & what level of false positives & false negatives have been reported from such tests? Both of these statistics would go a long way to beginning to prove your "claim" of accuracy.

upstateguy · Dec 29, 2015 at 1:26 AM

castleofargh said:
mmerrill99 said:

I asked you to back up your claim "personal blind test. we're only saying how it's more accurate" but all you provided was a diatribe on sighted listening & called it a "bad joke". Does this, in any way, prove your claim? What evidence have you presented which can be examined by unbiased observers?

As per the thrust of this thread - does the burden of proof not extend to all claims, including yours above?

Click to expand...

you're asking me to prove a nonsense. a sighted test without control is not a test because no control= no proof. no proof=meaningless result I can't trust. it's the very same as just asking my own opinion. as long as I believe I'm hearing it, I will say "yes I heard it". because hey I do believe I heard it. but did I? who knows I can't confirm it.
what good is that for me. my question is "can I really hear a difference?", not "do I want to hear a difference?".

that's my old example of me tweaking ever so slightly my EQ by 1db here and there, until I feel like I achieved something nice. just to notice later that the EQ was OFF all the time. all along I thought it was ON and I felt that I had improved my EQ. now the only reason I know I was fooled is the ON/OFF control button. without it my delusion would have never stopped. if something like that happens to me from time to time(and it does), how can I trust myself in a sighted evaluation where again I have no control test? I think the files are different and hear a difference. ok. did I really? IDK and I have zero way of checking. and what is something you believe but cannot verify? faith.

sighted evaluation is so bad, you can't even test it. you're asking me to prove that the absence of control is inferior, when I need to add controls to the listening test to prove it... I could make a test where some files aren't really what they say they are and have a sighted evaluation, but that would already become a blind test. sighted evaluation isn't a test at all. the only thing we should answer with it is "does it look good?".

LOL, I can't tell you how many times I've done that myself.......

In fact, I found small changes like the ones you mention are almost impossible to detect unless they exceed the JND, which varies from person to person.

One of the tricks I used to do was to put my mouse pointer on the check box, look away from the monitor and click the mouse rapidly until I didn't know if the EQ was on or off. Then I would try to see if my most recent tweak was actually detectable. Often it was not. So I don't EQ much, or often, with headphones.

charleski · Dec 29, 2015 at 5:08 AM

rrod said:
The issue is having enough trials for the differentiation between dishonesty and true inability, especially within the context of a test like ABX where more trials can cause other human issues to crop up. The 10-trial tests that seem to be the online norm are fine with Type I error but have absolutely no power for low true probabilities of detection, but I doubt 10 trials is enough to detect dishonesty is any meaningful way; I guess I should do some math on the subject. All this just highlights the importance of letting real researchers have some $$ to allow meaningful samples sizes.

Any attempt to tackle dishonesty will require a controlled environment first of all. Once that is established you can take further measures such as interspersing real trials with probes containing samples that should definitely be distinguishable. The number of trials used is merely a component of the overall experimental design and simply increasing sample size on its own will have little value.

The experimental objectives and overall design must be considered well before statistics are applied. A lot of this discussion seems to be in danger of placing the cart before the horse. The ABX test is definitely a rough instrument, but that's appropriate because it's designed to test comprehensive and broad claims: I've yet to see a hi-fi ad that claims that only 5% of the populace will gain any benefit, or that you might only hear it a fraction of the time. If we want to investigate 'low true probabilities of detection' we need a far more complex design that will take into account interpersonal variability as well as variations in perception and acuity in a single subject. The success of any such experiment would depend on the measures taken to control for sources of variability that lie outside the factor being tested.

There's reason real researchers aren't being given $$ to investigate whether such subtle and elusive effects are real: no-one's really interested in the results.

mmerrill99 said:
I believe you are mixing up two concepts of here - one being what registers as a statistical significant result (95% or whatever significance is decided to be acceptable for the question being posed) & the other being how close can we get to "truly random numbers" generated by algorithms. Two completely separate & distinct notions separated as RRod stated, by the number of trials being run

As above, experimental design comes first. Consideration of statistical tests and the number of trials required to yield an acceptable result only has meaning once you are sure of your overall design and what it is actually testing.

No, it wouldn't show up because, by their own admission on hydrogen audio, they don't even listen to the A/B samples in the following trials - they simply hit a random button each time the trial starts - there is no mechanism whereby the audio samples are influencing the result subconsciously.

Certainly, if they aren't listening to the audio then they aren't doing the test and those results are invalid. But a couple of anecdotal accounts of invalid tests aren't sufficient to throw out the multitude of negative results that have been reported.

mmerrill99 · Dec 29, 2015 at 8:34 AM

charleski said:
Any attempt to tackle dishonesty will require a controlled environment first of all. Once that is established you can take further measures such as interspersing real trials with probes containing samples that should definitely be distinguishable. The number of trials used is merely a component of the overall experimental design and simply increasing sample size on its own will have little value.

The experimental objectives and overall design must be considered well before statistics are applied. A lot of this discussion seems to be in danger of placing the cart before the horse. The ABX test is definitely a rough instrument, but that's appropriate because it's designed to test comprehensive and broad claims: I've yet to see a hi-fi ad that claims that only 5% of the populace will gain any benefit, or that you might only hear it a fraction of the time. If we want to investigate 'low true probabilities of detection' we need a far more complex design that will take into account interpersonal variability as well as variations in perception and acuity in a single subject. The success of any such experiment would depend on the measures taken to control for sources of variability that lie outside the factor being tested.

There's reason real researchers aren't being given $$ to investigate whether such subtle and elusive effects are real: no-one's really interested in the results.

Click to expand...

Dishonesty can run both ways so we are not just talking about false positives. False negatives can arise for many more reasons than dishonesty & your probes are definitely one way to address this

I don't think that differences now being perceived in audio are of a "comprehensive & broad claims" & I have never seen it stated that ABX tests are really only suitable for medium to large impairments in audio?

I do agree that research into this area is of no interest to academia

As above, experimental design comes first. Consideration of statistical tests and the number of trials required to yield an acceptable result only has meaning once you are sure of your overall design and what it is actually testing.

Certainly, if they aren't listening to the audio then they aren't doing the test and those results are invalid. But a couple of anecdotal accounts of invalid tests aren't sufficient to throw out the multitude of negative results that have been reported.

Click to expand...

Well, lack of honesty controls & false negative controls is a pretty good reason to doubt the validity of all negative results - false positive controls are already built into the ABX test by it's very design.

castleofargh · Dec 29, 2015 at 10:24 AM

I get the all idea of wishing to have real accurate data and facts, it's the ideal stuff we all aspire for. but let's be honest, nothing provides that, not even real scientists doing real science. what we get at best is getting closer to the truth and further away from fooling ourselves. you can always find some flaw to any test, there is always a potential for something we didn't think of, or don't even know exists. there is always a limit of resolution in the measurements...
what should matter is to know if on average a method gives more reliable results than another. shooting down abx because of the risk of false negative or false positive, that looks to me like someone saying that cars and bikes shouldn't be used for transportation because they suck at climbing stairs. one problem doesn't mean it doesn't work great for everything else. you use a test for what it's relatively good at, and for the rest, you try to find a better test. if you think the reliability is poor, then you just don't make claims about the the conclusions that you shouldn't make and we're back to the burden of proof and how people should always avoid making claims for half baked reasons. and why I failed to pass the abx test isn't saying the same as "there is no difference". on that we agree very much. we go as far as we confidently can go. it doesn't mean we shouldn't use a test that isn't 100% reliable. even 80% reliable is better than nothing.

right now, ABX is available to all curious people, and can help find out a number of things on a number of subjects. any time I use an obviously audible difference, I get 100% in an abx, so it's at least as good as sighted evaluation for obvious stuff. not inferior and not the lottery you're trying to depict.
how do I know if I pick up an amp because it sounds better or because I like how it looks? well it's simple, you hide how it looks and I test again. there is nothing wrong with having a preference for a given look and it's very ok to buy a product for that reason. but when pretending to be testing only for sound, we should at least try to remove as many external variables as possible. sighted evaluation fails to offer that 100% of the time.

so I wish for better than blind test or simply better than abx, and I'm also ok for not drawing weird conclusions or give too much credibility to some half controlled personal test. but sighted evaluation isn't an alternative to blind testing. all those arguing while offering no other choice don't know what testing sound means. so any claim about sound made from sighted evaluation should be called upon for verification when you feel like they might be wrong, and checked in any available testing of audio that removes at least some external variables. and if he made a claim about how something sounds, then the burden of proof justifies that he should either retract his statement, or make what he can to offer proper confirmation that he didn't make stuff up.
if he can't, then most likely he should retract his statement, as nobody forced him to claim something he couldn't try to verify.

maybe more than talking about the burden of proof, I should have talked about how silly it is to make statements we aren't entitled to make. but parenting should have made that clear long ago, and we all know you can't stop people from talking nonsense, so instead I pointed out to the burden of proof that is IMO, a proper arguing method against illegitimate claims.

mmerrill99 · Dec 29, 2015 at 1:24 PM

castleofargh said:
I get the all idea of wishing to have real accurate data and facts, it's the ideal stuff we all aspire for. but let's be honest, nothing provides that, not even real scientists doing real science. what we get at best is getting closer to the truth and further away from fooling ourselves. you can always find some flaw to any test, there is always a potential for something we didn't think of, or don't even know exists. there is always a limit of resolution in the measurements...
what should matter is to know if on average a method gives more reliable results than another. shooting down abx because of the risk of false negative or false positive, that looks to me like someone saying that cars and bikes shouldn't be used for transportation because they suck at climbing stairs. one problem doesn't mean it doesn't work great for everything else. you use a test for what it's relatively good at, and for the rest, you try to find a better test. if you think the reliability is poor, then you just don't make claims about the the conclusions that you shouldn't make and we're back to the burden of proof and how people should always avoid making claims for half baked reasons. and why I failed to pass the abx test isn't saying the same as "there is no difference". on that we agree very much. we go as far as we confidently can go. it doesn't mean we shouldn't use a test that isn't 100% reliable. even 80% reliable is better than nothing.

Click to expand...

Any test needs to have some measure of calibration or qualification to check it's suitability for the role it is being used for. In a test which relies on statistical analysis such as ABX testing, this represents the power of the test which is related to the level of false positives, level of false negatives, sample size, etc. The power of the test is directly related to it's accuracy

You made claims about the accuracy of home based blind testing & I asked for statistics to back up this claim. I'm not shooting down home based abx testing - I find that without these statistics, it is just another anecdote about listening (one that seems to be skewed to not hearing any differences). If you can present false negative statistics for home based ABX tests that show my opinion is wrong then I will change my view.

Essentially you are asking me to accept a tool that you claim is accurate yet you can't show me any calibration results for the tool. Instead you talk about how all the other tools are not good enough - I'm not convinced as I would not be convinced if you claimed you were more accurate at target practise than others & tried to prove this by telling me about the other guy's one eye or the twitch of another or the lack of balance of another

right now, ABX is available to all curious people, and can help find out a number of things on a number of subjects. any time I use an obviously audible difference, I get 100% in an abx, so it's at least as good as sighted evaluation for obvious stuff. not inferior and not the lottery you're trying to depict.
how do I know if I pick up an amp because it sounds better or because I like how it looks? well it's simple, you hide how it looks and I test again. there is nothing wrong with having a preference for a given look and it's very ok to buy a product for that reason. but when pretending to be testing only for sound, we should at least try to remove as many external variables as possible. sighted evaluation fails to offer that 100% of the time.

Click to expand...

Yes, that's one of the problems of ABX testing - it's availability to all curious people, so we see all sorts of results of varying quality without any ability to judge the quality of the test. It's just a curiosity, nothing more!!

so I wish for better than blind test or simply better than abx, and I'm also ok for not drawing weird conclusions or give too much credibility to some half controlled personal test. but sighted evaluation isn't an alternative to blind testing. all those arguing while offering no other choice don't know what testing sound means. so any claim about sound made from sighted evaluation should be called upon for verification when you feel like they might be wrong, and checked in any available testing of audio that removes at least some external variables. and if he made a claim about how something sounds, then the burden of proof justifies that he should either retract his statement, or make what he can to offer proper confirmation that he didn't make stuff up.
if he can't, then most likely he should retract his statement, as nobody forced him to claim something he couldn't try to verify.

maybe more than talking about the burden of proof, I should have talked about how silly it is to make statements we aren't entitled to make. but parenting should have made that clear long ago, and we all know you can't stop people from talking nonsense, so instead I pointed out to the burden of proof that is IMO, a proper arguing method against illegitimate claims.

Click to expand...

Yes, I'm glad you wish for a better test but that doesn't mean we should demand someone do a curiosity test that produces results of unknown quality

jcx · Dec 29, 2015 at 3:59 PM

I've shown the signals and gotten positive ABX results with the Foobar2000 plugin - Ethan Winer changed his position on audibility of phase/polarity when he got positive results

so it simply isn't the case that even coming into a DBT ABX controlled listening test with a skeptical preconception somehow guarntees nothing will be heard

RRod · Dec 29, 2015 at 5:21 PM

jcx said:
I've shown the signals and gotten positive ABX results with the Foobar2000 plugin - Ethan Winer changed his position on audibility of phase/polarity when he got positive results

so it simply isn't the case that even coming into a DBT ABX controlled listening test with a skeptical preconception somehow guarntees nothing will be heard

Another thing to try is to lower a track to an obviously bad bit-depth/sample rate and then ramp up. What happens for me is that, as the bit depth gets high enough to support the dynamism of the song and the sample rate gets near 2x my hearing limit, it starts to get harder and harder to pass an ABX. I would seem weird to me that when I hit these marks that suddenly some "skeptic switch" gets turned on, as opposed to there just being a natural asymptotic effect happening.

mmerrill99 · Dec 29, 2015 at 5:21 PM

jcx said:
I've shown the signals and gotten positive ABX results with the Foobar2000 plugin - Ethan Winer changed his position on audibility of phase/polarity when he got positive results

so it simply isn't the case that even coming into a DBT ABX controlled listening test with a skeptical preconception somehow guarntees nothing will be heard

I've seen that happen before - negative bias preventing hearing audible differences being heard & ABX null result after null result produced until something changes the negative bias.

In Winer's case his beliefs very much determine what he hears - the trigger for him appeared to be the AES paper that stated phase shift was audible. Up to then he hadn't heard any audible effects from normal range phase differences & as per his many claims of other issue sin audio stated "Phase shift per se is inaudible in typical amounts. It's a total non-issue, a bogeyman invented by audiophile magazine writers to explain stuff they don't understand"

I've seen others repeatedly report null ABX tests until someone posts positive results & then differences become audible.

One has to ask - are ABX tests simply a test of one's motivation, belief system, focus, training, etc - it certainly doesn't seem to qualify as the accurate test tool it is being purported to be

Latest Thread Images

mmerrill99

Member of the Trade: M2 Tech

RRod

Headphoneus Supremus

mmerrill99

Member of the Trade: M2 Tech

mmerrill99

Member of the Trade: M2 Tech

RRod

Headphoneus Supremus

castleofargh

Sound Science Forum Moderator

mmerrill99

Member of the Trade: M2 Tech

upstateguy

Headphoneus Supremus

charleski

100+ Head-Fier

mmerrill99

Member of the Trade: M2 Tech

castleofargh

Sound Science Forum Moderator

mmerrill99

Member of the Trade: M2 Tech

jcx

Headphoneus Supremus

RRod

Headphoneus Supremus

mmerrill99

Member of the Trade: M2 Tech

Users who are viewing this thread