ABX Blind testing - the Ins & outs
Jan 20, 2016 at 10:11 AM Post #46 of 64
Hi MM,

I often like your posts, not because I often agree, but because your role is that of being skeptical of the “skeptic side”. You don’t fear science, but rather embrace some interesting ideas such as auditory scene analysis, and the need to know *how* to measure something. But for years (and in this thread), you have made some incorrect statements that must be addressed.
 
First, let’s clear up an important definition: “blind” testing does not relate to the use of the visual system; it means certain *information* about the test is withheld from the subject. Because we often use the visual system to obtain information, and because we often borrow visual terms for lack of more general terms (the sound is “bright” or “colored”), “blind” is used. Note that an ABX test where the subject is blindfolded but your software announces “A is playing” or “B is playing”, even for the X playback, is “sighted”, while an open-eyed lights-on see-everything EXCEPT which is playing is “blind”. I know you already know this, but your arguments about the importance of multisensory integration for perception (my field) don’t extend to “blind testing”. Blind testing in no way inhibits multisensory integration.
 
Second, trained scientists are like a master carpenter or chef: one expects their outcomes/results to be better than a lay-person, but that is not always so. Some lay people obtain excellent results, while results from “professionals” are often lacking. There is nothing magical about doing it right. As you point out, one must know which pitfalls to avoid, and avoid them. In science, this is true not only for experimental design and execution, but also in critical evaluation of others’ work. Which lead to…
 
Third, there is no symmetry between sighted and blind tests, as you claim:
It boils down to two methods of evaluation - both of which can skew the results but in completely opposite directions - sighted towards false positives & blind towards false negatives.


No. Of course the quality of any design depends completely upon what question(s) you are hoping to answer. But when testing audibility of differences between {devices, sources, formats,…}, choosing to make the test sighted is always a flaw/weakness. You confound your results with known biases, and therefore complicate interpretation of results by adding variables. Perhaps this results in false positives, but it can also lead to false negatives, depending on exactly what you are testing. 
 
Making the blind/sighted choice is independent of other choices.
I have never seen blind ABX testing repeated over a week with different music, at different times of the day, etc I have always seen a one shot test of maybe 16 or 20 trials. Maybe repeated a couple of times but that tends to be it.


Choices of music, duration, time of day, number of trials, etc. can/should be made independent of blind/sighted. You can’t disparage blind testing just because the other choices made are not to your liking.
 
Also, I would argue that an ABX blind test can’t give you a false negative. It is not designed to give any info on negatives. It can give a true positive, a false positive (with p giving you the liklihood of this) and a “failure to demonstrate the effect”. I know you believe (probably correctly) that many people interpret a test with p>0.05 as a negative result, but that is their problem in interpretation, not a problem with the test. A p>0.05 result is simply an “I don’t know”. The fact that SETI has not received any evidence of life on other planets does NOT show there is none, even though MANY people have participated.
 
But that does not mean the test should never be done sighted. All tests have flaws, some grave, some negligible. In the design, you have to choose your trade-offs. For example, although choosing to make a test "blind" won’t weaken or damage the results, it may make the test prohibitively expensive, time-consuming, or difficult. I can’t see many people going through the effort of duplicating Harmon’s setup just so they can test two sets of speakers before buying. You may just have to accept the flaw of being sighted.
 
I know you will keep challenging objectivists who have closed minds, as well you should, but you may want to reevaluate some of your criticisms.
Happy listening!
 
Cheers,
SAM
 
Jan 20, 2016 at 1:15 PM Post #47 of 64
  Also, I would argue that an ABX blind test can’t give you a false negative. It is not designed to give any info on negatives. It can give a true positive, a false positive (with p giving you the liklihood of this) and a “failure to demonstrate the effect”. I know you believe (probably correctly) that many people interpret a test with p>0.05 as a negative result, but that is their problem in interpretation, not a problem with the test. A p>0.05 result is simply an “I don’t know”. The fact that SETI has not received any evidence of life on other planets does NOT show there is none, even though MANY people have participated.

 
Just a nitpick on this:
 
If someone has an actual probability of discrimination > 50%, then failing to reject the null hypothesis based on their choices would constitute a false negative. The problem here is statistical power, which can only be so high if we're both limiting trials and setting the level at 5%. In a 10 trial ABX test where the subject's actual discrimination probability is 60%, the test has only 16% power, meaning an 84% chance of their failing even though they can discriminate better than a coin-flip.
 
Jan 20, 2016 at 1:37 PM Post #48 of 64
Hi RRod,
You bring up a very important and interesting point IF(!!!) you know the effect size (you call it discrimination probability), which we pretty much never do. It is not a nit, but is quite difficult or impossible to have in this context (listening tests). It would take a whole battery of tests (much longer than the ABX) to obtain it. You point out that >50% is important and provide a speculative example of 60%. I've seen others say "if the difference is 'night and day' the effect size should be XX, whereas with a subtle difference it should be YY". The speculations can be quite instructive, but they are not real world. I have often struggled with the useless tautology: if only I knew everything about your brain, I'd know exactly how to test your brain! :wink:
Cheers,
SAM
 
Jan 20, 2016 at 1:58 PM Post #49 of 64
Hi RRod,
You bring up a very important and interesting point IF(!!!) you know the effect size (you call it discrimination probability), which we pretty much never do. It is not a nit, but is quite difficult or impossible to have in this context (listening tests). It would take a whole battery of tests (much longer than the ABX) to obtain it. You point out that >50% is important and provide a speculative example of 60%. I've seen others say "if the difference is 'night and day' the effect size should be XX, whereas with a subtle difference it should be YY". The speculations can be quite instructive, but they are not real world. I have often struggled with the useless tautology: if only I knew everything about your brain, I'd know exactly how to test your brain!
wink.gif

Cheers,
SAM

 
You never know the parameter value, that's why you're doing statistics. You are right in that someone claiming "night and day" is essentially saying they have a high probability of success, and thus power should be sufficient. But part of mmerrill's argument is that often people don't have a high probability, due to various reasons he posits (lack of training, lack of interest), and that your typical "9/10" ABX test will never eek out their small probability. A true-blue controlled test can handle this by training, compensation, and setup consistency. I should stop there b/c I told myself I wouldn't get in on this thread ^_^
 
Jan 20, 2016 at 2:19 PM Post #50 of 64
Hi MM,


I often like your posts, not because I often agree, but because your role is that of being skeptical of the “skeptic side”. You don’t fear science, but rather embrace some interesting ideas such as auditory scene analysis, and the need to know *how* to measure something. But for years (and in this thread), you have made some incorrect statements that must be addressed.
Thanks S&M - I always enjoy your well thought out & informed posts - it certainly makes me evaluate my line of reasoning & clarify my thoughts. It is very easy to lose the kernel of the reasoning when not an expert in auditory processing & when trying to address the wide & varied counter arguments. I welcome your expertise in these matters as I see it as an opportunity to learn so please continue your participation, if you can.

First, let’s clear up an important definition: “blind” testing does not relate to the use of the visual system; it means certain *information* about the test is withheld from the subject. Because we often use the visual system to obtain information, and because we often borrow visual terms for lack of more general terms (the sound is “bright” or “colored”), “blind” is used. Note that an ABX test where the subject is blindfolded but your software announces “A is playing” or “B is playing”, even for the X playback, is “sighted”, while an open-eyed lights-on see-everything EXCEPT which is playing is “blind”. I know you already know this, but your arguments about the importance of multisensory integration for perception (my field) don’t extend to “blind testing”. Blind testing in no way inhibits multisensory integration.
Sure, it's knowledge of what's being tested that is crucial & the kernel of what blind testing means.

In trying to explain my understanding of auditory perception, I was trying to convey that this perception has the job of attempting to solve a continually changing problem in real time.The problem is that the pressure waves hitting our eardrums do not usually contain sufficient data for a concrete determination of the auditory scene - we need to call on extraneous sources of data to help solve this indecision. So my point about sight was that it was another source of contemporaneous data that is normally used to correlate with the auditory data & narrow down the number of possible outcomes that go to creating the what/where/when auditory scenes on a moment-to-moment basis. In my posts I may have confused this point & I thank you for the correction & for keeping the message accurate. The data from sight is just one (although a usually very important one) of a number of other datapoints used in the attempts by auditory processing to "solve" this intractable problem of auditory scene analysis - stored auditory models & experience of the behaviour of the auditory world are just 2 more I can think of - you can probably nominate other ways that the "problem" is resolved to a level of unsurety that we have evolved to be suitably accurate for our survival.

Second, trained scientists are like a master carpenter or chef: one expects their outcomes/results to be better than a lay-person, but that is not always so. Some lay people obtain excellent results, while results from “professionals” are often lacking. There is nothing magical about doing it right. As you point out, one must know which pitfalls to avoid, and avoid them. In science, this is true not only for experimental design and execution, but also in critical evaluation of others’ work. Which lead to…
Indeed, I wasn't going there or appealing to authority - my criticism was of home run blind testing - my point was that the area of perceptual testing is rife with pitfalls that only knowledge of these pitfalls & addressing them can lead to anything approaching a set of results that is worth looking at.

Third, there is no symmetry between sighted and blind tests, as you claim:
It boils down to two methods of evaluation - both of which can skew the results but in completely opposite directions - sighted towards false positives


No. Of course the quality of any design depends completely upon what question(s) you are hoping to answer. But when testing audibility of differences between {devices, sources, formats,…}, choosing to make the test sighted is always a flaw/weakness. You confound your results with known biases, and therefore complicate interpretation of results by adding variables. Perhaps this results in false positives, but it can also lead to false negatives, depending on exactly what you are testing.
My statement that the difference between sighted & blind testing is that one is skewed towards false positives & the other skewed towards false negatives may not be symmetrically balanced as we don't know the level of false neg or false pos for each but as a generalisation I still think it holds

Making the blind/sighted choice is independent of other choices.
I have never seen blind ABX testing repeated over a week with different music, at different times of the day, etc I have always seen a one shot test of maybe 16 or 20 trials. Maybe repeated a couple of times but that tends to be it.



Choices of music, duration, time of day, number of trials, etc. can/should be made independent of blind/sighted. You can’t disparage blind testing just because the other choices made are not to your liking.
Sure, but again, I'm not disparaging blind testing - just the home-run blind tests that are suggested/demanded on audio forums, particularly in the science/objectivists section. My thread title should probably have made that clear but I think in reading the first couple of posts, it's seen that I give examples of ultmusicnob's description of his procedures in which he achieved positive ABX results & just how difficult this is.

One of the criticisms of blind "tests" Vs casual listening (both of which people use to evaluate audio devices) is that I would (& I assume others would) take our time to evaluate a device, become accustomed to it, tease out its characteristics by listening to a variety of music over a week or so. The alternative of using a single blind test in which I have to differentiate audible aspects seems to me to require a certain expertise in identifying & isolating those aspects as well as motivation & care in undertaking the blind test. It just seems to me that blind testing of any value requires so many aspects to be gotten right

Also, I would argue that an ABX blind test can’t give you a false negative. It is not designed to give any info on negatives. It can give a true positive, a false positive (with p giving you the liklihood of this) and a “failure to demonstrate the effect”. I know you believe (probably correctly) that many people interpret a test with p>0.05 as a negative result, but that is their problem in interpretation, not a problem with the test. A p>0.05 result is simply an “I don’t know”. The fact that SETI has not received any evidence of life on other planets does NOT show there is none, even though MANY people have participated.
Sure & the use of false negative is misleading because I'm not talking about the overall result of the test (or it's statistical power) - I'm really talking about how many trials in a ABX test would report no difference when a really audible difference was used in some trials (unknown to the participant). Other blind testing methods use these hidden references & anchors as a means of testing the test & validating it to some extent but ABX testing lacks this possibility.

As a result of this lack of internal self-checking we can have null ABX results posted which really have no validity but we have no way of telling the "valid" null results. An example of this comes from Arny Kreuger himself who claims to be the originator of the ABX test. He posted a set of null ABX results & it transpired that he hadn't actually listened to most of the trials - he just random guessed A or B - no actual listening. So, I know that no solid conclusions are meant to be drawn from null results but that isn't the reality when it comes to the non-scientific use of blind tests on audio forums

But that does not mean the test should never be done sighted. All tests have flaws, some grave, some negligible. In the design, you have to choose your trade-offs. For example, although choosing to make a test "blind" won’t weaken or damage the results, it may make the test prohibitively expensive, time-consuming, or difficult. I can’t see many people going through the effort of duplicating Harmon’s setup just so they can test two sets of speakers before buying. You may just have to accept the flaw of being sighted.

I know you will keep challenging objectivists who have closed minds, as well you should, but you may want to reevaluate some of your criticisms.
Happy listening!

Cheers,

SAM
Again, sure & I said it before - with a sighted test, there is a higher possibility that we mistakenly think one device is better than another (after some extended listening time - up to that cut-off point we can usually return the device) & later discover that it isn't. We can sell it on at some loss & move on

With blind testing there is a higher possibility that we miss an actual difference between devices & miss out on improving our listening experience

I think that it divides into two brain groupings seen in life. politics, etc - some look on unknowns as an opportunity to discover something new - others look on unknowns as a threat

Edit: And again I look forward to your input & yes, to your corrections of my thinking - it's really what forums should be about better understanding of our own thinking & learning new perspectives on matters
 
Jan 22, 2016 at 8:09 AM Post #51 of 64
In trying to explain my understanding of auditory perception, I was trying to convey that this perception has the job of attempting to solve a continually changing problem in real time.The problem is that the pressure waves hitting our eardrums do not usually contain sufficient data for a concrete determination of the auditory scene - we need to call on extraneous sources of data to help solve this indecision. So my point about sight was that it was another source of contemporaneous data that is normally used to correlate with the auditory data & narrow down the number of possible outcomes that go to creating the what/where/when auditory scenes on a moment-to-moment basis. In my posts I may have confused this point & I thank you for the correction & for keeping the message accurate. The data from sight is just one (although a usually very important one) of a number of other data points used in the attempts by auditory processing to "solve" this intractable problem of auditory scene analysis - stored auditory models & experience of the behaviour of the auditory world are just 2 more I can think of - you can probably nominate other ways that the "problem" is resolved to a level of unsurety that we have evolved to be suitably accurate for our survival.

Your understanding of auditory perception is impressive for someone not in that field. The role the visual system plays is as you describe it, but only when the visual data can be “used to correlate with the auditory data”…”on a moment-to-moment basis”. You mentioned the cocktail party effect earlier; that and the McGurk effect are great examples of visual data used in auditory perception because we see the sound-producer (mouth) moving in time, in ways we know produce certain sounds. These influence what we hear; the ventriloquist effect influences where we locate the sound source, using vision. Here vision helps using “experience of the behaviour of the auditory world” But what role do you see for vision when listening to a stereo? What should a stack of lit-up electronic boxes with a couple of cloth-covered wooden boxes sound like? What does the visual system offer “on a moment-to-moment basis”? I can’t imagine a role other than informing us about which pretty item is magically making sounds. This provides cognitive information, but nothing to help with auditory scene analysis. And therefore the visual sensory data plays no role, good or bad, in our auditory perception for either sighted or blind tests, other than cognitive information that could/will produce cognitive biases.
My statement that the difference between sighted & blind testing is that one is skewed towards false positives & the other skewed towards false negatives may not be symmetrically balanced as we don't know the level of false neg or false pos for each but as a generalisation I still think it holds.

That’s your answer to what I wrote? You are stubborn, but that’s okay. I read what you are saying as: an incompetently-performed/analysed blind test and an incompetently-performed/analysed sighted test are, in your view, equally bad and flawed, so one is allowed to pick their poison. I won’t argue. But add competence to both and, with all else being equal, the blind test is always better (in a better position to deliver valid results).
 
One of the criticisms of blind "tests" Vs casual listening (both of which people use to evaluate audio devices) is that I would (& I assume others would) take our time to evaluate a device, become accustomed to it, tease out its characteristics by listening to a variety of music over a week or so. The alternative of using a single blind test in which I have to differentiate audible aspects seems to me to require a certain expertise in identifying & isolating those aspects as well as motivation & care in undertaking the blind test. It just seems to me that blind testing of any value requires so many aspects to be gotten right.
 

You treat it as though there are only two alternatives. You bundle one set of conditions and call it “sighted” and bundle a different set of conditions and call it “blind”. Blind/sighted is one condition, and anyone is free to choose the other conditions. Criticizing the bundle many people choose is not a valid criticism of one component of that bundle. Don’t you agree that: “It just seems to me that all testing of any value requires so many aspects to be gotten right”.
Sure & the use of false negative is misleading because I'm not talking about the overall result of the test (or it's statistical power) - I'm really talking about how many trials in a ABX test would report no difference when a really audible difference was used in some trials (unknown to the participant).

Oh!, the best answer here will take too long. For months, I’ve been wanting to write a nano-tutorial on thresholds, both absolute (detection) and differential (discrimination), but you know…time! But first, you are right “false negative” refers to the whole test, not individual trials. But I know that you have been concerned with the results of individual trials for a long time. You are one of the reasons I’ve wanted to make the nano-tutorial. So as a preview, let’s look at the one “missed” trial in a 19/20 positive result. What do you think it means? (rhetorical) Is it a “failure” in your mind? You say “report no difference when a really audible difference was used” and it is possible that it was {distraction, wrong button-push, evil intent}, but by far the most likely reason, ‘cuz that’s how perception works, is that for that one trial (and maybe a couple others where they "guessed" right) it really was “inaudible”! Thresholds behave as probability density functions, not binary switches. So that 1/20 was not “wrong”. 
As a result of this lack of internal self-checking we can have null ABX results posted which really have no validity but we have no way of telling the "valid" null results. An example of this comes from Arny Kreuger himself who claims to be the originator of the ABX test. He posted a set of null ABX results & it transpired that he hadn't actually listened to most of the trials - he just random guessed A or B - no actual listening. So, I know that no solid conclusions are meant to be drawn from null results but that isn't the reality when it comes to the non-scientific use of blind tests on audio forums.

I’m familiar with that case. But you seem immune to the need for quality and care in sighted tests too. Can you agree that: As a result of this lack of internal self-checking we can have sighted results posted which really have no validity but we have no way of telling the "valid" results. … So, I know that no solid conclusions are meant to be drawn from sighted results but that isn't the reality when it comes to the non-scientific use of sighted tests on audio forums.
 it's really what forums should be about better understanding of our own thinking & learning new perspectives on matters

Yes, and I'm sorry to admit that at this point, I do more learning than contributing in forums... In about 10 years I'll be forced to retire, so unless I find a way around the rules, I promise to contribute more then... at the latest.
 
Keep me honest and fire away!
 
Jan 22, 2016 at 10:12 AM Post #52 of 64
In trying to explain my understanding of auditory perception, I was trying to convey that this perception has the job of attempting to solve a continually changing problem in real time.The problem is that the pressure waves hitting our eardrums do not usually contain sufficient data for a concrete determination of the auditory scene - we need to call on extraneous sources of data to help solve this indecision. So my point about sight was that it was another source of contemporaneous data that is normally used to correlate with the auditory data

Your understanding of auditory perception is impressive for someone not in that field. The role the visual system plays is as you describe it, but only when the visual data can be “used to correlate with the auditory data”…”on a moment-to-moment basis”. You mentioned the cocktail party effect earlier; that and the McGurk effect are great examples of visual data used in auditory perception because we see the sound-producer (mouth) moving in time, in ways we know produce certain sounds. These influence what we hear; the ventriloquist effect influences where we locate the sound source, using vision. Here vision helps using “experience of the behaviour of the auditory world” But what role do you see for vision when listening to a stereo? What should a stack of lit-up electronic boxes with a couple of cloth-covered wooden boxes sound like? What does the visual system offer “on a moment-to-moment basis”? I can’t imagine a role other than informing us about which pretty item is magically making sounds. This provides cognitive information, but nothing to help with auditory scene analysis. And therefore the visual sensory data plays no role, good or bad, in our auditory perception for either sighted or blind tests, other than cognitive information that could/will produce cognitive biases.
Yes, I totally agree & I have criticised others for using the McGurk effect as some "proof" about the bias of sighted evaluations - it has nothing go to do with that - it undoubtedly does influence our listening at a concert - when that guitarist leans towards the crowd does his guitar actually change timbre or is it our perception being influenced :)

I'm not sure I used that idea of sight in regards to blind listening & if I did I was wrong - I may have got carried away with myself :).

I may have used it to try to show how all senses are just best guess fits to the data & are very much follow the principles of science - this is the best fit to the data unless new data comes along which changes our model - that new data from vision, experience, or simply new auditory signals

I was listening to David Eagleman last night on a TV documentary about the brain & he was giving the example of the visual system where the signals from the eyes arrive at the staging station of the Thalamus & the Thalamus communicates with the visual cortex BUT the data conveyed from the visual cortex to the Thalamus are about 6 times greater than the signals from the Thalamus to the visual cortex. In other words, the visual cortex is sending an expected visual model to the thalamus, which is comparing with the data signals & any corrections being sent back to the visual cortex. I know this is grossly simplified so forgive my layman's attempts at description but the point is that the top down processing (visual cortex signals to thalamus) play a hugely important role (6 times the bottom up signals) in this visual sense. Auditory perception shares many overlapping functionality with visual perception & this is just one of them

My whole line of argument being that what we are doing in blind listening is still best fit guess-work - just in case people think that there is some magic purity to it
My statement that the difference between sighted

That’s your answer to what I wrote? You are stubborn, but that’s okay. I read what you are saying as: an incompetently-performed/analysed blind test and an incompetently-performed/analysed sighted test are, in your view, equally bad and flawed, so one is allowed to pick their poison. I won’t argue. But add competence to both and, with all else being equal, the blind test is always better (in a better position to deliver valid results).
OK, I'll read back over what you wrote & see if my reply was off-topic but first let me answer this. I'm saying that it is very, very difficult to do a competent blind test & so I treat them as anecdotal reports, just like the sighted anecdotal reports that we see all the time on audio forums. In fact if someone gives a detailed example of what they heard in a sighted report I find it much more useful than a null report from an ABX test. If there is a positive report from an ABX test & the descriptions of what was heard & differentiated are given, then again I am much more interested in this - hence I gave ultmusicsnob's examples at the top of the thread.

So, again, my general point is that in home based listening, sighted evaluations with detail, I consider of much more value & believability than null blind tests. If such blind tests had internal self-checks built in then there would be something to hang hope onto but ABX tests don't so all we have is a null result with no idea of the quality of the testing


One of the criticisms of blind "tests" Vs casual listening (both of which people use to evaluate audio devices) is that I would ("> 

You treat it as though there are only two alternatives. You bundle one set of conditions and call it “sighted” and bundle a different set of conditions and call it “blind”. Blind/sighted is one condition, and anyone is free to choose the other conditions. Criticizing the bundle many people choose is not a valid criticism of one component of that bundle. Don’t you agree that: “It just seems to me that all testing of any value requires so many aspects to be gotten right”.
I'm sorry if my communication in this is lax but in all cases where I mention blind tests, I'm talking about the ones that are home administered & either insisted on for "proof" that something is heard & not imagination or is offered in forums as some sort of higher level of evidence.

If you've read what I've said, I have always maintained that properly run blind tests are indeed worthwhile but they require expertise in perceptual testing. So., I'm not against blind testing at all - I'm against the pseudo-science posing that many engage in & try to elevate a sham of a test into the realm of believability by using phrases like double blind testing

It's these pretenders that I'm addressing, not yourself & all the experts in perceptual testing


Oh!, the best answer here will take too long. For months, I’ve been wanting to write a nano-tutorial on thresholds, both absolute (detection) and differential (discrimination), but you know…time! But first, you are right “false negative” refers to the whole test, not individual trials. But I know that you have been concerned with the results of individual trials for a long time. You are one of the reasons I’ve wanted to make the nano-tutorial. So as a preview, let’s look at the one “missed” trial in a 19/20 positive result. What do you think it means? (rhetorical) Is it a “failure” in your mind? You say “report no difference when a really audible difference was used” and it is possible that it was {distraction, wrong button-push, evil intent}, but by far the most likely reason, ‘cuz that’s how perception works, is that for that one trial (and maybe a couple others where they "guessed" right) it really was “inaudible”! Thresholds behave as probability density functions, not binary switches. So that 1/20 was not “wrong”. 
Yes & this is the tricky part - statistics & I can easily get my knickers in a twist on it. Rather than get into the statistics side of it, my intent was to look for some internal checks in the test that gave some self-calibration of the test itself. I know MUSHRA & other forms of blind testing uses hidden references & anchors as a means of cross-checking the results (I presumed) - I felt that this was the great flaw in ABX testing, particularly home testing - we have no idea what quality the results actually have.

As an aside, I would be interested if there were any studies ever done on ABX test results as to whether there were more false negative trial results in the second half of the test compared to the first half. In other words what I'm interested in finding out is do people get tired & less accurate as the number of trials increases & how much this affects the overall outcome?
As a result of this lack of internal self-checking we can have null ABX results posted which really have no validity but we have no way of telling the "valid" null results. An example of this comes from Arny Kreuger himself who claims to be the originator of the ABX test. He posted a set of null ABX results

I’m familiar with that case. But you seem immune to the need for quality and care in sighted tests too. Can you agree that: As a result of this lack of internal self-checking we can have sighted results posted which really have no validity but we have no way of telling the "valid" results. … So, I know that no solid conclusions are meant to be drawn from sighted results but that isn't the reality when it comes to the non-scientific use of sighted tests on audio forums.
[/quoted}Well, I qualified above what I thought were valuable sighted results Vs "the rest". The thing about sighted results is that they vary - there is a range of descriptions & music tracks, etc - I know this applies to ABX test too - but when people all start to coalesce to similar attributes to the sound it signifies to me that this might be worth personally checking into (sure I can be influenced by these reports but I have had enough examples of where I didn't find the same as others that I know this influence effect is not an overriding bias

 it's really what forums should be about better understanding of our own thinking

Yes, and I'm sorry to admit that at this point, I do more learning than contributing in forums... In about 10 years I'll be forced to retire, so unless I find a way around the rules, I promise to contribute more then... at the latest.


 


Keep me honest and fire away!
I expend too much energy on forums at times & at the end of the day, few, if anybody has been changed by what I've said so one could consider it a waste of time but I find that in trying to answer/explain my thinking, I find out what's wrong with my thinking & clarify it somewhat.

Just to re-iterate - I'm not denigrating the use of blind tests, per se - just the opposite - I find that the purveyor's of blind testing in forums do a great disservice to science, to the understanding of auditory perception & to the rigour needed for valid perceptual testing.

I believe that the posts of ultmusicsnob that I copied at the top of the thread should be made into a sticky for all those who want to do home ABX testing to identify just what is needed to counteract the great bias ABX tests have towards the null result
 
Jan 22, 2016 at 10:40 AM Post #53 of 64
SoundAndMotion said:
 No. Of course the quality of any design depends completely upon what question(s) you are hoping to answer. But when testing audibility of differences between {devices, sources, formats,…}, choosing to make the test sighted is always a flaw/weakness. You confound your results with known biases, and therefore complicate interpretation of results by adding variables. Perhaps this results in false positives, but it can also lead to false negatives, depending on exactly what you are testing.

My statement that the difference between sighted & blind testing is that one is skewed towards false positives & the other skewed towards false negatives may not be symmetrically balanced as we don't know the level of false neg or false pos for each but as a generalisation I still think it holds

Right, I can see how you think I'm just being stubborn in my reply - let me try to explain.

My perspective is completely confined to home based ABX testing & not laboratory based, formal tests that you may well have more familiarity with. So my generalisation was specifically about the sort of blind tests I see encountered on forums. I know that seems a contradiction - how can a generalisation be specific?

Your experience of blind testing is far broader than mine so this may be why, in this instance. we appear to be talking past one another?

What I find is disingenuous (not on your part as you are stating the scientific viewpoint) is the notion that removing sighted bias is therefore necessarily a better indicator of what is really audible & what isn't (this i sthe question that is attempting to be answered by ABX testing). My whole argument has been that this is not the case when it comes to home ABX tests - I would expect it to be correct when rigorous blind testing is done according to the MUSHRA or BS.1116 guidelines for such perceptual testing but not for home testing where 90% of these guidelines are considered irrelevant. Just look at one commonly stated viewpoint "well, if you have to be trained to hear these differences then they are not worth bothering about" or the usual "night & day" one - "if they claim night & day differences then it should be obvious in a blind test"

BTW, I'm sure you have read Leventhal's papers on the issue of the preponderance of null results in audio DB tests?
How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man. R3T 2N2, Canada
JAES Volume 34 Issue 6 pp. 437-453; June 1986

Statistically Significant Poor Performance in Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Manitoba, Canada
JAES Volume 42 Issue 7/8 pp. 585-587; July 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man., Canada
JAES Volume 44 Issue 10 pp. 850-863; October 1996

And worth reading is Leventhal's interesting exchange in Stereophile with David Clarke (one of the inventors of the ABX box along with Kreuger, I believe). http://www.stereophile.com/features/141/index.html#6DTXFlthhwOK8v6M.97
 
Jan 22, 2016 at 11:31 AM Post #54 of 64
BTW, I'm sure you have read Leventhal's papers on the issue of the preponderance of null results in audio DB tests?
How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man. R3T 2N2, Canada
JAES Volume 34 Issue 6 pp. 437-453; June 1986

Statistically Significant Poor Performance in Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Manitoba, Canada
JAES Volume 42 Issue 7/8 pp. 585-587; July 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man., Canada
JAES Volume 44 Issue 10 pp. 850-863; October 1996

And worth reading is Leventhal's interesting exchange in Stereophile with David Clarke (one of the inventors of the ABX box along with Kreuger, I believe). http://www.stereophile.com/features/141/index.html#6DTXFlthhwOK8v6M.97

 
You can see further discussion by Arny and others here:
https://hydrogenaud.io/index.php/topic,108127.0.html
 
Much in there sounds familiar.
 
Jan 22, 2016 at 3:59 PM Post #55 of 64
yeah let's make the false negative something huge when it isn't, and let's not talk about the "I'm always right because I'm reading the answer" illusion of truth that cripples all sighted evaluation, or all the added biases that can bend the truth in one or another direction.
because obviously when an abx fails and we don't draw any conclusion outside of "I wasn't able to pass", that is soooo important for the factual conclusions we're not making from it. so be careful and don't do abx kids, because sometimes you might fail and...
confused.gif
and what? decide that what you tested was a matter small enough not to care much about it? which is probably true if it's small enough to fail an abx.
wink_face.gif

 
Jan 22, 2016 at 5:37 PM Post #56 of 64
yeah let's make the false negative something huge when it isn't,
Any form of blind testing is subject to a number of issues that have the potential to produce errors, errors that can be so large that they completely invalidate the actual test and make it useless.

These range from Experimenters Bias over the Nocebo Effect to simple Statistical Errors and a lack of "testing the test" against known stimulae. All my arguments against the validity of a given test may refuted by demonstrating the ability of the test (that is the whole test set-up, the chosen test subjects, test program etc.) with phenomena that have been confirmed as giving positives in similar tests.
and let's not talk about the "I'm always right because I'm reading the answer" illusion of truth that cripples all sighted evaluation, or all the added biases that can bend the truth in one or another direction.
But if all you can do to try to validate these flawed ABX tests is to point to the flaws in sighted listening then you argument is weak in the extreme & you have nothing else to commend it.
because obviously when an abx fails and we don't draw any conclusion outside of "I wasn't able to pass", that is soooo important for the factual conclusions we're not making from it.
It's a nice attempt at trying to fool people that really you don't pay any attention to "null" results but we all know you do so time to drop this pretense
so be careful and don't do abx kids, because sometimes you might fail and...:confused: and what? decide that what you tested was a matter small enough not to care much about it? which is probably true if it's small enough to fail an abx. :wink_face:

Yes kids, let's all use a "test" that is designed to return a null result & let's pretend that this null result really means nothing....hehe, it's a great scam but we can get away with it. This will allow us to continue to ask for blind test as "proof". Every now & then we might slip in that 15 years of blind tests have not given a positive result so therefore it's as good as "proven" & nobody will notice our contradictory statements that we don;t pay any attention to null results

And let's pretend that anything which fails this test is therefore so small as to be of no audible consequence because when we stick to this line it becomes a clever self-deception that others buy into without noticing - this test of indeterminate quality is the final arbiter of what matters in audio & what should be ignored as inconsequential - it all revolves around the results of the "test" - clever, eh?

The arrogance of this position is seen when it's suggested that such flawed tests are no better than sighted tests & should be considered as just another anecdotal report - the rejection of this notion & utter horror is evident. It's not even being suggested that long-term sighted testing is better than these flawed blind tests, just that they both be considered with equal skepticism.

So let's keep pretending that this "test" has superior validity by continuing to cloak it in "Scientific Vestments" & proclaim it to be "Double Blind" to attempt to enhance its credibility while, at the same time, maximising the biasing that it introduces - biases like challenging the participants to "prove" what they hear, biases that suggest what''s described as "night & day" should be simple to differentiate, biases that this isn't a test that relies on statistics it's just about hearing, biases that ignore the intricacies of perceptual testing & instead go for the man-in-the-street simplistic view of such testing in order to get the unavoidable null result

Again, I have to remind that what I'm saying is about home based ABX testing & there are a number of main arguments being made, as far as I can tell:
- the continual grouping of all blind tests together as being of equal validity

is a superior test if anybody can show me the evidence - I don't mean talk about sighted bias - that only applies when doing "real" perceptual tests by people who know how to do them & don't introduce the biases I mentioned above. I mean evidence.
 
Jan 22, 2016 at 6:40 PM Post #57 of 64
So, no difference between any naïve, misdirected, untrained subject being "ambushed" with "bad DBT" missing a sonic feature that others may hear, that they may learn to pick up with guided focus, training

and a person making the claim that they do hear difference "X", can give a wordy description of what "X" sounds like, insists "X" as a property imparted to the sound by a component/step in the signal chain that they can hear when many other elements in the chain are changed?

I think most will make an inference from failed DBT test(s) of "X" by the latter when the subject is given opportunity to approve of samples, time, switching, train with the protocol, use source, equipment they agree they hear the difference with

since they "know what X sounds like" the situation is very different from the first case
 
Jan 22, 2016 at 7:09 PM Post #58 of 64
So, no difference between any naïve, misdirected, untrained subject being "ambushed" with "bad DBT" missing a sonic feature that others may hear, that they may learn pick up with guided focus, training
Well, from a collection of such null results, I have no way to know what level of hearing, lack of training, bias, stress the listeners are under in their tests


and a person making the claim that they do hear difference "X", can give a wordy description of what "X" sounds like, insists "X" as a property imparted to the sound by a component/step in the signal chain that they can hear when many other elements in the chain are changed?
I would never take one such report as interesting but if there were many reports along the same lines coming from different areas then, if interested, I could personally check this out for myself & see if it concurs with what's reported


I think most will make an inference from failed DBT test(s) of "X" by the latter when the subject is given opportunity to approve of samples, time, switching, train with the protocol, use source, equipment they agree they hear the difference with


since they "know what X sounds like" the situation is very different from the first case

And that's why I posted ultmusicsnob's descriptions of just how he managed to do an ABX test with positive results - to show just what's involved in this. So I think that your characterisation of the individual in your second example is actually typical of the "naïve, misdirected, untrained subject being "ambushed" with "bad DBT" missing a sonic feature" this is usually the way they are hijacked by the psychology of "prove you can hear it" - an immediate biasing factor in any psychological test

So, again we have a good example here of just how much is wrong in the setup & running of these tests. It's exactly why I say it requires people who know about perceptual testing because these would immediately know that your example was flawed
 
Jan 22, 2016 at 7:19 PM Post #59 of 64
you don't see ultmusicsnob's claims you keep pushing in our face as actually a counter-example to your arguments - does require motivation, persistence, willingness to work with DBT though
 
given that competitiveness is a human motivation it why shouldn't be harnessed with those skeptical of "Sound Science" "limited views" challenged to blow through those limits while "handicapped" with DBT?
 
isn't that essentially what you're claiming ultmusicsnob has done - so it seem to be possible
 
which is of course to be expected when practice, even "the stress" of competition is seen as improving human performance throughout science, sport and the arts
 
Jan 22, 2016 at 7:32 PM Post #60 of 64
 
because obviously when an abx fails and we don't draw any conclusion outside of "I wasn't able to pass", that is soooo important for the factual conclusions we're not making from it.

It's a nice attempt at trying to fool people that really you don't pay any attention to "null" results but we all know you do so time to drop this pretense

of course I pay attention to null results, in the sense that they aren't positive results and so aren't proof of difference. but they're also not proof of the absence of difference, so a null is proof of nothing. I don't think many people in favor of abx would pretend that a null is proof of something. that's just how people who want a difference to be there, tend to misinterpret ABX null results.
  if I get a positive, then it's worth investigating further and maybe look for a better test with some people around to control the test and get serious proof once and for all. if I get a null, then maybe I should stop claiming that I can hear a "night and day difference" like an idiot.
if only for that, ABX is an amazing tool for audiophiles.
 
 
 
with a false negative in ABX you're left with doubt. with a false positive in sighted evaluation you're wrong and have no reason to suspect you are.
 

Users who are viewing this thread

Back
Top