DBT problems --methodology
Aug 24, 2009 at 6:44 AM Thread Starter Post #1 of 63

aristos_achaion

500+ Head-Fier
Joined
Apr 19, 2009
Posts
565
Likes
26
It strikes me that a lot of the problems around DBT discussion revolve around a misunderstanding of what constitutes a DBT and, really, scientifically valid methodology in general. I'm not pro- or con-cable, but I've heard about so many "DBT"s that don't use statistically valid sample sizes, have their data misinterpreted, or even something like a badly defined or missing control.

So, I'm curious...how would you use a DBT to determine whether, say, the Zu Mobius cable on the HD650 was snake oil or not? How would you interpret the data? Why do (or don't) you see this as a valid way of addressing this question?
 
Aug 24, 2009 at 7:30 AM Post #2 of 63
I think the bigger issue is convincing test participants that the results are valid. No matter how carefully planned the tests are, believers refuse to accept them. I think the only tests that would work are ones that emphasize placebo, suggestion and the weaknesses of human perception.

Scientific testing, with very advanced equipment, fails to change minds. So do ordinary DBT tests. Even evidence of garden hose cables with Home Depot wire - outright fraud - isn't enough to change minds.

I think the only thing that would convince would be carefully planned tests with a psychology department and an engineering department at a university. You'd need both to untangle the human element from the engineering one.

Though that probably wouldn't be enough to change the minds of believers. It would be enough, however, to convince those without an opinion on the subject.
 
Aug 24, 2009 at 11:12 PM Post #3 of 63
Quote:

Originally Posted by aristos_achaion /img/forum/go_quote.gif
So, I'm curious...how would you use a DBT to determine whether, say, the Zu Mobius cable on the HD650 was snake oil or not?


Showing that it's not is easier than the opposite.

You need to find some people who can hear the improvement.
Then you have to design an experimental setup, in accordance with the listeners. The cable must not be detectable by the listener in any way : weight, stiffness, noise... The operator must not put the headphones on the listener's head because the test wouldn't be double blind. You must design a setup where the listener can take the headphones and put them on without toutching the cable, nor even moving it. Actually, the headphones should not move, the listener should put his head below the headphones instead of putting the headphones on.
Switching cables is also a challenge : the listener must hear absolutely nothing : the noise of the plugs, or the noise of the cables put down.

These constraints must respect the conditions in which the listener can hear the difference. The DBT setup must not introduce any variable that may prevent the listener to hear the difference, such as the inability to move the head or set the headphones in the right position on the head...

The listener may then decide what test setup suits him or her best : ABX or any other random sequence. A fixed success criterion (chances of guessing) must be agreed between all people involved, and taking into account the expectations of the people who are going to read the result.
The expected number of listeners, and all other published tests must be taken into account, as their existence drags down the significance of the result. The more you try, the more you risk a false success. The real probability of guessing must be evaluated according to these data.

If the test is supposed to be a "classic", that other forumers may repeat, then we have no control at all on the amount of listeners and sessions, therefore no control on the real significance of the result.
In this case, the significance can be drastically restored introducing trial sessions. Any listener who wants to take the challenge must success a trial session first, and the result of the trial session won't be taken into account, whatever it is. If your targeted p value is 0.001, introducing a trial session raises the significance by a factor 1000, which means that you don't have to worry about significance until 1000 other people have tried.

Then, when you get a success, you have to make some measurments in order to find the origin of the difference. It may have been a loose contact, for example.
Then you need to reproduce the success. Things are much easier here, because we now know what kind of music sample / sample duration / listening volume and repetition sequence can work.
It is also possible to record the signal at both ends of both cables, if the measurment device has a high enough impedance, in order to see what is the most transparent cable. If the difference is still audible listening to the recording, it can greatly simplify the reproduction : the sample can be uploaded and anyone can try to ABX them without having to setup a physical double blind test.
The way the samples were got might need to be reproduced, though, in order to eliminate the possibility that something is wrong with the setup (something broken in the cable, that was not obvious during mesurements , for example).

That is if you want to prove that the cables sound different. If you want to prove that one of them is snake oil, that's another matter.
You might begin to look for claimed technical superiority, and measure it in order to check. But that won't change the opinion of those who can hear a difference.
The only way would be, through a very long process, to organize DBTs with a representative selection of trusted people who can head the difference : reviewers, forumers etc.
All of them must have been able to find the difference easily. Then all DBT must be setup according to each listener's habits and requirements for optimal listening experience. And all failed DBT must be repeated, with training of the listeners, until they can clearly tell if the difference is there, or not there, and why. All suggestions that may explain a failure must lead to a new DBT setup that take into account the objection.
Once all loopholes have been adressed, and all people who claimed to hear the difference have recognized that the test conditions were perfect, and that the difference was thus just a product of their imagination, then we can conclude that the cable does not have the properties that everyone was thinking it had.
It still might not be snake oil, but it would mean that no one has yet found what it may improve over a standard cable.

In order to confirm a difference with several independant DBT and some subsequent measurements, expect about six months, with, say, between 2 and 10 meetings.
In order to prove that the sonic properties attributed to a given device are not real, expect several years of work.
 
Aug 25, 2009 at 3:31 AM Post #4 of 63
Pio you have gone a little too far here.

You do not have to worry about the infinite sequence of subjects who might take an infinte sequence of similar tests. You don't have to pool results -- you can do valid statistical work with a non-pooled sequence of "sample size one" trials.

But even with pooled results your experiment can stand on its own. It is nice to replicate it -- have others replicate it -- and that is necessary for long term acceptance by scientists, but this does not effect the actual significance of your test.

You can combine the pre-screening with the regular testing.

You do not need everyone to agree on significance levels -- you just publish your results and people are guided by the statistical analyses according to their own needs and loss functions.

And you can control response bias in many ways -- you can allow people to touch the equipment, the cables don't have to be identical, and you can use single blind experiments. Just be careful to understand and control appropriately.

So I would proceed as follows -- a single blind test. I would have a cable maker make me a junk Senn cable with cheap wire but dress it nice with techflex and put a fancy heatshrink logo on it ... make it look as professional as the Zu, and be roughly the same weight. I would put new shrink-wrap over the Zu logo with another fake logo. Now I would find an audience of musicians, people with (we assume) decent hearing, who know nothing about headphones. I know these people exist -- I have personally met many. They don't know Zu from Zoo. The feel of the cables will mean little to them, as long as both look professional.

You will need many Zu's and many fakes.

Prepare several kits, each with two sets of 650s -- with Zu (but a fake logo) and with the junk bling cable (another fake professional logo). Set up a nice listening room and invite lots of musicians for an afternoon of refreshments and listening. Tell them openly they are evaluating two different headphone builds for Sennheiser -- don't mention the cables. Tell them that Sennheiser has sponsored this event, paid for the food, etc. Have a door prize of PX100s, and some Sennheiser brochures lying around.

Have several listening booths set up -- with a decent source -- maybe the Shanling CDP which has an acceptable HP amp built in, and is easy to operate. Musicians go off and take a pair (I mean both HPs here) with them, and listen as long as they like, and vote: prefer A, prefer B, no preference.

If the room is roughly evenly divided, consistent with binomial 50-50 chance, we don't reject the null. If the number of musicians is large enough, and there is a strong majority for the Zu, that would say something ... assuming that the fake cable really looks and feels as professional as the (masked) Zu.
 
Aug 25, 2009 at 10:58 AM Post #5 of 63
Hello Wavoman,
I think that your setup have a much bigger chance of false negative than mine.

Asking for preference, you will get a negative under the hypothesis that the cable sound different, but that listeners have various tastes and preferences. Result of the efforts in this case : people rant "you test proves nothing, the differences are obvious, but of course not everyone have the same taste / the cable compensates for defects on given recordings while increasing the defects in other recordings etc". And they would be right.

And summing the individual results, you get a false negative in the hypothesis that only some listeners can hear the difference. The bad scores of the ones who can't hear the difference ruin the performance of the others.

If every listener undergoes a complete ABX (or any other randomized sequence) session, and the statistical significance is evaluated from the sum of all answers, you can even get some 16/16 scores, but still no statistical significance once every individual scores are summed. It occured once in an audio magazine.
In this case, when the test manager concludes that no statistical evidence was found, his credibility is permanently ruined. That's why i always divide my target probability of false success by the maximal number of sessions supposed to happen in the whole test. This way, any listener, better trained than the others, can make a success, valid for the whole test. I think that as a test manager, I must not deny any individual success to anyone. That's wouldn't be fair with the listeners, who invest a lot of time and effort in listening.
Of course, if every listener only gives one answer, that doesn't apply. I personally prefer having each listener undergo a whole session, because of the big sensitivity differences from one person to another.

Last, about the possibility of taking into acount the infinite other tests that other people can make, that's just a personal choice. I like to publish my results and discuss them on forums. I'm also archiving, together with Chaud7 and MrBear, all audio blind tests that i can find on the Internet ( Post-it: Annuaire des tests ABX ) Thus taking into account anything that can happen in any other forum at any other time makes sense to me : it allows to consider the question "has the difference between this and that ever been proven or not ?"

For example, the problem already arised in matrix-hifi blind test repository, on a lower scale. They have done so many ABX tests that the rare success that they have got are now void of any statistical significance, because their scores were statistically expected to happen in such a long serial of tests.
 
Aug 25, 2009 at 1:13 PM Post #6 of 63
The setup used has to be very sterile and unengaging so people can listen to changes more easily. And two things have to be addressed, 1. lingering memory or imagination filling in the difference between cables and 2. if the test subjects are comfortable or not. Human senses are always inefficient when they feel uncomfortable, or excited, etc.
 
Sep 4, 2009 at 1:50 PM Post #7 of 63
Are either of you statisticians? I've done a lot of statistics. Parametric statistics, non-parametric statistics, maximum likelihood statistics, bayesian statistics, simulation statistics, and randomization statistics. I've developed new statistical methods for analyzing complex biological data. I've been the head teaching assistant for a 3rd year biostatistics course. I'm not a statistician, but I do have a pretty strong background in statistics relative to someone without a degree in statistics or mathematical statistics.

That said, Pio- I have absolutely no idea what you're talking about. There are an awful lot of words there- but you don't end up saying very much. How will wavoman's set up produce more false negatives than yours? People are either going to hear a difference- and report it- or not hear a difference. The idea of a false negative here doesn't really make sense. Yes- there might be a difference, and people might not be able to hear it- OR, there may not be a difference, and people may hear it- but that's sort of irrelevant. Averaged over a large enough randomly selected population, that sort of noise doesn't really matter- it won't affect your mean response- it'd just increase your variation- unless of course your data is fairly non-normal, but again- as long as you have enough data, in practice that doesn't really matter. Think Law of Large Numbers.

First of all, it seems to me that there are two distinctly different questions here. One is "On average, can a population tell the difference between cable A and Cable B?". That question is interesting because we want to know if members of a population can reliably tell apart two cables. The second question is "Can individuals detect a difference between cable A and Cable B?". This differs from the first question because it's possible to have significant hereogeneity in the data. It could be that some large fraction of the population can't hear a difference between cables, while others can reliable here a difference. The first question only tells us if the difference between the cables is large enough for a population as a whole to detect the difference- but it is possible that while a population on average can't hear the difference, some members of the population can.

For a moment, let's imagine there are differences between two cables. 50% of the population cannot hear the difference between the cables, but 50% of the population can. These are not statistical estimates, but I am invoking my omniscient powers of being a god to know that these are the exact answers. If that was the population, and we did the population level test- we'd come back with a ratio of 50% of the population being able to hear a difference, and 50% of the population not hearing a difference, and we might falsely conclude that no differences exist between the cables.

What you really want to do, is design an experiment around something like a repeated measures two-way ANOVA. Every individual is going to be measured multiple times with both headphones. I can imagine this being done in two ways. In one case, we let them listen to A and then B, and ask them which they prefer- they have to decide- they cannot re-hear them before making a decision. Then we present them with the two headphones again- in a random order. So maybe they get A and then B again, or maybe they get B and then A. This pattern is repeated many times over, with many different pieces of music. You record which they prefer.

The other way is to give the listener a headphone with a particular cable- play a song for them, and ask them to rate it on a scale of 1-10. Remove the headphones, change cables (or don't change cables), and give them back, and ask them to rate the music again. The nice thing about this is that it isn't paired. They don't know that they're getting Cable 1 and then Cable 2, or Cable 2 and then Cable 1. They could be hearing Cable 1, then Cable 1, then Cable 1, then Cable 1. Each individual listener does not have the expectation of listening to Cable 1 50% of the time and Cable 2 50% of the time- although averaged over the whole experiment, each cable should be used equally.

There are various other experimental set ups you could use to test individual preferences, these are the first two that popped into my head.

Now for each trial, you record what they listened to and what they preferred, and you do this with lots of people (random population sample), and save the data.

Now you can ask both questions: Can the population as a whole tell the difference between the cables? Yes or no. Are there specific individuals that are capable of reliably telling the cables apart? Yes or no.

You could also go out of your way to design cables that should sound bad and compare them against specifically well designed cables and do the same experiment as above. The point of this would be- if we give them cables that have known qualities- some of which we think may be good, and some of which we think may be bad- are those detectable? Personally, that's how I'd start.

Actually, I wouldn't do it this way at all. I'd do it with a real experiment using a scope, and signal generator. Run the signal through the cable- read it out on the scope. Do that for your various cables. Can you see any difference? That combined with the above statistical experiments would satisfy me.
 
Sep 5, 2009 at 9:26 AM Post #9 of 63
Quick answer to Clutz. I haven't read the recent posts, no time yet. Sorry -- I'll get to it.

But yes, I am a real statistician. AB in Statistics, summa cum laude, from Princeton 1970. PhD in Statistics, Yale 1974. Papers in The Annals of Statisics, Econometrica, and of course the Journal of the American Statisical Association.

Winner of the Theory and Methods Award from the American Statistical Association. Inventor of the now-classic multivariate nonparametric two-sample test, which bears my name (and the name of my co-author) included in most stat packages, thousands of hits if you Google it, described in dozens of books and Wikipedia.

PM me if you need my name, but I think you can figure it out.

Only skimmed your answer, but indeed you are right -- each person has to be a block. However "play the winner" is more effective than randomized blocks. And swindles will increase power -- you have swindles built in ("don't change cables") -- but we can be a bit more surgical and not just mix 50-50. The designs here are tricky, they are sequential. I don't think any work on this has been published.

I do not agree that physical measurement is the way to go. The food industry abandoned that long ago in favor of sensory tests. I have read everything I can on food industry sensory tests (GIYF, as is Ammy here) and things don't carry over to audio, but they are way ahead of the sorry state of experiments as represented by the awful work in AES and A/B/X, which is worthless IMO. Asks the wrong questions, measures the answers wrong, and does not analyze the responses corrrectly, again IMO. I've given up on taking the A/B/X'ers on ... just doing my own thing. I don't care if anyone agrees, I'm doing this for myself.
 
Sep 5, 2009 at 11:19 PM Post #10 of 63
DBT and ABX can EASILY show that there differences in how some things, like loudspeakers sound. Nobody seems to complain about those studies.

However, to date, I am unaware of any controlled and reputable DBT that demonstrates that people can reliably detect differences in things like cables, amplifiers, DAC's, etc.

Rather than conclude that perhaps any differences in cables, amplifiers, and DAC's are small (or inaudible), people criticize the DBT's as invalid. It's simply amazing.

People forget that if there really WERE large differences in cables, amplifiers, and DAC's, they should be EASILY detectable in a blinded listening test. EASILY. Yet, the opposte is true.
 
Sep 6, 2009 at 12:47 AM Post #11 of 63
Well Smelly, I think that's the point. Folks believe that the differences are small, audible to a select few, but very meaningful nonetheless. And if that's the case (it might be, it might not be) then it is very hard to design a blind test to prove or disprove this fact.

What we can do is use "sample size 1" experiments, where the individual is the block, to use the correct statistical term.

Through the use of swindles you quickly smoke out the people who are giving random answers. You focus then on those that seem to actually hear a difference -- "play the winner". Through further replication you can prove or disprove something about that particular individual.

It is hard to pull this type of test off under comfortable, long term listening conditions. Not impossible. But it has not been done a lot. The famous power cord test was pretty close, and came up with no one who could tell the difference blinded. Sample size was not large, however.

I am fairly convinced that I personally cannot hear the difference between a $1 digital cable and a $1000 digital cable, playing redbook PCM audio through a Benchmark DAC, A QESLabs HP amp, and balanced 880/600s. So as a consequence I don't spend a lot on digital spdif coax cables. I don't spend $1, since in my real listening room I might have more interference problems than in my test, so I buy Blue Jeans Cables or equivalent basically, since these speak to the issue of shielding.

I am trying to get blinded analog interconnects built, like in the Power Cord trial, for use by members of Hi Fi ... it is hard.

But you are correct: if the differences were huge, we would have proven that already.

BTW, I will continue to claim that single blind is good enough in many cases, DBT is not needed, and that ABX is far from the correct protocol, in that it introduces too much response bias -- the food industry sensory comparison industry doesn't use ABX since it asks the wrong question. But I have argued this at length elsewhere and don't want to re-open.
 
Sep 6, 2009 at 2:33 AM Post #12 of 63
Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
What we can do is use "sample size 1" experiments, where the individual is the block, to use the correct statistical term.


Seriously, repeated measures ANOVA is exactly what we want here, isn't it? Multiple individuals, and each individual is repeatedly exposed to both the 'good' cable and the 'poor' cable- only they don't know what they're getting. Play a track- listen with cable x, then cable y- which do they like better? Then repeat- randomly assigning the cables to x and y.

Quote:

Through the use of swindles you quickly smoke out the people who are giving random answers. You focus then on those that seem to actually hear a difference -- "play the winner". Through further replication you can prove or disprove something about that particular individual.


Repeated measures ANOVA lets you do the same thing though, without first separating out the population into individuals who 'can' hear the difference and who 'can't', doesn't it? Using repeated measures gives us a lot of data about the individuals that can hear differences, and individuals that can't, well they're irrelevant. The only downside that I can see is that we're putting wasted effort into testing individuals that can't detect a difference, according to the swindle, but if the swindle is a single round, then it has a limited degree of sensitivity anyway. It seems to me that what you're getting at here is the possibility that there are mixed populations- some people who can hear a difference, and some people who cannot- so you break the test up into two parts. But when you do that, you're also giving up all sorts of statistical power by reducing your sample size by a lot. With a repeated measures fixed effects ANOVA you'll be able to separately measure population level responses and individual level responses, without 'smoking out' any one. I'm really curious about the swindles angle here, do you mind providing a few academic references about it? I'm genuinely curious.

Quote:

The famous power cord test was pretty close, and came up with no one who could tell the difference blinded. Sample size was not large, however.


Erasing this irrelevant bit, since you're more of a statistician than I am.

Quote:

BTW, I will continue to claim that single blind is good enough in many cases, DBT is not needed, and that ABX is far from the correct protocol, in that it introduces too much response bias -- the food industry sensory comparison industry doesn't use ABX since it asks the wrong question. But I have argued this at length elsewhere and don't want to re-open.


Single blind is fine, but double blind really isn't that much harder to do. I don't really like ABX either, fwiw.

My reasons for wanting to do the physical test of the cables is because I like being able to add real experimental data to statistical data- provide physical evidence for our population level statistical results.
 
Sep 6, 2009 at 3:10 AM Post #13 of 63
Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
Only skimmed your answer, but indeed you are right -- each person has to be a block. However "play the winner" is more effective than randomized blocks. And swindles will increase power -- you have swindles built in ("don't change cables") -- but we can be a bit more surgical and not just mix 50-50. The designs here are tricky, they are sequential. I don't think any work on this has been published.


I would think that the best way to do it, would be, with each round- randomly assign a cable- ask them to give it a rating from 1-10. They wont have an expectation of Cable 1, then Cable 2, or Cable 2, then Cable 1. Just a series of one off preferences over a variety of different sources of music, and then average it out. It's not a paired test which has more power, but it also means that the testee doesn't have any expectation about frequency of occurence of each type of cable.

Quote:

I do not agree that physical measurement is the way to go. The food industry abandoned that long ago in favor of sensory tests. I have read everything I can on food industry sensory tests


I would just like to know that if we do get a positive result (some peope can detect a difference) what the physical basis of that difference is. If it's real, it should be measurable- and hopefully with the technology we already have- but maybe not.
 
Sep 6, 2009 at 1:44 PM Post #14 of 63
Clutz -- again, my answers have to be quick, from the hip, sorry.

You understand the statistical pinciples perfectly, except something you said about the power of play-the-winner. Here's an an overview:

Repeated Measures is just another name for Individual as the Block. We agree 100%

Of course I won't really do an ANOVA since we are not measuring a continuous quantity.

Randomization and keeping all individuals in the trial will work, you are right. But sub-optimal for sure. You are correct that I get more leverage from my trials by dropping the people early who we prove (early on) can't discriminate. But you are wrong when you say this loses any evidence or weakens the power of the test. It really does not. It is the clever way to test. The sample size for the individual block (for the individuals we care about) is not reduced in any way. We gain a lot of bang-for-the-buck.

Doing this test single-blind means we can be surgically precise about the how we balance A and B, balancing by design instead of randomizing. Again, I choose the design to be super efficient. This is what the design of experiments is all about. I choose a design to try to minimize response bias, minimize the effects of guessing, etc. Random works, but a balanced presentation order unknown to the subject works too.

"Swindles" is my own picturesque name for telling the subject he/she is getting to choose between A and B, but secretly presenting A and A, or B and B. This is not done just once, and not done only at the beginning. If you randomize, this type of comparison should be one of the choices. In my precise sequential designs, I salt them in. (I think systematic use of swindles will be my real contribution to the state-of-the-art here, so if you want to write publicly about it please PM me).

The swindle concept is old as the hills. There are several names for swindles used in the statistical sensory evaluation literature -- you will easily find zillions of references -- these names are "blind duplicates", "blind replicates", "control samples" "control against itself" ... the general name for anything like this is "catch trial". Typically in the food industry A is the current product, and B is the "new improved" product (often cheaper to make, and the idea is to see if the consumer can perceive any quality difference). "A" is called the control, and telling the panel they are getting A and B but giving them A and A is the "control against itself" or the "control sample". This is done more than B vs B ... B vs B is what's usually called a "blind duplicate".

As far as I know I am the first to suggest systematically including A-A or B-B in the design sequence in a precise way, but I have not yet done a full literature search. Let me explain why before I get flamed. We were taught this in graduate school -- a little known approach not widely practiced -- when you hit a statistical testing situation that you think is not being done right, or can be improved, do a quick literature scan to understand the state-of-the-art, the various positions on the subject out there, etc. Then stop, and start your own research. Innovate.

When you are done, then go and do the real scholarly comprehensive lit search. You may have re-invented someone eles's work, very likely in fact, so you can't publish a new theory paper, but you have not wasted your time (well maybe only a little), since you now have a deep appreciation of what the issues are, and you will understand the paper(s) you found perfectly. In fact, you might well have discovered a slight twist and can push the theory along a little -- a research note instead of a major paper (or at least a nice interaction with the original author and maybe a follow-up paper, perhaps jointly). But sometimes you find you have come up with something entirely new and novel -- bang! a major contribution to the field. The point is that you have not polluted your mind by reading so much of the accepted ways of doing things that you fail to innovate. This is a fascinating approach to research, and it works.

I am a long way from publishing this stuff -- nowhere near even finished with the theory or computational routines. I do believe I have a new likelihood model that generates the probabilities of correct and incorrect answers to regular and swindle choices, given two underlying and unobservable parameters: (1) the amount of SQ difference between A and B, and (2) the ability of this particular listener to detect these differences. I have a way to fit this model, to estimate the parameters, but I am not sure I have any of it right yet. Also a way to generate the optimal order to present the trials and swindles, given assumptions about the two parameters.

This is just not my day job, and most nights I would rather listen to music than do statistics. I am in no rush, I will finish this and publish it some year, maybe when I retire, which is soon I hope. Along with thousands of other mathematical statisticians, I was trained to improve existing statistical methods and invent new ones, whenever and wherever we believe current methods are lacking. This goes on every day. I used to do it, then I switched careers, but it is like riding a bicycle. People here and in the world at large think somehow statistical methods are known and fixed, that there are a number of correct methods that you have to use, etc. This is not correct.

It is like master chefs. Sure, for years all the fine restaurants cooked the exact same thing -- coq au vin, duck a l'orange, etc. A few minor variations. But then the great cooking explosion occured and now we expect every real chef to invent new dishes, and innovate. They do, based on sound cooking principles, and respect for the past. Now that's not the greatest analogy, since we like variety and chefs will change just for the sake of change and give us something new, but they also change to improve, like Keller did with lobster, the cooking equivalent of a major breakthru paper.

When I got back in to high-end audio (after a multi-decade hiatus) in late 2007, and started to see all this DBT chatter, it became clear to me that A/B/X was wrong from many standpoints, the need to A/B/X mostly driven by the requirement to be double-blind (not necessary, SBT can be made to work), and/or the requirement to allow self-experimentation (which can be done in other ways). It asks a question not quite on point, and one that does not control for response bias. So I cast about, discovered the large literature on sensory evaluation and blind tests in the food industry, bought and read most of the major books in the field (GIYF and Ammy, as I have said before), and studied the statistical techniques used there. A lot of nice stuff, some pretty clever, but I decided (a) changes were needed for audio, and (b) I could do something better here. So (unfortunately on a very slow schedule) I have started.

The feedback and discussion here on head-fi has been incredibly valuable, and has helped me a lot.

Just like headphones improve, even though there are classics, so does statistical practice.
 
Sep 6, 2009 at 2:07 PM Post #15 of 63
Clutz -- just saw your latest post (after writing the long answer I just posted). Quick additional answers:

1. Your non-paired, random order, 1-to-10 rating experiment is a very interesting design. I agree it has less power than a paired-test, but it might have lower response bias. I am a little worried about not hearing the same music ... maybe a later passage in the same piece? If you did this, then you do have a classic randomized repeated measures design, a simple two-sample test would work (use a non-parametric one) on the implied two samples (separate the A and B responses).

I worry too about the 1 to 10 vs a simple "Prefer A to B". Still you have an idea worth exploring.


2. At this stage, I don't care about physical measurements at all. The discovery of physical effects often follows the discovery of a sensory difference through (proper) statistics, or simple observation. Read up on homing pigeons (science's attempt to understand how they do what they do).
 

Users who are viewing this thread

Back
Top