wavoman
Headphoneus Supremus
- Joined
- Jan 19, 2008
- Posts
- 1,873
- Likes
- 45
Hirsch and Nick -- fantastic posts. Right on the money IMHO.
I am a huge fan of Sample Size One experiments (this means one subject, not one test. Also called Single Subject Trials).
Think how the entire world as we know it changes if one person is proven to really have ESP.
I love the analogy with perfect pitch. Brilliant.
If you check my earlier rants on DBT and A/B/X, you will see I insist on swindles -- false comparisons, lying to the subject, throwing in 64kbs MP3's (or AM radio) along with 24/96 uncompressed audio, etc. This is exactly Hirsch's "known difference" point.
You don't need 8/10 replicated 2 or 3 times. I will run some numbers (these are called "power calculations" in statistics) using a variety of assumptions with an eye towards "ruling out chance". But (I have argued this before) there is the issue of "effect size". If I always prefer Treatment 1 to Treatment 2 just 6 times out of 10, and I can always repeat this, never fewer, never the other way 'round, then the effect is real, but small.
Although not too small for me to care. If Treatment 1 costs $50 more than Treatment 2, I'm buying it, 'cause my perceived SQ will never be worse, and will be better every once in a while ... good enough for me. If it costs $5000 more I would not buy it, but Bill Gates might.
For a given effect size (which is in real-life unknown), power caclulations tell you how many trials you need to prove an effect of that size exists beyond reasonable doubt (where "reasonable doubt" is set at a probability-of-error level, by convention 5%, but in the real world that is too strict).
That's classical statistics anyway, which most of us don't believe much anymore. Rather, we look at the economic gain or loss in trying to decide if something is true, we start with a guess based on scientific theory (engineering measurement of cables in this case) as to whether it is true, and the magnitude of the effect (we actually write down our probability beliefs for many different sizes of effects), and then revise our beliefs and recommend courses of action as we accumulate evidence from the trials.
But classical methods are OK to start.
In these trials, which are called "paired comparisons" (i.e. you pick the one of two musical samples you like, not A/B/X!), the typical model assumes an effect size theta (theta is zero if there is no difference), and we also assume a convenient mathematical model that tells us the probability of actually picking Treatment 1 over Treatment 2 for every possible value of theta.
In Single Subject Trials we assume each person has their very own theta, and we try to estimate it, and/or test that it is non-zero. In standard population trials we assume all people have the same theta, or that everyone has a different theta but we are all related in that there is an average theta (and a deviation around that) across the population, which in general is reasonably consistent.
When we get real data I can easily model both. What's new here is exactly what Hirsch said -- the concomitant variable of "can this subject tell the difference between two samples of music that we know are different". Adding this variable to the model makes for a more complex analysis, but at first we just throw the tests for those people away, or just not worry about them since we are focusing on individuals, not the group.
Note that the published DBT tests on high res vs redbook did none of this stuff.
mcsamms, you are right as rain. I tried to pull tests together at a meet, but nobody cared. I have said "let's do this at Can Jam '09", but no uptake.
If we did this right we would break new ground, confirm or refute the published studies re hi res, settle the cable issue once and for all, etc. Not curing cancer, but worth doing I think.
I will eventually get some of this organized with a small group of golden ears (not mine) in NJ. Will take several months to build out the test venue, but we are moving slowly in the right direction.
I am a huge fan of Sample Size One experiments (this means one subject, not one test. Also called Single Subject Trials).
Think how the entire world as we know it changes if one person is proven to really have ESP.
I love the analogy with perfect pitch. Brilliant.
If you check my earlier rants on DBT and A/B/X, you will see I insist on swindles -- false comparisons, lying to the subject, throwing in 64kbs MP3's (or AM radio) along with 24/96 uncompressed audio, etc. This is exactly Hirsch's "known difference" point.
You don't need 8/10 replicated 2 or 3 times. I will run some numbers (these are called "power calculations" in statistics) using a variety of assumptions with an eye towards "ruling out chance". But (I have argued this before) there is the issue of "effect size". If I always prefer Treatment 1 to Treatment 2 just 6 times out of 10, and I can always repeat this, never fewer, never the other way 'round, then the effect is real, but small.
Although not too small for me to care. If Treatment 1 costs $50 more than Treatment 2, I'm buying it, 'cause my perceived SQ will never be worse, and will be better every once in a while ... good enough for me. If it costs $5000 more I would not buy it, but Bill Gates might.
For a given effect size (which is in real-life unknown), power caclulations tell you how many trials you need to prove an effect of that size exists beyond reasonable doubt (where "reasonable doubt" is set at a probability-of-error level, by convention 5%, but in the real world that is too strict).
That's classical statistics anyway, which most of us don't believe much anymore. Rather, we look at the economic gain or loss in trying to decide if something is true, we start with a guess based on scientific theory (engineering measurement of cables in this case) as to whether it is true, and the magnitude of the effect (we actually write down our probability beliefs for many different sizes of effects), and then revise our beliefs and recommend courses of action as we accumulate evidence from the trials.
But classical methods are OK to start.
In these trials, which are called "paired comparisons" (i.e. you pick the one of two musical samples you like, not A/B/X!), the typical model assumes an effect size theta (theta is zero if there is no difference), and we also assume a convenient mathematical model that tells us the probability of actually picking Treatment 1 over Treatment 2 for every possible value of theta.
In Single Subject Trials we assume each person has their very own theta, and we try to estimate it, and/or test that it is non-zero. In standard population trials we assume all people have the same theta, or that everyone has a different theta but we are all related in that there is an average theta (and a deviation around that) across the population, which in general is reasonably consistent.
When we get real data I can easily model both. What's new here is exactly what Hirsch said -- the concomitant variable of "can this subject tell the difference between two samples of music that we know are different". Adding this variable to the model makes for a more complex analysis, but at first we just throw the tests for those people away, or just not worry about them since we are focusing on individuals, not the group.
Note that the published DBT tests on high res vs redbook did none of this stuff.
mcsamms, you are right as rain. I tried to pull tests together at a meet, but nobody cared. I have said "let's do this at Can Jam '09", but no uptake.
If we did this right we would break new ground, confirm or refute the published studies re hi res, settle the cable issue once and for all, etc. Not curing cancer, but worth doing I think.
I will eventually get some of this organized with a small group of golden ears (not mine) in NJ. Will take several months to build out the test venue, but we are moving slowly in the right direction.