Scientific Listening Experiment: Can we tell the difference? Let's find out.

Dec 16, 2004 at 9:03 PM Post #61 of 70
Quote:

Originally Posted by Publius
hey hey hey hey hey hey hey!

foul ball! foul ball!



Check out the replay... Look carefully…it hit the post : )!


JF
 
Dec 16, 2004 at 9:26 PM Post #62 of 70
Quote:

I think you have been confused. There are 2 commonly used meanings of p, one for statists and one for probability.


Lower-case p is for the statistical definition, which I noted, and upper-case P is for the probability. Someone correct me if I'm wrong.
 
Dec 16, 2004 at 9:28 PM Post #63 of 70
Quote:

Originally Posted by toor
I think you have been confused. There are 2 commonly used meanings of p, one for statists and one for probability.

You are using the p-value of regression taken. or in other words the statistical signifcance of some values that was measured.

the other meaning of p in probability theory is the likelihood of something happeining. i.e the probability (p) of someone hearing a difference in head phones is 80%.

People might be freely alternating between these two (which would be incorrect) but it should be pointed out that they are not necessarily using your definition.

edit...
Srry to confuse you more john



Sorry for the statistical butchering. I was using the second meaning - ie, when I referred to p<0.7, I meant that I would actually perceive a difference less than about 40% of the time. (0.4 + 0.6*0.5 = 0.7)

I do in fact have a statistics-for-engineers book at home but I don't have it on me atm. I probably ought to study it more.
 
Dec 17, 2004 at 5:40 AM Post #64 of 70
Quote:

Originally Posted by TWIFOSP

I really think you keep mistaking this process for something else. Like a control test. This is a measurement test. We are simply testing and collecting data on the ability for our testers ears & listening experience (the measuring device) to detect an unknown variable.



No, I understand what you're doing. There's simply no reason to do it, unless you're going to use it to screen out outlying subjects. It's not going to alter the data in any way, and the statistical analysis is going to account for the variance between subjects. You could find out that people can't detect measurable differences, and can the whole thing. Possibly you could try and use the data from the pretest as a covariate if you're planning some sort of ANCOVA as your statistical test.

The experiment itself will produce a given variance. An estimate of that variance known in advance can be used in a power analysis to tell you how many subjects you'd need to expect an effect of a given size to be statistically significant. If you're not going to use it weed out bad subjects, determine your N, or use the data as a covariate, the MSA is essentially wasted effort. It will NOT affect the confidence you have in your results, unless used as described. That's what your data analysis is for.
 
Dec 17, 2004 at 6:45 AM Post #65 of 70
Once you get away from thinking about the MSA, you can start thinking about your experimental design. An ABX type is a forced-choice experimental design. Some good and some bad points. If you can get away from trying to identify cables, and focus just on cable differences, a signal detection analysis might give you more power. The subject is given two headphones (same headphone, different or identical cables, blinded), and asked to say if the cables are the same or different. This leads to four possible classes of response:

Hit: The subject says the cables are different, and they are.
Miss: The subject says the cables are the same, and they are different.
False Alarm: The subject says the cables are different, and they are the same.
Correct Rejection: The subject says the cables are the same.

Now, the easier it is to tell if the cables are different, the more hits and correct rejections you'll see. These are measures of perception. If (hits + correct rejections) > (misses + false alarms), you've got some sort of effect, depending on stats. However, if there is biased responding (a subject is trying to hear phantom differences, or doesn't believe cables can sound different), you'll see changes in hits and false alarms. If (hits + false alarms > or < misses + correct rejections), there is bias present. So, the experimental design has built in measures for both perception and subject bias.

Now throw in another kicker. For some trials, use a positive control, where you keep the cables the same, but make some sort of auditory change that is known to be just above a perceptual threshold. For these trials, you'd better see hits. That gives you your measure of quality control built right into your experimental design.
 
Dec 17, 2004 at 4:05 PM Post #66 of 70
Quote:

Originally Posted by Hirsch
Now throw in another kicker. For some trials, use a positive control, where you keep the cables the same, but make some sort of auditory change that is known to be just above a perceptual threshold. For these trials, you'd better see hits. That gives you your measure of quality control built right into your experimental design.


Now we're talking. This is precisely what I want to do.

How do you propse we make some sort of auditory change that is known to be just about the perceptual threshold? What defines the threshold? Is it different for every person? How do we know who can hear it and who can't? Do we pick an arbitrary change? If so, how can we seperate out should have heard it, vs who can't hear the change. If they don't key in, they say "not a cable change", not because they detected the change and decided it wasn't a cable change, but didn't detect the change at all.

If you're designing this experiment and you choose changes that are barely above your own perceptual threshold, in the end, you didn't know if your testers we're better or worse listeners. What's worse is that you also never quantified how much of a change a cable swap really is.

The answer? MSA.

And yes, part of the post analysis I thought would include an ANOVA for each listener, cable, and other changes to see which are statistically different by variation, or by variable. Probably use Tukey-Kramer.

With the MSA data, we could also construct a stepwise, full factorial fit model where we could make some conclusions about: How much change (X) does a cable really introduce? What type of hearing is required to detect X Change? We'd could of course take wild stabs in the dark as to how significant a change in setup really is. Or we could perform the MSA where by we introduce change and see how many of our listeners are capable of detecting it. We can then use the results to devise a scale of how significant said changes are. Then once we run the cable expirement, we can basically stack rank, using the model, how significant the cable changes were (assuming they are detected) and what type of hearing you need to detect the change.

I don't know about you, but I'd like to come away from this with a bit more data than "in our test group X number of people could detect a cable change X% of the time". If I read experiment results that concluded that, and only that, I'd start asking questions like: "How well do the testers detect change other than cables?" "Are they capable of detecting minor changes in the first place?" ect ect.

Maybe I'm being over zealous with the attempt, but I do not want a simple binary conclusion that determines whether testers can or can not hear a change introduced with cables.

If I could design the uber experiment, I'd have more testers, they'd spend more time with the equipment, have a wider variety of equipment, and give us opinionated feedback. Such as, "Better, Worse, or The same", and also try and give an attempt to rate on a "scale" of 1 to 10. This would allow us to construct several models that gave us an idea of what an "uber rig" might contain. And what kind of listening habbits would gravitate towards differnt "uber rigs". As I suspect there'd be more than one depending on your tastes. But in theory, there'd be groupings of the population and only a few uber rigs. But that would take entirely too much time, require entirely too many people in order to overcome the bounds of subjective opinion of "better" and "worse". But it'd still be neat
biggrin.gif
 
Dec 17, 2004 at 5:12 PM Post #67 of 70
Quote:

Originally Posted by TWIFOSP
How do you propse we make some sort of auditory change that is known to be just about the perceptual threshold?

...

I don't know about you, but I'd like to come away from this with a bit more data than "in our test group X number of people could detect a cable change X% of the time". If I read experiment results that concluded that, and only that, I'd start asking questions like: "How well do the testers detect change other than cables?" "Are they capable of detecting minor changes in the first place?" ect ect.

Maybe I'm being over zealous with the attempt, but I do not want a simple binary conclusion that determines whether testers can or can not hear a change introduced with cables.



There has never been a properly done study that can irrefutably show that people can detect differences between cables. I'd say stick with the original hypothesis, and work toward that. If you try and get too much additional information out of the experiment, you can lose the key point. If you can show a statistically significant ability to detect a difference in cables exists in a blinded study, you've done more than anybody else so far. You don't need to quantify how big a difference a cable swap makes. You simply need to demonstrate that people can detect it in a properly designed study. Ideally, you'd be able to also make inferences about a negative result, but there simply isn't enough a priori data to determine the N needed to do that. You could approximate that using a surrogate endpoint in your MSA, but it's not really going to fly (you really need to know the variability in your actual cable experiment before you can start doing power analyses).

As far as perceptual thresholds, I'd get them from the literature. Auditory thresholds were published years ago. As a caution, though, if you make a deliberate difference too large relative to the differences between cables, there could be a masking effect present, where people will tend to under report differences they actually hear (because they may be small relative to the positive control). Different people have different thresholds, but the statistics will account for that variability. Since you don't know if any threshold that you choose to vary will be an important one for cable differences, no need to go overboard. Just have some sort of control built in to insure that your subjects are performing normally. If you don't get hits on your positive control, you've just demonstrated that a cable test setting alters auditory thresholds from those reported in the scientific literature...interesting possibility, isn't it?
 
Dec 17, 2004 at 5:30 PM Post #68 of 70
Quote:

Originally Posted by Hirsch
There has never been a properly done study that can irrefutably show that people can detect differences between cables. I'd say stick with the original hypothesis, and work toward that. If you try and get too much additional information out of the experiment, you can lose the key point. If you can show a statistically significant ability to detect a difference in cables exists in a blinded study, you've done more than anybody else so far. You don't need to quantify how big a difference a cable swap makes. You simply need to demonstrate that people can detect it in a properly designed study. Ideally, you'd be able to also make inferences about a negative result, but there simply isn't enough a priori data to determine the N needed to do that. You could approximate that using a surrogate endpoint in your MSA, but it's not really going to fly (you really need to know the variability in your actual cable experiment before you can start doing power analyses).

As far as perceptual thresholds, I'd get them from the literature. Auditory thresholds were published years ago. As a caution, though, if you make a deliberate difference too large relative to the differences between cables, there could be a masking effect present, where people will tend to under report differences they actually hear (because they may be small relative to the positive control). Different people have different thresholds, but the statistics will account for that variability. Since you don't know if any threshold that you choose to vary will be an important one for cable differences, no need to go overboard. Just have some sort of control built in to insure that your subjects are performing normally. If you don't get hits on your positive control, you've just demonstrated that a cable test setting alters auditory thresholds from those reported in the scientific literature...interesting possibility, isn't it?



Good input and discussion, let's keep it up.

No, we don't need to quantify it, but I'd like to. In addition to that, I just don't see the test as being conclusive unless we have a real world listening test to benchmark it to.

"you really need to know the variability in your actual cable experiment before you can start doing power analyses"

Yes! And how do you propose we go about detecting this? Well, comparing it to the results of the MSA is one method. Do you know of a simplier one?

In addition to power analysis, we can full factorial it within a model so long as we have the MSA data. Without it, we have no scale of change to compare it to, therefore we don't have any inference of what the cable variation is. I'm not sure of another way to measure it.

So, correct me if I'm wrong, but you're proposing making false positive changes based on published auditory thresholds. What kind of changes would these be exactly? From what I can understand about those studies, they are the type that say "the average human can hear in X range of frequencies". Well that's all fine and good, but I don't think we can just change the frequency of the music playing through the headphones. We can adjust the EQ slightly, here and there, but I think we're missing the point.

This isn't a hearing sensitivity test so much as it's a detection test. There's a fine line, but distinct difference. I could have the best hearing on the planet, be able to hear at insane frequency ranges. But if I also don't have the experience to detect naunce changes in music, or are not aware of what to look for, it's all moot.

I just think there is a difference between good hearing and good listening. I'm not aware of any listening studies that we can leverage on.

Thoughts?

I think we need to take a step back and design our measurement and conclusions scope. I still think between us we could design an experiment with real conclusions.
 
Dec 17, 2004 at 9:50 PM Post #69 of 70
Quote:

Originally Posted by TWIFOSP
"you really need to know the variability in your actual cable experiment before you can start doing power analyses"

Yes! And how do you propose we go about detecting this? Well, comparing it to the results of the MSA is one method. Do you know of a simplier one?



My suggestion would be to not bother. You lose the ability to draw conclusions from negative results, but little else. In order to keep this actually feasible to do, I'd suggest as simple as possible within a tight experimental design.


Quote:

Originally Posted by TWIFOSP
In addition to power analysis, we can full factorial it within a model so long as we have the MSA data. Without it, we have no scale of change to compare it to, therefore we don't have any inference of what the cable variation is. I'm not sure of another way to measure it.

So, correct me if I'm wrong, but you're proposing making false positive changes based on published auditory thresholds. What kind of changes would these be exactly? From what I can understand about those studies, they are the type that say "the average human can hear in X range of frequencies". Well that's all fine and good, but I don't think we can just change the frequency of the music playing through the headphones. We can adjust the EQ slightly, here and there, but I think we're missing the point.



I'd suggest a slight change in gain. Simple and should be detectable.

Quote:

Originally Posted by TWIFOSP
This isn't a hearing sensitivity test so much as it's a detection test. There's a fine line, but distinct difference. I could have the best hearing on the planet, be able to hear at insane frequency ranges. But if I also don't have the experience to detect naunce changes in music, or are not aware of what to look for, it's all moot.

I just think there is a difference between good hearing and good listening. I'm not aware of any listening studies that we can leverage on.



It's necessary to keep in mind that any test for differences in components is hopefully not going to be representative of the way in which we listen to music. At least I don't. My main awareness of my equipment when I'm listening is when something sounds wrong, and jolts me out of the music. Otherwise, I couldn't care less about the sound of the gear. Any test in which I have to make a judgement about gear is not going to be anything like the way I listen to music. With any luck, I'm not even conscious of the gear during real listening. The ability to discriminate between stimuli is something that is going to produce variability in the experiment. This is the strongest argument in favor of the MSA. If you can show this variability in a pretest, then you can use it as a covariate later on. It's chancy, because it essentially doubles the run time of the experiment (the MSA is going to take pretty much the same effort as the cable test), and the gain is uncertain.

Detection of a difference is enough. Do it in a blinded experiment to make people happy. If something can be detected, it is there. Prove that in a controlled experiment, and all of the engineers who are telling people that they can't hear what they do can go back and try and figure out what's really going on. The question is simple: can people detect cable differences in a blinded experiment? There's a lot more information that can be gotten, but that's the key question, and should be the focus. With that in mind, the simpler the final design, the liklier it is that it will actually be executed.

For cables, since there was an argument about it, how about headphone cables? Sennheiser stock cable and Moon Audio Silver Dragon (I'm suggesting that one since I own it, but take your pick). Two choices. Simple 'can you hear a difference?' type question. To help keep things blinded, maybe Drew at Moon Audio would sheath a stock Sennheiser cable in his Techflex, so that it would be difficult to tell it apart from his own cable by weight or feel. It can't hurt to ask him.

Note: I think the difference between these cables is so profound that even I think I can tell the difference in a blind test, and I don't like blind tests or perform well in them. I do have prejudices here. Without aftermarket cables, I would have ditched my Sennheisers long ago.
 
Dec 17, 2004 at 9:57 PM Post #70 of 70
Between two different cables the difference in sound is say Delta. Delta is the same between two cables for every single listener. What changes are our perceptions of delta. What any experiment should be trying to do is not measure delta, but to measure people's variances at approximating delta.

I am getting the sense that TWIFSOP you are trying to do something else (though I am not quite sure what). But maybe I am just not following you.
 

Users who are viewing this thread

Back
Top