"Things that people think are 'special' are not so special after all when knowledge of the origin is taken away" | Page 2 | Headphone Reviews and Discussion - Head-Fi.org

elmoe · Jun 16, 2014 at 5:23 PM

No doubt, that's why I mentioned world class.

Argo Duck · Jun 16, 2014 at 6:14 PM

Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.

Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.

That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.

This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.

Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).

It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.

Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.

For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjects :eek: Needless to say, the study did not proceed.

ab initio · Jun 16, 2014 at 6:28 PM

argo duck said:
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.

Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.

That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.

This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.

Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).

It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.

Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.

For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjects

Needless to say, the study did not proceed.

Well, if you think you've got a valid point, I encourage you to write the national academy of sciences. Good luck!

Cheers

jcx · Jun 16, 2014 at 6:31 PM

disingenuous much? - what statistical power is needed to verify common place claims of "obvious" differences? the audiophile "night and day", "anyone can hear...", "you'd have to deaf to not notice"

we expect 10/10 result when the differences are really that obvious

it is a strawman to project that the conclusion is that "no one can hear.." then critique the strawman you introduced

and is often used as a rhetorical tactical distraction from the very high confidence result that the study does support

Argo Duck · Jun 16, 2014 at 6:53 PM

Thanks ab initio, I don't need any such luck as my career as a scientist and statistical consultant is doing fine. You appear woefully unaware of the role of critical review in science (which is what makes science progress) - was this in fact a peer-reviewed, published paper or merely an online proceeding? - and to be substituting the rather anti-scientific notion of authority (the national academy of sciences indeed!) for a lazy lack of intellectual rigor. Good luck to you!

jcx, huh? How did my mild critique of the statistical foundations of a paper about distinguishing violins become about audiophile claims? My comments were not directed at and have nothing to do with any studies concerning the latter, as none appeared in this thread.

I see I made a mistake briefly leaving the sane world of scientific and academic discourse to participate in what I thought was the more dispassionate side of head-fi.

I've got more important stuff to do. Cheers!

ab initio · Jun 17, 2014 at 12:24 AM

argo duck said:
Thanks ab initio, I don't need any such luck as my career as a scientist and statistical consultant is doing fine. You appear woefully unaware of the role of critical review in science (which is what makes science progress) - was this in fact a peer-reviewed, published paper or merely an online proceeding? - and to be substituting the rather anti-scientific notion of authority (the national academy of sciences indeed!) for a lazy lack of intellectual rigor. Good luck to you!

jcx, huh? How did my mild critique of the statistical foundations of a paper about distinguishing violins become about audiophile claims? My comments were not directed at and have nothing to do with any studies concerning the latter, as none appeared in this thread.

I see I made a mistake briefly leaving the sane world of scientific and academic discourse to participate in what I thought was the more dispassionate side of head-fi.

I've got more important stuff to do. Cheers!

I'm not sure if I understand the animosity here?

The proceedings of the national academy of sciences (PNAS) is one of the highest impact peer-reviewed scientific journals. If you have a legitimate critique of the paper, it would make quite the line in your CV to publish a paper which corrects the supposed scientific shortcomings of the article, especially considering that you are a career scientist! I was encouraging you to participate in the scientific process. If you have a legitimate point, I encourage you to elaborate.... we are all here to learn.

The sound science forum is an open forum where well-intentioned folks come to discuss and share knowledge on topics related to audio. Certainly, one should proceed with humility and an openness for honest critique. If your point has merit, then it will be revealed by the quality of your arguments. Perhaps you should recall the role of critical review in science.

Cheers

barihunk · Jun 17, 2014 at 12:49 AM

argo duck said:
Thanks ab initio, I don't need any such luck as my career as a scientist and statistical consultant is doing fine. You appear woefully unaware of the role of critical review in science (which is what makes science progress) - was this in fact a peer-reviewed, published paper or merely an online proceeding? - and to be substituting the rather anti-scientific notion of authority (the national academy of sciences indeed!) for a lazy lack of intellectual rigor. Good luck to you!

jcx, huh? How did my mild critique of the statistical foundations of a paper about distinguishing violins become about audiophile claims? My comments were not directed at and have nothing to do with any studies concerning the latter, as none appeared in this thread.

I see I made a mistake briefly leaving the sane world of scientific and academic discourse to participate in what I thought was the more dispassionate side of head-fi.

I've got more important stuff to do. Cheers!

No offense meant, but damn statisticians, always trying to use fancy words like "effect size" and "statistical significance" to basically say whatever story they want to tell.

You are missing the entire point here, probably without even reading the paper. I'm not going to fall prey to the fallacy of invoking authority (PNAS, bastion of great studies *eek*) but let me put it this way:

1. n = 10 world class violinists is probably a statistically significant sample of "world class violinists" don't quote me on this but there really aren't too many of those around. More is always better but it's not like you're going to get Joshua Bell to anonymously do something like this. Or maybe he was one of the folks. Who knows.
2. The whole point of stradivarius and guarneri instruments is that they are so much better than the best modern instruments (let me spell it out for you here == large effect size)
3. You probably need to re-visit the reasons for using non-parametric measures - by definition the evaluation of a musical instrument is nonparametric, i.e. not quantitative. There is no reason why one would need parametric statistics to compare qualities of musical instruments as evaluated by players.
4. You're not the only one around here with a knowledge of statistics and a career in the sciences. Your very defensive reaction to criticism and ad hominem remarks do not lend credibility to your arguments, nor do they lend credibility to your scientific credentials, though we all know that that's how some scientists interact with their peers.
5. Don't let the blind pursuit of the "statistical significance" game in science brainwash you into think that is the be all end all of things. Statistical significance is a means to an end - to show that some difference one is examining is real. It says nothing about the actual significance of that difference (effect size), nor does it contain inherent truth. Not statistically significant means no difference, hence the term null hypothesis. Now whether one is measuring the right things is of course up for debate, but that doesn't mean that the paper in question is wrong in saying that the players couldn't distinguish the old/new instruments based on their recorded metrics, which may not be stated as such but is of course the actual claim.

J

P.S. all of this drivel and the fact that the old instruments actually are harder to play and much more fragile to keep doesn't mean that I no longer want a stradivarius/guarneri cello. I still long to touch one or own one. If you have one for cheap please please please please PM me. kthxbye.

esldude · Jun 17, 2014 at 1:30 AM

Seems quite valid to me to critique how significant the results are. This is the sound science forum. Statistically speaking, the article under discussion is pretty flimsy. Just as a casual description I would call it suggestive, but not conclusive. Accepting results that fit your pre-conceptions is not what science is about.

AiDee made specific comments about what he was referencing. He did tell us he has a background in such, but not as a way to hammer us to accept his writing due to his stature, merely as indicative that he has a background about such things. One could rationally reply to his specific concerns. But he has a point which I think is why no one has replied to refute those questions he posed.

Now 10/10 correct responses would also be marginal in telling us they heard a difference. The basic formula says 9/10 is minimum for 95% confidence level. I personally think for most things we need many more samples in such tests, and that 3 sigma results would lead to fewer false positives. I have seen almost this level of correct responses attacked on this forum as being too few to be meaningful. And rightfully so. But it works both ways.

Large studies with large sample sizes are tough, and inconvenient. In the long run though they are much more illuminating than any number of tests with inadequately small sample sizes.

Incidentally here is the full article:

http://www.pnas.org/content/109/3/760.full

Here is the test organizers write up on her web page. It has more details of who took part including pictures of the test conditions.

http://www.lam.jussieu.fr/Membres/Fritz/HomePage/Indianapolis_paper.html

Again this is suggestive, not conclusive. It would suggest differences are not large enough to be observed easily like say a loudness level of 6 db would be in playing back an audio recording. Since the old violins cost so much it suggests you get little for the extra money in performance. It would be nice to see the same people perform the same test again. With such small samples you can expect results would differ. For instance the least preferred is likely not to be so a second time if things are more or less equal between these instruments. Or it might which adds to the sample size, and the usefulness of such testing. But without more testing we don't know, it is only conjecture. Proper statistical controls are about keeping us from fooling ourselves in both directions.

castleofargh · Jun 17, 2014 at 2:05 AM

argo duck said:
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.

Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.

That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.

This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.

Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).

It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.

Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.

For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjects

Needless to say, the study did not proceed.

I'm not anti-you as I also believe that our basic job when looking at a study, is to first try and destroy its validity to make sure it is what it pretends to be. and even though several other trials have been conducted, sometimes the subjects were playing, sometimes they were just spectators, all going in the same direction, you're right we don't really have statistical evidence that it is not possible to tell a strad apart from another.
BUT!
in this case, the point never really was to prove it is impossible to discriminate a strad. it was only to show that strads didn't have the claimed superiority, and in the process reveal the impact of bias. bias was really the study, not strad.
something we here are all "lucky" to be very familliar with, but a soloist might not have though about bias a lot in his career out of "they didn't pick me because the other one has relations". and the axiom for the test was more like "of course I can tell I'm playing/listening to a strad".
something an expert in wine can do while trying tens of different wines without having a need for statistical significance.
those tests where done on people living with violins, most of them have probably heard a lot of strads if only on records. if the superiority of such violins was clear this test should have been enough to show it. it wasn't.
that's all the study is saying really.

elmoe · Jun 17, 2014 at 6:40 AM

This is a test conducted with specialized people, statistical significance is irrelevant here. If the study had been conducted with just one world class violinist it would still be very much relevant.

Chris J · Jun 17, 2014 at 7:41 AM

argo duck said:
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.

Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.

That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.

This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.

Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).

It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.

Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.

For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjects :eek: Needless to say, the study did not proceed.

An alternative point of view is part of the scientific method.

Chris J · Jun 17, 2014 at 7:42 AM

esldude said:
Seems quite valid to me to critique how significant the results are. This is the sound science forum. Statistically speaking, the article under discussion is pretty flimsy. Just as a casual description I would call it suggestive, but not conclusive. Accepting results that fit your pre-conceptions is not what science is about.

AiDee made specific comments about what he was referencing. He did tell us he has a background in such, but not as a way to hammer us to accept his writing due to his stature, merely as indicative that he has a background about such things. One could rationally reply to his specific concerns. But he has a point which I think is why no one has replied to refute those questions he posed.

Now 10/10 correct responses would also be marginal in telling us they heard a difference. The basic formula says 9/10 is minimum for 95% confidence level. I personally think for most things we need many more samples in such tests, and that 3 sigma results would lead to fewer false positives. I have seen almost this level of correct responses attacked on this forum as being too few to be meaningful. And rightfully so. But it works both ways.

Large studies with large sample sizes are tough, and inconvenient. In the long run though they are much more illuminating than any number of tests with inadequately small sample sizes.

Incidentally here is the full article:

http://www.pnas.org/content/109/3/760.full

Here is the test organizers write up on her web page. It has more details of who took part including pictures of the test conditions.

http://www.lam.jussieu.fr/Membres/Fritz/HomePage/Indianapolis_paper.html

Again this is suggestive, not conclusive. It would suggest differences are not large enough to be observed easily like say a loudness level of 6 db would be in playing back an audio recording. Since the old violins cost so much it suggests you get little for the extra money in performance. It would be nice to see the same people perform the same test again. With such small samples you can expect results would differ. For instance the least preferred is likely not to be so a second time if things are more or less equal between these instruments. Or it might which adds to the sample size, and the usefulness of such testing. But without more testing we don't know, it is only conjecture. Proper statistical controls are about keeping us from fooling ourselves in both directions.

Agreed.
Aide was not using a sledgehammer to drive his point home, just adding some additional analysis and another point of view.

esldude · Jun 17, 2014 at 12:18 PM

elmoe said:
This is a test conducted with specialized people, statistical significance is irrelevant here. If the study had been conducted with just one world class violinist it would still be very much relevant.

This is true only if the number of samples is large enough. 3 of 3 for the world's finest violinist means no more than 3 for 3 of anyone else. Both are statistically inadequate.

There were 21 participants so it is not quite as flimsy as I have portrayed it other than how the test was done. It would have had more validity statistically if they had made comparisons between only two violins with more trials. For each person they had two chances to evaluate the same pair of violins twice. Giving effectively only 21 trials of whether a person picks the same preference twice out of two times. The results were 11/21 which is certainly near chance. I like to see at least 30 trials of such things as that is enough to get a filled out distribution curve. And this was muddied by the repeat pairings not being the same pair for each participant as instead it was split among 9 different pairings.

It is easy to criticize someone after the fact. Getting such qualified people and such expensive instruments together for the test was not simple. I appreciate the testing done, and it is suggestive. When taken with other similar testing by listener panels it fits with the idea the old instruments are not so superior as the reputation. A better test protocol would have been more conclusive.

jcx · Jun 18, 2014 at 10:59 AM

the criticism of just the significance test numbers is ignoring effect size - what's your model of what is being tested? - we can know from the study as presented that distinguishing the violins is not a 99%, or 90% or even 80% probability of discrimination ability of skilled users/listeners to a very high confidence level

ruthieandjohn · Jun 18, 2014 at 11:15 AM

ab initio said:
The title of this thread is a quote by Professor Dale Purves that appeared in an NPR story.

I thought maybe there are some more examples (in audio or outside audio) where "Things that people think are 'special' are not so special after all when knowledge of the origin is taken away" applies. Anyone have examples?

Cheers

elmoe

Formerly known as Jashugan
Headphoneus Supremus

Argo Duck

Formerly known as "AiDee"

ab initio

500+ Head-Fier

jcx

Headphoneus Supremus

Argo Duck

Formerly known as "AiDee"

ab initio

500+ Head-Fier

barihunk

New Head-Fier

esldude

500+ Head-Fier

castleofargh

Sound Science Forum Moderator

elmoe

Formerly known as Jashugan
Headphoneus Supremus

Chris J

Headphoneus Supremus

Chris J

Headphoneus Supremus

esldude

500+ Head-Fier

jcx

Headphoneus Supremus

ruthieandjohn

Stumbling towards enlightenment
(Formerly known as kayandjohn.)

Users who are viewing this thread

Formerly known as JashuganHeadphoneus Supremus

Formerly known as "AiDee"

500+ Head-Fier

Headphoneus Supremus

Formerly known as "AiDee"

500+ Head-Fier

New Head-Fier

500+ Head-Fier

Sound Science Forum Moderator

Formerly known as JashuganHeadphoneus Supremus

Headphoneus Supremus

Headphoneus Supremus

500+ Head-Fier

Headphoneus Supremus

Stumbling towards enlightenment(Formerly known as kayandjohn.)

Users who are viewing this thread

Formerly known as Jashugan
Headphoneus Supremus

Formerly known as Jashugan
Headphoneus Supremus

Stumbling towards enlightenment
(Formerly known as kayandjohn.)