elmoe
Formerly known as JashuganHeadphoneus Supremus
No doubt, that's why I mentioned world class.
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.
Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.
That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.
This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.
Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).
It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.
Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.
For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjectsNeedless to say, the study did not proceed.
Thanks ab initio, I don't need any such luck as my career as a scientist and statistical consultant is doing fine. You appear woefully unaware of the role of critical review in science (which is what makes science progress) - was this in fact a peer-reviewed, published paper or merely an online proceeding? - and to be substituting the rather anti-scientific notion of authority (the national academy of sciences indeed!) for a lazy lack of intellectual rigor. Good luck to you!
jcx, huh? How did my mild critique of the statistical foundations of a paper about distinguishing violins become about audiophile claims? My comments were not directed at and have nothing to do with any studies concerning the latter, as none appeared in this thread.
I see I made a mistake briefly leaving the sane world of scientific and academic discourse to participate in what I thought was the more dispassionate side of head-fi.
I've got more important stuff to do. Cheers!
Thanks ab initio, I don't need any such luck as my career as a scientist and statistical consultant is doing fine. You appear woefully unaware of the role of critical review in science (which is what makes science progress) - was this in fact a peer-reviewed, published paper or merely an online proceeding? - and to be substituting the rather anti-scientific notion of authority (the national academy of sciences indeed!) for a lazy lack of intellectual rigor. Good luck to you!
jcx, huh? How did my mild critique of the statistical foundations of a paper about distinguishing violins become about audiophile claims? My comments were not directed at and have nothing to do with any studies concerning the latter, as none appeared in this thread.
I see I made a mistake briefly leaving the sane world of scientific and academic discourse to participate in what I thought was the more dispassionate side of head-fi.
I've got more important stuff to do. Cheers!
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.
Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.
That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.
This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.
Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).
It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.
Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.
For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjectsNeedless to say, the study did not proceed.
Hmm. I have no iron in this fire but am a bit worried about the statistical validity of the study, though it certainly seems interesting and well-formulated.
Just from reading the abstract it appears there were 10 violinists., i.e. sample-size (n) is 10. The design appears to have been extended from a simple binomial (is it A or is it B?) design to take in qualitative differences that these violinists might have detected. Greater complexity means a larger sample size is needed. Nevertheless, setting this aside and opting for the simple case, 10 provides almost no statistical power for any type of quantitative study.
That is, a very large 'effect size' (detectable difference) would be needed for the violinists to be able to distinguish these instruments. The statement "Soloists failed to distinguish new from old at better than chance levels" is dubious as the confidence interval (margin of error) of an n=10 study is so large one could equally state Soloists failed to not distinguish new from old at better than chance levels. Both the null and alternate hypotheses cannot be excluded, meaning this study has no power to establish anything.
This is a pity as the qualitative aspect of the study shows the authors allowed for the alternate hypothesis (that the violinists could distinguish) and expected in this case to be able to characterise what these differences were, not merely state they were there.
Because of this non-parametric tests were presumably used (these have less power to detect differences than parametric) and - reading between the lines - it seems likely preference rankings were indeed used (four attributes presumably used as distinct measures are explicitly mentioned: playability, articulation, projection, timbre; I can't tell whether these were expected to be related to a single construct or several).
It's possible the use of distinct measures (at least 4) were used to boost the n (to at least 4 x 10 = 40) but this is dubious in itself. An area I'm very familiar with is development of psychometric tests, and for sure administering a 100-item (measure) test to 10 people does not allow me to claim an n of 1000.
Proper sizing of studies to achieve adequate power is the first issue in any research, and usually starts with assessing effect-size from previous studies in the literature. However, it appears there was only one previous study of this specific kind - the 2010 one. In this case the scientifically proper conclusion (and for all I know it may be there as I didn't read the full paper) is that a large difference between old and new violins was not found but small or medium differences cannot be excluded.
For comparison, I consulted to a resident surgeon at a large local hospital keen to establish the efficacy of a new dressing. Whilst I gather there are no perfect dressings it certainly seems the case the dressings in use are all pretty good. The hoped for advantage of the new dressing was some 5%. Unfortunately, this 'small' effect size (though a large benefit in utility terms) meant the simple A versus B ("two arm") study needed to recruit some 2200 subjects :eek: Needless to say, the study did not proceed.
Seems quite valid to me to critique how significant the results are. This is the sound science forum. Statistically speaking, the article under discussion is pretty flimsy. Just as a casual description I would call it suggestive, but not conclusive. Accepting results that fit your pre-conceptions is not what science is about.
AiDee made specific comments about what he was referencing. He did tell us he has a background in such, but not as a way to hammer us to accept his writing due to his stature, merely as indicative that he has a background about such things. One could rationally reply to his specific concerns. But he has a point which I think is why no one has replied to refute those questions he posed.
Now 10/10 correct responses would also be marginal in telling us they heard a difference. The basic formula says 9/10 is minimum for 95% confidence level. I personally think for most things we need many more samples in such tests, and that 3 sigma results would lead to fewer false positives. I have seen almost this level of correct responses attacked on this forum as being too few to be meaningful. And rightfully so. But it works both ways.
Large studies with large sample sizes are tough, and inconvenient. In the long run though they are much more illuminating than any number of tests with inadequately small sample sizes.
Incidentally here is the full article:
http://www.pnas.org/content/109/3/760.full
Here is the test organizers write up on her web page. It has more details of who took part including pictures of the test conditions.
http://www.lam.jussieu.fr/Membres/Fritz/HomePage/Indianapolis_paper.html
Again this is suggestive, not conclusive. It would suggest differences are not large enough to be observed easily like say a loudness level of 6 db would be in playing back an audio recording. Since the old violins cost so much it suggests you get little for the extra money in performance. It would be nice to see the same people perform the same test again. With such small samples you can expect results would differ. For instance the least preferred is likely not to be so a second time if things are more or less equal between these instruments. Or it might which adds to the sample size, and the usefulness of such testing. But without more testing we don't know, it is only conjecture. Proper statistical controls are about keeping us from fooling ourselves in both directions.
This is a test conducted with specialized people, statistical significance is irrelevant here. If the study had been conducted with just one world class violinist it would still be very much relevant.
The title of this thread is a quote by Professor Dale Purves that appeared in an NPR story.
I thought maybe there are some more examples (in audio or outside audio) where "Things that people think are 'special' are not so special after all when knowledge of the origin is taken away" applies. Anyone have examples?
Cheers