Investigating: frequency response vs. subjective report
Jan 1, 2014 at 2:03 PM Thread Starter Post #1 of 7

vid

Headphoneus Supremus
Joined
Jan 5, 2005
Posts
2,063
Likes
130
I gathered up a corpus of Head-Fi user reviews (the non-forum ones) of five different headphones: Audeze LCD-2, AKG K 701, AT M50S, Senn HD 598, and Senn HD 650; and harvested some frequency response data for those phones from Tyll's measurements at InnerFidelity. The aim was to map out some of the existence or not of correlation between a headphone's frequency response and the subjective reports of its acoustic properties using a corpus-based approach.
 
The analysis was done for a paper for course credit in the field of corpus linguistics. I'm not sure what my uni's stance is on students posting bits of papers not yet handed in, so I'll keep this cursory enough. I'm not sure this line of study entirely qualifies as a linguistic one, either, but luckily that's not much of a concern here.
 
In any case, main aspects of data, ranked descending by third column:
 
Phone
Median
price, $
% of reviews with
[size=inherit]keyword "accurate"[/size]
2+3+4 kHz,
total dB
vs 1 kHz level
LCD-2​
995​
11​
13​
HD 598​
205​
7​
17​
HD 650​
400​
6​
19​
M50S​
130​
3​
21​
K 701​
266​
2​
22​
 
The corpus I built included 79 reviews for the M50S while the other phones had about 40% less available reviews going for them. These 'missing' reviews were filled in via curve-fitting in SPSS - but even without fitting, by analyzing a strict number of the first 40 reviews for each phone, the overall results shown above held.
 
The third column is the percentage of reviews for that phone containing the word "accurate" - or "accuracy"/"accurately" - in reference to the phone (whether describing the mids, the treble, the soundstage, etc.). A second keyword, "transparent" along with its derivatives, was also used, but it failed to show up at all for three of the five phones, so it's ignored here.
 
The fourth column gives a number estimating the amount of energy arriving at the listener's eardrum along the mid frequencies of 2–4 kHz. The number was derived by eye from Tyll's PDF graphs by mentally averaging the grey (raw) lines at each of the three kHz markers in relation to the 1 kHz level and then summing the three averaged numbers together. Along with the mids, the bass and treble frequencies were evaluated in the same manner (40+100+200 Hz, and 8+9+10 kHz).
 
Like the table shows, there was a strong correlation between the third and fourth columns: the less energy a phone had at the mid frequencies, the more likely the reviewers were to describe it as accurate. Not listed in the table are the data for the bass and treble frequencies - I saw no clear-cut linear correlation in scatterplots regarding these.
 
There was some amount of uncertain facetry and some correlational mystery in the data, but won't go into that now. It might be apparent that there will be some degree of error in the dB numbers as they were based on estimations. Also, the results could well be due to random chance (small sample size) or a confounding variable that I failed to find.
 
I did a post-hoc by replacing the fourth column's values with the sum of just 2+3 kHz and found a reasonable correlation with the third column still. Doing the same with 3+4 kHz removed any appreciable correlation.
 
Jan 2, 2014 at 12:36 AM Post #2 of 7
Interesting.
 
As a suggestion I think you might want to check the energy transfer in the mid band in relation to the bass and treble bands. That would give a more holistic view of whats going on.
This is becaunteresting.
 
 
As a suggestion I think you might want to check the energy transfer in the mid band in relation to the bass and treble bands. That would give a more holistic view of whats going on. More like the ratio of energies in these bands.
 
This is because according to your analysis, a headphone with a V shaped FR would be most accurate, which isn't the case.
 
Also, if you can normalize the headphones FR it'll be better.
 
As I estimate from looking at the FR graphs, the LCD2 will have closer proportions across the region, which is why its the most accurate.
 
 

 
Jan 2, 2014 at 11:31 AM Post #3 of 7
I by no means mean to make the prediction that less energy in the midrange equals more accuracy - the analysis concerns the dataset only. As you suggest, even if there was a correlation in reality, you wouldn't expect it to be linear. That it's linear with this data might suggest Audeze didn't design their frequency response blind, but that again is speculation rather than analysis.
 
Part of the aim of the study was to evaluate the applicability of the corpus approach to something like this - but of course, like you suggest, you run into the problem that you're evaluating a very context-sensitive issue from data whose reality was superficial and immutable. You've no idea what the informants' HRTFs were like let alone what gear they used, how their particular phones measured or even how they were plastered onto the person's head. From what I can see, that the data nonetheless produced such a strong correlation is telling of either a strong trend or an artifact.
 
I like the approach of evaluating the bigger picture (entire spectrum), but then there's the question of how to parameterize it. To simplify it to the interaction of bass, mids and treble, I don't even know that there's a strict definition of where the bass stops and the mids start, or even if such a definition would apply to every listener. From a few older studies I've had the chance to glance at, there's plenty of person-to-person variation in HRTF at higher frequencies, so I'm not sure how reliably you could investigate the treble anyway without access to your informants' anatomy (mind you, Tyll's measurements represent a single anatomical instance as well); and indeed even the mids seem very fickle in terms of HRTF variation, which again makes me suspicious of the correlation I mentioned above.
 
Jan 2, 2014 at 1:26 PM Post #4 of 7
This is indeed very interesting. Subjective impression is very difficult to measure unless you have a large enough sample size. You should also add the sample size in a column. For example, K701 has 2% accuracy mentioned. If the sample size is 50, then there is only one mentioned accuracy. If there is another phone with a 20:1 mention of accuracy, it will skew the result. The other factor that might affect the result is herd mentality particular in a forum setting. The latter reviews might be just echoing the earlier reviews.
 
One other comment is how do people experience accuracy without a reference. I think the perception  of accuracy needs to be drilled down a little more. The percentage listed is quite small. Do reviews that have no mention of accuracy did not experience an "accurate" sound? Maybe a list of other impression that describe warm, bright sound (inaccurate) should also be included. Just out of curiosity, in your research did you encounter any description of a warm and accurate sound or any other contradictory description?
 
Jan 2, 2014 at 3:36 PM Post #5 of 7
  I gathered up a corpus of Head-Fi user reviews (the non-forum ones) of five different headphones: Audeze LCD-2, AKG K 701, AT M50S, Senn HD 598, and Senn HD 650; and harvested some frequency response data for those phones from Tyll's measurements at InnerFidelity. The aim was to map out some of the existence or not of correlation between a headphone's frequency response and the subjective reports of its acoustic properties using a corpus-based approach.
 
The analysis was done for a paper for course credit in the field of corpus linguistics. I'm not sure what my uni's stance is on students posting bits of papers not yet handed in, so I'll keep this cursory enough. I'm not sure this line of study entirely qualifies as a linguistic one, either, but luckily that's not much of a concern here.
 
In any case, main aspects of data, ranked descending by third column:
 
Phone
Median
price, $
% of reviews with
[size=inherit]keyword "accurate"[/size]
2+3+4 kHz,
total dB
vs 1 kHz level
LCD-2​
995​
11​
13​
HD 598​
205​
7​
17​
HD 650​
400​
6​
19​
M50S​
130​
3​
21​
K 701​
266​
2​
22​
 
The corpus I built included 79 reviews for the M50S while the other phones had about 40% less available reviews going for them. These 'missing' reviews were filled in via curve-fitting in SPSS - but even without fitting, by analyzing a strict number of the first 40 reviews for each phone, the overall results shown above held.
 
The third column is the percentage of reviews for that phone containing the word "accurate" - or "accuracy"/"accurately" - in reference to the phone (whether describing the mids, the treble, the soundstage, etc.). A second keyword, "transparent" along with its derivatives, was also used, but it failed to show up at all for three of the five phones, so it's ignored here.
 
The fourth column gives a number estimating the amount of energy arriving at the listener's eardrum along the mid frequencies of 2–4 kHz. The number was derived by eye from Tyll's PDF graphs by mentally averaging the grey (raw) lines at each of the three kHz markers in relation to the 1 kHz level and then summing the three averaged numbers together. Along with the mids, the bass and treble frequencies were evaluated in the same manner (40+100+200 Hz, and 8+9+10 kHz).
 
Like the table shows, there was a strong correlation between the third and fourth columns: the less energy a phone had at the mid frequencies, the more likely the reviewers were to describe it as accurate. Not listed in the table are the data for the bass and treble frequencies - I saw no clear-cut linear correlation in scatterplots regarding these.
 
There was some amount of uncertain facetry and some correlational mystery in the data, but won't go into that now. It might be apparent that there will be some degree of error in the dB numbers as they were based on estimations. Also, the results could well be due to random chance (small sample size) or a confounding variable that I failed to find.
 
I did a post-hoc by replacing the fourth column's values with the sum of just 2+3 kHz and found a reasonable correlation with the third column still. Doing the same with 3+4 kHz removed any appreciable correlation.

 
What is the possibility that the reviewers of the headphones were influenced by the published headphone response plots?
 
It would be more interesting to see the results only from reviews that were written prior to the published measurement data (similar effect to the "heard mentality" that dvw mentions). Certainly, the reduced sample size would be an issue.
 
I can appreciate that combing through hundreds of headphone reviews to do this study is no small task. It's a cool idea!
 
beerchug.gif

Cheers
 
Jan 3, 2014 at 9:20 AM Post #6 of 7
  This is indeed very interesting. Subjective impression is very difficult to measure unless you have a large enough sample size. You should also add the sample size in a column. For example, K701 has 2% accuracy mentioned. If the sample size is 50, then there is only one mentioned accuracy. If there is another phone with a 20:1 mention of accuracy, it will skew the result. The other factor that might affect the result is herd mentality particular in a forum setting. The latter reviews might be just echoing the earlier reviews.
 
One other comment is how do people experience accuracy without a reference. I think the perception  of accuracy needs to be drilled down a little more. The percentage listed is quite small. Do reviews that have no mention of accuracy did not experience an "accurate" sound? Maybe a list of other impression that describe warm, bright sound (inaccurate) should also be included. Just out of curiosity, in your research did you encounter any description of a warm and accurate sound or any other contradictory description?

 
I'll keep some specifics, e.g. sample size, at least until I've handed in the paper, just to cover myself from the plagiarism aspect. (Though since the results were estimated via trend curve, the exact number of instances of the word 'accurate' can't be derived from the percentages I listed - you'd end up with fractions.) In any case, you're on the ball in that there weren't enough reviews available to call the results certain.
 
For the most part, I wanted to avoid a subjective interpretation of the reviews where possible, as I feel I might not have too much business telling an informant they heard the sound this or that way. In any case, the problems you and ab initio suggested are valid - first, there's no guarantee that the meaning of the word 'accurate' was universally shared; second, the notion of accuracy will be with reference to something else and the referent is unlikely to be the same from person to person; and third, informants having access to other informants' answers (reviews, and even the frequency response charts) prior to entering their own can't be too good. It's a self-report study, pretty much, and has the problems of one.
 
On the qualitative front, there's a difference between the phrases "fairly accurate" and "very accurate" - this was considered in no way in the quantitative stats. I've not put in a full analysis on this front yet, but simply glancing at the admittedly limited data, I see a trend of the former (magnitude of accuracy slightly downplayed where mentioned) with the M50S and the latter (degree of accuracy emphasized) with the LCD-2.
 
May 2, 2014 at 7:13 AM Post #7 of 7
The paper was completed and graded, so I can give some more info.
 
The sample size was 262 reviews for five headphones (K 701 42, M50S 79, LCD-2 43, HD 598 43, HD 650 55). It's a decent enough number of reviews, but I found it wasn't quite enough to get a good estimation of the keyword's frequency, i.e., sound quality - thus the use of an inverse regression curve in SPSS.
 
Sound quality vs. frequency response for the five data points (headphones) below.
 

 
The LCD-2 data point (sound quality = 13.5) is different from the table I posted in the first message (where sound quality ~ 11); this is for the reason that the basis for the regression curve used for the LCD-2 earlier wasn't the exact same as for the other headphones. This was fixed and the new sound quality value is in the figure.
 
The headphone sample size of five is too small to make any claims, and in any case there was still some reasonable doubt about the absolute accuracy of the data, but minding that, there are some trends in the figure. A relatively obvious linear correlation between sound quality and the midrange frequency response for one, and what could be exponential correlation between sound quality and the bass response; in both cases, listener preference seeming towards least variation from the level of 1 kHz, meaning towards 0 dB in the figure. If I remember right, variation in the HRTF between people in general is smallest in the bass frequencies and largest in the treble - so I'm not sure how much meaningful information can be had from the treble response graph given the sample size.
 
One thing I'd note is that you could look at the probability density function (PDF) of variation in the HRTF between people and say that, if the frequency response at certain ranges corresponded to certain perceived qualities of sound, a scatterplot between the response and the graded sound quality might follow the PDF of the HRTF variation of a group of listeners. That is, the probability that a given value of frequency response produces an averaged sound quality score of x might depend on the number of listeners in the group whose HRTF modifies that value of frequency response to fall within the spectral region corresponding to x. Whether this phenomenon is visible in the figure I leave up to you, I didn't touch on it in the paper since the experimental setup wasn't targeting this kind of thing.
 
As I mentioned in earlier posts, I had some doubts about the magnitude of correlation between FRm and sound quality given that there should indeed be variation in listener HRTF in those frequencies. But, given that the sound quality variable was in essence a group average, any noise in the scatterplots due to HRTF variation would be due to differences between the groups, and the lack of such noise could be for the reason that each group exhibited roughly similar degrees of variation in HRTF, which isn't an unreasonable assumption.
 

Users who are viewing this thread

Back
Top