The fact that you can see some fairly distinct differences in the measurements between each one of the headphones also indicates that there is at least some commonality between the way different people's ears measure on the same headphone. IOW, we all have ears that are at least somewhat similar in shape, that will measure somewhat similarly at the eardrum. If this were not the case, and the measurements of different individuals were less consistent, then it would probably be harder to separate the sound signatures of the different headphone from one another in a random study like this. Because the measurements would probably be more all over the map.
My understanding is that the problem is that the deviation between listeners is inconsistent between the over-ears that were tested.
If the study was about six different near field speakers, for example in front of the listeners, in an anechoic room, you'd expect variation for the same speaker between individuals (obviously), you'd expect variations between speakers for the same individual (obviously),
but you would also expect no variation in how each listener's HRTF modulate all six speakers' response.
In other terms, you would observe that at, for example, 4000hz, listener's A HRTF results in a curve 4dB above listener B
for all six speakers. Or that at 6317hz, listener's B HRTF results in a response 6.7dB lower than listener C
for all six speakers.
In other words if you know how
any speaker measures at listener A's DRP, and how listener A and B's HRTF deviate from each other, you'd be able to
calculate how
any speaker would measure at listener B's DRP (in the context of the aforementioned near-field situation).
This is not what we see in this study for headphones. The variation between individuals is not constant across
all headphones tested, even just for the over-ears. That may be for reasons other than the listeners' ear shape (head size, neck width, pad compression, position over the ears, whatever, I have no idea).
In other words, even if you know how headphones X measure at listener A's DRP, since there seems to be no such thing as a constant headphones transfer function, you can't calculate how they'll measure at listener B's DRP.
In other words, if you measure that headphones X and Y differ by 4dB at 6000hz at listener A's DRP, you can't assume that they'll differ by the same amount at listener B's DRP.
My interpretation of it is that even over-ears will struggle to excite our ears in a way that speakers or natural sound sources would and that the headphone transfer function is unlikely to ever match the head related transfer function, at least with passive means only.
I don't feel that the magnitude of the problem is that significant in an age when we're still dealing with headphones which FR curve are so rubbish that they can only be justified by having been designed for aliens. But maybe sufficiently significant for the last mile in fidelity. I'm not convinced for example that truly convincing surround sound simulation can be achieved without tackling it.
I've started measuring headphones with in concha mics for the response below 1khz on my own head, and I'm actually starting to make my own DIY mics to try to get meaningful measurements above that. Including early attempts at making DIY tube microphones of the kind used in that study. I have no idea whether I'll get measurements that make sense, and I'm not qualified to make it anything that serious and rigorous, but I'm quite interested in comparing various means of measuring headphones on head (for example blocked ear canal measurements vs. tube mics near the eardrum), and perhaps headphones vs. speakers.