Of course defining the frequency response of headphones can be the hard part. If I recall correctly, Tyll herstens (innerfidelity) measures in the ear canal, and then corrects that spectrum to produce a mains speaker equivalent spectrum. So in theory flat on a speaker and flat on that particular headphone, should give you the same (but not flat) spectrum inside your ear. If this is really done correctly and if all heads are close enough to the same, then this really does represent the flat in flat out model I was referring to. Whatever your nerves do to that is then the same as what they would have done at Niagra falls. This is at least the goal, imperfectly attained of course.
The moral here is it's very important that the "measurements" are made well by somebody who understands this stuff. In a sense we're agreeing that one should appropriately factor these things, but my point is while it is hard to do measurements/corrections perfectly at this level, it is actually possible, at least to within the variations in human ears, to define what flat reproduction means for a particular headphone and to measure and plot deviations from that, and existing measurements already strive for that.
I just don't see how your equal loudness statement stands as any useful generic advice on what differences one should want from flat in those measurements. Inerfidelity attempts very hard, as hard as probably anyone knows how give or take some debate, to make flat mean exactly what it should mean and exactly what should be needed for accurate (not necessarily your favorite) reproduction. Sure they aren't perfect and there are person to person variations left over, but none of that equates to saying "we should have some equal loudness contour" in the resulting spectrum. Equal loudness is maybe what you want if you want to play at low volume, the important thing not being the shape, but the difference in shape at different volumes.