my really last attempt at explaining that different stuff are different. in case someone still cares about the topic and wishes to think about it. here are the various systems to consider and try to understand:
model 1: human psychoacoustic, the use of ILD and ITD to locate one physical sound source somewhere around us. freedom of movement, head movement even helping to get a more accurate triangulation(more reference points, more cues), vision helping to confirm the correlation between the sound and some visual source causing it. in short, us in our daily lives in a real environment and the cat making cat sounds coming from the actual position of the cat.
model 2: stereo speaker playback. 2 physical sound sources are tasked with making us feel like one virtual sound source(or many) is located somewhere in a general direction between the 2 speakers. the main trick involves sending the signal to both speakers and making one louder(panning). we have free head movements allowing ILD ITD cues from the active sound field to let us place the 2 speakers as having a fixed plosition in the room(which is the case, so that's cool). we get room reverb, giving a few more cues about the position of the speakers, but also about the room itself. over time the acoustic and visual cues keep agreeing with our expectations and model 1 for the room and the speakers. but the actual track and virtual sound sources are another story. they may have their own psychoacoustic cues, that may agree with model 1 for that room and those speakers from our listening position, or maybe not. it's the fake that kind of feels real.
model 3: headphone playback. 2 physical sound sources stuck on our ears. moving our head tends to give us the idea that the physical sound sources are on our ears because they are(good job senses!). if we naturally rely on model 1, that's when we make use of ILD and ITD to place the physical sound sources on our own head. depending on the audio cues related to the virtual sound source(let's say some instrument on the track), we might try to make sense of the impossible by placing that instrument inside our head, somewhere between the 2 drivers or from one of the drivers. but maybe our brain understands that it's physically impossible and keeps pushing almost every virtual sound source near each ears even when it's almost mono. or maybe some place on the head right outside of it. maybe the FR of the headphone(why not some clear channel imbalance or some funny phase games), happens to remind our brain of the acoustic cues for a given direction. and we'll feel like the singer is not at the horizon but somewhere up or down(still probably in the vicinity of our head).
as the headphone sends sound from a position that bypasses some of what we use to locate a sound source(HRTF), a direct consequence is that one listener might not place a virtual sound at the same position as another listener would do with the same headphone. our interpretation is affected in a personal way when in model 1 or model 2(if sited at the same position), we all tend to point toward the same direction when asked where a sound is coming from.
on the bright side, headphones often have much lower distortions than speakers, are right on the ears, don't have room reverb, so they do get to deliver a very clear signal where it's easy to notice details if we care about that.
model 4: the idea behind crossfeed. we imagine 2 physical sound sources at a distance(speakers), and a dummy head facing them. I say dummy head because the notion of head movement is entirely ignored. now we doodle this and draw lines between the speakers and the ears. each speaker sends one line toward the left ear and one toward the right ear, we end up with 4 lines. in that model, a little geometry(head size, speaker angle, speed of sound at room temperature) gives us how much delay between left and right ears for the sound coming from the left speaker(and same stuff for the right speaker). we can measure how the sound from one speaker gets affected by the dummy head blocking some frequencies more than others when receiving sound from the speaker opposite to a given ear.
we can now try to simulate this so that when placing a headphone on the dummy head, the left channel will be copied, EQed, and delayed before being sent to the right ear, as if the sound source was still the pair of speakers. if done with a really specific EQ based on accurate measurement from the dummy head, we can get an impulse corresponding to each of the 4 "lines of sound" in our initial doodle and that could count as part of the dummy head's HRTF(head related transfer function) for that specific situation when nothing ever moves at all. that's about the best objective approach we can take to properly simulate speaker sound for that dummy head while using headphones. we'd still need to compensate for the headphone's own response but that would be very convincing as far as audio cues go. if that dummy head could talk(but still not move at all!!!! ever!!!) it would say that it works great.
the basic principle of crossfeed is inspired from this. it also completely neglects head movements as if it was no big deal. it doesn't bother with the headphone's initial FR at all. it doesn't apply your correct EQ for the sound coming from the opposite direction before mixing it because it doesn't know what the correct EQ should be for your own head. it also doesn't know the size of your skull but hopefully you can set that yourself correctly.
so in practice, even if you didn't sense the headphone on your head, even if your eyes weren't your main and most trusted source of intel regarding the word around you, chances are still that you could find issues with that processing of the music.
forget the 2 speakers + 2 ears doodle for a sec and consider real life circumstances instead. that's really all you need to do to find the many issues in that system and in
@71 dB's "demonstrations of a factual improvement".
an unrealistic model does not characterize a real system. it's that simple.
@71 dB if crossfeed is an objective approach that factually improves sound over default headphone listening(and not just that you and a few people like it subjectively), I want to propose a similar approach to the topic of masturbation. masturbation isn't as good as sex, I believe this is an accepted opinion for most people. and here is another fact, a human partner is mostly water. so I postulate that masturbating while holding a bottle of water is an objective improvement over not holding that bottle of water, and that's why people will like it more. it is my sincere belief that both our positions and arguments are exactly as factual and just as easily refutable because they rely on obvious logical fallacies and complete disregard for anything beside the variables we present.