The room reflections are excluded naturally (you only use the first few ms
of data before any reflection from a wall comes back).
Those from the headphone enclosure are included, except that Marv's rig minimizes those in the earcup with his sound absorbing baffle mount.
Even in Marv's rig, reflections from the outer cup are there unaltered but more or less visible depending on the design (closed or
open design).
The perfect result shown was just that: a headphone free of driver or frame resonance and cavity wall reflection.
The analogy to room acoustics is very interesting though: we talk about reverberation time in that case. And indeed, I believe for a concert hall the ideal R T does not change with frequency. I think it's the same with a classroom as RT must not exceed a certain value to get reasonable speech intelligibility.
Interestingly though, we never use this type of csd processing in the case of room characterization but compute the reverberation time (T30, T60) from a room response which prealably goes through some octave or third octave band filters.
I always wondered if I'd get similar result by processing a room impulse response through my csd routine. I assume it would unless the room is dead quiet and all I saw was the ringing of the filters and other equipment who's dynamics are in the data (typically the speaker,amp...). Actually, this artificial ringing introduced by the filters is probably a good reason why you want to do a csd rather than t60 analysis of such fast decang system as a headphone...