All sounstaging, imaging, etc. is delicious artifice. Our ears and brains interpret the sounds coming out of speakers as a representation of actual performers, based upon the cues we have learned from every day experience.
I record orchestras and chamber ensembles. Let's consider a simple microphone setup recording an orchestra: two mics, 15' in the air, a foot apart, cardiod capsules, the left mic pointing far left and the right far right.
Stereo positioning/imaging is a result of differences in timing and sound pressure. A violin to the immediate left will be louder in the left mic and will arrive sooner than the right. A tympani's sounds on the far left will arrive even later. The sound of the flutes in the middle arrives at the same time and at the same intensity. When speakers reproduce these recorded sounds the positional cues allow our brains to place the images in the sound stage from left to right.
Depth is slightly more complicated. Sounds that are further away are less loud. They also contain less treble energy (consider how a band at a distance sounds very bass heavy, thumping). The further away sounds will also be accompanied by more sound of the room (close sounds contain little echo/reverberation - far sounds much more room reverberation.) The sound of a violin in the front row is brighter, louder and has less room ambiance than a violin in back. We also rely on our experience. We know a trumpet will drown out a flute. Thus our brains will interpret the trumpet further back if we can hear the flute..
Headphones have difficulty reproducing imaging and depth. For example, our brains rely on each ear hearing both speakers. As another example, the best imaging typically occurs when the speakers are the same distance apart and the listener is this same distance from each speaker. This timing relationship is not maintained with headphones.
The relative positioning of "width" and "depth" do not change with headphones. They are just harder to discern, particularly with multi-tracked studio recordings as there is no real life counterpart - we do not know what is real.
I hope this makes sense.