I couldn't resist and write down some funny (I think?) thoughts in reaction to your post.
Indeed trying to apply optical reasoning is very very wise as light and sound can be described by the same formalism (and ultimatly are juste waves, light duality put apart).But if you want to compare a driver to an optical system, I think it has to be a set of lenses for exemple. You have to keep digital consideration like resolution (number of pixels) out of the equation because the sonic equivalent is sampling frequency/number of bits which we don't considere here. You should do pure analog vs pure analog.
The equivalent of transducers are lenses. The equivalent of a speaker/headphone is a projector and the equivalent of a microphone is a camera (both not digital, with films and stuff). And yes the notion of resolution exists for lenses but it's very different from the digital resolution and is in fact limited by diffraction (the upper boundary is defined by a simple
equation). It's unrelated to the quality of the lenses.
Note the following fun fact: The equivalent of a LCD screen is a sound systeme with one speaker for each frequency (the step between the different frequencies corresponds to definition). Anyone ready to buid this up with thousands of BAs?
Also Now that I think of it CRTs might be a very good equivalent for speakers/headphone.... In fact stereo audio would be doing VR with CRTs This thought really made my day.
So let's considere our projector. If it has bad lenses, you will observe geometry errors blur and speckle noise on the image (the lenses are not well shaped and not well polished). This is the exact equivalent of distortion and noise in audio.
The amount of detail is always the same no matter the system but it's rendered more or less distorted. So to judge the quality of the projector we would like to see if the lense deviates every ray of light the right way (according to Snell laws) no matter the angle the ray comes from. Believe it or not the angle paramter is called spatial frequency and is the exact equivalent of frequency in audio. (Optic is magis because you can check how the projector behaves super easely (adding other lenses asumed to be perfect lol) and actually see a figure with your eyes that shows how the systeme behaves for every single spatial frequency. This is indeed a spectre, and is no different from frequency reponse in audio.)
Now keep in mind the details on the image printed on the film difract the light widely (they spread the light a lot because there are small obstacles) so the rays of light containing the information for details reach the outer part of the lens: they are coming from high angles and therfore correspondes to high spatial frequencies!
In the end to see if the projector is "detailed" we just have to look at the spectre and see if the high frequencies are correctly handled or not. In practice this in not so easy to do, but in theory this is it. If you have the power to change the spectre (with filters) you can correct its behavior to make it better (exactly what you said with photo editing). If the filters are perfect and you can make theme inifnitely complex you can fully correct the lenses.
So you're right about everything except when you say that there is a quality that is inherent to the driver/lenses and is out of reach/hidden.
Regarding soudstage I think the problem is not how faithfully the signal is reproduced. It's more about what the signal should be, or/and how much of our ear distinsctive features is kept during the travel of the waves to our eardurm.
I omitted a lot of things (phase; coherence...) and I doubt I explained that well... I just wanted to show that making analogy to understand sound (you could do it with electricity as well) is both fun and revealing. The links are deeper than most people probably think, so I really honor the fact that you did borrow this path to explain your point.