I may not know a lot about audio, but I am a photographer. I know for a fact that downsampling an image reduces the quality of that image. There is a reason that the technical term for downsampling is "decimation". You lose data. How can you lose data and say that the sound quality is unaffected?
Here's the (very approximate) photographic analogy. You take have a photo that is 6000 pixels wide by 4000 pixels high. You make a print without resampling. If you make it big enough, and stand close enough, you'll eventually see individual pixels. But if you make an 8x10 and hold it at arms length it's impossible to see pixels no matter how good your eyes are. Now you make a 6x4 print. Can you see any of the 6000 pixels holding 6x4 at any distance? If you down-sampled that to image to, say 2000 x 1300 pixels and made a 6x4 print, could you see pixels at any normal distance? Our hearing is like a 6x4 print. You can have more data in it, but if you can't see it, what's the point? You can even increase the dynamic range of the original by making it 16 bits per channel, but a print can't reproduce that dynamic range, not even close, so again, what's the point? (Don't beat me up over having 16/channel and more res for post processing a photo, that's NOT what we're talking about...we're talking about the viewable print.)
Now, that analogy is really bad, because hearing and vision do not work the same way. You also have a big problem with where that 24/96 file came from in the first place. The bulk of them, and I mean really most of them, came from some form of up-sampling of a lower res original. You create more data that way but not more perceivable information. It's like taking a 2000x1300 pixel 6x4 print and up-sampling it to 6000x4000 and printing it again. You couldn't see the pixels in the first place, so what have you done? When you up-sample an image, do you create more detail? Nope, you just interpolate between pixels. Again, not the best analogy, but you're a photographer and might make the connection.
But lets not spend any energy beating up the analogy and why it's flawed. The three things everyone should know about hi-res audio is: 1. most of it is not traceable to a high-res original at all, those that are may still not contain any useful audio information because the original was limited by things like microphone performance, etc. 2. It's almost impossible to deliver ultrasonic information to the listener. Speakers, even those with "usable response" to 30kHz aren't really doing it, the ultrasonic dispersion pattern is like a pencil beam. If you do get it to hit your ears, you'll soon move out of the beam. Everything in the room is an effective ultrasonic absorber too, so there's no spraying it around, no off-axis response. Headphone/earphone response must follow a non-flat target curve. What's that curve in the area that we can't hear? Nobody has any idea. 3. The the elephant in the room: is it really audibly different? Scientific support of high-res audibility is extremely thin, questionable, and in many test cases, not repeatable.
So, that high-res data may just be interpolation of low res data, we can't get ultrasonic info to our hears, certainly at any "correct" level even if there was any in the first place, and we can't prove audibility anyway.
So what are "we" "hearing"? Absent overwhelming audibility data, but present well established understanding of expectation bias and placebo, we're hearing what we want to and expect to hear.
Compare that to the clear and unmistakable audibility of mono vs stereo, for example. See what I mean?