This is not easy to explain, but when I talk about timing I am not talking about linear timing due to frequency response, phase and time delays. These are linear effects and subjectively not important. So to give you an example, you talk about 40 kHz and the delay being about 1cm. Quite correctly, you comment that you can't hear that shift, and I absolutely agree; a linear (unvarying) shift of 1cm would be hard (impossible) to hear. But this is not where the problem occurs, as if the shift is non-linear - in other words, the 1cm delay is constantly changing randomly - then you absolutely would hear the change, as the brain uses timing information to perceive pretty much everything - pitch, timbre, sound-stage, instrument separation etc.; so if the timing on each channel is constantly and randomly changing then it becomes a massive perceptual problem.
The problem we have is down to the interpolation filter within the DAC. This filter is crucial, as it converts (together with the analogue filters) the sampled data back into the continuous signal that was originally in the ADC. Now to perfectly reconstruct the original you need a Whittaker-Shannon interpolation filter, and this is discussed here:
https://en.wikipedia.org/wiki/Whittaker–Shannon_interpolation_formula
Basically, to reconstitute the original timing information perfectly, you need an infinite tap length sinc function. This is a mathematical fact, and absolutely no other filter will do it. My insight was the realisation that using a limited number of taps would degrade timing, and we would have effectively random timing errors (where the reconstituted signal would constantly change the timing of transients - sometimes too early, sometimes too late), and that these timing errors would be audible, as I knew how important timing is perceptually.
So my quest over the last 25 years has been to design interpolation filters that got as close as possible to minimising these non-linear timing errors. Today, with the M scaler, I now have the interpolation filter exactly the same as an ideal sinc filter to better than a 16 bit accuracy, as the coefficients are identical to ideal within 16 bits. But to do this, I needed 1,015,808 taps, which is a huge tap length. And anybody that has heard the M scaler will say one thing - it is not a small change. In the case of Hugo 2 it's about 13 bit accurate - that is, the coefficients in the WTA filter is the same as an ideal sinc filter to about a 13 bit accuracy. Every time you double the tap length, accuracy improves by 1 bit...