I'd ask your question the other way around as it seems to me, both subjectively and objectively that multibit DACs are adding in less of their own characteristics to the music than D-S DACs.
So 'Why do D-S DACs sound worse?' Without considering the digital filters there are at least a couple of technical reasons, related to the back-end of the DAC - the modulator and low-bit multibit DAC itself.
The first reason is that the quantizer can't be correctly dithered because its in a feedback loop, hence the optimum level of dither can't be established. Non-optimal dither levels result in noise modulation - signal correlated shifts in the noise floor. I suspect this is an issue that ESS worked hard on in their 'hyperstream' DACs - reducing noise modulation in the modulator - at least its hinted at in Martin Mallison's RMAF presentation.
The second issue I don't believe Mallinson talked about at all - that's the fact that the low-bit DAC used isn't a very good one, in terms of the element matching. The apparent ability to use a not-so-good DAC is the whole point of designing D-S converters. Its this that makes them cheap to produce - the old multibit DACs needed resistor laser trimming which takes time with very expensive hardware hence translates to considerably higher prices. In order to get around the limitations of using a DAC with poorer than 10bit precision a lot of signal processing 'tricks' have to be used otherwise the measured THD would look very bad. The tricks used reduce to something quite simple - conversion of harmonic distortion into noise. I take it its assumed in doing this that 'harmonic distortion' = bad and 'noise' = benign but this looks to me to be a questionable assumption for audio. So long as the noise remains totally constant with signal level its reasonable, but that's the rub - 'linearizing' a poor DAC by turning its distortion into noise ISTM generates non-constant noise levels because its distortion isn't constant with signal level. Its this effect I believe which is responsible for the 'bump' in the THD+N vs signal level graph, around -35dB seen on some plots from ES9018 devices.
All of the technical facts you stated seem to be true and accurate, but I do question your interpretation and evaluation of some of them. There is a common tendency (especially by fans of R2R DACs) to describe Delta-Sigma DACs as "using low precision internal DACs and tricks to get good measurements" and making it sound as if this is somehow a way to "foist" inferior products on audiophiles by the use of clever tricks. The first part of the sentence is entirely correct - the whole idea of modern Delta-Sigma DACs is to use some clever tricks to allow an internal function block with five or six bit precision to deliver an analog output with 24 bit precision. (And that is a pretty neat trick.) However, it could also be restated as: "A Delta-Sigma DAC that has circuitry with internal conversion precision of only five or six bits can deliver performance equal to or better than an R2R DAC with 24 bit precision - and do so for a lot lower cost. (When you state it that way it sounds more like a cool idea that lets you get better performance using cheaper parts - which sounds more like a good thing.)
Also, when you talk about "converting distortion into noise", you need to be very careful of the context to avoid becoming mislead... The whole subject of how a Delta-Sigma DAC works is quite a bit more complicated than many people think... In general, you can
ALWAYS "trade" bit depth against sample rate. This is what DSD does - when compared to PCM. Instead of a 16 bit (or 24 bit) signal at a certain sample rate, you instead have a one bit signal at a much higher sample rate. However, the process of "trading one for the other" isn't some sort of shady business deal, conducted in a back alley somewhere. Rather, it is perfectly legitimate math, and the "trade" really is fair and equal. You really
CAN use less bits at a higher sample rate and get the same performance (within the limitations of what you're doing). A modern Delta-Sigma DAC isn't "doing something sneaky" either - it's simply using some clever math to balance sample rate against bit depth because it so happens that it's a lot easier to get DAC function blocks that can convert with five or six bits of precision, but do so at very high sample rates, than it is to get ones with 24 bit precision, that can do so at lower sample rates. (You can think of it as "dividing" that 24 bit sample into several smaller pieces, converting each piece very precisely and quickly using a DAC with less bits, then carefully putting all the pieces of output back together afterwards.) So, given that the precision of the results will be equal among those choices, it sort of makes obvious sense to choose the (equal) option that costs the least - right? Other than bragging rights, there's no technical benefit to doing something the more difficult and more expensive way unless it actually works better - right? (And arguing that a "real 24 bit R2R" DAC could do a better job than a 24 bit Delta-Sigma one is not only not necessarily true, but it's sort of moot - even the high-end R2R DAC used by Yggdrasil "only" has 20 or 21 bits of precision - not the full - and arguably unnecessary - 24 bits of precision.)
What the fellow from Sabre was referring to was that, because of the way the process works, you sometimes end up with a noise floor that varies depending on the content of the signal you're converting (the noise floor is modulated by the content). Since most people agree that a smooth consistent noise floor is in general less annoying than one that is correlated with the signal in some way, this is something worth avoiding (by careful attention to the details of how that mathematical trick is actually accomplished). We can leave the question of whether you can hear the difference between a smooth noise floor and a bad one, and whether different types of noise modulation sound audibly different - when the noise floor in question is better than -120 dB down - for another discussion. (This is only relevant and meaningful if you actually
DO notice that the noise between low level passages really is audible - and sounds audibly different between different DAcs.)
Another thing that seems to need clarification is the subject of digital filters.
ALL oversampling DACs require digital filters - and this includes
BOTH Delta-Sigma DACs and other types of oversampling DACs as well. Since the oversampling is tied in intimately with the Delta-Sigma process, most Delta-Sigma DACs have an internal oversampling filter. Yggdrasil uses an R2R type DAC
CHIP, yet it still oversamples, and still uses a digital filter to do so. (Schiit has developed their own digital filter, which functions somewhat differently than the one included in most oversampling DAC chips, and which they claim is audibly superior. Their oversampling is implemented outside the DAC chip.)
In this context, perhaps I should also clarify what is meant by "digital filter". The process of "oversampling" consists of converting a digital audio stream recorded at a certain sample rate to an equivalent digital audio stream at a higher sample rate. The way this is done is to
create more samples. (The process may create all new samples, or keep the original samples and create new ones to "drop" between them at the appropriate times. Note that the process
CANNOT create new
information. The ideal goal is to create new samples that contain the exact same information as the original audio stream, without changing it in any way (except to express it at a higher sample rate). You can think of it conceptually as taking the original samples as points on a graph, drawing a line through them in the precisely correct place, then picking
NEW points on that same line (but more of them spaced more closely in time). If you get this all just right, then your new points will define the same exact line as your original points. In practice, this can be done by calculating approximately where the new points should be, then applying a filter. By "filtering out the errors" the filter "forces the new samples into their proper values". Basically, if you make a guess, then eliminate the errors from your guess, the result will be the correct answer. And, yes, that's a horrible oversimplification. However, it can also theoretically be done in other ways.
The purpose of all this is that, by raising the sample rate, it raises the frequency of the errors introduced by the "steps" in the conversion process which, in turn, makes them easier to filter out without altering the desired audio signal. My basic point, however, is that the term "oversampling filter" may be somewhat misleading to some people... and thinking of it as an "oversampling process" (which is usually done using a special sort of digital filter) may make the concept easier to grasp.