What I don’t quite get is how then noise shaping works with 16/44. [1] Firstly, I don’t get why noise shaping is required with 16bits when its noise floor is already around -96db. [2] Secondly, where does the noise shaped signal go if the band width is limited to 22.05khz?
[3] ... why would noise shapes 8bits require a higher bandwidth than 22khz?
1. TBH, in most cases it isn't. If I explain it in a little more detail perhaps that would help: With 16bit, the noise floor with (standard, TPDF) dither is about -92dB, as dither typically uses 1 LSB and 16 bit un-dithered would actually have a noise floor of -98.08dB (16 x 6.02dB + 1.76dB). The vast majority of music has a dynamic range of around 48dB or less. Popular/Non-acoustic genres will hit near 0dB numerous times, overall be relatively loud and require a relatively low output level on playback. The dither noise floor at -92dB is going to be 100 times or more below the noise floor of the recording and therefore, even at loud playback volumes the dither noise floor is going to be completely inaudible. Even with classical and jazz, a -92dB noise floor is going to be completely inaudible in the vast majority of cases. However, there are a potential set of circumstances where it *could* be audible. For example, the 1812 Overture (Tchaikovsky) has cannons near the end and it could be that they produce transient peaks say 18dB above any other peak in the rest of the overture. If you wanted to playback such a recording so that the rest of the overture (excluding the cannons) sounded roughly the same volume as other classical recordings, then you'd have to increase your playback level by say 18dB and our dither noise floor would therefore be 18dB higher (effectively at -74dBFS). Assuming you have a high quality playback system (capable of +18dBSPL louder than normal), normally listen quite loudly and have a listening environment with a low noise floor, then potentially the dither noise floor could become audible. The 1812 overture is an obvious example but there are other examples which are not so obvious. For example, a hard hit on an orchestral bass drum produces a large amount of energy. It's not obvious because much of that energy is below 50Hz, where our hearing is insensitive and therefore it doesn't sound particularly loud but we could have peaks up to as much as about 12dB higher than normal. All the above requires a quite extreme set of circumstances and only applies to a tiny number of recordings, because most recordings with such unusual peaks would have those peaks reduced (compressed/limited), so the recording is suitable for consumers with good equipment/listening environments rather than only for those with excellent equipment/environments. Having said all this, it's been standard mastering practise to apply noise-shaped dither to all 16bit releases for the last 20+ years, as it only takes about a minute to apply and then you're covered, regardless of ANY music and playback scenario.
2. That's not entirely fixed, it depends on the noise-shaping algorithm and there's sometimes some user (mastering engineer) adjustment available. In general though, the shaped dither noise starts ramping up from around 10kHz and is at it's peak by about 17kHz, this deliberately coincides with human hearing; we're most sensitive at around 3kHz and have a roll-off in sensitivity starting around 5-7kHz and a steeper roll-off around 12-14kHz. With noise-shaping we don't get less dither noise, we get the same amount (or typically slightly more), so as far as RMS dither noise is concerned we've got a dither noise floor of say -90dB, but that noise is outside the range of hearing sensitivity, giving us a perceived noise floor around -120dB. This graph of noise-shaping might help you visualise the situation:
The X-Axis is frequency and the Y-Axis represents relative dB. So with 16bit the "0dB" line represents about -96dB and the fairly flat blue (ID=99) line covering it represents standard (TPDF) dither. The other "IDs" represent different user selectable noise-shaping algorithms. You'll notice a couple of things: A. We're not actually loosing any noise energy, just redistributing it. As we reduce it from one area of freqs, we must increase it elsewhere so we end up with the same amount of RMS noise energy (exactly how this works is laid out in the Gerzon-Craven Noise-Shaping Theorem). B. That from about 600Hz upwards the shaping curve is roughly an inverse of the Fletcher-Munson equal loudness curves. For example, ID=16 (the strongest noise-shaping) gives us about -26dB less noise at 3kHz, IE. A noise floor of about -122dB with 16bit (-96dB - 26dB) but by around 17kHz we've got about 30dB more noise, a noise floor of about -66dB (-96dB + 30dB) However, assuming perfect hearing, our sensitivity is down by about 50-60dB at 17kHz and down by over 100dB at 20kHz (where our redistributed noise peaks at around +36dB). Additionally, our sensitivity rolls-off in the lower freqs starting from around 800Hz. Therefore, as far as human hearing is concerned, the noise-shaped dither noise floor of ID=16 would never sound higher than about -122dB.
3. Using my explanation and graph from point #2, let's substitute 16bit with 8bit. Our ID=99 (0dB) line now represents -48dB. ID=16 therefore represents a perceptual noise floor of -74dB (26dB lower than -48dB), while peak noise (at around 20kHz) would be at about -12dB (-48dB + about 36dB @ 20khz). However, let's say for illustration purposes that we want a perceptual noise floor of say -92dB (roughly the same as standard dithered 16bit). First of all, we're going to need another algorithm, one that is 18dB more aggressive than ID=16, so that at peak hearing sensitivity (about 3kHz) it is removing 44dB of noise instead of about 26dB (-48dB - 44dB = -92dB). Unfortunately though, this means that the peak noise level (at about 20kHz) is likewise going to be about 18dB higher than ID=16: -48dB + 36dB + 18dB = +6dBFS, which is impossible. The solution would be to increase the sample rate, say double it to 88.2kS/s. With ID=16 the highest redistributed noise levels cover a 5kHz freq band (17kHz to 22kHz). With a sample rate of 88.2kS/s we could spread that same amount of redistributed noise energy over a much larger frequency band, a 27kHz band (5kHz + the additional 22.05kHz) and thereby significantly lower it's level. ..... From all this, I hope you can see that the lower the noise floor we wish to achieve the more dither noise therefore has to be redistributed and in addition, the fewer bits we have to play with, the higher the dither noise we've got to start with. Hence why SACD, with just one bit plus a desired noise floor of about -120dB, needs a sample rate of 2.8 megahertz, to redistribute the massive amount of resultant noise.
BTW, in the example I quoted previously (Lipsh*tz & Vanderkooy), they achieved a noise floor of -120.4dB with 8 bits and they didn't follow the Fletcher-Munson curve, they simply reduced the noise by 72dB throughout the freq band of 0hz-20kHz. Obviously that would result in a lot of noise needing to be redistributed (and all of it above 20kHz), so they used a sample rate of 176.4kS/s (44.1kS/s x 4) with the redistributed noise (at -19dBFS) occupying the 20kHz-88.2kHz audio band.
G
PS. I'm not sure how easy my explanation is to understand?