However, I tend to lean towards Rob's point that current neuroscience and our understanding of intelligence and audio processing is in its infancy.
Slide 3. Each sound might be separated out by the brain - if you're concentrating on it. You and Rob seem to be in agreement here.
Slide 4. It seems perfectly reasonable to require some margin of safety.
Slide 5. In all music (except perhaps electronic music), the ADSR (attack, decay, sustain, release) is exactly what helps us distnguish one instrument from another. This is how synthesizers trick the brain into thinking a keyboard is actually a trumpet. I don't deny there are overtones that affect overall timbre, but these aren't that relevant for short, staccato hemidemisemiquavers where there is very little time for the brain to process them.
Slide 6. Imagine I'm on a chromatic run with (for whatever reason) no transients at all in the reproduction of my notes - i.e., I'm slowly fuzzing from one note to the next. (Not because I'm not an outstanding bass player, but because of a poor digital reproduction.) At any point in time between me playing an E and a G, the pitch of the reproduced note is anywhere between 83.4 Hz and ~98 Hz. Are you hearing the start of the G, or a 87.3 Hz F, or a 92.5 Hz F#? I can see what he might be alluding to. Without a clear start/stop of each note, all you have is a continuous shift in pitch.
Slide 7. Rob is absolutely right on this one. ADSR.
Slide 8. First four points I agree with completely. Pitch and timbre we've covered already. Starting and stopping is obvious (you're not really calling that a lie, are you?!). Yep - he's right about soundstage too - the timing differences reaching the left and right ear are a crucial part of what allow us to place sounds in 3D space. As for the last 4 points - well, I'm not a neuroscientist, but we agreed to roll with the 4 microseconds, right? "Very small" and "very big" can be subjective, so this isn't necessarily wrong. You then say "Timing accuracy between channels is perfect at 44.1, there is zero relative shift!" If we're dealing with sound that doesn't reach the left and right ears at the exact same time, then we'd better have a relative shift, or we'll have unphysically altered the soundstage. I believe what Rob is shooting for is a consistent reproduction of the timing (to L and R channels) from a given source (but the details he gives are hazy - see my later comment on this).
[9] I'm sure we all appreciate Dirac delta functions don't usually exist in music. But what if I were to record a percussive instrument like a snare - or maybe even a blast wave on my next album? To all intents and purposes, these are like the initial spike of the dirac delta to a 44 kHz sampling rate. I would absolutely want a DAC that could take care of these extreme cases, even if I only spent most of my time listening to something with a slower attack, like piano.
I don't disagree that we have limited understanding of how the brain creates perception. However Rob Watts quoted a figure of 90% of what we perceive is created by the brain, made various claims about how the brain perceives sound and then said we have "no understanding" of how the brain processes the ears data, thereby making a lie of all the claims.
Slide 3: No, there is some separation but it's rather limited. Are you saying that when you listen to a symphony orchestra you can separate all the 18 or so individual violins during a tutti section, even if you concentrate solely on the violins? If not, then the brain cannot "separate each sound".
Slide 4: But how much is "some"? Is a thousand or more times below audibility not enough? Neurons firing with a resolution of 4um does not necessarily imply that's the resolution the brain can hear, it also does not indicate the brain samples at 250kHz. In fact, if we take his previous figure of the brain only using 10% of the data it receives, then only 1 in 10 of those neuron impulses is processed, implying a resolution more like 40um.
Slide 5: Exactly, but transients only account for the A of the ADSR! And, the Attack contributes to the sound/timbre of an instrument only to a very limited degree, for example the attack of brass instruments is very similar, what happens in the Sustain part of the envelope plays a larger part in differentiating say a french horn from a trumpet. Also, even hemi-demi-semi quavers still have a duration of many milli-seconds, not micro-secs!
Slide 6: Transients have virtually no impact on the detection of pitch. It's mainly the fundamental and harmonics, of the sustain portion of the sound envelope. It's not difficult to test for yourself; take a note and cut off the transient. Listen to that transient and see if you can determine pitch, listen to the sustain without the transient and try again. Notes blending into each other is a function of performance, the resonance/s of the instrument and the acoustics, digital accuracy is way beyond any of these factors.
Slide 7: But Rob Watts is not saying ADSR, he's effectively saying timbre depends on A (Attack)!
Slide 8: I do NOT agree with the first 4 points: We do NOT perceive pitch from transients. Transients play a part in soundstage but only a part, in 3D space reflections are vital. Transients play a limited part in timbre. The last of the four points is rather vague but is not entirely/always true. The last 4 points seem more ridiculous: I'm not sure the 4um is an accurate figure and it's not clear what he means by timing resolution. If he's talking about jitter, CD is down in the tens of pico secs, a hundred thousand times or so below even his 4um detection. If he's talking about timing inaccuracy between left and right channels, there is none on CD. How does a hundred thousand times below the 4us he quotes and no timing error between left and right channels have any impact at all, let alone a "big impact"?
9. No. A mic capsule has mass, it cannot respond instantaneously, same with the reproduction transducer, same with ear drum and then we've also got air and materials absorption. Regardless of what say a snare drum rimshot actually produces, what you are actually going to hear is already highly distorted, even with no digital conversion in the chain (for example, mic preamp output direct to speaker amplifier). Not to mention that a rimshot is not just an initial transient!
G