I love your summary, AVI. 
Just my contribution: I've dabbled in upmixing my music myself, and I often like it. You might laugh at me for saying this, but there is a subtle difference, in that it seems like I hear more detail. Determined with ABX testing.
Let me explain:
In my photography analogies, I prefer to process my photos with Photoshop's sharpening filters. Much of it helps with interpolating, and therefore, reducing effective lens fuzziness. Also, higher contrast/more edgy edges, makes details easier to notice. Is the latter preserving the original picture's information? Nope. We are altering it. But it makes it easier to notice. Good for perception. Sometimes to help our imprecise senses, the original data needs to be "helped."
Now my audio theory. I tend to prefer the "linear phase Brickwall" interpolation more. From what I can understand, it will sometimes make the waveforms more pointy (more triangular vs curvy-sine-wave-ish). While more pointy waves (ever listen to a pure triangular tone vs a sine wave?) would give a slightly harsher (to be evil sounding) or crisper (to use a nicer word) sound, giving the illusion of more detail, or perhaps emphasising details. Either or both could be true.
By the way, I absolutely hated the tube amps I've tried, even quite expensive ones. Now using a tube amp, which will have slower slew rates/ reaction times (and therefore will round off the tips of triangular and squared waves) are liked by some. I'd like to propose that using files higher than 44.1khz and using a tube amp would be pointless, as their slew rates are so slow that it wouldn't make a difference. However, higher resolution files can't hurt, and I always say that hard disk space is cheap, so just go with the most information possible. Peace of mind, and it might come in handy some day. You never know.
Now increasing bit depth. 24 bit is much better in mastering and recording, as you have more leeway and flexibility when editing files. In digital photography, using 12-bit colour has saved pictures, as I can, to use a very simple example, clip off the brightest 4 bits in an underexposed file (those brightest bits were wasted in this case) and still have 8-bits of data left. Sometimes you have to do it, and you should be prepared for when you have to rescue mistakes.
For playback, humans usually can only hear 60db of dynamic range (I'd rather round to 80db for playback to be safe). 16-bit has over 95db, so it's good enough for our poor little ears. My cat can probably sense a larger range, but we aren't talking about cats here.
Sorry for tl;dr.