1. You're of course free to think whatever you want but the actual reality/facts prove that there IS a "final answer". I can understand how/why you could think there isn't though, which is why I'll respond to point 3 before point 2 ...
3. There are many similarities between digital photography and digital audio, as well as many similarities between how we see/perceive images and how we hear/perceive sound and therefore, we can potentially have many valid analogies between the two. However, there are also many differences between the two (some of which are quite profound) and therefore, potentially many analogies between them that are only partially valid and some that are quite profoundly invalid! I believe this is the trap you may have fallen into, which explains your conclusion of there being "no final answer". Unfortunately, I am NOT an authority with digital photography/computer graphics, so my terminology and description of digital imaging may not be entirely correct but I'm going to try to give a couple of examples:
One of the most major differences is what we're actually converting into digital data to start with. With photography our source "format" is light, which is waves/packets (photons) of electromagnetic energy, that we can convert into digital data with sensors. With audio, our source "format" is sound pressure waves, which is mechanical/kinetic energy travelling through a medium BUT, we can only convert electromagnetic energy into digital data with sensors, not mechanical energy, so we CANNOT convert sound into digital data! The solution is simple in theory and didn't need discovering because it already existed, nearly 150 years ago and 50 years before digital audio was first conceived: We first convert this mechanical energy into electromagnetic energy (specifically electricity), a process called transduction, then we can convert this electromagnetic energy to digital data and of course do the reverse conversion and transduction to reproduce the sound waves. However, this has consequences/limitations compared to digital imaging (which doesn't involve transduction) because transduction is highly inefficient (due to the laws of motion/kinetics) and therefore requires relatively massive amounts of amplification, which in turn causes even more limitations (due to other laws of physics, such as thermal noise).
Another major difference is the different response of our eyes and ears, for example: Our ears have a freq response range of about 20 kilo-hz and can resolve that range into about 10,000 different pitches. Our eyes have a freq response range of about 320 tera-hz and can resolve that range into about 10,000,000 different colours. So 16bit, which can represent ~65,000 different colours, is about 150 times fewer colours than the human eye can differentiate but about 6.5 times more pitches than the human ear can differentiate. So, 16bit is definitely "low-res" for the human eye and 24bit, with 16,000,000 colours, is about 1.6 times greater than required for visual "hi-res". However, 16bit for the human ear is already 6.5 times greater than required for audio "hi-res"! This isn't an entirely fair comparison though, because we do not use bits to directly represent the frequencies (pitches/colours) in digital audio but to represent the amplitude of the transduced electrical voltage (from which freq is derived), which is another of the differences between digital audio and digital imaging. So as mentioned above, this would effectively represent a "partially valid" analogy and demonstrates the dangers of digital visual and audio analogies!
2. There is no 24bit upscaling in digital audio! Maybe in digital imaging there is, maybe you can interpolate colours between the ~65,000 values available and write those colour values to the 16,000,000 (24bit) available values but digital audio doesn't work that way. If you "upscale" 16bit audio to 24bit nothing changes, there is no "upscaling", you just get 16bit audio in a 24bit container, the extra 8bits (LSBs) are just padded with zeros.
G