Suppose that you have a system where the values are 0, 1, 2, 3, this is effectively 2 bit. and that you was a system that does 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, In your mind, you'd have doubled precision of the quantization.
Let's say that the digital values 0, 1, 2 ,3 represent 0, 1, 2, 3 mV out of your microphone, you want to record 0.5 mV steps, you simply double the pre amp values for your mike, making the the output values go from 0 to 7 mV by 1 mV steps which means you have doubled the dynamic range necessary to record the data. Effectively, to diminish the quantization error, you simply increase the dynamic range. You now have to use a 3 bit encoding with 0, 1, 2, 3, 4 ,5 ,6 ,7 as values.
Anyway, most recording are 24 bit today, allowing the smallest step to be set very low, and once digitized, every track (one for the piano, one for the singer...) is fed to the DAW (digital audio workstation) which performs its internal calculation on 32 and even 64 bit. And finally when you output your mastered song, you decide the values of your loudest and softest sounds considering that a variation smallest than the softest sound wouldn't be heard. This, for a CD is 16bit, ie. 96 dB of variation between the loudest and the softest sound, ie 65535 values, when you go up on and High Def, you get 144 dB, which means that either your loudest sound gets louder, or that you softest sound get softer, increasing dynamic range is the same as decreasing quantization error.
Except that, your listening room has noise for, even in an anechoic room you, if your hearing is very good, you'd hear your heart beat, the flow of your blood... So the softest sound, the smallest variation doesn't have to be that small, there's a limit for that. Now consider your 16 bit CD, your room noise floor is 25 dB (that's a very quiet room), you set the volume so that the highest sound reaches 120 dB (That's really loud), your softest sound is played at 24 dB, ie. lost in the noise floor. If you play so that the loudest sound is 90 dB, the softest sound plays below 0 dB which is below the audibility threshold. Of course this does not take into account dithering, which brings the perceivable dynamic range up to to 120 dB with a 16 bit encoding, considering a loudest level of playback at 120 dB, the small detail is right at the audibility threshold
As an addendum, your question was mostly answered in the first two paragraphs, the 3rd was mostly about why 16 bit is usually enough for playback in household conditions. That's with PCM encoding, SACd is a different story.
Originally Posted by
Vitor Machado 
There is one thing I don't understand though:
Wouldn't it make more sense to use the extra bits not for increased dynamic range, but for more gradual steps in the quantization?
Suppose we have 3 bits (possible values: 000, 001, ... 111), and they represent, in order, -30, -20, -10, 0, 10, 20, 30, 40.
If we add one more bit, instead of going like -70, -60, ... , 70, 80 (which seems to be the analogous case for 16 vs. 24 bits if I understand correctly), why not -30, -25, -20, ... , 30, 35, 40, adding more steps instead of bigger range?
If someone could clarify things with some practical example like this it would be great! 
EDIT: Better numbers for my example.