Well, that bit at the end was something of a simplification, only half the picture. By "consistency" I mean that when you listen to A multiple times, you give similar responses (ratings) for A, likewise for B or anything else.
It's really a measure of rating things that are different as different, while rating things that are the same as the same. It's a ratio, so the statistical power is increased when (1) you rate things that are different as even more different or (2) you are even more consistent in rating things that are the same as the same. The low positioning of the non-trained listeners could be either due to (1) not distinguishing much between things that are different, (2) not being consistent in rating things that are the same, or (3) a combination.
They seemed to just use a 1-10 preference scale, as far as I could tell from a quick skim. That doesn't really have anything to do with hearing the same note, unless I misunderstand what you mean.
Maybe this graph is easier to understand:
(click to view, from http://seanolive.blogspot.com/2012/05/more-evidence-that-kids-even-japanese.html)
though that just shows the averages, so it does not give you information about the variability within a given group / loudspeaker combination. You see at least there that the trained listeners distinguish more between the different speakers. Their score for A is much higher than their score for D.
Edited by mikeaj - 10/19/12 at 7:49pm