The relationship of statistics to "knowledge of the real world" is quite subtle and easily misunderstood. I usually assume most people on the sound science forum know more statistics than I do, because I know only a little of the math. However, I may have a better "feel" for what statistics mean than most people, because my day job is writing software that does something called "statistical orbit determination." That is navigating spacecraft on the basis of noisy and incomplete measurements. At this point you are probably wondering why I claim to know little about statistics! I don't actually program the mathematical part of the software. But I have stood over the shoulders of navigators and heard them thinking out loud.
Statistics is a kind of black art, because it involves manipulating models of the world, when you often don't know if those models are correct (even remotely correct)! Your conclusions can be right. They can be wrong. And they can be dangerous. (Spacecraft have been lost due to mistakes in navigation.)
They can be right for the right reason, and right for the wrong reason.
In the case of an ABX test, we start with a null hypothesis, which specifically is the hypothesis that the listener cannot hear any difference and is picking an answer based on some arbitrary criteria or random guessing. We then model this listener as someone who answers correctly 50% of the time, or to put it another way, p=0.5.
In other words, we are modeling the listener as essentially a fair coin. That's a bit of a funny thing to do, considering a listener is a brain and a body and so forth. For all that to reduce to simply a coin toss is amusing in a way. Although it is an accurate model if there is no audible difference.
Now, you might wonder, why do we take this approach?
Here's an analogy.
If someone claims to be a bad golfer, then goes out and gets a bad score, that is consistent with being a bad golfer. If he gets a good score, that is inconsistent with being a bad golfer.
Perhaps we have a reason to suspect someone who is actually a good golfer would fake being a bad golfer. If he claims to be a good golfer, then gets a bad score, you could say that is inconsistent. But we know he could easily be faking it. So in certain contexts, we cannot reject his claim of being a good golfer from a bad score.
On the other hand, if he claims to be a bad golfer, then gets a good score, we can categorically reject his claim. We intuitively know this, but a way to put it mathematically is to say there is very little probability he could achieve a good score while being a bad golfer.
What ABX testing analysis does is start by assuming he's a very bad golfer and seeing if he can prove otherwise. Or rather, start by assuming the test subject is randomly guessing, and wait for them to prove otherwise. Call the hypothesis they are randomly guessing H0.
A good test result can demonstrate H0 is unlikely to be true. That's because you can't fake a good result, except by getting lucky. And if you have only a 3% chance of "getting lucky", and yet you can repeat this feat several times, you have pretty much conclusively demonstrated you weren't lucky. Rather, H0 is false.
So if H0 is false, then what is true? In a sense, we still don't know. All we know is that H0 is false. Almost anything else could be true.
For example, we can demonstrate the test subject isn't well modeled by a 50% chance of guessing right, but does that mean they are well modeled by some other chance of guessing right? Not really, because people and test situations are way too complex to be reduced to a number.
Also, note that I said the golfer can easily be faking a bad score. The analogy is that a test can be insensitive for any number of reasons---and not just the ones Royalcrown thinks I'm "making up." Subject isn't paying attention, bad choice of music, incorrect or no listener training, wrong choice of test subjects, etc.
Now, you might think at this point I'm trying to use this argument to attack all skeptics and that I'll never let go. If you think that, well sorry to disappoint. There is a time and a place to accept H0. It is reasonable to do that when you feel that adequate testing has been done under good conditions, using all available evidence to design the tests.