I agree that it could be done a little more clearly (I like how Foobar ABX does it, with the two reference clips A and B both playable separately, and the two test clips X and Y also playable separately), but it seems understandable to me. AB is the reference clip, which consists of a higher quality and lower quality sample in some order (A being the first sample in the reference, and B being the second sample in the reference). You have to determine if the test clip, designated as XY, has the two samples in the same order as the original reference (A first, then B, designated as AB), or if the order is reversed (B first, then A, designated as BA). At the end, the final question then asks whether A (the first clip in the reference sample) or B (the second clip in the reference sample) is the higher quality clip. This does seem like a reasonable test to me, since it looks for both the ability of the listener to distinguish between the two clips, and the ability of the listener to audibly determine which clip contains the higher resolution sample.
That all having been said, I don't think I could tell apart any two well-encoded files with that test. Because the reference clip has the two samples back to back, instead of having two separate reference samples (A and B), it makes it very difficult to jump back and forth between a small subsection of A vs a small subsection of B, which is the only way anyone has a hope of hearing a small, subtle difference. Foobar ABX allows you to jump back and forth between two clips while preserving your location in the clip (so when you go to clip B from ten seconds into clip A, it automatically starts clip B at 10 seconds in so there's almost no gap in the music). That kind of switching allows for the perception of very subtle audible differences that you normally wouldn't remember well enough to find.