You are not doing the ABX test correctly. There is a problem with relying on feedback during the test. The problem is not that you check your guess after each answer, the problem is that you change what you do based on what you see and stop right at the result you want to see, effectively cherry picking the test. If you test this way, you can just keep going until the reported chance of not guessing is 95% which you can reach by relying on random guessing while waiting for a lucky streak and then you stop testing right after reaching 95% by pure chance.
An even worse way of cheating would be to do "some" undetermined amount of tests while getting "feedback" about the confidence, and reporting the highest number as the figure of how likely you were not just guessing. If you do 50 tests and at some point you reach 20/30 but you end up with a 25/50 then the chance of you not guessing is not high just because you got a lucky streak at around 30 tests. The parts of the trial that says you were likely just guessing shouldn't be ignored just because that is not what you want to see.
Again, the problem isn't that you check your results, the problem is that based on this result, you decide when to stop. If you first decide that you do x amount of tests no matter what, then checking the result of each trial won't skew the result. As an example, the chance of getting a heads (or tails) streak during 100 trials is way higher than getting way more heads (or tails) right at the 100th trial. If you don't decide on how much trials you are doing, you are effectively doing the test the first way I described and because of that, the equation used by the developer for calculating how likely the tester guessed is not correct. This is not mambo jambo as you put it, it's how math and probabilities work.
I've made a script because I wanted to check how much this cheats the result. It effectively creates the outcome 500 trials using a random number generator and reports how likely the results were not chosen randomly after each trial. It uses the same formula that your abx tester does which is n choose k divided by 2^n summed together for each X=k. You can fill in the outcomes yourself and compare it to the tester.
The script highlights the rows where the formula reports that the guess was "likely not random "(>95%) and prints them on the console. The console also reports the highest chance of not guessing.
https://jsbin.com/lalativife/1/edit?js,console,output
All i have to do is run the js a couple of times to get not just one but streaks of 95% chance of not guessing. Almost every run the highest chance of not guessing is way above 70%. The outcome is created by a random number generator but despite this, it's still easy to get long streaks of what looks like the outcome of the trials weren't likely randomized.
It looks like if the ABX test is done the way you do it, it is fairly easy to get 95% confidence of not guessing just by picking something randomly but doing enough tests and stopping at just the right time.
Of course, there wouldn't be any discussion about conditional probabilities if you clearly heard some difference and didn't pick the wrong answer so often. Picking the correct answer 20 times out of 33 on average is still quite random, not deterministic which is what the outcome of the test should be if the difference is clearly heard.