So we did the first two trials on Sunday. More trials will follow. We may change the protocol as we learn what seems to be more sensitive. Strictly speaking, we should only count trials that were done with the same protocol, but as this is a learning experience we may have to change it several times. Our accuracy may improve with time.
If you cross-test many parameters, like duration, pause lenght, musical choice, headphones, etc, the probability that, among a lot of random answers, you can pick a sequence with a good score and associate it with a given condition, is not negligible at all.
In order to avoid this, I think that it is better to dismiss any result as long as the most relevant test conditions are not fixed.
Then, you can evaluate your reliability. For example, one mistake out of five trials.
Then, you can set the number of real trials for the real test, given the confidence that you aim to acheive, and the ratio of mistake you expect to do. I can calculate it for you, if you need.
This method is very robust, and allows you to avoid the statistical bias that occurs when many tests are done in order to find one confirmation.
On the other hand, if you get good results during the trainings, and bad ones during the real test, the final result is all the same a negative.