When I first joined, I setup the blind listening tests per instructions from you and big shot(level matching, the while works).
The biggest factor for me was relaxing, in my experience.
If you manage to get statistical significance for a reasonably set up test, aren't you curious as to why you're hearing a difference?
Even in the research papers that support some more or less clear impact on listeners from hires, they keep also showing that listeners failed to pass the blind test. Nearly all research works reject conscious notice of a sound difference.
If I was really getting reliable clues while testing CD against hires, I would wonder about the actual cause. And there could be quite a few depending on how little effort was put into creating the listening conditions.
I could have one of those DACs that roll off so early, the treble takes a hit of several dB when playing 44.1kHz. That could easily be checked by converting the hires file to 16/44 and then back to its original resolution. That way while the track is missing all the ultrasound content, the DAC will apply the same filter on both tracks, so no chance of treble roll off on one. Other possibilities can be excluded that way too, like some idea about a track using a bigger bandwidth/sample rate having better timing than 44.1kHz, or some idea about the DAC(if delta sigma), resampling at a fixed sample rate so it would be doing less resampling on the hires file than on 44.1. We could have any sort of idea, potentially correct or ludicrous, so long as we can correctly isolate the hypothesis and test for it, that's part of the process. If maybe you have special ears that do perceive higher frequencies(youngsters or genetic rarity, or robot in a flesh suit), that's something you surely would want to test on the side.
Maybe if the files come from a PC, it's poorly configured, and it applies a crappy resampling to only one of the files, maybe it's audible, maybe it just adds enough delay to unconsciously identify the tracks that way.
If the discussion was can you feel getting hit by a baseball bat, we'd have no reason to suspect another cause when you reply yes. But here, those who set serious experiments suggest we won't hear a difference. At the very least, not a conscious one where we can identify the track consistently. If then someone comes and says he can, isn't it completely normal to suspect some confounding variables creating a false positive in his test?