For what variables are components potentially relevant and below the threshold for human perception? Frequency? Timing? Loudness? Signal/Noise and distortion? Channel separation? Timbre? I think there is a mishmash here of classic bench measures of things like frequency response, thd, and jitter with the aggregate sonic performance of a piece of gear in a system as perceived by a listener in a room or via a pair of headphones. The original Benchmark DAC 1 measured very well, but nearly every reviewer reported that it sounded noticeably different from the Benchmark DAC 2 and DAC 3 variants, even though the designer said the “improvements” should be generally inaudible.
But what is the credibility of someone saying he's hearing differences between anything and anything else? When do you decide the guy has both great ears and listening skills, or that he's a fool, victim of his own defective test method that would still find sound differences from putting a pet rock near a cable?
The intuitive way is to just take the guy who agrees with me and decide that he knows his stuff(because he keeps saying I'm right!!!). I bet that's what most people do, and surely you can see how flawed that rational is when my own views on gear come entirely from sighted and badly controlled experiences.
The only thing strongly suggested here is that we might both tend to fall for the same mistakes and biases(like leaving one DAC with its +2dB over the other one and then talking about how much better the soundstage and bass are on that one).
Be it this or hearing threshold, only a controlled test has the ability to give us a statistically valid answer. Not having it doesn't mean we should take the laziest, least reliable method ever and decide it's now fully conclusive. The correct behavior should simply be not to claim to know with certainty what we do not.
Believing something doesn't make it real. Most people think they are above average drivers. Simple math disagrees.
The closer we are to hearing thresholds, the more likely we are to find people who start "hearing" entirely with their eyes. When sound difference is very clear and consistently noticed, the brain doesn't need any trick. It will still get influenced by extra variables, but the result should be close enough to the correct one about sound.
So now comes the zero-dollar question: is a listener usually capable of telling when a difference is too subtle to trust his brain about it, or when a sighted experience is too messed up to conclude anything about a device's sound? Of course not. If people could be trusted, the entire world wouldn't be using blind testing to get their reliable data on subjective stuff.
You brought up before, the possible influences of unfamiliar setting and stress in a blind test, and of course that is real. But it is in no way comparable to conditions that allow for false positive.
Actually, for something simple with a single task and focus, like a listening test, a little stress is expected to improve performance(not like it increases thresholds, but it will help focus). There's plenty of work on stress and the ability to perform tasks, we always end up with similar curves(goes up, the down). What changes is how complex the task and of course the inherent level of stress(soldiers and surgeons in action probably shouldn't be compared to a guy in a chair being asked to check a letter when he thinks he hears something change).
Obviously, I'm not here saying that people are always wrong and that there are never audible differences between anything. I'm saying that casual listening doesn't allow us to verify when people are terribly wrong or not. What is the value of an answer you cannot trust?
As most mistakes in judgement come from biases working in a systemic way on humans, why would having bigger numbers of casual testimonies improve credibility? They could just as well improve credibility about human biases.
I get the general idea that if we don't do the right thing for reasons, then we just have to fall back on the easier stuff. But we're already doing that too much for everything in the hobby. You get a graph for a pair of headphones, and now that's how that model graphs for the community. That's super dumb. What mic/coupler was used? What compensation, smoothing, placement protocol? Any other pair of the same model almost surely has an audible level of difference with the first pair. Why would a sample of one become the reference for all? Because nobody is going to get 100 pairs and measure them all before giving the results. Well, nobody isn't correct, the manufacturer surely knows better and even has a list of determined tolerances that are verified in a fairly strict process. But the data isn't for us peons.
And obviously that's no better for a guy and his 1 pair claiming that the model in general sounds this and that way. Both generalizing for no legitimate reason and making claims he doesn't have supporting evidence for.
The right answer is simple. Don't brag about stuff we don't know. Dare to say "IDK" when we have no good reason to be sure. That's the right way. We can always keep on sharing our feelings, but at least let's not pretend that our subjective experience of uncontrolled conditions is the very best way to know about sound. Because that will always be complete BS. And anytime an audiophile pretends otherwise, the rest of the world laughs at the hobby and its mindset from 1750 or so.