do blind tests with X levels of variations. failure to clearly pass such tests with statistical significance = I failed to hear it. success= audible.
in both cases it might be worth looking into the test to make as sure as possible that no external variable was the cause of those results.
asking people if they think they heard a difference or what difference they heard in a listening test = wrong test. testing an opinions is not like testing actual abilities.
sighted tests are meaningless for such purpose as audible should concern the ears only.(thanks captain obvious! ^_^)
now the conclusions can be very specific, and apply to a specific subject in a specific test(he failed today, maybe he'll pass tomorrow, or with another sort of test). or be general and apply to a group of people on a statistical level.
understanding the difference is important! a human being has 2 arms. this is a claim based on statistical averaging. yet some humans do not have 2 arms and it's easy to demonstrate. this is not a reason to burn any book where a human is described as having 2 arms. if you get my weak analogy.
based on a big enough sample of tests and people, we can try to estimate an average threshold. and once that is done, it gives us a lot of confidence about other magnitudes of variations. like less than 0.1db variations in overall loudness would be very very hard to notice in a blinf test. but more than 0.2db variations can be enough to make people feel a difference. so it's a fairly good threshold area and while people often fail to notice 0.2db, I don't know if anybody can pass less than 0.1db consistently. based on those experiences, we're not taking much risks when we claim that a 0.001db variation is inaudible. once we have a fair idea about a threshold there is no real need to go test extensively for stuff happening magnitudes below. like I don't need to wait for extensive testing on millions of humans to claim that no human can run 100m under 1second.
magnitudes have meaning.