Reading the linked article, I don't believe from their description that their test can be described as ABX, a blind test certainly, but not an ABX. Literally it was not comparing A with B, the test sought preferences rather than whether a difference could be detected and the test also included comfort determinations. As I mentioned earlier, the senses are all linked, how can we be sure that the more comfortable headphones were not, at least in part, judged as sounding better? Now that we're talking about preference rather than just identifying if something is detectably different, we've got to contend with how the individual experience of the listeners affected the results, in what way were they experienced listeners? Also, how can we be sure that none of the listeners could identify or at least have a good guess at what model of headphones they were testing, from wearing them. If they were experienced listeners, there must be a fair chance they've tried some of those headphones in the past.
While it's possible to tear substantial holes in the methodology of this test, we've got to consider what is possible. For example, I can't see how in practise one could ABX the sound signature of two different headphones without actually putting them on and therefore without eliminating any potential biased caused by the level of comfort. The authors of the article appear to have gone to decent lengths to eliminate some biases and arguably they eliminated as many as were realistically feasible. And, for that reason the test is laudable and I would certainly take it's conclusions seriously. However, this type of preference based blind test does not carry nearly as much scientific weight for me personally as a straight forward "is there a detectable difference" ABX test.
The loud speaker test is even more troubling. As Tyll mentioned, put even a hypothetically perfectly flat speaker in an average room and what you'll hear will be nowhere near flat. While in general there will usually be an overall boost in the low frequencies the exact response is entirely dependent on the room dimensions, construction materials and furnishings. Forget the hundredths or thousandths of a dB difference in cables, the phase cancellation and summing in an average room are going to cause several peaks and dips which can differ by as much as 30dB, as well as a considerable number of smaller peaks and dips. In fact, high quality commercial recording studios with considerable acoustic treatment consider it quite an achievement to only have a few of these peaks and dips and if they are limited to 6dB or so! Exactly where these peaks and dips occur in the audible frequency spectrum is completely variable from one room to another so constructing a speaker which even sounds roughly the same from one untreated room to another is impossible, let alone one which produces a flat response in different rooms. I get the distinct impression that at least some of these tests are marketing disguised as serious scientific research, which isn't entirely unheard of in the audiophile world!
G