Be careful not to confuse "accurate" with "reliable" and neither of those is the same as "valid".
I trust the Headroom graphs to be reasonably accurate. I don't think they make any significant errors in measuring or presenting their results, although they're only human and I suppose one graph here or there could occasionally turn out just plain wrong. They'd probably catch it eventually, though.
They do everything possible to make them "reliable" but that's a real bugaboo with headphone testing. Because of the whole thing about interacting with the test "head" it's actually darned hard to get perfectly reproducable results if you measure the same headphone over and over, even on the same setup. I'd bet they do it as well as anyone could but if you peer at every little zig and zag of a whole bunch of those graphs you're probably trying to read it closer than the basic reliability of this kind of test will admit.
Then there's the question of "valid". That means whether what is on those graphs matches up to what you're trying to find out when you examine a bunch of them. The answer is not really. Graphs like this with a dummy head and so forth are great for pointing out where the objective differences between various models of heaphone exist when tested under this specific setup. But hearing a headphone is such a subjective thing, that coupled with ear-to-ear variances in ear/head shapes means you can't map details from a graph into what you are actually going to perceive. Validity is just not possible from a laboratory measurement process if the final question is "how will it sound to me". That's not a knock on Headroom, just a fact of life.