Thanks all for interesting responses. I'd like to get back to the meat of the testing, so let me address them collectively:
Maybe I should back up first and let you know the purpose of me posting as I am going along is to (a) get tips for improvement (b) encourage others to reproduce my results. Maybe with your ears and/or equipment you can do better. I will address a few objections as a courtesy, but I don't want to get too sidetracked from (a) and (b). I would encourage more ideas on whether to listen to short clips or long, what to look for, better tracks to try (as long as normally accessible in US over regular network), etc etc etc.
1. Number of trials and looking at the results
So I hear your input, and I can see where you are coming from. Actually I never put any thought into this it was the default setting and off I went. Plus it was late and I wanted success quicker rather than slower, if it was possible to discern. However, I have taken a couple of 2nd year courses in probability enough to know the binomial distribution and its application. It seems the ABX plugin is doing a straight binomial distribution calculation on the success/trials ratio. It doesn't care what you are thinking or doing or intent or whether you are looking at the result or not. That's the beauty of ABX testing. The only thing that would be invalid is to throw out failing trials but the tool won't let you do that .... you can start again from the beginning or continue but that's it. Note that 3/4 yields a very different % than 30/40 - that is baked into binomial. The % is valid no matter what you do or when you quit. I promise you. However, note that 10% means exactly that - 1 in 10 chance I was flipping coins and using that to decide.
To humor everyone if I get what I consider solid results (5% or better) I'll redo it both ways and post the logs. See how nice I am? But I did want to set the record straight for our dear readers.
RRod>> There are subtle differences in test statistics when using different stopping criterion. Getting 9/12 binomial trials right does not yield the same frequentist result as taking 12 trials to get 9 successes from a negative binomial. So it's valid for people to worry about things like stopping early and choosing best runs.
2. Dynamic range, and whether this Vivaldi is a good track to get a positive result
Firstly, people often confuse available dynamic range (eg. 96 dB for redbook CD) vs actual dynamic range of a section of music (ratio of loudest to softest part). I was trying to select a piece of music with a high value for the second. I didn't measure it (someone provide me a SoX incantation and I'll gladly do it). Whether Spring can have a dynamic range depends on the music to an extent, yes I agree, but also how it was recorded. With a sensitive mic really close to a violin and if the player plays very softly and very loud, you could get a range. The rest is the mix and how much dynamic range compression is applied. My expectation is that at the most this is 50 dB, more likely less Compared to 15 db for a lot of popular music I'm told. I heard one engineer claim 60 dB on a big orchestra, that is the highest claim I've ever read. The reason I want high range is to look for softer passages, were the relative delta of each step of quantization is highest. This is where 24 bit might shine.
RRod>> The largest RMS range I've found in my CDs so far is about 65dB, and the quiet parts sound great.
3. Difficulty downloading to reproduce
I'm really really sorry HDtracks doesn't support Linux. But if you find another track we can both download legitimately I'm game. I was suggested to use HD Tracks 24 bit Random Access Memories but I thought Vivaldi was a better start point. I can't get into Pono store yet, the European place only starts in January. Where else?
RRod>> See here: http://www.linn.co.uk/christmas?day=12
4. Dither and reconverting to 24 bit
I intentionally started without dither. I want the best chance for success first, then I'll add dither and see if it makes me fail.
>sox -V4 24to16bit.flac -b 24 16to24.flac -- interesting suggestion, but I don't think think upconverting is legitimate is it? The dither will be on the 8th LSB and not 0 LSB when you get back to 24. The DAC natively handles 16 and 24 and converts to multi-segment Sig-Delt anyway so I think my method is legit. This incantation was suggested after some discussion.
RRod>> Upconverting is totally legitimate; you're just padding with 0s to embed the 16bit content into 24, so the process is completely reversible with no content change. It is very possible that your DAC gives no subtle cues when bit-depth switching, but it's always best to remove variables.
One word about success: I am an admitted skeptic about 24 bit. My intellectual bias says that 16 bit redbook may be the be-all and end-all of music, all we need is better reproduction hardware. But, I'm not 100% sure of my position. So why do I want to succeed in telling them apart? The weakness I find in ABX testing is when someone does a whole series of comparison and they all come up negative. It is easily attacked, and hard to prove the negative (ie if you had just done X you would have passed). But if you pass one and gradually change parameters till you fail, it shows where the knee in the curve is for your ears and setup. IMHO.
RRod>> You are right in that it's a bit weird to do only 1-sided confidence intervals for the test, as it would seem someone who can consistently only get 10% right has something special going on (in fact, this would happen with someone who can get 90% correct but was told to choose the option that *didn't* match). Be that as it may, I haven't really seen such a result come up in the examples I've seen. It is worth mulling over the theory a bit, though.