“(5) (automated) double-blind testing ... yes please! DACs and amps for sure. (My use of "double-blind" in this context means you'd know which 2 sets of gear are in the test, but some automated process pseudo-randomly feeds you A or B, and you don't know which one was Sample #7 until the end, when you can mark up your scorecard/notes, and discover you can, or cannot, tell the difference. Or that the results are statistically ambiguous.)”
This is an interesting concept yet the definition of “double blind” testing is that neither the recipient nor provider know what was being tested till the results are in. I set up such a test for Schiit. Tubes were concealed and only had a random number on the outside, after the test was completed a sealed letter was opened to tell them which tubes were which.
My group does single blind testing, the test subjects know the gear after the results are in.
I do like the idea of automating but of course you have to know which gear was chosen at a specific time as it randomly switches.
Warning; geeky. Double-blind testing ("DBT"). Going through the weeds in search of rabbit holes to dive into.
I'd argue that having a non-human "test administrator" doing the switching fully meets the INTENT of double-blind testing, if not the standard wording that tries to implement that intent, given one added set-up requirement, and therefore "the way I would do it", as described below, does qualify as double-blind testing.
The intent of double-blind testing is to eliminate opportunities for human biases to confound the results.
IMO this process I'm describing would work for substitutions of most of the gear in the signal path, but
* headphones and/or the cables connecting them
* the music source, unless those can be time-synched
* speakers. (May be possible with a lot of work. Not going there today.)
Disclaimer: As a person who has worked as data analyst for almost 40 years, I am counter-intuitively very comfortable with decision-making based on (partially) subjective factors. That said, there's a lot of power/confidence to be gained by more rigorous/objective/repeatable evaluations. With the exception of Paladin79's group, audio consumers don't do DBTs very often because it's so difficult. But IMO the Schiitr could have some gear set-up to make DBT of selected components pretty darn easy, and my blue sky dream is that the Schiit community could and should take advantage of that opportunity. Jason did ask for blue sky dreams.
one added set-up requirement" is that the A/B switch is the LAST piece in the signal path before the transducers of choice, and that the signal is always running in parallel through the entirety of both the A-path and the B-path. The listeners can see all the gear, but cannot visually detect whether what they are hearing is from Path A or from Path B, because in fact both paths are always active, at some set volume. Tubes are glowing, VU meters are moving, the DAC/s are signaling 24/96 bits or whatever, etc. The switchbox itself cannot have a visible indicator of A vs B.
Here's the NIH's National Cancer Institution's online definition of a double-blind study: "
A type of clinical trial in which neither the participants nor the researcher knows which treatment or intervention participants are receiving until the clinical trial is over. This makes results of the study less likely to be biased. This means that the results are less likely to be affected by factors that are not related to the treatment or intervention being tested."
One of the most important of the biases is
confirmation bias, where one's pre-test expectations tilt the evaluations. For example, More Expensive is More Better. The expectations of the person administering the test, if s/he knows which of the alternatives being tested ... say Bifrost vs Yggy ... can be manifested in subtle facial expressions or body language. And humans are exquisitely evolved to detect body language in others. So the test admin's feeling of "I really like this Yggy that's hiding behind the sheets" can be inadvertently communicated ... aka leaked ... to the listener. Maybe not to every listener. Probably many listeners would not be consciously aware. But even a subtle "this other person likes this choice better" becomes a factor when the listener's brain compiles all of its multi-dimensional inputs AND PRIOR KNOWLEDGE into a reduced binary choice on that particular sample, i.,e, is it A or is it B.
Let's call my automated test administrator Mr. Robo Schiitr. A dedicated public servant. Let's assume in this experiment that Yggy is "Effect A" and BiFrost is "Effect B". While both pieces of gear are in plain sight of the human participants, Mr Robo doesn't recognize a Bifrost, doesn't recognize an Yggy, has no knowledge of what they do, has no personal impression of either, has never read a review or talked to anyone about either of them. The listener/s can't ascertain Mr Robo's preconceptions because (a) Mr Robo has none (b) Mr Robo has no body or indicators to use for non-verbal communications and (c) Mr Robo's process & timing is identical on every iteration. There is no communications path between Mr Robo's subjective mind and the listeners; the listeners are completely blinded in regards to Mr Robo's knowledge. Unlike when a human does this part.
Mr Robo's job has these phases:
1 - when the "start the listening session" button is pushed, he runs some process that equalizes the final output signal strength in electronics-path-A and electronics-path-B. We can't use this version of double-blind on headphones or speakers, so the sensitivity of whatever's being listened to is a constant, and no additional adjustments are needed here to balance
perceived loudness. This deals with another well-known/well-studied contributor to listener bias.
2 - display "A", and play the music selection through path A as a reference.
3 - display "B", and play the music selection through path B as a reference.
4 - then consult a pre-computed random numbers table, read the next entry which will say either A or B, set the output signal path accordingly, display "sample #1", and after a small gap start the music selection
5 - do phase 4 nine more times.
6 - When that's done, when they are ready, the listener/s hit a "give us the sequence" button, and Mr Robo confesses that the sequence used for this specific test was AABABBBABA.
The listeners then score themselves. "I recognized A or B correctly 6 of 10 times" is relatively easy to score and interpret, using a handy-dandy poster on the wall. If you're doing more elaborate scoring on a variety of sound attributes like the Paladin79 semi-secret society, bring a pre-built spreadsheet that will do the calculations.
* Q: why test sequences of 10 plays?
A: That's arbitrary. You have to find a happy balance between sample size (more is always better, at least up to some point) vs how long the test/s will take.
* Q: why balance 5x 'A' and 5x 'B'? It's supposed to be a random draw. Random draws of an extended binary series usually do not balance exactly, so forcing a 5-5 balance means it's not fully random.
A: True. But if you allow pure random, sooner or later you'll get an AAAAAAAAAA sequence, which isn't going to be helpful. If forced balance bothers you, you need to consult with a better statistician than me.
Q: Same piece of music played 10 times, or 5 different music selections each played twice?
A: That's tricky. With 5 musical selections, I think you'd have to play them in order. So now the experiment changes from "is this 1 of 10 soundbites Path A or Path B", to "is this 1 of 2 soundbites Path A or Path B", replicated 5 times. Because if Sample #3 is B, then Sample #4 must be A. To me that additional constraint means (loosely stated) there's fewer degrees of freedom, and therefore my gut is that same piece * 10 yields more information than the 5 tests of pairs. But OTOH, if it's bring-your-own music, you may need multiple soundbites to hear all the kinds of things you are listening for. Again, maybe an actual statistician should be consulted. Or mimic an existing A-B test protocol designed by a statistician.
Q: How long before listener fatigue sets in, and that starts contaminating the results?
A: Personally, I have no idea. But that will absolutely be a factor at some point. I'm sure that's been studied. Google is your friend.
Q: In doing A/B on different tubes, you're assuming that say Valhalla #1 is identical to Valhalla #2, and so that any differences you think you hear are purely caused by the different tubes. But there will inevitably be some unit-to-unit variations between the Valhallas.
A: Yep. Can't control for everything. You have to assume that 2 working Valhallas introduce sound variations that are much smaller than the difference between the tubes you are testing. But consider that you are also assuming that the Sylvania tubes you happen to have in hand are good representatives of what's typical for their class, and the KenRads you happen to have are likewise typical of their class. Note also that all the tubes are somewhat used, and all are marching towards the end of their life cycles. (With enough gear and time, you could measure both Valhallas to ensure they are operating close to design specs.) If those are assumptions you are not willing to make, you should state your published or internal-to-you conclusions in precise language to prevent generalizations you believe are inappropriate, as in "On 6-July-2025 at the Schiitr, I preferred these specific 4 used KenRad tubes in Valhalla3 #123456 over these other specific 4 used Sylvania tubes in Valhalla3 #123501 ... YMMV."
TL;DR? Your wisdom & discipline is impressive.
Sorry for any grammatical errors I missed.