Blind cable testing: initial report
Jul 17, 2009 at 5:12 PM Post #61 of 128
Quote:

Originally Posted by Real Man of Genius /img/forum/go_quote.gif
I assumed you meant the other as well. Seems to me that the rapid loss of music memory over even a short time lapse between switches would trump "imagination contamination" or lack of "freshness of mind".
Given this new understanding of what you mean by "quick-switching" I must say that I find your argument weak.



I'm skeptical about the "rapid loss of music memory"; see http://www.head-fi.org/forums/f133/r...-sound-430023/.

In that thread I describe a test in which I give myself a few minutes to get used to something. Sometimes people listen to the same system configuration for months, so they are very used to it. When you change one item in their system, they are sensitive to the change.

By the way, which "argument" is "weak"? I've made so many arguments, and they've been so misunderstood, that I need to establish what we're dealing with here.
 
Jul 18, 2009 at 12:22 AM Post #62 of 128
Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
I'm comparing my own impressions of two devices. And it's common for devices to differ in beauty.


I never heard this term connected to music before. What do you mean?
confused_face(1).gif


USG
 
Jul 18, 2009 at 1:07 AM Post #63 of 128
Quote:

Originally Posted by Real Man of Genius /img/forum/go_quote.gif
Seems to me that the rapid loss of music memory over even a short time lapse between switches . . . .


I would submit that the notion of "rapid loss of music memory" is a fallacy. Something like it is often repeated on this forum (although most people use the rather vague and ambiguous phrase "aural memory"), with no real evidence offered to support the notion that people cannot remember musical sounds more than a few seconds, and that, therefore, quick switching is necessary.
 
Jul 18, 2009 at 2:48 AM Post #64 of 128
The embedded two questions are not independent, you can't multiply the probabilities, sorry. You are not at 3% significance.

I agree with Pio that you should not change the experimental conditions as you continue to test.

I diasgree with Pio (comment in the other thread) that this protocol:

"can I tell ABAB from ABBA?"

is any different statistically from ABX. It is not. The trial always starts with A then B. Then a third sample is played, which the listener has to try to identify as A or B. Having done that, the final answer is forced, since it must be the other letter.

Therefore, from a statistical standpoint, "ABAB or ABBA, which was it?" is identical to "ABX, what is X?" And the analysis statistically is the same.

ABAB / ABBA is better than ABX from an "eliminate bias" perspective, since you get two listens after the initial AB, not just one.

And IMHO the OP is on the right track in doing relaxed, long term tests.

He got three right out of four in the "what is the third sample (and therefore fourth sample, which is forced)" question: the statistical significance level of that is 5/16 (that is the probability of getting three or more correct by chance alone). OP was correct in stating that his significance level after getting two right out of two is 1/4 or 25%.

Now consider the second test -- "is A the Cardas or the Radio Shack?". I thought I read that OP got this correct also 3 out of 4 times. But he says "7 out of 8" is his total across both types of questions, so I am confused.

Anyway I have no idea how to combine these two different types of questions, since they are dependent. We need more observations of analyzing whether errors in "what is the third sample?" correlate with errors in "what is A"?

BTW -- I hate all these protocols. They are artificial. The correct protocol in my opinion is to play two samples, over and over, blindly selecting each time the two samples from the four possible orderings (note the swindles):

Good then Bad
Bad then Good
Good then Good
Bad then Bad

and asking: "which do you prefer, first or second, or no preference? ... and if "no preference", do you think they are the same or different?".

A simpler protocol is just to ask: "same or different", but I believe the more involved protocol defends against certain types of response bias.

Hats off to OP for doing controlled, at-home tests.
 
Jul 18, 2009 at 3:23 AM Post #65 of 128
Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
The embedded two questions are not independent, you can't multiply the probabilities, sorry. You are not at 3% significance.


Perhaps. But see below.

Quote:

"can I tell ABAB from ABBA?"

is any different statistically from ABX. It is not. The trial always starts with A then B. Then a third sample is played, which the listener has to try to identify as A or B. Having done that, the final answer is forced, since it must be the other letter.

Therefore, from a statistical standpoint, "ABAB or ABBA, which was it?" is identical to "ABX, what is X?" And the analysis statistically is the same.


I don't know the identity of A and B. They are assigned randomly. So there are actually four possible arrangements.

If R is the Rat Shack and C is the cardas, then these arrangements are possible:

RCRC
RCCR
CRCR
CRRC



I have to run now but I'll answer the rest of this later.
 
Jul 18, 2009 at 5:02 AM Post #66 of 128
Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
...then these arrangements are possible:
RCRC
RCCR
CRCR
CRRC



Of course. I understood that. Nonetheless, once you identify the third sample, the fourth is a forced choice. You revel no more information in your answer to "what is the fourth sample".

In some A/B/X trials, they don't tell you which is A and which is B. Call that "unknown A/B/X". Then my point is this: statistically your protocol and the "unknown A/B/X" protocol are identical. Experimentally they are not, yours is (probably) better, since you get a second listen. I say "probably" because my argument that it is better rests on unproven but fairly plausible assumptions about response error and bias (both of which routinely get ignored around here but in fact are the essence of testing -- the stats are trivial).

Both protocols (A/B/X, and yours) are inferior IMO to protocols that include swindles, and present only two choices, no X, nor "two more". Just two samples.

Now back to the math. You are trying to say their are four states of nature (RCRC, RCCR, CRCR, CRRC) and therefore your chance of getting it right at random is 1/4, and you passed 7 out of 8 tests. But you didn't do 8 independent trials. You might be able to argue that the chance of getting it right is 1/4 (two guesses on each trial, both with probability 1/2), and that you passed 3 out of 4. You would have 13/256 significance, just about the magic 5% level. But that still rests on an unverified assumption.

Better is the view that there are two guessing games going on here, both 50-50, and you have done each 4 times, and you passed each 3 times, which gets you significance 5/16 on each ... which we don't really know how to combine.

Keep testing. Have your set-up man do some swindles. We will soon find out if you can tell the difference. Keep it simple. Forget this "which is which" nonesense. Play two samples and tell the set-up man if you hear a difference (and which you prefer). That's all -- test like real life. In real life people listen and make choices, nobody plays the silly A/B/X guessing game of "which is which". Audio is the last field to use A/B/X -- taste testers never use it. Think for yourself, don't to "what is this?" testing just 'cause everyone else does.

You want to conclude "can I hear a difference". So test "can I hear a difference". Simple, no? Before you spend you money you want to know "do I like this one better, don't want to be fooled by placebo". So test "do I like this one beter". Simple, no? And swindles are great, because the power of the significance test is high (if you claim to hear a difference between two listens to the same cable, after just a few trials you are toast, statistically).

There are historical reasons -- having to do with building hardware or software for testing -- why A/B/X became popular. You can give youself a blind test, very clever. But we don't need to do this. Also A/B/X is true double blind, while the methods I describe are single blind. But if you use physical isolation and/or neutral testers, single blind is fine.

Blind Testing is like sex -- the two person version is better than the one person version.
 
Jul 18, 2009 at 5:35 AM Post #67 of 128
Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
Keep testing. Have your set-up man do some swindles. We will soon find out if you can tell the difference. Keep it simple. Forget this "which is which" nonesense. Play two samples and tell the set-up man if you hear a difference (and which you prefer). That's all -- test like real life. In real life people listen and make choices, nobody plays the silly A/B/X guessing game of "which is which".


In real life, when people listen for enjoyment, they also tend not to repeat a track. I usually don't. Usually when I want to enjoy my music, I pick something fresh. I feel my sensitivity goes down on repeating hearings. That's an artificial situation. That's why, in part, my protocol is designed to produce as much contrast as possible.

My newer idea involves hearing only each test track only once. I am looking forward to seeing what happens.

Quote:

Blind Testing is like sex -- the two person version is better than the one person version.


Sometimes I use toys during blind testing.
 
Jul 18, 2009 at 12:17 PM Post #68 of 128
Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
In real life, when people listen for enjoyment, they also tend not to repeat a track. I usually don't. Usually when I want to enjoy my music, I pick something fresh. I feel my sensitivity goes down on repeating hearings. That's an artificial situation. That's why, in part, my protocol is designed to produce as much contrast as possible.

My newer idea involves hearing only each test track only once....



I think that's a very good point, and a good change to the protocol.

Anything that moves the experiment closer to the actual listening experience (which the desire to optimize is driving this whole investigation in the first place) is a plus.

Your original thesis that repeated listenings to a single track might blunt one's ability to hear differences cannot be dismissed out of hand, and therefore has to be tested.

I need to modify slightly my "agreement" with Pio that you need to keep experimental conditions the same. Only the uncontrolled ones! Experimental Desgin is all about varying factors systematically and taking them into account along with the direct effect you are measuring (so called "factorial" designs). So you will need to do things both ways: single track repeat vs. new track every time, and keep careful records.

Sometimes I am busy (yea, as you say, toys) but in general I am always willing to compute a confidence interval or a particular probability for you. Keeps me young.

Careful, serious at-home relaxed blind testing -- you are to be congratulated. I am pitching at the same thing, but you are ahead of me.
 
Jul 18, 2009 at 5:57 PM Post #69 of 128
I have a couple observations:

1. How many bong hits does it take to make you take 30 mins to change a cable?
wink.gif


2. We're all anonymous usernames here so "authority" or "expertise" on a subject is pretty much assumed based on "content" of the post. This too is subjective.

3. Is it possible "aural memory". being basically a charge, can be stored in one "state" but retrieved in another due to flux of brain chemistry?
 
Jul 19, 2009 at 3:34 AM Post #70 of 128
Quote:

Originally Posted by CodeToad /img/forum/go_quote.gif
I have a couple observations:

1. How many bong hits does it take to make you take 30 mins to change a cable?
wink.gif


2. We're all anonymous usernames here so "authority" or "expertise" on a subject is pretty much assumed based on "content" of the post. This too is subjective.

3. Is it possible "aural memory". being basically a charge, can be stored in one "state" but retrieved in another due to flux of brain chemistry?



I'll PM you now my credentials in statistics. BTW, "couple" means exactly 2 (you had 3 observations).
 
Jul 19, 2009 at 6:45 AM Post #71 of 128
Quote:

Originally Posted by wavoman /img/forum/go_quote.gif
Of course. I understood that. Nonetheless, once you identify the third sample, the fourth is a forced choice. You revel no more information in your answer to "what is the fourth sample".


I don't give separate answers to "what is the third" or "what is the fourth". The hope is that the back-to-back contrast makes it easier to identify ABBA or ABAB.

Initially each trial involved only one answer to a binary question: ABBA or ABAB.

Here's the reasoning that each trial involved two answers:

The first two sub-trials presented A and B. So it is natural at that point for me to wonder whether A was the Cardas or Radio Shack. Strictly speaking, it wasn't necessary in this protocol. It was actually supposed to make the test easier by avoiding that decision. I wanted to stick with the sound in front of me, in the moment, without a preconception of what each cable sounded like.

However, it was impossible for me not to have a judgment about the sonic characteristics of each cable. I just couldn't avoid it.

So in each trial, the original type of answer was ABBA or ABAB. Note that is equivalent to asking the ordering of the last two sub-trials.

The second answer is: in the first presentation, which is A and which is B?

So there are two answers associated with each trial. But this is a post-hoc analysis and carries some danger. For example, there is one thing that is a bit arbitrary. Why am I considered the answer about the ordering of the first two sub-trials so important that it is independent from my answer about the second two sub-trials, and why am I considering this post-hoc (i.e. it wasn't in the original protocol directions)? Because of my theory that I am most sensitive in the first or second listen, and because I was very confident about my answers. This does not convince you, of course.

Repeat: I am not trying to convince you. This post-hoc analysis is more interesting to myself, and my own confidence that the cables are different.

In the last three trials, I was right about the identity in the first two sub-trials. In the first trial, I was wrong. Except you have to realize I had not listened to these cables critically in a sighted fashion. I had no reference. And it turns out the Cardas I used for that test was defective so it was actually worse than the radio shack, with respect to the specific music I used. (Repeat: WRT/the specific music I used. I was using some very brilliant and detailed brass music and the Cardas made it vague and fuzzy.) So there are two ways to regard that answer. Probably it should be dropped from the test. It was really an introduction to the specific sound of each cable.

In that case, I'm 6/7.

Now, this is not meant to convince anyone else, but because I was so confident in the sonic characteristics of each cable, and because I used those same characteristics to judge the ordering of the second trial (which also used the defective Cardas), I feel that I picked up on the sonic identity of the cable.

So I consider myself 7/8.


Quote:

Both protocols (A/B/X, and yours) are inferior IMO to protocols that include swindles, and present only two choices, no X, nor "two more". Just two samples.


It seems to me your idea of "superior" or "inferior" protocols has to do with our confidence in hearing differences under certain conditions, but nothing to do with the question "Can they be heard under ideal conditions?"

Once an objectivist tried to convince me that playback has evolved to a state that is indistinguishable from live sound. His "proof" was this: once he recited a poem in front of a group. At one point, he started a recording of himself on the stereo, stopped talking, while continuing to mouth the words.

Afterward he asked if anyone noticed the moment he switched. No one did. Proof? No, because all it proves is what magicians already know.. if you distract people, they miss a lot of details.
 
Jul 19, 2009 at 12:05 PM Post #72 of 128
Quote:

Originally Posted by mike1127 /img/forum/go_quote.gif
Each trial in this test really consists of two trials, or rather two questions that each have two answers. The first question is:

- Considering just the first two listens, what is the identity of A and B?

The second question is:

- Considering the second two listens, would I choose the overall order as ABBA or ABAB?



Should our quantum for statistical purposes be each answer, or should it be an entire trial (we consider a trial a 'success' if and only if you answer both questions correctly)?

Or will you be changing the protocol shortly?
 
Jul 19, 2009 at 8:13 PM Post #73 of 128
About to start the new protocol...

I have picked 8 tracks. For each track, I listened carefully last night and identified five or six key musical factors that are particularly enjoyable about that track. I made a spreadsheet listing the tracks down the left side, and the factors across from left to right. (Different factors for each track.) The factors are place left to right in the order they occur in the track.

That's my attempt to control where my attention goes: having a list of factors left-to-right in the same order they occur in the track.

My assistant will randomly choose A or B for each track. I will take notes on what I hear, identifying overall quality for each factor.

So the order of cables today might be

ABBAAABA

We will repeat the test next week, switching the cables for each track.

BAABBBAB

I don't know what the cable ordering is until the test is over, and my assistant will hold onto the list for safekeeping. (I.e. no cheating by me!)

There are some compromises. I didn't get enough time to familiarize myself with each track last night. I just was too busy and had to rush through it.
 
Jul 20, 2009 at 4:19 AM Post #74 of 128
Bad protocol, really bad protocol.

DO NOT flip to the other letter for the next week. That UNBLINDS the test ... you know that you have a different cable.

Each week, pull another random selection of A or B for the 8 tracks.

Do this for several weeks.

NOW you have a protocol!

Sometimes you will be listening to the same cable on the same track -- but you won't know -- so your answers will be most revealing.
 
Jul 20, 2009 at 4:29 AM Post #75 of 128
Contiuing our discussion of the last trial -- you're not 7/8. You do not have 8 independent trials.

You have 4 trials, each of which asks two questions that are NOT independent of each other.

You have a two-measurement, four-replicate experiment. Stop trying to make it an 8-replicate experiment. You do NOT have the statistical power against incorrect conclusions that 8 replicates afford.

And to answer sohels -- the experimental unit is the trial, and there are four of them. It is not correct to think of "success if both questions are right". Rather, there are two answers for each trial -- the observation is a vector. The two elements of the vector are NOT independent.

You can do two significance tests, one for each question, each based on 4 trials. You can combine the answers into one signficance level, but the answer won't be right since the correlation is unknown. There is NO WAY -- I repear NO WAY -- you can do a signficance test based on n=8.

Please believe me on this.
 

Users who are viewing this thread

Back
Top