From: John Thaden Date: Tue, 13 Apr 1999 14:36:46 -0500 Subject: Materials Test: Reanalysis of SPAH '97 Data [LONG]
Effects of Harmonica Comb Material on Harmonica Sound: A Monte Carlo Analysis
John J. Thaden
Department of Geriatrics, University of Arkansas, 4300 W. 7th St. RSCH-151, Little Rock Arkansas USA 72205. jjthad~lash.net
Instrument choice can be perplexing for harmonica players. One variable is comb composition. Available comb materials include pear wood, various plastics, and metals such as aluminum, titanium, and brass. Do comb materials have detectable effects on the sound of a harmonica? The null hypothesis--that comb materials do not affect sound--was tested at the 1997 Society for the Preservation and Appreciation of Harmonica (SPAH) convention by asking a group of harmonica players/enthusiasts to identify comb material by hearing the instruments played. Test instruments were diatonic harmonicas, all with unmodified Hohner Big River reedplates and covers, but with combs fabricated of different materials. The harmonicas were played in three separate test series. During each, four harmonicas were first identified and played, then played in random order without identification. In the first two test series, the same brief test melody (the first and last phrases of 'Summertime') was played eight times by British professional John Walden; in the third series, a vacuum pump actuated the reeds for five tests. Thus, there were 8 + 8 + 5 = 21 tests total. A fourth series tested chromatic harmonica covers; it will not be discussed. The 1997 conclusion was that participants' harmonica identification scores did not differ significantly from scores that might result after random guessing, and thus, that the null hypothesis (no materials effect on sound) cannot be rejected with acceptable confidence (95%). In other words, if the whole experiment could be repeated 100 times, and if indeed comb materials make no difference, then test scores as good or better than that achieved by the SPAH 1997 audience would be likely to occur by chance alone more than five times in those 100 experiments. An active and prolonged debate ensued on Harp-L. Some faulted the test design and/or execution. On February 18, 1999, the raw data were published on Harp-L [Vern Smith, SPAH97 Materials Test Raw Data], excluding a few listeners who "turned in test papers on which only a few or no selections were made". This letter describes a new analysis of the SPAH'97 data that has addressed a problem in the original analysis caused by missing guesses, and has used resampling statistical methods to empirically discover the expected distribution of scores assuming no effect of comb material on harmonica sound, rather than merely assuming it is normal. The conclusions differ from the 1997 conclusion.
RESULTS (Note: Use a nonproportional font to view tables)
o If missing data are scored the same as incorrect selections, mean and median scores are lower, but so are the mean and median scores one expects by chance.
Some listeners left blanks indicating to me that they could not distinguish among the harps played. I scored that the same as an incorrect selection. [Smith, op. cit.]
Table 1 shows the extent and pattern of omitted responses, runs of four or more omitted responses, and duplicated guesses, made by 27 respondents after each of 21 harmonica test sounds.
Listener #3 left all answers blank (but did participate in the final, cover-materials test not discussed here). It is not clear if these 21 omissions were scored as wrong in the 1997 analysis. Respondent #9 answered all queries in the first two test series, but none in the third series or the fourth (cover-materials) series. Listener #6 omitted the final six guesses of the second test series. Instead of inability to distinguish harmonicas, these omissions may have been due to the individuals' absence from the room, or simple lack of interest. The remaining omissions were scattered, none occurring in runs of four or more. Nonetheless, one cannot rule out explanations unrelated to ability to distinguish harmonicas, for instance, a broken pencil, or an intrusive background noise. To discover the expected score distribution assuming no materials effect, a Monte Carlo simulation was done with 10,000 replications, each structured like the actual experiment (21 tests x 26 respondents, 32 omissions, 23 double-guesses).
Table 2 shows summary statistics based on the analysis method of 1997, excluding listener #3. Omissions were scored as wrong (0), correct doubled guesses as half-right (0.5), and correct single guesses as right(1). The scores are percentages because the sum has been divided by the total number of tests (21) and multiplied by 100%. The same scoring method was used for 10,000 replications in a Monte Carlo experiment, each reproducing the structure of the actual experiment (including omissions and doubled guesses), but with all guesses being strictly random choices of one of four possibilities, i.e., assuming no materials effects. Table 2 also shows the exact probability (P) that the actual mean, median, and maximum scores came from distributions revealed by the Monte Carlo experiment. A P-value less than 0.05 generally is considered to indicate a significant difference between actual and expected values, and thus that respondents' efforts to identify these harmonicas were aided by differences in how they sound.
Table 2. Omissions scored as wrong: summary of data from 1997 SPAH and a Monte Carlo experiment SPAH'97 Monte Carlo P Mean 27.75 23.52 +/- 2.91 0.0104 Median 28.57 23.10 +/- 4.17 0.0363 Maximum 57.14 43.59 +/- 8.33 0.0278
The Monte Carlo experiment showed that the expected mean and median are not 25%, as one might think based on the fact that each test involved four possible harmonicas, only one of which could be the correct answer, bur 23.5 and 23.1% respectively. This difference is real, and can be attributed to the scoring method--that omitted answers are never right while purely random guesses will be right 25% of the time. Because the mean and median distributions are shifted downward, the actual mean and median scores at SPAH'97--though little different from 25%, are significantly different than the prediction of chance.
The method for handling missing data chosen in the present analysis was to score them no better than expected by chance (0.25 points). Table 3 shows the SPAH'97 results scored in this manner (again omitting listener #3), and also the empirical results of a Monte Carlo experiment.
Table 3. Omissions scored as random guesses: summary of data from 1997 SPAH and a Monte Carlo experiment SPAH'97 Monte Carlo P Mean 29.17 25.00 +/- 2.88 0.0122 Median 30.36 24.62 +/- 3.87 0.0055 Maximum 57.14 44.00 +/- 7.74 0.0253
Regardless of the method for handling missing data, the mean, median and maximum scores from the SPAH'97 all exceed expectations if guessing were random, provided the random score distributions are expressed exactly via a resampling technique like the Monte Carlo experiments.
o Double guesses were appropriately treated in the 1997 analysis:
We allowed the listeners to name two comb materials if they chose. In that case the probability of guessing correctly rises [from 25%] to 50% so they should be scored 1/2 correct if they hit on two selections. [Smith, op. cit.]
Table 1 also shows the distribution of double entries among respondents. Duplicate guesses may be considered 'sampling without replacement' from four comb materials, and indeed, the probability of making a correct response with two random guesses is 0.5. Scores therefore included 1/2 point for a pair of guesses where one was correct (Table 2). Alternative analyses were also done wherein double guesses were treated essentially as single guesses on two separate tests and missing data were handled as described for Tables 2 and 3; the results were similar and led to identical conclusions [data not shown].
o Mean, median, and maximum scores at SPAH 1997 exceeded the predictions of chance.
Tables 2 and 3 both illustrate significant ability in the group of respondents to outperform mere chance guessing when trying to identify harmonicas by sound.
CONCLUSION:
Something about the different harmonicas allowed respondents to identify which of four harmonicas was being played. The ability to do so was not strong, but influenced the mean, median and maximum scores enough that they deviated significantly from the mean, median and maximum values that would be obtained by random guessing.
Were respondents actually hearing differences in harmonica sounds caused by differences in comb material? The test definitely fails to prove this. The problem that disallows this conclusion is that each harmonica also had different reed plates and cover plates. Though these components were all from the same model harmonica (Hohner Big River), it is common knowledge among harmonica players that at least reedplates (and the attached reeds) can differ markedly for different harmonicas of the same model. No attempt was made to normalize the reeds' musical pitches, timbre, rest position with respect to the reedplates (offset), responsiveness, etc.