Paper read at the 1998 Mid-America Linguistics Conference
|
John
E. McLaughlin
Utah State University
and LingNet Resources
DON'T PANIC! These words, written in large friendly letters(1), should allay your fears about my approach to the issues of chance, probability, randomness, and genetic relationship in languages. I'm not a mathematician and it takes a good night's sleep, the wife out of the house for the day, and a comfortable recliner in order to ferret out any meaning at all in the complex formulas presented in most papers on this subject. In this way, I probably reflect the knowledge and interest of most linguists doing comparative and historical linguistics. But this issue is not one that can be ignored despite our difficulty in tackling the math involved. It is the subject of a growing body of writing and is beginning to form one of the most critical elements of the debate over Nostratic, Greenberg's Amerind, Ruhlen's Proto-World, and even the basic question of how far back in time can we demonstrate a genetic relationship.
As I began reading this body of literature, I quickly realized that there was an increasingly acrimonious debate developing over the mathematical assumptions. The debate has hinged on two issues: 1) Is there really a difference in the rates of retention between the so-called "basic vocabulary" and the rest of the lexicon; and 2) are multilateral sets less likely to be affected by chance resemblances or not? It is the latter issue which sparked my curiosity. The problem with the debates over this matter is that to demonstrate the validity of any of the hypotheses, pairs of real languages were used. The first problem is that one can never categorically rule out a relationship. The absence of a proven relationship does not automatically prove that the two languages cannot be related. There are also a multitude of areal features to consider even between unrelated languages. Therefore there is always an unknown element when using real languages to demonstrate the factor of chance. The second problem is the element of semantic content. Can one legitimately compare 'daughter' with 'girl', or 'be' with 'become', or 'door' with 'entry'? Most comparative linguists, working outside the realm of mathematics, have no real problem with comparing any of these obviously very closely related concepts. A great many words in any dictionary contain multiple meanings for each item. Which one do we choose as the primary one for comparison? Yet each additional semantic possibility or semantic ambiguity that we add to a possible comparative group increases the chance of random matches.
With these problems in mind, I decided to approach the issue of demonstrating a factor of chance from a different perspective. Instead of using real languages, which are always subject to the possibility (no matter how remote) of actual relationship, I designed a computer program in Visual Basic 5.0 to produce a random lexicon for ten languages and, using strict rules of correspondence between sounds, had the computer find all the binary pairs in these languages that would count as cognates to a typical comparative linguist. Since these computer-generated (CG) languages had no possibility of genetic relationship, and since the semantic content could be precisely controlled at various levels, it provided very reliable information about the chances of random cognate sets between unrelated languages. In essence, it provides an experimentally derived basis of comparison rather than a mathematically-derived one.
In designing the ten CG languages, I divided the set of languages into four groups-three languages with small consonant inventories (less than 20), three languages with medium-sized consonant inventories (20-30), two languages with large inventories (30-40), and two languages with very large inventories (over 40). I based the phonologies of these CG languages on real-world languages, using both the actual phonemic inventories and the frequency of occurrence for each of the phonemes. In addition, as a control measure, two of the small CG languages and both of the very large CG languages were based on the same two real-world languages. The two identical small languages were based on Shoshone; the other small language on Zuni. The three medium-sized languages were based on reconstructed Proto-Indo-European, Hungarian, and English. The two large languages were based on Eastern Keres and Lushootseed, and the two very large languages were based on Heiltsuk. Except for English and Proto-Indo-European, all these languages represent unrelated language families, and each has a different consonant and vowel inventory. Table 1 illustrates the inventories for each of the eight real-world languages and matches them to the ten CG languages.
The program constructed a random vocabulary of 1,000 words for each of the ten languages. Each of these words consisted of a CVC sequence. I chose a CVC sequence since that tends to be the most commonly used sequence in comparisons and led to a two-tiered comparison of the forms by the computer. First, do the two consonants match, and second, does the vowel also match?
I then constructed a table of correspondences to use in comparing the disparate phonologies to one another. I basically made sure, using commonly found correspondences, that each sound in each language was part of a regular correspondence set. Thus, in the languages without glottalized consonants, for example, the plain versions matched both the plain versions and the glottalized versions in the languages that have them. The same basic principles were used for correlating the matches between uvulars and velars, lateral and rhotic approximants, fricatives, etc. Table 2 shows the Table of Correspondences.
The final process in the construction of the program was to decide how to deal with semantics. The problem was solved quite simply-each word was numbered as it was generated. Exact semantic matches (as in comparing 'eat' in Language A with 'eat' in Language B) were simply a case of comparing Word 1 in L1 to Word 1 in L2, etc. Dealing with non-exact semantic matches (as in comparing 'girl' with 'daughter') was a more complicated issue. I solved the problem by using a moving vector approach and taking advantage of what we may call the "Thesaurus Effect". The Thesaurus Effect is starting with a word and then moving through the choices in a thesaurus until one arrives at a word which has a completely different, unrelated meaning to the original word. We've all played this game at one time or another and are always amazed at the permutations we can come up with. I used this Thesaurus Effect in the program by comparing Word 1 in L1 with Words 1-10 in L2 for a semantic latitude typical of most long-range comparisons. The program then compared Word 2 in L1 with Words 2-11 in L2, etc. Thus, a semantic latitude of 1 represented extremely tight 'girl' equals 'girl' comparisons, a semantic latitude of 5 represented typical 'girl' equals 'girl', 'child', or 'daughter' comparisons, and a semantic latitude of 10 represented looser 'girl' equals 'girl', 'child', 'daughter', 'sister', 'niece', 'female', 'woman', 'sibling' comparisons.
The program reported several pieces of information:
First, we'll examine the most restrictive of the charts. Look at the top right chart on Table 3. This chart illustrates the results of finding exact matches between all three elements of each word and allowing no semantic variation. This is equivalent to comparing Shoshone kimma 'come' to Panamint kimma 'come'. Two general rules begin to stand out. The first generalization is the greater the difference between the phonological inventories of the two languages, the lower the number of matches found. So L1 and L2, which have identical phonologies, show the greatest number of matches using these restrictive criteria. Notice also that L9 and L10, which also have identical phonologies, show more matches with each other than with any other language they are compared to. The second generalization is that the larger the phonological inventory, the fewer matches will be found. While the identical L1 and L2 have a small phonological inventory and average three matches out of 1,000, the identical L9 and L10 only show 1 match out of 1,000 with their very large phonological inventories.
Compare this chart, which requires exact phonological matches with the top right chart on Table 4, which uses the Table of Correspondences, but otherwise with the same tight restrictions on semantics and matching all three elements of each word. While the number of matches between L1 and L2 and between L9 and L10 remain the same since these pairs do not use the Table of Correspondences, all the other comparisons between languages show more matches. The second of our two generalizations still holds-the languages with smaller inventories have more matches than the languages with larger inventories. Once we begin to use the Table of Correspondences, however, the radical difference between identical phonologies and non-identical phonologies is not as great, although it can still be seen in the charts requiring only a two consonant match.
We've now seen the number of matches in the most restrictive circumstances-12 pairs out of a possible 45,000 for an exact phonological match and 107 pairs out of a possible 45,000 for a Table of Correspondences match, or 0.03% and 0.24%, respectively. Now we turn to the least restrictive circumstances for a match. Look at the second chart on the left side of Table 3. This shows the average number of matches between two consonants allowing a semantic range of 10, but requiring an exact phonological match. Notice how much more the identical languages-L1 and L2 and L9 and L10-stand out in terms of number of matches between them. The number of matches between L1 and L2 is consistently about twice as many as between either of these languages and L3, another small, but non-identical inventory. The same is true for the number of matches between L9 and L10 compared to the number of matches between either of these languages and L8, with a large phonological inventory. Yet the number of matches between L1 and L2 is up to 11 times higher than the number of matches between L9 and L10, thus clearly demonstrating our two generalizations that the chance of random matches increases with similar and smaller phonologies when not using a Table of Correspondences.
Now look at the corresponding chart on Table 4. This chart is probably the most typical of the type of comparison practiced by most comparative linguists, especially those seeking to demonstrate long-range groupings. It recognizes a semantic leeway of 10 and matches just the two consonants on the Table of Correspondences. Notice that the generalization about smaller phonologies still holds true here, with the languages classed as small (fewer than 20 consonants) having matches with other languages five to six times as often as the languages classed as large or very large (30 or more consonants). Now look at the number of matched pairs-4,024 or 8.94% of a possible 45,000. With 1,000 words in each of the lexicons, this means that there should be an average of four pairs per lexical item. This may be expressed in one of two ways or a combination of the two. The first way that this might show up is in pairs illustrating the same sounds. With four pairs in a lexical item, this may mean an interlocking set of at least three languages showing the same correspondences in each of the forms. The second way that this might show up is in four unrelated pairs of words between eight of the languages. Usually, a combination of the two types of pairings is seen.
Now look at the bottom left chart on Table 4. This is where the maximum values are given for each of the language pairings out of 100 iterations with a semantic leeway of 10 and only matching the two consonants on the Table of Correspondences. Note that the numbers are at least 20% higher than they are in the averages chart we were just looking at. Also note that the number of pairs has risen to 5,184, or an average of five matches for each of the 1,000 lexical items. Obviously, this is quite relevant to the question of how likely it is to find multilateral comparisons based on chance alone. Table 5 shows the sets that came up for lexical items 900-1,000 during one run of the program. Rather than showing the individual pairs, I have lumped the related pairs together to form cognate sets that illustrate correspondence sets for each of the two consonants. The first 20 columns show the word and number for the forms in the ten languages. The final two columns show the so-called "proto-consonants" (one for each correspondence set on the Table of Correspondences) and the number of languages represented in each of the cognate sets. This table began as approximately 341 pairs. The full number of pairs for this iteration was 3411, actually 613 pairs less than the average. There is a good deal more collapsing of sets that could be done, but the current chart was done very precisely according to rule. Note that each of the sounds of the "proto-language" are illustrated by multiple cognate sets and in both initial and final position. Looking at this chart as it stands, many linguists would see at least a suggestive start for further research into a genetic relationship.
What happens to the chances of random matches when we loosen the bonds of comparison even more? For example, if we only compared initial consonants then imagine what Table 5 would look like. I found that sets with five to six languages illustrating the correspondence are common and there is even a significant number with seven and eight languages. What if we used longer words? I have used CVC as a standard form, but we often find an initial syllable compared to a final syllable and vice versa. What this does to the numbers in the charts in Table 3 and Table 4 is to double them.
In summary, I haven't given any rock hard figure or calculation to determine
whether a particular comparison exceeds the threshold of chance possibility.
Instead, I have found two generalities-the more similar and the smaller
the phonological inventory of the languages being compared, the greater
the likelihood of random matches. I have also found that multilateral comparison
also increases the chance of finding multiple languages showing two consonant
correspondences in particular lexical forms, and giving the overall impression
of a solid linguistic grouping with a full range of proto-forms.
1. Written on the cover of the fictional Hitchhiker's
Guide to the Galaxy, described by Douglas Adams in his book of the
same name