The first rule is, you must not fool yourself. And you are the easiest person to fool.
--Richard Feynman
On sci.lang we are often presented with lists of resemblances between far-flung languages (e.g. Basque and Ainu, Welsh and Mandan, Hebrew and Quechua, Hebrew and every other language, Basque and every other language), along with the claim that such resemblances "couldn't be due to chance", or are "too many" to be due to chance.
Linguists dismiss these lists, for several reasons. Often a good deal of work has gone into them, but little linguistic knowledge. Borrowings and native compounding are not taken into effect; the semantic equivalences proffered are quirky; and there is no attempt to find systematic sound correspondences. And linguists know that chance correspondences do happen.
All this is patiently explained, but it doesn't always convince those with no linguistic training-- especially the last point. Human beings have been designed by evolution to be good pattern matchers, and to trust the patterns they find; as a corollary their intuition about probability is abysmal. Lotteries and Las Vegas wouldn't function if it weren't so.
So, even one resemblance (one of my favorites was gaijin vs. goyim) may be taken as portentous. More reasonably, we may feel that one resemblance may be due to chance; but some compilers have amassed dozens of resemblances. Such lists may be criticized on other grounds, but even linguists may not know if the chance argument applies. Could a few dozen resemblances be due to chance? If not, what is the approximate cutoff?
The same question comes up in evaluating the results of Greenbergian mass comparisons; or proposals relating language families (e.g. Japanese and Tibeto-Burman) based on very small numbers of cognates. Again, it would be useful to know how many chance resemblances to expect.
I will propose a simple but linguistically informed statistical model for estimating the probability of such resemblances, and show how to adjust it to match the specific proposal being evaluated.
Let's start with a simplified case (we'll complicate it later). We will compare two unrelated languages A and B, each of which has 1,000 lexemes of the form CVC, and an identical semantics and phonology. That is, if there is a lexeme a in A with some meaning M, there will be a lexeme bp in B phonetically identical to a, and a lexeme bs with the same meaning as a.
What is the probability that bp is bs?-- that is, that there is a chance resemblance with a? It can be read off from the phonology of the typical root. Supposing there are 14 consonants and 5 vowels, it is 1/14 * 1/5 * 1/14, or 1 in 980. (This assumes that the vowels and consonants are equiprobable, which of course they are not.) For ease of calculation we'll round this to 1 in 1000.
How many chance resemblances are there? As a first approximation we might note that with a thousand chances at 1 in 1000 odds, there's a good chance of getting one match.
However, this is not enough. How likely is it exactly that we get one match? What's the chance of two matches? Would three be quite surprising? Fortunately probability theory has solved this problem for us; the chance that you'll find r matches in n words where the probability of a single match is p is
(n! / (r! (n-r!))) pr (1 - p)(n-r)or in this case
(1000! / (r! (1000-r!))) .001r .999(1000-r)For the first few r:
p(1) = .368
p(2) = .184
p(3) = .0613
p(4) = .0153
p(5) = .00305
p(6) = .000506
So the probability of between 1 and 6 matches is .368 + .184 + .0613 + .0153 + .00305 + .000506 = .632, or about 63%. It would be improbable, in other words, if we found no exact matches in the entire dictionary. (But not very improbable; p(0), which we can find by subtracting the above p's from 1.0, is 37%.)
Proffered resemblances are rarely exact. There is always some phonetic and semantic leeway. Either can be seen as increasing the set of words in B we would consider as a match to a given word a in A.
For instance, suppose for each consonant we would accept a match with 3 related consonants, and for each vowel, 3 related vowels. Since we're assuming a CVC root structure, this gives 3*3*3 = 27 words in B which might match any given a.
And suppose for each word a we will accept 10 possible meanings for b. This must be applied to each of the 27 phonetic matches; so a can now match a pool of 27*10 = 270 lexemes. The probability that it does so is of course 270 in 1000, or .27. Every lexeme in A, in other words, has a better than 1 in 4 chance of having a random match in B!
How many chance resemblances are there now? The same formula can be used, with the revised estimate for p:
(1000! / (r! (1000-r!))) .27r .973(1000-r).There is a significant probability for very high numbers of matches, so we must continue calculating for r well into the hundreds. The results can be summarized as follows:
p( 1 to 50) = 5.85 * 10-74
p( 51 to 100) = 1.22 * 10-40
p( 101 to 150) = 8.62 * 10-20
p( 151 to 200) = 1.70 * 10-07
p( 201 to 250) = .082
p( 251 to 300) = .903
p( 301 to 350) = .016
p( 351 to 400) = 1.17 * 10-08
p( 401 to 450) = 2.11 * 10-19
p( 451 to 500) = 1.30 * 10-34
In other words there's a 90% chance that we'll find between 250 and 300 matches, an 8% chance of finding less, and a 2% chance of finding more.
Our rule of thumb would have suggested 270 matches, and this is in fact the number with the highest probability (2.84%).
I will suggest refinements to this model below, but the basic features are in place: a probability for a single match; a calculation for number of expected matches; and an adjustment for phonetic and semantic leeway.
Suppose we want to check for random matches between Quechua and Chinese.
First, we need to decide what constitutes a phonetic match between the two languages. One way of doing this is to decide for each Quechua phoneme what Chinese phonemes we'll accept as matches. (Think of it this way: is Qu. runa a match for Ch. rén? Is chinchi a match for chong? Is chay a match for zhè?)
We might decide as follows. The criterion here is obviously phonetic similarity. We could certainly improve on this by requiring a particular phonological distance; e.g. a difference of no more than two phonetic features, such as voicing or place or articulation. The important point, as we will see, is to be clear about what we count or do not count as a match; or if we are evaluating someone else's work, to use the same phonetic criteria they do.
Qu. | Ch.
p | p, b
| t | t, d
| ch | ch, zh, j, q, c, z
| k | k, g
| s | s, sh, c, z, x, zh
| h | h
| q | h, k
| m | m, n
| n | m, n, ng
| ñ | m, n, ng, y
| l | l, r
| ll | l, r, y
| r | l, r
| w | w, u
| y | y, i
| a | a, e, o
| i | i, e, y
| u | u, o, w
| |
We will next need to know the frequency with which each phoneme occurs in each language. This can be calculated using a simple program operating on sample texts. For Quechua we find:
initial | medial | final | |
a | 5.291005 | 25.906736 | 40.211640 |
b | 2.645503 | 0 | 0 |
d | 0 | 0.310881 | 0 |
g | 0.529101 | 0.103627 | 0 |
h | 5.820106 | 0 | 0 |
i | 2.645503 | 8.808290 | 5.291005 |
k | 14.814815 | 5.595855 | 3.174603 |
l | 0.529101 | 0.414508 | 0 |
m | 7.407407 | 4.145078 | 3.703704 |
n | 1.587302 | 6.528497 | 25.396825 |
p | 7.936508 | 6.010363 | 0 |
q | 4.232804 | 3.108808 | 8.465608 |
r | 4.232804 | 5.077720 | 0 |
s | 6.349206 | 4.145078 | 2.645503 |
t | 7.407407 | 6.424870 | 0 |
u | 3.703704 | 11.398964 | 2.645503 |
w | 11.111111 | 1.450777 | 0.529101 |
y | 3.174603 | 4.145078 | 7.936508 |
ch | 6.878307 | 3.108808 | 0 |
ñ | 1.058201 | 1.243523 | 0 |
rr | 0.529101 | 0 | 0 |
ll | 2.116402 | 1.865285 | 0 |
initial | medial | final | |
a | 1.400000 | 21.494371 | 7.739308 |
b | 7.000000 | 1.432958 | 0 |
c | 0.600000 | 0.102354 | 0 |
d | 12.800000 | 1.228250 | 0 |
e | 0.200000 | 8.904811 | 15.885947 |
f | 2.000000 | 0.614125 | 0 |
g | 3.200000 | 1.842375 | 0 |
h | 3.400000 | 2.149437 | 0 |
i | 0 | 17.195496 | 29.327902 |
j | 4.600000 | 1.944729 | 0 |
k | 2.200000 | 0.204708 | 0 |
l | 6.000000 | 2.149437 | 0 |
m | 2.600000 | 1.330604 | 0 |
n | 3.800000 | 6.038895 | 11.608961 |
o | 0.400000 | 7.881269 | 9.368635 |
p | 1.000000 | 0.102354 | 0 |
q | 2.000000 | 1.842375 | 0 |
r | 0.800000 | 0.307062 | 1.629328 |
s | 0.800000 | 1.023541 | 0 |
t | 3.800000 | 1.228250 | 0 |
u | 0 | 8.495394 | 12.016293 |
w | 7.800000 | 0.716479 | 0 |
x | 4.200000 | 0.614125 | 0 |
y | 9.600000 | 0.511771 | 0 |
z | 4.200000 | 1.023541 | 0 |
ch | 2.200000 | 0.716479 | 0 |
ng | 0 | 5.834186 | 12.016293 |
sh | 7.800000 | 1.330604 | 0 |
zh | 5.600000 | 1.740020 | 0 |
(The reader who knows Chinese may wonder how we can have medial consonants at all. The answer is that I am using Chinese lexemes, not single characters (zì), so that, for instance, Zhongguó 'China' is one word, not two.)
Now we're in a position to calculate the probability for a match. Let's start by assuming that there must be a match (within the phonetic categories established above) in both initial, medial, and final.
To calculate the probability pi for a match in the initial, we go down the list of Quechua initials, multiplying its probability times the probability of finding the matching sound(s) in that same position in Chinese. For instance, the probability of a match on initial p is the probability of initial p in Quechua (.0794) times the probability of a match on initial p or b (.07 + .01 = .08), or .00635.
I show the entire calculations below, because some of them are quite eloquent, and show the value of taking a frequency approach. If you're looking for a match for a Quechua word in s-, for instance, you have a 23% chance of matching any of the sounds we've judged as similar in Chinese. You're likely to match medial -a- 38% of the time; final -a 33% of the time, final -n 24% of the time.
(The boldface letter is the Quechua sound; it's followed by the Chinese sounds we said would be a match. The first number is the probability of the Quechua phoneme; the second is the sum of the probabilities of the matching Chinese sounds; the third is the multiplication of the first two.)
Initials
a aeo | .05291 * .020 = | .00106 |
h h | .05820 * .034 = | .00198 |
i iey | .02646 * .098 = | .00259 |
k kg | .14815 * .054 = | .00800 |
l lr | .00529 * .068 = | .00036 |
m mn | .07407 * .064 = | .00474 |
n mn ng | .01587 * .160 = | .00254 |
p pb | .07937 * .080 = | .00635 |
q hk | .04228 * .056 = | .00237 |
r lr | .04228 * .068 = | .00288 |
s s sh c z x | .06349 * .232 = | .01473 |
t td | .07407 * .166 = | .01230 |
u uow | .03704 * .082 = | .00304 |
w wu | .11111 * .078 = | .00867 |
y yi | .03174 * .096 = | .00305 |
ch ch zh jqcz | .06883 * .192 = | .01322 |
ñ mn ng y | .01058 * .160 = | .00169 |
ll lry | .02121 * .164 = | .00348 |
Medials
a aeo | .25907 * .3828 = | .09917 |
i iey | .08808 * .2661 = | .02344 |
k kg | .05596 *.0205 = | .00114 |
l lr | .00415 * .0246 = | .00010 |
m mn | .04145 * .0737 = | .00305 |
n mn ng | .06528 * .1320 = | .00862 |
p pb | .06010 * .0153 = | .00092 |
q hk | .03109 * .0235 = | .00073 |
r lr | .05078 * .0246 = | .00125 |
s s sh c z x | .04145 * .0582 = | .00241 |
t td | .06425 * .0246 = | .00158 |
u uow | .11399 * .1710 = | .01949 |
w wu | .01451 * .0921 = | .00134 |
y yi | .04145 * .1771 = | .00734 |
ch ch zh jqcz | .03109 * .0736 = | .00229 |
ñ mn ng y | .01244 * .1371 = | .00170 |
ll lry | .01865 * .0297 = | .00055 |
Probability for a medial match = .17514 = 17.5 %
Finals
a aeo | .40212 * .3299 = | .13266 |
i iey | .05291 * .4522 = | .02393 |
k kg | .03175 * 0 = | 0 |
m mn | .03704 *.116 = | .00430 |
n mn ng | .25397 *.236 = | .05994 |
q hk | .08466 * 0 = | 0 |
s s sh c z x | .02646 * 0 = | 0 |
u uow | .02646 * .2139 = | .00566 |
w wu | .00529 * .1202 = | .00064 |
y yi | .07937 * .2933 = | .02328 |
Probability for a final match = .25039 = 25.0 %
So, the probability of finding a random match on a single word (with no semantic leeway) is .0931 * .1751 * .2504 = 0.0041, or 1 in 244.
Two lessons may be drawn. First, phoneme frequency matters. Both Quechua and Chinese have very many medial a sounds, and final nasals, and initial affricates. That makes random matches involving those sounds much more likely.
Second, seemingly minor points of procedure have a huge impact on our results. We are used to situations where rough calculations do not lead us far astray. But in this area differing assumptions or methodologies lead to very different results. Very careful attention to both is warranted.
Obviously the initial-medial-final calculation is still a simplification. Quechua, for instance, can have both initial and final consonant clusters; both languages have some two-phoneme roots; and of course a vague "medial" category is not a good way of handling multisyllabic words.
We might decide to allow a Quechua medial to match either a Chinese medial or final, to catch resemblances like runa/rén and chinchi/chong. To do this we need to compute the chance that a Quechua medial matches a Chinese final, as follows. (We can skip Quechua medials for which none of the corresponding Chinese sounds can end a word.)
Medial-to-final
a aeo | .25907 * .3300 = | .08549 |
i iey | .08808 * .4521 = | .03982 |
l lr | .00415 * .0163 = | .00007 |
m mn | .04145 *.1161 = | .00481 |
n mn ng | .06528 * .2363 = | .01543 |
r lr | .05078 * .0163 = | .00083 |
u uow | .11399 * .2138 = | .02437 |
w wu | .01451 * .1202 = | .00174 |
y yi | .04145 * .2933 = | .01216 |
ñ mn ng y | .01244 * .2363 = | .00294 |
ll lry | .01865 * .0163 = | .00030 |
This can be added to the previous medial-to-medial estimate, on the grounds that when a medial doesn't match another medial, we're giving it another chance to match a final. However, the additional chance should be discounted by the probability (30% in my sample Chinese text) that the initial and final are the same (that is, that the word is just two phonemes long). So the medial-to-medial-or-final probability is .1751 + (.1880 * .70) = .3067.
The probability of finding a random match on a single word (no semantic leeway) can now be given as .0931 * .3067 * .2504 = 0.0071.
This estimate could be revised still further to take account of such things as metathesis (switched consonants), or Quechua's initial consonant clusters. Note that both examples allow additional matches, and thus will increase p even more.
Since this probability is obviously going to be much higher, I don't recommend trying to combine both types of match into a single p, which would understate the difficulty of finding 3-phoneme matches and overstate that of 2-phoneme matches.
We can estimate the probability of a 2-phoneme match by using the probability of a match on initials times that of a Quechua medial matching a Chinese medial or final: .0931 * .3066 = .0285 or about 1 in 35.
This could be refined by adding the probability that a Quechua final matches a Chinese medial or final, this time discounted by the probability that the Quechua medial is also the final.
If you want to avoid phonetic calculations entirely, there's an alternative approach: We pick a word a in A, then pick the word b in B which most closely resembles it phonetically. To handle phonetic looseness, we pick the n words in B which most closely resemble it phonetically.
The advantage is that we don't have to mess with phonetic details or how to match the phonologies of different languages. We can proceed quickly to an estimate of how many matches we can expect to find in general between two languages.
The disadvantage is that this approach doesn't lend itself to evaluating other people's claims. You can picture (say) Greenberg & Ruhlen examining the n words in Tfaltik that most closely resemble maliq'a. But what is their n? To give a reasonable estimate we have to dive back into phonetic details and probabilities.
About all that can be done is to emphasize that most actual attempts to find "cognates" accept very considerable semantic leeway, and that this greatly increases the chance of random matches.
It's helpful to see just what a semantic leeway of 10 or 100 meanings looks like. For instance, let's take the word "eat".
If we allow 10 matches, we'd accept something like:
eat, dine, bite, chew, swallow, feed, hungry, edible, meal, meat
If we allow 25 matches, we'd accept:
eat, dine, bite, nibble, chew, munch
consume, devour, taste, drink, gulp, swallow,
meal, dinner, supper, snack, meat, food
hungry, thirsty, fast, edible, tasty, mouth,
corrode
If we allow 100 matches, we'd accept:
eat, dine, sup, stuff, gorge, nibble, ruminate, take in, tuck in, gobble, bolt, swill
chew, munch, slurp, bite, gnaw, masticate, -vorous
consume, devour, ingest, nourish, regale, taste, partake
drink, guzzle, suck, gulp, quaff, swallow, imbibe
morsel, mouthful, serving, helping, entree
food, victuals, fare, fodder, provender, diet
meal, repast, dinner, lunch, luncheon, supper, breakfast, snack, feast
meat, soup, fowl, vegetables, bread, trough, plate, hearth,
eat (of animals), feed, provide, forage, grub, chow
cuisine, board, table, mess, restaurant, cook
hungry, thirsty, peckish, famine, glutton, fast, full
edible, comestible, potable, tasty, delicious
mouth, cheek, teeth, tongue, stomach
deplete, waste, destroy, ravage, corrode, etch, erode, subsist (on), live, prey
Ten matches may seem like a lot; but nothing in even the list of 100 matches is a real stretch. The real question is: looking for matches for 'to eat' in A, would our language comparer take a word meaning 'nourish' or 'delicious' or 'gnaw' as a match--indicating that his semantic leeway is more like 100 than like 10? The temptation is almost irresistible.
It's worth pointing out that semantic leeway is really multidimensional. The 100-word list shows several directions we can explore starting from the word 'eat': types of eating; other types of ingestion; types of food; types of feeding or food preparation; qualities or states associated with eating; associated parts of the body; differences in connotation or register; differences in the type of eater; metaphorical extensions. You may accept only a little variation along each dimension, but because there are many dimensions the total variation is high.
What happens to the number of expected matches as semantic leeway increases?
We are traipsing through language A, word by word. For each word a, the probability p of finding a match in B is, say, .01.
But now suppose the semantic leeway m is 10. So for word a, we have to check for a match 10 times.
What does this mean numerically? It depends on how we count matches.
The easiest way to compute this is backwards: find the probability that we won't find a match, which is simply (1 - p)m. To see this, think of checking the 3rd word. To not find a match on words 1, 2, 3, we must find no match on word b1 (probability .99), no match on word b2 (prob .99), and no match on word b3 (prob .99). Since these events are independent, the cumulative probability of .99 * .99 * .99--that is, (1 - p)m.
So the probability of finding a match is 1 - ((1 - p)m). In our example, this is 1 - .9910 = .095618.
I find the concept of multiple matches per word rather bizarre, and it doesn't give us a single probability we can plug into our general formula, so I won't pursue it here.
If the number of words is about the same as the number of meanings, the answer is 'not much'. For instance, take our original search for exact matches in two phonologically identical languages. The probability of finding one or more random matches hardly varies at all by vocabulary size:
Lexicon size | Probability |
100 | .6340 |
1000 | .6323 |
10000 | .6321 |
The reason is not hard to find; as the lexicon size increases, the chance of a given match goes down, but we get more chances.
As the vocabulary grows into the thousands, the assumption that there are an equivalent number of "meanings" becomes increasingly dubious. Or to be more precise, fine distinctions may exist between all the words in the language; but the finer distinctions are liable to be ignored by the language comparer. A language of 10,000 words may well distinguish 'eat (of humans)' vs. 'eat (of animals)' vs. 'devour' vs. 'bite' and so on. Yet the language comparer, looking for a match for 'eat (of humans)', is not going to skip over 'eat (of animals)'.
Phonetic leeway is not independent of lexicon size either. If we are considering vocabularies of 10,000 words rather than 1000, phonological complexity is likely to be greater, and we will want to perform a more rigorous analysis, along the lines suggested above.
We will need to know:
To estimate the amount of phonetic leeway they allow, we can simply count the correspondences they allow. For instance, the vowels seem to be completely ignored. The middle consonant can be one of l, ly, lh, n, r, or zero-- (at least) 6 possibiliites. The end consonant can be one of g, j, d, k, q, q', kh, k', X, zero-- (at least) 10 possibilities.
The initial consonant must, it seems, be m. There is a reason for that, I think. Our brains, as psycholinguists have found, respond very strongly to initials. Find words in A and B that begin with the same letter, and the battle is half-won-- the brain is predisposed to find them very similar. Indeed, posters to sci.lang sometimes offer "resemblances" that correspond in nothing but the first letter. (The fact that they find the coincidence remarkable is more evidence for human beings' lousy intuition about probabilities.)
(It's also worth pointing out that a very high 7% of all words in my Quechua sample text begin with m. This initial isn't particularly common in Chinese, but one has to wonder if this choice of initial doesn't increase the possibility of false cognates, simply by being more common.)
Let's consider (arbitrarily) that G&R's languages have about 25 phonemes each. Then the minimum probability of a single, exact-meaning random match is 1/25 * 6/25 * 10/25 or .004. It must be emphasized that this is a minimum; if we saw more of G&R's potential cognates we might find that they allow more leeway than this. And of course it's only a loose estimate, because it's an abstract measure intended to cover any two arbitrary languages.
As for semantic matches, we can also form a lower bound on the semantic leeway they allow by counting the separate meanings in their glosses. I count at least 12; but I would consider it highly misleading not to at least double this number, and based on our lists of words related to 'eat' I'd consider 100 to be quite reasonable. If you can accept 'breast', 'cheek', 'swallow', 'drink', 'milk', and 'nape of the neck' as cognates, it's hard to seriously claim that you wouldn't accept 'eat', 'mouth', 'guzzle', 'vomit', 'suckle', and 'stomach'.
Let's say the semantic leeway is 25 words. Then the probability that we'll find at least one match for a word is 1 - ((1 - p)m = 1 - 0.99625 = .0953.
How many random matches can R&G expect to find?
Let's assume the lexicon is 2000 words-- hopefully a good estimate of the size of the lexicons available for many of the obscure languages they work with. Our formula produces (omitting ranges with minimal probabilities):
p( 126 to 150 ) = .0080
p( 151 to 175 ) = .1222
p( 176 to 200 ) = .6509
p( 201 to 225 ) = .2214
p( 226 to 250 ) = .0047
In other words we should expect close to 200 random matches, or a tenth of the lexicon.
Of course, that's between two languages. What's the likelihood of finding matches across languages?
If the languages are related, of course, we would expect many non-random resemblances (though with G&R-style phonetic and semantic laxity we will find many random matches as well). We should not be cowed by the size of G&R's cognate list; there are multiple entries per family. They are not really comparing hundreds of languages, but a much smaller number of language families.
There's no general number of random matches to expect this way, since language families vary in number of languages, and in how similar their lexicons are. But even closely related languages, like Spanish and French, have significant numbers of non-cognate words (chien vs. perro, chapeau vs. sombrero, manger vs.comer, regarder vs. mirar, etc.). So if you don't find enough resemblances in language A1, you can try A2, A3, etc. Or even search dialects for cognates, as it seems they've done with Quechua and Aymara. The list of random resemblances between families will be larger than the list of random resemblances between languages.
Another key question is how many families they're comparing. If G&R are right about certain high-level categories, such as Amerind and Eurasiatic, this turns out to be "three or four". If families A and B have 500 random resemblances, they'll have about 125 with family C and 31 with family D-- a quite respectable listing for "proto-World".
Quechua | Semitic | |||
llama | llama | gamal | camel (Heb.) | |
qollana | leader | kohen | priest (Heb.) | |
t'eqe | rag doll | degem | model, specimen (Heb.) | |
qhapa | rich, powerful | gabar | become strong (Heb.) | |
qhoyu | group | goi | nation, people (Heb.) | |
qhoruy | cut off | garaz | cut (Heb.) | |
qhasay | burp | gasah | burp (Heb.) | |
q'enti | shorten, shrink | gamad | shrink (Heb.) | |
amaru | snake, sage | amaru | know, teach (Assyr.) | |
anta | copper | homnt | copper (Coptic) | |
atoq | fox | bachor | fox (Coptic) | |
aysana | basket | tsina | basket (Aramaic) | |
ch'olqe | wrinkle | chorchi | wrinkles (Coptic) | |
charki | jerky | charke | drought (Heb.) | |
cholo | Andean person | chlol | folk (Coptic) | |
wanaqo | guanaco | anaqate | she-camel (Assyr.) | |
churi | father's son | chere | son (Coptic) | |
illa | light, jewel | ille | brightness, light (Coptic) | |
k'ayrapin | pancreas | kaire | gullet, belly (Coptic) | |
kinuwa | quinoa | knaao | sheaf (Coptic) | |
k'ayra | frog | krur | frog (Coptic) | |
kutuna | blouse | kutunet | garment, tunic | |
onqoy | sickness, illness | thomko | ill use, afflict | |
punku | door | brg | be open (Coptic) | |
tarpuy | planting | sirpad | plant (Heb.) | |
hamuna | entrance | amumuna | city gate (Assyr.) | |
hillu | sweet tooth | akkilu | glutton (Assyr.) | |
huku | owl | akku | owl (Assyr.) | |
qoleq | silver | purku | gold (Assyr.) | |
p'uru | bladder, gourd | buru | vessel (Assyr.) | |
ch'enqo | small thing | enegu | suck (Assyr.) | |
watuq | diviner | baru | seer (Assyr.) | |
waliq | abundant | baru | become full (Assyr.) | |
ch'aphra | brush, twigs | abru | brush pile (Assyr.) | |
raphra | wing | abru | wing, fin (Assyr.) | |
apu | god, mountain lord | abu | father (Assyr.) | |
hatarichiy | build, incite, disturb | adaru | worried, disturbed (Assyr.) | |
hayk'api | how much? | akka'iki | how much? (Assyr.) | |
taruka | type of deer | barih'a | antelope (Assyr.) | |
umiqa | jewel | banu | headgear, diadem (Assyr.) | |
wawa | baby, child | babu | child (Assyr.) | |
p'uytu | well, puddle | buninnu | pond (Assyr.) | |
walla | domineering person, soldier | baxalu | ripe, youthful, manly (Assyr.) | |
wayra | wind, air | billu | low wind (Assyr.) | |
wanqara | drum | balangu | kettle-drum (Assyr.) | |
phiri | thick; dish made of flour and water | buranu | meal? (Assyr.) | |
perqa | wall | birtu | fetter; fortress (Assyr.) | |
phasi | steamed | bashalu | boil, roast (Assyr.) | |
maqt'a | young man | batulu | youth (Assyr.) | |
patana | stone seat | apadana | throne room (Assyr.) | |
qhapaq | rich, powerful | gabshu | massive, powerful (Ethiopic) | |
qocha | lake, pond | gubshu | mass of water (Assyr.) | |
llantin | type of herb | gam-gam | a herb (Assyr.) |
There are a number of obvious problems with this list.
The above criticisms cannot answer this question; but the statistical model developed here can.
First let's estimate the degree of phonetic laxness the compiler is allowing. I'll use my Quechua frequency table, but I don't have similar data for the Semitic languages.
What's the level of semantic laxness? This is hard to say. Some matches are quite close (burp, copper, basket, son, frog, owl); others are fairly remote (leader/priest; snake/teach, jerky/drought, pancreas/belly; sickness/afflict; door/open; sweet tooth/glutton; small thing/suck; god/father; build/disturbed; domineering/youthful; wall/fortress; stone seat/throne room; two herb names). I think it would be quite conservative to assume 1 Quechua word can match 20 Semitic words in the compiler's mind.
Given this semantic leeway, the probability of a match on a single Quechua word is 1 - ((1 - p)m = 1 - 0.91120 = .845. That's quite telling right there-- it means that, given his phonetic and semantic laxness, the comparer is ordinarily going to find a random match for almost every Quechua word.
My own Quechua dictionary has about 2000 non-Spanish roots. Our comparer will very likely find more than 1500 chance resemblances.
With a vowel match-- which, I emphasize, the comparer bothers with only half the time-- the match probability becomes .302, for roughly 600 chance resemblances.
It's just not that hard to match just two consonants. How about the three-consonant matches? Given the initial and medial consonant match probabilities calculated above, the probability of a single 3-consonant exact-semantic match is .17 * .524 * .524 = 0.047. With the same semantic leeway of 20 words, the probability of a match per word becomes 1 - 0.95320 = .618.
Our formula gives:
p( 1101 to 1200 ) = .051
p( 1201 to 1300 ) = .947
p( 1301 to 1400 ) = .014
Is it a mere coincidence that there are so many correspondances between these languages? No, it isn't; what would be really surprising would be if there weren't a thousand more of the same quality.
With a vowel match, the match probability becomes .171 and we'd expect over 300 resemblances. However, there are only eight words in the list that can be described as matching three consonants and a vowel. Three of those are ruined by including -na as part of the root, and the remaining five include such uninspiring matches as q/t and ll/g.
The actual variation is almost linear. That is, allow word a to match n words phonetically and m words semantically, and you very nearly increase the expected number of matches by n * m.
Thus, the results depend almost entirely on the value of n and m, and small variations in either n or m become very large variations in the number of matches.
Calculating an expected number of matches, then, it is essential to estimate phonetic and semantic laxness from the claim being evaluated. There is no such thing as a general "number of random chances expected". It depends almost entirely on what you count as a match.
Equally, it's necessary to carefully evaluate any probability calculations offered by the comparer. Comparers typically present calculations for extremely narrow matches, and then give very loose matches in their word lists. And they often ignore semantic looseness entirely. Such errors are not trivial in this game; they can produce numbers that are off by several orders of magnitude.
Why are we so easy to fool?
Why do people fool themselves so easily in this area? Why is it so hard for even highly intelligent people to convince themselves that random matches will be few? I think there's several reasons.
(I have looked for random matches, and found plenty of them; see my lists of Chinese/Quechua and Chinese/English pseudo-cognates.)
The important parameters are the lexicon size n, the probability of a phonetic match p, and the semantic leeway semN (how many meanings count as a match).
#includeHere's the program output generated when the parameters are set as shown. Note that stopat and bin are reporting parameters. stopat = 70 told the computer to calculate propabilities for r = 1 to 70; bin = 10 told it to report these probabilities in groups of 10. With a little trial and error you can adjust these values for the most useful reporting of results.
#include
#include
#include
main( void )
{
double n = 2000; // lexicon size
double p = 0.002; // match probability
int semN = 10; // semantic leeway
int stopat = 70; // stop calculating at this r
int bin = 10; // report probs within ranges of this size
int r;
double unp = 1 - p;
double rfac = 1.0;
double nfac_nrfac = 1.0;
double cump = 0.0;
double pr;
double ptor = 1.0;
p = 1 - pow(1 - p, semN);
printf( "Probability %7.5lf\n\n", p );
unp = 1 - p;
for (r = 1; r <= stopat; r++)
{
rfac *= r;
nfac_nrfac *= n - r + 1;
ptor *= p;
pr = nfac_nrfac / rfac * ptor * pow( unp, n - r );
cump += pr;
if (bin > 1)
{
if (r % bin == 0)
{
printf( "p( %3i to %3i ) = %le\n",
r - bin + 1, r, cump );
cump = 0.0;
}
}
else
printf( "p(%i) = %le\n", r, pr );
}
if (bin == 1)
printf( "\nCumulative p is %le\n", cump );
}
Probability 0.01982
p( 1 to 10 ) = 1.695872e-08
p( 11 to 20 ) = 4.004021e-04
p( 21 to 30 ) = 6.627584e-02
p( 31 to 40 ) = 4.979988e-01
p( 41 to 50 ) = 3.904203e-01
p( 51 to 60 ) = 4.403604e-02
p( 61 to 70 ) = 8.649523e-04
The probability reported at the top is the phonetic probability, adjusted to reflect the given semantic leeway.
When the probability (after the semantic adjustment) is very low, .002 or less, you should set the bin size to 1 for best results.