How likely are chance resemblances between languages?

The first rule is, you must not fool yourself. And you are the easiest person to fool.
--Richard Feynman

On sci.lang we are often presented with lists of resemblances between far-flung languages (e.g. Basque and Ainu, Welsh and Mandan, Hebrew and Quechua, Hebrew and every other language, Basque and every other language), along with the claim that such resemblances "couldn't be due to chance", or are "too many" to be due to chance.

Linguists dismiss these lists, for several reasons. Often a good deal of work has gone into them, but little linguistic knowledge. Borrowings and native compounding are not taken into effect; the semantic equivalences proffered are quirky; and there is no attempt to find systematic sound correspondences. And linguists know that chance correspondences do happen.

All this is patiently explained, but it doesn't always convince those with no linguistic training-- especially the last point. Human beings have been designed by evolution to be good pattern matchers, and to trust the patterns they find; as a corollary their intuition about probability is abysmal. Lotteries and Las Vegas wouldn't function if it weren't so.

So, even one resemblance (one of my favorites was gaijin vs. goyim) may be taken as portentous. More reasonably, we may feel that one resemblance may be due to chance; but some compilers have amassed dozens of resemblances. Such lists may be criticized on other grounds, but even linguists may not know if the chance argument applies. Could a few dozen resemblances be due to chance? If not, what is the approximate cutoff?

The same question comes up in evaluating the results of Greenbergian mass comparisons; or proposals relating language families (e.g. Japanese and Tibeto-Burman) based on very small numbers of cognates. Again, it would be useful to know how many chance resemblances to expect.

I will propose a simple but linguistically informed statistical model for estimating the probability of such resemblances, and show how to adjust it to match the specific proposal being evaluated.

A trivial model: Abstract languages sharing a phonology

Let's start with a simplified case (we'll complicate it later). We will compare two unrelated languages A and B, each of which has 1,000 lexemes of the form CVC, and an identical semantics and phonology. That is, if there is a lexeme a in A with some meaning M, there will be a lexeme bp in B phonetically identical to a, and a lexeme bs with the same meaning as a.

What is the probability that bp is bs?-- that is, that there is a chance resemblance with a? It can be read off from the phonology of the typical root. Supposing there are 14 consonants and 5 vowels, it is 1/14 * 1/5 * 1/14, or 1 in 980. (This assumes that the vowels and consonants are equiprobable, which of course they are not.) For ease of calculation we'll round this to 1 in 1000.

How many chance resemblances are there? As a first approximation we might note that with a thousand chances at 1 in 1000 odds, there's a good chance of getting one match.

However, this is not enough. How likely is it exactly that we get one match? What's the chance of two matches? Would three be quite surprising? Fortunately probability theory has solved this problem for us; the chance that you'll find r matches in n words where the probability of a single match is p is

(n! / (r! (n-r!))) p^r (1 - p)^(n-r)

or in this case

(1000! / (r! (1000-r!))) .001^r .999^(1000-r)

For the first few r:

p(1) = .368
p(2) = .184
p(3) = .0613
p(4) = .0153
p(5) = .00305
p(6) = .000506

So the probability of between 1 and 6 matches is .368 + .184 + .0613 + .0153 + .00305 + .000506 = .632, or about 63%. It would be improbable, in other words, if we found no exact matches in the entire dictionary. (But not very improbable; p(0), which we can find by subtracting the above p's from 1.0, is 37%.)

Proffered resemblances are rarely exact. There is always some phonetic and semantic leeway. Either can be seen as increasing the set of words in B we would consider as a match to a given word a in A.

For instance, suppose for each consonant we would accept a match with 3 related consonants, and for each vowel, 3 related vowels. Since we're assuming a CVC root structure, this gives 3*3*3 = 27 words in B which might match any given a.

And suppose for each word a we will accept 10 possible meanings for b. This must be applied to each of the 27 phonetic matches; so a can now match a pool of 27*10 = 270 lexemes. The probability that it does so is of course 270 in 1000, or .27. Every lexeme in A, in other words, has a better than 1 in 4 chance of having a random match in B!

How many chance resemblances are there now? The same formula can be used, with the revised estimate for p:

(1000! / (r! (1000-r!))) .27^r .973^(1000-r).

There is a significant probability for very high numbers of matches, so we must continue calculating for r well into the hundreds. The results can be summarized as follows:

p( 1 to 50) = 5.85 * 10^-74
p( 51 to 100) = 1.22 * 10^-⁴⁰
p( 101 to 150) = 8.62 * 10^-²⁰
p( 151 to 200) = 1.70 * 10^-⁰⁷
p( 201 to 250) = .082
p( 251 to 300) = .903
p( 301 to 350) = .016
p( 351 to 400) = 1.17 * 10^-⁰⁸
p( 401 to 450) = 2.11 * 10^-¹⁹
p( 451 to 500) = 1.30 * 10^-³⁴

In other words there's a 90% chance that we'll find between 250 and 300 matches, an 8% chance of finding less, and a 2% chance of finding more.

Our rule of thumb would have suggested 270 matches, and this is in fact the number with the highest probability (2.84%).

I will suggest refinements to this model below, but the basic features are in place: a probability for a single match; a calculation for number of expected matches; and an adjustment for phonetic and semantic leeway.

Real phonologies

We'd like to remove the unrealistic assumptions in this model, starting with the absurdly simplified phonologies. Fortunately this is not hard to do; it amounts to finding a better p-- an estimate for the chance of a random match which takes into account the actual phonologies of languages A and B.

Suppose we want to check for random matches between Quechua and Chinese.

First, we need to decide what constitutes a phonetic match between the two languages. One way of doing this is to decide for each Quechua phoneme what Chinese phonemes we'll accept as matches. (Think of it this way: is Qu. runa a match for Ch. rén? Is chinchi a match for chong? Is chay a match for zhè?)

We might decide as follows. The criterion here is obviously phonetic similarity. We could certainly improve on this by requiring a particular phonological distance; e.g. a difference of no more than two phonetic features, such as voicing or place or articulation. The important point, as we will see, is to be clear about what we count or do not count as a match; or if we are evaluating someone else's work, to use the same phonetic criteria they do.

Qu. Ch.
p p, b
t t, d
ch ch, zh, j, q, c, z
k k, g
s s, sh, c, z, x, zh
h h
q h, k
m m, n
n m, n, ng
ñ m, n, ng, y
l l, r
ll l, r, y
r l, r
w w, u
y y, i
a a, e, o
i i, e, y
u u, o, w

We will next need to know the frequency with which each phoneme occurs in each language. This can be calculated using a simple program operating on sample texts. For Quechua we find:

initial medial final
a 5.291005 25.906736 40.211640
b 2.645503 0 0
d 0 0.310881 0
g 0.529101 0.103627 0
h 5.820106 0 0
i 2.645503 8.808290 5.291005
k 14.814815 5.595855 3.174603
l 0.529101 0.414508 0
m 7.407407 4.145078 3.703704
n 1.587302 6.528497 25.396825
p 7.936508 6.010363 0
q 4.232804 3.108808 8.465608
r 4.232804 5.077720 0
s 6.349206 4.145078 2.645503
t 7.407407 6.424870 0
u 3.703704 11.398964 2.645503
w 11.111111 1.450777 0.529101
y 3.174603 4.145078 7.936508
ch 6.878307 3.108808 0
ñ 1.058201 1.243523 0
rr 0.529101 0 0
ll 2.116402 1.865285 0
And for Chinese we get:

initial medial final
a 1.400000 21.494371 7.739308
b 7.000000 1.432958 0
c 0.600000 0.102354 0
d 12.800000 1.228250 0
e 0.200000 8.904811 15.885947
f 2.000000 0.614125 0
g 3.200000 1.842375 0
h 3.400000 2.149437 0
i 0 17.195496 29.327902
j 4.600000 1.944729 0
k 2.200000 0.204708 0
l 6.000000 2.149437 0
m 2.600000 1.330604 0
n 3.800000 6.038895 11.608961
o 0.400000 7.881269 9.368635
p 1.000000 0.102354 0
q 2.000000 1.842375 0
r 0.800000 0.307062 1.629328
s 0.800000 1.023541 0
t 3.800000 1.228250 0
u 0 8.495394 12.016293
w 7.800000 0.716479 0
x 4.200000 0.614125 0
y 9.600000 0.511771 0
z 4.200000 1.023541 0
ch 2.200000 0.716479 0
ng 0 5.834186 12.016293
sh 7.800000 1.330604 0
zh 5.600000 1.740020 0

(The reader who knows Chinese may wonder how we can have medial consonants at all. The answer is that I am using Chinese lexemes, not single characters (zì), so that, for instance, Zhongguó 'China' is one word, not two.)

Now we're in a position to calculate the probability for a match. Let's start by assuming that there must be a match (within the phonetic categories established above) in both initial, medial, and final.

To calculate the probability pi for a match in the initial, we go down the list of Quechua initials, multiplying its probability times the probability of finding the matching sound(s) in that same position in Chinese. For instance, the probability of a match on initial p is the probability of initial p in Quechua (.0794) times the probability of a match on initial p or b (.07 + .01 = .08), or .00635.

I show the entire calculations below, because some of them are quite eloquent, and show the value of taking a frequency approach. If you're looking for a match for a Quechua word in s-, for instance, you have a 23% chance of matching any of the sounds we've judged as similar in Chinese. You're likely to match medial -a- 38% of the time; final -a 33% of the time, final -n 24% of the time.

(The boldface letter is the Quechua sound; it's followed by the Chinese sounds we said would be a match. The first number is the probability of the Quechua phoneme; the second is the sum of the probabilities of the matching Chinese sounds; the third is the multiplication of the first two.)

Initials

a aeo .05291 * .020 = .00106
h h .05820 * .034 = .00198
i iey .02646 * .098 = .00259
k kg .14815 * .054 = .00800
l lr .00529 * .068 = .00036
m mn .07407 * .064 = .00474
n mn ng .01587 * .160 = .00254
p pb .07937 * .080 = .00635
q hk .04228 * .056 = .00237
r lr .04228 * .068 = .00288
s s sh c z x .06349 * .232 = .01473
t td .07407 * .166 = .01230
u uow .03704 * .082 = .00304
w wu .11111 * .078 = .00867
y yi .03174 * .096 = .00305
ch ch zh jqcz .06883 * .192 = .01322
ñ mn ng y .01058 * .160 = .00169
ll lry .02121 * .164 = .00348
Probability for an initial match = .09305 = 9.3%

Medials

a aeo .25907 * .3828 = .09917
i iey .08808 * .2661 = .02344
k kg .05596 *.0205 = .00114
l lr .00415 * .0246 = .00010
m mn .04145 * .0737 = .00305
n mn ng .06528 * .1320 = .00862
p pb .06010 * .0153 = .00092
q hk .03109 * .0235 = .00073
r lr .05078 * .0246 = .00125
s s sh c z x .04145 * .0582 = .00241
t td .06425 * .0246 = .00158
u uow .11399 * .1710 = .01949
w wu .01451 * .0921 = .00134
y yi .04145 * .1771 = .00734
ch ch zh jqcz .03109 * .0736 = .00229
ñ mn ng y .01244 * .1371 = .00170
ll lry .01865 * .0297 = .00055

Probability for a medial match = .17514 = 17.5 %

Finals

a aeo .40212 * .3299 = .13266
i iey .05291 * .4522 = .02393
k kg .03175 * 0 = 0
m mn .03704 *.116 = .00430
n mn ng .25397 *.236 = .05994
q hk .08466 * 0 = 0
s s sh c z x .02646 * 0 = 0
u uow .02646 * .2139 = .00566
w wu .00529 * .1202 = .00064
y yi .07937 * .2933 = .02328

Probability for a final match = .25039 = 25.0 %

So, the probability of finding a random match on a single word (with no semantic leeway) is .0931 * .1751 * .2504 = 0.0041, or 1 in 244.

Was all that worth it?

It's worthwhile comparing this to our original seat-of-the-pants estimate (based on 14 equiprobable consonants and 5 equiprobable vowels, and allowing 3 phonetic matches per sound) of 27 in 980, or 0.027-- 6.5 times the above frequency.

Two lessons may be drawn. First, phoneme frequency matters. Both Quechua and Chinese have very many medial a sounds, and final nasals, and initial affricates. That makes random matches involving those sounds much more likely.

Second, seemingly minor points of procedure have a huge impact on our results. We are used to situations where rough calculations do not lead us far astray. But in this area differing assumptions or methodologies lead to very different results. Very careful attention to both is warranted.

Additional types of match

We can also answer the question posed above: with the phonetic criteria as given, neither runa/rén, nor chinchi/chong, nor chay/zhè are matches. Yet a comparer would probably set great store on each of them.

Obviously the initial-medial-final calculation is still a simplification. Quechua, for instance, can have both initial and final consonant clusters; both languages have some two-phoneme roots; and of course a vague "medial" category is not a good way of handling multisyllabic words.

We might decide to allow a Quechua medial to match either a Chinese medial or final, to catch resemblances like runa/rén and chinchi/chong. To do this we need to compute the chance that a Quechua medial matches a Chinese final, as follows. (We can skip Quechua medials for which none of the corresponding Chinese sounds can end a word.)

Medial-to-final

a aeo .25907 * .3300 = .08549
i iey .08808 * .4521 = .03982
l lr .00415 * .0163 = .00007
m mn .04145 *.1161 = .00481
n mn ng .06528 * .2363 = .01543
r lr .05078 * .0163 = .00083
u uow .11399 * .2138 = .02437
w wu .01451 * .1202 = .00174
y yi .04145 * .2933 = .01216
ñ mn ng y .01244 * .2363 = .00294
ll lry .01865 * .0163 = .00030
Probability for a medial-to-final match = .18796 = 18.8 %

This can be added to the previous medial-to-medial estimate, on the grounds that when a medial doesn't match another medial, we're giving it another chance to match a final. However, the additional chance should be discounted by the probability (30% in my sample Chinese text) that the initial and final are the same (that is, that the word is just two phonemes long). So the medial-to-medial-or-final probability is .1751 + (.1880 * .70) = .3067.

The probability of finding a random match on a single word (no semantic leeway) can now be given as .0931 * .3067 * .2504 = 0.0071.

This estimate could be revised still further to take account of such things as metathesis (switched consonants), or Quechua's initial consonant clusters. Note that both examples allow additional matches, and thus will increase p even more.

Matching just two phonemes

We still haven't really taken account of chay/zhè (nor of runa/rén, since we decided above that u and e don't match). The probabilities calculated so far require three phonemes to match. It might be interesting to know the probability that just two phonemes match.

Since this probability is obviously going to be much higher, I don't recommend trying to combine both types of match into a single p, which would understate the difficulty of finding 3-phoneme matches and overstate that of 2-phoneme matches.

We can estimate the probability of a 2-phoneme match by using the probability of a match on initials times that of a Quechua medial matching a Chinese medial or final: .0931 * .3066 = .0285 or about 1 in 35.

This could be refined by adding the probability that a Quechua final matches a Chinese medial or final, this time discounted by the probability that the Quechua medial is also the final.

An alternative approach

If you want to avoid phonetic calculations entirely, there's an alternative approach: We pick a word a in A, then pick the word b in B which most closely resembles it phonetically. To handle phonetic looseness, we pick the n words in B which most closely resemble it phonetically.

The advantage is that we don't have to mess with phonetic details or how to match the phonologies of different languages. We can proceed quickly to an estimate of how many matches we can expect to find in general between two languages.

The disadvantage is that this approach doesn't lend itself to evaluating other people's claims. You can picture (say) Greenberg & Ruhlen examining the n words in Tfaltik that most closely resemble maliq'a. But what is their n? To give a reasonable estimate we have to dive back into phonetic details and probabilities.

Semantic matches

Unfortunately there's no rigorous way to improve our estimate of semantic leeway-- the number of meanings that will be considered to match a given word. Classifying meanings has been a parlor pastime for several centuries, but this has never produced any generally useful way of analyzing meanings, much less determining whether two meanings are close or not.

About all that can be done is to emphasize that most actual attempts to find "cognates" accept very considerable semantic leeway, and that this greatly increases the chance of random matches.

It's helpful to see just what a semantic leeway of 10 or 100 meanings looks like. For instance, let's take the word "eat".

If we allow 10 matches, we'd accept something like:

eat, dine, bite, chew, swallow, feed, hungry, edible, meal, meat

If we allow 25 matches, we'd accept:

eat, dine, bite, nibble, chew, munch
consume, devour, taste, drink, gulp, swallow,
meal, dinner, supper, snack, meat, food
hungry, thirsty, fast, edible, tasty, mouth,
corrode

If we allow 100 matches, we'd accept:

eat, dine, sup, stuff, gorge, nibble, ruminate, take in, tuck in, gobble, bolt, swill
chew, munch, slurp, bite, gnaw, masticate, -vorous
consume, devour, ingest, nourish, regale, taste, partake
drink, guzzle, suck, gulp, quaff, swallow, imbibe
morsel, mouthful, serving, helping, entree
food, victuals, fare, fodder, provender, diet
meal, repast, dinner, lunch, luncheon, supper, breakfast, snack, feast
meat, soup, fowl, vegetables, bread, trough, plate, hearth,
eat (of animals), feed, provide, forage, grub, chow
cuisine, board, table, mess, restaurant, cook
hungry, thirsty, peckish, famine, glutton, fast, full
edible, comestible, potable, tasty, delicious
mouth, cheek, teeth, tongue, stomach
deplete, waste, destroy, ravage, corrode, etch, erode, subsist (on), live, prey

Ten matches may seem like a lot; but nothing in even the list of 100 matches is a real stretch. The real question is: looking for matches for 'to eat' in A, would our language comparer take a word meaning 'nourish' or 'delicious' or 'gnaw' as a match--indicating that his semantic leeway is more like 100 than like 10? The temptation is almost irresistible.

It's worth pointing out that semantic leeway is really multidimensional. The 100-word list shows several directions we can explore starting from the word 'eat': types of eating; other types of ingestion; types of food; types of feeding or food preparation; qualities or states associated with eating; associated parts of the body; differences in connotation or register; differences in the type of eater; metaphorical extensions. You may accept only a little variation along each dimension, but because there are many dimensions the total variation is high.

What happens to the number of expected matches as semantic leeway increases?

We are traipsing through language A, word by word. For each word a, the probability p of finding a match in B is, say, .01.

But now suppose the semantic leeway m is 10. So for word a, we have to check for a match 10 times.

What does this mean numerically? It depends on how we count matches.

We want to know if a matches any word. We check for matches for a 10 times. What are the chances that we'll find a match?
The easiest way to compute this is backwards: find the probability that we won't find a match, which is simply (1 - p)^m. To see this, think of checking the 3rd word. To not find a match on words 1, 2, 3, we must find no match on word b1 (probability .99), no match on word b2 (prob .99), and no match on word b3 (prob .99). Since these events are independent, the cumulative probability of .99 * .99 * .99--that is, (1 - p)^m.
So the probability of finding a match is 1 - ((1 - p)^m). In our example, this is 1 - .99¹⁰ = .095618.
We want to know the number of words a matches. That is, we might find more than one phonetic match in our set of m semantic matches. The average number of matches is easy, since it's p for each word, and thus mp total. We could also go on to find the probability for one match, two matches, etc.
I find the concept of multiple matches per word rather bizarre, and it doesn't give us a single probability we can plug into our general formula, so I won't pursue it here.

Vocabulary size

What happens as the vocabulary gets larger?

If the number of words is about the same as the number of meanings, the answer is 'not much'. For instance, take our original search for exact matches in two phonologically identical languages. The probability of finding one or more random matches hardly varies at all by vocabulary size:

Lexicon size Probability
100 .6340
1000 .6323
10000 .6321

The reason is not hard to find; as the lexicon size increases, the chance of a given match goes down, but we get more chances.

As the vocabulary grows into the thousands, the assumption that there are an equivalent number of "meanings" becomes increasingly dubious. Or to be more precise, fine distinctions may exist between all the words in the language; but the finer distinctions are liable to be ignored by the language comparer. A language of 10,000 words may well distinguish 'eat (of humans)' vs. 'eat (of animals)' vs. 'devour' vs. 'bite' and so on. Yet the language comparer, looking for a match for 'eat (of humans)', is not going to skip over 'eat (of animals)'.

Phonetic leeway is not independent of lexicon size either. If we are considering vocabularies of 10,000 words rather than 1000, phonological complexity is likely to be greater, and we will want to perform a more rigorous analysis, along the lines suggested above.

Analyzing a claim

To analyze a claim about language relationship based simply on resemblances (as opposed, of course, to one based on the comparative method), we can apply the principles and formulas developed above.

We will need to know:

The probability of a phonetic match (implicit in the claimant's comparisons)
The semantic leeway: how many words in B the claimant will allow to match one in A
The vocabulary size

I analyze two examples below: Greenberg & Ruhlen's "world etymology" maliq'a 'swallow', and a comparison of Quechua and Hebrew posted to sci.lang.

Greenberg & Ruhlen

Greenberg & Ruhlen are comparing multiple languages. For a truly rigorous judgment we would need phoneme and root structure frequency analyses, as well as vocabulary size estimates, for each language concerned. This information is not available; but fortunately G&R's resemblance list alone is enough to deduce most of the parameters required.

To estimate the amount of phonetic leeway they allow, we can simply count the correspondences they allow. For instance, the vowels seem to be completely ignored. The middle consonant can be one of l, ly, lh, n, r, or zero-- (at least) 6 possibiliites. The end consonant can be one of g, j, d, k, q, q', kh, k', X, zero-- (at least) 10 possibilities.

The initial consonant must, it seems, be m. There is a reason for that, I think. Our brains, as psycholinguists have found, respond very strongly to initials. Find words in A and B that begin with the same letter, and the battle is half-won-- the brain is predisposed to find them very similar. Indeed, posters to sci.lang sometimes offer "resemblances" that correspond in nothing but the first letter. (The fact that they find the coincidence remarkable is more evidence for human beings' lousy intuition about probabilities.)

(It's also worth pointing out that a very high 7% of all words in my Quechua sample text begin with m. This initial isn't particularly common in Chinese, but one has to wonder if this choice of initial doesn't increase the possibility of false cognates, simply by being more common.)

Let's consider (arbitrarily) that G&R's languages have about 25 phonemes each. Then the minimum probability of a single, exact-meaning random match is 1/25 * 6/25 * 10/25 or .004. It must be emphasized that this is a minimum; if we saw more of G&R's potential cognates we might find that they allow more leeway than this. And of course it's only a loose estimate, because it's an abstract measure intended to cover any two arbitrary languages.

As for semantic matches, we can also form a lower bound on the semantic leeway they allow by counting the separate meanings in their glosses. I count at least 12; but I would consider it highly misleading not to at least double this number, and based on our lists of words related to 'eat' I'd consider 100 to be quite reasonable. If you can accept 'breast', 'cheek', 'swallow', 'drink', 'milk', and 'nape of the neck' as cognates, it's hard to seriously claim that you wouldn't accept 'eat', 'mouth', 'guzzle', 'vomit', 'suckle', and 'stomach'.

Let's say the semantic leeway is 25 words. Then the probability that we'll find at least one match for a word is 1 - ((1 - p)^m = 1 - 0.996²⁵ = .0953.

How many random matches can R&G expect to find?

Let's assume the lexicon is 2000 words-- hopefully a good estimate of the size of the lexicons available for many of the obscure languages they work with. Our formula produces (omitting ranges with minimal probabilities):

p( 126 to 150 ) = .0080
p( 151 to 175 ) = .1222
p( 176 to 200 ) = .6509
p( 201 to 225 ) = .2214
p( 226 to 250 ) = .0047

In other words we should expect close to 200 random matches, or a tenth of the lexicon.

Of course, that's between two languages. What's the likelihood of finding matches across languages?

If the languages are related, of course, we would expect many non-random resemblances (though with G&R-style phonetic and semantic laxity we will find many random matches as well). We should not be cowed by the size of G&R's cognate list; there are multiple entries per family. They are not really comparing hundreds of languages, but a much smaller number of language families.

There's no general number of random matches to expect this way, since language families vary in number of languages, and in how similar their lexicons are. But even closely related languages, like Spanish and French, have significant numbers of non-cognate words (chien vs. perro, chapeau vs. sombrero, manger vs.comer, regarder vs. mirar, etc.). So if you don't find enough resemblances in language A1, you can try A2, A3, etc. Or even search dialects for cognates, as it seems they've done with Quechua and Aymara. The list of random resemblances between families will be larger than the list of random resemblances between languages.

Another key question is how many families they're comparing. If G&R are right about certain high-level categories, such as Amerind and Eurasiatic, this turns out to be "three or four". If families A and B have 500 random resemblances, they'll have about 125 with family C and 31 with family D-- a quite respectable listing for "proto-World".

Quechua & Semitic

Now let's look at a list of Quechua and Semitic resemblances posted to sci.lang (and a level of scholarship which will make us miss Ruhlen & Greenberg).

*Quechua*		Semitic
llama	llama	gamal	camel (Heb.)
qollana	leader	kohen	priest (Heb.)
t'eqe	rag doll	degem	model, specimen (Heb.)
qhapa	rich, powerful	gabar	become strong (Heb.)
qhoyu	group	goi	nation, people (Heb.)
qhoruy	cut off	garaz	cut (Heb.)
qhasay	burp	gasah	burp (Heb.)
q'enti	shorten, shrink	gamad	shrink (Heb.)
amaru	snake, sage	amaru	know, teach (Assyr.)
anta	copper	homnt	copper (Coptic)
atoq	fox	bachor	fox (Coptic)
aysana	basket	tsina	basket (Aramaic)
ch'olqe	wrinkle	chorchi	wrinkles (Coptic)
charki	jerky	charke	drought (Heb.)
cholo	Andean person	chlol	folk (Coptic)
wanaqo	guanaco	anaqate	she-camel (Assyr.)
churi	father's son	chere	son (Coptic)
illa	light, jewel	ille	brightness, light (Coptic)
k'ayrapin	pancreas	kaire	gullet, belly (Coptic)
kinuwa	quinoa	knaao	sheaf (Coptic)
k'ayra	frog	krur	frog (Coptic)
kutuna	blouse	kutunet	garment, tunic
onqoy	sickness, illness	thomko	ill use, afflict
punku	door	brg	be open (Coptic)
tarpuy	planting	sirpad	plant (Heb.)
hamuna	entrance	amumuna	city gate (Assyr.)
hillu	sweet tooth	akkilu	glutton (Assyr.)
huku	owl	akku	owl (Assyr.)
qoleq	silver	purku	gold (Assyr.)
p'uru	bladder, gourd	buru	vessel (Assyr.)
ch'enqo	small thing	enegu	suck (Assyr.)
watuq	diviner	baru	seer (Assyr.)
waliq	abundant	baru	become full (Assyr.)
ch'aphra	brush, twigs	abru	brush pile (Assyr.)
raphra	wing	abru	wing, fin (Assyr.)
apu	god, mountain lord	abu	father (Assyr.)
hatarichiy	build, incite, disturb	adaru	worried, disturbed (Assyr.)
hayk'api	how much?	akka'iki	how much? (Assyr.)
taruka	type of deer	barih'a	antelope (Assyr.)
umiqa	jewel	banu	headgear, diadem (Assyr.)
wawa	baby, child	babu	child (Assyr.)
p'uytu	well, puddle	buninnu	pond (Assyr.)
walla	domineering person, soldier	baxalu	ripe, youthful, manly (Assyr.)
wayra	wind, air	billu	low wind (Assyr.)
wanqara	drum	balangu	kettle-drum (Assyr.)
phiri	thick; dish made of flour and water	buranu	meal? (Assyr.)
perqa	wall	birtu	fetter; fortress (Assyr.)
phasi	steamed	bashalu	boil, roast (Assyr.)
maqt'a	young man	batulu	youth (Assyr.)
patana	stone seat	apadana	throne room (Assyr.)
qhapaq	rich, powerful	gabshu	massive, powerful (Ethiopic)
qocha	lake, pond	gubshu	mass of water (Assyr.)
llantin	type of herb	gam-gam	a herb (Assyr.)

There are a number of obvious problems with this list.

The compiler (whose name I omit out of charity) knows nothing about the comparative method; no regular correspondences are presented.
Nor does he know much about Quechua; he has for instance consistently taken the regular nominalizing suffix -na as part of the root. (Note what this does to a 'resemblance' like aysana / tsina.)
-mu- in hamuy is a movement suffix, leaving the rather unconvincing ha-/amumuna. Likewise -pi in hayk'api is a locative suffix, leaving hay'ka/akka'iki.
Many of the resemblances are based on secondary meanings of Quechua roots. For instance, 'disturb' is a very secondary meaning of the causative hatarichiy; the meaning of the basic root hatariy is 'rise up'.
It's quite naive to compare individual Semitic languages with modern Cuzqueño dialect. On the Semitic side proto-Semitic or proto-Afro-Asiatic should be used; and on the Quechua side, reconstructed proto-Quechua. We also know some words in an even earlier form; for instance qocha is related to Aymara qota-- which looks even less like the proposed cognate gubshu.

However, my only concern here is to answer the compiler's question: "Is it a mere coincidence that there are so many correspondances between these languages?"

The above criticisms cannot answer this question; but the statistical model developed here can.

First let's estimate the degree of phonetic laxness the compiler is allowing. I'll use my Quechua frequency table, but I don't have similar data for the Semitic languages.

He's fairly strict on initials. Some have two alternatives (h, q, t, w, ch'), but others match just one consonant (k, m, p, qh, r, ll, ch). Initial a matches four sounds. The pool of Semitic consonants matched is small: a, b, ch, d, g, k, p, s, 0. So the probability of an initial match is p(h, q, t, w, ch') * 2/9 + p(k, m, p, qh, r, ll, ch) * 1/9 + p(a) * (4/9) which works out to .14.
He'll allow about any medial vowels to match; but over half the time he does have a close vowel match. To estimate the probability of this we'd need to know the Semitic vowel probabilities, which we can hardly take as equiprobable-- in Quechua, for instance, 56% of medial vowels are a's.
Medial consonants match between 1 and 6 consonants: e.g. k matches k, g, h; q matches g, ch, q, k, t, 0. The pool of Semitic consonants matched is larger, with 16 members. The cumulative probability of a medial consonant match comes out to .524.
Almost all the Quechua words end in a vowel, and this can match most anything as well. Where the Quechua word ends in a consonant, the comparer ignores it, except in the single instance llantin/gam-gam.

37 out of the 54 matches involve just two matches (initial and medial); 17 have two medial consonant matches. Let's start by seeing how many two-consonant matches he can expect. The probability for a single phonetic match should be .17 * .524 = 0.089. For the resemblances with a vowel match as well, we can estimate a fifth of this or .018.

What's the level of semantic laxness? This is hard to say. Some matches are quite close (burp, copper, basket, son, frog, owl); others are fairly remote (leader/priest; snake/teach, jerky/drought, pancreas/belly; sickness/afflict; door/open; sweet tooth/glutton; small thing/suck; god/father; build/disturbed; domineering/youthful; wall/fortress; stone seat/throne room; two herb names). I think it would be quite conservative to assume 1 Quechua word can match 20 Semitic words in the compiler's mind.

Given this semantic leeway, the probability of a match on a single Quechua word is 1 - ((1 - p)^m = 1 - 0.911²⁰ = .845. That's quite telling right there-- it means that, given his phonetic and semantic laxness, the comparer is ordinarily going to find a random match for almost every Quechua word.

My own Quechua dictionary has about 2000 non-Spanish roots. Our comparer will very likely find more than 1500 chance resemblances.

With a vowel match-- which, I emphasize, the comparer bothers with only half the time-- the match probability becomes .302, for roughly 600 chance resemblances.

It's just not that hard to match just two consonants. How about the three-consonant matches? Given the initial and medial consonant match probabilities calculated above, the probability of a single 3-consonant exact-semantic match is .17 * .524 * .524 = 0.047. With the same semantic leeway of 20 words, the probability of a match per word becomes 1 - 0.953²⁰ = .618.

Our formula gives:

p( 1101 to 1200 ) = .051
p( 1201 to 1300 ) = .947
p( 1301 to 1400 ) = .014

Is it a mere coincidence that there are so many correspondances between these languages? No, it isn't; what would be really surprising would be if there weren't a thousand more of the same quality.

With a vowel match, the match probability becomes .171 and we'd expect over 300 resemblances. However, there are only eight words in the list that can be described as matching three consonants and a vowel. Three of those are ruined by including -na as part of the root, and the remaining five include such uninspiring matches as q/t and ll/g.

Characteristics of the model

What we have seen offers quantified support for what many linguists would have suspected: the number of chance resemblances soars as phonetic and semantic matches are loosened.

The actual variation is almost linear. That is, allow word a to match n words phonetically and m words semantically, and you very nearly increase the expected number of matches by n * m.

Thus, the results depend almost entirely on the value of n and m, and small variations in either n or m become very large variations in the number of matches.

Calculating an expected number of matches, then, it is essential to estimate phonetic and semantic laxness from the claim being evaluated. There is no such thing as a general "number of random chances expected". It depends almost entirely on what you count as a match.

Equally, it's necessary to carefully evaluate any probability calculations offered by the comparer. Comparers typically present calculations for extremely narrow matches, and then give very loose matches in their word lists. And they often ignore semantic looseness entirely. Such errors are not trivial in this game; they can produce numbers that are off by several orders of magnitude.

Why are we so easy to fool?

Why do people fool themselves so easily in this area? Why is it so hard for even highly intelligent people to convince themselves that random matches will be few? I think there's several reasons.

We want to be fooled. The idea that far-flung languages are related (and with them their speakers) is intriguing. Proto-World is exciting. Slogging through the comparative method, by contrast, is dull and too often gives the completely unwanted answer "I don't know."
The brain is a pattern matchin' machine. We evolved to quickly find patterns in the world. It's not hard to see that this could be a matter of survival (that red striped plant gives you a stomach ache) or give an edge in social competition (I know that the red striped plant is bad for you and Tumba doesn't; or, Roger's eyebrows flutter when he has a good hand). It's so useful to find patterns that the brain is very tolerant of false patterns.
The brain is no good at probabilities. We simply have no great intuitive feel for probabilities. Most people's eyes glaze over when you start talking about the chances of the chances of r events among n objects over t trials with a single event probability of p.

When it comes to specific calculations of probability, which so often "prove" that the chance of a random match is vanishingly small, more factors come into play.

We're skeptical only of numbers we don't like. You don't have to be trying to defraud anyone to fool yourself. But it's hard not to avoid our own bias in favor of numbers that go the way we want them to. If they don't, we're displeased, and skeptically examine our calculations, and seize on new assumptions that create better numbers. If the numbers do behave, we look over our results and our assumptions more carelessly. (Feynman gives a nice example: an incorrect estimate of a physical constant, which took decades to be corrected, each correction edging toward the correct value. Since the original estimate was assumed to be correct, scientists were more than usually cautious about new estimates that deviated from it, and easily convinced themselves that they'd made a mistake somewhere. Estimates near the accepted value were given more slack.)
We don't double-check our results against reality. Few who make these calculations have looked for random matches in languages they're sure are unrelated. In other words, they have no control case to check their results against. It's no wonder they never notice how common random matches really are.
(I have looked for random matches, and found plenty of them; see my lists of Chinese/Quechua and Chinese/English pseudo-cognates.)
We haven't worked the numbers. Even trained linguists, though they know that random matches will occur, generally can't say how many.
Statistical or linguistic ignorance. The calculations are often ruined by elementary statistical errors, or by wildly unrealistic linguistic assumptions, or by disregard of the enormous phonetic and semantic leeway comparers allow themselves.

This document can hopefully help out in this area. The calculations are here and you can use them to evaluate claims (amateur or professional), or refer others to them.

A computer program for calculating matches

Here's a very simple computer program for applying the probability calculation described above. It was written for a Mac, but should work with most C compilers.

The important parameters are the lexicon size n, the probability of a phonetic match p, and the semantic leeway semN (how many meanings count as a match).

#include 

#include 

#include 

#include 



main( void )

{

    double n   = 2000;    // lexicon size

    double p   = 0.002;   // match probability

    int semN   = 10;      // semantic leeway

    int stopat = 70;      // stop calculating at this r

    int bin    = 10;      // report probs within ranges of this size

 

    int    r;

    double unp = 1 - p;

    double rfac = 1.0;

    double nfac_nrfac = 1.0;

    double cump = 0.0;

    double pr;

    double ptor = 1.0;

    

    p = 1 - pow(1 - p, semN);

    printf( "Probability %7.5lf\n\n", p );

    unp = 1 - p;

    

    for (r = 1; r <= stopat; r++)

    {

        rfac *= r;

        nfac_nrfac *= n - r + 1;

        ptor *= p;

        pr = nfac_nrfac / rfac * ptor * pow( unp, n - r );

     

        cump += pr;

     

        if (bin > 1)

        {

             if (r % bin == 0)

             {

                 printf( "p( %3i to %3i ) = %le\n", 

                          r - bin + 1, r, cump );

                 cump = 0.0;

             }

        }

        else

            printf( "p(%i) = %le\n", r, pr );

    }

    

    if (bin == 1)

        printf( "\nCumulative p is %le\n", cump );

}

Here's the program output generated when the parameters are set as shown. Note that stopat and bin are reporting parameters. stopat = 70 told the computer to calculate propabilities for r = 1 to 70; bin = 10 told it to report these probabilities in groups of 10. With a little trial and error you can adjust these values for the most useful reporting of results.

Probability 0.01982

p(   1 to  10 ) = 1.695872e-08

p(  11 to  20 ) = 4.004021e-04

p(  21 to  30 ) = 6.627584e-02

p(  31 to  40 ) = 4.979988e-01

p(  41 to  50 ) = 3.904203e-01

p(  51 to  60 ) = 4.403604e-02

p(  61 to  70 ) = 8.649523e-04

The probability reported at the top is the phonetic probability, adjusted to reflect the given semantic leeway.

When the probability (after the semantic adjustment) is very low, .002 or less, you should set the bin size to 1 for best results.

Qu.	Ch.
p	p, b
t	t, d
ch	ch, zh, j, q, c, z
k	k, g
s	s, sh, c, z, x, zh
h	h
q	h, k
m	m, n
n	m, n, ng
ñ	m, n, ng, y
l	l, r
ll	l, r, y
r	l, r
w	w, u
y	y, i
a	a, e, o
i	i, e, y
u	u, o, w

	initial	medial	final
a	5.291005	25.906736	40.211640
b	2.645503	0	0
d	0	0.310881	0
g	0.529101	0.103627	0
h	5.820106	0	0
i	2.645503	8.808290	5.291005
k	14.814815	5.595855	3.174603
l	0.529101	0.414508	0
m	7.407407	4.145078	3.703704
n	1.587302	6.528497	25.396825
p	7.936508	6.010363	0
q	4.232804	3.108808	8.465608
r	4.232804	5.077720	0
s	6.349206	4.145078	2.645503
t	7.407407	6.424870	0
u	3.703704	11.398964	2.645503
w	11.111111	1.450777	0.529101
y	3.174603	4.145078	7.936508
ch	6.878307	3.108808	0
ñ	1.058201	1.243523	0
rr	0.529101	0	0
ll	2.116402	1.865285	0

a aeo	.05291 * .020 =	.00106
h h	.05820 * .034 =	.00198
i iey	.02646 * .098 =	.00259
k kg	.14815 * .054 =	.00800
l lr	.00529 * .068 =	.00036
m mn	.07407 * .064 =	.00474
n mn ng	.01587 * .160 =	.00254
p pb	.07937 * .080 =	.00635
q hk	.04228 * .056 =	.00237
r lr	.04228 * .068 =	.00288
s s sh c z x	.06349 * .232 =	.01473
t td	.07407 * .166 =	.01230
u uow	.03704 * .082 =	.00304
w wu	.11111 * .078 =	.00867
y yi	.03174 * .096 =	.00305
ch ch zh jqcz	.06883 * .192 =	.01322
ñ mn ng y	.01058 * .160 =	.00169
ll lry	.02121 * .164 =	.00348

a aeo	.25907 * .3828 =	.09917
i iey	.08808 * .2661 =	.02344
k kg	.05596 *.0205 =	.00114
l lr	.00415 * .0246 =	.00010
m mn	.04145 * .0737 =	.00305
n mn ng	.06528 * .1320 =	.00862
p pb	.06010 * .0153 =	.00092
q hk	.03109 * .0235 =	.00073
r lr	.05078 * .0246 =	.00125
s s sh c z x	.04145 * .0582 =	.00241
t td	.06425 * .0246 =	.00158
u uow	.11399 * .1710 =	.01949
w wu	.01451 * .0921 =	.00134
y yi	.04145 * .1771 =	.00734
ch ch zh jqcz	.03109 * .0736 =	.00229
ñ mn ng y	.01244 * .1371 =	.00170
ll lry	.01865 * .0297 =	.00055

a aeo	.40212 * .3299 =	.13266
i iey	.05291 * .4522 =	.02393
k kg	.03175 * 0 =	0
m mn	.03704 *.116 =	.00430
n mn ng	.25397 *.236 =	.05994
q hk	.08466 * 0 =	0
s s sh c z x	.02646 * 0 =	0
u uow	.02646 * .2139 =	.00566
w wu	.00529 * .1202 =	.00064
y yi	.07937 * .2933 =	.02328

a aeo	.25907 * .3300 =	.08549
i iey	.08808 * .4521 =	.03982
l lr	.00415 * .0163 =	.00007
m mn	.04145 *.1161 =	.00481
n mn ng	.06528 * .2363 =	.01543
r lr	.05078 * .0163 =	.00083
u uow	.11399 * .2138 =	.02437
w wu	.01451 * .1202 =	.00174
y yi	.04145 * .2933 =	.01216
ñ mn ng y	.01244 * .2363 =	.00294
ll lry	.01865 * .0163 =	.00030