Melberg, Hans O., Against Correlation

[Note for bibliographic reference: Melberg, Hans O. (1996), Against correlation, http://www.oocities.org/hmelberg/papers/960415.htm]

Against Correlation

by Hans O. Melberg

In most introductory texts on statistics there is a warning about confusing correlation and causation. The standard example being that the number of storks and new-born babies are strongly correlated. However, as we all know, it is not the number of storks that causes the number of babies. Rather there is a third variable which causes both to be correlated: the weather. In this article I attempt to classify some of the causes of the confusion. I then try to construct an alternative to correlation as the basis for the estimating the probability of statements.

Introduction

What is correlation?
Informally we say that two phenomena are correlated if we to a large extent observe that the two appear together. For a short technical introduction to this intuitive idea, click here.

Why are we interested in correlation?
Correlation is often used as an argument to justify explanations. Hence, the study of the theory of correlation is important because it may reveal flaws which in turn reduce the plausibility of explanations.

Two introduce the reader to this kind of reasoning I want to give two examples

1.
In 1867 a doctor named J. Lister published a paper which showed that surgery was much safer when the environment was sterilised. Previously the Hungarian doctor I. Semmelweis had been ridiculed for suggesting that there was a connection between the dirty hands of hospital staff and infections caught by women after childbirth. Although there was a correlation there was no "scientific reason" to support it. Pasteur then provided the causal mechanism by demonstrating how bacteria in the air or on hands could cause the disease. After Lister's paper hygienic standards were raised and the occurrence of the disease decreased drastically.

2.
An example of the second use of correlation, is G. A. Cohen's argument that historical correlations can be used to justify the Marxist theory of historical change. Basically his argument is that if we observe a historic correlation between two factors we are justified on believing that there is a causal relationship between the two even if we cannot elaborate on the exact nature of the link between the two variables. 1

These examples may seem perfectly acceptable. However, as I shall try to argue below, the intuitive appeal is sometimes deceptive because there are many cases in which we are falsely led to believe that a correlation also implies a causal relationship.

The conceptual problem
In the social sciences we want to explain facts and events. There is some conceptual disagreement as to what an explanation is. Jon Elster argues that to explain an event "is to give an account of why it happened as it happened."2 Hence, even when we prove historically that two variables are correlated, we do not have an explanation since we do not know why the two are correlated. Cohen, on the other hand, argues that correlations may constitute explanations, although incomplete explanations. For example, if I prove that one type of revolutions is always correlated with a certain level of development of the economy (or more precise: of the productive forces), I have an incomplete explanation of this type of revolutions. For the explanation to be complete we must, as Cohen agrees and Elster insists, provide the causal mechanisms between the two variables.

I do not want to enter into this terminological issue. My focus is on the reasons why correlation may lead us to the wrong conclusions, not the definition of the term explanation.

How correlations may lead to wrong conclusions
In the first part I shall assume that there are no practical problem with the data (such as measurement errors and data scarcity)

a) Strong correlation but no causal link
Assume we have a strong correlation. The question is then how this could be if there is no causal connection between the two variables.

1. Accidental
Some correlations are purely accidental and hence constitute no proof of causation. I may take a phenomena, such as the unemployment numbers, and then search for a data series that is highly correlated with these figures. I might find that the number of rattle snakes in USA is highly correlated with unemployment in Norway, but the two are obviously in no way causally related.

In the example above it was easy to discover the accidental nature of the correlation. However, in other cases we are less sure whether the correlation is accidental or not. Consider a recent case on Norway in which a nurse in a nursing home was jailed because there had been an abnormal number of deaths when she was at duty. 3 The police suspected that she was a murderer, she maintained her innocence. The problem, of course, was that the correlation could be accidental. To determine this we could try to estimate the strength of our belief that the correlations was accidental by statistical techniques. However in this case such tests are difficult. Statistically there is bound to be some nurses that have more deaths when they are on duty than others, even when deaths are distributed randomly. We cannot every year accuse the nurse with the highest number of deaths of being a murderer even if we are more than 95% sure that her "death average" is significantly different from the average nurse. The problem illustrates that accidental correlations may be important in the formation of false beliefs. (In this specific case the police decide not to prosecute precisely because of the mentioned statistical problems.)

2. Ignored third variable effects I (Common Cause)
In the introduction I mentioned the standard textbook example of spurious correlation: That the number of storks are correlated with the number of babies. The reason these two variables are correlated could be a third variable, the weather, which might be a common cause of both variables. When the weather is warm there are many babies and humans are more likely to have sex.4 Hence we have
 
                 Storks      Babies
                      |       |
                     Temperature
One should note that this common cause could be found at a lower level i.e. the structure could be:
 
  D   E
  |   |
  B   C
  |   |
    A
In other words D and E are correlated, but not causally linked since there is a common factor (A) which causes B and C which in turn causes D and E. This structure could be extended so the common cause was even deeper down in the system. In this way the problem of third variable effects becomes even more difficult to discover since the connections may not be as obvious as the stork/babies case.

3. Ignored third variable effects II (Intervening variable)
In a famous study of suicide one of the fathers of modern sociology, Durkheim, discovered that there was a correlation between the number of suicides and climate. 5 The warmer it was, the more suicides were committed. However, we should not draw the conclusion that it was the weather which caused the number of suicides to increase. It could be that warm weather mean that people interact more and the feeling of loneliness and failure intensifies in potential suicide candidates. Hence, the causal chain is:

Temperature --> Social interaction --> Feeling of isolation --> Suicide

We may wonder whether this is really a case of spurious correlation. Even if we do not have a direct link between suicides and temperature, there is at least an indirect link in contrast to the case of babies and storks. It is true that in some cases and for some purposes the indirect nature of these links are not important. For example if you simply want to predict a variable you do not care whether the basis for you prediction is indirect as long as it works. However, sometimes it is very important to discover the indirect nature of the correlation. If you were to design policies aimed at reducing the number of suicides, it would be useful to know that it was caused by a feeling of social isolation and not temperature directly.

4. Pre-emption
Even if we have a strong correlation and we suspect it implies causation, it would still be wrong to conclude that the correlation implies causation in a specific instance (fallacy of composition). For example cancer is highly correlated with an early death. However, I might be wrong to conclude that the death of a person who had cancer was caused by cancer. The reason being that there might be a pre-emptive cause, such as the man being killed in a car- accident, which was the real cause of the death.6

The same applies at the social level. When historians try to explain the causes of the Russian revolution of 1917, they cannot simply infer the causes from a general theory of revolutions even if this theory is backed by strong correlation evidence. Every revolution must be examined on its own (as well as comparatively) because in specific instance there might be different causes at work than what is generally the case.

b) Weak correlation but strong causal link
Strong correlations need not imply causation, as discussed above. But the converse is also true. We may have weak correlations but strong causal connections.

5. Third variable effects III
In the same way that a third variable may create a spurious correlation, a third variable may also disguise a strong relationship. Wonnacott and Wonnacott in their classic introduction to econometrics gives the example of rainfall and yield in agriculture.7 First they find a weak negative correlation between rainfall and yield. This may seem strange and the answer is that a third variable, temperature, is at work. Rainfall tends to be associated with low temperatures and low temperatures results in lower yield. Hence there is a third variable which if we were unaware of it, might lead us into either concluding that there is no strong relationship between rainfall and yield or that the relationship is of the wrong direction.

6. Lags
The correlation coefficient measures only the contemporaneous correlation between variables. However, some variables have causal links that come into effect only after some time. The contemporaneous correlation between inflation and the government budget deficit may be low, but the causal relationship may be strong because it may take some time before the deficit creates inflation. This, of course, need not be a large problem. We might simply try the correlation between the deficit of previous year (or another year) and current inflation. However, the serious problem arises when the length of the lag varies (maybe as the result of increased speed of belief revision). In this case we would not find any correlation even if there was a strong causal link.

7. Non-linear relationship
The correlation coefficient measures only the strength of a linear relationship. This well known fact leads to the almost equally well known warning that a weak correlation coefficient may mask a strong non-linear relationship. This is true, though one should note that also a strong non-linear correlation need not imply causation. An amusing example may be the relationship between my wife's anger and the amount of water I pour over her. A small glass may provoke a rather large reaction. A somewhat bigger glass of water does not increase the amount of anger, but a bucket of water greatly increases the anger. Thus, the (linear) correlation might be weak, but the causal relationship is strong.

Another example could be the relationship between the cost of sending a letter and the weight of the letter. We know by definition that there is a strong and deterministic relationship. However, the (linear) correlation is less than one because the actual relationship is discrete.

Certain non-linear relationships are well know - exponential, quadratic, hyperbolic - but one might wonder whether there are some kinds of causal connections we have not discovered. For example, so-called chaotic relationships were not really studied until a few years ago. An example of such an relationship is the function: x _t+1 = k x (1-x ). This difference function was originally used (in its differential form) by Verhulst in 1845 to model population changes.8 For some values of k (and x is between 0 and 1) this relationship generates patterns that seems random on a graph. If we did not know that the relationship was generated by a function we could have falsely concluded that there was in no deterministic causal process at work. Certainly there is no correlation between the two that would lead to the suspicion of a relationship. By focusing on linear and standard non-linear relationships we may have ignored other and potentially more important causal patterns.9

c) Interpretative problems
Even if we assume we have a correlation which we suspect some kind of causal relationship between the variables, we still have serious problems in determining the precise nature of this relationship.

8. The direction of the causation
A correlation between two variables does not say which variable causes the other. I recently read a newspaper-article which illustrated this problem very clearly. The article was about a large study which indicated a strong (positive) correlation between the quality of a persons' sex life and how young the person looked (above a threshold of course). The newspaper interpreted this to the effect that good sex causes a person's look to age more slowly. However, the causal relation could also be in the opposite direction: People with good looks may be more likely to have good sex. This demonstrates the problem of interpreting the direction of causation even if you have two variables which are both correlated and causally related.

9. Joint causation
Sometimes the flow of causation goes in both directions and it would be wrong to interpret the correlation as a proof that one variable causes the other. For example, in economics the price and the quantity sold of a good is often (negatively) correlated. We may interpret this to the effect that the price of the good determines quantity traded, which is probably partly true. However, it might also be true that quantity determines prices. In effect we have a relationship in which the variables are simultaneously determined. Prices affect quantity and quantity affects prices. The correlation coefficient is of no help in determining the causal structure of this relationship.

10. Wrong kind of causation
In one of his books Jon Elster gives an amusing illustration of "wrong causation." He writes that his son once tried to command him to laugh. And, Elster admits, of course he laughed. We may thus observe a strong correlation between the command "laugh" and laughter. However, we would be wrong to conclude that the command "laugh" causes laughter. It was precisely because Elster knew that it is impossible to produce laughter on command that he laughed at the command. One might argue that this is simply another example of an ignored third variable (and maybe it is), however it is an almost unavoidable one since such mental operations cannot be quantified (and hence tested for), as opposed to the ignored variable (social interaction) in the mentioned example of suicides rates (see 3: Intervening variables).

11. Self-confirming correlations
Assume you are opening a firm in a rather poor society composed of one large homogeneous ethnic group and a small minority. Assume further that this minority has a bad reputation for stealing. It might even be true, when you arrive, that a disproportionate amount of the crime is committed by this minority. You then quite rationally decide not to hire people from the ethnic minority (this is obviously a country with few laws regarding discrimination). This, in turn, means that since they constantly loose out, the minority may engage in more criminal activity (either to survive or because they have nothing to loose). This, then, only exacerbates the problem. What we have is thus a correlation between criminal activity and being member of an ethnic group. The problem is how we interpret this correlation. Some may say that the group is "by nature" more untrustworthy and prove this by the statistical correlation. However, a little more reflection also shows that the correlation may sustain itself (once it is started). It is the correlation which forms the basis for belief formation which in turn results in actions that leads to the correlation. We thus have a correlation which indicates some kind of causal chain, but not the one we might think at first.

12. The time aspect
Even if we find a correlation and even if this correlation implies a causal relationship, we do not know whether this is a long run or a short run effect or whether it is a steady state or a disequilibrium effect. An example may clarify these statements.

Assume you live in a society in which the parents determine who their children should marry. 10 Some people may go against this tradition, but it is observed that these marriages often fail or become unhappy. One might then conclude that the observed correlation between marrying for love and unhappy marriages is causal. However, we might inquire about the precise nature of this causal relationship. For example, the observed correlation need not apply in a different steady state such as a state in which people married out of love. There are at least two reasons for this. First, the observed correlation may be caused by adverse selection i.e. people who do not conform to the tradition may be people who have stubborn personalities and therefore their marriages are not very happy. Second, there might be a causal after-effect in that people who marry against tradition are treated differently by the rest of society which in turn makes them unhappy. Both these effects would disappear in a state in which all people married out of love.

The distinction between the short run and the long run is of similar structure. For example, a democratic system may initially result in greater diversity of expressed opinions than under a dictatorship or an autocracy. However, as time passes different causal mechanisms may undermine this effect to produce a more conform society again. Hence we cannot take the initially observed short run correlation as an indication of long run correlation.

d) Practical problems
So far I have assumed that there have been no problems with respect to the data. In the real world this is a most serious problem of correlation analysis.

13. Few observations
Often we do not have enough data to reach reliable conclusions. Certain events only occur once or only a few times. There was only one Russian revolution (though there might be other revolutions that might be comparable), The industrial age did not rise many times under slightly different conditions. Even if I have data on discrimination in 20 countries, I am no closer to a reliable explanation if I have 19 explanatory variables.

14. Not enough variation
To avoid problems of multicorrelation among the explanatory variables we would like our data to be spread out. Unfortunately this is often not possible. This means that it is difficult to determine the relative importance of the various variables (since there is not enough variation to distinguish how important they are relative to each other).

15. Too much noise
We may believe that two variables are causally related, but unable to prove this because there is so many other variables interfering thus making it impossible to determine whether the variation is caused by the variable we think or noise from the other variables.

16. Flawed assumptions about the underlying distributions
When we use the correlation coefficient we usually assume that the underlying populations are normally distributed. However, if we assume that the distribution of one variable is highly skewed, and the other is highly skewed in the other direction, the highest possible correlation coefficient is 0.6. In this case 0.6, and not 1 as usual, indicates the strongest possible relationship. If we are unaware of the underlying distribution we might therefore discard a correlation of 0.6 as not very strong, while in fact the opposite is true.

17. Measurement errors
Usually there are some unknown degree of measurement error involved in the data, if nothing else we might press the wrong keys when we enter the numbers into the computer. The problem becomes serious when there are systematic measurement errors - such as when old data is less reliable than new data. The correlation coefficient may then not reflect the true correlation, but simply the measurement error.

18. Non-quantifiable factors
Some factors are notoriously difficult to quantify and some are inherently so. For example, cultural factors such as "commercial talent" seems to be difficult to measure isolated. Another example could be "inferiority complex" as a cultural trait. A third example could be discrimination: How do we put a number on how much discrimination there is in a society?11 These problems are serious because it means that we often leave them out of our analysis despite their undeniable significance, simply because they do not fit our frame of correlation analysis.

The sum of these problems
What are the implications of the problems mentioned so far? First of all I do not advocate the abandonment of statistical analysis even if there are problems. Many of the problems can be reduced. We can develop measures to examine non-linear correlation, tests for the existence of lags, tests for the existence of third variable effects and the data may be manipulated to remove some spurious correlation (by differencing the data). We may also consider second best strategies that account for the weaknesses described so far. Hendry's "test, test, test" strategy (also called the general-to-simple methodology) may be viewed as such a strategy as opposed to the simple-to-general methodology commonly used.12

Yet, sometimes the limitations may be too great to be remedied. The question is then whether we have any alternative strategies for determining the probability of statements.

An alternative strategy: Reflective estimation
Assume you want to examine to which degree the laws in a country are unjust defining injustice are differential treatment according to morally irrelevant criteria such as gender, race and status. One strategy could be to simply collect laws to see how many of them are discriminatory. Based on this one might arrive at an estimate the amount of injustice in the legal system.

An alternative approach might be to start with the assumption that people are in general selfish. We may then deduce the consequences of this selfishness with respect to the laws given the political system. For example, we did not need to examine the laws of Apartheid to have confidence in the belief that it discriminated against the blacks. A political system in which blacks did not have the right to vote and the belief that people are selfish, are enough to give some degree of reliability to the belief that the blacks were discriminated.

These two strategies may be combined in what I call reflective estimation. First I may arrive at a probability estimate based on deduction from a few basic beliefs. These basic beliefs may be based on correlation (i.e. induction) which are reliable in the sense that they do not suffer from many of the weaknesses identified above. Having arrived at an estimate deductively I may try to estimate the probability inductively i.e. by gathering statistics and finding the correlation coefficients. I now have two estimates of the amount of injustice in a legal system. One arrived at by deduction, one by induction. I would then suggest a compromise between the two. The relative weight attached to each would depend upon how many of the problems listed above we suspect the correlation might be suffering from.

Another concrete example may help. Assume I want to examine the causes of successful revolution. One approach would be to start with a few basic facts about peoples desires and beliefs and combine this with the existing political system to estimate the likely causes deductively. Another approach would be to gather aggregated statistical data (strikes, food consumption, industrialisation, urbanisation) and examine which is correlated with successful revolutions. If the two estimates diverge I must make a compromise between them according to how likely I believe they are.

A third example may be needed, at least to clarify the though in my mind. G.A. Cohen writes that the difference between him and Elster is exemplified by the following example 13 : Imagine a man has died after a dinner party. You suspect the cause was food-poisoning. How do you examine the probability of this claim. Cohen suggests that it is good enough to examine the other participants at the dinner. If those who ate the same food that the dead man ate also died, we are quite sure the cause was food poisoning. However, Cohen claims, Elster would use a different approach to assess the probability of food poisoning. He would examine the man in a hospital to see whether it actually was the food that killed the man. I suggest doing both and then if the results diverge go over the evidence once more. (I have to admit that this is not a very original or very revolutionary proposal.)

Why should this method work? Both the process of induction and deduction involves uncertainties. Induction, which I have associated with correlation, suffers potentially from the problems identified above. Deduction suffers from the fact that the deductive process may be unreliable (we might ignore some effects) and that the initial starting point is sometimes unreliable. For example, economists often start by assuming selfish and rational individuals. Sociologists then complain that this assumption is not always correct and hence it leads economists to the wrong conclusions. The point in the above method is that divergent estimates may suggest that there is a problem which in turn means that we examine both processes to see whether we might have missed something. In this way we arrive at what I think is the best estimate of the probability of a statement.

NOTES

1 Cohen (1982a), p. 490 and Cohen (1982b), p. 53

2 Elster (1989), p. 6

3 "Land�s saken"

4 I am somewhat sceptical of the stork/babies example because there is a time-lag between sex and having babies which may destroy the contamperaneous correlation.

5 Giddens (199?), p. 680

6 Elster (1989), p. 6

7 Wonnacott and Wonnacott (1979), p. 96

8 Baker G. L. and Gollub J. P. (1990), p. 77

9 It is true that exaplanation by reference to chaotic processes have not been very successful so far. Maybe they never will be, but the point that one should look for processes that create patterns that are not simply linear or standard non-linear remains - even if one such attempt was (initially) unsuccesful.

10 The example is from Elster (1993), p. 104, who in turn is inspired by Tocqueville's discussion in Democracy in America

11 For more on this see Blalock (1984), p. 49-56

12 See Gilbert (1986) for a discussion of this

13 Cohen (1982a), p. 491

What is correlation?

- A short review

(If you are new to the subject you are better off reading a textbook, such as Wonnacott and Wonnacott's "Econometrics" p. 150-)

The usual definition of correlation is "a measure of the strength and direction of a linear relationship." Depending on the nature of the data (cardinal, ordinal) there are many different measures of correlation. In the following I shall show how the most common measure - the Pearson's correlation coefficient - is derived.

How do we measure the strength and direction of the relationship between two variables? The first idea that comes to mind is to use the sum of the product of the deviations. The larger the sum is, the stronger is the positive relationship. However, the number may also become larger when we simply add new observations. To adjust for this we divide the sum by the number of observations. However, there is still one problem because the measure changes when we change the scale of measurement. For example, we may give a class of 30 students two tests - one in mathematics and another in verbal skills. We may then mark the scores of both tests on a scale from 0 to 50. Having done that we try to find the sum of the product of the deviations divided by the number of observations. This is a measure of the strength of the correlation. However, if we had marked the scores on a scale from 0 to 100, the measured strength of the relationship would change. To avoid this we standardise the deviations by dividing them by their average deviation (i.e. the standard deviation of the sample). After these adjustments the sum of the product of the deviations is a good measure of correlation.

More formally we have the following:

We have two data series: X and Y (X may be the scores on a test in mathematics, Y a test of the scores on a verbal test). We also have n number of observations (for example 30 if the tests were given to a class of thirty persons).

The deviation of one observations is then: (this particular X - average X) [often called little x] The product of the deviations is: (this X - average X) * (this Y - average Y) [ i.e. xy] If the people who score well on the math test also score well on the verbal test, this is a high, positive number. We add the product of the deviations for all the observations: Sigma (xy) We divide the resulting number by the number of observations: Sigma (xy)/n However, to adjust for differences in scale, we use the standardised deviations: (x) / (he standard deviation of X)

We now have:

r = 1/n sigma ( (x / s.d. X) * (y / s.d. Y) )

This is the sample correlation coefficient. It is a measure that varies between -1 and +1 depending on the strength and direction of the measured linear relationship. A strong negative relationship should come out as close to -1 on our scale, no relationship should give a measure close to zero, and a strong positive relationship should approach +1.

Note: To adjust for the loss of one degree of freedom one could use n-1 instead of n.

Press "Back" to go back to the main text.

BIBLIOGRAPHY

Baker G. L. and Gollub J. P (1990), Chaotic dynamics: An introduction, Cambridge: Cambridge University Press

Blalock, Hubert M. Jr. (1984), Basic Dilemmas in the Social Sciences, Beverly Hills: Sage Publications

Cohen, G. A. (1980), Functional explanation: Reply to Elster, Political Studies 28: 129-135

Cohen, G. A. (1982a), Functional explanation, Consequence explanation, and Marxismism, Inquiry 25: 27-56

Cohen, G. A. (1982b), Reply to Elster on "Marxism, Functionalism and Game Theory" , Theory and Society 11: 483-495

Darnell, A. C. and Evans J. L. (1990), The Limits of Econometrics, Aldershot: Edward Elgar

Elster, Jon (1980), Cohen on Marx's theory of history, Political Studies 28: 121-128

Elster, Jon (1982), Marxism, functionalism and game theory, Theory and Society 11: 453-482

Elster, Jon (1983), Reply to comments (on "Marxism, functionalism and game theory"), Theory and Society 12: 111-120

Elster, Jon (1983), Explaining Technical Change, Cambridge: Cambridge University Press

Elster, Jon (1986, Further thoughts on Marxism, functionalism and game theory, in J. Roemer (1986), ed., Analytical Marxism, Cambridge: Cambridge University Press

Elster, Jon (1986), An Introduction to Karl Marx, Cambridge: Cambridge University Press

Elster, Jon (1987), The possibility of rational politics, European Journal of Sociology (Archives Europennes Sociologique) 28: 67-103

Elster, Jon (1989), Nuts and Bolts for the Social Sciences, Cambridge: Cambridge University Press

Elster, Jon (1993), Political Psychology, Cambridge: Cambridge University Press

Giddens, Anthony (199?), Sociology

Gilbert, Christopher L. (1986), Professor Hendry's Econometric Methodology, Oxford Bulletin of economics and statistics 48(3): 283-307

Kline, Paul (1981), No Smoking, London Review of Books, 19. Feb.- 4 March 1981, page 23

Koutsoyiannis, A. (1977), Theory of Econometrics, London: Macmillan (second edition, first edition: 1973)

Ovind, Jan (1996), Hold deg ung med god sex, Verdens Gang 8. januar 1996: 22 (An article in a Norwegian newspaper)

Simon, Herbert A. (1971), Spurious Correlation: A Causal Interpretation, in H. M. Blalock (1971), Causal Models in the Social Sciences, Macmillan (Reprint from Journal of the American Statistical Association, 1954)

Wonnacott R. J. and T. H. (1979), Econometrics, New York: Wiley (2nd ed., first: 1970)

[Note for bibliographic reference: Melberg, Hans O. (1996), Against correlation, http://www.oocities.org/hmelberg/papers/960415.htm]