Why a perfect correlation can mean nothing: intuition behind correlation
Besides attending a meeting in the morning discussing the Colombia paper, I have read some interesting stuff about correlation.
When we talk about correlation, in most cases we mean "linear correlation". What is a linear correlation?
Let's look at two cases where there is no linear correlation (the correlation coefficient is 0).
In the first case, the two variables follow a quadratic function, a parabole, and the correlation is zero. In the second case, the fitted line can go through (2,2) from any direction so there is no way to fit a certain line. Actually, the correlation is zero as long as the scatterplot is symmetrical by either horizontal or vertical line.
In sum, the strength of the correlation only describes how well the linear model can describe the relationship, or how well a straight line can get closer to all the points. A linear model is relatively easy to interpret and can work well in many real-world cases, but there is no guarantee for it to work everywhere.
To give another case in which linear correlation is misleading: the two variables have a pretty strong negative linear correlation (r=-0.66, r2=0.4) but they actually follow a hyperbolic function. This is why talking about correlation without observing the scatterplot makes little sense. And be cautious next time when the statistician claims that the two variables are highly correlated.
I have also run into some excellent explanations on the intuition behind correlation.
Especially this one by William Huber on Cross Validated: in short, he said if you make a rectangle for every pair of points, color them in red if the line connecting them goes upper right, and in blue if the line goes upper left:
An example of 3 points:
Do this for all the pairs of points. Redder the plot, more positive the covariance. A dominating red or blue picture indicates a strong correlation.
When we talk about correlation, in most cases we mean "linear correlation". What is a linear correlation?
Let's look at two cases where there is no linear correlation (the correlation coefficient is 0).
In sum, the strength of the correlation only describes how well the linear model can describe the relationship, or how well a straight line can get closer to all the points. A linear model is relatively easy to interpret and can work well in many real-world cases, but there is no guarantee for it to work everywhere.
To give another case in which linear correlation is misleading: the two variables have a pretty strong negative linear correlation (r=-0.66, r2=0.4) but they actually follow a hyperbolic function. This is why talking about correlation without observing the scatterplot makes little sense. And be cautious next time when the statistician claims that the two variables are highly correlated.
By the way, the Spearman rank-order correlation coefficient which describes the monotonic relationship between two variables is -1 for this case (a perfect monotonical correlation). The linear correlation is also called Pearson product moment correlation.
Last but not least, it is worth mentioning even a perfect correlation could mean nothing. For example, US spending on science and technology correlates perfectly with suicides by hanging, strangulation, and suffocation (r=1). It just happens. Without question, human minds are so inclined to inject causation into correlation. I cannot bear reasoning why when I read these wonderful examples of spurious correlations.
Especially this one by William Huber on Cross Validated: in short, he said if you make a rectangle for every pair of points, color them in red if the line connecting them goes upper right, and in blue if the line goes upper left:
Do this for all the pairs of points. Redder the plot, more positive the covariance. A dominating red or blue picture indicates a strong correlation.
Comments