A linear regression model from Urban Compactness with R-squared equal one

What is the average travel distance in a city calculated from the compactness index? 

The compactness of an area (e.g. urban extent) is the ratio between two average distances: d / D.
There are two ways to get the numerator d:

  1. The average distance of any random point to the center of the circle, d = (128/45pi)*R = 0.9054*R
  2. The average distance of any random point to another point in the circle, d = 2/3*R (P.S., this one is easy to calculate, it is the integration of 2πr/πR^2 * r = 2r^2/R^2 from 0 to R. The idea is, for any r within range [0, R], the probability that a point on the disc is on that circle is 2πr/πR^2, then we integrate r from 0 to R. See this post.)

The circle is called the "Equal-Area circle" which has the equal area as the shape in the study (e.g. urban extent).
The denominator D is the average distance of a random point to the center or to another point in the real shape area.
  • The 1st approach (random distance to center) is called proximity compactness index;
  • The 2nd approach (between any two random points) is called cohesion compactness index; 
These two measurements are very similar and are highly correlated. They give a ratio maximized at one if the city is a perfect circle.

Compactness: How close to a circle is the shape of the city:


Figure: a city with high compactness (left, 0.86) and a city with low compactness (right, 0.49)

Here is the interesting part. When we calculated the compactness score, we also obtained (the theoretical) average travel distance in the urban extent: D. D is empirical from which we get the compactness index = d / D. If we build this linear model on the logarithmic scale:
        log_D = log_Population size + log_Population density + log_Compactness index
The R-squared almost equals one (R-squared = 0.997). Is it surprising?

Can I use linear regression to predict Y with X if Y was calculated from X?

This is a simple but sometimes confusing question.
Define: Density = Population Size / Urban Extent

  • If Y = A + B. Then using A and B as predictors for Y gives R = 1, since Y is a linear combination of A and B. 
  • If Y = A * B or Y = A / B (e.g. Density = Population / Urban Extent), a linear model of Y = b0 +  b1*A + b2*B can give a very low R-squared. Predicting density with population and urban extent directly gives R-squared = 0.03. This might be caused by the factor that population and urban extent are highly correlated (variance inflation is very high: 61), so dividing one by another might have some problems.
  • But on logarithmic scale log_Y = log_B + log_C, thus the same model on logarithmic scale gives R-squared = 1. But it doesn't mean predicting Y with either B or C alone can give strong correlation, no matter the variables are on the logarithmic scale or not. 
  • In sum, define: Density = Population Size / Urban Extent, then for these models:
    • predict Density using Population and Urban Extent                    R-squared = 0.03
    • predict log_Density using log_Population and log_Urban Extent    R-squared = 1
    • log_Population and log_Urban Extent                                             R-squared = 0.98
    • But: 
    • predict log_Density by log_Population alone                                 R-squared = 0.24
    • predict log_Density by log_Urban Extent alone                            R-squared = 0.08

*P.S. A technical notes: results above are based on Colombia Atlas, for global cities, the correlation between population and urban extent is not that high: R-squared = 0.75, and Density correlates weaker with population: R-squared = 0.08, and weaker with urban extent: R-squared =0.04.


The model for average distance

Back to the model of average distance from compactness index.
log_D = log_Population size + log_Population density + log_Compactness index
Because the numerator d = 0.91 R or d = 2/3 R. The numerator simply depends on the area of urban extent (R = radius of the equal area circle). And two cities with same population size and density have the same urban extent, thus same R, same di. Since D = d / Compactness, population size, density and Compactness fully explain (determine) the D. So we have an R-squared = 100%. It seems to be a mathematical trick, but I don't indicate it to be meaningless as it helps to show the relationship between variables from a different perspective. Here the linear model serves a more explanatory purpose than predicting.

In an earlier post that "a perfect correlation could mean nothing", we discussed correlation as a single measurement must be judged carefully in reality. The same caution could be applied to R-squared. We know in a univariate regression, R-squared is just the square of the correlation coefficient. In this case, we shall be clear that the model is for explanatory purpose and we need to understand the reason behind it. Choosing the right statistical model depends on the purpose and target. 

Extra: a nice note on the several ways to describe R-squared:

  • The strength of the relationship between variables
  • How well the predictor variables can predict the outcome
  • The amount of the variation in the response variable that is explained around its mean
  • the goodness of fit of the regression
updated 5.19

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting