How I mapped the city-level GDP from MGI dataset into the universe of cities

One of the main exercises I did during the late year 2017 was to create city-level GDP for the 4231 "universe" of cities. There were lots of interesting problems encountered during the process.

The main reference dataset is McKinsey Global Institute (MGI)'s Cityscope database of 2910 cities. They have city GDP and population, and they highly correlate with each other. I guess regression was already applied in obtaining their city GDP data.

My own guideline on this task is to rely on MGI data as much as I can. The 1st step was to match cities directly.  I first matched by city name and country, then matched those within 20km geological distance.  Because not surprising, there are no common code or city names. Even the country name can be different and sometimes MGI's geological coordinations are wrong.

There are 2617 (62%) cities matched directly by MGI data. Then linear regressions within each country using MGI city population as the independent variable to fit, and universe population as the independent variable to predict, helps to get another 966 (23%) cities. Regression works well in countries with enough cities. For example, China alone takes up 24% of the universe. But notice that among the 172 countries in our universe, 65% (112) countries have fewer than 10 cities in each country, more than half of the countries (88) have fewer than 5 cities. If the MGI cities in the countries are fewer than 5, or the city to be fitted is outside the minimum or maximum of MGI population in that country, regression would be inappropriate.

Thus, in the rest 648 (15%) cases where regression is inappropriate, national GDP per capita calculated from MGI data were used together with universe population to get city GDP. For 20 cities (0.5%) whose countries do not exist in MGI, national GDP per capita data from world bank were used. 

In summary, I have used the following datasets for each part of the 4231 cities:
2617 (62%):   MGI city GDP
966   (23%):   MGI city population, population from the universe
628   (14%):   MGI city GDP, MGI city population, population from the universe
20     (0.5%):  National GDP per capita from world bank, population from the universe,

The regressions use the MGI population as the independent variable to fit the linear model (in each country), then predict on the population size from the universe of cities to get corresponding city GDP. I took this approach as otherwise, some Europe cities go very high.

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting