Spatial join is very useful: project coordinate onto geological grid

Projecting data onto geological grid using spatial join in R

We have lots of observations and their coordinate (lat, long), and I want to group these observations by a manually designed grid. In the beginning, I wrote a for loop which is super slow (about 5 hours for 3.5 million observations). Then I was reminded that there is something called spatial join. In Python, it is done by gpd.sjoin(). as Xinyue told me. In R we can use over() from sp, which is a package that provides classes and methods for spatial data: points, lines, polygons, and grids. Spatial join toke fewer than 2 minutes.

The over() function is not very easy to use, I found out one way to get it work by turning grid into Spatial Polygon object first. There are different approaches to do this. 

The main dataset of our observations has recorded millions of trips: the longitudes and latitudes of the trip origin, and the counts of people on those trips by social economical (SE) strata. Part of it looks like:

home_long    home_lat    SE01   SE02   SE03
41.51957      -87.51127    0         2          1

In our case, we divide the map (Chicago) into 2 km x 2 km grid (small blue dots, red dots have 20 km interval):

If we want to sum the number of people by their home block (the 2km x2 km block created by connecting these blue dots where they live in it), we impose the dataset onto the grid.


      1. Create the grid
  • Grid was created from two vectors which contains all the longitudes and latitudes of these blue dots
  • Adding id is optional, it's for easier data joining later
> long_grid
 [1] -88.65292 -88.62852 -88.60412 ...
> lat_grid
 [1] 41.31433 41.33273 41.35114 ...

grid <- expand.grid(long = long_grid, lat = lat_grid)
grid$id <- 1:nrow(grid)
The grid has the longitude and latitude of  each blue dot:
> nrow(grid)
[1] 4550
> head(grid)
       long           lat       id
1 -88.65292 41.31433  1
2 -88.62852 41.31433  2
3 -88.60412 41.31433  3
...

      2. Turn grid into polygons: 
  • SpatialPixelsDataFrame() defines the longitude and latitude for the grid:
grid_df <- SpatialPixelsDataFrame(points = grid[c("long", "lat")], data = grid)
grid_poly <-  as(grid_df, "SpatialPolygonsDataFrame") 
  • Another option to get the polygons is:
grid2 <- grid
coordinates(grid2) <- ~ long + lat
gridded(grid2) = TRUE
grid_poly <- as(grid2, "SpatialPolygons")

      3. Set attribute of mydata and join data: 
  • Use coordinates() to set spatial coordinates to create a Spatial object
  • Use over() to get the sum of "SE01", "SE02", and "SE03" by the each block of grid_poly
coordinates(mydata) <- ~ h_cen_long + h_cen_lat
impose1 <- over(grid_poly, mydata[c("SE01", "SE02", "SE03")], fn = sum)

All the codes together:

# The grid ----------------------------------------------------------------
library("sp")
grid <- expand.grid(long = long_grid, lat = lat_grid)
grid$id <- 1:nrow(grid)
grid_df <- SpatialPixelsDataFrame(points = grid[c("long", "lat")], data = grid)
grid_poly <-  as(grid_df,  "SpatialPolygonsDataFrame")

# -----------------------------------------------------------
# spatial join by imposing mydata over grid by home coordinates
test_home <- mydata
coordinates(test_home) <- ~ h_cen_long + h_cen_lat
impose1 <- over(grid_poly, test_home[c("SE01", "SE02", "SE03")], fn = sum)

# combine output with grid
impose1[is.na(impose1)] <- 0
impose1$id <- 1:nrow(impose1)
impose1$count <- impose1$SE01 + impose1$SE02 + impose1$SE03
homeBlock2 <- join(grid, impose1, by = 'id')
homeBlock2 <- homeBlock2[homeBlock2$count != 0, ]

the output looks like this:

> head(homeBlock2)
         long           lat              id    SE01    SE02    SE03    count
52     -87.40848 41.31433   52     13        21        33        67
53     -87.38408 41.31433   53     11        21        44        76
121   -87.43288 41.33273   121    8          9         23        40

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

SAS’s Best Subset Selection by Mallows's Cp is actually Stepwise?