eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting
Ensemble Methods
I recently had the great pleasure to meet with Professor Allan Just and he introduced me to eXtreme Gradient Boosting (XGBoost). XGBoost is “one of the most loved machine learning algorithms at Kaggle”, it somehow combines the advantages of random forest and boosting. We will see later its predictive ability is indeed very strong.
In the earlier post Model Selection using Lasso and Best Subset I mainly focused on model selection by best-subset, stepwise, or lasso. This post will focus on ensemble method: random forest, Gradient boosting, and eXtreme Gradient Boosting (XGBoost).
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.
1. Random Forest
Use library randomForest
.
As a well-known decision trees approach, Random Forest usually performs very well for prediction and it is easy to use. Random Forest is based on bagging (i.e., bootstrap aggregation) which averages the results over many decision trees from sub-samples. It further limits its search to only 1/3 of the features (in regression) to fit each tree, weakening the correlations among decision trees. It is easy to get a lower testing error compared to linear models.
# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
rf.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# Test on test data: mydata3[-train_idx,]
yhat_bag <- predict(rf.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
## [1] 0.04104
- MSE on the testing dataset:
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
## [1] 0.04104
- Feature Importance (top 15)
- The variables high on rank show the relative importance of features in the tree model
- For example,
Monthly Water Cost
, Resettled Housing
, and Population Estimate
are the most influential features.
varImpPlot(rf.fit, n.var=15)
Use library
As a well-known decision trees approach, Random Forest usually performs very well for prediction and it is easy to use. Random Forest is based on bagging (i.e., bootstrap aggregation) which averages the results over many decision trees from sub-samples. It further limits its search to only 1/3 of the features (in regression) to fit each tree, weakening the correlations among decision trees. It is easy to get a lower testing error compared to linear models.
randomForest
.As a well-known decision trees approach, Random Forest usually performs very well for prediction and it is easy to use. Random Forest is based on bagging (i.e., bootstrap aggregation) which averages the results over many decision trees from sub-samples. It further limits its search to only 1/3 of the features (in regression) to fit each tree, weakening the correlations among decision trees. It is easy to get a lower testing error compared to linear models.
# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
rf.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# Test on test data: mydata3[-train_idx,]
yhat_bag <- predict(rf.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
## [1] 0.04104
- MSE on the testing dataset:
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
## [1] 0.04104
- Feature Importance (top 15)
- The variables high on rank show the relative importance of features in the tree model
- For example,
Monthly Water Cost
,Resettled Housing
, andPopulation Estimate
are the most influential features.
varImpPlot(rf.fit, n.var=15)
2. Gradient boosting
- Tuning Method: use
train
function from caret
to scan a grid of parameters.
library(gbm) # for Gradient boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)),
shrinkage = c(0.1, 0.05, 0.01),
interaction.depth = c(1:5),
n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,],
distribution = "gaussian", method = "gbm",
trControl = trainControl, verbose = FALSE,
tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)
Sys.time() - time_now
## Time difference of 2.746 mins
- The tuning parameters that give the lowest MSE in training set CV.
gbm_caret$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 1303 60 4 0.1 10
# MSE
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))
## [1] 0.03809
- Tuning Method: use
train
function fromcaret
to scan a grid of parameters.
library(gbm) # for Gradient boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)),
shrinkage = c(0.1, 0.05, 0.01),
interaction.depth = c(1:5),
n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,],
distribution = "gaussian", method = "gbm",
trControl = trainControl, verbose = FALSE,
tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)
Sys.time() - time_now
## Time difference of 2.746 mins
- The tuning parameters that give the lowest MSE in training set CV.
gbm_caret$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 1303 60 4 0.1 10
# MSE
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))
## [1] 0.03809
3. Extreme Gradient Boosting
- Tuning Method: randomize parameters and update the record with best ones.
- This package is really fast.
library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# Randomize and bound
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
param <- list(objective = "reg:linear", # For regression
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
Sys.time() - time_now
## Time difference of 3.147 mins
- The best tuning parameters
data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)
## objective eval_metric max_depth eta subsample colsample_bytree
## 1 reg:linear rmse 10 0.04441 0.7931 0.7619
## min_child_weight max_delta_step best_rmse_index best_rmse
## 1 5 9 91 0.2105
## best_seednumber
## 1 7476
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# MSE
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
## [1] 0.004265
- Feature Importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15,
measure = "Gain")
- Plot only 2 trees as an example (use
trees
= 1)
- Two ways to show it
# two ways to do it:
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
# xgb.plot.tree needs to be revised a little bit by adding `fontcolor ="black"` in the `nodes <- DiagrammeR::create_node_df` line
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
- Plot all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
Future work:
Try to use “xgboostExplainer”, still work on it.
Recall the features selected by Lasso and best-subset from last post as a comparison:
- Tuning Method: randomize parameters and update the record with best ones.
- This package is really fast.
library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# Randomize and bound
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
param <- list(objective = "reg:linear", # For regression
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
Sys.time() - time_now
## Time difference of 3.147 mins
- The best tuning parameters
data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)
## objective eval_metric max_depth eta subsample colsample_bytree
## 1 reg:linear rmse 10 0.04441 0.7931 0.7619
## min_child_weight max_delta_step best_rmse_index best_rmse
## 1 5 9 91 0.2105
## best_seednumber
## 1 7476
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# MSE
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
## [1] 0.004265
- Feature Importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15,
measure = "Gain")
- Plot only 2 trees as an example (use
trees
= 1) - Two ways to show it
# two ways to do it:
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
# xgb.plot.tree needs to be revised a little bit by adding `fontcolor ="black"` in the `nodes <- DiagrammeR::create_node_df` line
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
- Plot all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
Future work:
Try to use “xgboostExplainer”, still work on it.
Try to use “xgboostExplainer”, still work on it.
Recall the features selected by Lasso and best-subset from last post as a comparison:
Compared to Lasso and Best-subset
Lasso
## b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost -3.949e-06 6.278e-07 -6.2902
## B14__resettled -1.651e-01 2.893e-02 -5.7052
## DD1_Location_Problemscanal 1.943e-01 3.020e-02 6.4357
## FF1_8_Water_Sourceswells -9.677e-02 2.118e-02 -4.5681
## Eviction_Threats 8.793e-02 2.218e-02 3.9638
## B14__declared_legal_protected 8.435e-02 2.702e-02 3.1216
## FF1_8_Water_Sourcesshared_taps -1.119e-01 3.084e-02 -3.6277
## DD1_Location_Problemsslope 7.522e-02 2.224e-02 3.3817
## FF1_8_Water_Sourceswater_tankers -1.247e-01 3.636e-02 -3.4310
## DD1_Location_Problemsunder_power_lines -8.192e-02 2.565e-02 -3.1944
## GG1_Sewer_Line 6.772e-02 2.350e-02 2.8818
## DD1_Location_Problemswater_body 8.081e-02 2.783e-02 2.9042
## CC9_Households -5.211e-06 1.874e-06 -2.7808
## DD1_Location_Problemsflood_prone_area 5.697e-02 2.107e-02 2.7037
## C13__no_rent 1.515e-01 5.604e-02 2.7030
## FF1_8_Water_Sourcescommunity_taps 5.601e-02 2.663e-02 2.1033
## FF1_8_Water_Sourcesrivers -4.035e-02 3.551e-02 -1.1363
## DD1_Location_Problemssinking_soil 2.615e-02 3.102e-02 0.8431
## (Intercept) 3.961e-01 3.433e-02 11.5381
## b.Pr...t.. stb
## FF11_Water_MonthlyCost 6.403e-10 -0.21723
## B14__resettled 1.886e-08 -0.21544
## DD1_Location_Problemscanal 2.646e-10 0.18813
## FF1_8_Water_Sourceswells 6.059e-06 -0.13182
## Eviction_Threats 8.332e-05 0.11916
## B14__declared_legal_protected 1.891e-03 0.11034
## FF1_8_Water_Sourcesshared_taps 3.121e-04 -0.10669
## DD1_Location_Problemsslope 7.707e-04 0.09968
## FF1_8_Water_Sourceswater_tankers 6.458e-04 -0.09805
## DD1_Location_Problemsunder_power_lines 1.480e-03 -0.08740
## GG1_Sewer_Line 4.106e-03 0.08206
## DD1_Location_Problemswater_body 3.827e-03 0.08085
## CC9_Households 5.605e-03 -0.07923
## DD1_Location_Problemsflood_prone_area 7.066e-03 0.07819
## C13__no_rent 7.081e-03 0.07605
## FF1_8_Water_Sourcescommunity_taps 3.588e-02 0.06259
## FF1_8_Water_Sourcesrivers 2.563e-01 -0.03116
## DD1_Location_Problemssinking_soil 3.995e-01 0.02372
## (Intercept) 8.958e-28 0.00000
## b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost -3.949e-06 6.278e-07 -6.2902
## B14__resettled -1.651e-01 2.893e-02 -5.7052
## DD1_Location_Problemscanal 1.943e-01 3.020e-02 6.4357
## FF1_8_Water_Sourceswells -9.677e-02 2.118e-02 -4.5681
## Eviction_Threats 8.793e-02 2.218e-02 3.9638
## B14__declared_legal_protected 8.435e-02 2.702e-02 3.1216
## FF1_8_Water_Sourcesshared_taps -1.119e-01 3.084e-02 -3.6277
## DD1_Location_Problemsslope 7.522e-02 2.224e-02 3.3817
## FF1_8_Water_Sourceswater_tankers -1.247e-01 3.636e-02 -3.4310
## DD1_Location_Problemsunder_power_lines -8.192e-02 2.565e-02 -3.1944
## GG1_Sewer_Line 6.772e-02 2.350e-02 2.8818
## DD1_Location_Problemswater_body 8.081e-02 2.783e-02 2.9042
## CC9_Households -5.211e-06 1.874e-06 -2.7808
## DD1_Location_Problemsflood_prone_area 5.697e-02 2.107e-02 2.7037
## C13__no_rent 1.515e-01 5.604e-02 2.7030
## FF1_8_Water_Sourcescommunity_taps 5.601e-02 2.663e-02 2.1033
## FF1_8_Water_Sourcesrivers -4.035e-02 3.551e-02 -1.1363
## DD1_Location_Problemssinking_soil 2.615e-02 3.102e-02 0.8431
## (Intercept) 3.961e-01 3.433e-02 11.5381
## b.Pr...t.. stb
## FF11_Water_MonthlyCost 6.403e-10 -0.21723
## B14__resettled 1.886e-08 -0.21544
## DD1_Location_Problemscanal 2.646e-10 0.18813
## FF1_8_Water_Sourceswells 6.059e-06 -0.13182
## Eviction_Threats 8.332e-05 0.11916
## B14__declared_legal_protected 1.891e-03 0.11034
## FF1_8_Water_Sourcesshared_taps 3.121e-04 -0.10669
## DD1_Location_Problemsslope 7.707e-04 0.09968
## FF1_8_Water_Sourceswater_tankers 6.458e-04 -0.09805
## DD1_Location_Problemsunder_power_lines 1.480e-03 -0.08740
## GG1_Sewer_Line 4.106e-03 0.08206
## DD1_Location_Problemswater_body 3.827e-03 0.08085
## CC9_Households 5.605e-03 -0.07923
## DD1_Location_Problemsflood_prone_area 7.066e-03 0.07819
## C13__no_rent 7.081e-03 0.07605
## FF1_8_Water_Sourcescommunity_taps 3.588e-02 0.06259
## FF1_8_Water_Sourcesrivers 2.563e-01 -0.03116
## DD1_Location_Problemssinking_soil 3.995e-01 0.02372
## (Intercept) 8.958e-28 0.00000
Best-subset
## b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost -5.469e-06 9.276e-07 -5.8960
## B14__resettled -2.061e-01 3.893e-02 -5.2933
## Eviction_Threats 1.751e-01 4.399e-02 3.9797
## DD1_Location_Problemscanal 2.218e-01 4.108e-02 5.3983
## FF1_8_Water_Sourceswater_tankers -2.080e-01 5.435e-02 -3.8267
## FF1_8_Water_Sourcesshared_taps -1.362e-01 4.393e-02 -3.1006
## EE2A_Current_Eviction_Threat -8.252e-02 4.475e-02 -1.8441
## DD1_Location_Problemsflood_prone_area 7.352e-02 2.866e-02 2.5655
## B14__declared_legal_protected 6.983e-02 3.690e-02 1.8925
## FF1_8_Water_Sourcesrivers -7.327e-02 4.944e-02 -1.4821
## FF1_8_Water_Sourcessprings -4.165e-02 4.110e-02 -1.0134
## FF1_8_Water_Sourcesdams -5.584e-02 5.874e-02 -0.9507
## DD1_Location_Problemsgarbage_dump -2.470e-02 3.155e-02 -0.7828
## FF12_Water_CollectionTime30_minutes 3.477e-02 4.548e-02 0.7646
## DD1_Location_Problemsroad_side 1.986e-02 2.998e-02 0.6624
## GG7_10_Toilet_Typesindividual_toilets -2.052e-02 3.840e-02 -0.5344
## JJ1_Electricity_Availableyes 5.476e-03 4.093e-02 0.1338
## (Intercept) 4.445e-01 6.387e-02 6.9591
## b.Pr...t.. stb
## FF11_Water_MonthlyCost 9.258e-09 -0.299393
## B14__resettled 2.204e-07 -0.271580
## Eviction_Threats 8.498e-05 0.238699
## DD1_Location_Problemscanal 1.293e-07 0.216368
## FF1_8_Water_Sourceswater_tankers 1.555e-04 -0.157373
## FF1_8_Water_Sourcesshared_taps 2.099e-03 -0.128307
## EE2A_Current_Eviction_Threat 6.607e-02 -0.107864
## DD1_Location_Problemsflood_prone_area 1.075e-02 0.101994
## B14__declared_legal_protected 5.931e-02 0.093230
## FF1_8_Water_Sourcesrivers 1.393e-01 -0.058064
## FF1_8_Water_Sourcessprings 3.116e-01 -0.044319
## FF1_8_Water_Sourcesdams 3.424e-01 -0.036997
## DD1_Location_Problemsgarbage_dump 4.343e-01 -0.030778
## FF12_Water_CollectionTime30_minutes 4.451e-01 0.029456
## DD1_Location_Problemsroad_side 5.082e-01 0.025531
## GG7_10_Toilet_Typesindividual_toilets 5.934e-01 -0.020818
## JJ1_Electricity_Availableyes 8.937e-01 0.005555
## (Intercept) 1.869e-11 0.000000
## b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost -5.469e-06 9.276e-07 -5.8960
## B14__resettled -2.061e-01 3.893e-02 -5.2933
## Eviction_Threats 1.751e-01 4.399e-02 3.9797
## DD1_Location_Problemscanal 2.218e-01 4.108e-02 5.3983
## FF1_8_Water_Sourceswater_tankers -2.080e-01 5.435e-02 -3.8267
## FF1_8_Water_Sourcesshared_taps -1.362e-01 4.393e-02 -3.1006
## EE2A_Current_Eviction_Threat -8.252e-02 4.475e-02 -1.8441
## DD1_Location_Problemsflood_prone_area 7.352e-02 2.866e-02 2.5655
## B14__declared_legal_protected 6.983e-02 3.690e-02 1.8925
## FF1_8_Water_Sourcesrivers -7.327e-02 4.944e-02 -1.4821
## FF1_8_Water_Sourcessprings -4.165e-02 4.110e-02 -1.0134
## FF1_8_Water_Sourcesdams -5.584e-02 5.874e-02 -0.9507
## DD1_Location_Problemsgarbage_dump -2.470e-02 3.155e-02 -0.7828
## FF12_Water_CollectionTime30_minutes 3.477e-02 4.548e-02 0.7646
## DD1_Location_Problemsroad_side 1.986e-02 2.998e-02 0.6624
## GG7_10_Toilet_Typesindividual_toilets -2.052e-02 3.840e-02 -0.5344
## JJ1_Electricity_Availableyes 5.476e-03 4.093e-02 0.1338
## (Intercept) 4.445e-01 6.387e-02 6.9591
## b.Pr...t.. stb
## FF11_Water_MonthlyCost 9.258e-09 -0.299393
## B14__resettled 2.204e-07 -0.271580
## Eviction_Threats 8.498e-05 0.238699
## DD1_Location_Problemscanal 1.293e-07 0.216368
## FF1_8_Water_Sourceswater_tankers 1.555e-04 -0.157373
## FF1_8_Water_Sourcesshared_taps 2.099e-03 -0.128307
## EE2A_Current_Eviction_Threat 6.607e-02 -0.107864
## DD1_Location_Problemsflood_prone_area 1.075e-02 0.101994
## B14__declared_legal_protected 5.931e-02 0.093230
## FF1_8_Water_Sourcesrivers 1.393e-01 -0.058064
## FF1_8_Water_Sourcessprings 3.116e-01 -0.044319
## FF1_8_Water_Sourcesdams 3.424e-01 -0.036997
## DD1_Location_Problemsgarbage_dump 4.343e-01 -0.030778
## FF12_Water_CollectionTime30_minutes 4.451e-01 0.029456
## DD1_Location_Problemsroad_side 5.082e-01 0.025531
## GG7_10_Toilet_Typesindividual_toilets 5.934e-01 -0.020818
## JJ1_Electricity_Availableyes 8.937e-01 0.005555
## (Intercept) 1.869e-11 0.000000
Compare MSEs
- The advantage of XGBoost is highly distinctive.
options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso,
MSE_rForest, MSE_boost, MSE_xgb))
## MSE_best.subset MSE_Lasso MSE_rForest MSE_boost MSE_xgb
## 1 0.066 0.06079 0.04104 0.03809 0.004265
- The advantage of XGBoost is highly distinctive.
options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso,
MSE_rForest, MSE_boost, MSE_xgb))
## MSE_best.subset MSE_Lasso MSE_rForest MSE_boost MSE_xgb
## 1 0.066 0.06079 0.04104 0.03809 0.004265
Code
## Working Sample: Predicting Share of Temporary Structure in Slums Settlements ##
# 07/09/2018
library(readxl)
library(plyr)
library(ggplot2)
library(QuantPsyc) # used to show standardized regression coefficients
library(glmnet) # for Lasso
library(leaps) # for best subset
library(randomForest) # random forest
library(gbm) # for Gradient boosting
library(caret) # grid scan by train
library(xgboost) # for Extreme Gradient boosting
library(Ckmeans.1d.dp) # for xgb.ggplot.importance in xgboost
library(DiagrammeR) # for xgb.plot.tree in xgboost
library(knitr)
options(digits = 4)
# Data Cleaning -----------------------------------------------------------
sdi <- read_excel("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/Data/20180618_SDI 2.xlsx",
na = "", skip = 1)
# clean variables names, remove "/".
all_varnames <- names(sdi)
names(sdi) <- gsub("/", "", all_varnames)
# turn all character variables into factors
sdi[sapply(sdi, is.character)] <- lapply(sdi[sapply(sdi, is.character)],
as.factor)
sdi$Share_Temporary <- sdi$CC7_Structures_Temporary / (sdi$CC7_Structures_Temporary +sdi$CC6_Structures_Permanent)
# remove variables with over 10% NA
Remove <- (sapply(sdi, function(x) sum(is.na(x))) < dim(sdi)[1]*0.1)
mydata0 <- sdi[Remove]
mydata <- na.omit(mydata0) # remove all the NA
# fix some format issue
mydata$CC10_Household_Size <- as.numeric(mydata$CC10_Household_Size)
mydata$CC12_Total_Population <- as.numeric(mydata$CC12_Total_Population)
# remove useless factor
mydata2 <- subset(mydata, select = -c(1, City, Country)) # for some other use not shown here
# remove what are highly correlated to dependent variable
mydata3 <- subset(mydata2, select = - grep("Structure", names(mydata2)))
# divide training and test dataset
set.seed(123)
train_idx <- sample(dim(mydata3)[1], dim(mydata3)[1]* 0.6)
# The model.matrix() function is used in many regression packages for building
# an "X" matrix from data.
# need matrix for glmnet
X2 <- model.matrix(Share_Temporary~., data = mydata3)
Y2 <- as.matrix(mydata3[c("Share_Temporary")])
X_train <- X2[train_idx,]
X_test <- X2[-train_idx,]
Y_train <- Y2[train_idx]
Y_test <- Y2[-train_idx]
# merge back to df again, as df is required for regsubsets (best subset)
data_train <- data.frame(X2, Y2)[train_idx,]
mydata4 <- data.frame(X2, Y2)
# 1. Lasso ----------------------------------------------------------------
# Use cross-validation to choose lambda for Lasso
cv_lasso = cv.glmnet(X_train, Y_train, alpha=1) # Lasso regression
plot.cv.glmnet(cv_lasso)
(best_lam <- cv_lasso$lambda.1se)
coef_table <- coef(cv_lasso, s = cv_lasso$lambda.1se)
# variables chosen
# data.frame(name = coef_table@Dimnames[[1]][coef_table@i + 1], coefficient = coef_table@x)
# Check prediction error in test dataset
# The lasso model, then feed lambda to predict fitted value
lasso_mod <- glmnet(X_train, Y_train, alpha = 1)
# As lambda increase, "survived" variables decrease as coefficients move to zero
plot(lasso_mod, xvar = "lambda", label = TRUE)
lasso_pred <- predict(lasso_mod, s = best_lam, newx = X_test)
# The MSE (Mean square error)
(MSE_Lasso <- mean((lasso_pred - Y_test)^2))
# show the variables chosen into the model
var_list <- unlist(coef_table@Dimnames[[1]][coef_table@i + 1])
var_list
# run a regression using these variables
# Because there is intercept, we need to use [coef_table@i + 1], and [-1] in the end
sdi_sub <- mydata4[c(coef_table@Dimnames[[1]][coef_table@i + 1], "Share_Temporary")[-1]]
# lm model using variables chosen by Lasso 1se
reg.chosen1 <- lm(data = sdi_sub, Share_Temporary~.)
summary(reg.chosen1)
# Relative importance by showing Standardized parameter estimates in decreasing order
coef_table <- data.frame(b = reg.chosen1$coefficients, stb = c(0, lm.beta(reg.chosen1)))
coef_table[order(abs(coef_table$stb), decreasing = T),]
# 2. Best Subset ----------------------------------------------------------
time_now <- Sys.time()
nv_max <-
25
# the maximum allowed number of features to test
fit_s <- regsubsets(Share_Temporary~., data_train,
really.big = T, nbest=1, nvmax = nv_max
)
Sys.time() - time_now
val.errors =rep(0, nv_max)
# Need the X_test with intercept
for(i in 1:nv_max){
coefi = coef(fit_s, id=i)
pred = X_test[,names(coefi)]%*%coefi # need to use matrix here
val.errors[i]= mean((Y_test - pred)^2)
}
# choose the best number of predictors to use
(val_err_min <- which.min(val.errors))
# MSE
(MSE_best.subset <- val.errors[val_err_min])
# The reg model selected:
coef_list = coef(fit_s, id = val_err_min)
# create sdi_sub2 which has all the selected predictors, and dependent variable,
# remove intercept by adding [-1]
sdi_sub2 <- mydata4[c(names(coef_list), "Share_Temporary")[-1]]
reg_chosen_bs <- lm(data = sdi_sub2, Share_Temporary ~ .)
summary(reg_chosen_bs)
# get output of b and stb sorted by |stb|
coef_table_bs <- data.frame(b = reg_chosen_bs$coefficients, stb = c(0, lm.beta(reg_chosen_bs)))
coef_table_bs[order(abs(coef_table_bs$stb), decreasing = T),]
# the chosen model from cross-validation
reg.chosen4 <- lm(data = mydata4[train_idx, ], Share_Temporary~
FF11_Water_MonthlyCost +
B14__resettled +
Eviction_Threats +
DD1_Location_Problemsflood_prone_area +
FF1_8_Water_Sourcesshared_taps +
DD1_Location_Problemscanal +
FF1_8_Water_Sourceswater_tankers +
B14__declared_legal_protected +
FF1_8_Water_Sourcessprings +
JJ1_Electricity_Availableyes +
GG7_10_Toilet_Typesindividual_toilets +
FF1_8_Water_Sourcesdams +
FF1_8_Water_Sourcesrivers +
FF12_Water_CollectionTime30_minutes +
EE2A_Current_Eviction_Threat +
DD1_Location_Problemsroad_side +
DD1_Location_Problemsgarbage_dump
)
# MSE
(MSE_best.subset <- mean((predict(reg.chosen4, mydata4[-train_idx,]) - Y_test)^2))
# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
bag.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# rf_test <- mydata3[-train_idx, "Share_Temporary"]
# test on test data: mydata3[-train_idx,]
yhat_bag <- predict(bag.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
varImpPlot(bag.fit, n.var=15)
# 4. Gradiant boosting ----------------------------------------------------
# Tuning Method: use `train` function from `caret` to scan for parameters.
library(gbm) # for Gradiant boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)),
shrinkage = c(0.1, 0.05, 0.01),
interaction.depth = c(1:5),
n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,],
distribution = "gaussian", method = "gbm",
trControl = trainControl, verbose = FALSE,
tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)
# The tuning parameters that give lowest MSE in training set CV.
gbm_caret$bestTune
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))
Sys.time() - time_now
# 5. Extreme Gradiant Boosting --------------------------------------------------------------
# Tuning Method: randomnize parameters and update the record with best ones.
library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# # For a single cv
# xg_cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5,
# max_depth = 3, eta = 0.1, objective = "reg:linear",
# eval_metric = "rmse")
# print(xg_cv)
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
param <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# Check error in testing data
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
best_param_store <- data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)
Sys.time() - time_now
# Feature importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15,
measure = "Gain")
# plot the trees from model using `xgb.plot.tree`
# xgb.plot.tree needs to be revised a little bit by adding `fontcolor="black"` in the `nodes <- DiagrammeR::create_node_df` line
# only show 1 tree as an example.
#
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
# Project all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train),
features_keep = 3)
# Summarize MSEs ----------------------------------------------------------
options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso,
MSE_rForest, MSE_boost, MSE_xgb))
## Working Sample: Predicting Share of Temporary Structure in Slums Settlements ##
# 07/09/2018
library(readxl)
library(plyr)
library(ggplot2)
library(QuantPsyc) # used to show standardized regression coefficients
library(glmnet) # for Lasso
library(leaps) # for best subset
library(randomForest) # random forest
library(gbm) # for Gradient boosting
library(caret) # grid scan by train
library(xgboost) # for Extreme Gradient boosting
library(Ckmeans.1d.dp) # for xgb.ggplot.importance in xgboost
library(DiagrammeR) # for xgb.plot.tree in xgboost
library(knitr)
options(digits = 4)
# Data Cleaning -----------------------------------------------------------
sdi <- read_excel("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/Data/20180618_SDI 2.xlsx",
na = "", skip = 1)
# clean variables names, remove "/".
all_varnames <- names(sdi)
names(sdi) <- gsub("/", "", all_varnames)
# turn all character variables into factors
sdi[sapply(sdi, is.character)] <- lapply(sdi[sapply(sdi, is.character)],
as.factor)
sdi$Share_Temporary <- sdi$CC7_Structures_Temporary / (sdi$CC7_Structures_Temporary +sdi$CC6_Structures_Permanent)
# remove variables with over 10% NA
Remove <- (sapply(sdi, function(x) sum(is.na(x))) < dim(sdi)[1]*0.1)
mydata0 <- sdi[Remove]
mydata <- na.omit(mydata0) # remove all the NA
# fix some format issue
mydata$CC10_Household_Size <- as.numeric(mydata$CC10_Household_Size)
mydata$CC12_Total_Population <- as.numeric(mydata$CC12_Total_Population)
# remove useless factor
mydata2 <- subset(mydata, select = -c(1, City, Country)) # for some other use not shown here
# remove what are highly correlated to dependent variable
mydata3 <- subset(mydata2, select = - grep("Structure", names(mydata2)))
# divide training and test dataset
set.seed(123)
train_idx <- sample(dim(mydata3)[1], dim(mydata3)[1]* 0.6)
# The model.matrix() function is used in many regression packages for building
# an "X" matrix from data.
# need matrix for glmnet
X2 <- model.matrix(Share_Temporary~., data = mydata3)
Y2 <- as.matrix(mydata3[c("Share_Temporary")])
X_train <- X2[train_idx,]
X_test <- X2[-train_idx,]
Y_train <- Y2[train_idx]
Y_test <- Y2[-train_idx]
# merge back to df again, as df is required for regsubsets (best subset)
data_train <- data.frame(X2, Y2)[train_idx,]
mydata4 <- data.frame(X2, Y2)
# 1. Lasso ----------------------------------------------------------------
# Use cross-validation to choose lambda for Lasso
cv_lasso = cv.glmnet(X_train, Y_train, alpha=1) # Lasso regression
plot.cv.glmnet(cv_lasso)
(best_lam <- cv_lasso$lambda.1se)
coef_table <- coef(cv_lasso, s = cv_lasso$lambda.1se)
# variables chosen
# data.frame(name = coef_table@Dimnames[[1]][coef_table@i + 1], coefficient = coef_table@x)
# Check prediction error in test dataset
# The lasso model, then feed lambda to predict fitted value
lasso_mod <- glmnet(X_train, Y_train, alpha = 1)
# As lambda increase, "survived" variables decrease as coefficients move to zero
plot(lasso_mod, xvar = "lambda", label = TRUE)
lasso_pred <- predict(lasso_mod, s = best_lam, newx = X_test)
# The MSE (Mean square error)
(MSE_Lasso <- mean((lasso_pred - Y_test)^2))
# show the variables chosen into the model
var_list <- unlist(coef_table@Dimnames[[1]][coef_table@i + 1])
var_list
# run a regression using these variables
# Because there is intercept, we need to use [coef_table@i + 1], and [-1] in the end
sdi_sub <- mydata4[c(coef_table@Dimnames[[1]][coef_table@i + 1], "Share_Temporary")[-1]]
# lm model using variables chosen by Lasso 1se
reg.chosen1 <- lm(data = sdi_sub, Share_Temporary~.)
summary(reg.chosen1)
# Relative importance by showing Standardized parameter estimates in decreasing order
coef_table <- data.frame(b = reg.chosen1$coefficients, stb = c(0, lm.beta(reg.chosen1)))
coef_table[order(abs(coef_table$stb), decreasing = T),]
# 2. Best Subset ----------------------------------------------------------
time_now <- Sys.time()
nv_max <-
25
# the maximum allowed number of features to test
fit_s <- regsubsets(Share_Temporary~., data_train,
really.big = T, nbest=1, nvmax = nv_max
)
Sys.time() - time_now
val.errors =rep(0, nv_max)
# Need the X_test with intercept
for(i in 1:nv_max){
coefi = coef(fit_s, id=i)
pred = X_test[,names(coefi)]%*%coefi # need to use matrix here
val.errors[i]= mean((Y_test - pred)^2)
}
# choose the best number of predictors to use
(val_err_min <- which.min(val.errors))
# MSE
(MSE_best.subset <- val.errors[val_err_min])
# The reg model selected:
coef_list = coef(fit_s, id = val_err_min)
# create sdi_sub2 which has all the selected predictors, and dependent variable,
# remove intercept by adding [-1]
sdi_sub2 <- mydata4[c(names(coef_list), "Share_Temporary")[-1]]
reg_chosen_bs <- lm(data = sdi_sub2, Share_Temporary ~ .)
summary(reg_chosen_bs)
# get output of b and stb sorted by |stb|
coef_table_bs <- data.frame(b = reg_chosen_bs$coefficients, stb = c(0, lm.beta(reg_chosen_bs)))
coef_table_bs[order(abs(coef_table_bs$stb), decreasing = T),]
# the chosen model from cross-validation
reg.chosen4 <- lm(data = mydata4[train_idx, ], Share_Temporary~
FF11_Water_MonthlyCost +
B14__resettled +
Eviction_Threats +
DD1_Location_Problemsflood_prone_area +
FF1_8_Water_Sourcesshared_taps +
DD1_Location_Problemscanal +
FF1_8_Water_Sourceswater_tankers +
B14__declared_legal_protected +
FF1_8_Water_Sourcessprings +
JJ1_Electricity_Availableyes +
GG7_10_Toilet_Typesindividual_toilets +
FF1_8_Water_Sourcesdams +
FF1_8_Water_Sourcesrivers +
FF12_Water_CollectionTime30_minutes +
EE2A_Current_Eviction_Threat +
DD1_Location_Problemsroad_side +
DD1_Location_Problemsgarbage_dump
)
# MSE
(MSE_best.subset <- mean((predict(reg.chosen4, mydata4[-train_idx,]) - Y_test)^2))
# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
bag.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# rf_test <- mydata3[-train_idx, "Share_Temporary"]
# test on test data: mydata3[-train_idx,]
yhat_bag <- predict(bag.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
varImpPlot(bag.fit, n.var=15)
# 4. Gradiant boosting ----------------------------------------------------
# Tuning Method: use `train` function from `caret` to scan for parameters.
library(gbm) # for Gradiant boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)),
shrinkage = c(0.1, 0.05, 0.01),
interaction.depth = c(1:5),
n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,],
distribution = "gaussian", method = "gbm",
trControl = trainControl, verbose = FALSE,
tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)
# The tuning parameters that give lowest MSE in training set CV.
gbm_caret$bestTune
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))
Sys.time() - time_now
# 5. Extreme Gradiant Boosting --------------------------------------------------------------
# Tuning Method: randomnize parameters and update the record with best ones.
library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# # For a single cv
# xg_cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5,
# max_depth = 3, eta = 0.1, objective = "reg:linear",
# eval_metric = "rmse")
# print(xg_cv)
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0
time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
param <- list(objective = "reg:linear",
eval_metric = "rmse",
max_depth = sample(6:10, 1),
eta = runif(1, .01, .3), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(1:40, 1),
max_delta_step = sample(1:10, 1)
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 8, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# Check error in testing data
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
best_param_store <- data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)
Sys.time() - time_now
# Feature importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15,
measure = "Gain")
# plot the trees from model using `xgb.plot.tree`
# xgb.plot.tree needs to be revised a little bit by adding `fontcolor="black"` in the `nodes <- DiagrammeR::create_node_df` line
# only show 1 tree as an example.
#
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))
# Project all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train),
features_keep = 3)
# Summarize MSEs ----------------------------------------------------------
options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso,
MSE_rForest, MSE_boost, MSE_xgb))
Comments
AI Services
Data Engineering Services