Ensemble Methods

1. Random Forest

2. Gradient boosting

3. Extreme Gradient Boosting

Compared to Lasso and Best-subset

Lasso

Best-subset

Compare MSEs

Code

I recently had the great pleasure to meet with Professor Allan Just and he introduced me to eXtreme Gradient Boosting (XGBoost). XGBoost is “one of the most loved machine learning algorithms at Kaggle”, it somehow combines the advantages of random forest and boosting. We will see later its predictive ability is indeed very strong.

In the earlier post Model Selection using Lasso and Best Subset I mainly focused on model selection by best-subset, stepwise, or lasso. This post will focus on ensemble method: random forest, Gradient boosting, and eXtreme Gradient Boosting (XGBoost).

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

1. Random Forest

Use library randomForest.
As a well-known decision trees approach, Random Forest usually performs very well for prediction and it is easy to use. Random Forest is based on bagging (i.e., bootstrap aggregation) which averages the results over many decision trees from sub-samples. It further limits its search to only 1/3 of the features (in regression) to fit each tree, weakening the correlations among decision trees. It is easy to get a lower testing error compared to linear models.

# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
rf.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# Test on test data: mydata3[-train_idx,]
yhat_bag <- predict(rf.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))

## [1] 0.04104

MSE on the testing dataset:

(MSE_rForest <- mean((yhat_bag - Y_test)^2))

## [1] 0.04104

Feature Importance (top 15)
- The variables high on rank show the relative importance of features in the tree model
- For example, Monthly Water Cost, Resettled Housing, and Population Estimate are the most influential features.

varImpPlot(rf.fit, n.var=15)

2. Gradient boosting

Tuning Method: use train function from caret to scan a grid of parameters.

library(gbm)   # for Gradient boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)), 
                         shrinkage = c(0.1, 0.05, 0.01), 
                         interaction.depth = c(1:5),
                         n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,], 
                   distribution = "gaussian", method = "gbm",
                   trControl = trainControl, verbose = FALSE, 
                   tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)  

Sys.time() - time_now

## Time difference of 2.746 mins

The tuning parameters that give the lowest MSE in training set CV.

gbm_caret$bestTune

##      n.trees interaction.depth shrinkage n.minobsinnode
## 1303      60                 4       0.1             10

# MSE
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))

## [1] 0.03809

3. Extreme Gradient Boosting

Tuning Method: randomize parameters and update the record with best ones.
This package is really fast.

library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# Randomize and bound
best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0

time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
  param <- list(objective = "reg:linear", # For regression
                eval_metric = "rmse",
                max_depth = sample(6:10, 1),
                eta = runif(1, .01, .3),  # Learning rate, default: 0.3
                subsample = runif(1, .6, .9),
                colsample_bytree = runif(1, .5, .8), 
                min_child_weight = sample(1:40, 1),
                max_delta_step = sample(1:10, 1)
  )
  cv.nround <-  1000
  cv.nfold <-  5 # 5-fold cross-validation
  seed.number  <-  sample.int(10000, 1) # set seed for the cv
  set.seed(seed.number)
  mdcv <- xgb.cv(data = dtrain, params = param,  
                 nfold = cv.nfold, nrounds = cv.nround,
                 verbose = F, early_stopping_rounds = 8, maximize = FALSE)
  
  min_rmse_index  <-  mdcv$best_iteration
  min_rmse <-  mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
  
  if (min_rmse < best_rmse) {
    best_rmse <- min_rmse
    best_rmse_index <- min_rmse_index
    best_seednumber <- seed.number
    best_param <- param
  }
}
Sys.time() - time_now

## Time difference of 3.147 mins

The best tuning parameters

data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)

##    objective eval_metric max_depth     eta subsample colsample_bytree
## 1 reg:linear        rmse        10 0.04441    0.7931           0.7619
##   min_child_weight max_delta_step best_rmse_index best_rmse
## 1                5              9              91    0.2105
##   best_seednumber
## 1            7476

# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# MSE
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))

## [1] 0.004265

Feature Importance

importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)
# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15, 
                      measure = "Gain")

Plot only 2 trees as an example (use trees= 1)
Two ways to show it

# two ways to do it:
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))

# xgb.plot.tree needs to be revised a little bit by adding `fontcolor ="black"` in the `nodes <- DiagrammeR::create_node_df` line
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))

Plot all trees on one tree and plot it: A huge plot

xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))

Future work:
Try to use “xgboostExplainer”, still work on it.

Recall the features selected by Lasso and best-subset from last post as a comparison:

Compared to Lasso and Best-subset

Lasso

##                                        b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost                 -3.949e-06    6.278e-07   -6.2902
## B14__resettled                         -1.651e-01    2.893e-02   -5.7052
## DD1_Location_Problemscanal              1.943e-01    3.020e-02    6.4357
## FF1_8_Water_Sourceswells               -9.677e-02    2.118e-02   -4.5681
## Eviction_Threats                        8.793e-02    2.218e-02    3.9638
## B14__declared_legal_protected           8.435e-02    2.702e-02    3.1216
## FF1_8_Water_Sourcesshared_taps         -1.119e-01    3.084e-02   -3.6277
## DD1_Location_Problemsslope              7.522e-02    2.224e-02    3.3817
## FF1_8_Water_Sourceswater_tankers       -1.247e-01    3.636e-02   -3.4310
## DD1_Location_Problemsunder_power_lines -8.192e-02    2.565e-02   -3.1944
## GG1_Sewer_Line                          6.772e-02    2.350e-02    2.8818
## DD1_Location_Problemswater_body         8.081e-02    2.783e-02    2.9042
## CC9_Households                         -5.211e-06    1.874e-06   -2.7808
## DD1_Location_Problemsflood_prone_area   5.697e-02    2.107e-02    2.7037
## C13__no_rent                            1.515e-01    5.604e-02    2.7030
## FF1_8_Water_Sourcescommunity_taps       5.601e-02    2.663e-02    2.1033
## FF1_8_Water_Sourcesrivers              -4.035e-02    3.551e-02   -1.1363
## DD1_Location_Problemssinking_soil       2.615e-02    3.102e-02    0.8431
## (Intercept)                             3.961e-01    3.433e-02   11.5381
##                                        b.Pr...t..      stb
## FF11_Water_MonthlyCost                  6.403e-10 -0.21723
## B14__resettled                          1.886e-08 -0.21544
## DD1_Location_Problemscanal              2.646e-10  0.18813
## FF1_8_Water_Sourceswells                6.059e-06 -0.13182
## Eviction_Threats                        8.332e-05  0.11916
## B14__declared_legal_protected           1.891e-03  0.11034
## FF1_8_Water_Sourcesshared_taps          3.121e-04 -0.10669
## DD1_Location_Problemsslope              7.707e-04  0.09968
## FF1_8_Water_Sourceswater_tankers        6.458e-04 -0.09805
## DD1_Location_Problemsunder_power_lines  1.480e-03 -0.08740
## GG1_Sewer_Line                          4.106e-03  0.08206
## DD1_Location_Problemswater_body         3.827e-03  0.08085
## CC9_Households                          5.605e-03 -0.07923
## DD1_Location_Problemsflood_prone_area   7.066e-03  0.07819
## C13__no_rent                            7.081e-03  0.07605
## FF1_8_Water_Sourcescommunity_taps       3.588e-02  0.06259
## FF1_8_Water_Sourcesrivers               2.563e-01 -0.03116
## DD1_Location_Problemssinking_soil       3.995e-01  0.02372
## (Intercept)                             8.958e-28  0.00000

Best-subset

##                                       b.Estimate b.Std..Error b.t.value
## FF11_Water_MonthlyCost                -5.469e-06    9.276e-07   -5.8960
## B14__resettled                        -2.061e-01    3.893e-02   -5.2933
## Eviction_Threats                       1.751e-01    4.399e-02    3.9797
## DD1_Location_Problemscanal             2.218e-01    4.108e-02    5.3983
## FF1_8_Water_Sourceswater_tankers      -2.080e-01    5.435e-02   -3.8267
## FF1_8_Water_Sourcesshared_taps        -1.362e-01    4.393e-02   -3.1006
## EE2A_Current_Eviction_Threat          -8.252e-02    4.475e-02   -1.8441
## DD1_Location_Problemsflood_prone_area  7.352e-02    2.866e-02    2.5655
## B14__declared_legal_protected          6.983e-02    3.690e-02    1.8925
## FF1_8_Water_Sourcesrivers             -7.327e-02    4.944e-02   -1.4821
## FF1_8_Water_Sourcessprings            -4.165e-02    4.110e-02   -1.0134
## FF1_8_Water_Sourcesdams               -5.584e-02    5.874e-02   -0.9507
## DD1_Location_Problemsgarbage_dump     -2.470e-02    3.155e-02   -0.7828
## FF12_Water_CollectionTime30_minutes    3.477e-02    4.548e-02    0.7646
## DD1_Location_Problemsroad_side         1.986e-02    2.998e-02    0.6624
## GG7_10_Toilet_Typesindividual_toilets -2.052e-02    3.840e-02   -0.5344
## JJ1_Electricity_Availableyes           5.476e-03    4.093e-02    0.1338
## (Intercept)                            4.445e-01    6.387e-02    6.9591
##                                       b.Pr...t..       stb
## FF11_Water_MonthlyCost                 9.258e-09 -0.299393
## B14__resettled                         2.204e-07 -0.271580
## Eviction_Threats                       8.498e-05  0.238699
## DD1_Location_Problemscanal             1.293e-07  0.216368
## FF1_8_Water_Sourceswater_tankers       1.555e-04 -0.157373
## FF1_8_Water_Sourcesshared_taps         2.099e-03 -0.128307
## EE2A_Current_Eviction_Threat           6.607e-02 -0.107864
## DD1_Location_Problemsflood_prone_area  1.075e-02  0.101994
## B14__declared_legal_protected          5.931e-02  0.093230
## FF1_8_Water_Sourcesrivers              1.393e-01 -0.058064
## FF1_8_Water_Sourcessprings             3.116e-01 -0.044319
## FF1_8_Water_Sourcesdams                3.424e-01 -0.036997
## DD1_Location_Problemsgarbage_dump      4.343e-01 -0.030778
## FF12_Water_CollectionTime30_minutes    4.451e-01  0.029456
## DD1_Location_Problemsroad_side         5.082e-01  0.025531
## GG7_10_Toilet_Typesindividual_toilets  5.934e-01 -0.020818
## JJ1_Electricity_Availableyes           8.937e-01  0.005555
## (Intercept)                            1.869e-11  0.000000

Compare MSEs

The advantage of XGBoost is highly distinctive.

options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso, 
                   MSE_rForest, MSE_boost, MSE_xgb))

##   MSE_best.subset MSE_Lasso MSE_rForest MSE_boost  MSE_xgb
## 1           0.066   0.06079     0.04104   0.03809 0.004265

Code

## Working Sample: Predicting Share of Temporary Structure in Slums Settlements ##
# 07/09/2018
library(readxl)
library(plyr)
library(ggplot2)
library(QuantPsyc)     # used to show standardized regression coefficients
library(glmnet)        # for Lasso
library(leaps)         # for best subset
library(randomForest)  # random forest
library(gbm)           # for Gradient boosting
library(caret)         # grid scan by train
library(xgboost)       # for Extreme Gradient boosting
library(Ckmeans.1d.dp) # for xgb.ggplot.importance in xgboost
library(DiagrammeR)    # for xgb.plot.tree in xgboost
library(knitr)
options(digits = 4)

# Data Cleaning -----------------------------------------------------------
sdi <- read_excel("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/Data/20180618_SDI 2.xlsx", 
                  na = "", skip = 1)

# clean variables names, remove "/".
all_varnames <- names(sdi)
names(sdi) <- gsub("/", "", all_varnames) 

# turn all character variables into factors
sdi[sapply(sdi, is.character)] <- lapply(sdi[sapply(sdi, is.character)], 
                                       as.factor)
sdi$Share_Temporary <- sdi$CC7_Structures_Temporary / (sdi$CC7_Structures_Temporary +sdi$CC6_Structures_Permanent)
# remove variables with over 10% NA
Remove <- (sapply(sdi, function(x) sum(is.na(x))) < dim(sdi)[1]*0.1)
mydata0 <- sdi[Remove]
mydata <- na.omit(mydata0) # remove all the NA

# fix some format issue
mydata$CC10_Household_Size <- as.numeric(mydata$CC10_Household_Size)
mydata$CC12_Total_Population <- as.numeric(mydata$CC12_Total_Population)

# remove useless factor
mydata2 <- subset(mydata, select = -c(1, City, Country)) # for some other use not shown here
# remove what are highly correlated to dependent variable
mydata3 <- subset(mydata2, select = - grep("Structure", names(mydata2)))

# divide training and test dataset
set.seed(123)
train_idx <- sample(dim(mydata3)[1], dim(mydata3)[1]* 0.6)

# The model.matrix() function is used in many regression packages for building 
# an "X" matrix from data.
# need matrix for glmnet
X2 <- model.matrix(Share_Temporary~., data = mydata3)
Y2 <- as.matrix(mydata3[c("Share_Temporary")])
X_train <- X2[train_idx,]
X_test <- X2[-train_idx,]
Y_train <- Y2[train_idx]
Y_test <- Y2[-train_idx]
# merge back to df again, as df is required for regsubsets (best subset)
data_train <- data.frame(X2, Y2)[train_idx,] 
mydata4 <- data.frame(X2, Y2)

# 1. Lasso ----------------------------------------------------------------
# Use cross-validation to choose lambda for Lasso
cv_lasso = cv.glmnet(X_train, Y_train, alpha=1) # Lasso regression
plot.cv.glmnet(cv_lasso)
(best_lam <- cv_lasso$lambda.1se)
coef_table <- coef(cv_lasso, s = cv_lasso$lambda.1se)
# variables chosen
# data.frame(name = coef_table@Dimnames[[1]][coef_table@i + 1], coefficient = coef_table@x)
# Check prediction error in test dataset
# The lasso model, then feed lambda to predict fitted value
lasso_mod <- glmnet(X_train, Y_train, alpha = 1)
# As lambda increase, "survived" variables decrease as coefficients move to zero
plot(lasso_mod, xvar = "lambda", label = TRUE)
lasso_pred <- predict(lasso_mod, s = best_lam, newx = X_test)
# The MSE (Mean square error)
(MSE_Lasso <- mean((lasso_pred - Y_test)^2))
# show the variables chosen into the model
var_list <- unlist(coef_table@Dimnames[[1]][coef_table@i + 1])
var_list
# run a regression using these variables
# Because there is intercept, we need to use [coef_table@i + 1], and [-1] in the end
sdi_sub <- mydata4[c(coef_table@Dimnames[[1]][coef_table@i + 1], "Share_Temporary")[-1]]
# lm model using variables chosen by Lasso 1se
reg.chosen1 <- lm(data = sdi_sub, Share_Temporary~.)
summary(reg.chosen1)

# Relative importance by showing Standardized parameter estimates in decreasing order
coef_table <- data.frame(b = reg.chosen1$coefficients, stb = c(0, lm.beta(reg.chosen1)))
coef_table[order(abs(coef_table$stb), decreasing = T),]

# 2. Best Subset ----------------------------------------------------------
time_now <- Sys.time()
nv_max <- 25 # the maximum allowed number of features to test
fit_s <- regsubsets(Share_Temporary~., data_train,
                    really.big = T, nbest=1, nvmax = nv_max
)
Sys.time() - time_now

val.errors =rep(0, nv_max)
# Need the X_test with intercept
for(i in 1:nv_max){
  coefi = coef(fit_s, id=i)
  pred = X_test[,names(coefi)]%*%coefi # need to use matrix here
  val.errors[i]= mean((Y_test - pred)^2)
}
# choose the best number of predictors to use
(val_err_min <- which.min(val.errors))
# MSE
(MSE_best.subset <- val.errors[val_err_min])

# The reg model selected:
coef_list = coef(fit_s, id = val_err_min)
# create sdi_sub2 which has all the selected predictors, and dependent variable,
# remove intercept by adding [-1]
sdi_sub2 <- mydata4[c(names(coef_list), "Share_Temporary")[-1]]
reg_chosen_bs <- lm(data = sdi_sub2, Share_Temporary ~ .)
summary(reg_chosen_bs)

# get output of b and stb sorted by |stb|
coef_table_bs <- data.frame(b = reg_chosen_bs$coefficients, stb = c(0, lm.beta(reg_chosen_bs)))
coef_table_bs[order(abs(coef_table_bs$stb), decreasing = T),]

# the chosen model from cross-validation
reg.chosen4 <- lm(data = mydata4[train_idx, ], Share_Temporary~
                    FF11_Water_MonthlyCost  +
                    B14__resettled  +
                    Eviction_Threats    +
                    DD1_Location_Problemsflood_prone_area   +
                    FF1_8_Water_Sourcesshared_taps  +
                    DD1_Location_Problemscanal  +
                    FF1_8_Water_Sourceswater_tankers    +
                    B14__declared_legal_protected   +
                    FF1_8_Water_Sourcessprings  +
                    JJ1_Electricity_Availableyes    +
                    GG7_10_Toilet_Typesindividual_toilets   +
                    FF1_8_Water_Sourcesdams +
                    FF1_8_Water_Sourcesrivers   +
                    FF12_Water_CollectionTime30_minutes +
                    EE2A_Current_Eviction_Threat    +
                    DD1_Location_Problemsroad_side  +
                    DD1_Location_Problemsgarbage_dump
)

# MSE
(MSE_best.subset <- mean((predict(reg.chosen4, mydata4[-train_idx,]) - Y_test)^2))


# 3. Random Forest --------------------------------------------------------
# Easy to use, there is no hyperparameter to tune
bag.fit <- randomForest(Share_Temporary ~ ., data = mydata4, subset = train_idx)
# rf_test <- mydata3[-train_idx, "Share_Temporary"]
# test on test data: mydata3[-train_idx,]
yhat_bag <- predict(bag.fit, newdata = mydata4[-train_idx,])
# MSE
(MSE_rForest <- mean((yhat_bag - Y_test)^2))
varImpPlot(bag.fit, n.var=15)


# 4. Gradiant boosting ----------------------------------------------------
# Tuning Method: use `train` function from `caret` to scan for parameters.
library(gbm)   # for Gradiant boosting
library(caret) # scan the parameter grid using `train` function
time_now <- Sys.time()
para_grid <- expand.grid(n.trees = (20*c(1:100)), 
                         shrinkage = c(0.1, 0.05, 0.01), 
                         interaction.depth = c(1:5),
                         n.minobsinnode = 10)
trainControl <- trainControl(method = "cv", number = 10)
set.seed(123)
gbm_caret <- train(Share_Temporary ~ ., mydata4[train_idx,], 
                   distribution = "gaussian", method = "gbm",
                   trControl = trainControl, verbose = FALSE, 
                   tuneGrid = para_grid, metric = "RMSE", bag.fraction = 0.75)                  

# The tuning parameters that give lowest MSE in training set CV.
gbm_caret$bestTune
yhat_boost <- predict(gbm_caret, mydata4[-train_idx,])
(MSE_boost <- mean((yhat_boost - Y_test)^2))
Sys.time() - time_now

# 5. Extreme Gradiant Boosting --------------------------------------------------------------
# Tuning Method: randomnize parameters and update the record with best ones.
library(xgboost)
# Matrix for xgb: dtrain
dtrain <- xgb.DMatrix(X_train, label = Y_train)
dtest <- xgb.DMatrix(X_test, label = Y_test)
# # For a single cv
# xg_cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5,
#              max_depth = 3, eta = 0.1, objective = "reg:linear", 
#              eval_metric = "rmse")
# print(xg_cv)

best_param <- list()
best_seednumber <- 1234
best_rmse <- Inf
best_rmse_index <- 0

time_now <- Sys.time()
set.seed(1234)
for (iter in 1:100) {
  param <- list(objective = "reg:linear",
                eval_metric = "rmse",
                max_depth = sample(6:10, 1),
                eta = runif(1, .01, .3), # Learning rate, default: 0.3
                subsample = runif(1, .6, .9),
                colsample_bytree = runif(1, .5, .8), 
                min_child_weight = sample(1:40, 1),
                max_delta_step = sample(1:10, 1)
  )
  cv.nround <-  1000
  cv.nfold <-  5 # 5-fold cross-validation
  seed.number  <-  sample.int(10000, 1) # set seed for the cv
  set.seed(seed.number)
  mdcv <- xgb.cv(data = dtrain, params = param,  
                 nfold = cv.nfold, nrounds = cv.nround,
                 verbose = F, early_stopping_rounds = 8, maximize = FALSE)
  
  min_rmse_index  <-  mdcv$best_iteration
  min_rmse <-  mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
  
  if (min_rmse < best_rmse) {
    best_rmse <- min_rmse
    best_rmse_index <- min_rmse_index
    best_seednumber <- seed.number
    best_param <- param
  }
}

# The best index (min_rmse_index) is the best "nround" in the model
nround = best_rmse_index
set.seed(best_seednumber)
xg_mod <- xgboost(data = dtest, params = best_param, nround = nround, verbose = F)
# Check error in testing data
yhat_xg <- predict(xg_mod, dtest)
(MSE_xgb <- mean((yhat_xg - Y_test)^2))
best_param_store <- data.frame(best_param, best_rmse_index, best_rmse, best_seednumber)
Sys.time() - time_now

# Feature importance
importance_matrix <- xgb.importance(colnames(X_train), model = xg_mod)

# Use `xgb.plot.importance`, which create a _barplot_ or use `xgb.ggplot.importance`
library(Ckmeans.1d.dp) # for xgb.ggplot.importance
xgb.ggplot.importance(importance_matrix, top_n = 15, 
                      measure = "Gain")

# plot the trees from model using `xgb.plot.tree`
# xgb.plot.tree needs to be revised a little bit by adding `fontcolor="black"` in the `nodes <- DiagrammeR::create_node_df` line
# only show 1 tree as an example.
# 
source("C:/Users/lyhel/OneDrive - nyu.edu/NYU Marron/R/xgb.plot.tree2.R")
library("DiagrammeR")
xgb.plot.tree(model = xg_mod, trees = 1, feature_names = colnames(X_train))
xgb.plot.tree2(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train))

# Project all trees on one tree and plot it: A huge plot
xgb.plot.multi.trees(model = xg_mod, n_first_tree = 1, feature_names = colnames(X_train),
                     features_keep = 3)

# Summarize MSEs ----------------------------------------------------------

options(digits = 4)
(MSE <- data.frame(MSE_best.subset, MSE_Lasso, 
                   MSE_rForest, MSE_boost, MSE_xgb))

Comments

Jamie Foxx said…

Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

AI Services

Data Engineering Services

January 11, 2022 at 2:12 AM

Anonymous said…

Very nice blogs!!! i have to learning for lot of information for this sites…Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. hire a virtual assistant

March 25, 2022 at 7:16 PM

Search This Blog

Statistics and Data Analysis

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting

Ensemble Methods

1. Random Forest

2. Gradient boosting

3. Extreme Gradient Boosting

Compared to Lasso and Best-subset

Lasso

Best-subset

Compare MSEs

The advantage of XGBoost is highly distinctive.

`options(digits = 4) (MSE <- data.frame(MSE_best.subset, MSE_Lasso, MSE_rForest, MSE_boost, MSE_xgb))`

`## MSE_best.subset MSE_Lasso MSE_rForest MSE_boost MSE_xgb ## 1 0.066 0.06079 0.04104 0.03809 0.004265`

Code

Comments

Popular posts from this blog

The Weighted Standard Deviation

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

Visualization of Conflicts in Colombia and the World

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting

Ensemble Methods

1. Random Forest

2. Gradient boosting

3. Extreme Gradient Boosting

Compared to Lasso and Best-subset

Lasso

Best-subset

Compare MSEs

The advantage of XGBoost is highly distinctive. options(digits = 4) (MSE <- data.frame(MSE_best.subset, MSE_Lasso, MSE_rForest, MSE_boost, MSE_xgb)) ## MSE_best.subset MSE_Lasso MSE_rForest MSE_boost MSE_xgb ## 1 0.066 0.06079 0.04104 0.03809 0.004265

Code

Comments

Popular posts from this blog

The Weighted Standard Deviation

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

Visualization of Conflicts in Colombia and the World

The advantage of XGBoost is highly distinctive.

`options(digits = 4) (MSE <- data.frame(MSE_best.subset, MSE_Lasso, MSE_rForest, MSE_boost, MSE_xgb))`

`## MSE_best.subset MSE_Lasso MSE_rForest MSE_boost MSE_xgb ## 1 0.066 0.06079 0.04104 0.03809 0.004265`