Posts

Showing posts from June, 2018

SAS’s Best Subset Selection by Mallows's Cp is actually Stepwise?

Image
SAS’s Best Subset is actually Stepwise? I answered this question in my next post:  Best subset selection uses the branch and bound algorithm to speed up The original post explaining my own confusion: I suspect that the function  regsubsets  from R library  leaps  does go through all the possible combinations, and it takes hours for 50 variables. Facing a large number of variables SAS just uses stepwise selection even though the code asks for best subset by adding  selection = cp  to the model part of  proc reg . This is my suspicion, I don’t know whether maybe it never scans all the combinations. I tested with the same dataset from the  last post Without cross-validation, I used all the 598 observations to run  regsubsets : # R: nv_max <- 25 # up-limit of number of variables to test fit_s <- regsubsets(Share_Temporary~., mydata4, really.big = T , nbest= 1 , nvmax = nv_max ) As we can see from the figure above, the number of variables with

Modeling of Slums: Model Selection using Lasso and Best Subset

Image
Model Selection using Lasso and Best Subset 1. Linear regression model with Lasso feature selection 2. Linear regression model with Best Subset selection 3. Random Forest Conclusion Complete Code I will give a short introduction to statistical learning and modeling, apply feature (variable) selection using Best Subset and Lasso. The dependent variable to model is  Share_Temporary : Share of Temporary Structure in Slums. The independent variables are monitoring indicators like water, sanitation, housing conditions and overcrowding. Each of the 600 observations used is a slum settlement. I will also compare the results to Random Forest in the end. Prediction and Inference The two main motivations for statistical modeling (e.g. run a linear regression model) are prediction (to predict) and inference (to explain). Prediction usually needs cross-validation, which fits the model on a  training  dataset. By checking model’s performance on a separ