Statistics and Data Analysis

Posts

Showing posts from June, 2018

SAS’s Best Subset Selection by Mallows's Cp is actually Stepwise?

- June 25, 2018

SAS’s Best Subset is actually Stepwise? I answered this question in my next post: Best subset selection uses the branch and bound algorithm to speed up The original post explaining my own confusion: I suspect that the function regsubsets from R library leaps does go through all the possible combinations, and it takes hours for 50 variables. Facing a large number of variables SAS just uses stepwise selection even though the code asks for best subset by adding selection = cp to the model part of proc reg . This is my suspicion, I don’t know whether maybe it never scans all the combinations. I tested with the same dataset from the last post Without cross-validation, I used all the 598 observations to run regsubsets : # R: nv_max <- 25 # up-limit of number of variables to test fit_s <- regsubsets(Share_Temporary~., mydata4, really.big = T , nbest= 1 , nvmax = nv_max ) As we can see from the figure above, the number of variables with

Modeling of Slums: Model Selection using Lasso and Best Subset

- June 25, 2018

Model Selection using Lasso and Best Subset 1. Linear regression model with Lasso feature selection 2. Linear regression model with Best Subset selection 3. Random Forest Conclusion Complete Code I will give a short introduction to statistical learning and modeling, apply feature (variable) selection using Best Subset and Lasso. The dependent variable to model is Share_Temporary : Share of Temporary Structure in Slums. The independent variables are monitoring indicators like water, sanitation, housing conditions and overcrowding. Each of the 600 observations used is a slum settlement. I will also compare the results to Random Forest in the end. Prediction and Inference The two main motivations for statistical modeling (e.g. run a linear regression model) are prediction (to predict) and inference (to explain). Prediction usually needs cross-validation, which fits the model on a training dataset. By checking model’s performance on a separ