Best subset selection uses the branch and bound algorithm to speed up

Answering my own question in the last post: best-subset selection is not stepwise. The speed was achieved due to the specific algorithm applied. Same results from R stepwise and SAS best subset were obtained due to coincidence. After I remove some variables the outputs are no longer the same. The output from R best subset is still not exactly the same as SAS best subset. The differences are very small, only on several predictors with trivial effects.

I discussed this question on SAS community, got very good comments that remind me obviously R did not search for all the combinations either. Some speed-up algorithm must have been applied.

Then I sent an email to SAS technical help (follow this page), and got a reply in a day. The algorithm is called "Furnival-Wilson leaps-and-bounds" (1974), which seems quite basic in computer science after I figured out what it is. This article is fairly easy to read: Branch and bound in statistical data analysis D.J.Hand.
It scans for combinations given each allowed numbers of variables to use (1, 2, 3, ..., p) and find an optimal solution for each given level (thus Cp is bounded). This is why the program always reports the optimal model based on the number of predictors chosen: e.g., the best 1-predictor model, the best 2-predictor model, the best 10-predictor model... etc.


I hereby attach the reply from SAS:

--------------------------

Yang,
With selection-RSQUARE, ADJRSQ, and CP and n=number of regressors >=11, by default REG will only DISPLAY the best n subset models for each number of regressors. The best n one-variable models, best n two-variable models, etc. These can be computed (using the Furnival and Wilson algorithm) without examining every possible model of every possible size and so this is typically much faster than if all models of each size need to be displayed.

By default, when you run PROC REG with the SELECTION=CP and the STOP=10 option, and you have 20 regressors in the model, PROC REG will display at most 20 models for each of the 1-variable models, 2-variable models, 3-variable models, ...through 10-variable models.  In other words, the maximum number of models displayed will be equal to the number of predictor variables in the MODEL statement (if the number of predictors listed in the MODEL statement is greater than 11).   

To obtain more models than the number displayed by default, you will need to add the BEST= option to the MODEL statement.  For example, if you have 20 predictors in your MODEL statement, but you want to see up to 35 models in each of the possible subsets, then your PROC REG step would need to look something like:

---------------------
proc reg data=test;
  model y = x1-x20 / selection=rsquare stop=10 best=35;
run;
quit;
----------------------

I hope the above information is helpful.

Kathleen Kiernan
Senior Principal Technical Support Statistician

Comments

Popular posts from this blog

How to Draw Heatmap with Colorful Dendrogram

Power-law distribution (Pareto)& Zipf's Law: connection and how to fit the distribution of global city population

eXtreme Gradient Boosting (XGBoost): Better than random forest or gradient boosting