When you have sufficient data, you can subdivide your data into three parts: training, validation, and test data. During the selection process, models are fit on the training data, and the prediction error for the models so obtained is found by using the validation data. This prediction error on the validation data can be used to decide when to terminate the selection process or to decide which effects to include as the selection process proceeds. Finally, after a selected model has been obtained, the test set can be used to assess how the selected model generalizes on data that played no role in selecting the model.
In some cases you might want to use only training and test data. For example, you might decide to use an information criterion to decide which effects to include and when to terminate the selection process. In this case no validation data are required, but test data can still be useful in assessing the predictive performance of the selected model. In other cases you might decide to use validation data during the selection process but forgo assessing the selected model on test data. Hastie, Tibshirani, and Friedman (2001) note that it is difficult to give a general rule for how many observations you should assign to each role. They note that a typical split might be 50% for training and 25% each for validation and testing.
PROC QUANTSELECT provides several methods for partitioning data into training, validation, and test data. You can provide
data for each role in separate data sets that you specify with the DATA=, TESTDATA=, and VALDATA= options in the PROC QUANTSELECT procedure. An alternative method is to use a PARTITION statement to logically subdivide the DATA= data set into separate roles. You can name the fractions of the data that you want to reserve as test data and validation
data. The following statements randomly subdivide the inData
data set to use 25% of the data for validation and 25% for testing, leaving 50% of the data for training:
proc quantselect data=inData; partition fraction(test=0.25 validate=0.25); ... run;
If you need to exercise more control over the partitioning of the input data set, you can name a variable in the input data
set and a formatted value of that variable to correspond to each role. The following statements assign roles to observations
in the inData
data set based on the value of the variable named group
in that data set:
proc quantselect data=inData; partition roleVar=group(test='group 1' train='group 2') ... run;
Observations whose value of the variable group
is group 1
are assigned for testing, and those whose value is group 2
are assigned to training. All other observations are ignored.
You can also combine the use of the PARTITION statement with named data sets for specifying data roles. For example, the following statements reserve 40% of the inData
data set for validation, leaving the remaining 60% for training:
proc quantselect data=inData testData=inTest; partition fraction(validate=0.4); ... run;
Data for testing are supplied in the inTest
data set. Because a TESTDATA= data set is specified, additional observations for testing cannot be reserved by specifying a PARTITION statement.
When you use a PARTITION statement, the output data set that is created by an OUTPUT statement contains a character variable _ROLE_
whose values TRAIN
, TEST
, and VALIDATE
indicate the role of each observation. _ROLE_
is blank for observations that were not assigned to any of these three roles. When the input data set specified in the DATA= option in the PROC QUANTSELECT statement contains an _ROLE_
variable, no PARTITION statement is used, and the TESTDATA= and VALDATA= options are not specified, then the _ROLE_
variable is used to define the roles of each observation. This is useful when you want to rerun PROC QUANTSELECT but use
the same data partitioning as you used in a previous PROC QUANTSELECT step. For example, the following statements use the
same data for testing and training in both PROC QUANTSELECT steps:
proc quantselect data=inData; partition fraction(test=0.5); model y=x1-x10 / selection=forward; output out=outDataForward; run; proc quantselect data=outDataForward; model y=x1-x10 / selection=backward; run;
When you have reserved observations for training, validation, and testing, a model that is fit on the training data is scored on the validation and test data, and the average check loss, denoted by ACL, is computed separately for each of these subsets. The ACL for each data role is the sum of check losses for observations in that role divided by the number of observations in that role.
If you have provided observations for validation, then you can use the STOP=VALIDATE method-option to specify the validation ACL as the STOP= criterion in the SELECTION= option in the MODEL statement. At step k of the selection process, the best candidate effect to enter or leave the current model is determined. The “best candidate” means the effect that gives the best value of the SELECT= criterion that does not need to be based on the validation data. The validation ACL for the model with this candidate effect added is computed. If this validation ACL is greater than the validation ACL for the model at step k, then the selection process terminates at step k.
When you specify the CHOOSE=VALIDATE method-option in the SELECTION= option in the MODEL statement, the validation ACL is computed for the models at each step of the selection process. The model that yields the smallest validation ACL and contains the fewest effects is selected.
You request the validation ACL as the selection criterion by specifying the SELECT=VALIDATE method-option in the SELECTION= option in the MODEL statement. At step k of the selection process, the validation ACL is computed for each model where a candidate for entry is added or candidate for removal is dropped. The selected candidate for entry or removal is the one that yields a model with the minimal validation ACL.