The QUANTSELECT Procedure

Example 84.4 Surface Fitting with Many Noisy Variables

This example is based on Surface Fitting with Many Noisy Variables in Chapter 25: The ADAPTIVEREG Procedure. This example shows how you can use the EFFECT statement to select a nonlinear surface model for a data set that contains many nuisance variables.

Consider a simulated data set that contains a response variable and 10 continuous predictors. Each continuous predictor is sampled independently from the uniform distribution $U(0,1)$. The true model of the artificial data set depends nonlinearly on two variables x1 and x2:

\[  y = \frac{40\exp \left(8\left((x_1-0.5)^2+(x_2-0.5)^2\right)\right)}{\exp \left(8\left((x_1-0.2)^2+(x_2-0.7)^2\right)\right)+ \exp \left(8\left((x_1-0.7)^2+(x_2-0.2)^2\right)\right)}  \]

The values of the response variable are generated by adding errors from the standard normal distribution $N(0,1)$ to the true model. The generating mechanism is adapted from Gu et al. (1990). The following statements create an artificial data set that contains 400 observations for the purpose of effect selection and 10,201 observations of missing response values for the purpose of prediction:

%let p=10;
data artificial;
   drop i;
   array x{&p};
   do i=1 to 400;
      do j=1 to &p;
         x{j} = ranuni(1);
      end;
      yTrain = 40*exp(8*((x1-0.5)**2+(x2-0.5)**2))/
          (exp(8*((x1-0.2)**2+(x2-0.7)**2))+
          exp(8*((x1-0.7)**2+(x2-0.2)**2)))+rannor(1);
      output;
   end;

   yTrain = .;
   do x1=0 to 1 by 0.01;
      do x2 = 0 to 1 by 0.01;
         y = 40*exp(8*((x1-0.5)**2+(x2-0.5)**2))/
             (exp(8*((x1-0.2)**2+(x2-0.7)**2))+
             exp(8*((x1-0.7)**2+(x2-0.2)**2)));
         output;
      end;
   end;
run;

The variables x3 through x10 are nuisance variables that can cause overfitting in your analysis. The following statements invoke the QUANTSELECT procedure to select effects, fit a model on the selected effects, and output the model predictions to an output data set Out:

%macro art;
   proc quantselect data=artificial algorithm=smooth;
      %do i=1 %to &p;
         effect sp&i = spline(x&i);
      %end;
      model yTrain =
         sp1 %do i=2 %to &p; |sp&i %end; @2/details=all;
      output out=Out p=pred;
   run;
%mend;

%art;

You can use the EFFECT statement to generate nonlinear effects and model a nonlinear surface. This example uses spline effects on variables and includes all the two-way interactions among these spline effects.

The ALGORITHM=SMOOTH option specifies the smoothing algorithm for model fitting. It takes approximately 2.8 seconds to select the model on a PC that has an Intel i7-2600 quad-core CPU and 64-bit Windows 7 Enterprise operation system. If you use the ALGORITHM=SIMPLEX option, which is default, it takes approximately 8.7 seconds for the same computation settings.

Output 84.4.1 shows the model information. By default, the effect selection method is the stepwise method, and the selection criterion is SBC for the SELECT=, CHOOSE=, and STOP= options. The default quantile level is 0.5 for median regression.

Output 84.4.1: Model Information

The QUANTSELECT Procedure

Model Information
Data Set WORK.ARTIFICIAL
Dependent Variable yTrain
Selection Method Stepwise
Quantile Type Single Level
Select Criterion SBC
Stop Criterion SBC
Choose Criterion SBC



Output 84.4.2 shows the best 10 entry candidates at the selection step. You can see that sp1*sp2 is the most important effect, followed by sp1 and sp2.

Output 84.4.2: Best 10 Entry Candidates at Step 1

Best 10 Entry Candidates
Rank Effect SBC
1 sp1*sp2 -496.6752
2 sp1 165.9104
3 sp2 178.2126
4 sp3 213.4593
5 sp6 220.8471
6 sp7 222.0916
7 sp9 224.3185
8 sp4 224.7100
9 sp8 226.8373
10 sp5 227.2176



Output 84.4.3 shows the selection summary.

Output 84.4.3: Selection Summary

The QUANTSELECT Procedure
Quantile Level = 0.5

Selection Summary
Step Effect
Entered
Number
Effects
In
Number
Parms
In
SBC
0 Intercept 1 1 195.8108
1 sp1*sp2 2 49 -496.6752*
* Optimal Value Of Criterion



The following statements produce a graph that shows both the true model and the fitted model:

ods graphics on;
data pred;
   set out;
   where yTrain=.;
run;

%let off0 = offsetmin=0 offsetmax=0;
%let off0 = xaxisopts=(&off0) yaxisopts=(&off0);
%let eopt = location=outside valign=top textattrs=graphlabeltext;
proc template;
   define statgraph surfaces;
      begingraph / designheight=360px;
         layout lattice/columns=2;
            layout overlay / &off0;
               entry "True Model" / &eopt;
               contourplotparm z=y y=x2 x=x1;
            endlayout;
            layout overlay / &off0;
               entry "Fitted Model" / &eopt;
               contourplotparm z=pred y=x2 x=x1;
            endlayout;
         endlayout;
      endgraph;
   end;
run;

proc sgrender data=pred template=surfaces;
run;

Output 84.4.4 displays surfaces for both the true model and the fitted model. You can see that the fitted model nicely approximates the underlying true model.

Output 84.4.4: True Model and Fitted Model

True Model and Fitted Model