The ADAPTIVEREG Procedure (Experimental)

Getting Started: ADAPTIVEREG Procedure

This example concerns city-cycle fuel efficiency and automobile characteristics for 361 vehicle models made from year 1970 to 1982. The data can be downloaded from the UCI Machine Learning Repository (Asuncion and Newman, 2007). The following DATA step creates the data set autompg:

title 'Automobile MPG Study';
data Autompg;
   input MPG Cylinders Displacement Horsepower Weight
         Acceleration Year Origin Name $35.;
   datalines;
18.0 8 307.0   130.0   3504   12.0   70  1  Chevrolet Chevelle Malibu
15.0 8 350.0   165.0   3693   11.5   70  1  Buick Skylark 320
18.0 8 318.0   150.0   3436   11.0   70  1  Plymouth Satellite
16.0 8 304.0   150.0   3433   12.0   70  1  AMC Rebel SST

   ... more lines ...   

44.0 4 97.00   52.00   2130   24.6   82  2  VW Pickup
32.0 4 135.0   84.00   2295   11.6   82  1  Dodge Rampage
28.0 4 120.0   79.00   2625   18.6   82  1  Ford Ranger
31.0 4 119.0   82.00   2720   19.4   82  1  Chevy S-10
;

There are nine variables in the data set. The response variable MPG is city-cycle mileage per gallon (MPG). Seven predictor variables (Cylinders, Displacement, HorsePower, Weight, Acceleration, Year, and Origin) provide vehicle attributes. Among them, Cylinders, Year, and Origin are categorical variables. The last variable, Name, contains the specific name of each vehicle model.

The dependency of vehicle fuel efficiency on various factors might be nonlinear. There might also be redundant predictor variables as a result of dependency structures within predictors. For example, a vehicle model with more cylinders is likely to have more horsepower. The objective of this example is to explore the nonlinear dependency structure and also to produce a parsimonious model that does not overfit and thus has good predictive power. The following invocation of the ADAPTIVEREG procedure fits an additive model with linear spline terms of continuous predictors. By default, PROC ADAPTIVEREG fits a nonparametric regression model that includes two-way interaction between spline basis functions. You can try models with even higher interaction orders by specifying the MAXORDER= option in the MODEL statement. For this particular data set, the sample size is relatively small. Restricting model complexity by specifying an additive model can both improve model interpretability and reduce model variance without sacrificing much predictive power. The additive model consists of terms of nonparametric transformations of variables. The transformation of each variable and the selection of transformed terms are performed in an adaptive and automatic way.

ods graphics on;

proc adaptivereg data=autompg plots=all;
   class cylinders year origin;
   model mpg = cylinders displacement horsepower
               weight acceleration year origin / additive;
run;

PROC ADAPTIVEREG summarizes important information about the model that you are fitting in Figure 24.1.

Figure 24.1: Model Information and Fit Controls

Automobile MPG Study

The ADAPTIVEREG Procedure

Model Information
Data Set WORK.AUTOMPG
Response Variable MPG
Class Variables Cylinders Year Origin
Distribution Normal
Link Function Identity

Fit Controls
Maximum Number of Bases 21
Maximum Order of Interaction 1
Degrees of Freedom per Knot 2
Knot Separation Parameter 0.05
Penalty for Variable Reentry 0
Missing Value Handling Include


In addition to listing classification variables in the Model Information table, PROC ADAPTIVEREG displays level information about the classification variables that are specified in the CLASS statement. The table in Figure 24.2 lists the levels of the classification variables Cylinders, Year, and Origin. Although the values of Cylinders and Year are naturally ordered, they are treated as ordinary classification variables.

Figure 24.2: Class Level Information

Class Level Information
Class Levels Values
Cylinders 5 3 4 5 6 8
Year 13 70 71 72 73 74 75 76 77 78 79 80 81 82
Origin 3 1 2 3


The Fit Statistics table (Figure 24.3) lists summary statistics of the fitted regression spline model. Because the final model is essentially a linear model, several fit statistics are reported as if the model were fitted with basis functions as predetermined effects. However, because the model selection process and the determination of basis functions are highly nonlinear, additional statistics that incorporate the extra source of degrees of freedom are also displayed. The statistics include effective degrees of freedom, the generalized cross validation (GCV) criterion, and the GCV R-square value.

Figure 24.3: Fit Statistics

Fit Statistics
GCV 11.55804
GCV R-Square 0.81128
Effective Degrees of Freedom 23
R-Square 0.83161
Adjusted R-Square 0.82682
Mean Square Error 10.57977
Average Square Error 10.26079


The Parameter Estimates table (Figure 24.4) displays both parameter estimates for constructed basis functions and each function’s construction components. The basis functions are constructed as two-way interaction terms from parent basis functions and transformations of variables. For continuous variables, the transformations are linear spline functions with knot values specified in the Knot column. For classification variables, the transformations are formed by dichotomizing the variables based on levels specified in the Levels column.

Figure 24.4: Parameter Estimates

Regression Spline Model after Backward Selection
Name Coefficient Parent Variable Knot Levels
Basis0 29.4394   Intercept    
Basis2 0.004412 Basis0 Weight 3139.00  
Basis3 -21.2899 Basis0 Horsepower .  
Basis6 0.1534 Basis3 Horsepower 158.00  
Basis7 2.3920 Basis3 Year   10 12 11 9 8 7 3
Basis9 1.6658 Basis0 Acceleration 21.0000  
Basis10 0.4672 Basis0 Acceleration 21.0000  
Basis11 -8.1766 Basis0 Cylinders   0 3
Basis13 -10.0976 Basis4 Origin   0
Basis15 2.1354 Basis0 Origin   2
Basis17 6.7675 Basis0 Cylinders   3
Basis19 1.4987 Basis0 Year   3 10 12 11 9


During the model construction and selection process, some basis function terms are removed. You can view the backward elimination process in the selection plot (Figure 24.5). The plot displays how the model sum of squared error and the corresponding GCV criterion change along with the backward elimination process. The sum of squared error increases as more basis functions are removed from the full model. The GCV criterion decreases at first when two basis functions are dropped and increases afterward. The vertical line indicates the selected model that has the minimum GCV value.

Figure 24.5: Selection Plot

Selection Plot


The formed model is an additive model. Basis functions of same variables can be grouped together to form functional components. The ANOVA Decomposition table (Figure 24.6) shows functional components and their contribution to the final model.

Figure 24.6: ANOVA Decomposition

ANOVA Decomposition
Functional
Component
Number of
Bases
DF Change If Omitted
Lack of Fit GCV
Weight 1 2 299.55 0.7165
Horsepower 1 2 1324.81 3.5875
Year 2 4 1183.22 3.0358
Acceleration 2 4 287.76 0.5546
Cylinders 2 4 321.11 0.6470
Origin 2 4 316.04 0.6330


Another criterion that focuses on the contribution of each individual variable is variable importance. It is defined to be the square root of the GCV value of a submodel from which all basis functions that involve a variable have been removed, minus the square root of the GCV value of the selected model, then scaled to have the largest importance value, 100. The table in Figure 24.7 lists importance values, sorted in descending order, for the variables that compose the selected model.

Figure 24.7: Variable Importance

Variable Importance
Variable Number of
Bases
Importance
Horsepower 1 100.00
Year 2 85.46
Weight 1 21.10
Cylinders 2 19.08
Origin 2 18.67
Acceleration 2 16.38


The component panel (Figure 24.8) displays the fitted functional components against their forming variables.

Figure 24.8: Component Panel

Component Panel


Figure 24.9 shows a panel of fit diagnostics for the selected model that indicate a reasonable fit.

PROC ADAPTIVEREG provides an adaptive way to fit parsimonious regression spline models. The nonparametric transformation of variables is automatically determined, and model selection methods are used to reduce model complexity. The final model based on piecewise linear splines is easy to interpret and highly portable. It can also be used to suggest parametric forms based on the nonlinear trend.

Figure 24.9: Diagnostics Panel

Diagnostics Panel