Getting Started: ADAPTIVEREG Procedure :: SAS/STAT(R) 12.1 User's Guide

Getting Started: ADAPTIVEREG Procedure

This example concerns city-cycle fuel efficiency and automobile characteristics for 361 vehicle models made from year 1970 to 1982. The data can be downloaded from the UCI Machine Learning Repository (Asuncion and Newman, 2007). The following DATA step creates the data set autompg:

title 'Automobile MPG Study';
data Autompg;
   input MPG Cylinders Displacement Horsepower Weight
         Acceleration Year Origin Name $35.;
   datalines;
18.0 8 307.0   130.0   3504   12.0   70  1  Chevrolet Chevelle Malibu
15.0 8 350.0   165.0   3693   11.5   70  1  Buick Skylark 320
18.0 8 318.0   150.0   3436   11.0   70  1  Plymouth Satellite
16.0 8 304.0   150.0   3433   12.0   70  1  AMC Rebel SST

   ... more lines ...   

44.0 4 97.00   52.00   2130   24.6   82  2  VW Pickup
32.0 4 135.0   84.00   2295   11.6   82  1  Dodge Rampage
28.0 4 120.0   79.00   2625   18.6   82  1  Ford Ranger
31.0 4 119.0   82.00   2720   19.4   82  1  Chevy S-10
;

There are nine variables in the data set. The response variable MPG is city-cycle mileage per gallon (MPG). Seven predictor variables (Cylinders, Displacement, HorsePower, Weight, Acceleration, Year, and Origin) provide vehicle attributes. Among them, Cylinders, Year, and Origin are categorical variables. The last variable, Name, contains the specific name of each vehicle model.

The dependency of vehicle fuel efficiency on various factors might be nonlinear. There might also be redundant predictor variables as a result of dependency structures within predictors. For example, a vehicle model with more cylinders is likely to have more horsepower. The objective of this example is to explore the nonlinear dependency structure and also to produce a parsimonious model that does not overfit and thus has good predictive power. The following invocation of the ADAPTIVEREG procedure fits an additive model with linear spline terms of continuous predictors. By default, PROC ADAPTIVEREG fits a nonparametric regression model that includes two-way interaction between spline basis functions. You can try models with even higher interaction orders by specifying the MAXORDER= option in the MODEL statement. For this particular data set, the sample size is relatively small. Restricting model complexity by specifying an additive model can both improve model interpretability and reduce model variance without sacrificing much predictive power. The additive model consists of terms of nonparametric transformations of variables. The transformation of each variable and the selection of transformed terms are performed in an adaptive and automatic way.

ods graphics on;

proc adaptivereg data=autompg plots=all;
   class cylinders year origin;
   model mpg = cylinders displacement horsepower
               weight acceleration year origin / additive;
run;

PROC ADAPTIVEREG summarizes important information about the model that you are fitting in Figure 24.1.

Figure 24.1: Model Information and Fit Controls

Automobile MPG Study

The ADAPTIVEREG Procedure

Model Information
Data Set	WORK.AUTOMPG
Response Variable	MPG
Class Variables	Cylinders Year Origin
Distribution	Normal
Link Function	Identity

Fit Controls
Maximum Number of Bases	21
Maximum Order of Interaction	1
Degrees of Freedom per Knot	2
Knot Separation Parameter	0.05
Penalty for Variable Reentry	0
Missing Value Handling	Include

In addition to listing classification variables in the “Model Information” table, PROC ADAPTIVEREG displays level information about the classification variables that are specified in the CLASS statement. The table in Figure 24.2 lists the levels of the classification variables Cylinders, Year, and Origin. Although the values of Cylinders and Year are naturally ordered, they are treated as ordinary classification variables.

Figure 24.2: Class Level Information

Class Level Information
Class	Levels	Values
Cylinders	5	3 4 5 6 8
Year	13	70 71 72 73 74 75 76 77 78 79 80 81 82
Origin	3	1 2 3

The “Fit Statistics” table (Figure 24.3) lists summary statistics of the fitted regression spline model. Because the final model is essentially a linear model, several fit statistics are reported as if the model were fitted with basis functions as predetermined effects. However, because the model selection process and the determination of basis functions are highly nonlinear, additional statistics that incorporate the extra source of degrees of freedom are also displayed. The statistics include effective degrees of freedom, the generalized cross validation (GCV) criterion, and the GCV R-square value.

Figure 24.3: Fit Statistics

Fit Statistics
GCV	11.55804
GCV R-Square	0.81128
Effective Degrees of Freedom	23
R-Square	0.83161
Adjusted R-Square	0.82682
Mean Square Error	10.57977
Average Square Error	10.26079

The “Parameter Estimates” table (Figure 24.4) displays both parameter estimates for constructed basis functions and each function’s construction components. The basis functions are constructed as two-way interaction terms from parent basis functions and transformations of variables. For continuous variables, the transformations are linear spline functions with knot values specified in the Knot column. For classification variables, the transformations are formed by dichotomizing the variables based on levels specified in the Levels column.

Figure 24.4: Parameter Estimates

Regression Spline Model after Backward Selection
Name	Coefficient	Parent	Variable	Knot	Levels
Basis0	29.4394		Intercept
Basis2	0.004412	Basis0	Weight	3139.00
Basis3	-21.2899	Basis0	Horsepower	.
Basis6	0.1534	Basis3	Horsepower	158.00
Basis7	2.3920	Basis3	Year		10 12 11 9 8 7 3
Basis9	1.6658	Basis0	Acceleration	21.0000
Basis10	0.4672	Basis0	Acceleration	21.0000
Basis11	-8.1766	Basis0	Cylinders		0 3
Basis13	-10.0976	Basis4	Origin		0
Basis15	2.1354	Basis0	Origin		2
Basis17	6.7675	Basis0	Cylinders		3
Basis19	1.4987	Basis0	Year		3 10 12 11 9

During the model construction and selection process, some basis function terms are removed. You can view the backward elimination process in the selection plot (Figure 24.5). The plot displays how the model sum of squared error and the corresponding GCV criterion change along with the backward elimination process. The sum of squared error increases as more basis functions are removed from the full model. The GCV criterion decreases at first when two basis functions are dropped and increases afterward. The vertical line indicates the selected model that has the minimum GCV value.

Figure 24.5: Selection Plot

The formed model is an additive model. Basis functions of same variables can be grouped together to form functional components. The “ANOVA Decomposition” table (Figure 24.6) shows functional components and their contribution to the final model.

Figure 24.6: ANOVA Decomposition

ANOVA Decomposition
Functional Component	Number of Bases	DF	Change If Omitted
Functional Component	Number of Bases	DF	Lack of Fit	GCV
Weight	1	2	299.55	0.7165
Horsepower	1	2	1324.81	3.5875
Year	2	4	1183.22	3.0358
Acceleration	2	4	287.76	0.5546
Cylinders	2	4	321.11	0.6470
Origin	2	4	316.04	0.6330

Another criterion that focuses on the contribution of each individual variable is variable importance. It is defined to be the square root of the GCV value of a submodel from which all basis functions that involve a variable have been removed, minus the square root of the GCV value of the selected model, then scaled to have the largest importance value, 100. The table in Figure 24.7 lists importance values, sorted in descending order, for the variables that compose the selected model.

Figure 24.7: Variable Importance

Variable Importance
Variable	Number of Bases	Importance
Horsepower	1	100.00
Year	2	85.46
Weight	1	21.10
Cylinders	2	19.08
Origin	2	18.67
Acceleration	2	16.38

The component panel (Figure 24.8) displays the fitted functional components against their forming variables.

Figure 24.8: Component Panel

Figure 24.9 shows a panel of fit diagnostics for the selected model that indicate a reasonable fit.

PROC ADAPTIVEREG provides an adaptive way to fit parsimonious regression spline models. The nonparametric transformation of variables is automatically determined, and model selection methods are used to reduce model complexity. The final model based on piecewise linear splines is easy to interpret and highly portable. It can also be used to suggest parametric forms based on the nonlinear trend.

Figure 24.9: Diagnostics Panel

The ADAPTIVEREG Procedure (Experimental)

Getting Started: ADAPTIVEREG Procedure