The data in this example are measurements of 159 fish caught in Finland’s lake Laengelmavesi; this data set is available from
the Puranen. For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and
width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning
of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded
as percentages of the third length variable. The fish data set is available from the Sashelp
library. PROC STEPDISC will select a subset of the six quantitative variables that might be useful for differentiating between
the fish species. This subset is used in conjunction with PROC CANDISC and PROC DISCRIM to develop discrimination models.
The following steps use PROC STEPDISC to select a subset of potential discriminator variables. By default, PROC STEPDISC uses stepwise selection on all numeric variables that are not listed in other statements, and the significance levels for a variable to enter the subset and to stay in the subset are set to 0.15. The following statements produce Figure 89.1 through Figure 89.5:
title 'Fish Measurement Data'; proc stepdisc data=sashelp.fish; class Species; run;
PROC STEPDISC begins by displaying summary information about the analysis (see Figure 89.1). This information includes the number of observations with nonmissing values, the number of classes in the classification variable (specified by the CLASS statement), the number of quantitative variables under consideration, the significance criteria for variables to enter and to stay in the model, and the method of variable selection being used. The frequency of each class is also displayed.
Figure 89.1: Summary Information
Fish Measurement Data |
The Method for Selecting Variables is STEPWISE | |||
---|---|---|---|
Total Sample Size | 158 | Variable(s) in the Analysis | 6 |
Class Levels | 7 | Variable(s) Will Be Included | 0 |
Significance Level to Enter | 0.15 | ||
Significance Level to Stay | 0.15 |
Number of Observations Read | 159 |
---|---|
Number of Observations Used | 158 |
Class Level Information | ||||
---|---|---|---|---|
Species | Variable Name |
Frequency | Weight | Proportion |
Bream | Bream | 34 | 34.0000 | 0.215190 |
Parkki | Parkki | 11 | 11.0000 | 0.069620 |
Perch | Perch | 56 | 56.0000 | 0.354430 |
Pike | Pike | 17 | 17.0000 | 0.107595 |
Roach | Roach | 20 | 20.0000 | 0.126582 |
Smelt | Smelt | 14 | 14.0000 | 0.088608 |
Whitefish | Whitefish | 6 | 6.0000 | 0.037975 |
For each entry step, the statistics for entry are displayed for all variables not currently selected (see Figure 89.2). The variable selected to enter at this step (if any) is displayed, as well as all the variables currently selected. Next are multivariate statistics that take into account all previously selected variables and the newly entered variable.
Figure 89.2: Step 1: Variable HEIGHT Selected for Entry
Fish Measurement Data |
Statistics for Entry, DF = 6, 151 | ||||
---|---|---|---|---|
Variable | R-Square | F Value | Pr > F | Tolerance |
Weight | 0.3750 | 15.10 | <.0001 | 1.0000 |
Length1 | 0.6017 | 38.02 | <.0001 | 1.0000 |
Length2 | 0.6098 | 39.32 | <.0001 | 1.0000 |
Length3 | 0.6280 | 42.49 | <.0001 | 1.0000 |
Height | 0.7553 | 77.69 | <.0001 | 1.0000 |
Width | 0.4806 | 23.29 | <.0001 | 1.0000 |
Variable Height will be entered. |
Variable(s) That Have Been Entered |
---|
Height |
Multivariate Statistics | |||||
---|---|---|---|---|---|
Statistic | Value | F Value | Num DF | Den DF | Pr > F |
Wilks' Lambda | 0.244670 | 77.69 | 6 | 151 | <.0001 |
Pillai's Trace | 0.755330 | 77.69 | 6 | 151 | <.0001 |
Average Squared Canonical Correlation | 0.125888 |
For each removal step (Figure 89.3), the statistics for removal are displayed for all variables currently entered. The variable to be removed at this step (if any) is displayed. If no variable meets the criterion to be removed and the maximum number of steps as specified by the MAXSTEP= option has not been attained, then the procedure continues with another entry step.
Figure 89.3: Step 2: No Variable Is Removed; Variable Length2 Added
Fish Measurement Data |
Statistics for Removal, DF = 6, 151 |
|||
---|---|---|---|
Variable | R-Square | F Value | Pr > F |
Height | 0.7553 | 77.69 | <.0001 |
No variables can be removed. |
Statistics for Entry, DF = 6, 150 | ||||
---|---|---|---|---|
Variable | Partial R-Square |
F Value | Pr > F | Tolerance |
Weight | 0.7388 | 70.71 | <.0001 | 0.4690 |
Length1 | 0.9220 | 295.35 | <.0001 | 0.6083 |
Length2 | 0.9229 | 299.31 | <.0001 | 0.5892 |
Length3 | 0.9173 | 277.37 | <.0001 | 0.5056 |
Width | 0.8783 | 180.44 | <.0001 | 0.3699 |
Variable Length2 will be entered. |
Variable(s) That Have Been Entered |
|
---|---|
Length2 | Height |
Multivariate Statistics | |||||
---|---|---|---|---|---|
Statistic | Value | F Value | Num DF | Den DF | Pr > F |
Wilks' Lambda | 0.018861 | 157.04 | 12 | 300 | <.0001 |
Pillai's Trace | 1.554349 | 87.78 | 12 | 302 | <.0001 |
Average Squared Canonical Correlation | 0.259058 |
The stepwise procedure terminates either when no variable can be removed and no variable can be entered or when the maximum number of steps as specified by the MAXSTEP= option has been attained. In this example at step 7 no variables can be either removed or entered (Figure 89.4). Steps 3 through 6 are not displayed in this document.
Figure 89.4: Step 7: No Variables Entered or Removed
Fish Measurement Data |
Statistics for Removal, DF = 6, 146 |
|||
---|---|---|---|
Variable | Partial R-Square |
F Value | Pr > F |
Weight | 0.4521 | 20.08 | <.0001 |
Length1 | 0.2987 | 10.36 | <.0001 |
Length2 | 0.5250 | 26.89 | <.0001 |
Length3 | 0.7948 | 94.25 | <.0001 |
Height | 0.7257 | 64.37 | <.0001 |
Width | 0.5757 | 33.02 | <.0001 |
No variables can be removed. |
PROC STEPDISC ends by displaying a summary of the steps.
Figure 89.5: Step Summary
No further steps are possible. |
Fish Measurement Data |
Stepwise Selection Summary | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Step | Number In |
Entered | Removed | Partial R-Square |
F Value | Pr > F | Wilks' Lambda |
Pr < Lambda |
Average Squared Canonical Correlation |
Pr > ASCC |
1 | 1 | Height | 0.7553 | 77.69 | <.0001 | 0.24466983 | <.0001 | 0.12588836 | <.0001 | |
2 | 2 | Length2 | 0.9229 | 299.31 | <.0001 | 0.01886065 | <.0001 | 0.25905822 | <.0001 | |
3 | 3 | Length3 | 0.8826 | 186.77 | <.0001 | 0.00221342 | <.0001 | 0.38427100 | <.0001 | |
4 | 4 | Width | 0.5775 | 33.72 | <.0001 | 0.00093510 | <.0001 | 0.45200732 | <.0001 | |
5 | 5 | Weight | 0.4461 | 19.73 | <.0001 | 0.00051794 | <.0001 | 0.49488458 | <.0001 | |
6 | 6 | Length1 | 0.2987 | 10.36 | <.0001 | 0.00036325 | <.0001 | 0.51744189 | <.0001 |
All the variables in the data set are found to have potential discriminatory power. These variables are used to develop discrimination models in both the CANDISC and DISCRIM procedure chapters.