The data in this example are measurements of 159 fish caught in Finland’s Lake Laengelmaevesi; this data set is available
from Puranen (1917). For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and
width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning
of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded
as percentages of the third length variable. The fish data set is available from the Sashelp
library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into
species.
First, assume that the data are normally distributed within each group with equal covariances across groups. The following
statements use PROC DISCRIM to analyze the Sashelp.Fish
data and create Figure 35.1 through Figure 35.5:
title 'Fish Measurement Data'; proc discrim data=sashelp.fish; class Species; run;
The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 35.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.
Figure 35.1: Summary Information
Fish Measurement Data |
Total Sample Size | 158 | DF Total | 157 |
---|---|---|---|
Variables | 6 | DF Within Classes | 151 |
Classes | 7 | DF Between Classes | 6 |
Number of Observations Read | 159 |
---|---|
Number of Observations Used | 158 |
Class Level Information | |||||
---|---|---|---|---|---|
Species | Variable Name |
Frequency | Weight | Proportion | Prior Probability |
Bream | Bream | 34 | 34.0000 | 0.215190 | 0.142857 |
Parkki | Parkki | 11 | 11.0000 | 0.069620 | 0.142857 |
Perch | Perch | 56 | 56.0000 | 0.354430 | 0.142857 |
Pike | Pike | 17 | 17.0000 | 0.107595 | 0.142857 |
Roach | Roach | 20 | 20.0000 | 0.126582 | 0.142857 |
Smelt | Smelt | 14 | 14.0000 | 0.088608 | 0.142857 |
Whitefish | Whitefish | 6 | 6.0000 | 0.037975 | 0.142857 |
The natural log of the determinant of the pooled covariance matrix is displayed in Figure 35.2.
Figure 35.2: Pooled Covariance Matrix Information
Pooled Covariance Matrix Information |
|
---|---|
Covariance Matrix Rank |
Natural Log of the Determinant of the Covariance Matrix |
6 | 4.17613 |
The squared distances between the classes are shown in Figure 35.3.
Figure 35.3: Squared Distances
Fish Measurement Data |
Generalized Squared Distance to Species | |||||||
---|---|---|---|---|---|---|---|
From Species | Bream | Parkki | Perch | Pike | Roach | Smelt | Whitefish |
Bream | 0 | 83.32523 | 243.66688 | 310.52333 | 133.06721 | 252.75503 | 132.05820 |
Parkki | 83.32523 | 0 | 57.09760 | 174.20918 | 27.00096 | 60.52076 | 26.54855 |
Perch | 243.66688 | 57.09760 | 0 | 101.06791 | 29.21632 | 29.26806 | 20.43791 |
Pike | 310.52333 | 174.20918 | 101.06791 | 0 | 92.40876 | 127.82177 | 99.90673 |
Roach | 133.06721 | 27.00096 | 29.21632 | 92.40876 | 0 | 33.84280 | 6.31997 |
Smelt | 252.75503 | 60.52076 | 29.26806 | 127.82177 | 33.84280 | 0 | 46.37326 |
Whitefish | 132.05820 | 26.54855 | 20.43791 | 99.90673 | 6.31997 | 46.37326 | 0 |
The coefficients of the linear discriminant function are displayed (in Figure 35.4) with the default options METHOD=NORMAL and POOL=YES.
Figure 35.4: Linear Discriminant Function
Linear Discriminant Function for Species | |||||||
---|---|---|---|---|---|---|---|
Variable | Bream | Parkki | Perch | Pike | Roach | Smelt | Whitefish |
Constant | -185.91682 | -64.92517 | -48.68009 | -148.06402 | -62.65963 | -19.70401 | -67.44603 |
Weight | -0.10912 | -0.09031 | -0.09418 | -0.13805 | -0.09901 | -0.05778 | -0.09948 |
Length1 | -23.02273 | -13.64180 | -19.45368 | -20.92442 | -14.63635 | -4.09257 | -22.57117 |
Length2 | -26.70692 | -5.38195 | 17.33061 | 6.19887 | -7.47195 | -3.63996 | 3.83450 |
Length3 | 50.55780 | 20.89531 | 5.25993 | 22.94989 | 25.00702 | 10.60171 | 21.12638 |
Height | 13.91638 | 8.44567 | -1.42833 | -8.99687 | -0.26083 | -1.84569 | 0.64957 |
Width | -23.71895 | -13.38592 | 1.32749 | -9.13410 | -3.74542 | -3.43630 | -2.52442 |
A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 35.5, you see that only three of the observations are misclassified. The error-count estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these error-count estimates are biased.
Figure 35.5: Resubstitution Misclassification Summary
Fish Measurement Data |
Number of Observations and Percent Classified into Species | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
From Species | Bream | Parkki | Perch | Pike | Roach | Smelt | Whitefish | Total | ||||||||||||||||
Bream |
|
|
|
|
|
|
|
|
||||||||||||||||
Parkki |
|
|
|
|
|
|
|
|
||||||||||||||||
Perch |
|
|
|
|
|
|
|
|
||||||||||||||||
Pike |
|
|
|
|
|
|
|
|
||||||||||||||||
Roach |
|
|
|
|
|
|
|
|
||||||||||||||||
Smelt |
|
|
|
|
|
|
|
|
||||||||||||||||
Whitefish |
|
|
|
|
|
|
|
|
||||||||||||||||
Total |
|
|
|
|
|
|
|
|
||||||||||||||||
Priors |
|
|
|
|
|
|
|
|
Error Count Estimates for Species | ||||||||
---|---|---|---|---|---|---|---|---|
Bream | Parkki | Perch | Pike | Roach | Smelt | Whitefish | Total | |
Rate | 0.0000 | 0.0000 | 0.0536 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0077 |
Priors | 0.1429 | 0.1429 | 0.1429 | 0.1429 | 0.1429 | 0.1429 | 0.1429 |
One way to reduce the bias of the error-count estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 35.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.