The DISCRIM Procedure

Getting Started: DISCRIM Procedure

The data in this example are measurements of 159 fish caught in Finland’s Lake Laengelmaevesi; this data set is available from Puranen (1917). For each of the seven species (bream, roach, whitefish, parkki, perch, pike, and smelt) the weight, length, height, and width of each fish are tallied. Three different length measurements are recorded: from the nose of the fish to the beginning of its tail, from the nose to the notch of its tail, and from the nose to the end of its tail. The height and width are recorded as percentages of the third length variable. The fish data set is available from the Sashelp library. The goal now is to find a discriminant function based on these six variables that best classifies the fish into species.

First, assume that the data are normally distributed within each group with equal covariances across groups. The following statements use PROC DISCRIM to analyze the Sashelp.Fish data and create Figure 35.1 through Figure 35.5:

title 'Fish Measurement Data';

proc discrim data=sashelp.fish;
   class Species;
run;

The DISCRIM procedure begins by displaying summary information about the variables in the analysis (see Figure 35.1). This information includes the number of observations, the number of quantitative variables in the analysis (specified with the VAR statement), and the number of classes in the classification variable (specified with the CLASS statement). The frequency of each class, its weight, the proportion of the total sample, and the prior probability are also displayed. Equal priors are assigned by default.

Figure 35.1: Summary Information

Fish Measurement Data

The DISCRIM Procedure

Total Sample Size	158	DF Total	157
Variables	6	DF Within Classes	151
Classes	7	DF Between Classes	6

Number of Observations Read	159
Number of Observations Used	158

Class Level Information
Species	Variable Name	Frequency	Weight	Proportion	Prior Probability
Bream	Bream	34	34.0000	0.215190	0.142857
Parkki	Parkki	11	11.0000	0.069620	0.142857
Perch	Perch	56	56.0000	0.354430	0.142857
Pike	Pike	17	17.0000	0.107595	0.142857
Roach	Roach	20	20.0000	0.126582	0.142857
Smelt	Smelt	14	14.0000	0.088608	0.142857
Whitefish	Whitefish	6	6.0000	0.037975	0.142857

The natural log of the determinant of the pooled covariance matrix is displayed in Figure 35.2.

Figure 35.2: Pooled Covariance Matrix Information

Pooled Covariance Matrix Information
Covariance Matrix Rank	Natural Log of the Determinant of the Covariance Matrix
6	4.17613

The squared distances between the classes are shown in Figure 35.3.

Figure 35.3: Squared Distances

Fish Measurement Data

The DISCRIM Procedure

Generalized Squared Distance to Species
From Species	Bream	Parkki	Perch	Pike	Roach	Smelt	Whitefish
Bream	0	83.32523	243.66688	310.52333	133.06721	252.75503	132.05820
Parkki	83.32523	0	57.09760	174.20918	27.00096	60.52076	26.54855
Perch	243.66688	57.09760	0	101.06791	29.21632	29.26806	20.43791
Pike	310.52333	174.20918	101.06791	0	92.40876	127.82177	99.90673
Roach	133.06721	27.00096	29.21632	92.40876	0	33.84280	6.31997
Smelt	252.75503	60.52076	29.26806	127.82177	33.84280	0	46.37326
Whitefish	132.05820	26.54855	20.43791	99.90673	6.31997	46.37326	0

The coefficients of the linear discriminant function are displayed (in Figure 35.4) with the default options METHOD=NORMAL and POOL=YES.

Figure 35.4: Linear Discriminant Function

Linear Discriminant Function for Species
Variable	Bream	Parkki	Perch	Pike	Roach	Smelt	Whitefish
Constant	-185.91682	-64.92517	-48.68009	-148.06402	-62.65963	-19.70401	-67.44603
Weight	-0.10912	-0.09031	-0.09418	-0.13805	-0.09901	-0.05778	-0.09948
Length1	-23.02273	-13.64180	-19.45368	-20.92442	-14.63635	-4.09257	-22.57117
Length2	-26.70692	-5.38195	17.33061	6.19887	-7.47195	-3.63996	3.83450
Length3	50.55780	20.89531	5.25993	22.94989	25.00702	10.60171	21.12638
Height	13.91638	8.44567	-1.42833	-8.99687	-0.26083	-1.84569	0.64957
Width	-23.71895	-13.38592	1.32749	-9.13410	-3.74542	-3.43630	-2.52442

A summary of how the discriminant function classifies the data used to develop the function is displayed last. In Figure 35.5, you see that only three of the observations are misclassified. The error-count estimates give the proportion of misclassified observations in each group. Since you are classifying the same data that are used to derive the discriminant function, these error-count estimates are biased.

Figure 35.5: Resubstitution Misclassification Summary

Fish Measurement Data

The DISCRIM Procedure

Classification Summary for Calibration Data: SASHELP.FISH

Resubstitution Summary using Linear Discriminant Function

Bream

100.00

0.00

100.00

Parkki

0.00

100.00

0.00

100.00

Perch

0.00

94.64

0.00

5.36

0.00

100.00

Pike

0.00

100.00

0.00

100.00

Roach

0.00

100.00

0.00

100.00

Smelt

0.00

100.00

0.00

100.00

Whitefish

0.00

100.00

Total

21.52

6.96

33.54

10.76

12.66

10.76

3.80

158

100.00

Priors

0.14286

Error Count Estimates for Species
	Bream	Parkki	Perch	Pike	Roach	Smelt	Whitefish	Total
Rate	0.0000	0.0000	0.0536	0.0000	0.0000	0.0000	0.0000	0.0077
Priors	0.1429	0.1429	0.1429	0.1429	0.1429	0.1429	0.1429

One way to reduce the bias of the error-count estimates is to split your data into two sets. One set is used to derive the discriminant function, and the other set is used to run validation tests. Example 35.4 shows how to analyze a test data set. Another method of reducing bias is to classify each observation by using a discriminant function computed from all of the other observations; this method is invoked with the CROSSVALIDATE option.