The DISCRIM Procedure

Input Data Sets

Subsections:

DATA= Data Set
TESTDATA= Data Set

DATA= Data Set

When you specify METHOD=NPAR, an ordinary SAS data set is required as the input DATA= data set. When you specify METHOD=NORMAL, the DATA= data set can be an ordinary SAS data set or one of several specially structured data sets created by SAS/STAT procedures. These specially structured data sets include the following:

TYPE=CORR data sets created by PROC CORR by using a BY statement
TYPE=COV data sets created by PROC PRINCOMP by using both the COV option and a BY statement
TYPE=CSSCP data sets created by PROC CORR by using the CSSCP option and a BY statement, where the OUT= data set is assigned TYPE=CSSCP with the TYPE= data set option
TYPE=SSCP data sets created by PROC REG by using both the OUTSSCP= option and a BY statement
TYPE=LINEAR, TYPE=QUAD, and TYPE=MIXED data sets produced by previous runs of PROC DISCRIM that used both METHOD=NORMAL and OUTSTAT= options

When the input data set is TYPE=CORR, TYPE=COV, TYPE=CSSCP, or TYPE=SSCP, the BY variable in these data sets becomes the CLASS variable in the DISCRIM procedure.

When the input data set is TYPE=CORR, TYPE=COV, or TYPE=CSSCP, then PROC DISCRIM reads the number of observations for each class from the observations with _TYPE_=’N’ and reads the variable means in each class from the observations with _TYPE_=’MEAN’. Then PROC DISCRIM reads the within-class correlations from the observations with _TYPE_=’CORR’ and reads the standard deviations from the observations with _TYPE_=’STD’ (data set TYPE=CORR), the within-class covariances from the observations with _TYPE_=’COV’ (data set TYPE=COV), or the within-class corrected sums of squares and crossproducts from the observations with _TYPE_=’CSSCP’ (data set TYPE=CSSCP).

When you specify POOL=YES and the data set does not include any observations with _TYPE_=’CSSCP’ (data set TYPE=CSSCP), _TYPE_=’COV’ (data set TYPE=COV), or _TYPE_=’CORR’ (data set TYPE=CORR) for each class, PROC DISCRIM reads the pooled within-class information from the data set. In this case, PROC DISCRIM reads the pooled within-class covariances from the observations with _TYPE_=’PCOV’ (data set TYPE=COV) or reads the pooled within-class correlations from the observations with _TYPE_=’PCORR’ and the pooled within-class standard deviations from the observations with _TYPE_=’PSTD’ (data set TYPE=CORR) or the pooled within-class corrected SSCP matrix from the observations with _TYPE_=’PSSCP’ (data set TYPE=CSSCP).

When the input data set is TYPE=SSCP, the DISCRIM procedure reads the number of observations for each class from the observations with _TYPE_=’N’, the sum of weights of observations for each class from the variable INTERCEP in observations with _TYPE_=’SSCP’ and _NAME_=’INTERCEPT’, the variable sums from the analysis variables in observations with _TYPE_=’SSCP’ and _NAME_=’INTERCEPT’, and the uncorrected sums of squares and crossproducts from the analysis variables in observations with _TYPE_=’SSCP’ and _NAME_=’variablenames’.

When the input data set is TYPE=LINEAR, TYPE=QUAD, or TYPE=MIXED, then PROC DISCRIM reads the prior probabilities for each class from the observations with variable _TYPE_=’PRIOR’.

When the input data set is TYPE=LINEAR, then PROC DISCRIM reads the coefficients of the linear discriminant functions from the observations with variable _TYPE_=’LINEAR’.

When the input data set is TYPE=QUAD, then PROC DISCRIM reads the coefficients of the quadratic discriminant functions from the observations with variable _TYPE_=’QUAD’.

When the input data set is TYPE=MIXED, then PROC DISCRIM reads the coefficients of the linear discriminant functions from the observations with variable _TYPE_=’LINEAR’. If there are no observations with _TYPE_=’LINEAR’, then PROC DISCRIM reads the coefficients of the quadratic discriminant functions from the observations with variable _TYPE_=’QUAD’.

TESTDATA= Data Set

The TESTDATA= data set is an ordinary SAS data set with observations that are to be classified. The quantitative variable names in this data set must match those in the DATA= data set. The TESTCLASS statement can be used to specify the variable containing group membership information of the TESTDATA= data set observations. When the TESTCLASS statement is missing and the TESTDATA= data set contains the variable given in the CLASS statement, this variable is used as the TESTCLASS variable. The TESTCLASS variable should have the same type (character or numeric) and length as the variable given in the CLASS statement. PROC DISCRIM considers an observation misclassified when the value of the TESTCLASS variable does not match the group into which the TESTDATA= observation is classified.