The DISCRIM Procedure

Computational Resources

In the following discussion, let

\begin{eqnarray*}  n &  = &  \mbox{number of observations in the training data set} \\ v &  = &  \mbox{number of variables} \\ c &  = &  \mbox{number of class levels} \\ k &  = &  \mbox{number of canonical variables} \\ l &  = &  \mbox{length of the CLASS variable} \\ \end{eqnarray*}

Memory Requirements

The amount of temporary storage required depends on the discriminant method used and the options specified. The least amount of temporary storage in bytes needed to process the data is approximately

\[  c(32v + 3l + 128) + 8v^2 + 104v + 4l  \]

A parametric method (METHOD=NORMAL) requires an additional temporary memory of $12v^2+100v$ bytes. When you specify the CROSSVALIDATE option, this temporary storage must be increased by $4v^2+44v$ bytes. When a nonparametric method (METHOD=NPAR) is used, an additional temporary storage of $10v^2+94v$ bytes is needed if you specify METRIC=FULL to evaluate the distances.

With the MANOVA option, the temporary storage must be increased by $8v^2+96v$ bytes. The CANONICAL option requires a temporary storage of $2v^2+94v+8k(v+c)$ bytes. The POSTERR option requires a temporary storage of $8c^2+64c+96$ bytes. Additional temporary storage is also required for classification summary and for each output data set.

Consider the following statements:

proc discrim manova;
   class gp;
   var x1 x2 x3;
run;

If the CLASS variable gp has a length of 8 and the input data set contains two class levels, the procedure requires a temporary storage of 1992 bytes. This includes 1104 bytes for processing data, 480 bytes for using a parametric method, and 408 bytes for specifying the MANOVA option.

Time Requirements

The following factors determine the time requirements of discriminant analysis:

  • The time needed for reading the data and computing covariance matrices is proportional to $nv^2$. PROC DISCRIM must also look up each class level in the list. This is faster if the data are sorted by the CLASS variable. The time for looking up class levels is proportional to a value ranging from n to $n \ln (c)$.

  • The time for inverting a covariance matrix is proportional to $v^3$.

  • With a parametric method, the time required to classify each observation is proportional to $cv$ for a linear discriminant function and $cv^2$ for a quadratic discriminant function. When you specify the CROSSVALIDATE option, the discriminant function is updated for each observation in the classification. A substantial amount of time is required.

  • With a nonparametric method, the data are stored in a tree structure (Friedman, Bentley, and Finkel, 1977). The time required to organize the observations into the tree structure is proportional to $nv \ln (n)$. The time for performing each tree search is proportional to $\ln (n)$. When you specify the normal KERNEL= option, all observations in the training sample contribute to the density estimation and more computer time is needed.

  • The time required for the canonical discriminant analysis is proportional to $v^3$.

Each of the preceding factors has a different machine-dependent constant of proportionality.