The FASTCLUS Procedure

Example 38.1 Fisher’s Iris Data

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis. The sepal length, sepal width, petal length, and petal width are measured in millimeters on 50 iris specimens from each of three species, Iris setosa, I. versicolor, and I. virginica. Mezzich and Solomon (1980) discuss a variety of cluster analyses of the iris data.

In this example, the FASTCLUS procedure is used to find two and then three clusters. In the following code, an output data set is created, and PROC FREQ is invoked to compare the clusters with the species classification. See Output 38.1.1 and Output 38.1.2 for these results.

For three clusters, you can use the CANDISC procedure to compute canonical variables for plotting the clusters. See Output 38.1.3 and Output 38.1.4 for the results.

title 'Fisher (1936) Iris Data';
proc fastclus data=sashelp.iris maxc=2 maxiter=10 out=clus;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

proc freq;
   tables cluster*species;
run;

proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
   var SepalLength SepalWidth PetalLength PetalWidth;
run;

proc freq;
   tables cluster*Species;
run;

proc candisc anova out=can;
   class cluster;
   var SepalLength SepalWidth PetalLength PetalWidth;
   title2 'Canonical Discriminant Analysis of Iris Clusters';
run;

proc sgplot data=Can;
   scatter y=Can2 x=Can1 / group=Cluster;
   title2 'Plot of Canonical Variables Identified by Cluster';
run;

Output 38.1.1: Fisher’s Iris Data: PROC FASTCLUS with MAXC=2 and PROC FREQ

Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 77.00000000 26.00000000 69.00000000 23.00000000
2 45.00000000 23.00000000 13.00000000 3.00000000

Minimum Distance Between Initial Seeds = 67.59438

Iteration History
Iteration Criterion Relative Change in Cluster
Seeds
1 2
1 11.0045 0.3169 0.2164
2 5.6161 0.0379 0.0791
3 5.1042 0.0133 0.0306
4 5.0417 0.00348 0.00679

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 5.0390

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 97 5.6779 24.8448   2 39.2879
2 53 3.7050 21.6197   1 39.2879

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 5.49313 0.562896 1.287784
SepalWidth 4.35866 3.70393 0.282710 0.394137
PetalLength 17.65298 6.80331 0.852470 5.778291
PetalWidth 7.62238 3.57200 0.781868 3.584390
OVER-ALL 10.69224 5.07291 0.776410 3.472463

Pseudo F Statistic = 513.92

Approximate Expected Over-All R-Squared = 0.51539

Cubic Clustering Criterion = 14.806


WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 63.01030928 28.86597938 49.58762887 16.95876289
2 50.05660377 33.69811321 15.60377358 2.90566038

Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 6.336887455 3.267991438 7.800577673 4.155612484
2 3.427350930 4.396611045 4.404279486 2.105525249

Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of CLUSTER by Species
CLUSTER(Cluster) Species(Iris Species)
Setosa Versicolor Virginica Total
1
0
0.00
0.00
0.00
47
31.33
48.45
94.00
50
33.33
51.55
100.00
97
64.67
 
 
2
50
33.33
94.34
100.00
3
2.00
5.66
6.00
0
0.00
0.00
0.00
53
35.33
 
 
Total
50
33.33
50
33.33
50
33.33
150
100.00


Output 38.1.2: Fisher’s Iris Data: PROC FASTCLUS with MAXC=3 and PROC FREQ

Fisher (1936) Iris Data

The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02

Initial Seeds
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 77.00000000 38.00000000 67.00000000 22.00000000
2 57.00000000 44.00000000 15.00000000 4.00000000
3 49.00000000 25.00000000 45.00000000 17.00000000

Minimum Distance Between Initial Seeds = 38.23611

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3
1 7.0151 0.3205 0.3151 0.2985
2 3.7097 0.0459 0 0.0317
3 3.6427 0.0182 0 0.0124

Convergence criterion is satisfied.

Criterion Based on Final Seeds = 3.6289

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 38 4.0168 14.9736   3 17.9718
2 50 2.7803 12.4803   3 33.5693
3 62 4.0398 16.9272   1 17.9718

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
SepalLength 8.28066 4.39488 0.722096 2.598359
SepalWidth 4.35866 3.24816 0.452102 0.825156
PetalLength 17.65298 4.21431 0.943773 16.784895
PetalWidth 7.62238 2.45244 0.897872 8.791618
OVER-ALL 10.69224 3.66198 0.884275 7.641194

Pseudo F Statistic = 561.63

Approximate Expected Over-All R-Squared = 0.62728

Cubic Clustering Criterion = 25.021


WARNING: The two values above are invalid for correlated variables.

Cluster Means
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 68.50000000 30.73684211 57.42105263 20.71052632
2 50.06000000 34.28000000 14.62000000 2.46000000
3 59.01612903 27.48387097 43.93548387 14.33870968

Cluster Standard Deviations
Cluster SepalLength SepalWidth PetalLength PetalWidth
1 4.941550255 2.900924461 4.885895746 2.798724562
2 3.524896872 3.790643691 1.736639965 1.053855894
3 4.664100551 2.962840548 5.088949673 2.974997167

Fisher (1936) Iris Data

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of CLUSTER by Species
CLUSTER(Cluster) Species(Iris Species)
Setosa Versicolor Virginica Total
1
0
0.00
0.00
0.00
2
1.33
5.26
4.00
36
24.00
94.74
72.00
38
25.33
 
 
2
50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
 
 
3
0
0.00
0.00
0.00
48
32.00
77.42
96.00
14
9.33
22.58
28.00
62
41.33
 
 
Total
50
33.33
50
33.33
50
33.33
150
100.00


Output 38.1.3: Fisher’s Iris Data using PROC CANDISC

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Total Sample Size 150 DF Total 149
Variables 4 DF Within Classes 147
Classes 3 DF Between Classes 2

Number of Observations Read 150
Number of Observations Used 150

Class Level Information
CLUSTER Variable
Name
Frequency Weight Proportion
1 _1 38 38.0000 0.253333
2 _2 50 50.0000 0.333333
3 _3 62 62.0000 0.413333

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Univariate Test Statistics
F Statistics, Num DF=2, Den DF=147
Variable Label Total
Standard
Deviation
Pooled
Standard
Deviation
Between
Standard
Deviation
R-Square R-Square
/ (1-RSq)
F Value Pr > F
SepalLength Sepal Length (mm) 8.2807 4.3949 8.5893 0.7221 2.5984 190.98 <.0001
SepalWidth Sepal Width (mm) 4.3587 3.2482 3.5774 0.4521 0.8252 60.65 <.0001
PetalLength Petal Length (mm) 17.6530 4.2143 20.9336 0.9438 16.7849 1233.69 <.0001
PetalWidth Petal Width (mm) 7.6224 2.4524 8.8164 0.8979 8.7916 646.18 <.0001

Average R-Square
Unweighted 0.7539604
Weighted by Variance 0.8842753

Multivariate Statistics and F Approximations
S=2 M=0.5 N=71
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.03222337 164.55 8 288 <.0001
Pillai's Trace 1.25669612 61.29 8 290 <.0001
Hotelling-Lawley Trace 21.06722883 377.66 8 203.4 <.0001
Roy's Greatest Root 20.63266809 747.93 4 145 <.0001
NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

  Canonical
Correlation
Adjusted
Canonical
Correlation
Approximate
Standard
Error
Squared
Canonical
Correlation
Eigenvalues of Inv(E)*H
= CanRsq/(1-CanRsq)
Test of H0: The canonical correlations in the current row and all that follow are zero
  Eigenvalue Difference Proportion Cumulative Likelihood
Ratio
Approximate
F Value
Num DF Den DF Pr > F
1 0.976613 0.976123 0.003787 0.953774 20.6327 20.1981 0.9794 0.9794 0.03222337 164.55 8 288 <.0001
2 0.550384 0.543354 0.057107 0.302923 0.4346   0.0206 1.0000 0.69707749 21.00 3 145 <.0001

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Total Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.831965 0.452137
SepalWidth Sepal Width (mm) -0.515082 0.810630
PetalLength Petal Length (mm) 0.993520 0.087514
PetalWidth Petal Width (mm) 0.966325 0.154745

Between Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.956160 0.292846
SepalWidth Sepal Width (mm) -0.748136 0.663545
PetalLength Petal Length (mm) 0.998770 0.049580
PetalWidth Petal Width (mm) 0.995952 0.089883

Pooled Within Canonical Structure
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.339314 0.716082
SepalWidth Sepal Width (mm) -0.149614 0.914351
PetalLength Petal Length (mm) 0.900839 0.308136
PetalWidth Petal Width (mm) 0.650123 0.404282

Fisher (1936) Iris Data
Canonical Discriminant Analysis of Iris Clusters

The CANDISC Procedure

Total-Sample Standardized Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.047747341 1.021487262
SepalWidth Sepal Width (mm) -0.577569244 0.864455153
PetalLength Petal Length (mm) 3.341309573 -1.283043758
PetalWidth Petal Width (mm) 0.996451144 0.900476563

Pooled Within-Class Standardized Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.0253414487 0.5421446856
SepalWidth Sepal Width (mm) -.4304161258 0.6442092294
PetalLength Petal Length (mm) 0.7976741592 -.3063023132
PetalWidth Petal Width (mm) 0.3205998034 0.2897207865

Raw Canonical Coefficients
Variable Label Can1 Can2
SepalLength Sepal Length (mm) 0.0057661265 0.1233581748
SepalWidth Sepal Width (mm) -.1325106494 0.1983303556
PetalLength Petal Length (mm) 0.1892773419 -.0726814163
PetalWidth Petal Width (mm) 0.1307270927 0.1181359305

Class Means on Canonical Variables
CLUSTER Can1 Can2
1 4.931414018 0.861972277
2 -6.131527227 0.244761516
3 1.922300462 -0.725693908


Output 38.1.4: Plot of Fisher’s Iris Data using PROC CANDISC