This example involves data artificially generated to contain two clusters and several severe outliers. A preliminary analysis specifies 20 clusters and outputs an OUTSEED= data set to be used for a diagnostic plot. The exact number of initial clusters is not important; similar results could be obtained with 10 or 50 initial clusters. Examination of the plot suggests that clusters with more than five (again, the exact number is not important) observations can yield good seeds for the main analysis. A DATA step deletes clusters with five or fewer observations, and the remaining cluster means provide seeds for the next PROC FASTCLUS analysis.
Two clusters are requested; the LEAST= option specifies the mean absolute deviation criterion (LEAST=1). Values of the LEAST= option less than 2 reduce the effect of outliers on cluster centers.
The next analysis also requests two clusters; the STRICT= option is specified to prevent outliers from distorting the results.
The STRICT= value is chosen to be close to the _GAP_
and _RADIUS_
values of the larger clusters in the diagnostic plot; the exact value is not critical.
A final PROC FASTCLUS run assigns the outliers to clusters.
The following SAS statements implement these steps, and the results are displayed in Output 38.2.3 through Output 38.2.8. First, an artificial data set is created with two clusters and some outliers. Then PROC FASTCLUS is run with many clusters
to produce an OUTSEED= data set. A diagnostic plot using the variables _GAP_
and _RADIUS_
is then produced using the SGSCATTER procedure. The results from these steps are shown in Output 38.2.1 and Output 38.2.2.
title 'Using PROC FASTCLUS to Analyze Data with Outliers'; data x; drop n; do n=1 to 100; x=rannor(12345)+2; y=rannor(12345); output; end; do n=1 to 100; x=rannor(12345)-2; y=rannor(12345); output; end; do n=1 to 10; x=10*rannor(12345); y=10*rannor(12345); output; end; run; title2 'Preliminary PROC FASTCLUS Analysis with 20 Clusters'; proc fastclus data=x outseed=mean1 maxc=20 maxiter=0 summary; var x y; run; proc sgscatter data=mean1; compare y=(_gap_ _radius_) x=_freq_; run;
Output 38.2.1: Preliminary Analysis of Data with Outliers Using PROC FASTCLUS
Cluster Summary | ||||||
---|---|---|---|---|---|---|
Cluster | Frequency | RMS Std Deviation | Maximum Distance from Seed to Observation |
Radius Exceeded |
Nearest Cluster | Distance Between Cluster Centroids |
1 | 8 | 0.4753 | 1.1924 | 19 | 1.7205 | |
2 | 1 | . | 0 | 6 | 6.2847 | |
3 | 44 | 0.6252 | 1.6774 | 5 | 1.4386 | |
4 | 1 | . | 0 | 20 | 5.2130 | |
5 | 38 | 0.5603 | 1.4528 | 3 | 1.4386 | |
6 | 2 | 0.0542 | 0.1085 | 2 | 6.2847 | |
7 | 1 | . | 0 | 14 | 2.5094 | |
8 | 2 | 0.6480 | 1.2961 | 1 | 1.8450 | |
9 | 1 | . | 0 | 7 | 9.4534 | |
10 | 1 | . | 0 | 18 | 4.2514 | |
11 | 1 | . | 0 | 16 | 4.7582 | |
12 | 20 | 0.5911 | 1.6291 | 16 | 1.5601 | |
13 | 5 | 0.6682 | 1.4244 | 3 | 1.9553 | |
14 | 1 | . | 0 | 7 | 2.5094 | |
15 | 5 | 0.4074 | 1.2678 | 3 | 1.7609 | |
16 | 22 | 0.4168 | 1.5139 | 19 | 1.4936 | |
17 | 8 | 0.4031 | 1.4794 | 5 | 1.5564 | |
18 | 1 | . | 0 | 10 | 4.2514 | |
19 | 45 | 0.6475 | 1.6285 | 16 | 1.4936 | |
20 | 3 | 0.5719 | 1.3642 | 15 | 1.8999 |
In the following SAS statements, a DATA step is used to remove low frequency clusters, then the FASTCLUS procedure is run again, selecting seeds from the high frequency clusters in the previous analysis using LEAST=1 clustering criterion. The results are shown in Output 38.2.3 and Output 38.2.4.
data seed; set mean1; if _freq_>5; run; title2 'PROC FASTCLUS Analysis Using LEAST= Clustering Criterion'; title3 'Values < 2 Reduce Effect of Outliers on Cluster Centers'; proc fastclus data=x seed=seed maxc=2 least=1 out=out; var x y; run; proc sgplot data=out; scatter y=y x=x / group=cluster; run;
Output 38.2.3: Analysis of Data with Outliers Using the LEAST= Option
Using PROC FASTCLUS to Analyze Data with Outliers |
PROC FASTCLUS Analysis Using LEAST= Clustering Criterion |
Values < 2 Reduce Effect of Outliers on Cluster Centers |
Initial Seeds | ||
---|---|---|
Cluster | x | y |
1 | 2.794174248 | -0.065970836 |
2 | -2.027300384 | -2.051208579 |
The FASTCLUS procedure is run again, selecting seeds from high frequency clusters in the previous analysis. STRICT= prevents outliers from distorting the results. The results are shown in Output 38.2.5 and Output 38.2.6.
title2 'PROC FASTCLUS Analysis Using STRICT= to Omit Outliers'; proc fastclus data=x seed=seed maxc=2 strict=3.0 out=out outseed=mean2; var x y; run; proc sgplot data=out; scatter y=y x=x / group=cluster; run;
Output 38.2.5: Cluster Analysis with Outliers Omitted: PROC FASTCLUS SGPLOT
12 Observation(s) were not assigned to a cluster because the minimum distance to a cluster seed exceeded the STRICT= value. |
Statistics for Variables | ||||
---|---|---|---|---|
Variable | Total STD | Within STD | R-Square | RSQ/(1-RSQ) |
x | 2.06854 | 0.87098 | 0.823609 | 4.669219 |
y | 1.02113 | 1.00352 | 0.039093 | 0.040683 |
OVER-ALL | 1.63119 | 0.93959 | 0.669891 | 2.029303 |
Finally, the FASTCLUS procedure is run one more time with zero iterations to assign outliers and tails to clusters. The results are show in Output 38.2.7 and Output 38.2.8.
title2 'Final PROC FASTCLUS Analysis Assigning Outliers to Clusters'; proc fastclus data=x seed=mean2 maxc=2 maxiter=0 out=out; var x y; run; proc sgplot data=out; scatter y=y x=x / group=cluster; run;