The following table lists standardization methods and their corresponding location and scale measures available with the METHOD= option.
Table 91.2: Available Standardization Methods
Method |
Location |
Scale |
---|---|---|
MEAN |
Mean |
1 |
MEDIAN |
Median |
1 |
SUM |
0 |
Sum |
EUCLEN |
0 |
Euclidean length |
USTD |
0 |
Standard deviation about origin |
STD |
Mean |
Standard deviation |
RANGE |
Minimum |
Range |
MIDRANGE |
Midrange |
Range/2 |
MAXABS |
0 |
Maximum absolute value |
IQR |
Median |
Interquartile range |
MAD |
Median |
Median absolute deviation from median |
ABW(c) |
Biweight one-step M-estimate |
Biweight A-estimate |
AHUBER(c) |
Huber one-step M-estimate |
Huber A-estimate |
AWAVE(c) |
Wave one-step M-estimate |
Wave A-estimate |
AGK(p) |
Mean |
AGK estimate (ACECLUS) |
SPACING(p) |
Mid-minimum spacing |
Minimum spacing |
L(p) |
L(p) |
L(p) |
IN( |
Read from data set |
Read from data set |
For METHOD=ABW(c), METHOD=AHUBER(c), or METHOD=AWAVE(c), c is a positive numeric tuning constant.
For METHOD=AGK(p), p is a numeric constant that gives the proportion of pairs to be included in the estimation of the within-cluster variances.
For METHOD=SPACING(p), p is a numeric constant that gives the proportion of data to be contained in the spacing.
For METHOD=L(p), p is a numeric constant greater than or equal to 1 that specifies the power to which differences are to be raised in computing an L(p) or Minkowski metric.
For METHOD=IN(ds
), ds
is the name of a SAS data set that meets either of the following two conditions:
The data set contains a _TYPE_
variable. The observation that contains the location measure corresponds to the value _TYPE_
= 'LOCATION', and the observation that contains the scale measure corresponds to the value _TYPE_
= 'SCALE'. You can also use a data set created by the OUTSTAT= option from another PROC STDIZE statement as the ds
data set. See the section Output Data Sets for the contents of the OUTSTAT= data set.
The data set contains the location and scale variables specified by the LOCATION and SCALE statements.
PROC STDIZE reads in the location and scale variables in the ds
data set by first looking for the _TYPE_
variable in the ds
data set. If it finds this variable, PROC STDIZE continues to search for all variables specified in the VAR statement. If
it does not find the _TYPE_
variable, PROC STDIZE searches for the location variables specified in the LOCATION statement and the scale variables specified
in the SCALE statement.
The variable _TYPE_
can also contain the optional observations, 'ADD' and 'MULT'. If these observations are found in the ds
data set, the values in the observation of _TYPE_
= 'MULT' are the multiplication constants, and the values in the observation of _TYPE_
= 'ADD' are the addition constants; otherwise, the constants specified in the ADD= and MULT= options (or their default values)
are used.
For robust estimators, see Goodall (1983) and Iglewicz (1983). The MAD method has the highest breakdown point (50%), but it is somewhat inefficient. The ABW, AHUBER, and AWAVE methods provide a good compromise between breakdown and efficiency. The L(p) location estimates are increasingly robust as p drops from 2 (which corresponds to least squares, or mean estimation) to 1 (which corresponds to least absolute value, or median estimation). However, the L(p) scale estimates are not robust.
The SPACING method is robust to both outliers and clustering (Janssen et al., 1995) and is, therefore, a good choice for cluster analysis or nonparametric density estimation. The mid-minimum spacing method estimates the mode for small p. The AGK method is also robust to clustering and more efficient than the SPACING method, but it is not as robust to outliers and takes longer to compute. If you expect g clusters, the argument to METHOD=SPACING or METHOD=AGK should be or less. The AGK method is less biased than the SPACING method for small samples. As a general guide, it is reasonable to use AGK for samples of size 100 or less and SPACING for samples of size 1,000 or more, with the treatment of intermediate sample sizes depending on the available computer resources.