Combining Robust Residual and Robust Distance :: SAS/IML(R) 12.1 User's Guide

Combining Robust Residual and Robust Distance

12.6 Hawkins-Bradu-Kass Data
12.7 Stackloss Data

This section is based entirely on Rousseeuw and Van Zomeren (1990). Observations $\mb {x}_ i$ , which are far away from most of the other observations, are called leverage points. One classical method inspects the Mahalanobis distances to find outliers $\mb {x}_ i$ :

$MD_ i = \sqrt {(\mb {x}_ i - \mu ) \mb {C}^{-1}(\mb {x}_ i - \mu )^ T}$

where $\mb {C}$ is the classical sample covariance matrix.

Note that the MVE subroutine prints the classical Mahalanobis distances together with the robust distances . In classical linear regression, the diagonal elements $h_{ii}$ of the hat matrix

$\mb {H} = \mb {X}(\mb {X}^ T\mb {X})^{-1}\mb {X}^ T$

are used to identify leverage points. Rousseeuw and Van Zomeren (1990) report the following monotone relationship between the $h_{ii}$ and :

$h_{ii} = \frac{(MD_ i)^2}{N-1} + \frac{1}{n}$

They point out that neither the nor the $h_{ii}$ are entirely safe for detecting leverage points reliably. Multiple outliers do not necessarily have large values because of the masking effect.

The definition of a leverage point is, therefore, based entirely on the outlyingness of $\mb {x}_ i$ and is not related to the response value . By including the value in the definition, Rousseeuw and Van Zomeren (1990) distinguish between the following:

Good leverage points are points $(\mb {x}_ i,y_ i)$ that are close to the regression plane; that is, good leverage points improve the precision of the regression coefficients.
Bad leverage points are points $(\mb {x}_ i,y_ i)$ that are far from the regression plane; that is, bad leverage points reduce the precision of the regression coefficients.

Rousseeuw and Van Zomeren (1990) propose to plot the standardized residuals of robust regression (LMS or LTS) versus the robust distances obtained from MVE. Two horizontal lines that correspond to residual values of and are useful to distinguish between small and large residuals, and one vertical line that corresponds to the $\sqrt {\chi ^2_{n,.975}}$ is used to distinguish between small and large distances.