This example models the selling price of a house as a function of several covariates. One of these covariates is a classification variable that indicates whether a house is located on a corner lot (called a corner house in this example). Because corner houses are relatively rare, the inclusion of this classification effect in the model introduces a low-dimensional structure (that is, the majority of the observations are located in a lower dimensional hyperplane defined by being non-corner houses) into the design matrix. As discussed in Robust Distance, the presence of this low dimensional structure causes difficulties in the traditional computation of robust distances. This example illustrates how you can use the projected robust distance to address those difficulties and to obtain meaningful leverage diagnostics. It also shows how you can use the RDPLOT= and DDPLOT= options to illustrate the outlier-leverage relationship.
The following house price data set contains 66 home resale records on seven variables from February 15 to April 30, 1993 (Data and Story Library, 2005). The records are randomly selected from the database maintained by the Albuquerque Board of Realtors.
data house; input price sqft age feats ne cor tax @@; label price = "Selling price" sqft = "Square feet of living space" age = "Age of home in year" feats = "Number out of 11 features (dishwasher, refrigerator, microwave, disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit, cable TV access)" ne = "Located in northeast sector of city (1) or not (0)" cor = "Corner location (1) or not (0)" tax = "Annual taxes"; sum = sqft+age+feats+ne+cor+tax; id = _N_; datalines; 2050 2650 13 7 1 0 1639 2150 2664 6 5 1 0 1193 2150 2921 3 6 1 0 1635 1999 2580 4 4 1 0 1732 ... more lines ... 870 1273 4 4 0 0 638 869 1165 7 4 0 0 694 766 1200 7 4 0 1 634 739 970 4 4 0 1 541 ;
To illustrate the dependence detection ability of the generalized MCD algorithm, an extra variable sum
is created such that all the observations satisfy sum
= sqft
+ age
+ feats
+ ne
+ cor
+ tax
. Adding sum
does not change the rank of the original design matrix, so that sum
is expected to be ignored in the model and also in the diagnostics. The next statements apply the MM method and the generalized
MCD algorithm to the house price data.
ods graphics on; proc robustreg data=house method=MM plots=all; model price = sqft age feats ne cor tax sum / leverage(opc mcdinfo) diagnostics; run;
As shown in Output 80.5.1 and Output 80.5.2, PROC ROBUSTREG finds the design dependence equation and forces the parameter estimate of variable sum
to be zero.
Output 80.5.1: MM Estimates
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 46.4062 | 79.1714 | -108.767 | 201.5792 | 0.34 | 0.5578 |
sqft | 1 | 0.3809 | 0.0756 | 0.2327 | 0.5291 | 25.37 | <.0001 |
age | 1 | -2.6067 | 1.7610 | -6.0582 | 0.8449 | 2.19 | 0.1388 |
feats | 1 | 8.3627 | 14.7107 | -20.4697 | 37.1951 | 0.32 | 0.5697 |
ne | 1 | 65.0081 | 40.1329 | -13.6508 | 143.6671 | 2.62 | 0.1053 |
cor | 1 | -19.2997 | 38.1907 | -94.1520 | 55.5526 | 0.26 | 0.6133 |
tax | 1 | 0.4699 | 0.1260 | 0.2229 | 0.7170 | 13.90 | 0.0002 |
sum | 0 | 0.0000 | . | . | . | . | . |
Scale | 0 | 157.5593 |
Output 80.5.2: Design Dependence Equations
Note: | The following variables have been ignored in the MCD computation because of linear dependence. |
sum = | sqft + age + feats + ne + cor + tax |
---|
Moreover, PROC ROBUSTREG also identifies a robust dependence equation on cor
in Output 80.5.3, which holds for 77.27% of the observations but not for the entire data set.
Output 80.5.3: Robust Dependence Equations
Note: | The following robust dependence equations simultaneously hold for 77.27% of the observations in the data set. The breakdown setting for the MCD algorithm is 22.73%. |
cor = | 0 |
---|
Another way to represent the low-dimensional structure is to specify the coefficients of the MCD-dropped components on the data (see Output 80.5.4), which form a basis of the complementary space to the relevant low-dimensional hyperplane.
Output 80.5.4: Coefficients for MCD-Dropped Components
Coefficients for MCD-Dropped Components |
||
---|---|---|
Parameter | DesignDrop0 | RobustDrop1 |
sqft | 0 | 0 |
age | 0 | 0 |
feats | 0 | 0 |
ne | 0 | 0 |
cor | 0 | 1.0000 |
tax | 0 | 0 |
sum | 1.0000 | 0 |
By the definitions of projected robust distance and leverage point, an observation is called an off-plane leverage point if
at least one of the robust or design dependence equations does not apply to the observation. In this example, the observations
with cor
= 1 are all off-plane leverage points. Output 80.5.5 lists the leverage points and outliers along with the relevant distance measurements and standardized residuals.
Output 80.5.5: Diagnostics
Diagnostics | ||||||
---|---|---|---|---|---|---|
Obs | Projected Distance | Leverage | Standardized Robust Residual |
Outlier | ||
Mahalanobis | Robust | Off-Plane | ||||
1 | 3.5567 | 4.0211 | 0.0000 | * | 0.8522 | |
13 | 4.0034 | 5.2310 | 0.0000 | * | 0.1411 | |
15 | 1.3221 | 1.5219 | 2.3681 | * | 0.0226 | |
16 | 1.0839 | 1.0905 | 2.3681 | * | 0.4148 | |
18 | 1.9452 | 2.4655 | 2.3681 | * | -0.2789 | |
20 | 3.6006 | 4.0771 | 2.3681 | * | -0.0150 | |
22 | 3.0210 | 3.4307 | 2.3681 | * | 1.1664 | |
23 | 1.5920 | 1.8197 | 2.3681 | * | 0.2422 | |
24 | 3.4967 | 4.5154 | 0.0000 | * | 0.6464 | |
26 | 3.0420 | 3.6975 | 0.0000 | * | -1.7068 | |
29 | 2.3264 | 2.9925 | 2.3681 | * | -2.4980 | |
30 | 1.2587 | 1.2714 | 2.3681 | * | -1.2558 | |
38 | 2.4064 | 2.7249 | 2.3681 | * | -1.0620 | |
42 | 1.4722 | 1.4645 | 2.3681 | * | 0.2584 | |
44 | 2.8491 | 3.0019 | 0.0000 | 4.5665 | * | |
46 | 3.9725 | 5.2271 | 0.0000 | * | 3.5835 | * |
47 | 2.9431 | 3.3728 | 2.3681 | * | 0.1365 | |
55 | 2.2325 | 2.9590 | 2.3681 | * | 0.3217 | |
56 | 1.7999 | 1.8119 | 2.3681 | * | 0.1715 | |
65 | 1.8831 | 2.1822 | 2.3681 | * | -0.1990 | |
66 | 2.2483 | 2.5673 | 2.3681 | * | 0.4134 |
From Output 80.5.6 and Output 80.5.7, you can see that there is no apparent corner-related difference for the houses in terms of standardized robust residual and projected MD versus projected RD, although all the corner houses are defined as off-plane leverage points.
Output 80.5.6: Projected RDPLOT
Output 80.5.7: Projected DDPLOT
Output 80.5.8 shows more details of the robust diagnostics. The number of dimensions indicates that six regressors are used in the MCD
analysis. Since sum
is excluded in model fitting, it is ignored in the MCD analysis. The number of robust dropped components equals 1 due to
cor
. The number of off-plane points implies the 15 corner-house observations. The reweighted value of H is the number of observations
that are finally used to estimate the MCD covariance.
Output 80.5.8: MCD Information
MCD Profile | |
---|---|
Number of Dimensions | 6 |
Number of Robust Dropped Components | 1 |
Number of Observations | 66 |
Number of Off-Plane Observations | 15 |
Specified Value of H | 51 |
Reweighted Value of H | 47 |
Breakdown Value | 0.2273 |
MCD Center | ||
---|---|---|
ParameterName | Parameter | Center |
sqft | sqft | 1752.7 |
age | age | 12.809 |
feats | feats | 4.0426 |
ne | ne | 0.6170 |
cor | cor | -2E-16 |
tax | tax | 895.40 |
sum | sum | 2665.6 |
MCD Covariance | |||||||
---|---|---|---|---|---|---|---|
sqft | age | feats | ne | cor | tax | sum | |
sqft | 248870.3 | -853.232 | 147.0347 | 88.60083 | 0 | 148494.5 | 396747.3 |
age | -853.232 | 126.2886 | -1.18733 | 1.229417 | 0 | -1251.44 | -1978.34 |
feats | 147.0347 | -1.18733 | 0.99815 | 0.234043 | 0 | 87.0259 | 361.5814 |
ne | 88.60083 | 1.229417 | 0.234043 | 0.241443 | 0 | 45.76688 | 134.42 |
cor | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tax | 148494.5 | -1251.44 | 87.0259 | 45.76688 | 0 | 106652.5 | 255147 |
sum | 396747.3 | -1978.34 | 361.5814 | 134.42 | 0 | 255147 | 650413.7 |
MCD Correlation | |||||||
---|---|---|---|---|---|---|---|
sqft | age | feats | ne | cor | tax | sum | |
sqft | 1 | -0.15219 | 0.295009 | 0.361446 | 0 | 0.911462 | 0.986126 |
age | -0.15219 | 1 | -0.10575 | 0.222643 | 0 | -0.34099 | -0.21829 |
feats | 0.295009 | -0.10575 | 1 | 0.476749 | 0 | 0.266726 | 0.448759 |
ne | 0.361446 | 0.222643 | 0.476749 | 1 | 0 | 0.285206 | 0.339204 |
cor | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tax | 0.911462 | -0.34099 | 0.266726 | 0.285206 | 0 | 1 | 0.968747 |
sum | 0.986126 | -0.21829 | 0.448759 | 0.339204 | 0 | 0.968747 | 1 |
You might speculate that the projected MD and projected RD are equal to the regular MD and RD on the same data set without
the variable cor
. In fact, this is not true. (See Output 80.5.9 and Output 80.5.10 for the RDPLOT and DDPLOT on the data set without cor
.) When included in the MODEL, cor
is dropped in the distance calculation, but it is still used for the initial orthonormalization step and the h-subset searching. In this example, inclusion of cor
causes all the other covariates to be centered separately for corner houses and non-corner houses. However, without cor
, the centering process does not distinguish corner houses from non-corner houses, so that the MCD algorithm can still be
influenced by cor
through the correlation between cor
and other covariates. The following statements drop the variable cor
and produce the RDPLOT and DDPLOT for the reduced model, which are shown in Output 80.5.9 and Output 80.5.10:
proc robustreg data=house method=MM plots=all; model price = sqft age feats ne tax/leverage(mcdinfo) diagnostics; run; ods graphics off;
Output 80.5.9: RDPLOT for the Reduced Model
Output 80.5.10: DDPLOT for the Reduced Model
Compared with Output 80.5.8, Output 80.5.11 shows the changes of the MCD information by removing cor
from the model. You can see that the corner houses are no longer identified as off-plane points and the reweighted value
of H is increased from 47 to 52. The breakdown value is intact because it depends only on the specified value of H and the
total number of observations.
Output 80.5.11: MCD Information for the Reduced Model
MCD Profile | |
---|---|
Number of Dimensions | 5 |
Number of Robust Dropped Components | 0 |
Number of Observations | 66 |
Number of Off-Plane Observations | 0 |
Specified Value of H | 51 |
Reweighted Value of H | 52 |
Breakdown Value | 0.2273 |
MCD Center | ||
---|---|---|
ParameterName | Parameter | Center |
sqft | sqft | 1710.9 |
age | age | 11.173 |
feats | feats | 3.9423 |
ne | ne | 0.5962 |
tax | tax | 858.10 |
MCD Covariance | |||||
---|---|---|---|---|---|
sqft | age | feats | ne | tax | |
sqft | 216974.7 | 681.2327 | 199.2492 | 103.0388 | 107503.1 |
age | 681.2327 | 64.49887 | -0.9506 | 1.855581 | -187.135 |
feats | 199.2492 | -0.9506 | 0.878959 | 0.152715 | 114.9076 |
ne | 103.0388 | 1.855581 | 0.152715 | 0.245475 | 49.98077 |
tax | 107503.1 | -187.135 | 114.9076 | 49.98077 | 66558.68 |
MCD Correlation | |||||
---|---|---|---|---|---|
sqft | age | feats | ne | tax | |
sqft | 1 | 0.182102 | 0.456255 | 0.44647 | 0.89457 |
age | 0.182102 | 1 | -0.12625 | 0.466337 | -0.09032 |
feats | 0.456255 | -0.12625 | 1 | 0.328771 | 0.475075 |
ne | 0.44647 | 0.466337 | 0.328771 | 1 | 0.391018 |
tax | 0.89457 | -0.09032 | 0.475075 | 0.391018 | 1 |