This example contrasts several of the robust methods available in the ROBUSTREG procedure.
The following statements generate 1,000 random observations. The first 900 observations are from a linear model, and the last 100 observations are significantly biased in the y-direction. In other words, 10% of the observations are contaminated with outliers.
data a (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 900 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run;
The following statements invoke PROC REG and PROC ROBUSTREG with the data set a
.
proc reg data=a; model y = x1 x2; run;
proc robustreg data=a method=m; model y = x1 x2; run;
proc robustreg data=a method=mm seed=100; model y = x1 x2; run;
proc robustreg data=a method=s seed=100; model y = x1 x2; run;
proc robustreg data=a method=lts seed=100; model y = x1 x2; run;
The tables of parameter estimates generated by using M estimation, MM estimation, S estimation, and LTS estimation in the ROBUSTREG procedure are shown in Output 80.1.2, Output 80.1.3, Output 80.1.4, and Output 80.1.5, respectively. For comparison, the ordinary least squares (OLS) estimates produced by the REG procedure (Chapter 79: The REG Procedure,) are shown in Output 80.1.1. The four robust methods, M, MM, S, and LTS, correctly estimate the regression coefficients for the underlying model (10, 5, and 3), but the OLS estimate does not.
Output 80.1.1: OLS Estimates for Data with 10% Contamination
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 19.06712 | 0.86322 | 22.09 | <.0001 |
x1 | 1 | 3.55485 | 0.86892 | 4.09 | <.0001 |
x2 | 1 | 2.12341 | 0.83039 | 2.56 | 0.0107 |
Output 80.1.2: M Estimates for Data with 10% Contamination
Model Information | |
---|---|
Data Set | WORK.A |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | M Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 10.0024 | 0.0174 | 9.9683 | 10.0364 | 331908 | <.0001 |
x1 | 1 | 5.0077 | 0.0175 | 4.9735 | 5.0420 | 82106.9 | <.0001 |
x2 | 1 | 3.0161 | 0.0167 | 2.9834 | 3.0488 | 32612.5 | <.0001 |
Scale | 1 | 0.5780 |
Output 80.1.3: MM Estimates for Data with 10% Contamination
Model Information | |
---|---|
Data Set | WORK.A |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | MM Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 10.0035 | 0.0176 | 9.9690 | 10.0379 | 323947 | <.0001 |
x1 | 1 | 5.0085 | 0.0178 | 4.9737 | 5.0433 | 79600.6 | <.0001 |
x2 | 1 | 3.0181 | 0.0168 | 2.9851 | 3.0511 | 32165.0 | <.0001 |
Scale | 0 | 0.6733 |
Output 80.1.4: S Estimates for Data with 10% Contamination
Model Information | |
---|---|
Data Set | WORK.A |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | S Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 10.0055 | 0.0180 | 9.9703 | 10.0408 | 309917 | <.0001 |
x1 | 1 | 5.0096 | 0.0182 | 4.9740 | 5.0452 | 76045.2 | <.0001 |
x2 | 1 | 3.0210 | 0.0172 | 2.9873 | 3.0547 | 30841.3 | <.0001 |
Scale | 0 | 0.6721 |
Output 80.1.5: LTS Estimates for Data with 10% Contamination
Model Information | |
---|---|
Data Set | WORK.A |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | LTS Estimation |
LTS Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | 10.0083 |
x1 | 1 | 5.0316 |
x2 | 1 | 3.0396 |
Scale (sLTS) | 0 | 0.5880 |
Scale (Wscale) | 0 | 0.5113 |
The next statements demonstrate that if the percentage of contamination is increased to 40%, the M method and the MM method with default options fail to estimate the underlying model. Output 80.1.6 and Output 80.1.7 display these estimates. However, by tuning the constant c for the M method and the constants INITH and K0 for the MM method, you can increase the breakdown values of the estimates and capture the right model. Output 80.1.8 and Output 80.1.9 display these estimates. Similarly, you can tune the constant EFF for the S method and the constant H for the LTS method and correctly estimate the underlying model with these methods. Results are not presented.
data b (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; output; end; run;
proc robustreg data=b method=m; model y = x1 x2; run;
proc robustreg data=b method=mm; model y = x1 x2; run;
proc robustreg data=b method=m(wf=bisquare(c=2)); model y = x1 x2; run;
proc robustreg data=b method=mm(inith=502 k0=1.8); model y = x1 x2; run;
Output 80.1.6: M Estimates (Default Setting) for Data with 40% Contamination
Model Information | |
---|---|
Data Set | WORK.B |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | M Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 44.8991 | 1.5609 | 41.8399 | 47.9584 | 827.46 | <.0001 |
x1 | 1 | 2.4309 | 1.5712 | -0.6485 | 5.5104 | 2.39 | 0.1218 |
x2 | 1 | 1.3742 | 1.5015 | -1.5687 | 4.3171 | 0.84 | 0.3601 |
Scale | 1 | 56.6342 |
Output 80.1.7: MM Estimates (Default Setting) for Data with 40% Contamination
Model Information | |
---|---|
Data Set | WORK.B |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | MM Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 43.0607 | 1.7978 | 39.5370 | 46.5844 | 573.67 | <.0001 |
x1 | 1 | 2.7369 | 1.8140 | -0.8185 | 6.2924 | 2.28 | 0.1314 |
x2 | 1 | 1.5211 | 1.7265 | -1.8628 | 4.9049 | 0.78 | 0.3783 |
Scale | 0 | 52.8496 |
Output 80.1.8: M Estimates (Tuned) for Data with 40% Contamination
Model Information | |
---|---|
Data Set | WORK.B |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | M Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 10.0137 | 0.0219 | 9.9708 | 10.0565 | 209688 | <.0001 |
x1 | 1 | 4.9905 | 0.0220 | 4.9473 | 5.0336 | 51399.1 | <.0001 |
x2 | 1 | 3.0399 | 0.0210 | 2.9987 | 3.0811 | 20882.4 | <.0001 |
Scale | 1 | 1.0531 |
Output 80.1.9: MM Estimates (Tuned) for Data with 40% Contamination
Model Information | |
---|---|
Data Set | WORK.B |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | MM Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 10.0103 | 0.0213 | 9.9686 | 10.0520 | 221639 | <.0001 |
x1 | 1 | 4.9890 | 0.0218 | 4.9463 | 5.0316 | 52535.9 | <.0001 |
x2 | 1 | 3.0363 | 0.0201 | 2.9970 | 3.0756 | 22895.5 | <.0001 |
Scale | 0 | 1.8992 |
When there are bad leverage points, the M method fails to estimate the underlying model no matter what constant c you use. In this case, other methods (LTS, S, and MM) in PROC ROBUSTREG, which are robust to bad leverage points, correctly estimate the underlying model.
The following statements generate and analyze 1,000 observations with 1% bad high leverage points.
data c (drop=i); do i=1 to 1000; x1=rannor(1234); x2=rannor(1234); e=rannor(1234); if i > 600 then y=100 + e; else y=10 + 5*x1 + 3*x2 + .5 * e; if i < 11 then x1=200 * rannor(1234); if i < 11 then x2=200 * rannor(1234); if i < 11 then y= 100*e; output; end; run;
proc robustreg data=c method=mm(inith=502 k0=1.8) seed=100; model y = x1 x2; run;
proc robustreg data=c method=s(k0=1.8) seed=100; model y = x1 x2; run;
proc robustreg data=c method=lts(h=502) seed=100; model y = x1 x2; run;
Output 80.1.10 displays the MM estimates with initial LTS estimates, Output 80.1.11 displays the S estimates, and Output 80.1.12 displays the LTS estimates.
Output 80.1.10: MM Estimates for Data with 1% Leverage Points
Model Information | |
---|---|
Data Set | WORK.C |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | MM Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 9.9820 | 0.0215 | 9.9398 | 10.0241 | 215369 | <.0001 |
x1 | 1 | 5.0303 | 0.0206 | 4.9898 | 5.0707 | 59469.1 | <.0001 |
x2 | 1 | 3.0222 | 0.0221 | 2.9789 | 3.0655 | 18744.9 | <.0001 |
Scale | 0 | 2.2134 |
Output 80.1.11: S Estimates for Data with 1% Leverage Points
Model Information | |
---|---|
Data Set | WORK.C |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | S Estimation |
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | 9.9808 | 0.0216 | 9.9383 | 10.0232 | 212532 | <.0001 |
x1 | 1 | 5.0303 | 0.0208 | 4.9896 | 5.0710 | 58656.3 | <.0001 |
x2 | 1 | 3.0217 | 0.0222 | 2.9782 | 3.0652 | 18555.7 | <.0001 |
Scale | 0 | 2.2094 |
Output 80.1.12: LTS Estimates for Data with 1% Leverage Points
Model Information | |
---|---|
Data Set | WORK.C |
Dependent Variable | y |
Number of Independent Variables | 2 |
Number of Observations | 1000 |
Method | LTS Estimation |
LTS Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | 9.9742 |
x1 | 1 | 5.0010 |
x2 | 1 | 3.0219 |
Scale (sLTS) | 0 | 0.9952 |
Scale (Wscale) | 0 | 0.5216 |