Robust regression and outlier detection techniques have considerable applications to econometrics. The following example from Zaman, Rousseeuw, and Orhan (2001) shows how these techniques substantially improve the ordinary least squares (OLS) results for the growth study of De Long and Summers.
De Long and Summers (1991) studied the national growth of 61 countries from 1960 to 1985 by using OLS with the following data set growth
.
data growth; input country $ GDP LFG EQP NEQ GAP @@; datalines; Argentin 0.0089 0.0118 0.0214 0.2286 0.6079 Austria 0.0332 0.0014 0.0991 0.1349 0.5809 Belgium 0.0256 0.0061 0.0684 0.1653 0.4109 Bolivia 0.0124 0.0209 0.0167 0.1133 0.8634 Botswana 0.0676 0.0239 0.1310 0.1490 0.9474 ... more lines ... Venezuel 0.0120 0.0378 0.0340 0.0760 0.4974 Zambia -0.0110 0.0275 0.0702 0.2012 0.8695 Zimbabwe 0.0110 0.0309 0.0843 0.1257 0.8875 ;
The regression equation they used is
|
where the response variable is the growth in gross domestic product per worker (GDP) and the regressors are labor force growth (LFG), relative GDP gap (GAP), equipment investment (EQP), and nonequipment investment (NEQ).
The following statements invoke the REG procedure (Chapter 79: The REG Procedure,) for the OLS analysis:
proc reg data=growth; model GDP = LFG GAP EQP NEQ; run;
The OLS analysis shown in Output 80.3.1 indicates that GAP and EQP have a significant influence on GDP at the 5% level.
Output 80.3.1: OLS Estimates
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF | Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | -0.01430 | 0.01028 | -1.39 | 0.1697 |
LFG | 1 | -0.02981 | 0.19838 | -0.15 | 0.8811 |
GAP | 1 | 0.02026 | 0.00917 | 2.21 | 0.0313 |
EQP | 1 | 0.26538 | 0.06529 | 4.06 | 0.0002 |
NEQ | 1 | 0.06236 | 0.03482 | 1.79 | 0.0787 |
The following statements invoke the ROBUSTREG procedure with the default M estimation.
ods graphics on; proc robustreg data=growth plots=all; model GDP = LFG GAP EQP NEQ / diagnostics leverage; id country; run; ods graphics off;
Output 80.3.2 displays model information and summary statistics for variables in the model.
Output 80.3.2: Model Fitting Information and Summary Statistics
Model Information | |
---|---|
Data Set | WORK.GROWTH |
Dependent Variable | GDP |
Number of Independent Variables | 4 |
Number of Observations | 61 |
Method | M Estimation |
Summary Statistics | ||||||
---|---|---|---|---|---|---|
Variable | Q1 | Median | Q3 | Mean | Standard Deviation |
MAD |
LFG | 0.0118 | 0.0239 | 0.0281 | 0.0211 | 0.00979 | 0.00949 |
GAP | 0.5796 | 0.8015 | 0.8863 | 0.7258 | 0.2181 | 0.1778 |
EQP | 0.0265 | 0.0433 | 0.0720 | 0.0523 | 0.0296 | 0.0325 |
NEQ | 0.0956 | 0.1356 | 0.1812 | 0.1399 | 0.0570 | 0.0624 |
GDP | 0.0121 | 0.0231 | 0.0310 | 0.0224 | 0.0155 | 0.0150 |
Output 80.3.3 displays the M estimates. Besides GAP and EQP, the robust analysis also indicates that NEQ is significant. This new finding is explained by Output 80.3.4, which shows that Zambia, the 60th country in the data, is an outlier. Output 80.3.4 also identifies leverage points based on the robust MCD distances; however, there are no serious high-leverage points in this data set.
Output 80.3.3: M Estimates
Parameter Estimates | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | -0.0247 | 0.0097 | -0.0437 | -0.0058 | 6.53 | 0.0106 |
LFG | 1 | 0.1040 | 0.1867 | -0.2619 | 0.4699 | 0.31 | 0.5775 |
GAP | 1 | 0.0250 | 0.0086 | 0.0080 | 0.0419 | 8.36 | 0.0038 |
EQP | 1 | 0.2968 | 0.0614 | 0.1764 | 0.4172 | 23.33 | <.0001 |
NEQ | 1 | 0.0885 | 0.0328 | 0.0242 | 0.1527 | 7.29 | 0.0069 |
Scale | 1 | 0.0099 |
Output 80.3.4: Diagnostics
Diagnostics | ||||||
---|---|---|---|---|---|---|
Obs | country | Mahalanobis Distance | Robust MCD Distance | Leverage | Standardized Robust Residual |
Outlier |
1 | Argentin | 2.6083 | 4.0639 | * | -0.9424 | |
5 | Botswana | 3.4351 | 6.7391 | * | 1.4200 | |
8 | Canada | 3.1876 | 4.6843 | * | -0.1972 | |
9 | Chile | 3.6752 | 5.0599 | * | -1.8784 | |
17 | Finland | 2.6024 | 3.8186 | * | -1.7971 | |
23 | HongKong | 2.1225 | 3.8238 | * | 1.7161 | |
27 | Israel | 2.6461 | 5.0336 | * | 0.0909 | |
31 | Japan | 2.9179 | 4.7140 | * | 0.0216 | |
53 | Tanzania | 2.2600 | 4.3193 | * | -1.8082 | |
57 | U.S. | 3.8701 | 5.4874 | * | 0.1448 | |
58 | Uruguay | 2.5953 | 3.9671 | * | -0.0978 | |
59 | Venezuel | 2.9239 | 4.1663 | * | 0.3573 | |
60 | Zambia | 1.8562 | 2.7135 | -4.9798 | * | |
61 | Zimbabwe | 1.9634 | 3.9128 | * | -2.5959 |
Output 80.3.5 displays robust versions of goodness-of-fit statistics for the model.
Output 80.3.5: Goodness-of-Fit Statistics
Goodness-of-Fit | |
---|---|
Statistic | Value |
R-Square | 0.3178 |
AICR | 80.2134 |
BICR | 91.5095 |
Deviance | 0.0070 |
The PLOTS=ALL option generates four diagnostic plots. Output 80.3.6 and Output 80.3.7 are for outlier and leverage-point diagnostics. Output 80.3.8 and Output 80.3.9 are a histogram and a Q-Q plot of the standardized robust residuals, respectively.
Output 80.3.6: RDPLOT for growth
Data
Output 80.3.7: DDPLOT for growth
Data
Output 80.3.8: Histogram
Output 80.3.9: Q-Q Plot
The following statements invoke the ROBUSTREG procedure with LTS estimation, which was used by Zaman, Rousseeuw, and Orhan (2001). The results are consistent with those of M estimation.
proc robustreg method=lts(h=33) fwls data=growth seed=100; model GDP = LFG GAP EQP NEQ / diagnostics leverage; id country; run;
Output 80.3.10: LTS Estimates and LTS R Square
LTS Parameter Estimates | ||
---|---|---|
Parameter | DF | Estimate |
Intercept | 1 | -0.0249 |
LFG | 1 | 0.1123 |
GAP | 1 | 0.0214 |
EQP | 1 | 0.2669 |
NEQ | 1 | 0.1110 |
Scale (sLTS) | 0 | 0.0076 |
Scale (Wscale) | 0 | 0.0109 |
R-Square for LTS Estimation | |
---|---|
R-Square | 0.7418 |
Output 80.3.10 displays the LTS estimates and the LTS R Square.
Output 80.3.11: Diagnostics
Diagnostics | ||||||
---|---|---|---|---|---|---|
Obs | country | Mahalanobis Distance | Robust MCD Distance | Leverage | Standardized Robust Residual |
Outlier |
1 | Argentin | 2.6083 | 4.0639 | * | -1.0715 | |
5 | Botswana | 3.4351 | 6.7391 | * | 1.6574 | |
8 | Canada | 3.1876 | 4.6843 | * | -0.2324 | |
9 | Chile | 3.6752 | 5.0599 | * | -2.0896 | |
17 | Finland | 2.6024 | 3.8186 | * | -1.6367 | |
23 | HongKong | 2.1225 | 3.8238 | * | 1.7570 | |
27 | Israel | 2.6461 | 5.0336 | * | 0.2334 | |
31 | Japan | 2.9179 | 4.7140 | * | 0.0971 | |
53 | Tanzania | 2.2600 | 4.3193 | * | -1.2978 | |
57 | U.S. | 3.8701 | 5.4874 | * | 0.0605 | |
58 | Uruguay | 2.5953 | 3.9671 | * | -0.0857 | |
59 | Venezuel | 2.9239 | 4.1663 | * | 0.4113 | |
60 | Zambia | 1.8562 | 2.7135 | -4.4984 | * | |
61 | Zimbabwe | 1.9634 | 3.9128 | * | -2.1201 |
Output 80.3.11 displays outlier and leverage-point diagnostics based on the LTS estimates and the robust MCD distances.
Output 80.3.12: Final Weighted LS Estimates
Parameter Estimates for Final Weighted Least Squares Fit | |||||||
---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error | 95% Confidence Limits | Chi-Square | Pr > ChiSq | |
Intercept | 1 | -0.0222 | 0.0093 | -0.0405 | -0.0039 | 5.65 | 0.0175 |
LFG | 1 | 0.0446 | 0.1771 | -0.3026 | 0.3917 | 0.06 | 0.8013 |
GAP | 1 | 0.0245 | 0.0082 | 0.0084 | 0.0406 | 8.89 | 0.0029 |
EQP | 1 | 0.2824 | 0.0581 | 0.1685 | 0.3964 | 23.60 | <.0001 |
NEQ | 1 | 0.0849 | 0.0314 | 0.0233 | 0.1465 | 7.30 | 0.0069 |
Scale | 0 | 0.0116 |
Output 80.3.12 displays the final weighted least squares estimates, which are identical to those reported in Zaman, Rousseeuw, and Orhan (2001).