The CALIS Procedure

Example 29.14 Residual Diagnostics and Robust Estimation

Subsections:

Case-Level Residual Diagnostics
Direct Robust Estimation
Two-Stage Robust Estimation

This example illustrates case-level residual diagnostics and robust estimation with PROC CALIS. The data set is available in Mardia, Kent, and Bibby (1979). It contains 88 responses on five subjects: mechanics, vector, algebra, analysis, and statistics. Yuan and Hayashi (2010) use the same data set to illustrate the uses of residual diagnostics in the context of robust estimation. Inspired by their illustrations, this example mirrors their one-factor model, with some modifications.

Case-Level Residual Diagnostics

The following DATA step specifies and inputs five observed variables for the data set mardia:

data mardia;
   input mechanics vector algebra analysis statistics;
   datalines;
77.000     82.000     67.000     67.000     81.000 
63.000     78.000     80.000     70.000     81.000 
75.000     73.000     71.000     66.000     81.000 
  .          .          .          .          .
  .          .          .          .          .
     { More Data }
  .          .          .          .          .
  .          .          .          .          .
  .          .          .          .          .
  .          .          .          .          .
;

You specify a one-factor model for the five variables by using the following statements:

ods graphics on;

proc calis data=mardia residual plots=caseresid;
   path
      fact1 ===>  mechanics vector algebra analysis statistics = 1. ;
run;

ods graphics off;

In the PATH statement, all observed variables are indicators of the latent factor fact1. The first factor loading for mechanics is fixed at 1 for identification of the latent variable scale, while the remaining four loadings are free parameters in the model. The RESIDUAL option that is specified in the PROC CALIS statement requests residual analysis. Because raw data are input, case-level residual analysis is done. To request all the available ODS Graphics output for case-level residual diagnostics, you specify the PLOTS=CASERESID option. To enable ODS Graphics, you specify the ODS GRAPHICS ON statement before running the CALIS procedure.

Output 29.14.1 shows some of the fit statistics of the one-factor model. The model fit chi-square is 8.9782 (df = 5, p = 0.1099). Therefore, you fail to reject the one-factor model for the data. The SRMR is 0.0411, which shows a good fit. But the RMSEA is 0.0956, which does not show that the one-factor model is a very good approximating model for the data. Output 29.14.2 shows the loading estimates. All are statistically significant.

Output 29.14.1: Model Fit of Mardia Data

Fit Summary
Chi-Square	8.9782
Chi-Square DF	5
Pr > Chi-Square	0.1099
Standardized RMR (SRMR)	0.0475
RMSEA Estimate	0.0956

Output 29.14.2: Estimates of Loadings

PATH List
Path			Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
fact1	===>	mechanics		1.00000
fact1	===>	vector	_Parm1	0.83770	0.16598	5.0469	<.0001
fact1	===>	algebra	_Parm2	0.93064	0.15369	6.0552	<.0001
fact1	===>	analysis	_Parm3	1.09480	0.19611	5.5826	<.0001
fact1	===>	statistics	_Parm4	1.19229	0.22309	5.3444	<.0001

Output 29.14.3 shows the classifications of outlier and leverage observations in a two-dimensional space. Four regions are clearly shown.

Output 29.14.3: Outliers and Leverage Observations

The upper and the lower regions are separated by a horizontal line that represents the criterion for outlier detection. This criterion is set by a certain $\alpha$ -level that has a probability meaning. Loosely speaking, if the model is true and the distribution of the residuals is multivariate normal, you would expect that about $\alpha \times 100\%$ of observations will have their residual M-distances above the criterion. The larger the $\alpha$ value, the more liberal the outlier detection criterion. This $\alpha$ -level is set to 0.01 by default. You can change this default by using the ALPHAOUT= option.

Notice that all residuals and the criterion are measured in terms of M-distances. Hence, these measures are always positive. Output 29.14.3 classifies observations 81 and 28 as outliers (in the residual dimension). In addition to these two outlying observations, PROC CALIS also labels the next five observations with the largest residual M-distances. Output 29.14.3 shows that observations 33, 54, 56, 61, and 66 are the closest to being labeled as outliers.

The right and left regions are separated by a vertical line that represents the criterion for leverage observations. Again, this criterion is set by a certain $\alpha$ -level that has a probability meaning. If the model is true and the distribution of the predictor variables (in this case, the factor variable fact1) is multivariate normal, you would expect that about $\alpha \times 100\%$ of observations will have their leverage M-distances above the criterion. The larger the $\alpha$ value, the more liberal the leverage observation detection criterion. This $\alpha$ -level is also set to 0.01 by default. You can change this default by using the ALPHALEV= option.

Output 29.14.3 classifies observations 2, 87, and 88 as leverage points or observations. In addition to these three observations, PROC CALIS also labels the next five observations with the largest leverage M-distances. Output 29.14.3 shows that observations 1, 3, 6, 8, and 86 are the closest to being labeled as leverage observations.

The lower left region in Output 29.14.3 contains observations that are neither outliers nor leverage observations. This region contains the majority of the observations. The upper right region in Output 29.14.3 contains observations that are classified as both outliers and leverage observations. For the current model, no observations are classified as such.

To supplement the graphical depiction of outliers and leverage observations, PROC CALIS shows two tables that contain more numerical information. Output 29.14.4 shows the seven observations that have the largest residual M-distances. The table reflects the outlier classification in Output 29.14.3 numerically. Two outliers are flagged in the table. However, none of the seven observations are leverage observations. Output 29.14.5 shows the eight observations that have the largest leverage M-distances. Again, this table reflects the classification of leverage observations in Output 29.14.3 numerically. Three leverage observations are flagged, but none of these observations are residual outliers at the same time.

Output 29.14.4: Largest Residual Outliers

Diagnostics of the 7 Largest Case-Level Residuals (alpha=0.01)
Case Number	Residual (M-Distance)	Diagnostics
Case Number	Residual (M-Distance)	Outlier	Leverage
81	4.60511	*
28	3.69953	*
54	3.43068
33	3.17659
61	3.15104
56	3.06094
66	3.03262

Output 29.14.5: Largest Leverage Observations

Diagnostics of the 8 Largest Case-Level Leverages (alpha=0.01)
Case Number	Leverage (M-Distance)	Diagnostics
Case Number	Leverage (M-Distance)	Leverage	Outlier
87	2.96438	*
88	2.72441	*
2	2.69293	*
3	2.15519
1	2.04014
6	1.87894
86	1.72950
8	1.59242

Output 29.14.6 shows the histogram and its kernel density function of the residual M-distances. Theoretically, the density should look like a $\chi$ -distribution (that is, the distribution of the square root of the corresponding $\chi ^2$ variate). While this histogram provides an overall impression of the empirical distribution of the residual M-distances, it is not very useful in diagnosing the observations that do not conform to the theoretical distribution of the residual M-distances. To this end, Output 29.14.7 and Output 29.14.8 are more useful.

Output 29.14.6: Distribution of Case-Level Residuals

Output 29.14.7 shows the so-called Q-Q plot for the residual M-distances. The Y coordinates represent the observed quantiles, and the X coordinates represent the theoretical quantiles. Output 29.14.8 shows the so-called P-P plots for the residual M-distances. The Y coordinates represent the observed percentiles, and the X coordinates represent the theoretical percentiles. If the observed residual M-distances distribute exactly like the theoretical $\chi$ -distribution, all observations should lie on the straight lines with a slope of 1 in both plots. In Output 29.14.7 and Output 29.14.8, observations with the largest departures from the straight lines are labeled.

Output 29.14.7: Q-Q Plot of Residual M-Distances

In terms of quantiles, the Q-Q plot in Output 29.14.7 shows that observations 81, 28, and 54 have the largest departures from the theoretical distribution for the residual M-distances. Notice that observations 81 and 28 were also identified as outliers previously.

Output 29.14.8: P-P Plot of Residual M-Distances

In terms of percentiles, the P-P plot in Output 29.14.8 shows that several observations in the middle of the distribution have the largest departures from the theoretical distribution. However, it is not clear from the plot which observations have the largest departures.

To supplement the Q-Q and the P-P plots with numerical information, PROC CALIS outputs the table in Output 29.14.9, which shows the observations with the largest departures from the theoretical residual M-distance distribution in terms of quantiles and percentiles. The observations in the table are arranged by their percentile regions. Approximately 10 $\%$ of the observations with the largest departures in terms of quantile and in terms of percentile, respectively, are shown in this table. However, the final percentage of observations shown in the table might not add up to 20 $\%$ , because some observations could have large departures in terms of both quantile and percentile. Output 29.14.9 shows that observations 1, 15, 30, and 72 are this kind of observation. However, none of these are residual outliers.

Output 29.14.9: Largest Departures from the Theoretical Residual Distribution

Largest Departures From the Theoretical Residual Distribution
Percentile Region	Case Number	Quantile Deviation	Percentile Deviation
(.65,.90)	26		-0.05302
	36		-0.05836
	40		-0.05863
	15	-0.13087	-0.06509
	87		-0.05257
	71		-0.05145
	30	-0.11973	-0.05198
	1	-0.12344	-0.05065
	72	-0.13739	-0.05498
	64	-0.12550
(.90,.95)	66	0.13803
>.95	54	0.20039
	28	0.32398
	81	0.99843

Output 29.14.10 shows the residual on fit plot for the variable algebra. Notice that unlike the preceding residual diagnostic plots where the M-distances (nonnegative) for residuals are plotted, the residuals (which can be positive or negative) themselves are plotted against the predicted values for algebra in the current residual on fit (or residual on predicted) plot. If the linear relationship as described by the one-factor model is reasonable, then the residuals should show no systematic trend with the predicted values. That is, the residuals should be randomly scattered around the plot in the range of the predicted values. The pattern of residuals in Output 29.14.10 looks sufficiently "random," and therefore it gives support to the prescribed linear relationship between the variable algebra and the factor fact1. PROC CALIS also outputs the residual on fit plots for other dependent observed variables. To conserve space, they are not shown here. You can also control the amount of graphical output by specifying the desirable set of dependent observed variables for the residual on fit plots. For example, the following requests the residual on fit plots for the variables mechanics and analysis only, even though there are five dependent observed variables in the model:

PLOTS=RESBYPRED(VAR=mechanics analysis)

See the PLOTS= option for details.

Output 29.14.10: Residual on Fit Plot for the Variable Algebra

Direct Robust Estimation

Usually, the outlier detection process in residual diagnostics implies some treatments after their detections. For example, in the preceding analysis, observations 28 and 81 were detected as outliers. You would naturally think that the next step is to do an additional model fitting with these two outlying observations removed. While this thinking is intuitively appealing, the outlier issues are usually more complicated than can be resolved by such a "pick-and-remove" strategy. One issue is the so-called masking effect of outliers. In the preceding analysis, outliers are detected based on a criterion that is a function of the chosen criterion level (for example, the ALPHAOUT= value set by the researcher) and the parameter estimates. However, because the outliers have contributed to the estimation of the parameters, you might wonder whether the outlier detection process has been contaminated by the outliers themselves. If such a masking effect is present and an additional analysis is done after employing the pick-and-remove strategy for the outliers, you would be likely to discover more outliers. As a result, you might wonder whether the outlier removal process should be carried out indefinitely.

Robust estimation provides an alternative way to resolve the issue of masking effect. In robust estimation, outlying observations are downweighted simultaneously with the estimation. Iterative steps are used to reweight the observations according to the updated parameter estimates. Hence, with robust estimation you do not need to make a decision about removing outliers because they are already downweighted during the estimation. The issue of masking effect is avoided.

PROC CALIS provides two major methods of robust estimation—direct and two-stage. This section illustrates the direct robust estimation method, in which the model is estimated simultaneously with the (re)weighting of the observations. The next section illustrates the two-stage estimation method, in which the robust mean and covariance matrices obtained in the first stage are analyzed by the maximum likelihood (ML) estimation in the second stage. Both methods are based on the methodology proposed by Yuan and Zhong (2008) and Yuan and Hayashi (2010).

The following statements request direct robust estimation of the model for the mardia data set:

ods graphics on;

proc calis data=mardia residual robust plots=caseresid pcorr;
   path
      fact1 ===>  mechanics vector algebra analysis statistics = 1. ;
run;

ods graphics off;

The ROBUST option invokes the direct robust estimation method. The direct robust method reweights the observations according to the magnitudes of the residuals during iterative steps. For this reason, you can also specify ROBUST=RES to signify direct robust estimation with residual (M-distance) weighting. Following Yuan and Hayashi (2010), there are two variants of the direct robust estimation, One treats the disturbances (that is, error terms for endogenous latent factors) as factors to estimate (see the ROBUST= RES(F) option), and the other treats the disturbances as true error terms (see the ROBUST= RES(E) option). See the ROBUST= option and the section Robust Estimation for details. These variants would not make any difference for the current one-factor model estimation, and the ROBUST option that is used in the specification is the same as ROBUST=RES(E).

You also request case-level residual analysis by using the RESIDUAL option, even though the intention now is to identify rather than to remove the outliers.

Output 29.14.11 shows the fit summary of the direct robust estimation. The model fit chi-square is 6.0031 (df = 5, p = 0.3059). Again, you fail to reject the one-factor model for the data. Unlike the regular ML estimation in the preceding analysis, both SRMR (0.0321) and RMSEA (0.0480) show good model fit. This result can be attributed to the downweighting of the outlying observations in the direct robust estimation. Output 29.14.12 shows that all loading estimates are statistically significant.

Output 29.14.11: Model Fit of Mardia Data with Direct Robust Estimation

Fit Summary
Chi-Square	6.0031
Chi-Square DF	5
Pr > Chi-Square	0.3059
Standardized RMR (SRMR)	0.0370
RMSEA Estimate	0.0480
NOTE: Saturated mean structure parameters are excluded from the computations of fit indices.

Output 29.14.12: Estimates of Loadings with Direct Robust Estimation

PATH List
Path			Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
fact1	===>	mechanics		1.00000
fact1	===>	vector	_Parm1	0.83279	0.15680	5.3111	<.0001
fact1	===>	algebra	_Parm2	0.92854	0.14763	6.2895	<.0001
fact1	===>	analysis	_Parm3	1.08665	0.18855	5.7632	<.0001
fact1	===>	statistics	_Parm4	1.18799	0.21487	5.5289	<.0001

Output 29.14.13 shows the outlier and leverage classifications by the direct robust estimation. This plot has a similar pattern to the one obtained from the preceding regular ML estimation (see Output 29.14.3). Both plots show that observations 28 and 81 are the only outliers. Thus, the results from the direct robust estimation eliminate the concern of the masking effect.

Output 29.14.13: Outliers and Leverage Observations with Direct Robust Estimation

As a by-product of the direct robust estimation, robust (unstructured) covariance and mean matrices are computed with the PCORR option. Output 29.14.14 shows the robust covariance and mean matrices obtained for the current direct robust estimation for the mardia data set. These matrices are essentially computed by the usual formulas for covariance and mean matrices, but with weights obtained from the final estimates of the direct robust estimation of the model. For the current analysis, outlying observations (28 and 81) receive weights of less than 1 whereas all other observations receive weights of 1 in the computation.

Output 29.14.14: Robust Covariance and Mean Matrices with Direct Robust Estimation

Robust Covariance Matrix
	mechanics	vector	algebra	analysis	statistics
mechanics	299.05121	118.73107	104.04139	111.35028	122.12239
vector	118.73107	162.60829	86.88667	97.18075	102.10393
algebra	104.04139	86.88667	114.69031	113.43289	124.09414
analysis	111.35028	97.18075	113.43289	220.34769	156.62425
statistics	122.12239	102.10393	124.09414	156.62425	296.42019

Determinant	33723890038	Ln	24.241472

Robust Means
mechanics	vector	algebra	analysis	statistics
39.11015	50.80549	50.61153	46.70196	42.21135

Two-Stage Robust Estimation

In addition to the direct robust estimation method, PROC CALIS implements another robust estimation method for structural equation models. The two-stage robust estimation method estimates the robust means and covariance matrices in the first stage. It then fits the model to the robust mean and covariance matrix by using the ML method in the second stage. During the first stage, the robust method downweights the outlying observations in all variable dimensions. The robust properties are then propagated to the second stage of estimation.

In PROC CALIS, you can use the ROBUST=SAT option to invoke the two-stage robust estimation method, as shown in the following statements for fitting the one-factor model to the mardia data set:


proc calis data=mardia robust=sat pcorr;
   path
      fact1 ===>  mechanics vector algebra analysis statistics = 1. ;
run;

With the PCORR option, the (unstructured) robust mean and covariance matrices are displayed in Output 29.14.15. Although the robust covariance and mean matrices obtained here are numerically similar to those obtained from the direct robust estimation (see Output 29.14.14), they are obtained with totally different methods. The robust mean and covariance matrices in the direct robust estimation are the by-products of the model estimation, while the robust mean and covariance matrices in the two-stage robust estimation are the initial inputs for model estimation.

Output 29.14.15: Robust Covariance and Mean Matrices with Two-Stage Robust Estimation

Robust Covariance Matrix
	mechanics	vector	algebra	analysis	statistics
mechanics	298.83513	119.28292	99.65134	105.47098	116.68242
vector	119.28292	165.11323	84.25310	94.28855	98.70260
algebra	99.65134	84.25310	110.93012	110.46759	121.18480
analysis	105.47098	94.28855	110.46759	219.56424	155.24377
statistics	116.68242	98.70260	121.18480	155.24377	298.80146

Determinant	36001000868	Ln	24.306813

Robust Means
mechanics	vector	algebra	analysis	statistics
39.10571	50.74232	50.65677	46.73882	42.33179

Output 29.14.16 shows the fit summary of the two-stage robust estimation. The model fit chi-square is 7.2941 (df = 5, p = 0.1997). As in direct robust estimation, you fail to reject the one-factor model for the data. Unlike the case with direct robust estimation, only SRMR (0.0366) shows a good model fit. RMSEA is 0.0726 and is not very close to a good model fit.

Output 29.14.16: Model Fit of Mardia Data with Two-Stage Robust Estimation

Fit Summary
Chi-Square	7.2941
Chi-Square DF	5
Pr > Chi-Square	0.1997
Standardized RMR (SRMR)	0.0422
RMSEA Estimate	0.0726
NOTE: Saturated mean structure parameters are excluded from the computations of fit indices.

Output 29.14.16 shows the estimates of loadings with the two-stage robust estimation. All the estimates are statistically significant. All the numerical results are similar to that of the direct robust estimation in Output 29.14.12.

Output 29.14.17: Estimates of Loadings with Two-Stage Robust Estimation

PATH List
Path			Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
fact1	===>	mechanics		1.00000
fact1	===>	vector	_Parm1	0.84322	0.16538	5.0986	<.0001
fact1	===>	algebra	_Parm2	0.93683	0.15485	6.0500	<.0001
fact1	===>	analysis	_Parm3	1.10397	0.19849	5.5617	<.0001
fact1	===>	statistics	_Parm4	1.21080	0.22686	5.3371	<.0001

The outlier and leverage observation detection in two-stage robust estimation is essentially the same method as that in direct robust estimation. Therefore, the corresponding results for the two-stage robust estimation are not shown here. See the section Residual Diagnostics with Robust Estimation for details about case-level residual analysis with robust estimation.