For ordinary generalized linear models, regression diagnostic statistics developed by Williams (1987) can be requested in an output data set or in the OBSTATS table by specifying the DIAGNOSTICS | INFLUENCE option in the MODEL statement. These diagnostics measure the influence of an individual observation on model fit, and generalize the one-step diagnostics developed by Pregibon (1981) for the logistic regression model for binary data.
Preisser and Qaqish (1996) further generalized regression diagnostics to apply to models for correlated data fit by generalized estimating equations (GEEs), where the influence of entire clusters of correlated observations, or the influence of individual observations within a cluster, is measured. These diagnostic statistics can be requested in an output data set or in the OBSTATS table if a model for correlated data is specified with a REPEATED statement.
The next two sections use the following notation:
is the maximum likelihood estimate of the regression parameters , or, in the case of correlated data, the solution of the GEEs.
is the corresponding estimate evaluated with the ith observation deleted, or, in the case of correlated data, with the ith cluster deleted.
is the dimension of the regression parameter vector .
is the standardized Pearson residual , where is the variance of the ith response and is the leverage defined in the section H . LEVERAGE
is the variance of response i, , where is the variance function and is the dispersion parameter.
is the prior weight of the ith observation specified with the WEIGHT statement. If there is no WEIGHT statement, for all i.
All unknown quantities are replaced by their estimated values in the following two sections.
The following statistics are available for generalized linear models.
The DFBETA statistic for measuring the influence of the ith observation is defined as the one-step approximation to the difference in the MLE of the regression parameter vector and the MLE of the regression parameter vector without the ith observation. This one-step approximation assumes a Fisher scoring step, and is given by
where is the leverage defined in the section H . LEVERAGE
The standardized DFBETA statistic for assessing the influence of the ith observation on the jth regression parameter is defined as the DFBETA statistic for the jth parameter divided by its estimated standard deviation, where the standard deviation is estimated from all the data.
In normal linear regression, the influence of observation i can be measured by Cook’s distance (Cook and Weisberg, 1982). A measure of influence of observation i for generalized linear models that is equivalent to Cook’s distance for normal linear regression is given by
where is the leverage defined in the section H . This measure is the one-step approximation to , where is the log likelihood evaluated at . LEVERAGE
The diagnostic statistics in this section were developed by Preisser and Qaqish (1996). See the section Generalized Estimating Equations for further information and notation for generalized estimating equations (GEEs). The following additional notation is used in this section.
Partition the design matrix and response vector by cluster; that is, let , and corresponding to the K clusters.
Let be the number of responses for cluster i, and denote by the total number of observations. Denote by the diagonal matrix with as the jth diagonal element. If there is a WEIGHT statement, the diagonal element of is , where is the specified weight of the jth observation in the ith cluster. Let the diagonal matrix with as diagonal elements, , . Let the diagonal matrix corresponding to cluster i with as the jth diagonal element.
Let be the block diagonal weight matrix whose ith block, corresponding to the ith cluster, is the matrix
where is the working correlation matrix for cluster i.
Let
where is the design matrix corresponding to cluster i.
Define the adjusted residual vector as
and , the estimated residual for the ith cluster.
Let the subscript denote estimates evaluated without the ith cluster, estimates evaluated using all the data except the tth observation of the ith cluster, and let denote matrices corresponding to the ith cluster without the tth observation.
The following statistics are available for generalized estimating equation models.
The leverage of cluster i is contained in the matrix , and is summarized by the trace of ,
The leverage of the tth observation in the ith cluster is the tth diagonal element of .
The effect of deleting cluster i on the estimated parameter vector is given by the following one-step approximation for :
The cluster deletion statistic DFBETAC can be standardized using the variances of based on the complete data. The standardized one-step approximation for the change in due to deletion of cluster i is
Partition the matrices and as
and let and .
The effect of deleting the tth observation from the ith cluster is given by the following one-step approximation to :
where , , and . Note that , , and are scalars.
The observation deletion statistic DFBETAO can be standardized using the variances of based on the complete data. The standardized one-step approximation for the change in due to deletion of observation t in cluster i is
A measure of the standardized influence of the subset m of observations on the overall fit is . For deletion of cluster i, this is approximated by
The measure of overall fit in the section DCLS for the deletion of the CLUSTERCOOKD CLUSTERCOOKSDtth observation in the ith cluster is approximated by
where , , and are defined in the section DFBETAO. In the case of the independence working correlation, this is equal to the measure for ordinary generalized linear models defined in the section DOBS . COOKD COOKSD
A studentized distance measure of the type defined in the section DCLS of the influence of the CLUSTERCOOKD CLUSTERCOOKSDith cluster is given by