The QLIM Procedure

Selection Models

Subsections:

Heckman’s Two-Step Selection Method

In sample selection models, one or several dependent variables are observed when another variable takes certain values. For example, the standard Heckman selection model can be defined as

$z^{*}_{i} = \mb {w}_{i}’\bgamma + u_{i}$

$z_{i} = \left\{ \begin{array}{ll} 1 & \hbox{ if } z^{*}_{i}>0 \\ 0 & \hbox{ if } z^{*}_{i}\leq 0 \end{array} \right.$

$y_{i} = \mb {x}_{i}’\bbeta + \epsilon _{i} \quad \hbox{if } z_{i}=1$

where $u_{i}$ and $\epsilon _{i}$ are jointly normal with 0 mean, standard deviations of 1 and $\sigma$ , respectively, and correlation of $\rho$ . Selection is based on the variable $z$ , and $y$ is observed when $z$ has a value of 1. Least squares regression that uses the observed data of $y$ produces inconsistent estimates of $\bbeta$ . The maximum likelihood method is used to estimate selection models. It is also possible to estimate these models by using Heckman’s method, which is more computationally efficient. But it can be shown that the resulting estimates, although consistent, are not asymptotically efficient under a normality assumption. Moreover, this method often violates the constraint on the correlation coefficient $|\rho |\leq 1$ .

The log-likelihood function of the Heckman selection model is written as

$\begin{eqnarray*} \ell & =& \sum _{i\in \{ z_{i}=0\} }\ln [1-\Phi (\mb {w}_{i}’\bgamma )] \\ & +& \sum _{i\in \{ z_{i}=1\} } \left\{ \ln \phi (\frac{y_ i-\mb {x_ i}\bbeta }{\sigma }) - \ln \sigma + \ln \Phi \left(\frac{\mb {w}_{i}\bgamma + \rho \frac{y_ i-\mb {x_ i}\bbeta }{\sigma }}{\sqrt {1-\rho ^2}}\right) \right\} \end{eqnarray*}$

The selection can be based on only one variable, but the selection can lead to several variables. For example, selection is based on the variable $z$ in the following switching regression model:

$z^{*}_{i} = \mb {w}_{i}’\bgamma + u_{i}$

$z_{i} = \left\{ \begin{array}{ll} 1 & \hbox{if } z^{*}_{i}>0 \\ 0 & \hbox{if } z^{*}_{i}\leq 0 \end{array} \right.$

$\begin{eqnarray*} y_{1i} & =& \mb {x}_{1i}’\bbeta _1 + \epsilon _{1i} \quad \hbox{if $z_{i}=0$} \\ y_{2i} & =& \mb {x}_{2i}’\bbeta _2 + \epsilon _{2i} \quad \hbox{if $z_{i}=1$} \end{eqnarray*}$

If $z=0$ , then $y_1$ is observed. If $z=1$ , then $y_2$ is observed. Because $y1$ and $y2$ are never observed at the same time, the correlation between $y_1$ and $y_2$ cannot be estimated. Only the correlation between $z$ and $y_1$ and the correlation between $z$ and $y_2$ can be estimated. This estimation uses the maximum likelihood method.

A brief example of the SAS statements for this model can be found in Sample Selection Model.

The Heckman selection model can include censoring or truncation. For a brief example of the SAS statements for these models see Sample Selection Model with Truncation and Censoring. The following example shows a variable $y_{i}$ that is censored from below at zero:

$z^{*}_{i} = \mb {w}_{i}’\bgamma + u_{i}$

$z_{i} = \left\{ \begin{array}{ll} 1 & \hbox{if } z^{*}_{i}>0 \\ 0 & \hbox{if } z^{*}_{i}\leq 0 \end{array} \right.$

$y^*_{i} = \mb {x}_{i}’\bbeta + \epsilon _{i} \quad \hbox{if } z_{i}=1$

$y_{i} = \left\{ \begin{array}{ll} y^{*}_{i} & \mr {if} y^{*}_{i}>0 \\ 0 & \mr {if} y^{*}_{i}\leq 0 \end{array} \right.$

In this case, the log-likelihood function of the Heckman selection model needs to be modified as follows to include the censored region:

$\begin{eqnarray*} \ell & =& \sum _{\{ i|z_{i}=0\} }\ln [1-\Phi (\mb {w}_{i}’\bgamma )] \\ & +& \sum _{ \{ i|z_{i}=1,y_{i}=y_{i}^{*}\} } \left\{ \ln \left[\phi (\frac{y_ i-\mb {x_ i}\bbeta }{\sigma })\right] - \ln \sigma + \ln \left[\Phi \left(\frac{\mb {w}_{i}\bgamma + \rho \frac{y_ i-\mb {x_ i}\bbeta }{\sigma }}{\sqrt {1-\rho ^2}}\right)\right] \right\} \\ & +& \sum _{ \{ i|z_{i}=1,y_{i}=0\} } \ln \int _{-\infty } ^{\frac{-\mb {x_ i}\bbeta }{\sigma }} \int _{-\mb {w_ i}\gamma } ^{\infty } \phi _2(u,v,\rho ) \, du\, dv \end{eqnarray*}$

In case $y_{i}$ is truncated from below at 0 instead of censored, the likelihood function can be written as

$\begin{eqnarray*} \ell & =& \sum _{\{ i|z_{i}=0\} }\ln [1-\Phi (\mb {w}_{i}’\bgamma )] \\ & +& \sum _{\{ i|z_{i}=1\} } \left\{ \ln \left[\phi (\frac{y_ i-\mb {x_ i}\bbeta }{\sigma })\right] - \ln \sigma + \ln \left[\Phi \left(\frac{\mb {w}_{i}\bgamma + \rho \frac{y_ i-\mb {x_ i}\bbeta }{\sigma }}{\sqrt {1-\rho ^2}}\right)\right] -\ln \left[\Phi (\mb {x}_{i}’\bbeta /\sigma )\right] \right\} \end{eqnarray*}$

Heckman’s Two-Step Selection Method

Sample selection bias arises from nonrandom selection of the sample from the population. A classic example is using a sample of market wages for working women to estimate female labor supply function. This sample is nonrandom because it includes only the wages of women whose market wage exceeds their home wage at zero hours of work.

A simple selection model can be written as the latent model

$z^{*}_{i} = \mb {w}_{i}’\bgamma + u_{i}$

$z_{i} = \left\{ \begin{array}{ll} 1 & \hbox{ if } z^{*}_{i}>0 \\ 0 & \hbox{ if } z^{*}_{i}\leq 0 \end{array} \right.$

$y_{i} = \mb {x}_{i}’\bbeta + \epsilon _{i} \quad \hbox{if } z_{i}=1$

where $u_{i}$ and $\epsilon _{i}$ are jointly normal with 0 mean, standard deviations of 1 and $\sigma$ , respectively, and correlation of $\rho$ . The dependent variable $y_{i}$ (wage) is observed if the latent variable $z^{*}_{i}$ (the difference between market wage and reservation wage) is positive or if the indicator variable $z_{i}$ (labor force participation) is 1.

The model of interest that applies to the observations in the selected sample can be written as

$E( y_{i} | \mb {x}_{i}, z_{i}=1) = \mb {x}_{i}’\bbeta + \rho \sigma \lambda (\mb {w}_{i}’\bgamma )$

where $\lambda (\mb {w}_{i}’\bgamma ) = \phi (\mb {w}_{i}’\bgamma )/\Phi (\mb {w}_{i}’\bgamma )$ . Hence, the following regression equation is valid for the observations for which $z_{i} = 1$ :

$y_{i} = \mb {x}_{i}’\bbeta + \rho \sigma \lambda (\mb {w}_{i}’\bgamma ) + v_{i}$

Therefore, estimates of $\bbeta$ that are obtained from the OLS regression of $y$ on $\mb {x}$ by using the selected sample (that is, the sample for which $z_{i} = 1$ ) suffer from omitted variable bias if selection bias is really the case. Although maximum likelihood estimation of $\bbeta$ is consistent and efficient, Heckman’s two-step method is more frequently used. Heckman’s two-step method can be requested by specifying the HECKIT option of the QLIM statement.

Heckman’s two-step method is as follows:

Obtain $\hat{\bgamma }$ , the estimate of the parameters of the probability that $z^{*}_{i}>0$ , by using regressors $\mb {w}_{i}$ and the binary dependent variable $z_{i}$ by probit analysis for the full sample. Compute $\hat{\lambda }_{i} = \lambda (\mb {w}_{i}’\hat{\bgamma })$ .
Obtain $\hat{\bbeta }$ and $\hat{\beta }_{\lambda }$ , the estimates of $\bbeta$ and $\rho \sigma$ , by least squares regression of $y_ i$ on $\mb {x}_{i}$ and $\hat{\lambda }_{i}$ by using observations on the selected subsample.

The standard least squares estimators of the population variance $\sigma ^2$ and the variances of the estimated coefficients are incorrect. To test hypotheses, the correct ones need to be calculated. An estimator of $\sigma ^2$ is

$\hat{\sigma }^2 = \frac{1}{N_1}\sum _{i=1}^{N_1}e_{i}^2 + \hat{\beta }_{\lambda }^2\frac{1}{N_1}\sum _{i=1}^{N_1}\hat{\delta }_{i}$

where $N_1$ is the selected subsample size, $e_{i}$ is the residual for the $i$ th observation obtained from step 2, and $\hat{\delta }_{i} = \hat{\lambda }_{i}^2 + \hat{\lambda }_{i}\mb {w}_{i}’\hat{\bgamma }$ . Let $\mb {X}_{*}$ be an $N_{1}\times (K +1)$ matrix with $i$ th row $[\mb {x}_{i}’ ~ ~ \lambda _{i}]$ , and define $\mb {W}$ similarly with $i$ th row $\mb {w}_{i}’$ . Then the estimator of the asymptotic covariance of $[\hat{\bbeta }, \hat{\bbeta }_{\lambda }]$ is

$\mbox{EstAsyVar}[\hat{\bbeta }, \hat{\beta }_{\lambda }] = \hat{\sigma }^2[\mb {X}_{*}’\mb {X}_{*}]^{-1}[\mb {X}_{*}’(\mb {I} - \hat{\rho }^2\hat{\bDelta })\mb {X}_{*} + \mb {Q}][\mb {X}_{*}’\mb {X}_{*}]^{-1}$

where $\hat{\rho }^2 = \hat{\beta }_{\lambda }^2/\hat{\sigma }^2$ , $\hat{\bDelta } = \mbox{diag}(\hat{\delta }_{i})$ , and

$\mb {Q} = \hat{\sigma }^2(\mb {X}_{*}’\hat{\bDelta }\mb {W})\mbox{Est.Asy.Var}(\hat{\bgamma })(\mb {W}’\hat{\bDelta }\mb {X}_{*})$

where $\mbox{Est.Asy.Var}(\hat{\bgamma })$ is the estimator of the asymptotic covariance of the probit coefficients that are obtained in step 1. When you specify the HECKIT option, PROC QLIM uses a numerical estimated asymptotic variance.

When the HECKIT option is specified, PROC QLIM reports the corrected standard errors for $[\hat{\bbeta }, \hat{\beta }_{\lambda }]$ automatically. However, if you need the conventional OLS standard errors, you can specify the HECKIT(UNCORRECTED) option.

In the selected regression model, when the coefficient of $\lambda (\mb {w}_{i}’\bgamma )$ is 0, you do not need Heckman’s two-step estimation method; a simple regression of $y$ on $\mb {x}$ produces consistent estimates for $\bbeta$ , and the OLS standard errors are correct. Thus, a standard $t$ test on $\hat{\beta }_{\lambda }$ (which uses the estimate from step 2 and the uncorrected standard errors) is a valid test of the null hypothesis of no selection bias.

Although Heckman’s two-step method uses the OLS method in the second stage, you can request the ML method by specifying the HECKIT(SECONDSTAGE=ML) option. When the second-stage method is the ML method, the model for $y_{i}$ can be nonlinear.