The GAM Procedure

Backfitting and Local Scoring Algorithms

Much of the development and notation in this section follows Hastie and Tibshirani (1986).

Additive Models

Consider the estimation of the smoothing terms $s_0, s_1(~ ), \cdots , s_ p(~ )$ in the additive model

\[  \eta (X) = s_0 + \sum _{j=1}^{p} s_ j(X_ j)  \]

where $E\left(s_ j(X_ j)\right) = 0$ for every j. Since the algorithm for additive models is the basis for fitting generalized additive models, the algorithm for additive models is discussed first.

Many ways are available to approach the formulation and estimation of additive models. The backfitting algorithm is a general algorithm that can fit an additive model with any regression-type fitting mechanisms.

Define the kth set of partial residuals as

\[  R_ k = Y - s_0 - \sum _{j \ne k} s_ j(X_ j)  \]

then $E(R_ k|X_ k) = s_ k(X_ k)$. This observation provides a way to estimate each smoothing function $s_ k(~ )$ given estimates $\{ \hat{s}_ j(~ ), j \ne k\} $ for all the others. The resulting iterative procedure is known as the backfitting algorithm (Friedman and Stuetzle, 1981). The following formulation is taken from Hastie and Tibshirani (1986).

The Backfitting Algorithm

The unweighted form of the backfitting algorithm is as follows:

  1. Initialization:$s_0 = E(Y), s_1^{(1)}=s_2^{(1)}=\cdots =s_ p^{(1)}=0, m=0$

  2. Iterate: ${m = m+1}$;for j = 1 to p do: $R_ j = Y-s_0-\sum _{k=1}^{j-1}s_ k^{(m)}(X_ k) -\sum _{k=j+1}^ p s_ k^{(m-1)}(X_ k)$; $s_ j^{(m)} = E(R_ j|X_ j)$;

  3. Until: $\mathrm{RSS}=\frac{1}{n}\left\| Y-s_0-\sum ^ p_{j=1} s_ j^{(m)}(X_ j)\right\| ^2$ fails to decrease, or satisfies the convergence criterion.

In the preceding notation, $s_ j^{(m)}(~ )$ denotes the estimate of $s_ j(~ )$ at the mth iteration. It can be shown that with many smoothers (including linear regression, univariate and bivariate splines, and combinations of these), RSS never increases at any step. This implies that the algorithm always converges (Hastie and Tibshirani, 1986). Note, however, that for distributions other than Gaussian, numerical instabilities with weights can cause convergence problems. Even when the algorithm converges, the individual functions need not be unique, since dependence among the covariates can lead to more than one representation for the same fitted surface.

A weighted backfitting algorithm has the same form as for the unweighted case, except that the smoothers are weighted. In PROC GAM, weights are used with non-Gaussian data in the local scoring procedure described later in this section.

The GAM procedure uses the following condition as the convergence criterion for the backfitting algorithm:

\[  \frac{\sum _{i=1}^ n\sum _{j=1}^{p} \left(s^{(m-1)}_ j(\mb{X}_{ij}) - s^{(m)}_ j(\mb{X}_{ij})\right)^2}{1 + \sum _{i=1}^ n\sum _{j=1}^{p} \left(s^{(m-1)}_ j(\mb{X}_{ij})\right)^2}\le \epsilon  \]

where $\epsilon = 10^{-8}$ by default; you can change this with the EPSILON=  option in the MODEL statement.

Generalized Additive Models

The algorithm described so far fits only additive models. The algorithm for generalized additive models is a little more complicated. Generalized additive models extend generalized linear models in the same manner that additive models extend linear regression models—that is, by replacing the form $\alpha +\sum _ j \beta _ j\mb{x}_ j$ with the additive form $\alpha +\sum _ j f_ j(\mb{x}_ j)$. See Generalized Linear Models Theory in Chapter 43: The GENMOD Procedure, for more information.

PROC GAM fits generalized additive models by using a modified form of adjusted dependent variable regression, as described for generalized linear models in McCullagh and Nelder (1989), with the additive predictor taking the role of the linear predictor. Hastie and Tibshirani (1986) call this the local scoring algorithm. Important components of this algorithm depend on the link function for each distribution, as shown in the following table.

Distribution

Link

Adjusted Dependent ($\mb{z}$)

Weights ($\mb{w}$)

Normal

$\mu $

y

1

Binomial

$\log \left(\frac{\mu }{n-\mu }\right)$   

$\eta +(y-\mu )/n\mu (1-\mu )$   

$n\mu (1-\mu )$

Gamma

$-1/\mu $

$\eta +(y-\mu )/\mu ^2$

$\mu ^2$

Poisson

$\log (\mu )$

$\eta +(y-\mu )/\mu $

$\mu $

Inverse Gaussian

$1/\mu ^2$

$\eta -2(y-\mu )/\mu ^3$

$\mu ^3/4$

Once the distribution and hence these quantities are defined, the local scoring algorithm proceeds as follows.

The General Local Scoring Algorithm

  1. Initialization: $s_ i=g(E(y)), s_1^0=s_2^0=\cdots =s_ p^0=0, m=0$

  2. Iterate: $m = m+1$; Form the predictor $\bm {\eta }$, mean $\bmu $, weights $\mb{w}$, and adjusted dependent variable $\mb{z}$ based on their corresponding values from the previous iteration:

    \begin{eqnarray*}  \eta ^{(m-1)}_ i &  = &  s_0+\sum _{j=1}^ p s_ j^{(m-1)}(x_{ij}) \\ \mu ^{(m-1)}_ i &  = &  g^{-1}\left(\eta ^{(m-1)}_ i\right) \\ w_ i &  = &  \left(V^{(m-1)}_ i\right)^{-1}\cdot \left[\left(\frac{\partial \mu }{\partial \eta }\right)_ i^{(m-1)}\right]^2 \\ z_ i &  = &  \eta ^{(m-1)}_ i+\left(y_ i-\mu ^{(m-1)}_ i \right)\cdot \left(\frac{\partial \eta }{\partial \mu }\right)_ i^{(m-1)} \end{eqnarray*}

    where $V_ i^{(m-1)}$ is the variance of Y at $\mu _ i^{(m-1)}$. Fit an additive model to $\mb{z}$ by using the backfitting algorithm with weights $\mb{w}$ to obtain estimated functions $s_ j^{(m)}(~ ),j=1,\dots ,p$;

  3. Until: The convergence criterion is satisfied or the deviance fails to decrease. The deviance is an extension to generalized linear models of the RSS; see Goodness of Fit in Chapter 43: The GENMOD Procedure, for a definition.

The GAM procedure uses the following condition as the convergence criterion for local scoring:

\[  \frac{\sum _{i=1}^ n w_ i \sum _{j=1}^{p} \left(s^{(m-1)}_ j(\mb{X}_{ij}) - s^{(m)}_ j(\mb{X}_{ij})\right)^2}{\sum _{i=1}^ n w_ i \left(1+\sum _{j=1}^{p} \left(s^{(m-1)}_ j(\mb{X}_{ij})\right)^2\right)} \le \epsilon ^{s}  \]

where $\epsilon ^ s = 10^{-8}$ by default; you can change this with the EPSSCORE= option in the MODEL statement.

The estimating procedure for generalized additive models consists of two loops. Inside each step of the local scoring algorithm (outer loop), a weighted backfitting algorithm (inner loop) is used until convergence or until the RSS fails to decrease. Then, based on the estimates from this weighted backfitting algorithm, a new set of weights is calculated and the next iteration of the scoring algorithm starts. The scoring algorithm stops when the convergence criterion is satisfied or the deviance of the estimates stops decreasing.