The LOGISTIC Procedure

Scoring Data Sets

Scoring a data set, which is especially important for predictive modeling, means applying a previously fitted model to a new data set in order to compute the conditional, or posterior, probabilities of each response category given the values of the explanatory variables in each observation.

The SCORE statement enables you to score new data sets and output the scored values and, optionally, the corresponding confidence limits into a SAS data set. If the response variable is included in the new data set, then you can request fit statistics for the data, which is especially useful for test or validation data. If the response is binary, you can also create a SAS data set containing the receiver operating characteristic (ROC) curve. You can specify multiple SCORE statements in the same invocation of PROC LOGISTIC.

By default, the posterior probabilities are based on implicit prior probabilities that are proportional to the frequencies of the response categories in the training data (the data used to fit the model). Explicit prior probabilities should be specified with the PRIOR= or PRIOREVENT= option when the sample proportions of the response categories in the training data differ substantially from the operational data to be scored. For example, to detect a rare category, it is common practice to use a training set in which the rare categories are overrepresented; without prior probabilities that reflect the true incidence rate, the predicted posterior probabilities for the rare category will be too high. By specifying the correct priors, the posterior probabilities are adjusted appropriately.

The model fit to the DATA= data set in the PROC LOGISTIC statement is the default model used for the scoring. Alternatively, you can save a model fit in one run of PROC LOGISTIC and use it to score new data in a subsequent run. The OUTMODEL= option in the PROC LOGISTIC statement saves the model information in a SAS data set. Specifying this data set in the INMODEL= option of a new PROC LOGISTIC run will score the DATA= data set in the SCORE statement without refitting the model.

The STORE statement can also be used to save your model. The PLM procedure can use this model to score new data sets; see Chapter 75: The PLM Procedure, for more information. You cannot specify priors in PROC PLM.

Fit Statistics for Scored Data Sets

Specifying the FITSTAT option displays the following fit statistics when the data set being scored includes the response variable:

Statistic

Description

Total frequency

$F=\sum _ if_ in_ i$

Total weight

$W=\sum _ if_ iw_ in_ i$

Log likelihood

$\log L= \sum _ if_ iw_ i\log (\widehat{\pi }_ i)$

Full log likelihood

$\log L_ f= \textrm{constant}+\log L$

Misclassification (error) rate

$\displaystyle \frac{\sum _ i1\{ {\textrm F}\_ Y_ i\ne {\textrm I}\_ Y_ i\} f_ in_ i}{F}$

AIC

$-2\log L_ f+2p$

AICC

$-2\log L_ f+\displaystyle \frac{2pn}{n-p-1}$

BIC

$-2\log L_ f+p\log (n)$

SC

$-2\log L_ f+p\log (F)$

R-square

$\displaystyle R^2=1-\left(\frac{L_0}{L}\right)^{2/F}$

Maximum-rescaled R-square

$\displaystyle \frac{R^2}{1-L_0^{2/F}}$

AUC

Area under the ROC curve

Brier score (polytomous response)

$\frac{1}{W}\sum _ if_ iw_ i\sum _ j(y_{ij}-\widehat{\pi }_{ij})^2$

Brier score (binary response)

$\frac{1}{W}\sum _ if_ iw_ i(r_ i(1-\widehat{\pi }_ i)^2+(n_ i-r_ i)\widehat{\pi }_ i^2)$

Brier reliability (events/trials syntax)

$\frac{1}{W}\sum _ if_ iw_ i(r_ i/n_ i-\widehat{\pi }_{i})^2$

In the preceding table, $f_ i$ is the frequency of the ith observation in the data set being scored, $w_ i$ is the weight of the observation, and $n=\sum _ if_ i$. The number of trials when events/trials syntax is specified is $n_ i$, and with single-trial syntax $n_ i=1$. The values ${\textrm F}\_ Y_ i$ and ${\textrm I}\_ Y_ i$ are described in the section OUT= Output Data Set in a SCORE Statement. The indicator function $1\{ A\} $ is 1 if A is true and 0 otherwise. The likelihood of the model is L, and $L_0$ denotes the likelihood of the intercept-only model. For polytomous response models, $y_ i$ is the observed polytomous response level, $\widehat{\pi }_{ij}$ is the predicted probability of the jth response level for observation i, and $y_{ij}=1\{ y_ i=j\} $. For binary response models, $\widehat{\pi }_ i$ is the predicted probability of the observation, $r_ i$ is the number of events when you specify events/trials syntax, and $r_ i=y_ i$ when you specify single-trial syntax.

The log likelihood, Akaike’s information criterion (AIC), and Schwarz criterion (SC) are described in the section Model Fitting Information. The full log likelihood is displayed for models specified with events/trials syntax, and the constant term is described in the section Model Fitting Information. The AICC is a small-sample bias-corrected version of the AIC (Hurvich and Tsai, 1993; Burnham and Anderson, 1998). The Bayesian information criterion (BIC) is the same as the SC except when events/trials syntax is specified. The area under the ROC curve for binary response models is defined in the section ROC Computations. The R-square and maximum-rescaled R-square statistics, defined in Generalized Coefficient of Determination, are not computed when you specify both an OFFSET= variable and the INMODEL= data set. The Brier score (Brier, 1950) is the weighted squared difference between the predicted probabilities and their observed response levels. For events/trials syntax, the Brier reliability is the weighted squared difference between the predicted probabilities and the observed proportions (Murphy, 1973).

Posterior Probabilities and Confidence Limits

Let F be the inverse link function. That is,

\begin{eqnarray*}  F(t) = &  \left\{  \begin{array}{ll} \frac{1}{1+\exp (-t)} &  \mbox{logistic} \\ \Phi (t) &  \mbox{normal} \\ 1- \exp (-\exp (t)) &  \mbox{complementary log-log} \end{array} \right. \end{eqnarray*}

The first derivative of F is given by

\begin{eqnarray*}  F’(t) = &  \left\{  \begin{array}{ll} \frac{\exp (-t)}{(1+\exp (-t)) ^2} &  \mbox{logistic} \\ \phi (t) &  \mbox{normal} \\ \exp (t)\exp (-\exp (t)) &  \mbox{complementary log-log} \end{array} \right. \end{eqnarray*}

Suppose there are $k+1$ response categories. Let Y be the response variable with levels $1, \ldots , k+1$. Let $\mb{x}=(x_0, x_1, \ldots , x_ s)’$ be a $(s+1)$-vector of covariates, with $x_0\equiv 1$. Let $\bbeta $ be the vector of intercept and slope regression parameters.

Posterior probabilities are given by

\[  p(Y=i|\mb{x}) = \frac{p_ o(Y=i|\mb{x}) \frac{\widetilde{p}(Y=i)}{p_ o(Y=i)}}{\sum _ j p_ o(Y=j|\mb{x})\frac{\widetilde{p}(Y=j)}{p_ o(Y=j)}} \quad i=1,\ldots ,k+1  \]

where the old posterior probabilities ($p_ o(Y=i|\mb{x}),i=1,\ldots ,k+1$) are the conditional probabilities of the response categories given $\mb{x}$, the old priors ($p_ o(Y=i),i=1,\ldots ,k+1$) are the sample proportions of response categories of the training data, and the new priors ($\widetilde{p}(Y=i),i=1,\ldots ,k+1$) are specified in the PRIOR= or PRIOREVENT= option. To simplify notation, absorb the old priors into the new priors; that is

\[  p(Y=i) = \frac{\widetilde{p}(Y=i)}{p_ o(Y=i)} \quad i=1,\ldots ,k+1  \]

Note if the PRIOR= and PRIOREVENT= options are not specified, then $p(Y=i)=1$.

The posterior probabilities are functions of $\bbeta $ and their estimates are obtained by substituting $\bbeta $ by its MLE ${\widehat{\bbeta }}$. The variances of the estimated posterior probabilities are given by the delta method as follows:

\[  \mbox{Var}(\widehat{p}(Y=i|\mb{x})) = \biggl [\frac{\partial p(Y=i|\mb{x})}{\partial \bbeta } \biggr ]’ \mbox{Var}({\widehat{\bbeta }}) \biggl [\frac{\partial p(Y=i|\mb{x})}{\partial \bbeta } \biggr ]  \]

where

\[  \frac{\partial p(Y=i|\mb{x})}{\partial \bbeta } = \frac{\frac{\partial p_ o(Y=i|\mb{x})}{\partial \bbeta } p(Y=i)}{\sum _ j p_ o(Y=j|\mb{x}) p(Y=j)} - \frac{ p_ o(Y=i|\mb{x})p(Y=i)\sum _ j\frac{\partial p_ o(Y=j|\mb{x})}{\partial \bbeta } p(Y=j)}{[\sum _ j p_ o(Y=j|\mb{x})p(Y=j)]^2}  \]

and the old posterior probabilities $p_ o(Y=i|\mb{x})$ are described in the following sections.

A 100($1-\alpha $)% confidence interval for $p(Y=i|\mb{x})$ is

\[  \widehat{p}(Y=i|\mb{x}) \pm z_{1-\alpha /2} \sqrt {\widehat{\mbox{Var}}(\widehat{p}(Y=i|\mb{x}))}  \]

where $z_{\tau }$ is the upper 100$\tau $ percentile of the standard normal distribution.

Binary and Cumulative Response Models

Let $\alpha _{1}, \ldots , \alpha _{k}$ be the intercept parameters and let $\bbeta _{s}$ be the vector of slope parameters. Denote $\bbeta = (\alpha _{1}, \ldots ,\alpha _{k}, \bbeta _{s}’)’$. Let

\[  \eta _ i = \eta _ i(\bbeta )= \alpha _{i} + \mb{x}’\bbeta _{s}, i=1,\ldots ,k  \]

Estimates of $\eta _1,\ldots ,\eta _{k}$ are obtained by substituting the maximum likelihood estimate ${\widehat{\bbeta }}$ for $\bbeta $.

The predicted probabilities of the responses are

\begin{eqnarray*}  \widehat{p_ o}(Y=i|\mb{x}) = \widehat{{\Pr }}(Y = i) = \left\{  \begin{array}{ll} F(\hat{\eta }_1) &  i=1 \\ F(\hat{\eta }_ i) - F(\hat{\eta }_{i-1}) &  i=2,\ldots ,k \\ 1 - F(\hat{\eta }_{k} ) &  i= k+1 \end{array} \right. \end{eqnarray*}

For $i=1,\dots ,k$, let $\bdelta _ i(\mb{x})$ be a (k + 1) column vector with ith entry equal to 1, k + 1 entry equal to $\mb{x}$, and all other entries 0. The derivative of $p_ o(Y=i|\mb{x})$ with respect to $\bbeta $ are

\begin{eqnarray*}  \frac{\partial p_ o(Y=i|\mb{x})}{\partial {\bbeta }} = &  \left\{  \begin{array}{ll} F’(\alpha _{1}+\mb{x}’{\bbeta _{s}})\bdelta _1(\mb{x}) &  i=1 \\ F’(\alpha _{i}+\mb{x}’{\bbeta _{s}})\bdelta _ i(\mb{x}) - F’(\alpha _{i-1}+\mb{x}’{\bbeta _{s}})\bdelta _{i-1}(\mb{x}) &  i=2,\ldots ,k \\ -F’(\alpha _{k}+\mb{x}’{\bbeta _{s}})\bdelta _{k}(\mb{x}) &  i=k+1 \end{array} \right. \end{eqnarray*}

The cumulative posterior probabilities are

\[  p(Y\le i|\mb{x}) = \frac{ \sum _{j=1}^{i} p_ o(Y=j|\mb{x})p(Y=j) }{ \sum _{j=1}^{k+1} p_ o(Y=j|\mb{x})p(Y=j) } = \sum _{j=1}^{i} p(Y=j|\mb{x}) \quad i=1,\ldots , k+1  \]

Their derivatives are

\[  \frac{ \partial p(Y\le i|\mb{x}) }{\partial \bbeta } = \sum _{j=1}^{i} \frac{ \partial p(Y=j|\mb{x}) }{ \partial \bbeta } \quad i=1,\ldots , k+1  \]

In the delta-method equation for the variance, replace $p(Y=\cdot |\mb{x})$ with $p(Y\le \cdot |\mb{x})$.

Finally, for the cumulative response model, use

\begin{eqnarray*}  \widehat{p_ o}(Y\le i|\mb{x}) & =&  F(\hat{\eta }_ i) \quad i=1,\ldots , k \\ \widehat{p_ o}(Y\le k+1|\mb{x}) & =&  1 \\ \frac{ \partial p_ o(Y\le i|\mb{x}) }{\partial \bbeta } & =&  F’(\alpha _{i}+\mb{x}’ \bbeta _{s})\delta _ i(\mb{x}) \quad i=1,\ldots , k \\ \frac{ \partial p_ o(Y\le k+1|\mb{x}) }{\partial \bbeta } & =&  0 \end{eqnarray*}
Generalized Logit Model

Consider the last response level (Y=k+1) as the reference. Let $\bbeta _1,\ldots ,\bbeta _{k}$ be the (intercept and slope) parameter vectors for the first k logits, respectively. Denote $\bbeta =(\bbeta _1’,\ldots ,\bbeta _{k}’)’$. Let $\eta = (\eta _1,\ldots ,\eta _{k})’$ with

\[  \eta _ i= \eta _ i(\bbeta ) = \mb{x}’\bbeta _ i \quad i=1,\ldots ,k  \]

Estimates of $\eta _1,\ldots ,\eta _{k}$ are obtained by substituting the maximum likelihood estimate ${\widehat{\bbeta }}$ for $\bbeta $.

The predicted probabilities are

\begin{eqnarray*}  \widehat{p_ o}(Y=k+1|\mb{x}) \equiv {\Pr }(Y=k+1|\mb{x}) &  = &  \frac{1}{1+\sum _{l=1}^{k} \exp ({\hat{\eta }_ l})} \\ \widehat{p_ o}(Y=i|\mb{x}) \equiv {\Pr }(Y=i|\mb{x}) & =&  \widehat{p_ o}(Y=k+1|\mb{x}) \exp ({\eta _ i}), i=1,\ldots ,k \end{eqnarray*}

The derivative of $p_ o(Y=i|\mb{x})$ with respect to $\bbeta $ are

\begin{eqnarray*}  \frac{\partial p_ o(Y=i|\mb{x})}{\partial \bbeta } &  = &  \frac{\partial \eta }{\partial \bbeta } \frac{\partial p_ o(Y=i|\mb{x})}{\partial \eta } \\ & = &  (I_{k} \otimes \mb{x}) \left(\frac{\partial p_ o(Y=i|\mb{x})}{\partial \eta _1}, \cdots , \frac{\partial p_ o(Y=i|\mb{x})}{\partial \eta _{k}} \right)’ \end{eqnarray*}

where

\begin{eqnarray*}  \frac{\partial p_ o(Y=i|\mb{x})}{\partial \eta _ j} = \left\{  \begin{array}{ll} p_ o(Y=i|\mb{x})(1-p_ o(Y=i|\mb{x})) &  j=i \\ -p_ o(Y=i|\mb{x})p_ o(Y=j|\mb{x}) &  \textrm{otherwise} \end{array} \right. \end{eqnarray*}
Special Case of Binary Response Model with No Priors

Let $\bbeta $ be the vector of regression parameters. Let

\[  \eta = \eta (\bbeta ) = \mb{x}’\bbeta  \]

The variance of $\hat{\eta }$ is given by

\[  \mbox{Var}(\hat{\eta }) = \mb{x}’ \mbox{Var}({\widehat{\bbeta }}) \mb{x}  \]

A 100($1-\alpha $) percent confidence interval for $\eta $ is

\[  \hat{\eta } \pm z_{1-\alpha /2} \sqrt {\widehat{\mbox{Var}}(\hat{\eta })}  \]

Estimates of $p_ o(Y=1|\mb{x})$ and confidence intervals for the $p_ o(Y=1|\mb{x})$ are obtained by back-transforming $\hat{\eta }$ and the confidence intervals for $\eta $, respectively. That is,

\[  \widehat{p_ o}(Y=1|\mb{x}) = F(\hat{\eta })  \]

and the confidence intervals are

\[  F\left( \hat{\eta } \pm z_{1-\alpha /2} \sqrt {\widehat{\mbox{Var}}(\hat{\eta })}\right)  \]