The GENMOD Procedure

Exact Logistic and Exact Poisson Regression

Subsections:

OUTDIST= Output Data Set

The theory of exact logistic regression, also called exact conditional logistic regression, is described in the section Exact Conditional Logistic Regression in Chapter 58: The LOGISTIC Procedure. The following discussion of exact Poisson regression, also called exact conditional Poisson regression, uses the notation given in that section.

Note that in exact logistic regression, the coefficients $C({\bm {t}})$ are the number of possible response vectors $\bm {y}$ that generate ${\bm {t}}$ : $C({\bm {t}})= ||\{ \bm {y} : \bm {y}’\bX ={\bm {t}}’\} ||$ . However, when performing an exact Poisson regression, this value is replaced by

$C({\bm {t}}) = \sum _\Omega \prod _{i=1}^ n\frac{N_ i^{y_ i}}{y_ i!}$

where $\Omega = \{ \bm {y}\colon \bm {y}’\bX =\bm {t}\}$ and $N_ i=\exp (o_ i)$ is the exponential of the offset $o_ i$ for observation i. If an offset variable is not specified, then $N_ i=1$ .

The probability density function (PDF) for $\bT$ is created by summing over all candidate sequences $\bm {y}$ that generate an observable ${\bm {t}}$

$\Pr (\bT =\bm {t}) = \frac{C({\bm {t}})\exp ({\bm {t}}\bbeta )}{\prod _{i=1}^ n\exp (N_ ie^{\bm {x}_ i\bbeta })}$

However, the conditional likelihood of $\bT _\mr {I}$ given $\bT _\mr {N} = \bm {t}_\mr {N}$ has the same form as that for exact logistic regression.

For details about hypothesis testing and estimation, see the sections Hypothesis Tests and Inference for a Single Parameter in Chapter 58: The LOGISTIC Procedure. See the section Computational Resources for Exact Logistic Regression in Chapter 58: The LOGISTIC Procedure, for some computational notes about exact analyses.

In exact logistic binary regression, each component $y_ i, i=1,...,n,$ of $\bm {y}$ can take a value of 0 or 1, so there are a finite number, $2^ n$ , of candidate $\bm {y}$ vectors to be considered. Since a Poisson-distributed response variable can take an infinite number of values, exact Poisson regression should evaluate an infinite number of $\bm {y}$ vectors. However, by identifying the maximum value of $y_ i$ to check, $S_ i$ , for each observation i, the number of candidate $\bm {y}$ vectors to check is reduced to $\prod _{i=1}^ n S_ i$ . On a practical level, as $S_ i$ becomes large the probability of the Poisson random variable achieving this value drops to zero, so $S_ i$ can be thought of as the point at which the value does not matter. You can provide these maxima by specifying either an OFFSET= variable, $o_ i$ , or an EXACTMAX= variable, $e_ i$ , or you can let the algorithm choose a maximum for you. The way these two options interact to provide a maximum is described in the following list:

If an EXACTMAX= variable is specified, then $S_ i=e_ i$ .
If the EXACTMAX option is specified without a variable, or if neither the EXACTMAX= nor OFFSET= options are specified, then you must also condition out the intercept or you must specify the STRATA statement. If you are conditioning out the intercept, then every $S_ i$ has an effective maximum of $\sum _{i=1}^ nf_ iy_{0i}$ , where $\bm {y}_0$ is the observed response and $f_ i$ is the frequency of the observation; this is the sufficient statistic for the intercept term. If you are performing a stratified analysis, these sums are computed within each stratum.
If an offset variable is specified and the EXACTMAX option is not specified (you are modeling proportions), then $N_ i=\exp (o_ i)$ must be a positive integer, and $S_ i=N_ i$ is the maximum possible value for each observation in the experiment; for example, if you are counting the number of rats in a cage that acquire a disease, then $N_ i$ is the number of rats in cage i.

OUTDIST= Output Data Set

The OUTDIST= data set contains every exact conditional distribution necessary to process the corresponding EXACT statement. For example, the following statements create one distribution for the x1 parameter and another for the x2 parameters, and produce the data set dist shown in Table 42.11:

data test;
   input y x1 x2 count;
   datalines;
0 0 0 1
1 0 0 1
0 1 1 2
1 1 1 1
1 0 2 3
1 1 2 1
1 2 0 3
1 2 1 2
1 2 2 1
;

proc genmod data=test exactonly;
   class x2 / param=ref;
   model y=x1 x2 / d=b;
   exact x1 x2/ outdist=dist;
run;
proc print data=dist;
run;

Table 42.11: OUTDIST= Data Set

Obs	x1	x20	x21	Count	Score	Prob
1	.	0	0	3	5.81151	0.03333
2	.	0	1	15	1.66031	0.16667
3	.	0	2	9	3.12728	0.10000
4	.	1	0	15	1.46523	0.16667
5	.	1	1	18	0.21675	0.20000
6	.	1	2	6	4.58644	0.06667
7	.	2	0	19	1.61869	0.21111
8	.	2	1	2	3.27293	0.02222
9	.	3	0	3	6.27189	0.03333
10	2	.	.	6	3.03030	0.12000
11	3	.	.	12	0.75758	0.24000
12	4	.	.	11	0.00000	0.22000
13	5	.	.	18	0.75758	0.36000
14	6	.	.	3	3.03030	0.06000

The first nine observations in the dist data set contain an exact distribution for the parameters of the x2 effect (hence the values for the x1 parameter are missing), and the remaining five observations are for the x1 parameter. If a joint distribution was created, there would be observations with values for both the x1 and x2 parameters. For CLASS variables, the corresponding parameters in the dist data set are identified by concatenating the variable name with the appropriate classification level.

The data set contains the possible sufficient statistics of the parameters for the effects specified in the EXACT statement, and the Count variable contains the number of different responses that yield these statistics. In particular, there are six possible response vectors $\mb {y}$ for which the dot product $\mb {y}’\mb {x1}$ was equal to 2, and for which $\mb {y}’\mb {x20}$ , $\mb {y}’\mb {x21}$ , and $\mb {y}’\mb {1}$ were equal to their actual observed values (displayed in the “Sufficient Statistics” table).

Note: If you are performing an exact Poisson analysis, then the Count variable is replaced by a variable named Weight.

When hypothesis tests are performed on the parameters, the Prob variable contains the probability of obtaining that statistic (which is just the count divided by the total count), and the Score variable contains the score for that statistic.

The OUTDIST= data set can contain a different exact conditional distribution for each specified EXACT statement. For example, consider the following EXACT statements:


exact 'O1'   x1    /           outdist=o1;
exact 'OJ12' x1 x2 / jointonly outdist=oj12;
exact 'OA12' x1 x2 / joint     outdist=oa12;
exact 'OE12' x1 x2 / estimate  outdist=oe12;

The O1 statement outputs a single exact conditional distribution. The OJ12 statement outputs only the joint distribution for x1 and x2. The OA12 statement outputs three conditional distributions: one for x1, one for x2, and one jointly for x1 and x2. The OE12 statement outputs two conditional distributions: one for x1 and the other for x2. Data set oe12 contains both the x1 and x2 variables; the distribution for x1 has missing values in the x2 column while the distribution for x2 has missing values in the x1 column.