This example uses a data set on a study of the analgesic effects of treatments on elderly patients with neuralgia. The purpose
of this example is to show how PROC PLM behaves under different situations when BY-group processing is present. Two test treatments
and a placebo are compared to test whether the patient reported pain or not. For each patient, the information of age, gender,
and the duration of complaint before the treatment began were recorded. The following DATA step creates the data set named
Neuralgia
:
Data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No A F 71 12 No B F 72 50 No B F 76 9 Yes A M 71 17 Yes A F 63 27 No A F 69 18 Yes B F 66 12 No A M 62 42 No P F 64 1 Yes A F 64 17 No P M 74 4 No A F 72 25 No P M 70 1 Yes B M 66 19 No B M 59 29 No A F 64 30 No A M 70 28 No A M 69 1 No B F 78 1 No P M 83 1 Yes B F 69 42 No B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes A M 70 12 No A F 69 12 No B F 65 14 No B M 70 1 No B M 67 23 No A M 76 25 Yes P M 78 12 Yes B M 77 1 Yes B F 69 24 No P M 66 4 Yes P F 65 29 No P M 60 26 Yes A M 78 15 Yes B M 75 21 Yes A F 67 11 No P F 72 27 No P F 70 13 Yes A M 75 6 Yes B F 65 7 No P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No A M 65 15 No P F 67 1 Yes A M 67 10 No P F 72 11 Yes A F 74 1 No B M 80 21 Yes A F 69 3 No ;
The data set contains five variables. Treatment
is a classification variable that has three levels: A and B represent the two test treatments, and P represents the placebo
treatment. Sex
is a classification variable that indicates each patient’s gender. Age
is a continuous variable that indicates the age in years of each patient when a treatment began. Duration
is a continuous variable that indicates the duration of complaint in months. The last variable Pain
is the response variable with two levels: ‘Yes’ if pain was reported, ‘No’ if no pain was reported.
Suppose there is some preliminary belief that the dependency of pain
on the explanatory variables is different for male and female patients, leading to separate models between genders. You believe
there might be redundant information for predicting the probability of Pain
. Thus, you want to perform model selection to eliminate unnecessary effects. You can use the following statements:
proc sort data=Neuralgia; by sex; run; proc logistic data=Neuralgia; class Treatment / param=glm; model pain = Treatment Age Duration / selection=backward; by sex; store painmodel; title 'Logistic Model on Neuralgia'; run;
PROC SORT is called to sort the data by variable Sex
. The LOGISTIC procedure is then called to fit the probability of no pain. Three variables are specified for the full model:
Treatment
, Age
, and Duration
. Backward elimination is used as the model selection method. The BY statement fits separate models for male and female patients.
Finally, the STORE statement specifies that the fitted results be saved to an item store named painmodel
.
Output 75.5.1 lists parameter estimates from the two models after backward elimination is performed. From the model for female patients,
Treatment
is the only factor that affects the probability of no pain, and Treatment
A and B have the same positive effect in predicting the probability of no pain. From the model for male patients, both Treatment
and Age
are included in the selected model. Treatment
A and B have different positive effects, while Age
has a negative effect in predicting the probability of no pain.
Output 75.5.1: Parameter Estimates for Male and Female Patients
Logistic Model on Neuralgia |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -0.4055 | 0.6455 | 0.3946 | 0.5299 | |
Treatment | A | 1 | 2.6027 | 1.2360 | 4.4339 | 0.0352 |
Treatment | B | 1 | 2.6027 | 1.2360 | 4.4339 | 0.0352 |
Treatment | P | 0 | 0 | . | . | . |
Now the fitted models are saved to the item store painmodel
. Suppose you want to use it to score several new observations. The following DATA steps create three data sets for scoring:
data score1; input Treatment $ Sex $ Age; datalines; A F 20 B F 30 P F 40 A M 20 B M 30 P M 40 ; data score2; set score1(drop=sex); run; data score3; set score2(drop=Age); run;
The first score data set score1
contains six observations and all the variables that are specified in the full model. The second score data set score2
is a duplicate of score1
except that Sex
is dropped. The third score data set score3
is a duplicate of score2
except that Age
is dropped. You can use the following statements to score the three data sets:
proc plm restore=painmodel; score data=score1 out=score1out predicted; score data=score2 out=score2out predicted; score data=score3 out=score3out predicted; run;
Output 75.5.2 lists the store information that PROC PLM reads from the item store painmodel
. The "Model Effects" entry lists all three variables that are specified in the full model before the BY-group processing.
Output 75.5.2: Item Store Information for painmodel
Logistic Model on Neuralgia |
Store Information | |
---|---|
Item Store | WORK.PAINMODEL |
Data Set Created From | WORK.NEURALGIA |
Created By | PROC LOGISTIC |
Date Created | 27MAR14:10:15:34 |
By Variable | Sex |
Response Variable | Pain |
Link Function | Logit |
Distribution | Binary |
Class Variables | Treatment Pain |
Model Effects | Intercept Treatment Age Duration |
With the three SCORE statements, three data sets are thus produced: score1out
, score2out
, and score3out
. They contain the linear predictors in addition to all original variables. The data set score1out
contains the values shown in Output 75.5.3.
Linear predictors are computed for all six observations. Because the BY variable Sex
is available in score1
, PROC PLM uses separate models to score observations of male and female patients. So an observation with the same Treatment
and Age
has different linear predictors for different genders.
The data set score2out
contains the values shown in Output 75.5.4.
The second score data set score2
does not contain the BY variable Sex
. PROC PLM continues to score the full data set two times. Each time the scoring is based on the fitted model for each corresponding
BY-group. In the output data set, Sex
is added at the first column as the BY-group indicator. The first six entries correspond to the model for female patients,
and the next six entries correspond to the model for male patients. Age
is not included in the first model, and Treatment
A and B have the same parameter estimates, so observations 1, 2, 4, and 5 have the same linear predicted value.
The data set score3out
contains the values shown in Output 75.5.5.
The third score data set score3
does not contain the BY variable Sex
. PROC PLM scores the full data twice with separate models. Furthermore, it does not contain the variable Age
, which is a selected variable for predicting the probability of no pain for male patients. Thus, PROC PLM computes linear
predictor values for score3
by using the first model for female patients, and sets the linear predictor to missing when using the second model for male
patients to score the data set.