The Fitness
data described in the REG procedure are measurements of 31 individuals in a physical fitness course. See Chapter 79: The REG Procedure, for more information.
The Fitness1
data set is constructed from the Fitness
data set and contains three variables: Oxygen
, RunTime
, and RunPulse
. Some values have been set to missing, and the resulting data set has an arbitrary pattern of missingness in these three
variables.
*---------------------Data on Physical Fitness-------------------------* | These measurements were made on men involved in a physical fitness | | course at N.C. State University. Certain values have been set to | | missing and the resulting data set has an arbitrary missing pattern. | | Only selected variables of | | Oxygen (intake rate, ml per kg body weight per minute), | | Runtime (time to run 1.5 miles in minutes), | | RunPulse (heart rate while running) are used. | *----------------------------------------------------------------------*; data Fitness1; input Oxygen RunTime RunPulse @@; datalines; 44.609 11.37 178 45.313 10.07 185 54.297 8.65 156 59.571 . . 49.874 9.22 . 44.811 11.63 176 . 11.95 176 . 10.85 . 39.442 13.08 174 60.055 8.63 170 50.541 . . 37.388 14.03 186 44.754 11.12 176 47.273 . . 51.855 10.33 166 49.156 8.95 180 40.836 10.95 168 46.672 10.00 . 46.774 10.25 . 50.388 10.08 168 39.407 12.63 174 46.080 11.17 156 45.441 9.63 164 . 8.92 . 45.118 11.08 . 39.203 12.88 168 45.790 10.47 186 50.545 9.93 148 48.673 9.40 186 47.920 11.50 170 47.467 10.50 170 ;
Suppose that the data are multivariate normally distributed and the missing data are missing at random (MAR). That is, the probability that an observation is missing can depend on the observed variable values of the individual, but not on the missing variable values of the individual. See the section Statistical Assumptions for Multiple Imputation for a detailed description of the MAR assumption.
The following statements invoke the MI procedure and impute missing values for the Fitness1
data set:
proc mi data=Fitness1 seed=501213 mu0=50 10 180 out=outmi; mcmc; var Oxygen RunTime RunPulse; run;
The “Model Information” table in Figure 57.1 describes the method used in the multiple imputation process. By default, the MCMC statement uses the Markov chain Monte Carlo (MCMC) method with a single chain to create five imputations. The posterior mode, the highest observed-data posterior density, with a noninformative prior, is computed from the expectation-maximization (EM) algorithm and is used as the starting value for the chain.
Figure 57.1: Model Information
Model Information | |
---|---|
Data Set | WORK.FITNESS1 |
Method | MCMC |
Multiple Imputation Chain | Single Chain |
Initial Estimates for MCMC | EM Posterior Mode |
Start | Starting Value |
Prior | Jeffreys |
Number of Imputations | 5 |
Number of Burn-in Iterations | 200 |
Number of Iterations | 100 |
Seed for random number generator | 501213 |
The MI procedure takes 200 burn-in iterations before the first imputation and 100 iterations between imputations. In a Markov chain, the information in the current iteration influences the state of the next iteration. The burn-in iterations are iterations in the beginning of each chain that are used both to eliminate the series of dependence on the starting value of the chain and to achieve the stationary distribution. The between-imputation iterations in a single chain are used to eliminate the series of dependence between the two imputations.
The “Missing Data Patterns” table in Figure 57.2 lists distinct missing data patterns with their corresponding frequencies and percentages. An “X” means that the variable is observed in the corresponding group, and a “.” means that the variable is missing. The table also displays group-specific variable means. The MI procedure sorts the data into groups based on whether the analysis variables are observed or missing. For a detailed description of missing data patterns, see the section Missing Data Patterns.
Figure 57.2: Missing Data Patterns
Missing Data Patterns | ||||||||
---|---|---|---|---|---|---|---|---|
Group | Oxygen | RunTime | RunPulse | Freq | Percent | Group Means | ||
Oxygen | RunTime | RunPulse | ||||||
1 | X | X | X | 21 | 67.74 | 46.353810 | 10.809524 | 171.666667 |
2 | X | X | . | 4 | 12.90 | 47.109500 | 10.137500 | . |
3 | X | . | . | 3 | 9.68 | 52.461667 | . | . |
4 | . | X | X | 1 | 3.23 | . | 11.950000 | 176.000000 |
5 | . | X | . | 2 | 6.45 | . | 9.885000 | . |
After the completion of m imputations, the “Variance Information” table in Figure 57.3 displays the between-imputation variance, within-imputation variance, and total variance for combining complete-data inferences. It also displays the degrees of freedom for the total variance. The relative increase in variance due to missing values, the fraction of missing information, and the relative efficiency (in units of variance) for each variable are also displayed. A detailed description of these statistics is provided in the section Combining Inferences from Multiply Imputed Data Sets.
Figure 57.3: Variance Information
Variance Information | |||||||
---|---|---|---|---|---|---|---|
Variable | Variance | DF | Relative Increase in Variance |
Fraction Missing Information |
Relative Efficiency |
||
Between | Within | Total | |||||
Oxygen | 0.056930 | 0.954041 | 1.022356 | 25.549 | 0.071606 | 0.068898 | 0.986408 |
RunTime | 0.000811 | 0.064496 | 0.065469 | 27.721 | 0.015084 | 0.014968 | 0.997015 |
RunPulse | 0.922032 | 3.269089 | 4.375528 | 15.753 | 0.338455 | 0.275664 | 0.947748 |
The “Parameter Estimates” table in Figure 57.4 displays the estimated mean and standard error of the mean for each variable. The inferences are based on the t distribution. The table also displays a 95% confidence interval for the mean and a t statistic with the associated p-value for the hypothesis that the population mean is equal to the value specified with the MU0= option. A detailed description of these statistics is provided in the section Combining Inferences from Multiply Imputed Data Sets.
Figure 57.4: Parameter Estimates
Parameter Estimates | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std Error | 95% Confidence Limits | DF | Minimum | Maximum | Mu0 | t for H0: Mean=Mu0 |
Pr > |t| | |
Oxygen | 47.094040 | 1.011116 | 45.0139 | 49.1742 | 25.549 | 46.783898 | 47.395550 | 50.000000 | -2.87 | 0.0081 |
RunTime | 10.572073 | 0.255870 | 10.0477 | 11.0964 | 27.721 | 10.526392 | 10.599616 | 10.000000 | 2.24 | 0.0336 |
RunPulse | 171.787793 | 2.091776 | 167.3478 | 176.2278 | 15.753 | 170.774818 | 173.122002 | 180.000000 | -3.93 | 0.0012 |
In addition to the output tables, the procedure also creates a data set with imputed values. The imputed data sets are stored
in the outmi
data set, with the index variable _Imputation_
indicating the imputation numbers. The data set can now be analyzed using standard statistical procedures with _Imputation_
as a BY variable.
The following statements list the first 10 observations of data set outmi
:
proc print data=outmi (obs=10); title 'First 10 Observations of the Imputed Data Set'; run;
The table in Figure 57.5 shows that the precision of the imputed values differs from the precision of the observed values. You can use the ROUND= option to make the imputed values consistent with the observed values.
Figure 57.5: Imputed Data Set
First 10 Observations of the Imputed Data Set |
Obs | _Imputation_ | Oxygen | RunTime | RunPulse |
---|---|---|---|---|
1 | 1 | 44.6090 | 11.3700 | 178.000 |
2 | 1 | 45.3130 | 10.0700 | 185.000 |
3 | 1 | 54.2970 | 8.6500 | 156.000 |
4 | 1 | 59.5710 | 8.0747 | 155.925 |
5 | 1 | 49.8740 | 9.2200 | 176.837 |
6 | 1 | 44.8110 | 11.6300 | 176.000 |
7 | 1 | 42.8857 | 11.9500 | 176.000 |
8 | 1 | 46.9992 | 10.8500 | 173.099 |
9 | 1 | 39.4420 | 13.0800 | 174.000 |
10 | 1 | 60.0550 | 8.6300 | 170.000 |