Suppose that the previous student sample is actually drawn using a stratified sample design. The strata are grades in the junior high school: 7, 8, and 9. Within strata, simple random samples are selected. Table 94.1 provides the number of students in each grade.
Table 94.1: Students in Grades
Grade |
Number of Students |
---|---|
7 |
1,824 |
8 |
1,025 |
9 |
1,151 |
Total |
4,000 |
In order to analyze this sample by using PROC SURVEYREG, you need to input the stratification information by creating a SAS
data set with the information in Table 94.1. The following SAS statements create such a data set called StudentTotals
:
data StudentTotals; input Grade _TOTAL_; datalines; 7 1824 8 1025 9 1151 ;
The variable Grade
is the stratification variable, and the variable _TOTAL_
contains the total numbers of students in each stratum in the survey population. PROC SURVEYREG requires you to use the keyword
_TOTAL_
as the name of the variable that contains the population total information.
In a stratified sample design, when the sampling rates in the strata are unequal, you need to use sampling weights to reflect this information. For this example, the appropriate sampling weights are the reciprocals of the probabilities of selection. You can use the following DATA step to create the sampling weights:
data IceCream; set IceCream; if Grade=7 then Prob=20/1824; if Grade=8 then Prob=9/1025; if Grade=9 then Prob=11/1151; Weight=1/Prob; run;
If you use PROC SURVEYSELECT to select your sample, PROC SURVEYSELECT creates these sampling weights for you.
The following statements demonstrate how you can fit a linear model while incorporating the sample design information (stratification):
title1 'Ice Cream Spending Analysis'; title2 'Stratified Sample Design'; proc surveyreg data=IceCream total=StudentTotals; strata Grade /list; class Kids; model Spending = Income Kids / solution; weight Weight; run;
Comparing these statements to those in the section Simple Random Sampling, you can see how the TOTAL=StudentTotals
option replaces the previous TOTAL=4000 option.
The STRATA statement specifies the stratification variable Grade
. The LIST option in the STRATA statement requests that the stratification information be included in the output. The WEIGHT
statement specifies the weight variable.
Figure 94.4 summarizes the data information, the sample design information, and the fit information. Note that, due to the stratification, the denominator degrees of freedom for F tests and t tests are 37, which is different from the analysis in Figure 94.1.
Figure 94.4: Summary of the Regression
Ice Cream Spending Analysis |
Stratified Sample Design |
Data Summary | |
---|---|
Number of Observations | 40 |
Sum of Weights | 4000.0 |
Weighted Mean of Spending | 9.14130 |
Weighted Sum of Spending | 36565.2 |
Design Summary | |
---|---|
Number of Strata | 3 |
Fit Statistics | |
---|---|
R-square | 0.8219 |
Root MSE | 2.4185 |
Denominator DF | 37 |
For each stratum, Figure 94.5 displays the value of identifying variables, the number of observations (sample size), the total population size, and the calculated sampling rate or fraction.
Figure 94.5: Stratification and Classification Information
Stratum Information | ||||
---|---|---|---|---|
Stratum Index |
Grade | N Obs | Population Total | Sampling Rate |
1 | 7 | 20 | 1824 | 1.10% |
2 | 8 | 9 | 1025 | 0.88% |
3 | 9 | 11 | 1151 | 0.96% |
Class Level Information | ||
---|---|---|
Class Variable | Levels | Values |
Kids | 4 | 1 2 3 4 |
Figure 94.6 displays the tests for the significance of model effects under the stratified sample design. The Income
effect is strongly significant, while the Kids
effect is not significant at the 5% level.
Figure 94.6: Testing Effects
Tests of Model Effects | |||
---|---|---|---|
Effect | Num DF | F Value | Pr > F |
Model | 4 | 124.85 | <.0001 |
Intercept | 1 | 150.95 | <.0001 |
Income | 1 | 326.89 | <.0001 |
Kids | 3 | 0.99 | 0.4081 |
Note: | The denominator degrees of freedom for the F tests is 37. |
The regression coefficient estimates for the stratified sample, along with their standard errors and associated t tests, are displayed in Figure 94.7.
Figure 94.7: Regression Coefficients
Estimated Regression Coefficients | ||||
---|---|---|---|---|
Parameter | Estimate | Standard Error | t Value | Pr > |t| |
Intercept | -26.086882 | 2.44108058 | -10.69 | <.0001 |
Income | 0.776699 | 0.04295904 | 18.08 | <.0001 |
Kids 1 | 0.888631 | 1.07000634 | 0.83 | 0.4116 |
Kids 2 | 1.545726 | 1.20815863 | 1.28 | 0.2087 |
Kids 3 | -0.526817 | 1.32748011 | -0.40 | 0.6938 |
Kids 4 | 0.000000 | 0.00000000 | . | . |
Note: | The denominator degrees of freedom for the t tests is 37. Matrix X'WX is singular and a generalized inverse was used to solve the normal equations. Estimates are not unique. |
You can request other statistics and tests by using PROC SURVEYREG. You can also analyze data from a more complex sample design. The remainder of this chapter provides more detailed information.