You can specify a STRATA statement to obtain stratified sampling. The STRATA statement names one or more variables that partition the input data set into nonoverlapping groups (strata). The combinations of levels of the STRATA variables
define the strata. PROC SURVEYSELECT independently selects samples from the strata according to the selection method and design
parameters that you specify in the PROC SURVEYSELECT
statement. For information about stratification in sample design, see Lohr (2010), Kalton (1983), Kish (1965), Kish (1987), and Cochran (1977).
The STRATA variables are one or more variables in the DATA=
input data set. These variables can be either character or numeric, but PROC SURVEYSELECT treats them as categorical variables.
The formatted values of the STRATA variables determine the STRATA variable levels. Thus, you can use formats to group values
into levels. For more information, see the FORMAT procedure in the
Base SAS Procedures Guide and the FORMAT statement and SAS formats in
SAS Formats and Informats: Reference.
The STRATA variables function much like BY variables, and PROC SURVEYSELECT expects the input data set to be sorted by the
STRATA variables. The BY statement options DESCENDING and NOTSORTED are available in the STRATA statement. For more information
about these BY statement options, see
SAS Language Reference: Concepts.
If you specify a CONTROL
statement or METHOD=PPS
in the PROC SURVEYSELECT statement, the input data set must be sorted by the STRATA variables in ascending order. In this
case, you cannot specify the NOTSORTED or DESCENDING option in the STRATA statement.
If your input data set is not sorted by the STRATA variables, use one of the following alternatives:
-
Sort the data by using the SORT procedure with the STRATA variables in a BY statement.
-
Specify the NOTSORTED or DESCENDING option in the STRATA statement (if you do not specify a CONTROL statement or METHOD=PPS
in the PROC SURVEYSELECT statement). The NOTSORTED option does not mean that the data are unsorted but rather that the data
are arranged in groups (according to values of the STRATA variables) and that these groups are not necessarily in alphabetical
or increasing numeric order.
-
Create an index on the STRATA variables by using the DATASETS procedure (in Base SAS software).
For more information about BY-group processing, see the discussion in
SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the
Base SAS Procedures Guide.
Table 102.2 summarizes the options available in the STRATA statement. Descriptions of the options follow in alphabetical order.
Table 102.2: STRATA Statement Options for Sample Allocation
Option
|
Description
|
ALLOC=name
|
Specifies the allocation method
|
ALLOC=(values)
|
Provides allocation proportions
|
ALLOCMIN=
|
Specifies the minimum sample size per stratum
|
ALPHA=
|
Specifies the confidence level for the MARGIN=
option
|
COST=
|
Provides stratum costs
|
MARGIN=
|
Specifies the margin of error
|
NOSAMPLE
|
Allocates but does not select the sample
|
STATS
|
Displays additional allocation statistics
|
VAR=
|
Provides stratum variances
|
You can specify the following options in the STRATA statement after a slash (/):
-
ALLOC=name | (values)| SAS-data-set
-
specifies the allocation method name or specifies the stratum allocation proportions as a list of values or a SAS-data-set. You can use the ALLOC= option with any selection method (which you specify in the PROC SURVEYSELECT statement) except METHOD=PPS_BREWER
and METHOD=PPS_MURTHY
, either of which selects two units from each stratum.
You can specify the sample size allocation by using one of the following forms:
-
ALLOC=name
-
specifies the method for allocating the total sample size among the strata. You can specify one of the following values for
name:
-
NEYMAN
-
requests Neyman allocation, which allocates the total sample size among the strata in proportion to the stratum sizes and
variances. For more information, see the section Neyman Allocation. If you specify ALLOC=NEYMAN, you must provide the stratum variances by also specifying the VAR=
option.
-
OPTIMAL
OPT
-
requests optimal allocation, which allocates the total sample size among the strata in proportion to the stratum sizes, stratum
variances, and stratum costs. For more information, see the section Optimal Allocation. If you specify ALLOC=OPTIMAL, you must provide the stratum variances by also specifying the VAR=
option, and you must provide the stratum costs by also specifying the COST=
option.
-
PROPORTIONAL
PROP
-
requests proportional allocation, which allocates the total sample size in proportion to the stratum sizes, where stratum
size is the number of sampling units in the stratum. For more information, see the section Proportional Allocation.
-
ALLOC=(values)
-
specifies a list of stratum allocation proportion values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should
correspond to a stratum group, and the number of values must equal the number of strata in the input data set.
A stratum allocation proportion specifies the proportion of the total sample size to allocate to the stratum. The sum of the
allocation proportions must be 1 or 100%.
The allocation proportions must be positive numbers. You can specify the proportion values as numbers between 0 and 1. Or
you can specify the values in percentage form (as numbers between 1 and 100), and PROC SURVEYSELECT converts the numbers to
proportions. PROC SURVEYSELECT treats the value 1 as 100% instead of 1%.
The order of the stratum allocation proportions must match the order of the stratum groups in the DATA=
input data set. When you specify a list of proportion values, the input data set must be sorted by the STRATA
variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.
-
ALLOC=SAS-data-set
-
names a SAS-data-set that contains stratum allocation proportions. You should provide the stratum allocation proportions in the data set variable
named _ALLOC_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
A stratum allocation proportion specifies the proportion of the total sample size to allocate to the corresponding stratum.
The sum of the allocation proportions must be 1 or 100%.
The allocation proportions must be positive numbers. You can specify the proportion values as numbers between 0 and 1. Or
you can specify the values in percentage form (as numbers between 1 and 100), and PROC SURVEYSELECT converts the numbers to
proportions. PROC SURVEYSELECT treats the value 1 as 100% instead of 1%.
The ALLOC= data set, which is a secondary input data set, must contain all stratification variables that you specify in the
STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the ALLOC= data set must match the order of the groups in the DATA= data
set. If formats are associated with the STRATA variables, the formats must be consistent between the two data sets. For more
information, see the section Secondary Input Data Set. You can name only one secondary data set in each invocation of PROC SURVEYSELECT.
-
ALLOCMIN=n
-
specifies the minimum sample size to allocate to a stratum. If you specify ALLOCMIN=n, PROC SURVEYSELECT allocates at least n sampling units to each stratum.
The minimum stratum sample size n must be a positive integer. The value of n times the number of strata must not exceed the total sample size to be allocated. For without-replacement selection methods,
the value of n must not exceed the number of sampling units in any stratum.
By default, PROC SURVEYSELECT allocates at least one sampling unit to each stratum.
-
ALPHA=
-
specifies the confidence level that PROC SURVEYSELECT uses in the MARGIN=
computations. For more information, see the section Specifying the Margin of Error.
The value of must be between 0 and 1; a confidence level of produces a % confidence interval. By default, ALPHA=0.05, which produces a 95% confidence interval.
-
COST < =values | SAS-data-set >
-
specifies the stratum-level costs that PROC SURVEYSELECT uses to compute optimal allocation when you specify ALLOC=OPTIMAL
. For more information, see the section Optimal Allocation. The stratum costs must be positive numbers. A stratum cost represents the per-unit cost, which is the survey cost of a single
unit in the stratum.
You can provide stratum costs by specifying one of the following forms:
-
COST
-
indicates that stratum costs are provided in a secondary input data set that you name in another option (for example, the
VAR=SAS-data-set
option). You should provide the stratum costs in the data set variable named _COST_
. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
COST=(values)
-
specifies a list of stratum cost values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should
correspond to a stratum group, and the number of values must equal the number of strata in the input data set.
The order of the stratum cost values must match the order of the stratum groups in the DATA=
input data set. When you specify a list of values, the input data set must be sorted by the STRATA
variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.
-
COST=SAS-data-set
-
names a SAS-data-set that contains the stratum costs. You should provide the stratum costs in the data set variable named _COST_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the COST= data set must match the order of the groups in the DATA= data
set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information,
see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
MARGIN=value
-
specifies the desired margin of error for estimating the overall mean from the stratified sample. When you specify this option,
PROC SURVEYSELECT determines the stratum sample sizes that achieve the margin value by using the allocation method or proportions that you specify in the ALLOC=
option. For more information, see the section Specifying the Margin of Error.
The value must be a positive number. When you specify this option, you must also provide the stratum variances in the VAR=
option.
You can use the ALPHA=
option to specify the confidence level for the MARGIN= computations. By default, ALPHA=0.05, which produces a 95% confidence
interval.
You can specify the MARGIN= option with any allocation method (proportional, optimal, or Neyman) or with allocation proportions
(ALLOC=(values)
or ALLOC=SAS-data-set
).
Allocation to achieve a specified margin is an alternative approach to the allocation of a specified total sample size. Therefore,
when you specify the MARGIN= option, you cannot also specify a total sample size in the SAMPSIZE=
option in the PROC SURVEYSELECT statement.
-
NOSAMPLE
-
requests that PROC SURVEYSELECT not select a sample after computing the allocation. When you specify this option, the OUT=
output data set contains the stratum sample sizes that PROC SURVEYSELECT computes. For more information, see the section
Allocation Output Data Set. (By default, PROC SURVEYSELECT selects a sample after computing the allocation.)
-
STATS
-
displays sample allocation statistics. When you specify the MARGIN=
option, the STATS option displays the expected margin of error for the allocation. For more information, see the section
Specifying the Margin of Error. When you specify ALLOC=OPTIMAL
or ALLOC=NEYMAN
but do not specify the MARGIN= option, the STATS option displays the expected variance, which is computed from the stratum
variances that you provide and the allocated stratum sample sizes. When you specify ALLOC=OPTIMAL
, the STATS option also displays the total stratum-level cost, which is computed from the stratum costs that you provide and
the allocated stratum sample sizes.
-
VAR < =values | SAS-data-set >
-
specifies the stratum variances that PROC SURVEYSELECT uses to compute optimal allocation (ALLOC=OPTIMAL
), Neyman allocation (ALLOC=NEYMAN
), or allocation for a specified margin (MARGIN=
). The stratum variances must be positive numbers.
You can provide stratum variances by specifying one of the following forms:
-
VAR
-
indicates that stratum variances are provided in a secondary input data set that you name in another option (for example,
the COST=SAS-data-set
option). You should provide the stratum variances in the data set variable named _VAR_
. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
VAR=(values)
-
specifies a list of stratum variance values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should
correspond to a stratum group, and the number of values must equal the number of strata in the input data set.
The order of the stratum variance values must match the order of the stratum groups in the DATA=
input data set. When you specify a list of values, the input data set must be sorted by the STRATA
variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.
-
VAR=SAS-data-set
-
names a SAS-data-set that contains the stratum variances. You should provide the stratum variances in the data set variable named _VAR_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the VAR= data set must match the order of the groups in the DATA= data
set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information,
see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.