-
CERTSIZE < =value | SAS-data-set >
-
specifies the certainty size measure that PROC SURVEYSELECT uses to identify units that are selected with certainty. You can
provide a single certainty value for the entire sample selection, or you can provide stratum-level certainty values by specifying a SAS-data-set. The certainty size values must be positive numbers.
You can use the SIZE
statement to provide size measures for the sampling units. PROC SURVEYSELECT selects with certainty all sampling units whose
size measures are greater than or equal to the certainty size value. After removing the certainty units, the procedure selects
the remainder of the sample by using the method that you specify in the METHOD=
option. The OUT=
output data set contains a variable named Certain
that identifies units that are selected with certainty. The selection probability of each certainty unit is one.
This option is available for the following PPS selection methods: METHOD=PPS
, METHOD=PPS_SAMPFORD
, METHOD=PPS_SYS
, and METHOD=PPS_WR
. The CERTSIZE= option is not available when you specify a SAMPLINGUNIT
statement.
You can provide certainty size values by specifying one of the following forms:
-
CERTSIZE
-
indicates that certainty size values are provided in a secondary input data set that you name in another option (for example,
the SAMPSIZE=SAS-data-set
option). This data set should include a variable named _CERTSIZE_
that contains the certainty values. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
CERTSIZE=value
-
specifies a single certainty size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA
statement, PROC SURVEYSELECT uses the certainty value to determine certainty selections for all strata.
-
CERTSIZE=SAS-data-set
-
names a SAS-data-set that contains stratum-level certainty size values. You should provide the certainty values in the data set variable named
_CERTSIZE_
. Each observation in this data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
data set. The order of the stratum groups in the CERTSIZE= data set must match the order of the groups in the DATA= data
set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information,
see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
CERTSIZE=P < =p | SAS-data-set >
-
specifies the certainty proportion that PROC SURVEYSELECT uses for iterative certainty selection. You can provide a single
certainty proportion p for the entire sample, or you can provide stratum-level certainty proportions by specifying a SAS-data-set.
The certainty proportions must be positive numbers. You can specify a certainty proportion as a number between 0 and 1. Or
you can specify a proportion in percentage form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion.
The procedure treats the value 1 as 100% instead of 1%.
You can use the SIZE
statement to provide size measures for the sampling units. PROC SURVEYSELECT computes the certainty size as the certainty
proportion p of the total size for all units. The procedure selects with certainty the sampling units whose size measures are greater
than or equal to the certainty size. After removing these certainty units from consideration, the procedure computes a new
certainty size as the certainty proportion of the total size of the remaining units and again identifies certainty units.
PROC SURVEYSELECT repeats this process until no more certainty units are selected. After certainty selection is complete,
the remainder of the sample is selected by using the method that you specify in the METHOD=
option. The OUT=
output data set contains a variable named Certain
that identifies units that are selected with certainty. The selection probability of each certainty unit is one.
This option is available for METHOD=PPS
and METHOD=PPS_SAMPFORD
. This option is not available when you specify a SAMPLINGUNIT
statement.
You can provide certainty size proportions by specifying one of the following forms:
-
CERTSIZE=P
-
indicates that certainty size proportions are provided in a secondary input data set that you name in another option (for
example, the SAMPSIZE=SAS-data-set
option). You should provide the certainty proportions in the data set variable named _CERTP_
. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
CERTSIZE=P=p
-
specifies a single certainty size proportion p, which must be a positive number. If you request a stratified sample design by specifying the STRATA
statement, PROC SURVEYSELECT uses the certainty proportion p to determine certainty selections for all strata.
-
CERTSIZE=P=SAS-data-set
-
names a SAS-data-set that contains stratum-level certainty size proportions. You should provide the certainty proportions in the data set variable
named _CERTP_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the CERTSIZE=P= data set must match the order of the groups in the DATA=
data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more
information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
CERTUNITS=NOPRINT | OUTPUT
-
controls the display and output of information about certainty selection. This option is available when you specify the CERTSIZE=
or CERTSIZE=P=
option. CERTUNITS=NOPRINT suppresses display of the number of certainty units in the "Sample Selection Summary" table. For
more information, see the section Displayed Output. CERTUNITS=OUTPUT includes the number of certainty units in the output data set. For more information about the contents
of the output data set, see the section Sample Output Data Set.
-
DATA=SAS-data-set
-
names the SAS-data-set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created
SAS data set. In sampling terminology, the input data set is the sampling frame (the list of units from which the sample is selected).
By default, the procedure uses input data set observations as sampling units and selects a sample of these units. Alternatively,
you can use the SAMPLINGUNIT
statement to define sampling units as groups of observations (clusters).
-
GROUPS=n | (values)
-
requests random assignment of the observations in the input data set to groups. You can specify the total number of groups
as n, which must be a positive integer. Or you can provide a list of group size values, which are positive integers that specify the number of observations in the groups. When you use a STRATA
statement, PROC SURVEYSELECT performs the specified random assignment independently in each stratum. Otherwise, the procedure
performs the random assignment for the entire data set.
When you specify GROUPS=n, PROC SURVEYSELECT randomly assigns the observations in the data set (or stratum) to n groups. The number of observations in each group is equal, or as nearly equal as possible. For example, if the data set contains
100 observations and you specify GROUPS=3, PROC SURVEYSELECT creates three groups that contain 33, 33, and 34 observations.
This is equivalent to specifying GROUPS=(33, 33, 34).
When you specify GROUPS=values, the number of groups is determined by the number of group size values that you list. You can separate the values with blanks
or commas, and you must enclose the list of values in parentheses. The sum of the group size values must equal the total number
of observations in the data set (or in the stratum, if you specify a STRATA
statement).
The OUT=
data set includes a variable named GroupID
that identifies the group assignment of each observation. When you specify the OUTSIZE
option, the output data set includes a variable named GroupSize
that provides the number of units in the group; the output data set also includes the total number of units and the number
of groups (in the data set, or in the stratum if you specify a STRATA
statement). For more information, see the section Random Assignment Output Data Set.
The following options are available when you specify the GROUPS= option: the SEED=
, RANUNI
, and OUTSEED
options, which pertain to random number generation; the REPS=
option, which provides independent replicates of the random assignment; the NOPRINT
option, which suppresses display of the "Random Assignment" table; and the OUTSIZE
option.
The GROUPS= option does not select a sample; you cannot specify sample selection options (for example, METHOD=
or SAMPSIZE=
) when you use the GROUPS= option. The SAMPLINGUNIT
statement is not available when you use the GROUPS= option.
-
JTPROBS
-
includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability
proportional to size selection methods: METHOD=PPS
, METHOD=PPS_SAMPFORD
, and METHOD=PPS_WR
. By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS_BREWER
and METHOD=PPS_MURTHY
, which select two units per stratum.
For information about joint selection probabilities for a particular sampling method, see the method description in the section
Sample Selection Methods. For more information about the contents of the output data set, see the section Sample Output Data Set.
-
MAXSIZE < =value | SAS-data-set >
-
specifies the maximum size measure. You can provide a single maximum value for the entire sample selection, or you can provide stratum-level maximum values by specifying a SAS-data-set. The maximum size values must be positive numbers.
PROC SURVEYSELECT uses the maximum size values to adjust the size measures, which you can provide by specifying the SIZE
statement or by specifying the PPS
option in the SAMPLINGUNIT
statement. When a size measure exceeds the maximum value, the procedure replaces the size measure with the maximum value.
If you use a SAMPLINGUNIT
statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (instead of the observation
sizes). If you specify a SIZE
statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the
SAMPLINGUNIT PPS
option, the size of a sampling unit is the number of observations in the unit.
When you use a SAMPLINGUNIT
statement, the OUT=
data set includes a variable named UnitSize
that contains the adjusted sampling unit size measures. When you do not use a SAMPLINGUNIT
statement, the OUT=
data set includes a variable named AdjustedSize
that contains the adjusted observation size measures.
You can provide maximum size values by specifying one of the following forms:
-
MAXSIZE
-
indicates that maximum size values are provided in a secondary input data set that you name in another option (for example,
the SAMPSIZE=SAS-data-set
option). You should provide the maximum size values in the data set variable named _MAXSIZE_
. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
MAXSIZE=value
-
specifies a single maximum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA
statement, PROC SURVEYSELECT uses the value to adjust size measures in all strata.
-
MAXSIZE=SAS-data-set
-
names a SAS-data-set that contains stratum-level maximum size values. You should provide the maximum size values in the data set variable named
_MAXSIZE_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
data set. The order of the stratum groups in the MAXSIZE= data set must match the order of the groups in the DATA= data set.
If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information,
see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
METHOD=name
M=name
-
specifies the method for sample selection.
If you do not specify the METHOD= option, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS
) by default unless you specify a SIZE
statement or the PPS
option in the SAMPLINGUNIT
statement. If you do specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size
selection without replacement (METHOD=PPS
) by default.
The following values are available for the METHOD= option:
-
BERNOULLI
-
requests Bernoulli sampling, which consists of N independent selection trials, each with constant inclusion probability , where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable.
For more information, see the section Bernoulli Sampling.
When you specify this method, you must provide the sampling rate (inclusion probability ) in the SAMPRATE=
option. For stratified sampling (which you request with the STRATA
statement), you can specify the same sampling rate for each stratum in the SAMPRATE=value
option. Or you can specify different sampling rates for different strata in the SAMPRATE=(values)
or SAMPRATE=SAS-data-set
option.
Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI
does not use the SAMPSIZE=
option. Also, the ALLOC=
option in the STRATA
statement (which allocates the total sample size among strata) is not available with METHOD=BERNOULLI.
-
POISSON
-
requests Poisson sampling. A generalization of Bernoulli sampling, Poisson sampling consists of N independent selection trials with a separate inclusion probability specified for each unit, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable.
For more information, see the section Poisson Sampling.
You must provide inclusion probabilities for Poisson sampling in the SIZE
variable. The probability values should be between 0 and 1. If a value of the SIZE variable is missing, nonpositive, or greater
than 1, PROC SURVEYSELECT omits the observation from sample selection.
Because Poisson sampling is based on specified inclusion probabilities instead of a fixed sample size, you cannot specify
the SAMPSIZE=
option when you specify METHOD=POISSON. You also cannot specify the ALLOC=
option in the STRATA
statement when you specify METHOD=POISSON.
The SAMPLINGUNIT
statement is not available when you specify METHOD=POISSON.
When you specify the SAMPRATE=
option for METHOD=POISSON but do not specify a SIZE
statement, PROC SURVEYSELECT uses METHOD=BERNOULLI
.
-
PPS
-
requests selection with probability proportional to size and without replacement. For more information, see the section PPS Sampling without Replacement. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement.
-
PPS_BREWER
BREWER
-
requests selection according to Brewer’s method. Brewer’s method selects two units from each stratum with probability proportional
to size and without replacement. For more information, see the section Brewer’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement. You do not need to specify the sample size in the SAMPSIZE=
option because Brewer’s method selects two units from each stratum.
-
PPS_MURTHY
MURTHY
-
requests selection according to Murthy’s method. Murthy’s method selects two units from each stratum with probability proportional
to size and without replacement. For more information, see the section Murthy’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement. You do not need to specify the sample size in the SAMPSIZE=
option because Murthy’s method selects two units from each stratum.
-
PPS_SAMPFORD
SAMPFORD
-
requests selection according to Sampford’s method. Sampford’s method selects units with probability proportional to size and
without replacement. For more information, see the section Sampford’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement.
-
PPS_SEQ
CHROMY
-
requests sequential selection with probability proportional to size and with minimum replacement. This method is also known
as Chromy’s method. For more information, see the section PPS Sequential Sampling. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement.
-
PPS_SYS < (method-options)>
-
requests systematic selection with probability proportional to size. For more information, see the section PPS Systematic Sampling. When you specify this method, you must provide size measures by specifying the SIZE
statement or the PPS
option in the SAMPLINGUNIT
statement.
You can specify the following method-options:
-
DETAILS
-
displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include
strata or replicates. For more information, see the section Displayed Output.
-
INTERVAL=value
-
specifies the interval value for PPS systematic selection. The interval value must be a positive number. It must not exceed the total of the size measures
in the data set (or in each stratum if you specify a STRATA
statement). By default, the systematic interval is the ratio of the size measure total to the sample size (which you provide
in the SAMPSIZE=
option). For more information, see the section PPS Systematic Sampling.
You cannot use the INTERVAL= method-option when you specify a sample size in the SAMPSIZE=
option or when you specify the ALLOC=
option, which allocates the total sample size among strata.
-
START=value
-
specifies the starting value for PPS systematic selection. The starting value must be a positive number that is less than the systematic interval. By
default, PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the
section PPS Systematic Sampling.
When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the
starting point), the following options for random number generation have no effect: SEED=
, RANUNI
, and OUTSEED
. You cannot use the REPS=
option when you specify the START= method-option.
When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.
-
PPS_WR
-
requests selection with probability proportional to size and with replacement. For more information, see the section PPS Sampling with Replacement. When you specify this method, you must name a size measure variable in the SIZE
statement or specify the PPS
option in the SAMPLINGUNIT
statement.
-
SEQ
CHROMY
-
requests sequential selection according to Chromy’s method. If you specify this method and do not specify a SIZE
statement (or the PPS
option in the SAMPLINGUNIT
statement), PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. For more information,
see the section Sequential Random Sampling.
If you specify METHOD=SEQ and also specify a SIZE
statement (or the PPS
option in the SAMPLINGUNIT
statement), PROC SURVEYSELECT uses METHOD=PPS_SEQ, which is sequential selection with probability proportional to size and
with minimum replacement. For more information, see the section PPS Sequential Sampling.
-
SRS
-
requests simple random sampling, which is selection with equal probability and without replacement. For more information,
see the section Simple Random Sampling. METHOD=SRS is the default selection method if you do not specify the METHOD= option and also do not specify a SIZE
statement (or the PPS
option in the SAMPLINGUNIT
statement).
-
SYS < (method-options)>
-
requests systematic random sampling. If you specify this method and do not specify a SIZE
statement (or the PPS
option in the SAMPLINGUNIT
statement), PROC SURVEYSELECT uses systematic random sampling with equal probability. For more information, see the section
Systematic Random Sampling.
If you specify this method and also specify a SIZE
statement (or the PPS
option in the SAMPLINGUNIT
statement), PROC SURVEYSELECT uses systematic random sampling with probability proportional to size (METHOD=PPS_SYS
). For more information, see the section PPS Systematic Sampling.
You can specify the following method-options:
-
DETAILS
-
displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include
strata or replicates. For more information, see the section Displayed Output.
-
INTERVAL=value
-
specifies the interval for systematic random sampling. The interval value must be a positive number and must not exceed the number of sampling units in the data set (or the number of units in each
stratum, if you specify a STRATA
statement).
By default, PROC SURVEYSELECT determines the systematic interval from the sampling rate or sample size that you provide in
the SAMPRATE=
or SAMPSIZE=
option, respectively. When you specify the sampling rate, PROC SURVEYSELECT computes the systematic interval as the inverse
of the sampling rate. When you specify the sample size, the procedure computes the interval as the ratio of the number of
sampling units to the sample size. For more information, see the section Systematic Random Sampling.
You cannot use the INTERVAL= method-option when you specify the SAMPSIZE=
option, the SAMPRATE=
option, or the ALLOC=
option (which allocates the total sample size among strata).
-
START=value
-
specifies the starting value for systematic selection. The starting value must be a positive number that is less than the systematic interval. By default,
PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the section Systematic Random Sampling.
When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the
starting point), the following options for random number generation have no effect: SEED=
, RANUNI
, and OUTSEED
. You cannot use the REPS=
option when you specify the START= method-option.
When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.
-
URS
-
requests unrestricted random sampling, which is selection with equal probability and with replacement. For more information,
see the section Unrestricted Random Sampling.
-
MINSIZE < =value | SAS-data-set >
-
specifies the minimum size measure. You can provide a single minimum value for the entire sample selection, or you can provide stratum-level minimum values by specifying a SAS-data-set. The minimum size values must be positive numbers.
PROC SURVEYSELECT uses the minimum size values to adjust the size measures, which you provide by specifying the SIZE
statement or by specifying the PPS
option in the SAMPLINGUNIT
statement. When a size measure is less than the minimum value, the procedure replaces the size measure with the minimum value.
If you use a SAMPLINGUNIT
statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (not the observation sizes).
If you specify a SIZE
statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the
SAMPLINGUNIT PPS
option, the size of a sampling unit is the number of observations in the unit.
When you use a SAMPLINGUNIT
statement, the OUT=
data set includes a variable named UnitSize
that contains the adjusted sampling unit size measures. When you do not use a SAMPLINGUNIT
statement, the OUT=
data set includes a variable named AdjustedSize
that contains the adjusted observation size measures.
You can provide minimum size values by specifying one of the following forms:
-
MINSIZE
-
indicates that minimum size values are provided in a secondary input data set that you name in another option (for example,
the SAMPSIZE=SAS-data-set
option). You should provide the minimum size values in the data set variable named _MINSIZE_
. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
MINSIZE=value
-
specifies a single minimum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA
statement, PROC SURVEYSELECT uses the minimum value to adjust size measures in all strata.
-
MINSIZE=SAS-data-set
-
names a SAS-data-set that contains stratum-level minimum size values. You should provide the minimum size values in the data set variable named
_MINSIZE_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the MINSIZE= data set must match the order of the groups in the DATA=
input data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets.
For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
NMAX=n
-
specifies the maximum stratum sample size n for the SAMPRATE=
option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total
number of units in the stratum by the specified sampling rate. If this sample size is greater than the value NMAX=n, PROC SURVEYSELECT selects only n units.
The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which you can specify for equal
probability selection methods (METHOD=SRS
, METHOD=URS
, METHOD=SYS
, and METHOD=SEQ
). The NMAX= option is not available with METHOD=BERNOULLI
, where the SAMPRATE= option specifies the constant inclusion probability.
-
NMIN=n
-
specifies the minimum stratum sample size n for the SAMPRATE=
option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total
number of units in the stratum by the specified sampling rate. If this sample size is less than the value NMIN=n, PROC SURVEYSELECT selects n units.
The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which you can specify for equal
probability selection methods (METHOD=SRS
, METHOD=URS
, METHOD=SYS
, and METHOD=SEQ
). The NMIN= option is not available with METHOD=BERNOULLI
, where the SAMPRATE= option specifies the constant inclusion probability.
-
NOPRINT
-
suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. This
option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 20: Using the Output Delivery System.
-
OUT=SAS-data-set
-
names the output data set. If you omit the OUT= option, the data set is named DATAn
, where n is the smallest integer that makes the name unique. If you request sample selection by specifying the METHOD=
option, the output data set contains the observations that are selected for the sample. If you request sample allocation
without sample selection by specifying the ALLOC=
and NOSAMPLE
options in the STRATA
statement, the output data set contains the allocated sample sizes. If you request random assignment by specifying the GROUPS=
option, the output data set contains all observations in the input data set together with their assigned group identification.
When PROC SURVEYSELECT selects a sample, the output data set contains the units that are selected, sample design information,
and selection statistics. You can specify options that control the information to include in the output data set. For more
information, see the descriptions of the following options: JTPROBS
, OUTALL
, OUTHITS
, OUTSEED
, OUTSIZE
, and STATS
. For more information about the contents of the output data set, see the section Sample Output Data Set.
By default, the sample output data set contains only those units that are selected for the sample. To include all observations
from the input data set in the output data set, use the OUTALL
option.
By default, the sample output data set includes one copy of each selected unit, even when a unit is selected more than once,
which can occur when you use with-replacement or with-minimum-replacement selection methods. For with-replacement or with-minimum-replacement
selection methods, the output data set includes a variable NumberHits
that records the number of hits (selections) for each unit. To include a distinct copy of each selection in the output data
set when the same unit is selected more than once, use the OUTHITS
option.
When you specify the ALLOC=
and NOSAMPLE
options in the STRATA
statement, PROC SURVEYSELECT allocates the total sample size among the strata but does not select a sample. In this case,
the OUT= data set contains the allocated sample sizes. For more information, see the section Allocation Output Data Set.
When you specify the GROUPS=
option, PROC SURVEYSELECT randomly assigns observations to groups; it does not select a sample. In this case, the OUT= data
set contains all observations from the input data set and includes a variable named GroupID
that identifies group assignments. For more information, see the section Random Assignment Output Data Set.
-
OUTALL
-
includes all observations from the DATA=
input data set in the OUT=
output data set. By default, the output data set includes only those units selected for the sample. When you specify the
OUTALL option, the output data set includes all observations from the input data set and also contains a variable that indicates
each observation’s selection status. For an observation that is selected, the value of the variable Selected
is 1; for an observation that is not selected, the value of Selected
is 0. For information about the contents of the output data set, see the section Sample Output Data Set.
The OUTALL option is available for equal probability selection methods (METHOD=SRS
, METHOD=URS
, METHOD=SYS
, METHOD=SEQ
, and METHOD=BERNOULLI
). The OUTALL option is also available for METHOD=POISSON
.
-
OUTHITS
-
includes a distinct copy of each selected unit in the OUT=
output data set when the same sampling unit is selected more than once. By default, the output data set contains a single
copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits
records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits
is m; for example, the output data set contains three copies of a unit that is selected three times (NumberHits
is 3).
A sampling unit can be selected more than once by with-replacement and with-minimum-replacement selection methods, which include
METHOD=URS
, METHOD=PPS_WR
, METHOD=PPS_SYS
, and METHOD=PPS_SEQ
. The OUTHITS option is available for these selection methods.
For information about the contents of the output data set, see the section Sample Output Data Set.
-
OUTSEED
-
includes the initial seed for each stratum in the OUT=
output data set. The variable InitialSeed
contains the stratum initial seeds. For information about the contents of the output data set, see the section Sample Output Data Set.
To reproduce the same sample for any stratum in a subsequent execution of PROC SURVEYSELECT, you can specify the same stratum
initial seed in the SEED=SAS-data-set
option together with the same sample selection parameters. For more information, see the section Random Number Generation.
The "Sample Selection Summary" table displays the initial random number seed for the entire sample selection, which is the
same as the initial seed for the first stratum when the design is stratified. To reproduce the entire sample, you can specify
this same seed value in the SEED=
option, along with the same sample selection parameters.
Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases,
PROC SURVEYSELECT uses the RANUNI random number generator, which you can now request by specifying the RANUNI
option. To reproduce samples that PROC SURVEYSELECT selects in releases prior to SAS/STAT 12.1, specify the RANUNI option
with the SEED=
option (for the same input data set and sample selection parameters).
-
OUTSIZE
-
includes additional design and sampling frame information in the OUT=
output data set.
If you use a STRATA
statement, the OUTSIZE option provides stratum-level values in the output data set. Otherwise, the OUTSIZE option provides
overall values.
The OUTSIZE option includes the sample size or sampling rate in the output data set, depending on whether you specify the
SAMPSIZE=
option or the SAMPRATE=
option, respectively. For PPS selection methods, the OUTSIZE option includes the total size measure in the output data set.
If you do not provide size measures, or if you specify a SAMPLINGUNIT
statement, the OUTSIZE option includes the total number of sampling units.
If you request size measure adjustment or certainty selection, the OUTSIZE option includes the following information in the
output data set: the minimum size measure if you specify the MINSIZE=
option, the maximum size measure if you specify the MAXSIZE=
option, the certainty size measure if you specify the CERTSIZE=
option, and the certainty proportion if you specify the CERTSIZE=P=
option.
For METHOD=BERNOULLI
, the OUTSIZE option includes the following information in the output data set: total number of sampling units, selection
probability (sampling rate), expected sample size, and actual sample size. See the section Bernoulli Sampling for descriptions of these statistics.
For more information about the contents of the output data set, see the section Sample Output Data Set.
If you specify the GROUPS=
option for random assignment, the OUTSIZE option adds the following information to the output data set: total number of units,
number of groups, and number of units in the group. For more information, see the section Random Assignment Output Data Set.
-
OUTSORT=SAS-data-set
-
names an output data set to store the sorted input data set. This option is available when you specify a CONTROL
statement to sort the DATA=
input data set for systematic or sequential selection methods (METHOD=SYS
, METHOD=PPS_SYS
, METHOD=SEQ
, and METHOD=PPS_SEQ
).
If you specify CONTROL variables but do not name an output data set in the OUTSORT= option, the sorted data set replaces the
input data set.
-
RANUNI
-
requests uniform random number generation by the method of Fishman and Moore (1982), which PROC SURVEYSELECT uses in releases before SAS/STAT 12.1. This is the same random number generator that the RANUNI
function provides.
Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. Developed by Matsumoto
and Nishimura (1998), the Mersenne-Twister random number generator has a very long period and good statistical properties. This is the random
number generator that the RAND function provides for the uniform distribution.
For more information, see the section Random Number Generation. For information about the RANUNI and RAND functions, see
SAS Functions and CALL Routines: Reference.
You can specify the RANUNI option with the SEED=
option to reproduce samples that PROC SURVEYSELECT selects in releases before SAS/STAT 12.1. To reproduce a sample by using
the RANUNI and SEED= options, you must also specify the same input data set and sample selection parameters.
-
REPS=nreps
-
specifies the number of sample replicates. The value of nreps must be a positive integer.
When you specify the REPS= option, PROC SURVEYSELECT selects nreps independent samples, each with the same sample size or sampling rate and the same sample design that you request. The variable
Replicate
in the OUT= data set contains the sample replicate number.
You can use replicated sampling to provide a simple method of variance estimation for any form of statistic, and also to evaluate
variable nonsampling errors such as interviewer differences. For information about replicated sampling, see Lohr (2010), Wolter (2007), Kish (1965), Kish (1987), and Kalton (1983). You can also use the REPS= option to perform a variety of other resampling and simulation tasks. For more information,
see Cassell (2007).
-
SAMPRATE=value | (values)| SAS-data-set
RATE=value | (values)| SAS-data-set
-
specifies the sampling rate, which is the proportion of units to select for the sample. You can provide a single sampling
rate value for the entire sample selection, or you can provide stratum sampling rates by specifying values or a SAS-data-set.
The sampling rate value must be a positive number. The stratum sampling rate values and the stratum sampling rates that you
provide in the SAS-data-set must be nonnegative numbers. You can specify a sampling rate as a number between 0 and 1. Or you can specify a rate in percentage
form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion. The procedure treats the value 1 as
100% instead of 1%.
This option is available for equal probability selection methods (METHOD=SRS
, METHOD=URS
, METHOD=SYS
, METHOD=SEQ
, and METHOD=BERNOULLI
). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT computes the selection interval as the inverse of the sampling
rate. For more information, see the section Systematic Random Sampling. For Bernoulli sampling (METHOD=BERNOULLI), the procedure uses the sampling rate as the inclusion probability. For more information,
see the section Bernoulli Sampling. For the other equal probability selection methods, PROC SURVEYSELECT converts the sampling rate to the sample size before
selection by multiplying the total number of units in the stratum or data set by the sampling rate and rounding up to the
nearest integer.
You cannot specify both the SAMPRATE= option and the SAMPSIZE=
option.
You can provide sampling rates by specifying one of the following forms:
-
SAMPRATE=value
RATE=value
-
specifies a single sampling rate value, which must be a positive number. If you request a stratified sample design by specifying the STRATA
statement, PROC SURVEYSELECT uses the rate value for all strata.
-
SAMPRATE=(values)
RATE=(values)
-
specifies a list of stratum sampling rate values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of
stratum sampling rate values should equal the number of strata in the input data set.
The order of the stratum sampling rate values must match the order of the stratum groups in the DATA=
input data set. When you specify a list of values, the input data set must be sorted by the STRATA
variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.
The stratum sampling rate values must be nonnegative numbers. If you specify a stratum sampling rate of zero, PROC SURVEYSELECT
does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the
stratum that you omit is not included in the sampling frame or represented in the sample.
-
SAMPRATE=SAS-data-set
RATE=SAS-data-set
-
names a SAS-data-set that contains stratum sampling rates. You should provide the sampling rates in the data set variable named _RATE_
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the SAMPRATE= data set must match the order of the groups in the DATA=
data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more
information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
The stratum sampling rates must be nonnegative numbers. If you specify a stratum sampling rate of zero, PROC SURVEYSELECT
does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the
stratum that you omit is not included in the sampling frame or represented in the sample.
-
SAMPSIZE=n |(values)| SAS-data-set
N=n | (values)| SAS-data-set
-
specifies the sample size, which is the number of units to select for the sample. You can provide a single sample size n for the entire sample selection, or you can provide stratum sample sizes by specifying values or a SAS-data-set.
The value of n must be a positive integer. The stratum sample size values and the stratum sample sizes that you provide in the SAS-data-set must be nonnegative numbers. For selection methods that select without replacement, the sample size must not exceed the total
number of units in the data set (or the number of units in the stratum, if you specify a STRATA
statement).
This option specifies the number of sampling units to select. If you do not specify a SAMPLINGUNIT
statement, PROC SURVEYSELECT defines sampling units as observations and selects the number of observations that you specify.
If you specify a SAMPLINGUNIT
statement, PROC SURVEYSELECT defines sampling units as groups of observations (clusters) and selects the number of clusters
that you specify.
If you specify SAMPSIZE=n and the ALLOC=
option in the STRATA
statement, PROC SURVEYSELECT allocates the sample size n among the strata according to the allocation method that you request. For more information, see the section Sample Size Allocation. You cannot specify SAMPSIZE=values or SAMPSIZE=SAS-data-set when you use the ALLOC=
option. You cannot specify SAMPSIZE= with the MARGIN=
option, which determines stratum sample sizes that provide the specified margin of error. For more information, see the section
Specifying the Margin of Error.
You cannot specify both the SAMPSIZE= option and the SAMPRATE=
option.
You can provide sample size values by specifying one of the following forms:
-
SAMPSIZE=n
N=n
-
specifies a single sample size value n, which must be a positive integer. If you request a stratified sample design, PROC SURVEYSELECT selects n units from each stratum (unless you also specify the ALLOC=
option in the STRATA
statement, which allocates the total sample size among the strata).
For methods that select without replacement, the sample size n must not exceed the number of units in the stratum unless you also specify the SELECTALL
option. If you specify the SELECTALL
option, PROC SURVEYSELECT selects all stratum units when the stratum sample size exceeds the total number of units in the
stratum.
-
SAMPSIZE=(values)
N=(values)
-
specifies a list of stratum sample size values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of
sample size values must equal the number of strata in the input data set.
The order of the stratum sample size values must match the order of the stratum groups in the DATA=
input data set. When you specify a list of values, the input data set must be sorted by the STRATA
variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.
The values of the stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of zero, PROC SURVEYSELECT
does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the
stratum that you omit is not included in the sampling frame or represented in the sample.
-
SAMPSIZE=SAS-data-set
N=SAS-data-set
-
names a SAS-data-set that contains stratum sample sizes. You should provide the sample sizes in the data set variable named _NSIZE_
or SampleSize
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the SAMPSIZE= data set must match the order of the groups in the DATA=
data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more
information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
The stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of zero, PROC SURVEYSELECT does
not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum
that you omit is not included in the sampling frame or represented in the sample.
-
SEED < =value | SAS-data-set >
-
specifies the initial seed for random number generation. You can provide a single seed value for the entire sample selection, or you can provide stratum initial seeds by specifying a SAS-data-set. To initialize random number generation, a seed must be a positive integer. If you do not specify this option, or if you
specify an initial seed that is negative or zero, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain
an initial seed. For more information, see the section Random Number Generation.
PROC SURVEYSELECT displays the value of the initial seed in the "Sample Selection Summary" table. To reproduce the same sample
in a subsequent execution of PROC SURVEYSELECT, you can specify the same initial seed in the SEED= option (for the same input
data set and sample selection parameters).
If you specify a STRATA
statement, you can provide stratum initial seeds by specifying a SAS-data-set. If you do not provide stratum initial seeds, the procedure generates random numbers continuously across strata from the
random number stream that is initialized by the single seed value or by default. You can specify the OUTSEED
option to include stratum initial seeds in the output data set.
Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases,
PROC SURVEYSELECT uses the RANUNI random number generator, which you can now request by specifying the RANUNI
option. To reproduce samples that PROC SURVEYSELECT selects in releases before SAS/STAT 12.1, use the RANUNI
option with the SEED= option (for the same input data set and sample selection parameters).
You can provide initial seeds by specifying one of the following forms:
-
SEED
-
indicates that stratum initial seeds are provided in a secondary input data set that you name in another option (for example,
the SAMPSIZE=SAS-data-set
option). You should provide the initial seeds in the data set variable named _SEED_
or InitialSeed
. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
-
SEED=value
-
specifies a single initial seed value for random number generation. To initialize random number generation, the value must be a positive integer.
-
SEED=SAS-data-set
-
names a SAS-data-set that contains stratum initial seeds. You should provide the stratum initial seeds in the data set variable named _SEED_
or InitialSeed
. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA
variables.
This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA
statement. The data set must also contain all stratum groups that appear in the DATA=
input data set. The order of the stratum groups in the SEED= data set must match the order of the groups in the DATA= data
set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information,
see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.
The OUTSEED
option includes the stratum initial seeds in the OUT=
output data set. You can reproduce the same sample in a subsequent execution of PROC SURVEYSELECT by specifying the same
stratum initial seeds (for the same input data set and sample selection parameters). If you need to reproduce the same sample
for only a subset of the strata, you can use the same initial seeds for the strata in the subset.
-
SELECTALL
-
requests that PROC SURVEYSELECT select all stratum units when the stratum sample size exceeds the total number of units in
the stratum. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total
number of units in the stratum, unless you are using a with-replacement selection method.
The SELECTALL option is available for the following without-replacement selection methods: METHOD=SRS
, METHOD=SYS
, METHOD=SEQ
, METHOD=PPS
, and METHOD=PPS_SAMPFORD
.
The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or those PPS
methods that select two units per stratum.
-
SORT=NEST | SERP
-
specifies the type of sorting by CONTROL variables. The option SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic
serpentine sorting. The default is SORT=SERP. See the section Sorting by CONTROL Variables for descriptions of serpentine and nested sorting. Where there is only one CONTROL variable, the two types of sorting are
equivalent.
The SORT= option is available when you specify a CONTROL
statement for systematic or sequential selection methods (METHOD=SYS
, METHOD=PPS_SYS
, METHOD=SEQ
, and METHOD=PPS_SEQ
). When you specify a CONTROL statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata
before selecting the sample.
The SORT= option and the CONTROL statement are not available when you specify a SAMPLINGUNIT
statement. For more information, see the descriptions of the CONTROL and SAMPLINGUNIT statements.
When you specify a CONTROL statement, you can also use the OUTSORT=
option to name an output data set that contains the sorted input data set. Otherwise, if you do not specify the OUTSORT=
option, the sorted data set replaces the input data set.
-
STATS
-
includes the selection probability and sampling weight in the OUT= output data set for equal probability selection methods
when you do not specify a STRATA
statement. By default, the output data set does not include these values for equal probability selection methods unless you
specify a STRATA statement. The STATS option applies to the following selection methods: METHOD=SRS
, METHOD=URS
, METHOD=SYS
, METHOD=SEQ
, and METHOD=BERNOULLI
.
In addition to the selection probability and sampling weight, the STATS option includes the following statistics in the output
data set for METHOD=BERNOULLI
: total number of sampling units, expected sample size, actual sample size, and adjusted sampling weight. For more information,
see the section Bernoulli Sampling.
For PPS selection methods, the output data set contains selection probabilities and sampling weights by default. The STATS
option has no effect for PPS methods.
For more information about the contents of the output data set, see the section Sample Output Data Set.