The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement

  • PROC SURVEYSELECT options;

The PROC SURVEYSELECT statement invokes the SURVEYSELECT procedure. Optionally, it identifies input and output data sets. If you do not name a DATA= input data set, the procedure selects the sample from the most recently created SAS data set. If you do not name an OUT= output data set to contain the sample of selected units, the procedure still creates an output data set and names it according to the DATAn convention.

The PROC SURVEYSELECT statement also specifies the sample selection method, the sample size, and other sample design parameters.

If you do not specify a selection method, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS ) by default unless you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement. If you do specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS ) by default. For more information, see the description of the METHOD= option.

You can use the SAMPSIZE=n option to specify the sample size, or you can use the SAMPSIZE=SAS-data-set option to name a secondary input data set that contains stratum sample sizes. You must specify a sample size or sampling rate except when you request one of the following: random assignment (GROUPS= ); Poisson sampling (METHOD=POISSON ); Brewer’s method or Murthy’s method, either of which selects two units from each stratum (METHOD=PPS_BREWER or METHOD=PPS_MURTHY ); or sample allocation for a specified margin (MARGIN= ).

You can provide stratum sample sizes, sampling rates, initial seeds, minimum size measures, maximum size measures, and certainty size measures in a secondary input data set. For more information, see the descriptions of the SAMPSIZE= , SAMPRATE= , SEED= , MINSIZE= , MAXSIZE= , CERTSIZE= , and CERTSIZE=P= options. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT. For more information, see the section Secondary Input Data Set.

Table 102.1 summarizes the options available in the PROC SURVEYSELECT statement. Descriptions of the options follow in alphabetical order.

Table 102.1: PROC SURVEYSELECT Statement Options

Option

Description

Input and Output Data Sets

DATA=

Names the input SAS data set

OUT=

Names the output SAS data set that contains the sample

OUTSORT=

Names an output SAS data set that stores the sorted input data set

Selection Method

METHOD=

Specifies the sample selection method

Sample Size

SAMPSIZE=

Specifies the sample size

SELECTALL

Selects all stratum units when the sample size exceeds the total

Sampling Rate

SAMPRATE=

Specifies the sampling rate

NMIN=

Specifies the minimum stratum sample size

NMAX=

Specifies the maximum stratum sample size

Replicated Sampling

REPS=

Specifies the number of sample replicates

Size Measures

MINSIZE=

Specifies the minimum size measure

MAXSIZE=

Specifies the maximum size measure

CERTSIZE=

Specifies the certainty size measure

CERTSIZE=P=

Specifies the certainty proportion

Control Sorting

SORT=

Specifies the type of sorting

Random Number Generation

SEED=

Specifies the initial seed

RANUNI

Requests the RANUNI random number generator

Random Assignment

GROUPS=

Requests random assignment

Displayed Output

NOPRINT

Suppresses the display of all output

OUT= Data Set Contents

CERTUNITS=

Includes number of certainty units

JTPROBS

Includes joint probabilities of selection

OUTALL

Includes all observations from the DATA= input data set

OUTHITS

Includes a distinct copy of each selected unit

OUTSEED

Includes the initial seed for each stratum

OUTSIZE

Includes additional design and sampling frame information

STATS

Includes selection probabilities and sampling weights


You can specify the following options in the PROC SURVEYSELECT statement:

CERTSIZE < =value | SAS-data-set >

specifies the certainty size measure that PROC SURVEYSELECT uses to identify units that are selected with certainty. You can provide a single certainty value for the entire sample selection, or you can provide stratum-level certainty values by specifying a SAS-data-set. The certainty size values must be positive numbers.

You can use the SIZE statement to provide size measures for the sampling units. PROC SURVEYSELECT selects with certainty all sampling units whose size measures are greater than or equal to the certainty size value. After removing the certainty units, the procedure selects the remainder of the sample by using the method that you specify in the METHOD= option. The OUT= output data set contains a variable named Certain that identifies units that are selected with certainty. The selection probability of each certainty unit is one.

This option is available for the following PPS selection methods: METHOD=PPS , METHOD=PPS_SAMPFORD , METHOD=PPS_SYS , and METHOD=PPS_WR . The CERTSIZE= option is not available when you specify a SAMPLINGUNIT statement.

You can provide certainty size values by specifying one of the following forms:

CERTSIZE

indicates that certainty size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). This data set should include a variable named _CERTSIZE_ that contains the certainty values. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=value

specifies a single certainty size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the certainty value to determine certainty selections for all strata.

CERTSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level certainty size values. You should provide the certainty values in the data set variable named _CERTSIZE_. Each observation in this data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= data set. The order of the stratum groups in the CERTSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=P < =p | SAS-data-set >

specifies the certainty proportion that PROC SURVEYSELECT uses for iterative certainty selection. You can provide a single certainty proportion p for the entire sample, or you can provide stratum-level certainty proportions by specifying a SAS-data-set.

The certainty proportions must be positive numbers. You can specify a certainty proportion as a number between 0 and 1. Or you can specify a proportion in percentage form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion. The procedure treats the value 1 as 100% instead of 1%.

You can use the SIZE statement to provide size measures for the sampling units. PROC SURVEYSELECT computes the certainty size as the certainty proportion p of the total size for all units. The procedure selects with certainty the sampling units whose size measures are greater than or equal to the certainty size. After removing these certainty units from consideration, the procedure computes a new certainty size as the certainty proportion of the total size of the remaining units and again identifies certainty units. PROC SURVEYSELECT repeats this process until no more certainty units are selected. After certainty selection is complete, the remainder of the sample is selected by using the method that you specify in the METHOD= option. The OUT= output data set contains a variable named Certain that identifies units that are selected with certainty. The selection probability of each certainty unit is one.

This option is available for METHOD=PPS and METHOD=PPS_SAMPFORD . This option is not available when you specify a SAMPLINGUNIT statement.

You can provide certainty size proportions by specifying one of the following forms:

CERTSIZE=P

indicates that certainty size proportions are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the certainty proportions in the data set variable named _CERTP_. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTSIZE=P=p

specifies a single certainty size proportion p, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the certainty proportion p to determine certainty selections for all strata.

CERTSIZE=P=SAS-data-set

names a SAS-data-set that contains stratum-level certainty size proportions. You should provide the certainty proportions in the data set variable named _CERTP_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the CERTSIZE=P= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

CERTUNITS=NOPRINT | OUTPUT

controls the display and output of information about certainty selection. This option is available when you specify the CERTSIZE= or CERTSIZE=P= option. CERTUNITS=NOPRINT suppresses display of the number of certainty units in the "Sample Selection Summary" table. For more information, see the section Displayed Output. CERTUNITS=OUTPUT includes the number of certainty units in the output data set. For more information about the contents of the output data set, see the section Sample Output Data Set.

DATA=SAS-data-set

names the SAS-data-set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created SAS data set. In sampling terminology, the input data set is the sampling frame (the list of units from which the sample is selected).

By default, the procedure uses input data set observations as sampling units and selects a sample of these units. Alternatively, you can use the SAMPLINGUNIT statement to define sampling units as groups of observations (clusters).

GROUPS=n | (values)

requests random assignment of the observations in the input data set to groups. You can specify the total number of groups as n, which must be a positive integer. Or you can provide a list of group size values, which are positive integers that specify the number of observations in the groups. When you use a STRATA statement, PROC SURVEYSELECT performs the specified random assignment independently in each stratum. Otherwise, the procedure performs the random assignment for the entire data set.

When you specify GROUPS=n, PROC SURVEYSELECT randomly assigns the observations in the data set (or stratum) to n groups. The number of observations in each group is equal, or as nearly equal as possible. For example, if the data set contains 100 observations and you specify GROUPS=3, PROC SURVEYSELECT creates three groups that contain 33, 33, and 34 observations. This is equivalent to specifying GROUPS=(33, 33, 34).

When you specify GROUPS=values, the number of groups is determined by the number of group size values that you list. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The sum of the group size values must equal the total number of observations in the data set (or in the stratum, if you specify a STRATA statement).

The OUT= data set includes a variable named GroupID that identifies the group assignment of each observation. When you specify the OUTSIZE option, the output data set includes a variable named GroupSize that provides the number of units in the group; the output data set also includes the total number of units and the number of groups (in the data set, or in the stratum if you specify a STRATA statement). For more information, see the section Random Assignment Output Data Set.

The following options are available when you specify the GROUPS= option: the SEED= , RANUNI , and OUTSEED options, which pertain to random number generation; the REPS= option, which provides independent replicates of the random assignment; the NOPRINT option, which suppresses display of the "Random Assignment" table; and the OUTSIZE option.

The GROUPS= option does not select a sample; you cannot specify sample selection options (for example, METHOD= or SAMPSIZE= ) when you use the GROUPS= option. The SAMPLINGUNIT statement is not available when you use the GROUPS= option.

JTPROBS

includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability proportional to size selection methods: METHOD=PPS , METHOD=PPS_SAMPFORD , and METHOD=PPS_WR . By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS_BREWER and METHOD=PPS_MURTHY , which select two units per stratum.

For information about joint selection probabilities for a particular sampling method, see the method description in the section Sample Selection Methods. For more information about the contents of the output data set, see the section Sample Output Data Set.

MAXSIZE < =value | SAS-data-set >

specifies the maximum size measure. You can provide a single maximum value for the entire sample selection, or you can provide stratum-level maximum values by specifying a SAS-data-set. The maximum size values must be positive numbers.

PROC SURVEYSELECT uses the maximum size values to adjust the size measures, which you can provide by specifying the SIZE statement or by specifying the PPS option in the SAMPLINGUNIT statement. When a size measure exceeds the maximum value, the procedure replaces the size measure with the maximum value.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (instead of the observation sizes). If you specify a SIZE statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the SAMPLINGUNIT PPS option, the size of a sampling unit is the number of observations in the unit.

When you use a SAMPLINGUNIT statement, the OUT= data set includes a variable named UnitSize that contains the adjusted sampling unit size measures. When you do not use a SAMPLINGUNIT statement, the OUT= data set includes a variable named AdjustedSize that contains the adjusted observation size measures.

You can provide maximum size values by specifying one of the following forms:

MAXSIZE

indicates that maximum size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the maximum size values in the data set variable named _MAXSIZE_. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.

MAXSIZE=value

specifies a single maximum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the value to adjust size measures in all strata.

MAXSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level maximum size values. You should provide the maximum size values in the data set variable named _MAXSIZE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= data set. The order of the stratum groups in the MAXSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

METHOD=name
M=name

specifies the method for sample selection.

If you do not specify the METHOD= option, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS ) by default unless you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement. If you do specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS ) by default.

The following values are available for the METHOD= option:

BERNOULLI

requests Bernoulli sampling, which consists of N independent selection trials, each with constant inclusion probability $\pi $, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. For more information, see the section Bernoulli Sampling.

When you specify this method, you must provide the sampling rate (inclusion probability $\pi $) in the SAMPRATE= option. For stratified sampling (which you request with the STRATA statement), you can specify the same sampling rate for each stratum in the SAMPRATE=value option. Or you can specify different sampling rates for different strata in the SAMPRATE=(values) or SAMPRATE=SAS-data-set option.

Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI does not use the SAMPSIZE= option. Also, the ALLOC= option in the STRATA statement (which allocates the total sample size among strata) is not available with METHOD=BERNOULLI.

POISSON

requests Poisson sampling. A generalization of Bernoulli sampling, Poisson sampling consists of N independent selection trials with a separate inclusion probability specified for each unit, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. For more information, see the section Poisson Sampling.

You must provide inclusion probabilities for Poisson sampling in the SIZE variable. The probability values should be between 0 and 1. If a value of the SIZE variable is missing, nonpositive, or greater than 1, PROC SURVEYSELECT omits the observation from sample selection.

Because Poisson sampling is based on specified inclusion probabilities instead of a fixed sample size, you cannot specify the SAMPSIZE= option when you specify METHOD=POISSON. You also cannot specify the ALLOC= option in the STRATA statement when you specify METHOD=POISSON.

The SAMPLINGUNIT statement is not available when you specify METHOD=POISSON.

When you specify the SAMPRATE= option for METHOD=POISSON but do not specify a SIZE statement, PROC SURVEYSELECT uses METHOD=BERNOULLI .

PPS

requests selection with probability proportional to size and without replacement. For more information, see the section PPS Sampling without Replacement. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_BREWER
BREWER

requests selection according to Brewer’s method. Brewer’s method selects two units from each stratum with probability proportional to size and without replacement. For more information, see the section Brewer’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size in the SAMPSIZE= option because Brewer’s method selects two units from each stratum.

PPS_MURTHY
MURTHY

requests selection according to Murthy’s method. Murthy’s method selects two units from each stratum with probability proportional to size and without replacement. For more information, see the section Murthy’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size in the SAMPSIZE= option because Murthy’s method selects two units from each stratum.

PPS_SAMPFORD
SAMPFORD

requests selection according to Sampford’s method. Sampford’s method selects units with probability proportional to size and without replacement. For more information, see the section Sampford’s PPS Method. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SEQ
CHROMY

requests sequential selection with probability proportional to size and with minimum replacement. This method is also known as Chromy’s method. For more information, see the section PPS Sequential Sampling. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SYS < (method-options)>

requests systematic selection with probability proportional to size. For more information, see the section PPS Systematic Sampling. When you specify this method, you must provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

You can specify the following method-options:

DETAILS

displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include strata or replicates. For more information, see the section Displayed Output.

INTERVAL=value

specifies the interval value for PPS systematic selection. The interval value must be a positive number. It must not exceed the total of the size measures in the data set (or in each stratum if you specify a STRATA statement). By default, the systematic interval is the ratio of the size measure total to the sample size (which you provide in the SAMPSIZE= option). For more information, see the section PPS Systematic Sampling.

You cannot use the INTERVAL= method-option when you specify a sample size in the SAMPSIZE= option or when you specify the ALLOC= option, which allocates the total sample size among strata.

START=value

specifies the starting value for PPS systematic selection. The starting value must be a positive number that is less than the systematic interval. By default, PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the section PPS Systematic Sampling.

When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the starting point), the following options for random number generation have no effect: SEED= , RANUNI , and OUTSEED . You cannot use the REPS= option when you specify the START= method-option.

When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.

PPS_WR

requests selection with probability proportional to size and with replacement. For more information, see the section PPS Sampling with Replacement. When you specify this method, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

SEQ
CHROMY

requests sequential selection according to Chromy’s method. If you specify this method and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. For more information, see the section Sequential Random Sampling.

If you specify METHOD=SEQ and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses METHOD=PPS_SEQ, which is sequential selection with probability proportional to size and with minimum replacement. For more information, see the section PPS Sequential Sampling.

SRS

requests simple random sampling, which is selection with equal probability and without replacement. For more information, see the section Simple Random Sampling. METHOD=SRS is the default selection method if you do not specify the METHOD= option and also do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement).

SYS < (method-options)>

requests systematic random sampling. If you specify this method and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses systematic random sampling with equal probability. For more information, see the section Systematic Random Sampling.

If you specify this method and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses systematic random sampling with probability proportional to size (METHOD=PPS_SYS ). For more information, see the section PPS Systematic Sampling.

You can specify the following method-options:

DETAILS

displays the random start and the systematic interval in the "Sample Selection Summary" table when the design does not include strata or replicates. For more information, see the section Displayed Output.

INTERVAL=value

specifies the interval for systematic random sampling. The interval value must be a positive number and must not exceed the number of sampling units in the data set (or the number of units in each stratum, if you specify a STRATA statement).

By default, PROC SURVEYSELECT determines the systematic interval from the sampling rate or sample size that you provide in the SAMPRATE= or SAMPSIZE= option, respectively. When you specify the sampling rate, PROC SURVEYSELECT computes the systematic interval as the inverse of the sampling rate. When you specify the sample size, the procedure computes the interval as the ratio of the number of sampling units to the sample size. For more information, see the section Systematic Random Sampling.

You cannot use the INTERVAL= method-option when you specify the SAMPSIZE= option, the SAMPRATE= option, or the ALLOC= option (which allocates the total sample size among strata).

START=value

specifies the starting value for systematic selection. The starting value must be a positive number that is less than the systematic interval. By default, PROC SURVEYSELECT randomly determines a starting point in the systematic interval. For more information, see the section Systematic Random Sampling.

When you use this option to specify a systematic starting point (instead of allowing the procedure to randomly determine the starting point), the following options for random number generation have no effect: SEED= , RANUNI , and OUTSEED . You cannot use the REPS= option when you specify the START= method-option.

When the starting value that you provide is not randomly determined, the resulting selection is not a probability-based sample.

URS

requests unrestricted random sampling, which is selection with equal probability and with replacement. For more information, see the section Unrestricted Random Sampling.

MINSIZE < =value | SAS-data-set >

specifies the minimum size measure. You can provide a single minimum value for the entire sample selection, or you can provide stratum-level minimum values by specifying a SAS-data-set. The minimum size values must be positive numbers.

PROC SURVEYSELECT uses the minimum size values to adjust the size measures, which you provide by specifying the SIZE statement or by specifying the PPS option in the SAMPLINGUNIT statement. When a size measure is less than the minimum value, the procedure replaces the size measure with the minimum value.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), PROC SURVEYSELECT adjusts the sampling unit sizes (not the observation sizes). If you specify a SIZE statement, the size of a sampling unit is the sum of the size measures of the observations in the unit. If you specify the SAMPLINGUNIT PPS option, the size of a sampling unit is the number of observations in the unit.

When you use a SAMPLINGUNIT statement, the OUT= data set includes a variable named UnitSize that contains the adjusted sampling unit size measures. When you do not use a SAMPLINGUNIT statement, the OUT= data set includes a variable named AdjustedSize that contains the adjusted observation size measures.

You can provide minimum size values by specifying one of the following forms:

MINSIZE

indicates that minimum size values are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the minimum size values in the data set variable named _MINSIZE_. For more information, see the section Secondary Input Data Set. You can specify only one secondary input data set in each invocation of PROC SURVEYSELECT.

MINSIZE=value

specifies a single minimum size value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the minimum value to adjust size measures in all strata.

MINSIZE=SAS-data-set

names a SAS-data-set that contains stratum-level minimum size values. You should provide the minimum size values in the data set variable named _MINSIZE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the MINSIZE= data set must match the order of the groups in the DATA= input data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

NMAX=n

specifies the maximum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is greater than the value NMAX=n, PROC SURVEYSELECT selects only n units.

The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which you can specify for equal probability selection methods (METHOD=SRS , METHOD=URS , METHOD=SYS , and METHOD=SEQ ). The NMAX= option is not available with METHOD=BERNOULLI , where the SAMPRATE= option specifies the constant inclusion probability.

NMIN=n

specifies the minimum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is less than the value NMIN=n, PROC SURVEYSELECT selects n units.

The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which you can specify for equal probability selection methods (METHOD=SRS , METHOD=URS , METHOD=SYS , and METHOD=SEQ ). The NMIN= option is not available with METHOD=BERNOULLI , where the SAMPRATE= option specifies the constant inclusion probability.

NOPRINT

suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. This option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 20: Using the Output Delivery System.

OUT=SAS-data-set

names the output data set. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique. If you request sample selection by specifying the METHOD= option, the output data set contains the observations that are selected for the sample. If you request sample allocation without sample selection by specifying the ALLOC= and NOSAMPLE options in the STRATA statement, the output data set contains the allocated sample sizes. If you request random assignment by specifying the GROUPS= option, the output data set contains all observations in the input data set together with their assigned group identification.

When PROC SURVEYSELECT selects a sample, the output data set contains the units that are selected, sample design information, and selection statistics. You can specify options that control the information to include in the output data set. For more information, see the descriptions of the following options: JTPROBS , OUTALL , OUTHITS , OUTSEED , OUTSIZE , and STATS . For more information about the contents of the output data set, see the section Sample Output Data Set.

By default, the sample output data set contains only those units that are selected for the sample. To include all observations from the input data set in the output data set, use the OUTALL option.

By default, the sample output data set includes one copy of each selected unit, even when a unit is selected more than once, which can occur when you use with-replacement or with-minimum-replacement selection methods. For with-replacement or with-minimum-replacement selection methods, the output data set includes a variable NumberHits that records the number of hits (selections) for each unit. To include a distinct copy of each selection in the output data set when the same unit is selected more than once, use the OUTHITS option.

When you specify the ALLOC= and NOSAMPLE options in the STRATA statement, PROC SURVEYSELECT allocates the total sample size among the strata but does not select a sample. In this case, the OUT= data set contains the allocated sample sizes. For more information, see the section Allocation Output Data Set.

When you specify the GROUPS= option, PROC SURVEYSELECT randomly assigns observations to groups; it does not select a sample. In this case, the OUT= data set contains all observations from the input data set and includes a variable named GroupID that identifies group assignments. For more information, see the section Random Assignment Output Data Set.

OUTALL

includes all observations from the DATA= input data set in the OUT= output data set. By default, the output data set includes only those units selected for the sample. When you specify the OUTALL option, the output data set includes all observations from the input data set and also contains a variable that indicates each observation’s selection status. For an observation that is selected, the value of the variable Selected is 1; for an observation that is not selected, the value of Selected is 0. For information about the contents of the output data set, see the section Sample Output Data Set.

The OUTALL option is available for equal probability selection methods (METHOD=SRS , METHOD=URS , METHOD=SYS , METHOD=SEQ , and METHOD=BERNOULLI ). The OUTALL option is also available for METHOD=POISSON .

OUTHITS

includes a distinct copy of each selected unit in the OUT= output data set when the same sampling unit is selected more than once. By default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits is m; for example, the output data set contains three copies of a unit that is selected three times (NumberHits is 3).

A sampling unit can be selected more than once by with-replacement and with-minimum-replacement selection methods, which include METHOD=URS , METHOD=PPS_WR , METHOD=PPS_SYS , and METHOD=PPS_SEQ . The OUTHITS option is available for these selection methods.

For information about the contents of the output data set, see the section Sample Output Data Set.

OUTSEED

includes the initial seed for each stratum in the OUT= output data set. The variable InitialSeed contains the stratum initial seeds. For information about the contents of the output data set, see the section Sample Output Data Set.

To reproduce the same sample for any stratum in a subsequent execution of PROC SURVEYSELECT, you can specify the same stratum initial seed in the SEED=SAS-data-set option together with the same sample selection parameters. For more information, see the section Random Number Generation.

The "Sample Selection Summary" table displays the initial random number seed for the entire sample selection, which is the same as the initial seed for the first stratum when the design is stratified. To reproduce the entire sample, you can specify this same seed value in the SEED= option, along with the same sample selection parameters.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases, PROC SURVEYSELECT uses the RANUNI random number generator, which you can now request by specifying the RANUNI option. To reproduce samples that PROC SURVEYSELECT selects in releases prior to SAS/STAT 12.1, specify the RANUNI option with the SEED= option (for the same input data set and sample selection parameters).

OUTSIZE

includes additional design and sampling frame information in the OUT= output data set.

If you use a STRATA statement, the OUTSIZE option provides stratum-level values in the output data set. Otherwise, the OUTSIZE option provides overall values.

The OUTSIZE option includes the sample size or sampling rate in the output data set, depending on whether you specify the SAMPSIZE= option or the SAMPRATE= option, respectively. For PPS selection methods, the OUTSIZE option includes the total size measure in the output data set. If you do not provide size measures, or if you specify a SAMPLINGUNIT statement, the OUTSIZE option includes the total number of sampling units.

If you request size measure adjustment or certainty selection, the OUTSIZE option includes the following information in the output data set: the minimum size measure if you specify the MINSIZE= option, the maximum size measure if you specify the MAXSIZE= option, the certainty size measure if you specify the CERTSIZE= option, and the certainty proportion if you specify the CERTSIZE=P= option.

For METHOD=BERNOULLI , the OUTSIZE option includes the following information in the output data set: total number of sampling units, selection probability (sampling rate), expected sample size, and actual sample size. See the section Bernoulli Sampling for descriptions of these statistics.

For more information about the contents of the output data set, see the section Sample Output Data Set.

If you specify the GROUPS= option for random assignment, the OUTSIZE option adds the following information to the output data set: total number of units, number of groups, and number of units in the group. For more information, see the section Random Assignment Output Data Set.

OUTSORT=SAS-data-set

names an output data set to store the sorted input data set. This option is available when you specify a CONTROL statement to sort the DATA= input data set for systematic or sequential selection methods (METHOD=SYS , METHOD=PPS_SYS , METHOD=SEQ , and METHOD=PPS_SEQ ).

If you specify CONTROL variables but do not name an output data set in the OUTSORT= option, the sorted data set replaces the input data set.

RANUNI

requests uniform random number generation by the method of Fishman and Moore (1982), which PROC SURVEYSELECT uses in releases before SAS/STAT 12.1. This is the same random number generator that the RANUNI function provides.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. Developed by Matsumoto and Nishimura (1998), the Mersenne-Twister random number generator has a very long period and good statistical properties. This is the random number generator that the RAND function provides for the uniform distribution.

For more information, see the section Random Number Generation. For information about the RANUNI and RAND functions, see SAS Functions and CALL Routines: Reference.

You can specify the RANUNI option with the SEED= option to reproduce samples that PROC SURVEYSELECT selects in releases before SAS/STAT 12.1. To reproduce a sample by using the RANUNI and SEED= options, you must also specify the same input data set and sample selection parameters.

REPS=nreps

specifies the number of sample replicates. The value of nreps must be a positive integer.

When you specify the REPS= option, PROC SURVEYSELECT selects nreps independent samples, each with the same sample size or sampling rate and the same sample design that you request. The variable Replicate in the OUT= data set contains the sample replicate number.

You can use replicated sampling to provide a simple method of variance estimation for any form of statistic, and also to evaluate variable nonsampling errors such as interviewer differences. For information about replicated sampling, see Lohr (2010), Wolter (2007), Kish (1965), Kish (1987), and Kalton (1983). You can also use the REPS= option to perform a variety of other resampling and simulation tasks. For more information, see Cassell (2007).

SAMPRATE=value | (values)| SAS-data-set
RATE=value | (values)| SAS-data-set

specifies the sampling rate, which is the proportion of units to select for the sample. You can provide a single sampling rate value for the entire sample selection, or you can provide stratum sampling rates by specifying values or a SAS-data-set.

The sampling rate value must be a positive number. The stratum sampling rate values and the stratum sampling rates that you provide in the SAS-data-set must be nonnegative numbers. You can specify a sampling rate as a number between 0 and 1. Or you can specify a rate in percentage form as a number between 1 and 100, which PROC SURVEYSELECT converts to a proportion. The procedure treats the value 1 as 100% instead of 1%.

This option is available for equal probability selection methods (METHOD=SRS , METHOD=URS , METHOD=SYS , METHOD=SEQ , and METHOD=BERNOULLI ). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT computes the selection interval as the inverse of the sampling rate. For more information, see the section Systematic Random Sampling. For Bernoulli sampling (METHOD=BERNOULLI), the procedure uses the sampling rate as the inclusion probability. For more information, see the section Bernoulli Sampling. For the other equal probability selection methods, PROC SURVEYSELECT converts the sampling rate to the sample size before selection by multiplying the total number of units in the stratum or data set by the sampling rate and rounding up to the nearest integer.

You cannot specify both the SAMPRATE= option and the SAMPSIZE= option.

You can provide sampling rates by specifying one of the following forms:

SAMPRATE=value
RATE=value

specifies a single sampling rate value, which must be a positive number. If you request a stratified sample design by specifying the STRATA statement, PROC SURVEYSELECT uses the rate value for all strata.

SAMPRATE=(values)
RATE=(values)

specifies a list of stratum sampling rate values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of stratum sampling rate values should equal the number of strata in the input data set.

The order of the stratum sampling rate values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

The stratum sampling rate values must be nonnegative numbers. If you specify a stratum sampling rate of zero, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPRATE=SAS-data-set
RATE=SAS-data-set

names a SAS-data-set that contains stratum sampling rates. You should provide the sampling rates in the data set variable named _RATE_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SAMPRATE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The stratum sampling rates must be nonnegative numbers. If you specify a stratum sampling rate of zero, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPSIZE=n |(values)| SAS-data-set
N=n | (values)| SAS-data-set

specifies the sample size, which is the number of units to select for the sample. You can provide a single sample size n for the entire sample selection, or you can provide stratum sample sizes by specifying values or a SAS-data-set.

The value of n must be a positive integer. The stratum sample size values and the stratum sample sizes that you provide in the SAS-data-set must be nonnegative numbers. For selection methods that select without replacement, the sample size must not exceed the total number of units in the data set (or the number of units in the stratum, if you specify a STRATA statement).

This option specifies the number of sampling units to select. If you do not specify a SAMPLINGUNIT statement, PROC SURVEYSELECT defines sampling units as observations and selects the number of observations that you specify. If you specify a SAMPLINGUNIT statement, PROC SURVEYSELECT defines sampling units as groups of observations (clusters) and selects the number of clusters that you specify.

If you specify SAMPSIZE=n and the ALLOC= option in the STRATA statement, PROC SURVEYSELECT allocates the sample size n among the strata according to the allocation method that you request. For more information, see the section Sample Size Allocation. You cannot specify SAMPSIZE=values or SAMPSIZE=SAS-data-set when you use the ALLOC= option. You cannot specify SAMPSIZE= with the MARGIN= option, which determines stratum sample sizes that provide the specified margin of error. For more information, see the section Specifying the Margin of Error.

You cannot specify both the SAMPSIZE= option and the SAMPRATE= option.

You can provide sample size values by specifying one of the following forms:

SAMPSIZE=n
N=n

specifies a single sample size value n, which must be a positive integer. If you request a stratified sample design, PROC SURVEYSELECT selects n units from each stratum (unless you also specify the ALLOC= option in the STRATA statement, which allocates the total sample size among the strata).

For methods that select without replacement, the sample size n must not exceed the number of units in the stratum unless you also specify the SELECTALL option. If you specify the SELECTALL option, PROC SURVEYSELECT selects all stratum units when the stratum sample size exceeds the total number of units in the stratum.

SAMPSIZE=(values)
N=(values)

specifies a list of stratum sample size values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. The number of sample size values must equal the number of strata in the input data set.

The order of the stratum sample size values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

The values of the stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of zero, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPSIZE=SAS-data-set
N=SAS-data-set

names a SAS-data-set that contains stratum sample sizes. You should provide the sample sizes in the data set variable named _NSIZE_ or SampleSize. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SAMPSIZE= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The stratum sample sizes must be nonnegative numbers. If you specify a stratum sample size of zero, PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SEED < =value | SAS-data-set >

specifies the initial seed for random number generation. You can provide a single seed value for the entire sample selection, or you can provide stratum initial seeds by specifying a SAS-data-set. To initialize random number generation, a seed must be a positive integer. If you do not specify this option, or if you specify an initial seed that is negative or zero, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain an initial seed. For more information, see the section Random Number Generation.

PROC SURVEYSELECT displays the value of the initial seed in the "Sample Selection Summary" table. To reproduce the same sample in a subsequent execution of PROC SURVEYSELECT, you can specify the same initial seed in the SEED= option (for the same input data set and sample selection parameters).

If you specify a STRATA statement, you can provide stratum initial seeds by specifying a SAS-data-set. If you do not provide stratum initial seeds, the procedure generates random numbers continuously across strata from the random number stream that is initialized by the single seed value or by default. You can specify the OUTSEED option to include stratum initial seeds in the output data set.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases, PROC SURVEYSELECT uses the RANUNI random number generator, which you can now request by specifying the RANUNI option. To reproduce samples that PROC SURVEYSELECT selects in releases before SAS/STAT 12.1, use the RANUNI option with the SEED= option (for the same input data set and sample selection parameters).

You can provide initial seeds by specifying one of the following forms:

SEED

indicates that stratum initial seeds are provided in a secondary input data set that you name in another option (for example, the SAMPSIZE=SAS-data-set option). You should provide the initial seeds in the data set variable named _SEED_ or InitialSeed. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

SEED=value

specifies a single initial seed value for random number generation. To initialize random number generation, the value must be a positive integer.

SEED=SAS-data-set

names a SAS-data-set that contains stratum initial seeds. You should provide the stratum initial seeds in the data set variable named _SEED_ or InitialSeed. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the SEED= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

The OUTSEED option includes the stratum initial seeds in the OUT= output data set. You can reproduce the same sample in a subsequent execution of PROC SURVEYSELECT by specifying the same stratum initial seeds (for the same input data set and sample selection parameters). If you need to reproduce the same sample for only a subset of the strata, you can use the same initial seeds for the strata in the subset.

SELECTALL

requests that PROC SURVEYSELECT select all stratum units when the stratum sample size exceeds the total number of units in the stratum. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum, unless you are using a with-replacement selection method.

The SELECTALL option is available for the following without-replacement selection methods: METHOD=SRS , METHOD=SYS , METHOD=SEQ , METHOD=PPS , and METHOD=PPS_SAMPFORD .

The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or those PPS methods that select two units per stratum.

SORT=NEST | SERP

specifies the type of sorting by CONTROL variables. The option SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic serpentine sorting. The default is SORT=SERP. See the section Sorting by CONTROL Variables for descriptions of serpentine and nested sorting. Where there is only one CONTROL variable, the two types of sorting are equivalent.

The SORT= option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS , METHOD=PPS_SYS , METHOD=SEQ , and METHOD=PPS_SEQ ). When you specify a CONTROL statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample.

The SORT= option and the CONTROL statement are not available when you specify a SAMPLINGUNIT statement. For more information, see the descriptions of the CONTROL and SAMPLINGUNIT statements.

When you specify a CONTROL statement, you can also use the OUTSORT= option to name an output data set that contains the sorted input data set. Otherwise, if you do not specify the OUTSORT= option, the sorted data set replaces the input data set.

STATS

includes the selection probability and sampling weight in the OUT= output data set for equal probability selection methods when you do not specify a STRATA statement. By default, the output data set does not include these values for equal probability selection methods unless you specify a STRATA statement. The STATS option applies to the following selection methods: METHOD=SRS , METHOD=URS , METHOD=SYS , METHOD=SEQ , and METHOD=BERNOULLI .

In addition to the selection probability and sampling weight, the STATS option includes the following statistics in the output data set for METHOD=BERNOULLI : total number of sampling units, expected sample size, actual sample size, and adjusted sampling weight. For more information, see the section Bernoulli Sampling.

For PPS selection methods, the output data set contains selection probabilities and sampling weights by default. The STATS option has no effect for PPS methods.

For more information about the contents of the output data set, see the section Sample Output Data Set.