PROC SURVEYSELECT Statement :: SAS/STAT(R) 12.1 User's Guide

PROC SURVEYSELECT Statement

PROC SURVEYSELECT options ;

The PROC SURVEYSELECT statement invokes the SURVEYSELECT procedure. Optionally, it identifies input and output data sets. If you do not name a DATA= input data set, the procedure selects the sample from the most recently created SAS data set. If you do not name an OUT= output data set to contain the sample of selected units, the procedure still creates an output data set and names it according to the DATAn convention.

The PROC SURVEYSELECT statement also specifies the sample selection method, the sample size, and other sample design parameters.

If you do not specify a selection method, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) by default unless you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement. If you do specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS) by default. See the description of the METHOD= option for more information.

You must specify the sample size or sampling rate except when you request Poisson sampling (METHOD=POISSON), request a method that selects two units from each stratum (METHOD=PPS_BREWER or METHOD=PPS_MURTHY), or specify the MARGIN= option in the STRATA statement for sample allocation. You can use the SAMPSIZE=n option to specify the sample size, or you can use the SAMPSIZE=SAS-data-set option to name a secondary input data set that contains stratum sample sizes.

You can also provide stratum sampling rates, minimum size measures, maximum size measures, and certainty size measures in the secondary input data set. See the descriptions of the SAMPSIZE=, SAMPRATE=, MINSIZE=, MAXSIZE=, CERTSIZE=, and CERTSIZE=P= options for more information. You can name only one secondary input data set in each invocation of the procedure. See the section Secondary Input Data Set for details.

Table 95.1 summarizes the options available in the PROC SURVEYSELECT statement. Descriptions of the options follow in alphabetical order.

Table 95.1: PROC SURVEYSELECT Statement Options

Option	Description
Input and Output Data Sets
DATA=	Names the input SAS data set
OUT=	Names the output SAS data set that contains the sample
OUTSORT=	Names an output SAS data set that stores the sorted input data set
Selection Method
METHOD=	Specifies the sample selection method
Sample Size
SAMPSIZE=	Specifies the sample size
SELECTALL	Selects all stratum units when the sample size exceeds the total
Sampling Rate
SAMPRATE=	Specifies the sampling rate
NMIN=	Specifies the minimum stratum sample size
NMAX=	Specifies the maximum stratum sample size
Replicated Sampling
REPS=	Specifies the number of sample replicates
Size Measures
MINSIZE=	Specifies the minimum size measure
MAXSIZE=	Specifies the maximum size measure
CERTSIZE=	Specifies the certainty size measure
CERTSIZE=P=	Specifies the certainty proportion
Control Sorting
SORT=	Specifies the type of sorting
Random Number Generation
SEED=	Specifies the initial seed
RANUNI	Requests the RANUNI random number generator
Displayed Output
NOPRINT	Suppresses the display of all output
OUT= Data Set Contents
JTPROBS	Includes joint probabilities of selection
OUTALL	Includes all observations from the DATA= input data set
OUTHITS	Includes a distinct copy of each selected unit
OUTSEED	Includes the initial seed for each stratum
OUTSIZE	Includes additional design and sampling frame information
STATS	Includes selection probabilities and sampling weights

You can specify the following options in the PROC SURVEYSELECT statement:

CERTSIZE

requests certainty selection, where the certainty size values are provided in the secondary input data set. Use the CERTSIZE option when you have already named the secondary data set in another option, such as the SAMPSIZE=SAS-data-set option. See the section Secondary Input Data Set for details.

The CERTSIZE option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE option is not available with the SAMPLINGUNIT statement.

In certainty selection, PROC SURVEYSELECT automatically selects all sampling units that have size measures greater than or equal to the stratum certainty size values. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

You provide the stratum certainty size values in the secondary input data set variable _CERTSIZE_. Each certainty size value must be a positive number. The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

If you want to specify a single certainty size value for all strata, you can use the CERTSIZE=certain option.

CERTSIZE=certain

specifies the certainty size value, which must be a positive number. PROC SURVEYSELECT automatically selects all sampling units that have size measures greater than or equal to the value certain. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

The CERTSIZE= option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE= option is not available with the SAMPLINGUNIT statement.

The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

If you request a stratified sample design with the STRATA statement and specify the CERTSIZE=certain option, PROC SURVEYSELECT uses the value certain for all strata. If you do not want to use the same certainty size for all strata, use the CERTSIZE=SAS-data-set option to specify a certainty size value for each stratum.

CERTSIZE=SAS-data-set

names a SAS data set that contains certainty size values for the strata. PROC SURVEYSELECT automatically selects all sampling units that have size measures greater than or equal to the stratum certainty size values. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

The CERTSIZE= option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE= option is not available with the SAMPLINGUNIT statement.

You provide the stratum certainty size values in the CERTSIZE= data set variable _CERTSIZE_. Each certainty size value must be a positive number. The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

The CERTSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the CERTSIZE= data set as in the DATA= data set. The CERTSIZE= data set must include a variable named _CERTSIZE_ that contains the certainty size value for each stratum. The CERTSIZE= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

If you want to specify a single certainty size value for all strata, you can use the CERTSIZE=certain option.

CERTSIZE=P

requests certainty proportion selection, where the stratum certainty proportions are provided in the secondary input data set. Use the CERTSIZE=P option when you have already named the secondary data set in another option, such as the SAMPSIZE=SAS-data-set option. See the section Secondary Input Data Set for details.

The CERTSIZE=P option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE=P option it not available with the SAMPLINGUNIT statement.

In certainty proportion selection, PROC SURVEYSELECT automatically selects all sampling units that have size measures greater than or equal to the stratum certainty proportion of the total stratum size. The procedure repeats this process with the remaining units until no more certainty units are selected. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

You provide the stratum certainty proportions in the secondary input data set variable _CERTP_. Each certainty proportion must be a positive number. You can specify a proportion value as a number between 0 and 1. Or you can specify a proportion value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

If you want to specify a single certainty proportion for all strata, you can use the CERTSIZE=P=p option.

CERTSIZE=P=p

specifies the certainty proportion. PROC SURVEYSELECT automatically selects all sampling units that have size measures greater than or equal to the proportion p of the total stratum size. The procedure repeats this process with the remaining units until no more certainty units are selected. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

The CERTSIZE=P= option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE=P= option is not available with the SAMPLINGUNIT statement.

The value of the certainty proportion p must be a positive number. You can specify p as a number between 0 and 1. Or you can specify p in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

If you request a stratified sample design with the STRATA statement and specify the CERTSIZE=P=p option, PROC SURVEYSELECT uses the certainty proportion p for all strata. If you do not want to use the same certainty proportion for all strata, use the CERTSIZE=P=SAS-data-set option to specify a certainty proportion for each stratum.

CERTSIZE=P=SAS-data-set

names a SAS data set that contains certainty proportions for the strata. PROC SURVEYSELECT automatically selects all sampling units with size measures greater than or equal to the certainty proportion of the total stratum size. The procedure repeats this process with the remaining units until no more certainty units are selected. After identifying the certainty units, PROC SURVEYSELECT selects the remainder of the sample according to the method that is specified in the METHOD= option.

The CERTSIZE=P= option is available for METHOD=PPS and METHOD=PPS_SAMPFORD. The CERTSIZE=P= option is not available with the SAMPLINGUNIT statement.

You provide the stratum certainty proportions in the CERTSIZE=P= data set variable _CERTP_. Each certainty proportion must be a positive number. You can specify a proportion value as a number between 0 and 1. Or you can specify a proportion value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

The variable Certain in the OUT= data set identifies the certainty selections, which have selection probabilities equal to 1.

The CERTSIZE=P= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the CERTSIZE=P= data set as in the DATA= data set. The CERTSIZE=P= data set must include a variable named _CERTP_ that contains the certainty proportion for each stratum. The CERTSIZE=P= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

If you want to specify a single certainty proportion for all strata, you can use the CERTSIZE=P=p option.

DATA=SAS-data-set

names the SAS data set from which PROC SURVEYSELECT selects the sample. If you omit the DATA= option, the procedure uses the most recently created SAS data set. In sampling terminology, the input data set is the sampling frame (the list of units from which the sample is selected).

By default, the procedure uses input data set observations as sampling units and selects a sample of these units. Alternatively, you can use the SAMPLINGUNIT statement to define sampling units as groups of observations (clusters).

JTPROBS

includes joint probabilities of selection in the OUT= output data set. This option is available for the following probability proportional to size selection methods: METHOD=PPS, METHOD=PPS_SAMPFORD, and METHOD=PPS_WR. By default, PROC SURVEYSELECT outputs joint selection probabilities for METHOD=PPS_BREWER and METHOD=PPS_MURTHY, which select two units per stratum.

For details about computation of joint selection probabilities for a particular sampling method, see the method description in the section Sample Selection Methods. For more information about the contents of the output data set, see the section Sample Output Data Set.

MAXSIZE

requests adjustment of size measures according to the stratum maximum size values provided in the secondary input data set. Use the MAXSIZE option when you have already named the secondary input data set in another option, such as the SAMPSIZE=SAS-data-set option. See the section Secondary Input Data Set for details.

The MAXSIZE option is available when you use size measures for any PPS selection method and also include a STRATA statement. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

You provide the stratum maximum size values in the secondary input data set variable _MAXSIZE_. Each maximum size value must be a positive number.

When a size measure exceeds the specified maximum value for its stratum, PROC SURVEYSELECT adjusts the size measure downward to equal the maximum size value. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), then the procedure applies the MAXSIZE adjustment to the sampling unit size. The sampling unit size equals the number of observations in the sampling unit if you specify the PPS option, or the sum of the observation size measures if you specify a SIZE statement. The output data set variable UnitSize contains the adjusted sampling unit size measures.

If you want to specify a single maximum size value for all strata, you can use the MAXSIZE=max option.

MAXSIZE=max

specifies the maximum size value. The value of max must be a positive number.

When a size measure exceeds the value max, PROC SURVEYSELECT adjusts the size measure downward to equal max. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

The MAXSIZE=max option is available when you use size measures for any PPS selection method. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

If you request a stratified sample design with the STRATA statement and specify the MAXSIZE=max option, PROC SURVEYSELECT uses the maximum size max for all strata. If you do not want to use the same maximum size for all strata, use the MAXSIZE=SAS-data-set option to specify a maximum size value for each stratum.

MAXSIZE=SAS-data-set

names a SAS data set that contains maximum size values for the strata. You provide the stratum maximum size values in the MAXSIZE= data set variable _MAXSIZE_. Each maximum size value must be a positive number.

The MAXSIZE=SAS-data-set option is available when you use size measures for any PPS selection method and also include a STRATA statement. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

When a size measure exceeds the maximum size value for its stratum, PROC SURVEYSELECT adjusts the size measure downward to equal the maximum size value. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

The MAXSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MAXSIZE= data set as in the DATA= data set. The MAXSIZE= data set must include a variable named _MAXSIZE_ that contains the maximum size value for each stratum. The MAXSIZE= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

If you want to specify a single maximum size value for all strata, you can use the MAXSIZE=max option.

METHOD=name M=name

specifies the method for sample selection.

If you do not specify the METHOD= option, PROC SURVEYSELECT uses simple random sampling (METHOD=SRS) by default unless you specify a SIZE statement or the PPS option in the SAMPLINGUNIT statement. If you do specify a SIZE statement (or the PPS option), PROC SURVEYSELECT uses probability proportional to size selection without replacement (METHOD=PPS) by default.

The following values are available for the METHOD= option:

BERNOULLI

requests Bernoulli sampling, which consists of N independent selection trials, each with constant inclusion probability $\pi$ , where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. See the section Bernoulli Sampling for details.

When you specify METHOD=BERNOULLI, you must provide the sampling rate (inclusion probability $\pi$ ) by using the SAMPRATE= option. For stratified sampling (which you request with the STRATA statement), you can specify the same sampling rate for each stratum with the SAMPRATE=r option. Or you can specify different sampling rates for different strata by using the SAMPRATE=(values) or SAMPRATE=SAS-data-set option.

Because Bernoulli sampling is based on a specified inclusion probability instead of a fixed sample size, METHOD=BERNOULLI does not use the SAMPSIZE= option. Also, the ALLOC= option in the STRATA statement (which allocates the total sample size among strata) is not available with METHOD=BERNOULLI.

POISSON

requests Poisson sampling. A generalization of Bernoulli sampling, Poisson sampling consists of N independent selection trials with a separate inclusion probability specified for each unit, where N is the total number of sampling units in the stratum or data set. The sample size is not fixed but is a random variable. See the section Poisson Sampling for details.

You must provide inclusion probabilities for Poisson sampling in the SIZE variable. The probability values should be between 0 and 1. If a value of the SIZE variable is missing, nonpositive, or greater than 1, PROC SURVEYSELECT omits the observation from sample selection.

Because Poisson sampling is based on specified inclusion probabilities instead of a fixed sample size, METHOD=POISSON does not use the SAMPSIZE= option. Also, the ALLOC= option in the STRATA statement (which allocates the total sample size among strata) is not available with METHOD=POISSON.

The SAMPLINGUNIT statement is not available with METHOD=POISSON.

When METHOD=POISSON is specified with the SAMPRATE= option and without a SIZE statement, PROC SURVEYSELECT uses METHOD=BERNOULLI.

PPS

requests selection with probability proportional to size and without replacement. See the section PPS Sampling without Replacement for details. If you specify METHOD=PPS, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_BREWER | BREWER

requests selection according to Brewer’s method. Brewer’s method selects two units from each stratum with probability proportional to size and without replacement. See the section Brewer’s PPS Method for details. If you specify METHOD=PPS_BREWER, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size with the SAMPSIZE= option because Brewer’s method selects two units from each stratum.

PPS_MURTHY | MURTHY

requests selection according to Murthy’s method. Murthy’s method selects two units from each stratum with probability proportional to size and without replacement. See the section Murthy’s PPS Method for details. If you specify METHOD=PPS_MURTHY, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement. You do not need to specify the sample size with the SAMPSIZE= option because Murthy’s method selects two units from each stratum.

PPS_SAMPFORD | SAMPFORD

requests selection according to Sampford’s method. Sampford’s method selects units with probability proportional to size and without replacement. See the section Sampford’s PPS Method for details. If you specify METHOD=PPS_SAMPFORD, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SEQ | CHROMY

requests sequential selection with probability proportional to size and with minimum replacement. This method is also known as Chromy’s method. See the section PPS Sequential Sampling for details. If you specify METHOD=PPS_SEQ, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_SYS

requests systematic selection with probability proportional to size. See the section PPS Systematic Sampling for details. If you specify METHOD=PPS_SYS, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

PPS_WR

requests selection with probability proportional to size and with replacement. See the section PPS Sampling with Replacement for details. If you specify METHOD=PPS_WR, you must name a size measure variable in the SIZE statement or specify the PPS option in the SAMPLINGUNIT statement.

SEQ

requests sequential selection according to Chromy’s method. If you specify METHOD=SEQ and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses sequential zoned selection with equal probability and without replacement. See the section Sequential Random Sampling for details.

If you specify METHOD=SEQ and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses METHOD=PPS_SEQ, which is sequential selection with probability proportional to size and with minimum replacement. See the section PPS Sequential Sampling for more information.

SRS

requests simple random sampling, which is selection with equal probability and without replacement. See the section Simple Random Sampling for details. METHOD=SRS is the default if you do not specify the METHOD= option and also do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement).

SYS

requests systematic random sampling. If you specify METHOD=SYS and do not specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses systematic selection with equal probability. See the section Systematic Random Sampling for more information.

If you specify METHOD=SYS and also specify a SIZE statement (or the PPS option in the SAMPLINGUNIT statement), PROC SURVEYSELECT uses METHOD=PPS_SYS, which is systematic selection with probability proportional to size. See the section PPS Systematic Sampling for details.

URS

requests unrestricted random sampling, which is selection with equal probability and with replacement. See the section Unrestricted Random Sampling for details.

MINSIZE

requests adjustment of size measures according to the stratum minimum size values provided in the secondary input data set. Use the MINSIZE option when you have already named the secondary input data set in another option, such as the SAMPSIZE=SAS-data-set option. See the section Secondary Input Data Set for details.

The MINSIZE option is available when you use size measures for any PPS selection method and also include a STRATA statement. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

You provide the stratum minimum size values in the secondary input data set variable _MINSIZE_. Each minimum size value must be a positive number.

When a size measure is less than the specified minimum value for its stratum, PROC SURVEYSELECT adjusts the size measure upward to equal the minimum size value. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

If you use a SAMPLINGUNIT statement to define sampling units (clusters), then the procedure applies the MINSIZE adjustment to the sampling unit size. The sampling unit size equals the number of observations in the sampling unit if you specify the PPS option, or the sum of the observation size measures if you specify a SIZE statement. The output data set variable UnitSize contains the adjusted sampling unit size measures.

If you want to specify a single minimum size value for all strata, you can use the MINSIZE=min option.

MINSIZE=min

specifies the minimum size value. The value of min must be a positive number.

When a size measure is less than the value min, PROC SURVEYSELECT adjusts the size measure upward to equal min. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

The MINSIZE=min option is available when you use size measures for any PPS selection method. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

If you request a stratified sample design with the STRATA statement and specify the MINSIZE=min option, PROC SURVEYSELECT uses the minimum size min for all strata. If you do not want to use the same minimum size for all strata, use the MINSIZE=SAS-data-set option to specify a minimum size value for each stratum.

MINSIZE=SAS-data-set

names a SAS data set that contains minimum size values for the strata. You provide the stratum minimum size values in the MINSIZE= data set variable _MINSIZE_. Each minimum size value must be a positive number.

The MINSIZE=SAS-data-set option is available when you use size measures for any PPS selection method and also include a STRATA statement. You provide size measures by specifying the SIZE statement or the PPS option in the SAMPLINGUNIT statement.

When a size measure is less than the minimum size value for its stratum, PROC SURVEYSELECT adjusts the size measure upward to equal the minimum size measure. If your sampling units are individual observations, the variable AdjustedSize in the OUT= data set contains the adjusted size measures.

The MINSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the MINSIZE= data set as in the DATA= data set. The MINSIZE= data set must include a variable named _MINSIZE_ that contains the minimum size measure for each stratum. The MINSIZE= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

If you want to specify a single minimum size value for all strata, you can use the MINSIZE=min option.

NMAX=n

specifies the maximum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is greater than the value NMAX=n, then PROC SURVEYSELECT selects only n units.

The maximum sample size n must be a positive integer. The NMAX= option is available only with the SAMPRATE= option, which can be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). The NMAX= option is not available with METHOD=BERNOULLI, where the SAMPRATE= option specifies the constant inclusion probability.

NMIN=n

specifies the minimum stratum sample size n for the SAMPRATE= option. When you specify the SAMPRATE= option, PROC SURVEYSELECT calculates the stratum sample size by multiplying the total number of units in the stratum by the specified sampling rate. If this sample size is less than the value NMIN=n, then PROC SURVEYSELECT selects n units.

The minimum sample size n must be a positive integer. The NMIN= option is available only with the SAMPRATE= option, which can be used with equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, and METHOD=SEQ). The NMIN= option is not available with METHOD=BERNOULLI, where the SAMPRATE= option specifies the constant inclusion probability.

NOPRINT

suppresses the display of all output. You can use the NOPRINT option when you want only to create an output data set. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 20: Using the Output Delivery System.

OUT=SAS-data-set

names the output data set that contains the sample. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique.

The output data set contains the units that are selected for the sample, in addition to design information and selection statistics, depending on the selection method and output options that you request. See descriptions of the options JTPROBS, OUTALL, OUTHITS, OUTSEED, OUTSIZE, and STATS, which specify information to include in the output data set. See the section Sample Output Data Set for details about the contents of the output data set.

By default, the output data set contains only those units that are selected for the sample. To include all observations from the input data set in the output data set, use the OUTALL option.

By default, the output data set includes one copy of each selected unit, even when a unit is selected more than once, which can occur when you use with-replacement or with-minimum-replacement selection methods. For with-replacement or with-minimum-replacement selection methods, the output data set includes a variable NumberHits that records the number of hits (selections) for each unit. To include a distinct copy of each selection in the output data set when the same unit is selected more than once, use the OUTHITS option.

If you specify the NOSAMPLE option in the STRATA statement, PROC SURVEYFREQ allocates the total sample size among the strata but does not select the sample. In this case, the OUT= data set contains the allocated sample sizes. See the section Allocation Output Data Set for details.

OUTALL

includes all observations from the DATA= input data set in the OUT= output data set. By default, the output data set includes only those units selected for the sample. When you specify the OUTALL option, the output data set includes all observations from the input data set and also contains a variable that indicates each observation’s selection status. The variable Selected equals 1 for an observation that is selected for the sample, and equals 0 for an observation that is not selected. For information about the contents of the output data set, see the section Sample Output Data Set.

The OUTALL option is available for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI). The OUTALL option is also available for METHOD=POISSON.

OUTHITS

includes a distinct copy of each selected unit in the OUT= output data set when the same sampling unit is selected more than once. By default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits equals m. For example, with the OUTHITS option a unit that is selected three times is represented by three copies in the output data set.

A sampling unit can be selected more than once by with-replacement and with-minimum-replacement selection methods, which include METHOD=URS, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ. The OUTHITS option is available for these selection methods.

See the section Sample Output Data Set for details about the contents of the output data set.

OUTSEED

includes the initial seed for each stratum in the OUT= output data set. The variable InitialSeed contains the stratum initial seeds. See the section Sample Output Data Set for details about the contents of the output data set.

To reproduce the same sample for any stratum in a subsequent execution of PROC SURVEYSELECT, you can specify the same stratum initial seed with the SEED=SAS-data-set option, along with the same sample selection parameters. See the section Random Number Generation for more information.

The “Sample Selection Summary” table displays the initial random number seed for the entire sample selection, which is the same as the initial seed for the first stratum when the design is stratified. To reproduce the entire sample, you can specify this same seed value in the SEED= option, along with the same sample selection parameters.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases, PROC SURVEYSELECT used the RANUNI random number generator, which you can now request by specifying the RANUNI option. To reproduce samples that PROC SURVEYSELECT selected in releases prior to SAS/STAT 12.1, specify the RANUNI option with the SEED= option (for the same input data set and sample selection parameters).

OUTSIZE

includes additional design and sampling frame information in the OUT= output data set.

If you use a STRATA statement, the OUTSIZE option provides stratum-level values in the output data set. Otherwise, the OUTSIZE option provides overall values.

The OUTSIZE option includes the sample size or sampling rate in the output data set, depending on whether you specify the SAMPSIZE= option or the SAMPRATE= option, respectively. For PPS selection methods, the OUTSIZE option includes the total size measure in the output data set. If you do not provide size measures, or if you specify a SAMPLINGUNIT statement, the OUTSIZE option includes the total number of sampling units.

If you request size measure adjustment or certainty selection, the OUTSIZE option includes the following information in the output data set: the minimum size measure if you specify the MINSIZE= option, the maximum size measure if you specify the MAXSIZE= option, the certainty size measure if you specify the CERTSIZE= option, and the certainty proportion if you specify the CERTSIZE=P= option.

For METHOD=BERNOULLI, the OUTSIZE option includes the following information in the output data set: total number of sampling units, selection probability (sampling rate), expected sample size, and actual sample size. See the section Bernoulli Sampling for descriptions of these statistics.

For more information about the contents of the output data set, see the section Sample Output Data Set.

OUTSORT=SAS-data-set

names an output data set to store the sorted input data set. This option is available when you specify a CONTROL statement to sort the DATA= input data set for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ).

If you specify CONTROL variables but do not name an output data set with the OUTSORT= option, then the sorted data set replaces the input data set.

RANUNI

requests uniform random number generation by the method of Fishman and Moore (1982), which PROC SURVEYSELECT used in releases prior to SAS/STAT 12.1. This is the same random number generator that the RANUNI function provides.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. Developed by Matsumoto and Nishimura (1998), the Mersenne-Twister random number generator has a very long period and good statistical properties. This is the random number generator that the RAND function provides for the uniform distribution.

See the section Random Number Generation for details, and see SAS Functions and CALL Routines: Reference for information about the RANUNI and RAND functions.

You can specify the RANUNI option with the SEED= option to reproduce samples that PROC SURVEYSELECT selected in releases prior to SAS/STAT 12.1. To reproduce a sample by using the RANUNI and SEED= options, you must also specify the same input data set and sample selection parameters.

REPS=nreps

specifies the number of sample replicates. The value of nreps must be a positive integer.

When you specify the REPS= option, PROC SURVEYSELECT selects nreps independent samples, each with the same sample size or sampling rate and the same sample design that you request. The variable Replicate in the OUT= data set contains the sample replicate number.

You can use replicated sampling to provide a simple method of variance estimation for any form of statistic, and also to evaluate variable nonsampling errors such as interviewer differences. For information about replicated sampling, see Lohr (2010); Wolter (2007); Kish (1965, 1987); Kalton (1983). You can also use the REPS= option to perform a variety of other resampling and simulation tasks. See Cassell (2007) for more information.

SAMPRATE=r RATE=r

specifies the sampling rate, which is the proportion of units to select for the sample. The sampling rate r must be a positive number. You can specify r as a number between 0 and 1. Or you can specify r in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the sampling rate r as the selection interval. See the section Systematic Random Sampling for details. For Bernoulli sampling (METHOD=BERNOULLI), PROC SURVEYSELECT uses the sampling rate r as the inclusion probability. See the section Bernoulli Sampling for details. For the other equal probability selection methods, PROC SURVEYSELECT converts the sampling rate r to the sample size before selection by multiplying the total number of units in the stratum or frame by the sampling rate and rounding up to the nearest integer.

If you request a stratified sample design with the STRATA statement and specify the SAMPRATE=r option, PROC SURVEYSELECT uses the sampling rate r for each stratum. If you do not want to use the same sampling rate for each stratum, use the SAMPRATE=(values) option or the SAMPRATE=SAS-data-set option to specify a sampling rate for each stratum.

SAMPRATE=(values) RATE=(values)

specifies stratum sampling rates, where the stratum sampling rate is the proportion of units to select from the stratum. You can separate values with blanks or commas. The number of SAMPRATE= values must equal the number of strata in the input data set.

List the stratum sampling rate values in the order in which the strata appear in the input data set. When you use the SAMPRATE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

Each stratum sampling rate value must be a nonnegative. You can specify a rate value as a number between 0 and 1. Or you can specify a rate value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

To select a sample from a stratum, the value of the stratum sampling rate must be positive. If you specify a stratum sampling rate of 0, then PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the stratum selection interval. See the section Systematic Random Sampling for details. For Bernoulli sampling (METHOD=BERNOULLI), PROC SURVEYSELECT uses the stratum sampling rate as the inclusion probability for the stratum. See the section Bernoulli Sampling for details. For the other equal probability selection methods, PROC SURVEYSELECT converts the stratum sampling rate to the stratum sample size before selection by multiplying the total number of units in the stratum by the sampling rate and rounding up to the nearest integer.

SAMPRATE=SAS-data-set RATE=SAS-data-set

names a SAS data set that contains stratum sampling rates, where the stratum sampling rate is the proportion of units to select from the stratum. The SAMPRATE= data set should include a variable _RATE_ that contains the stratum sampling rates.

Each sampling rate value must be a nonnegative number. You can specify a rate value as a number between 0 and 1. Or you can specify a rate value in percentage form as a number between 1 and 100, and PROC SURVEYSELECT converts that number to a proportion. The procedure treats the value 1 as 100% instead of 1%.

The SAMPRATE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPRATE= data set as in the DATA= data set.

The SAMPRATE= option is available only for equal probability selection methods (METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI). For systematic random sampling (METHOD=SYS), PROC SURVEYSELECT uses the inverse of the stratum sampling rate as the stratum selection interval. See the section Systematic Random Sampling for details. For Bernoulli sampling (METHOD=BERNOULLI), PROC SURVEYSELECT uses the stratum sampling rate as the inclusion probability for the stratum. See the section Bernoulli Sampling for details. For the other equal probability selection methods, PROC SURVEYSELECT converts the stratum sampling rate to the stratum sample size before selection by multiplying the total number of units in the stratum by the sampling rate and rounding up to the nearest integer.

SAMPSIZE=n N=n

specifies the sample size, which is the number of units to select for the sample. The sample size n must be a positive integer. For selection methods that select without replacement, the sample size n must not exceed the number of units in the input data set.

If you do not specify a SAMPLINGUNIT statement, then your sampling units are observations, and PROC SURVEYSELECT selects n observations. If you use a SAMPLINGUNIT statement to define sampling units as groups of observations (clusters), then the procedure selects n clusters.

If you specify the SAMPSIZE=n option and request stratified selection with the STRATA statement, PROC SURVEYSELECT selects n units from each stratum unless you also specify the ALLOC= option in the STRATA statement to allocate the total sample size among the strata.

If you specify the ALLOC= option in the STRATA statement and the SAMPSIZE=n option, PROC SURVEYSELECT allocates the total sample size n among the strata according to the allocation method that you request. See the section Sample Size Allocation for details. If you specify the MARGIN= option with the ALLOC= option in the STRATA statement, PROC SURVEYSELECT determines the stratum sample sizes that provide the requested margin of error for the allocation. Therefore, you cannot use the SAMPSIZE= option with the MARGIN= option.

For methods that select without replacement, the sample size n must not exceed the number of units in any stratum. If you do not want to select the same number of units from each stratum, use the SAMPSIZE=(values) option or the SAMPSIZE=SAS-data-set option to specify a sample size for each stratum.

For without-replacement selection methods, by default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units available in the stratum. If you specify the SELECTALL option, PROC SURVEYSELECT selects all stratum units when the stratum sample size exceeds the number of units in the stratum.

SAMPSIZE=(values) N=(values)

specifies stratum sample sizes, where the stratum sample size is the number of units to select from the stratum. You can separate values with blanks or commas. The number of SAMPSIZE= values must equal the number of strata in the input data set.

List the stratum sample size values in the order in which the strata appear in the input data set. When you use the SAMPSIZE=(values) option, the input data set must be sorted by the STRATA variables in ascending order. You cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

Each stratum sample size value must be a nonnegative integer. To select a sample from a stratum, the value of the stratum sample size must be positive. If you specify a stratum sample size of 0, then PROC SURVEYSELECT does not select a sample from the stratum. This has the effect of subsetting the input data set before sample selection; the stratum that you omit is not included in the sampling frame or represented in the sample.

SAMPSIZE=SAS-data-set N=SAS-data-set

names a SAS data set that contains stratum sample sizes, where the stratum sample size is the number of units to select from the stratum. The SAMPSIZE= input data set should include a variable named _NSIZE_ or SampleSize that contains the stratum sample sizes.

The SAMPSIZE= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SAMPSIZE= data set as in the DATA= data set. The SAMPSIZE= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

SEED

indicates that stratum-level initial seeds are included in the secondary input data set. Use the SEED option when you have already named the secondary input data set in another option, such as the SAMPSIZE=SAS-data-set option. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

You provide the stratum initial seeds in the secondary input data set variable named _SEED_ or InitialSeed. The initial seeds must be positive integers.

See the description of the SEED=SAS-data-set option for more information about initial seeds for random number generation.

SEED=number

specifies the initial seed for random number generation. The SEED= value must be a positive integer. If you do not specify the SEED= option, or if the SEED= value is negative or 0, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed. See the section Random Number Generation for more information.

If you request a stratified sample design with the STRATA statement, you can use the SEED=SAS-data-set option to specify an initial seed for each stratum. Otherwise, PROC SURVEYSELECT generates random numbers continuously across strata from the random number stream initialized by the SEED= value.

You can use the OUTSEED option to include the stratum initial seeds in the output data set.

Whether or not you specify the SEED= option, PROC SURVEYSELECT displays the value of the initial seed in the “Sample Selection Summary” table. If you need to reproduce the same sample in a subsequent execution of PROC SURVEYSELECT, you can specify this same seed value in the SEED= option, along with the same sample selection parameters, and PROC SURVEYSELECT will reproduce the sample.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases, PROC SURVEYSELECT used the RANUNI random number generator, which you can now request by specifying the RANUNI option. To reproduce samples that PROC SURVEYSELECT selected in releases prior to SAS/STAT 12.1, use the RANUNI option with the SEED= option (for the same input data set and sample selection parameters).

SEED=SAS-data-set

names a SAS data set that contains initial seeds for the strata. You provide the stratum seeds in the SEED= input data set variable _SEED_ or InitialSeed.

The initial seed values must be positive integers. If the initial seed value for the first stratum is not a positive integer, PROC SURVEYSELECT uses the time of day from the computer’s clock to obtain the initial seed. If the initial seed value for a subsequent stratum is not a positive integer, PROC SURVEYSELECT continues to use the random number stream already initialized by the seed for the previous stratum. See the section Sample Selection Methods for more information.

The SEED= input data set should contain all the STRATA variables, with the same type and length as in the DATA= data set. The STRATA groups should appear in the same order in the SEED= data set as in the DATA= data set. The SEED= data set is a secondary input data set. See the section Secondary Input Data Set for details. You can name only one secondary input data set in each invocation of the procedure.

You can use the OUTSEED option to include the stratum initial seeds in the output data set.

If you specify initial seeds by strata with the SEED=SAS-data-set option, you can reproduce the same sample in a subsequent execution of PROC SURVEYSELECT by specifying these same stratum initial seeds, along with the same sample selection parameters. If you need to reproduce the same sample for only a subset of the strata, you can use the same initial seeds for those strata in the subset.

Beginning in SAS/STAT 12.1, PROC SURVEYSELECT uses the Mersenne-Twister random number generator by default. In previous releases, PROC SURVEYSELECT used the RANUNI random number generator, which you can now request by specifying the RANUNI option. To reproduce samples that PROC SURVEYSELECT selected in releases prior to SAS/STAT 12.1, use the RANUNI option with the SEED= option (for the same input data set and sample selection parameters).

SELECTALL

requests that PROC SURVEYSELECT select all stratum units when the stratum sample size exceeds the total number of units in the stratum. By default, PROC SURVEYSELECT does not allow you to specify a stratum sample size that is greater than the total number of units in the stratum, unless you are using a with-replacement selection method.

The SELECTALL option is available for the following without-replacement selection methods: METHOD=SRS, METHOD=SYS, METHOD=SEQ, METHOD=PPS, and METHOD=PPS_SAMPFORD.

The SELECTALL option is not available for with-replacement selection methods, with-minimum-replacement methods, or those PPS methods that select two units per stratum.

SORT=NEST | SERP

specifies the type of sorting by CONTROL variables. The option SORT=NEST requests nested sorting, and SORT=SERP requests hierarchic serpentine sorting. The default is SORT=SERP. See the section Sorting by CONTROL Variables for descriptions of serpentine and nested sorting. Where there is only one CONTROL variable, the two types of sorting are equivalent.

The SORT= option is available when you specify a CONTROL statement for systematic or sequential selection methods (METHOD=SYS, METHOD=PPS_SYS, METHOD=SEQ, and METHOD=PPS_SEQ). When you specify a CONTROL statement, PROC SURVEYSELECT sorts the input data set by the CONTROL variables within strata before selecting the sample.

The SORT= option and the CONTROL statement are not available with a SAMPLINGUNIT statement. See the descriptions of the CONTROL and SAMPLINGUNIT statements for more information.

When you specify a CONTROL statement, you can also use the OUTSORT= option to name an output data set that contains the sorted input data set. Otherwise, if you do not specify the OUTSORT= option, the sorted data set replaces the input data set.

STATS

includes the selection probability and sampling weight in the OUT= output data set for equal probability selection methods when you do not specify a STRATA statement. By default, the output data set does not include these values for equal probability selection methods unless you specify a STRATA statement. The STATS option applies to the following selection methods: METHOD=SRS, METHOD=URS, METHOD=SYS, METHOD=SEQ, and METHOD=BERNOULLI.

In addition to the selection probability and sampling weight, the STATS option includes the following statistics in the output data set for METHOD=BERNOULLI: total number of sampling units, expected sample size, actual sample size, and adjusted sampling weight. See the section Bernoulli Sampling for more information.

For PPS selection methods, the output data set contains selection probabilities and sampling weights by default. The STATS option has no effect for PPS methods.

For more information about the contents of the output data set, see the section Sample Output Data Set.

The SURVEYSELECT Procedure

PROC SURVEYSELECT Statement