Efficient Indexing in the SPD Engine

Parallel Indexing

Indexes can improve the performance of WHERE expression processing and BY expression processing. The SPD Engine enables the rapid creation and update of indexes because it can process in parallel.

The SPD Engine's indexes are especially appropriate for data sets of varying sizes and data distributions. These indexes contain both a segmented view and a global view of indexed variables' values. This feature enables the SPD Engine to optimally support both of the following queries:

queries that require global data views, such as BY expression processing
queries that require segmented views, such as parallel processing of WHERE expressions

When an SPD Engine data set is encrypted, only the data component files are encrypted. None of the other files are encrypted, such as the metadata and index files.

Parallel Index Creation

You can create indexes on your SPD Engine data in parallel, asynchronously. To enable asynchronous parallel index creation, use the ASYNCINDEX= Data Set Option.

Use this option with the DATA step INDEX= option, with PROC DATASETS INDEX CREATE commands, or in the PROC APPEND statement when creating an SPD Engine data set from a default Base SAS engine data set that has an index. Each method enables all of the declared indexes to be populated from a single scan of the data set.

Note: If you create an SPD Engine data set from a default Base SAS engine data set that is encrypted and that has an index, the index is not encrypted in the SPD Engine data set. For more information, see Converting Default Base SAS Engine Data Sets to SPD Engine Data Sets.

The following example shows indexes created in parallel using the DATA step. A simple index is created on variable X and a composite index is created on variables A and B.

data foo.mine(index=(x y=(a b)) asyncindex=yes);   
   x=1;
   a="Doe";
   b=20;
run;

To create multiple indexes in parallel, you must allocate enough utility disk space to create all of the key sorts at the same time. You must also allocate enough memory space. Use the SPDEUTILLOC= System Option to allocate disk space and SPDEINDEXSORTSIZE= System Option in the configuration file or at invocation to allocate additional memory.

The DATASETS procedure has the flexibility to enable batched parallel index creation by using multiple MODIFY groups. Instead of creating all of the indexes at once, which would require a significant amount of space, you can create the indexes in groups as shown in the following example:

proc datasets lib=main;
   modify patients(asyncindex=yes);
      index create number;
      index create class;
   run;
   modify patients(asyncindex=yes)'
      index create lastname firstname;
   run;
   modify patients(asyncindex=yes);
      index create fullname=(lastname firstname);
      index create class_sex=(class sex);
   run;
quit;

Indexes NUMBER and CLASS are created in parallel, indexes LASTNAME and FIRSTNAME are created in parallel, and indexes FULLNAME and CLASS_SEX are created in parallel.

Note: You cannot create an index or composite index on a variable if the variable name contains any of the following special characters:

"  *  |  \  :  /  <  >  ?  -

Parallel Index Updates

The SPD Engine also supports parallel index updating during data set append operations. Multiple threads enable updates of the data store and index files. The SPD Engine decomposes a data set append or insert operation into a set of steps that can be performed in parallel. The level of parallelism attained depends on the number of indexes in the data set. As with parallel index creation, this operation uses memory and disk space for the key sorts that are part of the index append processing. Use system options SPDEINDEXSORTSIZE= to allocate memory and SPDEUTILLOC= to allocate disk space.