Allocating the Library Space

How to Allocate the Library Space

To realize performance gains through SPD Engine's partitioned data I/O and threading capabilities, the SPD Engine libraries must be properly configured and managed. Optimally, a SAS system administrator performs these tasks.

An SPD Engine data set requires a file system with enough space to store the various component files. Often that file system includes multiple directories for these components. Usually, a single directory path (part of a file system) is constrained by a volume limit for the file system as a whole. This limit is the maximum amount of disk space configured for the file system to use.

Within this maximum disk space, you must allocate enough space for all of the SPD Engine component files. Understanding how each component file is handled is critical to estimating the amount of storage that you need in each library.

Configuring Space for All Components in a Single Path

In the simplest SPD Engine library configuration, all of the SPD Engine component files (data files, metadata files, and index files) can reside in a single path called the primary path. The primary path is the default path specification in the LIBNAME statement. The following LIBNAME statement sets up the primary file system for the MYLIB library:

libname mylib spde '/disk1/spdedata';

Because there are no other path options specified, all component files are created in this primary path. Storing all component file types in the primary path is simple and works for very small data sets. It does not take advantage of the performance boost that storing components separately can achieve, nor does it take advantage of multiple CPUs.

Note: The SPD Engine requires complete pathnames to be specified.

Configuring Separate Library Space for Each Component File

Most sites use the SPD Engine to manage very large amounts of data, which can have thousands of variables, some of them indexed. At these sites, separate storage paths are usually defined for the various component types. In addition, using disk striping and RAIDS (Redundant Array of Independent Disks) can be very efficient. For more information, see “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

All metadata component files must begin in the primary path, even if they span devices. In addition, specifying separate paths for the data and index components provides further performance gains. You specify different paths because the I/O load is distributed across disk drives. Separating the data and index components helps prevent disk contention and increases the level of parallelism that can be achieved, especially in complex WHERE evaluations. The following example code specifies a primary path for the metadata. The code uses the DATAPATH= and INDEXPATH= to specify additional, separate paths for the data and index component files:

libname all_users spde '/disk1/metadata'  
   datapath= ('/disk2/userdata' '/disk3/userdata')  
   indexpath= ('/disk4/userindexes' '/disk5/userindexes');

The metadata is stored on disk1, which is the primary path. The data is on disk2 and disk3, and the indexes are on disk4 and disk5. For all path specifications, you must specify the complete pathname.

CAUTION:

The primary path must be unique for each library.

If two librefs are created with the same primary path, but with differences in the other paths, data can be lost. You cannot use NFS in any path option other than the primary path.

Note: If you are planning to store data in locally mounted drives and access the data from a remote computer, use the remote pathname when you specify the LIBNAME. If /data01 and /data02 are locally mounted drives on the localA computer, use the pathnames /nfs/localA/data01 and /nfs/localA/data02 in the LIBNAME statement.

Note: You cannot change the pathnames of the files. When you specify the DATAPATH=, INDEXPATH=, METAPATH=, or primary path LIBNAME options, make sure that the identical paths that were used when the data set was created are used every time you access the data sets. The pathnames for these locations are stored internally in the data set. If you change any part of the pathname, the SPD Engine might not be able to find the data set, or it might damage the data set.

Anticipating the Space for Each Component File

To properly configure the SPD Engine library space, you need to understand the relative sizes of the SPD Engine component files. The following information provides a general overview. For more information, see the “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

Metadata component files are relatively small files, so the primary path might be large enough to contain all the metadata files for the library.

Index component files (both .idx and .hbx) can be medium to large, depending on the number of distinct values in each index and whether the indexes are single or composite indexes. When an index component file grows beyond the space available in the current file path, a new component file is created in the next path.

Data component files can be quite numerous, depending on the amount of data and the partition size specified for the data set. Each data partition is stored as a separate data component file. The size of the data partitions is specified in the PARTSIZE= LIBNAME Statement Option. Data files are the only component files for which you can specify a partition size.

Storage of the Metadata Component Files

Metadata Component Files

Much of the information that the SPD Engine needs to efficiently read and write partitioned data is stored in the metadata component. The SPD Engine must be able to rapidly access that metadata. By design, the SPD Engine expects every data set's metadata component to begin in the primary path. These metadata component files can overflow into other paths (specified in the METAPATH= LIBNAME Statement Option), but they must always begin in the primary path. This concept is very important to understand because it directly affects whether you can add data sets (with their associated metadata files) to the library.

For example, a new data set for a library is created and the space in the primary path is full. The SPD Engine cannot begin the metadata component file in that primary path as required. The create operation fails with an appropriate error message. To successfully create a new data set in this case, you must either free space in the primary path or assign a new library. You cannot use the METAPATH= option to create space for a new data set's first metadata partition. METAPATH= only specifies overflow space for a metadata component that begins in the primary path, but has expanded to fill the space reserved in the primary path. Your metadata component might grow to exceed the file size or library space limitations. To ensure you have space in the primary path for additional data sets, specify an overflow path for metadata in the METAPATH= option when you first create the library.

You can specify additional space at a later time for data and index component files, even if you specified separate paths in the initial LIBNAME statement.

Initial Set of Paths

In the following example, the LIBNAME statement specifies the MYLIB directory for the primary path. By default, this path is used to store initial metadata partitions. Other devices and directories are specified to store the data and index partitions.

libname myref spde 'mylib'
   datapath=('/mydisk30/siteuser')
   indexpath=('/mydisk31/siteuser');

Adding Subsequent Paths

Later, if more space is needed (for example, for appending large amounts of data), additional devices are added for the data and indexes, as in the following example:

libname myref spde 'mylib'
   datapath=('/mydisk30/siteuser' '/mydisk32/siteuser' '/mydisk33/siteuser') 
   indexpath=('/mydisk31/siteuser' '/mydisk34/siteuser');

Storage of the Index Component Files

Index component files are stored based on overflow space. Several file paths are specified with the INDEXPATH= option. Index files are started in the first available space, and then overflow to the next file path when the previous space is filled. Unlike metadata components, index component files do not have to begin in the primary path.

Storage of the Data Partitions

The data component partitions are the only files for which you can specify the size. Partitioned data can be processed in threads easily, thereby taking full advantage of multiple CPUs on your computer.

The partition size for the data component is fixed and it is set at the time the data set is created. The default is 128 megabytes, but you can specify a different partition size using the PARTSIZE= option. Performance depends on appropriate partition sizes. You are responsible for knowing the size and uses of the data. SPD Engine data sets can be created with a partition size that results in a balanced number of observations. (For more information, see PARTSIZE= Data Set Option.)

Many data partitions can be created in each data path for a given data set. The SPD Engine uses the file paths that you specify with the DATAPATH= option to distribute partitions in a cyclic fashion. The SPD Engine creates the first data partition in one of the specified paths, the second partition in the next path, and so on. The software continues to cycle among the file paths, as many times as needed, until all data partitions for the data set are stored. The path selected for the first partition is selected at random.

Assume that you specify the following in your LIBNAME statement:

datapath=('/data1' '/data2')

The SPD Engine stores the first partition in /DATA1, the second partition in /DATA2, the third partition in /DATA1, and so on. Cyclical distribution of the data partitions creates disk striping, which can be highly efficient. Disk striping is discussed in detail in “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

Renaming, Copying, or Moving Component Files

CAUTION:

Do not rename, copy, or move an SPD Engine data set or its component files using operating system commands.

You should always use the COPY procedure to copy SPD Engine data sets from one location to another, or the DATASETS procedure to rename or delete SPD engine data sets.