Multiple partitions
are necessary to read the data in parallel. The option PARTSIZE= forces
the software to partition SPD Engine data files at the specified size.
The actual size of the partition is computed to accommodate the maximum
number of observations that fit in the specified size of
n megabytes, gigabytes, or terabytes. If you
have a table with an observation length greater than 65K, you might
find that the PARTSIZE= that you specify and the actual partition
size do not match. To get these numbers to match, specify a PARTSIZE=
that is a multiple of 32 and the observation length.
By splitting (partitioning)
the data portion of an SPD Engine data set into fixed-sized files,
the software can introduce a high degree of scalability for some operations.
The SPD Engine can spawn threads in parallel (for example, up to one
thread per partition for WHERE evaluations). Separate data partitions
also enable the SPD Engine to process the data without the overhead
of file access contention between the threads. Because each partition
is one file, the trade-off for a small partition size is that an increased
number of files (for example, UNIX i-nodes) are required to store
the observations.
Scalability limitations
using PARTSIZE= depend on how you configure and spread the file systems
specified in the DATAPATH= option across striped volumes. (You should
spread each individual volume's striping configuration across multiple
disk controllers or SCSI channels in the disk storage array.) The
goal for the configuration, at the hardware level, is to maximize
parallelism during data retrieval. For information about disk striping,
see “I/O Setup and Validation” under “SPD Engine”
in Scalability and Performance at
http://support.sas.com/rnd/scalability
.
The PARTSIZE= specification
is limited by the SPD Engine system option MINPARTSIZE=, which is
usually maintained by the system administrator. MINPARTSIZE= ensures
that an inexperienced user does not arbitrarily create small partitions,
thereby generating a large number of data files.
The partition size determines
a unit of work for many of the parallel operations that require full
data set scans. But, more partitions does not always mean faster processing.
The trade-offs involve balancing the increased number of physical
files (partitions) required to store the data set against the amount
of work that can be done in parallel by having more partitions. More
partitions means more open files to process the data set, but a smaller
number of observations in each partition. A general rule is to have
10 or fewer partitions per data path, and 3 to 4 partitions per CPU.
(Some operating systems have a limit on the number of open files that
you can use.)
To determine an adequate
partition size for a new SPD Engine data set, you should be aware
of the following:
-
the types of applications that
run against the data
-
-
how many CPUs are available to
the applications
-
which disks are available for storing
the partitions
-
the relationships of these disks
to the CPUs
For example, if each
CPU controls only one disk, then an appropriate partition size would
be one in which each disk contains approximately the same amount of
data. If each CPU controls two disks, then an appropriate partition
size would be one in which the load is balanced. Each CPU does approximately
the same amount of work.
Note: The PARTSIZE= value for a
data set cannot be changed after a data set is created. To change
PARTSIZE=, you must re-create the data set and specify a different
PARTSIZE= value in the LIBNAME statement or on the new (output) data
set.