Four macros control
the process of duplicate-data checking:
-
%RMDUPINT loads macro definitions
that are used by the other duplicate-data checking macros.
-
%RMDUPDSN generates the name (WORK._DUPCNTL)
of the temporary SAS data set that will contain datetime ranges for
the data that is being processed into the active IT data mart. It
also generates the name of the
sourceAUDIT
data set for duplicate checking.
-
%RMDUPCHK checks for duplicate
data by examining timestamps on data being read by the staging code.
This macro also writes to the temporary control data set.
-
%RMDUPUPD updates the permanent
control data sets with information from a temporary control data set
through the intermediate control data sets.
Each of the duplicate-data
checking macros performs a specific task. Together, these macros set
up and manage duplicate-data checking. The macros are designed to
check your data and to prevent duplicate data from being processed
into the IT data mart. However, sometimes it is necessary to process
data in a datetime range for which a machine's or system's data was
already processed. For example, you might need to process data into
a table that you did not use earlier or that you accidentally deleted.
You can specify that the data is acceptable even though it appears
to be duplicate data.
As raw data is being
read, one of the macros that performs duplicate-data checking reviews
the datetime information in each record. It then stores the information
in a SAS data set called a
temporary control data set.
Later, by using intermediate control data sets, another macro merges
the information in the temporary control data set into one or more
SAS data sets that are called
permanent control data
sets.
When additional data
is processed into the IT data mart, the timestamps of the incoming
data are compared with the datetime information in the permanent control
data sets in order to determine whether the new data has already been
processed. If it has, the duplicate data is handled in the way that
you specify.
A duplicate-data report is printed in the SAS log after the data is
read. The report describes how many records were read for each machine
or system and how many duplicates were found, if any. (If you specified
Report
= No
on the
Duplicate Checking page
of the
Staging Parameters tab of the staging
transformation, this information is written to the
sourceAUDIT
file.)
Note: The first time you run a
job with duplicate-data checking enabled, the permanent control data
sets have not been built, so the macro %RMDUPCHK cannot check the
input records. Your data is not checked or rejected for duplicates,
but the permanent control data sets are created and the datetime information
for this data is saved to them. Data is checked only on the datetime,
although SMF data is also checked for the system name. (For example,
if you try to add a new record type, but you have already read other
record types from that adapter for that time period, the records are
not kept.) The duplicate-data report contains only a limited amount
of information about your data.