Glossary
- analysis data set
-
in SAS data quality, a SAS output data set that
provides information about the degree of divergence in specified character
values.
- Blue Fusion data format
-
a file format for schemes that can be created
and applied in data quality software from SAS and from DataFlux (a
SAS company). Schemes in Blue Fusion data format are sometimes referred
to as BFD schemes. Schemes can also be created in SAS format.
- case definition
-
a part of a locale that is referenced during data
cleansing to impose on character values a consistent usage of uppercase
and lowercase letters.
- cleanse
-
to improve the consistency and accuracy of data
by standardizing it, reorganizing it, and eliminating redundancy.
- cluster
-
in SAS data quality, a set of character values
that have the same match code.
- composite match code
-
a match code that consists of a concatenation
of match codes from values from two or more input character variables
in the same observation. A delimiter can be specified to separate
the individual match codes in the concatenation.
- compound match code
-
a match code that consists of a concatenation
of match codes that are created for each token in a delimited or parsed
string. Within a compound match code, individual match codes might
be separated by a delimiter.
- data analysis
-
in SAS data quality, the process of evaluating
input data sets in order to determine whether data cleansing is needed.
- data cleansing
-
the process of eliminating inaccuracies, irregularities,
and discrepancies from data.
- data quality
-
the relative value of data, which is based on
the accuracy of the knowledge that can be generated using that data.
High-quality data is consistent, accurate, and unambiguous, and it
can be processed efficiently.
- data transformation
-
in SAS data quality, a cleansing process that
applies a scheme to a specified character variable. The scheme creates
match codes internally to create clusters. All values in each cluster
are then transformed to the standardization value that is specified
in the scheme for each cluster.
- delimiter
-
a character that serves as a boundary that separates
the elements of a text string.
- gender definition
-
a part of a locale that is referenced during data
cleansing to determine the gender of individuals based on the names
of those individuals.
- guess definition
-
a part of a locale that is referenced during the
selection of the locale from the locale list that is the best choice
for use in the analysis or cleansing of the specified character values.
- identification definition
-
a part of a locale that is referenced during data
analysis or data cleansing to determine categories for specified character
values.
- locale
-
a setting that reflects the language, local conventions,
and culture for a geographic region. Local conventions can include
specific formatting rules for paper sizes, dates, times, and numbers,
and a currency symbol for the country or region. Some examples of
locale values are French_Canada, Portuguese_Brazil, and Chinese_Singapore.
- locale list
-
an ordered list of locales that is loaded into
memory prior to data analysis or data cleansing. The first locale
in the list is the default locale.
- match
-
a set of values that produce identical match codes
or identical match code components. Identical match codes are assigned
to clusters.
- match code
-
an encoded version of a character value that is
created as a basis for data analysis and data cleansing. Match codes
are used to cluster and compare character values.
- match definition
-
a part of a locale that is referenced during the
creation of match codes. Each match definition is specific to a category
of data content. In the ENUSA locale. For example, match definitions
are provided for names, e-mail addresses, and street addresses, among
others.
- name prefix
-
a title of respect or a professional title that
precedes a first name or an initial. For example, Mr., Mrs., and Dr.
are name prefixes.
- name suffix
-
a part of a name that follows the last name. For
example, Jr. and Sr. are name suffixes.
- parse
-
to analyze text, such as a SAS statement, for
the purpose of separating it into its constituent words, phrases,
punctuation marks, values, or other types of information. The information
can then be analyzed according to a definition or set of rules.
- parse definition
-
a part of a locale that is referenced during the
parsing of character values. The parse definition specifies the number
and location of the delimiters that are inserted during parsing. The
location of the delimiters depends on the content of the character
values.
- parse token
-
a named element that can be assigned a value during
parsing. The specified parse definition provides the criteria that
detect the value in the string. After the value is detected and assigned
to the token, the character value can be manipulated using the name
of the token.
- parsed string
-
in SAS data quality, a text string into which
has been inserted a delimiter and name at the beginning of each token
in that string. The string is automatically parsed by referencing
a parse definition.
- Quality Knowledge Base
-
a collection of locales and other information
that is referenced during data analysis and data cleansing. For example,
to create match codes for a data set that contains street addresses
in Great Britain, you would reference the ADDRESS match definition
in the ENGBR locale in the Quality Knowledge Base.
- scheme
-
a reusable collection of match codes and standardization
values that is applied to input character values for the purposes
of transformation or analysis.
- sensitivity
-
in SAS data quality, a value that specifies the
amount of information in match codes. Greater sensitivity values result
in match codes that contain greater amounts of information. As sensitivity
values increase, character values must be increasingly similar to
generate the same match codes.
- standardization definition
-
a part of a locale that is referenced during data
cleansing to impose a specified format on character values.
- standardize
-
to eliminate unnecessary variation in data in
order to maximize the consistency and accuracy of the data.
- token
-
in SAS data quality, a named word or phrase in
a parsed or delimited string that can be individually analyzed and
cleansed.
- transformation
-
in data integration, an operation that extracts
data, transforms data, or loads data into data stores.
- transformation value
-
in SAS data quality, the most frequently occurring
value in a cluster. In data cleansing, this value is propagated to
all of the values in the cluster.
Copyright © SAS Institute Inc. All rights reserved.