COMPGED Function

Returns the generalized edit distance between two strings.

Category: Character
Restriction: I18N Level 0 functions are designed for use with Single Byte Character Sets (SBCS) only.

Syntax

COMPGED(string-1, string-2 <,cutoff> <,modifiers> )

Required Arguments

string–1

specifies a character constant, variable, or expression.

string-2

specifies a character constant, variable, or expression.

Optional Arguments

cutoff

is a numeric constant, variable, or expression. If the actual generalized edit distance is greater than the value of cutoff, the value that is returned is equal to the value of cutoff.

Tip
Using a small value of cutoff improves the efficiency of COMPGED if the values of string–1 and string–2 are long

modifiers

specifies a character string that can modify the action of the COMPGED function. You can use one or more of the following characters as a valid modifier:

i or I ignores the case in string–1 and string–2.
l or L removes leading blanks in string–1 and string–2 before comparing the values.
n or N removes quotation marks from any argument that is an n-literal and ignores the case of string–1 and string–2.
: (colon) truncates the longer of string–1 or string–2 to the length of the shorter string, or to one, whichever is greater.
Tip
COMPGED ignores blanks that are used as modifiers.

Details

The Order in Which Modifiers Appear

The order in which the modifiers appear in the COMPGED function is relevant.
  • “LN” first removes leading blanks from each string and then removes quotation marks from n-literals.
  • “NL” first removes quotation marks from n-literals and then removes leading blanks from each string.

Definition of Generalized Edit Distance

Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The Levenshtein edit distance is the number of deletions, insertions, or replacements of single characters that are required to transform string-1 into string-2.

Computing the Generalized Edit Distance

The COMPGED function returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.
The algorithm for computing the sum of the costs involves a pointer that points to a character in string-2 (the input string). An output string is constructed by a sequence of operations that might advance the pointer, add one or more characters to the output string, or both. Initially, the pointer points to the first character in the input string, and the output string is empty.
The operations and their costs are described in the following table.
Operation
Default Cost in Units
Description of Operation
APPEND
50
When the output string is longer than the input string, add any one character to the end of the output string without moving the pointer.
BLANK
10
Do any of the following:
  • Add one space character to the end of the output string without moving the pointer.
  • When the character at the pointer is a space character, advance the pointer by one position without changing the output string.
  • When the character at the pointer is a space character, add one space character to the end of the output string, and advance the pointer by one position.
If the cost for BLANK is set to zero by the COMPCOST function, the COMPGED function removes all space characters from both strings before doing the comparison.
DELETE
100
Advance the pointer by one position without changing the output string.
DOUBLE
20
Add the character at the pointer to the end of the output string without moving the pointer.
FDELETE
200
When the output string is empty, advance the pointer by one position without changing the output string.
FINSERT
200
When the pointer is in position one, add any one character to the end of the output string without moving the pointer.
FREPLACE
200
When the pointer is in position one and the output string is empty, add any one character to the end of the output string, and advance the pointer by one position.
INSERT
100
Add any one character to the end of the output string without moving the pointer.
MATCH
0
Copy the character at the pointer from the input string to the end of the output string, and advance the pointer by one position.
PUNCTUATION
30
Do any of the following:
  • Add one punctuation character to the end of the output string without moving the pointer.
  • When the character at the pointer is a punctuation character, advance the pointer by one position without changing the output string.
  • When the character at the pointer is a punctuation character, add one punctuation character to the end of the output string, and advance the pointer by one position.
If the cost for PUNCTUATION is set to zero by the COMPCOST function, the COMPGED function removes all punctuation characters from both strings before doing the comparison.
REPLACE
100
Add any one character to the end of the output string, and advance the pointer by one position.
SINGLE
20
When the character at the pointer is the same as the character that follows in the input string, advance the pointer by one position without changing the output string.
SWAP
20
Copy the character that follows the pointer from the input string to the output string. Then copy the character at the pointer from the input string to the output string. Advance the pointer two positions.
TRUNCATE
10
When the output string is shorter than the input string, advance the pointer by one position without changing the output string.
To set the cost of the string operations, you can use the CALL COMPCOST routine or use default costs. If you use the default costs, the values that are returned by COMPGED are approximately 100 times greater than the values that are returned by COMPLEV.

Examples of Errors

The rationale for determining the generalized edit distance is based on the number and types of typographical errors that can occur. COMPGED assigns a cost to each error and determines the minimum sum of these costs that could be incurred. Some types of errors can be more serious than others. For example, inserting an extra letter at the beginning of a string might be more serious than omitting a letter from the end of a string. For another example, if you type a word or phrase that exists in string-2 and introduce a typographical error, you might produce string-1 instead of string-2.

Making the Generalized Edit Distance Symmetric

Generalized edit distance is not necessarily symmetric. That is, the value that is returned by COMPGED(string1, string2) is not always equal to the value that is returned by COMPGED(string2, string1). To make the generalized edit distance symmetric, use the CALL COMPCOST routine to assign equal costs to the operations within each of the following pairs:
  • INSERT, DELETE
  • FINSERT, FDELETE
  • APPEND, TRUNCATE
  • DOUBLE, SINGLE

Comparisons

You can compute the Levenshtein edit distance by using the COMPLEV function. You can compute the generalized edit distance by using the CALL COMPCOST routine and the COMPGED function. Computing generalized edit distance requires considerably more computer time than does computing Levenshtein edit distance. But generalized edit distance usually provides a more useful measure than Levenshtein edit distance for applications such as fuzzy file merging and text mining.

Example

The following example uses the default costs to calculate the generalized edit distance.
data test;
   infile datalines missover;
   input String1 $char8. +1 String2 $char8. +1 Operation $40.;
   GED=compged(string1, string2);
   datalines;
baboon   baboon   match
baXboon  baboon   insert
baoon    baboon   delete
baXoon   baboon   replace
baboonX  baboon   append
baboo    baboon   truncate
babboon  baboon   double
babon    baboon   single
baobon   baboon   swap
bab oon  baboon   blank
bab,oon  baboon   punctuation
bXaoon   baboon   insert+delete
bXaYoon  baboon   insert+replace
bXoon    baboon   delete+replace
Xbaboon  baboon   finsert
aboon    baboon   trick question: swap+delete
Xaboon   baboon   freplace
axoon    baboon   fdelete+replace
axoo     baboon   fdelete+replace+truncate
axon     baboon   fdelete+replace+single
baby     baboon   replace+truncate*2
balloon  baboon   replace+insert
;

proc print data=test label;
   label GED='Generalized Edit Distance';
   var String1 String2 GED Operation;
run;
Generalized Edit Distance Based on Operation
Generalized Edit Distance Based on Operation

See Also

CALL Routines: