Using Perl Regular Expressions in the DATA Step

Syntax of Perl Regular Expressions

The Components of a Perl Regular Expression

Perl regular expressions consist of characters and special characters that are called metacharacters. When performing a match, SAS searches a source string for a substring that matches the Perl regular expression that you specify. Using metacharacters enables SAS to perform special actions. These actions include forcing the match to begin in a particular location, and matching a particular set of characters. Paired forward slashes are the default delimiters. The following two examples show metacharacters and the values that they match:
  • If you use the metacharacter \d, SAS matches a digit between 0–9.
  • If you use /\dt/, SAS finds the digits in the string “Raleigh, NC 27506”.
You can see lists of PRX metacharacters in Tables of Perl Regular Expression (PRX) Metacharacters. For a complete list of metacharacters, see the Perl documentation.

Basic Syntax for Finding a Match in a String

You use the PRXMATCH function to find the position of a matched value in a source string. PRXMATCH has the following general form:
/search-string/source-string/
The following example uses the PRXMATCH function to find the position of search-string in source-string:
prxmatch('world', 'Hello world!');
The result of PRXMATCH is the value 7, because world occurs in the seventh position of the string Hello world!.

Basic Syntax for Searching and Replacing Text

The basic syntax for searching and replacing text has the following form:
s/regular-expression/replacement-string/ 
The following example uses the PRXCHANGE function to show how substitution is performed:
prxchange('s/world/planet/', 1, 'Hello world!'); 
Arguments
s
specifies the metacharacter for substitution.
world
specifies the regular expression.
planet
specifies the replacement value for world.
1
specifies that the search ends when one match is found.
Hello world!
specifies the source string to be searched.
The result of the substitution is Hello planet.

Another Example of Using Basic Syntax for Searching and Replacing Text

Another example of using the PRXCHANGE function changes the value Jones, Fred to Fred Jones:
prxchange('s/(\w+), (\w+)/$2 $1',-1, 'Jones, Fred');
In this example, the Perl regular expression is s/(\w+), (\w+)/$2 $1. The number of times to search for a match is –1. The source string is 'Jones, Fred'. The value –1 specifies that matching patterns continue to be replaced until the end of the source is reached.
The Perl regular expression can be divided into its elements:
s
specifies a substitution regular expression.
(\w+)
matches one or more word characters (alphanumeric and underscore). The parentheses indicate that the value is stored in capture buffer 1.
,<space>
matches a comma and a space.
(\w+)
matches one or more word characters (alphanumeric and underscore). The parentheses indicate that the value is stored in capture buffer 2.
/
separator between the regular expression and the replacement string.
$2
part of the replacement string that substitutes the value in capture buffer 2, which in this case is the word after the comma, puts the substitution in the results.
<space>
puts a space in the result.
$1
puts capture buffer 1 into the result. In this case, it is the word before the comma.

Replacing Text

The following example uses the \u and \L metacharacters to replace the second character in MCLAUREN with a lower case letter:
data _null_;
   x = 'MCLAUREN';
   x = prxchange("s/(MC)/\u\L$1/i", -1, x);
   put x=;
run;
SAS writes the following output to the log:
x=McLAUREN

Example 1: Validating Data

You can test for a pattern of characters within a string. For example, you can examine a string to determine whether it contains a correctly formatted telephone number. This type of test is called data validation.
The following example validates a list of phone numbers. To be valid, a phone number must have one of the following forms: (XXX) XXX-XXXX or XXX-XXX-XXXX.
data _null_;  1
   if _N_ = 1 then 
      do;  
         paren = "\([2-9]\d\d\) ?[2-9]\d\d-\d\d\d\d";  2
         dash = "[2-9]\d\d-[2-9]\d\d-\d\d\d\d";  3
         expression = "/(" || paren || ")|(" || dash || ")/";   4
         retain re; 
         re = prxparse(expression);  5
         if missing(re) then  6
            do;
               putlog "ERROR: Invalid expression " expression;  7
               stop;
            end;     
      end; 

length first last home business $ 16;
input first last home business;

   if ^prxmatch(re, home) then  8
      putlog "NOTE: Invalid home phone number for " first last home;

   if ^prxmatch(re, business) then  9
      putlog "NOTE: Invalid business phone number for " first last business;

   datalines;   
Jerome Johnson (919)319-1677 (919)846-2198 
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821 
Palinor Kent . 919-782-3199
Ruby Archuleta . . 
Takei Ito 7042982145 .
Tom Joad 209/963/2764 2099-66-8474
;
run;
The following items correspond to the lines that are numbered in the DATA step that is shown above.
1 Create a DATA step.
2 Build a Perl regular expression to identify a phone number that matches (XXX)XXX-XXXX, and assign the variable PAREN to hold the result. Use the following syntax elements to build the Perl regular expression:
\( matches the open parenthesis in the area code.
[2–9] matches the digits 2–9, which is the first number in the area code.
\d matches a digit, which is the second number in the area code.
\d matches a digit, which is the third number in the area code.
\) matches the closed parenthesis in the area code.
<space>? matches the space (which is the preceding subexpression) zero or one time. Spaces are significant in Perl regular expressions. They match a space in the text that you are searching. If a space precedes the question mark metacharacter (as it does in this case), the pattern matches either zero spaces or one space in this position in the phone number.
3 Build a Perl regular expression to identify a phone number that matches XXX-XXX-XXXX, and assign the variable DASH to hold the result.
4 Build a Perl regular expression that concatenates the regular expressions for (XXX)XXX-XXXX and XXX—XXX—XXXX. The concatenation enables you to search for both phone number formats from one regular expression.
The PAREN and DASH regular expressions are placed within parentheses. The bar metacharacter (|) that is located between PAREN and DASH instructs the compiler to match either pattern. The slashes around the entire pattern tell the compiler where the start and end of the regular expression is located.
5 Pass the Perl regular expression to PRXPARSE and compile the expression. PRXPARSE returns a value to the compiled pattern. Using the value with other Perl regular expression functions and CALL routines enables SAS to perform operations with the compiled Perl regular expression.
6 Use the MISSING function to check whether the regular expression was successfully compiled.
7 Use the PUTLOG statement to write an error message to the SAS log if the regular expression did not compile.
8 Search for a valid home phone number. PRXMATCH uses the value from PRXPARSE along with the search text and returns the position where the regular expression was found in the search text. If there is no match for the home phone number, the PUTLOG statement writes a note to the SAS log.
9 Search for a valid business phone number. PRXMATCH uses the value from PRXPARSE along with the search text and returns the position where the regular expression was found in the search text. If there is no match for the business phone number, the PUTLOG statement writes a note to the SAS log.
Output from Validating Data
NOTE: Invalid home phone number for Palinor Kent  
NOTE: Invalid home phone number for Ruby Archuleta  
NOTE: Invalid business phone number for Ruby Archuleta  
NOTE: Invalid home phone number for Takei Ito 7042982145
NOTE: Invalid business phone number for Takei Ito  
NOTE: Invalid home phone number for Tom Joad 209/963/2764
NOTE: Invalid business phone number for Tom Joad 2099-66-8474

Example 2: Matching and Replacing Text

This example uses a Perl regular expression to find a match and replace the matching characters with other characters. PRXPARSE compiles the regular expression and uses PRXCHANGE to find the match and perform the replacement. The example replaces all occurrences of a less than sign with &lt;, a common substitution when converting text to HTML.
data _null_;  1
   input;  2
   _infile_ = prxchange('s/</&lt;/', -1, _infile_);  3
   put _infile_;  4
   datalines;  5
x + y < 15
x < 10 < y
y < 11
;
run;
The following items correspond to the numbered lines in the DATA step that is shown above.
1 Create a DATA step.
2 Bring an input data record into the input buffer without creating any SAS variables.
3 Call the PRXCHANGE routine to perform the pattern exchange. The format for the regular expression is s/regular-expression/replacement-text/. The s before the regular expression signifies that this is a substitution regular expression. The –1 is a special value that is passed to PRXCHANGE and indicates that all possible replacements should be made.
4 Write the current output line to the log by using the _INFILE_ option with the PUT statement.
5 Identify the input file.
Output from Replacing Text
x + y &lt; 15
x &lt; 10 &lt; y
y &lt; 11
The ability to pass a regular expression to PRXCHANGE and return a result enables calling PRXCHANGE from a PROC SQL query. The following query produces a column with the same character substitution as in the preceding example. From the input table the query reads text_lines, changes the text for the column line, and places the results in a column named html_line:
proc sql;
   select prxchange('s/</&lt;/', -1, line)
   as html_line
   from text_lines;
quit;

Example 3: Extracting a Substring from a String

You can use Perl regular expressions to find and easily extract text from a string. In this example, the DATA step creates a subset of North Carolina business phone numbers. The program extracts the area code and checks it against a list of area codes for North Carolina.
data _null_;  1
   if _N_ = 1 then 
      do; 
         paren = "\(([2-9]\d\d)\) ?[2-9]\d\d-\d\d\d\d";  2 
		      dash = "([2-9]\d\d)-[2-9]\d\d-\d\d\d\d";  3 
         regexp = "/(" || paren || ")|(" || dash || ")/";  4
         retain re; 
         re = prxparse(regexp);  5
         if missing(re) then  6
            do;
               putlog "ERROR: Invalid regexp " regexp;  7
               stop;
            end;     
 
         retain areacode_re;
         areacode_re = prxparse("/828|336|704|910|919|252/");  8
         if missing(areacode_re) then 
            do;
               putlog "ERROR: Invalid area code regexp";
               stop;
            end;
      end; 

   length first last home business $ 25;
   length areacode $ 3;
   input first last home business;

   if ^prxmatch(re, home) then  
      putlog "NOTE: Invalid home phone number for " first last home;

   if prxmatch(re, business) then  9
      do;
         which_format = prxparen(re);  10
         call prxposn(re, which_format, pos, len);  11
         areacode = substr(business, pos, len); 
         if prxmatch(areacode_re, areacode) then  12
            put "In North Carolina: " first last business;
      end;
      else
         putlog "NOTE: Invalid business phone number for " first last business;
   datalines; 
Jerome Johnson (919)319-1677 (919)846-2198 
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821 
Palinor Kent 704-782-4673 704-782-3199
Ruby Archuleta 905-384-2839 905-328-3892 
Takei Ito 704-298-2145 704-298-4738
Tom Joad 515-372-4829 515-389-2838
;
1 Create a DATA step.
2 Build a Perl regular expression to identify a phone number that matches (XXX)XXX-XXXX, and assign the variable PAREN to hold the result. Use the following syntax elements to build the Perl regular expression:
\( matches the open parenthesis in the area code. The open parenthesis marks the start of the submatch.
[2–9] matches the digits 2–9.
\d matches a digit, which is the second number in the area code.
\d matches a digit, which is the third number in the area code.
\) matches the closed parenthesis in the area code. The closed parenthesis marks the end of the submatch.
? matches the space (which is the preceding subexpression) zero or one time. Spaces are significant in Perl regular expressions. They match a space in the text that you are searching. If a space precedes the question mark metacharacter (as it does in this case), the pattern matches either zero spaces or one space in this position in the phone number.
3 Build a Perl regular expression to identify a phone number that matches XXX-XXX-XXXX, and assign the variable DASH to hold the result.
4 Build a Perl regular expression that concatenates the regular expressions for (XXX)XXX-XXXX and XXX—XXX—XXXX. The concatenation enables you to search for both phone number formats from one regular expression.
The PAREN and DASH regular expressions are placed within parentheses. The bar metacharacter (|) that is located between PAREN and DASH instructs the compiler to match either pattern. The slashes around the entire pattern tell the compiler where the start and end of the regular expression is located.
5 Pass the Perl regular expression to PRXPARSE and compile the expression. PRXPARSE returns a value to the compiled pattern. Using the value with other Perl regular expression functions and CALL routines enables SAS to perform operations with the compiled Perl regular expression.
6 Use the MISSING function to check whether the Perl regular expression compiled without error.
7 Use the PUTLOG statement to write an error message to the SAS log if the regular expression did not compile.
8 Compile a Perl regular expression that searches a string for a valid North Carolina area code.
9 Search for a valid business phone number.
10 Use the PRXPAREN function to determine which submatch to use. PRXPAREN returns the last submatch that was matched. If an area code matches the form (XXX), PRXPAREN returns the value 2. If an area code matches the form XXX, PRXPAREN returns the value 4.
11 Call the PRXPOSN routine to retrieve the position and length of the submatch.
12 Use the PRXMATCH function to determine whether the area code is a valid North Carolina area code, and write the observation to the log.
Output from Extracting a Substring from a String
In North Carolina: Jerome Johnson (919)846-2198
In North Carolina: Palinor Kent 704-782-3199
In North Carolina: Takei Ito 704-298-4738

Example 4: Another Example of Extracting a Substring from a String

In this example, the PRXPOSN function is passed to the original search text instead of to the position and length variables. PRXPOSN returns the text that is matched.
data _null_;  1
   length first last phone $ 16;
   retain re;
   if _N_ = 1 then do;  2
      re=prxparse("/\(([2-9]\d\d)\) ?[2-9]\d\d-\d\d\d\d/");  3
   end;

   input first last phone & 16.;
   if prxmatch(re, phone) then do;  4
      area_code = prxposn(re, 1, phone);  5
      if area_code ^in ("828" 
                                  "336"
                                  "704"
                                  "910"
                                  "919" 
                                  "252") then
         putlog "NOTE: Not in North Carolina: "
                      first last phone;   6
    end;
   datalines;  7
Thomas Archer    (919)319-1677
Lucy Mallory       (800)899-2164
Tom Joad              (508)852-2146
Laurie Jorgensen  (252)352-7583
;
run;
The following items correspond to the numbered lines in the DATA step that is shown above.
1 Create a DATA step.
2 If this is the first record, find the value of re.
3 Build a Perl regular expression for pattern matching. Use the following syntax elements to build the Perl regular expression:
/ is the beginning delimiter for a regular expression.
\( marks the next character entry as a character or a literal.
( marks the start of the submatch.
[2–9] matches the digits 2–9 and identifies the first number in the area code.
\d matches a digit, which is the second number in the area code.
\d matches a digit, which is the third number in the area code.
\) matches the close parenthesis in the area code. The close parenthesis marks the end of the submatch.
? matches the space (which is the preceding subexpression) zero or one time. Spaces are significant in Perl regular expressions. The spaces match a space in the text that you are searching. If a space precedes the question mark metacharacter (as it does in this case), the pattern matches either zero spaces or one space in this position in the phone number.
|| is the concatenation operator.
[2–9] matches the digits 2–9 and identifies the first number in the seven-digit phone number.
\d matches a digit, which is the second number in the seven-digit phone number.
\d matches a digit, which is the third number in the seven-digit phone number.
is the hyphen between the first three and last four digits of the phone number after the area code.
\d matches a digit, which is the fourth number in the seven-digit phone number.
\d matches a digit, which is the fifth number in the seven-digit phone number.
\d matches a digit, which is the sixth number in the seven-digit phone number.
\d matches a digit, which is the seventh number in the seven-digit phone number.
/ is the ending delimiter for a regular expression.
4 Return the position at which the string begins.
5 Identify the position at which the area code begins.
6 Search for an area code from the list. If the area code is not valid for North Carolina, use the PUTLOG statement to write a note to the SAS log.
7 Identify the input file.
Output from Extracting a Substring from a String
NOTE: Not in North Carolina: Lucy Mallory (800)899-2164
NOTE: Not in North Carolina: Tom Joad (508)852-2146