Processing BY-Groups in the DATA Step

Overview

The most common use of BY-group processing in the DATA step is to combine two or more SAS data sets using a BY statement with a SET, MERGE, MODIFY, or UPDATE statement. (If you use a SET, MERGE, or UPDATE statement with the BY statement, your observations must be grouped or ordered.) When processing these statements, SAS reads one observation at a time into the program data vector. With BY-group processing, SAS selects the observations from the data sets according to the values of the BY variable or variables. After processing all the observations from one BY group, SAS expects the next observation to be from the next BY group.
The BY statement modifies the action of the SET, MERGE, MODIFY, or UPDATE statement by controlling when the values in the program data vector are set to missing. During BY-group processing, SAS retains the values of variables until it has copied the last observation that it finds for that BY group in any of the data sets. Without the BY statement, the SET statement sets variables to missing when it reads the last observation from any data set, and the MERGE statement does not set variables to missing after the DATA step starts reading observations into the program data vector.

Processing BY-Groups Conditionally

You can process observations conditionally by using the subsetting IF or IF-THEN statements, or the SELECT statement, with the temporary variables FIRST.variable and LAST.variable (set up during BY-group processing). For example, you can use them to perform calculations for each BY group and to write an observation when the first or the last observation of a BY group has been read into the program data vector.
The following example computes annual payroll by department. It uses IF-THEN statements and the values of FIRST.variable and LAST.variable automatic variables to reset the value of PAYROLL to 0 at the beginning of each BY group and to write an observation after the last observation in a BY group is processed.
title;
options pageno=1 nodate linesize=80 pagesize=60;  

data salaries;    
   input Department $ Name $ WageCategory $ WageRate;    
   datalines; 
BAD Carol Salaried 20000 
BAD Elizabeth Salaried 5000 
BAD Linda Salaried 7000 
BAD Thomas Salaried 9000 
BAD Lynne Hourly 230
DDG Jason Hourly 200 
DDG Paul Salaried 4000 
PPD Kevin Salaried 5500 
PPD Amber Hourly 150 
PPD Tina Salaried 13000 
STD Helen Hourly 200 
STD Jim Salaried 8000
; 

proc print data=salaries; 
run;
proc sort data=salaries out=temp;    
   by Department;
run;

data budget (keep=Department Payroll);   
   set temp;       
   by Department; 
   if WageCategory='Salaried' then YearlyWage=WageRate*12;    
   else if WageCategory='Hourly' then YearlyWage=WageRate*2000;  
      /* SAS sets FIRST.variable to 1 if this is a new        */
      /* department in the BY group.                          */
   if first.Department then Payroll=0;    
   Payroll+YearlyWage;
      /* SAS sets LAST.variable to 1 if this is the last      */
      /* department in the current BY group.                  */
   if last.Department; 
run;

proc print data=budget;    
   format Payroll dollar10.;
   title 'Annual Payroll by Department'; 
run;
Output from Conditional BY-Group Processing
                          Annual Payroll by Department                         1

                        Obs    Department       Payroll

                         1        BAD          $952,000
                         2        DDG          $448,000
                         3        PPD          $522,000
                         4        STD          $496,000

Data Not in Alphabetic or Numeric Order

In BY-group processing, you can use data that is arranged in an order other than alphabetic or numeric, such as by calendar month or by category. To do this, use the NOTSORTED option in a BY statement when you use a SET statement. The NOTSORTED option in the BY statement tells SAS that the data is not in alphabetic or numeric order, but that it is arranged in groups by the values of the BY variable. You cannot use the NOTSORTED option with the MERGE statement, the UPDATE statement, or when the SET statement lists more than one data set.
This example assumes that the data is grouped by the character variable MONTH. The subsetting IF statement conditionally writes an observation, based on the value of LAST.month. This DATA step writes an observation only after processing the last observation in each BY group.
data sales;
   input month
data total_sale(drop=sales);    
   set region.sales
   by month notsorted;    
   total+sales;    
   if last.month; 
run;

Data Grouped by Formatted Values

Use the GROUPFORMAT option in the BY statement to ensure that
  • formatted values are used to group observations when a FORMAT statement and a BY statement are used together in a DATA step
  • the FIRST.variable and LAST.variable are assigned by the formatted values of the variable
The GROUPFORMAT option is valid only in the DATA step that creates the SAS data set. It is particularly useful with user-defined formats. The following example illustrates the use of the GROUPFORMAT option.
proc format;
   value range
      low -55 = 'Under 55'
      55-60   = '55 to 60'
      60-65   = '60 to 65'
      65-70   = '65 to 70'
      other   = 'Over 70';
run;

proc sort data=class out=sorted_class;
   by height;
run;

data _null_;
   format height range.;
   set sorted_class;
      by height groupformat;
   if first.height then
      put 'Shortest in ' height 'measures ' height:best12.;
run;
SAS writes the following output to the log:
The SAS Log Showing Output Using the BY Statement GROUPFORMAT Option

Shortest
in Under 55 measures 51.3
Shortest in 55 to 60 measures 56.3
Shortest in 60 to 65 measures 62.5
Shortest in 65 to 70 measures 65.3
Shortest in Over 70 measures 72
options
pageno=1 nodate linesize=80 pagesize=60;     

   /* Create a user-defined format */ 
proc format;    
   value Range 1-2='Low'                 
               3-4='Medium' 
               5-6='High';
run;     

   /* Create the SAS data set */ 
data newtest;    
   set test;   
   by groupformat Score; 
   format Score Range.;    
run;  

   /* Print using formatted values */ 
proc print data=newtest;    
   title 'Score Categories';    
   var Name Score;    
   by Score;    
run;

Output
- — — —
                                Score Categories                               1

---------------------------------- Score=Low -----------------------------------

                              Obs    Name    Score

                               1     Jon      Low 


--------------------------------- Score=Medium ---------------------------------

                            Obs     Name      Score

                             2     Anthony    Medium
                             3     Miguel     Medium
                             4     Joseph     Medium


---------------------------------- Score=High ----------------------------------

                              Obs    Name    Score

                               5     Ian     High 
                               6     Jan     High