The Format of the Input Data Set :: SAS/STAT(R) 12.1 User's Guide

The Format of the Input Data Set

The SAS data set used as input to the INBREED procedure must contain an observation for each individual. Each observation must include one variable identifying the individual and two variables identifying the individual’s parents. Optionally, an observation can contain a known covariance coefficient and a character variable defining the gender of the individual.

For example, consider the following data:

data Population;
   input Individual $ Parent1 $ Parent2 $
         Covariance Sex $ Generation;
   datalines;
Mark   George Lisa    .    M  1
Kelly  Scott  Lisa    .    F  1
Mike   George Amy     .    M  1
.      Mark   Kelly  0.50  .  1
David  Mark   Kelly   .    M  2
Merle  Mike   Jane    .    F  2
Jim    Mark   Kelly  0.50  M  2
Mark   Mike   Kelly   .    M  2
;

It is important to order the pedigree observations so that individuals are defined before they are used as parents of other individuals. The family relationships between individuals cannot be ascertained correctly unless you observe this ordering. Also, older individuals must precede younger ones. For example, ‘Mark’ appears as the first parent of ‘David’ at observation 5; therefore, his observation needs to be defined prior to observation 5. Indeed, this is the case (see observation 1). Also, ‘David’ is older than ‘Jim’, whose observation appears after the observation for ‘David’, as is appropriate.

In populations with distinct, nonoverlapping generations, the older generation (parents) must precede the younger generation. For example, the individuals defined in Generation=1 appear as parents of individuals defined in Generation=2.

PROC INBREED produces warning messages when a parent cannot be found. For example, ‘Jane’ appears as the second parent of the individual ‘Merle’ even though there are no previous observations defining her own parents. If the population is treated as an overlapping population, that is, if the generation grouping is ignored, then the procedure inserts an observation for ‘Jane’ with missing parents just before the sixth observation, which defines ‘Merle’ as follows:

Jane   .      .       .    F  2
Merle  Mike   Jane    .    F  2

However, if generation grouping is taken into consideration, then ‘Jane’ is defined as the last observation in Generation=1, as follows:

Mike   George Amy     .    M  1
Jane   .      .       .    F  1

In this latter case, however, the observation for ‘Jane’ is inserted after the computations are reported for the first generation. Therefore, she does not appear in the covariance/inbreeding matrix, even though her observation is used in computations for the second generation (see Figure 47.2).

If the data for an individual are duplicated, only the first occurrence of the data is used by the procedure, and a warning message is displayed to note the duplication. For example, individual ‘Mark’ is defined twice, at observations 1 and 8. If generation grouping is ignored, then this is an error and observation 8 is skipped. However, if the population is processed with respect to two distinct generations, then ‘Mark’ refers to two different individuals, one in Generation=1 and the other in Generation=2.

If a covariance is to be assigned between two individuals, then those individuals must be defined prior to the assignment observation. For example, a covariance of 0.50 can be assigned between ‘Mark’ and ‘Kelly’ since they are previously defined. Note that assignment statements must have different formats depending on whether the population is processed with respect to generations (see the section DATA= Data Set for further information). For example, while observation 4 is valid for nonoverlapping generations, it is invalid for a processing mode that ignores generation grouping. In this latter case, observation 7 indicates a valid assignment, and observation 4 is skipped.

The latest covariance specification between any given two individuals overrides the previous one between the same individuals.

The INBREED Procedure

The Format of the Input Data Set