Data that measure lifetime, or the length of time until the occurrence of an event, are called lifetime, failure time, or survival data. For example, a variable of interest might be the lifetime of diesel engines, the length of time a person stays at a job, or the survival time for heart transplant patients. Such data have special considerations that must be incorporated into any analysis.
Survival data consist of a response (event time, failure time, or survival time) variable that measures the duration of time until a specified event occurs and possibly a set of independent variables that are thought to be associated with the failure time variable. These independent variables (concomitant variables, covariates, or prognostic factors) can be either discrete, such as sex or race, or continuous, such as age or temperature. The system that gives rise to the event of interest can be biological (as with most medical data) or physical (as with engineering data). The purpose of survival analysis is to model the underlying distribution of the failure time variable and to assess the dependence of the failure time variable on the independent variables.
An intrinsic characteristic of survival data is the possibility of censoring of observations (that is, the actual time until the event is not observed). Such censoring can arise from withdrawal by a subject from the experiment or termination of the experiment. Because the response is usually a duration, some of the possible events might not yet have occurred when the period of data collection ends. For example, clinical trials are conducted over a finite period of time, with staggered entry of patients. That is, patients enter a clinical trial over time, and thus the length of follow-up varies by patient; consequently, the time to the event for all patients in the study might not be ascertained. In addition, some of the responses might be lost to follow-up (for example, a participant might move or refuse to continue to participate) before data collection ends. In either case, only a lower bound on the failure time of the censored observations is known. Such observations are said to be right-censored. Thus, an additional variable is incorporated into the analysis to indicate which failure times are observed event times and which are censored times. More generally, the failure time might be known only to be smaller than a given value (left-censored) or known only to be within a given interval (interval-censored). Many possible censoring schemes arise in survival analysis. Maddala (1983) discusses several related types of censoring situations, and Kalbfleisch and Prentice (1980) also discuss several censoring schemes. Data that contains censored observations cannot be analyzed by ignoring the censored observations because, among other considerations, the longer-lived individuals are usually more likely to be right-censored. The method of analysis must take the censoring into account and correctly use both the censored observations and the uncensored observations.
Another characteristic of survival data is that the response cannot be negative. This suggests that a transformation of the survival time, such as a log transformation, might be necessary or that specialized methods might be more appropriate than those that assume a normal distribution of the error term. It is especially important to check any underlying assumptions as part of the analysis, because some of the models that are used are very sensitive to these assumptions.