The SSM Procedure

Example 27.9 Longitudinal Data: Variable Bandwidth Smoothing

The data for this example, taken from Givens and Hoeting (2005, chap. 11, Example 11.8), contain two variables, x and y. The variable y represents noisy evaluation of an unknown smooth function at x. The data are sorted by x.

 data Difficult;
 input x y;
 datalines;
 0.002 0.040 
 0.011 0.009 
 0.013 0.719 
 0.016 0.199 
 0.017 -0.409 

   ... more lines ...   

Output 27.9.1 shows the scatter plot of y against x that is generated by the following statements:

 proc sgplot data=Difficult;
    title "Scatter Plot of Y versus X";
    scatter x=x y=y ;
 run;

Output 27.9.1: Scatter Plot of Y versus X


The plot clearly shows that the variance of y values varies considerably over the range of x values—the variance is larger for x values around 0.2 and gets increasingly smaller as the x values get closer to 1. Givens and Hoeting (2005) discuss the difficulties of extracting a smooth pattern from such data. Consider the following model for y:

\[  \mb {y}(x) = \pmb {\mu }_{x} + \pmb {\epsilon }_{x}  \]

where $\pmb {\mu }_{x}$ is a smooth trend component and $\pmb {\epsilon }_{x}$ is the observation noise with variance, $h(x)$, which changes with x: $\pmb {\epsilon }_{x} \sim N( 0, h(x))$. It is known (Durbin and Koopman, 2001, chap. 3, sect. 10 and sect. 11) that modeling the trend $\pmb {\mu }_{x}$ as a polynomial smoothing spline (for example, the way the growth curves are modeled in Example 27.4) and taking the variance function of the observation noise $\pmb {\epsilon }_{x}$ a constant results in a trend estimate that can be termed a fixed-bandwidth-smoother. The optimal bandwidth turns out to be a function of the signal-to-noise ratio: the ratio of the observation noise variance and the disturbance variance of the trend component. On the other hand, allowing the variance function of the observation noise to change with the x values results in a trend estimate that can be termed a variable-bandwidth-smoother. The rest of this example shows how to use the SSM procedure to create a data-dependent variance function $h(x)$ and to extract the associated (variable-bandwidth) smooth trend from such data. Suppose that the (unknown) variance function $h(x)$ can be approximated as

\[  h(x) = \exp ( \sum _{i=1}^{7} \nu _{i} \;  \mr {SplineBasis}_{i} (x))  \]

where $\nu _{i}, i=1, 2, \ldots , 7$ are unknown parameters and $\mr {SplineBasis}_{i} (x), i=1, 2, \ldots , 7$, are the full set of cubic spline basis functions (B-splines) with four evenly spaced internal knots between the range of x values—essentially, four equispaced points between 0.0 and 1.0. Note that the number of basis functions in the full set, 7, is the sum of the number of internal knots, 4, and the degree of the polynomial, 3. The following statements create a data set, Combined, that contains the variables x and y, along with the desired spline basis functions (col1col7) that are created by using the BSPLINE function in PROC IML:

 proc iml;
    use difficult;
    /* read x and y from difficult into temp */
    read all var _num_ into temp;
    x = temp[,1];
    /* generate B-spline basis for a cubic spline
       with 4 evenly spaced internal knots in the x-range */
    bsp = bspline(x, 2, ., 4);
    Combined = temp || bsp;
    /* create a merged data set with x, y, and
       spline basis columns                    */
    create Combined var {x y col1 col2 col3 col4 col5 col6 col7};
    append from Combined;
 quit;

The following statements specify and fit the desired model to the data:

 proc ssm data=Combined opt(tech=dbldog);
    id x;
    /* parameters needed to define h(x) */
    parms v1-v7;
    /* defining h(x) */
    var = exp(v1*col1 + v2*col2 + v3*col3 + v4*col4
       + v5*col5 + v6*col6 + v7*col7);
    /* defining the polynomial spline trend */
    trend trend(ps(2));
    /* defining the observation noise with variance h(x) */
    irregular wn variance=var;
    model y = trend wn;
    output out=For pdv;
 run;

Output 27.9.2 shows the estimates of $\nu _{i}, i=1, 2, \ldots , 7$, and Output 27.9.3 shows the estimate of the disturbance variance associated with the polynomial spline trend that is specified in the TREND statement.

Output 27.9.2: Estimates of v1–v7

The SSM Procedure

Estimates of Named Parameters
Parameter Estimate Standard Error
v1 -3.302 1.501
v2 -0.826 0.619
v3 -2.234 0.453
v4 -3.130 0.412
v5 -4.306 0.415
v6 -6.901 0.588
v7 -19.514 2.306


Output 27.9.3: Estimate of the Disturbance Variance Associated with the Trend

Model Parameter Estimates
Component Type Parameter Estimate Standard
Error
trend PS(2) Trend Level Variance 339 110


The following statements produce a plot, shown in Output 27.9.4, of the fitted trend with 95% confidence band:

 proc sgplot data=For;
    title "Variable Bandwidth Smoothing Spline";
    band x=x lower=smoothed_lower_trend
       upper=smoothed_upper_trend ;
    series x=x y=smoothed_trend;
    scatter x=x y=y;
 run;

Output 27.9.4: Fitted Trend with 95% Confidence Band


Clearly the fitted curve tracks the data quite well. Lastly, Output 27.9.5 (produced by using the following statements) shows the estimated variance function $h(x)$.

 proc sgplot data=For;
    title "Estimated Variance Function";
    series x=x y=var;
 run;

As expected, the curve attains its peak at an x value around 0.18 and decays to nearly 0 as x values reach 1.0.

Output 27.9.5: Estimated Variance Function $h(x) = \exp ( \sum _{i=1}^{7} \hat{\nu }_{i} \;  \mr {SplineBasis}_{i} (x))$