A bivariate histogram
shows the distribution of data for two continuous numeric variables.
In the following graph, the X axis displays HEIGHT values and the
Y axis displays WEIGHT values. The Z axis represents the frequency
count of observations. The Z values could be some other measure (for
example, percentage of observations), but they can never be negative.
As with a standard histogram,
the X and Y variables in the bivariate histogram have been uniformly
binned, which means that their data ranges have been divided into
equal sized intervals (bins), and that observations are distributed
into one of these bin combinations.
The BIHISTOGRAM3DPARM statement,
which produced this plot, does not perform any binning computation
on the input columns. Thus, you must pre-bin the data. In the following
example, the binning is done with PROC KDE (part of the
SAS/STAT product).
proc kde data=sashelp.heart;
bivar height(ngrid=8) weight(ngrid=10) /
out=kde(keep=value1 value2 count) noprint plots=none;
run;
In this program, the
NGRID= option sets the number of bins to create for each variable.
The default for NGRID is 60. The binned values for HEIGHT are stored
in VALUE1, and the binned values for WEIGHT are stored in VALUE2.
This selection of bins produces 1 observation for each of the 80 bin
combinations. Frequency counts for each bin combination are placed
in a COUNT variable in the output data set.
Notice that when you
form the grid by choosing the number of bins, the bin widths (about
3.5 for HEIGHT and about 26 for WEIGHT) are most often non-integer.
The following template
definition displays this data. By default, the BINAXIS=TRUE setting
requests that X and Y axes show tick values at bin boundaries. Also
by default, XVALUES=MIDPOINTS and YVALUES=MIDPOINTS, which means that
the X and Y columns represent midpoint values rather than lower bin
boundaries (LEFTPOINTS) or upper bin boundaries (RIGHTPOINTS). Not
all of the bins in this graph can be labelled without collision because
the graph is small. Thus, the ticks and tick values were thinned.
The non-integer bin values are converted to integers ( TICKVALUEFORMAT=5.
) to simplify the axis tick values. DISPLAY=ALL means "show outlined,
filled bins."
proc template;
define statgraph bihistogram1a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on)
xaxisopts=(linearopts=(tickvalueformat=5.))
yaxisopts=(linearopts=(tickvalueformat=5.));
bihistogram3dparm x=value1 y=value2 z=count /
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data= kde template=bihistogram1a;
label value1="Height" value2="Weight";
run;
Eliminating Bins that Have No Data. Notice
that the bins of 0 frequency (there are several) are included in
the plot. If you want to eliminate the bins where there is no data,
you can generate a subset of the data. The subset makes it a bit clearer
where there are bins with small frequency counts verses portions of
the grid with no data.
proc sgrender data= kde template=bihistogram1a;
where count > 0;
label value1="Height" value2="Weight";
run;
Displaying Percentages on Z Axis. To display
the percentage of observations on the Z axis instead of the actual
count, you need to perform an additional data transformation to convert
the counts to percentages.
proc kde data=sashelp.heart;
bivar height(ngrid=8) weight(ngrid=10) /
out=kde(keep=value1 value2 count) noprint plots=none;
run;
data kde;
if _n_ = 1 then do i=1 to rows;
set kde(keep=count) point=i nobs=rows;
TotalObs+count;
end;
set kde;
Count=100*(Count/TotalObs);
label Count="Percent";
run;
proc sgrender data= kde template=bihistogram1a;
label value1="Height" value2="Weight";
run;
Setting Bin Width. Another technique for binning data
is to set a bin width and compute the number of observations in each
bin. In the DATA step below, 5 is the bin width for HEIGHT and 25
for WEIGHT. With this technique you do not know the exact number
of bins, but you can assure that the bins are of a "good" size.
data heart;
set sashelp.heart(keep=height weight);
if height ne . and weight ne .;
height=round(height,5);
weight=round(weight,25);
run;
After rounding, HEIGHT
and WEIGHT can be used as classifiers for a summarization. Notice
that the COMPLETETYPES option forces all possible combinations of
the two variables to be output, even if no data exists for a particular
crossing.
proc summary data=heart nway completetypes;
class height weight;
var height;
output out=stats(keep=height weight count) N=Count;
run;
The template can be
simplified because we know that the bin midpoints are uniformly spaced
integers. For this selection of bin widths, 6 bins were produced for
HEIGHT and 10 for WEIGHT.
proc template;
define statgraph bihistogram2a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on);
bihistogram3dparm x=height y=weight z=count /
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data=stats template=bihistogram2a;
run;
If
you prefer to see the axes labeled with the bin endpoints rather the
bin midpoints, you can use the ENDLABELS=TRUE setting on the BIHISTOGRAM3DPARM
statement. Note that the ENDLABELS= option is independent of the XVALUES=
and YVALUES= options.
In the following example,
the bin widths are changed to even numbers (10 and 50) to make the
bin endpoints even numbers:
proc template;
define statgraph bihistogram2a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on);
bihistogram3dparm x=height y=weight z=count /
binaxis=true endlabels=true display=all;
endlayout;
endgraph;
end;
run;
data heart;
set sashelp.heart(keep=height weight);
height=round(height,10);
weight=round(weight,50);
run;
proc summary data=heart nway completetypes;
class height weight;
var height;
output out=stats(keep=height weight count) N=Count;
run;
proc sgrender data=stats template=bihistogram2a;
run;
If you choose bin widths that are too small, "gaps" might be displayed
among axis ticks values, which might cause the following message:
WARNING: The data for a HISTOGRAMPARM statement is not appropriate.
HISTOGRAMPARM statement expects uniformly-binned data. The
histogram might not be drawn correctly.
Because BIHISTOGRAM3DPARM
is a parameterized plot, you can use it to show the 3-D data summarization
of a response variable Z, which must have non-negative values, by
two numeric classification variables that are uniformly spaced (X
and Y). That is, even though the graphical representation is a bivariate
histogram, the Z axis does not have to display a frequency count or
a percent.
data cars;
set sashelp.cars(keep=weight horsepower mpg_highway);
if horsepower ne . and weight ne .;
horsepower=round(horsepower,75);
weight=round(weight,1000);
run;
proc summary data=cars nway completetypes;
class weight horsepower;
var mpg_highway;
output out=stats mean=Mean ;
run;
proc template;
define statgraph bihistogram2b;
begingraph;
entrytitle
"Distribution of Gas Mileage by Vehicle Weight and Horsepower";
entryfootnote halign=right "SASHELP.CARS";
layout overlay3d / cube=false zaxisopts=(griddisplay=on) rotate=130;
bihistogram3dparm y=weight x=horsepower z=mean / binaxis=true
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data=stats template=bihistogram2b;
run;