The following data provide crime rates per 100,000 people in seven categories for each of the 50 states in 1977. Since there are seven numeric variables, it is impossible to plot all the variables simultaneously. Principal components can be used to summarize the data in two or three dimensions, and they help to visualize the data. The following statements produce Figure 73.1 through Figure 73.5.
title 'Crime Rates per 100,000 Population by State'; data Crime; input State $1-15 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; datalines; Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 ;
ods graphics on; proc princomp out=Crime_Components plots= score(ellipse ncomp=3); id State; run;
Figure 73.1 displays the PROC PRINCOMP output, beginning with simple statistics followed by the correlation matrix. The PROC PRINCOMP statement requests by default principal components computed from the correlation matrix, so the total variance is equal to the number of variables, 7.
Figure 73.1: Number of Observations and Simple Statistics from the PRINCOMP Procedure
Crime Rates per 100,000 Population by State |
Observations | 50 |
---|---|
Variables | 7 |
Simple Statistics | |||||||
---|---|---|---|---|---|---|---|
Murder | Rape | Robbery | Assault | Burglary | Larceny | Auto_Theft | |
Mean | 7.444000000 | 25.73400000 | 124.0920000 | 211.3000000 | 1291.904000 | 2671.288000 | 377.5260000 |
StD | 3.866768941 | 10.75962995 | 88.3485672 | 100.2530492 | 432.455711 | 725.908707 | 193.3944175 |
Correlation Matrix | |||||||
---|---|---|---|---|---|---|---|
Murder | Rape | Robbery | Assault | Burglary | Larceny | Auto_Theft | |
Murder | 1.0000 | 0.6012 | 0.4837 | 0.6486 | 0.3858 | 0.1019 | 0.0688 |
Rape | 0.6012 | 1.0000 | 0.5919 | 0.7403 | 0.7121 | 0.6140 | 0.3489 |
Robbery | 0.4837 | 0.5919 | 1.0000 | 0.5571 | 0.6372 | 0.4467 | 0.5907 |
Assault | 0.6486 | 0.7403 | 0.5571 | 1.0000 | 0.6229 | 0.4044 | 0.2758 |
Burglary | 0.3858 | 0.7121 | 0.6372 | 0.6229 | 1.0000 | 0.7921 | 0.5580 |
Larceny | 0.1019 | 0.6140 | 0.4467 | 0.4044 | 0.7921 | 1.0000 | 0.4442 |
Auto_Theft | 0.0688 | 0.3489 | 0.5907 | 0.2758 | 0.5580 | 0.4442 | 1.0000 |
Figure 73.2 displays the eigenvalues. The first principal component explains about 58.8% of the total variance, the second principal component explains about 17.7%, and the third principal component explains about 10.4%. Note that the eigenvalues sum to the total variance.
The eigenvalues indicate that two or three components provide a good summary of the data, two components accounting for 76% of the total variance and three components explaining 87%. Subsequent components contribute less than 5% each.
Figure 73.2: Results of Principal Component Analysis: PROC PRINCOMP
Eigenvalues of the Correlation Matrix | ||||
---|---|---|---|---|
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 4.11495951 | 2.87623768 | 0.5879 | 0.5879 |
2 | 1.23872183 | 0.51290521 | 0.1770 | 0.7648 |
3 | 0.72581663 | 0.40938458 | 0.1037 | 0.8685 |
4 | 0.31643205 | 0.05845759 | 0.0452 | 0.9137 |
5 | 0.25797446 | 0.03593499 | 0.0369 | 0.9506 |
6 | 0.22203947 | 0.09798342 | 0.0317 | 0.9823 |
7 | 0.12405606 | 0.0177 | 1.0000 |
Figure 73.3 displays the eigenvectors. From the eigenvectors matrix, you can represent the first principal component Prin1
as a linear combination of the original variables:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Similarly, the second principal component Prin2
is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where the variables are standardized.
Figure 73.3: Results of Principal Component Analysis: PROC PRINCOMP
Eigenvectors | |||||||
---|---|---|---|---|---|---|---|
Prin1 | Prin2 | Prin3 | Prin4 | Prin5 | Prin6 | Prin7 | |
Murder | 0.300279 | -.629174 | 0.178245 | -.232114 | 0.538123 | 0.259117 | 0.267593 |
Rape | 0.431759 | -.169435 | -.244198 | 0.062216 | 0.188471 | -.773271 | -.296485 |
Robbery | 0.396875 | 0.042247 | 0.495861 | -.557989 | -.519977 | -.114385 | -.003903 |
Assault | 0.396652 | -.343528 | -.069510 | 0.629804 | -.506651 | 0.172363 | 0.191745 |
Burglary | 0.440157 | 0.203341 | -.209895 | -.057555 | 0.101033 | 0.535987 | -.648117 |
Larceny | 0.357360 | 0.402319 | -.539231 | -.234890 | 0.030099 | 0.039406 | 0.601690 |
Auto_Theft | 0.295177 | 0.502421 | 0.568384 | 0.419238 | 0.369753 | -.057298 | 0.147046 |
The first component is a measure of the overall crime rate since the first eigenvector shows approximately equal loadings
on all variables. The second eigenvector has high positive loadings on variables Auto_Theft
and Larceny
and high negative loadings on variables Murder
and Assault
. There is also a small positive loading on Burglary
and a small negative loading on Rape
. This component seems to measure the preponderance of property crime over violent crime. The interpretation of the third
component is not obvious.
The ODS GRAPHICS statement enables the PRINCOMP procedure to produce statistical graphs by using ODS Graphics. See Chapter 21: Statistical Graphics Using ODS, for more information. PLOTS=SCORE(ELLIPSE NCOMP=3) in the PROC PRINCOMP statement requests the pairwise component score plots for the first three components with a 95% prediction ellipse overlaid on each of the scatter plot. Figure 73.4 shows the plot of the first two components. It is possible to identify regional trends on the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North and South Dakota are at the extreme left, with low overall crime rates. Southeastern states tend to be at the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a higher-than-average ratio of property crime to violent crime. Assuming the first two components are from a bivariate normal distribution, the ellipse identifies Nevada as a possible outlier.
Figure 73.4: Plot of the First Two Component Scores
Figure 73.5 shows the plot of the first and third components. Assuming the first and the third components are from a bivariate normal distribution, the ellipse identifies Nevada, Massachusetts, and New York as possible outliers.
Figure 73.5: Plot of the First and Third Component Scores
The most striking feature of the plot of the first and third principal components is that Massachusetts and New York are
outliers on the third component.