The following data provide crime rates per 100,000 people in seven categories for each of the 50 US states in 1977. Because there are seven numeric variables, it is impossible to plot all the variables simultaneously. You can use principal components to summarize the data in two or three dimensions, and they help you visualize the data. The following statements produce Figure 79.1 through Figure 79.5:
title 'Crime Rates per 100,000 Population by State'; data Crime; input State $1-15 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; datalines; Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 ;
ods graphics on; proc princomp out=Crime_Components plots= score(ellipse ncomp=3); id State; run;
Figure 79.1 displays the PROC PRINCOMP output, beginning with simple statistics and followed by the correlation matrix. By default, the PROC PRINCOMP statement requests principal components that are computed from the correlation matrix, so the total variance is equal to the number of variables, 7.
Figure 79.1: Number of Observations and Simple Statistics from the PRINCOMP Procedure
Correlation Matrix | |||||||
---|---|---|---|---|---|---|---|
Murder | Rape | Robbery | Assault | Burglary | Larceny | Auto_Theft | |
Murder | 1.0000 | 0.6012 | 0.4837 | 0.6486 | 0.3858 | 0.1019 | 0.0688 |
Rape | 0.6012 | 1.0000 | 0.5919 | 0.7403 | 0.7121 | 0.6140 | 0.3489 |
Robbery | 0.4837 | 0.5919 | 1.0000 | 0.5571 | 0.6372 | 0.4467 | 0.5907 |
Assault | 0.6486 | 0.7403 | 0.5571 | 1.0000 | 0.6229 | 0.4044 | 0.2758 |
Burglary | 0.3858 | 0.7121 | 0.6372 | 0.6229 | 1.0000 | 0.7921 | 0.5580 |
Larceny | 0.1019 | 0.6140 | 0.4467 | 0.4044 | 0.7921 | 1.0000 | 0.4442 |
Auto_Theft | 0.0688 | 0.3489 | 0.5907 | 0.2758 | 0.5580 | 0.4442 | 1.0000 |
Figure 79.2 displays the eigenvalues. The first principal component accounts for about 58.8% of the total variance, the second principal component accounts for about 17.7%, and the third principal component accounts for about 10.4%. Note that the eigenvalues sum to the total variance.
The eigenvalues indicate that two or three components provide a good summary of the data: two components account for 76% of the total variance, and three components account for 87%. Subsequent components account for less than 5% each.
Figure 79.2: Results of Principal Component Analysis: PROC PRINCOMP
Eigenvalues of the Correlation Matrix | ||||
---|---|---|---|---|
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 4.11495951 | 2.87623768 | 0.5879 | 0.5879 |
2 | 1.23872183 | 0.51290521 | 0.1770 | 0.7648 |
3 | 0.72581663 | 0.40938458 | 0.1037 | 0.8685 |
4 | 0.31643205 | 0.05845759 | 0.0452 | 0.9137 |
5 | 0.25797446 | 0.03593499 | 0.0369 | 0.9506 |
6 | 0.22203947 | 0.09798342 | 0.0317 | 0.9823 |
7 | 0.12405606 | 0.0177 | 1.0000 |
Figure 79.3 displays the eigenvectors. From the eigenvectors matrix, you can represent the first principal component, Prin1
, as a linear combination of the original variables:
Similarly, the second principal component, Prin2
, is
where the variables are standardized.
Figure 79.3: Results of Principal Component Analysis: PROC PRINCOMP
Eigenvectors | |||||||
---|---|---|---|---|---|---|---|
Prin1 | Prin2 | Prin3 | Prin4 | Prin5 | Prin6 | Prin7 | |
Murder | 0.300279 | -.629174 | 0.178245 | -.232114 | 0.538123 | 0.259117 | 0.267593 |
Rape | 0.431759 | -.169435 | -.244198 | 0.062216 | 0.188471 | -.773271 | -.296485 |
Robbery | 0.396875 | 0.042247 | 0.495861 | -.557989 | -.519977 | -.114385 | -.003903 |
Assault | 0.396652 | -.343528 | -.069510 | 0.629804 | -.506651 | 0.172363 | 0.191745 |
Burglary | 0.440157 | 0.203341 | -.209895 | -.057555 | 0.101033 | 0.535987 | -.648117 |
Larceny | 0.357360 | 0.402319 | -.539231 | -.234890 | 0.030099 | 0.039406 | 0.601690 |
Auto_Theft | 0.295177 | 0.502421 | 0.568384 | 0.419238 | 0.369753 | -.057298 | 0.147046 |
The first component is a measure of the overall crime rate because the first eigenvector shows approximately equal loadings
on all variables. The second eigenvector has high positive loadings on the variables Auto_Theft
and Larceny
and high negative loadings on the variables Murder
and Assault
. There is also a small positive loading on the variable Burglary
and a small negative loading on the variable Rape
. This component seems to measure the preponderance of property crime compared to violent crime. The interpretation of the
third component is not obvious.
The ODS GRAPHICS statement enables the creation of graphs. For more information, see Chapter 21: Statistical Graphics Using ODS. The option PLOTS=SCORE(ELLIPSE NCOMP=3) in the PROC PRINCOMP statement requests the pairwise component score plots for the first three components, with a 95% prediction ellipse overlaid on each scatter plot. Figure 79.4 shows the plot of the first two components. You can identify regional trends in the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North Dakota and South Dakota are at the extreme left, with low overall crime rates. Southeastern states tend to be at the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a higher-than-average ratio of property crime to violent crime. Assuming that the first two components are from a bivariate normal distribution, the ellipse identifies Nevada as a possible outlier.
Figure 79.5 shows the plot of the first and third components. Assuming that the first and third components are from a bivariate normal distribution, the ellipse identifies Nevada, Massachusetts, and New York as possible outliers.
The most striking feature of the plot of the first and third principal components is that Massachusetts and New York are outliers on the third component.