A quicky..

If you’re (and you should) interested in principal components then take a good look at this. The linked post will take you by hand to do everything from scratch. If you’re not in the mood then the dollowing R functions will help you.

An example.

# Generates sample matrix of five discrete clusters that have
# very different mean and standard deviation values.
z1 <- rnorm(10000, mean=1, sd=1);
z2 <- rnorm(10000, mean=3, sd=3);
z3 <- rnorm(10000, mean=5, sd=5);
z4 <- rnorm(10000, mean=7, sd=7);
z5 <- rnorm(10000, mean=9, sd=9);
mydata <- matrix(c(z1, z2, z3, z4, z5), 2500, 20, byrow=T,
dimnames=list(paste("R", 1:2500, sep=""), paste("C", 1:20, sep="")))

# Performs principal component analysis after scaling the data.
# It returns a list with class "prcomp" that contains five components:
#   (1) the standard deviations (sdev) of the principal components,
#   (2) the matrix of eigenvectors (rotation),
#   (3) the principal component data (x),
#   (4) the centering (center) and
#   (5) scaling (scale) used.
pca <- prcomp(mydata, scale=T)

# Prints variance summary for all principal components.
summary(pca)

# Set plotting parameters.
x11(height=6, width=12, pointsize=12); par(mfrow=c(1,2))

# Define plotting colors.
mycolors <- c("red", "green", "blue", "magenta", "black")

# Plots scatter plot for the first two principal components
# that are stored in pca$x[,1:2].
plot(pca$x, pch=20, col=mycolors[sort(rep(1:5, 500))])
# Same as above, but prints labels.
plot(pca$x, type="n"); text(pca$x, rownames(pca$x), cex=0.8,
 col=mycolors[sort(rep(1:5, 500))])

# Plots scatter plots for all combinations between the first four principal components.
pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))])

# Plots a scatter plot for the first two principal components
# plus the corresponding eigen vectors that are stored in pca$rotation.
biplot(pca)

# Loads library scatterplot3d.
library(scatterplot3d)
# Same as above, but plots the first three principal components in 3D scatter plot
scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))])

# Importance of components:
#                          PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
# Standard deviation     2.157 0.9953 0.9831 0.9684 0.9601 0.9465 0.9340 0.9288
# Proportion of Variance 0.233 0.0495 0.0483 0.0469 0.0461 0.0448 0.0436 0.0431
# Cumulative Proportion  0.233 0.2822 0.3305 0.3774 0.4235 0.4683 0.5119 0.5550
#                           PC9   PC10   PC11   PC12   PC13   PC14   PC15   PC16
# Standard deviation     0.9030 0.8989 0.8930 0.8763 0.8703 0.8656 0.8573 0.8458
# Proportion of Variance 0.0408 0.0404 0.0399 0.0384 0.0379 0.0375 0.0367 0.0358
# Cumulative Proportion  0.5958 0.6362 0.6761 0.7145 0.7523 0.7898 0.8265 0.8623
#                          PC17   PC18   PC19   PC20
# Standard deviation     0.8415 0.8360 0.8302 0.8110
# Proportion of Variance 0.0354 0.0349 0.0345 0.0329
# Cumulative Proportion  0.8977 0.9326 0.9671 1.0000
# KernSmooth 2.23 loaded
# Copyright M. P. Wand 1997-2009



Comments

  1. Dymphy says:

    I have a question. For what does ‘pca$x’ stand for? I know that pca the name is for the PCA you did earlier, and that an $ is used for selectnig a specific column, but I can’t deduce where the x comes from.