This script will demonstrate how to perform PCA analysis. This process involves: - Re-scaling parameters (if necessary) - Calculating principal components - Visualizing variance explained - Visualizing particular components with scatterplots
The data being used today is the GAPIT demo data. This data is from a maize diversity panel consisting of 281 lines with 3093 markers. The setup portion of the Rmarkdown will check if the data is downloaded in the markdown project folder. If not, it will automatically download it and unzip it for use in the markdown.
knitr::opts_chunk$set(echo = F)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(gridExtra)
library(knitr)
# Download the demo data if necessary
if (!file.exists("GAPIT_Tutorial_Data.zip"))
{
download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip")
unzip("GAPIT_Tutorial_Data.zip")
}
# Import the GAPIT demo data genotypes
gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1))
classes <- sapply(gt_scan, class)
genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t"))
marker_map <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t")
PCA is sensitive to scale. For example, if you have two variables that describe a set of objects. The first is a continuous variable and ranges from 0 to >1000. The second is a continuous index that ranges from -1 to 1. If you leave the variables ‘as is’, the first variable will dominate the PCA simply because the range is greater. For a real-world example, here is a blog post that demonstrates this with a set of wine data: PCA rescaling example
Because genotype values all have the same range (0,1,2) or (-1, 0, 1), rescaling isn’t necessary. If scaling is necessary, you can either transform them using your own strategy, or you can use the scaling feature included with the PCA function.
R has two built-in PCA functions: princomp() and prcomp(). The function princomp() uses the spectral decomposition approach. The functions prcomp() use the singular value decomposition (SVD). It is outside the scope of this lab, but it is useful to know that SVD is considered to be generally superior to spectral decomposition so we will use prcomp().
gt_pca_results <- prcomp(genotypes)
names(gt_pca_results)
## [1] "sdev" "rotation" "center" "scale" "x"
The output of this function is a prcomp object that contains a list of several items. The first sdev is the standard deviations of the principal components (used to calculate variance-explained). The next item rotation is a matrix of the variable loadings (columns are eigenvectors). The center item is the variable means that were subtracted. Scale are the variable standard deviations. Lastly, the x item contains the coordinates of the individuals on the principal components. This is what is used for plotting the components.
A useful function for getting the variance explained is the generic summary() function. When this function is called on a prcomp object, it will return the prcomp object with an additional item in the list called importance. This item is a matrix that has the standard deviation, variance explained, and cumulative variance for each component.
gt_pca_results <- summary(gt_pca_results)
kable(round(gt_pca_results$importance[,1:10], 2))
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
---|---|---|---|---|---|---|---|---|---|---|
Standard deviation | 10.28 | 7.76 | 6.27 | 5.73 | 5.65 | 5.26 | 5.11 | 4.87 | 4.80 | 4.43 |
Proportion of Variance | 0.06 | 0.03 | 0.02 | 0.02 | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 |
Cumulative Proportion | 0.06 | 0.09 | 0.12 | 0.13 | 0.15 | 0.17 | 0.18 | 0.19 | 0.21 | 0.22 |
var_exp_plot_data <- data.frame(t(gt_pca_results$importance)) # I transposed the data to be in long form and coerce to a data.frame for ggplot
names(var_exp_plot_data) <- c("sdev", "var_exp", "cum_var_exp") # rename the columns (because the original rownames has spaces)
var_exp_plot_data$pca_index <- 1:nrow(var_exp_plot_data) # Add a new column that is the integer index of the component
var_exp_plot <- ggplot(data = var_exp_plot_data, aes(x = pca_index, y = var_exp)) +
geom_line() +
geom_point() +
labs(x = "PCA component index", y = "Variance explained", title = "GAPIT Demo GT: Variance Explained")
var_exp_plot
This plot shows multiple things. The first component doesn’t explain a massive portion of the variance (only 6%). Second, the variance explained drops off quickly. This is a typical pattern to observe with PCA. Next, we will plot the cumulative variance explained.
cum_var_exp_plot <- ggplot(data = var_exp_plot_data, aes(x = pca_index, y = cum_var_exp)) +
geom_line() +
geom_point() +
labs(x = "PCA component index", y = "Cumulative Variance explained", title = "GAPIT Demo GT: Cumulative Variance Explained")
cum_var_exp_plot
Looking at the cumulative variance isn’t drastically different, except it makes it very easy to quickly identify how many components explain x% of the data. For example: the top 50 components explain 50% of the data.
Next, we are going to visualize these PCA components with scatterplots. We will start with 2d graphs, and then will create a 3d plot.
pca_comp_plot_data <- data.frame(gt_pca_results$x)
# Plotting the first two components
pca_comp_plot_12 <-
ggplot(data = pca_comp_plot_data, aes(x = PC1, y = PC2)) +
geom_point()
pca_comp_plot_13 <-
ggplot(data = pca_comp_plot_data, aes(x = PC1, y = PC3)) +
geom_point()
pca_comp_plot_23 <-
ggplot(data = pca_comp_plot_data, aes(x = PC2, y = PC3)) +
geom_point()
grid.arrange(pca_comp_plot_12, pca_comp_plot_13, pca_comp_plot_23, nrow = 2, ncol = 2, top = "GAPIT Demo GT: Principal Components")
Based on the PCA plots, there seems to be a set of individuals that differ from the majority of samples and that these differences are gradual (there arent any big gaps between individuals). The second component divides the individuals into two main groups (suggesting structure) and the third component further pulls out a small group.
Rstudio has some issues with rendering Plotly output and embedding .html files. I currently do not know of a workaround, but you can open this plot by double-clicking on the “PCA_3d.html” file located in the same folder as the knitted Rmarkdown report. Notice that this 3d plot is interactive and can be rotated, zoomed, and even saved as a static image for later use.