The objective of this report is to demonstrate how k-fold and jackknife validation can be used to test different genomic selection models. Because Genomic Selection has not been formally introduced in class, we will use a makeshift combination of different methods already covered as a proxy for methods that will be taught later this week.
Method 1: Use pairs of principal components derived from the genotype matrix.
Method 2: Use the n top markers based on GWAS p-values (keeping in mind the methodology covered last lecture).
library(GAPIT3)
## Warning: replacing previous import 'multtest::wapply' by 'gplots::wapply' when
## loading 'GAPIT3'
library(ggplot2)
library(caret)
## Loading required package: lattice
library(knitr)
source("functions_lab10.R")
source("http://zzlab.net/StaGen/2020/R/G2P.R")
# Download the demo data if necessary
if (!file.exists("GAPIT_Tutorial_Data.zip"))
{
download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip")
unzip("GAPIT_Tutorial_Data.zip")
}
# Import the GAPIT demo data genotypes
gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1))
classes <- sapply(gt_scan, class)
genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t"))
# filter SNPs based on maf
maf_thresh <- 0.05 # Using a typical maf threshold
maf <- calc_maf_apply(genotypes)
genotypes <- genotypes[,maf >= maf_thresh]
marker_map <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t")
cv_plot <- qplot(cv_acc_means, bins = 50) +
geom_vline(xintercept = top3_mean_acc, color = 'red') +
labs(title = "GLM regression accuracy with 3 random PCA components, 5-fold cross-validation",
x = "Mean r-squared",
y = "Number of models",
caption = "Vertical line is the top 3 components")
cv_plot
It is also possible to use the cross-validation framework in the Caret package (a common R package for machine learning). I won’t go in-depth on this because generalized machine-learning is a bit beyond the scope of this course, but I highly recommend you check out this caret tutorial: The caret Package
The following code demonstrates how caret abstracts 5-fold cross-validation to work with many different kinds of modeling with very little code (that’s the power of functions!). In this case, I have chosen to use the ‘glmboost’ algorithm, a method that uses a ML method called gradient boosting
phenotype <- sim_pheno$y
mydata <- cbind(phenotype, genotypes)
train_control <- trainControl(method = "cv", number = 5)
model <- train(phenotype~., data = mydata, trControl = train_control, method = "glmboost")
model
## Boosted Generalized Linear Model
##
## 281 samples
## 2683 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 225, 224, 225, 225, 225
## Resampling results across tuning parameters:
##
## mstop RMSE Rsquared MAE
## 50 3.245513 0.5857484 2.557190
## 100 2.938038 0.6428847 2.352163
## 150 2.872901 0.6384974 2.313921
##
## Tuning parameter 'prune' was held constant at a value of no
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mstop = 150 and prune = no.