Objective

The objective of this report is to demonstrate how k-fold and jackknife validation can be used to test different genomic selection models. Because Genomic Selection has not been formally introduced in class, we will use a makeshift combination of different methods already covered as a proxy for methods that will be taught later this week.

Method 1: Use pairs of principal components derived from the genotype matrix.

Method 2: Use the n top markers based on GWAS p-values (keeping in mind the methodology covered last lecture).

library(GAPIT3)
## Warning: replacing previous import 'multtest::wapply' by 'gplots::wapply' when
## loading 'GAPIT3'
library(ggplot2)
library(caret)
## Loading required package: lattice
library(knitr)

source("functions_lab10.R")
source("http://zzlab.net/StaGen/2020/R/G2P.R")

# Download the demo data if necessary
if (!file.exists("GAPIT_Tutorial_Data.zip"))
{
  download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip")
  unzip("GAPIT_Tutorial_Data.zip")
}

# Import the GAPIT demo data genotypes
gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1))
classes <- sapply(gt_scan, class)
genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t"))

# filter SNPs based on maf
maf_thresh <- 0.05 # Using a typical maf threshold
maf <- calc_maf_apply(genotypes)
genotypes <- genotypes[,maf >= maf_thresh]

marker_map <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t")

Using a simple GS method that only uses principal components and basic linear regression

Plotting the cross-validation results

cv_plot <- qplot(cv_acc_means, bins = 50) + 
  geom_vline(xintercept = top3_mean_acc, color = 'red') +
  labs(title = "GLM regression accuracy with 3 random PCA components, 5-fold cross-validation",
       x = "Mean r-squared", 
       y = "Number of models",
       caption = "Vertical line is the top 3 components")

cv_plot

Cross-Validation with caret

It is also possible to use the cross-validation framework in the Caret package (a common R package for machine learning). I won’t go in-depth on this because generalized machine-learning is a bit beyond the scope of this course, but I highly recommend you check out this caret tutorial: The caret Package

The following code demonstrates how caret abstracts 5-fold cross-validation to work with many different kinds of modeling with very little code (that’s the power of functions!). In this case, I have chosen to use the ‘glmboost’ algorithm, a method that uses a ML method called gradient boosting

phenotype <- sim_pheno$y
mydata <- cbind(phenotype, genotypes)
train_control <- trainControl(method = "cv", number = 5)
model <- train(phenotype~., data = mydata, trControl = train_control, method = "glmboost")
model
## Boosted Generalized Linear Model 
## 
##  281 samples
## 2683 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 225, 224, 225, 225, 225 
## Resampling results across tuning parameters:
## 
##   mstop  RMSE      Rsquared   MAE     
##    50    3.245513  0.5857484  2.557190
##   100    2.938038  0.6428847  2.352163
##   150    2.872901  0.6384974  2.313921
## 
## Tuning parameter 'prune' was held constant at a value of no
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mstop = 150 and prune = no.