--- title: "REPORT_CV_jankGS_testing" author: "Matthew McGowan" date: "April 8, 2020" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Objective The objective of this report is to demonstrate how k-fold and jackknife validation can be used to test different genomic selection models. Because Genomic Selection has not been formally introduced in class, we will use a makeshift combination of different methods already covered as a proxy for methods that will be taught later this week. Method 1: Use pairs of principal components derived from the genotype matrix. Method 2: Use the n top markers based on GWAS p-values (keeping in mind the methodology covered last lecture). ```{r loading data and dependencies} library(GAPIT3) library(ggplot2) library(caret) library(knitr) source("functions_lab10.R") source("http://zzlab.net/StaGen/2020/R/G2P.R") # Download the demo data if necessary if (!file.exists("GAPIT_Tutorial_Data.zip")) { download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip") unzip("GAPIT_Tutorial_Data.zip") } # Import the GAPIT demo data genotypes gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1)) classes <- sapply(gt_scan, class) genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t")) # filter SNPs based on maf maf_thresh <- 0.05 # Using a typical maf threshold maf <- calc_maf_apply(genotypes) genotypes <- genotypes[,maf >= maf_thresh] marker_map <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t") ``` ## Using a simple GS method that only uses principal components and basic linear regression ```{r cross-validation with lm_jank, echo=FALSE} set.seed(1337) sim_pheno <- G2P(genotypes, h2 = 0.75, NQTN = 15, distribution = 'normal') pca_results <- prcomp(genotypes) # Randomly selecting 3 PCA components n times n = 1000 random_3comp_sets <- replicate(n, sample(1:ncol(pca_results$x), size = 3)) # Performing 5-fold cross-validation using the lm_jank() method cv_accuracies <- apply(random_3comp_sets, 2, function(x) { cv_sets <- make_CV_sets(sim_pheno$y, pca_results$x[,x]) # Note: This could work with any set of predictors and observations provided they meet the model assumptions kfold_accuracies <- lm_jank(cv_sets) # Note: The modeling approach could be swapped with anything. return(kfold_accuracies) }) # Calculating the mean accuracy for each random PCA set cv_acc_means <- apply(cv_accuracies, 2, mean) # Performing 5-fold cross-validation using specifically the top 3 components top3_cv_acc <- lm_jank(make_CV_sets(sim_pheno$y, pca_results$x[,1:3])) top3_mean_acc <- mean(top3_cv_acc) ``` ### Plotting the cross-validation results ```{r plot_cv_results} cv_plot <- qplot(cv_acc_means, bins = 50) + geom_vline(xintercept = top3_mean_acc, color = 'red') + labs(title = "GLM regression accuracy with 3 random PCA components, 5-fold cross-validation", x = "Mean r-squared", y = "Number of models", caption = "Vertical line is the top 3 components") cv_plot ``` ## Cross-Validation with *caret* It is also possible to use the cross-validation framework in the Caret package (a common R package for machine learning). I won't go in-depth on this because generalized machine-learning is a bit beyond the scope of this course, but I highly recommend you check out this caret tutorial: [The caret Package](http://topepo.github.io/caret/index.html) The following code demonstrates how caret abstracts 5-fold cross-validation to work with many different kinds of modeling with very little code (that's the power of functions!). In this case, I have chosen to use the 'glmboost' algorithm, a method that uses a ML method called [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) ```{r cv_with_caret} phenotype <- sim_pheno$y mydata <- cbind(phenotype, genotypes) train_control <- trainControl(method = "cv", number = 5) model <- train(phenotype~., data = mydata, trControl = train_control, method = "glmboost") model ```