---
title: "REPORT_CV_jankGS_testing"
author: "Matthew McGowan"
date: "April 8, 2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Objective

The objective of this report is to demonstrate how k-fold and jackknife validation can be used to test different genomic selection models. Because Genomic Selection has not been formally introduced in class, we will use a makeshift combination of different methods already covered as a proxy for methods that will be taught later this week.

Method 1: Use pairs of principal components derived from the genotype matrix.

Method 2: Use the n top markers based on GWAS p-values (keeping in mind the methodology covered last lecture).

```{r loading data and dependencies}
library(GAPIT3)
library(ggplot2)
library(caret)
library(knitr)

source("functions_lab10.R")
source("http://zzlab.net/StaGen/2020/R/G2P.R")

# Download the demo data if necessary
if (!file.exists("GAPIT_Tutorial_Data.zip"))
{
  download.file("http://zzlab.net/GAPIT/GAPIT_Tutorial_Data.zip", destfile = "GAPIT_Tutorial_Data.zip")
  unzip("GAPIT_Tutorial_Data.zip")
}

# Import the GAPIT demo data genotypes
gt_scan <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1))
classes <- sapply(gt_scan, class)
genotypes <- data.frame(read.table("GAPIT_Tutorial_Data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t"))

# filter SNPs based on maf
maf_thresh <- 0.05 # Using a typical maf threshold
maf <- calc_maf_apply(genotypes)
genotypes <- genotypes[,maf >= maf_thresh]

marker_map <- read.table("GAPIT_Tutorial_Data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t")
```

## Using a simple GS method that only uses principal components and basic linear regression
```{r cross-validation with lm_jank, echo=FALSE}
set.seed(1337)
sim_pheno <- G2P(genotypes, h2 = 0.75, NQTN = 15, distribution = 'normal')
pca_results <- prcomp(genotypes)

# Randomly selecting 3 PCA components n times
n = 1000
random_3comp_sets <- replicate(n, sample(1:ncol(pca_results$x), size = 3))

# Performing 5-fold cross-validation using the lm_jank() method
cv_accuracies <- apply(random_3comp_sets, 2, function(x) 
  {
  cv_sets <- make_CV_sets(sim_pheno$y, pca_results$x[,x]) # Note: This could work with any set of predictors and observations provided they meet the model assumptions
  kfold_accuracies <- lm_jank(cv_sets) # Note: The modeling approach could be swapped with anything.
  return(kfold_accuracies)
  })

# Calculating the mean accuracy for each random PCA set
cv_acc_means <- apply(cv_accuracies, 2, mean)

# Performing 5-fold cross-validation using specifically the top 3 components
top3_cv_acc <- lm_jank(make_CV_sets(sim_pheno$y, pca_results$x[,1:3]))
top3_mean_acc <- mean(top3_cv_acc)
```

### Plotting the cross-validation results

```{r plot_cv_results}
cv_plot <- qplot(cv_acc_means, bins = 50) + 
  geom_vline(xintercept = top3_mean_acc, color = 'red') +
  labs(title = "GLM regression accuracy with 3 random PCA components, 5-fold cross-validation",
       x = "Mean r-squared", 
       y = "Number of models",
       caption = "Vertical line is the top 3 components")

cv_plot
```

## Cross-Validation with *caret*
It is also possible to use the cross-validation framework in the Caret package (a common R package for machine learning). I won't go in-depth on this because generalized machine-learning is a bit beyond the scope of this course, but I highly recommend you check out this caret tutorial:
[The caret Package](http://topepo.github.io/caret/index.html)

The following code demonstrates how caret abstracts 5-fold cross-validation to work with many different kinds of modeling with very little code (that's the power of functions!). In this case, I have chosen to use the 'glmboost' algorithm, a method that uses a ML method called [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting)

```{r cv_with_caret}
phenotype <- sim_pheno$y
mydata <- cbind(phenotype, genotypes)
train_control <- trainControl(method = "cv", number = 5)
model <- train(phenotype~., data = mydata, trControl = train_control, method = "glmboost")
model
```