---
title: "BGLR_demonstration"
author: "Matthew McGowan"
date: "April 22, 2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Dependencies

BGLR is a self-contained R package for calculating Bayesian genomic selection models. It is available on CRAN and is easy to install and load. When you want to include a package in an Rmarkdown, but don't know if it will already be installed on all systems running the script, you can use the 'require' function to check if the package is available or needs to be installed. 

```{r setup}
test_install <- require("BGLR", "ggplot2")
if (!test_install)
{
  install.packages("BGLR")
  library(BGLR)
  library(ggplot2)
}

source("FUNCTIONS_GS.R")

# Read in the data
myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T)
row.names(myGD) = myGD[,1]
myGD_mat<- as.matrix(myGD[,-1])

myGM <- read.table("http://zzlab.net/GAPIT/data/mdp_SNP_information.txt", header = T, stringsAsFactors = F, sep = "\t")
myY <- read.table("http://zzlab.net/GAPIT/data/mdp_traits.txt", header = T, stringsAsFactors = F, sep = "\t")
```

## Data Formatting

Again, because we know the demo data contains monomorphic markers, we need to pre-process it to avoid any downstream issues.

The genotype matrix can be in either c(-1,0,1) or c(0,1,2) format. There are also helper functions for using PLINK formatted data in BGLR (see documentation).

While this script will demonstrate the basics of using BGLR, I highly recommend going through the vignette posted on the CRAN package url: [HERE](https://cran.r-project.org/web/packages/BGLR/index.html)

```{r data_formatting, echo=FALSE}
# Remove monomorphic markers
maf <- calc_maf_apply(myGD_mat, encoding = c(0, 1, 2))
mono_indices <- which(maf == 0)
taxa <- row.names(myGD)
myGD_mat = myGD_mat[,-mono_indices]
myGM = myGM[-mono_indices,]
  
# Subset a single phenotype by taxa
trait_col <- 2
trait_name <- names(myY)[trait_col]
singleY <- myY[,c(1,trait_col)]
taxa_match <- match(myGD$taxa, singleY$Taxa)
singleY <- singleY[taxa_match,]
singleY_vect <- as.vector(singleY[,2])
```

## BGLR basics
```{r BGLR_basics, echo = F}
# Testing a single model
# Set missing values
phenotype_withNA <- singleY_vect
randNA <- sample(1:length(singleY_vect), 30)
phenotype_withNA[randNA] <- NA

# Calculate the GS model using BGLR
bayesA_ETA<-list(list(X=myGD_mat,model="BayesB"))
bayesA_results <- BGLR(y = phenotype_withNA, ETA = bayesA_ETA, nIter=5000, burnIn=1000)
bayesA_predictions <- predict(bayesA_results)
bayesA_acc <- cor(singleY_vect[randNA], bayesA_predictions[randNA])
```

## Testing Multiple Models
```{r BGLR_model_testing, echo = F}
model_testing_results <- test_all_models_BGLR_cv(genotypes = myGD_mat, 
                                                 phenotype = singleY_vect, 
                                                 nIter = 100, 
                                                 burnIn = 20)
```