Homework 5, CROPS 545, Spring 2020

Professor: Zhiwu Zhang

Due on April 13, 2020, Monday, 3:10PM, PST

Problem 1 (20 Pts)

Sample number of QTNs of your choice from the genetic markers used in homework2 and simulate QTN effects from a standard normal distribution. Assign genetic effects for each individual. Simulate normal distributed residual effects with appropriate variance to have a heritability of 0.75. Add residual effects to the genetic effects to create phenotypes. Use the GLM GWAS package you developed in homework 4 to perform association analyses with three PCs included as covariates. Count number of false positives for identifying half of your QTNs (20 points).

Load Package from HW4

#code to install package from github
#install.packages("devtools")
library(devtools)

#install_github("stp4freedom/GWhEAT", force =TRUE)
library(GWhEAT)#package name
if (!require("pacman")) install.packages("pacman")#Load dependencies
pacman::p_load(ggplot2,knitr,gridExtra)

Load Data

myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T)
myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T)

#read Genetic data in with gt_scan for GWASapply function, it's the same data as myGD but in a form that GWASapply will take
gt_scan <- data.frame(read.table("http://zzlab.net/GAPIT/data/mdp_numeric.txt", header = T, stringsAsFactors = F, sep = "\t", nrows = 1))
classes <- sapply(gt_scan, class)
genotypes <- data.frame(read.table("http://zzlab.net/GAPIT/data/mdp_numeric.txt", header = T, row.names = 1, colClasses = classes, stringsAsFactors = F, sep = "\t"))

Simulated Data

source("http://www.zzlab.net/StaGen/2020/R/G2P.R")
NQTN = 20 #specify number of QTN
h2 = 0.75 #heritability
alpha = 1 #alpha value
distribution = "normal" #specify distribution
set.seed(99163)
#phenotype simulation
mySim=G2P(genotypes, h2, alpha, NQTN, distribution)

Run GWAS

q1=GWASapply(pheno=mySim$y, geno=genotypes, GM=myGM, PCA.M=3,QTN.position=mySim$QTN.position, plots=FALSE, messages=FALSE, trait="problem1")

#find middle p-value of QTNs
QTN_pval= q1$P.value.res[mySim$QTN.position]
a = sort(QTN_pval)
QTNcutoff = median(a)

#Manhattan Plot 
  color.vector <- rep(rainbow(5,s = 1), length(unique(myGM[,2])))
  
  plot(t(-log10(q1$P.value.res)),col=color.vector[myGM[,2]], 
       main="GWAS Manhattan Plot", xlab="Marker Position", ylab="-log10(p-value)") 
  
  points(mySim$QTN.position, -log10(q1$P.value.res[mySim$QTN.position]), type="p", pch=21, cex=3,lwd=2,col="Blue")# Circles the QNTs
  
  abline(h=-log10(QTNcutoff), col="green") #p-vlaue of middle QTN

Order QTNs

#order QTNs
pValue = q1$P.value.res[mySim$QTN.position]
QTNsq1 =cbind(myGM[c(mySim$QTN.position),], pValue)#Name Chr SNP position and P-value of QTNs
OrderQTNs = QTNsq1[with(QTNsq1, order(QTNsq1[,4])),]
OrderQTNs

##              SNP Chromosome  Position       pValue
## 1057 PZA00413.18          3 125192339 0.000000e+00
## 123   PZA03243.2          1  44532330 4.369838e-13
## 1314  PZA00078.4          4  18502089 7.161186e-09
## 1977 PZA00440.15          6  22403926 2.588799e-04
## 917   PZA00121.1          2 228972770 1.879514e-03
## 2157 PZB01569.10          6 160748446 4.419381e-03
## 1419  PZA00704.1          4 128552705 4.724744e-03
## 1890   PHM532.23          5 193208321 9.282557e-03
## 729   PZA00635.1          2  59600932 1.334000e-02
## 1878  PZB02005.1          5 188557321 1.577566e-02
## 2212 PZA03067.20          7  14354237 2.877772e-02
## 693   PZB01231.2          2  39041117 9.696740e-02
## 2849  PZB00750.2          9 133449028 1.495539e-01
## 605   PZA00620.2          2  10429605 1.631715e-01
## 1337  PZA03385.1          4  37067373 1.908301e-01
## 2648  PZB00700.1          8 163492031 2.406736e-01
## 1371  PZA03753.2          4  73310537 2.919296e-01
## 1370  PZA03752.1          4  73309674 4.100369e-01
## 2269  PZA01210.1          7  75099046 4.579929e-01
## 1092 PZA00507.11          3 155818939 6.134086e-01

Find median p-value of QTNs

#find middle p-value of QTNs
median_pval = median(OrderQTNs$pValue)
median_pval

## [1] 0.02227669

#find number of false positives

Count False Positives

sig.snps = sum(q1$P.value.res < QTNcutoff)#Total number of SNPs with p-value smaller then middle QTN
sig.QTNs = sum(q1$P.value.res[mySim$QTN.position] < QTNcutoff)#number of QTN smller then middel QTN ie true positives 
False.Positives = sig.snps - sig.QTNs#False positives
False.Positives

## [1] 193

#plot(q1$power.fdr.type1error.res$power~q1$power.fdr.type1error.res$fdr,xlab="FDR",ylab="Power",main="ROC Curve: GWASapply",type="b")

Problem 1 Answer

Problem 1 suggested I use my data from homework 2 however the data I used for homework 2 had minor allele frequencies of less then 0.05 for ~80% of the SNPs. And that can be problematic for correctly identifying QTNs. In order to make sure my code was working properly I used the GAPIT tutorial data.

As can be seen in the code above I simulated the phenotypic data as requested and then ran a GLM GWAS using the package I created in home work 4. The Manhattan plot shows the location of all QTNs circled in blue. The Manhattan plot also shows the cutoff line which is set by the median p value of the QTNs. Under “Order QTNs” you can see a table with the QTNs ordered by P-values and under “Find median p-value of QTNs” and “Count False Positives” you can see the the median p-value of the QTNs listed was 0.022 as well the number of SNPs with P-values smaller than that threshold 193 i.e. the number of false positives using the median QTN as the threshold.

Problem, 2 (20 Pts)

Repeat simulation in (1) 30 times and exam statistical power vs. FDR at mapping resolution of 100,000 base pairs (20 points).

Load GAPIT for Problems 2-5

#devtools::install_github("jiabowang/GAPIT3",force=TRUE)
#library(GAPIT3)
library(compiler)
source("GAPIT_FUNCTION.R")

nrep = 30
h2 = 0.75
NQTN = 20
set.seed(99163)
q2 = replicate(nrep, {
mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm") 

GAPITrun <- GAPIT(
Y=mySim$Y,
GD=myGD,#Genotype
GM=myGM,#Map information
PCA.total=3,
QTN.position=mySim$QTN.position,
model= "GLM",
file.output = FALSE)
  
myStat2=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                       GM=myGM,seqQTN=mySim$QTN.position,
                       GWAS=cbind(myGM, GAPITrun$GWAS$P.value))
  
})

Problem 2 ROC Curve

#plot power vs. FDR
power=q2[[2]]
s.t1=seq(3,length(q2),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=q2[s.t1]
t1.mean=Reduce ("+", t1) / length(t1)
plot(t1.mean[,1],power, type="b", col="blue",xlim=c(0,1),xlab = "FDR", ylab = "Power")
title("Problem 2 ROC Curve")

Problem 2 Answer

For problem 2-5 I switch to GAPIT to preform GWAS. The best way to download GAPIT is the devtools line I have commented out. When you run devtools::install_github(“jiabowang/GAPIT3”,force=TRUE) you will get a pop up asking if you want to update some all or none of several of the packages. Because of this manual step I could not get the code to Knit and is why I have it commented out. The way I installed GAPIT was to download the source code and add it as a function to my homework4 project that way the Knit option would work. Either way the code works the same.

For problem 2 I replicated a GLM GWAS 30 times using a simple replicate function. To caculated Power vs. FDR I used the GAPIT.FDR.TypeI function from the GAPIT package. Inside GAPIT.FDR.TYPEI I set the WS equal to 1e5. WS is the window size for mapping resolution which I set to 100,00bp.

To calculate power and FDR averages I used the code from the lectures as seen in “Problem 2 ROC Curve” code chunk. The 30 reps of problem 2 are stored in q2 with seven different values listed. Which is why the seq function is used to advance by 7 to get all the data for all 30 reps. Power is stored in q2[[2]] while FDR is stored in q2[[3]]. The code then used the Reduce function to average all 30 reps to calculate FDR at each Power. Power is set as increments from 0 to 1 by number of QTNs, so in our case the increments are 1/20 = 0.05.The best way to show power and FDR averages across replicates is the ROC curve as it allows you to see the average FDR at each incrementally increasing Power values.

Problem 3 (20 Pts)

Repeat simulation in (2) with additional two levels of heritability (0.25, and 0.5) (20 points).

#############################################h2 = 0.5
nrep = 30
h2 = 0.5
NQTN = 20
set.seed(99163)
statrep50 = replicate(nrep, {
mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm") 

GAPITrun <- GAPIT(
Y=mySim$Y,
GD=myGD,#Genotype
GM=myGM,#Map information
PCA.total=3,
QTN.position=mySim$QTN.position,
model= "GLM",
file.output = FALSE)
  
myStat=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                       GM=myGM,seqQTN=mySim$QTN.position,
                       GWAS=cbind(myGM, GAPITrun$GWAS$P.value))
})
#############################################h2 = 0.25
nrep = 30
h2 = 0.25
NQTN = 20
set.seed(99163)
statrep25 = replicate(nrep, {
mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm") 

GAPITrun <- GAPIT(
Y=mySim$Y,
GD=myGD,#Genotype
GM=myGM,#Map information
PCA.total=3,
QTN.position=mySim$QTN.position,
model= "GLM",
file.output = FALSE)
myStat=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                       GM=myGM,seqQTN=mySim$QTN.position,
                       GWAS=cbind(myGM, GAPITrun$GWAS$P.value))
})

Problem 3 ROC Curve

#plot power vs. FDR for h2 = 0.5
power.5=statrep50[[2]]
s.t1=seq(3,length(statrep50),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=statrep50[s.t1]
t1.mean.5=Reduce ("+", t1) / length(t1)
#plot power vs. FDR for h2 = 0.25
power.25=statrep25[[2]]
s.t1=seq(3,length(statrep25),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=statrep25[s.t1]
t1.mean.25=Reduce ("+", t1) / length(t1)


plot(t1.mean.5[,1],power.5, type="b", col="blue",xlim=c(0,1),xlab = "FDR", ylab = "Power")
lines(t1.mean.25[,1],power.25, type="b", col="red")
title("Problem 3 ROC Curve")
legend("topleft", legend = c("h2 = 0.5","h2 = 0.25"),col=c("blue","red"),lty=1,bty="n" )

Problem 3 Answer

Phenotypic simulations were conducted for problem 3 with heritability set at 0.5 and 0.25 with the h2 variable in the GAPIT.Phenotype.simulation function. Replications were conducted as before using the replication function. Power and FDR averages where also conducted the same as in problem 2 using the code from the lecture slides.
As can be seen in the “Problem 3 ROC Curve” graph, power is higher at the same FDR rate when heritability is higher i.e. 0.5 vs. 0.25. This makes since as a higher heritability gives a stronger signal which can be more easily detected by GWAS.

Problem 4 (20 Pts)

Repeat simulation in (2) with additional two methods of your choice among MLM. CMLM, SUPER, FarmCPU, and BLINK (20 points).

#############################################MLM
nrep = 30
h2 = 0.75
NQTN = 20
set.seed(99163)
GWAS_MLM = replicate(nrep, {
mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm") 

GAPITrun <- GAPIT(
Y=mySim$Y,
GD=myGD,#Genotype
GM=myGM,#Map information
PCA.total=3,
QTN.position=mySim$QTN.position,
model = "MLM",
file.output = FALSE)
  
myStat=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                       GM=myGM,seqQTN=mySim$QTN.position,
                       GWAS=cbind(myGM, GAPITrun$GWAS$P.value))
})
#############################################?
nrep = 30
h2 = 0.75
NQTN = 20
set.seed(99163)
GWAS_SUPER = replicate(nrep, {
mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm") 

GAPITrun <- GAPIT(
Y=mySim$Y,
GD=myGD,#Genotype
GM=myGM,#Map information
PCA.total=3,
QTN.position=mySim$QTN.position,
model = "SUPER",
file.output = FALSE)
myStat=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                       GM=myGM,seqQTN=mySim$QTN.position,
                       GWAS=cbind(myGM, GAPITrun$GWAS$P.value))
})

Problem 4 ROC Curve

#plot power vs. FDR for MLM
power.5=GWAS_MLM[[2]]
s.t1=seq(3,length(GWAS_MLM),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=GWAS_MLM[s.t1]
t1.mean.5=Reduce ("+", t1) / length(t1)
#plot power vs. FDR for SUPER
power.25=GWAS_SUPER[[2]]
s.t1=seq(3,length(GWAS_SUPER),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=GWAS_SUPER[s.t1]
t1.mean.25=Reduce ("+", t1) / length(t1)


plot(t1.mean.5[,1],power.5, type="b", col="blue",xlim=c(0,1),xlab = "FDR", ylab = "Power")
lines(t1.mean.25[,1],power.25, type="b", col="red")
title("Problem 4 ROC Curve")
legend("topleft", legend = c("MLM","SUPER"),col=c("blue","red"),lty=1,bty="n" )

Problem 4 Answer

I choose to repeat simulations in problem 2 using MLM and SUPER. I picked MLM because it’s relatively fast and computationally easy comparted to SUPER. The tradeoff is a reduction in power of MLM vs. SUPER. Replication and Power vs. FDR calculations were conducted as in problem 2.

In the “Problem 4 ROC Curve” graph we can see that SUPER does indeed have higher power as compared to MLM. SUPER is able to achieve a higher power because it does adjustment on its kinship covariates. If sample size is small or computational power is high, I would recommend SUPER over MLM every time. However, if computational speed is an issue MLM might be the preferred method as it can run much faster.

Problem 5 (20 Pts)

Among above problem (2-4), redo one of them by excluding the QTNs from the marker dataset before GWAS. Describe the impact of inclusion and exclusion of the QTNs (20 points).

################################################################Q5
nrep = 30
h2 = 0.75 #heritability
NQTN = 20
set.seed(99163)
################################################################run replication
GWAS_noQTNs = replicate(nrep, {
  mySim=GAPIT.Phenotype.Simulation(GD=myGD,GM=myGM,h2=h2,NQTN=NQTN,QTNDist="norm")#simulate phenotype data
  
  
  QTNindex = mySim$QTN.position#index of QTNs
  
  
  GMonlyQTNs = myGM[c(QTNindex),]#Name Chr and SNP position of QTNs
  
  QTNindexGD = QTNindex +1 #have to +1 becasue GM does not have the taxa column in it
  GDminusQTN = myGD[c(-QTNindexGD)]#Remove QTNs from genetic data
  GMminusQTN = myGM[-c(QTNindex),]#Remove QTNs from genetic map
  
################################################################Run GWAS
  GAPITrun <- GAPIT(
    Y=mySim$Y,
    GD=GDminusQTN,#Genotype
    GM=GMminusQTN,#Map information
    PCA.total=3,
    #QTN.position=mySim$QTN.position,
    model = "GLM",
    file.output = FALSE)
 
#######################################################add QTNs back into dataframe
a = GAPITrun$GWAS[1:4]# Name Chr Position and P-value of non QTNs
P.value = rep(NaN, NQTN)# create NaN P-values for QTNs
g = cbind(GMonlyQTNs, P.value)# get Name chr Position and NA P-values of QTNs
names(a)[3]= c("Position")#rename Position so it matches other dataframes position so we can rbind
h = rbind(g,a)#combine data

final = h[with(h, order(h[,2],h[,3])),]# need to order by chr then position so that indexs line up
########################################################Caculate power vs. FDR with GAPIT.FDR.TypeI using window size of 100,000bps
  myStat=GAPIT.FDR.TypeI(WS=1e5, #WS equals window size of 100kb
                         GM=myGM,
                         seqQTN=mySim$QTN.position,
                         GWAS=cbind(myGM, final$P.value))
  
})

Problem 5 ROC Curve

#plot power vs. FDR for QTNs removed
power_noQTNs=GWAS_noQTNs[[2]]
s.t1=seq(3,length(GWAS_noQTNs),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=GWAS_noQTNs[s.t1]
t1.mean.5=Reduce ("+", t1) / length(t1)
plot(t1.mean.5[,1],power_noQTNs, type="b", col="blue",xlim=c(0,1),xlab = "FDR", ylab = "Power")


#plot power vs. FDR for problem 2 
powerq2=q2[[2]]
FDRq2=seq(3,length(q2),7)#1 = P, 2 = power, 3 = FDR, 4 = type1, 6 = AUC vs.FDR, 7 = AUC.T1
t1=q2[FDRq2]
t1.meanq2=Reduce ("+", t1) / length(t1)
lines(t1.meanq2[,1],powerq2, type="b", col="red")
title("Problem 5 ROC Curve")
legend("topleft", legend = c("QTNs removed","QTNs included"),col=c("blue","red"),lty=1,bty="n" )

Problem 5 Answer

In the code above I removed the QTNs from the Genetic Data (GD) I also removed the QTNs from the Genetic Map (GM) I had to remove the QTNs from the GM because GAPIT wanted the dimensions of the data sets to match. After running GLM GWAS I put the QTNs with P-Values set to NaN back into the data frame with all the other SNPs and their p-values. With the sorted data frame of all the SNP p-values and QTNs with p-values set to NaN I was able to use the GAPIT.FDR.TypeI function to calculate Power and FDR. And then just like in problem 2 I calculated the average Power and FDR for all 30 reps and plotted them on a graph.

In the “Problem 5 ROC Curve” graph you can see that excluding the QTNs before running GWAS greatly decreased the power vs. FDR curve. The ROC curve with QTNs removed is still able to pick up a few true positives because I’m using a window size of 100,000bps so while none of the QTNs are being selected based off their p-values because I set them all the NaN, if the QTN is within 100,000bp of a peak p-value it will still be selected. For an analysis like problem 5 LD will play a strong role in how well a GWAS with QTNs excluded is still able to pick true positives. And from the results above we can see that if you remove the QTNs before GWAS the power/FDR gets very low indicating LD is not strongly effecting our end results with QTNs included