---
title: "Homework 2, CROPS 545, Spring 2020"
author: "Your Name"
date: "January 7, 2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Professor: Zhiwu Zhang
## **Due on February 21, 2020, Friday, 3:10PM, PST**

**Hand in:** Email your report (Rmarkdown and knitted .html file) with subject of "CROPS545_HW2" to Zhiwu.Zhang@WSU.edu. Written portions are limited to 500 words or less for each problem. Name your files with the following format: Homework2_firstname_lastname.Rmd and Homework2_firstname_lastname.html

**Objectives:** To examine the impact of different variables on imputation accuracy evaluated as correlation coefficient, match proportion across genomes, and match proportions on major and minor allele homozygous separately.

**The following variables will be tested:**

1) Missing rate
2) Sample size
3) Method of imputation

**Grading Criteria:**

* Problem is adequately addressed
* All code should be bug-free and well commented
* Plots should be properly labeled
* Written portions should be in sentence form and easy to read.

**Remember:** You can use the echo=FALSE chunk option to hide your code in the .html file.

Feel free to add extra sub-problem code chunks as necessary.

### **Problem 1 (20 Pts)**

Choose a dataset (please specify the dataset number) from the recommended list:
http://zzlab.net/StaGen/2018/Data/PublicData.pdf. 

You can sample a portion of the individuals. However, the final dataset must contain at least 100 individuals and 5,000 markers with known chromosome and base pair positions. Display marker locations on chromosomes, distribution of missing rate (both marker and individual wise), and minor allele frequency.

EXTRA CREDIT (5 points): Find a new publically available dataset that matches the sample and marker requirements. Provide a brief description of the data as well as information that matches the recommended list.

```{r problem1, echo=FALSE}
# Your code goes here

print('Hello World')
```

THIS IS AN EXAMLE OF WHERE YOUR WRITTEN ANSWERS WILL SHOW UP.

### **Problem, 2 (20 Pts)**

Randomly select 5%, 25%, and 50% of data points and set them as missing values. Impute these missing values with the stochastic imputation method. Calculate the imputation accuracy as a correlation coeficient or match proportion. Repeat this process at least 30 times and report the average, standard deviation, and number of replicates. Describe the relationship between the missing rate and imputation accuracy (20 points).

```{r problem2}
# Your code goes here
```

### **Problem 3 (20 Pts)**

Redo problem 2, but replace the stochastic method with the k-nearest neighbor method. Compare and contrast your results with the stochastic method.

```{r problem3}
# Your code goes here
```

### **Problem 4 (20 Pts)**

The neighbors in KNN refer to individuals and attributes refer to genetic markers for imputation of missing genotypes. Redo problem 3 by switching neighbors to genetic markers and attribute to individuals. Describe the differences.

```{r problem4}
# Your code goes here
```

### **Problem 5 (20 Pts)**

Fix the missing rate at 25% and perform imputation with BEAGLE. Calculate imputation accuracy as correlation coefficient or match proporion. Repeat this process at least 10 times, report average, standard deviation and number of replicates. Compare and contrast this method with the previous methods tested.

The documentation for BEAGLE is available here:
https://faculty.washington.edu/browning/beagle/beagle_5.1_08Nov19.pdf

You can nest console commands in an R script using the 'system' function.

```{r problem5}
# Your code goes here
```

### **Problem 6 (Extra Credit 20 Pts)**

Redo problem 3, but calculate imputation accuracies for major allele homozygous and minor allele homozygous separately. In 500 words or less, describe any novel insights by calculating accuracy this way.

### **Problem 7 (Extra Credit, 20 points)**

Find another published imputation method that can achieve a higher imputation accuracy than both KNN and BEAGLE. In 500 words or less, describe the basics of this method and explain why it can achieve a higher accuracy.