1 Summary

Here we will make use of the SummarizedBenchmark package to benchmark a Genome-Wide Association Study (GWAS) dataset. We will examine a GWAS meta-analysis for Body Mass Index (BMI), which was also included in the Boca-Leek manuscript.

2 Workspace setup

This analysis requires that the software PLINK (version 1.9) is installed on your system. PLINK is freely available for download here.

library(data.table)
library(readxl)
library(readr)
library(dplyr)
library(ggplot2)
library(cowplot)
library(BiocParallel)

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

# data and results directories
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/GWAS"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

# results files
resfile_N <- file.path(sbdir, paste0("bmi-samplesize-benchmark.rds"))
resfile_AF <- file.path(sbdir, paste0("bmi-maf-benchmark.rds"))
resfile_uninf <- file.path(sbdir, paste0("bmi-uninf-benchmark.rds"))

# set up parallel backend
cores <- 8
multicoreParam <- MulticoreParam(workers = cores)

3 Data preparation

We’ll first download the GWAS summary statistics dataset from the GIANT consortium data portal.

3.1 Data download

After downloading, we unzip the file, and delete the files we won’t use (they provide results subsetted by ancestry and sex), keeping only the file with European ancestry and both sexes BMI.SNPadjSMK.CombinedSexes.EuropeanOnly.txt. Note that the Boca-Leek paper also used the ‘EuropeanOnly’ subset, likely because using all ancestries would require that we assess the impact of population stratification and adjust for it if present. For simplicity, we follow suit and use the homogeneous subset to avoid the impact of population stratification on our results.

if (!file.exists(file.path(datdir, "BMI.SNPadjSMK.CombinedSexes.EuropeanOnly.txt"))) {
  download.file(url = "http://portals.broadinstitute.org/collaboration/giant/images/3/3a/BMI.SNPadjSMK.zip", 
             destfile = file.path(datdir, "BMI.SNPadjSMK.zip")) 
  unzip(file.path(datdir, "BMI.SNPadjSMK.zip"), exdir = datdir)
  file.remove(file.path(datdir,"BMI.SNPadjSMK.zip"))
  
  dfiles <- list.files(path = datdir, pattern = "BMI.SNPadjSMK.*.txt", 
                       full.names = TRUE)
  dfiles <- dfiles[!grepl("BMI.SNPadjSMK.CombinedSexes.EuropeanOnly.txt", dfiles)]
  file.remove(dfiles)
}

We also download the 1000 genomes CEU reference population data, which will be used to determine which SNPs are LD buddies for the purposes of obtaining an independent set of SNPs.

reffile <- file.path(datdir, "1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01")
if (!file.exists(paste0(reffile, ".fam"))) {
    download.file("http://neurogenetics.qimrberghofer.edu.au/iSECA/1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01.zip", 
                  destfile = paste0(reffile, ".zip"))
    unzip(paste0(reffile, ".zip"), exdir = datdir)
    file.remove(paste0(reffile, ".zip"))
}

Next, we’ll read in the unzipped .txt file into R and verify that it contains the necessary inputs to run the FDR benchmark comparisons.

bmi <- fread(file.path(datdir, "BMI.SNPadjSMK.CombinedSexes.EuropeanOnly.txt"),
             header = TRUE)
dim(bmi)

## [1] 2456142      10

head(bmi)

##    chromosome     rs_id     markername allele_1 allele_2
## 1:          1 rs1000050 chr1:161003087        T        C
## 2:          1 rs1000073 chr1:155522020        A        G
## 3:          1 rs1000075  chr1:94939420        T        C
## 4:          1 rs1000085  chr1:66630503        C        G
## 5:          1 rs1000127  chr1:63205304        T        C
## 6:          1 rs1000312 chr1:203772100        A        G
##    Freq_Allele1_HapMapCEU  effect stderr  p_value      N
## 1:                 0.9000  0.0069 0.0060 0.249700 141408
## 2:                 0.3136  0.0046 0.0043 0.276400 140542
## 3:                 0.3583 -0.0039 0.0040 0.334100 163841
## 4:                 0.1667 -0.0026 0.0047 0.582400 163398
## 5:                 0.7719  0.0003 0.0043 0.947900 163274
## 6:                 0.2982  0.0130 0.0043 0.002776 151830

It looks like we have:

p_value:p-value
effect:effect size
stderr:standard error
additional covariates:
N: Number of samples with this SNP - BMI association measured
Freq_Allele1_HapMapCEU: Allele Frequency of Allele1, as measured in HapMapCEU

for 2456142 SNPs.

3.2 Prune SNPs that are in Linkage Disequilibrium

Since nearby SNPs in linkage disequilibrium may not constitute separate discoveries, we carry out a pruning step. This commonly done by examining correlation of genotypes among nearby markers, but since we don’t have the genotype data, we’ll use the linkage disequilibrium results of common SNPs and previously observed LD blocks. Specifically, we’ll use the phase 3 data from 1000 genomes, restricted to the CEU population (since this most closely matches our population of European ancestry). This reference data was downloaded in a previous code chunk. Here we read in the list of more than 9 million SNPs with data in 1000 genomes and create a list of SNPs that overlap our set. We’ll write this list to a file that matches PLINK format.

# load reference data 
onekg <- fread(file.path(datdir,  "1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01.bim"))$V2

# construct input list of SNPs that overlap - 97.5% overlap
bmi$BP <- as.numeric(unlist(lapply(strsplit(bmi$markername, ":"), function(x) x[[2]])))
write_delim(bmi %>%
            dplyr::rename(CHR=chromosome, SNP=rs_id, A1=allele_1, A2=allele_2) %>%
            mutate(P=0.5, F_A=Freq_Allele1_HapMapCEU, F_U=F_A, CHISQ=1, OR=1) %>% 
            select(CHR, SNP, BP, A1, F_A, F_U, A2, CHISQ, P, OR) %>%
            filter(SNP %in% onekg), 
            path=file.path(datdir,  "rslist.assoc"), delim="\t")

ol <- sum(bmi$rs_id %in% onekg)/nrow(bmi)
rm(onekg)

It turns out that 97.5 percent of SNPs in our set are included in the 1000 genomes reference data. Next we’ll use the PLINK software to read in our list of SNPs and the 1000 genomes linkage disequilibrium data to ‘clump’ the SNPs into independent sets of SNPs that have LD correlation (r squared) less than 0.2 with all neighbors within 250 kilobases of distance. We write the results to a file called plink_clump.clumped. We have to jump out of R here since PLINK is a command line tool.

## use PLINK clumping tool to obtain a set of independent SNPs by LD
if (!file.exists(file.path(datdir, "plink_clump.clumped"))) {
    code <- paste("plink",
                  "--bfile", file.path(datdir, "1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01"),
                  "--clump", file.path(datdir, "rslist.assoc"),
                  "--clump-field P",
                  "--clump-p1 0.9999",
                  "--clump-p2 0.9999",
                  "--clump-r2 0.2",
                  "--clump-kb 250",
                  "--out", file.path(datdir, "plink_clump"))
    system(code)
}

Next, we’ll read in the clumped results and subset our data by our list of independent SNPs.

clump <- fread(file.path(datdir, "plink_clump.clumped"))

# subset to include only the SNPs in the ld.list
bmi <- bmi[bmi$rs_id %in% clump$SNP, ]

rm(clump)

After pruning, we are left with for 196969 approximately independent SNPs.

Before moving on, we’ll rename the relevant columns so that they will use common terms across the different datasets. We’ll also add a ‘test_statistic’ column.

bmi <- bmi %>% 
    dplyr::rename(pval = p_value,
                  SE = stderr,
                  effect_size = effect,
                  ind_covar_N = N,
                  ind_covar_AF = Freq_Allele1_HapMapCEU) %>%
    mutate(test_statistic = effect_size / SE) %>%
    select(pval, SE, effect_size, ind_covar_N, ind_covar_AF, test_statistic)

Now we’ll add an random (uninformative covariate) to compare results with.

set.seed(39580)
bmi <- bmi %>% mutate(ind_covar_uninf = rnorm(nrow(bmi)))

4 Data Analysis

4.1 Differential Testing

Since we do not have access to the individual genotype information and since our data already contains the results of performing hypothesis tests of association of BMI with each SNP (including p-values and effect sizes), we don’t need to carry out any additional testing. We move on to checking the covariate diagnostics and adjusting for multiple comparisons.

4.2 Check Assumptions for ash

We’ll create a plot to examine the distribution of effect sizes, since the ash method assumes that the distribution of true (unobserved) effect sizes is unimodal.

ggplot(data=bmi, aes(effect_size)) +
  geom_histogram(bins=30)

We’ll also explore how the standard error (used by ash) correlates with the independent covariates (used by methods that incorporate covariates), in order to get an idea of how these pieces of information relate to one another.

ggplot(data=bmi, aes(x = ind_covar_N, y = SE)) +
  geom_hex(bins=125) +
  xlab("Covariate: Sample Size")

ggplot(data=bmi, aes(x = ind_covar_AF, y = SE)) +
  geom_hex(bins = 125)+
  xlab("Covariate: Minor allele frequency")

## Warning: Removed 200 rows containing non-finite values (stat_binhex).

ggplot(data=bmi, aes(x = ind_covar_AF, y = SE)) +
  geom_hex(bins = 125)+
  xlab("Covariate: Random")

## Warning: Removed 200 rows containing non-finite values (stat_binhex).

4.3 Covariate Diagnostics

Here we look to see if the covariates do indeed look informative. First we look at the sample size covariate.

4.3.1 Covariate one: Sample Size

rank_scatter(bmi, pvalue="pval", covariate="ind_covar_N") + 
  ggtitle("Covariate 1: Sample Size")

strat_hist(bmi, pvalue="pval", covariate="ind_covar_N", maxy=2)

4.3.2 Covariate two: Minor Allele Frequency

Next we look at the allele frequency covariate.

rank_scatter(bmi, pvalue="pval", covariate="ind_covar_AF") +
  ggtitle("Covariate 1: Minor allele frequency")

strat_hist(bmi, pvalue="pval", covariate="ind_covar_AF", maxy=2)

4.3.3 Covariate two: Random

Next we look at the random (uninformative) covariate.

rank_scatter(bmi, pvalue="pval", covariate="ind_covar_uninf") +
  ggtitle("Covariate 1: Minor allele frequency")

strat_hist(bmi, pvalue="pval", covariate="ind_covar_uninf", maxy=2)

4.4 Multiple-Testing Correction

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.

bd <- initializeBenchDesign()

We also add in Scott’s FDR Regression (both nulltype = "empirical" and nulltype = "theoretical") since our test statistics are t-distributed.

bd <- addBMethod(bd, "fdrreg-t",
                     FDRreg::FDRreg,
                     function(x) { x$FDR },
                     z = test_statistic,
                     features = model.matrix( ~  splines::bs(ind_covariate, df = 3) - 1),
                     nulltype = 'theoretical',
                     control = list(lambda = 0.01))
bd <- addBMethod(bd, "fdrreg-e",
                     FDRreg::FDRreg,
                     function(x) { x$FDR },
                     z = test_statistic,
                     features = model.matrix( ~  splines::bs(ind_covariate, df = 3) - 1),
                     nulltype = 'empirical',
                     control = list(lambda = 0.01))

Now, we’re ready to construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

4.4.1 Covariate one: Sample size

First we’ll include the sample size covariate.

duration <- NA
if (!file.exists(resfile_N)) {
    t1 <- proc.time()
    sbN <- bd %>% buildBench(data=(bmi %>% mutate(ind_covariate=ind_covar_N)), 
                             ftCols = c("ind_covar_N", "effect_size"),
                             parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sbN)$data_download_link <- 
                    "http://portals.broadinstitute.org/collaboration/giant/images/3/3a/BMI.SNPadjSMK.zip"
    saveRDS(sbN, file = resfile_N)
    duration <- round((proc.time()-t1)[3]/60,1)
} else {
    sbN <- readRDS(resfile_N)
}

## This step took 108 minutes for 196969 SNPs using 8 cores.

4.4.2 Covariate two: Minor Allele Frequency

Now, we’ll repeat the multiple testing correction using the other covariate (AF):

duration <- NA
if (!file.exists(resfile_AF)) {
    t1 <- proc.time()
    sbAF <- bd %>% buildBench(data=(bmi %>% mutate(ind_covariate=ind_covar_AF) %>%
                                    filter(!is.na(ind_covariate))), 
                              ftCols = c("ind_covar_AF", "effect_size"),
                              parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sbAF)$data_download_link <-    
                     "http://portals.broadinstitute.org/collaboration/giant/images/3/3a/BMI.SNPadjSMK.zip"
    saveRDS(sbAF, file = resfile_AF)
    duration <- round((proc.time()-t1)[3]/60, 1)
} else {
    sbAF <- readRDS(resfile_AF)
}

## This step took 63.1 minutes for 196969 SNPs using 8 cores.

4.4.3 Covariate three: Random

Now, we’ll repeat the multiple testing correction using the other covariate (random):

duration <- NA
if (!file.exists(resfile_uninf)) {
    t1 <- proc.time()
    sbU <- bd %>% buildBench(data=(bmi %>% mutate(ind_covariate=ind_covar_uninf) %>%
                                    filter(!is.na(ind_covariate))), 
                              ftCols = c("ind_covar_uninf", "effect_size"),
                              parallel=TRUE, BPPARAM=multicoreParam)
    metadata(sbU)$data_download_link <-    
                     "http://portals.broadinstitute.org/collaboration/giant/images/3/3a/BMI.SNPadjSMK.zip"
    saveRDS(sbU, file = resfile_uninf)
    duration <- round((proc.time()-t1)[3]/60, 1)
} else {
    sbU <- readRDS(resfile_uninf)
}

## This step took 73.7 minutes for 196969 SNPs using 8 cores.

4.5 Benchmark Metrics

Next, we’ll add the default performance metric for q-value assays and plot the results. We’ll start with covariate one.

4.5.1 Covariate one: Sample Size

First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sbN) <- "qvalue"
sbN <- addDefaultMetrics(sbN)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sbN, supplementary=FALSE) +
    ggtitle("Covariate 1: Sample size")

rejection_scatter_bins(sbN, covariate="ind_covar_N", bins=4,
                       supplementary=FALSE) +
    ggtitle("Covariate 1: Sample size")

# upset plot 
plotFDRMethodsOverlap(sbN, 
                      alpha=0.05, nsets=ncol(sb),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

covariateLinePlot(sbN, alpha=0.05, covname="effect_size", nbins=25, 
                 trans="log1p")

covariateLinePlot(sbN, alpha=0.05, covname="ind_covar_N", nbins=25, 
                 trans="log1p")

4.5.2 Covariate two: Minor Allele Frequency

Next, we’ll look at the performance metrics for the other covariate (minor allele frequency).

# rename assay to qvalue
assayNames(sbAF) <- "qvalue"
sbAF <- addDefaultMetrics(sbAF)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sbAF, supplementary=FALSE) +
  ggtitle("Covariate 2: Minor allele frequency")

rejection_scatter_bins(sbAF, covariate="ind_covar_AF", bins=4,
                       supplementary=FALSE) +
  ggtitle("Covariate 2: Minor allele frequency")

# upset plot 
plotFDRMethodsOverlap(sbAF, 
                      alpha=0.05, nsets=ncol(sbAF),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

covariateLinePlot(sbAF, alpha=0.05, covname="effect_size", nbins=25, 
                 trans = "log1p")

covariateLinePlot(sbAF, alpha=0.05, covname="ind_covar_AF", nbins=25)

4.5.3 Covariate three: Random

Next, we’ll look at the performance metrics for the other covariate (random).

# rename assay to qvalue
assayNames(sbU) <- "qvalue"
sbU <- addDefaultMetrics(sbU)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sbU, supplementary=FALSE) +
  ggtitle("Covariate 3: Random")

rejection_scatter_bins(sbU, covariate="ind_covar_uninf", bins=4,
                       supplementary=FALSE) +
  ggtitle("Covariate 3: Random")

# upset plot 
plotFDRMethodsOverlap(sbU, 
                      alpha=0.05, nsets=ncol(sbAF),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

covariateLinePlot(sbU, alpha=0.05, covname="effect_size", nbins=25, 
                 trans = "log1p")

covariateLinePlot(sbU, alpha=0.05, covname="ind_covar_uninf", nbins=25)

5 Covariate comparison

Here we compare the method ranks for the two covariates at alpha = 0.10.

plotMethodRanks(c(resfile_AF, resfile_N, resfile_uninf), 
                colLabels = c("MAF", "Sample Size", "Random"), 
                alpha = 0.10, xlab = "Covariate", 
                excludeMethods = NULL)

6 Session Information

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] adaptMT_1.0.0               FDRreg_0.2-1               
##  [3] mvtnorm_1.0-8               BayesLogit_0.6             
##  [5] fda_2.4.7                   Matrix_1.2-14              
##  [7] fdrtool_1.2.15              swfdr_1.6.0                
##  [9] qvalue_2.12.0               ashr_2.2-7                 
## [11] IHW_1.8.0                   SummarizedBenchmark_0.99.1 
## [13] mclust_5.4                  stringr_1.3.1              
## [15] rlang_0.2.1                 UpSetR_1.3.3               
## [17] SummarizedExperiment_1.10.1 DelayedArray_0.6.1         
## [19] matrixStats_0.53.1          Biobase_2.40.0             
## [21] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
## [23] IRanges_2.14.10             S4Vectors_0.18.3           
## [25] BiocGenerics_0.26.0         tidyr_0.8.1                
## [27] magrittr_1.5                hexbin_1.27.2              
## [29] bindrcpp_0.2.2              BiocParallel_1.14.2        
## [31] cowplot_0.9.2               ggplot2_3.0.0              
## [33] dplyr_0.7.5                 readr_1.1.1                
## [35] readxl_1.1.0                data.table_1.11.4          
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137           bitops_1.0-6           RColorBrewer_1.1-2    
##  [4] doParallel_1.0.11      rprojroot_1.3-2        tools_3.5.0           
##  [7] backports_1.1.2        R6_2.2.2               lazyeval_0.2.1        
## [10] colorspace_1.3-2       withr_2.1.2            mnormt_1.5-5          
## [13] tidyselect_0.2.4       gridExtra_2.3          compiler_3.5.0        
## [16] ggdendro_0.1-20        labeling_0.3           slam_0.1-43           
## [19] mosaicCore_0.5.0       scales_0.5.0           psych_1.8.4           
## [22] SQUAREM_2017.10-1      digest_0.6.15          foreign_0.8-70        
## [25] ggformula_0.7.0        rmarkdown_1.10         XVector_0.20.0        
## [28] pscl_1.5.2             pkgconfig_2.0.1        htmltools_0.3.6       
## [31] lpsymphony_1.8.0       bindr_0.1.1            RCurl_1.95-4.10       
## [34] GenomeInfoDbData_1.1.0 mosaicData_0.16.0      Rcpp_0.12.17          
## [37] munsell_0.4.3          stringi_1.2.2          yaml_2.1.19           
## [40] MASS_7.3-50            zlibbioc_1.26.0        plyr_1.8.4            
## [43] grid_3.5.0             lattice_0.20-35        hms_0.4.2             
## [46] knitr_1.20             pillar_1.2.3           reshape2_1.4.3        
## [49] codetools_0.2-15       glue_1.2.0             evaluate_0.10.1       
## [52] foreach_1.4.4          cellranger_1.1.0       gtable_0.2.0          
## [55] purrr_0.2.5            assertthat_0.2.0       broom_0.4.4           
## [58] truncnorm_1.0-8        tibble_1.4.2           iterators_1.0.9       
## [61] mosaic_1.2.0

Case study: Genome-Wide Association Analyses (GWAS)

Keegan Korthauer

August 03, 2018

1 Summary

2 Workspace setup

3 Data preparation

3.1 Data download

3.2 Prune SNPs that are in Linkage Disequilibrium

4 Data Analysis

4.1 Differential Testing

4.2 Check Assumptions for ash

4.3 Covariate Diagnostics

4.3.1 Covariate one: Sample Size

4.3.2 Covariate two: Minor Allele Frequency

4.3.3 Covariate two: Random

4.4 Multiple-Testing Correction

4.4.1 Covariate one: Sample size

4.4.2 Covariate two: Minor Allele Frequency

4.4.3 Covariate three: Random

4.5 Benchmark Metrics

4.5.1 Covariate one: Sample Size

4.5.2 Covariate two: Minor Allele Frequency

4.5.3 Covariate three: Random

5 Covariate comparison

6 Session Information