The objective of this document is benchmark methods to control FDR in the context of differential gene expression.
The data consists of 20 samples from two regions of the human basal ganglia, the nucleus accumbens and the putamen, from the GTEx project. Shortly, samples were downloaded using the Short Read Archive Toolkit and mapped to the human reference genome version GRCh38 using STAR v2.4.2a. htseq-count was used to tabulate the number of uniquely mapping reads for each gene. We used DESeq2 to format the data into a DESeqDataSet object.
library(dplyr)
library(ggplot2)
library(DESeq2)
library(SummarizedBenchmark)
library(BiocParallel)
## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
source(f)
}
## project data/results folders
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/RNAseq"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)
## intermediary files we create below
count_file <- file.path(resdir, "brain-counts.rds")
result_file <- file.path(resdir, "brain-results.rds")
bench_file <- file.path(sbdir, "brain-benchmark.rds")
bench_file_uninf <- file.path(sbdir, "brain-uninf-benchmark.rds")
## set up parallel backend
cores <- as.numeric(Sys.getenv("SLURM_NTASKS"))
multicoreParam <- SerialParam()
We download the DESeqDataSet object from zenodo that contains the gene level counts for GTEx samples.
if (!file.exists(count_file)) {
download.file("https://zenodo.org/record/1475409/files/rnaseq-brain-counts.rds?download=1", destfile = count_file)
}
dsd <- readRDS(count_file)
We use DESeq2 to test for differential gene expression between the two cell types. We set the parameter independentFiltering=FALSE
to skip the independent filtering step, as this step would be redundant with some of the FDR control methods that use gene expression as an independent covariate to increase power.
if (file.exists(result_file)) {
res <- readRDS(result_file)
} else {
dds <- DESeq(dsd, parallel = TRUE, BPPARAM = multicoreParam)
res <- results(dds, independentFiltering = FALSE) %>%
as.data.frame() %>%
na.omit() %>%
dplyr::select(pvalue, baseMean, log2FoldChange, lfcSE, stat) %>%
dplyr::rename(pval = pvalue,
ind_covariate = baseMean,
effect_size = log2FoldChange,
SE = lfcSE,
test_statistic = stat)
saveRDS(res, file = result_file)
}
Add random (uninformative) covariate.
set.seed(83750)
res$rand_covar <- rnorm(nrow(res))
In RNA-seq differential expression analysis, it is very well established that the mean expression is an informative covariate that is independent under the null and that can be used to increase power while keeping FDR control.
rank_scatter(res, pvalue = "pval", covariate = "ind_covariate") +
ggtitle("Mean coverage as independent covariate") +
xlab("Mean Expression")
We noticed some discreteness in the distribution of p-values towards values close to 1 that was particularly pronounced in the first covariate bin. The overall distribution of p-values could violate some of the assumptions of the FDR control methods.
strat_hist(res, pvalue = "pval",
covariate = "ind_covariate", maxy = 25)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:rlang':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
This discreteness, however, corresponded to genes with very low expression values. For example, there were 17024 genes with less than an average of 1 read across the 20 samples. The distribution of p-values looked much better when removing these very lowly expressed genes.
res <- dplyr::filter(res, ind_covariate >= 1)
strat_hist(res, pvalue = "pval",
covariate = "ind_covariate", maxy = 25)
rank_scatter(res, pvalue = "pval", covariate = "rand_covar") +
ggtitle("Random independent covariate")
strat_hist(res, pvalue = "pval",
covariate = "rand_covar", maxy = 25)
We use the common BenchDesign
with the set of multiple testing correction methods already included. We also add in Scott’s FDR Regression (both nulltype = "empirical"
and nulltype = "theoretical"
) since our test statistics are approximately t-distributed.
if (file.exists(bench_file)) {
sb <- readRDS(bench_file)
} else {
bd <- initializeBenchDesign()
bd <- addBMethod(bd, "fdrreg-t",
FDRreg::FDRreg,
function(x) { x$FDR },
z = test_statistic,
features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1),
nulltype = 'theoretical',
control = list(lambda = 0.01))
bd <- addBMethod(bd, "fdrreg-e",
FDRreg::FDRreg,
function(x) { x$FDR },
z = test_statistic,
features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1),
nulltype = 'empirical',
control = list(lambda = 0.01))
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = bench_file)
}
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
estimatePerformanceMetrics(sb, alpha = 0.05, tidy = TRUE) %>%
filter(performanceMetric == "rejections") %>%
select(blabel, performanceMetric, alpha, value) %>%
mutate(n = nrow(sb), prop = round(value / n, 3)) %>%
arrange(desc(value)) %>%
as_tibble() %>%
print(n = 40)
## # A tibble: 23 x 6
## blabel performanceMetric alpha value n prop
## <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 ashq rejections 0.05 6647 30374 0.219
## 2 unadjusted rejections 0.05 6595 30374 0.217
## 3 lfdr rejections 0.05 3811 30374 0.125
## 4 adapt-glm rejections 0.05 3750 30374 0.123
## 5 fdrreg-t rejections 0.05 3664 30374 0.121
## 6 bl-df02 rejections 0.05 3336 30374 0.11
## 7 bl-df03 rejections 0.05 3262 30374 0.107
## 8 bl-df05 rejections 0.05 3251 30374 0.107
## 9 bl-df04 rejections 0.05 3247 30374 0.107
## 10 qvalue rejections 0.05 3204 30374 0.105
## 11 ihw-a04 rejections 0.05 2766 30374 0.091
## 12 ihw-a06 rejections 0.05 2763 30374 0.091
## 13 ihw-a07 rejections 0.05 2756 30374 0.091
## 14 ihw-a10 rejections 0.05 2751 30374 0.091
## 15 ihw-a03 rejections 0.05 2745 30374 0.09
## 16 ihw-a08 rejections 0.05 2740 30374 0.09
## 17 ihw-a09 rejections 0.05 2739 30374 0.09
## 18 ihw-a05 rejections 0.05 2734 30374 0.09
## 19 ihw-a01 rejections 0.05 2730 30374 0.09
## 20 ihw-a02 rejections 0.05 2730 30374 0.09
## 21 bh rejections 0.05 2516 30374 0.083
## 22 fdrreg-e rejections 0.05 288 30374 0.009
## 23 bonf rejections 0.05 284 30374 0.009
ash was the method that rejected the largest number of hypotheses, followed by lfdr and fdrreg-theoretical.
rejections_scatter(sb, supplementary = FALSE)
rejection_scatter_bins(sb, covariate = "ind_covariate",
bins = 4, supplementary = FALSE)
plotFDRMethodsOverlap(sb, alpha = 0.05, nsets = ncol(sb),
order.by = "freq", decreasing = TRUE,
supplementary = FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
We use the common BenchDesign
with the set of multiple testing correction methods already included. We also add in Scott’s FDR Regression (both nulltype = "empirical"
and nulltype = "theoretical"
) since our test statistics are approximately t-distributed.
if (file.exists(bench_file_uninf)) {
sb <- readRDS(bench_file_uninf)
} else {
bd <- initializeBenchDesign()
bd <- addBMethod(bd, "fdrreg-t",
FDRreg::FDRreg,
function(x) { x$FDR },
z = test_statistic,
features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1),
nulltype = 'theoretical',
control = list(lambda = 0.01))
bd <- addBMethod(bd, "fdrreg-e",
FDRreg::FDRreg,
function(x) { x$FDR },
z = test_statistic,
features = model.matrix( ~ splines::bs(ind_covariate, df = 3) - 1),
nulltype = 'empirical',
control = list(lambda = 0.01))
res <- res %>% dplyr::mutate(ind_covariate = rand_covar)
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = bench_file_uninf)
}
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
estimatePerformanceMetrics(sb, alpha = 0.05, tidy = TRUE) %>%
filter(performanceMetric == "rejections") %>%
select(blabel, performanceMetric, alpha, value) %>%
mutate(n = nrow(sb), prop = round(value / n, 3)) %>%
arrange(desc(value)) %>%
as_tibble() %>%
print(n = 40)
## # A tibble: 23 x 6
## blabel performanceMetric alpha value n prop
## <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 ashq rejections 0.05 6647 30374 0.219
## 2 unadjusted rejections 0.05 6595 30374 0.217
## 3 fdrreg-t rejections 0.05 3594 30374 0.118
## 4 bl-df02 rejections 0.05 3297 30374 0.109
## 5 lfdr rejections 0.05 3273 30374 0.108
## 6 bl-df03 rejections 0.05 3206 30374 0.106
## 7 qvalue rejections 0.05 3204 30374 0.105
## 8 bl-df04 rejections 0.05 3192 30374 0.105
## 9 bl-df05 rejections 0.05 3191 30374 0.105
## 10 adapt-glm rejections 0.05 3146 30374 0.104
## 11 ihw-a10 rejections 0.05 2521 30374 0.083
## 12 ihw-a09 rejections 0.05 2520 30374 0.083
## 13 ihw-a06 rejections 0.05 2519 30374 0.083
## 14 ihw-a07 rejections 0.05 2519 30374 0.083
## 15 bh rejections 0.05 2516 30374 0.083
## 16 ihw-a01 rejections 0.05 2516 30374 0.083
## 17 ihw-a05 rejections 0.05 2515 30374 0.083
## 18 ihw-a08 rejections 0.05 2503 30374 0.082
## 19 ihw-a03 rejections 0.05 2491 30374 0.082
## 20 ihw-a04 rejections 0.05 2484 30374 0.082
## 21 ihw-a02 rejections 0.05 2466 30374 0.081
## 22 bonf rejections 0.05 284 30374 0.009
## 23 fdrreg-e rejections 0.05 283 30374 0.009
rejections_scatter(sb, supplementary = FALSE)
rejection_scatter_bins(sb, covariate = "ind_covariate",
bins = 4, supplementary = FALSE)
plotFDRMethodsOverlap(sb, alpha = 0.05, nsets = ncol(sb),
order.by = "freq", decreasing = TRUE,
supplementary = FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Here we compare the method ranks for the different covariates at alpha = 0.10.
plotMethodRanks(c(bench_file, bench_file_uninf),
colLabels = c("mean", "uninf"),
alpha = 0.10, xlab = "Comparison")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 magrittr_1.5
## [3] cowplot_0.9.2 hexbin_1.27.2
## [5] SummarizedBenchmark_0.99.1 mclust_5.4
## [7] stringr_1.3.1 rlang_0.2.1
## [9] UpSetR_1.3.3 tidyr_0.8.1
## [11] DESeq2_1.20.0 SummarizedExperiment_1.10.1
## [13] DelayedArray_0.6.1 BiocParallel_1.14.2
## [15] matrixStats_0.53.1 Biobase_2.40.0
## [17] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0
## [19] IRanges_2.14.10 S4Vectors_0.18.3
## [21] BiocGenerics_0.26.0 ggplot2_3.0.0
## [23] dplyr_0.7.5
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 splines_3.5.0 Formula_1.2-3
## [4] assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.1
## [7] GenomeInfoDbData_1.1.0 yaml_2.1.19 pillar_1.2.3
## [10] RSQLite_2.1.1 backports_1.1.2 lattice_0.20-35
## [13] glue_1.2.0 digest_0.6.15 RColorBrewer_1.1-2
## [16] XVector_0.20.0 checkmate_1.8.5 colorspace_1.3-2
## [19] htmltools_0.3.6 Matrix_1.2-14 plyr_1.8.4
## [22] XML_3.98-1.11 pkgconfig_2.0.1 genefilter_1.62.0
## [25] zlibbioc_1.26.0 purrr_0.2.5 xtable_1.8-2
## [28] scales_0.5.0 htmlTable_1.12 tibble_1.4.2
## [31] annotate_1.58.0 withr_2.1.2 nnet_7.3-12
## [34] lazyeval_0.2.1 cli_1.0.0 crayon_1.3.4
## [37] survival_2.41-3 memoise_1.1.0 evaluate_0.10.1
## [40] foreign_0.8-70 tools_3.5.0 data.table_1.11.4
## [43] locfit_1.5-9.1 munsell_0.4.3 cluster_2.0.7-1
## [46] AnnotationDbi_1.42.1 compiler_3.5.0 grid_3.5.0
## [49] RCurl_1.95-4.10 rstudioapi_0.7 htmlwidgets_1.2
## [52] labeling_0.3 bitops_1.0-6 base64enc_0.1-3
## [55] rmarkdown_1.10 gtable_0.2.0 DBI_1.0.0
## [58] R6_2.2.2 gridExtra_2.3 knitr_1.20
## [61] utf8_1.1.4 bit_1.1-14 bindr_0.1.1
## [64] Hmisc_4.1-1 rprojroot_1.3-2 stringi_1.2.2
## [67] Rcpp_0.12.17 geneplotter_1.58.0 rpart_4.1-13
## [70] acepack_1.4.1 tidyselect_0.2.4