Here we analyze a subset of the ENIGMA data (Smith, M. and Rocha, A. et al 2015). The dataset includes water samples from 67 wells with well-characterized biochemical metadata (i.e. pH, heavy metal concentrations, etc). We’ll download the processed OTU table and associated metadata from Zenodo and put it in the data/enigma_results
folder.
Note that I’ve uploaded this to our repo as a gzipped file, make sure to gunzip the OTU table before knitting this.
We did some preliminary correlation analysis and found that the metadata variables show a range of correlation with different OTUs: some metadata variables don’t seem correlated with any OTUs, whereas others seem correlated with basically the entire community. We’ll pick three of these metadata variables to illustrate our FDR correction methods in cases where we expect different numbers of true significant associations.
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(SummarizedBenchmark)
## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
source(f)
}
## project data/results folders
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/microbiome"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)
enigma_dir <- file.path(datdir, "enigma_results")
dir.create(enigma_dir, showWarnings = FALSE)
# From their paper, we expect SO4 to have little correlation, pH
# to have high correlations, and Al to be in the middle
so4_result_file <- file.path(resdir, "enigma-so4-otus-results.rds")
so4_bench_file <- file.path(sbdir, "enigma-so4-otus-benchmark.rds")
so4_bench_file_abun <- file.path(sbdir, "enigma-so4-otus-abun-benchmark.rds")
so4_bench_file_uninf <- file.path(sbdir, "enigma-so4-otus-uninf-benchmark.rds")
ph_result_file <- file.path(resdir, "enigma-ph-otus-results.rds")
ph_bench_file <- file.path(sbdir, "enigma-ph-otus-benchmark.rds")
ph_bench_file_abun <- file.path(sbdir, "enigma-ph-otus-abun-benchmark.rds")
ph_bench_file_uninf <- file.path(sbdir, "enigma-ph-otus-uninf-benchmark.rds")
al_result_file <- file.path(resdir, "enigma-al-otus-results.rds")
al_bench_file <- file.path(sbdir, "enigma-al-otus-benchmark.rds")
al_bench_file_abun <- file.path(sbdir, "enigma-al-otus-abun-benchmark.rds")
al_bench_file_uninf <- file.path(sbdir, "enigma-al-otus-uninf-benchmark.rds")
We download the data from Zenodo if not already available.
if (!file.exists(file.path(enigma_dir, "enigma.metadata.txt"))) {
download.file("https://zenodo.org/record/1455793/files/enigma.metadata.txt",
destfile = file.path(enigma_dir, "enigma.metadata.txt"))
}
if (!file.exists(file.path(enigma_dir, "enigma.otu_table_resampled_updated_r.txt.gz"))) {
download.file("https://zenodo.org/record/1455793/files/enigma.otu_table_resampled_updated_r.txt.gz",
destfile = file.path(enigma_dir, "enigma.otu_table_resampled_updated_r.txt.gz"))
}
## load OTU table and metadata
otu <- read.table(file.path(enigma_dir, "enigma.otu_table_resampled_updated_r.txt.gz"),
sep='\t', header = TRUE, row.names = 1)
meta <- read.csv(file.path(enigma_dir, "enigma.metadata.txt"), sep='\t')
# Select only the 0.02 um samples
meta <- meta[endsWith(as.character(meta$Sample), "02"), ]
# Make metadata sample ID match OTU table
meta$Sample <- gsub("_", "\\.", as.character(meta$Sample))
meta$Sample <- substr(meta$Sample, 1, nchar(meta$Sample)-3)
## keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta$Sample)
otu <- otu[, keep_samples]
meta <- dplyr::filter(meta, Sample %in% keep_samples)
## Make sure samples in metadata and OTU table match
meta <- meta[match(colnames(otu), meta$Sample), ]
We’ll do a little filtering here before calculating correlations. We’ll apply a minimum threshold of 10 reads per OTU across all samples. After removing these shallow OTUs, we also get rid of any samples with too few reads. We define the minimum number of reads per OTU in min_otu
, and the minimum number of reads per sample in min_sample
.
After we’ve removed any shallow OTUs and samples, we’ll convert the OTU table to relative abundances.
min_otu <- 10
minp_otu <- 0.01
min_sample <- 100
## Remove OTUs w/ <= min reads, w/ <= min prop, samples w/ <= min reads
otu <- otu[rowSums(otu) > min_otu, ]
otu <- otu[rowSums(otu > 0) / ncol(otu) > minp_otu, ]
otu <- otu[, colSums(otu) > min_sample]
## Update metadata with new samples
meta <- dplyr::filter(meta, Sample %in% colnames(otu))
## Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
## Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))
## Add pseudo counts
zeroabun <- 0
abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues for correlations between abundances of each OTU and the metadata variables of interest. (i.e. we’re correlating each OTU’s abundances across wells with the metadata variable across wells). Let’s start with “pH”. We’ll store the pvalue results in a dataframe.
While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate
for use in downstream steps.
if (!file.exists(ph_result_file)) {
res <- test_microbiome_corr(abundance = abun_otu, meta_col = meta$pH,
shift = zeroabun)
saveRDS(res, file = ph_result_file)
} else {
res <- readRDS(ph_result_file)
}
Add random (uninformative) covariate.
set.seed(44318)
res$rand_covar <- rnorm(nrow(res))
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="rand_covar")
Let’s use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(ph_bench_file)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = ph_bench_file)
} else {
sb <- readRDS(ph_bench_file)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Let’s use mean_abun_present
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(ph_bench_file_abun)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = ph_bench_file_abun)
} else {
sb <- readRDS(ph_bench_file_abun)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Let’s use rand_covar
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = rand_covar)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(ph_bench_file_uninf)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = ph_bench_file_uninf)
} else {
sb <- readRDS(ph_bench_file_uninf)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Next, we’ll look at SO4 levels
if (!file.exists(so4_result_file)) {
res <- test_microbiome_corr(abundance = abun_otu, meta_col = meta$SO4_mgL,
shift = zeroabun)
saveRDS(res, file = so4_result_file)
} else {
res <- readRDS(so4_result_file)
}
Add random (uninformative) covariate.
set.seed(27349)
res$rand_covar <- rnorm(nrow(res))
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="rand_covar")
Let’s use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(so4_bench_file)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = so4_bench_file)
} else {
sb <- readRDS(so4_bench_file)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Let’s use mean_abun_present
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(so4_bench_file_abun)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = so4_bench_file_abun)
} else {
sb <- readRDS(so4_bench_file_abun)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Let’s use rand_covar
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(so4_bench_file_uninf)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = so4_bench_file_uninf)
} else {
sb <- readRDS(so4_bench_file_uninf)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
Finally, we’ll look at Al levels
if (!file.exists(al_result_file)) {
res <- test_microbiome_corr(abundance = abun_otu, meta_col = meta$Al_mgL,
shift = zeroabun)
saveRDS(res, file = al_result_file)
} else {
res <- readRDS(al_result_file)
}
# remove OTUs with missing pvals
res <- res %>% filter(!is.na(pval))
Add random (uninformative) covariate.
set.seed(19474)
res$rand_covar <- rnorm(nrow(res))
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="rand_covar")
Let’s use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(al_bench_file)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = al_bench_file)
} else {
sb <- readRDS(al_bench_file)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Let’s use mean_abun_present
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(al_bench_file_abun)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = al_bench_file_abun)
} else {
sb <- readRDS(al_bench_file_abun)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Let’s use rand_covar
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = rand_covar)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(al_bench_file_uninf)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
saveRDS(sb, file = al_bench_file_uninf)
} else {
sb <- readRDS(al_bench_file_uninf)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Here we compare the method ranks for the different comparisons at alpha = 0.10.
plotMethodRanks(c(so4_bench_file, so4_bench_file_abun, so4_bench_file_uninf,
ph_bench_file, ph_bench_file_abun, ph_bench_file_uninf,
al_bench_file, al_bench_file_abun, al_bench_file_uninf),
colLabels = c("SO4-ubiquity", "SO4-abun", "SO4-uninf",
"pH-ubiquity", "ph-abun", "pH-uninf",
"Al-ubiquity", "Al-abun", "Al-uninf"),
alpha = 0.10, xlab = "Comparison")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] adaptMT_1.0.0 FDRreg_0.2-1
## [3] mvtnorm_1.0-8 BayesLogit_0.6
## [5] fda_2.4.7 Matrix_1.2-14
## [7] fdrtool_1.2.15 swfdr_1.6.0
## [9] qvalue_2.12.0 ashr_2.2-7
## [11] IHW_1.8.0 hexbin_1.27.2
## [13] cowplot_0.9.2 bindrcpp_0.2.2
## [15] SummarizedBenchmark_0.99.1 mclust_5.4
## [17] stringr_1.3.1 rlang_0.2.1
## [19] UpSetR_1.3.3 SummarizedExperiment_1.10.1
## [21] DelayedArray_0.6.1 BiocParallel_1.14.2
## [23] matrixStats_0.53.1 Biobase_2.40.0
## [25] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0
## [27] IRanges_2.14.10 S4Vectors_0.18.3
## [29] BiocGenerics_0.26.0 magrittr_1.5
## [31] ggplot2_3.0.0 tidyr_0.8.1
## [33] dplyr_0.7.5
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 bitops_1.0-6 doParallel_1.0.11
## [4] RColorBrewer_1.1-2 rprojroot_1.3-2 tools_3.5.0
## [7] backports_1.1.2 R6_2.2.2 lazyeval_0.2.1
## [10] colorspace_1.3-2 withr_2.1.2 tidyselect_0.2.4
## [13] gridExtra_2.3 mnormt_1.5-5 compiler_3.5.0
## [16] ggdendro_0.1-20 labeling_0.3 slam_0.1-43
## [19] mosaicCore_0.5.0 scales_0.5.0 SQUAREM_2017.10-1
## [22] psych_1.8.4 digest_0.6.15 ggformula_0.7.0
## [25] foreign_0.8-70 rmarkdown_1.10 XVector_0.20.0
## [28] pscl_1.5.2 pkgconfig_2.0.1 htmltools_0.3.6
## [31] lpsymphony_1.8.0 bindr_0.1.1 RCurl_1.95-4.10
## [34] GenomeInfoDbData_1.1.0 mosaicData_0.16.0 Rcpp_0.12.17
## [37] munsell_0.4.3 stringi_1.2.2 yaml_2.1.19
## [40] MASS_7.3-50 zlibbioc_1.26.0 plyr_1.8.4
## [43] grid_3.5.0 lattice_0.20-35 knitr_1.20
## [46] pillar_1.2.3 reshape2_1.4.3 codetools_0.2-15
## [49] glue_1.2.0 evaluate_0.10.1 foreach_1.4.4
## [52] gtable_0.2.0 purrr_0.2.5 assertthat_0.2.0
## [55] broom_0.4.4 truncnorm_1.0-8 tibble_1.4.2
## [58] iterators_1.0.9 mosaic_1.2.0