Here we download and analyze the Goodrich obesity (OB) dataset (Goodrich et al., 2014). The dataset includes fecal samples from 428 individuals with lean body mass index (BMI < 25), and 185 obese individuals (BMI > 30). We’ll download the processed OTU tables from Zenodo and unzip them in the data/ob_goodrich_results
folder.
We expect few or no truly differentially abundant OTUs (Finucane et al., 2014 and Walters et al., 2014), so this study provides a real-life “null” example.
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(SummarizedBenchmark)
## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
source(f)
}
## project data/results folders
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/microbiome"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)
ob_data <- file.path(datdir, "ob_goodrich_results.tar.gz")
ob_dir <- file.path(datdir, "ob_goodrich_results")
otu_result_file <- file.path(resdir, "goodrich-otus-results.rds")
otu_bench_file <- file.path(sbdir, "goodrich-otus-benchmark.rds")
gns_result_file <- file.path(resdir, "goodrich-genus-results.rds")
gns_bench_file <- file.path(sbdir, "goodrich-genus-benchmark.rds")
gns_bench_file_abun <- file.path(sbdir, "goodrich-genus-abun-benchmark.rds")
gns_bench_file_uninf <- file.path(sbdir, "goodrich-genus-uninf-benchmark.rds")
if (!file.exists(ob_data)) {
download.file("https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz",
destfile = ob_data)
}
if (!file.exists(file.path(ob_dir, "ob_goodrich.metadata.txt"))) {
untar(ob_data, exdir = datdir)
}
Next, we’ll read in the unzipped OTU table and metadata files into R.
We’ll need to manually define which disease labels are of interest and what metadata column contains the sample IDs (i.e. the IDs in the OTU table).
## load OTU table and metadata
otu <- read.table(file.path(ob_dir, "RDP",
"ob_goodrich.otu_table.100.denovo.rdp_assigned"))
meta <- read.csv(file.path(ob_dir, "ob_goodrich.metadata.txt"), sep='\t')
## Keep only samples with the right DiseaseState metadata
meta <- dplyr::filter(meta, DiseaseState %in% c("H", "OB"))
## keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta$X)
otu <- otu[, keep_samples]
meta <- dplyr::filter(meta, X %in% keep_samples)
Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) We still apply a minimum threshold of 10 reads per OTU across all samples. After removing these shallow OTUs, we also get rid of any samples with too few reads. We define the minimum number of reads per OTU in min_otu
, and the minimum number of reads per sample in min_sample
.
After we’ve removed any shallow samples, we’ll convert the OTU table to relative abundances.
min_otu <- 10
minp_otu <- 0.01
min_sample <- 100
## Remove OTUs w/ <= min reads, w/ <= min prop, samples w/ <= min reads
otu <- otu[rowSums(otu) > min_otu, ]
otu <- otu[rowSums(otu > 0) / ncol(otu) > minp_otu, ]
otu <- otu[, colSums(otu) > min_sample]
# Update metadata with new samples
meta <- dplyr::filter(meta, X %in% colnames(otu))
# Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]
# Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))
# Add pseudo counts
zeroabun <- 0
abun_otu <- abun_otu + zeroabun
Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare lean vs. obese. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval
, SE
, effect_size
, test_statistic
). The test statistic is the one returned by wilcox.test()
.
Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases])
).
While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate
for use in downstream steps.
Note that OB Goodrich has ~70,000 OTUs, so this is going to take a while.
if (!file.exists(otu_result_file)) {
res <- test_microbiome(abundance = abun_otu, shift = zeroabun,
is_case = meta$DiseaseState == "OB")
saveRDS(res, file = otu_result_file)
} else {
res <- readRDS(otu_result_file)
}
Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.
res <- tidyr::separate(res, otu,
c("kingdom", "phylum", "class", "order",
"family", "genus", "species", "denovo"),
sep=";", remove = FALSE)
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=5, binwidth=0.05, numQ=4)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
rank_scatter(res, pvalue="pval", covariate="ubiquity")
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=5, binwidth=0.05, numQ=4)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…
ggplot(res, aes(x=pval)) +
geom_histogram() +
facet_wrap(~phylum, scales = "free")
Let’s use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object. We remove ASH from the default comparison.
Next, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(otu_bench_file)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
metadata(sb)$data_download_link <-
"https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
saveRDS(sb, file = otu_bench_file)
} else {
sb <- readRDS(otu_bench_file)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
## plot nrejects by method overall and stratified by covariate
rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
##plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE )
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
There is no upsetR plot for OB Goodrich for now because only lfdr is returning any rejections.
assays(sb)[["qvalue"]][, "ashq"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))
## add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df <- dplyr::as_tibble(genus_df, rownames = "otu")
## just get genus information
genus_df <- dplyr::mutate(genus_df, genus = sapply(strsplit(otu, ";"), `[`, 6))
## gather into tidy format
genus_df <- tidyr::gather(genus_df, key = "sample", value = "abun", -otu, -genus)
## get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>%
dplyr::filter(genus != "g__") %>%
dplyr::group_by(genus, sample) %>%
dplyr::summarise(total_abun = sum(abun)) %>%
ungroup()
## convert back to longform
genus_df <- tidyr::spread(genus_df, sample, total_abun)
genus_df <- as.data.frame(as.list(dplyr::select(genus_df, -genus)),
row.names = genus_df$genus)
## re-order columns to match metadata
genus_df <- genus_df[, match(meta$X, colnames(genus_df))]
## use matrix
genus <- as.matrix(genus_df)
if (!file.exists(gns_result_file)) {
res <- test_microbiome(abundance = genus, shift = zeroabun,
is_case = meta$DiseaseState == "OB")
saveRDS(res, file = gns_result_file)
} else {
res <- readRDS(gns_result_file)
}
Add random (uninformative) covariate.
set.seed(76291)
res$rand_covar <- rnorm(nrow(res))
Here we look to see if the covariates do indeed look informative.
strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="ubiquity")
Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)
strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)
rank_scatter(res, pvalue="pval", covariate="mean_abun_present")
strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=5, binwidth=0.05, numQ=4)
rank_scatter(res, pvalue="pval", covariate="rand_covar")
Let’s use ubiquity
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = ubiquity)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(gns_bench_file)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
metadata(sb)$data_download_link <-
"https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
saveRDS(sb, file = gns_bench_file)
} else {
sb <- readRDS(gns_bench_file)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots(sb, alpha=0.1, nsets=6, methods=methods)
Let’s use mean_abun_present
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = mean_abun_present)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(gns_bench_file_abun)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
metadata(sb)$data_download_link <-
"https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
saveRDS(sb, file = gns_bench_file_abun)
} else {
sb <- readRDS(gns_bench_file_abun)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots(sb, alpha=0.1, nsets=6, methods=methods)
Let’s use rand_covar
as our ind_covariate
.
res <- dplyr::mutate(res, ind_covariate = rand_covar)
First, we’ll create an object of BenchDesign
class to hold the data and add the benchmark methods to the BenchDesign
object.
Then, we’ll construct the SummarizedBenchmark
object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).
if (!file.exists(gns_bench_file_uninf)) {
bd <- initializeBenchDesign()
bd <- dropBMethod(bd, "ashq")
sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
metadata(sb)$data_download_link <-
"https://zenodo.org/record/840333/files/ob_goodrich_results.tar.gz"
saveRDS(sb, file = gns_bench_file_uninf)
} else {
sb <- readRDS(gns_bench_file_uninf)
}
Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.
assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)
Now, we’ll plot the results.
rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)
rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)
covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")
## Warning: Removed 25 rows containing missing values (geom_path).
Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.
plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )
methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots(sb, alpha=0.1, nsets=6, methods=methods)
Here we compare the method ranks for the different comparisons at alpha = 0.10.
plotMethodRanks(c(otu_bench_file, gns_bench_file, gns_bench_file_abun, gns_bench_file_uninf),
colLabels = c("OTU", "Genus-ubiquity", "Genus-abun", "Genus-uninf"),
alpha = 0.10, xlab = "Comparison")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] hexbin_1.27.2 cowplot_0.9.2
## [3] bindrcpp_0.2.2 SummarizedBenchmark_0.99.1
## [5] mclust_5.4 stringr_1.3.1
## [7] rlang_0.2.1 UpSetR_1.3.3
## [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.1
## [11] BiocParallel_1.14.2 matrixStats_0.53.1
## [13] Biobase_2.40.0 GenomicRanges_1.32.3
## [15] GenomeInfoDb_1.16.0 IRanges_2.14.10
## [17] S4Vectors_0.18.3 BiocGenerics_0.26.0
## [19] magrittr_1.5 ggplot2_3.0.0
## [21] tidyr_0.8.1 dplyr_0.7.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 RColorBrewer_1.1-2 XVector_0.20.0
## [4] pillar_1.2.3 compiler_3.5.0 plyr_1.8.4
## [7] bindr_0.1.1 zlibbioc_1.26.0 bitops_1.0-6
## [10] tools_3.5.0 digest_0.6.15 lattice_0.20-35
## [13] evaluate_0.10.1 tibble_1.4.2 gtable_0.2.0
## [16] pkgconfig_2.0.1 Matrix_1.2-14 yaml_2.1.19
## [19] gridExtra_2.3 GenomeInfoDbData_1.1.0 withr_2.1.2
## [22] knitr_1.20 rprojroot_1.3-2 grid_3.5.0
## [25] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2
## [28] rmarkdown_1.10 purrr_0.2.5 backports_1.1.2
## [31] scales_0.5.0 htmltools_0.3.6 assertthat_0.2.0
## [34] colorspace_1.3-2 labeling_0.3 stringi_1.2.2
## [37] RCurl_1.95-4.10 lazyeval_0.2.1 munsell_0.4.3