1 Summary

Here we download and analyze the Baxter colorectal cancer (CRC) dataset (Baxter et al., 2016). The dataset includes stool samples from 172 patients with no colonic lesions, and 120 patients with CRC. We’ll download the processed OTU tables from Zenodo and unzip them in the data/crc_baxter_results folder.

For colorectal cancer, we do expect some amount of truly differentially abundant OTUs, but not as many as in diarrhea (e.g. Schubert et al., 2014). This dataset will hopefully provide an intermediate non-extreme case study.

2 Workspace Setup

library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(SummarizedBenchmark)

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

## project data/results folders
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/microbiome"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

crc_data <- file.path(datdir, "crc_baxter_results.tar.gz")
crc_dir <- file.path(datdir, "crc_baxter_results")

otu_result_file <- file.path(resdir, "baxter-otus-results.rds")
otu_bench_file <- file.path(sbdir, "baxter-otus-benchmark.rds")
otu_bench_file_log_ubiquity <- file.path(sbdir, "baxter-otus-log-ubiquity-benchmark.rds")

gns_result_file <- file.path(resdir, "baxter-genus-results.rds")
gns_bench_file <- file.path(sbdir, "baxter-genus-benchmark.rds")
gns_bench_file_log_ubiquity <- file.path(sbdir, "baxter-genus-log-ubiquity-benchmark.rds")
gns_bench_file_abun <- file.path(sbdir, "baxter-genus-mean-abun-benchmark.rds")
gns_bench_file_uninf <- file.path(sbdir, "baxter-genus-mean-uninf-benchmark.rds")

3 Data Preparation

if (!file.exists(crc_data)) {
    download.file("https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz",
                  destfile = crc_data)
}

if (!file.exists(file.path(crc_dir, "crc_baxter.metadata.txt"))) {
    untar(crc_data, exdir = datdir)
}

Next, we’ll read in the unzipped OTU table and metadata files into R.

## load OTU table and metadata
otu <- read.table(file.path(crc_dir, "RDP",
                            "crc_baxter.otu_table.100.denovo.rdp_assigned"))
meta <- read.csv(file.path(crc_dir, "crc_baxter.metadata.txt"), sep='\t')

## Keep only samples with the right DiseaseState metadata
meta <- dplyr::filter(meta, DiseaseState %in% c("H", "CRC"))

## add "X" in front of sample IDs because of how R read in the OTU table
meta$Sample_Name_s <- paste0("X", meta$Sample_Name_s)

## keep only samples with both metadata and 16S data
keep_samples <- intersect(colnames(otu), meta$Sample_Name_s)
otu <- otu[, keep_samples]
meta <- dplyr::filter(meta, Sample_Name_s %in% keep_samples)

A brief aside: what’s the count distribution of these OTUs?

hist(colSums(otu))

Since we’ll be using OTU-wise covariates, we shouldn’t need to perform any filtering/cleaning of the OTUs, apart from removing any that are all zeros. (This may happen after removing shallow samples, I think.) We still apply a minimum threshold of 10 reads per OTU across all samples. After removing these shallow OTUs, we also get rid of any samples with too few reads. We define the minimum number of reads per OTU in min_otu, and the minimum number of reads per sample in min_sample.

After we’ve removed any shallow OTUs and samples, we’ll convert the OTU table to relative abundances.

min_otu <- 10
minp_otu <- 0.01
min_sample <- 100

## Remove OTUs w/ <= min reads, w/ <= min prop, samples w/ <= min reads
otu <- otu[rowSums(otu) > min_otu, ]
otu <- otu[rowSums(otu > 0) / ncol(otu) > minp_otu, ]
otu <- otu[, colSums(otu) > min_sample]

## Update metadata with new samples
meta <- dplyr::filter(meta, Sample_Name_s %in% colnames(otu))

## Remove empty OTUs
otu <- otu[rowSums(otu) > 0, ]

## Convert to relative abundance
abun_otu <- t(t(otu) / rowSums(t(otu)))

## Add pseudo counts
zeroabun <- 0
abun_otu <- abun_otu + zeroabun

4 Data Analysis

4.1 Differential Testing

Next, we need to calculate the pvalues, effect size, and standard error for each OTU. Here, we’ll compare CRC vs. healthy. We won’t consider the nonCRC adenoma patients. We’ll put these results into a dataframe, and label the columns with the standardized names for downstream use (pval, SE, effect_size, test_statistic). The test statistic is the one returned by wilcox.test().

Note that the effect here is calculated as logfold change of mean abundance in controls relative to cases (i.e. log(mean_abun[controls]/mean_abun[cases])).

While we’re at it, we’ll also calculate the mean abundance and ubiquity (detection rate) of each OTU. Later, we can assign their values to a new column called ind_covariate for use in downstream steps.

if (!file.exists(otu_result_file)) {
    res <- test_microbiome(abundance = abun_otu, shift = zeroabun,
                           is_case = meta$DiseaseState == "CRC")
    saveRDS(res, file = otu_result_file)
} else {
    res <- readRDS(otu_result_file)
}

Finally, let’s try to add phylogeny as covariates. Here we’ll have columns for each separate taxonomic level.

res <- tidyr::separate(res, otu, 
                       c("kingdom", "phylum", "class", "order",
                         "family", "genus", "species", "denovo"), 
                       sep = ";", remove = FALSE)

4.2 Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

4.2.1 Ubiquity

strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

rank_scatter(res, pvalue="pval", covariate="ubiquity")

Something weird is happening with p=0.20 and p=0.40 - maybe this has something to do with ties? Either way, ubiquity looks to be a bit informative (from the scatter plot, but not really the histograms…)

4.2.2 Mean Abundance (across non-zero samples)

strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

4.2.3 Phylogeny

Let’s look at phylum-level stratification first. A priori, I might expect Proteobacteria to be enriched for low p-values? But I don’t know if that’s super legit, and Eric doesn’t seem to think that phylogeny will be informative at all…

ggplot(res, aes(x=pval)) +
    geom_histogram() +
    facet_wrap(~phylum, scales = "free")

Not really informative either.

4.3 Multiple-Testing Correction

Let’s use ubiquity as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = ubiquity)

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the default comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(otu_bench_file)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = otu_bench_file)
} else {
    sb <- readRDS(otu_bench_file)
}

4.4 Benchmark Metrics

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

assays(sb)[["qvalue"]][, "ashq"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

4.5 Multiple-Testing Correction - log(ubiquity)

Let’s use log(ubiquity) as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = log10(ubiquity))

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the default comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(otu_bench_file_log_ubiquity)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = otu_bench_file_log_ubiquity)
} else {
    sb <- readRDS(otu_bench_file_log_ubiquity)
}

4.6 Benchmark Metrics - log(ubiquity)

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter( sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

assays(sb)[["qvalue"]][, "ashq"] %>% max()
sum(is.na(assays(sb)[["qvalue"]][, "scott-empirical"]))
sum(is.na(assays(sb)[["qvalue"]][, "lfdr"]))

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

5 Data Analysis (genus-level)

5.1 Collapse to Genus

## add column with otu names
genus_df <- as.data.frame(abun_otu)
genus_df <- dplyr::as_tibble(genus_df, rownames = "otu")

## just get genus information
genus_df <- dplyr::mutate(genus_df, genus = sapply(strsplit(otu, ";"), `[`, 6))

## gather into tidy format
genus_df <- tidyr::gather(genus_df, key = "sample", value = "abun", -otu, -genus)

## get rid of unannoated genera, and sum abundances for genera
genus_df <- genus_df %>% 
    dplyr::filter(genus != "g__") %>% 
    dplyr::group_by(genus, sample) %>% 
    dplyr::summarise(total_abun = sum(abun)) %>%
    ungroup()

## convert back to longform
genus_df <- tidyr::spread(genus_df, sample, total_abun)
genus_df <- as.data.frame(as.list(dplyr::select(genus_df, -genus)),
                          row.names = genus_df$genus)

## re-order columns to match metadata
genus_df <- genus_df[, match(meta$Sample_Name_s, colnames(genus_df))]

## use matrix
genus <- as.matrix(genus_df)

5.2 Differential Testing

if (!file.exists(gns_result_file)) {
    res <- test_microbiome(abundance = genus, shift = zeroabun,
                           is_case = meta$DiseaseState == "CRC")
    saveRDS(res, file = gns_result_file)
} else {
    res <- readRDS(gns_result_file)
}

Add random (uninformative) covariate to test at the genus level.

set.seed(72664)
res$rand_covar <- rnorm(nrow(res))

5.3 Covariate Diagnostics

Here we look to see if the covariates do indeed look informative.

5.3.1 Ubiquity

strat_hist(res, pvalue="pval", covariate="ubiquity", maxy=10, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="ubiquity")

5.3.2 Mean Abundance (across non-zero samples)

strat_hist(res, pvalue="pval", covariate="mean_abun_present", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="mean_abun_present")

5.3.3 Random

strat_hist(res, pvalue="pval", covariate="rand_covar", maxy=8, binwidth=0.05)

rank_scatter(res, pvalue="pval", covariate="rand_covar")

5.4 Multiple-Testing Correction

Let’s use ubiquity as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = ubiquity)

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(gns_bench_file)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = gns_bench_file)
} else {
    sb <- readRDS(gns_bench_file)
}

5.5 Benchmark Metrics

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots(sb, alpha=0.1, nsets=6, methods=methods)

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

## Warning: Removed 25 rows containing missing values (geom_path).

5.6 Multiple-Testing Correction - log(ubiquity)

Let’s use log(ubiquity) as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = log10(ubiquity))

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(gns_bench_file_log_ubiquity)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = gns_bench_file_log_ubiquity)
} else {
    sb <- readRDS(gns_bench_file_log_ubiquity)
}

5.7 Benchmark Metrics - log(ubiquity)

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

methods <- c( "lfdr", "ihw-a10", "bl-df03", "qvalue", "bh", "bonf" )
plotCovariateBoxplots(sb, alpha=0.1, nsets=6, methods=methods)

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

## Warning: Removed 25 rows containing missing values (geom_path).

5.8 Multiple-Testing Correction - mean abundance

Let’s use mean_abun_present as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = mean_abun_present)

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(gns_bench_file_abun)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = gns_bench_file_abun)
} else {
    sb <- readRDS(gns_bench_file_abun)
}

5.9 Benchmark Metrics - mean abundance

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

## Warning: Removed 25 rows containing missing values (geom_path).

5.10 Multiple-Testing Correction - Random

Let’s use rand_covar as our ind_covariate.

res <- dplyr::mutate(res, ind_covariate = rand_covar)

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object. We remove ASH from the comparison.

Then, we’ll construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

if (!file.exists(gns_bench_file_uninf)) {
    bd <- initializeBenchDesign()
    bd <- dropBMethod(bd, "ashq")
    sb <- buildBench(bd, data = res, ftCols = "ind_covariate")
    metadata(sb)$data_download_link <-
                   "https://zenodo.org/record/840333/files/crc_baxter_results.tar.gz"
    saveRDS(sb, file = gns_bench_file_uninf)
} else {
    sb <- readRDS(gns_bench_file_uninf)
}

5.11 Benchmark Metrics - Random

Next, we’ll add the default performance metric for q-value assays. First, we have to rename the assay to ‘qvalue’.

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Now, we’ll plot the results.

rejections_scatter(sb, as_fraction=FALSE, supplementary=FALSE)

rejection_scatter_bins(sb, covariate="ind_covariate", supplementary=FALSE)

Note: Benjamini-Hochberg (bh) overlaps exactly with the IHW results.

plotFDRMethodsOverlap(sb, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100 )

covariateLinePlot(sb, alpha = 0.05, covname = "ind_covariate")

## Warning: Removed 25 rows containing missing values (geom_path).

6 OTU/Genus comparison

Here we compare the method ranks for the different comparisons at alpha = 0.10.

plotMethodRanks(c(otu_bench_file, otu_bench_file_log_ubiquity, 
                  gns_bench_file, gns_bench_file_log_ubiquity,
                  gns_bench_file_abun, gns_bench_file_uninf), 
                colLabels = c("OTU", "OTU - log", "Genus", "Genus - log",
                              "Genus - abun", "Genus - uninf"), 
                alpha = 0.10, xlab = "Comparison")

7 Session Info

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hexbin_1.27.2               cowplot_0.9.2              
##  [3] bindrcpp_0.2.2              SummarizedBenchmark_0.99.1 
##  [5] mclust_5.4                  stringr_1.3.1              
##  [7] rlang_0.2.1                 UpSetR_1.3.3               
##  [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.1         
## [11] BiocParallel_1.14.2         matrixStats_0.53.1         
## [13] Biobase_2.40.0              GenomicRanges_1.32.3       
## [15] GenomeInfoDb_1.16.0         IRanges_2.14.10            
## [17] S4Vectors_0.18.3            BiocGenerics_0.26.0        
## [19] magrittr_1.5                ggplot2_3.0.0              
## [21] tidyr_0.8.1                 dplyr_0.7.5                
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17           RColorBrewer_1.1-2     XVector_0.20.0        
##  [4] pillar_1.2.3           compiler_3.5.0         plyr_1.8.4            
##  [7] bindr_0.1.1            zlibbioc_1.26.0        bitops_1.0-6          
## [10] tools_3.5.0            digest_0.6.15          lattice_0.20-35       
## [13] evaluate_0.10.1        tibble_1.4.2           gtable_0.2.0          
## [16] pkgconfig_2.0.1        Matrix_1.2-14          yaml_2.1.19           
## [19] gridExtra_2.3          GenomeInfoDbData_1.1.0 withr_2.1.2           
## [22] knitr_1.20             rprojroot_1.3-2        grid_3.5.0            
## [25] tidyselect_0.2.4       glue_1.2.0             R6_2.2.2              
## [28] rmarkdown_1.10         purrr_0.2.5            backports_1.1.2       
## [31] scales_0.5.0           htmltools_0.3.6        assertthat_0.2.0      
## [34] colorspace_1.3-2       labeling_0.3           stringi_1.2.2         
## [37] RCurl_1.95-4.10        lazyeval_0.2.1         munsell_0.4.3

Case Study: 16S microbiome data (Baxter et al., 2016)

Claire Duvallet

October 30, 2018

1 Summary

2 Workspace Setup

3 Data Preparation

4 Data Analysis

4.1 Differential Testing

4.2 Covariate Diagnostics

4.2.1 Ubiquity

4.2.2 Mean Abundance (across non-zero samples)

4.2.3 Phylogeny

4.3 Multiple-Testing Correction

4.4 Benchmark Metrics

4.5 Multiple-Testing Correction - log(ubiquity)

4.6 Benchmark Metrics - log(ubiquity)

5 Data Analysis (genus-level)

5.1 Collapse to Genus

5.2 Differential Testing

5.3 Covariate Diagnostics

5.3.1 Ubiquity

5.3.2 Mean Abundance (across non-zero samples)

5.3.3 Random

5.4 Multiple-Testing Correction

5.5 Benchmark Metrics

5.6 Multiple-Testing Correction - log(ubiquity)

5.7 Benchmark Metrics - log(ubiquity)

5.8 Multiple-Testing Correction - mean abundance

5.9 Benchmark Metrics - mean abundance

5.10 Multiple-Testing Correction - Random

5.11 Benchmark Metrics - Random

6 OTU/Genus comparison

7 Session Info