1 Summary

For the application of differential gene expression, we look at scRNAseq and bulk RNAseq data. Here we discuss the analyses of scRNAseq.

scRNAseq Datasets
datasets organism technology comparison
GSE84465 human smart-seq Tumor core vs Periphery
GSE94383 mouse smart-seq2 LPS vs No stimulation
DE methods
DE_method input output
scDD count p-val
MAST tpm p-val

We selected two datasets, one each from mouse and human. These datasets come from the Conquer (Soneson and Robinson 2018) scRNAseq data collection. Conquer provides uniformly processed gene expression data summarized to both count (aggregated from transcript members), and length-scaled TPMs at the gene level. The authors used salmon to quantify raw reads and tximport to summarize to gene level. The data comes from Smartseq, a plate based sequencing method which results in deeper sequencing of reads at the cost of numbers of cells.

Here we are interested in performing differential expression analyses to determine which genes have different expression among the two biological groups in each comparison. We also apply two different analysis methods: MAST (Finak et al. 2015) and scDD (Korthauer et al. 2016). In this vignette, we will explore the human dataset analysed with MAST.

2 Workspace setup

The downloaded plate data are stored in R objects called MultiAssayExperiment; this package requires BioConductor release 3.5 and R version 3.4 or greater for installation.

library(MultiAssayExperiment)
library(SingleCellExperiment)
library(dplyr)
library(ggplot2)
library(cowplot)
library(impute)
library(edgeR)
library(MAST)
library(BiocParallel)

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

# data and results directories
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/scRNAseq"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

# results files
resfile_mean <- file.path(sbdir, "mouse-benchmark-mast-mean.rds")
resfile_det <- file.path(sbdir, "mouse-benchmark-mast-det.rds")
resfile_uninf <- file.path(sbdir, "mouse-benchmark-mast-uninf.rds")

# set up parallel backend
cores <- 8
BiocParallel::register(MulticoreParam(workers = cores))

3 Data preparation

3.1 Data download

We first download the data in the data directory. conquer provides the processed datasets in convenient .rds files. We download the Mouse data from GSE94383

datfile <- file.path(datdir,"GSE94383.rds")

if (!file.exists(datfile)){
  download.file(url="http://imlspenticton.uzh.ch/robinson_lab/conquer/data-mae/GSE94383.rds",
    destfile=datfile)
}

3.2 Data processing and filtering

For data processing, we follow practices and recommendations from (Lun, McCarthy, and Marioni 2016). Briefly, we perform quality control filtering on cells to ensure that technical artifacts do not distort downstream analysis results. In addition, we also filter out very low-abundance genes, since genes that contain almost all zero counts do not contain enough information for reliable inference. Specifically, we filter out cells with extremely low mapping rate (below 20%), as these represent poor quality cells that may end up having negative size factor estimates during normalization. For the same reason, we also filter cells which have extremely low proportion of genes detected (less than 5%). We also filter out genes that are expressed in fewer than 5% of cells in both groups.

This dataset (from PMC28396000 (Lane et al. 2017)) contains cells from a murine macrophage cell line that were studied for the impact of stimulation of the innate immune transcription factor nuclear factor \(\kappa\)B (NF-\(\kappa\)B) on gene expression. Specifically, it is of interest to characterize the dynamics of global gene expression on activating immune response. The NF-\(\kappa\)B was stimulated using a lipopolysaccharide (LPS)-dependent subunit fused to a fluorescent protein. Here we extract and compare two conditions and compare unstimulated cells versus cells stimulated with LPS after 150 minutes.

datfile <- file.path(datdir, "mouse_GSE94383_mast.rds")

if (!file.exists(datfile)){

  # load MultiAssayExperiment
  mousedata=readRDS(file.path(datdir,"GSE94383.rds"))
  
  # exclude cells with extremely low mapping or feature rate (poor quality cells)
  lowMap <- which(metadata(mousedata)$salmon_summary$percent_mapped < 20 |
                  colMeans(assay(experiments(mousedata)[[1]], "count") > 0) < 0.05)
  if (length(lowMap) > 0){
    mousedata <- mousedata[, -lowMap]
    message("Removed ", length(lowMap), " cells with mapping rate below 20%",
            " or fewer than 5% of genes detected.")
  }
  
  # subset by groups of interest
  mousedata <- mousedata[, (colData(mousedata)$characteristics_ch1.2 ==
                              "condition: No stimulation") |
                           (colData(mousedata)$characteristics_ch1.2 ==
                              "condition: LPS" &
                            colData(mousedata)$characteristics_ch1.1 %in% 
                              "time point: 150 min") ]
  cdata=colData(mousedata)
  cdata$group <- cdata$characteristics_ch1.2

  # subset by gene expression TPM & count assays
  mousedata <- experiments(mousedata)[[1]]
  assays(mousedata) <- assays(mousedata)[c("TPM", "count")]
  colData(mousedata) <- cdata

  # subset by genes with detection rate > 5% in at least one condition
  levs <- unique(colData(mousedata)$group)
  mousedata <- mousedata[rowMeans(assay(mousedata, "count")[, 
                            colData(mousedata)$group == levs[1]] > 0) > 0.05 |
                         rowMeans(assay(mousedata, "count")[, 
                            colData(mousedata)$group == levs[2]] > 0) > 0.05, ]
  
  # remove spike-in controls
  mousedata <- mousedata[!grepl("ERCC-", rownames(mousedata)),]

  # save SummarizedExperiment 
  saveRDS(mousedata, datfile)
}else{
  mousedata <- readRDS(datfile) 
}

# look at number of cells in each group
table(as.character(colData(mousedata)$group))
## 
##            condition: LPS condition: No stimulation 
##                       368                       202

For the mouse dataset2, we have a total of 570 cells and 13777 genes.

4 Data analysis

4.1 Differential testing

We perform differential expression testing using MAST which has a hurdle model to account for zeroes.

We apply MAST to the TPMs and adjust for the cellular detection rate, as suggested by the authors of MAST in the manuscript and vignette on Bioconductor.

# function to add MAST p-values
compute_MAST_sc <- function(dat){
  # check if already computed
  if (! "mast_p" %in% colnames(rowData(dat))){
    sca <- FromMatrix(exprsArray = log2(assay(dat, "TPM") + 1), 
                    cData = data.frame(wellKey=colnames(dat), colData(dat)))
    colData(sca)$cdr <- scale(colSums(assay(sca)>0))
    zlmdata <- zlm(~ group + cdr, sca)

    rowData(dat)$mast_p <- 
        lrTest(zlmdata, "group")[, "hurdle", "Pr(>Chisq)"]
  }else{
     message("MAST p-values already computed.")
  }
  return(dat)
}

mousedata <- compute_MAST_sc(mousedata)
## MAST p-values already computed.
# save results
saveRDS(mousedata, datfile)

4.2 Covariate Diagnostics

In scrnaseq data, strength of the signal can be a affected by the of level of gene expression. We will explore two potential covariates related to expression level: mean nonzero expression and detection rate (defined as the proportion of cells expressing the gene at a nonzero level). In addition, we’ll add a random (uninformative covariate). These will also be added to the rowData slot of the SummarizedExperiments.

add_covariates_scrnaseq <- function(dat){
  rowData(dat)$meanExp <- apply(assay(dat, "count"), 1, 
                                function(x) mean(x[x>0]))
  rowData(dat)$detection <- rowMeans(assay(dat, "count") > 0)
  rowData(dat)$rand_covar <- rnorm(nrow(dat))
  return(dat)
}

set.seed(6823)
mousedata <- add_covariates_scrnaseq(mousedata)

# save results
saveRDS(mousedata, datfile)

4.2.1 Covariate one: Mean Nonzero Expression

For each covariate, we’ll examine the covariate diagnostic plots.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="mast_p", covariate="meanExp") + 
  ggtitle("MAST, Covariate 1: Mean Nonzero Expression")

strat_hist(data.frame(rowData(mousedata)),
           pvalue="mast_p", covariate="meanExp", maxy=20,
           main = "MAST, Covariate 1: Mean Nonzero Expression")

The mean nonzero expression appears to be informative and approximately satisfies the assumptions

4.2.2 Covariate two: Detection Rate

Next we look at the detection rate covariate.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="mast_p", covariate="detection") + 
  ggtitle("MAST, Covariate 2: Detection Rate")

strat_hist(data.frame(rowData(mousedata)), 
           pvalue="mast_p", covariate="detection", maxy=20,
           main = "MAST, Covariate 2: Detection Rate")

4.2.3 Covariate three: Random

Next we look at the random covariate.

rank_scatter(data.frame(rowData(mousedata)), 
             pvalue="mast_p", covariate="rand_covar") + 
  ggtitle("MAST, Covariate 3: Random")

strat_hist(data.frame(rowData(mousedata)), 
           pvalue="mast_p", covariate="rand_covar", maxy=20,
           main = "MAST, Covariate 3: Random")

4.3 Multiple-Testing Correction

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.

bd <- initializeBenchDesign()

Now, we’re ready to construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

4.3.1 Covariate one: Mean nonzero expression

First we’ll include the mean nonzero expression covariate.

if (!file.exists(resfile_mean)){
  sb1 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   na.omit() %>%
                                   mutate(pval=mast_p,
                                          ind_covariate=meanExp), 
                          ftCols = c("meanExp"),
                          parallel=TRUE)

  saveRDS(sb1, file = resfile_mean)
}else{
  sb1 <- readRDS(resfile_mean)
}

4.3.2 Covariate two: Detection Rate

Now, we’ll repeat the multiple testing correction using the detection rate covariate:

if (!file.exists(resfile_det)){
  sb2 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   na.omit() %>%
                                   mutate(pval=mast_p,
                                          ind_covariate=detection), 
                          ftCols = c("detection"),
                          parallel=TRUE)

  saveRDS(sb2, file = resfile_det)
}else{
  sb2 <- readRDS(resfile_det)
}

4.3.3 Covariate three: Random

Now, we’ll repeat the multiple testing correction using the random covariate:

if (!file.exists(resfile_uninf)){
  sb3 <- bd %>% buildBench(data.frame(rowData(mousedata)) %>% 
                                   na.omit() %>%
                                   mutate(pval=mast_p,
                                          ind_covariate=rand_covar), 
                          ftCols = c("rand_covar"),
                          parallel=TRUE)

  saveRDS(sb3, file = resfile_uninf)
}else{
  sb3 <- readRDS(resfile_uninf)
}

4.4 Benchmark Metrics

Next, we’ll add the default performance metric for q-value assays and plot the results. We’ll start with covariate one.

4.4.1 Covariate one: Mean Nonzero Expression

First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb1) <- "qvalue"
sb1 <- addDefaultMetrics(sb1)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb1,
                   supplementary=FALSE) +
  ggtitle("MAST, Covariate 1: Mean Nonzero Expression")

rejection_scatter_bins(sb1, covariate="meanExp", bins=4,
                       supplementary=FALSE) +
  ggtitle("MAST, Covariate 1: Mean Nonzero Expression")

# upset plot 
plotFDRMethodsOverlap(sb1, 
                      alpha=0.05, nsets=ncol(sb1),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE) 

mcols(sb1)$ind_covariate <- mcols(sb1)$meanExp
covariateLinePlot(sb1, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans = "log1p") +
  ggtitle("MAST, Covariate 1: Mean Nonzero Expression")
## Warning: Removed 25 rows containing missing values (geom_path).

4.4.2 Covariate two: Detection Rate

Next, we’ll look at the performance metrics for the detection rate covariate.

# rename assay to qvalue
assayNames(sb2) <- "qvalue"
sb2 <- addDefaultMetrics(sb2)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb2, supplementary=FALSE) +
  ggtitle("MAST, Covariate 2: Detection Rate")

rejection_scatter_bins(sb2, covariate="detection", bins=4,
                       supplementary=FALSE) +
  ggtitle("MAST, Covariate 2: Detection Rate")

# upset plot 
plotFDRMethodsOverlap(sb2, 
                      alpha=0.05, nsets=ncol(sb2),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

mcols(sb2)$ind_covariate <- mcols(sb2)$detection
covariateLinePlot(sb2, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("MAST, Covariate 2: Detection Rate")
## Warning: Removed 25 rows containing missing values (geom_path).

4.4.3 Covariate three: Random

Next, we’ll look at the performance metrics for the random covariate.

# rename assay to qvalue
assayNames(sb3) <- "qvalue"
sb3 <- addDefaultMetrics(sb3)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb3, supplementary=FALSE) +
  ggtitle("MAST, Covariate 3: Random")

rejection_scatter_bins(sb3, covariate="rand_covar", bins=4,
                       supplementary=FALSE) +
  ggtitle("MAST, Covariate 3: Random")

# upset plot 
plotFDRMethodsOverlap(sb3, 
                      alpha=0.05, nsets=ncol(sb3),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

mcols(sb3)$ind_covariate <- mcols(sb3)$rand_covar
covariateLinePlot(sb3, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("MAST, Covariate 3: Random")
## Warning: Removed 25 rows containing missing values (geom_path).

5 Covariate comparison

Here we compare the method ranks for the two covariates at alpha = 0.10.

plotMethodRanks(c(resfile_mean, resfile_det, resfile_uninf), 
                colLabels = c("Mean nonzero", "Detection rate", "Uninf"), 
                alpha = 0.10, xlab = "Covariate")

6 Session Information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] adaptMT_1.0.0               FDRreg_0.2-1               
##  [3] mvtnorm_1.0-8               BayesLogit_0.6             
##  [5] fda_2.4.7                   Matrix_1.2-14              
##  [7] fdrtool_1.2.15              swfdr_1.6.0                
##  [9] qvalue_2.12.0               ashr_2.2-7                 
## [11] IHW_1.8.0                   SummarizedBenchmark_0.99.1 
## [13] mclust_5.4                  stringr_1.3.1              
## [15] rlang_0.2.1                 UpSetR_1.3.3               
## [17] tidyr_0.8.1                 bindrcpp_0.2.2             
## [19] magrittr_1.5                hexbin_1.27.2              
## [21] MAST_1.6.1                  edgeR_3.22.3               
## [23] limma_3.36.2                impute_1.54.0              
## [25] cowplot_0.9.2               ggplot2_3.0.0              
## [27] dplyr_0.7.5                 SingleCellExperiment_1.2.0 
## [29] SummarizedExperiment_1.10.1 DelayedArray_0.6.1         
## [31] BiocParallel_1.14.2         matrixStats_0.53.1         
## [33] Biobase_2.40.0              GenomicRanges_1.32.3       
## [35] GenomeInfoDb_1.16.0         IRanges_2.14.10            
## [37] S4Vectors_0.18.3            BiocGenerics_0.26.0        
## [39] MultiAssayExperiment_1.6.0  knitr_1.20                 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137           bitops_1.0-6           RColorBrewer_1.1-2    
##  [4] doParallel_1.0.11      rprojroot_1.3-2        tools_3.5.0           
##  [7] backports_1.1.2        R6_2.2.2               lazyeval_0.2.1        
## [10] colorspace_1.3-2       withr_2.1.2            mnormt_1.5-5          
## [13] tidyselect_0.2.4       gridExtra_2.3          compiler_3.5.0        
## [16] ggdendro_0.1-20        labeling_0.3           slam_0.1-43           
## [19] mosaicCore_0.5.0       scales_0.5.0           psych_1.8.4           
## [22] SQUAREM_2017.10-1      digest_0.6.15          foreign_0.8-70        
## [25] ggformula_0.7.0        rmarkdown_1.10         XVector_0.20.0        
## [28] pscl_1.5.2             pkgconfig_2.0.1        htmltools_0.3.6       
## [31] lpsymphony_1.8.0       highr_0.7              bindr_0.1.1           
## [34] RCurl_1.95-4.10        GenomeInfoDbData_1.1.0 mosaicData_0.16.0     
## [37] Rcpp_0.12.17           munsell_0.4.3          abind_1.4-5           
## [40] stringi_1.2.2          yaml_2.1.19            MASS_7.3-50           
## [43] zlibbioc_1.26.0        plyr_1.8.4             grid_3.5.0            
## [46] lattice_0.20-35        locfit_1.5-9.1         pillar_1.2.3          
## [49] reshape2_1.4.3         codetools_0.2-15       glue_1.2.0            
## [52] evaluate_0.10.1        data.table_1.11.4      foreach_1.4.4         
## [55] gtable_0.2.0           purrr_0.2.5            assertthat_0.2.0      
## [58] broom_0.4.4            truncnorm_1.0-8        tibble_1.4.2          
## [61] iterators_1.0.9        mosaic_1.2.0

References

Finak, Greg, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K Shalek, Chloe K Slichter, et al. 2015. “MAST: A Flexible Statistical Framework for Assessing Transcriptional Changes and Characterizing Heterogeneity in Single-Cell Rna Sequencing Data.” Genome Biology 16 (1). BioMed Central: 278.

Korthauer, Keegan D, Li-Fang Chu, Michael A Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. 2016. “A Statistical Approach for Identifying Differential Distributions in Single-Cell Rna-Seq Experiments.” Genome Biology 17 (1). BioMed Central: 222.

Lane, Keara, David Van Valen, Mialy M DeFelice, Derek N Macklin, Takamasa Kudo, Ariel Jaimovich, Ambrose Carr, et al. 2017. “Measuring Signaling and Rna-Seq in the Same Cell Links Gene Expression to Dynamic Patterns of Nf-\(\kappa\)B Activation.” Cell Systems 4 (4). Elsevier: 458–69.

Lun, Aaron TL, Davis J McCarthy, and John C Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Research 5. Faculty of 1000 Ltd.

Soneson, Charlotte, and Mark D Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods. Nature Publishing Group.