1 Summary

For the application of differential gene expression, we look at scRNAseq and bulk RNAseq data. Here we discuss the analyses of scRNAseq.

scRNAseq Datasets
datasets organism technology comparison
GSE84465 human smart-seq Tumor core vs Periphery
GSE94383 mouse smart-seq2 LPS vs No stimulation
DE methods
DE_method input output
scDD count p-val
MAST tpm p-val

We selected two datasets, one each from mouse and human. These datasets come from the Conquer (Soneson and Robinson 2018) scRNAseq data collection. Conquer provides uniformly processed gene expression data summarized to both count (aggregated from transcript members), and length-scaled TPMs at the gene level. The authors used salmon to quantify raw reads and tximport to summarize to gene level. The data comes from Smartseq, a plate based sequencing method which results in deeper sequencing of reads at the cost of numbers of cells.

Here we are interested in performing differential expression analyses to determine which genes have different expression among the two biological groups in each comparison. We also apply two different analysis methods: MAST (Finak et al. 2015) and scDD (Korthauer et al. 2016). In this vignette, we will explore the human dataset analysed with MAST.

2 Workspace setup

The downloaded plate data are stored in R objects called MultiAssayExperiment; this package requires BioConductor release 3.5 and R version 3.4 or greater for installation.

Note that this Rmd also relies on scDD version 1.3.4 or higher (install from Bioconductor devel, or from the master repo on GitHub kdkorthauer/scDD), since additional functionality to skip computation of the classification of genes into patterns was implemented.

library(MultiAssayExperiment)
library(SingleCellExperiment)
library(dplyr)
library(ggplot2)
library(cowplot)
library(impute)
library(scDD)
library(EBSeq)
library(BiocParallel)

if( !packageVersion("scDD") >= '1.3.4'){
  stop("This code relies on scDD version 1.3.4 or higher - see note above")
}

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

# data and results directories
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/scRNAseq"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

# results files
resfile_mean <- file.path(sbdir, "human-benchmark-scdd-mean.rds")
resfile_det <- file.path(sbdir, "human-benchmark-scdd-det.rds")
resfile_uninf <- file.path(sbdir, "human-benchmark-scdd-uninf.rds")

# set up parallel backend
cores <- 8
BiocParallel::register(MulticoreParam(workers = cores))

3 Data preparation

3.1 Data download

We first download the data in the data directory.conquer provides the processed datasets in convenient .rds files. First, we download the Human dataset: GSE84465.

datfile <- file.path(datdir, "GSE84465.rds")

if (!file.exists(datfile)){
  download.file("http://imlspenticton.uzh.ch/robinson_lab/conquer/data-mae/GSE84465.rds",
    destfile=datfile)
}

3.2 Data processing and filtering

For data processing, we follow practices and recommendations from (Lun, McCarthy, and Marioni 2016). Briefly, we perform quality control filtering on cells to ensure that technical artifacts do not distort downstream analysis results. In addition, we also filter out very low-abundance genes, since genes that contain almost all zero counts do not contain enough information for reliable inference. Specifically, we filter out cells with extremely low mapping rate (below 20%), as these represent poor quality cells that may end up having negative size factor estimates during normalization. For the same reason, we also filter cells which have extremely low proportion of genes detected (less than 5%). We also filter out genes that are expressed in fewer than 5% of cells in both groups.

This dataset (from PMC5810554 (Darmanis et al. 2017)) contains cells from four Glioblastoma patients, sampled from both the tumor core as well as the tissue peripheral to the tumor. We’ll compare the Neoplastic cells in the tumor core vs periphery. Here it is of interest to compare the gene expression profiles with cancerous cells that are part of the primary tumor with those that have migrated, since these differences are possibly related to metastasis. We’ll exclude the two samples where only a few neoplastic cells were found in the periphery.

datfile <- file.path(datdir, "human_GSE84465_scdd.rds")

if (!file.exists(datfile)){
  
  # load MultiAssayExperiment
  humandata <- readRDS(file.path(datdir, "GSE84465.rds"))
  
  # exclude cells with extremely low mapping or feature rate (poor quality cells)
  lowMap <- which(metadata(humandata)$salmon_summary$percent_mapped < 20 |
                  colMeans(assay(experiments(humandata)[[1]], "count") > 0) < 0.05)
  if (length(lowMap) > 0){
    humandata <- humandata[, -lowMap]
    message("Removed ", length(lowMap), " cells with mapping rate below 20%",
            " or fewer than 5% of genes detected.")
  }
  
  # subset by groups of interest
  humandata <- humandata[, colData(humandata)$patient.id.ch1 %in% 
                           c(" BT_S4", " BT_S2") &
                           colData(humandata)$cell.type.ch1 == " Neoplastic" ]
  cdata=colData(humandata)
  cdata$group <- cdata$tissue.ch1

  # subset by gene expression TPM & count assays
  humandata <- experiments(humandata)[[1]]
  assays(humandata) <- assays(humandata)[c("TPM", "count")]
  colData(humandata) <- cdata

  # subset by genes with detection rate > 5% in at least one condition
  levs <- unique(colData(humandata)$group)
  humandata <- humandata[rowMeans(assay(humandata, "count")[, 
                            colData(humandata)$group == levs[1]] > 0) > 0.05 |
                         rowMeans(assay(humandata, "count")[, 
                            colData(humandata)$group == levs[2]] > 0) > 0.05, ]
  
  # remove spike-in controls
  humandata <- humandata[!grepl("ERCC-", rownames(humandata)),]
 
  # save SummarizedExperiment 
  saveRDS(humandata, datfile)
}else{
  humandata <- readRDS(datfile)
}

# look at number of cells in each group
table(colData(humandata)$group)
## 
##  Periphery      Tumor 
##         52        580

For the human dataset, there are a total of 632 cells and 23257 genes.

4 Data analysis

4.1 Differential testing

We perform differential expression testing using scDD which implicitly models any heterogeneity in expression.

We apply scDD to normalized log-transformed counts. The p-value results are added to the rowData slot of the SummarizedExperiment data objects to ensure that gene metadata stays intact.

# function to add scDD p-values
compute_scdd_sc <- function(dat){
  # check if already computed
  if (! "scdd_p" %in% colnames(rowData(dat))){
    sce <- SingleCellExperiment(assays=list(count=assay(dat, "count")),
                                colData=data.frame(colData(dat)))
    # scran normalization
    libsize <- scran::computeSumFactors(assay(sce, "count"))
    normcounts(sce) <- GetNormalizedMat(assay(sce, "count"), libsize)

    res <- results(scDD(sce, 
             condition = "group",
             prior_param = list(alpha=0.01, mu0=0, s0=0.01, a0=0.01, b0=0.01), 
             testZeroes = TRUE,
             categorize = FALSE))
    rowData(dat)$scdd_p <- res$combined.pvalue
  }else{
     message("scDD p-values already computed.")
  }
  return(dat)
}

humandata <- compute_scdd_sc(humandata)
## scDD p-values already computed.
# save results
saveRDS(humandata, datfile)

4.2 Covariate Diagnostics

In scrnaseq data, strength of the signal can be a affected by the of level of gene expression. We will explore two potential covariates related to expression level: mean nonzero expression and detection rate (defined as the proportion of cells expressing the gene at a nonzero level). In addition, we’ll add a random (uninformative covariate). These will also be added to the rowData slot of the SummarizedExperiments.

add_covariates_scrnaseq <- function(dat){
  rowData(dat)$meanExp <- apply(assay(dat, "count"), 1, 
                                function(x) mean(x[x>0]))
  rowData(dat)$detection <- rowMeans(assay(dat, "count") > 0)
  rowData(dat)$rand_covar <- rnorm(nrow(dat))
  return(dat)
}

set.seed(8723)
humandata <- add_covariates_scrnaseq(humandata)

# save results
saveRDS(humandata, datfile)

4.2.1 Covariate one: Mean Nonzero Expression

For each covariate, we’ll examine the covariate diagnostic plots.

rank_scatter(data.frame(rowData(humandata)), 
             pvalue="scdd_p", covariate="meanExp") + 
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")
## Warning: Removed 1 rows containing non-finite values (stat_binhex).

strat_hist(data.frame(rowData(humandata)), 
           pvalue="scdd_p", covariate="meanExp", maxy=9,
           main = "scDD, Covariate 1: Mean Nonzero Expression") 

The mean nonzero expression appears to be informative and approximately satisfies the assumptions

4.2.2 Covariate two: Detection Rate

Next we look at the detection rate covariate.

rank_scatter(data.frame(rowData(humandata)), 
             pvalue="scdd_p", covariate="detection") + 
  ggtitle("scDD, Covariate 2: Detection Rate")
## Warning: Removed 1 rows containing non-finite values (stat_binhex).

strat_hist(data.frame(rowData(humandata)), 
           pvalue="scdd_p", covariate="detection", maxy=9,
           main = "scDD, Covariate 2: Detection Rate") 

4.2.3 Covariate three: Random

Next we look at the random covariate.

rank_scatter(data.frame(rowData(humandata)), 
             pvalue="scdd_p", covariate="rand_covar") + 
  ggtitle("scDD, Covariate 3: Random")
## Warning: Removed 1 rows containing non-finite values (stat_binhex).

strat_hist(data.frame(rowData(humandata)), 
           pvalue="scdd_p", covariate="rand_covar", maxy=9,
           main = "scDD, Covariate 3: Random") 

4.3 Multiple-Testing Correction

First, we’ll create an object of BenchDesign class to hold the data and add the benchmark methods to the BenchDesign object.

bd <- initializeBenchDesign()

Now, we’re ready to construct the SummarizedBenchmark object, which will run the functions specified in each method (these are actually sourced in from the helper scripts).

4.3.1 Covariate one: Mean nonzero expression

First we’ll include the mean nonzero expression covariate.

if (!file.exists(resfile_mean)){
  sb1 <- bd %>% buildBench(data.frame(rowData(humandata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=meanExp), 
                          ftCols = c("meanExp"),
                          parallel=TRUE)

  saveRDS(sb1, file = resfile_mean)
}else{
  sb1 <- readRDS(resfile_mean)
}

4.3.2 Covariate two: Detection Rate

Now, we’ll repeat the multiple testing correction using the covariate detection rate:

if (!file.exists(resfile_det)){
  sb2 <- bd %>% buildBench(data.frame(rowData(humandata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=detection), 
                          ftCols = c("detection"),
                          parallel=TRUE)

  saveRDS(sb2, file = resfile_det)
}else{
  sb2 <- readRDS(resfile_det)
}

4.3.3 Covariate three: Random

Now, we’ll repeat the multiple testing correction using the random covariate:

if (!file.exists(resfile_uninf)){
  sb3 <- bd %>% buildBench(data.frame(rowData(humandata)) %>% 
                                   filter(!is.na(scdd_p)) %>%
                                   mutate(pval=scdd_p,
                                          ind_covariate=rand_covar), 
                          ftCols = c("rand_covar"),
                          parallel=TRUE)

  saveRDS(sb3, file = resfile_uninf)
}else{
  sb3 <- readRDS(resfile_uninf)
}

4.4 Benchmark Metrics

Next, we’ll add the default performance metric for q-value assays and plot the results. We’ll start with covariate one.

4.4.1 Covariate one: Mean Nonzero Expression

First, we have to rename the assay to ‘qvalue’.

# rename assay to qvalue
assayNames(sb1) <- "qvalue"
sb1 <- addDefaultMetrics(sb1)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb1,
                   supplementary=FALSE) +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")

rejection_scatter_bins(sb1, covariate="meanExp", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")

# upset plot 
plotFDRMethodsOverlap(sb1, 
                      alpha=0.05, nsets=ncol(sb1),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE) 

mcols(sb1)$ind_covariate <- mcols(sb1)$meanExp
covariateLinePlot(sb1, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans = "log1p") +
  ggtitle("scDD, Covariate 1: Mean Nonzero Expression")
## Warning: Removed 25 rows containing missing values (geom_path).

4.4.2 Covariate two: Detection Rate

Next, we’ll look at the performance metrics for the covariate detection rate.

# rename assay to qvalue
assayNames(sb2) <- "qvalue"
sb2 <- addDefaultMetrics(sb2)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb2, supplementary=FALSE) +
  ggtitle("scDD, Covariate 2: Detection Rate")

rejection_scatter_bins(sb2, covariate="detection", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 2: Detection Rate")

# upset plot 
plotFDRMethodsOverlap(sb2, 
                      alpha=0.05, nsets=ncol(sb2),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

mcols(sb2)$ind_covariate <- mcols(sb2)$detection
covariateLinePlot(sb2, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("scDD, Covariate 2: Detection Rate")
## Warning: Removed 25 rows containing missing values (geom_path).

4.4.3 Covariate three: Random

Next, we’ll look at the performance metrics for the random covariate.

# rename assay to qvalue
assayNames(sb3) <- "qvalue"
sb3 <- addDefaultMetrics(sb3)

Now, we’ll plot the results.

# plot nrejects by method overall and stratified by covariate
rejections_scatter(sb3, supplementary=FALSE) +
  ggtitle("scDD, Covariate 3: Random")

rejection_scatter_bins(sb3, covariate="rand_covar", bins=4,
                       supplementary=FALSE) +
  ggtitle("scDD, Covariate 3: Random")

# upset plot 
plotFDRMethodsOverlap(sb3, 
                      alpha=0.05, nsets=ncol(sb3),
                      order.by="freq", decreasing=TRUE,
                      supplementary=FALSE)

mcols(sb3)$ind_covariate <- mcols(sb3)$rand_covar
covariateLinePlot(sb3, alpha=0.05, covname="ind_covariate", nbins=25) +
  ggtitle("scDD, Covariate 3: Random")
## Warning: Removed 25 rows containing missing values (geom_path).

5 Covariate comparison

Here we compare the method ranks for the two covariates at alpha = 0.10.

plotMethodRanks(c(resfile_mean, resfile_det, resfile_uninf), 
                colLabels = c("Mean nonzero", "Detection rate", "Uninf"), 
                alpha = 0.10, xlab = "Covariate")

6 Session Information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] adaptMT_1.0.0               FDRreg_0.2-1               
##  [3] mvtnorm_1.0-8               BayesLogit_0.6             
##  [5] fda_2.4.7                   Matrix_1.2-14              
##  [7] fdrtool_1.2.15              swfdr_1.6.0                
##  [9] qvalue_2.12.0               ashr_2.2-7                 
## [11] IHW_1.8.0                   SummarizedBenchmark_0.99.1 
## [13] mclust_5.4                  stringr_1.3.1              
## [15] rlang_0.2.1                 UpSetR_1.3.3               
## [17] tidyr_0.8.1                 bindrcpp_0.2.2             
## [19] magrittr_1.5                hexbin_1.27.2              
## [21] EBSeq_1.20.0                testthat_2.0.0             
## [23] gplots_3.0.1                blockmodeling_0.3.1        
## [25] scDD_1.4.0                  impute_1.54.0              
## [27] cowplot_0.9.2               ggplot2_3.0.0              
## [29] dplyr_0.7.5                 SingleCellExperiment_1.2.0 
## [31] SummarizedExperiment_1.10.1 DelayedArray_0.6.1         
## [33] BiocParallel_1.14.2         matrixStats_0.53.1         
## [35] Biobase_2.40.0              GenomicRanges_1.32.3       
## [37] GenomeInfoDb_1.16.0         IRanges_2.14.10            
## [39] S4Vectors_0.18.3            BiocGenerics_0.26.0        
## [41] MultiAssayExperiment_1.6.0  knitr_1.20                 
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2          spam_2.2-0              
##   [3] plyr_1.8.4               igraph_1.2.1            
##   [5] lazyeval_0.2.1           shinydashboard_0.7.0    
##   [7] scater_1.8.0             lpsymphony_1.8.0        
##   [9] digest_0.6.15            foreach_1.4.4           
##  [11] htmltools_0.3.6          viridis_0.5.1           
##  [13] SQUAREM_2017.10-1        gdata_2.18.0            
##  [15] mosaicData_0.16.0        doParallel_1.0.11       
##  [17] mosaicCore_0.5.0         limma_3.36.2            
##  [19] colorspace_1.3-2         RCurl_1.95-4.10         
##  [21] tximport_1.8.0           lme4_1.1-17             
##  [23] bindr_0.1.1              iterators_1.0.9         
##  [25] glue_1.2.0               registry_0.5            
##  [27] gtable_0.2.0             zlibbioc_1.26.0         
##  [29] XVector_0.20.0           ggformula_0.7.0         
##  [31] Rhdf5lib_1.2.1           maps_3.3.0              
##  [33] abind_1.4-5              scales_0.5.0            
##  [35] pscl_1.5.2               mosaic_1.2.0            
##  [37] edgeR_3.22.3             rngtools_1.3.1          
##  [39] bibtex_0.4.2             Rcpp_0.12.17            
##  [41] viridisLite_0.3.0        xtable_1.8-2            
##  [43] foreign_0.8-70           dotCall64_0.9-5.2       
##  [45] DT_0.4                   truncnorm_1.0-8         
##  [47] htmlwidgets_1.2          RColorBrewer_1.1-2      
##  [49] FNN_1.1                  pkgconfig_2.0.1         
##  [51] locfit_1.5-9.1           dynamicTreeCut_1.63-1   
##  [53] tidyselect_0.2.4         labeling_0.3            
##  [55] reshape2_1.4.3           later_0.7.3             
##  [57] munsell_0.4.3            tools_3.5.0             
##  [59] broom_0.4.4              ggdendro_0.1-20         
##  [61] evaluate_0.10.1          arm_1.10-1              
##  [63] yaml_2.1.19              outliers_0.14           
##  [65] caTools_1.17.1           purrr_0.2.5             
##  [67] nlme_3.1-137             doRNG_1.6.6             
##  [69] mime_0.5                 slam_0.1-43             
##  [71] scran_1.8.2              compiler_3.5.0          
##  [73] beeswarm_0.2.3           tibble_1.4.2            
##  [75] statmod_1.4.30           stringi_1.2.2           
##  [77] highr_0.7                fields_9.6              
##  [79] lattice_0.20-35          psych_1.8.4             
##  [81] nloptr_1.0.4             pillar_1.2.3            
##  [83] data.table_1.11.4        bitops_1.0-6            
##  [85] httpuv_1.4.3             R6_2.2.2                
##  [87] promises_1.0.1           KernSmooth_2.23-15      
##  [89] gridExtra_2.3            vipor_0.4.5             
##  [91] codetools_0.2-15         MASS_7.3-50             
##  [93] gtools_3.8.1             assertthat_0.2.0        
##  [95] rhdf5_2.24.0             pkgmaker_0.27           
##  [97] rprojroot_1.3-2          rjson_0.2.20            
##  [99] withr_2.1.2              mnormt_1.5-5            
## [101] GenomeInfoDbData_1.1.0   grid_3.5.0              
## [103] coda_0.19-1              minqa_1.2.4             
## [105] rmarkdown_1.10           DelayedMatrixStats_1.2.0
## [107] shiny_1.1.0              ggbeeswarm_0.6.0

References

Darmanis, Spyros, Steven A Sloan, Derek Croote, Marco Mignardi, Sophia Chernikova, Peyman Samghababi, Ye Zhang, et al. 2017. “Single-Cell Rna-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma.” Cell Reports 21 (5). Elsevier: 1399–1410.

Finak, Greg, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K Shalek, Chloe K Slichter, et al. 2015. “MAST: A Flexible Statistical Framework for Assessing Transcriptional Changes and Characterizing Heterogeneity in Single-Cell Rna Sequencing Data.” Genome Biology 16 (1). BioMed Central: 278.

Korthauer, Keegan D, Li-Fang Chu, Michael A Newton, Yuan Li, James Thomson, Ron Stewart, and Christina Kendziorski. 2016. “A Statistical Approach for Identifying Differential Distributions in Single-Cell Rna-Seq Experiments.” Genome Biology 17 (1). BioMed Central: 222.

Lun, Aaron TL, Davis J McCarthy, and John C Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Research 5. Faculty of 1000 Ltd.

Soneson, Charlotte, and Mark D Robinson. 2018. “Bias, Robustness and Scalability in Single-Cell Differential Expression Analysis.” Nature Methods. Nature Publishing Group.