1 Summary

The objective of this vignette is to test different multiple testing methods in the context of Gene Set Enrichment Analysis (GSEA). To do this, we use data from the paper by Cabezas-Wallscheid et al. (Cell stem Cell, 2014). The data consist of RNA-seq data from mouse hematopoietic stem cells and multipotent progenitor lineages. The raw fastq data is available through the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-2262. These data were mapped to the mouse reference genome GRCm38 (ENSEMBL release 69) using the Genomic Short-Read Nucleotide Alignment program (version 2012-07-20). We used htseq-count to count the number of reads overlapping with each gene and used the DESeq2 package to format the data as a DESeqDataSet R object.

Here we use the fgsea Bioconductor package to implement the GSEA method. This is a Functional Class Scoring approach, which does not require setting an arbitrary threshold for Differential Expression, but instead relies on the gene’s rank (here we rank by DESeq2 test statistic).

2 Workspace Setup

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(scales)
library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:dplyr':
## 
##     exprs

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## The following object is masked from 'package:dplyr':
## 
##     count

## Loading required package: BiocParallel

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following objects are masked from 'package:base':
## 
##     aperm, apply

library(EnsDb.Mmusculus.v75)

## Loading required package: ensembldb

## Loading required package: GenomicFeatures

## Loading required package: AnnotationDbi

## 
## Attaching package: 'AnnotationDbi'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: AnnotationFilter

## 
## Attaching package: 'ensembldb'

## The following object is masked from 'package:dplyr':
## 
##     filter

## The following object is masked from 'package:stats':
## 
##     filter

library(fgsea)

## Loading required package: Rcpp

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

## project data/results folders
datdir <- "data"
resdir <- "results"
sbdir <- "../../results/GSEA"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)
dir.create(sbdir, showWarnings = FALSE)

## intermediary files we create below
count_file <- file.path(datdir, "mouse-counts.rds")
deseq_file <- file.path(datdir, "mouse-deseq.rds")
goset_file <- file.path(datdir, "mouse-gosets.rds")
result_file <- file.path(resdir, "mouse-results.rds")
bench_file <- file.path(sbdir, "mouse-benchmark.rds")
bench_file_uninf <- file.path(sbdir, "mouse-uninf-benchmark.rds")

3 Data Preparation

The data has been preprocessed and saved as a DESeqDataset object. The following lines of code download this DESeqDataSet if it is not present locally.

if (!file.exists(count_file)) {
    download.file("https://zenodo.org/record/1475409/files/gsea-mouse-counts.rds?download=1",
                  destfile = count_file)
}
dseHSCMPP <- readRDS(count_file)

4 Data Analysis

4.1 Enrichment Analysis

In order to rank the list of genes for GSEA, we will test each gene for differential gene expression between hematopoietic stem cells and multipotent progenitors (fraction 1). To do this, we will run DESeq2 to obtain a statistic for differential expression.

if (!file.exists(deseq_file)) {
    dseHSCMPP <- DESeq(dseHSCMPP)
    res <- results(dseHSCMPP, contrast = c("conditions", "HSC", "MPP1"),
                   independentFiltering = FALSE)
    saveRDS(res, file = deseq_file)
} else {
    res <- readRDS(deseq_file)
}

genes <- as.numeric(res$padj < 0.1)
names(genes) <- rownames(res)
sum(genes, na.rm=TRUE)

## [1] 1336

We next use biomaRt to get the relations between GO categories and genes.

if (!file.exists(goset_file)) {
    library(biomaRt)
    mart <- useMart("ensembl", "mmusculus_gene_ensembl")
    goSets <- getBM(c("ensembl_gene_id", "go_id"), mart = mart,
                    filters = "ensembl_gene_id", values = rownames(res))
    goSets <- goSets[!nchar(goSets$go_id) == 0, ]
    goSets <- with(goSets, split(go_id, ensembl_gene_id))
    saveRDS(goSets, file = goset_file)
} else {
   goSets <- readRDS(goset_file)
}

Now we use the fgsea package to perform the gene set enrichment analysis and obtain a enrichment p-value for each pathway.

# invert the list so each item is a pathway instead of a gene
goSets <- split(rep(names(goSets), lengths(goSets)), unlist(goSets))
stats <- res$stat
names(stats) <- rownames(res)
stats <- stats[!is.na(stats)]


if (!file.exists(result_file)) {
  goRes <- fgsea(goSets, 
                  stats, 
                  nperm=10000, 
                  maxSize=500,
                  minSize=5)  
  saveRDS(goRes, file = result_file)
} else {
  goRes <- readRDS(result_file)
}

Add a random (uninformative covariate) to the dataset.

## Add random (uninformative) covariate
set.seed(7476)
goRes$rand_covar <- rnorm(nrow(goRes))

4.2 Covariate Diagnostics

Here, we want to check whether the size of the gene set is actually informative and independent under the null.

4.2.1 Gene Set Size

In the following plot, we explore the relationship between the p-value and the gene set size. We can see that this covariate is actually informative.

rank_scatter(dat = goRes, pval = "pval", 
             covariate = "size", bins = 50, funx = log2, 
             funfill = log10_trans()) +
    ggtitle("Enriched gene sets") +
    xlab(expression(log[10]~"(# of genes)")) +
    ylab(expression(-log[10]~"(p-value)"))

And at the same time, it seems to be fair to assume that it is independent under the null hypothesis.

strat_hist(goRes, pval = "pval", covariate = "size", maxy=12)

4.2.2 Random

We will explore whether the random covariate can be used as a covariate for modern multiple-testing correction methods in the context of GSEA. In the plot below, the log10 of the p-values is plotted as a function of the random covariate. This covariate looks independent of the p-values.

rank_scatter(dat = goRes, pval = "pval", 
              covariate = "rand_covar", bins = 50, 
              funfill = log10_trans()) +
    ggtitle("Enriched gene sets") +
    ylab(expression(-log[10]~"(p-value)") )

We can also explore if the covariate seems to be independent under the null.

strat_hist(goRes, pval="pval", covariate="rand_covar", maxy=10)

4.3 Multiple-Testing Correction

Generating the SummarizedBenchmark object:

## rename columns and prepare for benchmarking
res <- dplyr::select(goRes, pval, size, rand_covar) %>%
    dplyr::rename(pval = pval,
                  ind_covariate = size) 

## generate default BenchDesign
bd <- initializeBenchDesign()

We don’t include ashq, fdrreg-e and fdrreg-t from the analysis because the necessary assumptions are not met in the current case study. Namely, effect sizes and standard errors are not available for ASH, and the test statistics are not normally distributed under the null and alternative, as required by Scott’s FDR regression methods.

if (!file.exists(bench_file)) {
    sGSEAMouse <- buildBench(bd, res, ftCols = "ind_covariate")
    saveRDS(sGSEAMouse, file = bench_file)
} else {
    sGSEAMouse <- readRDS(bench_file)
}

We’ll also compare the results to an uninformative (random) covariate.

if (!file.exists(bench_file_uninf)) {
  res$ind_covariate <- res$rand_covar
  sGSEAMouse_rand <- buildBench(bd, res, ftCols = "ind_covariate")
  saveRDS(sGSEAMouse_rand, file = bench_file_uninf)
} else {
  sGSEAMouse_rand <- readRDS(bench_file_uninf)
}

4.4 Benchmark Metrics

4.4.1 Gene Set Size

assayNames(sGSEAMouse) <- "qvalue"
sGSEAMouse <- addDefaultMetrics(sGSEAMouse)

rejections_scatter(sGSEAMouse, as_fraction = FALSE, supplementary = FALSE)

plotFDRMethodsOverlap(sGSEAMouse, alpha = 0.1, supplementary = FALSE,
                      order.by = "freq", nsets = 100)

covariateLinePlot(sGSEAMouse, alpha = 0.1, covname = "ind_covariate", trans="log1p")

## Warning: Removed 25 rows containing missing values (geom_path).

4.4.2 Random

assayNames(sGSEAMouse_rand) <- "qvalue"
sGSEAMouse_rand <- addDefaultMetrics(sGSEAMouse_rand)
sGSEAMouse_rand <- estimatePerformanceMetrics(sGSEAMouse_rand, addColData=TRUE)

rejections_scatter(sGSEAMouse_rand, as_fraction=FALSE, supplementary=FALSE)

## Found already estimated performance metrics, replacing these

plotFDRMethodsOverlap(sGSEAMouse_rand, alpha=0.1, supplementary=FALSE, order.by="freq", nsets=100)

covariateLinePlot(sGSEAMouse_rand, alpha = 0.1, covname = "ind_covariate", trans = "log1p")

## Warning: Removed 25 rows containing missing values (geom_path).

4.5 Covariate comparison

Here we compare the method ranks for the two covariates at alpha = 0.10.

plotMethodRanks(c(bench_file, bench_file_uninf), 
                colLabels = c("Set Size", "Random"), 
                alpha = 0.10, xlab = "Covariate", 
                excludeMethods = NULL)

5 Session Info

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] adaptMT_1.0.0               FDRreg_0.2-1               
##  [3] mvtnorm_1.0-8               BayesLogit_0.6             
##  [5] fda_2.4.7                   Matrix_1.2-14              
##  [7] fdrtool_1.2.15              swfdr_1.6.0                
##  [9] qvalue_2.12.0               ashr_2.2-7                 
## [11] IHW_1.8.0                   SummarizedBenchmark_0.99.1 
## [13] mclust_5.4                  stringr_1.3.1              
## [15] rlang_0.2.1                 UpSetR_1.3.3               
## [17] tidyr_0.8.1                 bindrcpp_0.2.2             
## [19] magrittr_1.5                cowplot_0.9.2              
## [21] hexbin_1.27.2               fgsea_1.6.0                
## [23] Rcpp_0.12.17                EnsDb.Mmusculus.v75_2.99.0 
## [25] ensembldb_2.4.1             AnnotationFilter_1.4.0     
## [27] GenomicFeatures_1.32.0      AnnotationDbi_1.42.1       
## [29] DESeq2_1.20.0               SummarizedExperiment_1.10.1
## [31] DelayedArray_0.6.1          BiocParallel_1.14.2        
## [33] matrixStats_0.53.1          Biobase_2.40.0             
## [35] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
## [37] IRanges_2.14.10             S4Vectors_0.18.3           
## [39] BiocGenerics_0.26.0         scales_0.5.0               
## [41] ggplot2_3.0.0               dplyr_0.7.5                
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.3-2         rprojroot_1.3-2         
##  [3] htmlTable_1.12           XVector_0.20.0          
##  [5] ggdendro_0.1-20          base64enc_0.1-3         
##  [7] rstudioapi_0.7           bit64_0.9-7             
##  [9] mosaic_1.2.0             codetools_0.2-15        
## [11] mnormt_1.5-5             pscl_1.5.2              
## [13] doParallel_1.0.11        geneplotter_1.58.0      
## [15] knitr_1.20               Formula_1.2-3           
## [17] Rsamtools_1.32.0         broom_0.4.4             
## [19] annotate_1.58.0          cluster_2.0.7-1         
## [21] compiler_3.5.0           httr_1.3.1              
## [23] backports_1.1.2          assertthat_0.2.0        
## [25] lazyeval_0.2.1           acepack_1.4.1           
## [27] htmltools_0.3.6          prettyunits_1.0.2       
## [29] tools_3.5.0              gtable_0.2.0            
## [31] glue_1.2.0               GenomeInfoDbData_1.1.0  
## [33] reshape2_1.4.3           fastmatch_1.1-0         
## [35] slam_0.1-43              Biostrings_2.48.0       
## [37] nlme_3.1-137             rtracklayer_1.40.3      
## [39] iterators_1.0.9          psych_1.8.4             
## [41] mosaicCore_0.5.0         XML_3.98-1.11           
## [43] zlibbioc_1.26.0          MASS_7.3-50             
## [45] ProtGenerics_1.12.0      RColorBrewer_1.1-2      
## [47] yaml_2.1.19              curl_3.2                
## [49] mosaicData_0.16.0        memoise_1.1.0           
## [51] gridExtra_2.3            biomaRt_2.36.1          
## [53] rpart_4.1-13             latticeExtra_0.6-28     
## [55] stringi_1.2.2            RSQLite_2.1.1           
## [57] SQUAREM_2017.10-1        genefilter_1.62.0       
## [59] foreach_1.4.4            checkmate_1.8.5         
## [61] truncnorm_1.0-8          pkgconfig_2.0.1         
## [63] bitops_1.0-6             evaluate_0.10.1         
## [65] lattice_0.20-35          lpsymphony_1.8.0        
## [67] purrr_0.2.5              bindr_0.1.1             
## [69] GenomicAlignments_1.16.0 htmlwidgets_1.2         
## [71] labeling_0.3             bit_1.1-14              
## [73] tidyselect_0.2.4         ggformula_0.7.0         
## [75] plyr_1.8.4               R6_2.2.2                
## [77] Hmisc_4.1-1              DBI_1.0.0               
## [79] pillar_1.2.3             foreign_0.8-70          
## [81] withr_2.1.2              survival_2.41-3         
## [83] RCurl_1.95-4.10          nnet_7.3-12             
## [85] tibble_1.4.2             rmarkdown_1.10          
## [87] progress_1.1.2           locfit_1.5-9.1          
## [89] grid_3.5.0               data.table_1.11.4       
## [91] blob_1.1.1               digest_0.6.15           
## [93] xtable_1.8-2             munsell_0.4.3

Case Study: Gene Set Enrichment Analysis (Mouse Data Set)

Alejandro Reyes and Keegan Korthauer

October 30, 2018