1 Introduction

This is an analysis of the yeast data with 48 biological replicates in each of two conditions (analyzed in this publication). We chose this experiment because of the large number of biological replicates, which will allow us to (1) implement null comparisons on random subsets of samples within one condition, and (2) start with a null comparison and add in artificial differences to a subset of genes to define ‘true positives’.

We will simulate a strongly informative and weakly informative independent covariate. We will also investigate the effects of an uninformative covariate by using a randomly generated covariate.

Processed count table is made available by the authors in their paper codebase GitHub repository.

In this Rmd we will carry out simulations for multiple replicates, and plot results averaging over the replications. Here we draw effect sizes (log2 fold changes) from the observed fold changes in the full comparison of the two conditions. Since this results in a distribution of effect sizes and test statistics under the alternative that has a mode at zero, the assumptions for FDRreg-empirical are violated. However, since this assumption is impossible to check outside of a simulation setting, we still include this method in the benchmark comparisons.

In this setting, we assume 2000 DE genes and a unimodal alternative.

2 Set up workspace

# Load packages and source benchmark FDR
library(SummarizedBenchmark)
library(data.table)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(magrittr)
library(cowplot)
library(purrr)
library(DESeq2)
library(tibble)
library(ggthemes)
library(R.utils)

## load helper functions
for (f in list.files("../R", "\\.(r|R)$", full.names = TRUE)) {
    source(f)
}

# set up data / results directories
datdir <- "yeast-data"
resdir <- "../../results/RNAseq"
dir.create(datdir, showWarnings = FALSE)
dir.create(resdir, showWarnings = FALSE)

# results files that will be generated
# null
resfile_n5 <- file.path(resdir, "yeast-results-null5.rds")
resfile_n10 <- file.path(resdir, "yeast-results-null10.rds")

# strong covariate
resfile_d5 <- file.path(resdir, "yeast-results-de5.rds")
resfile_d10 <- file.path(resdir, "yeast-results-de10.rds")

# weak covariate
resfile_d5_w <- file.path(resdir, "yeastW-results-de5.rds")
resfile_d10_w <- file.path(resdir, "yeastW-results-de10.rds")

# uninformative covariate
resfile_d5_uninfCov <- file.path(resdir, "yeast-results-de5-uninfCov.rds")
resfile_d10_uninfCov <- file.path(resdir, "yeast-results-de10-uninfCov.rds")

# set up parallel backend
library(parallel)
nCores <- 20

3 Data download

First, we download the processed count data from GitHub. There is one file for the Snf2 condition and one file for the wild type condition. The Snf2 condition is a yeast strain that has the transcriptional regulator gene Snf2 knocked out. Each of these files is a compressed tar.gz archive that contains a single bam file for each replicate in each condition, which we’ll place in a subdirectory called yeast-data.

download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Preprocessed_data/Snf2_countdata.tar.gz",
              destfile = file.path(datdir, "Snf2_countdata.tar.gz"))
download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Preprocessed_data/WT_countdata.tar.gz", 
              destfile = file.path(datdir, "WT_countdata.tar.gz"))

gunzip(file.path(datdir, "Snf2_countdata.tar.gz")) 
gunzip(file.path(datdir, "WT_countdata.tar.gz")) 

untar(file.path(datdir, "Snf2_countdata.tar"), exdir = datdir) 
untar(file.path(datdir, "WT_countdata.tar"), exdir = datdir) 

file.remove(file.path(datdir, "Snf2_countdata.tar"), 
            file.path(datdir, "WT_countdata.tar"))

## [1] TRUE TRUE

Each of the data files contains two columns, one with a gene/feature name, and one with the count value.

We’ll also download the list of ‘bad’ replicates which were especially poorly correlated to the others, as determined by the authors.

download.file(url = "https://github.com/bartongroup/profDGE48/raw/master/Bad_replicate_identification/exclude.lst", destfile = (file.path(datdir, "badreps.txt")))

3.1 Read data into R and create a count table

Here we make use of the map and map2 functions in the purrr package, to swiftly apply the read_tsv function from readr to read in all of the 96 sample tables, as well as add in the sample name (derived from the file name) to each subtable. Finally, the reduce function is used to join all the replicates together. We will remove samples that failed QC in the original study.

files <- dir(path = datdir, pattern = "*.bam.gbgout", full.names = TRUE)
sample_names <- sapply(strsplit(dir(path = datdir, pattern = "*.bam.gbgout"), "_MID"),
                       function(x) x[[1]])
badreps <- read_tsv(file.path(datdir, "badreps.txt"), col_names = FALSE)$X1
badreps <- unlist(lapply(strsplit(badreps, "_MID"), function(x) x[1]))

counts <- files %>%
  purrr::map(read_tsv, col_names = FALSE) %>% # read in all the files individually
  purrr::map2(sample_names, ~ dplyr:::rename(.x, !! .y := X2, feature = X1) ) %>% # add sample names
  purrr::reduce(left_join, by = "feature") %>% # reduce with rbind into one dataframe
  dplyr::select(-badreps ) # remove badreps

4 Empirical analysis of 10 vs 10 (non-null)

Here we’ll carry out an analysis of a subset of the full dataset, comparing the controls to the knockout samples. The fold changes observed in this comparison will be used when generating the non-null simulated data in the following sections. Here we use a subset of the samples (10 samples in each group) to mimic the sample sizes we use in simulation.

We’re ready to construct a DESeq2 object. First we pull out the feature names and add them as rownames for the count table, and next we construct a column data object that houses the sample names, replicate numbers, and condition factor (WT versus Snf2 knockout).

4.1 Set up DESeq2 object

First we set up the DESeq2 object.

feats <- (counts %>% select(1))$feature
counts <- as.matrix(counts %>% select(-1))
rownames(counts) <- feats

coldat <- tibble(sample=colnames(counts)) %>% 
  separate(sample, sep="_", into=c("condition", "replicate"), remove=FALSE) %>%
  mutate(condition = factor(condition))

# filter low count genes
counts <- counts[rowMeans(counts) > 1,]

dds_full <- DESeqDataSetFromMatrix(countData = counts,
                              colData = coldat,
                              design= ~ condition)

4.2 Run DESeq2

Next we run DESeq2, and subset results on genes with FDR < 0.05.

# results on full set
set.seed(9384)
subset.wt <- dds_full[,colData(dds_full)$condition == "WT"]
subset.sn <- dds_full[,colData(dds_full)$condition == "Snf2"]
subset <- cbind(subset.wt[, sample(1:ncol(subset.wt), 10)],
                subset.sn[, sample(1:ncol(subset.sn), 10)])

dds <- DESeq(subset)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 8 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

resultsNames(dds) # lists the coefficients

## [1] "Intercept"            "condition_WT_vs_Snf2"

res_10 <- results(dds, name="condition_WT_vs_Snf2", independentFiltering = F)

head(res_10)

## log2 fold change (MLE): condition WT vs Snf2 
## Wald test p-value: condition WT vs Snf2 
## DataFrame with 6 rows and 6 columns
##                  baseMean      log2FoldChange              lfcSE
##                 <numeric>           <numeric>          <numeric>
## 15S_rRNA    23.0677113098  -0.442436354738558  0.737886531818313
## 21S_rRNA 122.468643375999 -0.0695116374758201  0.623139117792482
## HRA1     2.73615294778566    1.01339771844163  0.509391516531703
## ICR1     138.471320625508  -0.096978186962606 0.0756931110689155
## LSR1     236.839720373841   0.169449748770875  0.354617686257335
## NME1      27.707146784895  -0.288556885396075  0.292234960281767
##                        stat             pvalue               padj
##                   <numeric>          <numeric>          <numeric>
## 15S_rRNA -0.599599444711775  0.548773217327728  0.627374545210852
## 21S_rRNA -0.111550752458087  0.911179622050456  0.934130172605857
## HRA1       1.98942794599634 0.0466539875922699 0.0753013745547155
## ICR1       -1.2812022863522  0.200122623271019  0.268785314674111
## LSR1      0.477837838713748  0.632765627252953   0.70030622452096
## NME1     -0.987413980578691  0.323439733395463  0.403330270778396

sum(res_10$padj < 0.05, na.rm=T)

## [1] 3854

Check densities of the test statistic and effect sizes for significant DE genes.

data.frame(res_10[res_10$padj < 0.05 & !is.na(res_10$padj),]) %>% 
  ggplot(aes(x = stat)) +
  geom_density() +
  xlab("Test Statistic")

data.frame(res_10[res_10$padj < 0.05 & !is.na(res_10$padj),]) %>% 
  ggplot(aes(x = log2FoldChange)) +
  geom_density() +
  xlab("Effect Size (log2 Fold Change)")

4.3 Check assumptions

There are several thousand genes detected using the full set. Next, we’ll build input data frame for summarized benchmark.

geneExp <- tbl_df(data.frame(geneName=rownames(res_10), 
                             pval=res_10$pvalue, 
                             SE=res_10$lfcSE,
                             ind_covariate = res_10$baseMean, 
                             effect_size=res_10$log2FoldChange, 
                             test_statistic=res_10$stat,
                             pzero=rowSums(counts(subset)==0)/ncol(counts(subset))))

# filter NAs and those with less than 50% expressed 
geneExp <- geneExp %>% na.omit() %>% dplyr::filter(pzero < 0.5)

We’ll create a plot to examine the distribution of effect sizes, since the ash method assumes that the distribution of true (unobserved) effect sizes is unimodal.

ggplot(data=geneExp, aes(effect_size)) +
  geom_histogram(bins=30)

We’ll also explore how the standard error (used by ash) correlates with the independent covariate (used by methods that incorporate covariates), in order to get an idea of how these pieces of information relate to one another.

ggplot(data=geneExp, aes(x = ind_covariate, y = SE)) +
  geom_hex(bins = 100) +
  scale_x_continuous(trans="log10") +
  xlab("Covariate: Mean gene expression")

Look at covariate diagnostic plots.

strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy=30)

rank_scatter(geneExp, pvalue="pval", covariate="ind_covariate")

## Warning: Removed 9 rows containing non-finite values (stat_binhex).

4.4 FDR benchmarking

Build common bench design. We also add in Scott’s FDR Regression since our test statistics are approximately t-distributed. Note that the assumption of FDRreg-empirical that the distribution of non-null test statistics does not have a significant mass at zero is violated here, but we still include fdrreg-e in the evaluation since this assumption is impossible to check outside of a simulation setting (since it is not possible to know which tests are non-null). Thus, we’d like to evaluate how results change when the assumption is violated, since it’s plausible that it is violated in the case studies.

bd <- initializeBenchDesign()

## 
## Attaching package: 'IHW'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

## 
## Attaching package: 'ashr'

## The following object is masked from 'package:mclust':
## 
##     dens

## Loading required package: fda

## Loading required package: splines

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:S4Vectors':
## 
##     expand

## The following object is masked from 'package:tidyr':
## 
##     expand

## 
## Attaching package: 'fda'

## The following object is masked from 'package:graphics':
## 
##     matplot

## Loading required package: BayesLogit

## Loading required package: mvtnorm

bd <- addBMethod(bd, "fdrreg-t",
                     FDRreg::FDRreg,
                     function(x) { x$FDR },
                     z = test_statistic,
                     features = model.matrix( ~  splines::bs(ind_covariate, df = 3) - 1),
                     nulltype = 'theoretical',
                     control = list(lambda = 0.01))
bd <- addBMethod(bd, "fdrreg-e",
                     FDRreg::FDRreg,
                     function(x) { x$FDR },
                     z = test_statistic,
                     features = model.matrix( ~  splines::bs(ind_covariate, df = 3) - 1),
                     nulltype = 'empirical',
                     control = list(lambda = 0.01))

Run benchmark methods.

sb <- bd %>% buildBench(data=geneExp, parallel = FALSE)

## Due to absence of package REBayes, switching to EM algorithm

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Model selection starts!
## Shrink the set of candidate models if it is too time-consuming.
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |==========                                        |  20%
  |                                                        
  |====================                              |  40%
  |                                                        
  |==============================                    |  60%
  |                                                        
  |========================================          |  80%
  |                                                        
  |==================================================| 100%

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Plot results.

rejections_scatter(sb, supplementary=FALSE)

plotFDRMethodsOverlap(sb, alpha=0.05, nsets=ncol(sb), order.by="freq", decreasing=TRUE, supplementary=FALSE)

5 Empirical analysis of 5 vs 5 (non-null)

Here we’ll carry out an analysis of a subset of the full dataset, comparing the controls to the knockout samples. The fold changes observed in this comparison will be used when generating the non-null simulated data in the following sections. Here we use a subset of the samples (5 samples in each group) to mimic the sample sizes we use in simulation.

5.1 Run DESeq2

First we run DESeq2.

# results on full set
set.seed(728)
subset.wt <- dds_full[,colData(dds_full)$condition == "WT"]
subset.sn <- dds_full[,colData(dds_full)$condition == "Snf2"]
subset <- cbind(subset.wt[, sample(1:ncol(subset.wt), 5)],
                subset.sn[, sample(1:ncol(subset.sn), 5)])

dds <- DESeq(subset)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

resultsNames(dds) # lists the coefficients

## [1] "Intercept"            "condition_WT_vs_Snf2"

res_5 <- results(dds, name="condition_WT_vs_Snf2", independentFiltering = F)

head(res_5)

## log2 fold change (MLE): condition WT vs Snf2 
## Wald test p-value: condition WT vs Snf2 
## DataFrame with 6 rows and 6 columns
##                  baseMean     log2FoldChange             lfcSE
##                 <numeric>          <numeric>         <numeric>
## 15S_rRNA 25.6626923517005  -1.34815645290897  1.40561049095492
## 21S_rRNA 114.275757036222 -0.709270244739718  1.13245600210531
## HRA1      2.6946770614275  0.114883899003022 0.866594538316973
## ICR1     151.161331077218 -0.164662939582324 0.108031359607394
## LSR1     229.529135332088 -0.370287291037731 0.612634881796695
## NME1     31.9605950080128 -0.313338368213699 0.474311940931533
##                        stat            pvalue              padj
##                   <numeric>         <numeric>         <numeric>
## 15S_rRNA -0.959125206865156                NA                NA
## 21S_rRNA -0.626311524175012 0.531110628953209 0.646273413046184
## HRA1      0.132569378092481 0.894533951361094 0.930036965782474
## ICR1      -1.52421426686418 0.127455195247127  0.21073064217961
## LSR1     -0.604417577320732 0.545566052439667 0.659087267557163
## NME1     -0.660616655777867 0.508858184848436 0.626647659378039

sum(res_5$padj < 0.05, na.rm=T)

## [1] 2900

Check densities of the test statistic and effect sizes for significant DE genes.

data.frame(res_5[res_5$padj < 0.05 & !is.na(res_5$padj),]) %>% 
  ggplot(aes(x = stat)) +
  geom_density() +
  xlab("Test Statistic")

data.frame(res_5[res_5$padj < 0.05 & !is.na(res_5$padj),]) %>% 
  ggplot(aes(x = log2FoldChange)) +
  geom_density() +
  xlab("Effect Size (log2 Fold Change)")

5.2 Check assumptions

There are several thousand genes detected using the full set. Next, we’ll build input data frame for summarized benchmark.

geneExp <- tbl_df(data.frame(geneName=rownames(res_5), 
                             pval=res_5$pvalue, 
                             SE=res_5$lfcSE,
                             ind_covariate = res_5$baseMean, 
                             effect_size=res_5$log2FoldChange, 
                             test_statistic=res_5$stat,
                             pzero=rowSums(counts(subset)==0)/ncol(counts(subset))))

# filter NAs and those with less than 50% expressed 
geneExp <- geneExp %>% na.omit() %>% dplyr::filter(pzero < 0.5)

We’ll create a plot to examine the distribution of effect sizes, since the ash method assumes that the distribution of true (unobserved) effect sizes is unimodal.

ggplot(data=geneExp, aes(effect_size)) +
  geom_histogram(bins=30)

ggplot(data=geneExp, aes(x = ind_covariate, y = SE)) +
  geom_hex(bins = 100) +
  scale_x_continuous(trans="log10") +
  xlab("Covariate: Mean gene expression")

Look at covariate diagnostic plots.

strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy=23)

rank_scatter(geneExp, pvalue="pval", covariate="ind_covariate")

## Warning: Removed 4 rows containing non-finite values (stat_binhex).

5.3 FDR benchmarking

Run benchmark methods.

sb <- bd %>% buildBench(data=geneExp, parallel = FALSE)

## Due to absence of package REBayes, switching to EM algorithm

## Model selection starts!
## Shrink the set of candidate models if it is too time-consuming.
## 
  |                                                        
  |                                                  |   0%
  |                                                        
  |==========                                        |  20%
  |                                                        
  |====================                              |  40%
  |                                                        
  |==============================                    |  60%
  |                                                        
  |========================================          |  80%
  |                                                        
  |==================================================| 100%

assayNames(sb) <- "qvalue"
sb <- addDefaultMetrics(sb)

Plot results.

rejections_scatter(sb, supplementary=FALSE)

plotFDRMethodsOverlap(sb, alpha=0.05, nsets=ncol(sb), order.by="freq", decreasing=TRUE, supplementary=FALSE)

6 Simulation set up

First we’ll illustrate the logistic curve used to simulate the informative covariate. In this study we’ll sample from a logistic curve and use these values as probability weights when selecting DE genes. A weakly informative covariate will be generated by adding noise to the strongly informative covariate.

x <- seq(0,10, by = 0.1)
infcov <- 1 / (1+exp(-x + 5))
plot(x, infcov, type = "l", ylab = "informative covariate")

plot(infcov,  pmin(1, abs(infcov + rnorm(length(x), 0, 0.25))), 
     ylab = "weak covariate", xlab = "true covariate")

Next, we’ll analyze random splits of one condition, both with and without the addition of simulated DE genes. Here we’ll create a function that we can use to run one replicate given sample size and number of DE gene settings. This will be looped over many replications and results averaged over them.

#' @param X simnumber
#' @param seed random seed
#' @param sampleSize is the number of samples in each condition
#' @param nDE is the number of DE genes
#' @param bd is the bench design object
#' @param BPPARAM is the BiocParallel bpparam argument to pass to DESeq. Since
#'  parallelization happens by forking multiple instances of this function, we 
#'  will specify only one worker per DEseq intance.
#' @param uninformativeCovariate logical indicating whether to replace the
#'  informative covariate with a randomly generated (uninformative) covariate. 
#'  Default is FALSE.
#' @param pvalHists logical whether to return histograms of pvalues by covariate
#'  instead of SB objects
simulateOneSplit <- function(X, rseed, nDE, sampleSize,
                             bd, uninformativeCovariate = FALSE, pvalHists = FALSE,
                             strongCovariate = TRUE){
  
  # set random seed
  set.seed(as.numeric(X)*as.numeric(rseed))
  
  covar <- 1 / (1+exp(-runif(nrow(dds_full), 0, 10) + 5))
  
  # select a random subset of 20 WT samples
  dds_test <- dds_full[,sample(1:ncol(dds_full), sampleSize*2)]

  # add a fake condition column to coldat
  colData(dds_test)$fake <-  factor(c(rep("A", sampleSize), 
                                      rep("B", sampleSize))[sample(1:(sampleSize*2), 
                                                                   sampleSize*2)])
  design(dds_test) <- ~fake
  
  pzero = rowSums(counts(dds_test)==0)/ncol(counts(dds_test))
  dds_test <- dds_test[pzero < 0.5,]
  covar <- covar[pzero < 0.5]
  
  truth <- rep(FALSE, nrow(dds_test))
  
  # make sure null comparison is truly null: if PC1 or PC2 sig different, 
  # or if test of join means of PCs 1:4 is significant,  
  # reshuffle sample labels
  dds_test <- estimateSizeFactors(dds_test)
  x <- t(counts(dds_test, normalize=TRUE))
  pc <- prcomp(log(x + 0.5), scale.=TRUE)

  a1 <- pc$x[colData(dds_test)$fake=="A",1]
  b1 <- pc$x[colData(dds_test)$fake=="B",1] 
  a2 <- pc$x[colData(dds_test)$fake=="A",2]
  b2 <- pc$x[colData(dds_test)$fake=="B",2] 
  p1 <- t.test(a1, b1)$p.value
  p2 <- t.test(a2, b2)$p.value
  tries <- 0
  
  while(p1 < 0.10 || p2 < 0.10  && tries < 10){
    colData(dds_test)$fake <- sample(colData(dds_test)$fake, ncol(dds_test))
    x <- t(counts(dds_test, normalize=TRUE))
    pc <- prcomp(log(x + 0.5), scale.=TRUE)

    a1 <- pc$x[colData(dds_test)$fake=="A",1]
    b1 <- pc$x[colData(dds_test)$fake=="B",1] 
    a2 <- pc$x[colData(dds_test)$fake=="A",2]
    b2 <- pc$x[colData(dds_test)$fake=="B",2] 
    p1 <- t.test(a1, b1)$p.value
    p2 <- t.test(a2, b2)$p.value
    tries <- tries + 1
  }
  
  if(nDE > 0){
    # pick random set of nDE genes to add signal to
    DE <- sample(1:nrow(dds_test), nDE, prob = covar)
    truth[DE] <- TRUE
    
    # randomly sample a log2FC from original FCs (without regard to DE)
    counts_new <- counts(dds_test)
    log2FC <- rep(0, nrow(dds_test))
    
    if(sampleSize == 5){
      log2FC[DE] <- res_5$log2FoldChange[pzero < 0.5][DE]
    }else if (sampleSize == 10){
      log2FC[DE] <- res_10$log2FoldChange[pzero < 0.5][DE]
    }else{
      stop("Only sample sizes 5 and 10 are currently supported with pre-",
           "computed fold changes for sampling from.")
    }
    
    # randomize which condition is shifted up or down
    ran <- runif(nrow(dds_test)) 
    refcond <- ifelse(ran < 0.5, "A", "B")
    down <- which(ran < 0.5)
    
    counts_new[down,colData(dds_test)$fake==unique(refcond[down])] <- 
        counts(dds_test)[down, colData(dds_test)$fake==unique(refcond[down])] *
        2^log2FC[down]
    counts_new[-down,colData(dds_test)$fake==unique(refcond[-down])] <- 
        counts(dds_test)[-down, colData(dds_test)$fake==unique(refcond[-down])] *
        2^log2FC[-down]
           
    counts_new <- apply(counts_new, 2, as.integer)
    
    
    counts(dds_test) <- counts_new
  }
  
  # replace existing size factors 
  dds_test <- estimateSizeFactors(dds_test)
  
  dds_test <- DESeq(dds_test, parallel = FALSE)
  resTEST <- results(dds_test, name="fake_B_vs_A", independentFiltering = FALSE)
  
  geneExp <- tbl_df(data.frame(geneName=rownames(resTEST), 
                               pval=resTEST$pvalue, 
                               SE=resTEST$lfcSE,                 
                               ind_covariate = covar,
                               effect_size = resTEST$log2FoldChange, 
                               test_statistic = resTEST$stat,
                               qvalue = truth))
  
  if (uninformativeCovariate){
    geneExp <- mutate(geneExp, ind_covariate = runif(length(covar)))
  }else if(!strongCovariate){
    geneExp <- mutate(geneExp, ind_covariate = pmin(1, abs(covar + rnorm(length(covar), 0, 0.25))))
  }
  
  geneExp <-  geneExp %>% dplyr::filter(!is.na(pval))
  
  if(pvalHists){
    return(strat_hist(geneExp, pvalue="pval", covariate="ind_covariate", maxy =10))
  }

  sb <- bd %>% buildBench(data=geneExp, parallel = FALSE, 
                          truthCols = "qvalue",
                          ftCols = "ind_covariate")
  
  assayNames(sb) <- "qvalue"
  sb <- addDefaultMetrics(sb)
  rowData(sb)$log2FC <- geneExp$effect_size
    
  return(sb)
}

We’ll also set some parameters that will be common to all simulations. These include the number of replications, the bench design object, the set of methods to exclude in the results plots, and the alpha cutoff level to be used when plotting the aggregated Upset results.

B <- 100
excludeSet <- c("unadjusted", "bl-df02", "bl-df04", "bl-df05")
ualpha <- 0.05

# only keep one condition for subsetting in simulations that follow 
dds_full <- dds_full[,colData(dds_full)$condition == "Snf2"]

Here’s a helper function to return the number of methods with rejections at a particular alpha level (this helps us determine whether or not to plot the aggregated upset plot - if there aren’t at least 2 methods it will throw an error, which is a problem for the null simulations).

# To be included in the upset agg plot, method needs to have found on average
# at least one rejection per replicate. To create an upset plot, require that
# at least two methods rejected at this threshold.
#' @param res standardized metric data.table generated using
#'        standardize_results.
#' @param alpha alpha cutoff
#' @param filterSet which methods to exclude from consideration 
numberMethodsReject <- function(res, alphacutoff, filterSet){
  res <- res %>% 
    filter(is.na(param.alpha) | (param.alpha == alphacutoff)) %>%
    filter(!(blabel %in% filterSet)) %>%
    filter(alpha == alphacutoff) %>%
    filter(performanceMetric == "rejections") %>%
    select(blabel, performanceMetric, value) %>%
    group_by(blabel) %>%
    summarize(mean_value = mean(value)) %>%
    filter(mean_value > 1)
  return(nrow(res))
}

7 Null Comparisons

7.1 N5: Null 5v5

Here we’ll repeat the above, but for a null comparison 5 versus 5 Snf2 samples, where the groups are selected randomly. This will be done for 100 random splits.

7.1.1 Generate a list of SB results objects

rseed <- 225
sampleSize <- 5
nDE <- 0

if (!file.exists(resfile_n5)){
  null5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                    nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores)
  saveRDS(null5, file = resfile_n5)
}else{
  null5 <- readRDS(file = resfile_n5)
}

7.1.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(null5, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res5 <- plotsim_standardize(null5, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res5, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5, met="rejectprop",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5, met="FWER",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5, met="TNR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

covariateLinePlot(null5, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(null5, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res5, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(null5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

7.2 N10: Null 10v10

Here we’ll repeat the above, but for a null comparison 10 versus 10 Snf2 samples, where the groups are selected randomly. This will be done for 100 random splits. Note that here the covariate-aware methods use a covariate that by definition is uninformative (since there are no non-null observations).

7.2.1 Generate a list of SB results objects

rseed <- 837
sampleSize <- 10
nDE <- 0

if (!file.exists(resfile_n10)){
  null10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                    nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores)
  saveRDS(null10, file = resfile_n10)
}else{
  null10 <- readRDS(file = resfile_n10)
}

7.2.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(null10, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res10 <- plotsim_standardize(null10, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res10, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10, met="rejectprop",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10, met="FWER",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10, met="TNR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

covariateLinePlot(null10, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(null10, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res10, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(null10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

8 Non-null Comparisons with Strongly Informative Covariate

8.1 D5S: DE 5v5 Strong

Here we’ll repeat the above, but for a DE comparison 5 versus 5 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. We’ll use a strongly informative covariate. This will be done for 100 random splits.

8.1.1 Generate a list of SB results objects

rseed <- 198
sampleSize <- 5
nDE <- 2000

if (!file.exists(resfile_d5)){
  de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                  nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores)
  saveRDS(de5, file = resfile_d5)
}else{
  de5 <- readRDS(file = resfile_d5)
}

# pvalue histograms
plotfile <- file.path(datdir, "de5_pvalhists.pdf")
if (!file.exists(plotfile)){
  hists <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                   nDE=nDE, sampleSize=sampleSize, bd=bd, 
                   pvalHists = TRUE, mc.cores = nCores)
  pdf(plotfile, width=8, height=4)
  for(i in 1:length(hists)){
    print(hists[[i]])
  }
  dev.off()
}

8.1.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res5d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
  
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

8.2 D10S: DE 10v10 Strong

Here we’ll repeat the above, but for a DE comparison 10 versus 10 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. We’ll use a strongly informative covariate. This will be done for 100 random splits.

8.2.1 Generate a list of SB results objects

rseed <- 961
sampleSize <- 10
nDE <- 2000

if(!file.exists(resfile_d10)){
  de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                  nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores)
  saveRDS(de10, file = resfile_d10)
}else{
  de10 <- readRDS(file = resfile_d10)
}

# pvalue histograms
plotfile <- file.path(datdir, "de10_pvalhists.pdf")
if (!file.exists(plotfile)){
  hists <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                   nDE=nDE, sampleSize=sampleSize, bd=bd, 
                   pvalHists = TRUE, mc.cores = nCores)
  pdf(plotfile, width=8, height=4)
  for(i in 1:length(hists)){
    print(hists[[i]])
  }
  dev.off()
}

8.2.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res10d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

9 Non-null Comparisons with Weakly Informative Covariate

9.1 D5W: DE 5v5 Weak

Here we’ll repeat the previous section, but using a weaker informative covariate (that has noise added to it).

9.1.1 Generate a list of SB results objects

rseed <- 198
sampleSize <- 5
nDE <- 2000

if (!file.exists(resfile_d5_w)){
  de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                  nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores,
                  strongCovariate = FALSE)
  saveRDS(de5, file = resfile_d5_w)
}else{
  de5 <- readRDS(file = resfile_d5_w)
}

9.1.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res5d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
  
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

9.2 D10W: DE 10v10 Weak

Here we’ll repeat the above, but using a weaker informative covariate (that has noise added to it).

9.2.1 Generate a list of SB results objects

rseed <- 961
sampleSize <- 10
nDE <- 2000

if(!file.exists(resfile_d10_w)){
  de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                  nDE=nDE, sampleSize=sampleSize, bd=bd, mc.cores=nCores,
                  strongCovariate = FALSE)
  saveRDS(de10, file = resfile_d10_w)
}else{
  de10 <- readRDS(file = resfile_d10_w)
}

9.2.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res10d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

10 Non-null Comparisons with Uninformative Covariate

Here we repeat the previous section, using an uninformative covariate.

10.1 D5U: DE 5v5 Uninformative

Here we’ll repeat the above, but for a DE comparison 5 versus 5 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. This will be done for 100 random splits.

10.1.1 Generate a list of SB results objects

rseed <- 198
sampleSize <- 5
nDE <- 2000

if (!file.exists(resfile_d5_uninfCov)){
  de5 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                  nDE=nDE, sampleSize=sampleSize, bd=bd,
                  uninformativeCovariate = TRUE, mc.cores=nCores)
  saveRDS(de5, file = resfile_d5_uninfCov)
}else{
  de5 <- readRDS(file = resfile_d5_uninfCov)
}

10.1.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de5, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res5d <- plotsim_standardize(de5, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res5d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res5d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de5, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de5, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res5d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de5, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
  
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

10.2 D10U: DE 10v10 Uninformative

Here we’ll repeat the above, but for a DE comparison 10 versus 10 Snf2 samples, where the groups are selected randomly and 2000 DE genes are added. This will be done for 100 random splits.

10.2.1 Generate a list of SB results objects

rseed <- 961
sampleSize <- 10
nDE <- 2000

if(!file.exists(resfile_d10_uninfCov)){
  de10 <- mclapply(X=1:B, FUN=simulateOneSplit, rseed=rseed, 
                   nDE=nDE, sampleSize=sampleSize, bd=bd,
                   uninformativeCovariate = TRUE, mc.cores=nCores)
  saveRDS(de10, file = resfile_d10_uninfCov)
}else{
  de10 <- readRDS(file = resfile_d10_uninfCov)
}

10.2.2 Plot average results over replications

Plot results.

# Check for missing results (if any methods threw an error for relevant metrics).
rowSums(sapply(de10, function(x) colSums(is.na(assays(x)$qvalue)) > 0))

## unadjusted       bonf         bh     qvalue    ihw-a01    ihw-a02 
##          0          0          0          0          0          0 
##    ihw-a03    ihw-a04    ihw-a05    ihw-a06    ihw-a07    ihw-a08 
##          0          0          0          0          0          0 
##    ihw-a09    ihw-a10       ashq    bl-df02    bl-df03    bl-df04 
##          0          0          0          0          0          0 
##    bl-df05       lfdr  adapt-glm   fdrreg-t   fdrreg-e 
##          0          0          0          0          0

res10d <- plotsim_standardize(de10, alpha = seq(0.01, 0.10, 0.01))

plotsim_average(res10d, met="rejections",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="FDR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met="TPR",filter_set = excludeSet,
                merge_ihw = TRUE, errorBars=TRUE)

plotsim_average(res10d, met=c("FPR", "TPR"), filter_set = excludeSet,
                merge_ihw = TRUE)

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

covariateLinePlot(de10, alpha=0.05, covname="log2FC", nbins=25, 
                 trans="log1p")

covariateLinePlot(de10, alpha=0.05, covname="ind_covariate", nbins=25, 
                 trans="log1p")

if (numberMethodsReject(res10d, alphacutoff=ualpha, filterSet=excludeSet) >= 2){
  aggupset(de10, alpha=ualpha, supplementary = FALSE, return_list = FALSE) 
}else{
  message("Not enough methods found rejections at alpha ", ualpha, 
          "; skipping upset plot")
}

11 Simulation setting comparison

Here we compare the method ranks for the different sample sizes and informativeness settings at alpha = 0.10.

plotMethodRanks(c(resfile_d5, resfile_d5_w, resfile_d10, resfile_d10_w,
                  resfile_d5_uninfCov, resfile_d10_uninfCov), 
                colLabels = c("5 S", "5 W", "10 S", "10 W",
                              "5 UI", "10 UI"), 
                alpha = 0.10, xlab = "Comparison", 
                excludeMethods = NULL)

12 Session information

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] adaptMT_1.0.0               FDRreg_0.2-1               
##  [3] mvtnorm_1.0-8               BayesLogit_0.6             
##  [5] fda_2.4.7                   Matrix_1.2-14              
##  [7] fdrtool_1.2.15              swfdr_1.6.0                
##  [9] qvalue_2.12.0               ashr_2.2-7                 
## [11] IHW_1.8.0                   hexbin_1.27.2              
## [13] bindrcpp_0.2.2              R.utils_2.6.0              
## [15] R.oo_1.22.0                 R.methodsS3_1.7.1          
## [17] ggthemes_3.5.0              tibble_1.4.2               
## [19] DESeq2_1.20.0               purrr_0.2.5                
## [21] cowplot_0.9.2               magrittr_1.5               
## [23] readr_1.1.1                 data.table_1.11.4          
## [25] SummarizedBenchmark_0.99.1  dplyr_0.7.5                
## [27] mclust_5.4                  ggplot2_3.0.0              
## [29] stringr_1.3.1               rlang_0.2.1                
## [31] UpSetR_1.3.3                SummarizedExperiment_1.10.1
## [33] DelayedArray_0.6.1          BiocParallel_1.14.2        
## [35] matrixStats_0.53.1          Biobase_2.40.0             
## [37] GenomicRanges_1.32.3        GenomeInfoDb_1.16.0        
## [39] IRanges_2.14.10             S4Vectors_0.18.3           
## [41] BiocGenerics_0.26.0         tidyr_0.8.1                
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.3-2       rprojroot_1.3-2        htmlTable_1.12        
##  [4] XVector_0.20.0         ggdendro_0.1-20        base64enc_0.1-3       
##  [7] rstudioapi_0.7         bit64_0.9-7            AnnotationDbi_1.42.1  
## [10] mosaic_1.2.0           codetools_0.2-15       mnormt_1.5-5          
## [13] pscl_1.5.2             doParallel_1.0.11      geneplotter_1.58.0    
## [16] knitr_1.20             Formula_1.2-3          broom_0.4.4           
## [19] annotate_1.58.0        cluster_2.0.7-1        compiler_3.5.0        
## [22] backports_1.1.2        assertthat_0.2.0       lazyeval_0.2.1        
## [25] acepack_1.4.1          htmltools_0.3.6        tools_3.5.0           
## [28] gtable_0.2.0           glue_1.2.0             GenomeInfoDbData_1.1.0
## [31] reshape2_1.4.3         Rcpp_0.12.17           slam_0.1-43           
## [34] nlme_3.1-137           iterators_1.0.9        psych_1.8.4           
## [37] mosaicCore_0.5.0       XML_3.98-1.11          zlibbioc_1.26.0       
## [40] MASS_7.3-50            scales_0.5.0           hms_0.4.2             
## [43] RColorBrewer_1.1-2     yaml_2.1.19            mosaicData_0.16.0     
## [46] memoise_1.1.0          gridExtra_2.3          rpart_4.1-13          
## [49] latticeExtra_0.6-28    stringi_1.2.2          RSQLite_2.1.1         
## [52] SQUAREM_2017.10-1      genefilter_1.62.0      foreach_1.4.4         
## [55] checkmate_1.8.5        truncnorm_1.0-8        pkgconfig_2.0.1       
## [58] bitops_1.0-6           evaluate_0.10.1        lattice_0.20-35       
## [61] lpsymphony_1.8.0       bindr_0.1.1            htmlwidgets_1.2       
## [64] labeling_0.3           bit_1.1-14             tidyselect_0.2.4      
## [67] ggformula_0.7.0        plyr_1.8.4             R6_2.2.2              
## [70] Hmisc_4.1-1            DBI_1.0.0              pillar_1.2.3          
## [73] foreign_0.8-70         withr_2.1.2            survival_2.41-3       
## [76] RCurl_1.95-4.10        nnet_7.3-12            rmarkdown_1.10        
## [79] locfit_1.5-9.1         grid_3.5.0             blob_1.1.1            
## [82] digest_0.6.15          xtable_1.8-2           munsell_0.4.3

Yeast RNA-seq 48 sample simulation study: Unimodal alternative, 2000 DE genes

Keegan Korthauer

1 Introduction

2 Set up workspace

3 Data download

3.1 Read data into R and create a count table

4 Empirical analysis of 10 vs 10 (non-null)

4.1 Set up DESeq2 object

4.2 Run DESeq2

4.3 Check assumptions

4.4 FDR benchmarking

5 Empirical analysis of 5 vs 5 (non-null)

5.1 Run DESeq2

5.2 Check assumptions

5.3 FDR benchmarking

6 Simulation set up

7 Null Comparisons

7.1 N5: Null 5v5

7.1.1 Generate a list of SB results objects

7.1.2 Plot average results over replications

7.2 N10: Null 10v10

7.2.1 Generate a list of SB results objects

7.2.2 Plot average results over replications

8 Non-null Comparisons with Strongly Informative Covariate

8.1 D5S: DE 5v5 Strong

8.1.1 Generate a list of SB results objects

8.1.2 Plot average results over replications

8.2 D10S: DE 10v10 Strong

8.2.1 Generate a list of SB results objects

8.2.2 Plot average results over replications

9 Non-null Comparisons with Weakly Informative Covariate

9.1 D5W: DE 5v5 Weak

9.1.1 Generate a list of SB results objects

9.1.2 Plot average results over replications

9.2 D10W: DE 10v10 Weak

9.2.1 Generate a list of SB results objects

9.2.2 Plot average results over replications

10 Non-null Comparisons with Uninformative Covariate

10.1 D5U: DE 5v5 Uninformative

10.1.1 Generate a list of SB results objects

10.1.2 Plot average results over replications

10.2 D10U: DE 10v10 Uninformative

10.2.1 Generate a list of SB results objects

10.2.2 Plot average results over replications

11 Simulation setting comparison

12 Session information