The SARTools R package which generated this report has been developped at PF2 - Institut Pasteur by M.-A. Dillies and H. Varet (). Thanks to cite H. Varet, L. Brillet-Guéguen, J.-Y. Coppee and M.-A. Dillies, SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data, PLoS One, 2016, doi: http://dx.doi.org/10.1371/journal.pone.0157022 when using this tool for any analysis published.

1 Introduction

The analyses reported in this document are part of the G4PLAST1 cDNA+ncRNA project. The aim is to find features that are differentially expressed between WTMS2 and WTN50. The statistical analysis process includes data normalization, graphical exploration of raw and normalized data, test for differential expression for each feature between the conditions, raw p-value adjustment and export of lists of features having a significant differential expression between the conditions. In this analysis, the batch effect will be taken into account in the statistical models.

The analysis is performed using the R software [1], Bioconductor [2] packages including edgeR [3] and the SARTools package developed at PF2 - Institut Pasteur. Normalization and differential analysis are carried out according to the edgeR model and package. This report comes with additional tab-delimited text files that contain lists of differentially expressed features.

For more details about the edgeR methodology, please refer to its related publications [3–6].

2 Description of raw data

The count data files and associated biological conditions are listed in the following table.

Table 1: Data files and associated biological conditions.
label files group batch
wt_ms2_1 WT_MS2_1_sorted.count WTMS2 day1
wt_ms2_2 WT_MS2_2_sorted.count WTMS2 day2
wt_ms2_3 WT_MS2_3_sorted.count WTMS2 day3
wt_n50_1 WT_N50_1_sorted.count WTN50 day1
wt_n50_2 WT_N50_2_sorted.count WTN50 day2
wt_n50_3 WT_N50_3_sorted.count WTN50 day3

After loading the data we first have a look at the raw data table itself. The data table contains one row per annotated feature and one column per sequenced sample. Row names of this table are feature IDs (unique identifiers). The table contains raw count values representing the number of reads that map onto the features. For this project, there are 33309 features in the count data table.

Table 2: Partial view of the count data table.
wt_ms2_1 wt_ms2_2 wt_ms2_3 wt_n50_1 wt_n50_2 wt_n50_3
AT1G01010.1 111 90 104 550 374 431
AT1G01020.1 227 256 282 384 229 254
AT1G01030.2 117 168 151 192 83 72
AT1G01040.1 886 1155 1171 2061 780 1015
at1g01046 6 5 8 35 7 22
AT1G01050.1 1129 1241 1212 1846 1175 1392

Looking at the summary of the count table provides a basic description of these raw counts (min and max values, median, etc).

Table 3: Summary of the raw counts.
Min. 1st Qu. Median Mean 3rd Qu. Max.
wt_ms2_1 0 0 59 1202 401 7470590
wt_ms2_2 0 0 69 1380 481 8165762
wt_ms2_3 0 0 65 1459 473 9713009
wt_n50_1 0 1 111 1406 719 11245438
wt_n50_2 0 1 70 950 466 8312708
wt_n50_3 0 1 80 1070 517 8480469

Figure 1 shows the total number of mapped and counted reads for each sample. We expect total read counts to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates. This may happen for several reasons, including:

  • different rRNA contamination levels between samples (even between biological replicates);
  • slight differences between library concentrations, since they may be difficult to measure with high precision.
Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.

Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.

Figure 2 shows the percentage of features with no read count in each sample. We expect this percentage to be similar within conditions. Features with null read counts in the 6 samples will not be taken into account for the analysis with edgeR. Here, 5338 features (16.03%) are in this situation (dashed line).

Figure 2: Percentage of features with null read counts in each sample.

Figure 2: Percentage of features with null read counts in each sample.

Figure 3 shows the distribution of read counts for each sample (on a log scale to improve readability). Again we expect replicates to have similar distributions. In addition, this figure shows if read counts are preferably low, medium or high. This depends on the organisms as well as the biological conditions under consideration.

Figure 3: Density distribution of read counts.

Figure 3: Density distribution of read counts.

It may happen that one or a few features capture a high proportion of reads (up to 20% or more). This phenomenon should not influence the normalization process. The edgeR normalization has proved to be robust to this situation [7]. Anyway, we expect these high count features to be the same across replicates. They are not necessarily the same across conditions. Figure 4 and table 4 illustrate the possible presence of such high count features in the data set.

Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.

Figure 4: Percentage of reads associated with the sequence having the highest count (provided in each box on the graph) for each sample.

Table 4: Percentage of reads associated with the sequences having the highest counts.
ATCG00020.1 ATCG00490.1 ATCG00340.1 AT2G41310.1 AT1G05163.1
wt_ms2_1 18.65 12.12 1.77 0.92 0.69
wt_ms2_2 17.77 11.73 1.89 0.84 0.60
wt_ms2_3 19.98 12.57 1.84 0.76 0.55
wt_n50_1 24.02 1.86 0.61 2.60 1.87
wt_n50_2 26.26 1.82 0.31 3.01 2.12
wt_n50_3 23.79 1.54 0.28 3.11 2.31

We may wish to assess the similarity between samples across conditions. A pairwise scatter plot is produced (figure 5) to show how replicates and samples from different biological conditions are similar or different (using a log scale). Moreover, as the Pearson correlation has been shown not to be relevant to measure the similarity between replicates, the SERE statistic has been proposed as a similarity index between RNA-Seq samples [8]. It measures whether the variability between samples is random Poisson variability or higher. Pairwise SERE values are printed in the lower triangle of the pairwise scatter plot. The value of the SERE statistic is:

  • 0 when samples are identical (no variability at all: this may happen in the case of a sample duplication);

  • 1 for technical replicates (technical variability follows a Poisson distribution);

  • greater than 1 for biological replicates and samples from different biological conditions (biological variability is higher than technical one, data are over-dispersed with respect to Poisson). The higher the SERE value, the lower the similarity. It is expected to be lower between biological replicates than between samples of different biological conditions. Hence, the SERE statistic can be used to detect inversions between samples.

Figure 5: Pairwise comparison of samples (not produced when more than 12 samples).

Figure 5: Pairwise comparison of samples (not produced when more than 12 samples).

3 Filtering low counts

edgeR suggests to filter features with null or low counts because they do not supply much information. For this project, 5338 features (16.03%) have been removed from the analysis because they did not satisfy the following condition: having at least 0 counts-per-million in at least 3 samples.

4 Variability within the experiment: data exploration

The main variability within the experiment is expected to come from biological differences between the samples. This can be checked in two ways. The first one is to perform a hierarchical clustering of the whole sample set. This is performed after a transformation of the count data as moderated log-counts-per-million. Figure 6 shows the dendrogram obtained from CPM data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.

Figure 6: Sample clustering based on normalized data.

Figure 6: Sample clustering based on normalized data.

Another way of visualizing the experiment variability is to look at the first two dimensions of a multidimensional scaling plot, as shown on figure 7. On this figure, the first dimension is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.

Figure 7: Multidimensional scaling plot of the samples.

Figure 7: Multidimensional scaling plot of the samples.

For the statistical analysis, we need to take into account the effect of the batch parameter. Statistical models and tests will thus be adjusted on it.

5 Normalization

Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by edgeR is called Trimmed Mean of M-values (TMM) but it is also possible to use the RLE (DESeq) or upperquartile normalizations. It relies on the hypothesis that most features are not differentially expressed.

edgeR computes a factor for each sample. These normalization factors apply to the total number of counts and cannot be used to normalize read counts in a direct manner. Indeed, normalization factors are used to normalize total counts. These in turn are used to normalize read counts according to a total count normalization: if \(N_j\) is the total number of reads of the sample \(j\) and \(f_j\) its normalization factor, \(N'_j=f_j \times N_j\) is the normalized total number of reads. Then, let \(s_j=N'_j/\bar{N'}\) with \(\bar{N'}\) the mean of the \(N'_j\) s. Finally, the normalized counts of the sample \(j\) are defined as \(x'_{ij}=x_{ij}/s_j\) where \(i\) is the gene index.

Table 5: Normalization factors.
wt_ms2_1 wt_ms2_2 wt_ms2_3 wt_n50_1 wt_n50_2 wt_n50_3
TMM normalization factors 0.82 0.85 0.79 1.29 1.19 1.2

Boxplots are often used to assess the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 8 shows boxplots of raw (left) and normalized (right) data respectively.

Figure 8: Boxplots of raw (left) and normalized (right) read counts.

Figure 8: Boxplots of raw (left) and normalized (right) read counts.

6 Differential analysis

6.1 Modelization

edgeR aims at fitting one linear model per feature. For this project, the design used is ~ batch + group and the goal is to estimate the models’ coefficients which can be interpreted as \(\log_2(\texttt{FC})\). These coefficients will then be tested to get p-values and adjusted p-values.

6.2 Dispersions estimation

The edgeR model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data.

Figure 9: Dispersion estimates.

Figure 9: Dispersion estimates.

Figure 9 shows the result of the dispersion estimation step. The x- and y-axes represent the mean count value and the estimated dispersion respectively. Black dots represent empirical dispersion estimates for each feature (from the observed count values). The blue curve shows the relationship between the means of the counts and the dispersions modeled with splines. The red segment represents the common dispersion.

6.3 Statistical test for differential expression

Once the dispersion estimation and the model fitting have been done, edgeR can perform the statistical testing. Figure 10 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on \([0,1]\) and a peak around 0 corresponding to the differentially expressed features.

Figure 10: Distribution(s) of raw p-values.

Figure 10: Distribution(s) of raw p-values.

6.4 Final results

A p-value adjustment is performed to take into account multiple testing and control the false positive rate to a chosen level \(\alpha\). For this analysis, a BH p-value adjustment was performed [9,10] and the level of controlled false positive rate was set to 0.05.

Table 6: Number of up-, down- and total number of differentially expressed features for each comparison.
Test vs Ref # down # up # total
WTN50 vs WTMS2 8293 8844 17137

Figure 11 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high \(\log_2(\text{FC})\) to be displayed on the plot.

Figure 11: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 11: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 12 shows the volcano plots for the comparisons performed and differentially expressed features are still highlighted in red. A volcano plot represents the log of the adjusted P value as a function of the log ratio of differential expression.

Figure 12: Volcano plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Figure 12: Volcano plot(s) of each comparison. Red dots represent significantly differentially expressed features.

Full results as well as lists of differentially expressed features are provided in the following text files which can be easily read in a spreadsheet. For each comparison:

  • TestVsRef.complete.txt contains results for all the features;
  • TestVsRef.up.txt contains results for up-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one;
  • TestVsRef.down.txt contains results for down-regulated features. Features are ordered from the most significant adjusted p-value to the less significant one.

These files contain the following columns:

  • Id: unique feature identifier;
  • sampleName: raw counts per sample;
  • norm.sampleName: rounded normalized counts per sample;
  • baseMean: base mean over all samples;
  • WTMS2 and WTN50: means (rounded) of normalized counts of the biological conditions;
  • FoldChange: fold change of expression, calculated as \(2^{\log_2(\text{FC})}\);
  • log2FoldChange: \(\log_2(\text{FC})\) as estimated by the GLM model. It reflects the differential expression between Test and Ref and can be interpreted as \(\log_2(\frac{\text{Test}}{\text{Ref}})\). If this value is:
    • around 0: the feature expression is similar in both conditions;
    • positive: the feature is up-regulated (\(\text{Test} > \text{Ref}\));
    • negative: the feature is down-regulated (\(\text{Test} < \text{Ref}\));
  • pvalue: raw p-value from the statistical test;
  • padj: adjusted p-value on which the cut-off \(\alpha\) is applied;
  • tagwise.dispersion: dispersion parameter estimated from feature counts (i.e. black dots on figure 9);
  • trended.dispersion: dispersion parameter estimated with splines (i.e. blue curve on figure 9).

7 R session information and parameters

The versions of the R software and Bioconductor packages used for this analysis are listed below. It is important to save them if one wants to re-perform the analysis in the same conditions.

  • R version 3.6.3 (2020-02-29), x86_64-pc-linux-gnu
  • Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=fr_FR.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=fr_FR.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=fr_FR.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=fr_FR.UTF-8, LC_IDENTIFICATION=C
  • Running under: Ubuntu 18.04.4 LTS
  • Random number generation:
  • RNG: Mersenne-Twister
  • Normal: Inversion
  • Sample: Rounding
  • Matrix products: default
  • BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
  • LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
  • Base packages: base, datasets, graphics, grDevices, methods, parallel, stats, stats4, utils
  • Other packages: Biobase 2.46.0, BiocGenerics 0.32.0, BiocParallel 1.20.1, DelayedArray 0.12.3, DESeq2 1.26.0, edgeR 3.28.1, GenomeInfoDb 1.22.1, GenomicRanges 1.38.0, ggplot2 3.3.2, IRanges 2.20.2, kableExtra 1.1.0, limma 3.42.2, matrixStats 0.56.0, S4Vectors 0.24.4, SARTools 1.7.3, SummarizedExperiment 1.16.1
  • Loaded via a namespace (and not attached): acepack 1.4.1, annotate 1.64.0, AnnotationDbi 1.48.0, backports 1.1.8, base64enc 0.1-3, bit 1.1-15.2, bit64 0.9-7.1, bitops 1.0-6, blob 1.2.1, checkmate 2.0.0, cluster 2.1.0, colorspace 1.4-1, compiler 3.6.3, crayon 1.3.4, data.table 1.12.8, DBI 1.1.0, digest 0.6.25, dplyr 1.0.0, ellipsis 0.3.1, evaluate 0.14, farver 2.0.3, foreign 0.8-75, Formula 1.2-3, genefilter 1.68.0, geneplotter 1.64.0, generics 0.0.2, GenomeInfoDbData 1.2.2, GGally 2.0.0, ggdendro 0.1-20, ggrepel 0.8.2, glue 1.4.1, grid 3.6.3, gridExtra 2.3, gtable 0.3.0, highr 0.8, Hmisc 4.4-0, hms 0.5.3, htmlTable 2.0.1, htmltools 0.5.0, htmlwidgets 1.5.1, httr 1.4.2, jpeg 0.1-8.1, knitr 1.29, labeling 0.3, lattice 0.20-41, latticeExtra 0.6-29, lifecycle 0.2.0, locfit 1.5-9.4, magrittr 1.5, MASS 7.3-51.6, Matrix 1.2-18, memoise 1.1.0, munsell 0.5.0, nnet 7.3-14, pillar 1.4.6, pkgconfig 2.0.3, plyr 1.8.6, png 0.1-7, purrr 0.3.4, R6 2.4.1, RColorBrewer 1.1-2, Rcpp 1.0.5, RCurl 1.98-1.2, readr 1.3.1, reshape 0.8.8, rlang 0.4.7, rmarkdown 2.3, rpart 4.1-15, RSQLite 2.2.0, rstudioapi 0.11, rvest 0.3.5, scales 1.1.1, splines 3.6.3, stringi 1.4.6, stringr 1.4.0, survival 3.1-12, tibble 3.0.3, tidyselect 1.1.0, tools 3.6.3, vctrs 0.3.2, viridisLite 0.3.0, webshot 0.5.2, withr 2.2.0, xfun 0.15, XML 3.99-0.3, xml2 1.3.2, xtable 1.8-4, XVector 0.26.0, yaml 2.2.1, zlibbioc 1.32.0

Parameter values used for this analysis are:

  • workDir: /home/cecile/partage/F/G4PLAST1/sartools_WT_N50_vs_WT_MS2
  • projectName: G4PLAST1 cDNA+ncRNA
  • author: Cécile
  • targetFile: target_file.txt
  • rawDir: /home/cecile/partage/F/G4PLAST1/count
  • featuresToRemove: alignment_not_unique, ambiguous, no_feature, not_aligned, too_low_aQual
  • varInt: group
  • condRef: WTMS2
  • batch: batch
  • alpha: 0.05
  • pAdjustMethod: BH
  • cpmCutoff: 0
  • gene.selection: pairwise
  • normalizationMethod: TMM
  • colors: #f3c300, #875692, #f38400, #a1caf1, #be0032, #c2b280, #848482, #008856, #e68fac, #0067a5, #f99379, #604e97

Bibliography

1. R Core Team. R: A language and environment for statistical computing. Vienna, Austria : R Foundation for Statistical Computing, 2017 :

2. Gentleman RC, Carey VJ, Bates DM et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004 ; 5: R80.

3. Robinson M, McCarthy D, Smyth G. EdgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010 ; 26: 139.

4. Robinson M, Smyth G. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 2007 ; 23: 2881.

5. Robinson M, Smyth G. Small-sample estimation of negative binomial dispersion, with applications to sage data. Biostatistics 2008 ; 9: 321.

6. McCarthy D, Chen Y, Smyth G. Differential expression analysis of multifactor rna-seq experiments with respect to biological variation. Nucleic Acids Research 2012 ; 40: 4288.

7. Dillies M-A, Rau A, Aubert J et al. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Briefings in Bioinformatics 2013 ; 14: 671.

8. Schulze SK, Kanwar R, Gölzenleuchter M et al. SERE: Single-parameter quality control and sample comparison for rna-seq. BMC Genomics 2012 ; 13: 524.

9. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995 ; 57: 289–300.

10. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 2001 ; 29: 1165–1188.