## Review for "Transformation and model choice for RNA-seq co-expression analysis"

Completed on 4 Aug 2016 by Wolfgang Huber . Sourced from http://biorxiv.org/content/early/2016/07/24/065607.

Author response is in blue.

I was asked to review this manuscript for a journal, and decided to share the review here.
---

Our thanks to Dr. Huber for these insightful comments. We have updated the manuscript according to his recommendations (as well as those of two other anonymous reviewers). For reference, here are our responses to the points raised by Dr. Huber.

The work is concerned with clustering of gene expression profiles from RNA-seq data, using data transformation and Gaussian mixture models. The manuscript is well-written and of high technical quality. I have a few suggestions for improving it further.

Main points:

1.) The transformation to p_ij and then to g_arcsin, g_logit is interesting, and worthwhile considering. However, as the authors note on p.4, there are also other, obvious, and well-established candidates for transformations, such as log(n+c), VST, rlog, moderated CPM (all should be followed by mean centering). Since the title of the ms is "Transformation and model choice..." I would consider it important to include these in the study. One possible result could be that "it doesn't matter very much", or that one or another of these candidates really does poorly; in any case, this would be interesting for readers, as the choice of transformation often creates a good deal of anxiety.

We thank the reviewer for this very relevant and interesting remark. To follow-up on this suggestion, we made the following change to the Discussion section of the manuscript in the newly added “Choice of transformation for co-expression analysis" subsection:

We have focused the majority of our discussion here on the use of (arcsin- or logit-transformed) normalized profiles to identify groups of co-expressed genes. As previously noted, a variety of well-established candidates for transformations have already been proposed for RNA-seq data, including the $\log(\cdot + c)$ (for a constant $c$), VST, CPM, and rlog. As suggested by a reviewer, we could consider the alternative approach of clustering RNA-seq data after applying one of these transformations and mean-centering per gene; this latter step would remove the spread in values observed, for example, in the log-transformed (and uncentered) data in Figure 1B and ensure that the data to be clustered are independent of the absolute expression levels.

To investigate this idea, we fit Gaussian mixture models to the centered $\log$, VST, CPM, and rlog transformed data for the Fietz et al. [29] mouse data with $K=2, ..., 40$ clusters using the same parameters as for the \texttt{coseq} analysis. For all four mean-centered transformations across all clusters, the most general covariance model form (the so-called $[p_kL_kC_k]$ model, with variable proportions, volume, orientation, and shape) systematically resulted in estimation errors via the Rmixmod package. When considering a larger variety of general covariance model forms (the $[p_kLC_k]$, $[p_kL_kC]$, $[p_kLC]$ forms, which all have variable proportions and respectively feature equal volume, equal orientation/shape, and equal volume/orientation/shape), the same estimation errors were observed for all centered transformations across all clusters. We were in fact only able to estimate the Gaussian mixture model for all four centered transformations when restricting the covariance model forms to be spherical or diagonal (which roughly corresponds to applying a K-means type algorithm); as such, this implies that when using these alternative transformations, complex covariance structures among samples (such as those observed in Figure 3) must be assumed to be negligeable (i.e., conditional independence given the clustering component).

2.) The arguments for Gaussian mixture models (GMM, e.g. p.16) are well taken, but are a bit old-fashioned, "20th century". Nowadays, there also very good resampling based methods for assessing cluster stability, cluster membership, etc. See, e.g., the clue package on CRAN, or the "Cluster Stability Analysis" Section in the vignette of my Hiiragi2013 Bioconductor package. Adding these methods would be interesting, although I could understand if the authors decide it is out of scope. In that case, I suggest that at least the claims on exclusive utility of GMMs for doing such stability assessments be toned down.

We thank the reviewer for pointing out these references; although we decided that adding them to the manuscript was beyond its scope, we did temper our arguments about the assessment of GMMs for cluster stability analyses. In particular, we adapted the text in the newly added Discussion subsection “Gaussian mixture models for co-expression analysis" as follows:

However, it should be noted that such assessments of cluster stability are not unique to Gaussian mixture models. In recent years, several resampling-based methods for assessing cluster stability and membership have been proposed, including the clue R package and ConsensusClusterPlus Bioconductor package; for example, Ohnisi et al. applied K-means clustering to resampled microarray data, constructed a consensus clustering using clue, and compared each of the resampled clusterings to the consensus to identify stable and robust clusters.

In addition, we slightly rearranged the discussion so that the third-to-last paragraph reads as follows:

Many alternative clustering strategies exist based on different algorithms (e.g., K-means and hierarchical clustering), distance measures calculated among pairs of genes (e.g., Euclidean distance, correlation, etc), and techniques for identifying the number of clusters (e.g., the Dynamic Tree Cut method for dendrograms). The difficulty of comparing clusterings arising from different approaches is well-known, and it is rarely straightforward to establish the circumstances under which a given strategy may be preferred. One possibility that may be of interest in practice is to analyze cluster ensembles arising from a set of different methods to assess the agreement or dissimilarity among partitions and obtain a consensus clustering.

Smaller points:

3.) I very much appreciate the provision of an Rmarkdown vignette reproducing all plots in the paper. This is exactly how it should be done. Here two more suggestions, which go beyond with what is required for journal publication, but would greatly increase the impact and quality of this research: To allow execution by readers, I recommend also providing the .Rmd file, not only the rendered PDF. Moreover, to avoid 'code rot' and other reproducibility issues, I recommend submitting the Rmarkdown document (e.g. in the form of a package vignette) to a repository with a build system, such as CRAN or Bioconductor, which will make sure the code actually runs on any regular computer (no dependencies on private files), with current versions of R, etc.

We have added the source .Rnw file that produces the rendered PDF containing the analysis plots to the Supplementary Materials [NB: only for the submitted version, as bioRxiv does not accept this file extension]. With respect to submitting the full Rmarkdown document to a build system, one difficulty is the relatively long computational time (on the order of a few hours) for a full run of coseq across a large number of models. We will submit an executable vignette with the package for inclusion within Bioconductor for a small subset of data for illustrative purposes, but it does not seem feasible to do so with the full datasets considered here.

4.) p.4 "As previously noted, each of these transformations seeks to render the data homoskedastic" -- I do not think this is correct. Homoskedasticity is the stated goal of the variance-stablising transformation, but not of the others.

We thank the reviewer for noting this oversight, and have modified the text of the manuscript as follows:

(Introduction) Several transformations of read counts or pseudocounts have been proposed in the context of exploratory or differential analyses, but most largely seek to render the data homoskedastic or to reduce skewness.

(Methods: Data transformations for RNA-seq co-expression) As previously noted, each of these transformations was proposed with the objective of rendering the data homoskedastic (in the case of the VST or regularized log transformations) or to reduce the orders of magnitude spanned by untransformed RNA-seq data.

5.) p.4 "... but does not facilitate clustering together features with similar patterns of expression across experiments." -- where is the evidence for this claim? (cf. Point 1 above)

Thank you for this remark; this claim was indeed unsupported in the manuscript, and we have removed this phrase.

6.) p.4 Notation overload in the equation for p_ij: the symbol j on the right hand side is used for two different things

This equation has been revised as follows (note that a constant in both the numerator and denominator was mistakenly omitted in the first submission):
$\tcb{p_{ij} = \frac{y_{ij}/s_{j}+1}{\sum_\ell y_{i\ell}/s_{\ell}+1}.}$
We have also revised the equation just preceding this one to fix the same notation overload, now labeled Equation (1):
$\tcb{s_j = \frac{\ell_j}{\sum_{m=1}^q \ell_m/q}}$

7.) p.5 "becomes even more apparent when considering the normalized expression profiles p_ij (Figure 1C)." -- is this not a circular argument if Cluster 1 was itself obtained from the p_ij?

We thank the reviewer for this comment; this is indeed a circular argument. Figure 1 was primarily intended to illustrate the importance of using a measure that is independent of the absolute expression level of each gene when performing co-expression analyses, and not to confirm or validate the clusters identified when using $p_{ij}$ values. We have thus revised the text as follows (and modified Figure 1 accordingly) to avoid this circular reasoning:

To illustrate the interest of using these normalized expression profiles for co-expression analysis, we plot $y_{ij}/s_j$, $\log (y_{ij}/s_j+1)$, and the normalized expression profiles $p_{ij}$ in Figure 1 for a subset of genes from the mouse RNA-seq data studied by Fietz et al. In particular, we consider ten genes that are most representative (as measured by Euclidean distance) of four distinct groups: non-differentially expressed (NDE) genes across all samples (Group 1); genes expressed only in the last experimental condition (samples 11 to 15, Group 2); genes expressed only in the first experimental condition (samples 1 to 5, Group 3); and genes expressed only in the second experimental condition (samples 6 to 10, Group 4). It may clearly be seen that the large differences in magnitude that are dominant for normalized counts (Figure 1A) are greatly reduced by a log-transformation (Figure 1B), although a certain amount of spread remains between very highly and weakly expressed genes. This spread can be notably reduced by considering the normalized expression profiles $p_{ij}$ (Figure 1C). This example is thus instructive in illustrating the importance in co-expression analyses of considering a measure that is independent of the absolute expression level of the genes, as is the case for the normalized profiles.

8.) p.5 "This means that the vector of values p_i are linearly dependent ... For this reason, we consider two separate transformations of the profiles pij to break the sum constraint ..." -- Even though the sum constraint is replaced by a more complicated constraint, the dependency is not broken. Is this argument really tenable (or needed)?

As correctly noted by the reviewer, following arcsin or logit transformation of the $p_{ij}$ values, the sum constraint is indeed simply replaced by a more complicated constraint. That being said, we feel that it is important to include this remark in the text as fitting a Gaussian mixture model to linearly dependent variables (such as the untransformed $p_{ij}$) leads to singular covariance matrices for each component; this problem is no longer an issue for variables that are non-linearly dependent.

9.) p.5 What are the values (used by the software) for g_logit for p_ij = 0 or 1? Both can and will happen in practice.

We thank the reviewer for this comment, as this reflects a typo in our initial submission; in our method, a constant was added to the normalized counts $y_{ij}/s_{j}$ prior to calculating the profiles $p_{ij}$, and the equation for $p_{ij}$ should read as follows:
$p_{ij} = \frac{y_{ij}/s_{j}+1}{\sum_\ell y_{i\ell}/s_{\ell}+1}.$
As such, $p_{ij}$ never takes the value of 0 or 1.

10.) p.7 It would be helpful if plots showing the graphs of these transformations could be provided.

This is a nice suggestion, and we have added this plot as Figure S1 in the Supplementary Materials, and referred to it in the text as follows:

Over a broad range of intermediate values of the proportions, the logit and arcsin transformations are roughly linearly related to one another \tcb{(see Figure S1A in the Supplementary Materials)}. However, although both transformations tend to pull out the ends of the distribution of $p_{ij}$ values, this effect is more marked for the logit transformation, meaning that it is more affected by smaller differences at the ends of the scale Figure S1B).

11.) p.10 "filtering genes with mean normalized count less than 50" -- This seems like it could be a too stringent threshold, especially in, say, developmental studies, where certain genes are completely switched off in some of the conditions. Indeed these tend to be the most important genes.

We agree with the reviewer that in some cases a less stringent filter would be desired. To reflect this, have added the following text to emphasize that by default, the coseq package does not filter genes and that the user may specify a filter through the meanFilterCutoff argument if desired:

We note that for some studies (e.g., those in which some genes may be completely switched off in some conditions), a less stringent filtering threshold may be desired; in such cases, the meanFilterCutoff argument of coseq may be omitted (corresponding to no filter) or set to a smaller value.

Very small point:

12.) Introduction: "Increasingly complex studies of transcriptome dynamics are now routinely carried out using high-throughput sequencing of RNA molecules, ..."
Indeed what is being sequenced in RNA-Seq are cDNA molecules. Direct sequencing of RNA is also possible but (currently) usually not called RNA-Seq.

We thank the reviewer for this important detail; the first sentence of the manuscript has been modified as follows:

Increasingly complex studies of transcriptome dynamics are now routinely carried out using high-throughput sequencing of reverse-transcribed RNA molecules (i.e., cDNA molecules), called RNA sequencing (RNA-seq).

Typos:

p.8 compatability

p.15 hierarhical clustering

Corrected.

This review was prepared by Wolfgang Huber.