How to analyze RNA-seq data

Transcriptome analysis using next generation sequencing data has become increasingly popular and you might be considering making your own study. Let’s assume you would approach Genevia with a typical RNA seq data set involving raw reads from a set of samples and would like us to perform the analyses. Let me outline here what kind of analysis your samples could go through.

The question “How to analyze RNA seq data” could of course be answered in multiple ways. Although there are a multitude of software and environments that can be used in the analysis, the steps to take in an analysis workflow are typically very similar and would likely follow the outline I present here.

Let us look at a typical RNA seq data analysis workflow that includes the following steps:

1. Quality control

The overall quality of the sequencing reads in RNA seq data is first inspected to ensure that nothing went wrong when the samples were prepared and the data produced.

At this step, it is seen whether there are any adapter sequences remaining in the data or if any sequencing reads have worse quality than expected. If anything like this occured, the adapter sequences would be removed and reads with any bad-quality ends would be shortened, so that the data of a given sample can be used. Sometimes bad quality data is also identified in this step, in which case also entire samples can be removed from the study.

2. Read alignment and normalisation

Each good-quality RNA read needs to be associated with their gene of origin by sequence alignment to a reference genome.

After alignment, the reads at each gene position are counted to obtain a gene-specific expression value for all genes. These expression values are further normalised to enable cross-sample statistical analysis and visualisations using the obtained expression values.

3. PCA

Having all the expression values per sample in hand, it is finally possible to proceed to visualising and statistically comparing the samples.

A principal component analysis (PCA) is a practical and quick approach to ensure the sample similarity within experimental groups prior to other statistical tests. A PCA analysis can efficiently reveal outliers and - although no-one wishes for it - even samples mixed accidentally in the laboratory! An example case of a PCA analysis is shown below, with the different cell type samples forming separate groups. Sometimes PCA reveals the samples to be highly similar, with little likelihood for finding any differentially expressed genes. In such case the samples would seem much more mixed.

RNA-seq analysis and PCA analysis can be used to ensure the sample similarity within experimental groups prior to statistical comparisons, as well as to ensure the differences between experimental groups. Severe outliers can also be pinpointed and considered for removal after a careful study of their origin.A PCA analysis can be used to ensure the sample similarity within experimental groups prior to statistical comparisons, as well as to ensure the differences between experimental groups. Severe outliers can also be pinpointed and considered for removal after a careful study of their origin.

4. Statistical tests

The statistical comparisons that aim at identifying differentially expressed genes between sample groups remain in the heart of transcriptomics data analysis.

Depending on the experimental setting, the approaches taken may vary. The typical approach is to compare sample groups pair-wise using a statistical test which can take into account sample dependencies, such as pairedness or other variables.

A typical output of the analysis includes a list of significantly differentially expressed genes between the conditions, together with their fold changes for expression levels and p-values for their significance. The list is typically filtered to include only genes with at least 2-fold change in expression and a significant p-value.

5. Pathway enrichment analysis

The lists of differentially expressed genes may naturally include a few expected hits that are easy to identify and to link to biological processes.

However, more often there are hundreds of differentially expressed genes not earlier associated with a given process under study. One then needs to understand what biological processes or pathways their up- or downregulation may associate with. One way to disentangle the functional meaning of the genes is to perform pathway enrichment analyses.

These analyses determine whether any pathway terms in databases are annotated to the list of differentially expressed genes at a frequency greater than what would be expected by chance alone. A typical output of such analysis is a list of significantly enriched pathway terms together with p-values and with the original differentially expressed genes associated with each.

6. Integration with other data types

these steps typically appear in our workflow, yet each data set is analysed in tailored fashion taking each data’s requirements and customer’s interests into account.

Finally the data could also be combined in integrative analyses with other data types, such as miRNA or proteomics data. Assuming we had also data of differentially expressed miRNAs, we could predict their potential target genes in databases, and find cases where both genes and their potential regulator miRNAs are differentially expressed to identify regulator - target relationships.

That’s the main story of RNA seq analysis briefly. As a next step, you could tell us more about your own experiment to enable us plan yours!

What would you like to know more about?