Processing of next-generation sequencing (NGS) data for downstream applications is a critical part of the NGS workflow since it has a direct impact on data interpretation and experimental conclusions. Understanding the basics of NGS data analysis, common bioinformatic tools, and a general framework of the data analysis pipeline can facilitate the design and planning of your NGS experiments. This section provides an overview of NGS data analysis workflows for common DNA and RNA sequencing using Illumina systems.
NGS data analysis is a computationally intensive process, requiring the storage, transfer, and processing of very large data files (generally 1–3 GB in size). This means access to advanced computing centers on-site via a private network or in the cloud is highly recommended. While many off-the-shelf and “plug-and-play” tools are available, it is very likely that some level of scripting and coding skills will be necessary. Python, Perl, R, and Bash scripting are among the most prevalent, and they are typically performed within Linux or Unix-like operating systems and in command-line environments. At the very least, these skill sets can markedly facilitate the task of data analysis [1,2].
NGS data analysis generally involves three core steps: primary, secondary, and tertiary analysis (Figure 1).
The primary analysis provides total reads and quality metrics of the input libraries for assessment of sequencing efficiency and quality. In Illumina sequencing, the input for primary analysis is a raw binary file with nucleotide bases which are identified during the sequencing run (also called base calls). This file has the extension .bcl. The output from primary analysis is a text-based (ASCII) file in FASTQ format . Software built within, or in some cases linked directly to, the sequencer manages this conversion (historically this software has been the bcl2fastq Conversion Software).
Figure 2. Phred quality score, or Q score, and its relationship to base call accuracy.
Because multiple library samples can be sequenced concurrently (i.e., multiplexing), they are identified and separated by index reads during primary analysis. Demultiplexing results in multiple FASTQ files, each corresponding to one unique sample. Demultiplexed files contain read name, flow cell location, etc., for sample identification (Figure 3).
Figure 3. FASTQ file format and annotation. The sequence label often contains flow cell ID, lane, coordinate, etc., which are not shown here for simplicity.
Demultiplexed FASTQ files are used to process each sample individually in secondary analysis. Secondary processing includes the following steps and output files (Table 1).
|This step removes low-quality sequence reads and/or portions of reads (also called “soft-clipping”) from the data. The output from this step is a “cleaned” FASTQ file.
|This step matches, or maps, each base call to its corresponding location in the genome (or transcriptome for RNA) of the organism of interest. Although each individual organism generally has a unique genome, there are reference genomes and transcriptomes published for use as common alignment tools for the NGS community. The alignment output is a Binary Alignment Map (BAM) file that contains the base call alignments to the reference sequences.
|In this step, mutations, features, and/or other anomalies that do not match the reference genome of interest are identified (or called) from the BAM file; these calls are represented in a text-based tab-delimited file in VCF format.
|Gene expression analysis
|In the context of RNA sequencing, this can also include gene count and (differential) expression data. A standard file format (like VCF) does not exist for such data, but most tools will output in a tab-delimited format (e.g., Tab Separated Values (TSV) files) with each column representing samples, genes, amplicon IDs, raw counts, normalized counts, etc.
The very first step of secondary analysis is to perform cleanup of sequencing reads. It usually includes trimming adapters, tracking indexes or barcodes, merging paired-end reads, and removing low-quality or suspect reads (or portions of reads) from the data. While steps like removing reads of base calls below a given Phred score cutoff (typically 30) are common, processing can also be both highly subjective and dependent on the nature or quality of samples, e.g., sequencing of archived samples that are degraded. However, reads shorter than a certain length are commonly thrown out because they are difficult to map or representative of unwanted degraded nucleic acids. Additionally, portions of reads beyond a specific read length cutoff (e.g., 200–250 bp) may be eliminated, or “soft-clipped”, due to diminishing sequencing quality at read ends.
Common tools (e.g., FastQC) can be used to check general FASTQ data . FastQC provides quality scores per base and per sequence overall, sequence and GC content, and duplicate or overrepresented sequences (Figure 4). To account for specific chemistry and experiment-specific quality controls, custom scripting is usually necessary. The output from this cleanup would be another “cleaned” FASTQ file.
Another optional step at this stage is de-duplication of reads. This step removes repeat reads of the same library fragments and reads of PCR amplicons of the same sequence. If a unique molecular identifier (UMI, also called barcoding) is integrated in the adapters, de-duplication will be more informative since UMI fingerprints every individual nucleic acid molecule sequenced. By comparing duplicate reads to correct for errors from PCR and/or sequencing , this step helps ensure biological mutations, anomalies, and gene expression levels are correctly represented in sequencing data.
For RNA sequencing (RNA-Seq) data, additional steps may also be required during cleanup.
If an established reference genome exists for the sequenced sample, alignment of all short reads in a FASTQ file can be done with various off-the-shelf tools. In general, there is a trade-off between computational cost (i.e., power and speed) and the quality of mapping among these tools. Nevertheless, common alignment tools such as BWA and Bowtie 2 offer a reliable balance and are often used together to reach a satisfactory consensus. Another alignment tool, BLAST, is lower in throughput but comprehensive for verification of read sections and troubleshooting of ambiguous reads.
Note that the reference genome chosen is critical in this process. Given that no reference genome is perfect, be clear and consistent about the reference being used and understand potential weaknesses associated with the chosen reference. In the case of human samples, the latest standard is GRCh38 (also called hg38), but the prior standard, GRCh37 (also called hg19), is still commonly used.
The output file from alignment is a Binary Alignment Map (BAM) file. Although alignment output may be in the form of a text-based Sequence Alignment Map (SAM) file , BAM files are generally preferred because they are more compact. Interfacing and manipulation of alignment files are done by scripting or with purpose-built software such as SAMtools.
Sequence alignment can be visualized by uploading BAM or SAM files to a genome browser such as Integrative Genomic Viewer (IGV), University of California Santa Cruz (UCSC): Genome Browser, or Golden Helix. Overlaps of reads in a given region (called pileups) and mismatches from the reference (which may be due to SNPs, mutations, or other errors) can be observed (Figure 5).
Figure 5. (A) Pileups of short reads mapped against the human gene FMR1-AS1, as viewed in the Integrative Genomic Viewer (IGV) from the Broad Institute. Red and blue represent reads 1 and 2, respectively, from Illumina NGS paired-end sequencing. (B) In a view of reads mapped against the CHODL gene in IGV, mismatched regions (mutations) can be quickly scanned in the browser via color-coded highlighting, in this case a C/T SNP. Further details on the reference gene are shown at the top of the viewer, including the genomic address as well as the exact bases of interest.
To build a new reference genome (or sections of a genome), or to provide higher confidence in the alignment process itself, de novo assembly may be performed prior to alignment. In this process, short and long reads with overlapping regions are combined to yield single, longer consensus reads. De novo assembly is useful particularly when large mutations or anomalies might make alignment of shorter reads impossible.
To check that prepared libraries and data processing have been successful to this point, examine the following alignment metrics.
Figure 6. Comparison of percentage of mapped reads from libraries prepared with four different kits.
Figure 6. Comparison of percentage of mapped reads from libraries prepared with four different kits.
Figure 8. Coverage histograms and interquartile range (IQR). The IQR represents the range of the middle 50% of the sequencing coverage histogram; a smaller IQR indicates more uniform coverage and reliable results.
For alignment of RNA-Seq data, tools such as Bowtie and BWA can be used for a first pass because many reads will likely not span exon–exon boundaries. RNA-Seq reads across splice sites, however, should be mapped with reference transcript sequences using transcriptome alignment tools such as TopHat and STAR [9,10]. In cases of incomplete or missing reference transcriptome data, de novo spliced alignment can be performed using TopHat, Cufflinks, or other open-source tools. The de novo alignment process may also allow identification of novel transcripts and new splice isoforms.
Once alignment is complete, BAM or SAM files are fed into multiple off-the-shelf tools to collate lists of mutations. These mutation lists are most commonly reported in a Variant Call Format (VCF) file. Note that poorly mapped—as defined by MAPping Quality (MAPQ) values in the SAM file, or based on the Phred scale—and unmappable reads should be removed prior to generating VCF files.
Genome Analysis Toolkit (GATK) is one of the most advanced general-purpose variant callers available. GATK serves as a good starting point for highlighting single-nucleotide variants (SNVs), multiple-nucleotide variants (MNVs), insertion deletions (indels), etc. Other alignment tools such as SAMtools and Mpileup should also be used to cross-reference the results, as each tool has strengths and weaknesses in finding specific types of mutations and variant allele fractions (VAFs).
More complicated genetic variations will often require specialized mutation callers and pipelines, such as using an off-the-shelf tool in tandem with some custom scripting. Examples of such genetic variations include structural rearrangements (e.g., gene fusions), chromosomal abnormalities, and copy number variations (CNVs). Their mutation reporting often does not fit within the context of VCF files and needs their own custom format. For instance, detection of gene fusion may occur as a supplementary step to a de novo alignment in RNA-Seq because adjacent exons in a given read are mapped to different genes (i.e., chimeric reads). Such mappings can be tabulated in a purpose-built tab-delimited (TSV) file for further investigation during tertiary analysis.
Note that understanding the nature of a mutation or other anomaly of the sample is critical in this step. Sample origin, library preparation technique, challenges in sequence alignment, and allele fraction can all impact calling mutations in a sample.
One of the goals of RNA sequencing is gene expression analysis. For screening, researchers may look for changes in genes of interest, e.g., those expressed at high or low levels after certain treatment. In some cases, researchers may investigate differential expression of genes by comparing their relative levels in the same sample at different time points (temporal), or in samples of different origins at a certain time point (spatial).
Regardless of the research goal, gene/transcript counting and normalization are two common steps in expression analysis.
|Reads per kilobase of transcript per million mapped reads (RPKM)
|Divides total mapped reads (counts in millions, for depth) by transcript length (in kilobases)
|Fragments per kilobase of transcript per million mapped reads (FPKM)
|Counts the number of cDNA fragments corresponding to transcript length (in kilobases), per total mapped reads (in millions)
|Transcripts per million mapped reads (TPM)
|Divides total mapped reads by transcript length (in kilobases), to obtain reads per kilobase (RPK). Sum of RPK values of all transcripts in a sample is then expressed per (i.e., divided by) million—known as a scaling factor. TPM is RPK of each transcript, divided by the scaling factor.
An alternative to FPKM and TPM described in Table 2, DESeq2 has become a popular program to identify differential gene expression among samples because it provides more quantitative analysis of comparative RNA-Seq data from high-throughput sequencing .
Tertiary analysis is perhaps the most subjective step in the NGS data analysis workflow. This step includes annotation of genes and transcripts; interpretation of sequencing data; and prediction of mutational effects on signaling pathways, proteins, and phenotypes. Therefore, experimental hypotheses (e.g., hypothesizing which mutations and/or genetic features are of interest), the statistical confidence of data obtained, and the availability and quality of databases on studied genes and mutations can all impact tertiary analysis.
For the organism being studied, databases may be available for annotating mutations in the VCF file. There are also databases for known and predicted transcripts and common SNPs. One example is Ensembl, which provides basic transcript information and prediction of mutational effects on proteins for multiple organisms. Ensembl can operate as an online tool as well as a downloaded program for direct integration into a data analysis pipeline.
Annotating RNA transcripts is often more challenging because multiple transcripts (e.g., coding, noncoding, different start/stop sites, splice variants) may stem from the same genomic sequence. In some cases, library fragments that are sequenced may not cover all exons of a transcript.
In reporting mutations or sequence variants, it is critical to be clear and consistent on the position and nature of mutated nucleotides being annotated. One of the widely accepted recommendations for identifying and recording variants is a nomenclature proposed by the Human Genome Variation Society (HGVS) .
After identification, mutations and other variants should be prioritized based on:
Despite quality control steps taken in primary and secondary analyses, questionable NGS data that lead to false-positive or false-negative mutation calls cannot be avoided. These scenarios arise especially when quality scores are close to their prescribed cutoff or when the allele fraction of a mutation is approaching that of the known limitation of errors introduced by library preparation and sequencing processes [17,18].
In preparing a final report, a general recommendation is to rank or categorize mutations to guide appropriate actions to take based on the sequencing data (Table 3). Also, consult sorting and ranking processes reported in the literature , because the method chosen depends on the types of genetic features and phenotypes of interest, as well as the actions expected from the NGS report.
|Questionable but potentially impactful
Tertiary analysis for RNA-Seq often includes visualization of advanced statistical analysis of the normalized expression metrics determined from the secondary analysis. Tools such as Cufflinks are used for comparison of gene expression across multiple libraries. Different models may be applied to detect systematic biases, e.g., skewed data from highly expressed genes. Different probability distributions, like Poisson vs. binomial, may be applied to quantify reproducibility and confidence levels .
For Research Use Only. Not for use in diagnostic procedures.