The importance of coverage

Coverage describes the number of sequencing reads that are uniquely mapped to a reference and “cover” a known part of the genome. Ideally, the sequencing reads that uniquely aligned are uniformly distributed across the reference genome and hence provide uniform coverage. In reality, coverage is not uniform and may be underrepresented in genetic regions of interest due to a variety of factors (see table below). These include the fact that the genome itself is complex, containing genes, noncoding DNA, repetitive sequences, and other elements that can make it difficult to align the sequencing read to the proper genomic coordinates.

The number of sequencing reads that map to a known region is also an important part of coverage. A sufficient number of properly mapped reads is required to find and correctly identify genetic mutations. With high sequencing coverage, researchers can find the proverbial ‘needle in the haystack’, able to identify low frequency mutations or discover mutations in a heterogeneous sample such as a tumor biopsy. Poor coverage, whether due to an insufficient number of reads or sequencing reads that are mapped incorrectly, will result in the inability to detect the variants of interest.

The importance of coverage: advantages of amplicon-based approaches in next-generation sequencing

White Paper: The importance of coverage: advantages of amplicon-based approaches in next-generation sequencing


How does throughput relate to coverage?

Having coverage is clearly important to ensure that the genomic region of interest can be studied with high confidence. For regions with little to no coverage, researchers frequently increase the sequencing throughput for their studies. That is, obtain more sequencing reads and data to increase coverage for a genetic region by brute force. However, this method is inefficient, increases costs, and does not address the underlying reasons for the poor coverage itself. By increasing throughput, genomic regions with sufficient coverage will now be over-represented and the reads are in effect, wasted. Areas with zero coverage before may not have coverage just by sequencing more sample.

A more efficient way to address coverage is by using a targeted sequencing approach. Through targeted sequencing, researchers can focus on just their regions of interest instead of needing to sequence the entire genome. This provides the benefit of ensuring sufficient coverage, including in parts of the genome that may not have been accessible previously, with lower sequencing costs.

Potential reasons for poor sequencing coverage and uniformity

Reasons for poor coverage

Why this can affect coverage

Sample quality
Degraded samples are more difficult to prepare with shorter sequencing reads. Shorter sequencing reads are more difficult to map to the correct region since they may be less unique.
Sample input
May not have enough sample to sequence and the DNA is not representative of the entire genome
Homologous regions
Homologous regions have similar sequences. More difficult to map the read to the correct portion of the reference genome
Regions of low complexity
Sequence reads with low complexity may be mapped to the wrong part of the genome, resulting in coverage bias.
Hypervariable regions
Due to the high number of variants, the sequencing read will look very different compared to the reference genome and may not be mapped appropriately.
GC content
Potential sequencing bias due to the % content of guanine-cytosine nucleotides