Sequencing using chain-terminating dideoxynucleotides – also known as Sanger sequencing – has long been recognized as the standard for DNA sequence determination. The uncomplicated chemistries and workflows, ease of data analysis, and clear results interpretation allow Sanger sequencing to remain an important tool in the biologist’s toolbox, especially when accurately determining a sequence is extremely important. For example, miscalling sequences can lead to a misinterpretation of the underlying biology. In cancer research, it might mean making the wrong assumptions about the nature of the mutation in a tumor, affecting possible intervention choices. In infectious disease research, it might mean misidentifying a pathogenic strain and using an antibiotic that is ineffective against it. In inherited disease research, it could result in the misidentification of the causative mutation for a syndrome. Sanger sequencing gives researchers the confidence that their experimental results in the identification of the underlying mutation is correct, enabling them to further their research on solid foundations.
However, in spite of the robustness of Sanger workflows, there are occasionally problems that can complicate the results. For example, if the morphology of a peak is slightly off, base calling software may call the peak incorrectly or not at all. Additionally, background noise from a sequencing reaction can reduce the confidence in read quality. Dye blobs, resulting from incomplete clean-up of sequencing reactions, can also interfere with base calling confidence. Manual examination of suboptimal sequencing traces can overcome these issues, often at a cost of increased time and reduced efficiency for the lab.
To overcome these problems, the innovators at Applied Biosystems™ took advantage of machine learning and artificial intelligence algorithms to develop a novel Sanger sequencing basecaller solution. A set of algorithms, including deep neural networks, was applied to paired suboptimal traces and ground truth traces to improve basecalling accuracy in Sanger Sequencing traces. A large collection of in-house generated, annotated Sanger sequencing datasets was then used to train and test the resulting algorithms.
These development efforts produced Smart Deep Basecaller™ (SDB). Smart Deep Basecaller, accessible within Sequencing Analysis Software 8.0, takes the output of the Applied Biosystems genetic analyzers and provides improved sequence interpretation, particularly with suboptimal traces. This new solution gives researchers a tool that can give them enhanced confidence in their Sanger sequencing results.
Increased read lengths
Researchers are interested in getting the most information from their experiments. The advanced algorithm in SDB allows for greater accuracy in the 5’ and 3’ ends, thus optimizing the number of bases per read. This increase in the number of high quality basecalls at 5’ and 3’ ends of long reads increases the overall read length from a single reaction. In internal tests, SDB increased the Q20 CR length (number of contiguous bases with a QV greater than 20) between 6.2-15.5% relative to KB Basecaller, depending on the instrument and run module used. Another metric used to demonstrate the utility of SDB is aligned clear read length (ACR), which is the number of bases within a region that aligned with a reference sequence with high accuracy. This is a measure of the accuracy of a Sanger read. SDB increased the ACR anywhere from 3.9-12.5% on long reads, relative to the standard KB. These advanced features can produce read lengths of over 1200 bp.
More accurate pure and mixed base calls
Another aspect of SDB functionality is that it can read through artifacts such as dye blobs, mobility-shifted peaks, malformed peaks and N-1 peaks. An example is shown in Figure 1. This example electropherogram shows a region that has a dye blob and a couple of flanking malformed peaks. KB™ Basecaller (right) has a problem calling some of the bases co-migrating with the dye blob; five bases are incorrectly called as mixed base with poor quality. SDB is able to recognize the blob anomalies and can make the correct base calls, even in the presence of the anomalous peaks (left).
Increased accuracy through GC-rich regions
SDB is able to improve reads through difficult sequences such as homopolymeric regions and GC-rich templates. In internal tests, we analyzed 109 sequencing traces from a template with GC content between 60-75%. SDB was able to extract 8.4% longer ACR length, 10.4% longer Q20 CR length and 28.9% lower ACR error rate than KB. Similar results were seen on homopolymeric A/T sequences.
Improved confidence with heterozygous insertion-deletion (het indel) variants
In many cases, a genomic DNA sample contains a mixture of alleles. SDB can improve the basecalling of single nucleotide variants. Moreover, when analyzing a sequence that is heterozygous for a frameshifting insertion or deletion mutation, SDB’s advanced algorithms can improve the quality value and accuracy of the basecalls (Figure 2). This allowed the researcher to have more confidence in the resulting sequence.
Enhanced View trace visualization
Many Sanger sequencing reactions, especially long-reads, have reduced peak morphology (reduced resolution) at the 3’-end of a read. SDB algorithms can accurately call these reduced resolution peaks by improving the baseline and increasing resolution in the 3’ end of plasmid sequences. The results are packaged into an improved electropherogram diagram, facilitating the interpretation of the sequence in this region (Figure 3).
Reduced manual review time
When Sanger sequencing reactions produce suboptimal results, the traces and sequences often have to be examined and edited manually, introducing extra time and incurring extra costs. The improvements introduced by SDB have been designed to overcome these suboptimal conditions. The reduced number of low-quality base calls and false positives reduce the amount of manual interpretation needed when the reactions are suboptimal. This eases Sanger sequencing data review, freeing staff time to work on other tasks.
SDB harness the power of AI-driven advancements to improve the accuracy of Sanger sequencing. Use this power to reveal insights that might have been missed before.
Watch webinar on Testing Applied Biosystems Smart Deep Basecaller for Sanger Sequencing QC
For Research Use Only. Not for use in diagnostic procedures.
Leave a Reply