Part four of a six-part series, here’s an overview of how data is converted by a capillary sequencing instrument from an analog signal to a digital one, assigned a base call (or fragment length) with a quality metric, and lastly variant reporting.
Getting from optical signal to bases
In part one of this series I briefly described the shift away from radioactivity and X-ray film detection to fluorescent dyes and optical detection. By automating the reading of DNA sequences, the labor of reading individual lanes of the bases G’s, T’s, A’s and C’s were eliminated. If you ask any researcher of a certain generation (doing DNA sequencing work from the inception of DNA sequencing technology in the late 1970’s through perhaps the next two decades) they will tell of many experiences with both the manual effort involved in setting up the gel apparatus and the sequencing reactions, as well as the reading of the individual DNA bases.
The optical snapshot of fluorescence signal intensity (in a convenient Relative Fluorescence Unit metric, or RFU) measures four dyes each with different wavelengths and thus are measured independently of each other. A laser illuminates the dye, which emits light of a particular frequency spectra that is detected by an in-line charge-coupled device (CCD). Depending on the application, for DNA sequencing each base is labeled with a unique dye terminator, and fragment analysis the primer is labeled instead; regardless of the application, the data is collected by the Data Collection instrument software.
The Data Collection Software is the software through which the end-user interacts with the instrument (from the simplest 310 Genetic Analyzer to the latest 3500 Genetic Analyzer and every model in-between). The user-modifiable parameters that can be adjusted include voltage, run-time, filters to apply, and camera speed. Since this is still electrophoresis (except through the thin capillaries instead of thicker acrylamide matrix in the manual sequencing context), voltage and time are the two main variables that affect speed and resolution of the separated products.
The software takes the color information from the CCD, and due to the overlap between the spectrum of colors that each dye emits will remove the overlapping information depending on the specific dyes (and dye-specific filters) used.
The Data Collection instrument software then outputs the raw intensity data (along with timing and parameter information) into a file called ‘AB1’, since the file names produced have an .ab1 suffix appended to them.
There are several analysis programs (some are particular to specific instruments) available for the kind of genetic analysis being performed. As an example, for DNA sequencing, the software will take the *.ab1 file and apply a mobility formula that will correct for the particular instrument, specific polymer used for that run, and the dyes used. Individual bases will be called, where the background signal noise (other dye signal in that particular position) is analyzed and a quality score assigned to that base.
A note about quality scores
Sanger sequencing by capillary electrophoresis has developed a highly refined error model over the course of its more than two decades of automated detection. These instruments and method served as the major workhorse for the Human Genome Project and related genome projects. (The sequencing of model organisms from E. coli through yeast Saccharomyces cerevisiae, fruit fly Drosophila melanogaster, worm Caenorhabditis elegans and the laboratory mouse was part of the Human Genome Project planning in 1991.) A program called Phred was developed by Phil Green. The Phred Quality Score linked on a logarithmic scale an accurate assessment of the base quality to an error probability.
Defined as Q = -10 log10(P), where P = probability of an error and Q=Phred Quality Score, a score of 20 is a 1 in 100 probability of an incorrect base, or 99% accuracy for the base in question. A Q score of 30 is a 1 in 1000 probability of an error, or 99.9% accuracy.
There are many reasons for low-quality scores being assigned particular bases or regions of bases as the case may be. Regions that are GC-rich cause ‘compressions’ of the data, where the polymerase is not able to effectively read-through; other bases are part of palindromic or other sequences that form a high degree of secondary structure that also affects the base quality. However, the accuracy of the quality assignment is very high; that is, when a base is called with low accuracy / high probability of error there is a very high level of confidence that that base is truly inaccurate.
Different software for different purposes
For sequencing applications, if all that is needed is a visual inspection of the base calls, a free viewer called Sequence Scanner is available. For viewing bases and comparison to a reference sequence (alignment), and then calling variants and annotating (if so desired), software called Variant Reporter™ is available. Lastly, for more complex alignments (such as multiple reference sequence library alignment) and advanced reporting functions, SeqScape™ Software is available. (SeqScape software assists with 21 CRF compliance, for applications that require that level of process documentation.)
For fragment analysis, for low-complexity analysis a free application called Peak Scanner™ Software is available. For more sophisticated analysis an application called GeneMapper™ Software offers more flexibility. Each well in a fragment analysis run includes a size ladder to determine the size of the peaks measured, and the analysis can be as simple as assigning a base-pair size to each of the peaks, or as complex as a custom database with specific marker and allele information to compare a given peak with. This capability is very useful for population studies, where peaks ranging in number from fifty to several hundred can be compared and catalogued, determining automatically which peaks are universal across the population and which are unique to the individual.
This collection of software can all be setup in advance, so that the end-user need only to load in the data, click a few buttons and have an analysis done and a report generated.
Check out the whole Series: