Whether you are new to next generation sequencing, or an old hand, you have likely encountered terms like base quality, Q-score, consensus accuracy and so on. This could have happened when you were reading a NGS review or listening to a bright-eyed grad student presenting her new variant calling algorithm at a poster session. Either way, you may have been left thinking, “Huh?”
Let’s turn that “Huh?” to an “Oh, yeah” while we take a closer look at the statistics of sequencing.
So let’s begin our journey from the good old days of base quality scores, also known Q-scores. Back in the early days of DNA sequencing, shotgun sequencing was typically employed to perform de novo assembly. Basically, DNA was fragmented into shorter pieces and sequenced. The resulting sequencing reads were then assembled to build a reference genome.
Since we didn’t know what the truth was back then, we tried to predict the quality of the sequenced bases using a, well, de novo methodology. That is, by doing some complicated signal analysis and assigning a base quality score on Phred scale, or a Q-score, to every base observed.
The Q-score is a value derived from the formula Q equals negative10 log P, where P is the probability that the base is wrongly called. A Q-score of 10 translates to 1 in a 10 chance of the base call being wrong,
Q20 is 1 in 100 chance
And Q30 is 1 in 1000 chance.
What I just described is what we refer to as the predicted quality in the sequencing world. You are likely have encountered a metric called %bases greater than or equal to Q30 which translates to the percent of sequenced bases that have a predicted quality score of 30 or more.
Moving to the present, we now have the benefit of knowing the reference genome. As a result, we can take individual reads, map them to a reference and count the number of mismatches. This gives us the actual observed measure of raw read accuracy, or the empirical quality, measured by the error rate. An error rate of 1% translates to the raw read accuracy of 99% and tells us that one in a hundred bases is a mismatch to the reference. Keep in mind, that error rate will count actual biological variants as mismatches too and; thus is a conservative metric of observed quality. Determining true variants from base calling errors is where the consensus accuracy comes into play.
One of the advantages of NGS is the ability to produce millions of sequencing reads. As you can see, when we pile these reads up together along the reference, what we get is the power of majority. We can filter out mismatches that are likely to have been base calling errors and keep those that are likely to be real variants using statistical inference. What we get in the end is a consensus sequence and its corresponding consensus accuracy with respect to the reference.
Hopefully you are now familiar with the idea of Q-scores, raw read accuracy, and consensus accuracy as it pertains to NGS. But I am sure you’ll have more questions.
Submit your question at thermofisher.com/ask and subscribe to our channel to see more videos like this.
And remember, when in doubt, just Seq It Out