If you are new to Next Generation Sequencing, you’ve heard about the large amount of data it generates and the challenges with finding meaning in all that data. But, don’t worry, just like working at the lab late at night, bioinformatics doesn’t have to be scary.
The power of next generation sequencing, or NGS, is the ability to interrogate 100’s, or 1000’s of genes or even whole genomes in a single sequencing run. While the throughput and speed are ideal for accelerating genetic research, the amount of data may be overwhelming. Finding what you are looking for can sometimes feel like searching for a needle in a haystack. Fortunately, many NGS bioinformatics tools take the pain out of data analysis and interpretation.
Let’s take a look at the general NGS data analysis workflow and see why it isn’t so scary after all.
The input into NGS systems is a collection of DNA fragments, known as libraries. These library fragments can range in size, from about 50bp to 1000bp, depending on the system used. Basically, NGS systems sequence these library fragments and automatically process the raw sequence data to make sure you get high quality sequences, referred to as reads. The reads are presented in a manner we lab scientists all understand, like A, T, G and C.
Did you know that if you sequenced everybody’s genome on earth, you would need about 21 Exabytes of space on your computer to keep their list of A,T,G and Cs? That’s about 21 billion gigabytes! You definitely need another external drive for that one.
Let’s take a look at our lab book
NGS can produce a bunch of A, T, C and G’s, but how do we make sense of it all? The collection of sequencing reads can be aligned to a reference genome, generating a Binary Alignment Mapping file, or BAM file. This standard file is the input for many NGS software tools and can be used for a variety of applications, including downstream variant detection. No reference genome? No problem. The collection of reads can also be used by specialized NGS software for building a reference genome, called de novo assembly.
So now you have a BAM file with aligned reads, what’s next? Let’s use the example of variant detection and dive deeper into how to find that needle in a haystack.
Now with the help of our bioinformatics tools, we will determine if the sequence information contains a variant when compared to the reference genome. Variants can be single nucleotide polymorphisms, or SNPs, nucleotide insertions or deletions, also known as indels, as well as, structural variants. The output of variant calling is a Variant Call Format, or VCF, file. These files contain a list of all variants identified depending on the settings used by the variant detection software.
But what is the biological meaning of the observed changes? This is where NGS bioinformatics analysis gets really interesting. There are several software tools that use the VCF file as input. The tools compare that information against a large collection of annotation databases that associate a variant to some type of function, process, pathway or disease. Filtering your data based on these annotations helps narrow your focus to variants relevant to your research, getting you closer to that needle.
High throughput NGS is revolutionizing genomics, getting us data faster than ever before. With all the available NGS data analysis tools, uncovering critical genetic associations and trends or even putting together a new reference genome isn’t as daunting as it once was. NGS combined with advanced bioinformatics tools means we are not just getting data faster, but critical answers faster, helping lead to a healthier future.
I hope this video was helpful on NGS bioinformatics, and I am sure you’ll have more questions.
Submit your question at https://www.thermofisher.com/ask and subscribe to our channel to see more videos like this.
And remember, when in doubt, just Seq It Out