After years in the making, the 1000 Genomes Project consortium published this week the results of the analysis of the pilot phase of the project in Nature. The succinct text of this article conceals the hard work that went into generating the data, but mostly, figuring out how to analyze it in the most efficient manner.

The pilot phase of the project aimed at testing different second-generation platforms and sequencing strategies for the scale up phase where at least 1000 genomes would be sequenced. The underlying dilemma is common to many human genetics projects: how to maximize the information obtained under a fixed budget. The goal set was ambitious – to catalog 95% of the variants of frequency 1% or more, the traditional definition of polymorphism, in major continental populations. The original budget for the project allowed for either sequencing 100 genomes to good depth of coverage, obtaining a more comprehensive view of these genomes, or sequencing 1000 genomes at low coverage, randomly sampling large parts of the genome but leaving many holes.

It is at that point that Life Technologies decided to contribute to the project’s pilot and increase the available “budget” with data from the SOLiD™ System. This academia-industry partnership proved extremely valuable for the Project, not only for the increase in sequence capacity, but in the development of the methods that ultimately were used in the analysis pipelines by providing the expertise on the specific features of each platform. In partnership with the Baylor College of Medicine, we produced over a tera-base of raw SOLiD data for the pilot, which resulted in about 167 times the length of the human genome of uniquely mapped base-pairs. With this extra capacity, and gains from the increasing throughput of the sequencing platforms, the low coverage pilot ultimately targeted 4-6X as the sequencing depth for 179 samples. We are glad that our effort made a significant contribution to the Project.

One of the unique challenges of the Project was to consolidate data from different sequencing platforms, each with its unique set of attributes. During this process new bioinformatic methods and standards were created that allowed the interoperation of analysis pipelines and platforms, such as the BAM format for aligned reads. Our customers are already benefiting form these developments. The SOLiD analysis software, BioScope™, now uses the BAM format as a central component of the architecture, allowing customers to use third party analysis tools. These developments also demystified the analysis of the SOLiD data, with is unique advantage of error detection codes that enable higher accuracy, but which also requires some unique analysis strategies to reap this benefit. The 1000 Genomes project has now produced various tools that can easily analyze SOLiD data and combine it with data from other platforms.

Besides a pilot to test the strategy of low coverage sequencing, two other experiments tried to provide complete sequencing on two pairs of trios, and sequence more deeply the much smaller fraction of the genome corresponding to the exons of protein coding genes. Experimental validation of novel genetic variation was used to assess the data quality from each pilot. This data suggested that combining data from multiple platforms and analysis methods, a reduction in false positive variants was ensued. In this context, the accuracy of the SOLiD System was a contributing factor to minimize the false positives in the Project products. Our analysis of the experimental validation of the data from the offspring of the Yoruba trio demonstrated an amazingly low false positive rate of less than 0.7% for SNP calls produced from SOLiD data. Similar good results where obtained from validation of small indels and structural data derived from data from our platform.

The availability of the Pilot data has immediate implications for the study of genetic diseases. In the search of highly penetrant variants for Mendelian diseases by exome sequencing, the 1000 Genomes data provides a filter to reduce the number of candidate functional variants, since it is expected that these mutations would be private to the study cohorts and not present in the 1000 Genome catalog (see for example Hoischen et al, Nat. Genet. 42(6):483-5, 2010). In the analysis of GWAS studies, the availability of this new data set increases the power of imputation methods to test for association variants not genotyped in the studies, but thought to be present in the individuals of the sample.

However, the most important lesson from the project so far, is that sequencing is becoming the mainstream method to carry out medical genetics studies. The tools are now available to execute such projects that require including thousands of samples. The continuous drop in sequencing costs, together with strategies that either reduce the extent of the genome sequenced (e.g. exome enrichment), or permits to analyze data from low coverage genomes, is making this approach a feasible reality. This heralds the passing of the microarray as the platform of choice for the study of genetic predisposition to disease. The detection by sequencing of rare single-nucleotide variants of potentially higher genetic effect, as well as novel indels and structural variants (which arrays can poorly do) is poised to fill the gaps that the recent wave of GWAS studies left in terms of missing genetic heritability. Even with more populations and larger number of samples slated for sequencing by the 1000 Genomes Project (currently about 2,500 at low coverage), its becoming clear that fixed sets of SNPs that could be put into arrays would be limiting, especially when studies start to include admixed populations which make a sizeable fraction of the US and world population. Therefore, the next wave of genetic studies is undoubtedly becoming the new age of sequencing.