Studying genetic variation generally lays aside consideration for variants as they line up along a particular parental allele. While we accept that the human genome is diploid, in practice having two different forms of each allele is taken as individual homozygous or heterozygous single nucleotide variants at a given base position, rather than a string of variants lined up along distinct haploid genes.
In order to look at the role of haplotype in genomics, a bit of recent history is in order.
The Human Genome Project
The original 2001 achievement of the Human Genome Project (HGP) undoubtedly was the starting point for significant advances in human genomics. Three years later in 2004 the international consortium delivered a high-quality human genome assembly consisting of 99% of the euchromatic sequence. And yet it is incomplete with regard to determining haplotype phase, because the reference sequence is a haploid consensus mosaic sequence comprised of multiple samples. (On this point it is not clear what the identity of these individual samples are nor how many there were from the original HGP effort.)
Simultaneous to the finish of the HGP, the International HapMap Consortium (IHC) initiated work on determining the linkage disequilibrium (LD) blocks across 269 samples from four populations. This laid the foundation for the flood of genome-wide association studies to come: the most recent data from the Genome.gov website (December 2013) includes 1,778 publications and 12,123 SNPs associated with different phenotypic traits studied. Publishing their first work of over a million SNPs in 2005 and then a second generation map of 3.1 million SNPs in 2007, these maps were critical in understanding common variation across populations as a high-quality dataset.
Having the map of haplotypes across four populations enabled genome-wide associations to explore the ‘common disease – common variant’ hypothesis, that common variants in a population contribute additive effects for disease risk within the population where the disease is manifest.
Limitations of technology
Technology is available to determine blocks of SNPs that travel together as a haplotype group; but for an individual sample, the biology of these variants as distinct paternal and maternal allelic forms of genes have not been explored. The ability to determine individual genetic variation in a sample has been available for decades; however the ability to analyze the entire set of genes that comprise the genome by its haploid content has yet to mature.
In the simplest example, a single SNV can be assessed with great precision via a TaqMan® Genotyping Assay for a given ‘rs’ identification number. (For here at the online TaqMan® Gene Expression Assay Search tool, you can simply put in a gene name and pull up a list of pre-designed genotyping research assays.) For a single rsID, for example rs10170549 in the IDH1 gene, this assay will detect an A/G transition substitution.
From a sample of human genomic DNA the assay will very accurately determine what the specific genotype is (in the example of rs10170549 above, typically A/A, A/G or GG). Yet what this analysis does not capture is on which allele the variant lies on in the context of all the other variants that may be analyzed simultaneously – the variation disregards allelic information surrounding it, as the assay is looking at only a single SNP.
In the realm of whole-genome SNP microarrays, so fundamental for the successful execution of the International HapMap Project, obtaining hundreds of thousands of genotypes from a single sample have been routine for almost a decade. Particular genotypes along a particular strand cannot be determined unambiguously without additional information, usually in the form of additional genotypes from parental samples. Many computational methods have been developed to analyze genotypes from unrelated individuals, and group variants by haplotypes in that manner.
In the case where genotypes from both parents and the individual sample are available (otherwise known as a trio), haplotypes can be inferred using Identity By Descent (IBD) methods.
These computational methods are useful, but have limitations where the regions of haplotype information obtained is limited at a given threshold of accuracy. Experimental methods promise much better accuracy in haplotype phasing over a much larger proportion of the genome.
Using Sanger long reads for whole-genome haplotyping
Interest in completing a human haplotype sequence was fulfilled in 2007 with the publication of a single human individual via Sanger capillary electrophoresis. Titled “The diploid genome sequence of an individual human”, there were some surprises, not the least of which were the importance of indels (insertion/deletions), termed non-SNP DNA variation, accounting for 22% of the variation events and involving 74% of the variant bases compared to the human genome reference.
Please note the assembly of an individual diploid human genome using Sanger capillary electrophoresis is highly accurate and much simpler than short-read next-generation sequencing due to its long reads. However it is not feasible to be performed routinely, as the costs and time involved would be prohibitive.
Experimental methods to determine whole-genome haplotype
Several successive publications using alternative sample preparation methods combined with next-generation sequencing have been developed.
These sample preparation methods include:
- Spreading metaphase chromosomes from arrested cells and then microdissecting them
- Use fluorescent-activated cell sorting (FACS) to separate chromosomes prior to amplification, tagging then sequencing
- Use a microfluidic device to capture a single metaphase cell, then separate and create separate homologous chromosome pools to determine haplotypes by whole-genome genotyping
- Dilute a very small number of genomic equivalents across a 384-well plate with 10% of a haploid genome in each well (called Long Fragment Length sequencing or LFR)
- Create many pools of large-insert fosmid libraries, create individual NGS libraries from each to reconstruct the haplotypes
Taking this last approach, a group at the Max Planck Institute for Molecular Genetics (Berlin, Germany) published a German individual’s haplotype-resolved genome in 2011. The researchers analyzed 159 experimentally phased genes where deleterious mutations were harbored, and of these 86 were in cis (on the same haplotype) while 73 were in trans (compound heterozygosity). The mutations in cis is less likely to be damaging as one copy of the gene has its protein coding sequence unchanged.
Taking this work further, this group recently published an extension of haplotype-resolved individual genomes to 14 samples, and combined their analysis with an additional 372 statistically-resolved genomes from the 1000 Genomes Project.
Findings from multiple haplotype-resolved genomes
They characterize both haploid and diploid gene forms, and discover that there are a remarkable diversity of an average of 249 haploid forms per gene, and 235 diploid forms per gene. In addition, they describe a set of 4,269 genes that encode for two different proteins in over 30% of the genomes they examined, which they call a ‘common diplotypic proteome’.
The large number of haploid and dipoid gene forms (on the order of 4.1 million and 3.9 million respectively) in their relatively small population size suggests “that current efforts are still far from capturing the majority of gene forms and that saturation may not even be achievable”. They continue, “the concept of a predominant, ‘wild-type’ form of ‘the’ gene appears obsolete for over 85% of genes, challenging traditional Mendelian views.”
This work was done using the Applied Biosystem® SOLiD® Next Generation Sequencing platform. The largely unexplored diploid landscape is just starting to unfold.
Reference: “Multiple haplotype-resolved genomes reveal population patterns of gene and protein diplotypes”, Nature Commun. 2014 Nov 5:5569