Data Considerations for Human Genome Sequencing: Is Smaller Better?
Author: Jonathan Mangion
Head of Bioinformatics (Europe, Middle East & Africa)
Life Science Solutions, Thermo Fisher Scientific
The development of ultra-high-throughput next-generation sequencing (NGS) technologies coupled with the initiation of large-scale population sequencing projects such as Genomics England 100,000 Genomes Project and the Genome Canada initiative has led to a dramatic reduction in the cost of genomic sequencing. Very high profile projects like Genomics England give the impression that whole-genome sequencing (WGS) or whole-exome sequencing (WES) are the most appropriate strategy for human genomics in a clinical or research environment. These approaches bring many benefits as the amount of information delivered aids discovery and increases our understanding of the interactions between genetics and human health. However, there are many considerations to take into account before deciding whether WGS, WES, or a more targeted approach using gene panels is the right solution, such as experimental aim, cost, time to results, and data analysis and management. Data analysis and management is an area that is often overlooked in the decision-making process.
The software used to analyze sequencing genomic data has matured greatly over recent years and most platforms provide a solution that allows mapping and variant calling of the data. WGS identifies millions of variants, creating challenges that need to be overcome to properly handle so much data:
- Provision of the computing power to undertake the analysis
- Interpretation of the results
- Development of a truth dataset to determine specificity and sensitivity using an alternative method such as Sanger sequencing.
High-throughput WGS requires a large investment in infrastructure and with projects such as Genomic England there is a trend to centralize these resources. Setting up such a facility independently is cost prohibitive for most institutes outside of the main genome institutes. WES seems to offer a more viable approach. A few groups, such as the Cincinnati Children’s Hospital, have managed to set up high-throughput clinical exome sequencing. Using cloud technology to reduce the cost burden of the computing infrastructure is one option, but users need to understand the data governance regulations and whether the data can be transferred to a cloud provider located in another country. Even in this digital age transferring files of many tens or hundreds of gigabytes may put a major strain on the local network.
WGS and WES generate vast amounts of information that are invaluable for a research project. In a clinical setting, though, this raises the problems of increases in the false positive rate and detection of incidental findings. Often it is not clear what the functional implication of an incidental finding is. The software may predict that the variant is likely to cause a serious impact on the protein, but the consequences are not truly known until functional studies are undertaken. One approach to circumvent this issue is to provide the results for a subset of genes (in effect, a virtual gene panel). The advantage of this is that with increased knowledge of a disease, the virtual gene set can be easily expanded to include extra genes without having to redo the sequencing.
There is ongoing debate in the community regarding the most cost-effective strategy, not just for the actual sequencing but also for the effects on the health care system. Mulin Khoury at the Centers for Disease Control and Prevention argued in a blog that due to the consequences of incidental findings, a targeted approach should be used in a clinical setting. A small panel allows for the development of assays that more adequately test the full sequence of the specific genes. The Saudi Mendeliome project shows that this approach generates favorable results compared to WES, at a much reduced cost per sample (article). It is important to note that the extent of knowledge of the genes implicated in a disease needs to be understood. If only a small percentage of the genetics is accounted for then a WES approach is likely to prove more successful.
An often ignored aspect of generating vast quantities of genomic information is data management, which comprises not just the storage but also the safeguarding, compliance, and usability of the information.
Research studies take years to complete, and there are regulations requiring long-term retention of genetic health data. Often there is a desire to keep the alignment files and not just the variant calls, so that comprehensive retrospective studies can be undertaken. Management and preservation of the data over the long term, in a cost-effective and compliant manner, while maintaining usability and accessibility, is not a trivial effort. The lifetime of the data is often longer than the lifetime of the medium it is stored in. Migrating the data to new servers requires processes that ensure data integrity, which requires specialist digital archiving knowledge. Reducing the genomic footprint will help to reduce the challenge.
When discussing the data storage aspect of data management with researchers and clinicians, I often hear surprise as to why it is should be expensive. After all, you can purchase Terabytes of storage from Amazon for a couple of thousand dollars. However, the data that is generated in these studies is critical and so deserves an enterprise solution that gives on-site support and protects the data in case of a disk or hardware failure. An enterprise solution should ensure that you have a copy of your data in the event of a disaster.
Data protection and privacy regulations also need to be considered. The EU General Data Protection Regulation (GDPR) that came to force in May 2018 defined genetic data that may be used to identify an individual as personal data. Data from WGS and WES likely fall in this category. The consequence of GDPR on institutes across the world that use data from EU citizens is that there are various auditable requirements that they should comply with. A person enrolled on a genetic study using WGS or WES will have the following rights:
- Access: the individual will have the right to know if, how, and why you are processing their personal data
- Portability: individuals can obtain their data in a readable format and have it transferred to another organization free of charge
- Erasure (or right to be forgotten): individuals can ask for their personal data to be deleted once it has outlived its original purposes.
Smaller gene panels do not have the information to uniquely identify an individual, so probably will not be deemed as personal data.
There are many factors that need to be considered when selecting the most appropriate approach to a gene sequencing project. WGS is a very powerful tool, but the cost of such an approach appears to have stopped falling. A targeted approach is often a more suitable and cost-effective alternative, especially for clinical and specific clinical research use. As our knowledge of the genetics implicated in diseases increases, the need to use a broad net to detect mutations in a limited subset of genes lessens and so gene panels will become the way forward, reducing the cost and data management burden.