Improving Biobank Data Curation with SORTA

Data curation in a biobank is a time-consuming, expert process. It is often performed retrospectively because of variation in data collection protocols and can involve matching original data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). Pang et al. (2015)¹ have developed a computer-aided system, SORTA (system for ontology-based re-coding and technical annotation), which shows promise in mechanizing the process to improve efficiency.

Using SORTA, the investigators recoded 5,210 unique entries for “physical exercise” in the LifeLines biobank and 315 unique entries for “physical symptoms” (including terms that are similar, but not the same) in the Dutch CINEAS and HPO (Human Phenotype Ontology) coding systems for metabolic diseases. They first identified the requirements of the researcher users and found that the following capabilities were necessary to fulfill the researcher needs:

Comparable similarity score
Ability to import code system in ontology format
Ability import code system in excel format
Use of a lexical index to improve performance
Ability to code/recode data directly in the tool
Availability as online service
Support for partial matches
Ability to match complex data values
Ability to earn from curated datasets

SORTA works in the following way: Pang et al. combined Lucene, a token-based algorithm that is a high-performance search engine, with an n-gram-based algorithm. Lucene enabled them to recall suitable codes for each value and sort them based on their match. Using the secondary n-gram-based algorithm allowed them to standardize similarity scores as percentages to help users understand the quality of the match and to form a uniform cut-off value. In order, they uploaded coding systems or ontologies in Lucene, then had users create their own coding/recoding project by uploading a list of data values. Each item in the short list of matching concepts for each value was then matched with the n-gram-based algorithm to normalize similarity scores to values from 0 to 100%. The users could then apply a cut-off value based on the percentage similarity, allowing the system to then automatically accept and curate.

In the first instance, Pang et al. tested SORTA on unstructured data from the Healthy Obese Project (HOP) data in the LifeLines biobank. The researchers needed to match free text fields from a questionnaire to an existing coding system, the Ainsworth compendium of physical activities. User evaluations suggested that as long as they captured the correct matches in the top 10 codes, the researchers considered the tool useful. Otherwise, based on their experience, users changed the query in the tool to update the matching results. The investigators were able to improve SORTA’s precision by reusing manually curated data from the previous coding round. This resulted in recall/precision at rank 1, increasing from 0.59/0.65 to 0.97/0.98 and at rank 10 from 0.79/0.14 to 0.98/0.11. At the end of the coding task, they captured about 97% of correct matches at rank 1, with users only needing to look at the first candidate match.

In the second instance, Pang et al. tested SORTA on recoding data from the CINEAS clinical symptom list with HPO ontology. Currently, researchers use written notes or equivalent to track candidate terms, which is a highly time-consuming and error-prone method. They found 89% to be a good cut-off value for CINEAS matching because, above this value, all of the suggested matches are correct with 100% precision.

The investigators further compared SORTA to BioPortal Annotator and ZOOMA. SORTA outperformed these other tools. Using SORTA, they were able to retrieve all of the ontology matches, including complex matches with multiple words. Additionally, SORTA recalled at least 99.6% of the existing matches with 100% precision across all three matching experiments.

Overall, SORTA provided significant improvements in speed and quality compared to existing protocols used for data pooling. The researchers have also noted areas where they can continue to improve SORTA. In particular, they intend to add the option for users to choose which algorithm they wish to use to sort the matching results. They also plan to include additional resources such as WordNet for query expansion to increase the chance of finding correct matches from ontologies or coding systems. Finally, they plan to publish mappings as linked data, for example as nano publications, so they can be easily reused.

Reference

1. Pang, et al. (2015) “SORTA: A system for ontology-based re-coding and technical annotation of biomedical phenotype data,” Database (pp. 1-13).