Biobank Data Integration: Semi-Automated

In order to achieve sufficient statistical power, researchers frequently need to pool data from multiple biobanks. This is particularly the case in the instance of rare-disease research. However, integrating data can be time-consuming because of the variations between biobanks in their data collection protocols and questionnaires. Pang et al. (2016) have developed a new program, MOLGENIS/connect, to overcome this problem and streamline biospecimen research processes.¹

MOLGENIS/connect is a semi-automatic system that can find, match and pool data from multiple sources. To begin with, Pang et al. implemented a metadata model component that allows users to upload, view and visualize the data of the source biobanks and to target DataSchemas, which are lists of target variables that researchers need to include to address their specific research question. Pang et al.’s flexible meta-model, Entity Model Extensible (EMX), requires only two types of information (entity and attribute) to sufficiently describe a data set. Entities are definitions of tables that define groups of attributes as columns and data. Attributes are features that can be observed, such as disease, gender and height. When performing a manual search, a researcher would typically go through all data attributes of all biobanks. MOLGENIS combines the Information Retrieval System of Lucene with query expansion to automatically short list good candidate attributes.

Furthermore, some databases use centimeters, while others use meters, and there are similar differences between other units of measurement. MOLGENIS uses a newly developed two-step method for converting units. Pang et al. have developed this such that composite units or derived units such as kg/m² are also easily recognized.

A typical work flow is as follows:

Users upload a target DataSchema and the source biobank data.
Users create a mapping project and select target DataSchema and data sources.
MOLGENIS/connect automatically generates all matches and conversion algorithms for all data sources and all target attributes.
Users curate each of the matches and algorithms using the algorithm editor and preview tool.
MOLGENIS/connect generates the integrated data set.

Pang et al. evaluated MOLGENIS in 184 BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) matches, and it was able to generate useful matches and algorithms in 73% of the cases, while only 11% still needed to be created manually. Users can use these auto-generated algorithms to rapidly design and execute the integration via a user-friendly online Web application. The application and source code are available as open source via the MOLGENIS software suite at http://github.com/molgenis/molgenis.

Reference

1. Pang, C., et al. (2016) “MOLGENIS/connect: A system for semiautomatic integration of heterogeneous phenotype data with applications in biobanks,” Bioinformatics, pii: btw155. [Epub ahead of print]