The single biggest asset in biobanking is data—this is what makes a bioresource so valuable to the research world. But it is also potentially the biggest headache. In order to realize its full potential, a biobank must make data available to researchers, which then can jeopardize its biggest stakeholders: the donors. Releasing data is a potential privacy breach, threatening confidentiality and trust in those to whom the biobank itself has a duty of care.
Biobankers face the following risks:
- Make data readily accessible and risk breaching privacy.
- Breach privacy and donors will no longer donate.
- Enforce privacy and reduce data availability.
In this seemingly circular catch-22 situation, Kuiper et al. (2015) propose a solution that could keep both researchers and donors happy.1
Most statistical disclosure control strategies operate on either suppression, by removing personal information such as names and addresses to de-identify the data, or by straightforward obfuscation. Institutions have tried to obscure individual identifiers by grouping data by ranges, for example using ages or postal codes. However, these measures can be insufficient for maintaining personal privacy, since simple combinations such as date of birth in conjunction with postal code are enough to reveal identity. Moreover, the presence of only a few single nucleotide polymorphisms (SNPs) can enable identification through DNA. Releasing data in secure enclaves with restricted access sounds promising, but it can skew results and lead to bias, thus limiting statistical analysis.
What Kuiper and co-authors introduce is a hybrid system that uses a synthetic data model to give researchers access to data exploration in a controlled environment that ensures donor privacy. In summary, biobanks build a synthetic data model from the existing data for researchers to interact with in order to build search and query parameters. This approach is currently in use by the United States Census Bureau but so far has not been applied to biobanking.
Kuiper et al. describe the steps involved in supplying data to researchers as the following:
- The researcher requests data and then obtains free access to allied metadata to create requests based on measured endpoints.
- The system creates a synthetic data set, supplying a link that the researcher can access.
- The researcher analyzes the synthetic data set, generating the codes and tools necessary for running an identical query on the original data.
- The researcher submits the analysis codes for running on the original data set in private for evaluation.
- Once the system evaluates the results and, as an additional security measure, confirms the veracity of the request, it releases the results.The researcher receives the results generated for evaluation and make inferences.
Execution of the analytical codes is remote, thus ensuring privacy for biobank donors.
Kuiper et al. tested the theoretical system using data from PREVND (Prevention of Renal and Vascular End-stage Disease), a longitudinal cohort study based around the population of Groningen in the Netherlands as a model set.
In discussing the workflow proposed in the hybrid system, the researchers note that many analysts are suspicious of using synthetic data, since many see it as fake. Furthermore, for acceptance within the research community, the method presented must be reproducible, so that other teams can verify, repeat and publish results. However, despite these points Kuiper et al. believe that a hybrid disclosure system that uses synthetic data as surrogates could avoid potentially damaging privacy breaches.
In conclusion, the researchers acknowledge that although “models will always be an approximation,” they feel that the hybrid system offers a balance of practicality and confidentiality for biobank data control and disclosure.
1. Kuiper, J., et al. (2015) “The hybrid synthetic microdata platform: A method for statistical disclosure control,” Biopreservation and Biobanking, 13. doi: 10.1089/bio.2014.0069