Scale, and The Promise of Genomic Science
Sequencing the first human genome required more than ten years of effort and cost $3 billion. In the last decade, the cost has decreased by approximately six orders of magnitude — today, a human genome can be sequenced in a week for less than $5000. As costs have become less prohibitive, researchers in academia, public health, and the pharmaceutical industry have sequenced more and more individuals and fruitfully explored many of the associations between genetics and disease.
The strongest scientific results, however, have been in relatively simple Mendelian disorders. Studies of more complex diseases for which genetic predictors have been identified have suffered from a lack of scale. Sample sizes are simply too small to draw meaningful scientific conclusions. Many researchers have argued that genome studies will yield meaningful scientific results only when study sizes have expanded far beyond their current level and incorporate much more data from much more diverse populations.
Access versus Privacy?
Collecting such a large set of samples is beyond the capability of most research projects and organizations. As a result, some researchers have suggested that open, public databases — filled with anonymized genome data from consenting subjects — is the only viable way to build a database at the scale required to achieve true statistical power. Unfortunately, recent studies have shown that anonymization by itself is not sufficient for preserving subject privacy. In a widely-publicized article in Science magazine, one research group detailed the method by which they were able to personally identify fifty individuals in the 1000 Genomes Project database (the most prominent of the public databases) using side information obtained from ancestry records elsewhere online. When breaches of medical confidentiality routinely lead to millions of dollars in fines for larger institutions, the risk of managing public databases may prove too great.
If it is true that (a) the statistical power of genomic studies is limited by small sample sizes and a lack of subject diversity, and (b) even anonymized public databases cannot protect the privacy of subjects, then it would appear that researchers in this field are faced with a stark choice between openness and privacy. This is a false dichotomy — there is another way forward that allows for open access and yet mitigates the worst effects of re-identification attacks.
The great virtue of open public databases is that they make data accessible to the broadest possible set of stakeholders, and this openness is absolutely essential to the progress of science. The privacy risks outlined above depend primarily on an attacker’s ability to combine genetic data with other information in order to erode data anonymity.
One solution to this problem is to limit access to the information to trusted actors — people and institutions that have been screened in advance and can be trusted to respect the privacy interests of subjects. Once a particular researcher is labeled as “trusted”, they have full access to the information they request. This solution depends on an oversimplified view of trust, which is a far more nuanced concept. For example, can a trusted researcher email a patient’s private data to a colleague whom he trusts? Could he legitimately email a subset of that information? How large a subset would be acceptable? What are his ethical obligations with respect to managing the computing infrastructure on which these data are stored? These scenarios plays out hundreds of times per day in the research community. The subject — who has a legitimate interest in the use of his information — is not part of the discussion.
The most significant problem with the gatekeeper/trusted actor approach is that it expects the users of the data to enforce the policies that govern the use of the data. Once information has been revealed, the burden is on the recipient to act with discretion, a requirement that is often in conflict with the recipient’s desire to accomplish his scientific task by sharing that information. Fundamentally, it is difficult to govern the behavior of people.
On the other hand, people increasingly interact with private data through the medium of a computer program, and computer programs are eminently governable. A computer program running in a virtualization container can be restricted in its network access, for example, so that it is physically unable to forward private information to a third-party endpoint even if the program contains instructions to do so. Or suppose a subject wished to make his potential Alzheimer’s risk factors inaccessible (as with Dr James Watson). An intermediary process governing a bioinformatics program’s access to regions of that subject’s genome solves this problem simply and elegantly. Somewhat counterintuitively, the introduction of cloud computing technologies in this genetics actually serves the cause of privacy rather than undermining it.
genecloud is a trusted cloud service for storing and analyzing genetic sequences and other medical information. The project is specifically designed to address issues of genetic privacy by allowing researchers to interact with data through computer programs — trusted analytics — that can be managed across several dimensions:
- They may be restricted from accessing some parts of a subject’s genome altogether
- They may be required to ask for the subject’s permission for the use of certain data, or uses by a particular researcher
- The computations may require that third-party certifications (e.g. FDA, CLIA, AMA) be attached
- Programs may be required to add randomize or statistically average data into a larger dataset to protect privacy
- Accesses to (and transport of) data across different legal jurisdictions can be managed
- All accesses can be carefully audited for subsequent forensic purposes
- The interests of different stakeholders in the data (subject, physician, researcher, governments) can be enforced.
None of these features are compatible with a model in which raw data is simply provided directly to researchers and other users.
At the same time, genecloud provides a level of open access that rivals those of open pubic databases. Anyone, anywhere in the world, with or without access to the massive financial resources required to sequence and store genomes, may submit programs to genecloud and seek the permissions to run those programs on real data. By striking a balance between openness and privacy, genecloud helps to democratize access to genetic data and enables the community to harness the vast potential of genetic data to treat and prevent human disease.
- Iaonnidis, JPA et al, Assessment of cumulative evidence on genetic associations: interim guidelines, International Journal of Epidemiology, 37(1), September 2007.
- Cecile, A. et al, Genome-based prediction of common diseases: advances and prospects, Human Molecular Genetics, 22(8), April 2013.
- Bustamante, C. et al, Genomics for the world, Nature, 475, July 2011.
- Lunshof, J. et al, From Genetic Privacy to Open Consent, Nature Reviews Genetics, 9, May 2008.
- Gymrek, M. et al, Identifying Personal Genomes by Surname Inference, Science, 339 (6117), January 2013.
- Sack, K., Patient Data Posted Online in Major Breach of Privacy, New York Times, September 8, 2011.
- DNA databases shut after identities compromised, Nature, 455, September 2008.
- Nyholt, D. et al, On Jim Watson’s APOE status: genetic information is hard to hide, European Journal of Human Genetics, 17(2), February 2009.