In the name of science, 2,500 people have participated in Harvard’s Personal Genome Project, anonymously sharing DNA data along with such information as histories of depression and sexually transmitted disease. But Harvard Privacy Lab director Latanya Sweeney demonstrates what “anonymous” really means. Of 579 participants who listed their birth dates, ZIP codes, and genders, her team was able to identify 42%by name using public records. Previously she was able to identify up to 87% of the U.S. population with just ZIP code, birth date, and gender.
Harvard Professor Re-Identifies Anonymous Volunteers In DNA Study
Harvard Professor Latanya Sweeney
A Harvard professor has re-identified the names of more than 40% of a sample of anonymous participants in a high-profile DNA study, highlighting the dangers that ever greater amounts of personal data available in the Internet era could unravel personal secrets.
From the onset, the Personal Genome Project, set up by Harvard Medical School Professor of Genetics George Church, has warned participants of the risk that someone someday could identify them, meaning anyone could look up the intimate medical histories that many have posted along with their genome data. That day arrived on Thursday.
Professor Latanya Sweeney, director of the Data Privacy Lab at Harvard, along with her research assistant and two students scraped data on 1,130 people of the now more than 2,500 who have shared their DNA data for the Personal Genome Project. Church’s project posts information about the volunteers on the Internet to help researchers gain new insights about human health and disease.
Their names do not appear, but the profiles list medical conditions including abortions, illegal drug use, alcoholism, depression, sexually transmitted diseases, medications and their DNA sequence.
Of the 1,130 volunteers Sweeney and her team reviewed, about 579 provided zip code, date of birth and gender, the three key pieces of information she needs to identify anonymous people combined with information from voter rolls or other public records. Of these, Sweeney succeeded in naming 241, or 42% of the total. The Personal Genome Project confirmed that 97% of the names matched those in its database if nicknames and first name variations were included. She describes her findings here.
Sweeney has also set up a web page for anyone to test how unique their birthdate, gender and zip are in combination. When I tried it, I was the only match in my zip code, suggesting that I, like so many others, would be easy to re-identify. “This allows us to show the vulnerabilities and to show that they can be identified by name,” she said. “Vulnerabilities exist but there are solutions too.”
(Personal disclosure: I work closely with Professor Sweeney in the Harvard Department of Government on topics related to my book research on the business of personal data, but was not involved with this study).
On Thursday, researchers and participants in the Personal Genome Project gathered in Boston for a conference timed to mark the 60th anniversary of James Watson and Francis Crick’s publication of their discovery of the DNA double helix structure in April 1953. Sweeney and her research assistant set up a table at the conference where participants could find out whether they could easily be identified. Sweeney sought not to out the study participants, but rather to demonstrate to them how providing a little less information–for example, just birth year rather than exact birth date, and three digits rather than five or nine from the zip code–could help preserve anonymity for participants.
Several participants said they expected someone would one day re-identify them and said they were not particularly concerned. Volunteer Gabriel Dean said he was far more worried about another future threat forecast by the experiment, that one day criminals might be able to replicate DNA and place some at the scene of a crime. The conference took place a few blocks from the scene of the Boston Marathon bombing earlier this month.
Another “outed” particiapant, James Smith, a 59-year-old who lives outside Chicago, says he has an additional layer or protection because his name is so common. He said his genetic testing showed he had a greater possibility of developing Alzheimer’s disease than a typical person, but said he was “not worried about job discrimination, I’m not worried about health care,” he said. Smith is independently wealthy after having sold his company to Yahoo. “I’m retired.”
Volunteer Lenore Snyder, however, said that she did not want to be identified and as a result did not provide her zip code and some other identifying characteristics in her profile. She said her genetic testing suggests she has an intellectual disability, even though she is a molecular biologist with a PhD. “People don’t know how to interpret this,” she said. “It’s dangerous. A little bit of information is dangerous.”
Sweeney’s latest findings build on a 1997 study she did that showed she could identify up to 87% of the U.S. population with just zip code, birthdate and gender. She was also able to identify then Massachusetts Gov. William Weld from anonymous hospital discharge records.
The same techniques could be used to identify people in various surveys and records, pharmacy purchases, or from a wide variety of seemingly anonymous activities such as Internet searches.
Figuring out clues about people could also enable identity theft. “I believe that many people in the current interconnected digital world are not aware of how easy it is to identify them with a high level of granularity,” says Keith Batchelder, the founder of Genomic Healthcare Strategies in Charlestown, Massachusetts, and one of the first ten volunteers in the Personal Genome Project.
Church, who maintains a thick mountain-man beard, says that advances in data and in medicine make it impossible to guarantee anonymity for most medical experiment volunteers. Church has participated as a volunteer himself in past medical studies and scoffs at claims that such data can remain anonymous. Every year his university sends him an anonymous survey. He scribbles in some additional information at the beginning of the form. “My name is George Church, you could figure that out anyway,” he writes.
His Personal Genome Project makes no privacy promises at all. “The Personal Genome Project is a new form of public genomics research and, as a result, it is impossible to accurately predict all of the possible risks and discomforts that you might experience,” the 24-page consent form tells users. Later it specifies some possible risks: “The data that you provide to the PGP may be used, on its own or in combination with your previously shared data, to identify you as a participant in otherwise private and/or confidential research.”
Volunteers take an online exam about the risks they face before they are allowed into the program. And the test does not pose a universal ‘you do understand the risks” question. It has 20 questions and he requires a perfect score. Potential volunteers can take the test as many times as they want until they pass. One person took the test 90 times before passing.
Given what Church sees as the flaws in preserving privacy in the Internet age, he has embraced openness about many aspects of his own history. On his personal home page he posts the exact coordinates of his home, his birthdate and parents, medical problems (heart attack, carcinoma, narcolepsy, dyslexia, pneumonia, motion sickness) and even a copy of the 1976 letter booting him out of Duke University for getting an F in his graduate major subject.
Many of the early participants in the Personal Genome Project share the same ‘let it all hang out’ ethos. Volunteer Steven Pinker, a well-known experimental psychologist and author of the 2011 book “The Better Angels of Our Nature,” posts his genome and a 1996 scan of his brain on his web page. He says even data as in depth as his genome and medical records does not provide especially deep insights into a person.
“There just isn’t going to be an ‘honesty gene’ or anything else that would be nearly as informative as a person’s behavior, which, after all, reflects the effect of all three billion base pairs and their interactions together with chance, environmental effects, and personal history,” he says. “As for the medical records, I just don’t think anyone is particularly interested in my back pain.”
Could companies use medical information to single out people to deny them services? Might a bank, for example, turn down a loan to someone because their health records suggest they may die at a young age? Even though Church expected reidentification of his volunteers, he does not think so.
“These companies are not yet highly motivated to do that and probably judging from the way the winds blowing on the Genetic Information Nondiscrimination Act they would be ill advised to do that from a public relations standpoint,” he says, referring to the 2008 law.
In a different study released earlier this year, researcher Yaniv Erlich at the Whitehead Institute for Biomedical Research in Cambridge, Massachusetts, was also able to re-identify almost 50 people participating in a different genomic study. He said that he does not know of anyone who has suffered harm to date from such re-identifications, but pointed out the current ethical debate “emerged from the very bad history of the field in the first half of the 20th century, where bad genetic and abundance of records of familial genealogy contributed to one of the most horrific crimes.”
Misha Angrist, an assistant professor of the practice at the Duke Institute for Genome Sciences & Policy and one of the original ten to participate in the Personal Genome Project, praises the re-identification experiments by researchers such as Sweeney and Erlich. “It is a nuisance to scientists who are trying to operate under the status quo and to tell their participants with a straight face, you know, it’s very unlikely that you will be identified,” he says. “It is useful for pointing out that the emperor has no clothes, that absolute privacy and confidentiality are illusory.”