I Know Who You Are: Using ‘Private’ DNA Sequences To Identify People

Searching for ancestors online is a popular activity. So is having your DNA sequenced. But merging the two has created a problem; it’s very, very easy to use genealogy software and DNA sequence data to identify people who are supposed to be anonymous.

A spanner in the works

This means all sorts of data, like health history, race and ethnicity, and even paternity, aren’t really as confidential as they’re supposed to be. It also throws a wrench in the efforts of scientists to provide genetic data to any researcher, for free and without restriction.

Easy to find

Yaniv Erlich, a geneticist at the Whitehead Institute in Massachusetts, started looking at new tools to probe big DNA databases. He found that by just using a string of DNA code and a participant’s age (all available on the database), he could find people by matching that information with public genealogy sites. His work was published in the 18 January issue of Science.

Our ancestors

Erlich started his efforts by looking at the 1000 Genomes Project, an international study which collects DNA sequences, ages, and regions where participants reside, and posts that information on the internet. He found that tiny, inherited patterns in DNA on the Y chromosome called ‘short tandem repeats’ could help identify the last names of men in the study. In fact, genealogy websites use these short tandem repeats to help men identify other males with the same last names- in other words, ancestors.

I know who you are…

Using the DNA sequence of someone well known (in this case, DNA sequencing pioneer Craig Venter), Erlich found that he could identify Venter solely by searching the DNA databases! He figured that he could find, just by using DNA sequences, about 12 percent of the males in the project.

…and who your grandfather was!

He didn’t stop there. Erlich obtained the Y chromosome short tandem repeats from one male participant in the database. By focusing on men from Utah and knowing his participant’s age, he had the region, age and DNA sequence. Inserting Y-short tandem repeats and looking for a match in two genealogy databases (he used Y search and SMGF, the largest public, free genealogy sites) quickly gave him the man’s grandfathers on both sides of his family. A quick Google search of the names popped up an obituary, which gave him the names of more relatives- people who were not actually part of the 1000 Genomes Project.

Collision Course!

Erlich’s study has created a storm of sorts in research circles, as well as within government agencies tasked with managing DNA and other supposedly confidential health data. Recently, the White House issued recommendations on maintaining the privacy of DNA data, while allowing its unrestricted use among researchers. It seems like those two recommendations are heading for a collision course.

Gymrek, M., et al. (2013). Identifying personal genomes by surname inference. Science, 339, 321-324. DOI 10.1126/Science.1229566.

