When Anonymized Data Isn’t and Why Expertise Matters in Privacy

If you start looking into organizations disclosure of their policies around data and privacy, it won’t take you long to find a statement saying something like “we make sure any personally identifiable information (also known as PII) released is completely anonymized.” What this means is if the organization making this statement holds data that can somehow be used to identify and/or locate or contact a certain person, if that organization would like to release that data to a third party, it will obfuscate the data in some manner so it can’t be traced back to that person.

A common scenario for organizations to release anonymized data to third parties is when the data in aggregate can be analyzed in some fashion to provide some useful information. Such a case was when a data solutions architect requested the New York City government to release trip data from New York taxi cabs hoping to use this data to map out some useful information about the city. The city government provided the data as requested and of course anonymized it to make sure the data couldn’t be traced back to the individual taxi drivers.

However, it was. When Vijay Pandurangan, the founder and CEO of a security software company, got a hold of it, he was able to “de-anonymize” (i.e. make it into PII again) the data within an hour (link to https://medium.com/@vijayp/f6bc289679a1). What happened was the taxi medallion data, which the original data included, had been transformed into anonymous numbers using a cryptographic method called MD5. What Mr. Pandurangan was able to do was take advantage of the fact that the taxi medallion data itself was organized in such a way which made it relatively easy to reconstruct the actual taxi medallion data from the “anonymized” data using computational techniques. Once the data had been reconstructed, should Mr. Pandurangan have been so inclined, he could track any individual New York taxi driver’s daily drives throughout the city.

Mr. Pandurangan pointed out that the method used by New York City to anonymize the data had been questioned by security experts for at least a year. And in an article done about Mr. Pandurangan’s work by the technology news site ArsTechnica (link to http://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/), the author pointed out three other incidents where anonymized data had also been compromised.

So, the next time you read about someone promising to anonymize personally identifiable information, remember that the data may not necessarily remain anonymous unless it is handled by experts with appropriate security knowledge and expertise.