Study: Anonymized Data Can Not Be Totally Anonymous. And 'Homomorphic Encryption' Explained
Thursday, September 19, 2019
Many online users have encountered situations where companies collect data with the promised that it is safe because the data has been anonymized -- all personally-identifiable data elements have been removed. How safe is this really? A recent study reinforced the findings that it isn't as safe as promised. Anonymized data can be de-anonymized = re-identified to individual persons.
"... data can be deanonymised in a number of ways. In 2008, an anonymised Netflix data set of film ratings was deanonymised by comparing the ratings with public scores on the IMDb film website in 2014; the home addresses of New York taxi drivers were uncovered from an anonymous data set of individual trips in the city; and an attempt by Australia’s health department to offer anonymous medical billing data could be reidentified by cross-referencing “mundane facts” such as the year of birth for older mothers and their children, or for mothers with many children. Now researchers from Belgium’s Université catholique de Louvain (UCLouvain) and Imperial College London have built a model to estimate how easy it would be to deanonymise any arbitrary dataset. A dataset with 15 demographic attributes, for instance, “would render 99.98% of people in Massachusetts unique”. And for smaller populations, it gets easier..."
According to the U.S. Census Bureau, the population of Massachusetts was abut 6.9 million on July 1, 2018. How did this de-anonymization problem happen? Scientific American explained:
"Many commonly used anonymization techniques, however, originated in the 1990s, before the Internet’s rapid development made it possible to collect such an enormous amount of detail about things such as an individual’s health, finances, and shopping and browsing habits. This discrepancy has made it relatively easy to connect an anonymous line of data to a specific person: if a private detective is searching for someone in New York City and knows the subject is male, is 30 to 35 years old and has diabetes, the sleuth would not be able to deduce the man’s name—but could likely do so quite easily if he or she also knows the target’s birthday, number of children, zip code, employer and car model."
Data brokers, including credit-reporting agencies, have collected a massive number of demographic data attributes about every persons. According to this 2018 report, Acxiom has compiled about 5,000 data elements for each of 700 million persons worldwide.
It's reasonable to assume that credit-reporting agencies and other data brokers have similar capabilities. So, data brokers' massive databases can make it relatively easy to re-identify data that was supposedly been anonymized. This means consumers don't have the privacy promised.
What's the solution? Researchers suggest that data brokers must develop new anonymization methods, and rigorously test them to ensure anonymization truly works. And data brokers must be held to higher data security standards.
Any legislation serious about protecting consumers' privacy must address this, too. What do you think?
The Editor once again brings to us the realization that anonymized data isn’t. This is no secret or new revelation, as leading academic departments of computer science, statistics, and/or business schools have revealed that anonymized data can be analyzed to reveal individuals.
But there may be a statistical technique that can truly shield individuals’ identities in a aggregated database so that information on patterns can be extracted from that database but without revealing the individuals within it. And that technique is Differential Privacy. See Differential Privacy at https://en.m.wikipedia.org/wiki/Differential_privacy and https://privacytools.seas.harvard.edu/differential-privacy and https://privacytools.seas.harvard.edu/research/differential-privacy.
If Differential Privacy works as it appears to, the question is why Apple is the only major tech company to use it? Facebook, Google, et al., who are always going on and on about how they value our privacy and do so much to protect it by, inter alia, ineffectively anonymizing our aggregate data, apparently eschew the one technique, Differential Privacy, that could actual effectively anonymize our identifying data. Differential Privacy is no secret, as Tim Cook presented it at a big Apple presentation a few years ago, as he discussed how Apple was testing Differential Privacy and intended to implement it broadly to effectively anonymize data but in ways that would permit Apple to extract useful aggregate information from aggregate data.
So why hasn’t Differential Privacy been broadly adopted? It is because it does conceal individual identities. So Differential Privacy does limit the data that can be extracted from aggregate data bases to aggregate data, with individual data being effectively hidden. Well, that could mean less money for Facebook, Google, and their ilk, as that would limit them to valuable, but less valuable, aggregate data, instead of ineffectively anonymized data, which is so valuable because it can be effectively and practically analyzed to provided either tech firms or their customers information about us to whatever level of detail is needed, including revealing individual characteristics and identities.
So those, other than Apple, don’t use Differential Privacy because it takes significant money off the table, which, unlike Apple, they can’t offset with sales of hardware and services and content. Apple has those offsets yet is willing leave significant money on the table for the sake of privacy, while the others aren’t, either because they don’t have those offsetting revenues or are just greedy or both.
Posted by: Chanson de Roland | Friday, September 20, 2019 at 03:13 PM