The Power of Data
As our lives become more centered around technology, we create more and more data. Collectors of this data often lead us to believe it is generic, no personal information is identifiable, or that it’s “just metadata,” presenting no real threat. However, data is powerful and can be used to identify problems and provide solutions like uncovering criminal networks involved in laundering money and combating domestic extremism. In addition to large-scale operations, data can be leveraged to identify you.
With Great Power Comes Great Responsibility
Today, the movement of people all around the world can be tracked through app location data. In 2019, the New York Times investigated how this data can be leveraged by analyzing over 50 billion location pings from more than 12 million American phones as they moved through major U.S. cities. Ultimately, these locations were linked to identify specific movement patterns, allowing them to be tied to individuals.
As analysts, these capabilities are a double-edged sword. App location data can quickly identify how actors move and link them to other people through overlapping locations. However, this also raises privacy concerns and questions of ethics as these methodologies allow for constant surveillance, obstructing any semblance of “private life.”
Metadata
Metadata is often described as “data about data.” It provides information and descriptors about data without exposing the content of the data itself. Metadata includes elements like title, tags, creation date, and author. These may have a “surface-level” appearance. However, these elements reveal a lot. An email’s metadata typically includes the subject, sender, receiver, date, time, and IP addresses, while a photo’s metadata can show the date, time, file name, and location.
Consumers are the Product
Every day 2.5 quintillion bytes of new data are created. Humans produce data in all sorts of ways like through web searches, social media activity, photos, and even within our homes through appliances and technology, which connect through the Internet of Things (IoT). It is no surprise that companies profit from the collection of this data.
Companies collect data for various purposes, including tailoring advertising, promoting relevant services and products, analyzing significant trends, and selling information to third parties for use. Whatever the use for the data, consumers are the product.
Data Re-identification/De-anonymization
Before being sold to third parties, data needs to be anonymized or “scrubbed.” The scrubbing process includes removing personally identifiable information, such as name, address, and date of birth, from the data. It is important to note that pseudonyms are also used to anonymize data.
Methods for Re-identification of Scrubbed Data
There are three primary methods for re-identifying anonymized data. A detailed description of each method can be found here and can be used in conjunction with each other to maximize ability.
1. Exposing Insufficient De-Identification
As previously mentioned, data and metadata are scrubbed when shared with a third party to protect an individual’s privacy. However, data that has been insufficiently anonymized contains direct and/or indirect identifiers that serve as attributes that can be linked to a person.
Studies have shown that individuals can be recognized with auxiliary information even among large-scale datasets. One study showed that in a dataset of 60 million people, 93% of individuals could be re-identified using just four points of place and time information.
2. Pseudonym Reversal
Pseudonymization is the process of using artificial identifiers to mask personally identifiable information within data. The pseudonyms must be properly applied to decrease the chances of reversal. Applying pseudonyms randomly rather than using an algorithm better protects personally identifiable information. Re-identifying pseudonymized data is possible when indirect identifiers are left in the dataset. These identifiers could include race, occupation, age, and geographical location, which make it possible to connect traits to identity.
3. Dataset Combining
Combining datasets presents the most opportunity for data de-anonymization. This is because indirect identifiers can be matched through merged datasets to ultimately re-identify scrubbed names and other direct identifiers.
Dr. Latanya Sweeney has demonstrated this technique to expose insufficiently de-identified commercially sold data with publicly available information. Her lab project entitled Identifying Participants in the Personal Genome Project by Name details her workflow combining publicly available voter lists with the publicly available and “anonymized” candidate profiles in the Personal Genome Project to re-identify participants with an estimated 84-97 percent success rate.
Privacy Threats
The accuracy and the scale at which seemingly anonymous data can be identified raises a slew of privacy concerns. Mass data collection and emerging re-identification techniques increase the challenge of keeping personal information private and exploiting other risks like cyberstalking and digital profiling. Team Praescient’s blog on Big Tech’s collection of user data further details the risks associated with mass data collection.
Praescient Analytics utilizes innovative methods to analyze patterns, trends, and organizational relationships to support litigation, internal investigations, risk, and compliance mission sets. The analysis of anonymized data paired with publicly and commercially available information can determine how well the data has been anonymized and the level of risk to mitigate privacy concerns.
Understanding Data’s Potential Moving Forward
Innovation and technology have shaped the Information Age of the 21st Century. While advancements in technology seek to improve everyday life, it comes with consequences. Privacy is becoming harder to achieve, and mass data collection further contributes to this. The belief that once data has been scrubbed of personally identifiable information, it cannot be traced back to a person creates a false sense of security. As Praescient analysts, we have to be conscious and aware of the potential of all data, even those sets which have been anonymized. While there are no comprehensive laws in the United States that prohibit data re-identification, it is our responsibility to uphold the foundational commitment to ethical behavior across all aspects of the company. Understanding that metadata isn’t just “data about data,” and that anonymized data isn’t guaranteed identity protection will help to guide future online actions and how much one shares.