There is a significant conceptual gap between legal and mathematica thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators.
We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset x using the output of a data-release mechanism M(x) if it finds a predicate p matching exactly one row in x with probability much better than a statistical baseline. A data-release mechanism that precludes suchattacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and kanonymity
are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, kanonymity does not: There exists a simple predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.
There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so.
As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent.
We formulate and prove formal claims – “legal theorems” – addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out.
Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.
This paper aims to shed some light on potential reasons for this. We argue that the requirements when implementing differential privacy approaches at government agencies are often fundamentally different from the requirements in industry. This raises many challenging problems and open questions that still need to be addressed before the concept might be used as an overarching principle when sharing data with the public. The paper will not offer any solutions to these challenges. Instead, we hope to stimulate some collaborative research efforts, as we believe that many of the problems can only be addressed by inter-disciplinary collaborations.
avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders’ trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a ‘statistical imaginary.’
Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.
This article advocates a hybrid legal-technical approach to the evaluation of technical measures designed to render information anonymous in order to bring it outside the scope of data protection regulation. The article demonstrates how such an approach can be used for instantiating a key anonymization concept appearing in the EU General Data Protection Regulation (GDPR) -- singling out. The analysis identifies and addresses a tension between a common, compelling theory of singling out and a mathematical analysis of this theory, and it demonstrates how to make determinations regarding the sufficiency of specific technologies for satisfying regulatory requirements for anonymization.
Doubts about the feasibility of effective anonymization and de-identification have gained prominence in recent years in response to high-profile privacy breaches enabled by scientific advances in privacy research, improved analytical capabilities, the wider availability of personal data, and the unprecedented richness of available data sources. At the same time, privacy regulations recognize the possibility, at least in principle, of data anonymization that is sufficiently protective so as to free the resulting (anonymized) data from regulation. As a result, practitioners developing privacy enhancing technologies face substantial uncertainty as to the legal standing of these technologies. More fundamentally, it is not clear how to make a determination of compliance even when the tool is fully described and available for examination.
This gap is symptomatic of a more general problem: Legal and technical approaches to data protection have developed in parallel, and their conceptual underpinnings are growing increasingly divergent. When lawmakers rely on purely legal concepts to engage areas that are affected by rapid scientific and technological change, the resulting laws, when applied in practice, frequently create substantial uncertainty for implementation; provide contradictory recommendations in important cases; disagree with current scientific technical understanding; and fail to scale to the rapid pace of technological development. This article argues that new hybrid concepts, created through technical and legal co-design, can inform practices that are practically complete, coherent, and scalable.
As a case study, the article focuses on a key privacy-related concept appearing in Recital 26 of the General Data Protection Regulation (GDPR) called singling out. We identify a compelling theory of singling out that is implicit in the most persuasive guidance available, and demonstrate that the theory is ultimately incomplete. We then use that theory as the basis for a new and mathematically rigorous privacy concept called predicate singling-out. Predicate singling-out sheds light on the notion of singling out in the GDPR, itself inextricably linked to anonymization. We argue that any data protection tool that purports to anonymize arbitrary personal data under the GDPR must prevent predicate singling-out. This enables, for the first time, a legally- and mathematically-grounded analysis of the standing of supposed anonymization technologies like k-anonymity and differential privacy. The analysis in this article is backed by a technical-mathematical analysis previously published by two of the authors.
Conceptually, our analysis demonstrates that a nuanced understanding of baseline risk is unavoidable for a theory of singling out based on current regulatory guidance. Practically, it identifies previously unrecognized failures of anonymization. In particular, it demonstrates that some k-anonymous mechanisms may allow singling out, challenging the prevailing regulatory guidance.
The article concludes with a discussion of specific recommendations for both policymakers and scholars regarding how to conduct a hybrid legal-technical analysis. Rather than formalizing or mathematizing the law, the article provides approaches for wielding formal tools in the service of practical regulation.
Is it possible to piece together the confidential data of almost everyone in the US from statistics published by the Census Bureau—without breaching Census security or policy? Could someone—a doctor, a nosy neighbor, or a foreign state actor—determine whether a particular person participated in a genetic study of hundreds of individuals, when each individual contributed only tiny trace amounts of DNA to a highly complex and aggregated genetic mixture? Could police detectives re-trace a suspect’s every movement over the course of many months and thereby learn intimate details about the suspect’s political, religious, and sexual associations—without having to deploy any sort of surveillance or tracking devices? Could someone reliably deduce the sexual preferences of a Facebook user without looking at any content that user has shared?
Until recently, most people probably never imagined that their highly sensitive personal data could be so vulnerable to discovery from seemingly innocuous sources. Many continue to believe that the privacy risks from purely public, statistical, and anonymised data are merely theoretical, and that the practical risks are negligibly small. Yet all of the privacy violations described above are not only theoretically possible—they have already been successfully executed.
The foregoing examples of real-world privacy attacks all leverage one particular vulnerability that we refer to as composition effects. This vulnerability stems from the cumulative erosions of privacy that inhere in every piece of data about people. These erosions occur no matter how aggregated, insignificant, or anonymised the data may seem, and even small erosions can combine in unanticipated ways to create big risks.
Privacy and data protection failures from unanticipated composition effects reflect a type of data myopia—a short-sighted approach toward addressing increasingly-ubiquitous surveillance and privacy risks from Big Data analytics, characterized by a near-total focus on individual data processors and processes and by pervasive underestimation of systemic risks accumulating from independent data products. The failure to recognize accumulation of risk in the information ecosystem reflects a more general societal blind spot to cumulative systemic risks, with parallels in collective failures to foresee or forestall global financial crises, and to adequately address mounting risks to the natural environment.
As the volume and complexity of data uses and publications grow rapidly across a broad range of contexts, the need to develop frameworks for addressing cumulative privacy risks is likely to become an increasingly urgent and widespread problem. Threats to privacy are growing due to the accelerating abundance, and richness, of data about individuals being generated and made publicly available. Furthermore, substantial increases in computing power and algorithmic improvements are making the execution of such attacks more technically feasible. These threats will be impossible to overcome unless regulations are designed to explicitly regulate cumulative risk in a manner that is consistent with the science of composition effects.