Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.
When the U.S. Census Bureau announced its intention to modernize its disclosure avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders’ trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a ‘statistical imaginary.’
Over the past two decades, we have come to see that traditional de-anonymization techniques fail to protect the privacy of individuals in sensitive datasets. To address this problem, computer scientists introduced differential privacy, a strong statistical notion of privacy that bounds the amount of information a statistical release leaks about any individual. Differential privacy has become a gold standard for privacy protection: organizations from Google to the U.S. Census Bureau have adopted differentially private methods, and the MIT Technology Review named it as one of the top ten technologies expected to have “widespread consequences for human life.” Yet, while differential privacy offers rigorous statistical guarantees, we must also examine how these guarantees interact with social and contextual factors. In this paper, I investigate the political dimensions of differential privacy. What does the adoption of this standard reveal or obscure about the privacy practices within our sociotechnical systems? And how might a reliance on this standard impact our progress towards broader notions of privacy? Drawing on scholarship from sociology, law, computer science, and science and technology studies, I describe the entanglements between algorithmic privacy and institutional logics, highlighting disempowering practices that may emerge despite, or in response to, the adoption of differential privacy. The goal of this work is not to discourage the use of differential privacy, which I argue is necessary and beneficial in a wide range of settings, but to examine where it may have unintended consequences. I conclude with recommendations on how the privacy community can continue to develop formal privacy standards while elevating broader visions of privacy.
Differential privacy is a restriction on data processing algorithms that provides strong confidentiality guarantees for individual records in the data. However, research on proper statistical inference, that is, research on properly quantifying the uncertainty of the (noisy) sample estimate regarding the true value in the population, is currently still limited. This paper proposes and evaluates several strategies to compute valid differentially private confidence intervals for the median. Instead of computing a differentially private point estimate and deriving its uncertainty, we directly estimate the interval bounds and discuss why this approach is superior if ensuring privacy is important. We also illustrate that addressing both sources of uncertainty--the error from sampling and the error from protecting the output--simultaneously should be preferred over simpler approaches that incorporate the uncertainty in a sequential fashion. We evaluate the performance of the different algorithms under various parameter settings in extensive simulation studies and demonstrate how the findings could be applied in practical settings using data from the 1940 Decennial Census.
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields a doubly exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of Ψ-dimension defined in work of Ben-David et al.  to the online setting and explores its general properties.
There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so.
As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent.
We formulate and prove formal claims – “legal theorems” – addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out.
Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.
This article advocates a hybrid legal-technical approach to the evaluation of technical measures designed to render information anonymous in order to bring it outside the scope of data protection regulation. The article demonstrates how such an approach can be used for instantiating a key anonymization concept appearing in the EU General Data Protection Regulation (GDPR) -- singling out. The analysis identifies and addresses a tension between a common, compelling theory of singling out and a mathematical analysis of this theory, and it demonstrates how to make determinations regarding the sufficiency of specific technologies for satisfying regulatory requirements for anonymization.
Doubts about the feasibility of effective anonymization and de-identification have gained prominence in recent years in response to high-profile privacy breaches enabled by scientific advances in privacy research, improved analytical capabilities, the wider availability of personal data, and the unprecedented richness of available data sources. At the same time, privacy regulations recognize the possibility, at least in principle, of data anonymization that is sufficiently protective so as to free the resulting (anonymized) data from regulation. As a result, practitioners developing privacy enhancing technologies face substantial uncertainty as to the legal standing of these technologies. More fundamentally, it is not clear how to make a determination of compliance even when the tool is fully described and available for examination.
This gap is symptomatic of a more general problem: Legal and technical approaches to data protection have developed in parallel, and their conceptual underpinnings are growing increasingly divergent. When lawmakers rely on purely legal concepts to engage areas that are affected by rapid scientific and technological change, the resulting laws, when applied in practice, frequently create substantial uncertainty for implementation; provide contradictory recommendations in important cases; disagree with current scientific technical understanding; and fail to scale to the rapid pace of technological development. This article argues that new hybrid concepts, created through technical and legal co-design, can inform practices that are practically complete, coherent, and scalable.
As a case study, the article focuses on a key privacy-related concept appearing in Recital 26 of the General Data Protection Regulation (GDPR) called singling out. We identify a compelling theory of singling out that is implicit in the most persuasive guidance available, and demonstrate that the theory is ultimately incomplete. We then use that theory as the basis for a new and mathematically rigorous privacy concept called predicate singling-out. Predicate singling-out sheds light on the notion of singling out in the GDPR, itself inextricably linked to anonymization. We argue that any data protection tool that purports to anonymize arbitrary personal data under the GDPR must prevent predicate singling-out. This enables, for the first time, a legally- and mathematically-grounded analysis of the standing of supposed anonymization technologies like k-anonymity and differential privacy. The analysis in this article is backed by a technical-mathematical analysis previously published by two of the authors.
Conceptually, our analysis demonstrates that a nuanced understanding of baseline risk is unavoidable for a theory of singling out based on current regulatory guidance. Practically, it identifies previously unrecognized failures of anonymization. In particular, it demonstrates that some k-anonymous mechanisms may allow singling out, challenging the prevailing regulatory guidance.
The article concludes with a discussion of specific recommendations for both policymakers and scholars regarding how to conduct a hybrid legal-technical analysis. Rather than formalizing or mathematizing the law, the article provides approaches for wielding formal tools in the service of practical regulation.
Government agencies always need to carefully consider potential risks of disclosure whenever they publish statistics based on their data or give external researchers access to the collected data. For this reason, research on disclosure avoiding techniques has a long tradition at statistical agencies. In this context, the promise of formal privacy guarantees offered by concepts such as differential privacy seem to be the panacea enabling the agencies to exactly quantify and control the privacy loss incurred by any data release. Still, despite the excitement in academia and industry, most agencies-with the prominent exception of the U.S. Census Bureau-have been reluctant to even consider the concept for their data release strategy. This paper aims to shed some light on potential reasons for this. We argue that the requirements when implementing differential privacy approaches at government agencies are often fundamentally different from the requirements in industry. This raises many challenging problems and open questions that still need to be addressed before the concept might be used as an overarching principle when sharing data with the public. The paper will not offer any solutions to these challenges. Instead, we hope to stimulate some collaborative research efforts, as we believe that many of the problems can only be addressed by inter-disciplinary collaborations.
There is a significant conceptual gap between legal and mathematica thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators.
We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset x using the output of a data-release mechanism M(x) if it finds a predicate p matching exactly one row in x with probability much better than a statistical baseline. A data-release mechanism that precludes suchattacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and kanonymity
are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, kanonymity does not: There exists a simple predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.
Hypothesis testing plays a central role in statistical inference, and is used in many settings where privacy concerns are paramount. This work answers a basic question about privately testing simple hypotheses: given two distributions P and Q, and a privacy level ε, how many i.i.d. samples are needed to distinguish P from Q subject to ε-differential privacy, and what sort of tests have optimal sample complexity? Specifically, we characterize this sample complexity up to constant factors in terms of the structure of P and Q and the privacy level ε, and show that this sample complexity is achieved by a certain randomized and clamped variant of the log-likelihood ratio test. Our result is an analogue of the classical Neyman–Pearson lemma in the setting of private hypothesis testing. We also give an application of our result to private change-point detection. Our characterization applies more generally to hypothesis tests satisfying essentially any notion of algorithmic stability, which is known to imply strong generalization bounds in adaptive data analysis, and thus our results have applications even when privacy is not a primary concern.
A new line of work [6, 9, 15, 2] demonstrates how differential privacy  can be used as a mathematical tool for guaranteeing generalization in adaptive data analysis. Specifically, if a differentially private analysis is applied on a sample S of i.i.d. examples to select a lowsensitivity function f , then w.h.p. f (S) is close to its expectation, although f is being chosen based on the data. Very recently, Steinke and Ullman  observed that these generalization guarantees can be used for proving concentration bounds in the non-adaptive setting, where the low-sensitivity function is fixed beforehand. In particular, they obtain alternative proofs for classical concentration bounds for low-sensitivity functions, such as the Chernoff bound and McDiarmid’s Inequality. In this work, we set out to examine the situation for functions with high-sensitivity, for which differential privacy does not imply generalization guarantees under adaptive analysis. We show that differential privacy can be used to prove concentration bounds for such functions in the non-adaptive setting.
Is it possible to piece together the confidential data of almost everyone in the US from statistics published by the Census Bureau—without breaching Census security or policy? Could someone—a doctor, a nosy neighbor, or a foreign state actor—determine whether a particular person participated in a genetic study of hundreds of individuals, when each individual contributed only tiny trace amounts of DNA to a highly complex and aggregated genetic mixture? Could police detectives re-trace a suspect’s every movement over the course of many months and thereby learn intimate details about the suspect’s political, religious, and sexual associations—without having to deploy any sort of surveillance or tracking devices? Could someone reliably deduce the sexual preferences of a Facebook user without looking at any content that user has shared?
Until recently, most people probably never imagined that their highly sensitive personal data could be so vulnerable to discovery from seemingly innocuous sources. Many continue to believe that the privacy risks from purely public, statistical, and anonymised data are merely theoretical, and that the practical risks are negligibly small. Yet all of the privacy violations described above are not only theoretically possible—they have already been successfully executed.
The foregoing examples of real-world privacy attacks all leverage one particular vulnerability that we refer to as composition effects. This vulnerability stems from the cumulative erosions of privacy that inhere in every piece of data about people. These erosions occur no matter how aggregated, insignificant, or anonymised the data may seem, and even small erosions can combine in unanticipated ways to create big risks.
Privacy and data protection failures from unanticipated composition effects reflect a type of data myopia—a short-sighted approach toward addressing increasingly-ubiquitous surveillance and privacy risks from Big Data analytics, characterized by a near-total focus on individual data processors and processes and by pervasive underestimation of systemic risks accumulating from independent data products. The failure to recognize accumulation of risk in the information ecosystem reflects a more general societal blind spot to cumulative systemic risks, with parallels in collective failures to foresee or forestall global financial crises, and to adequately address mounting risks to the natural environment.
As the volume and complexity of data uses and publications grow rapidly across a broad range of contexts, the need to develop frameworks for addressing cumulative privacy risks is likely to become an increasingly urgent and widespread problem. Threats to privacy are growing due to the accelerating abundance, and richness, of data about individuals being generated and made publicly available. Furthermore, substantial increases in computing power and algorithmic improvements are making the execution of such attacks more technically feasible. These threats will be impossible to overcome unless regulations are designed to explicitly regulate cumulative risk in a manner that is consistent with the science of composition effects.
The bootstrap is a common and powerful statistical tool for numerically computing the standard error of estimators, that is, a calculation of the uncertainty of functions computed on sample data so as to make an inference back to the original population from which the sample was drawn. Understanding uncertainty, and inferential questions, in the context of private data is an increasingly important task within the literature of differential privacy [7, 20, 15]. We show how to construct an implementation of the bootstrap within differential privacy. Most importantly, we show that, for a broad class of functions under zero concentrated differential privacy, the bootstrap can be implemented at no cost. That is, for a given choice of privacy parameter and associated expected error of some query, the bootstrap can be implemented for the exact same privacy guarantee, resulting in the same expected error (or sometimes less) in the desired query, but additionally provide the standard error of that query. In section 2 we provide a brief overview of differential privacy. Then to describe these results on bootstrap inference, in section 3 we describe some foundational results on the aggregation of repeated queries under contrasting privacy and composition definitions. This leads to a tangential result in section 4 on a low-noise Gaussian mechanism for pure differential privacy. Next we provide a brief foundation on the bootstrap algorithm in statistics in section 5, before showing our algorithmic construction of the bootstrap using the mechanisms of differential privacy in section 6. In section 7 we describe how to use the differentially private estimate of the standard error in the construction of confidence intervals and hypothesis tests, and then demonstrate this in section 8 with examples using published Census microdata in the style of privacy sensitive data.
The fields of law and computer science incorporate contrasting notions of the privacy risks associated with the analysis and release of statistical data about individuals and groups of individuals. Emerging concepts from the theoretical computer science literature provide formal mathematical models for quantifying and mitigating privacy risks, where the set of risks they take into account is much broader than the privacy risks contemplated by many privacy laws. An example of such a model is differential privacy, which provides a provable guarantee of privacy against a wide range of potential attacks, including types of attacks currently unknown or unforeseen. The subject of much theoretical investigation, new privacy technologies based on formal models have recently been making significant strides towards practical implementation. For these tools to be used with sensitive personal information, it is important to demonstrate that they satisfy relevant legal requirements for privacy protection. However, making such an argument is challenging due to the conceptual gaps between the legal and technical approaches to defining privacy. Notably, information privacy laws are generally subject to interpretation and some degree of flexibility, which creates uncertainty for the implementation of more formal approaches. This Article articulates the gaps between legal and technical approaches to privacy and presents a methodology for rigorously arguing that a technological method for privacy protection satisfies the requirements of a particular law. The proposed methodology has two main components: (i) extraction of a formal mathematical requirement of privacy based on a legal standard found in an information privacy law, and (ii) construction of a rigorous mathematical proof for establishing that a technological privacy solution satisfies the mathematical requirement derived from the law. To handle ambiguities that can lead to different interpretations of a legal standard, the methodology takes a conservative “worst-case” approach and attempts to extract a mathematical requirement that is robust to potential ambiguities. Under this approach, the mathematical proof demonstrates that a technological method satisfies a broad range of reasonable interpretations of a legal standard. The Article demonstrates the application of the proposed methodology with an example bridging between the requirements of the Family Educational Rights and Privacy Act of 1974 and differential privacy.
This document is a primer on differential privacy, which is a formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings. Using intuitive illustrations and limited mathematical formalism, this document provides an introduction to differential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to differential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social scientists can conceptualize the guarantees provided by differential privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded.
We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.
This position paper observes how different technical and normative conceptions of privacy have evolved in parallel and describes the practical challenges that these divergent approaches pose. Notably, past technologies relied on intuitive, heuristic understandings of privacy that have since been shown not to satisfy expectations for privacy protection. With computations ubiquitously integrated in almost every aspect of our lives, it is increasingly important to ensure that privacy technologies provide protection that is in line with relevant social norms and normative expectations. Similarly, it is also important to examine social norms and normative expectations with respect to the evolving scientific study of privacy. To this end, we argue for a rigorous analysis of the mapping from normative to technical concepts of privacy and vice versa. We review the landscape of normative and technical definitions of privacy and discuss specific examples of gaps between definitions that are relevant in the context of privacy in statistical computation. We then identify opportunities for overcoming their differences in the design of new approaches to protecting privacy in accordance with both technical and normative standards.