Publications

2022
Sílvia Casacuberta, Michael Shoemate, Salil Vadhan, and Connor Wagaman. 7/21/2022. “Widespread Underestimation of Sensitivity in Differentially Private Libraries and How to Fix It.” In Theory and Practice of Differential Privacy 7/21/2022. Publisher's VersionAbstract
We identify a new class of vulnerabilities in implementations of differential privacy. Specifically, they arise when computing basic statistics such as sums, thanks to discrepancies between the implemented arithmetic using finite data types (namely, ints or floats) and idealized arithmetic over the reals or integers. These discrepancies cause the sensitivity of the implemented statistics (i.e., how much one individual's data can affect the result) to be much higher than the sensitivity we expect. Consequently, essentially all differential privacy libraries fail to introduce enough noise to hide individual-level information as required by differential privacy, and we show that this may be exploited in realistic attacks on differentially private query systems. In addition to presenting these vulnerabilities, we also provide a number of solutions, which modify or constrain the way in which the sum is implemented in order to recover the idealized or near-idealized bounds on sensitivity.
TPDP.pdf
Daniel Alabi and Salil Vadhan. 6/29/2022. “Hypothesis testing for differentially private linear regression.” In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS ‘22). Publisher's VersionAbstract
In this work, we design differentially private hypothesis tests for the following problems in the general linear model: testing a linear relationship and testing for the presence of mixtures. The majority of our hypothesis tests are based on differentially private versions of the F-statistic for the general linear model framework, which are uniformly most powerful unbiased in the nonprivate setting. We also present other tests for these problems, one of which is based on the differentially private nonparametric tests of Couch, Kazan, Shi, Bray, and Groce (CCS 2019), which is especially suited for the small dataset regime. We show that the differentially private Fstatistic converges to the asymptotic distribution of its non-private counterpart. As a corollary, the statistical power of the differentially private F-statistic converges to the statistical power of the non-private F-statistic. Through a suite of Monte Carlo based experiments, we show that our tests achieve desired significance levels and have a high power that approaches the power of the non-private tests as we increase sample sizes or the privacy-loss parameter. We also show when our tests outperform existing methods in the literature.
ARXIV.pdf
Mark Bun, Jörg Drechsler, Marco Gaboardi, Audra McMillan, and Jayshree Sarathy. 6/2022. “Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling.” In In Foundations of Responsible Computing (FORC 2022).Abstract

Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.

FORC-22.pdf
Ruobin Gong, Erica L. Groshen, and Salil Vadhan. 6/2022. “Harnessing the Known Unknowns: Differential Privacy and the 2020 Census (co-editors’ forward).” Harvard Data Science Review, Special Issue 2. Publisher's Version Harvard Data Science Review 2022.pdf
danah boyd and Jayshree Sarathy. 4/2022. “Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau’s Use of Differential Privacy.” To appear in the Harvard Data Science Review (HDSR). Publisher's VersionAbstract
When the U.S. Census Bureau announced its intention to modernize its disclosure
avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders’ trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a ‘statistical imaginary.’
SSRN.pdf
Rachel Cummings, Yajun Mei, and Wanrong Zhang. 3/2022. “Private sequential hypothesis testing for statisticians: Privacy, error rates, and sample size.” In In The 25th International Conference on Artificial Intelligence and Statistics (AISTATS).Abstract
The sequential hypothesis testing problem is a class of statistical analyses where the sample size is not fixed in advance, and the analyst must make real-time decisions until a stopping criterion is reached. In this work, we study the sequential hypothesis testing problem under the constraint of Renyi differential privacy for the sample. Unlike previous work in private hypothesis testing that focuses on the classical fixed sample setting, our results may allow a conclusion to be reached much earlier, and thus saves the cost of collecting these additional samples. We present a new private algorithm based on Wald's Sequential Probability Ratio Test (SPRT) that gives strong theoretical privacy guarantees. We provide theoretical analysis on statistical performance measured by Type I and Type II error as well as the expected sample size, and we also provide empirical validation of our results.
Jayshree Sarathy. 2022. From Algorithmic to Institutional Logics: The Politics of Differential Privacy. Publisher's VersionAbstract
Over the past two decades, we have come to see that traditional de-anonymization techniques fail to protect the privacy of individuals in sensitive datasets. To address this problem, computer scientists introduced differential privacy, a strong statistical notion of privacy that bounds the amount of information a statistical release leaks about any individual. Differential privacy has become a gold standard for privacy protection: organizations from Google to the U.S. Census Bureau have adopted differentially private methods, and the MIT Technology Review named it as one of the top ten technologies expected to have “widespread consequences for human life.” Yet, while differential privacy offers rigorous statistical guarantees, we must also examine how these guarantees interact with social and contextual factors. In this paper, I investigate the political dimensions of differential privacy. What does the adoption of this standard reveal or obscure about the privacy practices within our sociotechnical systems? And how might a reliance on this standard impact our progress towards broader notions of privacy? Drawing on scholarship from sociology, law, computer science, and science and technology studies, I describe the entanglements between algorithmic privacy and institutional logics, highlighting disempowering practices that may emerge despite, or in response to, the adoption of differential privacy. The goal of this work is not to discourage the use of differential privacy, which I argue is necessary and beneficial in a wide range of settings, but to examine where it may have unintended consequences. I conclude with recommendations on how the privacy community can continue to develop formal privacy standards while elevating broader visions of privacy.
SSRN.pdf
Jörg Drechsler, Ira Globus-Harris, Audra McMillan, Jayshree Sarathy, and Adam Smith. 2022. “Nonparametric Differentially Private Confidence Intervals for the Median.” To appear in the Journal of Survey Statistics and Methodology (JSSAM). Publisher's VersionAbstract
Differential privacy is a restriction on data processing algorithms that provides strong confidentiality guarantees for individual records in the data. However, research on proper statistical inference, that is, research on properly quantifying the uncertainty of the (noisy) sample estimate regarding the true value in the population, is currently still limited. This paper proposes and evaluates several strategies to compute valid differentially private confidence intervals for the median. Instead of computing a differentially private point estimate and deriving its uncertainty, we directly estimate the interval bounds and discuss why this approach is superior if ensuring privacy is important. We also illustrate that addressing both sources of uncertainty--the error from sampling and the error from protecting the output--simultaneously should be preferred over simpler approaches that incorporate the uncertainty in a sequential fashion. We evaluate the performance of the different algorithms under various parameter settings in extensive simulation studies and demonstrate how the findings could be applied in practical settings using data from the 1940 Decennial Census.
ARXIV.pdf
Daniel Alabi, Adam Smith, Audra McMillan, Salil Vadhan, and Jayshree Sarathy. 2022. “Differentially Private Simple Linear Regression.” arXiv:2007.05157. Publisher's VersionAbstract
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
ARXIV 2020 POPET-2022.pdf
Daniel Alabi, Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. 2022. “Private rank aggregation in central and local models.” In In Proceedings of the 2022 AAAI Conference on Artificial Intelligence. Publisher's VersionAbstract
In social choice theory, (Kemeny) rank aggregation is a well-studied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. In this work, we present differentially private algorithms for rank aggregation in the pure and approximate settings along with distribution-independent utility upper and lower bounds. In addition to bounds in the central model, we also present utility bounds for the local model of differential privacy.
ARXIV.pdf
2021
Jordan Awan and Salil Vadhan. 8/9/2021. “Canonical Noise Distributions and Private Hypothesis Tests.” In Privacy-Preserving Machine Learning (PPML '21), Privacy in Machine Learning (PriML '21). Publisher's VersionAbstract
f-DP has recently been proposed as a generalization of classical definitions of differential privacy allowing a lossless analysis of composition, post-processing, and privacy amplification via subsampling. In the setting of f-DP, we propose the concept canonical noise distribution (CND) which captures whether an additive privacy mechanism is appropriately tailored for a given f, and give a construction that produces a CND given an arbitrary tradeoff function f. We show that private hypothesis tests are intimately related to CNDs, allowing for the release of private p-values at no additional privacy cost as well as the construction of uniformly most powerful (UMP) tests for binary data.
We apply our techniques to the problem of difference of proportions testing, and construct a UMP unbiased "semi-private" test which upper bounds the performance of any DP test. Using this as a benchmark we propose a private test, based on the inversion of characteristic functions, which allows for optimal inference for the two population parameters and is nearly as powerful as the semi-private UMPU. When specialized to the case of (ϵ,0)-DP, we show empirically that our proposed test is more powerful than any (ϵ/2‾√)-DP test and has more accurate type I errors than the classic normal approximation test.
ARXIV.pdf
Mark Bun, Marco Gaboardi, and Satchit Sivakumar. 7/2021. “Multiclass versus Binary Differentially Private PAC Learning.” In Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Publisher's VersionAbstract
We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields a doubly exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of Ψ-dimension defined in work of Ben-David et al. [5] to the online setting and explores its general properties.
NEURIPS.pdf ARXIV.pdf
Kobbi Nissim. 6/2021. “Legal Theorems: Bridging Computer Science and Privacy Law.” 53rd Annual ACM Symposium on Theory of Computing (STOC 2021).
Kobbi Nissim. 6/2021. “Privacy: From Database Reconstruction to Legal Theorems.” In 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2021).Abstract

 

There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so.

As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent.

We formulate and prove formal claims – “legal theorems” – addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out.

Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.

 

PODS.pdf
Salil Vadhan and Tianhao Wang. 5/2021. “Concurrent Composition of Differential Privacy.” Proceedings of the 19th Theory of Cryptography Conference (TCC '21), 13043. Publisher's VersionAbstract

We initiate a study of the composition properties of interactive dierentially private mechanisms. An interactive dierentially private mechanism is an algorithm that allows an analyst to adaptively ask queries about a sensitive dataset, with the property that an adversarial

analyst's view of the interaction is approximately the same regardless of whether or not any individual's data is in the dataset. Previous studies of composition of dierential privacy have focused on non-interactive algorithms, but interactive mechanisms are needed to capture many of the intended applications of dierential privacy and a number of the important dierentially private primitives.

 

We focus on concurrent composition, where an adversary can arbitrarily interleave its queries to several dierentially private mechanisms, which may be feasible when dierentially private query systems are deployed in practice. We prove that when the interactive mechanisms being composed are pure dierentially private, their concurrent composition achieves privacy parameters (with respect to pure or approximate dierential privacy) that match the (optimal) composition theorem for noninteractive dierential privacy. We also prove a composition theorem for interactive mechanisms that satisfy approximate dierential privacy. That bound is weaker than even the basic (suboptimal) composition theorem for noninteractive dierential privacy, and we leave closing the gap as a direction for future research, along with understanding concurrent composition for other variants of dierential privacy.

ARXIV.pdf TCC 21.pdf
Micah Altman, Aloni Cohen, Kobbi Nissim, and Alexandra Wood. 5/2021. “What a Hybrid Legal-Technical Analysis Teaches Us About Privacy Regulation: The Case of Singling Out.” BU Journal of Science Technology and Law, Vol 27, 1. Publisher's VersionAbstract
Abstract

This article advocates a hybrid legal-technical approach to the evaluation of technical measures designed to render information anonymous in order to bring it outside the scope of data protection regulation. The article demonstrates how such an approach can be used for instantiating a key anonymization concept appearing in the EU General Data Protection Regulation (GDPR) -- singling out. The analysis identifies and addresses a tension between a common, compelling theory of singling out and a mathematical analysis of this theory, and it demonstrates how to make determinations regarding the sufficiency of specific technologies for satisfying regulatory requirements for anonymization.

Doubts about the feasibility of effective anonymization and de-identification have gained prominence in recent years in response to high-profile privacy breaches enabled by scientific advances in privacy research, improved analytical capabilities, the wider availability of personal data, and the unprecedented richness of available data sources. At the same time, privacy regulations recognize the possibility, at least in principle, of data anonymization that is sufficiently protective so as to free the resulting (anonymized) data from regulation. As a result, practitioners developing privacy enhancing technologies face substantial uncertainty as to the legal standing of these technologies. More fundamentally, it is not clear how to make a determination of compliance even when the tool is fully described and available for examination.

This gap is symptomatic of a more general problem: Legal and technical approaches to data protection have developed in parallel, and their conceptual underpinnings are growing increasingly divergent. When lawmakers rely on purely legal concepts to engage areas that are affected by rapid scientific and technological change, the resulting laws, when applied in practice, frequently create substantial uncertainty for implementation; provide contradictory recommendations in important cases; disagree with current scientific technical understanding; and fail to scale to the rapid pace of technological development. This article argues that new hybrid concepts, created through technical and legal co-design, can inform practices that are practically complete, coherent, and scalable.

As a case study, the article focuses on a key privacy-related concept appearing in Recital 26 of the General Data Protection Regulation (GDPR) called singling out. We identify a compelling theory of singling out that is implicit in the most persuasive guidance available, and demonstrate that the theory is ultimately incomplete. We then use that theory as the basis for a new and mathematically rigorous privacy concept called predicate singling-out. Predicate singling-out sheds light on the notion of singling out in the GDPR, itself inextricably linked to anonymization. We argue that any data protection tool that purports to anonymize arbitrary personal data under the GDPR must prevent predicate singling-out. This enables, for the first time, a legally- and mathematically-grounded analysis of the standing of supposed anonymization technologies like k-anonymity and differential privacy. The analysis in this article is backed by a technical-mathematical analysis previously published by two of the authors.

Conceptually, our analysis demonstrates that a nuanced understanding of baseline risk is unavoidable for a theory of singling out based on current regulatory guidance. Practically, it identifies previously unrecognized failures of anonymization. In particular, it demonstrates that some k-anonymous mechanisms may allow singling out, challenging the prevailing regulatory guidance.

The article concludes with a discussion of specific recommendations for both policymakers and scholars regarding how to conduct a hybrid legal-technical analysis. Rather than formalizing or mathematizing the law, the article provides approaches for wielding formal tools in the service of practical regulation.

 

 
BU JOSTL.pdf
Tyler Piazza. 3/2021. “Differentially Private Ridge Regression”.Abstract

Studying problems of interest, like finding trends in medical data, can require analyzing data which contains sensitive and personally identifying information. As a result, it is often infeasible to release these datasets to researchers or to the general public. In this paper, we study algorithms that are differentially private, where there are theoretical guarantees that the mechanisms studied will reveal only limited amounts of information about individual people while still providing insights about large groups. This thesis discusses various forms of linear and ridge regression in this differentially private setting, with the goal of studying sensitive data to make predictions about future sensitive data. In particular, we will discuss the internal privacy-loss budgeting of the differentially private ridge regression technique adaSSP. This thesis provides 3 contributions. First, we discuss the existing SSP and adaSSP algorithms, and provide detailed proofs that they are each differentially private. Second, we introduce the two new algorithms adaSSPbudget and constSSPfull and prove that these are each differentially private. Third, we conduct experiments using synthetic and real world data to explore whether the precise privacy-loss budgeting used within these algorithms could improve their performance.

These experiments will explore the tradeoff between the accuracy of a hyperparameter and the accuracy of the other releases. Through the experimental results, we find that the performance is often insensitive to the particular privacy-loss budgeting and that for certain datasets, no choice of privacy-loss budget allows for the adaptive adaSSPbudget to outperform the standard SSP algorithm.

PDF
Victor Balcer, Albert Cheu, Matthew Joseph, and Jieming Mao. 2021. “Connecting Robust Shuffle Privacy and Pan-Privacy.” In In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), Pp. 2384-2403. Publisher's VersionAbstract
In the \emph{shuffle model} of differential privacy, data-holding users send randomized messages to a secure shuffler, the shuffler permutes the messages, and the resulting collection of messages must be differentially private with regard to user data. In the \emph{pan-private} model, an algorithm processes a stream of data while maintaining an internal state that is differentially private with regard to the stream data. We give evidence connecting these two apparently different models.
Our results focus on \emph{robustly} shuffle private protocols, whose privacy guarantees are not greatly affected by malicious users. First, we give robustly shuffle private protocols and upper bounds for counting distinct elements and uniformity testing. Second, we use pan-private lower bounds to prove robustly shuffle private lower bounds for both problems. Focusing on the dependence on the domain size k, we find that robust approximate shuffle privacy and approximate pan-privacy have additive error Θ(k√) for counting distinct elements. For uniformity testing, we give a robust approximate shuffle private protocol with sample complexity Õ (k2/3) and show that an Ω(k2/3) dependence is necessary for any robust pure shuffle private tester. Finally, we show that this connection is useful in both directions: we give a pan-private adaptation of recent work on shuffle private histograms and use it to recover further separations between pan-privacy and interactive local privacy.
ARXIV.pdf
Marco Gaboardi, Michael Hay, and Salil Vadhan. 2021. “A Programming Framework for OpenDP (extended abstract).” In 6th Workshop on the Theory and Practice of Differential Privacy (TPDP 2020).Abstract

In this paper, we propose a programming framework for the library of dierentially private algorithms that will be at the core of the new OpenDP open-source software project (http://opendp.io/).

PDF
2020
Marco Gaboardi, Kobbi Nissim, and David Purser. 7/2020. “The Complexity of Verifying Loop-free Programs as Differentially Private.” In 47th International Colloquium on Automata, Languages and Programming (To appear - ICALP 2020). ArXiv VersionAbstract

We study the problem of verifying differential privacy for loop-free programs with probabilistic choice. Programs in this class can be seen as randomized Boolean circuits, which we will use as a formal model to answer two different questions: first, deciding whether a program satisfies a prescribed level of privacy; second, approximating the privacy parameters a program realizes. We show that the problem of deciding whether a program satisfies "-differential privacy is coNP#P-complete. In fact, this is the case when either the input domain or the output range of the program is large.

Further, we show that deciding whether a program is (", )-differentially private is coNP#P-hard, and in coNP#P for small output domains, but always in coNP#P#P. Finally, we show that the problem of approximating the level of differential privacy is both NP-hard and coNP-hard. These results complement previous results by Murtagh and Vadhan [34] showing that deciding the optimal composition of differentially private components is #P-complete, and that approximating the optimal composition of differentially private components is in P.

ARXIV.pdf

Pages