The sequential hypothesis testing problem is a class of statistical analyses where the sample size is not fixed in advance, and the analyst must make real-time decisions until a stopping criterion is reached. In this work, we study the sequential hypothesis testing problem under the constraint of Renyi differential privacy for the sample. Unlike previous work in private hypothesis testing that focuses on the classical fixed sample setting, our results may allow a conclusion to be reached much earlier, and thus saves the cost of collecting these additional samples. We present a new private algorithm based on Wald's Sequential Probability Ratio Test (SPRT) that gives strong theoretical privacy guarantees. We provide theoretical analysis on statistical performance measured by Type I and Type II error as well as the expected sample size, and we also provide empirical validation of our results.
Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.
When the U.S. Census Bureau announced its intention to modernize its disclosure avoidance procedures for the 2020 Census, it sparked a controversy that is still underway. The move to differential privacy introduced technical and procedural uncertainties, leaving stakeholders unable to evaluate the quality of the data. More importantly, this transformation exposed the statistical illusions and limitations of census data, weakening stakeholders’ trust in the data and in the Census Bureau itself. This essay examines the epistemic currents of this controversy. Drawing on theories from Science and Technology Studies (STS) and ethnographic fieldwork, we analyze the current controversy over differential privacy as a battle over uncertainty, trust, and legitimacy of the Census. We argue that rebuilding trust will require more than technical repairs or improved communication; it will require reconstructing what we identify as a ‘statistical imaginary.’
Over the past two decades, we have come to see that traditional de-anonymization techniques fail to protect the privacy of individuals in sensitive datasets. To address this problem, computer scientists introduced differential privacy, a strong statistical notion of privacy that bounds the amount of information a statistical release leaks about any individual. Differential privacy has become a gold standard for privacy protection: organizations from Google to the U.S. Census Bureau have adopted differentially private methods, and the MIT Technology Review named it as one of the top ten technologies expected to have “widespread consequences for human life.” Yet, while differential privacy offers rigorous statistical guarantees, we must also examine how these guarantees interact with social and contextual factors. In this paper, I investigate the political dimensions of differential privacy. What does the adoption of this standard reveal or obscure about the privacy practices within our sociotechnical systems? And how might a reliance on this standard impact our progress towards broader notions of privacy? Drawing on scholarship from sociology, law, computer science, and science and technology studies, I describe the entanglements between algorithmic privacy and institutional logics, highlighting disempowering practices that may emerge despite, or in response to, the adoption of differential privacy. The goal of this work is not to discourage the use of differential privacy, which I argue is necessary and beneficial in a wide range of settings, but to examine where it may have unintended consequences. I conclude with recommendations on how the privacy community can continue to develop formal privacy standards while elevating broader visions of privacy.
Differential privacy is a restriction on data processing algorithms that provides strong confidentiality guarantees for individual records in the data. However, research on proper statistical inference, that is, research on properly quantifying the uncertainty of the (noisy) sample estimate regarding the true value in the population, is currently still limited. This paper proposes and evaluates several strategies to compute valid differentially private confidence intervals for the median. Instead of computing a differentially private point estimate and deriving its uncertainty, we directly estimate the interval bounds and discuss why this approach is superior if ensuring privacy is important. We also illustrate that addressing both sources of uncertainty--the error from sampling and the error from protecting the output--simultaneously should be preferred over simpler approaches that incorporate the uncertainty in a sequential fashion. We evaluate the performance of the different algorithms under various parameter settings in extensive simulation studies and demonstrate how the findings could be applied in practical settings using data from the 1940 Decennial Census.
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
In social choice theory, (Kemeny) rank aggregation is a well-studied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. In this work, we present differentially private algorithms for rank aggregation in the pure and approximate settings along with distribution-independent utility upper and lower bounds. In addition to bounds in the central model, we also present utility bounds for the local model of differential privacy.