Privacy Tools for Sharing Research Data: Publications

2018
Micah Altman, Alexandra Wood, David O'Brien, and Urs Gasser. 3/2018. “Practical Approaches to Big Data Privacy Over Time.” International Data Privacy Law, 8, 1, Pp. 29-51. Publisher's Version
Victor Balcer and Salil Vadhan. 1/2018. “Differential Privacy on Finite Computers.” in 9th Innovations in Theoretical Computer Science Conference (ITCS 2018); also presented at Theory and Practice of Differential Privacy Conference (TPDP 2017). arXiv PageAbstract

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomialtime algorithms. We use a case study the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe X . The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Confidentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in |X |, which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its “per-bin accuracy” (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of log |X |, but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and efficiently computable representation of a dense histogram; it is based on an (n + 1)-wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

ITCS Version Sep 2017 ArXiv Version
K. Nissim, A Bembenek, A Wood, M Bun, M Gaboardi, U. Gasser, D O'Brien, T Steinke, and S. Vadhan. 2018. “Bridging the Gap between Computer Science and Legal Approaches to Privacy Kobbi Nissim, Aaron Bembenek, Alexandra Wood, Mark Bun, Marco Gaboardi, Urs Gasser, David R. O’Brien, and Salil Vadhan, Bridging the Gap between Computer Science and Legal Appro.” In , 2nd ed., 31: Pp. 687-780. Harvard Journal of Law & Technology. Publisher's VersionAbstract
The fields of law and computer science incorporate contrasting notions of the privacy risks associated with the analysis and release of statistical data about individuals and groups of individuals. Emerging concepts from the theoretical computer science literature provide formal mathematical models for quantifying and mitigating privacy risks, where the set of risks they take into account is much broader than the privacy risks contemplated by many privacy laws. An example of such a model is differential privacy, which provides a provable guarantee of privacy against a wide range of potential attacks, including types of attacks currently unknown or unforeseen. The subject of much theoretical investigation, new privacy technologies based on formal models have recently been making significant strides towards practical implementation. For these tools to be used with sensitive personal information, it is important to demonstrate that they satisfy relevant legal requirements for privacy protection. However, making such an argument is challenging due to the conceptual gaps between the legal and technical approaches to defining privacy. Notably, information privacy laws are generally subject to interpretation and some degree of flexibility, which creates uncertainty for the implementation of more formal approaches. This Article articulates the gaps between legal and technical approaches to privacy and presents a methodology for rigorously arguing that a technological method for privacy protection satisfies the requirements of a particular law. The proposed methodology has two main components: (i) extraction of a formal mathematical requirement of privacy based on a legal standard found in an information privacy law, and (ii) construction of a rigorous mathematical proof for establishing that a technological privacy solution satisfies the mathematical requirement derived from the law. To handle ambiguities that can lead to different interpretations of a legal standard, the methodology takes a conservative “worst-case” approach and attempts to extract a mathematical requirement that is robust to potential ambiguities. Under this approach, the mathematical proof demonstrates that a technological method satisfies a broad range of reasonable interpretations of a legal standard. The Article demonstrates the application of the proposed methodology with an example bridging between the requirements of the Family Educational Rights and Privacy Act of 1974 and differential privacy.
PDF
Jack Murtagh and Salil Vadhan. 2018. “The Complexity of Computing the Optimal Composition of Differential Privacy.” In Theory of Cryptography Conference (TCC 2016), 8th ed., 14: Pp. 1-35. Theory of Computing (2018). TOC's VersionAbstract

In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC'06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML'15) showed how to compute the optimal bound for composing k arbitrary (ϵ,δ)-differentially private algorithms. We characterize the optimal composition for the more general case of k arbitrary (ϵ1,δ1),,(ϵk,δk)-differentially private algorithms where the privacy parameters may differ for each algorithm in the composition. We show that computing the optimal composition in general is #P-complete. Since computing optimal composition exactly is infeasible (unless FP=#P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer's dynamic programming approach to approximately counting solutions to knapsack problems (STOC'03).

PDF
Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David O'Brien, and Salil Vadhan. 2018. “Differential Privacy: A Primer for a Non-technical Audience.” Vanderbilt Journal of Entertainment and Technology Law , 21, 1, Pp. 209-276.Abstract

This document is a primer on differential privacy, which is a formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings. Using intuitive illustrations and limited mathematical formalism, this document provides an introduction to differential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to differential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social scientists can conceptualize the guarantees provided by differential privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded. 

 

Preliminary Version Updated Version PDF
Vishesh Karwa and Salil Vadhan. 2018. “Finite Sample Differentially Private Confidence Intervals.” 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). arXiv PageAbstract
We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.
ITCS Version
Micah Altman, Alexandra Wood, and Effy Vayena. 2018. “A Harm-Reduction Framework for Algorithmic Accountability over Personal Information.” IEEE Security & Privacy , 16, 3, Pp. 34-45. Publisher's Version harm-reduction_framework.pdf
Jack Murtagh, Kathryn Taylor, George Kellaris, and Salil Vadhan. 2018. “Usable Differential Privacy: A Case Study with PSI”. arXiv Page
2017
Thomas Steinke and Jonathan Ullman. 2017. “Between Pure and Approximate Differential Privacy.” Journal of Privacy and Confidentiality.Abstract
We show a new lower bound on the sample complexity of (ε, δ)-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter δ, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy (δ > 0) and pure differential privacy (δ = 0).
PDF
Merce Crosas. 2017. “The DataTags System: Sharing Sensitive Data with Confidence.” Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016. PDF
Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. “Exposed! A Survey of Attacks on Private Data.” Annual Review of Statistics and Its Application (2017).Abstract
Privacy-preserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, re-identification attacks have shown that it is easy to link supposedly de-identified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof. We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individual's data are included in the dataset.Wealso discuss techniques from the differential privacy literature for releasing approximate aggregate statistics while provably thwarting any privacy attack.
PDF
Mark Bun, Thomas Steinke, and Jonathan Ullman. 2017. “Make Up Your Mind: The Price of Online Queries in Differential Privacy.” Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). arXiv PageAbstract

We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer:

• Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch.

• Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query.

• Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries.

Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line.

Michael Bar-Sinai and Rotem Medzini. 2017. “Public Policy Modeling using the DataTags Toolset.” In European Social Policy Analysis network (ESPAnet). Israel.Abstract
We apply Tags, a framework for modeling data handling policies, to a welfare policy. The generated model is useful for assessing entitlements of specific cases, and for gaining insights into the modeled policy as a whole.
PDF
2016
Urs Gasser. 12/9/2016. “Recoding Privacy Law: Reflections on the Future Relationship Among Law, Technology, and Privacy.” Harvard Law Review . Publisher's Version
Michael Bar-Sinai, Latanya Sweeney, and Merce Crosas. 8/4/2016. “DataTags, Data Handling Policy Spaces and the Tags Language.” In Security and Privacy Workshops (SPW), 2016 IEEE. San Jose, California : IEEE. Publisher's VersionAbstract

Widespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper.

PDF
Salil Vadhan. 8/2016. “The Complexity of Differential Privacy”.Abstract

Differential Privacy is a theoretical framework for ensuring the privacy of individual-level data when performing statistical analysis of privacy-sensitive datasets. The goal of this tutorial is to convey the deep connections between differential privacy and a variety of other topics in computational complexity, cryptography, and theoretical computer science at large. This tutorial was written starting from notes taken during a minicourse given by the author and Kunal Talwar at the 26th McGill Invitational Workshop on Computational Complexity in February 2014, at the Bellairs Institute in Holetown, Barbados.

August 2016 March 2017 Errata
Marco Gaboardi, James Honaker, Gary King, Kobbi Nissim, Jonathan Ullman, Salil Vadhan, and Jack Murtagh. 6/2016. “PSI (Ψ): a Private data Sharing Interface.” In Theory and Practice of Differential Privacy. New York, NY. ArXiv VersionAbstract

We provide an overview of PSI (“a Private data Sharing Interface”), a system we are devel- oping to enable researchers in the social sciences and other fields to share and explore privacy- sensitive datasets with the strong privacy protections of differential privacy.

Poster presented at Theory and Practice of Differential Privacy (TPDP 2016).

TPDP Poster PSI.pdf
Yiling Chen, Stephen Chong, Ian Kash, Tal Moran, and Salil Vadhan. 6/2016. “Truthful Mechanisms for Agents that Value Privacy.” ACM Transactions on Economics and Computation (TEAC), 4, 3. TEAC VersionAbstract

Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from truthfulness; it is not incorporated in players’ utility functions (and doing so has been shown to lead to nontruthfulness in some cases). In this work, we propose a new, general way of modeling privacy in players’ utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has a small privacy cost to player i. We give three mechanisms that are truthful with respect to our modeling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n).

PDF
Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. 2016. “Adaptive Learning with Robust Generalization Guarantees.” Conference on Learning Theory (COLT). arXiv VersionAbstract

The traditional notion of generalization---i.e., learning a hypothesis whose empirical error is close to its true error---is surprisingly brittle. As has recently been noted in [DFH+15b], even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization---increasing in strength---that are robust to postprocessing and amenable to adaptive composition, and examine the relationships between them. We call the weakest such notion Robust Generalization. A second, intermediate, notion is the stability guarantee known as differential privacy. The strongest guarantee we consider we call Perfect Generalization. We prove that every hypothesis class that is PAC learnable is also PAC learnable in a robustly generalizing fashion, with almost the same sample complexity. It was previously known that differentially private algorithms satisfy robust generalization. In this paper, we show that robust generalization is a strictly weaker concept, and that there is a learning task that can be carried out subject to robust generalization guarantees, yet cannot be carried out subject to differential privacy. We also show that perfect generalization is a strictly stronger guarantee than differential privacy, but that, nevertheless, many learning tasks can be carried out subject to the guarantees of perfect generalization.

PDF
Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. 2016. “Algorithmic Stability for Adaptive Data Analysis.” 48th Annual Symposium on the Theory of Computing. arXiv VersionAbstract

Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen queries about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions: (i) We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015, NIPS, 2015). (ii) We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries. As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stability implies low generalization error. We also study weaker stability guarantees such as bounded KL divergence and total variation distance.

PDF

Pages