Privacy Tools for Sharing Research Data

Information technology, advances in statistical computing, and the deluge of data available through the Internet are transforming computational social science. However, a major challenge is maintaining the privacy of human subjects. This project is a broad, multidisciplinary effort to help enable the collection, analysis, and sharing of sensitive data while providing privacy for individual subjects. Bringing together computer science, social science, statistics, and law, the investigators seek to refine and develop definitions and measures of privacy and data utility, and design an array of technological, legal, and policy tools for dealing with sensitive data. In addition to contributing to research infrastructure around the world, the ideas developed in this project will benefit society more broadly as it grapples with data privacy issues in many other domains, including public health and electronic commerce.

This project will define and measure privacy in both mathematical and legal terms, and explore alternate definitions of privacy that may be more general or more practical. The project will study variants of differential privacy and develop new theoretical results for use in contexts where it is currently inappropriate or impractical. The research will provide a better understanding of the practical performance and usability of a variety of algorithms for analyzing and sharing privacy-sensitive data. The project will develop secure implementations of these algorithms and legal instruments, which will be made publicly available and used to enable wider access to privacy-sensitive data sets at the Harvard Institute for Quantitative Social Science's Dataverse Network.

This project is funded by a National Science Foundation Secure and Trustworthy Cyberspace Frontier Grant and a gift from Google. For more information, see the original proposed project description to NSF (2012).

Two major areas of research in this project are DataTags and Differential Privacy. This project has contributed to the development of the software tools DataTags, PSI, and AbcDatalog.

Senior Personnel

Salil Vadhan

Salil Vadhan

Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard
Area Chair for Computer Science

Salil Vadhan is the lead PI of Privacy Tools for Sharing Research Data project and the Vicky Joseph Professor of Computer Science and Applied Mathematics...

Read more about Salil Vadhan

Jayshree Sarathy

Graduate student, Computer Science, Theory Group, Harvard School of Engineering and Applied Sciences
Marco Gaboardi

Marco Gaboardi

Visiting Scholar, Center for Research on Computation & Society
State University of New York at Buffalo
Current Member of Datatags Team
Urs Gasser

Urs Gasser

Executive Director, Berkman Center for Internet & Society
Professor of Practice, Harvard Law School
Current Member of Datatags Team
  • 1 of 3
  • »


Victor Balcer and Salil Vadhan. 1/2018. “Differential Privacy on Finite Computers.” in 9th Innovations in Theoretical Computer Science Conference (ITCS 2018); also presented at Theory and Practice of Differential Privacy Conference (TPDP 2017). arXiv PageAbstract

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomialtime algorithms. We use a case study the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe X . The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Confidentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in |X |, which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its “per-bin accuracy” (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of log |X |, but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and efficiently computable representation of a dense histogram; it is based on an (n + 1)-wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

Vishesh Karwa and Salil Vadhan. 2018. “Finite Sample Differentially Private Confidence Intervals.” 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). arXiv PageAbstract
We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.
Thomas Steinke and Jonathan Ullman. 2017. “Between Pure and Approximate Differential Privacy.” Journal of Privacy and Confidentiality.Abstract
We show a new lower bound on the sample complexity of (ε, δ)-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter δ, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy (δ > 0) and pure differential privacy (δ = 0).
Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. “Exposed! A Survey of Attacks on Private Data.” Annual Review of Statistics and Its Application (2017).Abstract
Privacy-preserving statistical data analysis addresses the general question of protecting privacy when publicly releasing information about a sensitive dataset. A privacy attack takes seemingly innocuous released information and uses it to discern the private details of individuals, thus demonstrating that such information compromises privacy. For example, re-identification attacks have shown that it is easy to link supposedly de-identified records to the identity of the individual concerned. This survey focuses on attacking aggregate data, such as statistics about how many individuals have a certain disease, genetic trait, or combination thereof. We consider two types of attacks: reconstruction attacks, which approximately determine a sensitive feature of all the individuals covered by the dataset, and tracing attacks, which determine whether or not a target individual's data are included in the dataset.Wealso discuss techniques from the differential privacy literature for releasing approximate aggregate statistics while provably thwarting any privacy attack.
Mark Bun, Thomas Steinke, and Jonathan Ullman. 2017. “Make Up Your Mind: The Price of Online Queries in Differential Privacy.” Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). arXiv PageAbstract

We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer:

• Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch.

• Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query.

• Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries.

Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line.

Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David O'Brien, and Salil Vadhan. Forthcoming. “Differential Privacy: A Primer for a Non-technical Audience (Preliminary Version).” Vanderbilt Journal of Entertainment and Technology Law .Abstract

This document is a primer on differential privacy, which is a formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings. Using intuitive illustrations and limited mathematical formalism, this document provides an introduction to differential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to differential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social scientists can conceptualize the guarantees provided by differential privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded. 


  • 1 of 6
  • »