Privacy Tools for Sharing Research Data

Information technology, advances in statistical computing, and the deluge of data available through the Internet are transforming computational social science. However, a major challenge is maintaining the privacy of human subjects. This project is a broad, multidisciplinary effort to help enable the collection, analysis, and sharing of sensitive data while providing privacy for individual subjects. Bringing together computer science, social science, statistics, and law, the investigators seek to refine and develop definitions and measures of privacy and data utility, and design an array of technological, legal, and policy tools for dealing with sensitive data. In addition to contributing to research infrastructure around the world, the ideas developed in this project will benefit society more broadly as it grapples with data privacy issues in many other domains, including public health and electronic commerce.

This project will define and measure privacy in both mathematical and legal terms, and explore alternate definitions of privacy that may be more general or more practical. The project will study variants of differential privacy and develop new theoretical results for use in contexts where it is currently inappropriate or impractical. The research will provide a better understanding of the practical performance and usability of a variety of algorithms for analyzing and sharing privacy-sensitive data. The project will develop secure implementations of these algorithms and legal instruments, which will be made publicly available and used to enable wider access to privacy-sensitive data sets at the Harvard Institute for Quantitative Social Science's Dataverse Network.

This project is funded by a National Science Foundation Secure and Trustworthy Cyberspace Frontier Grant and a gift from Google. For more information, see the original proposed project description to NSF (2012).

Two major areas of research in this project are DataTags and Differential Privacy. This project has contributed to the development of the software tools DataTags, PSI, and AbcDatalog.

Senior Personnel

Salil Vadhan

Principal Investigator
Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard

Home Page

Jayshree Sarathy

Graduate student, Computer Science, Theory Group, Harvard School of Engineering and Applied Sciences

Marco Gaboardi

Visiting Scholar, Center for Research on Computation & Society
State University of New York at Buffalo
Current Member of Datatags Team

James Honaker

Chief Privacy Engineer and Research Associate

Home Page

Urs Gasser

Executive Director, Berkman Center for Internet & Society
Professor of Practice, Harvard Law School
Current Member of Datatags Team

Home Page

David O’Brien

Research Fellow, Berkman Center, Harvard Law School
Current Member of Datatags Team

Home Page

Publications

Marco Gaboardi, Kobbi Nissim, and David Purser. 7/2020. “The Complexity of Verifying Loop-free Programs as Differentially Private.” In 47th International Colloquium on Automata, Languages and Programming (To appear - ICALP 2020). ArXiv Version Abstract

ARXIV.pdf

We study the problem of verifying differential privacy for loop-free programs with probabilistic choice. Programs in this class can be seen as randomized Boolean circuits, which we will use as a formal model to answer two different questions: first, deciding whether a program satisfies a prescribed level of privacy; second, approximating the privacy parameters a program realizes. We show that the problem of deciding whether a program satisfies "-differential privacy is coNP#P-complete. In fact, this is the case when either the input domain or the output range of the program is large.

Further, we show that deciding whether a program is (", )-differentially private is coNP#P-hard, and in coNP#P for small output domains, but always in coNP#P#P. Finally, we show that the problem of approximating the level of differential privacy is both NP-hard and coNP-hard. These results complement previous results by Murtagh and Vadhan [34] showing that deciding the optimal composition of differentially private components is #P-complete, and that approximating the optimal composition of differentially private components is in P.

Jack Murtagh, Kathryn Taylor, George Kellaris, and Salil Vadhan. 2018. “Usable Differential Privacy: A Case Study with PSI”. arXiv Page

PDF

Micah Altman, Alexandra Wood, David O'Brien, and Urs Gasser. 3/2018. “Practical Approaches to Big Data Privacy Over Time.” International Data Privacy Law, 8, 1, Pp. 29-51. Publisher's Version

Micah Altman, Alexandra Wood, and Effy Vayena. 2018. “A Harm-Reduction Framework for Algorithmic Accountability over Personal Information.” IEEE Security & Privacy , 16, 3, Pp. 34-45. Publisher's Version

harm-reduction_framework.pdf

Victor Balcer and Salil Vadhan. 9/2019. “Differential Privacy on Finite Computers.” Journal of Privacy and Confidentiality, 9, 2. JPC Page Abstract

JPC 2019

ITCS 2018

ArXiv

Version History:

Also presented at TPDP 2017; preliminary version posted as arXiv:1709.05396 [cs.DS].

2018: Published in Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pp 43:1-43:21. http://drops.dagstuhl.de/opus/frontdoor.php?source_opus=8353

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomial-time algorithms. As a case study, we examine the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe \(X\). The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Condentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in \(|X|\), which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its "per-bin accuracy" (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of \(\log |X|\), but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and eciently computable representation of a dense histogram; it is based on an \((n + 1)\) - wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

Merce Crosas. 2017. “The DataTags System: Sharing Sensitive Data with Confidence.” Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016.

PDF

Michael Bar-Sinai and Rotem Medzini. 2017. “Public Policy Modeling using the DataTags Toolset.” In European Social Policy Analysis network (ESPAnet). Israel.Abstract

PDF

We apply Tags, a framework for modeling data handling policies, to a welfare policy. The generated model is useful for assessing entitlements of specific cases, and for gaining insights into the modeled policy as a whole.

Vishesh Karwa and Salil Vadhan. 2018. “Finite Sample Differentially Private Confidence Intervals.” 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). arXiv Page Abstract

ITCS Version

We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors.

Thomas Steinke and Jonathan Ullman. 2017. “Between Pure and Approximate Differential Privacy.” Journal of Privacy and Confidentiality.Abstract

PDF

We show a new lower bound on the sample complexity of (ε, δ)-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter δ, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy (δ > 0) and pure differential privacy (δ = 0).

G. Barthe, P.Y. Strub, J. Hsu, A. D. Gordon, E. J. Gallego Arias, M. Gaboardi, and G. P. Farina. 2016. “Differentially Private Bayesian Programming.” 23rd ACM Conference on Computer and Communications Security, CCS.

p68-barthe.pdf

Harvard University Privacy Tools Project

Privacy Tools for Sharing Research Data

Senior Personnel

Salil Vadhan

Jayshree Sarathy

Marco Gaboardi

James Honaker

Urs Gasser

David O’Brien

Publications

h_seas_logo_rgb.png