Applying Theoretical Advances in Privacy to Computational Social Science Practice

The goal of this project is to improve replicability and reproducibility in social science by developing easy-to-use tools for researchers to share confidential research data in a privacy-protective manner, supported by rigorous computational, institutional, and legal foundations. We are motivated by the opportunities in social science created by massive new sources of data and developments in data analysis and sharing, and by the threat that privacy concerns pose to realizing the full potential of social science research. By leveraging ongoing multidisciplinary collaborations and theoretical advances in computation, statistics, law, and social science, the project aims to improve the replicability and reproducibility of data in empirical social science. Our goal is to develop and extend integrated privacy-preserving tools for enabling access to, use, and disclosure of social science data. The proposed project builds on a successful, ongoing multidisciplinary collaboration supported by an NSF Frontier grant: Privacy Tools for Sharing Research Data.

Objectives for this project include (1) analyzing the institutional and stakeholder incentives for managing research data privacy and the policy consequences of implementing new computational and legal privacy tools and concepts; (2) designing a blueprint for securing large-scale confidential archival data in the Dataverse repository; (3) exploring applications of our new computational and legal privacy tools to massive data and selected use cases, including online education data, human subjects research data, and economic data protected by NDAs; and (4) expanding research collaborations to engage with other differential privacy and privacy law experts, ongoing data privacy and dissemination efforts at MIT and Harvard, and several related Sloan Foundation projects.

This project is supported by a grant from the Sloan Foundation. For more information, please see the original proposed project description:

cover sheet
project proposal
appendices

Principal Investigators

Salil Vadhan

Principal Investigator
Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard

Home Page

Urs Gasser

Executive Director, Berkman Center for Internet & Society
Professor of Practice, Harvard Law School
Current Member of Datatags Team

Home Page

Micah Altman

Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, MIT
Non-Resident Senior Fellow, The Brookings Institution
Current Member of Datatags Team

Home Page

Gary King

Co-PI
Albert J. Weatherhead III University Professor, Harvard
Director, IQSS, Harvard

Home Page

Mercè Crosas

Co-PI
Director of Data Science, IQSS, Harvard

Home Page

Publications

Ryan Rogers, Aaron Roth, Jonathan Ullman, and Salil Vadhan. 2016. “Privacy Odometers and Filters: Pay-as-you-Go Composition.” Neural Information Processing Systems (NIPS).Abstract

ArXiv PDF

In this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases.

Marco Gaboardi, James Honaker, Gary King, Kobbi Nissim, Jonathan Ullman, Salil Vadhan, and Jack Murtagh. 6/2016. “PSI (Ψ): a Private data Sharing Interface.” In Theory and Practice of Differential Privacy. New York, NY. ArXiv Version Abstract

TPDP Poster

PSI.pdf

We provide an overview of PSI (“a Private data Sharing Interface”), a system we are devel- oping to enable researchers in the social sciences and other fields to share and explore privacy- sensitive datasets with the strong privacy protections of differential privacy.

Poster presented at Theory and Practice of Differential Privacy (TPDP 2016).

Kobbi Nissim, Uri Stemmer, and Salil Vadhan. 2016. “Locating a Small Cluster Privately.” In PODS 2016. ACM SIGMOD/PODS Conference, San Francisco, USA, 2016.Abstract

PDF

We present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim,and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova,and Smith, 2007], which allows compiling of "off the shelf" (non-private) analyses into analyses that preserve differential privacy.

Marco Gaboardi, Hyun woo Lim, Ryan Rogers, and Salil Vadhan. 2016. “Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing.” Proceedings of The 33rd International Conference on Machine Learning, PMLR . Publisher's Version Abstract

PDF

Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.
We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance.
We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

Jack Murtagh and Salil Vadhan. 2018. “The Complexity of Computing the Optimal Composition of Differential Privacy.” In Theory of Cryptography Conference (TCC 2016), 8th ed., 14: Pp. 1-35. Theory of Computing (2018). TOC's Version Abstract

PDF

In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC'06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML'15) showed how to compute the optimal bound for composing k arbitrary (ϵ,δ)-differentially private algorithms. We characterize the optimal composition for the more general case of k arbitrary (ϵ1,δ1),…,(ϵk,δk)-differentially private algorithms where the privacy parameters may differ for each algorithm in the composition. We show that computing the optimal composition in general is #P-complete. Since computing optimal composition exactly is infeasible (unless FP=#P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer's dynamic programming approach to approximately counting solutions to knapsack problems (STOC'03).

Mercè Crosas, Gary King, James Honaker, and Latanya Sweeney. 2015. “Automating Open Science for Big Data.” The ANNALS of the American Academy of Political and Social Science, 659, 1, Pp. 260-273 . Publisher's Version Abstract

The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

Harvard University Privacy Tools Project

Applying Theoretical Advances in Privacy to Computational Social Science Practice

Principal Investigators

Salil Vadhan

Urs Gasser

Micah Altman

Gary King

Mercè Crosas

Publications

h_seas_logo_rgb.png