Applying Theoretical Advances in Privacy to Computational Social Science Practice

The goal of this project is to improve replicability and reproducibility in social science by developing easy-to-use tools for researchers to share confidential research data in a privacy-protective manner, supported by rigorous computational, institutional, and legal foundations. We are motivated by the opportunities in social science created by massive new sources of data and developments in data analysis and sharing, and by the threat that privacy concerns pose to realizing the full potential of social science research. By leveraging ongoing multidisciplinary collaborations and theoretical advances in computation, statistics, law, and social science, the project aims to improve the replicability and reproducibility of data in empirical social science. Our goal is to develop and extend integrated privacy-preserving tools for enabling access to, use, and disclosure of social science data. The proposed project builds on a successful, ongoing multidisciplinary collaboration supported by an NSF Frontier grant: Privacy Tools for Sharing Research Data.

Objectives for this project include (1) analyzing the institutional and stakeholder incentives for managing research data privacy and the policy consequences of implementing new computational and legal privacy tools and concepts; (2) designing a blueprint for securing large-scale confidential archival data in the Dataverse repository; (3) exploring applications of our new computational and legal privacy tools to massive data and selected use cases, including online education data, human subjects research data, and economic data protected by NDAs; and (4) expanding research collaborations to engage with other differential privacy and privacy law experts, ongoing data privacy and dissemination efforts at MIT and Harvard, and several related Sloan Foundation projects. 

This project is supported by a grant from the Sloan Foundation. For more information, please see the original proposed project description:

cover sheet
project proposal
appendices

Principal Investigators

Salil Vadhan

Salil Vadhan

Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard
Current Member of Datatags Team

Salil Vadhan is the lead PI of Privacy Tools for Sharing Research Data project and the Vicky Joseph Professor of Computer Science and Applied Mathematics.… Read more about Salil Vadhan

Urs Gasser

Urs Gasser

Executive Director, Berkman Center for Internet & Society
Professor of Practice, Harvard Law School
Current Member of Datatags Team
Micah Altman

Micah Altman

Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, MIT
Non-Resident Senior Fellow, The Brookings Institution
Current Member of Datatags Team
Gary King

Gary King

Albert J. Weatherhead III University Professor, Harvard
Director, IQSS, Harvard

Publications

Micah Altman, Alexandra Wood, David R. O'Brien, and Urs Gasser. 2016. “Practical Approaches to Big Data Privacy Over Time.” Brussels Privacy Symposium. Publisher's Version Abstract

Increasingly, governments and businesses are collecting, analyzing, and sharing detailed information about individuals over long periods of time. Vast quantities of data from new sources and novel methods for large-scale data analysis promise to yield deeper understanding of human characteristics, behavior, and relationships and advance the state of science, public policy, and innovation. At the same time, the collection and use of fine-grained personal data over time is associated with significant risks to individuals, groups, and society at large. In this article, we examine a range of longterm data collections, conducted by researchers in social science, in order to identify the characteristics of these programs that drive their unique sets of risks and benefits. We also examine the practices that have been established by social scientists to protect the privacy of data subjects in light of the challenges presented in long-term studies. We argue that many uses of big data, across academic, government, and industry settings, have characteristics similar to those of traditional long-term research studies. In this article, we discuss the lessons that can be learned from longstanding data management practices in research and potentially applied in the context of newly emerging data sources and uses.

K. Nissim, A Bembenek, A Wood, M Bun, M Gaboardi, U. Gasser, D O'Brien, T Steinke, and S. Vadhan. 2016. “Bridging the Gap between Computer Science and Legal Approaches to Privacy.” In Privacy Law Scholars Conference. Privacy Law Scholars Conference, Washington D.C., 2016. Abstract
The fields of law and computer science incorporate contrasting notions of the privacy risks associated with the analysis and release of statistical data about individuals and groups of individuals. Emerging concepts from the theoretical computer science literature provide formal mathematical models for quantifying and mitigating privacy risks, where the set of risks they take into account is much broader than the privacy risks contemplated by many privacy laws. An example of such a model is differential privacy, which provides a provable guarantee of privacy against a wide range of potential attacks, including types of attacks currently unknown or unforeseen. The subject of much theoretical investigation, new privacy technologies based on formal models have recently been making significant strides towards practical implementation. For these tools to be used with sensitive personal information, it is important to demonstrate that they satisfy relevant legal requirements for privacy protection. However, making such an argument is challenging due to the conceptual gaps between the legal and technical approaches to defining privacy. Notably, information privacy laws are generally subject to interpretation and some degree of flexibility, which creates uncertainty for the implementation of more formal approaches. This Article articulates the gaps between legal and technical approaches to privacy and presents a methodology for rigorously arguing that a technological method for privacy protection satisfies the requirements of a particular law. The proposed methodology has two main components: (i) extraction of a formal mathematical requirement of privacy based on a legal standard found in an information privacy law, and (ii) construction of a rigorous mathematical proof for establishing that a technological privacy solution satisfies the mathematical requirement derived from the law. To handle ambiguities that can lead to different interpretations of a legal standard, the methodology takes a conservative “worst-case” approach and attempts to extract a mathematical requirement that is robust to potential ambiguities. Under this approach, the mathematical proof demonstrates that a technological method satisfies a broad range of reasonable interpretations of a legal standard. The Article demonstrates the application of the proposed methodology with an example bridging between the requirements of the Family Educational Rights and Privacy Act of 1974 and differential privacy.
Kobbi Nissim, Uri Stemmer, and Salil Vadhan. 2016. “Locating a Small Cluster Privately.” In PODS 2016. ACM SIGMOD/PODS Conference, San Francisco, USA, 2016. ArXiv Version Abstract

We present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim,and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova,and Smith, 2007], which allows compiling of "off the shelf" (non-private) analyses into analyses that preserve differential privacy.

More