Latanya Sweeney and Merce Crosas. 2015. “An Open Science Platform for the Next Generation of Data.” Computer Science, Computers and Scoiety [Internet]. ArXiv VersionAbstract

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.

Latanya Sweeney. 2015. “Privacy as a Sword and Shield in Public Health.” New York City Department of Public Health. New York, NY. .
Micah Altman. 2015. “Privacy Principles (framing talk).” United Nations Global Pulse Workshop on ICT4D Principle 8: Address Privacy & Security In Development Programs. New York, USA.
Or Sheffet. 2015. “Private Approximations of the 2nd-Moment Matrix Using Existing Techniques in Linear Regression”. ArXiv VersionAbstract

We introduce three differentially-private algorithms that approximates the 2nd-moment matrix of the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, correspond to existing techniques in linear regression literature. Specifically, we discuss the following three techniques. (i) For Ridge Regression, we propose setting the regularization coefficient so that by approximating the solution using Johnson-Lindenstrauss transform we preserve privacy. (ii) We show that adding a small batch of random samples to our data preserves differential privacy. (iii) We show that sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially private provided the prior is set correctly. We also evaluate our techniques experimentally and compare them to the existing "Analyze Gauss" algorithm of Dwork et al.

C. Dwork, A Smith, T Steinke, J Ullman, and S. Vadhan. 2015. “Robust Traceability from Trace Amounts.” In IEEE Symposium on Foundations of Computer Science (FOCS 2015). Berkeley, California.Abstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in `1 norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

Latanya Sweeney, Mercè Crosas, and Michael Bar-Sinai. 2015. “Sharing Sensitive Data with Confidence: The Datatags System.” Technology Science . Online VersionAbstract

Society generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

Mark Bun, Kobbi Nissim, and Uri Stemmer. 2015. “Simultaneous private learning of multiple concepts.” In .Abstract

We investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving k learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving k learning tasks without privacy? In our setting, an individual example consists of a domain element x labeled by k unknown concepts (c1,,ck). The goal of a multi-learner is to output k hypotheses (h1,,hk) that generalize the input examples. 
Without concern for privacy, the sample complexity needed to simultaneously learn k concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with k. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in k.

L Cranor, T Rabin, V Shmatikov, S Vadhan, and D Weitzner. 2015. “Towards a Privacy Research Roadmap for the Computing Community.” Report for the Computing Community Consortium (CCC). CCC Version
David Abrams. 2015. “What Stays in Vegas: The Road to “Zero Privacy".” New England Law Review, 49, 4. Publisher's Version
Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood. 2014. “Comment on the Occupational Safety and Health Administration (OSHA) Proposed Rule: Improve Tracking of Workplace Injuries and Illnesses; Extension of Comment Period”. Full Text at PDF
Micah Altman. 2014. “Comment to the Federal Trade Commission on Mobile Device Tracking”. PDF
Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood. 2014. “Comment to The White House Office of Science and Technology Policy (OSTP): Big Data Study, Request for Information”.Abstract
On January 23, 2014, President Barack Obama asked John Podesta to perform a comprehensive review of big data and privacy. During this review, the White House Office of Science and Technology Policy issued a request for public comment on questions related to the public policy implications of big data.
Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood submitted a response on behalf of the Privacy Tools for Sharing Research Data project. Their comments outline a broad, comprehensive, and systematic framework for privacy analysis and provide a taxonomy of modern technological, statistical, and cryptographic approaches to preserving both data privacy and utility. They argue that an analysis of information privacy should address the scope of information covered, the sensitivity of that information, the risk that sensitive information will be disclosed, the availability of control and accountability mechanisms, and the suitability of existing data sharing models, as applied across the entire lifecyle of information use, from collection through dissemination and reuse.
With this submission, the authors discuss the inadequacy of traditional approaches to privacy protection and recommend a modern approach that considers three principles. First, the risks of informational harm are generally not a simple function of the presence or absence of specific fields, attributes, or keywords in the released set of data. Second, redaction, pseudonymization, coarsening, and hashing, are often neither an adequate nor appropriate practice, nor is releasing less information necessary more privacy protective. Third, a thoughtful analysis with expert consultation is necessary in order to evaluate the sensitivity of the data collected, to quantify the associated re-identification risks, and to design useful and safe release mechanisms.
Daniel J. Weitzner, Hal Abelson, Cynthia Dwork, Cameron Kerry, Daniela Rus, Sandy Pentland, and Salil Vadhan. 2014. “Consumer Privacy Bill of Rights and Big Data: Response to White House Office of Science and Technology Policy Request for Information”.Abstract

In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.

Christos Dimoulas, Scott Moore, Aslan Askarov, and Stephen Chong. 2014. “Declarative Policies for Capability Control.” In Proceedings of the 27th {IEEE} Computer Security Foundations Symposium. Piscataway, NJ, USA: IEEE Press.Abstract

In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system.

To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. Access control policies restrict which components may use a capability and are enforced using higher-order contracts. Integrity policies restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.

Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, and Andrew Wan. 2014. “Faster Private Release of Marginals on Small Databases.” In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, Pp. 387–402. New York, NY, USA: ACM. Publisher's Version PDF
Mark Bun, Jonathan Ullman, and Salil Vadhan. 2014. “Fingerprinting Codes and the Price of Approximate Differential Privacy.” In SIAM Journal on Computing Special Issue on STOC `14., Pp. 1–10. PDF arXiv Version
David O'Brien. 2014. “In the Age of the Web, What Does ''Public' Mean?” Internet Monitor 2014: Reflections on the Digital World (Berkman Center Research Publication 2014-17 ). SSRN Version
Alexandra Wood, David O'Brien, Micah Altman, Alan Karr, Urs Gasser, Michael Bar-Sinai, Kobbi Nissim, Jonathan Ullman, Salil Vadhan, and Wojcik, Michael John. 2014. Integrating Approaches to Privacy Across the Research Lifecycle: Long-Term Longitudinal Studies. Social Science Research Network. Cambridge: Harvard University. Publisher's VersionAbstract

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.

Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.

Thomas Steinke and Jonathan Ullman. 2014. “Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery”. arXiv.orgAbstract

We show a tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently considered by Dwork et al., who showed that Ω~(n2) queries can be answer efficiently, and also by Hardt and Ullman, who showed that answering O~(n3) queries is computationally hard. We close the gap between the two bounds by proving a new, nearly-optimal hardness result. Specifically, we show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to O(n2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results via an optimal construction of a new combinatorial object that we call an interactive fingerprinting code, which may be of independent interest.

Urs Gasser, Jonathan Zittrain, R Faris, and RH Jones. 2014. Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse. 17th ed., Pp. 155. Cambridge, MA: Berkman Center Research Publication. SSNR VersionAbstract

This publication is the second annual report of the Internet Monitor project at the Berkman Center for Internet & Society at Harvard University. As with the inaugural report, this year’s edition is a collaborative effort of the extended Berkman community. Internet Monitor 2014: Reflections on the Digital World includes nearly three dozen contributions from friends and colleagues around the world that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

The result, intended for a general interest audience, brings together reflection and analysis on a broad range of issues and regions — from an examination of Europe’s “right to be forgotten” to a review of the current state of mobile security to an exploration of a new wave of movements attempting to counter hate speech online — and offers it up for debate and discussion. Our goal remains not to provide a definitive assessment of the “state of the Internet” but rather to provide a rich compendium of commentary on the year’s developments with respect to the online space.

Last year’s report examined the dynamics of Internet controls and online activity through the actions of government, corporations, and civil society. We focus this year on the interplay between technological platforms and policy; growing tensions between protecting personal privacy and using big data for social good; the implications of digital communications tools for public discourse and collective action; and current debates around the future of Internet governance.

The report reflects the diversity of ideas and input the Internet Monitor project seeks to invite. Some of the contributions are descriptive; others prescriptive. Some contain purely factual observations; others offer personal opinion. In addition to those in traditional essay format, contributions this year include a speculative fiction story exploring what our increasingly data-driven world might bring, a selection of “visual thinking” illustrations that accompany a number of essays, a “Year in Review” timeline that highlights many of the year’s most fascinating Internet-related news stories (and an interactive version of which is available at the, and a slightly tongue-in-cheek “By the Numbers” section that offers a look at the year’s important digital statistics. We believe that each contribution offers insights, and hope they provoke further reflection, conversation, and debate in both offline and online settings around the globe.