Y. Chen, K. Nissim, and B. Waggoner, “Fair Information Sharing for Treasure Hunting.,” in Association for the Advancement of Artificial Intelligence (AAAI), 2015.
M. Crosas, G. King, J. Honaker, and L. Sweeney, “Automating Open Science for Big Data,” The ANNALS of the American Academy of Political and Social Science , vol. 659, no. 1, pp. 260-273 , 2015. Publisher's VersionAbstract
The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.
M. Bun, K. Nissim, U. Stemmer, and S. Vadhan, “Differentially Private Release and Learning of Threshold Functions,” 2015. ArXiv's Version
K. Nissim and U. Stemmer, “On the Generalization Properties of Differential Privacy,” 2015. ArXiv's Version
T. Steinke and J. Ullman, “Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery,” 2015. ArXiv's Version
D. O'Brien, et al., “Integrating Approaches to Privacy Across the Research Lifecycle: When is Information Purely Public?,” Social Science Research Network, 2015. SSRN's Version
R. Faris and D. R. O’Brien, “Data and Privacy,” Internet Monitor 2014: Data and Privacy. 2015. Medium LinkAbstract
This essay first appeared in the Internet Monitor project’s second annual report, Internet Monitor 2014: Reflections on the Digital World. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.
R. Faris and D. O'Brien, “In the Age of the Web, What Does “Public” Mean?,” Internet Monitor 2014: Data and Privacy. 2015. Medium's Link
A. Beimel, K. Nissim, and U. Stemmer, “Learning Privately with Labeled and Unlabeled Examples,” Accepted for publication, SODA 2015, 2015. arXiv.orgAbstract
A private learner is an algorithm that given a sample of labeled individual examples outputs a generalizing hypothesis while preserving the privacy of each individual. In 2008, Kasiviswanathan et al. (FOCS 2008) gave a generic construction of private learners, in which the sample complexity is (generally) higher than what is needed for non-private learners. This gap in the sample complexity was then further studied in several followup papers, showing that (at least in some cases) this gap is unavoidable. Moreover, those papers considered ways to overcome the gap, by relaxing either the privacy or the learning guarantees of the learner. We suggest an alternative approach, inspired by the (non-private) models of semi-supervised learning and active-learning, where the focus is on the sample complexity of labeled examples whereas unlabeled examples are of a significantly lower cost. We consider private semi-supervised learners that operate on a random sample, where only a (hopefully small) portion of this sample is labeled. The learners have no control over which of the sample elements are labeled. Our main result is that the labeled sample complexity of private learners is characterized by the VC dimension. We present two generic constructions of private semi-supervised learners. The first construction is of learners where the labeled sample complexity is proportional to the VC dimension of the concept class, however, the unlabeled sample complexity of the algorithm is as big as the representation length of domain elements. Our second construction presents a new technique for decreasing the labeled sample complexity of a given private learner, while only slightly increasing its unlabeled sample complexity. In addition, we show that in some settings the labeled sample complexity does not depend on the privacy parameters of the learner.
Y. Chen, O. Sheffet, and S. Vadhan, “Privacy Games,” in 10th Conference on Web and Internet Economics (WINE), Beijing, China, 2014. privacy_game_wine.pdf
R. Bassily, A. Smith, and A. Thakurta, “

Private Empirical Risk Minimization, Revisited

,” in ICML 2014 Workshop on Learning, Security and Privacy, Beijing, China, 2014. Publisher's VersionAbstract
In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.  Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.
A. Wood, et al.,

Integrating Approaches to Privacy Across the Research Lifecycle: Long-Term Longitudinal Studies

. Cambridge: Harvard University, 2014. Publisher's VersionAbstract
On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.
L. Waye, “Privacy Integrated Data Stream Queries,” in Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14), New York, NY, 2014. ACM Digital Library
C. Dimoulas, S. Moore, A. Askarov, and S. Chong, “

Declarative Policies for Capability Control

,” in Proceedings of the 27th {IEEE} Computer Security Foundations Symposium, Piscataway, NJ, USA, 2014.Abstract
In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system. To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. Access control policies restrict which components may use a capability and are enforced using higher-order contracts. Integrity policies restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.
M. Kearns, M. Pai, A. Roth, and J. Ullman, “

Mechanism Design in Large Games: Incentives and Privacy

,” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 403–410. Publisher's Version p403-kearns.pdf
K. Chandrasekaran, J. Thaler, J. Ullman, and A. Wan, “

Faster Private Release of Marginals on Small Databases

,” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 387–402. Publisher's Version p387-chandrasekaran.pdf
M. Bun, J. Ullman, and S. Vadhan, “

Fingerprinting Codes and the Price of Approximate Differential Privacy

,” in Proceedings of the 46th Annual ACM Symposium on Theory of Computing, New York, NY, USA, 2014, pp. 1–10. Publisher's Version p1-bun.pdf
K. Nissim, S. Vadhan, and D. Xiao, “

Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals

,” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 411–422. Publisher's Version p411-nissim.pdf
D. O'Brien, “In the Age of the Web, What Does ''Public' Mean?,” Internet Monitor 2014: Reflections on the Digital World, no. Berkman Center Research Publication 2014-17 . 2014. SSRN's Version
U. Gasser, J. Zittrain, R. Faris, and R. H. Jones, “Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse,” Social Science Research Network. 2014. SSRN's VersionAbstract
This publication is the second annual report of the Internet Monitor project at the Berkman Centerfor Internet & Society at Harvard University. As with the inaugural report, this year’s edition is a collaborative effort of the extended Berkman community. Internet Monitor 2014: Reflections on the Digital World includes nearly three dozen contributions from friends and colleagues around the world that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.The result, intended for a general interest audience, brings together reflection and analysis on a broad range of issues and regions — from an examination of Europe’s “right to be forgotten” to a review of the current state of mobile security to an exploration of a new wave of movements attempting to counter hate speech online — and offers it up for debate and discussion. Our goal remains not to provide a definitive assessment of the “state of the Internet” but rather to provide a rich compendium of commentary on the year’s developments with respect to the online space.Last year’s report examined the dynamics of Internet controls and online activity through the actions of government, corporations, and civil society. We focus this year on the interplay between technological platforms and policy; growing tensions between protecting personal privacy and using big data for social good; the implications of digital communications tools for public discourse and collective action; and current debates around the future of Internet governance.The report reflects the diversity of ideas and input the Internet Monitor project seeks to invite. Some of the contributions are descriptive; others prescriptive. Some contain purely factual observations; others offer personal opinion. In addition to those in traditional essay format, contributions this year include a speculative fiction story exploring what our increasingly data-driven world might bring, a selection of “visual thinking” illustrations that accompany a number of essays, a “Year in Review” timeline that highlights many of the year’s most fascinating Internet-related news stories (and an interactive version of which is available at the, and a slightly tongue-in-cheek “By the Numbers” section that offers a look at the year’s important digital statistics. We believe that each contribution offers insights, and hope they provoke further reflection, conversation, and debate in both offline and online settings around the globe.