On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.
This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.
Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.
We show a tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently considered by Dwork et al., who showed that Ω~(n2) queries can be answer efficiently, and also by Hardt and Ullman, who showed that answering O~(n3) queries is computationally hard. We close the gap between the two bounds by proving a new, nearly-optimal hardness result. Specifically, we show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to O(n2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results via an optimal construction of a new combinatorial object that we call an interactive fingerprinting code, which may be of independent interest.
This publication is the second annual report of the Internet Monitor project at the Berkman Center for Internet & Society at Harvard University. As with the inaugural report, this year’s edition is a collaborative effort of the extended Berkman community. Internet Monitor 2014: Reflections on the Digital World includes nearly three dozen contributions from friends and colleagues around the world that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.
The result, intended for a general interest audience, brings together reflection and analysis on a broad range of issues and regions — from an examination of Europe’s “right to be forgotten” to a review of the current state of mobile security to an exploration of a new wave of movements attempting to counter hate speech online — and offers it up for debate and discussion. Our goal remains not to provide a definitive assessment of the “state of the Internet” but rather to provide a rich compendium of commentary on the year’s developments with respect to the online space.
Last year’s report examined the dynamics of Internet controls and online activity through the actions of government, corporations, and civil society. We focus this year on the interplay between technological platforms and policy; growing tensions between protecting personal privacy and using big data for social good; the implications of digital communications tools for public discourse and collective action; and current debates around the future of Internet governance.
The report reflects the diversity of ideas and input the Internet Monitor project seeks to invite. Some of the contributions are descriptive; others prescriptive. Some contain purely factual observations; others offer personal opinion. In addition to those in traditional essay format, contributions this year include a speculative fiction story exploring what our increasingly data-driven world might bring, a selection of “visual thinking” illustrations that accompany a number of essays, a “Year in Review” timeline that highlights many of the year’s most fascinating Internet-related news stories (and an interactive version of which is available at the netmonitor.org), and a slightly tongue-in-cheek “By the Numbers” section that offers a look at the year’s important digital statistics. We believe that each contribution offers insights, and hope they provoke further reflection, conversation, and debate in both offline and online settings around the globe.
In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.
Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different.
Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.
Justin Hsu, Aaron Roth, Tim Roughgarden, Jonathan Ullman, Javier Esparza, Pierre Fraigniaud, Thore Husfeldt, and Elias Koutsoupias. 2014. “Privately Solving Linear Programs.” In Automata, Languages, and Programming, 8572: Pp. 612-624. Springer Berlin Heidelberg. Publisher's VersionPDF
In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).
Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:
SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
For any t, there exists a class C such that LDim(C)=2 but SCDP(C)≥t.
For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b).
We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.
The greatest threat to privacy today is not the NSA, but good-old American companies. Internet giants, leading retailers, and other firms are voraciously gathering data with little oversight from anyone.
In Las Vegas, no company knows the value of data better than Caesars Entertainment. Many thousands of enthusiastic clients pour through the ever-open doors of their casinos. The secret to the company’s success lies in their one unrivaled asset: they know their clients intimately by tracking the activities of the overwhelming majority of gamblers. They know exactly what games they like to play, what foods they enjoy for breakfast, when they prefer to visit, who their favorite hostess might be, and exactly how to keep them coming back for more.
Caesars’ dogged data-gathering methods have been so successful that they have grown to become the world’s largest casino operator, and have inspired companies of all kinds to ramp up their own data mining in the hopes of boosting their targeted marketing efforts. Some do this themselves. Some rely on data brokers. Others clearly enter a moral gray zone that should make American consumers deeply uncomfortable.
We live in an age when our personal information is harvested and aggregated whether we like it or not. And it is growing ever more difficult for those businesses that choose not to engage in more intrusive data gathering to compete with those that do. Tanner’s timely warning resounds: Yes, there are many benefits to the free flow of all this data, but there is a dark, unregulated, and destructive netherworld as well.
A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database db ∈ dbset, and k = ~Θ(n2) arbitrary efficiently computable counting queries, runs in time poly(d, n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth (AC0) circuits. Our result is almost tight because it is known that ~Ω(n2) counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than n2 queries (even exponential in n) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.
In 2008, Kasiviswanathan el al. defined private learning as a combination of PAC learning and differential privacy . Informally, a private learner is applied to a collection of labeled individual information and outputs a hypothesis while preserving the privacy of each individual. Kasiviswanathan et al. gave a generic construction of private learners for (finite) concept classes, with sample complexity logarithmic in the size of the concept class. This sample complexity is higher than what is needed for non-private learners, hence leaving open the possibility that the sample complexity of private learning may be sometimes significantly higher than that of non-private learning. We give a combinatorial characterization of the sample size sufficient and necessary to privately learn a class of concepts. This characterization is analogous to the well known characterization of the sample complexity of non-private learning in terms of the VC dimension of the concept class. We introduce the notion of probabilistic representation of a concept class, and our new complexity measure RepDim corresponds to the size of the smallest probabilistic representation of the concept class. We show that any private learning algorithm for a concept class C with sample complexity m implies RepDim(C) = O(m), and that there exists a private learning algorithm with sample complexity m = O(RepDim(C)).
We further demonstrate that a similar characterization holds for the database size needed for privately computing a large class of optimization problems and also for the well studied problem of private data release.
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting dif- ferential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data an- alysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer an- alyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the pri- vate query-release problem as a two-player zero-sum game, which may be of independent interest.