# Privacy Tools for Sharing Research Data: Publications

2015
A. Beimel, K. Nissim, and U. Stemmer. 2015. “Learning Privately with Labeled and Unlabeled Examples.” Accepted for publication, SODA 2015. arXiv.orgAbstract

A private learner is an algorithm that given a sample of labeled individual examples outputs a generalizing hypothesis while preserving the privacy of each individual. In 2008, Kasiviswanathan et al. (FOCS 2008) gave a generic construction of private learners, in which the sample complexity is (generally) higher than what is needed for non-private learners. This gap in the sample complexity was then further studied in several followup papers, showing that (at least in some cases) this gap is unavoidable. Moreover, those papers considered ways to overcome the gap, by relaxing either the privacy or the learning guarantees of the learner.
We suggest an alternative approach, inspired by the (non-private) models of semi-supervised learning and active-learning, where the focus is on the sample complexity of labeled examples whereas unlabeled examples are of a significantly lower cost. We consider private semi-supervised learners that operate on a random sample, where only a (hopefully small) portion of this sample is labeled. The learners have no control over which of the sample elements are labeled. Our main result is that the labeled sample complexity of private learners is characterized by the VC dimension.
We present two generic constructions of private semi-supervised learners. The first construction is of learners where the labeled sample complexity is proportional to the VC dimension of the concept class, however, the unlabeled sample complexity of the algorithm is as big as the representation length of domain elements. Our second construction presents a new technique for decreasing the labeled sample complexity of a given private learner, while only slightly increasing its unlabeled sample complexity. In addition, we show that in some settings the labeled sample complexity does not depend on the privacy parameters of the learner.

C. Dwork, A Smith, T Steinke, J Ullman, and S. Vadhan. 2015. “Robust Traceability from Trace Amounts.” In IEEE Symposium on Foundations of Computer Science (FOCS 2015). Berkeley, California.Abstract

The privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in 1 norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

Mark Bun, Kobbi Nissim, and Uri Stemmer. 2015. “Simultaneous private learning of multiple concepts.” In .Abstract

We investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving k learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving k learning tasks without privacy? In our setting, an individual example consists of a domain element x labeled by k unknown concepts (c1,,ck). The goal of a multi-learner is to output k hypotheses (h1,,hk) that generalize the input examples.
Without concern for privacy, the sample complexity needed to simultaneously learn k concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with k. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in k.

2014
Daniel J. Weitzner, Hal Abelson, Cynthia Dwork, Cameron Kerry, Daniela Rus, Sandy Pentland, and Salil Vadhan. 4/4/2014. “Consumer Privacy Bill of Rights and Big Data: Response to White House Office of Science and Technology Policy Request for Information”.Abstract

In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.

Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood. 3/31/2014. “Comment to The White House Office of Science and Technology Policy (OSTP): Big Data Study, Request for Information”.Abstract
On January 23, 2014, President Barack Obama asked John Podesta to perform a comprehensive review of big data and privacy. During this review, the White House Office of Science and Technology Policy issued a request for public comment on questions related to the public policy implications of big data.

Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood submitted a response on behalf of the Privacy Tools for Sharing Research Data project. Their comments outline a broad, comprehensive, and systematic framework for privacy analysis and provide a taxonomy of modern technological, statistical, and cryptographic approaches to preserving both data privacy and utility. They argue that an analysis of information privacy should address the scope of information covered, the sensitivity of that information, the risk that sensitive information will be disclosed, the availability of control and accountability mechanisms, and the suitability of existing data sharing models, as applied across the entire lifecyle of information use, from collection through dissemination and reuse.

With this submission, the authors discuss the inadequacy of traditional approaches to privacy protection and recommend a modern approach that considers three principles. First, the risks of informational harm are generally not a simple function of the presence or absence of specific fields, attributes, or keywords in the released set of data. Second, redaction, pseudonymization, coarsening, and hashing, are often neither an adequate nor appropriate practice, nor is releasing less information necessary more privacy protective. Third, a thoughtful analysis with expert consultation is necessary in order to evaluate the sensitivity of the data collected, to quantify the associated re-identification risks, and to design useful and safe release mechanisms.
Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood. 3/8/2014. “Comment on the Occupational Safety and Health Administration (OSHA) Proposed Rule: Improve Tracking of Workplace Injuries and Illnesses; Extension of Comment Period”. Full Text at Regulations.gov
Christos Dimoulas, Scott Moore, Aslan Askarov, and Stephen Chong. 2014. “Declarative Policies for Capability Control.” In Proceedings of the 27th {IEEE} Computer Security Foundations Symposium. Piscataway, NJ, USA: IEEE Press.Abstract

In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system.

To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. Access control policies restrict which components may use a capability and are enforced using higher-order contracts. Integrity policies restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.

Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, and Andrew Wan. 2014. “Faster Private Release of Marginals on Small Databases.” In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, Pp. 387–402. New York, NY, USA: ACM. Publisher's Version
Mark Bun, Jonathan Ullman, and Salil Vadhan. 2014. “Fingerprinting Codes and the Price of Approximate Differential Privacy.” In SIAM Journal on Computing Special Issue on STOC 14., Pp. 1–10.
Alexandra Wood, David O'Brien, Micah Altman, Alan Karr, Urs Gasser, Michael Bar-Sinai, Kobbi Nissim, Jonathan Ullman, Salil Vadhan, and Wojcik, Michael John. 2014. Integrating Approaches to Privacy Across the Research Lifecycle: Long-Term Longitudinal Studies. Social Science Research Network. Cambridge: Harvard University. Publisher's VersionAbstract

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.

Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.

Yiling Chen, Or Sheffet, and Salil Vadhan. 2014. “Privacy Games.” In 10th Conference on Web and Internet Economics (WINE). Beijing, China.
Kobbi Nissim, Salil Vadhan, and David Xiao. 2014. “Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals.” In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, Pp. 411–422. New York, NY, USA: ACM. Publisher's Version
Vitaly Feldman and David Xiao. 2014. “Sample Complexity Bounds on Differentially Private Learning via Communication Complexity.” Proceedings of The 27th Conference on Learning Theory (COLT 2014) 35, Pp. 1-20. Barcelona, Spain: JMLR Workshop and Conference Proceedings. Publisher's VersionAbstract

In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).

Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:

• SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
• For any t, there exists a class C such that LDim(C)=2 but SCDP(C)t.
• For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b).

We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.

2013
Jonathan Ullman. 2013. “Answering n{2+o(1)} counting queries with differential privacy is hard.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 361-370. Palo Alto, California, USA: ACM. DOIAbstract
A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database db ∈ dbset, and k = ~Θ(n2) arbitrary efficiently computable counting queries, runs in time poly(d, n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth (AC0) circuits. Our result is almost tight because it is known that ~Ω(n2) counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than n2 queries (even exponential in n) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.
Justin Hsu, Aaron Roth, and Jonathan Ullman. 2013. “Differential privacy for the analyst via private equilibrium computation.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 341-350. Palo Alto, California, USA: ACM. DOIAbstract
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting dif- ferential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data an- alysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer an- alyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the pri- vate query-release problem as a two-player zero-sum game, which may be of independent interest.
Latanya Sweeney. 2013. “Discrimination in online ad delivery.” Commun. ACM, 56, 5, Pp. 44–54. DOIAbstract
Google ads, black names and white names, racial discrimination, and click advertising.
Mark Bun and Justin Thaler. 2013. “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities.” Edited by FedorV. Fomin, Rūsiņš Freivalds, Marta Kwiatkowska, and David Peleg. Automata, Languages, and Programming, 7965, Pp. 303-314. DOIAbstract
The ε-approximate degree of a Boolean function f: { − 1, 1} n  → { − 1, 1} is the minimum degree of a real polynomial that approximates f to within ε in the ℓ ∞  norm. We prove several lower bounds on this important complexity measure by explicitly constructing solutions to the dual of an appropriate linear program. Our first result resolves the ε-approximate degree of the two-level AND-OR tree for any constant ε > 0. We show that this quantity is Θ(n‾‾√) , closing a line of incrementally larger lower bounds [3,11,21,30,32]. The same lower bound was recently obtained independently by Sherstov using related techniques [25]. Our second result gives an explicit dual polynomial that witnesses a tight lower bound for the approximate degree of any symmetric Boolean function, addressing a question of Špalek [34]. Our final contribution is to reprove several Markov-type inequalities from approximation theory by constructing explicit dual solutions to natural linear programs. These inequalities underly the proofs of many of the best-known approximate degree lower bounds, and have important uses throughout theoretical computer science.
Guy N. Rothblum, Salil Vadhan, and Avi Wigderson. 2013. “Interactive proofs of proximity: delegating computation in sublinear time.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 793-802. Palo Alto, California, USA: ACM. DOIAbstract

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

Amos Beimel, Kobbi Nissim, Uri Stemmer, Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and JoséD.P. Rolim. 2013. “Private Learning and Sanitization: Pure vs. Approximate Differential Privacy.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 8096: Pp. 363-378. Springer Berlin Heidelberg. Publisher's Version
Sean Hooley and Latanya Sweeney. 2013. “Survey of Publicly Available State Health Databases.” Data Privacy Lab, IQSS, Harvard University. Project website