# Privacy Tools for Sharing Research Data: Publications

2014
Kobbi Nissim, Salil Vadhan, and David Xiao. 2014. “Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals.” In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, Pp. 411–422. New York, NY, USA: ACM. Publisher's Version
2013
Guy N. Rothblum, Salil Vadhan, and Avi Wigderson. 2013. “Interactive proofs of proximity: delegating computation in sublinear time.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 793-802. Palo Alto, California, USA: ACM. DOIAbstract

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

2012
Yevgeniy Dodis, Adriana López-Alt, Ilya Mironov, and Salil Vadhan. 2012. “Differential Privacy with Imperfect Randomness.” In Proceedings of the 32nd International Cryptology Conference (CRYPTO 12), Lecture Notes on Computer Science, 7417: Pp. 497–516. Santa Barbara, CA: Springer-Verlag. Springer LinkAbstract

In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under speciﬁc “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Speciﬁcally, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and diﬀerential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite diﬀerent from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisﬁed by any additive-noise mechanism.

Justin Thaler, Jonathan Ullman, and Salil P. Vadhan. 2012. “Faster Algorithms for Privately Releasing Marginals.” In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Lecture Notes in Computer Science. Vol. 7391. Warwick, UK: Springer. DOIAbstract

We study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).

Cynthia Dwork, Moni Naor, and Salil Vadhan. 2012. “The Privacy of the Analyst and the Power of the State.” In Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS 12), Pp. 400–409. New Brunswick, NJ: IEEE. IEEE XploreAbstract

We initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

2011
Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Paul Kominers, Scott Duke Kominers, Harry R. Lewis, Tal Moran, and Guy Rothblum. 2011. “Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005”. regulations.govAbstract

Comments by Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Scott Duke Kominers, Paul Kominers, Harry Lewis, Tal Moran, Guy Rothblum, and Jon Ullman (at Harvard, Microsoft Research, the University of Chicago, MIT, and the Herzilya Interdisciplinary Center) These comments address the issues of data privacy and de-identification raised in the ANPRM. Our perspective is informed by substantial advances in privacy science that have been made in the computer science literature.

Jon Ullman and Salil Vadhan. 2011. “PCPs and the Hardness of Generating Synthetic Data.” In Proceedings of the 8th IACR Theory of Cryptography Conference (TCC 11), edited by Yuval Ishai, Lecture Notes on Computer Science, 5978: Pp. 572–587. Providence, RI: Springer-Verlag. Springer LinkAbstract

Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm A that takes a database D\in ({0,1}^d)^n and outputs a synthetic database'' D' all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x\in {0,1}^d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS 07), who gave an algorithm running in time poly(n,2^d). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC 09) with PCP-based Levin-reductions from NP search problems to finding approximate solutions to CSPs.

Yiling Chen, Stephen Chong, Ian A. Kash, Tal Moran, and Salil P. Vadhan. 2011. “Truthful Mechanisms for Agents that Value Privacy.” CoRR, abs/1111.5472. ArXiv VersionAbstract

Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome $o$ has the property that any report of player $i$ would have led to $o$ with approximately the same probability, then $o$ has small privacy cost to player $i$. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number $n$ of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of $n$).

2010
Cynthia Dwork, Guy Rothblum, and Salil Vadhan. 2010. “Boosting and Differential Privacy.” In Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS 10), Pp. 51–60. Las Vegas, NV: IEEE. DOIAbstract

Boosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-pre serving synopses of an input database. These are data structures that yield, for a given set Q of queries over an input database, reasonably accurate estimates of the responses to every query in Q, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on Q and produces a "weak" synopsis that yields "good" answers for a majority of the weight in Q, our Boosting for Queries algorithm obtains a synopsis that is good for all of Q. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an O(ε2) bound on the expected privacy loss from a single e-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides e-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.

Andrew McGregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. 2010. “The Limits of Two-Party Differential Privacy.” In Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS 10), Pp. 81–90. Las Vegas, NV: IEEE. DOIAbstract

We study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy. The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions).

2009
Cynthia Dwork, Moni Naor, Omer Reingold, Guy Rothblum, and Salil Vadhan. 2009. “On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results.” In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC 09), Pp. 381–390. Bethesda, MD. ACM Digital LibraryAbstract

We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.

Ilya Mironov, Omkant Pandey, Omer Reingold, and Salil Vadhan. 2009. “Computational Differential Privacy.” In Advances in Cryptology–-CRYPTO `09, 5677: Pp. 126–142. Santa Barbara, CA: Springer-Verlag. Springer LinkAbstract

The definition of differential privacy has recently emerged as a leading standard of privacy guarantees for algorithms on statistical databases. We offer several relaxations of the definition which require privacy guarantees to hold only against efficient—i.e., computationally-bounded—adversaries. We establish various relationships among these notions, and in doing so, we observe their close connection with the theory of pseudodense sets by Reingold et al.[1]. We extend the dense model theorem of Reingold et al. to demonstrate equivalence between two definitions (indistinguishability- and simulatability-based) of computational differential privacy. Our computational analogues of differential privacy seem to allow for more accurate constructions than the standard information-theoretic analogues. In particular, in the context of private approximation of the distance between two vectors, we present a differentially-private protocol for computing the approximation, and contrast it with a substantially more accurate protocol that is only computationally differentially private.