Latanya Sweeney, Akua Abu, and Julia Winn. 2013. “Identifying Participants in the Personal Genome Project by Name.” Data Privacy Lab, IQSS, Harvard University. Project website PDF
Guy N. Rothblum, Salil Vadhan, and Avi Wigderson. 2013. “Interactive proofs of proximity: delegating computation in sublinear time.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 793-802. Palo Alto, California, USA: ACM. DOIAbstract

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

Latanya Sweeney. 2013. “Matching Known Patients to Health Records in Washington State Data.” Data Privacy Lab, IQSS, Harvard University. Project website PDF
Amos Beimel, Kobbi Nissim, Uri Stemmer, Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and JoséD.P. Rolim. 2013. “Private Learning and Sanitization: Pure vs. Approximate Differential Privacy.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 8096: Pp. 363-378. Springer Berlin Heidelberg. Publisher's Version PDF
Latanya Sweeney, William A Yasnoff, and Edward H Shortliffe. 2013. “Putting Health IT on the Path to Success.” JAMA, 309, 10, Pp. 989-990. DOIAbstract
The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision.
George Alter, Micah Altman, Mark Abrahamson, Roper Center, Merce Crosas, Jon Crabtree, and Ted Hull. 2013. “Response to the National Institute of Health Request Information: Input on Development of NIH Data Catalog”.
Sean Hooley and Latanya Sweeney. 2013. “Survey of Publicly Available State Health Databases.” Data Privacy Lab, IQSS, Harvard University. Project website PDF
Yiling Chen, Stephen Chong, Ian A Kash, Tal Moran, and Salil Vadhan. 2013. “Truthful mechanisms for agents that value privacy.” In Proceedings of the fourteenth ACM conference on Electronic commerce, Pp. 215-232. Philadelphia, Pennsylvania, USA: ACM. DOIAbstract
Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has small privacy cost to player i. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n).
Yevgeniy Dodis, Adriana López-Alt, Ilya Mironov, and Salil Vadhan. 2012. “Differential Privacy with Imperfect Randomness.” In Proceedings of the 32nd International Cryptology Conference (CRYPTO `12), Lecture Notes on Computer Science, 7417: Pp. 497–516. Santa Barbara, CA: Springer-Verlag. Springer LinkAbstract

In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under specific “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Specifically, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and differential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite different from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisfied by any additive-noise mechanism.

Justin Thaler, Jonathan Ullman, and Salil P. Vadhan. 2012. “Faster Algorithms for Privately Releasing Marginals.” In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Lecture Notes in Computer Science. Vol. 7391. Warwick, UK: Springer. DOIAbstract

We study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).

Anupam Gupta, Aaron Roth, and Jonathan Ullman. 2012. “Iterative Constructions and Private Data Release.” In Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012, Lecture Notes in Computer Science, 7194: Pp. 339-356. Taormina, Sicily, Italy: Springer. DOIAbstract

In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.

Cynthia Dwork, Moni Naor, and Salil Vadhan. 2012. “The Privacy of the Analyst and the Power of the State.” In Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12), Pp. 400–409. New Brunswick, NJ: IEEE. IEEE XploreAbstract

We initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. 2012. “Private Equilibrium Release, Large Games, and No-Regret Learning”. ArXiv VersionAbstract

We give mechanisms in which each of n players in a game is given their component of an (approximate) equilibrium in a way that guarantees differential privacy---that is, the revelation of the equilibrium components does not reveal too much information about the utilities of the other players. More precisely, we show how to compute an approximate correlated equilibrium (CE) under the constraint of differential privacy (DP), provided n is large and any player's action affects any other's payoff by at most a small amount. Our results draw interesting connections between noisy generalizations of classical convergence results for no-regret learning, and the noisy mechanisms developed for differential privacy. Our results imply the ability to truthfully implement good social-welfare solutions in many games, such as games with small Price of Anarchy, even if the mechanism does not have the ability to enforce outcomes. We give two different mechanisms for DP computation of approximate CE. The first is computationally efficient, but has a suboptimal dependence on the number of actions in the game; the second is computationally efficient, but allows for games with exponentially many actions. We also give a matching lower bound, showing that our results are tight up to logarithmic factors.

Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Paul Kominers, Scott Duke Kominers, Harry R. Lewis, Tal Moran, and Guy Rothblum. 2011. “Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005”. regulations.govAbstract

Comments by Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Scott Duke Kominers, Paul Kominers, Harry Lewis, Tal Moran, Guy Rothblum, and Jon Ullman (at Harvard, Microsoft Research, the University of Chicago, MIT, and the Herzilya Interdisciplinary Center) These comments address the issues of data privacy and de-identification raised in the ANPRM. Our perspective is informed by substantial advances in privacy science that have been made in the computer science literature.

Jon Ullman and Salil Vadhan. 2011. “PCPs and the Hardness of Generating Synthetic Data.” In Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11), edited by Yuval Ishai, Lecture Notes on Computer Science, 5978: Pp. 572–587. Providence, RI: Springer-Verlag. Springer LinkAbstract

Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm A that takes a database D\in ({0,1}^d)^n and outputs a ``synthetic database'' D' all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x\in {0,1}^d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS `07), who gave an algorithm running in time poly(n,2^d). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC `09) with PCP-based Levin-reductions from NP search problems to finding approximate solutions to CSPs.

Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. 2011. “Privately releasing conjunctions and the statistical query barrier.” In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, Pp. 803-812. San Jose, CA, USA: ACM. ACM Digital LibraryAbstract
Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ)model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.
Yiling Chen, Stephen Chong, Ian A. Kash, Tal Moran, and Salil P. Vadhan. 2011. “Truthful Mechanisms for Agents that Value Privacy.” CoRR, abs/1111.5472. ArXiv VersionAbstract

Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome $o$ has the property that any report of player $i$ would have led to $o$ with approximately the same probability, then $o$ has small privacy cost to player $i$. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number $n$ of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of $n$).

Cynthia Dwork, Guy Rothblum, and Salil Vadhan. 2010. “Boosting and Differential Privacy.” In Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10), Pp. 51–60. Las Vegas, NV: IEEE. DOIAbstract

Boosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-pre serving synopses of an input database. These are data structures that yield, for a given set Q of queries over an input database, reasonably accurate estimates of the responses to every query in Q, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on Q and produces a "weak" synopsis that yields "good" answers for a majority of the weight in Q, our Boosting for Queries algorithm obtains a synopsis that is good for all of Q. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an O(ε2) bound on the expected privacy loss from a single e-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides e-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.

Andrew McGregor, Ilya Mironov, Toniann Pitassi, Omer Reingold, Kunal Talwar, and Salil Vadhan. 2010. “The Limits of Two-Party Differential Privacy.” In Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10), Pp. 81–90. Las Vegas, NV: IEEE. DOIAbstract

We study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy. The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions).

Cynthia Dwork, Moni Naor, Omer Reingold, Guy Rothblum, and Salil Vadhan. 2009. “On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results.” In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09), Pp. 381–390. Bethesda, MD. ACM Digital LibraryAbstract

We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.