# Publications

2013
Justin Hsu, Aaron Roth, and Jonathan Ullman. 2013. “Differential privacy for the analyst via private equilibrium computation.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 341-350. Palo Alto, California, USA: ACM. DOIAbstract
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting dif- ferential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data an- alysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer an- alyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the pri- vate query-release problem as a two-player zero-sum game, which may be of independent interest.
Latanya Sweeney. 2013. “Discrimination in online ad delivery.” Commun. ACM, 56, 5, Pp. 44–54. DOIAbstract
Mark Bun and Justin Thaler. 2013. “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities.” Edited by FedorV. Fomin, Rūsiņš Freivalds, Marta Kwiatkowska, and David Peleg. Automata, Languages, and Programming, 7965, Pp. 303-314. DOIAbstract
The ε-approximate degree of a Boolean function f: { − 1, 1} n  → { − 1, 1} is the minimum degree of a real polynomial that approximates f to within ε in the ℓ ∞  norm. We prove several lower bounds on this important complexity measure by explicitly constructing solutions to the dual of an appropriate linear program. Our first result resolves the ε-approximate degree of the two-level AND-OR tree for any constant ε > 0. We show that this quantity is Θ(n‾‾√) , closing a line of incrementally larger lower bounds [3,11,21,30,32]. The same lower bound was recently obtained independently by Sherstov using related techniques [25]. Our second result gives an explicit dual polynomial that witnesses a tight lower bound for the approximate degree of any symmetric Boolean function, addressing a question of Špalek [34]. Our final contribution is to reprove several Markov-type inequalities from approximation theory by constructing explicit dual solutions to natural linear programs. These inequalities underly the proofs of many of the best-known approximate degree lower bounds, and have important uses throughout theoretical computer science.
S.H. Chan, T.B. Costa, and E.M. Airoldi. 2013. “Estimation of exchangeable graph models by stochastic blockmodel approximation.” In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, Pp. 293-296.
Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, and Andrew Wan. 2013. “Faster Private Release of Marginals on Small Databases.” CoRR, abs/1304.3754. arXiv.orgAbstract
We study the problem of answering \emph{$k$-way marginal} queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$\min{\exp(d^{1-\Omega(1/\sqrt{k})}), \exp(d / \log^{.99} d)\}$$ per query and answers any (possibly superpolynomially long and adaptively chosen) sequence of $k$-way marginal queries up to error at most $\pm .01$ on every query, provided $n \gtrsim d^{.51}$. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee on a database of size $\poly(d, k)$ in time $\exp(o(d))$. Our algorithms are a variant of the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10), but using a different low-weight representation of the database. We derive our low-weight representation using approximations to the OR function by low-degree polynomials with coefficients of bounded $L_1$-norm. We also prove a strong limitation on our approach that is of independent approximation-theoretic interest. Specifically, we show that for any $k = o(\log d)$, any polynomial with coefficients of $L_1$-norm $poly(d)$ that pointwise approximates the $d$-variate OR function on all inputs of Hamming weight at most $k$ must have degree $d^{1-O(1/\sqrt{k})}$.
Latanya Sweeney, Akua Abu, and Julia Winn. 2013. “Identifying Participants in the Personal Genome Project by Name.” Data Privacy Lab, IQSS, Harvard University. Project website
Guy N. Rothblum, Salil Vadhan, and Avi Wigderson. 2013. “Interactive proofs of proximity: delegating computation in sublinear time.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 793-802. Palo Alto, California, USA: ACM. DOIAbstract

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

Latanya Sweeney. 2013. “Matching Known Patients to Health Records in Washington State Data.” Data Privacy Lab, IQSS, Harvard University. Project website
Amos Beimel, Kobbi Nissim, Uri Stemmer, Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and JoséD.P. Rolim. 2013. “Private Learning and Sanitization: Pure vs. Approximate Differential Privacy.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 8096: Pp. 363-378. Springer Berlin Heidelberg. Publisher's Version
Latanya Sweeney, William A Yasnoff, and Edward H Shortliffe. 2013. “Putting Health IT on the Path to Success.” JAMA, 309, 10, Pp. 989-990. DOIAbstract
The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision.
Sean Hooley and Latanya Sweeney. 2013. “Survey of Publicly Available State Health Databases.” Data Privacy Lab, IQSS, Harvard University. Project website
Yiling Chen, Stephen Chong, Ian A. Kash, Tal Moran, and Salil Vadhan. 2013. “Truthful mechanisms for agents that value privacy.” In Proceedings of the fourteenth ACM conference on Electronic commerce, Pp. 215-232. Philadelphia, Pennsylvania, USA: ACM. DOIAbstract
Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has small privacy cost to player i. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n).
2012
Yevgeniy Dodis, Adriana López-Alt, Ilya Mironov, and Salil Vadhan. 2012. “Differential Privacy with Imperfect Randomness.” In Proceedings of the 32nd International Cryptology Conference (CRYPTO 12), Lecture Notes on Computer Science, 7417: Pp. 497–516. Santa Barbara, CA: Springer-Verlag. Springer LinkAbstract

In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under speciﬁc “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Speciﬁcally, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and diﬀerential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite diﬀerent from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisﬁed by any additive-noise mechanism.

Justin Thaler, Jonathan Ullman, and Salil P. Vadhan. 2012. “Faster Algorithms for Privately Releasing Marginals.” In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Lecture Notes in Computer Science. Vol. 7391. Warwick, UK: Springer. DOIAbstract

We study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).

Anupam Gupta, Aaron Roth, and Jonathan Ullman. 2012. “Iterative Constructions and Private Data Release.” In Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012, Lecture Notes in Computer Science, 7194: Pp. 339-356. Taormina, Sicily, Italy: Springer. DOIAbstract

In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.

Cynthia Dwork, Moni Naor, and Salil Vadhan. 2012. “The Privacy of the Analyst and the Power of the State.” In Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS 12), Pp. 400–409. New Brunswick, NJ: IEEE. IEEE XploreAbstract

We initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. 2012. “Private Equilibrium Release, Large Games, and No-Regret Learning”. ArXiv VersionAbstract

We give mechanisms in which each of n players in a game is given their component of an (approximate) equilibrium in a way that guarantees differential privacy---that is, the revelation of the equilibrium components does not reveal too much information about the utilities of the other players. More precisely, we show how to compute an approximate correlated equilibrium (CE) under the constraint of differential privacy (DP), provided n is large and any player's action affects any other's payoff by at most a small amount. Our results draw interesting connections between noisy generalizations of classical convergence results for no-regret learning, and the noisy mechanisms developed for differential privacy. Our results imply the ability to truthfully implement good social-welfare solutions in many games, such as games with small Price of Anarchy, even if the mechanism does not have the ability to enforce outcomes. We give two different mechanisms for DP computation of approximate CE. The first is computationally efficient, but has a suboptimal dependence on the number of actions in the game; the second is computationally efficient, but allows for games with exponentially many actions. We also give a matching lower bound, showing that our results are tight up to logarithmic factors.

2011
Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Paul Kominers, Scott Duke Kominers, Harry R. Lewis, Tal Moran, and Guy Rothblum. 2011. “Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005”. regulations.govAbstract

Comments by Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Scott Duke Kominers, Paul Kominers, Harry Lewis, Tal Moran, Guy Rothblum, and Jon Ullman (at Harvard, Microsoft Research, the University of Chicago, MIT, and the Herzilya Interdisciplinary Center) These comments address the issues of data privacy and de-identification raised in the ANPRM. Our perspective is informed by substantial advances in privacy science that have been made in the computer science literature.

Jon Ullman and Salil Vadhan. 2011. “PCPs and the Hardness of Generating Synthetic Data.” In Proceedings of the 8th IACR Theory of Cryptography Conference (TCC 11), edited by Yuval Ishai, Lecture Notes on Computer Science, 5978: Pp. 572–587. Providence, RI: Springer-Verlag. Springer LinkAbstract

Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm A that takes a database D\in ({0,1}^d)^n and outputs a synthetic database'' D' all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x\in {0,1}^d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS 07), who gave an algorithm running in time poly(n,2^d). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC `09) with PCP-based Levin-reductions from NP search problems to finding approximate solutions to CSPs.

Anupam Gupta, Moritz Hardt, Aaron Roth, and Jonathan Ullman. 2011. “Privately releasing conjunctions and the statistical query barrier.” In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, Pp. 803-812. San Jose, CA, USA: ACM. ACM Digital LibraryAbstract
Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ)model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms.