Publications

2014
D. J. Weitzner, et al., “

Consumer Privacy Bill of Rights and Big Data: Response to White House Office of Science and Technology Policy Request for Information

”. 2014.Abstract
In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.
PDF version
M. Altman, O’Brien, D., and Wood, A., “

Comment on the Occupational Safety and Health Administration (OSHA) Proposed Rule: Improve Tracking of Workplace Injuries and Illnesses; Extension of Comment Period

”. 2014. Full Text at Regulations.gov PDF version of comments
2013
M. Bun and Thaler, J., “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities”, Automata, Languages, and Programming, vol. 7965, pp. 303-314, 2013. DOIAbstract
The ε-approximate degree of a Boolean function f: { − 1, 1} n  → { − 1, 1} is the minimum degree of a real polynomial that approximates f to within ε in the ℓ ∞  norm. We prove several lower bounds on this important complexity measure by explicitly constructing solutions to the dual of an appropriate linear program. Our first result resolves the ε-approximate degree of the two-level AND-OR tree for any constant ε > 0. We show that this quantity is Θ(n‾‾√) , closing a line of incrementally larger lower bounds [3,11,21,30,32]. The same lower bound was recently obtained independently by Sherstov using related techniques [25]. Our second result gives an explicit dual polynomial that witnesses a tight lower bound for the approximate degree of any symmetric Boolean function, addressing a question of Špalek [34]. Our final contribution is to reprove several Markov-type inequalities from approximation theory by constructing explicit dual solutions to natural linear programs. These inequalities underly the proofs of many of the best-known approximate degree lower bounds, and have important uses throughout theoretical computer science.
PDF
K. Chandrasekaran, Thaler, J., Ullman, J., and Wan, A., “Faster Private Release of Marginals on Small Databases”, CoRR, vol. abs/1304.3754, 2013. arXiv.orgAbstract
We study the problem of answering \emph{$k$-way marginal} queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$ \min{\exp(d^{1-\Omega(1/\sqrt{k})}), \exp(d / \log^{.99} d)\} $$ per query and answers any (possibly superpolynomially long and adaptively chosen) sequence of $k$-way marginal queries up to error at most $\pm .01$ on every query, provided $n \gtrsim d^{.51} $. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee on a database of size $\poly(d, k)$ in time $\exp(o(d))$. Our algorithms are a variant of the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10), but using a different low-weight representation of the database. We derive our low-weight representation using approximations to the OR function by low-degree polynomials with coefficients of bounded $L_1$-norm. We also prove a strong limitation on our approach that is of independent approximation-theoretic interest. Specifically, we show that for any $k = o(\log d)$, any polynomial with coefficients of $L_1$-norm $poly(d)$ that pointwise approximates the $d$-variate OR function on all inputs of Hamming weight at most $k$ must have degree $d^{1-O(1/\sqrt{k})}$.
PDF
L. Sweeney, Yasnoff, W. A., and Shortliffe, E. H., “Putting Health IT on the Path to Success”, JAMA, vol. 309, no. 10, pp. 989-990, 2013. DOIAbstract
The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision.
PDF
S. Hooley and Sweeney, L., “Survey of Publicly Available State Health Databases”, Data Privacy Lab, IQSS, Harvard University. 2013. Project website PDF
L. Sweeney, Abu, A., and Winn, J., “Identifying Participants in the Personal Genome Project by Name”, Data Privacy Lab, IQSS, Harvard University. 2013. Project website PDF
L. Sweeney, “Matching Known Patients to Health Records in Washington State Data”, Data Privacy Lab, IQSS, Harvard University. 2013. Project website PDF
J. Ullman, “Answering n{2+o(1)} counting queries with differential privacy is hard”, in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Palo Alto, California, USA, 2013, pp. 361-370. DOIAbstract
A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database db ∈ dbset, and k = ~Θ(n2) arbitrary efficiently computable counting queries, runs in time poly(d, n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth (AC0) circuits. Our result is almost tight because it is known that ~Ω(n2) counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than n2 queries (even exponential in n) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.
PDF
J. Hsu, Roth, A., and Ullman, J., “Differential privacy for the analyst via private equilibrium computation”, in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Palo Alto, California, USA, 2013, pp. 341-350. DOIAbstract
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting dif- ferential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data an- alysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer an- alyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the pri- vate query-release problem as a two-player zero-sum game, which may be of independent interest.
PDF
G. N. Rothblum, Vadhan, S., and Wigderson, A., “Interactive proofs of proximity: delegating computation in sublinear time”, in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Palo Alto, California, USA, 2013, pp. 793-802. DOIAbstract
We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.
PDF
Y. Chen, Chong, S., Kash, I. A., Moran, T., and Vadhan, S., “Truthful mechanisms for agents that value privacy”, in Proceedings of the fourteenth ACM conference on Electronic commerce, Philadelphia, Pennsylvania, USA, 2013, pp. 215-232. DOIAbstract
Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has small privacy cost to player i. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n).
PDF
L. Sweeney, “Discrimination in online ad delivery”, Commun. ACM, vol. 56, no. 5, pp. 44–54, 2013. DOIAbstract
Google ads, black names and white names, racial discrimination, and click advertising.
PDF
S. P. Kasiviswanathan, Nissim, K., Raskhodnikova, S., and Smith, A., “Analyzing Graphs with Node Differential Privacy”, in Theory of Cryptography, vol. 7785, Springer Berlin Heidelberg, 2013, pp. 457-476. Springer LinkAbstract
We develop algorithms for the private analysis of network data that provide accurate analysis of realistic networks while satisfying stronger privacy guarantees than those of previous work. We present several techniques for designing node differentially private algorithms, that is, algorithms whose output distribution does not change significantly when a node and all its adjacent edges are added to a graph. We also develop methodology for analyzing the accuracy of such algorithms on realistic networks. The main idea behind our techniques is to “project” (in one of several senses) the input graph onto the set of graphs with maximum degree below a certain threshold. We design projection operators, tailored to specific statistics that have low sensitivity and preserve information about the original statistic. These operators can be viewed as giving a fractional (low-degree) graph that is a solution to an optimization problem described as a maximum flow instance, linear program, or convex program. In addition, we derive a generic, efficient reduction that allows us to apply any differentially private algorithm for bounded-degree graphs to an arbitrary graph. This reduction is based on analyzing the smooth sensitivity of the “naive” truncation that simply discards nodes of high degree.
PDF
A. Beimel, Nissim, K., and Stemmer, U., “Characterizing the sample complexity of private learners”, in Proceedings of the 4th conference on Innovations in Theoretical Computer Science, Berkeley, California, USA, 2013, pp. 97-110. DOIAbstract
In 2008, Kasiviswanathan el al. defined private learning as a combination of PAC learning and differential privacy [16]. Informally, a private learner is applied to a collection of labeled individual information and outputs a hypothesis while preserving the privacy of each individual. Kasiviswanathan et al. gave a generic construction of private learners for (finite) concept classes, with sample complexity logarithmic in the size of the concept class. This sample complexity is higher than what is needed for non-private learners, hence leaving open the possibility that the sample complexity of private learning may be sometimes significantly higher than that of non-private learning. We give a combinatorial characterization of the sample size sufficient and necessary to privately learn a class of concepts. This characterization is analogous to the well known characterization of the sample complexity of non-private learning in terms of the VC dimension of the concept class. We introduce the notion of probabilistic representation of a concept class, and our new complexity measure RepDim corresponds to the size of the smallest probabilistic representation of the concept class. We show that any private learning algorithm for a concept class C with sample complexity m implies RepDim(C) = O(m), and that there exists a private learning algorithm with sample complexity m = O(RepDim(C)). We further demonstrate that a similar characterization holds for the database size needed for privately computing a large class of optimization problems and also for the well studied problem of private data release.
PDF
2012
A. Gupta, Roth, A., and Ullman, J., “Iterative Constructions and Private Data Release”, in Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012, Taormina, Sicily, Italy, 2012, Lecture Notes in Computer Science., vol. 7194, pp. 339-356. DOI:10.1007/978-3-642-28914-9_19Abstract
In this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.
PDF
C. Dwork, Naor, M., and Vadhan, S., “The Privacy of the Analyst and the Power of the State”, in Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12), New Brunswick, NJ, 2012, pp. 400–409. IEEE XploreAbstract
We initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.
PDF
Y. Dodis, López-Alt, A., Mironov, I., and Vadhan, S., “Differential Privacy with Imperfect Randomness”, in Proceedings of the 32nd International Cryptology Conference (CRYPTO `12), Santa Barbara, CA, 2012, Lecture Notes on Computer Science., vol. 7417, pp. 497–516. Springer LinkAbstract
In this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under specific “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Specifically, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and differential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite different from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisfied by any additive-noise mechanism.
PDF
J. Thaler, Ullman, J., and Vadhan, S. P., “Faster Algorithms for Privately Releasing Marginals”, in Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Warwick, UK, 2012, Lecture Notes in Computer Science., vol. 7391. DOI:10.1007/978-3-642-31594-7_68Abstract
We study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).
PDF
M. Kearns, Pai, M., Roth, A., and Ullman, J., “Private Equilibrium Release, Large Games, and No-Regret Learning”, 2012. arXiv:1207.4084Abstract
We give mechanisms in which each of n players in a game is given their component of an (approximate) equilibrium in a way that guarantees differential privacy---that is, the revelation of the equilibrium components does not reveal too much information about the utilities of the other players. More precisely, we show how to compute an approximate correlated equilibrium (CE) under the constraint of differential privacy (DP), provided n is large and any player's action affects any other's payoff by at most a small amount. Our results draw interesting connections between noisy generalizations of classical convergence results for no-regret learning, and the noisy mechanisms developed for differential privacy. Our results imply the ability to truthfully implement good social-welfare solutions in many games, such as games with small Price of Anarchy, even if the mechanism does not have the ability to enforce outcomes. We give two different mechanisms for DP computation of approximate CE. The first is computationally efficient, but has a suboptimal dependence on the number of actions in the game; the second is computationally efficient, but allows for games with exponentially many actions. We also give a matching lower bound, showing that our results are tight up to logarithmic factors.
PDF