# Publications

2014
Lucas Waye. 2014. “Privacy Integrated Data Stream Queries.” In Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14). New York, NY: ACM. ACM Digital Library Version
Raef Bassily, Adam Smith, and Abhradeep Thakurta. 2014. “Private Empirical Risk Minimization, Revisited.” In ICML 2014 Workshop on Learning, Security and Privacy. Beijing, China. Publisher's VersionAbstract

In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.

Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different.

Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.

Justin Hsu, Aaron Roth, Tim Roughgarden, Jonathan Ullman, Javier Esparza, Pierre Fraigniaud, Thore Husfeldt, and Elias Koutsoupias. 2014. “Privately Solving Linear Programs.” In Automata, Languages, and Programming, 8572: Pp. 612-624. Springer Berlin Heidelberg. Publisher's Version
Kobbi Nissim, Salil Vadhan, and David Xiao. 2014. “Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals.” In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, Pp. 411–422. New York, NY, USA: ACM. Publisher's Version
Vitaly Feldman and David Xiao. 2014. “Sample Complexity Bounds on Differentially Private Learning via Communication Complexity.” Proceedings of The 27th Conference on Learning Theory (COLT 2014) 35, Pp. 1-20. Barcelona, Spain: JMLR Workshop and Conference Proceedings. Publisher's VersionAbstract

In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).

Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:

• SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
• For any t, there exists a class C such that LDim(C)=2 but SCDP(C)t.
• For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b).

We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.

Adam Tanner. 2014. What Stays in Vegas: The World of Personal Data -- Lifeblood of Big Business -- and the End of Privacy as We Know It., Pp. 336. New York, NY: Public Affairs. Available on AmazonAbstract

The greatest threat to privacy today is not the NSA, but good-old American companies. Internet giants, leading retailers, and other firms are voraciously gathering data with little oversight from anyone.

In Las Vegas, no company knows the value of data better than Caesars Entertainment. Many thousands of enthusiastic clients pour through the ever-open doors of their casinos. The secret to the company’s success lies in their one unrivaled asset: they know their clients intimately by tracking the activities of the overwhelming majority of gamblers. They know exactly what games they like to play, what foods they enjoy for breakfast, when they prefer to visit, who their favorite hostess might be, and exactly how to keep them coming back for more.

Caesars’ dogged data-gathering methods have been so successful that they have grown to become the world’s largest casino operator, and have inspired companies of all kinds to ramp up their own data mining in the hopes of boosting their targeted marketing efforts. Some do this themselves. Some rely on data brokers. Others clearly enter a moral gray zone that should make American consumers deeply uncomfortable.

We live in an age when our personal information is harvested and aggregated whether we like it or not. And it is growing ever more difficult for those businesses that choose not to engage in more intrusive data gathering to compete with those that do. Tanner’s timely warning resounds: Yes, there are many benefits to the free flow of all this data, but there is a dark, unregulated, and destructive netherworld as well.

2013
ShivaPrasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2013. “Analyzing Graphs with Node Differential Privacy.” In Theory of Cryptography, 7785: Pp. 457-476. Springer Berlin Heidelberg. Springer LinkAbstract
We develop algorithms for the private analysis of network data that provide accurate analysis of realistic networks while satisfying stronger privacy guarantees than those of previous work. We present several techniques for designing node differentially private algorithms, that is, algorithms whose output distribution does not change significantly when a node and all its adjacent edges are added to a graph. We also develop methodology for analyzing the accuracy of such algorithms on realistic networks. The main idea behind our techniques is to “project” (in one of several senses) the input graph onto the set of graphs with maximum degree below a certain threshold. We design projection operators, tailored to specific statistics that have low sensitivity and preserve information about the original statistic. These operators can be viewed as giving a fractional (low-degree) graph that is a solution to an optimization problem described as a maximum flow instance, linear program, or convex program. In addition, we derive a generic, efficient reduction that allows us to apply any differentially private algorithm for bounded-degree graphs to an arbitrary graph. This reduction is based on analyzing the smooth sensitivity of the “naive” truncation that simply discards nodes of high degree.
ShivaPrasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2013. “Analyzing Graphs with Node Differential Privacy.” In Proceedings of the 10th Theory of Cryptography Conference on Theory of Cryptography, Pp. 457–476. Berlin, Heidelberg: Springer-Verlag. Publisher's Version
Jonathan Ullman. 2013. “Answering n{2+o(1)} counting queries with differential privacy is hard.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 361-370. Palo Alto, California, USA: ACM. DOIAbstract
A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries on a sensitive database. Counting queries are of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database db ∈ dbset, and k = ~Θ(n2) arbitrary efficiently computable counting queries, runs in time poly(d, n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant-depth (AC0) circuits. Our result is almost tight because it is known that ~Ω(n2) counting queries can be answered efficiently while satisfying differential privacy. Moreover, many more than n2 queries (even exponential in n) can be answered in exponential time. We prove our results by extending the connection between differentially private query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.
Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2013. “Characterizing the Sample Complexity of Private Learners.” In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, Pp. 97–110. New York, NY, USA: ACM. Publisher's Version
Amos Beimel, Kobbi Nissim, and Uri Stemmer. 2013. “Characterizing the sample complexity of private learners.” In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, Pp. 97-110. Berkeley, California, USA: ACM. DOIAbstract
In 2008, Kasiviswanathan el al. defined private learning as a combination of PAC learning and differential privacy [16]. Informally, a private learner is applied to a collection of labeled individual information and outputs a hypothesis while preserving the privacy of each individual. Kasiviswanathan et al. gave a generic construction of private learners for (finite) concept classes, with sample complexity logarithmic in the size of the concept class. This sample complexity is higher than what is needed for non-private learners, hence leaving open the possibility that the sample complexity of private learning may be sometimes significantly higher than that of non-private learning. We give a combinatorial characterization of the sample size sufficient and necessary to privately learn a class of concepts. This characterization is analogous to the well known characterization of the sample complexity of non-private learning in terms of the VC dimension of the concept class. We introduce the notion of probabilistic representation of a concept class, and our new complexity measure RepDim corresponds to the size of the smallest probabilistic representation of the concept class. We show that any private learning algorithm for a concept class C with sample complexity m implies RepDim(C) = O(m), and that there exists a private learning algorithm with sample complexity m = O(RepDim(C)). We further demonstrate that a similar characterization holds for the database size needed for privately computing a large class of optimization problems and also for the well studied problem of private data release.
Justin Hsu, Aaron Roth, and Jonathan Ullman. 2013. “Differential privacy for the analyst via private equilibrium computation.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 341-350. Palo Alto, California, USA: ACM. DOIAbstract
We give new mechanisms for answering exponentially many queries from multiple analysts on a private database, while protecting dif- ferential privacy both for the individuals in the database and for the analysts. That is, our mechanism's answer to each query is nearly insensitive to changes in the queries asked by other analysts. Our mechanism is the first to offer differential privacy on the joint distribution over analysts' answers, providing privacy for data an- alysts even if the other data analysts collude or register multiple accounts. In some settings, we are able to achieve nearly optimal error rates (even compared to mechanisms which do not offer an- alyst privacy), and we are able to extend our techniques to handle non-linear queries. Our analysis is based on a novel view of the pri- vate query-release problem as a two-player zero-sum game, which may be of independent interest.
Latanya Sweeney. 2013. “Discrimination in online ad delivery.” Commun. ACM, 56, 5, Pp. 44–54. DOIAbstract
Mark Bun and Justin Thaler. 2013. “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities.” Edited by FedorV. Fomin, Rūsiņš Freivalds, Marta Kwiatkowska, and David Peleg. Automata, Languages, and Programming, 7965, Pp. 303-314. DOIAbstract
The ε-approximate degree of a Boolean function f: { − 1, 1} n  → { − 1, 1} is the minimum degree of a real polynomial that approximates f to within ε in the ℓ ∞  norm. We prove several lower bounds on this important complexity measure by explicitly constructing solutions to the dual of an appropriate linear program. Our first result resolves the ε-approximate degree of the two-level AND-OR tree for any constant ε > 0. We show that this quantity is Θ(n‾‾√) , closing a line of incrementally larger lower bounds [3,11,21,30,32]. The same lower bound was recently obtained independently by Sherstov using related techniques [25]. Our second result gives an explicit dual polynomial that witnesses a tight lower bound for the approximate degree of any symmetric Boolean function, addressing a question of Špalek [34]. Our final contribution is to reprove several Markov-type inequalities from approximation theory by constructing explicit dual solutions to natural linear programs. These inequalities underly the proofs of many of the best-known approximate degree lower bounds, and have important uses throughout theoretical computer science.
S.H. Chan, T.B. Costa, and E.M. Airoldi. 2013. “Estimation of exchangeable graph models by stochastic blockmodel approximation.” In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, Pp. 293-296.
Karthekeyan Chandrasekaran, Justin Thaler, Jonathan Ullman, and Andrew Wan. 2013. “Faster Private Release of Marginals on Small Databases.” CoRR, abs/1304.3754. arXiv.orgAbstract
We study the problem of answering \emph{$k$-way marginal} queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$\min{\exp(d^{1-\Omega(1/\sqrt{k})}), \exp(d / \log^{.99} d)\}$$ per query and answers any (possibly superpolynomially long and adaptively chosen) sequence of $k$-way marginal queries up to error at most $\pm .01$ on every query, provided $n \gtrsim d^{.51}$. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee on a database of size $\poly(d, k)$ in time $\exp(o(d))$. Our algorithms are a variant of the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10), but using a different low-weight representation of the database. We derive our low-weight representation using approximations to the OR function by low-degree polynomials with coefficients of bounded $L_1$-norm. We also prove a strong limitation on our approach that is of independent approximation-theoretic interest. Specifically, we show that for any $k = o(\log d)$, any polynomial with coefficients of $L_1$-norm $poly(d)$ that pointwise approximates the $d$-variate OR function on all inputs of Hamming weight at most $k$ must have degree $d^{1-O(1/\sqrt{k})}$.
Latanya Sweeney, Akua Abu, and Julia Winn. 2013. “Identifying Participants in the Personal Genome Project by Name.” Data Privacy Lab, IQSS, Harvard University. Project website
Guy N. Rothblum, Salil Vadhan, and Avi Wigderson. 2013. “Interactive proofs of proximity: delegating computation in sublinear time.” In Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, Pp. 793-802. Palo Alto, California, USA: ACM. DOIAbstract

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

Latanya Sweeney. 2013. “Matching Known Patients to Health Records in Washington State Data.” Data Privacy Lab, IQSS, Harvard University. Project website
Amos Beimel, Kobbi Nissim, Uri Stemmer, Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and JoséD.P. Rolim. 2013. “Private Learning and Sanitization: Pure vs. Approximate Differential Privacy.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 8096: Pp. 363-378. Springer Berlin Heidelberg. Publisher's Version