A new line of work [6, 9, 15, 2] demonstrates how differential privacy [8] can be used as a mathematical tool for guaranteeing generalization in adaptive data analysis. Specifically, if a differentially private analysis is applied on a sample S of i.i.d. examples to select a lowsensitivity function f , then w.h.p. f (S) is close to its expectation, although f is being chosen based on the data. Very recently, Steinke and Ullman [16] observed that these generalization guarantees can be used for proving concentration bounds in the non-adaptive setting, where the low-sensitivity function is fixed beforehand. In particular, they obtain alternative proofs for classical concentration bounds for low-sensitivity functions, such as the Chernoff bound and McDiarmid’s Inequality. In this work, we set out to examine the situation for functions with high-sensitivity, for which differential privacy does not imply generalization guarantees under adaptive analysis. We show that differential privacy can be used to prove concentration bounds for such functions in the non-adaptive setting.

%B Journal of Privacy and Confidentiality %V 9 %G eng %U https://arxiv.org/abs/1703.01970 %N 1 %0 Journal Article %J European Data Protection Law Review %D 2019 %T Data Protection's Composition Problem %A Aaron Fluitt %A Aloni Cohen %A Micah Altman %A Kobbi Nissim %A Salome Viljoen %A Alexandra Wood %B European Data Protection Law Review %V 5 %P 285-292 %G eng %U https://edpl.lexxion.eu/article/EDPL/2019/3/4 %N 3 %0 Journal Article %J ArXiv %D 2019 %T Formalizing Privacy Laws for License Generation and Data Repository Decision Automation %A Micah Altman %A Stephen Chong %A Alexandra Wood %X In this paper, we summarize work-in-progress on expert system support to automate some data deposit and release decisions within a data repository, and to generate custom license agreements for those data transfers.Our approach formalizes via a logic programming language the privacy-relevant aspects of laws, regulations, and best practices, supported by legal analysis documented in legal memoranda. This formalization enables automated reasoning about the conditions under which a repository can transfer data, through interrogation of users, and the application of formal rules to the facts obtained from users. The proposed system takes the specific conditions for a given data release and produces a custom data use agreement that accurately captures the relevant restrictions on data use. This enables appropriate decisions and accurate licenses, while removing the bottleneck of lawyer effort per data transfer. The operation of the system aims to be transparent, in the sense that administrators, lawyers, institutional review boards, and other interested parties can evaluate the legal reasoning and interpretation embodied in the formalization, and the specific rationale for a decision to accept or release a particular dataset. %B ArXiv %G eng %U https://arxiv.org/abs/1910.10096 %0 Journal Article %J Computer Security Foundations %D 2019 %T Information Flow Control for Distributed Trusted Execution Environments %A Anitha Gollamudi %A Owen Arden %A Stephen Chong %B Computer Security Foundations %G eng %0 Conference Paper %B ACM SIGMOD/PODS Conference International Conference on Management of Data (PODS 2018) %D 2018 %T Heavy Hitters and the Structure of Local Privacy %A Mark Bun %A Jelani Nelson %A Uri Stemmer %B ACM SIGMOD/PODS Conference International Conference on Management of Data (PODS 2018) %8 2018 %G eng %0 Journal Article %J Conference on Algorithmic Learning Theory (ALT 2018) %D 2018 %T Clustering Algorithms for the Centralized and Local Models %A Kobbi Nissim %A Uri Stemmer %B Conference on Algorithmic Learning Theory (ALT 2018) %G eng %0 Government Document %D 2018 %T Comments to the U.S. Office of Management and Budget, Re: Request for information (New Techniques and Methodologies Based on Combining Data from Multiple Sources) %A Micah Altman %A Aloni Cohen %A Aaron Fluitt %A James Honaker %A Kobbi Nissim %A Michael Washington %A Alexandra Wood %G eng %0 Thesis %B Applied Mathematics %D 2018 %T Towards Differentially Private Inference on Network Data %A Alexander Koujianos Goldberg %X Statistical analysis of network data, while popular in a broad range of fields, can also be highly problematic from a privacy standpoint. In this thesis, we study privacy-preserving inference on network data using the rigorous notion of differential privacy. We propose new methods for differentially private inference using a common class of models known as Exponential Random Graph Models (ERGMs). The goal of our work is to accurately estimate the parameters of an ERGM applied to a network dataset, while offering meaningful privacy guarantees to participants. We propose methods that provably guarantee differential privacy at two different granularities: edge-level privacy, which protects the privacy of any single relationship in the network and node-level privacy, which protects all of the relationships of a participant. Specifically, using the framework of "restricted sensitivity," we take advantage of the sparsity of real-world networks to perturb data much less than prior work while guaranteeing differential privacy. We empirically evaluate the accuracy of inference in a series of experiments on both synthetic networks and a real network dataset. Experimental results suggest that our proposed methods enable accurate inference under meaningful privacy guarantees in settings where current methods do not, moving us closer to the goal of useful differentially private statistical modeling of network data. %B Applied Mathematics %G eng %9 Undergraduate thesis %0 Journal Article %J International Data Privacy Law %D 2018 %T Practical Approaches to Big Data Privacy Over Time %A Micah Altman %A Alexandra Wood %A O'Brien, David %A Gasser, Urs %B International Data Privacy Law %V 8 %P 29-51 %G eng %U https://academic.oup.com/idpl/advance-article/doi/10.1093/idpl/ipx027/4930711 %N 1 %0 Journal Article %J in 9th Innovations in Theoretical Computer Science Conference (ITCS 2018); also presented at Theory and Practice of Differential Privacy Conference (TPDP 2017) %D 2018 %T Differential Privacy on Finite Computers %A Victor Balcer %A Salil Vadhan %X

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomialtime algorithms. We use a case study the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe X . The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Confidentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in |X |, which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its “per-bin accuracy” (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of log |X |, but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and efficiently computable representation of a dense histogram; it is based on an (n + 1)-wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

%B in 9th Innovations in Theoretical Computer Science Conference (ITCS 2018); also presented at Theory and Practice of Differential Privacy Conference (TPDP 2017) %G eng %U https://arxiv.org/abs/1709.05396 %0 Conference Paper %D 2018 %T Bridging the Gap between Computer Science and Legal Approaches to Privacy Kobbi Nissim, Aaron Bembenek, Alexandra Wood, Mark Bun, Marco Gaboardi, Urs Gasser, David R. O’Brien, and Salil Vadhan, Bridging the Gap between Computer Science and Legal Appro %A K. Nissim %A A Bembenek %A A Wood %A M Bun %A M Gaboardi %A Gasser, U. %A D O'Brien %A T Steinke %A S. Vadhan %XThe fields of law and computer science incorporate contrasting notions of the privacy risks associated with the analysis and release of statistical data about individuals and groups of individuals. Emerging concepts from the theoretical computer science literature provide formal mathematical models for quantifying and mitigating privacy risks, where the set of risks they take into account is much broader than the privacy risks contemplated by many privacy laws. An example of such a model is differential privacy, which provides a provable guarantee of privacy against a wide range of potential attacks, including types of attacks currently unknown or unforeseen. The subject of much theoretical investigation, new privacy technologies based on formal models have recently been making significant strides towards practical implementation. For these tools to be used with sensitive personal information, it is important to demonstrate that they satisfy relevant legal requirements for privacy protection. However, making such an argument is challenging due to the conceptual gaps between the legal and technical approaches to defining privacy. Notably, information privacy laws are generally subject to interpretation and some degree of flexibility, which creates uncertainty for the implementation of more formal approaches. This Article articulates the gaps between legal and technical approaches to privacy and presents a methodology for rigorously arguing that a technological method for privacy protection satisfies the requirements of a particular law. The proposed methodology has two main components: (i) extraction of a formal mathematical requirement of privacy based on a legal standard found in an information privacy law, and (ii) construction of a rigorous mathematical proof for establishing that a technological privacy solution satisfies the mathematical requirement derived from the law. To handle ambiguities that can lead to different interpretations of a legal standard, the methodology takes a conservative “worst-case” approach and attempts to extract a mathematical requirement that is robust to potential ambiguities. Under this approach, the mathematical proof demonstrates that a technological method satisfies a broad range of reasonable interpretations of a legal standard. The Article demonstrates the application of the proposed methodology with an example bridging between the requirements of the Family Educational Rights and Privacy Act of 1974 and differential privacy.

%7 2
%I Harvard Journal of Law & Technology
%V 31
%P 687-780
%8 2016
%G eng
%U https://jolt.law.harvard.edu/assets/articlePDFs/v31/02.-Article-Wood-7.21.pdf
%0 Conference Paper
%B Theory of Cryptography Conference (TCC 2016)
%D 2018
%T The Complexity of Computing the Optimal Composition of Differential Privacy
%A Jack Murtagh
%A Salil Vadhan
%X In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC'06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML'15) showed how to compute the optimal bound for composing k arbitrary (ϵ,δ)-differentially private algorithms. We characterize the optimal composition for the more general case of k arbitrary (ϵ1,δ1),…,(ϵk,δk)-differentially private algorithms where the privacy parameters may differ for each algorithm in the composition. We show that computing the optimal composition in general is #P-complete. Since computing optimal composition exactly is infeasible (unless FP=#P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer's dynamic programming approach to approximately counting solutions to knapsack problems (STOC'03).

%B Theory of Cryptography Conference (TCC 2016) %7 8 %I Theory of Computing (2018) %V 14 %P 1-35 %G eng %U http://theoryofcomputing.org/articles/v014a008/ %0 Generic %D 2018 %T Composable and Versatile Privacy via Truncated CDP (Poster) %A Mark Bun %A Cynthia Dwork %A Guy Rothblum %A Thomas Steinke %B ACM Symposium on the Theory of Computing (STOC 2018) %G eng %0 Journal Article %D 2018 %T Computational Two-Party Correlation: A Dichotomy for Key-Agreement Protocols %A Iftach Haitner %A Ronen Shaltiel %A Jad Silbak %A Kobbi Nissim %A Eran Omri %G eng %U https://www.irif.fr/~focs2018/accepted/ %0 Journal Article %J Vanderbilt Journal of Entertainment and Technology Law %D 2018 %T Differential Privacy: A Primer for a Non-technical Audience %A Kobbi Nissim %A Thomas Steinke %A Alexandra Wood %A Micah Altman %A Aaron Bembenek %A Mark Bun %A Marco Gaboardi %A O'Brien, David %A Salil Vadhan %XThis document is a primer on differential privacy, which is a formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings. Using intuitive illustrations and limited mathematical formalism, this document provides an introduction to differential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to differential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social scientists can conceptualize the guarantees provided by differential privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded.

%B Vanderbilt Journal of Entertainment and Technology Law %I a product of the "Bridging Privacy Definitions" working group, part of the Privacy Tools for Sharing Research Data project at Harvard University %C Cambridge, MA %V 21 %P 209-276 %8 2018 %G eng %N 1 %0 Journal Article %J 9th Innovations in Theoretical Computer Science Conference (ITCS 2018) %D 2018 %T Finite Sample Differentially Private Confidence Intervals %A Vishesh Karwa %A Salil Vadhan %X We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors. %B 9th Innovations in Theoretical Computer Science Conference (ITCS 2018) %G eng %U https://arxiv.org/abs/1711.03908 %0 Journal Article %J IEEE Security & Privacy %D 2018 %T A Harm-Reduction Framework for Algorithmic Accountability over Personal Information %A Micah Altman %A Alexandra Wood %A Effy Vayena %B IEEE Security & Privacy %V 16 %P 34-45 %G eng %U https://ieeexplore.ieee.org/document/8395114/ %N 3 %0 Journal Article %J Philosophical Transaction of the Royal Society A %D 2018 %T Is Privacy

When it is ethical and legal to use a sensitive attribute (such as gender or race) in machine learning systems, the question remains how to do so. We show that the naive application of machine learning algorithms using sensitive features leads to an inherent tradeoff in accuracy between groups. We provide a simple and efficient decoupling technique, that can be added on top of any black-box machine learning algorithm, to learn different classifiers for different groups. Transfer learning is used to mitigate the problem of having too little data on any one group.

The method can apply to a range of fairness criteria. In particular, we require the application designer to specify as joint loss function that makes explicit the trade-off between fairness and accuracy. Our reduction is shown to efficiently find the minimum loss as long as the objective has a certain natural monotonicity property which may be of independent interest in the study of fairness in algorithms.

We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer:

• Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch.

• Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query.

• Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries.

Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line.

%B Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) %G eng %U https://arxiv.org/abs/1604.04618 %0 Journal Article %J Proceedings of The 30th Conference on Learning Theory Conference (COLT 2017) %D 2017 %T The Price of Selection in Differential Privacy %A Mitali Bafna %A Jonathan Ullman %B Proceedings of The 30th Conference on Learning Theory Conference (COLT 2017) %G eng %U https://arxiv.org/abs/1702.02970 %0 Journal Article %J in the ACM SIGMOD/PODS Conference (PODS 2017) %D 2017 %T Private Incremental Regression %A Kasiviswanathan, Shiva %A Kobbi Nissim %A Hongxia Jin %B in the ACM SIGMOD/PODS Conference (PODS 2017) %G eng %0 Conference Paper %B European Social Policy Analysis network (ESPAnet). Israel %D 2017 %T Public Policy Modeling using the DataTags Toolset %A Bar-Sinai, Michael %A Rotem Medzini %X We apply Tags, a framework for modeling data handling policies, to a welfare policy. The generated model is useful for assessing entitlements of specific cases, and for gaining insights into the modeled policy as a whole. %B European Social Policy Analysis network (ESPAnet). Israel %G eng %0 Journal Article %J in Technology Science %D 2017 %T Reidentification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study %A Sweeney, Latanya %A Ji Su Yoo %A Laura Perovich %A Katherine E Boronow %A Brown, Phil %A Julia Green Brody %B in Technology Science %G eng %0 Journal Article %J in the IEEE Secure Development Conference (IEEE 2017) %D 2017 %T Securing Dataverse with an Adapted Command Design Pattern %A Gustavo Durand %A Bar-Sinai, Michael %A Merce Crosas %B in the IEEE Secure Development Conference (IEEE 2017) %G eng %0 Journal Article %J in the 58th Annual Symposium on Foundations of Computer Science (FOCS 2017) %D 2017 %T Tight Lower Bounds for Differentially Private Selection %A Thomas Steinke %A Jonathan Ullman %B in the 58th Annual Symposium on Foundations of Computer Science (FOCS 2017) %G eng %U https://arxiv.org/abs/1704.03024 %0 Journal Article %J in Technology Science %D 2017 %T Voter Identity Theft: Submitting Changes to Voter Registrations Online to Disrupt Elections %A Sweeney, Latanya %A Ji Su Yoo %A Jinyan Zang %B in Technology Science %G eng %0 Journal Article %J in the 23rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) %D 2017 %T What's Fair? %A Cynthia Dwork %B in the 23rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) %G eng %0 Journal Article %J Harvard Law Review %D 2016 %T Recoding Privacy Law: Reflections on the Future Relationship Among Law, Technology, and Privacy %A Gasser, Urs %B Harvard Law Review %8 12 Dec, 2016 %G eng %U http://harvardlawreview.org/2016/12/recoding-privacy-law-reflections-on-the-future-relationship-among-law-technology-and-privacy/ %0 Journal Article %J Conference on Internet and Economics, WINE %D 2016 %T Computer-Aided Verification in Mechanism Design %A G. Barthe %A M. Gaboardi %A E. J. Gallego Arias %A J. Hsu %A A. Roth %A P.-Y. Strub %B Conference on Internet and Economics, WINE %8 Dec 2016 %G eng %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security, CCS %D 2016 %T Advanced Probabilistic Couplings for Differential Privacy %A G. Barthe %A N. Fong %A M. Gaboardi %A B. Gregoire %A J. Hsu %A P.-Y. Strub %B 23rd ACM Conference on Computer and Communications Security, CCS %8 Oct 2016 %G eng %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security %D 2016 %T Generic Attacks on Secure Outsourced Databases %A Georgios Kellaris %A George Kollios %A Kobbi Nissim %A Adam O'Neill %XRecently, various protocols have been proposed for securely outsourcing database storage to a third party server, ranging from systems with “full-fledged” security based on strong cryptographic primitives such as fully homomorphic encryption or oblivious RAM, to more practical implementations based on searchable symmetric encryption or even on deterministic and order-preserving encryption. On the flip side, various attacks have emerged that show that for some of these protocols confidentiality of the data can be compromised, usually given certain auxiliary information. We take a step back and identify a need for a formal understanding of the inherent efficiency/privacy trade-off in outsourced database systems, independent of the details of the system. We propose abstract models that capture secure outsourced storage systems in sufficient generality, and identify two basic sources of leakage, namely access pattern and communication volume. We use our models to distinguish certain classes of outsourced database systems that have been proposed, and deduce that all of them exhibit at least one of these leakage sources. We then develop generic reconstruction attacks on any system supporting range queries where either access pattern or communication volume is leaked. These attacks are in a rather weak passive adversarial model, where the untrusted server knows only the underlying query distribution. In particular, to perform our attack the server need not have any prior knowledge about the data, and need not know any of the issued queries nor their results. Yet, the server can reconstruct the secret attribute of every record in the database after about N 4 queries, where N is the domain size. We provide a matching lower bound showing that our attacks are essentially optimal. Our reconstruction attacks using communication volume apply even to systems based on homomorphic encryption or oblivious RAM in the natural way. Finally, we provide experimental results demonstrating the efficacy of our attacks on real datasets with a variety of different features. On all these datasets, after the required number of queries our attacks successfully recovered the secret attributes of every record in at most a few seconds.

%B 23rd ACM Conference on Computer and Communications Security %8 Oct 2016 %G eng %0 Conference Paper %B Security and Privacy Workshops (SPW), 2016 IEEE %D 2016 %T DataTags, Data Handling Policy Spaces and the Tags Language %A Bar-Sinai, Michael %A Sweeney, Latanya %A Merce Crosas %XWidespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper.

%B Security and Privacy Workshops (SPW), 2016 IEEE %I IEEE %C San Jose, California %8 22-26 May %G eng %U http://ieeexplore.ieee.org/document/7527746/?reload=true&arnumber=7527746 %0 Manuscript %D 2016 %T The Complexity of Differential Privacy %A Salil Vadhan %XDifferential Privacy is a theoretical framework for ensuring the privacy of individual-level data when performing statistical analysis of privacy-sensitive datasets. The goal of this tutorial is to convey the deep connections between differential privacy and a variety of other topics in computational complexity, cryptography, and theoretical computer science at large. This tutorial was written starting from notes taken during a minicourse given by the author and Kunal Talwar at the 26th McGill Invitational Workshop on Computational Complexity in February 2014, at the Bellairs Institute in Holetown, Barbados.

We provide an overview of PSI (“a Private data Sharing Interface”), a system we are devel- oping to enable researchers in the social sciences and other fields to share and explore privacy- sensitive datasets with the strong privacy protections of differential privacy.

Poster presented at Theory and Practice of Differential Privacy (TPDP 2016).

%B Theory and Practice of Differential Privacy %C New York, NY %8 2016 %G eng %U https://arxiv.org/abs/1609.04340 %0 Journal Article %J ACM Transactions on Economics and Computation (TEAC) %D 2016 %T Truthful Mechanisms for Agents that Value Privacy %A Yiling Chen %A Stephen Chong %A Ian Kash %A Tal Moran %A Salil Vadhan %XRecent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from truthfulness; it is not incorporated in players’ utility functions (and doing so has been shown to lead to nontruthfulness in some cases). In this work, we propose a new, general way of modeling privacy in players’ utility functions. Specifically, we only assume that if an outcome *o* has the property that any report of player *i* would have led to *o* with approximately the same probability, then *o* has a small privacy cost to player *i*. We give three mechanisms that are truthful with respect to our modeling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number *n* of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of *n*).

Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

%B Annals of Statistics %G eng %U https://arxiv.org/abs/1408.2923 %0 Journal Article %J NIPS 2015 Workshop on Learning and Privacy with Incomplete Data and Weak Supervision %D 2016 %T Private Posterior Distributions from Variational Approximations %A Vishesh Karwa %A Dan Kifer %A Aleksandra Slavkovic %XPrivacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating posterior distributions over model parameters. This leads to a principled way of performing valid statistical inference using private data, however, the corresponding likelihoods are intractable. In this paper, we derive fast and accurate variational approximations to tackle such intractable likelihoods that arise due to privacy. We focus on estimating posterior distributions of parameters of the naive Bayes log-linear model, where the sufficient statistics of this model are shared using a differentially private interface. Using a simulation study, we show that the posterior approximations outperform the naive method of ignoring the noise addition mechanism.

%B NIPS 2015 Workshop on Learning and Privacy with Incomplete Data and Weak Supervision %G eng %U https://arxiv.org/abs/1511.07896 %0 Journal Article %J The Ethics of Biomedical Big Data %D 2016 %T Strictly Biomedical? Sketching the Ethics of the Big Data Ecosystem in Biomedicine %A Effy Vayena %A Gasser, Urs %XIn today’s ever evolving data ecosystem it is evident that data generated for a wide range of purposes unrelated to biomedicine possess tremendous potential value for biomedical research. Analyses of our Google searches, social media content, loyalty card points and the like are used to draw a fairly accurate picture of our health, our future health, our attitudes towards vaccination, disease outbreaks within a county and epidemic trajectories in other continents. These data sets are different from traditional biomedical data, if a biomedical purpose is the categorical variable. Yet the results their analyses yield are of serious biomedical relevance. This paper discusses important but unresolved challenges within typical biomedical data, and it explores examples of non-biomedical Big Data with high biomedical value, including the specific conundrums these engender, especially when we apply biomedical data concepts to them. It also highlights the “digital phenotype” project, illustrating the Big Data ecosystem in action and an approach believed as likely to yield biomedical and health knowledge. We argue that to address the challenges and make full use of the opportunities that Big Data offers to biomedicine, a new ethical framework taking a data ecosystem approach is urgently needed. We conclude by discussing key components, design requirements and substantive normative elements of such a framework.

%B The Ethics of Biomedical Big Data %G eng %U http://link.springer.com/chapter/10.1007/978-3-319-33525-4_2 %0 Journal Article %J Conference on Learning Theory (COLT) %D 2016 %T Adaptive Learning with Robust Generalization Guarantees %A Rachel Cummings %A Katrina Ligett %A Kobbi Nissim %A Aaron Roth %A Zhiwei Steven Wu %XThe traditional notion of generalization---i.e., learning a hypothesis whose empirical error is close to its true error---is surprisingly brittle. As has recently been noted in [DFH+15b], even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization---increasing in strength---that are robust to postprocessing and amenable to adaptive composition, and examine the relationships between them. We call the weakest such notion Robust Generalization. A second, intermediate, notion is the stability guarantee known as differential privacy. The strongest guarantee we consider we call Perfect Generalization. We prove that every hypothesis class that is PAC learnable is also PAC learnable in a robustly generalizing fashion, with almost the same sample complexity. It was previously known that differentially private algorithms satisfy robust generalization. In this paper, we show that robust generalization is a strictly weaker concept, and that there is a learning task that can be carried out subject to robust generalization guarantees, yet cannot be carried out subject to differential privacy. We also show that perfect generalization is a strictly stronger guarantee than differential privacy, but that, nevertheless, many learning tasks can be carried out subject to the guarantees of perfect generalization.

%B Conference on Learning Theory (COLT) %G eng %U https://arxiv.org/abs/1602.07726 %0 Journal Article %J 48th Annual Symposium on the Theory of Computing %D 2016 %T Algorithmic Stability for Adaptive Data Analysis %A Raef Bassily %A Kobbi Nissim %A Smith, Adam %A Thomas Steinke %A Uri Stemmer %A Jonathan Ullman %XAdaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen queries about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions: (i) We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015, NIPS, 2015). (ii) We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries. As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stability implies low generalization error. We also study weaker stability guarantees such as bounded KL divergence and total variation distance.

%B 48th Annual Symposium on the Theory of Computing %G eng %U https://arxiv.org/abs/1511.02513v1 %0 Journal Article %J PLoS Med %D 2016 %T Between Openness and Privacy in Genomics. %A Effy Vayena %A Gasser, Urs %X- Advancing genomic research depends on the accessing and sharing of genomic data. However, the increasing need for sharing escalates the tension between genomic privacy and openness.
- Promoting openness while protecting privacy is a challenge that cannot be overcome only with technical solutions such as encryption and differential privacy. Although such solutions are crucial, we still need to confront some fundamental normative tensions that are intensified in the era of genomics and big data. Here are at least three:
- The right to genomic privacy is not an absolute right. If privacy is understood as control over information or data, privacy is not about maximal levels of control, but rather about reasonable measures of control.
- Although individual control is necessary, it is not a sufficient safeguard of privacy. Individuals’ willingness to be open about their data does not obviate responsibility for reducing privacy risks on the part of the data users.
- Data governance models, such as data cooperatives, that enable meaningful and continuous roles of the individuals whose data are at stake hold promise for reconciling privacy and openness.

This is a graduate textbook of advanced tutorials on the theory of cryptography and computational complexity. In particular, the chapters explain aspects of garbled circuits, public-key cryptography, pseudorandom functions, one-way functions, homomorphic encryption, the simulation proof technique, and the complexity of differential privacy. Most chapters progress methodically through motivations, foundations, definitions, major results, issues surrounding feasibility, surveys of recent developments, and suggestions for further study.

This book honors Professor Oded Goldreich, a pioneering scientist, educator, and mentor. Oded was instrumental in laying down the foundations of cryptography, and he inspired the contributing authors, Benny Applebaum, Boaz Barak, Andrej Bogdanov, Iftach Haitner, Shai Halevi, Yehuda Lindell, Alon Rosen, and Salil Vadhan, themselves leading researchers on the theory of cryptography and computational complexity. The book is appropriate for graduate tutorials and seminars, and for self-study by experienced researchers, assuming prior knowledge of the theory of cryptography.

%B Tutorials on the Foundations of Cryptography %I Yehuda Lindell %G eng %0 Journal Article %J 14th Theory of Cryptography Conference %D 2016 %T Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds %A Mark Bun %A Thomas Steinke %X"Concentrated differential privacy" was recently introduced by Dwork and Rothblum as a relaxation of differential privacy, which permits sharper analyses of many privacy-preserving computations. We present an alternative formulation of the concept of concentrated differential privacy in terms of the Renyi divergence between the distributions obtained by running an algorithm on neighboring inputs. With this reformulation in hand, we prove sharper quantitative results, establish lower bounds, and raise a few new questions. We also unify this approach with approximate differential privacy by giving an appropriate definition of "approximate concentrated differential privacy."

%B 14th Theory of Cryptography Conference %G eng %U https://arxiv.org/abs/1605.02065 %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security, CCS %D 2016 %T Differentially Private Bayesian Programming %A G. Barthe %A P.Y. Strub %A J. Hsu %A A. D. Gordon %A E. J. Gallego Arias %A M. Gaboardi %A G. P. Farina %B 23rd ACM Conference on Computer and Communications Security, CCS %G eng %0 Journal Article %D 2016 %T Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing %A Marco Gaboardi %A Hyun woo Lim %A Ryan Rogers %A Salil Vadhan %XHypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.

We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance.

We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

merging large-scale data sources hold tremendous potential for new scientific research into human biology, behaviors, and relationships. At the same time, big data research presents privacy and ethical challenges that the current regulatory framework is ill-suited to address. In light of the immense value of large-scale research data, the central question moving forward is not whether such data should be made available for research, but rather how the benefits can be captured in a way that respects fundamental principles of ethics and privacy.

In response, this Essay outlines elements of a new ethical framework for big data research. It argues that oversight should aim to provide universal coverage of human subjects research, regardless of funding source, across all stages of the information lifecycle. New definitions and standards should be developed based on a modern understanding of privacy science and the expectations of research subjects. In addition, researchers and review boards should be encouraged to incorporate systematic risk-benefit assessments and new procedural and technological solutions from the wide range of interventions that are available. Finally, oversight mechanisms and the safeguards implemented should be tailored to the intended uses, benefits, threats, harms, and vulnerabilities associated with a specific research activity.

Development of a new ethical framework with these elements should be the product of a dynamic multistakeholder process that is designed to capture the latest scientific understanding of privacy, analytical methods, available safeguards, community and social norms, and best practices for research ethics as they evolve over time. Such a framework would support big data utilization and help harness the value of big data in a sustainable and trust-building manner.

%B Washington and Lee Law Review %V 72 %8 31 Mar, 2016 %G eng %U http://lawreview.journals.wlu.io/elements-of-a-new-ethical-framework-for-big-data-research/ %N 3 %0 Journal Article %J Annals of Statistics %D 2016 %T Inference Using Noisy Degrees: Differentially Private Beta-Model and Synthetic Graphs %A Vishesh Karwa %A Aleksandra Slavković %XThe β-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the β-model, and show that the MLE never exists for the degree partition β-model. Second, motivated by privacy problems with network data, we derive a differentially private estimator of the parameters of β-model, and show it is consistent and asymptotically normally distributed - it achieves the same rate of convergence as the nonprivate estimator. We present an efficient algorithm for the private estimator that can be used to release synthetic graphs. Our techniques can also be used to release degree distributions and degree partitions accurately and privately, and to perform inference from noisy degrees arising from contexts other than privacy. We evaluate the proposed estimator on real graphs and compare it with a current algorithm for releasing degree distributions and find that it does significantly better. Finally, our paper addresses shortcomings of current approaches to a fundamental problem of how to perform valid statistical inference from data released by privacy mechanisms, and lays a foundational groundwork on how to achieve optimal and private statistical inference in a principled manner by modeling the privacy mechanism; these principles should be applicable to a class of models beyond the β-model.

%B Annals of Statistics %V 44 %P 87-112 %G eng %N 1 %0 Journal Article %J Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS) %D 2016 %T Lipschitz Extensions for NodePrivate Graph Statistics and the Generalized Exponential Mechanism %A Raskhodnikova, Sofya %A Smith, Adam %B Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS) %G eng %0 Conference Paper %B PODS 2016 %D 2016 %T Locating a Small Cluster Privately %A Kobbi Nissim %A Uri Stemmer %A Salil Vadhan %XWe present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim,and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova,and Smith, 2007], which allows compiling of "off the shelf" (non-private) analyses into analyses that preserve differential privacy.

%B PODS 2016 %C ACM SIGMOD/PODS Conference, San Francisco, USA, 2016 %8 June 26 2016 %G eng %0 Conference Paper %B Proceedings of the 12th Theory of Cryptography Conference (TCC 2016) %D 2016 %T Order revealing encryption and the hardness of private learning %A Mark Bun %A Mark Zhandry %XAn order-revealing encryption scheme gives a public procedure by which two ciphertexts can be compared to reveal the ordering of their underlying plaintexts. We show how to use order-revealing encryption to separate computationally efficient PAC learning from efficient (ϵ,δ)-differentially private PAC learning. That is, we construct a concept class that is efficiently PAC learnable, but for which every efficient learner fails to be differentially private. This answers a question of Kasiviswanathan et al. (FOCS '08, SIAM J. Comput. '11).

To prove our result, we give a generic transformation from an order-revealing encryption scheme into one with strongly correct comparison, which enables the consistent comparison of ciphertexts that are not obtained as the valid encryption of any message. We believe this construction may be of independent interest.

Increasingly, governments and businesses are collecting, analyzing, and sharing detailed information about individuals over long periods of time. Vast quantities of data from new sources and novel methods for large-scale data analysis promise to yield deeper understanding of human characteristics, behavior, and relationships and advance the state of science, public policy, and innovation. At the same time, the collection and use of fine-grained personal data over time is associated with significant risks to individuals, groups, and society at large. In this article, we examine a range of longterm data collections, conducted by researchers in social science, in order to identify the characteristics of these programs that drive their unique sets of risks and benefits. We also examine the practices that have been established by social scientists to protect the privacy of data subjects in light of the challenges presented in long-term studies. We argue that many uses of big data, across academic, government, and industry settings, have characteristics similar to those of traditional long-term research studies. In this article, we discuss the lessons that can be learned from longstanding data management practices in research and potentially applied in the context of newly emerging data sources and uses.

%B Brussels Privacy Symposium %C Brussels, Belgium %8 2016 %G eng %0 Journal Article %J Neural Information Processing Systems (NIPS) %D 2016 %T Privacy Odometers and Filters: Pay-as-you-Go Composition %A Ryan Rogers %A Aaron Roth %A Jonathan Ullman %A Salil Vadhan %XIn this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases.

%B Neural Information Processing Systems (NIPS) %8 2016 %G eng %0 Journal Article %J 43rd International Colloquium on Automata, Languages, and Programming %D 2016 %T Sensitivity of Counting Queries %A Myrto Arapinis %A Diego Figueira %A Marco Gaboardi %XIn the context of statistical databases, the release of accurate statistical information about the collected data often puts at risk the privacy of the individual contributors. The goal of differential privacy is to maximize the utility of a query while protecting the individual records in the database. A natural way to achieve differential privacy is to add statistical noise to the result of the query. In this context, a mechanism for releasing statistical information is thus a trade-off between utility and privacy. In order to balance these two "conflicting" requirements, privacy preserving mechanisms calibrate the added noise to the so-called sensitivity of the query, and thus a precise estimate of the sensitivity of the query is necessary to determine the amplitude of the noise to be added. In this paper, we initiate a systematic study of sensitivity of counting queries over relational databases. We first observe that the sensitivity of a Relational Algebra query with counting is not computable in general, and that while the sensitivity of Conjunctive Queries with counting is computable, it becomes unbounded as soon as the query includes a join. We then consider restricted classes of databases (databases with constraints), and study the problem of computing the sensitivity of a query given such constraints. We are able to establish bounds on the sensitivity of counting conjunctive queries over constrained databases. The kind of constraints studied here are: functional dependencies and cardinality dependencies. The latter is a natural generalization of functional dependencies that allows us to provide tight bounds on the sensitivity of counting conjunctive queries.

%B 43rd International Colloquium on Automata, Languages, and Programming %G eng %0 Journal Article %J Proceedings of the 14th Theory of Cryptography Conference (TCC 2016-B) %D 2016 %T Separating Computational and Statistical Differential Privacy in the Client-Server Model %A Mark Bun %A Yi Hsiu Chen %A Salil Vadhan %XDifferential privacy is a mathematical definition of privacy for statistical data analysis. It guarantees that any (possibly adversarial) data analyst is unable to learn too much information that is specific to an individual. Mironov et al. (CRYPTO 2009) proposed several computational relaxations of differential privacy (CDP), which relax this guarantee to hold only against computationally bounded adversaries. Their work and subsequent work showed that CDP can yield substantial accuracy improvements in various multiparty privacy problems. However, these works left open whether such improvements are possible in the traditional client-server model of data analysis. In fact, Groce, Katz and Yerukhimovich (TCC 2011) showed that, in this setting, it is impossible to take advantage of CDP for many natural statistical tasks. Our main result shows that, assuming the existence of sub-exponentially secure one-way functions and 2-message witness indistinguishable proofs (zaps) for NP, that there is in fact a computational task in the client-server model that can be efficiently performed with CDP, but is infeasible to perform with information-theoretic differential privacy.

%B Proceedings of the 14th Theory of Cryptography Conference (TCC 2016-B) %8 Aug 2016 %G eng %0 Journal Article %J Berkeley Technology Law Journal %D 2016 %T Towards a Modern Approach to Privacy-Aware Government Data Releases %A Micah Altman %A Alexandra Wood %A O'Brien, David %A Salil Vadhan %A Gasser, Urs %XThis article summarizes research exploring various models by which governments release data to the public and the interventions in place to protect the privacy of individuals in the data. Applying concepts from the recent scientific and legal literature on privacy, the authors propose a framework for a modern privacy analysis and illustrate how governments can use the framework to select appropriate privacy controls that are calibrated to the specific benefits and risks in individual data releases.

%B Berkeley Technology Law Journal %V 30 %G eng %N 3 %0 Thesis %B Computer Science Department at Harvard University %D 2016 %T Upper and Lower Bounds for Privacy and Adaptivity in Algorithmic Data Analysis %A Thomas Steinke %B Computer Science Department at Harvard University %G eng %9 PhD Thesis %0 Journal Article %J Proceedings of the 32nd International Conference on Machine Learning %D 2015 %T Consistent Estimation of Dynamic and Multi-Layer Block Models %A Qiuyi Han %A Kevin Xu %A Edoardo Airoldi %XSignificant progress has been made recently on theoretical analysis of estimators for the stochastic block model (SBM). In this paper, we consider the multi-graph SBM, which serves as a foundation for many application settings including dynamic and multi-layer networks. We explore the asymptotic properties of two estimators for the multi-graph SBM, namely spectral clustering and the maximum-likelihood estimate (MLE), as the number of layers of the multi-graph increases. We derive sufficient conditions for consistency of both estimators and propose a variational approximation to the MLE that is computationally feasible for large networks. We verify the sufficient conditions via simulation and demonstrate that they are practical. In addition, we apply the model to two real data sets: a dynamic social network and a multi-layer social network with several types of relations.

%B Proceedings of the 32nd International Conference on Machine Learning %G eng %U https://arxiv.org/pdf/1410.8597v3.pdf %0 Journal Article %J Encyclopedia of Algorithms %D 2015 %T Differentially Private Analysis of Graphs %A Sofya Raskhodnikova and Adam Smith %B Encyclopedia of Algorithms %G eng %0 Journal Article %J ZSR %D 2015 %T Perspectives on the Future of Digital Privacy %A Gasser, Urs %B ZSR %P 339-448 %G eng %0 Generic %D 2015 %T All the Data on All the People %A Sweeney, Latanya %B UC Berkeley Law School & GWU Law School (Berkeley Center for Law & Technology). The Privacy Law Scholars Conference (PLSC). Berkeley, CA. %G eng %0 Journal Article %J The ANNALS of the American Academy of Political and Social Science %D 2015 %T Automating Open Science for Big Data %A Mercè Crosas %A Gary King %A James Honaker %A Sweeney, Latanya %XThe vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

%B The ANNALS of the American Academy of Political and Social Science %V 659 %P 260-273 %G eng %U http://ann.sagepub.com/content/659/1/260.abstract %N 1 %0 Generic %D 2015 %T Between Pure and Approximate Differential Privacy %A Thomas Steinke %A Jon Ullman %B Theory and Practice of Differential Privacy (TPDP 2015), London, UK %G eng %U http://tpdp.computing.dundee.ac.uk/abstracts/TPDP_2015_3.pdf %0 Journal Article %J Proceedings of the 28th IEEE Computer Security Foundations Symposium (CSF) %D 2015 %T Cryptographic Enforcement of Language-Based Erasure %A A Askarov %A Moore, S. %A C Dimoulas %A S Chong %XInformation erasure is a formal security requirement that stipulates when sensitive data must be removed from computer systems. In a system that correctly enforces erasure requirements, an attacker who observes the system after sensitive data is required to have been erased cannot deduce anything about the data. Practical obstacles to enforcing information erasure include: (1) correctly determining which data requires erasure; and (2) reliably deleting potentially large volumes of data, despite untrustworthy storage services.

In this paper, we present a novel formalization of language- based information erasure that supports cryptographic enforcement of erasure requirements: sensitive data is encrypted be- fore storage, and upon erasure, only a relatively small set of decryption keys needs to be deleted. This cryptographic technique has been used by a number of systems that implement data deletion to allow the use of untrustworthy storage services. However, these systems provide no support to correctly determine which data requires erasure, nor have the formal semantic properties of these systems been explained or proven to hold. We address these shortcomings. Specifically, we study a programming language extended with primitives for public- key cryptography, and demonstrate how information-flow control mechanisms can automatically track data that requires erasure and provably enforce erasure requirements even when programs employ cryptographic techniques for erasure.

This essay first appeared in the Internet Monitor project’s second annual report, Internet Monitor 2014: Reflections on the Digital World. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

%B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/data-and-privacy-f7bfa24bbddc %0 Report %D 2015 %T The Differential Privacy of Bayesian Inference %A Shijie Zheng %XDifferential privacy is one recent framework for analyzing and quantifying the amount of privacy lost when data is released. Meanwhile, multiple imputation is an existing Bayesian-inference based technique from statistics that learns a model using real data, then releases synthetic data by drawing from that model. Because multiple imputation does not directly release any real data, it is generally believed to protect privacy. In this thesis, we examine that claim. While there exist newer synthetic data algorithms specifically designed to provide differential privacy, we evaluate whether multiple imputation already includes differential privacy for free. Thus, we focus on several method variants for releasing the learned model and releasing the synthetic data, and how these methods perform for models taking on two common distributions: the Bernoulli and the Gaussian with known variance. We prove a number of new or improved bounds on the amount of privacy afforded by multiple imputation for these distributions. We find that while differential privacy is ostensibly achievable for most of our method variants, the conditions needed for it to do so are often not realistic for practical usage. At least in theory, this is particularly true if we want absolute privacy (ε-differential privacy), but that the methods are more practically compatible with privacy when we allow a small probability of a catastrophic data leakage ((ε, δ)-differential privacy). |

We prove new upper and lower bounds on the sample complexity of (ϵ,δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function cx over a totally ordered domain X evaluates to cx(y)=1 if y≤x, and evaluates to 0 otherwise. We give the first nontrivial lower bound for releasing thresholds with (ϵ,δ) differential privacy, showing that the task is impossible over an infinite domain X, and moreover requires sample complexity n≥Ω(log∗|X|), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with n≤2(1+o(1))log∗|X| samples. This improves the previous best upper bound of 8(1+o(1))log∗|X| (Beimel et al., RANDOM '13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with (ϵ,δ) differential privacy and learning without privacy. For properly learning thresholds in ℓ dimensions, this lower bound extends to n≥Ω(ℓ⋅log∗|X|).

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database D of elements from X, the interior point problem asks for an element between the smallest and largest elements in D. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Emerging large-scale data sources hold tremendous potential for new scientific research into human biology, behaviors, and relationships. At the same time, big data research presents privacy and ethical challenges that the current regulatory framework is ill-suited to address. In light of the immense value of large-scale research data, the central question moving forward is not whether such data should be made available for research, but rather how the benefits can be captured in a way that respects fundamental principles of ethics and privacy.

In a search task, a group of agents compete to be the first to find the solution. Each agent has different private information to incorporate into its search. This problem is inspired by settings such as scientific research, Bitcoin hash inversion, or hunting for some buried treasure. A social planner such as a funding agency, mining pool, or pirate captain might like to convince the agents to collaborate, share their information, and greatly reduce the cost of searching. However, this cooperation is in tension with the individuals' competitive desire to each be the first to win the search. The planner's proposal should incentivize truthful information sharing, reduce the total cost of searching, and satisfy fairness properties that preserve the spirit of the competition. We design contract-based mechanisms for information sharing without money. The planner solicits the agents' information and assigns search locations to the agents, who may then search only within their assignments. Truthful reporting of information to the mechanism maximizes an agent's chance to win the search. Epsilon-voluntary participation is satisfied for large search spaces. In order to formalize the planner's goals of fairness and reduced search cost, we propose a simplified, simulated game as a benchmark and quantify fairness and search cost relative to this benchmark scenario. The game is also used to implement our mechanisms. Finally, we extend to the case where coalitions of agents may participate in the mechanism, forming larger coalitions recursively.

%B AAI Conference on Artificial Intelligence %I Association for the Advancement of Artificial Intelligence (AAAI) %C North America %8 February %G eng %U http://people.seas.harvard.edu/~bwaggoner/papers/2015/fair-information-sharing-for-treasure-hunting--2015--chen,nissim,waggoner.pdf %0 Conference Proceedings %D 2015 %T On the Generalization Properties of Differential Privacy %A Kobbi Nissim %A Uri Stemmer %XA new line of work, started with Dwork et al., studies the task of answering statistical queries using a sample and relates the problem to the concept of differential privacy. By the Hoeffding bound, a sample of size O(logk/α2) suffices to answer k non-adaptive queries within error α, where the answers are computed by evaluating the statistical queries on the sample. This argument fails when the queries are chosen adaptively (and can hence depend on the sample). Dwork et al. showed that if the answers are computed with (ϵ,δ)-differential privacy then O(ϵ) accuracy is guaranteed with probability 1−O(δϵ). Using the Private Multiplicative Weights mechanism, they concluded that the sample size can still grow polylogarithmically with the k.

Very recently, Bassily et al. presented an improved bound and showed that (a variant of) the private multiplicative weights algorithm can answer k adaptively chosen statistical queries using sample complexity that grows logarithmically in k. However, their results no longer hold for every differentially private algorithm, and require modifying the private multiplicative weights algorithm in order to obtain their high probability bounds.

We greatly simplify the results of Dwork et al. and improve on the bound by showing that differential privacy guarantees O(ϵ) accuracy with probability 1−O(δlog(1/ϵ)/ϵ). It would be tempting to guess that an (ϵ,δ)-differentially private computation should guarantee O(ϵ) accuracy with probability 1−O(δ). However, we show that this is not the case, and that our bound is tight (up to logarithmic factors).

We propose graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource large-scale privacy-sensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three schemes that are provably secure against any semi-honest server. Our first construction makes use of only symmetric-key operations, resulting in a computationally-efficient construction. Our second scheme, makes use of somewhat-homomorphic encryption and is less computationally-efficient but achieves optimal communication complexity (i.e., uses a minimal amount of bandwidth). Finally, our third scheme is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.

%B The 22nd ACM Conference on Computer and Communications Security %G eng %U https://eprint.iacr.org/2015/266 %0 Journal Article %J International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %D 2015 %T Hardness Amplification and the Approximate Degree of Constant-Depth Circuits %A Mark Bun %A Justin Thaler %XWe establish a generic form of hardness amplification for the approximability of constant-depth Boolean circuits by polynomials. Specifically, we show that if a Boolean circuit cannot be pointwise approximated by low-degree polynomials to within constant error in a certain one-sided sense, then an OR of disjoint copies of that circuit cannot be pointwise approximated even with very high error. As our main application, we show that for every sequence of degrees d(n), there is an explicit depththree circuit F : {−1, 1} n → {−1, 1} of polynomial-size such that any degree-d polynomial cannot pointwise approximate F to error better than 1 − exp −Ω( ˜ nd−3/2 ) . As a consequence of our main result, we obtain an exp −Ω( ˜ n 2/5 ) upper bound on the the discrepancy of a function in AC0, and an exp Ω( ˜ n 2/5 ) lower bound on the threshold weight of AC0, improving over the previous best results of exp −Ω(n 1/3 ) and exp Ω(n 1/3 ) respectively. Our techniques also yield a new lower bound of Ω n 1/2/ log(d−2)/2 (n) on the approximate degree of the AND-OR tree of depth d, which is tight up to polylogarithmic factors for any constant d, as well as new bounds for read-once DNF formulas. In turn, these results imply new lower bounds on the communication and circuit complexity of these classes, and demonstrate strong limitations on existing PAC learning algorithms.

%B International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %G eng %U http://arxiv.org/abs/1311.1616 %0 Generic %D 2015 %T In the Age of the Web, What Does “Public” Mean? %A Rob Faris %A O'Brien, David %B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/in-the-age-of-the-web-what-does-public-mean-ee74df403174 %0 Journal Article %J Social Science Research Network %D 2015 %T Integrating Approaches to Privacy Across the Research Lifecycle: When is Information Purely Public? %A O'Brien, David %A Jonathan Ullman %A Micah Altman %A Gasser, Urs %A Bar-Sinai, Michael %A Kobbi Nissim %A Salil Vadhan %A Michael Wojcik %A Alexandra Wood %XOn September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.

We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is “close” to the correct expectation over the distribution. This question was recently studied by Dwork et al. (2015), who showed how to answer Ω( ˜ n 2 ) queries efficiently, and also by Hardt and Ullman (2014), who showed that answering O˜(n 3 ) queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n 2 ) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code Fiat and Tassa (2001). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.

%B JMLR: Workshop and Conference Proceedings %V 40 %P 1-41 %G eng %U http://jmlr.org/proceedings/papers/v40/Steinke15.pdf %N 201 %0 Journal Article %J Accepted for publication, SODA 2015 %D 2015 %T Learning Privately with Labeled and Unlabeled Examples %A A. Beimel %A K. Nissim %A U. Stemmer %XA private learner is an algorithm that given a sample of labeled individual examples outputs a generalizing hypothesis while preserving the privacy of each individual. In 2008, Kasiviswanathan et al. (FOCS 2008) gave a generic construction of private learners, in which the sample complexity is (generally) higher than what is needed for non-private learners. This gap in the sample complexity was then further studied in several followup papers, showing that (at least in some cases) this gap is unavoidable. Moreover, those papers considered ways to overcome the gap, by relaxing either the privacy or the learning guarantees of the learner.

We suggest an alternative approach, inspired by the (non-private) models of semi-supervised learning and active-learning, where the focus is on the sample complexity of labeled examples whereas unlabeled examples are of a significantly lower cost. We consider private semi-supervised learners that operate on a random sample, where only a (hopefully small) portion of this sample is labeled. The learners have no control over which of the sample elements are labeled. Our main result is that the labeled sample complexity of private learners is characterized by the VC dimension.

We present two generic constructions of private semi-supervised learners. The first construction is of learners where the labeled sample complexity is proportional to the VC dimension of the concept class, however, the unlabeled sample complexity of the algorithm is as big as the representation length of domain elements. Our second construction presents a new technique for decreasing the labeled sample complexity of a given private learner, while only slightly increasing its unlabeled sample complexity. In addition, we show that in some settings the labeled sample complexity does not depend on the privacy parameters of the learner.

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.

%B Arxiv.org Computer Science, Computers and Scoiety [Internet] %G eng %U http://arxiv.org/abs/1506.05632 %0 Generic %D 2015 %T Privacy as a Sword and Shield in Public Health %A Sweeney, Latanya %B New York City Department of Public Health. New York, NY. %G eng %0 Generic %D 2015 %T Privacy Principles (framing talk) %A Micah Altman %B United Nations Global Pulse Workshop on ICT4D Principle 8: Address Privacy & Security In Development Programs. New York, USA %G eng %U https://www.slideshare.net/drmaltman/un-global-pulse-privacy-framing %0 Journal Article %D 2015 %T Private Approximations of the 2nd-Moment Matrix Using Existing Techniques in Linear Regression %A Or Sheffet %XWe introduce three differentially-private algorithms that approximates the 2nd-moment matrix of the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, correspond to existing techniques in linear regression literature. Specifically, we discuss the following three techniques. (i) For Ridge Regression, we propose setting the regularization coefficient so that by approximating the solution using Johnson-Lindenstrauss transform we preserve privacy. (ii) We show that adding a small batch of random samples to our data preserves differential privacy. (iii) We show that sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially private provided the prior is set correctly. We also evaluate our techniques experimentally and compare them to the existing "Analyze Gauss" algorithm of Dwork et al.

%G eng %U http://arxiv.org/abs/1507.00056 %0 Conference Paper %B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %D 2015 %T Robust Traceability from Trace Amounts %A C. Dwork %A Smith, A %A T Steinke %A J Ullman %A S. Vadhan %XThe privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in `1 norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

%B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %C Berkeley, California %8 10/18-20/2015 %G eng %0 Journal Article %J Technology Science %D 2015 %T Sharing Sensitive Data with Confidence: The Datatags System %A Sweeney, Latanya %A Mercè Crosas %A Bar-Sinai, Michael %XSociety generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

%B Technology Science %G eng %U http://techscience.org/a/2015101601/ %0 Conference Paper %D 2015 %T Simultaneous private learning of multiple concepts %A Mark Bun %A Kobbi Nissim %A Uri Stemmer %XWe investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving k learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving k learning tasks without privacy? In our setting, an individual example consists of a domain element x labeled by k unknown concepts (c1,…,ck). The goal of a multi-learner is to output k hypotheses (h1,…,hk) that generalize the input examples.

Without concern for privacy, the sample complexity needed to simultaneously learn k concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with k. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in k.

In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.

%G eng %0 Government Document %D 2014 %T Comment to The White House Office of Science and Technology Policy (OSTP): Big Data Study, Request for Information %A Micah Altman %A David O’Brien %A Salil Vadhan %A Alexandra Wood %XOn January 23, 2014, President Barack Obama asked John Podesta to perform a comprehensive review of big data and privacy. During this review, the White House Office of Science and Technology Policy issued a request for public comment on questions related to the public policy implications of big data.

Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood submitted a response on behalf of the Privacy Tools for Sharing Research Data project. Their comments outline a broad, comprehensive, and systematic framework for privacy analysis and provide a taxonomy of modern technological, statistical, and cryptographic approaches to preserving both data privacy and utility. They argue that an analysis of information privacy should address the scope of information covered, the sensitivity of that information, the risk that sensitive information will be disclosed, the availability of control and accountability mechanisms, and the suitability of existing data sharing models, as applied across the entire lifecyle of information use, from collection through dissemination and reuse.

With this submission, the authors discuss the inadequacy of traditional approaches to privacy protection and recommend a modern approach that considers three principles. First, the risks of informational harm are generally not a simple function of the presence or absence of specific fields, attributes, or keywords in the released set of data. Second, redaction, pseudonymization, coarsening, and hashing, are often neither an adequate nor appropriate practice, nor is releasing less information necessary more privacy protective. Third, a thoughtful analysis with expert consultation is necessary in order to evaluate the sensitivity of the data collected, to quantify the associated re-identification risks, and to design useful and safe release mechanisms.

In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system.

To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. *Access control policies* restrict which components may use a capability and are enforced using higher-order contracts. *Integrity policies* restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.

Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.

We show a tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently considered by Dwork et al., who showed that Ω~(n2) queries can be answer efficiently, and also by Hardt and Ullman, who showed that answering O~(n3) queries is computationally hard. We close the gap between the two bounds by proving a new, nearly-optimal hardness result. Specifically, we show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to O(n2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results via an optimal construction of a new combinatorial object that we call an interactive fingerprinting code, which may be of independent interest.

%G eng %U http://arxiv.org/abs/1410.1228 %0 Book %D 2014 %T Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse %A Gasser, Urs %A Jonathan Zittrain %A R Faris %A R.H Jones %XThis publication is the second annual report of the Internet Monitor project at the Berkman Center for Internet & Society at Harvard University. As with the inaugural report, this year’s edition is a collaborative effort of the extended Berkman community. Internet Monitor 2014: Reflections on the Digital World includes nearly three dozen contributions from friends and colleagues around the world that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

The result, intended for a general interest audience, brings together reflection and analysis on a broad range of issues and regions — from an examination of Europe’s “right to be forgotten” to a review of the current state of mobile security to an exploration of a new wave of movements attempting to counter hate speech online — and offers it up for debate and discussion. Our goal remains not to provide a definitive assessment of the “state of the Internet” but rather to provide a rich compendium of commentary on the year’s developments with respect to the online space.

Last year’s report examined the dynamics of Internet controls and online activity through the actions of government, corporations, and civil society. We focus this year on the interplay between technological platforms and policy; growing tensions between protecting personal privacy and using big data for social good; the implications of digital communications tools for public discourse and collective action; and current debates around the future of Internet governance.

The report reflects the diversity of ideas and input the Internet Monitor project seeks to invite. Some of the contributions are descriptive; others prescriptive. Some contain purely factual observations; others offer personal opinion. In addition to those in traditional essay format, contributions this year include a speculative fiction story exploring what our increasingly data-driven world might bring, a selection of “visual thinking” illustrations that accompany a number of essays, a “Year in Review” timeline that highlights many of the year’s most fascinating Internet-related news stories (and an interactive version of which is available at the netmonitor.org), and a slightly tongue-in-cheek “By the Numbers” section that offers a look at the year’s important digital statistics. We believe that each contribution offers insights, and hope they provoke further reflection, conversation, and debate in both offline and online settings around the globe.

%7 17 %I Berkman Center Research Publication %C Cambridge, MA %G eng %U http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2538813 %0 Conference Paper %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %D 2014 %T Mechanism Design in Large Games: Incentives and Privacy %A Michael Kearns %A Mallesh Pai %A Aaron Roth %A Jonathan Ullman %K differential privacy %K equilibrium selection %K game theory %K mechanism design %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %S ITCS '14 %I ACM %C New York, NY, USA %P 403–410 %@ 978-1-4503-2698-8 %G eng %U http://doi.acm.org/10.1145/2554797.2554834 %R 10.1145/2554797.2554834 %0 Journal Article %J CoRR %D 2014 %TAn Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring

%A Mallesh M. Pai %A Aaron Roth %A Jonathan Ullman %B CoRR %V abs/1402.2801 %G eng %0 Conference Paper %B 10th Conference on Web and Internet Economics (WINE) %D 2014 %T Privacy Games %A Yiling Chen %A Or Sheffet %A Salil Vadhan %B 10th Conference on Web and Internet Economics (WINE) %C Beijing, China %8 December %G eng %0 Conference Paper %B Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14) %D 2014 %T Privacy Integrated Data Stream Queries %A Lucas Waye %B Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14) %I ACM %C New York, NY %G eng %U http://delivery.acm.org/10.1145/2690000/2687150/p19-waye.pdf?ip=65.112.10.216&id=2687150&acc=OPEN&key=4D4702B0C3E38B35.4D4702B0C3E38B35.4D4702B0C3E38B35.6D218144511F3437&CFID=469747432&CFTOKEN=41625734&__acm__=1420474755_45ce38ca730887129ca49b667ac6e0aa %0 Conference Paper %B ICML 2014 Workshop on Learning, Security and Privacy %D 2014 %T Private Empirical Risk Minimization, Revisited %A Raef Bassily %A Smith, Adam %A Abhradeep Thakurta %XIn this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.

Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different.

Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.

%B ICML 2014 Workshop on Learning, Security and Privacy %C Beijing, China %8 25 Jun %G eng %U http://arxiv.org/abs/1405.7085 %0 Book Section %B Automata, Languages, and Programming %D 2014 %T Privately Solving Linear Programs %A Justin Hsu %A Aaron Roth %A Roughgarden, Tim %A Jonathan Ullman %A Esparza, Javier %A Fraigniaud, Pierre %A Husfeldt, Thore %A Koutsoupias, Elias %B Automata, Languages, and Programming %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 8572 %P 612-624 %@ 978-3-662-43947-0 %G eng %U http://dx.doi.org/10.1007/978-3-662-43948-7_51 %R 10.1007/978-3-662-43948-7_51 %0 Conference Paper %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %D 2014 %T Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals %A Kobbi Nissim %A Salil Vadhan %A Xiao, David %K differential privacy %K mechanism design %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %S ITCS '14 %I ACM %C New York, NY, USA %P 411–422 %@ 978-1-4503-2698-8 %G eng %U http://doi.acm.org/10.1145/2554797.2554835 %R 10.1145/2554797.2554835 %0 Conference Proceedings %B Proceedings of The 27th Conference on Learning Theory (COLT 2014) %D 2014 %T Sample Complexity Bounds on Differentially Private Learning via Communication Complexity %A Vitaly Feldman %A Xiao, David %XIn this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).

Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:

- SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
- For any t, there exists a class C such that LDim(C)=2 but SCDP(C)≥t.
- For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b).

We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.

%B Proceedings of The 27th Conference on Learning Theory (COLT 2014) %I JMLR Workshop and Conference Proceedings %C Barcelona, Spain %V 35 %P 1-20 %G eng %U http://jmlr.org/proceedings/papers/v35/feldman14b.pdf %0 Book %D 2014 %T What Stays in Vegas: The World of Personal Data -- Lifeblood of Big Business -- and the End of Privacy as We Know It. %A Adam Tanner %XThe greatest threat to privacy today is not the NSA, but good-old American companies. Internet giants, leading retailers, and other firms are voraciously gathering data with little oversight from anyone.

In Las Vegas, no company knows the value of data better than Caesars Entertainment. Many thousands of enthusiastic clients pour through the ever-open doors of their casinos. The secret to the company’s success lies in their one unrivaled asset: they know their clients intimately by tracking the activities of the overwhelming majority of gamblers. They know exactly what games they like to play, what foods they enjoy for breakfast, when they prefer to visit, who their favorite hostess might be, and exactly how to keep them coming back for more.

Caesars’ dogged data-gathering methods have been so successful that they have grown to become the world’s largest casino operator, and have inspired companies of all kinds to ramp up their own data mining in the hopes of boosting their targeted marketing efforts. Some do this themselves. Some rely on data brokers. Others clearly enter a moral gray zone that should make American consumers deeply uncomfortable.

We live in an age when our personal information is harvested and aggregated whether we like it or not. And it is growing ever more difficult for those businesses that choose not to engage in more intrusive data gathering to compete with those that do. Tanner’s timely warning resounds: Yes, there are many benefits to the free flow of all this data, but there is a dark, unregulated, and destructive netherworld as well.

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

%B Proceedings of the 45th annual ACM symposium on Symposium on theory of computing %I ACM %C Palo Alto, California, USA %P 793-802 %@ 978-1-4503-2029-0 %G eng %U http://dx.doi.org/10.1145/2488608.2488709 %1 2488709 %R 10.1145/2488608.2488709 %0 Unpublished Work %D 2013 %T Matching Known Patients to Health Records in Washington State Data %A Sweeney, Latanya %B Data Privacy Lab, IQSS, Harvard University %8 06/01/2013 %G eng %U http://thedatamap.org/risks.html %0 Book Section %B Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques %D 2013 %T Private Learning and Sanitization: Pure vs. Approximate Differential Privacy %A Amos Beimel %A Kobbi Nissim %A Uri Stemmer %A Raghavendra, Prasad %A Raskhodnikova, Sofya %A Jansen, Klaus %A Rolim, JoséD.P. %K differential privacy %K Private Learning %K Sanitization %B Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 8096 %P 363-378 %@ 978-3-642-40327-9 %G eng %U http://dx.doi.org/10.1007/978-3-642-40328-6_26 %R 10.1007/978-3-642-40328-6_26 %0 Journal Article %J JAMA %D 2013 %T Putting Health IT on the Path to Success %A Sweeney, Latanya %A Yasnoff, William A. %A Shortliffe, Edward H. %X The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision. %B JAMA %V 309 %P 989-990 %G eng %U http://dx.doi.org/10.1001/jama.2013.1474 %N 10 %0 Unpublished Work %D 2013 %T Survey of Publicly Available State Health Databases %A Hooley, Sean %A Sweeney, Latanya %B Data Privacy Lab, IQSS, Harvard University %8 06/01/2013 %G eng %U http://thedatamap.org/states.html %0 Conference Paper %B Proceedings of the fourteenth ACM conference on Electronic commerce %D 2013 %T Truthful mechanisms for agents that value privacy %A Yiling Chen %A Stephen Chong %A Ian A. Kash %A Tal Moran %A Salil Vadhan %X Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has small privacy cost to player i. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n). %B Proceedings of the fourteenth ACM conference on Electronic commerce %I ACM %C Philadelphia, Pennsylvania, USA %P 215-232 %@ 978-1-4503-1962-1 %G eng %U http://dx.doi.org/10.1145/2482540.2482549 %1 2482549 %R 10.1145/2482540.2482549 %0 Conference Paper %B Proceedings of the 32nd International Cryptology Conference (CRYPTO `12) %D 2012 %T Differential Privacy with Imperfect Randomness %A Yevgeniy Dodis %A Adriana López-Alt %A Ilya Mironov %A Salil Vadhan %XIn this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under speciﬁc “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Speciﬁcally, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and diﬀerential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite diﬀerent from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisﬁed by any additive-noise mechanism.

%B Proceedings of the 32nd International Cryptology Conference (CRYPTO `12) %S Lecture Notes on Computer Science %7 Lecture Notes on Computer Science %I Springer-Verlag %C Santa Barbara, CA %V 7417 %P 497–516 %8 19–23 August %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-32009-5_29 %0 Conference Paper %B Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012 %D 2012 %T Faster Algorithms for Privately Releasing Marginals %A Justin Thaler %A Jonathan Ullman %A Salil P. Vadhan %XWe study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).

%B Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012 %7 Lecture Notes in Computer Science %I Springer %C Warwick, UK %V 7391 %8 9-13 Jul. %G eng %U http://dx.doi.org/10.1007/978-3-642-31594-7_68 %0 Conference Paper %B Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012 %D 2012 %T Iterative Constructions and Private Data Release %A Anupam Gupta %A Aaron Roth %A Jonathan Ullman %XIn this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.

%B Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012 %7 Lecture Notes in Computer Science %I Springer %C Taormina, Sicily, Italy %V 7194 %P 339-356 %8 19-21 Mar. %G eng %U http://dx.doi.org/10.1007/978-3-642-28914-9_19 %0 Conference Paper %B Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12) %D 2012 %T The Privacy of the Analyst and the Power of the State %A Cynthia Dwork %A Moni Naor %A Salil Vadhan %XWe initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

%B Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12) %I IEEE %C New Brunswick, NJ %P 400–409 %8 20–23 October %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6375318&tag=1 %0 Manuscript %D 2012 %T Private Equilibrium Release, Large Games, and No-Regret Learning %A Michael Kearns %A Mallesh Pai %A Aaron Roth %A Jonathan Ullman %XWe give mechanisms in which each of n players in a game is given their component of an (approximate) equilibrium in a way that guarantees differential privacy---that is, the revelation of the equilibrium components does not reveal too much information about the utilities of the other players. More precisely, we show how to compute an approximate correlated equilibrium (CE) under the constraint of differential privacy (DP), provided n is large and any player's action affects any other's payoff by at most a small amount. Our results draw interesting connections between noisy generalizations of classical convergence results for no-regret learning, and the noisy mechanisms developed for differential privacy. Our results imply the ability to truthfully implement good social-welfare solutions in many games, such as games with small Price of Anarchy, even if the mechanism does not have the ability to enforce outcomes. We give two different mechanisms for DP computation of approximate CE. The first is computationally efficient, but has a suboptimal dependence on the number of actions in the game; the second is computationally efficient, but allows for games with exponentially many actions. We also give a matching lower bound, showing that our results are tight up to logarithmic factors.

%G eng %U http://arxiv.org/abs/1207.4084 %0 Government Document %D 2011 %T Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005 %A Salil Vadhan %A David Abrams %A Micah Altman %A Cynthia Dwork %A Paul Kominers %A Scott Duke Kominers %A Harry R. Lewis %A Tal Moran %A Guy Rothblum %XComments by Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Scott Duke Kominers, Paul Kominers, Harry Lewis, Tal Moran, Guy Rothblum, and Jon Ullman (at Harvard, Microsoft Research, the University of Chicago, MIT, and the Herzilya Interdisciplinary Center) These comments address the issues of data privacy and de-identification raised in the ANPRM. Our perspective is informed by substantial advances in privacy science that have been made in the computer science literature.

%8 October %G eng %U http://www.regulations.gov/#!documentDetail;D=HHS-OPHS-2011-0005-1101 %0 Conference Paper %B Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11) %D 2011 %T PCPs and the Hardness of Generating Synthetic Data %A Jon Ullman %A Salil Vadhan %E Yuval Ishai %XAssuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm A that takes a database D\in ({0,1}^d)^n and outputs a ``synthetic database'' D' all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x\in {0,1}^d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS `07), who gave an algorithm running in time poly(n,2^d). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC `09) with PCP-based Levin-reductions from NP search problems to finding approximate solutions to CSPs.

%B Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11) %S Lecture Notes on Computer Science %7 Lecture Notes on Computer Science %I Springer-Verlag %C Providence, RI %V 5978 %P 572–587 %8 28–30 March %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-19571-6_24 %0 Conference Paper %B Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011 %D 2011 %T Privately releasing conjunctions and the statistical query barrier %A Anupam Gupta %A Moritz Hardt %A Aaron Roth %A Jonathan Ullman %X Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ)model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms. %B Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011 %I ACM %C San Jose, CA, USA %P 803-812 %8 6-8 June %G eng %U http://dl.acm.org/citation.cfm?id=1993742 %0 Conference Paper %B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %D 2010 %T Boosting and Differential Privacy %A Cynthia Dwork %A Guy Rothblum %A Salil Vadhan %XBoosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-pre serving synopses of an input database. These are data structures that yield, for a given set Q of queries over an input database, reasonably accurate estimates of the responses to every query in Q, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on Q and produces a "weak" synopsis that yields "good" answers for a majority of the weight in Q, our Boosting for Queries algorithm obtains a synopsis that is good for all of Q. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an O(ε2) bound on the expected privacy loss from a single e-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides e-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.

%B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %I IEEE %C Las Vegas, NV %P 51–60 %8 23–26 October %G eng %U http://dx.doi.org/10.1109/FOCS.2010.12 %0 Conference Paper %B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %D 2010 %T The Limits of Two-Party Differential Privacy %A Andrew McGregor %A Ilya Mironov %A Toniann Pitassi %A Omer Reingold %A Kunal Talwar %A Salil Vadhan %XWe study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy. The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions).

%B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %I IEEE %C Las Vegas, NV %P 81–90 %8 23–26 October %G eng %U http://dx.doi.org/10.1109/FOCS.2010.14 %0 Conference Paper %B Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09) %D 2009 %T On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results %A Cynthia Dwork %A Moni Naor %A Omer Reingold %A Guy Rothblum %A Salil Vadhan %XWe consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.

%B Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09) %C Bethesda, MD %P 381–390 %8 31 May–2 June %G eng %U http://dl.acm.org/citation.cfm?id=1536467 %0 Conference Paper %B Advances in Cryptology–-CRYPTO `09 %D 2009 %T Computational Differential Privacy %A Ilya Mironov %A Omkant Pandey %A Omer Reingold %A Salil Vadhan %XThe definition of differential privacy has recently emerged as a leading standard of privacy guarantees for algorithms on statistical databases. We offer several relaxations of the definition which require privacy guarantees to hold only against efficient—i.e., computationally-bounded—adversaries. We establish various relationships among these notions, and in doing so, we observe their close connection with the theory of pseudodense sets by Reingold et al.[1]. We extend the dense model theorem of Reingold et al. to demonstrate equivalence between two definitions (indistinguishability- and simulatability-based) of computational differential privacy. Our computational analogues of differential privacy seem to allow for more accurate constructions than the standard information-theoretic analogues. In particular, in the context of private approximation of the distance between two vectors, we present a differentially-private protocol for computing the approximation, and contrast it with a substantially more accurate protocol that is only computationally differentially private.

%B Advances in Cryptology–-CRYPTO `09 %S Lecture Notes in Computer Science %I Springer-Verlag %C Santa Barbara, CA %V 5677 %P 126–142 %8 16–20 August %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-03356-8_8 %0 Unpublished Work %D 2000 %T Simple Demographics Often Identify People Uniquely %A Sweeney, Latanya %B Carnegie Mellon University, Data Privacy %G eng %U http://dataprivacylab.org/projects/identifiability/ %9 Working paper