Abstract

This article advocates a hybrid legal-technical approach to the evaluation of technical measures designed to render information anonymous in order to bring it outside the scope of data protection regulation. The article demonstrates how such an approach can be used for instantiating a key anonymization concept appearing in the EU General Data Protection Regulation (GDPR) -- singling out. The analysis identifies and addresses a tension between a common, compelling theory of singling out and a mathematical analysis of this theory, and it demonstrates how to make determinations regarding the sufficiency of specific technologies for satisfying regulatory requirements for anonymization.

Doubts about the feasibility of effective anonymization and de-identification have gained prominence in recent years in response to high-profile privacy breaches enabled by scientific advances in privacy research, improved analytical capabilities, the wider availability of personal data, and the unprecedented richness of available data sources. At the same time, privacy regulations recognize the possibility, at least in principle, of data anonymization that is sufficiently protective so as to free the resulting (anonymized) data from regulation. As a result, practitioners developing privacy enhancing technologies face substantial uncertainty as to the legal standing of these technologies. More fundamentally, it is not clear how to make a determination of compliance even when the tool is fully described and available for examination.

This gap is symptomatic of a more general problem: Legal and technical approaches to data protection have developed in parallel, and their conceptual underpinnings are growing increasingly divergent. When lawmakers rely on purely legal concepts to engage areas that are affected by rapid scientific and technological change, the resulting laws, when applied in practice, frequently create substantial uncertainty for implementation; provide contradictory recommendations in important cases; disagree with current scientific technical understanding; and fail to scale to the rapid pace of technological development. This article argues that new hybrid concepts, created through technical and legal co-design, can inform practices that are practically complete, coherent, and scalable.

As a case study, the article focuses on a key privacy-related concept appearing in Recital 26 of the General Data Protection Regulation (GDPR) called singling out. We identify a compelling theory of singling out that is implicit in the most persuasive guidance available, and demonstrate that the theory is ultimately incomplete. We then use that theory as the basis for a new and mathematically rigorous privacy concept called predicate singling-out. Predicate singling-out sheds light on the notion of singling out in the GDPR, itself inextricably linked to anonymization. We argue that any data protection tool that purports to anonymize arbitrary personal data under the GDPR must prevent predicate singling-out. This enables, for the first time, a legally- and mathematically-grounded analysis of the standing of supposed anonymization technologies like k-anonymity and differential privacy. The analysis in this article is backed by a technical-mathematical analysis previously published by two of the authors.

Conceptually, our analysis demonstrates that a nuanced understanding of baseline risk is unavoidable for a theory of singling out based on current regulatory guidance. Practically, it identifies previously unrecognized failures of anonymization. In particular, it demonstrates that some k-anonymous mechanisms may allow singling out, challenging the prevailing regulatory guidance.

The article concludes with a discussion of specific recommendations for both policymakers and scholars regarding how to conduct a hybrid legal-technical analysis. Rather than formalizing or mathematizing the law, the article provides approaches for wielding formal tools in the service of practical regulation.

We study the problem of verifying differential privacy for loop-free programs with probabilistic choice. Programs in this class can be seen as randomized Boolean circuits, which we will use as a formal model to answer two different questions: first, deciding whether a program satisfies a prescribed level of privacy; second, approximating the privacy parameters a program realizes. We show that the problem of deciding whether a program satisfies "-differential privacy is coNP#P-complete. In fact, this is the case when either the input domain or the output range of the program is large.

Further, we show that deciding whether a program is (", )-differentially private is coNP#P-hard, and in coNP#P for small output domains, but always in coNP#P#P. Finally, we show that the problem of approximating the level of differential privacy is both NP-hard and coNP-hard. These results complement previous results by Murtagh and Vadhan [34] showing that deciding the optimal composition of differentially private components is #P-complete, and that approximating the optimal composition of differentially private components is in P.

%B 47th International Colloquium on Automata, Languages and Programming (To appear - ICALP 2020) %8 7 July 2020 %G eng %U https://arxiv.org/abs/1911.03272 %0 Journal Article %J arXiv:2007.05157 %D 2020 %T Differentially Private Simple Linear Regression %A Daniel Alabi %A Smith, Adam %A Audra McMillan %A Salil Vadhan %A Jayshree Sarathy %X Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases. %B arXiv:2007.05157 %G eng %U http://arxiv.org/abs/2007.05157 %0 Conference Paper %B 20th Privacy Enhancing Technologies Symposium (To appear - PET 2020) %D 2020 %T Formalizing Privacy Laws for License Generation and Data Repository Decision Automation %A Micah Altman %A Stephen Chong %A Alexandra Wood %X In this paper, we summarize work-in-progress on expert system support to automate some data deposit and release decisions within a data repository, and to generate custom license agreements for those data transfers.Our approach formalizes via a logic programming language the privacy-relevant aspects of laws, regulations, and best practices, supported by legal analysis documented in legal memoranda. This formalization enables automated reasoning about the conditions under which a repository can transfer data, through interrogation of users, and the application of formal rules to the facts obtained from users. The proposed system takes the specific conditions for a given data release and produces a custom data use agreement that accurately captures the relevant restrictions on data use. This enables appropriate decisions and accurate licenses, while removing the bottleneck of lawyer effort per data transfer. The operation of the system aims to be transparent, in the sense that administrators, lawyers, institutional review boards, and other interested parties can evaluate the legal reasoning and interpretation embodied in the formalization, and the specific rationale for a decision to accept or release a particular dataset. %B 20th Privacy Enhancing Technologies Symposium (To appear - PET 2020) %G eng %U https://arxiv.org/abs/1910.10096 %0 Conference Paper %B 52nd ACM Symposium on Theory of Computing (To appear - STOC 2020) %D 2020 %T Better Secret-Sharing via Robust Conditional Disclosure of Secrets %A Benny Applebaum %A Amos Beimel %A Oded Nir %A Naty Peter %X

A secret-sharing scheme allows to distribute a secret s among n parties such that only some predefined

“authorized” sets of parties can reconstruct the secret, and all other “unauthorized” sets learn

nothing about *s*. The collection of authorized sets is called the access structure. For over 30 years, it

was known that any (monotone) collection of authorized sets can be realized by a secret-sharing scheme

whose shares are of size 2^{n-o(n)} and until recently no better scheme was known. In a recent breakthrough,

Liu and Vaikuntanathan (STOC 2018) have reduced the share size to 2^{0:994n+o(n)}, which was

later improved to 2^{0:892n+o(n)} by Applebaum et al. (EUROCRYPT 2019).

In this paper we improve the exponent of general secret-sharing down to 0:637. For the special case

of linear secret-sharing schemes, we get an exponent of 0:762 (compared to 0:942 of Applebaum et al.).

As our main building block, we introduce a new robust variant of conditional disclosure of secrets

(robust CDS) that achieves unconditional security even under limited form of re-usability. We show that

the problem of general secret-sharing reduces to robust CDS with sub-exponential overhead and derive

our main result by implementing robust CDS with a non-trivial exponent. The latter construction follows

by presenting a general immunization procedure that turns standard CDS into a robust CDS.

%B 52nd ACM Symposium on Theory of Computing (To appear - STOC 2020) %G eng %U https://eprint.iacr.org/2020/080 %0 Conference Paper %B Information-Theoretic Cryptography (To appear - ITC 2020) %D 2020 %T The Power of Synergy in Differential Privacy: Combining a Small Curator with Local Randomizers %A Amos Beimel %A Aleksandra Korolova %A Kobbi Nissim %A Or Sheffet %A Uri Stemmer %XMotivated by the desire to bridge the utility gap between local and trusted curator modelsof differential privacy for practical applications, we initiate the theoretical study of a hybridmodel introduced by “Blender” [Avent et al., USENIX Security ’17], in which differentially private protocols of n agents that work in the local-model are assisted by a differentially private curator that has access to the data of m additional users. We focus on the regime where mn and study the new capabilities of this (m;n)-hybrid model. We show that, despite the fact that the hybrid model adds no significant new capabilities for the basic task of simple hypothesistesting, there are many other tasks (under a wide range of parameters) that can be solved in the hybrid model yet cannot be solved either by the curator or by the local-users separately. Moreover, we exhibit additional tasks where at least one round of interaction between the curator and the local-users is necessary – namely, no hybrid model protocol without such interaction can solve these tasks. Taken together, our results show that the combination of the local model with a small curator can become part of a promising toolkit for designing and implementing differential privacy.

%B Information-Theoretic Cryptography (To appear - ITC 2020) %G eng %U https://arxiv.org/abs/1912.08951 %0 Thesis %D 2020 %T Certifiably Accurate Private Data Release with Generative Adversarial Networks %A Michael Fine %XSociety is caught in a vise. The exponential growth in the power and ubiquity of computing devices has enabled the collection and analysis of data at an unprecedented scale. This Cambrian explosion in data collection promises enormous benefits across commercial, scientific, and policy fields. Unfortunately, this collection and analysis of increasingly-personal data has also proved to be a grave threat to individual’s privacy.

%G eng %9 Bachelor of Arts Thesis %0 Conference Paper %B Information-Theoretic Cryptography (To appear - ITC 2020) %D 2020 %T Separating Local & Shuffled Differential Privacy via Histograms %A Victor Balcer %A Albert Cheu %XRecent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users’ hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This impliesan arbitrarily large gap in sample complexity between the shuffled and local models. On theother hand, we show that the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.

%B Information-Theoretic Cryptography (To appear - ITC 2020) %G eng %U https://arxiv.org/abs/1911.06879 %0 Journal Article %J ACM Transactions on Economics and Computation %D 2020 %T Privacy Games %A Yiling Chen %A Or Sheffet %A Salil Vadhan %X **Version History: **

Previously published as: Yiling Chen, Or Sheffet, and Salil Vadhan. Privacy games. In Proceedings of the 10th International Conference on Web and Internet Economics (WINE ‘14), volume 8877 of *Lecture Notes in Computer Science*, pages 371–385. Springer-Verlag, 14–17 December 2014. (WINE Publisher's Version linked here: https://link.springer.com/chapter/10.1007/978-3-319-13129-0_30); PDF attached as WINE2014.

The problem of analyzing the effect of privacy concerns on the behavior of selfish utility-maximizing agents has received much attention lately. Privacy concerns are often modeled by altering the utility functions of agents to consider also their privacy loss. Such privacy aware agents prefer to take a randomized strategy even in very simple games in which non-privacy aware agents play pure strategies. In some cases, the behavior of privacy aware agents follows the framework of Randomized Response, a well-known mechanism that preserves differential privacy.

Our work is aimed at better understanding the behavior of agents in settings where their privacy concerns are explicitly given. We consider a toy setting where agent *A*, in an attempt to discover the secret type of agent *B*, offers *B* a gift that one type of *B* agent likes and the other type dislikes. As opposed to previous works, *B*'s incentive to keep her type a secret isn't the result of "hardwiring" *B*'s utility function to consider privacy, but rather takes the form of a payment between *B* and *A*. We investigate three different types of payment functions and analyze *B*'s behavior in each of the resulting games. As we show, under some payments, *B*'s behavior is very different than the behavior of agents with hardwired privacy concerns and might even be deterministic. Under a different payment we show that *B*'s BNE strategy does fall into the framework of Randomized Response.

Real-world applications routinely make authorization decisions based on dynamic computation. Reasoning about dynamically computed authority is challenging. Integrity of the system might be compromised if attackers can improperly influence the authorizing computation. Confidentiality can also be compromised by authorization, since authorization decisions are often based on sensitive data such as membership lists and passwords. Previous formal models for authorization do not fully address the security implications of permitting trust relationships to change, which limits their ability to reason about authority that derives from dynamic computation. Our goal is a way to construct dynamic authorization mechanisms that do not violate confidentiality or integrity.

We introduce the Flow-Limited Authorization Calculus (FLAC), which is both a simple, expressive model for reasoning about dynamic authorization and also an information flow control language for securely implementing various authorization mechanisms. FLAC combines the insights of two previous models: it extends the Dependency Core Calculus with features made possible by the Flow-Limited Authorization Model. FLAC provides strong end-to-end information security guarantees even for programs that incorporate and implement rich dynamic authorization mechanisms. These guarantees include noninterference and robust declassification, which prevent attackers from influencing information disclosures in unauthorized ways. We prove these security properties formally for all FLAC programs and explore the expressiveness of FLAC with several examples.

%B Journal of Computer Security %G eng %0 Conference Paper %D 2020 %T The OpenDP White Paper %A The OpenDP Team %X **Talks: **

- View a talk on this paper presented at the 2020 OpenDP Community Meeting
- View a talk on this paper presented at TPDP 2020

OpenDP is a community effort to build a trustworthy suite of open-source tools for enabling privacy-protective analysis of sensitive personal data, focused on a library of algorithms for generating differentially private statistical releases. The target use cases for OpenDP are to enable government, industry, and academic institutions to safely and confidently share sensitive data to support scientifically oriented research and exploration in the public interest. We aim for OpenDP to flexibly grow with the rapidly advancing science of differential privacy, and be a pathway to bring the newest algorithmic developments to a wide array of practitioners.

OpenDP is led by Faculty Directors Gary King and Salil Vadhan and an Executive Committee at Harvard University, funded in part by a grant from the Sloan Foundation. Its efforts so far have included implementing a differentially private curator application in collaboration with Microsoft, and developing a framework for a community-driven OpenDP Commons through the work of an Ad Hoc Design Committee including external experts. Going forward, the project plans to engage with a wide community of stakeholders, establish partnerships with a wide variety of groups from industry, academia, and government, and adopt increasing levels of community governance.

%G eng %U https://projects.iq.harvard.edu/files/opendp/files/opendp_white_paper_11may2020.pdf %0 Journal Article %J INSS Cyber Intelligence and Security [Internet] %D 2019 %T Social Change Through Computerized Accessibility of Legal Rules %A Tadjer, Michel %A Bar-Sinai, Michael %A Mor Vilozni %XIn a recent paper, Chan et al. [SODA ’19] proposed a relaxation of the notion of (full) memory obliviousness, which was introduced by Goldreich and Ostrovsky [J. ACM ’96] and extensively researched by cryptographers. The new notion, differential obliviousness, requires that any two neighboring inputs exhibit similar memory access patterns, where the similarity requirement is that of differential privacy.

Chan et al. demonstrated that differential obliviousness allows achieving improved efficiency for several algorithmic tasks, including sorting, merging of sorted lists, and range query data structures.

In this work, we continue the exploration of differential obliviousness, focusing on algorithms that do not necessarily examine all their input. This choice is motivated by the fact that the existence of logarithmic overhead ORAM protocols implies that differential obliviousness can yield at most a logarithmic improvement in efficiency for computations that need to examine all their input. In particular, we explore property testing, where we show that differential obliviousness yields an almost linear improvement in overhead in the dense graph model, and at most quadratic improvement in the bounded degree model.

We also explore tasks where a non-oblivious algorithm would need to explore different portions of the input, where the latter would depend on the input itself, and where we show that such a behavior can be maintained under differential obliviousness, but not under full obliviousness. Our examples suggest that there would be benefits in further exploring which class of computational tasks are amenable to differential obliviousness.

%B Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019 %G eng %U https://drops.dagstuhl.de/opus/frontdoor.php?source_opus=11280 %0 Conference Paper %B 21st International Symposium on Principles and Practice of Declarative Programming (PPDP 2019) %D 2019 %T Relational Symbolic Execution %A Gian Pietro Farina %A Stephen Chong %A Marco Gaboardi %XSymbolic execution is a classical program analysis technique used to show that programs satisfy or violate given specifications. In this work we generalize symbolic execution to support program analysis for relational specifications in the form of relational properties - these are properties about two runs of two programs on related inputs, or about two executions of a single program on related inputs. Relational properties are useful to formalize notions in security and privacy, and to reason about program optimizations. We design a relational symbolic execution engine, named RelSymwhich supports interactive refutation, as well as proving of relational properties for programs written in a language with arrays and for-like loops.

%B 21st International Symposium on Principles and Practice of Declarative Programming (PPDP 2019) %G eng %U https://doi.org/10.1145/3354166.3354175 %0 Journal Article %J Journal of Privacy and Confidentiality %D 2019 %T Differential Privacy on Finite Computers %A Victor Balcer %A Salil Vadhan %X **Version History: **

Also presented at TPDP 2017; preliminary version posted as arXiv:1709.05396 [cs.DS].

2018: Published in Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of *Leibniz International Proceedings in Informatics *(LIPIcs), pp 43:1-43:21. http://drops.dagstuhl.de/opus/frontdoor.php?source_opus=8353

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomial-time algorithms. As a case study, we examine the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe \(X\). The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Condentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in \(|X|\), which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its "per-bin accuracy" (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of \(\log |X|\), but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and eciently computable representation of a dense histogram; it is based on an \((n + 1)\) - wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.

%B Journal of Privacy and Confidentiality %V 9 %G eng %U https://journalprivacyconfidentiality.org/index.php/jpc/article/view/679 %N 2 %0 Conference Paper %B International Cryptology Conference (CRYPTO 2019) %D 2019 %T The Privacy Blanket of the Shuffle Model %A Borja Balle %A James Bell %A Adria Gascon %A Kobbi Nissim %X This work studies differential privacy in the context of the recently proposed shuffle model. Unlike in the local model, where the server collecting privatized data from users can track back an input to a specific user, in the shuffle model users submit their privatized inputs to a server anonymously. This setup yields a trust model which sits in between the classical curator and local models for differential privacy. The shuffle model is the core idea in the Encode, Shuffle, Analyze (ESA) model introduced by Bittau et al. (SOPS 2017). Recent work by Cheu et al. (EUROCRYPT 2019) analyzes the differential privacy properties of the shuffle model and shows that in some cases shuffled protocols provide strictly better accuracy than local protocols. Additionally, Erlingsson et al. (SODA 2019) provide a privacy amplification bound quantifying the level of curator differential privacy achieved by the shuffle model in terms of the local differential privacy of the randomizer used by each user. In this context, we make three contributions. First, we provide an optimal single message protocol for summation of real numbers in the shuffle model. Our protocol is very simple and has better accuracy and communication than the protocols for this same problem proposed by Cheu et al. Optimality of this protocol follows from our second contribution, a new lower bound for the accuracy of private protocols for summation of real numbers in the shuffle model. The third contribution is a new amplification bound for analyzing the privacy of protocols in the shuffle model in terms of the privacy provided by the corresponding local randomizer. Our amplification bound generalizes the results by Erlingsson et al. to a wider range of parameters, and provides a whole family of methods to analyze privacy amplification in the shuffle model. %B International Cryptology Conference (CRYPTO 2019) %G eng %U https://link.springer.com/chapter/10.1007/978-3-030-26951-7_22 %0 Conference Paper %B Computer Security Foundations %D 2019 %T Information Flow Control for Distributed Trusted Execution Environments %A Anitha Gollamudi %A Owen Arden %A Stephen Chong %X Distributed applications cannot assume that their security policies will be enforced on untrusted hosts. Trusted execution environments (TEEs) combined with cryptographic mechanisms enable execution of known code on an untrusted host and the exchange of confidential and authenticated messages with it. TEEs do not, however, establish the trustworthiness of code executing in a TEE. Thus, developing secure applications using TEEs requires specialized expertise and careful auditing. This paper presents DFLATE, a core security calculus for distributed applications with TEEs. DFLATE offers high-level abstractions that reflect both the guarantees and limitations of the underlying security mechanisms they are based on. The accuracy of these abstractions is exhibited by asymmetry between confidentiality and integrity in our formal results: DFLATE enforces a strong form of noninterference for confidentiality, but only a weak form for integrity. This reflects the asymmetry of the security guarantees of a TEE: a malicious host cannot access secrets in the TEE or modify its contents, but they can suppress or manipulate the sequence of its inputs and outputs. Therefore DFLATE cannot protect against the suppression of high-integrity messages, but when these messages are delivered, their contents cannot have been influenced by an attacker. %B Computer Security Foundations %G eng %U https://ieeexplore.ieee.org/document/8823697 %0 Conference Paper %B Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT 2019) %D 2019 %T Secret-Sharing Schemes for General and Uniform Access Structures %A Benny Applebaum %A Amos Beimel %A Oriol Farr`as %A Oded Nir %A Naty Peter %X A secret-sharing scheme allows some authorized sets of parties to reconstruct a secret; the collection of authorized sets is called the access structure. For over 30 years, it was known that any (monotone) collection of authorized sets can be realized by a secret-sharing scheme whose shares are of size2

A new line of work [6, 9, 15, 2] demonstrates how differential privacy [8] can be used as a mathematical tool for guaranteeing generalization in adaptive data analysis. Specifically, if a differentially private analysis is applied on a sample S of i.i.d. examples to select a lowsensitivity function f , then w.h.p. f (S) is close to its expectation, although f is being chosen based on the data. Very recently, Steinke and Ullman [16] observed that these generalization guarantees can be used for proving concentration bounds in the non-adaptive setting, where the low-sensitivity function is fixed beforehand. In particular, they obtain alternative proofs for classical concentration bounds for low-sensitivity functions, such as the Chernoff bound and McDiarmid’s Inequality. In this work, we set out to examine the situation for functions with high-sensitivity, for which differential privacy does not imply generalization guarantees under adaptive analysis. We show that differential privacy can be used to prove concentration bounds for such functions in the non-adaptive setting.

%B Journal of Privacy and Confidentiality %V 9 %G eng %U https://journalprivacyconfidentiality.org/index.php/jpc/article/view/658 %N 1 %0 Conference Paper %B 32nd Annual Conference on Learning Theory %D 2019 %T Private Center Points and Learning of Halfspaces %A Amos Beimel %A Shay Moran %A Kobbi Nissim %A Uri Stemmer %X We present a private learner for halfspaces over an arbitrary finite domain X⊂ℝWe also provide a lower bound on the sample complexity for privately finding a point in the convex hull. For approximate differential privacy, we show a lower bound of m=Ω(d+log

Is it possible to piece together the confidential data of almost everyone in the US from statistics published by the Census Bureau—without breaching Census security or policy? Could someone—a doctor, a nosy neighbor, or a foreign state actor—determine whether a particular person participated in a genetic study of hundreds of individuals, when each individual contributed only tiny trace amounts of DNA to a highly complex and aggregated genetic mixture? Could police detectives re-trace a suspect’s every movement over the course of many months and thereby learn intimate details about the suspect’s political, religious, and sexual associations—without having to deploy any sort of surveillance or tracking devices? Could someone reliably deduce the sexual preferences of a Facebook user without looking at any content that user has shared?

Until recently, most people probably never imagined that their highly sensitive personal data could be so vulnerable to discovery from seemingly innocuous sources. Many continue to believe that the privacy risks from purely public, statistical, and anonymised data are merely theoretical, and that the practical risks are negligibly small. Yet all of the privacy violations described above are not only theoretically possible—they have already been successfully executed.

The foregoing examples of real-world privacy attacks all leverage one particular vulnerability that we refer to as composition effects. This vulnerability stems from the cumulative erosions of privacy that inhere in every piece of data about people. These erosions occur no matter how aggregated, insignificant, or anonymised the data may seem, and even small erosions can combine in unanticipated ways to create big risks.

Privacy and data protection failures from unanticipated composition effects reflect a type of data myopia—a short-sighted approach toward addressing increasingly-ubiquitous surveillance and privacy risks from Big Data analytics, characterized by a near-total focus on individual data processors and processes and by pervasive underestimation of systemic risks accumulating from independent data products. The failure to recognize accumulation of risk in the information ecosystem reflects a more general societal blind spot to cumulative systemic risks, with parallels in collective failures to foresee or forestall global financial crises, and to adequately address mounting risks to the natural environment.

As the volume and complexity of data uses and publications grow rapidly across a broad range of contexts, the need to develop frameworks for addressing cumulative privacy risks is likely to become an increasingly urgent and widespread problem. Threats to privacy are growing due to the accelerating abundance, and richness, of data about individuals being generated and made publicly available. Furthermore, substantial increases in computing power and algorithmic improvements are making the execution of such attacks more technically feasible. These threats will be impossible to overcome unless regulations are designed to explicitly regulate cumulative risk in a manner that is consistent with the science of composition effects.

The fields of law and computer science incorporate contrasting notions of the privacy risks associated with the analysis and release of statistical data about individuals and groups of individuals. Emerging concepts from the theoretical computer science literature provide formal mathematical models for quantifying and mitigating privacy risks, where the set of risks they take into account is much broader than the privacy risks contemplated by many privacy laws. An example of such a model is differential privacy, which provides a provable guarantee of privacy against a wide range of potential attacks, including types of attacks currently unknown or unforeseen. The subject of much theoretical investigation, new privacy technologies based on formal models have recently been making significant strides towards practical implementation. For these tools to be used with sensitive personal information, it is important to demonstrate that they satisfy relevant legal requirements for privacy protection. However, making such an argument is challenging due to the conceptual gaps between the legal and technical approaches to defining privacy. Notably, information privacy laws are generally subject to interpretation and some degree of flexibility, which creates uncertainty for the implementation of more formal approaches. This Article articulates the gaps between legal and technical approaches to privacy and presents a methodology for rigorously arguing that a technological method for privacy protection satisfies the requirements of a particular law. The proposed methodology has two main components: (i) extraction of a formal mathematical requirement of privacy based on a legal standard found in an information privacy law, and (ii) construction of a rigorous mathematical proof for establishing that a technological privacy solution satisfies the mathematical requirement derived from the law. To handle ambiguities that can lead to different interpretations of a legal standard, the methodology takes a conservative “worst-case” approach and attempts to extract a mathematical requirement that is robust to potential ambiguities. Under this approach, the mathematical proof demonstrates that a technological method satisfies a broad range of reasonable interpretations of a legal standard. The Article demonstrates the application of the proposed methodology with an example bridging between the requirements of the Family Educational Rights and Privacy Act of 1974 and differential privacy.

%7 2
%I Harvard Journal of Law & Technology
%V 31
%P 687-780
%8 2016
%G eng
%U https://jolt.law.harvard.edu/assets/articlePDFs/v31/02.-Article-Wood-7.21.pdf
%0 Conference Paper
%B Theory of Cryptography Conference (TCC 2016)
%D 2018
%T The Complexity of Computing the Optimal Composition of Differential Privacy
%A Jack Murtagh
%A Salil Vadhan
%X In the study of differential privacy, composition theorems (starting with the original paper of Dwork, McSherry, Nissim, and Smith (TCC'06)) bound the degradation of privacy when composing several differentially private algorithms. Kairouz, Oh, and Viswanath (ICML'15) showed how to compute the optimal bound for composing k arbitrary (ϵ,δ)-differentially private algorithms. We characterize the optimal composition for the more general case of k arbitrary (ϵ1,δ1),…,(ϵk,δk)-differentially private algorithms where the privacy parameters may differ for each algorithm in the composition. We show that computing the optimal composition in general is #P-complete. Since computing optimal composition exactly is infeasible (unless FP=#P), we give an approximation algorithm that computes the composition to arbitrary accuracy in polynomial time. The algorithm is a modification of Dyer's dynamic programming approach to approximately counting solutions to knapsack problems (STOC'03).

%B Theory of Cryptography Conference (TCC 2016) %7 8 %I Theory of Computing (2018) %V 14 %P 1-35 %G eng %U http://theoryofcomputing.org/articles/v014a008/ %0 Generic %D 2018 %T Composable and Versatile Privacy via Truncated CDP (Poster) %A Mark Bun %A Cynthia Dwork %A Guy Rothblum %A Thomas Steinke %B ACM Symposium on the Theory of Computing (STOC 2018) %G eng %0 Journal Article %D 2018 %T Computational Two-Party Correlation: A Dichotomy for Key-Agreement Protocols %A Iftach Haitner %A Ronen Shaltiel %A Jad Silbak %A Kobbi Nissim %A Eran Omri %G eng %U https://www.irif.fr/~focs2018/accepted/ %0 Journal Article %J Vanderbilt Journal of Entertainment and Technology Law %D 2018 %T Differential Privacy: A Primer for a Non-technical Audience %A Kobbi Nissim %A Thomas Steinke %A Alexandra Wood %A Micah Altman %A Aaron Bembenek %A Mark Bun %A Marco Gaboardi %A O'Brien, David %A Salil Vadhan %XThis document is a primer on differential privacy, which is a formal mathematical framework for guaranteeing privacy protection when analyzing or releasing statistical data. Recently emerging from the theoretical computer science literature, differential privacy is now in initial stages of implementation and use in various academic, industry, and government settings. Using intuitive illustrations and limited mathematical formalism, this document provides an introduction to differential privacy for non-technical practitioners, who are increasingly tasked with making decisions with respect to differential privacy as it grows more widespread in use. In particular, the examples in this document illustrate ways in which social scientists can conceptualize the guarantees provided by differential privacy with respect to the decisions they make when managing personal data about research subjects and informing them about the privacy protection they will be afforded.

%B Vanderbilt Journal of Entertainment and Technology Law %I a product of the "Bridging Privacy Definitions" working group, part of the Privacy Tools for Sharing Research Data project at Harvard University %C Cambridge, MA %V 21 %P 209-276 %8 2018 %G eng %N 1 %0 Journal Article %J 9th Innovations in Theoretical Computer Science Conference (ITCS 2018) %D 2018 %T Finite Sample Differentially Private Confidence Intervals %A Vishesh Karwa %A Salil Vadhan %X We study the problem of estimating finite sample confidence intervals of the mean of a normal population under the constraint of differential privacy. We consider both the known and unknown variance cases and construct differentially private algorithms to estimate confidence intervals. Crucially, our algorithms guarantee a finite sample coverage, as opposed to an asymptotic coverage. Unlike most previous differentially private algorithms, we do not require the domain of the samples to be bounded. We also prove lower bounds on the expected size of any differentially private confidence set showing that our the parameters are optimal up to polylogarithmic factors. %B 9th Innovations in Theoretical Computer Science Conference (ITCS 2018) %G eng %U https://arxiv.org/abs/1711.03908 %0 Journal Article %J IEEE Security & Privacy %D 2018 %T A Harm-Reduction Framework for Algorithmic Accountability over Personal Information %A Micah Altman %A Alexandra Wood %A Effy Vayena %B IEEE Security & Privacy %V 16 %P 34-45 %G eng %U https://ieeexplore.ieee.org/document/8395114/ %N 3 %0 Journal Article %J Philosophical Transaction of the Royal Society A %D 2018 %T Is Privacy

The transparency goals of the open data movement serve important social, economic, and democratic functions in cities like Seattle. At the same time, some municipal datasets about the city and its citizens’ activities carry inherent risks to individual privacy when shared publicly. In 2016, the City of Seattle declared in its Open Data Policy that the city’s data would be “open by preference,” except when doing so may affect individual privacy. To ensure its Open Data program effectively protects individuals, Seattle committed to performing an annual risk assessment and tasked the Future of Privacy Forum (FPF) with creating and deploying an initial privacy risk assessment methodology for open data.

This Draft Report provides tools and guidance to the City of Seattle and other municipalities navigating the complex policy, operational, technical, organizational, and ethical standards that support privacyprotective open data programs. Although there is a growing body of research on open data privacy, open data managers and departmental data owners need to be able to employ a standardized methodology for assessing the privacy risks and benefits of particular datasets internally, without a bevy of expert statisticians, privacy lawyers, or philosophers. By following a flexible, risk-based assessment process, the City of Seattle – and other municipal open data programs – can maximize the utility and openness of civic data while minimizing privacy risks to individuals and community concerns about ethical challenges, fairness, and equity.

This Draft Report first describes inherent privacy risks in an open data landscape, with an emphasis on potential harms related to re-identification, data quality, and fairness. Accompanying this, the Draft Report includes a Model Open Data Benefit Risk Analysis (MODBRA). The model template evaluates the types of data contained in a proposed open dataset, the potential benefits – and concomitant risks – of releasing the dataset publicly, and strategies for effective de-identification and risk mitigation. This holistic assessment guides city officials to determine whether to release the dataset openly, in a limited access environment, or to withhold it from publication (absent countervailing public policy considerations). The Draft Report methodology builds on extensive work done in this field by experts at the National Institute of Standards and Technology, the University of Washington, the Berkman Klein Center for Internet & Society at Harvard University, and others, and adapts existing frameworks to the unique challenges faced by cities as local governments, technological system integrators, and consumer facing service providers.

%B Comments to the Future of Privacy Forum %G eng %U https://fpf.org/wp-content/uploads/2018/01/Wood-Altman-Baleato-Vadhan_Comments-on-FPF-Seattle-Open-Data-Draft-Report.pdf %0 Journal Article %J Symposium on the Principle of Programming Languages, ACM %D 2017 %T Relational Cost Analysis %A Ezgi Cicek %A Gilles Barthe %A Marco Gaboardi %A Deepak Garg %A Jan Hoffmann %X Establishing quantitative bounds on the execution cost of programs is essential in many areas of computer science such as complexity analysis, compiler optimizations, security and privacy. Techniques based on program analysis, type systems and abstract interpretation are well-studied, but methods for analyzing how the execution costs of two programs compare to each other have not received attention. Naively combining the worst and best case execution costs of the two programs does not work well in many cases because such analysis forgets the similarities between the programs or the inputs. In this work, we propose a relational cost analysis technique that is capable of establishing precise bounds on the difference in the execution cost of two programs by making use of relational properties of programs and inputs. We develop RelCost, a refinement type and effect system for a higher-order functional language with recursion and subtyping. The key novelty of our technique is the combination of relational refinements with two modes of typing—relational typing for reasoning about similar computations/inputs and unary typing for reasoning about unrelated computations/inputs. This combination allows us to analyze the execution cost difference of two programs more precisely than a naive non-relational approach. We prove our type system sound using a semantic model based on step-indexed unary and binary logical relations accounting for non-relational and relational reasoning principles with their respective costs. We demonstrate the precision and generality of our technique through examples. %B Symposium on the Principle of Programming Languages, ACM %8 Jan 2017 %G eng %0 Journal Article %J Symposium on the Principle of Programming Languages, ACM %D 2017 %T A Semantic Account of Metric Preservation %A Arthur Azevedo de Amorim %A Marco Gaboardi %A Justin Hsu %A Shin-ya Katsumata %A Ikram Cherigui %X Program sensitivity measures how robust a program is to small changes in its input, and is a fundamental notion in domains ranging from differential privacy to cyber-physical systems. A natural way to formalize program sensitivity is in terms of metrics on the input and output spaces, requiring that an r-sensitive function map inputs that are at distance d to outputs that are at distance at most r⋅d. Program sensitivity is thus an analogue of Lipschitz continuity for programs. Reed and Pierce introduced Fuzz, a functional language with a linear type system that can express program sensitivity. They show soundness operationally, in the form of a metric preservation property. Inspired by their work, we study program sensitivity and metric preservation from a denotational point of view. In particular, we introduce metric CPOs, a novel semantic structure for reasoning about computation on metric spaces, by endowing CPOs with a compatible notion of distance. This structure is useful for reasoning about metric properties of programs, and specifically about program sensitivity. We demonstrate metric CPOs by giving a model for the deterministic fragment of Fuzz. %B Symposium on the Principle of Programming Languages, ACM %8 Jan 2017 %G eng %U https://arxiv.org/abs/1702.00374 %0 Journal Article %J Journal of Privacy and Confidentiality %D 2017 %T Between Pure and Approximate Differential Privacy %A Thomas Steinke %A Jonathan Ullman %X We show a new lower bound on the sample complexity of (ε, δ)-differentially private algorithms that accurately answer statistical queries on high-dimensional databases. The novelty of our bound is that it depends optimally on the parameter δ, which loosely corresponds to the probability that the algorithm fails to be private, and is the first to smoothly interpolate between approximate differential privacy (δ > 0) and pure differential privacy (δ = 0). %B Journal of Privacy and Confidentiality %G eng %0 Book Section %B Tutorials on the Foundations of Cryptography %D 2017 %T The Complexity of Differential Privacy %A Salil Vadhan %X**Version History: **

August 2016: Manuscript v1 (see files attached)

March 2017: Manuscript v2 (see files attached); Errata

April 2017: Published Version (in *Tutorials on the Foundations of Cryptography*; see above)

Differential privacy is a theoretical framework for ensuring the privacy of individual-level data when performing statistical analysis of privacy-sensitive datasets. This tutorial provides an introduction to and overview of differential privacy, with the goal of conveying its deep connections to a variety of other topics in computational complexity, cryptography, and theoretical computer science at large. This tutorial is written in celebration of Oded Goldreich’s 60th birthday, starting from notes taken during a minicourse given by the author and Kunal Talwar at the 26th McGill Invitational Workshop on Computational Complexity [1].

%B Tutorials on the Foundations of Cryptography %I Springer, Yehuda Lindell, ed. %P 347-450 %G eng %U https://link.springer.com/chapter/10.1007/978-3-319-57048-8_7 %0 Journal Article %J Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016 %D 2017 %T The DataTags System: Sharing Sensitive Data with Confidence %A Merce Crosas %B Research Data Alliance (RDA) 8th Plenary on Privacy Implications of Research Data Sets, during International Data Week 2016 %G eng %0 Conference Paper %B Fairness and Transparency in Machine Learning Conference (FATML) %D 2017 %T Decoupled Classifiers for Fair and Efficient Machine Learning %A Cynthia Dwork %A Nicole Immorlica %A Adam Kalai %A Max Leiserson %XWhen it is ethical and legal to use a sensitive attribute (such as gender or race) in machine learning systems, the question remains how to do so. We show that the naive application of machine learning algorithms using sensitive features leads to an inherent tradeoff in accuracy between groups. We provide a simple and efficient decoupling technique, that can be added on top of any black-box machine learning algorithm, to learn different classifiers for different groups. Transfer learning is used to mitigate the problem of having too little data on any one group.

The method can apply to a range of fairness criteria. In particular, we require the application designer to specify as joint loss function that makes explicit the trade-off between fairness and accuracy. Our reduction is shown to efficiently find the minimum loss as long as the objective has a certain natural monotonicity property which may be of independent interest in the study of fairness in algorithms.

We consider the problem of answering queries about a sensitive dataset subject to differential privacy. The queries may be chosen adversarially from a larger set Q of allowable queries in one of three ways, which we list in order from easiest to hardest to answer:

• Offline: The queries are chosen all at once and the differentially private mechanism answers the queries in a single batch.

• Online: The queries are chosen all at once, but the mechanism only receives the queries in a streaming fashion and must answer each query before seeing the next query.

• Adaptive: The queries are chosen one at a time and the mechanism must answer each query before the next query is chosen. In particular, each query may depend on the answers given to previous queries.

Many differentially private mechanisms are just as efficient in the adaptive model as they are in the offline model. Meanwhile, most lower bounds for differential privacy hold in the offline setting. This suggests that the three models may be equivalent. We prove that these models are all, in fact, distinct. Specifically, we show that there is a family of statistical queries such that exponentially more queries from this family can be answered in the offline model than in the online model. We also exhibit a family of search queries such that exponentially more queries from this family can be answered in the online model than in the adaptive model. We also investigate whether such separations might hold for simple queries like threshold queries over the real line.

%B Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) %G eng %U https://arxiv.org/abs/1604.04618 %0 Journal Article %J Proceedings of The 30th Conference on Learning Theory Conference (COLT 2017) %D 2017 %T The Price of Selection in Differential Privacy %A Mitali Bafna %A Jonathan Ullman %B Proceedings of The 30th Conference on Learning Theory Conference (COLT 2017) %G eng %U https://arxiv.org/abs/1702.02970 %0 Journal Article %J in the ACM SIGMOD/PODS Conference (PODS 2017) %D 2017 %T Private Incremental Regression %A Kasiviswanathan, Shiva %A Kobbi Nissim %A Hongxia Jin %B in the ACM SIGMOD/PODS Conference (PODS 2017) %G eng %0 Conference Paper %B European Social Policy Analysis network (ESPAnet). Israel %D 2017 %T Public Policy Modeling using the DataTags Toolset %A Bar-Sinai, Michael %A Rotem Medzini %X We apply Tags, a framework for modeling data handling policies, to a welfare policy. The generated model is useful for assessing entitlements of specific cases, and for gaining insights into the modeled policy as a whole. %B European Social Policy Analysis network (ESPAnet). Israel %G eng %0 Journal Article %J in Technology Science %D 2017 %T Reidentification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study %A Sweeney, Latanya %A Ji Su Yoo %A Laura Perovich %A Katherine E Boronow %A Brown, Phil %A Julia Green Brody %B in Technology Science %G eng %0 Journal Article %J in the IEEE Secure Development Conference (IEEE 2017) %D 2017 %T Securing Dataverse with an Adapted Command Design Pattern %A Gustavo Durand %A Bar-Sinai, Michael %A Merce Crosas %B in the IEEE Secure Development Conference (IEEE 2017) %G eng %0 Journal Article %J in the 58th Annual Symposium on Foundations of Computer Science (FOCS 2017) %D 2017 %T Tight Lower Bounds for Differentially Private Selection %A Thomas Steinke %A Jonathan Ullman %B in the 58th Annual Symposium on Foundations of Computer Science (FOCS 2017) %G eng %U https://arxiv.org/abs/1704.03024 %0 Journal Article %J in Technology Science %D 2017 %T Voter Identity Theft: Submitting Changes to Voter Registrations Online to Disrupt Elections %A Sweeney, Latanya %A Ji Su Yoo %A Jinyan Zang %B in Technology Science %G eng %0 Journal Article %J in the 23rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) %D 2017 %T What's Fair? %A Cynthia Dwork %B in the 23rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017) %G eng %0 Journal Article %J Harvard Law Review %D 2016 %T Recoding Privacy Law: Reflections on the Future Relationship Among Law, Technology, and Privacy %A Gasser, Urs %B Harvard Law Review %8 12 Dec, 2016 %G eng %U http://harvardlawreview.org/2016/12/recoding-privacy-law-reflections-on-the-future-relationship-among-law-technology-and-privacy/ %0 Journal Article %J Conference on Internet and Economics, WINE %D 2016 %T Computer-Aided Verification in Mechanism Design %A G. Barthe %A M. Gaboardi %A E. J. Gallego Arias %A J. Hsu %A A. Roth %A P.-Y. Strub %B Conference on Internet and Economics, WINE %8 Dec 2016 %G eng %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security, CCS %D 2016 %T Advanced Probabilistic Couplings for Differential Privacy %A G. Barthe %A N. Fong %A M. Gaboardi %A B. Gregoire %A J. Hsu %A P.-Y. Strub %B 23rd ACM Conference on Computer and Communications Security, CCS %8 Oct 2016 %G eng %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security %D 2016 %T Generic Attacks on Secure Outsourced Databases %A Georgios Kellaris %A George Kollios %A Kobbi Nissim %A Adam O'Neill %XRecently, various protocols have been proposed for securely outsourcing database storage to a third party server, ranging from systems with “full-fledged” security based on strong cryptographic primitives such as fully homomorphic encryption or oblivious RAM, to more practical implementations based on searchable symmetric encryption or even on deterministic and order-preserving encryption. On the flip side, various attacks have emerged that show that for some of these protocols confidentiality of the data can be compromised, usually given certain auxiliary information. We take a step back and identify a need for a formal understanding of the inherent efficiency/privacy trade-off in outsourced database systems, independent of the details of the system. We propose abstract models that capture secure outsourced storage systems in sufficient generality, and identify two basic sources of leakage, namely access pattern and communication volume. We use our models to distinguish certain classes of outsourced database systems that have been proposed, and deduce that all of them exhibit at least one of these leakage sources. We then develop generic reconstruction attacks on any system supporting range queries where either access pattern or communication volume is leaked. These attacks are in a rather weak passive adversarial model, where the untrusted server knows only the underlying query distribution. In particular, to perform our attack the server need not have any prior knowledge about the data, and need not know any of the issued queries nor their results. Yet, the server can reconstruct the secret attribute of every record in the database after about N 4 queries, where N is the domain size. We provide a matching lower bound showing that our attacks are essentially optimal. Our reconstruction attacks using communication volume apply even to systems based on homomorphic encryption or oblivious RAM in the natural way. Finally, we provide experimental results demonstrating the efficacy of our attacks on real datasets with a variety of different features. On all these datasets, after the required number of queries our attacks successfully recovered the secret attributes of every record in at most a few seconds.

%B 23rd ACM Conference on Computer and Communications Security %8 Oct 2016 %G eng %0 Conference Paper %B Security and Privacy Workshops (SPW), 2016 IEEE %D 2016 %T DataTags, Data Handling Policy Spaces and the Tags Language %A Bar-Sinai, Michael %A Sweeney, Latanya %A Merce Crosas %XWidespread sharing of scientific datasets holds great promise for new scientific discoveries and great risks for personal privacy. Dataset handling policies play the critical role of balancing privacy risks and scientific value. We propose an extensible, formal, theoretical model for dataset handling policies. We define binary operators for policy composition and for comparing policy strictness, such that propositions like "this policy is stricter than that policy" can be formally phrased. Using this model, The policies are described in a machine-executable and human-readable way. We further present the Tags programming language and toolset, created especially for working with the proposed model. Tags allows composing interactive, friendly questionnaires which, when given a dataset, can suggest a data handling policy that follows legal and technical guidelines. Currently, creating such a policy is a manual process requiring access to legal and technical experts, which are not always available. We present some of Tags' tools, such as interview systems, visualizers, development environment, and questionnaire inspectors. Finally, we discuss methodologies for questionnaire development. Data for this paper include a questionnaire for suggesting a HIPAA compliant data handling policy, and formal description of the set of data tags proposed by the authors in a recent paper.

%B Security and Privacy Workshops (SPW), 2016 IEEE %I IEEE %C San Jose, California %8 22-26 May %G eng %U http://ieeexplore.ieee.org/document/7527746/?reload=true&arnumber=7527746 %0 Conference Paper %B Theory and Practice of Differential Privacy %D 2016 %T PSI (Ψ): a Private data Sharing Interface %A Marco Gaboardi %A James Honaker %A Gary King %A Kobbi Nissim %A Jonathan Ullman %A Salil Vadhan %A Jack Murtagh %XWe provide an overview of PSI (“a Private data Sharing Interface”), a system we are devel- oping to enable researchers in the social sciences and other fields to share and explore privacy- sensitive datasets with the strong privacy protections of differential privacy.

Poster presented at Theory and Practice of Differential Privacy (TPDP 2016).

%B Theory and Practice of Differential Privacy %C New York, NY %8 2016 %G eng %U https://arxiv.org/abs/1609.04340 %0 Journal Article %J ACM Transactions on Economics and Computation (TEAC) %D 2016 %T Truthful Mechanisms for Agents that Value Privacy %A Yiling Chen %A Stephen Chong %A Ian Kash %A Tal Moran %A Salil Vadhan %XRecent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from truthfulness; it is not incorporated in players’ utility functions (and doing so has been shown to lead to nontruthfulness in some cases). In this work, we propose a new, general way of modeling privacy in players’ utility functions. Specifically, we only assume that if an outcome *o* has the property that any report of player *i* would have led to *o* with approximately the same probability, then *o* has a small privacy cost to player *i*. We give three mechanisms that are truthful with respect to our modeling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number *n* of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of *n*).

Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

%B Annals of Statistics %G eng %U https://arxiv.org/abs/1408.2923 %0 Journal Article %J NIPS 2015 Workshop on Learning and Privacy with Incomplete Data and Weak Supervision %D 2016 %T Private Posterior Distributions from Variational Approximations %A Vishesh Karwa %A Dan Kifer %A Aleksandra Slavkovic %XPrivacy preserving mechanisms such as differential privacy inject additional randomness in the form of noise in the data, beyond the sampling mechanism. Ignoring this additional noise can lead to inaccurate and invalid inferences. In this paper, we incorporate the privacy mechanism explicitly into the likelihood function by treating the original data as missing, with an end goal of estimating posterior distributions over model parameters. This leads to a principled way of performing valid statistical inference using private data, however, the corresponding likelihoods are intractable. In this paper, we derive fast and accurate variational approximations to tackle such intractable likelihoods that arise due to privacy. We focus on estimating posterior distributions of parameters of the naive Bayes log-linear model, where the sufficient statistics of this model are shared using a differentially private interface. Using a simulation study, we show that the posterior approximations outperform the naive method of ignoring the noise addition mechanism.

%B NIPS 2015 Workshop on Learning and Privacy with Incomplete Data and Weak Supervision %G eng %U https://arxiv.org/abs/1511.07896 %0 Journal Article %J The Ethics of Biomedical Big Data %D 2016 %T Strictly Biomedical? Sketching the Ethics of the Big Data Ecosystem in Biomedicine %A Effy Vayena %A Gasser, Urs %XIn today’s ever evolving data ecosystem it is evident that data generated for a wide range of purposes unrelated to biomedicine possess tremendous potential value for biomedical research. Analyses of our Google searches, social media content, loyalty card points and the like are used to draw a fairly accurate picture of our health, our future health, our attitudes towards vaccination, disease outbreaks within a county and epidemic trajectories in other continents. These data sets are different from traditional biomedical data, if a biomedical purpose is the categorical variable. Yet the results their analyses yield are of serious biomedical relevance. This paper discusses important but unresolved challenges within typical biomedical data, and it explores examples of non-biomedical Big Data with high biomedical value, including the specific conundrums these engender, especially when we apply biomedical data concepts to them. It also highlights the “digital phenotype” project, illustrating the Big Data ecosystem in action and an approach believed as likely to yield biomedical and health knowledge. We argue that to address the challenges and make full use of the opportunities that Big Data offers to biomedicine, a new ethical framework taking a data ecosystem approach is urgently needed. We conclude by discussing key components, design requirements and substantive normative elements of such a framework.

%B The Ethics of Biomedical Big Data %G eng %U http://link.springer.com/chapter/10.1007/978-3-319-33525-4_2 %0 Journal Article %J Conference on Learning Theory (COLT) %D 2016 %T Adaptive Learning with Robust Generalization Guarantees %A Rachel Cummings %A Katrina Ligett %A Kobbi Nissim %A Aaron Roth %A Zhiwei Steven Wu %XThe traditional notion of generalization---i.e., learning a hypothesis whose empirical error is close to its true error---is surprisingly brittle. As has recently been noted in [DFH+15b], even if several algorithms have this guarantee in isolation, the guarantee need not hold if the algorithms are composed adaptively. In this paper, we study three notions of generalization---increasing in strength---that are robust to postprocessing and amenable to adaptive composition, and examine the relationships between them. We call the weakest such notion Robust Generalization. A second, intermediate, notion is the stability guarantee known as differential privacy. The strongest guarantee we consider we call Perfect Generalization. We prove that every hypothesis class that is PAC learnable is also PAC learnable in a robustly generalizing fashion, with almost the same sample complexity. It was previously known that differentially private algorithms satisfy robust generalization. In this paper, we show that robust generalization is a strictly weaker concept, and that there is a learning task that can be carried out subject to robust generalization guarantees, yet cannot be carried out subject to differential privacy. We also show that perfect generalization is a strictly stronger guarantee than differential privacy, but that, nevertheless, many learning tasks can be carried out subject to the guarantees of perfect generalization.

%B Conference on Learning Theory (COLT) %G eng %U https://arxiv.org/abs/1602.07726 %0 Journal Article %J 48th Annual Symposium on the Theory of Computing %D 2016 %T Algorithmic Stability for Adaptive Data Analysis %A Raef Bassily %A Kobbi Nissim %A Smith, Adam %A Thomas Steinke %A Uri Stemmer %A Jonathan Ullman %XAdaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen queries about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions: (i) We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015, NIPS, 2015). (ii) We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries. As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that stability implies low generalization error. We also study weaker stability guarantees such as bounded KL divergence and total variation distance.

%B 48th Annual Symposium on the Theory of Computing %G eng %U https://arxiv.org/abs/1511.02513v1 %0 Journal Article %J PLoS Med %D 2016 %T Between Openness and Privacy in Genomics. %A Effy Vayena %A Gasser, Urs %X- Advancing genomic research depends on the accessing and sharing of genomic data. However, the increasing need for sharing escalates the tension between genomic privacy and openness.
- Promoting openness while protecting privacy is a challenge that cannot be overcome only with technical solutions such as encryption and differential privacy. Although such solutions are crucial, we still need to confront some fundamental normative tensions that are intensified in the era of genomics and big data. Here are at least three:
- The right to genomic privacy is not an absolute right. If privacy is understood as control over information or data, privacy is not about maximal levels of control, but rather about reasonable measures of control.
- Although individual control is necessary, it is not a sufficient safeguard of privacy. Individuals’ willingness to be open about their data does not obviate responsibility for reducing privacy risks on the part of the data users.
- Data governance models, such as data cooperatives, that enable meaningful and continuous roles of the individuals whose data are at stake hold promise for reconciling privacy and openness.

"Concentrated differential privacy" was recently introduced by Dwork and Rothblum as a relaxation of differential privacy, which permits sharper analyses of many privacy-preserving computations. We present an alternative formulation of the concept of concentrated differential privacy in terms of the Renyi divergence between the distributions obtained by running an algorithm on neighboring inputs. With this reformulation in hand, we prove sharper quantitative results, establish lower bounds, and raise a few new questions. We also unify this approach with approximate differential privacy by giving an appropriate definition of "approximate concentrated differential privacy."

%B 14th Theory of Cryptography Conference %G eng %U https://arxiv.org/abs/1605.02065 %0 Journal Article %J 23rd ACM Conference on Computer and Communications Security, CCS %D 2016 %T Differentially Private Bayesian Programming %A G. Barthe %A P.Y. Strub %A J. Hsu %A A. D. Gordon %A E. J. Gallego Arias %A M. Gaboardi %A G. P. Farina %B 23rd ACM Conference on Computer and Communications Security, CCS %G eng %0 Journal Article %J Proceedings of The 33rd International Conference on Machine Learning, PMLR %D 2016 %T Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing %A Marco Gaboardi %A Hyun woo Lim %A Ryan Rogers %A Salil Vadhan %X Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables.

We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance.

We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

merging large-scale data sources hold tremendous potential for new scientific research into human biology, behaviors, and relationships. At the same time, big data research presents privacy and ethical challenges that the current regulatory framework is ill-suited to address. In light of the immense value of large-scale research data, the central question moving forward is not whether such data should be made available for research, but rather how the benefits can be captured in a way that respects fundamental principles of ethics and privacy.

In response, this Essay outlines elements of a new ethical framework for big data research. It argues that oversight should aim to provide universal coverage of human subjects research, regardless of funding source, across all stages of the information lifecycle. New definitions and standards should be developed based on a modern understanding of privacy science and the expectations of research subjects. In addition, researchers and review boards should be encouraged to incorporate systematic risk-benefit assessments and new procedural and technological solutions from the wide range of interventions that are available. Finally, oversight mechanisms and the safeguards implemented should be tailored to the intended uses, benefits, threats, harms, and vulnerabilities associated with a specific research activity.

Development of a new ethical framework with these elements should be the product of a dynamic multistakeholder process that is designed to capture the latest scientific understanding of privacy, analytical methods, available safeguards, community and social norms, and best practices for research ethics as they evolve over time. Such a framework would support big data utilization and help harness the value of big data in a sustainable and trust-building manner.

%B Washington and Lee Law Review %V 72 %8 31 Mar, 2016 %G eng %U http://lawreview.journals.wlu.io/elements-of-a-new-ethical-framework-for-big-data-research/ %N 3 %0 Journal Article %J Annals of Statistics %D 2016 %T Inference Using Noisy Degrees: Differentially Private Beta-Model and Synthetic Graphs %A Vishesh Karwa %A Aleksandra Slavković %XThe β-model of random graphs is an exponential family model with the degree sequence as a sufficient statistic. In this paper, we contribute three key results. First, we characterize conditions that lead to a quadratic time algorithm to check for the existence of MLE of the β-model, and show that the MLE never exists for the degree partition β-model. Second, motivated by privacy problems with network data, we derive a differentially private estimator of the parameters of β-model, and show it is consistent and asymptotically normally distributed - it achieves the same rate of convergence as the nonprivate estimator. We present an efficient algorithm for the private estimator that can be used to release synthetic graphs. Our techniques can also be used to release degree distributions and degree partitions accurately and privately, and to perform inference from noisy degrees arising from contexts other than privacy. We evaluate the proposed estimator on real graphs and compare it with a current algorithm for releasing degree distributions and find that it does significantly better. Finally, our paper addresses shortcomings of current approaches to a fundamental problem of how to perform valid statistical inference from data released by privacy mechanisms, and lays a foundational groundwork on how to achieve optimal and private statistical inference in a principled manner by modeling the privacy mechanism; these principles should be applicable to a class of models beyond the β-model.

%B Annals of Statistics %V 44 %P 87-112 %G eng %N 1 %0 Journal Article %J Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS) %D 2016 %T Lipschitz Extensions for NodePrivate Graph Statistics and the Generalized Exponential Mechanism %A Raskhodnikova, Sofya %A Smith, Adam %B Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS) %G eng %0 Conference Paper %B PODS 2016 %D 2016 %T Locating a Small Cluster Privately %A Kobbi Nissim %A Uri Stemmer %A Salil Vadhan %XWe present a new algorithm for locating a small cluster of points with differential privacy [Dwork, McSherry, Nissim,and Smith, 2006]. Our algorithm has implications to private data exploration, clustering, and removal of outliers. Furthermore, we use it to significantly relax the requirements of the sample and aggregate technique [Nissim, Raskhodnikova,and Smith, 2007], which allows compiling of "off the shelf" (non-private) analyses into analyses that preserve differential privacy.

%B PODS 2016 %C ACM SIGMOD/PODS Conference, San Francisco, USA, 2016 %8 June 26 2016 %G eng %0 Conference Paper %B Proceedings of the 12th Theory of Cryptography Conference (TCC 2016) %D 2016 %T Order revealing encryption and the hardness of private learning %A Mark Bun %A Mark Zhandry %XAn order-revealing encryption scheme gives a public procedure by which two ciphertexts can be compared to reveal the ordering of their underlying plaintexts. We show how to use order-revealing encryption to separate computationally efficient PAC learning from efficient (ϵ,δ)-differentially private PAC learning. That is, we construct a concept class that is efficiently PAC learnable, but for which every efficient learner fails to be differentially private. This answers a question of Kasiviswanathan et al. (FOCS '08, SIAM J. Comput. '11).

To prove our result, we give a generic transformation from an order-revealing encryption scheme into one with strongly correct comparison, which enables the consistent comparison of ciphertexts that are not obtained as the valid encryption of any message. We believe this construction may be of independent interest.

Increasingly, governments and businesses are collecting, analyzing, and sharing detailed information about individuals over long periods of time. Vast quantities of data from new sources and novel methods for large-scale data analysis promise to yield deeper understanding of human characteristics, behavior, and relationships and advance the state of science, public policy, and innovation. At the same time, the collection and use of fine-grained personal data over time is associated with significant risks to individuals, groups, and society at large. In this article, we examine a range of longterm data collections, conducted by researchers in social science, in order to identify the characteristics of these programs that drive their unique sets of risks and benefits. We also examine the practices that have been established by social scientists to protect the privacy of data subjects in light of the challenges presented in long-term studies. We argue that many uses of big data, across academic, government, and industry settings, have characteristics similar to those of traditional long-term research studies. In this article, we discuss the lessons that can be learned from longstanding data management practices in research and potentially applied in the context of newly emerging data sources and uses.

%B Brussels Privacy Symposium %C Brussels, Belgium %8 2016 %G eng %0 Journal Article %J Neural Information Processing Systems (NIPS) %D 2016 %T Privacy Odometers and Filters: Pay-as-you-Go Composition %A Ryan Rogers %A Aaron Roth %A Jonathan Ullman %A Salil Vadhan %XIn this paper we initiate the study of adaptive composition in differential privacy when the length of the composition, and the privacy parameters themselves can be chosen adaptively, as a function of the outcome of previously run analyses. This case is much more delicate than the setting covered by existing composition theorems, in which the algorithms themselves can be chosen adaptively, but the privacy parameters must be fixed up front. Indeed, it isn't even clear how to define differential privacy in the adaptive parameter setting. We proceed by defining two objects which cover the two main use cases of composition theorems. A privacy filter is a stopping time rule that allows an analyst to halt a computation before his pre-specified privacy budget is exceeded. A privacy odometer allows the analyst to track realized privacy loss as he goes, without needing to pre-specify a privacy budget. We show that unlike the case in which privacy parameters are fixed, in the adaptive parameter setting, these two use cases are distinct. We show that there exist privacy filters with bounds comparable (up to constants) with existing privacy composition theorems. We also give a privacy odometer that nearly matches non-adaptive private composition theorems, but is sometimes worse by a small asymptotic factor. Moreover, we show that this is inherent, and that any valid privacy odometer in the adaptive parameter setting must lose this factor, which shows a formal separation between the filter and odometer use-cases.

%B Neural Information Processing Systems (NIPS) %8 2016 %G eng %0 Journal Article %J 43rd International Colloquium on Automata, Languages, and Programming %D 2016 %T Sensitivity of Counting Queries %A Myrto Arapinis %A Diego Figueira %A Marco Gaboardi %XIn the context of statistical databases, the release of accurate statistical information about the collected data often puts at risk the privacy of the individual contributors. The goal of differential privacy is to maximize the utility of a query while protecting the individual records in the database. A natural way to achieve differential privacy is to add statistical noise to the result of the query. In this context, a mechanism for releasing statistical information is thus a trade-off between utility and privacy. In order to balance these two "conflicting" requirements, privacy preserving mechanisms calibrate the added noise to the so-called sensitivity of the query, and thus a precise estimate of the sensitivity of the query is necessary to determine the amplitude of the noise to be added. In this paper, we initiate a systematic study of sensitivity of counting queries over relational databases. We first observe that the sensitivity of a Relational Algebra query with counting is not computable in general, and that while the sensitivity of Conjunctive Queries with counting is computable, it becomes unbounded as soon as the query includes a join. We then consider restricted classes of databases (databases with constraints), and study the problem of computing the sensitivity of a query given such constraints. We are able to establish bounds on the sensitivity of counting conjunctive queries over constrained databases. The kind of constraints studied here are: functional dependencies and cardinality dependencies. The latter is a natural generalization of functional dependencies that allows us to provide tight bounds on the sensitivity of counting conjunctive queries.

%B 43rd International Colloquium on Automata, Languages, and Programming %G eng %0 Journal Article %J Proceedings of the 14th Theory of Cryptography Conference (TCC 2016-B) %D 2016 %T Separating Computational and Statistical Differential Privacy in the Client-Server Model %A Mark Bun %A Yi Hsiu Chen %A Salil Vadhan %XDifferential privacy is a mathematical definition of privacy for statistical data analysis. It guarantees that any (possibly adversarial) data analyst is unable to learn too much information that is specific to an individual. Mironov et al. (CRYPTO 2009) proposed several computational relaxations of differential privacy (CDP), which relax this guarantee to hold only against computationally bounded adversaries. Their work and subsequent work showed that CDP can yield substantial accuracy improvements in various multiparty privacy problems. However, these works left open whether such improvements are possible in the traditional client-server model of data analysis. In fact, Groce, Katz and Yerukhimovich (TCC 2011) showed that, in this setting, it is impossible to take advantage of CDP for many natural statistical tasks. Our main result shows that, assuming the existence of sub-exponentially secure one-way functions and 2-message witness indistinguishable proofs (zaps) for NP, that there is in fact a computational task in the client-server model that can be efficiently performed with CDP, but is infeasible to perform with information-theoretic differential privacy.

%B Proceedings of the 14th Theory of Cryptography Conference (TCC 2016-B) %8 Aug 2016 %G eng %0 Journal Article %J Berkeley Technology Law Journal %D 2016 %T Towards a Modern Approach to Privacy-Aware Government Data Releases %A Micah Altman %A Alexandra Wood %A O'Brien, David %A Salil Vadhan %A Gasser, Urs %XThis article summarizes research exploring various models by which governments release data to the public and the interventions in place to protect the privacy of individuals in the data. Applying concepts from the recent scientific and legal literature on privacy, the authors propose a framework for a modern privacy analysis and illustrate how governments can use the framework to select appropriate privacy controls that are calibrated to the specific benefits and risks in individual data releases.

%B Berkeley Technology Law Journal %V 30 %G eng %N 3 %0 Thesis %B Computer Science Department at Harvard University %D 2016 %T Upper and Lower Bounds for Privacy and Adaptivity in Algorithmic Data Analysis %A Thomas Steinke %B Computer Science Department at Harvard University %G eng %9 PhD Thesis %0 Journal Article %J Proceedings of the 32nd International Conference on Machine Learning %D 2015 %T Consistent Estimation of Dynamic and Multi-Layer Block Models %A Qiuyi Han %A Kevin Xu %A Edoardo Airoldi %XSignificant progress has been made recently on theoretical analysis of estimators for the stochastic block model (SBM). In this paper, we consider the multi-graph SBM, which serves as a foundation for many application settings including dynamic and multi-layer networks. We explore the asymptotic properties of two estimators for the multi-graph SBM, namely spectral clustering and the maximum-likelihood estimate (MLE), as the number of layers of the multi-graph increases. We derive sufficient conditions for consistency of both estimators and propose a variational approximation to the MLE that is computationally feasible for large networks. We verify the sufficient conditions via simulation and demonstrate that they are practical. In addition, we apply the model to two real data sets: a dynamic social network and a multi-layer social network with several types of relations.

%B Proceedings of the 32nd International Conference on Machine Learning %G eng %U https://arxiv.org/pdf/1410.8597v3.pdf %0 Journal Article %J Encyclopedia of Algorithms %D 2015 %T Differentially Private Analysis of Graphs %A Sofya Raskhodnikova and Adam Smith %B Encyclopedia of Algorithms %G eng %0 Journal Article %J ZSR %D 2015 %T Perspectives on the Future of Digital Privacy %A Gasser, Urs %B ZSR %P 339-448 %G eng %0 Generic %D 2015 %T All the Data on All the People %A Sweeney, Latanya %B UC Berkeley Law School & GWU Law School (Berkeley Center for Law & Technology). The Privacy Law Scholars Conference (PLSC). Berkeley, CA. %G eng %0 Journal Article %J The ANNALS of the American Academy of Political and Social Science %D 2015 %T Automating Open Science for Big Data %A Mercè Crosas %A Gary King %A James Honaker %A Sweeney, Latanya %XThe vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

%B The ANNALS of the American Academy of Political and Social Science %V 659 %P 260-273 %G eng %U http://ann.sagepub.com/content/659/1/260.abstract %N 1 %0 Generic %D 2015 %T Between Pure and Approximate Differential Privacy %A Thomas Steinke %A Jon Ullman %B Theory and Practice of Differential Privacy (TPDP 2015), London, UK %G eng %U http://tpdp.computing.dundee.ac.uk/abstracts/TPDP_2015_3.pdf %0 Journal Article %J Proceedings of the 28th IEEE Computer Security Foundations Symposium (CSF) %D 2015 %T Cryptographic Enforcement of Language-Based Erasure %A A Askarov %A Moore, S. %A C Dimoulas %A S Chong %XInformation erasure is a formal security requirement that stipulates when sensitive data must be removed from computer systems. In a system that correctly enforces erasure requirements, an attacker who observes the system after sensitive data is required to have been erased cannot deduce anything about the data. Practical obstacles to enforcing information erasure include: (1) correctly determining which data requires erasure; and (2) reliably deleting potentially large volumes of data, despite untrustworthy storage services.

In this paper, we present a novel formalization of language- based information erasure that supports cryptographic enforcement of erasure requirements: sensitive data is encrypted be- fore storage, and upon erasure, only a relatively small set of decryption keys needs to be deleted. This cryptographic technique has been used by a number of systems that implement data deletion to allow the use of untrustworthy storage services. However, these systems provide no support to correctly determine which data requires erasure, nor have the formal semantic properties of these systems been explained or proven to hold. We address these shortcomings. Specifically, we study a programming language extended with primitives for public- key cryptography, and demonstrate how information-flow control mechanisms can automatically track data that requires erasure and provably enforce erasure requirements even when programs employ cryptographic techniques for erasure.

This essay first appeared in the Internet Monitor project’s second annual report, Internet Monitor 2014: Reflections on the Digital World. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

%B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/data-and-privacy-f7bfa24bbddc %0 Report %D 2015 %T The Differential Privacy of Bayesian Inference %A Shijie Zheng %XDifferential privacy is one recent framework for analyzing and quantifying the amount of privacy lost when data is released. Meanwhile, multiple imputation is an existing Bayesian-inference based technique from statistics that learns a model using real data, then releases synthetic data by drawing from that model. Because multiple imputation does not directly release any real data, it is generally believed to protect privacy. In this thesis, we examine that claim. While there exist newer synthetic data algorithms specifically designed to provide differential privacy, we evaluate whether multiple imputation already includes differential privacy for free. Thus, we focus on several method variants for releasing the learned model and releasing the synthetic data, and how these methods perform for models taking on two common distributions: the Bernoulli and the Gaussian with known variance. We prove a number of new or improved bounds on the amount of privacy afforded by multiple imputation for these distributions. We find that while differential privacy is ostensibly achievable for most of our method variants, the conditions needed for it to do so are often not realistic for practical usage. At least in theory, this is particularly true if we want absolute privacy (ε-differential privacy), but that the methods are more practically compatible with privacy when we allow a small probability of a catastrophic data leakage ((ε, δ)-differential privacy). |

We prove new upper and lower bounds on the sample complexity of (ϵ,δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function cx over a totally ordered domain X evaluates to cx(y)=1 if y≤x, and evaluates to 0 otherwise. We give the first nontrivial lower bound for releasing thresholds with (ϵ,δ) differential privacy, showing that the task is impossible over an infinite domain X, and moreover requires sample complexity n≥Ω(log∗|X|), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with n≤2(1+o(1))log∗|X| samples. This improves the previous best upper bound of 8(1+o(1))log∗|X| (Beimel et al., RANDOM '13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with (ϵ,δ) differential privacy and learning without privacy. For properly learning thresholds in ℓ dimensions, this lower bound extends to n≥Ω(ℓ⋅log∗|X|).

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database D of elements from X, the interior point problem asks for an element between the smallest and largest elements in D. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Emerging large-scale data sources hold tremendous potential for new scientific research into human biology, behaviors, and relationships. At the same time, big data research presents privacy and ethical challenges that the current regulatory framework is ill-suited to address. In light of the immense value of large-scale research data, the central question moving forward is not whether such data should be made available for research, but rather how the benefits can be captured in a way that respects fundamental principles of ethics and privacy.

In a search task, a group of agents compete to be the first to find the solution. Each agent has different private information to incorporate into its search. This problem is inspired by settings such as scientific research, Bitcoin hash inversion, or hunting for some buried treasure. A social planner such as a funding agency, mining pool, or pirate captain might like to convince the agents to collaborate, share their information, and greatly reduce the cost of searching. However, this cooperation is in tension with the individuals' competitive desire to each be the first to win the search. The planner's proposal should incentivize truthful information sharing, reduce the total cost of searching, and satisfy fairness properties that preserve the spirit of the competition. We design contract-based mechanisms for information sharing without money. The planner solicits the agents' information and assigns search locations to the agents, who may then search only within their assignments. Truthful reporting of information to the mechanism maximizes an agent's chance to win the search. Epsilon-voluntary participation is satisfied for large search spaces. In order to formalize the planner's goals of fairness and reduced search cost, we propose a simplified, simulated game as a benchmark and quantify fairness and search cost relative to this benchmark scenario. The game is also used to implement our mechanisms. Finally, we extend to the case where coalitions of agents may participate in the mechanism, forming larger coalitions recursively.

%B AAI Conference on Artificial Intelligence %I Association for the Advancement of Artificial Intelligence (AAAI) %C North America %8 February %G eng %U http://people.seas.harvard.edu/~bwaggoner/papers/2015/fair-information-sharing-for-treasure-hunting--2015--chen,nissim,waggoner.pdf %0 Conference Proceedings %D 2015 %T On the Generalization Properties of Differential Privacy %A Kobbi Nissim %A Uri Stemmer %XA new line of work, started with Dwork et al., studies the task of answering statistical queries using a sample and relates the problem to the concept of differential privacy. By the Hoeffding bound, a sample of size O(logk/α2) suffices to answer k non-adaptive queries within error α, where the answers are computed by evaluating the statistical queries on the sample. This argument fails when the queries are chosen adaptively (and can hence depend on the sample). Dwork et al. showed that if the answers are computed with (ϵ,δ)-differential privacy then O(ϵ) accuracy is guaranteed with probability 1−O(δϵ). Using the Private Multiplicative Weights mechanism, they concluded that the sample size can still grow polylogarithmically with the k.

Very recently, Bassily et al. presented an improved bound and showed that (a variant of) the private multiplicative weights algorithm can answer k adaptively chosen statistical queries using sample complexity that grows logarithmically in k. However, their results no longer hold for every differentially private algorithm, and require modifying the private multiplicative weights algorithm in order to obtain their high probability bounds.

We greatly simplify the results of Dwork et al. and improve on the bound by showing that differential privacy guarantees O(ϵ) accuracy with probability 1−O(δlog(1/ϵ)/ϵ). It would be tempting to guess that an (ϵ,δ)-differentially private computation should guarantee O(ϵ) accuracy with probability 1−O(δ). However, we show that this is not the case, and that our bound is tight (up to logarithmic factors).

We propose graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource large-scale privacy-sensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three schemes that are provably secure against any semi-honest server. Our first construction makes use of only symmetric-key operations, resulting in a computationally-efficient construction. Our second scheme, makes use of somewhat-homomorphic encryption and is less computationally-efficient but achieves optimal communication complexity (i.e., uses a minimal amount of bandwidth). Finally, our third scheme is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.

%B The 22nd ACM Conference on Computer and Communications Security %G eng %U https://eprint.iacr.org/2015/266 %0 Journal Article %J International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %D 2015 %T Hardness Amplification and the Approximate Degree of Constant-Depth Circuits %A Mark Bun %A Justin Thaler %XWe establish a generic form of hardness amplification for the approximability of constant-depth Boolean circuits by polynomials. Specifically, we show that if a Boolean circuit cannot be pointwise approximated by low-degree polynomials to within constant error in a certain one-sided sense, then an OR of disjoint copies of that circuit cannot be pointwise approximated even with very high error. As our main application, we show that for every sequence of degrees d(n), there is an explicit depththree circuit F : {−1, 1} n → {−1, 1} of polynomial-size such that any degree-d polynomial cannot pointwise approximate F to error better than 1 − exp −Ω( ˜ nd−3/2 ) . As a consequence of our main result, we obtain an exp −Ω( ˜ n 2/5 ) upper bound on the the discrepancy of a function in AC0, and an exp Ω( ˜ n 2/5 ) lower bound on the threshold weight of AC0, improving over the previous best results of exp −Ω(n 1/3 ) and exp Ω(n 1/3 ) respectively. Our techniques also yield a new lower bound of Ω n 1/2/ log(d−2)/2 (n) on the approximate degree of the AND-OR tree of depth d, which is tight up to polylogarithmic factors for any constant d, as well as new bounds for read-once DNF formulas. In turn, these results imply new lower bounds on the communication and circuit complexity of these classes, and demonstrate strong limitations on existing PAC learning algorithms.

%B International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %G eng %U http://arxiv.org/abs/1311.1616 %0 Generic %D 2015 %T In the Age of the Web, What Does “Public” Mean? %A Rob Faris %A O'Brien, David %B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/in-the-age-of-the-web-what-does-public-mean-ee74df403174 %0 Journal Article %J Social Science Research Network %D 2015 %T Integrating Approaches to Privacy Across the Research Lifecycle: When is Information Purely Public? %A O'Brien, David %A Jonathan Ullman %A Micah Altman %A Gasser, Urs %A Bar-Sinai, Michael %A Kobbi Nissim %A Salil Vadhan %A Michael Wojcik %A Alexandra Wood %XOn September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.

We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is “close” to the correct expectation over the distribution. This question was recently studied by Dwork et al. (2015), who showed how to answer Ω( ˜ n 2 ) queries efficiently, and also by Hardt and Ullman (2014), who showed that answering O˜(n 3 ) queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n 2 ) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code Fiat and Tassa (2001). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.

%B JMLR: Workshop and Conference Proceedings %V 40 %P 1-41 %G eng %U http://jmlr.org/proceedings/papers/v40/Steinke15.pdf %N 201 %0 Journal Article %J Accepted for publication, SODA 2015 %D 2015 %T Learning Privately with Labeled and Unlabeled Examples %A A. Beimel %A K. Nissim %A U. Stemmer %XA private learner is an algorithm that given a sample of labeled individual examples outputs a generalizing hypothesis while preserving the privacy of each individual. In 2008, Kasiviswanathan et al. (FOCS 2008) gave a generic construction of private learners, in which the sample complexity is (generally) higher than what is needed for non-private learners. This gap in the sample complexity was then further studied in several followup papers, showing that (at least in some cases) this gap is unavoidable. Moreover, those papers considered ways to overcome the gap, by relaxing either the privacy or the learning guarantees of the learner.

We suggest an alternative approach, inspired by the (non-private) models of semi-supervised learning and active-learning, where the focus is on the sample complexity of labeled examples whereas unlabeled examples are of a significantly lower cost. We consider private semi-supervised learners that operate on a random sample, where only a (hopefully small) portion of this sample is labeled. The learners have no control over which of the sample elements are labeled. Our main result is that the labeled sample complexity of private learners is characterized by the VC dimension.

We present two generic constructions of private semi-supervised learners. The first construction is of learners where the labeled sample complexity is proportional to the VC dimension of the concept class, however, the unlabeled sample complexity of the algorithm is as big as the representation length of domain elements. Our second construction presents a new technique for decreasing the labeled sample complexity of a given private learner, while only slightly increasing its unlabeled sample complexity. In addition, we show that in some settings the labeled sample complexity does not depend on the privacy parameters of the learner.

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.

%B Arxiv.org Computer Science, Computers and Scoiety [Internet] %G eng %U http://arxiv.org/abs/1506.05632 %0 Generic %D 2015 %T Privacy as a Sword and Shield in Public Health %A Sweeney, Latanya %B New York City Department of Public Health. New York, NY. %G eng %0 Generic %D 2015 %T Privacy Principles (framing talk) %A Micah Altman %B United Nations Global Pulse Workshop on ICT4D Principle 8: Address Privacy & Security In Development Programs. New York, USA %G eng %U https://www.slideshare.net/drmaltman/un-global-pulse-privacy-framing %0 Journal Article %D 2015 %T Private Approximations of the 2nd-Moment Matrix Using Existing Techniques in Linear Regression %A Or Sheffet %XWe introduce three differentially-private algorithms that approximates the 2nd-moment matrix of the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, correspond to existing techniques in linear regression literature. Specifically, we discuss the following three techniques. (i) For Ridge Regression, we propose setting the regularization coefficient so that by approximating the solution using Johnson-Lindenstrauss transform we preserve privacy. (ii) We show that adding a small batch of random samples to our data preserves differential privacy. (iii) We show that sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially private provided the prior is set correctly. We also evaluate our techniques experimentally and compare them to the existing "Analyze Gauss" algorithm of Dwork et al.

%G eng %U http://arxiv.org/abs/1507.00056 %0 Conference Paper %B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %D 2015 %T Robust Traceability from Trace Amounts %A C. Dwork %A Smith, A %A T Steinke %A J Ullman %A S. Vadhan %XThe privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in `1 norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

%B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %C Berkeley, California %8 10/18-20/2015 %G eng %0 Journal Article %J Technology Science %D 2015 %T Sharing Sensitive Data with Confidence: The Datatags System %A Sweeney, Latanya %A Mercè Crosas %A Bar-Sinai, Michael %XSociety generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

%B Technology Science %G eng %U http://techscience.org/a/2015101601/ %0 Conference Paper %D 2015 %T Simultaneous private learning of multiple concepts %A Mark Bun %A Kobbi Nissim %A Uri Stemmer %XWe investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving k learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving k learning tasks without privacy? In our setting, an individual example consists of a domain element x labeled by k unknown concepts (c1,…,ck). The goal of a multi-learner is to output k hypotheses (h1,…,hk) that generalize the input examples.

Without concern for privacy, the sample complexity needed to simultaneously learn k concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with k. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in k.

In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.

%G eng %0 Government Document %D 2014 %T Comment to The White House Office of Science and Technology Policy (OSTP): Big Data Study, Request for Information %A Micah Altman %A David O’Brien %A Salil Vadhan %A Alexandra Wood %XOn January 23, 2014, President Barack Obama asked John Podesta to perform a comprehensive review of big data and privacy. During this review, the White House Office of Science and Technology Policy issued a request for public comment on questions related to the public policy implications of big data.

Micah Altman, David O’Brien, Salil Vadhan, and Alexandra Wood submitted a response on behalf of the Privacy Tools for Sharing Research Data project. Their comments outline a broad, comprehensive, and systematic framework for privacy analysis and provide a taxonomy of modern technological, statistical, and cryptographic approaches to preserving both data privacy and utility. They argue that an analysis of information privacy should address the scope of information covered, the sensitivity of that information, the risk that sensitive information will be disclosed, the availability of control and accountability mechanisms, and the suitability of existing data sharing models, as applied across the entire lifecyle of information use, from collection through dissemination and reuse.

With this submission, the authors discuss the inadequacy of traditional approaches to privacy protection and recommend a modern approach that considers three principles. First, the risks of informational harm are generally not a simple function of the presence or absence of specific fields, attributes, or keywords in the released set of data. Second, redaction, pseudonymization, coarsening, and hashing, are often neither an adequate nor appropriate practice, nor is releasing less information necessary more privacy protective. Third, a thoughtful analysis with expert consultation is necessary in order to evaluate the sensitivity of the data collected, to quantify the associated re-identification risks, and to design useful and safe release mechanisms.

In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system.

To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. *Access control policies* restrict which components may use a capability and are enforced using higher-order contracts. *Integrity policies* restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.

On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.

Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.

We show a tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is "close" to the correct expectation over the distribution. This question was recently considered by Dwork et al., who showed that Ω~(n2) queries can be answer efficiently, and also by Hardt and Ullman, who showed that answering O~(n3) queries is computationally hard. We close the gap between the two bounds by proving a new, nearly-optimal hardness result. Specifically, we show that, under a standard hardness assumption, there is no computationally efficient algorithm that given n samples from an unknown distribution can give valid answers to O(n2) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results via an optimal construction of a new combinatorial object that we call an interactive fingerprinting code, which may be of independent interest.

%G eng %U http://arxiv.org/abs/1410.1228 %0 Book %D 2014 %T Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse %A Gasser, Urs %A Jonathan Zittrain %A R Faris %A R.H Jones %XThis publication is the second annual report of the Internet Monitor project at the Berkman Center for Internet & Society at Harvard University. As with the inaugural report, this year’s edition is a collaborative effort of the extended Berkman community. Internet Monitor 2014: Reflections on the Digital World includes nearly three dozen contributions from friends and colleagues around the world that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

The result, intended for a general interest audience, brings together reflection and analysis on a broad range of issues and regions — from an examination of Europe’s “right to be forgotten” to a review of the current state of mobile security to an exploration of a new wave of movements attempting to counter hate speech online — and offers it up for debate and discussion. Our goal remains not to provide a definitive assessment of the “state of the Internet” but rather to provide a rich compendium of commentary on the year’s developments with respect to the online space.

Last year’s report examined the dynamics of Internet controls and online activity through the actions of government, corporations, and civil society. We focus this year on the interplay between technological platforms and policy; growing tensions between protecting personal privacy and using big data for social good; the implications of digital communications tools for public discourse and collective action; and current debates around the future of Internet governance.

The report reflects the diversity of ideas and input the Internet Monitor project seeks to invite. Some of the contributions are descriptive; others prescriptive. Some contain purely factual observations; others offer personal opinion. In addition to those in traditional essay format, contributions this year include a speculative fiction story exploring what our increasingly data-driven world might bring, a selection of “visual thinking” illustrations that accompany a number of essays, a “Year in Review” timeline that highlights many of the year’s most fascinating Internet-related news stories (and an interactive version of which is available at the netmonitor.org), and a slightly tongue-in-cheek “By the Numbers” section that offers a look at the year’s important digital statistics. We believe that each contribution offers insights, and hope they provoke further reflection, conversation, and debate in both offline and online settings around the globe.

%7 17 %I Berkman Center Research Publication %C Cambridge, MA %G eng %U http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2538813 %0 Conference Paper %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %D 2014 %T Mechanism Design in Large Games: Incentives and Privacy %A Michael Kearns %A Mallesh Pai %A Aaron Roth %A Jonathan Ullman %K differential privacy %K equilibrium selection %K game theory %K mechanism design %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %S ITCS '14 %I ACM %C New York, NY, USA %P 403–410 %@ 978-1-4503-2698-8 %G eng %U http://doi.acm.org/10.1145/2554797.2554834 %R 10.1145/2554797.2554834 %0 Journal Article %J CoRR %D 2014 %TAn Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring

%A Mallesh M. Pai %A Aaron Roth %A Jonathan Ullman %B CoRR %V abs/1402.2801 %G eng %0 Conference Paper %B 10th Conference on Web and Internet Economics (WINE) %D 2014 %T Privacy Games %A Yiling Chen %A Or Sheffet %A Salil Vadhan %B 10th Conference on Web and Internet Economics (WINE) %C Beijing, China %8 December %G eng %0 Conference Paper %B Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14) %D 2014 %T Privacy Integrated Data Stream Queries %A Lucas Waye %B Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP '14) %I ACM %C New York, NY %G eng %U http://delivery.acm.org/10.1145/2690000/2687150/p19-waye.pdf?ip=65.112.10.216&id=2687150&acc=OPEN&key=4D4702B0C3E38B35.4D4702B0C3E38B35.4D4702B0C3E38B35.6D218144511F3437&CFID=469747432&CFTOKEN=41625734&__acm__=1420474755_45ce38ca730887129ca49b667ac6e0aa %0 Conference Paper %B ICML 2014 Workshop on Learning, Security and Privacy %D 2014 %T Private Empirical Risk Minimization, Revisited %A Raef Bassily %A Smith, Adam %A Abhradeep Thakurta %XIn this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.

Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different.

Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.

%B ICML 2014 Workshop on Learning, Security and Privacy %C Beijing, China %8 25 Jun %G eng %U http://arxiv.org/abs/1405.7085 %0 Book Section %B Automata, Languages, and Programming %D 2014 %T Privately Solving Linear Programs %A Justin Hsu %A Aaron Roth %A Roughgarden, Tim %A Jonathan Ullman %A Esparza, Javier %A Fraigniaud, Pierre %A Husfeldt, Thore %A Koutsoupias, Elias %B Automata, Languages, and Programming %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 8572 %P 612-624 %@ 978-3-662-43947-0 %G eng %U http://dx.doi.org/10.1007/978-3-662-43948-7_51 %R 10.1007/978-3-662-43948-7_51 %0 Conference Paper %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %D 2014 %T Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals %A Kobbi Nissim %A Salil Vadhan %A Xiao, David %K differential privacy %K mechanism design %B Proceedings of the 5th Conference on Innovations in Theoretical Computer Science %S ITCS '14 %I ACM %C New York, NY, USA %P 411–422 %@ 978-1-4503-2698-8 %G eng %U http://doi.acm.org/10.1145/2554797.2554835 %R 10.1145/2554797.2554835 %0 Conference Proceedings %B Proceedings of The 27th Conference on Learning Theory (COLT 2014) %D 2014 %T Sample Complexity Bounds on Differentially Private Learning via Communication Complexity %A Vitaly Feldman %A Xiao, David %XIn this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).

Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:

- SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
- For any t, there exists a class C such that LDim(C)=2 but SCDP(C)≥t.
- For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b).

We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.

%B Proceedings of The 27th Conference on Learning Theory (COLT 2014) %I JMLR Workshop and Conference Proceedings %C Barcelona, Spain %V 35 %P 1-20 %G eng %U http://jmlr.org/proceedings/papers/v35/feldman14b.pdf %0 Book %D 2014 %T What Stays in Vegas: The World of Personal Data -- Lifeblood of Big Business -- and the End of Privacy as We Know It. %A Adam Tanner %XThe greatest threat to privacy today is not the NSA, but good-old American companies. Internet giants, leading retailers, and other firms are voraciously gathering data with little oversight from anyone.

In Las Vegas, no company knows the value of data better than Caesars Entertainment. Many thousands of enthusiastic clients pour through the ever-open doors of their casinos. The secret to the company’s success lies in their one unrivaled asset: they know their clients intimately by tracking the activities of the overwhelming majority of gamblers. They know exactly what games they like to play, what foods they enjoy for breakfast, when they prefer to visit, who their favorite hostess might be, and exactly how to keep them coming back for more.

Caesars’ dogged data-gathering methods have been so successful that they have grown to become the world’s largest casino operator, and have inspired companies of all kinds to ramp up their own data mining in the hopes of boosting their targeted marketing efforts. Some do this themselves. Some rely on data brokers. Others clearly enter a moral gray zone that should make American consumers deeply uncomfortable.

We live in an age when our personal information is harvested and aggregated whether we like it or not. And it is growing ever more difficult for those businesses that choose not to engage in more intrusive data gathering to compete with those that do. Tanner’s timely warning resounds: Yes, there are many benefits to the free flow of all this data, but there is a dark, unregulated, and destructive netherworld as well.

We study interactive proofs with sublinear-time verifiers. These proof systems can be used to ensure approximate correctness for the results of computations delegated to an untrusted server. Following the literature on property testing, we seek proof systems where with high probability the verifier accepts every input in the language, and rejects every input that is far from the language. The verifier's query complexity (and computation complexity), as well as the communication, should all be sublinear. We call such a proof system an Interactive Proof of Proximity (IPP). On the positive side, our main result is that all languages in NC have Interactive Proofs of Proximity with roughly √n query and communication and complexities, and polylog(n) communication rounds. This is achieved by identifying a natural language, membership in an affine subspace (for a structured class of subspaces), that is complete for constructing interactive proofs of proximity, and providing efficient protocols for it. In building an IPP for this complete language, we show a tradeoff between the query and communication complexity and the number of rounds. For example, we give a 2-round protocol with roughly n3/4 queries and communication. On the negative side, we show that there exist natural languages in NC1, for which the sum of queries and communication in any constant-round interactive proof of proximity must be polynomially related to n. In particular, for any 2-round protocol, the sum of queries and communication must be at least ~Ω(√n). Finally, we construct much better IPPs for specific functions, such as bipartiteness on random or well-mixing graphs, and the majority function. The query complexities of these protocols are provably better (by exponential or polynomial factors) than what is possible in the standard property testing model, i.e. without a prover.

%B Proceedings of the 45th annual ACM symposium on Symposium on theory of computing %I ACM %C Palo Alto, California, USA %P 793-802 %@ 978-1-4503-2029-0 %G eng %U http://dx.doi.org/10.1145/2488608.2488709 %1 2488709 %R 10.1145/2488608.2488709 %0 Unpublished Work %D 2013 %T Matching Known Patients to Health Records in Washington State Data %A Sweeney, Latanya %B Data Privacy Lab, IQSS, Harvard University %8 06/01/2013 %G eng %U http://thedatamap.org/risks.html %0 Book Section %B Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques %D 2013 %T Private Learning and Sanitization: Pure vs. Approximate Differential Privacy %A Amos Beimel %A Kobbi Nissim %A Uri Stemmer %A Raghavendra, Prasad %A Raskhodnikova, Sofya %A Jansen, Klaus %A Rolim, JoséD.P. %K differential privacy %K Private Learning %K Sanitization %B Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques %S Lecture Notes in Computer Science %I Springer Berlin Heidelberg %V 8096 %P 363-378 %@ 978-3-642-40327-9 %G eng %U http://dx.doi.org/10.1007/978-3-642-40328-6_26 %R 10.1007/978-3-642-40328-6_26 %0 Journal Article %J JAMA %D 2013 %T Putting Health IT on the Path to Success %A Sweeney, Latanya %A Yasnoff, William A. %A Shortliffe, Edward H. %X The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision. %B JAMA %V 309 %P 989-990 %G eng %U http://dx.doi.org/10.1001/jama.2013.1474 %N 10 %0 Unpublished Work %D 2013 %T Survey of Publicly Available State Health Databases %A Hooley, Sean %A Sweeney, Latanya %B Data Privacy Lab, IQSS, Harvard University %8 06/01/2013 %G eng %U http://thedatamap.org/states.html %0 Conference Paper %B Proceedings of the fourteenth ACM conference on Electronic commerce %D 2013 %T Truthful mechanisms for agents that value privacy %A Yiling Chen %A Stephen Chong %A Ian A. Kash %A Tal Moran %A Salil Vadhan %X Recent work has constructed economic mechanisms that are both truthful and differentially private. In these mechanisms, privacy is treated separately from the truthfulness; it is not incorporated in players' utility functions (and doing so has been shown to lead to non-truthfulness in some cases). In this work, we propose a new, general way of modelling privacy in players' utility functions. Specifically, we only assume that if an outcome o has the property that any report of player i would have led to o with approximately the same probability, then o has small privacy cost to player i. We give three mechanisms that are truthful with respect to our modelling of privacy: for an election between two candidates, for a discrete version of the facility location problem, and for a general social choice problem with discrete utilities (via a VCG-like mechanism). As the number n of players increases, the social welfare achieved by our mechanisms approaches optimal (as a fraction of n). %B Proceedings of the fourteenth ACM conference on Electronic commerce %I ACM %C Philadelphia, Pennsylvania, USA %P 215-232 %@ 978-1-4503-1962-1 %G eng %U http://dx.doi.org/10.1145/2482540.2482549 %1 2482549 %R 10.1145/2482540.2482549 %0 Conference Paper %B Proceedings of the 32nd International Cryptology Conference (CRYPTO `12) %D 2012 %T Differential Privacy with Imperfect Randomness %A Yevgeniy Dodis %A Adriana López-Alt %A Ilya Mironov %A Salil Vadhan %XIn this work we revisit the question of basing cryptography on imperfect randomness. Bosley and Dodis (TCC’07) showed that if a source of randomness R is “good enough” to generate a secret key capable of encrypting k bits, then one can deterministically extract nearly k almost uniform bits from R, suggesting that traditional privacy notions (namely, indistinguishability of encryption) requires an “extractable” source of randomness. Other, even stronger impossibility results are known for achieving privacy under speciﬁc “non-extractable” sources of randomness, such as the γ-Santha-Vazirani (SV) source, where each next bit has fresh entropy, but is allowed to have a small bias γ < 1 (possibly depending on prior bits). We ask whether similar negative results also hold for a more recent notion of privacy called differential privacy (Dwork et al., TCC’06), concentrating, in particular, on achieving differential privacy with the Santha-Vazirani source. We show that the answer is no. Speciﬁcally, we give a differentially private mechanism for approximating arbitrary “low sensitivity” functions that works even with randomness coming from a γ-Santha-Vazirani source, for any γ < 1. This provides a somewhat surprising “separation” between traditional privacy and diﬀerential privacy with respect to imperfect randomness. Interestingly, the design of our mechanism is quite diﬀerent from the traditional “additive-noise” mechanisms (e.g., Laplace mechanism) successfully utilized to achieve differential privacy with perfect randomness. Indeed, we show that any (accurate and private) “SV-robust” mechanism for our problem requires a demanding property called consistent sampling, which is strictly stronger than differential privacy, and cannot be satisﬁed by any additive-noise mechanism.

%B Proceedings of the 32nd International Cryptology Conference (CRYPTO `12) %S Lecture Notes on Computer Science %7 Lecture Notes on Computer Science %I Springer-Verlag %C Santa Barbara, CA %V 7417 %P 497–516 %8 19–23 August %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-32009-5_29 %0 Conference Paper %B Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012 %D 2012 %T Faster Algorithms for Privately Releasing Marginals %A Justin Thaler %A Jonathan Ullman %A Salil P. Vadhan %XWe study the problem of releasing k-way marginals of a database D ∈ ({0, 1} d ) n , while preserving differential privacy. The answer to a k-way marginal query is the fraction of D’s records x ∈ {0, 1} d with a given value in each of a given set of up to k columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07). We give an algorithm that runs in time dO(k√) and releases a private summary capable of answering any k-way marginal query with at most ±.01 error on every query as long as n≥dO(k√) . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of k-way marginal queries, which is d Θ(k) (for k ≪ d).

%B Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012 %7 Lecture Notes in Computer Science %I Springer %C Warwick, UK %V 7391 %8 9-13 Jul. %G eng %U http://dx.doi.org/10.1007/978-3-642-31594-7_68 %0 Conference Paper %B Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012 %D 2012 %T Iterative Constructions and Private Data Release %A Anupam Gupta %A Aaron Roth %A Jonathan Ullman %XIn this paper we study the problem of approximately releasing the cut function of a graph while preserving differential privacy, and give new algorithms (and new analyses of existing algorithms) in both the interactive and non-interactive settings. Our algorithms in the interactive setting are achieved by revisiting the problem of releasing differentially private, approximate answers to a large number of queries on a database. We show that several algorithms for this problem fall into the same basic framework, and are based on the existence of objects which we call iterative database construction algorithms. We give a new generic framework in which new (efficient) IDC algorithms give rise to new (efficient) interactive private query release mechanisms. Our modular analysis simplifies and tightens the analysis of previous algorithms, leading to improved bounds. We then give a new IDC algorithm (and therefore a new private, interactive query release mechanism) based on the Frieze/Kannan low-rank matrix decomposition. This new release mechanism gives an improvement on prior work in a range of parameters where the size of the database is comparable to the size of the data universe (such as releasing all cut queries on dense graphs). We also give a non-interactive algorithm for efficiently releasing private synthetic data for graph cuts with error O(|V|1.5). Our algorithm is based on randomized response and a non-private implementation of the SDP-based, constant-factor approximation algorithm for cut-norm due to Alon and Naor. Finally, we give a reduction based on the IDC framework showing that an efficient, private algorithm for computing sufficiently accurate rank-1 matrix approximations would lead to an improved efficient algorithm for releasing private synthetic data for graph cuts. We leave finding such an algorithm as our main open problem.

%B Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012 %7 Lecture Notes in Computer Science %I Springer %C Taormina, Sicily, Italy %V 7194 %P 339-356 %8 19-21 Mar. %G eng %U http://dx.doi.org/10.1007/978-3-642-28914-9_19 %0 Conference Paper %B Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12) %D 2012 %T The Privacy of the Analyst and the Power of the State %A Cynthia Dwork %A Moni Naor %A Salil Vadhan %XWe initiate the study of "privacy for the analyst" in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e. individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system. This problem arises only in the context of stateful privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries). We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

%B Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12) %I IEEE %C New Brunswick, NJ %P 400–409 %8 20–23 October %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6375318&tag=1 %0 Manuscript %D 2012 %T Private Equilibrium Release, Large Games, and No-Regret Learning %A Michael Kearns %A Mallesh Pai %A Aaron Roth %A Jonathan Ullman %XWe give mechanisms in which each of n players in a game is given their component of an (approximate) equilibrium in a way that guarantees differential privacy---that is, the revelation of the equilibrium components does not reveal too much information about the utilities of the other players. More precisely, we show how to compute an approximate correlated equilibrium (CE) under the constraint of differential privacy (DP), provided n is large and any player's action affects any other's payoff by at most a small amount. Our results draw interesting connections between noisy generalizations of classical convergence results for no-regret learning, and the noisy mechanisms developed for differential privacy. Our results imply the ability to truthfully implement good social-welfare solutions in many games, such as games with small Price of Anarchy, even if the mechanism does not have the ability to enforce outcomes. We give two different mechanisms for DP computation of approximate CE. The first is computationally efficient, but has a suboptimal dependence on the number of actions in the game; the second is computationally efficient, but allows for games with exponentially many actions. We also give a matching lower bound, showing that our results are tight up to logarithmic factors.

%G eng %U http://arxiv.org/abs/1207.4084 %0 Government Document %D 2011 %T Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005 %A Salil Vadhan %A David Abrams %A Micah Altman %A Cynthia Dwork %A Paul Kominers %A Scott Duke Kominers %A Harry R. Lewis %A Tal Moran %A Guy Rothblum %XComments by Salil Vadhan, David Abrams, Micah Altman, Cynthia Dwork, Scott Duke Kominers, Paul Kominers, Harry Lewis, Tal Moran, Guy Rothblum, and Jon Ullman (at Harvard, Microsoft Research, the University of Chicago, MIT, and the Herzilya Interdisciplinary Center) These comments address the issues of data privacy and de-identification raised in the ANPRM. Our perspective is informed by substantial advances in privacy science that have been made in the computer science literature.

%8 October %G eng %U http://www.regulations.gov/#!documentDetail;D=HHS-OPHS-2011-0005-1101 %0 Conference Paper %B Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11) %D 2011 %T PCPs and the Hardness of Generating Synthetic Data %A Jon Ullman %A Salil Vadhan %E Yuval Ishai %XAssuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm A that takes a database D\in ({0,1}^d)^n and outputs a ``synthetic database'' D' all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x\in {0,1}^d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS `07), who gave an algorithm running in time poly(n,2^d). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC `09) with PCP-based Levin-reductions from NP search problems to finding approximate solutions to CSPs.

%B Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11) %S Lecture Notes on Computer Science %7 Lecture Notes on Computer Science %I Springer-Verlag %C Providence, RI %V 5978 %P 572–587 %8 28–30 March %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-19571-6_24 %0 Conference Paper %B Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011 %D 2011 %T Privately releasing conjunctions and the statistical query barrier %A Anupam Gupta %A Moritz Hardt %A Aaron Roth %A Jonathan Ullman %X Suppose we would like to know all answers to a set of statistical queries C on a data set up to small error, but we can only access the data itself using statistical queries. A trivial solution is to exhaustively ask all queries in C. Can we do any better? We show that the number of statistical queries necessary and sufficient for this task is---up to polynomial factors---equal to the agnostic learning complexity of C in Kearns' statistical query (SQ)model. This gives a complete answer to the question when running time is not a concern. We then show that the problem can be solved efficiently (allowing arbitrary error on a small fraction of queries) whenever the answers to C can be described by a submodular function. This includes many natural concept classes, such as graph cuts and Boolean disjunctions and conjunctions. While interesting from a learning theoretic point of view, our main applications are in privacy-preserving data analysis: Here, our second result leads to an algorithm that efficiently releases differentially private answers to all Boolean conjunctions with 1% average error. This presents progress on a key open problem in privacy-preserving data analysis. Our first result on the other hand gives unconditional lower bounds on any differentially private algorithm that admits a (potentially non-privacy-preserving) implementation using only statistical queries. Not only our algorithms, but also most known private algorithms can be implemented using only statistical queries, and hence are constrained by these lower bounds. Our result therefore isolates the complexity of agnostic learning in the SQ-model as a new barrier in the design of differentially private algorithms. %B Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011 %I ACM %C San Jose, CA, USA %P 803-812 %8 6-8 June %G eng %U http://dl.acm.org/citation.cfm?id=1993742 %0 Conference Paper %B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %D 2010 %T Boosting and Differential Privacy %A Cynthia Dwork %A Guy Rothblum %A Salil Vadhan %XBoosting is a general method for improving the accuracy of learning algorithms. We use boosting to construct improved privacy-pre serving synopses of an input database. These are data structures that yield, for a given set Q of queries over an input database, reasonably accurate estimates of the responses to every query in Q, even when the number of queries is much larger than the number of rows in the database. Given a base synopsis generator that takes a distribution on Q and produces a "weak" synopsis that yields "good" answers for a majority of the weight in Q, our Boosting for Queries algorithm obtains a synopsis that is good for all of Q. We ensure privacy for the rows of the database, but the boosting is performed on the queries. We also provide the first synopsis generators for arbitrary sets of arbitrary low-sensitivity queries, i.e., queries whose answers do not vary much under the addition or deletion of a single row. In the execution of our algorithm certain tasks, each incurring some privacy loss, are performed many times. To analyze the cumulative privacy loss, we obtain an O(ε2) bound on the expected privacy loss from a single e-differentially private mechanism. Combining this with evolution of confidence arguments from the literature, we get stronger bounds on the expected cumulative privacy loss due to multiple mechanisms, each of which provides e-differential privacy or one of its relaxations, and each of which operates on (potentially) different, adaptively chosen, databases.

%B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %I IEEE %C Las Vegas, NV %P 51–60 %8 23–26 October %G eng %U http://dx.doi.org/10.1109/FOCS.2010.12 %0 Conference Paper %B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %D 2010 %T The Limits of Two-Party Differential Privacy %A Andrew McGregor %A Ilya Mironov %A Toniann Pitassi %A Omer Reingold %A Kunal Talwar %A Salil Vadhan %XWe study differential privacy in a distributed setting where two parties would like to perform analysis of their joint data while preserving privacy for both datasets. Our results imply almost tight lower bounds on the accuracy of such data analyses, both for specific natural functions (such as Hamming distance) and in general. Our bounds expose a sharp contrast between the two-party setting and the simpler client-server setting (where privacy guarantees are one-sided). In addition, those bounds demonstrate a dramatic gap between the accuracy that can be obtained by differentially private data analysis versus the accuracy obtainable when privacy is relaxed to a computational variant of differential privacy. The first proof technique we develop demonstrates a connection between differential privacy and deterministic extraction from Santha-Vazirani sources. A second connection we expose indicates that the ability to approximate a function by a low-error differentially private protocol is strongly related to the ability to approximate it by a low communication protocol. (The connection goes in both directions).

%B Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10) %I IEEE %C Las Vegas, NV %P 81–90 %8 23–26 October %G eng %U http://dx.doi.org/10.1109/FOCS.2010.14 %0 Conference Paper %B Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09) %D 2009 %T On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results %A Cynthia Dwork %A Moni Naor %A Omer Reingold %A Guy Rothblum %A Salil Vadhan %XWe consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a "sanitization" of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a "synthetic data set" consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. For the task of sanitizing with a synthetic dataset output, we map the boundary between computational feasibility and infeasibility with respect to a variety of utility measures. For the (potentially easier) task of sanitizing with unrestricted output format, we show a tight qualitative and quantitative connection between hardness of sanitizing and the existence of traitor tracing schemes.

%B Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09) %C Bethesda, MD %P 381–390 %8 31 May–2 June %G eng %U http://dl.acm.org/citation.cfm?id=1536467 %0 Conference Paper %B Advances in Cryptology–-CRYPTO `09 %D 2009 %T Computational Differential Privacy %A Ilya Mironov %A Omkant Pandey %A Omer Reingold %A Salil Vadhan %XThe definition of differential privacy has recently emerged as a leading standard of privacy guarantees for algorithms on statistical databases. We offer several relaxations of the definition which require privacy guarantees to hold only against efficient—i.e., computationally-bounded—adversaries. We establish various relationships among these notions, and in doing so, we observe their close connection with the theory of pseudodense sets by Reingold et al.[1]. We extend the dense model theorem of Reingold et al. to demonstrate equivalence between two definitions (indistinguishability- and simulatability-based) of computational differential privacy. Our computational analogues of differential privacy seem to allow for more accurate constructions than the standard information-theoretic analogues. In particular, in the context of private approximation of the distance between two vectors, we present a differentially-private protocol for computing the approximation, and contrast it with a substantially more accurate protocol that is only computationally differentially private.

%B Advances in Cryptology–-CRYPTO `09 %S Lecture Notes in Computer Science %I Springer-Verlag %C Santa Barbara, CA %V 5677 %P 126–142 %8 16–20 August %G eng %U http://link.springer.com/chapter/10.1007%2F978-3-642-03356-8_8 %0 Unpublished Work %D 2000 %T Simple Demographics Often Identify People Uniquely %A Sweeney, Latanya %B Carnegie Mellon University, Data Privacy %G eng %U http://dataprivacylab.org/projects/identifiability/ %9 Working paper