There is a significant conceptual gap between legal and mathematical thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators. We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset x using the output of a data-release mechanism M(x) if it finds a predicate p matching exactly one row in x with probability much better than a statistical baseline. A data-release mechanism that precludes such attacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and k-anonymity are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, k-anonymity does not: There exists a simple predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.
Society is caught in a vise. The exponential growth in the power and ubiquity of computing devices has enabled the collection and analysis of data at an unprecedented scale. This Cambrian explosion in data collection promises enormous benefits across commercial, scientific, and policy fields. Unfortunately, this collection and analysis of increasingly-personal data has also proved to be a grave threat to individual’s privacy.
Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users’ hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This impliesan arbitrarily large gap in sample complexity between the shuffled and local models. On theother hand, we show that the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.
Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users' hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This implies an arbitrarily large gap in sample complexity between the shuffled and local models. On the other hand, the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.
There is a significant conceptual gap between legal and mathematica thinking around data privacy. The effect is uncertainty as to which technical offerings meet legal standards. This uncertainty is exacerbated by a litany of successful privacy attacks demonstrating that traditional statistical disclosure limitation techniques often fall short of the privacy envisioned by regulators.
We define “predicate singling out,” a type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). An adversary predicate singles out a dataset x using the output of a data-release mechanism M(x) if it finds a predicate p matching exactly one row in x with probability much better than a statistical baseline. A data-release mechanism that precludes suchattacks is “secure against predicate singling out” (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data-release mechanism that purports to “render anonymous” personal data under the GDPR must prevent singling out and, hence, must be PSO secure. We analyze the properties of PSO security, showing that it fails to compose. Namely, a combination of more than logarithmically many exact counts, each individually PSO secure, facilitates predicate singling out. Finally, we ask whether differential privacy and kanonymity
are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, and in contrast with current legal guidance, kanonymity does not: There exists a simple predicate singling out attack under mild assumptions on the k-anonymizer and the data distribution.
Previously published as: Yiling Chen, Or Sheffet, and Salil Vadhan. Privacy games. In Proceedings of the 10th International Conference on Web and Internet Economics (WINE ‘14), volume 8877 of Lecture Notes in Computer Science, pages 371–385. Springer-Verlag, 14–17 December 2014. (WINE Publisher's Version linked here: https://link.springer.com/chapter/10.1007/978-3-319-13129-0_30); PDF attached as WINE2014.
The problem of analyzing the effect of privacy concerns on the behavior of selfish utility-maximizing agents has received much attention lately. Privacy concerns are often modeled by altering the utility functions of agents to consider also their privacy loss. Such privacy aware agents prefer to take a randomized strategy even in very simple games in which non-privacy aware agents play pure strategies. In some cases, the behavior of privacy aware agents follows the framework of Randomized Response, a well-known mechanism that preserves differential privacy.
Our work is aimed at better understanding the behavior of agents in settings where their privacy concerns are explicitly given. We consider a toy setting where agent A, in an attempt to discover the secret type of agent B, offers B a gift that one type of B agent likes and the other type dislikes. As opposed to previous works, B's incentive to keep her type a secret isn't the result of "hardwiring" B's utility function to consider privacy, but rather takes the form of a payment between B and A. We investigate three different types of payment functions and analyze B's behavior in each of the resulting games. As we show, under some payments, B's behavior is very different than the behavior of agents with hardwired privacy concerns and might even be deterministic. Under a different payment we show that B's BNE strategy does fall into the framework of Randomized Response.
Real-world applications routinely make authorization decisions based on dynamic computation. Reasoning about dynamically computed authority is challenging. Integrity of the system might be compromised if attackers can improperly influence the authorizing computation. Confidentiality can also be compromised by authorization, since authorization decisions are often based on sensitive data such as membership lists and passwords. Previous formal models for authorization do not fully address the security implications of permitting trust relationships to change, which limits their ability to reason about authority that derives from dynamic computation. Our goal is a way to construct dynamic authorization mechanisms that do not violate confidentiality or integrity.
We introduce the Flow-Limited Authorization Calculus (FLAC), which is both a simple, expressive model for reasoning about dynamic authorization and also an information flow control language for securely implementing various authorization mechanisms. FLAC combines the insights of two previous models: it extends the Dependency Core Calculus with features made possible by the Flow-Limited Authorization Model. FLAC provides strong end-to-end information security guarantees even for programs that incorporate and implement rich dynamic authorization mechanisms. These guarantees include noninterference and robust declassification, which prevent attackers from influencing information disclosures in unauthorized ways. We prove these security properties formally for all FLAC programs and explore the expressiveness of FLAC with several examples.
OpenDP is a community effort to build a trustworthy suite of open-source tools for enabling privacy-protective analysis of sensitive personal data, focused on a library of algorithms for generating differentially private statistical releases. The target use cases for OpenDP are to enable government, industry, and academic institutions to safely and confidently share sensitive data to support scientifically oriented research and exploration in the public interest. We aim for OpenDP to flexibly grow with the rapidly advancing science of differential privacy, and be a pathway to bring the newest algorithmic developments to a wide array of practitioners.
OpenDP is led by Faculty Directors Gary King and Salil Vadhan and an Executive Committee at Harvard University, funded in part by a grant from the Sloan Foundation. Its efforts so far have included implementing a differentially private curator application in collaboration with Microsoft, and developing a framework for a community-driven OpenDP Commons through the work of an Ad Hoc Design Committee including external experts. Going forward, the project plans to engage with a wide community of stakeholders, establish partnerships with a wide variety of groups from industry, academia, and government, and adopt increasing levels of community governance.
This article presents a self-help software system that makes rights accessible through an on-line interview. The interview is based on a formal model of the relevant jurisprudence and does not require the involvement of a service representative, only a user who wants to understand his or her rights. In addition, the article provides a methodology for building models and interviews for similar social contexts and describes building a model for workers’ rights according to Israeli law, upon completing their employment. In addition to conducting interviews, these models can be used to create diagrams and perform legal queries. This kind of system can fulfill a central role in empowering disadvantaged populations, as it enables people to asses their rights in a user-friendly manner, which is personalized to the situation of the interviewee and not overburdened with large amounts of information that it is difficult to navigate.
In a recent paper, Chan et al. [SODA ’19] proposed a relaxation of the notion of (full) memory obliviousness, which was introduced by Goldreich and Ostrovsky [J. ACM ’96] and extensively researched by cryptographers. The new notion, differential obliviousness, requires that any two neighboring inputs exhibit similar memory access patterns, where the similarity requirement is that of differential privacy.
Chan et al. demonstrated that differential obliviousness allows achieving improved efficiency for several algorithmic tasks, including sorting, merging of sorted lists, and range query data structures.
In this work, we continue the exploration of differential obliviousness, focusing on algorithms that do not necessarily examine all their input. This choice is motivated by the fact that the existence of logarithmic overhead ORAM protocols implies that differential obliviousness can yield at most a logarithmic improvement in efficiency for computations that need to examine all their input. In particular, we explore property testing, where we show that differential obliviousness yields an almost linear improvement in overhead in the dense graph model, and at most quadratic improvement in the bounded degree model.
We also explore tasks where a non-oblivious algorithm would need to explore different portions of the input, where the latter would depend on the input itself, and where we show that such a behavior can be maintained under differential obliviousness, but not under full obliviousness. Our examples suggest that there would be benefits in further exploring which class of computational tasks are amenable to differential obliviousness.
Symbolic execution is a classical program analysis technique used to show that programs satisfy or violate given specifications. In this work we generalize symbolic execution to support program analysis for relational specifications in the form of relational properties - these are properties about two runs of two programs on related inputs, or about two executions of a single program on related inputs. Relational properties are useful to formalize notions in security and privacy, and to reason about program optimizations. We design a relational symbolic execution engine, named RelSymwhich supports interactive refutation, as well as proving of relational properties for programs written in a language with arrays and for-like loops.
We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomial-time algorithms. As a case study, we examine the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe \(X\). The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Condentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in \(|X|\), which can be exponential in the bit length of the input.
In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee. One of our algorithms produces a sparse histogram as output. Its "per-bin accuracy" (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of \(\log |X|\), but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram. A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and eciently computable representation of a dense histogram; it is based on an \((n + 1)\) - wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.
This work studies differential privacy in the context of the recently proposed shuffle model. Unlike in the local model, where the server collecting privatized data from users can track back an input to a specific user, in the shuffle model users submit their privatized inputs to a server anonymously. This setup yields a trust model which sits in between the classical curator and local models for differential privacy. The shuffle model is the core idea in the Encode, Shuffle, Analyze (ESA) model introduced by Bittau et al. (SOPS 2017). Recent work by Cheu et al. (EUROCRYPT 2019) analyzes the differential privacy properties of the shuffle model and shows that in some cases shuffled protocols provide strictly better accuracy than local protocols. Additionally, Erlingsson et al. (SODA 2019) provide a privacy amplification bound quantifying the level of curator differential privacy achieved by the shuffle model in terms of the local differential privacy of the randomizer used by each user. In this context, we make three contributions. First, we provide an optimal single message protocol for summation of real numbers in the shuffle model. Our protocol is very simple and has better accuracy and communication than the protocols for this same problem proposed by Cheu et al. Optimality of this protocol follows from our second contribution, a new lower bound for the accuracy of private protocols for summation of real numbers in the shuffle model. The third contribution is a new amplification bound for analyzing the privacy of protocols in the shuffle model in terms of the privacy provided by the corresponding local randomizer. Our amplification bound generalizes the results by Erlingsson et al. to a wider range of parameters, and provides a whole family of methods to analyze privacy amplification in the shuffle model.
Distributed applications cannot assume that their security policies will be enforced on untrusted hosts. Trusted execution environments (TEEs) combined with cryptographic mechanisms enable execution of known code on an untrusted host and the exchange of confidential and authenticated messages with it. TEEs do not, however, establish the trustworthiness of code executing in a TEE. Thus, developing secure applications using TEEs requires specialized expertise and careful auditing. This paper presents DFLATE, a core security calculus for distributed applications with TEEs. DFLATE offers high-level abstractions that reflect both the guarantees and limitations of the underlying security mechanisms they are based on. The accuracy of these abstractions is exhibited by asymmetry between confidentiality and integrity in our formal results: DFLATE enforces a strong form of noninterference for confidentiality, but only a weak form for integrity. This reflects the asymmetry of the security guarantees of a TEE: a malicious host cannot access secrets in the TEE or modify its contents, but they can suppress or manipulate the sequence of its inputs and outputs. Therefore DFLATE cannot protect against the suppression of high-integrity messages, but when these messages are delivered, their contents cannot have been influenced by an attacker.
A secret-sharing scheme allows some authorized sets of parties to reconstruct a secret; the collection of authorized sets is called the access structure. For over 30 years, it was known that any (monotone) collection of authorized sets can be realized by a secret-sharing scheme whose shares are of size2n−o(n)and until recently no better scheme was known. In a recent breakthrough, Liu and Vaikuntanathan (STOC 2018) have reduced the share size to O(20.994n) Our first contribution is improving the exponent of secret sharing down to 0.892. For the special case of linear secret-sharing schemes, we get an exponent of 0.942 (compared to 0.999 of Liu and Vaikuntanathan).Motivated by the construction of Liu and Vaikuntanathan, we study secret-sharing schemes for uniform access structures. An access structure is k-uniform if all sets of size larger than k are authorized, all sets of size smaller than k are unauthorized, and each set of size k can be either authorized or unauthorized. The construction of Liu and Vaikuntanathan starts from protocols for conditional disclosure of secrets, constructs secret-sharing schemes for uniform access structures from them, and combines these schemes in order to obtain secret-sharing schemes for general access structures. Our second contribution in this paper is constructions of secret-sharing schemes for uniform access structures. We achieve the following results:A secret-sharing scheme for k-uniform access structures for large secrets in which the share size is O(k2) times the size of the secret.A linear secret-sharing scheme for k-uniform access structures for a binary secret in which the share size is~O(2h(k/n)n/2) (where h is the binary entropy function). By counting arguments, this construction is optimal (up to polynomial factors).A secret-sharing scheme for k-uniform access structures for a binary secret in which the share size is 2~O(√klogn). Our third contribution is a construction of ad-hoc PSM protocols, i.e., PSM protocols in which only a subset of the parties will compute a function on their inputs. This result is based on ideas we used in the construction of secret-sharing schemes for k-uniform access structures for a binary secret.
Hypothesis testing plays a central role in statistical inference, and is used in many settings where privacy concerns are paramount. This work answers a basic question about privately testing simple hypotheses: given two distributions P and Q, and a privacy level ε, how many i.i.d. samples are needed to distinguish P from Q subject to ε-differential privacy, and what sort of tests have optimal sample complexity? Specifically, we characterize this sample complexity up to constant factors in terms of the structure of P and Q and the privacy level ε, and show that this sample complexity is achieved by a certain randomized and clamped variant of the log-likelihood ratio test. Our result is an analogue of the classical Neyman–Pearson lemma in the setting of private hypothesis testing. We also give an application of our result to private change-point detection. Our characterization applies more generally to hypothesis tests satisfying essentially any notion of algorithmic stability, which is known to imply strong generalization bounds in adaptive data analysis, and thus our results have applications even when privacy is not a primary concern.
A new line of work [6, 9, 15, 2] demonstrates how differential privacy  can be used as a mathematical tool for guaranteeing generalization in adaptive data analysis. Specifically, if a differentially private analysis is applied on a sample S of i.i.d. examples to select a lowsensitivity function f , then w.h.p. f (S) is close to its expectation, although f is being chosen based on the data. Very recently, Steinke and Ullman  observed that these generalization guarantees can be used for proving concentration bounds in the non-adaptive setting, where the low-sensitivity function is fixed beforehand. In particular, they obtain alternative proofs for classical concentration bounds for low-sensitivity functions, such as the Chernoff bound and McDiarmid’s Inequality. In this work, we set out to examine the situation for functions with high-sensitivity, for which differential privacy does not imply generalization guarantees under adaptive analysis. We show that differential privacy can be used to prove concentration bounds for such functions in the non-adaptive setting.
We present a private learner for halfspaces over an arbitrary finite domain X⊂ℝd with sample complexity mathrmpoly(d,2log∗|X|). The building block for this learner is a differentially private algorithm for locating an approximate center point of m>poly(d,2log∗|X|) points -- a high dimensional generalization of the median function. Our construction establishes a relationship between these two problems that is reminiscent of the relation between the median and learning one-dimensional thresholds [Bun et al.\ FOCS '15]. This relationship suggests that the problem of privately locating a center point may have further applications in the design of differentially private algorithms. We also provide a lower bound on the sample complexity for privately finding a point in the convex hull. For approximate differential privacy, we show a lower bound of m=Ω(d+log∗|X|), whereas for pure differential privacy m=Ω(d log|X|).
Is it possible to piece together the confidential data of almost everyone in the US from statistics published by the Census Bureau—without breaching Census security or policy? Could someone—a doctor, a nosy neighbor, or a foreign state actor—determine whether a particular person participated in a genetic study of hundreds of individuals, when each individual contributed only tiny trace amounts of DNA to a highly complex and aggregated genetic mixture? Could police detectives re-trace a suspect’s every movement over the course of many months and thereby learn intimate details about the suspect’s political, religious, and sexual associations—without having to deploy any sort of surveillance or tracking devices? Could someone reliably deduce the sexual preferences of a Facebook user without looking at any content that user has shared?
Until recently, most people probably never imagined that their highly sensitive personal data could be so vulnerable to discovery from seemingly innocuous sources. Many continue to believe that the privacy risks from purely public, statistical, and anonymised data are merely theoretical, and that the practical risks are negligibly small. Yet all of the privacy violations described above are not only theoretically possible—they have already been successfully executed.
The foregoing examples of real-world privacy attacks all leverage one particular vulnerability that we refer to as composition effects. This vulnerability stems from the cumulative erosions of privacy that inhere in every piece of data about people. These erosions occur no matter how aggregated, insignificant, or anonymised the data may seem, and even small erosions can combine in unanticipated ways to create big risks.
Privacy and data protection failures from unanticipated composition effects reflect a type of data myopia—a short-sighted approach toward addressing increasingly-ubiquitous surveillance and privacy risks from Big Data analytics, characterized by a near-total focus on individual data processors and processes and by pervasive underestimation of systemic risks accumulating from independent data products. The failure to recognize accumulation of risk in the information ecosystem reflects a more general societal blind spot to cumulative systemic risks, with parallels in collective failures to foresee or forestall global financial crises, and to adequately address mounting risks to the natural environment.
As the volume and complexity of data uses and publications grow rapidly across a broad range of contexts, the need to develop frameworks for addressing cumulative privacy risks is likely to become an increasingly urgent and widespread problem. Threats to privacy are growing due to the accelerating abundance, and richness, of data about individuals being generated and made publicly available. Furthermore, substantial increases in computing power and algorithmic improvements are making the execution of such attacks more technically feasible. These threats will be impossible to overcome unless regulations are designed to explicitly regulate cumulative risk in a manner that is consistent with the science of composition effects.