# Publications

2021
Salil Vadhan and Tianhao Wang. 5/2021. “Concurrent Composition of Differential Privacy”. Publisher's VersionAbstract

We initiate a study of the composition properties of interactive dierentially private mechanisms. An interactive dierentially private mechanism is an algorithm that allows an analyst to adaptively ask queries about a sensitive dataset, with the property that an adversarial

analyst's view of the interaction is approximately the same regardless of whether or not any individual's data is in the dataset. Previous studies of composition of dierential privacy have focused on non-interactive algorithms, but interactive mechanisms are needed to capture many of the intended applications of dierential privacy and a number of the important dierentially private primitives.

We focus on concurrent composition, where an adversary can arbitrarily interleave its queries to several dierentially private mechanisms, which may be feasible when dierentially private query systems are deployed in practice. We prove that when the interactive mechanisms being composed are pure dierentially private, their concurrent composition achieves privacy parameters (with respect to pure or approximate dierential privacy) that match the (optimal) composition theorem for noninteractive dierential privacy. We also prove a composition theorem for interactive mechanisms that satisfy approximate dierential privacy. That bound is weaker than even the basic (suboptimal) composition theorem for noninteractive dierential privacy, and we leave closing the gap as a direction for future research, along with understanding concurrent composition for other variants of dierential privacy.

Tyler Piazza. 3/2021. “Differentially Private Ridge Regression”.Abstract

Studying problems of interest, like finding trends in medical data, can require analyzing data which contains sensitive and personally identifying information. As a result, it is often infeasible to release these datasets to researchers or to the general public. In this paper, we study algorithms that are differentially private, where there are theoretical guarantees that the mechanisms studied will reveal only limited amounts of information about individual people while still providing insights about large groups. This thesis discusses various forms of linear and ridge regression in this differentially private setting, with the goal of studying sensitive data to make predictions about future sensitive data. In particular, we will discuss the internal privacy-loss budgeting of the differentially private ridge regression technique adaSSP. This thesis provides 3 contributions. First, we discuss the existing SSP and adaSSP algorithms, and provide detailed proofs that they are each differentially private. Second, we introduce the two new algorithms adaSSPbudget and constSSPfull and prove that these are each differentially private. Third, we conduct experiments using synthetic and real world data to explore whether the precise privacy-loss budgeting used within these algorithms could improve their performance.

These experiments will explore the tradeoff between the accuracy of a hyperparameter and the accuracy of the other releases. Through the experimental results, we find that the performance is often insensitive to the particular privacy-loss budgeting and that for certain datasets, no choice of privacy-loss budget allows for the adaptive adaSSPbudget to outperform the standard SSP algorithm.

Marco Gaboardi, Michael Hay, and Salil Vadhan. 2021. “A Programming Framework for OpenDP (extended abstract).” In 6th Workshop on the Theory and Practice of Differential Privacy (TPDP 2020).Abstract

In this paper, we propose a programming framework for the library of dierentially private algorithms that will be at the core of the new OpenDP open-source software project (http://opendp.io/).

2020
Victor Balcer, Albert Cheu, Matthew Joseph, and Jieming Mao. 8/2020. “Connecting Robust Shuffle Privacy and Pan-Privacy.” arXiv. Publisher's VersionAbstract
In the \emph{shuffle model} of differential privacy, data-holding users send randomized messages to a secure shuffler, the shuffler permutes the messages, and the resulting collection of messages must be differentially private with regard to user data. In the \emph{pan-private} model, an algorithm processes a stream of data while maintaining an internal state that is differentially private with regard to the stream data. We give evidence connecting these two apparently different models.
Our results focus on \emph{robustly} shuffle private protocols, whose privacy guarantees are not greatly affected by malicious users. First, we give robustly shuffle private protocols and upper bounds for counting distinct elements and uniformity testing. Second, we use pan-private lower bounds to prove robustly shuffle private lower bounds for both problems. Focusing on the dependence on the domain size k, we find that robust approximate shuffle privacy and approximate pan-privacy have additive error Θ(k√) for counting distinct elements. For uniformity testing, we give a robust approximate shuffle private protocol with sample complexity Õ (k2/3) and show that an Ω(k2/3) dependence is necessary for any robust pure shuffle private tester. Finally, we show that this connection is useful in both directions: we give a pan-private adaptation of recent work on shuffle private histograms and use it to recover further separations between pan-privacy and interactive local privacy.
Micah Altman, Aloni Cohen, Kobbi Nissim, and Alexandra Wood. 7/27/2020. “What a Hybrid Legal-Technical Analysis Teaches Us About Privacy Regulation: The Case of Singling Out.” Proceedings of the National Academy of Sciences, Vol 117, 15. Publisher's VersionAbstract
Abstract

This article advocates a hybrid legal-technical approach to the evaluation of technical measures designed to render information anonymous in order to bring it outside the scope of data protection regulation. The article demonstrates how such an approach can be used for instantiating a key anonymization concept appearing in the EU General Data Protection Regulation (GDPR) -- singling out. The analysis identifies and addresses a tension between a common, compelling theory of singling out and a mathematical analysis of this theory, and it demonstrates how to make determinations regarding the sufficiency of specific technologies for satisfying regulatory requirements for anonymization.

Doubts about the feasibility of effective anonymization and de-identification have gained prominence in recent years in response to high-profile privacy breaches enabled by scientific advances in privacy research, improved analytical capabilities, the wider availability of personal data, and the unprecedented richness of available data sources. At the same time, privacy regulations recognize the possibility, at least in principle, of data anonymization that is sufficiently protective so as to free the resulting (anonymized) data from regulation. As a result, practitioners developing privacy enhancing technologies face substantial uncertainty as to the legal standing of these technologies. More fundamentally, it is not clear how to make a determination of compliance even when the tool is fully described and available for examination.

This gap is symptomatic of a more general problem: Legal and technical approaches to data protection have developed in parallel, and their conceptual underpinnings are growing increasingly divergent. When lawmakers rely on purely legal concepts to engage areas that are affected by rapid scientific and technological change, the resulting laws, when applied in practice, frequently create substantial uncertainty for implementation; provide contradictory recommendations in important cases; disagree with current scientific technical understanding; and fail to scale to the rapid pace of technological development. This article argues that new hybrid concepts, created through technical and legal co-design, can inform practices that are practically complete, coherent, and scalable.

As a case study, the article focuses on a key privacy-related concept appearing in Recital 26 of the General Data Protection Regulation (GDPR) called singling out. We identify a compelling theory of singling out that is implicit in the most persuasive guidance available, and demonstrate that the theory is ultimately incomplete. We then use that theory as the basis for a new and mathematically rigorous privacy concept called predicate singling-out. Predicate singling-out sheds light on the notion of singling out in the GDPR, itself inextricably linked to anonymization. We argue that any data protection tool that purports to anonymize arbitrary personal data under the GDPR must prevent predicate singling-out. This enables, for the first time, a legally- and mathematically-grounded analysis of the standing of supposed anonymization technologies like k-anonymity and differential privacy. The analysis in this article is backed by a technical-mathematical analysis previously published by two of the authors.

Conceptually, our analysis demonstrates that a nuanced understanding of baseline risk is unavoidable for a theory of singling out based on current regulatory guidance. Practically, it identifies previously unrecognized failures of anonymization. In particular, it demonstrates that some k-anonymous mechanisms may allow singling out, challenging the prevailing regulatory guidance.

The article concludes with a discussion of specific recommendations for both policymakers and scholars regarding how to conduct a hybrid legal-technical analysis. Rather than formalizing or mathematizing the law, the article provides approaches for wielding formal tools in the service of practical regulation.

Marco Gaboardi, Kobbi Nissim, and David Purser. 7/2020. “The Complexity of Verifying Loop-free Programs as Differentially Private.” In 47th International Colloquium on Automata, Languages and Programming (To appear - ICALP 2020). ArXiv VersionAbstract

We study the problem of verifying differential privacy for loop-free programs with probabilistic choice. Programs in this class can be seen as randomized Boolean circuits, which we will use as a formal model to answer two different questions: first, deciding whether a program satisfies a prescribed level of privacy; second, approximating the privacy parameters a program realizes. We show that the problem of deciding whether a program satisfies "-differential privacy is coNP#P-complete. In fact, this is the case when either the input domain or the output range of the program is large.

Further, we show that deciding whether a program is (", )-differentially private is coNP#P-hard, and in coNP#P for small output domains, but always in coNP#P#P. Finally, we show that the problem of approximating the level of differential privacy is both NP-hard and coNP-hard. These results complement previous results by Murtagh and Vadhan [34] showing that deciding the optimal composition of differentially private components is #P-complete, and that approximating the optimal composition of differentially private components is in P.

Daniel Alabi, Adam Smith, Audra McMillan, Salil Vadhan, and Jayshree Sarathy. 7/2020. “Differentially Private Simple Linear Regression.” arXiv:2007.05157. Publisher's VersionAbstract
Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm's output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.
Micah Altman, Stephen Chong, and Alexandra Wood. 7/2020. “Formalizing Privacy Laws for License Generation and Data Repository Decision Automation.” In 20th Privacy Enhancing Technologies Symposium (To appear - PET 2020). ArXiv VersionAbstract
In this paper, we summarize work-in-progress on expert system support to automate some data deposit and release decisions within a data repository, and to generate custom license agreements for those data transfers.
Our approach formalizes via a logic programming language the privacy-relevant aspects of laws, regulations, and best practices, supported by legal analysis documented in legal memoranda. This formalization enables automated reasoning about the conditions under which a repository can transfer data, through interrogation of users, and the application of formal rules to the facts obtained from users. The proposed system takes the specific conditions for a given data release and produces a custom data use agreement that accurately captures the relevant restrictions on data use. This enables appropriate decisions and accurate licenses, while removing the bottleneck of lawyer effort per data transfer. The operation of the system aims to be transparent, in the sense that administrators, lawyers, institutional review boards, and other interested parties can evaluate the legal reasoning and interpretation embodied in the formalization, and the specific rationale for a decision to accept or release a particular dataset.
Benny Applebaum, Amos Beimel, Oded Nir, and Naty Peter. 6/2020. “Better Secret-Sharing via Robust Conditional Disclosure of Secrets.” In 52nd ACM Symposium on Theory of Computing (To appear - STOC 2020). ePrint VersionAbstract

A secret-sharing scheme allows to distribute a secret s among n parties such that only some predefined

“authorized” sets of parties can reconstruct the secret, and all other “unauthorized” sets learn

nothing about s. The collection of authorized sets is called the access structure. For over 30 years, it

was known that any (monotone) collection of authorized sets can be realized by a secret-sharing scheme

whose shares are of size 2n-o(n)and until recently no better scheme was known. In a recent breakthrough,

Liu and Vaikuntanathan (STOC 2018) have reduced the share size to 20:994n+o(n), which was

later improved to 20:892n+o(n)by Applebaum et al. (EUROCRYPT 2019).

In this paper we improve the exponent of general secret-sharing down to 0:637. For the special case

of linear secret-sharing schemes, we get an exponent of 0:762 (compared to 0:942 of Applebaum et al.).

As our main building block, we introduce a new robust variant of conditional disclosure of secrets

(robust CDS) that achieves unconditional security even under limited form of re-usability. We show that

the problem of general secret-sharing reduces to robust CDS with sub-exponential overhead and derive

our main result by implementing robust CDS with a non-trivial exponent. The latter construction follows

by presenting a general immunization procedure that turns standard CDS into a robust CDS.

Amos Beimel, Aleksandra Korolova, Kobbi Nissim, Or Sheffet, and Uri Stemmer. 6/2020. “The Power of Synergy in Differential Privacy: Combining a Small Curator with Local Randomizers.” In Information-Theoretic Cryptography (To appear - ITC 2020). ArXiv VersionAbstract

Motivated by the desire to bridge the utility gap between local and trusted curator modelsof differential privacy for practical applications, we initiate the theoretical study of a hybridmodel introduced by “Blender” [Avent et al., USENIX Security ’17], in which differentially private protocols of n agents that work in the local-model are assisted by a differentially private curator that has access to the data of m additional users. We focus on the regime where mn and study the new capabilities of this (m;n)-hybrid model. We show that, despite the fact that the hybrid model adds no significant new capabilities for the basic task of simple hypothesistesting, there are many other tasks (under a wide range of parameters) that can be solved in the hybrid model yet cannot be solved either by the curator or by the local-users separately. Moreover, we exhibit additional tasks where at least one round of interaction between the curator and the local-users is necessary – namely, no hybrid model protocol without such interaction can solve these tasks. Taken together, our results show that the combination of the local model with a small curator can become part of a promising toolkit for designing and implementing differential privacy.

Victor Balcer and Albert Cheu. 4/2020. “Separating Local & Shuffled Differential Privacy via Histograms.” arXiv. Publisher's VersionAbstract
Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users' hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This implies an arbitrarily large gap in sample complexity between the shuffled and local models. On the other hand, the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.
Michael Fine. 4/2020. “Certifiably Accurate Private Data Release with Generative Adversarial Networks”.Abstract

Society is caught in a vise. The exponential growth in the power and ubiquity of computing devices has enabled the collection and analysis of data at an unprecedented scale. This Cambrian explosion in data collection promises enormous benefits across commercial, scientific, and policy fields. Unfortunately, this collection and analysis of increasingly-personal data has also proved to be a grave threat to individual’s privacy.

Victor Balcer and Albert Cheu. 4/2020. “Separating Local & Shuffled Differential Privacy via Histograms.” In Information-Theoretic Cryptography (To appear - ITC 2020). ArXiv VersionAbstract

Recent work in differential privacy has highlighted the shuffled model as a promising avenue to compute accurate statistics while keeping raw data in users’ hands. We present a protocol in this model that estimates histograms with error independent of the domain size. This impliesan arbitrarily large gap in sample complexity between the shuffled and local models. On theother hand, we show that the models are equivalent when we impose the constraints of pure differential privacy and single-message randomizers.

Yiling Chen, Or Sheffet, and Salil Vadhan. 2020. “Privacy Games.” ACM Transactions on Economics and Computation, 8, 2, Pp. Article 9. Publisher's VersionAbstract

Version History:

Previously published as: Yiling Chen, Or Sheffet, and Salil Vadhan. Privacy games. In Proceedings of the 10th International Conference on Web and Internet Economics (WINE ‘14), volume 8877 of Lecture Notes in Computer Science, pages 371–385. Springer-Verlag, 14–17 December 2014. (WINE Publisher's Version linked here: https://link.springer.com/chapter/10.1007/978-3-319-13129-0_30); PDF attached as WINE2014.

The problem of analyzing the effect of privacy concerns on the behavior of selfish utility-maximizing agents has received much attention lately. Privacy concerns are often modeled by altering the utility functions of agents to consider also their privacy loss. Such privacy aware agents prefer to take a randomized strategy even in very simple games in which non-privacy aware agents play pure strategies. In some cases, the behavior of privacy aware agents follows the framework of Randomized Response, a well-known mechanism that preserves differential privacy.

Our work is aimed at better understanding the behavior of agents in settings where their privacy concerns are explicitly given. We consider a toy setting where agent A, in an attempt to discover the secret type of agent B, offers B a gift that one type of B agent likes and the other type dislikes. As opposed to previous works, B's incentive to keep her type a secret isn't the result of "hardwiring" B's utility function to consider privacy, but rather takes the form of a payment between B and A. We investigate three different types of payment functions and analyze B's behavior in each of the resulting games. As we show, under some payments, B's behavior is very different than the behavior of agents with hardwired privacy concerns and might even be deterministic. Under a different payment we show that B's BNE strategy does fall into the framework of Randomized Response.

Owen Arden, Anitha Gollamudi, Ethan Cecchetti, Stephen Chong, and Andrew C. Myers. 2020. “A Calculus for Flow-Limited Authorization.” Journal of Computer Security.Abstract

Real-world applications routinely make authorization decisions based on dynamic computation. Reasoning about dynamically computed authority is challenging. Integrity of the system might be compromised if attackers can improperly influence the authorizing computation. Confidentiality can also be compromised by authorization, since authorization decisions are often based on sensitive data such as membership lists and passwords. Previous formal models for authorization do not fully address the security implications of permitting trust relationships to change, which limits their ability to reason about authority that derives from dynamic computation. Our goal is a way to construct dynamic authorization mechanisms that do not violate confidentiality or integrity.

We introduce the Flow-Limited Authorization Calculus (FLAC), which is both a simple, expressive model for reasoning about dynamic authorization and also an information flow control language for securely implementing various authorization mechanisms. FLAC combines the insights of two previous models: it extends the Dependency Core Calculus with features made possible by the Flow-Limited Authorization Model. FLAC provides strong end-to-end information security guarantees even for programs that incorporate and implement rich dynamic authorization mechanisms. These guarantees include noninterference and robust declassification, which prevent attackers from influencing information disclosures in unauthorized ways. We prove these security properties formally for all FLAC programs and explore the expressiveness of FLAC with several examples.

The OpenDP Team. 2020. “The OpenDP White Paper.” In . Publisher's VersionAbstract

Talks:

OpenDP is a community effort to build a trustworthy suite of open-source tools for enabling privacy-protective analysis of sensitive personal data, focused on a library of algorithms for generating differentially private statistical releases. The target use cases for OpenDP are to enable government, industry, and academic institutions to safely and confidently share sensitive data to support scientifically oriented research and exploration in the public interest. We aim for OpenDP to flexibly grow with the rapidly advancing science of differential privacy, and be a pathway to bring the newest algorithmic developments to a wide array of practitioners.

OpenDP is led by Faculty Directors Gary King and Salil Vadhan and an Executive Committee at Harvard University, funded in part by a grant from the Sloan Foundation. Its efforts so far have included implementing a differentially private curator application in collaboration with Microsoft, and developing a framework for a community-driven OpenDP Commons through the work of an Ad Hoc Design Committee including external experts. Going forward, the project plans to engage with a wide community of stakeholders, establish partnerships with a wide variety of groups from industry, academia, and government, and adopt increasing levels of community governance.

2019
Michel Tadjer, Michael Bar-Sinai, and Mor Vilozni. 10/2019. “Social Change Through Computerized Accessibility of Legal Rules.” INSS Cyber Intelligence and Security [Internet], 3, 2, Pp. 81-98. Publisher's VersionAbstract

### This article presents a self-help software system that makes rights accessible through an on-line interview. The interview is based on a formal model of the relevant jurisprudence and does not require the involvement of a service representative, only a user who wants to understand his or her rights. In addition, the article provides a methodology for building models and interviews for similar social contexts and describes building a model for workers’ rights according to Israeli law, upon completing their employment. In addition to conducting interviews, these models can be used to create diagrams and perform legal queries. This kind of system can fulfill a central role in empowering disadvantaged populations, as it enables people to asses their rights in a user-friendly manner, which is personalized to the situation of the interviewee and not overburdened with large amounts of information that it is difficult to navigate.

Amos Beimel, Kobbi Nissim, and Mohammad Zaherix. 10/2019. “Exploring Differential Obliviousness.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 10/2019. Publisher's VersionAbstract

In a recent paper, Chan et al. [SODA ’19] proposed a relaxation of the notion of (full) memory obliviousness, which was introduced by Goldreich and Ostrovsky [J. ACM ’96] and extensively researched by cryptographers. The new notion, differential obliviousness, requires that any two neighboring inputs exhibit similar memory access patterns, where the similarity requirement is that of differential privacy.

Chan et al. demonstrated that differential obliviousness allows achieving improved efficiency for several algorithmic tasks, including sorting, merging of sorted lists, and range query data structures.

In this work, we continue the exploration of differential obliviousness, focusing on algorithms that do not necessarily examine all their input. This choice is motivated by the fact that the existence of logarithmic overhead ORAM protocols implies that differential obliviousness can yield at most a logarithmic improvement in efficiency for computations that need to examine all their input. In particular, we explore property testing, where we show that differential obliviousness yields an almost linear improvement in overhead in the dense graph model, and at most quadratic improvement in the bounded degree model.

We also explore tasks where a non-oblivious algorithm would need to explore different portions of the input, where the latter would depend on the input itself, and where we show that such a behavior can be maintained under differential obliviousness, but not under full obliviousness. Our examples suggest that there would be benefits in further exploring which class of computational tasks are amenable to differential obliviousness.

Gian Pietro Farina, Stephen Chong, and Marco Gaboardi. 10/2019. “Relational Symbolic Execution.” In 21st International Symposium on Principles and Practice of Declarative Programming (PPDP 2019). Publisher's VersionAbstract

Symbolic execution is a classical program analysis technique used to show that programs satisfy or violate given specifications. In this work we generalize symbolic execution to support program analysis for relational specifications in the form of relational properties - these are properties about two runs of two programs on related inputs, or about two executions of a single program on related inputs. Relational properties are useful to formalize notions in security and privacy, and to reason about program optimizations. We design a relational symbolic execution engine, named RelSymwhich supports interactive refutation, as well as proving of relational properties for programs written in a language with arrays and for-like loops.

Victor Balcer and Salil Vadhan. 9/2019. “Differential Privacy on Finite Computers.” Journal of Privacy and Confidentiality, 9, 2. JPC PageAbstract

Version History:

Also presented at TPDP 2017; preliminary version posted as arXiv:1709.05396 [cs.DS].

2018: Published in Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pp 43:1-43:21. http://drops.dagstuhl.de/opus/frontdoor.php?source_opus=8353

We consider the problem of designing and analyzing differentially private algorithms that can be implemented on discrete models of computation in strict polynomial time, motivated by known attacks on floating point implementations of real-arithmetic differentially private algorithms (Mironov, CCS 2012) and the potential for timing attacks on expected polynomial-time algorithms. As a case study, we examine the basic problem of approximating the histogram of a categorical dataset over a possibly large data universe $$X$$. The classic Laplace Mechanism (Dwork, McSherry, Nissim, Smith, TCC 2006 and J. Privacy & Condentiality 2017) does not satisfy our requirements, as it is based on real arithmetic, and natural discrete analogues, such as the Geometric Mechanism (Ghosh, Roughgarden, Sundarajan, STOC 2009 and SICOMP 2012), take time at least linear in $$|X|$$, which can be exponential in the bit length of the input.

In this paper, we provide strict polynomial-time discrete algorithms for approximate histograms whose simultaneous accuracy (the maximum error over all bins) matches that of the Laplace Mechanism up to constant factors, while retaining the same (pure) differential privacy guarantee.  One of our algorithms produces a sparse histogram as output. Its "per-bin accuracy" (the error on individual bins) is worse than that of the Laplace Mechanism by a factor of $$\log |X|$$, but we prove a lower bound showing that this is necessary for any algorithm that produces a sparse histogram.  A second algorithm avoids this lower bound, and matches the per-bin accuracy of the Laplace Mechanism, by producing a compact and eciently computable representation of a dense histogram; it is based on an $$(n + 1)$$ - wise independent implementation of an appropriately clamped version of the Discrete Geometric Mechanism.