Computing Over Distributed Sensitive Data

Large amounts of data are being collected about individuals by a variety of organizations: government agencies, banks, hospitals, research institutions, privacy companies, etc. Many of these organizations collect similar data, or data about similar populations. Sharing this data between organizations could bring about many benefits in social, scientific, business, and security domains. For example, by sharing their data, hospitals and small clinics can obtain statistically significant results in cases where the individual datasets are otherwise too small. Unfortunately, much of the collected data is sensitive: it contains personal details about individuals or information that may damage an organization’s reputation and competitiveness. The sharing of data is hence often curbed for ethical, legal, or business reasons. 

This project develops a collection of tools that will enable the benefits of data sharing without requiring data owners to share their data. The techniques developed respect principles of data ownership and privacy requirement, and draw on recent scientific developments in privacy, cryptography, machine learning, computational statistics, program verification, and system security. The tools developed in this project will contribute to existing research and business infrastructure, and hence enable new ways to create value in information whose use would otherwise have been restricted. The project supports the development of new curricula material and trains a new generation of researchers and citizens with the multidisciplinary perspectives required to address the complex issues surrounding data privacy.

This project is funded by grant 1565387 from the National Science Foundation to Harvard University and SUNY at Buffalo.


Salil Vadhan

Salil Vadhan

Principal Investigator
Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard
Marco Gaboardi

Marco Gaboardi

Visiting Scholar, Center for Research on Computation & Society
State University of New York at Buffalo
Current Member of Datatags Team
Victor Balcer

Victor Balcer

Undergraduate Researcher (REU Summer 2014)
Graduate student, Theory of Computing Group

Victor joined the project in summer 2014 as an REU student. In 2015 Victor completed his undergraduate work in Computer Science & Mathematics at...

Read more about Victor Balcer
  • 1 of 2
  • »


Gian Pietro Farina, Stephen Chong, and Marco Gaboardi. 10/2019. “Relational Symbolic Execution.” In 21st International Symposium on Principles and Practice of Declarative Programming (PPDP 2019). Publisher's VersionAbstract

Symbolic execution is a classical program analysis technique used to show that programs satisfy or violate given specifications. In this work we generalize symbolic execution to support program analysis for relational specifications in the form of relational properties - these are properties about two runs of two programs on related inputs, or about two executions of a single program on related inputs. Relational properties are useful to formalize notions in security and privacy, and to reason about program optimizations. We design a relational symbolic execution engine, named RelSymwhich supports interactive refutation, as well as proving of relational properties for programs written in a language with arrays and for-like loops.

Benny Applebaum, Amos Beimel, Oded Nir, and Naty Peter. 6/2020. “Better Secret-Sharing via Robust Conditional Disclosure of Secrets.” In 52nd ACM Symposium on Theory of Computing (To appear - STOC 2020). ePrint VersionAbstract

A secret-sharing scheme allows to distribute a secret s among n parties such that only some predefined

“authorized” sets of parties can reconstruct the secret, and all other “unauthorized” sets learn

nothing about s. The collection of authorized sets is called the access structure. For over 30 years, it

was known that any (monotone) collection of authorized sets can be realized by a secret-sharing scheme

whose shares are of size 2n-o(n)and until recently no better scheme was known. In a recent breakthrough,

Liu and Vaikuntanathan (STOC 2018) have reduced the share size to 20:994n+o(n), which was

later improved to 20:892n+o(n)by Applebaum et al. (EUROCRYPT 2019).


In this paper we improve the exponent of general secret-sharing down to 0:637. For the special case

of linear secret-sharing schemes, we get an exponent of 0:762 (compared to 0:942 of Applebaum et al.).

As our main building block, we introduce a new robust variant of conditional disclosure of secrets

(robust CDS) that achieves unconditional security even under limited form of re-usability. We show that

the problem of general secret-sharing reduces to robust CDS with sub-exponential overhead and derive

our main result by implementing robust CDS with a non-trivial exponent. The latter construction follows

by presenting a general immunization procedure that turns standard CDS into a robust CDS.

Amos Beimel, Kobbi Nissim, and Mohammad Zaherix. 10/2019. “Exploring Differential Obliviousness.” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 10/2019. Publisher's VersionAbstract

In a recent paper, Chan et al. [SODA ’19] proposed a relaxation of the notion of (full) memory obliviousness, which was introduced by Goldreich and Ostrovsky [J. ACM ’96] and extensively researched by cryptographers. The new notion, differential obliviousness, requires that any two neighboring inputs exhibit similar memory access patterns, where the similarity requirement is that of differential privacy.

Chan et al. demonstrated that differential obliviousness allows achieving improved efficiency for several algorithmic tasks, including sorting, merging of sorted lists, and range query data structures.

In this work, we continue the exploration of differential obliviousness, focusing on algorithms that do not necessarily examine all their input. This choice is motivated by the fact that the existence of logarithmic overhead ORAM protocols implies that differential obliviousness can yield at most a logarithmic improvement in efficiency for computations that need to examine all their input. In particular, we explore property testing, where we show that differential obliviousness yields an almost linear improvement in overhead in the dense graph model, and at most quadratic improvement in the bounded degree model.

We also explore tasks where a non-oblivious algorithm would need to explore different portions of the input, where the latter would depend on the input itself, and where we show that such a behavior can be maintained under differential obliviousness, but not under full obliviousness. Our examples suggest that there would be benefits in further exploring which class of computational tasks are amenable to differential obliviousness.

Anitha Gollamudi, Owen Arden, and Stephen Chong. 6/2019. “Information Flow Control for Distributed Trusted Execution Environments.” In Computer Security Foundations. Publisher's VersionAbstract
Distributed applications cannot assume that their security policies will be enforced on untrusted hosts. Trusted execution environments (TEEs) combined with cryptographic mechanisms enable execution of known code on an untrusted host and the exchange of confidential and authenticated messages with it. TEEs do not, however, establish the trustworthiness of code executing in a TEE. Thus, developing secure applications using TEEs requires specialized expertise and careful auditing. This paper presents DFLATE, a core security calculus for distributed applications with TEEs. DFLATE offers high-level abstractions that reflect both the guarantees and limitations of the underlying security mechanisms they are based on. The accuracy of these abstractions is exhibited by asymmetry between confidentiality and integrity in our formal results: DFLATE enforces a strong form of noninterference for confidentiality, but only a weak form for integrity. This reflects the asymmetry of the security guarantees of a TEE: a malicious host cannot access secrets in the TEE or modify its contents, but they can suppress or manipulate the sequence of its inputs and outputs. Therefore DFLATE cannot protect against the suppression of high-integrity messages, but when these messages are delivered, their contents cannot have been influenced by an attacker.