Computing Over Distributed Sensitive Data

Large amounts of data are being collected about individuals by a variety of organizations: government agencies, banks, hospitals, research institutions, privacy companies, etc. Many of these organizations collect similar data, or data about similar populations. Sharing this data between organizations could bring about many benefits in social, scientific, business, and security domains. For example, by sharing their data, hospitals and small clinics can obtain statistically significant results in cases where the individual datasets are otherwise too small. Unfortunately, much of the collected data is sensitive: it contains personal details about individuals or information that may damage an organization’s reputation and competitiveness. The sharing of data is hence often curbed for ethical, legal, or business reasons. 

This project develops a collection of tools that will enable the benefits of data sharing without requiring data owners to share their data. The techniques developed respect principles of data ownership and privacy requirement, and draw on recent scientific developments in privacy, cryptography, machine learning, computational statistics, program verification, and system security. The tools developed in this project will contribute to existing research and business infrastructure, and hence enable new ways to create value in information whose use would otherwise have been restricted. The project supports the development of new curricula material and trains a new generation of researchers and citizens with the multidisciplinary perspectives required to address the complex issues surrounding data privacy.

This project is funded by grant 1565387 from the National Science Foundation to Harvard University and SUNY at Buffalo.

Personnel

Salil Vadhan

Salil Vadhan

Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard
Area Chair for Computer Science

Salil Vadhan is the lead PI of Privacy Tools for Sharing Research Data project and the Vicky Joseph Professor of Computer Science and Applied Mathematics...

Read more about Salil Vadhan
Marco Gaboardi

Marco Gaboardi

Visiting Scholar, Center for Research on Computation & Society
State University of New York at Buffalo
Current Member of Datatags Team
Victor Balcer

Victor Balcer

Undergraduate researcher (REU summer 2014)
Graduate student, Theory of Computing Group

Victor joined the project in summer 2014 as an REU student. In 2015 Victor completed his undergraduate work in Computer Science & Mathematics at...

Read more about Victor Balcer
  •  
  • 1 of 2
  • »

Publications

E. Cicek, G. Barthe, M. Gaboardi, D. Garg, and J. Hoffmann. 1/2017. “Relational Cost Analysis.” Symposium on the Principle of Programming Languages, ACM.Abstract
Establishing quantitative bounds on the execution cost of programs is essential in many areas of computer science such as complexity analysis, compiler optimizations, security and privacy. Techniques based on program analysis, type systems and abstract interpretation are well-studied, but methods for analyzing how the execution costs of two programs compare to each other have not received attention. Naively combining the worst and best case execution costs of the two programs does not work well in many cases because such analysis forgets the similarities between the programs or the inputs. In this work, we propose a relational cost analysis technique that is capable of establishing precise bounds on the difference in the execution cost of two programs by making use of relational properties of programs and inputs. We develop RelCost, a refinement type and effect system for a higher-order functional language with recursion and subtyping. The key novelty of our technique is the combination of relational refinements with two modes of typing—relational typing for reasoning about similar computations/inputs and unary typing for reasoning about unrelated computations/inputs. This combination allows us to analyze the execution cost difference of two programs more precisely than a naive non-relational approach. We prove our type system sound using a semantic model based on step-indexed unary and binary logical relations accounting for non-relational and relational reasoning principles with their respective costs. We demonstrate the precision and generality of our technique through examples.
Georgios Kellaris, George Kollios, Kobbi Nissim, and Adam O'Neill. 10/2016. “Generic Attacks on Secure Outsourced Databases.” 23rd ACM Conference on Computer and Communications Security.Abstract

Recently, various protocols have been proposed for securely outsourcing database storage to a third party server, ranging from systems with “full-fledged” security based on strong cryptographic primitives such as fully homomorphic encryption or oblivious RAM, to more practical implementations based on searchable symmetric encryption or even on deterministic and order-preserving encryption. On the flip side, various attacks have emerged that show that for some of these protocols confidentiality of the data can be compromised, usually given certain auxiliary information. We take a step back and identify a need for a formal understanding of the inherent efficiency/privacy trade-off in outsourced database systems, independent of the details of the system. We propose abstract models that capture secure outsourced storage systems in sufficient generality, and identify two basic sources of leakage, namely access pattern and communication volume. We use our models to distinguish certain classes of outsourced database systems that have been proposed, and deduce that all of them exhibit at least one of these leakage sources. We then develop generic reconstruction attacks on any system supporting range queries where either access pattern or communication volume is leaked. These attacks are in a rather weak passive adversarial model, where the untrusted server knows only the underlying query distribution. In particular, to perform our attack the server need not have any prior knowledge about the data, and need not know any of the issued queries nor their results. Yet, the server can reconstruct the secret attribute of every record in the database after about N 4 queries, where N is the domain size. We provide a matching lower bound showing that our attacks are essentially optimal. Our reconstruction attacks using communication volume apply even to systems based on homomorphic encryption or oblivious RAM in the natural way. Finally, we provide experimental results demonstrating the efficacy of our attacks on real datasets with a variety of different features. On all these datasets, after the required number of queries our attacks successfully recovered the secret attributes of every record in at most a few seconds.

  • «
  • 2 of 2
  •