Differential Privacy

Research Overview:

The goals of the Differential Privacy research group are to:

  • Design and implement differentially private tools that will enable social scientists to share useful statistical information about sensitive datasets.
  • Integrate these tools with the widely-used platforms developed by The Institute for Quantitative Social Science for sharing and exploring research data.
  • Advance the theory of differential privacy in a variety of settings, including statistical analysis (e.g. statistical estimation, regression, and answering many statistical queries), machine learning, and economic mechanism design.

Please see below for further information about differential privacy and our group's work. To get involved, please email privacytools-info@seas.harvard.edu and see our calls for summer interns, students, postdocs and visiting scholars.

About the Differential Privacy Group

What is differential privacy?

Differential privacy is a rigorous mathematical definition of privacy. In the simplest setting, consider an algorithm that analyzes a dataset and computes statistics about it (such as the data's mean, variance, median, mode, etc.). Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual's data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset -- anything the algorithm might output on a database containing some individual's information is almost as likely to have come from a database without that individual's information. Most notably, this guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual's details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked.

The definition of differential privacy emerged from a long line of work applying algorithmic ideas to the study of privacy (Dinur and Nissim `03; Dwork and Nissim `04; Blum, Dwork, McSherry, and Nissim `05), culminating with work of Dwork, McSherry, Nissim, and Smith `06.

See our educational materials for more detail about the formal definition of differential privacy and its semantic guarantees.

Why do we need a mathematical definition of privacy?

Many heuristics are used to preserve the privacy of individuals in research databases. Anonymization (the removal of "identifiable" attributes, such as names, addresses, SSN, IP addresses, etc.) is the most commonly used technique. However, such heuristics, devoid of any formal guarantees, can fail and have been repeatedly shown to fail. In a striking example, Latanya Sweeney showed that gender, date of birth, and zip code are sufficient to uniquely identify the vast majority of Americans. By linking these attributes in a supposedly anonymized healthcare database to public voter records, she was able to identify the individual health record of the Governor of Massachussetts. These "linkage attacks" motivate the need for a robust definition of privacy -- one that is immune to attacks using auxiliary knowledge, including knowledge that a data curator cannot predict the availability of.

Another line of work has shown that answering too many innocuous (even completely random) queries about a database inherently violates the privacy of its individual contributors. These works expose a fundamental tradeoff between statistical utility and privacy. In order to understand this tradeoff and locate socially desirable outcomes, we must be able to formally define privacy in the first place.

Privacy as a quantifiable measure

A crucial feature of differential privacy is that it defines privacy not as a binary notion of "was the data of individual exposed or not", but rather a matter of accumulative risk. That is, every time a person's data is processed her risk of being exposed increases. To this end, the definition of differential privacy is equipped with parameters ("epsilon and delta") that quantify the "privacy loss" -- the additional risk to an individual that results from her data being used. Regardless of any auxiliary knowledge used in a re-identification attack, the risk to one's privacy caused by a differentially private algorithm will forever be bounded by this privacy loss.

When is Differential Privacy useful?

Through extensive theoretical research, differential privacy shows promise in enabling research data to be shared in a wide variety of settings.* The simplest and most well-studied scenario is statistical query release: a data owner can specify counting queries, such as "how many people in the database are male?" or "how many people in the database live in Massachussetts?" and receive answers perturbed by a small amount of random noise. Differentially private algorithms are able to answer a large number of such queries approximately, so that a researcher seeing these approximate answers can draw roughly the same conclusions as if she had the data herself.

However, the reach of differential privacy extends far beyond the simple case of statistical queries. For instance, there are differentially private versions of algorithms in machine learning, game theory and economic mechanism design, statistical estimation, and streaming.

It is worth remarking that differential privacy works better on larger databases. This is because as the number of individuals in a database grows, the effect of any single individual on a given aggregate statistic diminishes.

*Of course, a major challenge -- and one focus of the PrivacyTools project -- is to bring these theoretical results into practice.

How does differential privacy fit in to the PrivacyTools project?

Our goal is to integrate the definitions and algorithmic tools from differential privacy into several IQSS projects for sharing and exploring research data, especially the widely-used Dataverse platform. Related projects that we are incorporating differential privacy into include DataTags, TwoRavens, and Zelig.

The Dataverse project is a software infrastructure used for hosting data repositories around the world, enabling researchers to share, preserve, cite, explore, and analyze research data. Our goal is to augment  Dataverse to enable differentially private access to sensitive datasets that currently cannot be safely shared.  What makes this particularly challenging (compared to many other practical applications of differential privacy) is that the tools need to be general purpose, applying to a wide variety of datasets uploaded to Dataverse repositories, and automated, with no differential privacy expert optimizing the algorithms for each dataset or analyst.  Consequently, we envision that the differentially private access we provide will allow researchers to perform rough preliminary analyses that help determine whether it is worth the effort to apply for access to the raw data.

DataTags is a PrivacyTools project that generates guidelines for how dataset holders should share their data in compliance with the relevant privacy laws and regulations. To use the tool, a dataset holder engages with an automated interview process, which produces a "Data Tag" telling the user how the data can be shared, how it can be stored, etc. We are working to incorporate differential privacy into these tags, especially to enable sharing of data where the current tags do not allow for public release. For instance, we are assessing the protection guaranteed by various settings of the differential privacy parameters (epsilon and delta) so that we can make recommendations for which parameters are appropriate for each level of tag.

Zelig is a user-friendly package built on R for performing statistical methods and interpreting and presenting the results. TwoRavens, integrated with Zelig and Dataverse, is a browser-based tool for exploring and analyzing data. We are working towards creating differentially private versions of the core functionalities of these projects.

Differential Privacy Tool

Please refer to this link to test out our differentially private tool.

Members of our Differential Privacy team

How can I get involved?

How have students contributed to the project?

Harvard Undergraduate and Graduate Students

Harvard students have contributed to all aspects of our group's theoretical and applied work on differential privacy. These projects have culminated in PhD and undergraduate theses, as well as numerous research papers. Please see our student research page for posters describing the projects students have worked on.

Summer Interns

In Summer 2014, we hosted a Research Experience for Undergraduates (REU) in which students designed, implemented, and experimentally assessed differentially private tools for a number of statistical tasks. In this first effort, we focused on the fundamental tasks of estimating histograms, quantiles, means, and covariances. We also had a student architect a system that enables a data depositor to choose how his privacy budget is divided across these various functionalities.

Publications

Relevant Publications by the Differential Privacy Team:

2016

M. Bun, K. Nissim, and U. Stemmer, “Simultaneous private learning of multiple concepts,”

2015

M. Bun and M. Zhandry, “Order revealing encryption and the hardness of private learning,” in Proceedings of the 12th Theory of Cryptography Conference (TCC 2016), Tel-Aviv, Israel, 2016. ArXiv VersionBunZhandry.pdf

J. Murtagh and S. Vadhan, “The Complexity of Computing the Optimal Composition of Differential Privacy,” in Theory of Cryptography Conference (TCC 2016), 2016. ArXiv Version

C. Dwork, A. Smith, T. Steinke, J. Ullman, and S. Vadhan, “Robust Traceability from Trace Amounts,” in IEEE Symposium on Foundations of Computer Science (FOCS 2015), Berkeley, California, 2015

S. Zheng, The Differential Privacy of Bayesian Inference. Bachelor's thesis, Harvard College, 2015. DASH Version

M. Bun, K. Nissim, U. Stemmer, and S. Vadhan, “Differentially Private Release and Learning of Threshold Functions,” in 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS 15), Berkeley, California, 2015. ArXiv VersionAbstractbunnissimstemmervadhan.pdf

O. Sheffet, “Private Approximations of the 2nd-Moment Matrix Using Existing Techniques in Linear Regression,” Neural Information Processing Systems (NIPS 2015), 2015. ArXiv VersionAbstract

M. Bun and J. Thaler, “Hardness Amplification and the Approximate Degree of Constant-Depth Circuits,” International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG, 2015. ArXiv Versionhardnessamplification.pdf

K. Nissim and U. Stemmer, “On the Generalization Properties of Differential Privacy”. 2015. ArXiv VersionAbstract

T. Steinke and J. Ullman, “Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery,” JMLR: Workshop and Conference Proceedings, vol. 40, no. 201, pp. 1-41, 2015. PDF

K. Nissim and D. Xiao, “Mechanism Design and Differential Privacy,” in Encyclopedia of Algorithms, New York, NY: Springer Berlin Heidelberg, 2015, pp. 1-12. Publisher's Version

J. Honaker, “Efficient Use of Differentially Private Binary Trees,” Theory and Practice of Differential Privacy (TPDP 2015), London, UK. 2015.

T. Steinke and J. Ullman, “Between Pure and Approximate Differential Privacy,” Theory and Practice of Differential Privacy (TPDP 2015), London, UK. 2015. TPDP Conference VersionBetween Pure and Approximate Differential Privacy.pdf

A. Beimel, K. Nissim, and U. Stemmer, “Learning Privately with Labeled and Unlabeled Examples,” Accepted for publication, SODA 2015, 2015. arXiv.orgAbstract1407.2662v2.pdf

2014

Y. Chen, Sheffet, O., and Vadhan, S., “Privacy Games,” in 10th Conference on Web and Internet Economics (WINE), , Beijing, China, 2014.privacy_game_wine.pdf

R. Bassily, Smith, A., and Thakurta, A., “Private Empirical Risk Minimization, Revisited,” in ICML 2014 Workshop on Learning, Security and Privacy, , Beijing, China, 2014. Publisher's VersionAbstract1405.7085v1.pdf

M. Kearns, Pai, M., Roth, A., and Ullman, J., “Mechanism Design in Large Games: Incentives and Privacy,” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, , New York, NY, USA, 2014, pp. 403–410. Publisher's Versionp403-kearns.pdf

K. Chandrasekaran, Thaler, J., Ullman, J., and Wan, A., “Faster Private Release of Marginals on Small Databases” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, , New York, NY, USA, 2014, pp. 387–402. Publisher's Version p387-chandrasekaran.pdf

M. Bun, Ullman, J., and Vadhan, S., “Fingerprinting Codes and the Price of Approximate Differential Privacy” in Proceedings of the 46th Annual ACM Symposium on Theory of Computing, , New York, NY, USA, 2014, pp. 1–10. Publisher's Versionp1-bun.pdf

K. Nissim, Vadhan, S., and Xiao, D., “Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals,” in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, , New York, NY, USA, 2014, pp. 411–422. Publisher's Versionp411-nissim.pdf

T. Steinke and Ullman, J., “Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery,” 2014. arXiv.orgAbstract1410.1228v1.pdf

L. Waye, “Privacy Integrated Data Stream Queries,” Privacy and Security in Programming, , 2014.

V. Feldman and Xiao, D., “Sample Complexity Bounds on Differentially Private Learning via Communication Complexity,” Proceedings of The 27th Conference on Learning Theory (COLT 2014), , vol. 35. JMLR Workshop and Conference Proceedings, pp. 1-20, 2014. Publisher's VersionAbstractfeldman14b.pdf

J. Hsu, et al., “Privately Solving Linear Programs,” in Automata, Languages, and Programming, , vol. 8572, Springer Berlin Heidelberg, 2014, pp. 612-624. Publisher's Version 1402.3631v2.pdf

M. M. Pai, Roth, A., and Ullman, J., “An Anti-Folk Theorem for Large Repeated Games with Imperfect MonitoringCoRR, , vol. abs/1402.2801, 2014. 1402.2801v1.pdf

2013

S. P. Kasiviswanathan, Nissim, K., Raskhodnikova, S., and Smith, A., “Analyzing Graphs with Node Differential Privacy,” in Proceedings of the 10th Theory of Cryptography Conference on Theory of Cryptography, , Berlin, Heidelberg, 2013, pp. 457–476. Publisher's Version chp3a10.10072f978-3-642-36594-2_26.pdf

A. Beimel, Nissim, K., and Stemmer, U., “Characterizing the Sample Complexity of Private Learners,” in Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, , New York, NY, USA, 2013, pp. 97–110. Publisher's Version p97-beimel_1.pdf

J. Ullman, “Answering n{2+o(1)} counting queries with differential privacy is hard,” in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, , Palo Alto, California, USA, 2013, pp. 361-370. DOIAbstract PDF

J. Hsu, Roth, A., and Ullman, J., “Differential privacy for the analyst via private equilibrium computation,” in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, , Palo Alto, California, USA, 2013, pp. 341-350. DOIAbstract PDF

G. N. Rothblum, Vadhan, S., and Wigderson, A., “Interactive proofs of proximity: delegating computation in sublinear time,” in Proceedings of the 45th annual ACM symposium on Symposium on theory of computing, , Palo Alto, California, USA, 2013, pp. 793-802. DOIAbstract PDF

Y. Chen, Chong, S., Kash, I. A., Moran, T., and Vadhan, S., “Truthful mechanisms for agents that value privacy,” in Proceedings of the fourteenth ACM conference on Electronic commerce, , Philadelphia, Pennsylvania, USA, 2013, pp. 215-232. DOIAbstract PDF

A. Beimel, Nissim, K., and Stemmer, U., “Characterizing the sample complexity of private learners,” in Proceedings of the 4th conference on Innovations in Theoretical Computer Science, , Berkeley, California, USA, 2013, pp. 97-110. DOIAbstract PDF

A. Beimel, et al., “Private Learning and Sanitization: Pure vs. Approximate Differential Privacy,” in Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, , vol. 8096, Springer Berlin Heidelberg, 2013, pp. 363-378. Publisher's Version chp3a10.10072f978-3-642-40328-6_26.pdf

M. Bun and Thaler, J., “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities,” Automata, Languages, and Programming, , vol. 7965, pp. 303-314, 2013. DOIAbstract PDF

K. Chandrasekaran, Thaler, J., Ullman, J., and Wan, A., “Faster Private Release of Marginals on Small Databases,” CoRR, , vol. abs/1304.3754, 2013. arXiv.orgAbstract PDF

S. P. Kasiviswanathan, Nissim, K., Raskhodnikova, S., and Smith, A., “Analyzing Graphs with Node Differential Privacy,” in Theory of Cryptography, , vol. 7785, Springer Berlin Heidelberg, 2013, pp. 457-476. Springer LinkAbstract PDF

2012

J. Thaler, Ullman, J., and Vadhan, S. P., “Faster Algorithms for Privately Releasing Marginals,” in Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, , Warwick, UK, 2012, Lecture Notes in Computer Science., vol. 7391. DOI:10.1007/978-3-642-31594-7_68Abstract PDF

C. Dwork, Naor, M., and Vadhan, S., “The Privacy of the Analyst and the Power of the State,” in Proceedings of the 53rd Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `12), , New Brunswick, NJ, 2012, pp. 400–409. IEEE XploreAbstract PDF

A. Gupta, Roth, A., and Ullman, J., “Iterative Constructions and Private Data Release,” in Theory of Cryptography - 9th Theory of Cryptography Conference, TCC 2012, , Taormina, Sicily, Italy, 2012, Lecture Notes in Computer Science., vol. 7194, pp. 339-356. DOI:10.1007/978-3-642-28914-9_19Abstract PDF

M. Kearns, Pai, M., Roth, A., and Ullman, J., “Private Equilibrium Release, Large Games, and No-Regret Learning,” 2012. arXiv:1207.4084Abstract PDF

2011

S. Vadhan, et al., “Comments on Advance Notice of Proposed Rulemaking: Human Subjects Research Protections: Enhancing Protections for Research Subjects and Reducing Burden, Delay, and Ambiguity for Investigators, Docket ID number HHS-OPHS-2011-0005”. 2011. regulations.govAbstract PDF

Y. Chen, Chong, S., Kash, I. A., Moran, T., and Vadhan, S. P., “Truthful Mechanisms for Agents that Value Privacy,” CoRR, , vol. abs/1111.5472, 2011. arXiv:abs/1111.5472Abstract PDF

A. Gupta, Hardt, M., Roth, A., and Ullman, J., “Privately releasing conjunctions and the statistical query barrier,” in Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, , San Jose, CA, USA, 2011, pp. 803-812. ACM Digital LibraryAbstract PDF

J. Ullman and Vadhan, S., “PCPs and the Hardness of Generating Synthetic Data,” in Proceedings of the 8th IACR Theory of Cryptography Conference (TCC `11), , Providence, RI, 2011, Lecture Notes on Computer Science., vol. 5978, pp. 572–587. Springer Link Abstract PDF

2010

C. Dwork, Rothblum, G., and Vadhan, S., “Boosting and Differential Privacy,” in Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10), , Las Vegas, NV, 2010, pp. 51–60. DOI:10.1109/FOCS.2010.12Abstract PDF

A. McGregor, Mironov, I., Pitassi, T., Reingold, O., Talwar, K., and Vadhan, S., “The Limits of Two-Party Differential Privacy,” in Proceedings of the 51st Annual {IEEE} Symposium on Foundations of Computer Science (FOCS `10), , Las Vegas, NV, 2010, pp. 81–90. DOI:10.1109/FOCS.2010.14Abstract PDF

2009

C. Dwork, Naor, M., Reingold, O., Rothblum, G., and Vadhan, S., “On the Complexity of Differentially Private Data Release: Efficient Algorithms and Hardness Results,” in Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC `09), , Bethesda, MD, 2009, pp. 381–390. ACM Digital LibraryAbstract PDF

I. Mironov, Pandey, O., Reingold, O., and Vadhan, S., “Computational Differential Privacy,” in Advances in Cryptology–-CRYPTO `09, , Santa Barbara, CA, 2009, vol. 5677, pp. 126–142. Springer LinkAbstract PDF

Educational materials

All Differential Privacy education materials are available on our Courses & Educational Materials Page.