Preventing False Discovery via Differential Privacy

Slides Available Here

Abstract: How can we prevent false discovery, and ensure that the conclusions we draw from data generalize to the population at large? For decades, statisticians have been developing methods to prevent false discovery, and yet, false discovery remains a vexing problem in the scientific community, leading to provocatively titled scientific articles like "Why Most Published Research Findings are False." While there are many causes of false discovery, one that is increasingly cited is interactive data analysis---the common scenario where datasets are re-used across multiple analyses---as a major source of false discovery. Interactivity invalidates statistical methods for preventing false discovery, and has been implicated in a "statistical crisis in science." In this talk, I will describe a recent line of work that formalizes the problem of interactive data analysis and introduces new methods for preventing false discovery in this challenging setting. These methods draw on a novel connection between differential privacy and preventing false discovery, and shows how differential privacy can aid, rather than restrict, statistical analysis of datasets. This talk is be based on a series of works with Raef Bassily, Moritz Hardt, Kobbi Nissim, Adam Smith, Thomas Steinke, and Uri Stemmer.

Bio: Jonathan Ullman is an Assistant Professor in the College of Computer and Information Science at Northeastern University, where he is a member of the Cybersecurity and Privacy Institute. His research lies at the intersection of privacy, cryptography, machine learning, and game theory. Prior to Northeastern, he was a Ph.D. student and postdoctoral fellow at Harvard University, working on the Privacy Tools Project, and later a junior fellow in the Simons Society of Fellows.