Can you have privacy and big data too? — Comments for the White House

Originally posted by Micah Altman on 3/31/14 on the MIT Libraries Program for Information Science blog and on

Big data has huge implications for privacy, as summarized in our commentary below:

Both the government and third parties have the potential to collect extensive (sometimes exhaustive), fine grained, continuous, and identifiable records of a person’s location, movement history, associations and interactions with others, behavior, speech, communications, physical and medical conditions, commercial transactions, etc. Such “big data” has the ability to be used in a wide variety of ways, both positive and negative. Examples of potential applications include improving government and organizational transparency and accountability, advancing research and scientific knowledge, enabling businesses to better serve their customers, allowing systematic commercial and non-commercial manipulation, fostering pervasive discrimination, and surveilling public and private spheres.

On January 23, 2014, President Obama asked John Podesta to develop in 90 days, a ‘comprehensive review’ on big data and privacy.

This lead to a series of workshop on big data and  technology at MIT, and on social cultural & ethical dimensions at NYU, with a third planned to discuss legal issues at Berkeley. A number of colleagues from our Privacy Tools for Research  project and from the BigData@CSAIL projects have contributed to these workshops  and raised many thoughtful issues (and the workshop sessions are  online and well worth watching).

EPIC, ARL and 22 other privacy organizations requested an opportunity to comment on the report, and OSTP later allowed for a 27-day commentary period during which brief comments would be accepted by e-mail. (Note that the original RFI provided by OSTP is, at the time of this writing, a broken link, so we have posted a copy. )They requested commenters provide specific answers to five extraordinarily broad questions:

  1. What are the public policy implications of the collection, storage, analysis, and use of big data? For example, do the current U.S. policy framework and privacy proposals for protecting consumer privacy and government use of data adequately address issues raised by big data analytics? 
  2. What types of uses of big data [are most important]… could measurably improve outcomes or productivity with further government action, funding, or research? What types of uses of big data raise the most public policy concerns? Are there specific sectors or types of uses that should receive more government and/or public attention?
  3. What technological trends or key technologies will affect the collection, storage, analysis and use of big data? Are there particularly promising technologies or new practices for safeguarding privacy while enabling effective uses of big data?
  4. How should the policy frameworks or regulations for handling big data differ between the government and the private sector? Please be specific as to the type of entity and type of use (e.g., law enforcement, government services, commercial, academic research, etc.). 
  5. What issues are raised by the use of big data across jurisdictions, such as the adequacy of current international laws, regulations, or norms? 

 My colleagues at the Berkman Center, David O’Brien, Alexandra Woods, Salil Vadhan and I have submitted responses to these questions that outline a broad, comprehensive, and systematic framework for analyzing these types of questions and taxonomize a variety of modern technological, statistical, and cryptographic approaches to simultaneously providing privacy and utility. This comment is made on behalf of the Privacy Tools for Research Project, of which we are a part, and has benefitted from extensive commentary by the other project collaborators.

Much can be improved in how big data and is currently treated. To summarize (quoting from the conclusions of the comment):

Addressing privacy risks requires a sophisticated approach, and the privacy protections currently used for big data do not take advantage of advances in data privacy research or the nuances these provide in dealing with different kinds of data and closely matching sensitivity to risk.

I invite you to read the full comment.