DataTags Research

 

Research Overview:

Members of the Privacy Tools project are developing DataTags, a suite of tools to help researchers share and use sensitive data in a standardized and responsible way.

Proper handling of human subjects data requires knowledge of relevant federal and state data privacy laws, applicable data sharing agreements, best practices for confidentiality and security, and available mechanisms for privacy protection. The goal of DataTags is to help researchers who are not legal or technical experts navigate these considerations and make informed decisions when collecting, storing, and sharing privacy-sensitive data.

This project is in collaboration with the IQSS Dataverse team. For more information, please see the FAQ below and visit DataTags.org to try the demo available.

About the DataTags project

What problem is DataTags designed to address?

Making data widely available to researchers is good policy and crucial to good science. It enables replication and validation of scientific findings, supports extensions of studies, and maximizes return on research investment. For these reasons, sponsors and publishers expect or mandate the sharing of data where possible.

However, data containing sensitive information about individuals cannot be shared openly without appropriate safeguards. An extensive body of statutes, regulations, institutional policies, consent forms, data sharing agreements, and best practices govern how sensitive data should be used and disclosed in different contexts. Researchers and institutions that manage and share data must interpret how the various legal requirements and other data privacy and security standards constrain their handling of a given dataset. DataTags helps researchers navigate these complex issues.

How does DataTags work?

DataTags is designed to enable computer-assisted assessments of the legal, contractual, and policy restrictions that govern data sharing decisions. Assessments are performed through interactive computation, in which the DataTags system asks a user a series of questions to elicit the key properties of a given dataset and applies inference rules to determine which laws, contracts, and best practices are applicable. The output is a set of recommended DataTags, or simple, iconic labels that represent a human-readable and machine-actionable data policy, and a license agreement that is tailored to the individual dataset. The DataTags system is being designed to integrate with the open source data repository software Dataverse and its suite of access controls and statistical analysis tools. It will also operate as a standalone tool and as an application that can be integrated with other platforms.

What are DataTags?

The DataTags recommended by the system are human-readable and machine-actionable labels that express conditions under which datasets can be stored, transmitted, or used. Colloquially, each DataTag tells you that there are some specific things you can safely do with the data — such as make the data available to any user who accepts a prespecified click-through agreement — without requiring further human analysis or decision making. Requirements that cannot be automated and expressed by a simple label are encoded instead in a custom license agreement that complements the DataTags assigned to a dataset.

More formally, a DataTag is an informative label from a controlled vocabulary that can be applied to a dataset. It carries distinct semantics, summarizing sufficient conditions for a specific set of automated actions over the data. A dataset is labelled with a tag on the basis of a systematic interrogation of a data controller, conducted using a specified set of survey questions, and inferential rules for tag assignment. Each label formally corresponds to a set of assertions regarding permissible or impermissible actions over the dataset.

Who are the members of the DataTags team?

Salil Vadhan

Salil Vadhan

Vicky Joseph Professor of Computer Science and Applied Mathematics, SEAS, Harvard

Salil Vadhan is the lead PI of Privacy Tools for Sharing Research Data project and the Vicky Joseph Professor of Computer Science and Applied Mathematics.

Marco Gaboardi

Marco Gaboardi

Visiting Scholar, Center for Research on Computation & Society
State University of New York at Buffalo
Urs Gasser

Urs Gasser

Executive Director, Berkman Center for Internet & Society
Professor of Practice, Harvard Law School
Michael Bar-Sinai

Michael Bar-Sinai

Graduate Student, Ben Gurion University in Negev, Israel
Visiting Graduate Student, Harvard University, IQSS

Michael creates DataTags theory (based on decision graphs), programming language and tools, and explores their integration with data repositories, such as …

Micah Altman

Micah Altman

Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, MIT
Non-Resident Senior Fellow, The Brookings Institution
Latanya Sweeney

Latanya Sweeney

Professor of Government and Technology in Residence, Harvard
Director of the Data Privacy Lab, Harvard

Publications

2017

Bar-Sinai M, Medzini R. Public Policy Modeling using the DataTags Toolset. 2017.

2016

Bar-Sinai M, Sweeney L, Crosas M. DataTags, Data Handling Policy Spaces and the Tags Language, in In Proceedings of the International Workshop on Privacy Engineering, IEEE. San-Jose, CA, USA: IEEE; 2016. 

2015

Crosas M, King G, Honaker J, Sweeney L. Automating Open Science for Big Data. The ANNALS of the American Academy of Political and Social Science [Internet]. 2015;659 (1) :260-273. 

Sweeney L. All the Data on All the People, in The Privacy Law Scholars Conference (PLSC). Berkeley, California: UC Berkeley Law School & GWU Law School (Berkeley Center for Law & Technology); 2015.

Sweeney L, Crosas M. An Open Science Platform for the Next Generation of Data. Arxiv.org Computer Science, Computers and Scoiety [Internet] [Internet]. 2015.

Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: The Datatags System. Technology Science [Internet]. 2015.

PolicyModels

PolicyModels (formerly: the DataTags toolset) is a system for creating models of policies, e.g. for handling datasets or determining welfare entitlements. A policy model consists of a policy space, detailing all possible treatments within a policy, and a decision tree, which describes the process of getting to a specific treatment. Policy models can be used to perform interactive interviews which yield a concrete treatment that is both human readable and machine actionable. Models can also be visualized, and can be analyzed to find caveats or loopholes. A policy model of DataTags (still in beta), that takes into account best information practices and various US laws and regulations, is available at: 

http://dvnweb-vm1.hmdc.harvard.edu/interviews/privacy/intro

Robot Lawyers for License Generation

The Robot Lawyers system is being developed to provide data repositories with expert system-like support for automating certain data handling decisions and generating custom data sharing agreements. It relies on a formalization of the privacy-relevant aspects of selected statutes, regulations, and best practices, supported by an analysis documented in legal memoranda. This formalization enables automated reasoning about the conditions under which a data transfer is permitted, based on facts learned about the data through an interview with the user depositing the data and the application of rules to these facts. The system uses this formalization to generate a custom data sharing agreement that accurately captures the relevant conditions on the data transfer. Transparency at each stage enables repository administrators, lawyers, institutional review boards, and other interested parties to examine the legal analysis and interpretation embodied in the formalization, as well as the rationale behind the generation of a particular license. Through integration with Dataverse, DataTags, and PolicyModels, this system will aim to help Dataverse users access and share data under tailored licenses, with confidence that the agreements reflect legal requirements and best practices with respect to privacy.

How have students contributed to the project?

Students have been involved throughout the development of DataTags. Law students contribute to the project by performing legal research and drafting memoranda analyzing how various privacy laws and regulations govern the collection, use, and sharing of personal data for research purposes. They also draft questions for the DataTags automated interview and terms for the custom license agreements. Undergraduates, graduate students, and postdocs in computer science contribute to the development of the DataTags software. This involves creating a custom language for the DataTags interview, inference, and tags assignment process, as well as tools for testing, verifying, and validating the software code.

How can I get involved?

Please see our open positions for interns, students, postdocs and visiting scholars.