DataTags

Research Overview:

Members of the Privacy Tools project are developing DataTags, a suite of tools to help researchers share and use sensitive data in a standardized and responsible way.

Proper handling of human subjects data requires knowledge of relevant federal and state data privacy laws, applicable data sharing agreements, best practices for confidentiality and security, and available mechanisms for privacy protection. The goal of DataTags is to help researchers who are not legal or technical experts navigate these considerations and make informed decisions when collecting, storing, and sharing privacy-sensitive data.

For more information, please see the FAQ below and visit DataTags.org to try the demo available.

About the DataTags project

What problem is DataTags designed to address?

Making data widely available to researchers is good policy and crucial to good science. It enables replication and validation of scientific findings, supports extensions of studies, and maximizes return on research investment. For these reasons, sponsors and publishers expect or mandate the sharing of data where possible.

However, data containing sensitive information about individuals cannot be shared openly without appropriate safeguards. An extensive body of statutes, regulations, institutional policies, consent forms, data sharing agreements, and best practices govern how sensitive data should be used and disclosed in different contexts. Researchers and institutions that manage and share data must interpret how the various legal requirements and other data privacy and security standards constrain their handling of a given dataset. DataTags helps researchers navigate these complex issues.

How does DataTags work?

DataTags is designed to enable computer-assisted assessments of the legal, contractual, and policy restrictions that govern data sharing decisions. Assessments are performed through interactive computation, in which the DataTags system asks a user a series of questions to elicit the key properties of a given dataset and applies inference rules to determine which laws, contracts, and best practices are applicable. The output is a set of recommended DataTags, or simple, iconic labels that represent a human-readable and machine-actionable data policy, and a license agreement that is tailored to the individual dataset. The DataTags system is being designed to integrate with the open source data repository software Dataverse and its suite of access controls and statistical analysis tools. It will also operate as a standalone tool and as an application that can be integrated with other platforms.

What are DataTags?

The DataTags recommended by the system are human-readable and machine-actionable labels that express conditions under which datasets can be stored, transmitted, or used. Colloquially, each DataTag tells you that there are some specific things you can safely do with the data — such as make the data available to any user who accepts a prespecified click-through agreement — without requiring further human analysis or decision making. Requirements that cannot be automated and expressed by a simple label are encoded instead in a custom license agreement that complements the DataTags assigned to a dataset.

More formally, a DataTag is an informative label from a controlled vocabulary that can be applied to a dataset. It carries distinct semantics, summarizing sufficient conditions for a specific set of automated actions over the data. A dataset is labelled with a tag on the basis of a systematic interrogation of a data controller, conducted using a specified set of survey questions, and inferential rules for tag assignment. Each label formally corresponds to a set of assertions regarding permissible or impermissible actions over the dataset.

Who are the members of the DataTags team?

    Latanya Sweeney

    Merce Crosas

    Urs Gasser

    Salil Vadhan

    Steve Chong

    Michael Bar-Sinai

    Micah Altman

    David O’Brien

    Marco Gaboardi

    Kit Walsh

    Alexandra Wood

    Kevin Condon

    Michael Heppler

    Sean Hooley

    Elizabeth Quigley

    Naomi Day

    Bryan Lee

    Jeremy Merkel

    Anna Myers

    Brett Weinstein

      How can I get involved?

      Please see our open positions for interns, students, postdocs and visiting scholars.

      How have students contributed to the project?

      Students have been involved throughout the development of DataTags. Law students contribute to the project by performing legal research and drafting memoranda analyzing how various privacy laws and regulations govern the collection, use, and sharing of personal data for research purposes. They also draft questions for the DataTags automated interview and terms for the custom license agreements. Undergraduates, graduate students, and postdocs in computer science contribute to the development of the DataTags software. This involves creating a custom language for the DataTags interview, inference, and tags assignment process, as well as tools for testing, verifying, and validating the software code.