In a search task, a group of agents compete to be the first to find the solution. Each agent has different private information to incorporate into its search. This problem is inspired by settings such as scientific research, Bitcoin hash inversion, or hunting for some buried treasure. A social planner such as a funding agency, mining pool, or pirate captain might like to convince the agents to collaborate, share their information, and greatly reduce the cost of searching. However, this cooperation is in tension with the individuals' competitive desire to each be the first to win the search. The planner's proposal should incentivize truthful information sharing, reduce the total cost of searching, and satisfy fairness properties that preserve the spirit of the competition. We design contract-based mechanisms for information sharing without money. The planner solicits the agents' information and assigns search locations to the agents, who may then search only within their assignments. Truthful reporting of information to the mechanism maximizes an agent's chance to win the search. Epsilon-voluntary participation is satisfied for large search spaces. In order to formalize the planner's goals of fairness and reduced search cost, we propose a simplified, simulated game as a benchmark and quantify fairness and search cost relative to this benchmark scenario. The game is also used to implement our mechanisms. Finally, we extend to the case where coalitions of agents may participate in the mechanism, forming larger coalitions recursively.
C. Dwork, A. Smith, T. Steinke, J. Ullman, and S. Vadhan, “Robust Traceability from Trace Amounts,” in IEEE Symposium on Foundations of Computer Science (FOCS 2015), Berkeley, California, 2015.
We prove new upper and lower bounds on the sample complexity of (ϵ,δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function cx over a totally ordered domain X evaluates to cx(y)=1 if y≤x, and evaluates to 0 otherwise. We give the first nontrivial lower bound for releasing thresholds with (ϵ,δ) differential privacy, showing that the task is impossible over an infinite domain X, and moreover requires sample complexity n≥Ω(log∗|X|), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with n≤2(1+o(1))log∗|X| samples. This improves the previous best upper bound of 8(1+o(1))log∗|X| (Beimel et al., RANDOM '13). Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with (ϵ,δ) differential privacy and learning without privacy. For properly learning thresholds in ℓ dimensions, this lower bound extends to n≥Ω(ℓ⋅log∗|X|). To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database D of elements from X, the interior point problem asks for an element between the smallest and largest elements in D. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.
We introduce three differentially-private algorithms that approximates the 2nd-moment matrix of the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, correspond to existing techniques in linear regression literature. Specifically, we discuss the following three techniques. (i) For Ridge Regression, we propose setting the regularization coefficient so that by approximating the solution using Johnson-Lindenstrauss transform we preserve privacy. (ii) We show that adding a small batch of random samples to our data preserves differential privacy. (iii) We show that sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially private provided the prior is set correctly. We also evaluate our techniques experimentally and compare them to the existing "Analyze Gauss" algorithm of Dwork et al.
Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.
When marketing and data analytic companies purchase medical data, sensitive medical facts about patients can leak if the information includes national identifiers. In some cases, patient information is at risk of exposure even when national identifiers are encrypted to provide privacy. We examined prescription data with encrypted national identifiers from South Korean decedents. Because the data did not include the patient’s name or address and encrypted the patient’s national identifier (Resident Registration Number, or RRN), the data was presumed to be anonymous and to resemble data shared with IMS Health, a large multinational corporation headquartered in the United States that collects these kinds of prescription data on millions of living South Koreans. However, weakly encrypted RRNs may be vulnerable to de-anonymization. They have demographics embedded within its digits with a publicly-known pattern, and like credit cards, a last digit that is a weighted sum of prior digits which allows for arithmetic inspection of the accuracy of re-identification. We conducted two experiments on the prescription data that demonstrates the vulnerability of its RRN encryption method. This work is timely because South Korea is debating a redesign of its national identifier, the United States is discussing a new universal patient identifier, and countries worldwide are struggling with breaches of personal information that rely on national identification systems.
L. Sweeney, “All the Data on All the People,” UC Berkeley Law School & GWU Law School (Berkeley Center for Law & Technology). The Privacy Law Scholars Conference (PLSC). Berkeley, CA. . 2015.
The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.
A new line of work, started with Dwork et al., studies the task of answering statistical queries using a sample and relates the problem to the concept of differential privacy. By the Hoeffding bound, a sample of size O(logk/α2) suffices to answer k non-adaptive queries within error α, where the answers are computed by evaluating the statistical queries on the sample. This argument fails when the queries are chosen adaptively (and can hence depend on the sample). Dwork et al. showed that if the answers are computed with (ϵ,δ)-differential privacy then O(ϵ) accuracy is guaranteed with probability 1−O(δϵ). Using the Private Multiplicative Weights mechanism, they concluded that the sample size can still grow polylogarithmically with the k. Very recently, Bassily et al. presented an improved bound and showed that (a variant of) the private multiplicative weights algorithm can answer k adaptively chosen statistical queries using sample complexity that grows logarithmically in k. However, their results no longer hold for every differentially private algorithm, and require modifying the private multiplicative weights algorithm in order to obtain their high probability bounds. We greatly simplify the results of Dwork et al. and improve on the bound by showing that differential privacy guarantees O(ϵ) accuracy with probability 1−O(δlog(1/ϵ)/ϵ). It would be tempting to guess that an (ϵ,δ)-differentially private computation should guarantee O(ϵ) accuracy with probability 1−O(δ). However, we show that this is not the case, and that our bound is tight (up to logarithmic factors).