Significant progress has been made recently on theoretical analysis of estimators for the stochastic block model (SBM). In this paper, we consider the multi-graph SBM, which serves as a foundation for many application settings including dynamic and multi-layer networks. We explore the asymptotic properties of two estimators for the multi-graph SBM, namely spectral clustering and the maximum-likelihood estimate (MLE), as the number of layers of the multi-graph increases. We derive sufficient conditions for consistency of both estimators and propose a variational approximation to the MLE that is computationally feasible for large networks. We verify the sufficient conditions via simulation and demonstrate that they are practical. In addition, we apply the model to two real data sets: a dynamic social network and a multi-layer social network with several types of relations.

%B Proceedings of the 32nd International Conference on Machine Learning %G eng %U https://arxiv.org/pdf/1410.8597v3.pdf %0 Journal Article %J Encyclopedia of Algorithms %D 2015 %T Differentially Private Analysis of Graphs %A Sofya Raskhodnikova and Adam Smith %B Encyclopedia of Algorithms %G eng %0 Journal Article %J ZSR %D 2015 %T Perspectives on the Future of Digital Privacy %A Gasser, Urs %B ZSR %P 339-448 %G eng %0 Generic %D 2015 %T All the Data on All the People %A Sweeney, Latanya %B UC Berkeley Law School & GWU Law School (Berkeley Center for Law & Technology). The Privacy Law Scholars Conference (PLSC). Berkeley, CA. %G eng %0 Journal Article %J The ANNALS of the American Academy of Political and Social Science %D 2015 %T Automating Open Science for Big Data %A Mercè Crosas %A Gary King %A James Honaker %A Sweeney, Latanya %XThe vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published results. The trend toward big data—including large-scale streaming data—is starting to transform research and has the potential to impact policymaking as well as our understanding of the social, economic, and political problems that affect human societies. However, big data research poses new challenges to the execution of the analysis, archiving and reuse of the data, and reproduction of the results. Downloading these datasets to a researcher’s computer is impractical, leading to analyses taking place in the cloud, and requiring unusual expertise, collaboration, and tool development. The increased amount of information in these large datasets is an advantage, but at the same time it poses an increased risk of revealing personally identifiable sensitive information. In this article, we discuss solutions to these new challenges so that the social sciences can realize the potential of big data.

%B The ANNALS of the American Academy of Political and Social Science %V 659 %P 260-273 %G eng %U http://ann.sagepub.com/content/659/1/260.abstract %N 1 %0 Generic %D 2015 %T Between Pure and Approximate Differential Privacy %A Thomas Steinke %A Jon Ullman %B Theory and Practice of Differential Privacy (TPDP 2015), London, UK %G eng %U http://tpdp.computing.dundee.ac.uk/abstracts/TPDP_2015_3.pdf %0 Journal Article %J Proceedings of the 28th IEEE Computer Security Foundations Symposium (CSF) %D 2015 %T Cryptographic Enforcement of Language-Based Erasure %A A Askarov %A Moore, S. %A C Dimoulas %A S Chong %XInformation erasure is a formal security requirement that stipulates when sensitive data must be removed from computer systems. In a system that correctly enforces erasure requirements, an attacker who observes the system after sensitive data is required to have been erased cannot deduce anything about the data. Practical obstacles to enforcing information erasure include: (1) correctly determining which data requires erasure; and (2) reliably deleting potentially large volumes of data, despite untrustworthy storage services.

In this paper, we present a novel formalization of language- based information erasure that supports cryptographic enforcement of erasure requirements: sensitive data is encrypted be- fore storage, and upon erasure, only a relatively small set of decryption keys needs to be deleted. This cryptographic technique has been used by a number of systems that implement data deletion to allow the use of untrustworthy storage services. However, these systems provide no support to correctly determine which data requires erasure, nor have the formal semantic properties of these systems been explained or proven to hold. We address these shortcomings. Specifically, we study a programming language extended with primitives for public- key cryptography, and demonstrate how information-flow control mechanisms can automatically track data that requires erasure and provably enforce erasure requirements even when programs employ cryptographic techniques for erasure.

This essay first appeared in the Internet Monitor project’s second annual report, Internet Monitor 2014: Reflections on the Digital World. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

%B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/data-and-privacy-f7bfa24bbddc %0 Report %D 2015 %T The Differential Privacy of Bayesian Inference %A Shijie Zheng %XDifferential privacy is one recent framework for analyzing and quantifying the amount of privacy lost when data is released. Meanwhile, multiple imputation is an existing Bayesian-inference based technique from statistics that learns a model using real data, then releases synthetic data by drawing from that model. Because multiple imputation does not directly release any real data, it is generally believed to protect privacy. In this thesis, we examine that claim. While there exist newer synthetic data algorithms specifically designed to provide differential privacy, we evaluate whether multiple imputation already includes differential privacy for free. Thus, we focus on several method variants for releasing the learned model and releasing the synthetic data, and how these methods perform for models taking on two common distributions: the Bernoulli and the Gaussian with known variance. We prove a number of new or improved bounds on the amount of privacy afforded by multiple imputation for these distributions. We find that while differential privacy is ostensibly achievable for most of our method variants, the conditions needed for it to do so are often not realistic for practical usage. At least in theory, this is particularly true if we want absolute privacy (ε-differential privacy), but that the methods are more practically compatible with privacy when we allow a small probability of a catastrophic data leakage ((ε, δ)-differential privacy). |

We prove new upper and lower bounds on the sample complexity of (ϵ,δ) differentially private algorithms for releasing approximate answers to threshold functions. A threshold function cx over a totally ordered domain X evaluates to cx(y)=1 if y≤x, and evaluates to 0 otherwise. We give the first nontrivial lower bound for releasing thresholds with (ϵ,δ) differential privacy, showing that the task is impossible over an infinite domain X, and moreover requires sample complexity n≥Ω(log∗|X|), which grows with the size of the domain. Inspired by the techniques used to prove this lower bound, we give an algorithm for releasing thresholds with n≤2(1+o(1))log∗|X| samples. This improves the previous best upper bound of 8(1+o(1))log∗|X| (Beimel et al., RANDOM '13).

Our sample complexity upper and lower bounds also apply to the tasks of learning distributions with respect to Kolmogorov distance and of properly PAC learning thresholds with differential privacy. The lower bound gives the first separation between the sample complexity of properly learning a concept class with (ϵ,δ) differential privacy and learning without privacy. For properly learning thresholds in ℓ dimensions, this lower bound extends to n≥Ω(ℓ⋅log∗|X|).

To obtain our results, we give reductions in both directions from releasing and properly learning thresholds and the simpler interior point problem. Given a database D of elements from X, the interior point problem asks for an element between the smallest and largest elements in D. We introduce new recursive constructions for bounding the sample complexity of the interior point problem, as well as further reductions and techniques for proving impossibility results for other basic problems in differential privacy.

Emerging large-scale data sources hold tremendous potential for new scientific research into human biology, behaviors, and relationships. At the same time, big data research presents privacy and ethical challenges that the current regulatory framework is ill-suited to address. In light of the immense value of large-scale research data, the central question moving forward is not whether such data should be made available for research, but rather how the benefits can be captured in a way that respects fundamental principles of ethics and privacy.

In a search task, a group of agents compete to be the first to find the solution. Each agent has different private information to incorporate into its search. This problem is inspired by settings such as scientific research, Bitcoin hash inversion, or hunting for some buried treasure. A social planner such as a funding agency, mining pool, or pirate captain might like to convince the agents to collaborate, share their information, and greatly reduce the cost of searching. However, this cooperation is in tension with the individuals' competitive desire to each be the first to win the search. The planner's proposal should incentivize truthful information sharing, reduce the total cost of searching, and satisfy fairness properties that preserve the spirit of the competition. We design contract-based mechanisms for information sharing without money. The planner solicits the agents' information and assigns search locations to the agents, who may then search only within their assignments. Truthful reporting of information to the mechanism maximizes an agent's chance to win the search. Epsilon-voluntary participation is satisfied for large search spaces. In order to formalize the planner's goals of fairness and reduced search cost, we propose a simplified, simulated game as a benchmark and quantify fairness and search cost relative to this benchmark scenario. The game is also used to implement our mechanisms. Finally, we extend to the case where coalitions of agents may participate in the mechanism, forming larger coalitions recursively.

%B AAI Conference on Artificial Intelligence %I Association for the Advancement of Artificial Intelligence (AAAI) %C North America %8 February %G eng %U http://people.seas.harvard.edu/~bwaggoner/papers/2015/fair-information-sharing-for-treasure-hunting--2015--chen,nissim,waggoner.pdf %0 Conference Proceedings %D 2015 %T On the Generalization Properties of Differential Privacy %A Kobbi Nissim %A Uri Stemmer %XA new line of work, started with Dwork et al., studies the task of answering statistical queries using a sample and relates the problem to the concept of differential privacy. By the Hoeffding bound, a sample of size O(logk/α2) suffices to answer k non-adaptive queries within error α, where the answers are computed by evaluating the statistical queries on the sample. This argument fails when the queries are chosen adaptively (and can hence depend on the sample). Dwork et al. showed that if the answers are computed with (ϵ,δ)-differential privacy then O(ϵ) accuracy is guaranteed with probability 1−O(δϵ). Using the Private Multiplicative Weights mechanism, they concluded that the sample size can still grow polylogarithmically with the k.

Very recently, Bassily et al. presented an improved bound and showed that (a variant of) the private multiplicative weights algorithm can answer k adaptively chosen statistical queries using sample complexity that grows logarithmically in k. However, their results no longer hold for every differentially private algorithm, and require modifying the private multiplicative weights algorithm in order to obtain their high probability bounds.

We greatly simplify the results of Dwork et al. and improve on the bound by showing that differential privacy guarantees O(ϵ) accuracy with probability 1−O(δlog(1/ϵ)/ϵ). It would be tempting to guess that an (ϵ,δ)-differentially private computation should guarantee O(ϵ) accuracy with probability 1−O(δ). However, we show that this is not the case, and that our bound is tight (up to logarithmic factors).

We propose graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource large-scale privacy-sensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three schemes that are provably secure against any semi-honest server. Our first construction makes use of only symmetric-key operations, resulting in a computationally-efficient construction. Our second scheme, makes use of somewhat-homomorphic encryption and is less computationally-efficient but achieves optimal communication complexity (i.e., uses a minimal amount of bandwidth). Finally, our third scheme is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.

%B The 22nd ACM Conference on Computer and Communications Security %G eng %U https://eprint.iacr.org/2015/266 %0 Journal Article %J International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %D 2015 %T Hardness Amplification and the Approximate Degree of Constant-Depth Circuits %A Mark Bun %A Justin Thaler %XWe establish a generic form of hardness amplification for the approximability of constant-depth Boolean circuits by polynomials. Specifically, we show that if a Boolean circuit cannot be pointwise approximated by low-degree polynomials to within constant error in a certain one-sided sense, then an OR of disjoint copies of that circuit cannot be pointwise approximated even with very high error. As our main application, we show that for every sequence of degrees d(n), there is an explicit depththree circuit F : {−1, 1} n → {−1, 1} of polynomial-size such that any degree-d polynomial cannot pointwise approximate F to error better than 1 − exp −Ω( ˜ nd−3/2 ) . As a consequence of our main result, we obtain an exp −Ω( ˜ n 2/5 ) upper bound on the the discrepancy of a function in AC0, and an exp Ω( ˜ n 2/5 ) lower bound on the threshold weight of AC0, improving over the previous best results of exp −Ω(n 1/3 ) and exp Ω(n 1/3 ) respectively. Our techniques also yield a new lower bound of Ω n 1/2/ log(d−2)/2 (n) on the approximate degree of the AND-OR tree of depth d, which is tight up to polylogarithmic factors for any constant d, as well as new bounds for read-once DNF formulas. In turn, these results imply new lower bounds on the communication and circuit complexity of these classes, and demonstrate strong limitations on existing PAC learning algorithms.

%B International Colloquium on Automata, Languages, and Programming (ICALP 2015) BG %G eng %U http://arxiv.org/abs/1311.1616 %0 Generic %D 2015 %T In the Age of the Web, What Does “Public” Mean? %A Rob Faris %A O'Brien, David %B Internet Monitor 2014: Data and Privacy %G eng %U https://medium.com/internet-monitor-2014-data-and-privacy/in-the-age-of-the-web-what-does-public-mean-ee74df403174 %0 Journal Article %J Social Science Research Network %D 2015 %T Integrating Approaches to Privacy Across the Research Lifecycle: When is Information Purely Public? %A O'Brien, David %A Jonathan Ullman %A Micah Altman %A Gasser, Urs %A Bar-Sinai, Michael %A Kobbi Nissim %A Salil Vadhan %A Michael Wojcik %A Alexandra Wood %XOn September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.

Researchers are increasingly obtaining data from social networking websites, publicly-placed sensors, government records and other public sources. Much of this information appears public, at least to first impressions, and it is capable of being used in research for a wide variety of purposes with seemingly minimal legal restrictions. The insights about human behaviors we may gain from research that uses this data are promising. However, members of the research community are questioning the ethics of these practices, and at the heart of the matter are some difficult questions about the boundaries between public and private information. This workshop report, the second in a series, identifies selected questions and explores issues around the meaning of “public” in the context of using data about individuals for research purposes.

We show an essentially tight bound on the number of adaptively chosen statistical queries that a computationally efficient algorithm can answer accurately given n samples from an unknown distribution. A statistical query asks for the expectation of a predicate over the underlying distribution, and an answer to a statistical query is accurate if it is “close” to the correct expectation over the distribution. This question was recently studied by Dwork et al. (2015), who showed how to answer Ω( ˜ n 2 ) queries efficiently, and also by Hardt and Ullman (2014), who showed that answering O˜(n 3 ) queries is hard. We close the gap between the two bounds and show that, under a standard hardness assumption, there is no computationally efficient algorithm that, given n samples from an unknown distribution, can give valid answers to O(n 2 ) adaptively chosen statistical queries. An implication of our results is that computationally efficient algorithms for answering arbitrary, adaptively chosen statistical queries may as well be differentially private. We obtain our results using a new connection between the problem of answering adaptively chosen statistical queries and a combinatorial object called an interactive fingerprinting code Fiat and Tassa (2001). In order to optimize our hardness result, we give a new Fourier-analytic approach to analyzing fingerprinting codes that is simpler, more flexible, and yields better parameters than previous constructions.

%B JMLR: Workshop and Conference Proceedings %V 40 %P 1-41 %G eng %U http://jmlr.org/proceedings/papers/v40/Steinke15.pdf %N 201 %0 Journal Article %J Accepted for publication, SODA 2015 %D 2015 %T Learning Privately with Labeled and Unlabeled Examples %A A. Beimel %A K. Nissim %A U. Stemmer %XA private learner is an algorithm that given a sample of labeled individual examples outputs a generalizing hypothesis while preserving the privacy of each individual. In 2008, Kasiviswanathan et al. (FOCS 2008) gave a generic construction of private learners, in which the sample complexity is (generally) higher than what is needed for non-private learners. This gap in the sample complexity was then further studied in several followup papers, showing that (at least in some cases) this gap is unavoidable. Moreover, those papers considered ways to overcome the gap, by relaxing either the privacy or the learning guarantees of the learner.

We suggest an alternative approach, inspired by the (non-private) models of semi-supervised learning and active-learning, where the focus is on the sample complexity of labeled examples whereas unlabeled examples are of a significantly lower cost. We consider private semi-supervised learners that operate on a random sample, where only a (hopefully small) portion of this sample is labeled. The learners have no control over which of the sample elements are labeled. Our main result is that the labeled sample complexity of private learners is characterized by the VC dimension.

We present two generic constructions of private semi-supervised learners. The first construction is of learners where the labeled sample complexity is proportional to the VC dimension of the concept class, however, the unlabeled sample complexity of the algorithm is as big as the representation length of domain elements. Our second construction presents a new technique for decreasing the labeled sample complexity of a given private learner, while only slightly increasing its unlabeled sample complexity. In addition, we show that in some settings the labeled sample complexity does not depend on the privacy parameters of the learner.

Imagine an online work environment where researchers have direct and immediate access to myriad data sources and tools and data management resources, useful throughout the research lifecycle. This is our vision for the next generation of the Dataverse Network: an Open Science Platform (OSP). For the first time, researchers would be able to seamlessly access and create primary and derived data from a variety of sources: prior research results, public data sets, harvested online data, physical instruments, private data collections, and even data from other standalone repositories. Researchers could recruit research participants and conduct research directly on the OSP, if desired, using readily available tools. Researchers could create private or shared workspaces to house data, access tools, and computation and could publish data directly on the platform or publish elsewhere with persistent, data citations on the OSP. This manuscript describes the details of an Open Science Platform and its construction. Having an Open Science Platform will especially impact the rate of new scientific discoveries and make scientific findings more credible and accountable.

%B Arxiv.org Computer Science, Computers and Scoiety [Internet] %G eng %U http://arxiv.org/abs/1506.05632 %0 Generic %D 2015 %T Privacy as a Sword and Shield in Public Health %A Sweeney, Latanya %B New York City Department of Public Health. New York, NY. %G eng %0 Generic %D 2015 %T Privacy Principles (framing talk) %A Micah Altman %B United Nations Global Pulse Workshop on ICT4D Principle 8: Address Privacy & Security In Development Programs. New York, USA %G eng %U https://www.slideshare.net/drmaltman/un-global-pulse-privacy-framing %0 Journal Article %D 2015 %T Private Approximations of the 2nd-Moment Matrix Using Existing Techniques in Linear Regression %A Or Sheffet %XWe introduce three differentially-private algorithms that approximates the 2nd-moment matrix of the data. These algorithm, which in contrast to existing algorithms output positive-definite matrices, correspond to existing techniques in linear regression literature. Specifically, we discuss the following three techniques. (i) For Ridge Regression, we propose setting the regularization coefficient so that by approximating the solution using Johnson-Lindenstrauss transform we preserve privacy. (ii) We show that adding a small batch of random samples to our data preserves differential privacy. (iii) We show that sampling the 2nd-moment matrix from a Bayesian posterior inverse-Wishart distribution is differentially private provided the prior is set correctly. We also evaluate our techniques experimentally and compare them to the existing "Analyze Gauss" algorithm of Dwork et al.

%G eng %U http://arxiv.org/abs/1507.00056 %0 Conference Paper %B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %D 2015 %T Robust Traceability from Trace Amounts %A C. Dwork %A Smith, A %A T Steinke %A J Ullman %A S. Vadhan %XThe privacy risks inherent in the release of a large number of summary statistics were illustrated by Homer et al. (PLoS Genetics, 2008), who considered the case of 1-way marginals of SNP allele frequencies obtained in a genome-wide association study: Given a large number of minor allele frequencies from a case group of individuals diagnosed with a particular disease, together with the genomic data of a single target individual and statistics from a sizable reference dataset independently drawn from the same population, an attacker can determine with high confidence whether or not the target is in the case group. In this work we describe and analyze a simple attack that succeeds even if the summary statistics are significantly distorted, whether due to measurement error or noise intentionally introduced to protect privacy. Our attack only requires that the vector of distorted summary statistics is close to the vector of true marginals in `1 norm. Moreover, the reference pool required by previous attacks can be replaced by a single sample drawn from the underlying population. The new attack, which is not specific to genomics and which handles Gaussian as well as Bernouilli data, significantly generalizes recent lower bounds on the noise needed to ensure differential privacy (Bun, Ullman, and Vadhan, STOC 2014; Steinke and Ullman, 2015), obviating the need for the attacker to control the exact distribution of the data.

%B IEEE Symposium on Foundations of Computer Science (FOCS 2015) %C Berkeley, California %8 10/18-20/2015 %G eng %0 Journal Article %J Technology Science %D 2015 %T Sharing Sensitive Data with Confidence: The Datatags System %A Sweeney, Latanya %A Mercè Crosas %A Bar-Sinai, Michael %XSociety generates data on a scale previously unimagined. Wide sharing of these data promises to improve personal health, lower healthcare costs, and provide a better quality of life. There is a tendency to want to share data freely. However, these same data often include sensitive information about people that could cause serious harms if shared widely. A multitude of regulations, laws and best practices protect data that contain sensitive personal information. Government agencies, research labs, and corporations that share data, as well as review boards and privacy officers making data sharing decisions, are vigilant but uncertain. This uncertainty creates a tendency not to share data at all. Some data are more harmful than other data; sharing should not be an all-or-nothing choice. How do we share data in ways that ensure access is commensurate with risks of harm?

%B Technology Science %G eng %U http://techscience.org/a/2015101601/ %0 Conference Paper %D 2015 %T Simultaneous private learning of multiple concepts %A Mark Bun %A Kobbi Nissim %A Uri Stemmer %XWe investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving k learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving k learning tasks without privacy? In our setting, an individual example consists of a domain element x labeled by k unknown concepts (c1,…,ck). The goal of a multi-learner is to output k hypotheses (h1,…,hk) that generalize the input examples.

Without concern for privacy, the sample complexity needed to simultaneously learn k concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with k. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in k.