Publications

2014
A. Wood, et al.,

Integrating Approaches to Privacy Across the Research Lifecycle: Long-Term Longitudinal Studies

. Cambridge: Harvard University, 2014. Publisher's VersionAbstract
On September 24-25, 2013, the Privacy Tools for Sharing Research Data project at Harvard University held a workshop titled "Integrating Approaches to Privacy across the Research Data Lifecycle." Over forty leading experts in computer science, statistics, law, policy, and social science research convened to discuss the state of the art in data privacy research. The resulting conversations centered on the emerging tools and approaches from the participants’ various disciplines and how they should be integrated in the context of real-world use cases that involve the management of confidential research data.This workshop report, the first in a series, provides an overview of the long-term longitudinal study use case. Long-term longitudinal studies collect, at multiple points over a long period of time, highly-specific and often sensitive data describing the health, socioeconomic, or behavioral characteristics of human subjects. The value of such studies lies in part in their ability to link a set of behaviors and changes to each individual, but these factors tend to make the combination of observable characteristics associated with each subject unique and potentially identifiable.Using the research information lifecycle as a framework, this report discusses the defining features of long-term longitudinal studies and the associated challenges for researchers tasked with collecting and analyzing such data while protecting the privacy of human subjects. It also describes the disclosure risks and common legal and technical approaches currently used to manage confidentiality in longitudinal data. Finally, it identifies urgent problems and areas for future research to advance the integration of various methods for preserving confidentiality in research data.
ssrn-id2469848.pdf
R. Bassily, Smith, A., and Thakurta, A., “

Private Empirical Risk Minimization, Revisited

”, in ICML 2014 Workshop on Learning, Security and Privacy, Beijing, China, 2014. Publisher's VersionAbstract
In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex.  Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for (ϵ,0)- and (ϵ,δ)-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.
1405.7085v1.pdf
C. Dimoulas, Moore, S., Askarov, A., and Chong, S., “

Declarative Policies for Capability Control

”, in Proceedings of the 27th {IEEE} Computer Security Foundations Symposium, Piscataway, NJ, USA, 2014.Abstract
In capability-safe languages, components can access a resource only if they possess a capability for that resource. As a result, a programmer can prevent an untrusted component from accessing a sensitive resource by ensuring that the component never acquires the corresponding capability. In order to reason about which components may use a sensitive resource it is necessary to reason about how capabilities propagate through a system. This may be difficult, or, in the case of dynamically composed code, impossible to do before running the system. To counter this situation, we propose extensions to capability-safe languages that restrict the use of capabilities according to declarative policies. We introduce two independently useful semantic security policies to regulate capabilities and describe language-based mechanisms that enforce them. Access control policies restrict which components may use a capability and are enforced using higher-order contracts. Integrity policies restrict which components may influence (directly or indirectly) the use of a capability and are enforced using an information-flow type system. Finally, we describe how programmers can dynamically and soundly combine components that enforce access control or integrity policies with components that enforce different policies or even no policy at all.
csf14_capflow.pdf
V. Feldman and Xiao, D., “

Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

”, Proceedings of The 27th Conference on Learning Theory (COLT 2014), vol. 35. JMLR Workshop and Conference Proceedings, pp. 1-20, 2014. Publisher's VersionAbstract
In this work we analyze the sample complexity of classification by differentially private algorithms. Differential privacy is a strong and well-studied notion of privacy introduced by Dwork et al. (2006) that ensures that the output of an algorithm leaks little information about the data point provided by any of the participating individuals. Sample complexity of private PAC and agnostic learning was studied in a number of prior works starting with (Kasiviswanathan et al., 2008) but a number of basic questions still remain open (Beimel et al. 2010; Chaudhuri and Hsu, 2011; Beimel et al., 2013ab).  Our main contribution is an equivalence between the sample complexity of differentially-private learning of a concept class C (or SCDP(C)) and the randomized one-way communication complexity of the evaluation problem for concepts from C. Using this equivalence we prove the following bounds:
  • SCDP(C)=Ω(LDim(C)), where LDim(C) is the Littlestone's (1987) dimension characterizing the number of mistakes in the online-mistake-bound learning model. This result implies that SCDP(C) is different from the VC-dimension of C, resolving one of the main open questions from prior work.
  • For any t, there exists a class C such that LDim(C)=2 but SCDP(C)≥t.
  • For any t, there exists a class C such that the sample complexity of (pure) α-differentially private PAC learning is Ω(t/α) but the sample complexity of the relaxed (α,β)-differentially private PAC learning is O(log(1/β)/α). This resolves an open problem from (Beimel et al., 2013b). 
We also obtain simpler proofs for a number of known related results. Our equivalence builds on a characterization of sample complexity by Beimel et al., (2013a) and our bounds rely on a number of known results from communication complexity.
feldman14b.pdf
K. Chandrasekaran, Thaler, J., Ullman, J., and Wan, A., “

Faster Private Release of Marginals on Small Databases

”, in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 387–402. Publisher's Version p387-chandrasekaran.pdf
M. Bun, Ullman, J., and Vadhan, S., “

Fingerprinting Codes and the Price of Approximate Differential Privacy

”, in Proceedings of the 46th Annual ACM Symposium on Theory of Computing, New York, NY, USA, 2014, pp. 1–10. Publisher's Version p1-bun.pdf
M. Kearns, Pai, M., Roth, A., and Ullman, J., “

Mechanism Design in Large Games: Incentives and Privacy

”, in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 403–410. Publisher's Version p403-kearns.pdf
K. Nissim, Vadhan, S., and Xiao, D., “

Redrawing the Boundaries on Purchasing Data from Privacy-sensitive Individuals

”, in Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2014, pp. 411–422. Publisher's Version p411-nissim.pdf
M. M. Pai, Roth, A., and Ullman, J., “

An Anti-Folk Theorem for Large Repeated Games with Imperfect Monitoring

”, CoRR, vol. abs/1402.2801, 2014. 1402.2801v1.pdf
J. Hsu, et al., “

Privately Solving Linear Programs

”, in Automata, Languages, and Programming, vol. 8572, Springer Berlin Heidelberg, 2014, pp. 612-624. Publisher's Version 1402.3631v2.pdf
D. J. Weitzner, et al., “

Consumer Privacy Bill of Rights and Big Data: Response to White House Office of Science and Technology Policy Request for Information

”. 2014.Abstract
In response to the White House Office of Science and Technology Policy Request for Information on Big Data Privacy we offer these comments based on presentations and discussions at the White House-MIT Workshop “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice” and subsequent workshops co-sponsored with Data & Society and NYU Information Law Institute and the UC Berkeley iSchool.
PDF version
M. Altman, O’Brien, D., and Wood, A., “

Comment on the Occupational Safety and Health Administration (OSHA) Proposed Rule: Improve Tracking of Workplace Injuries and Illnesses; Extension of Comment Period

”. 2014. Full Text at Regulations.gov PDF version of comments
2013
S. H. Chan, Costa, T. B., and Airoldi, E. M., “

Estimation of exchangeable graph models by stochastic blockmodel approximation

”, in Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, 2013, pp. 293-296. chan_costa_airoldi_2013.pdf
S. P. Kasiviswanathan, Nissim, K., Raskhodnikova, S., and Smith, A., “

Analyzing Graphs with Node Differential Privacy

”, in Proceedings of the 10th Theory of Cryptography Conference on Theory of Cryptography, Berlin, Heidelberg, 2013, pp. 457–476. Publisher's Version chp3a10.10072f978-3-642-36594-2_26.pdf
A. Beimel, Nissim, K., and Stemmer, U., “

Characterizing the Sample Complexity of Private Learners

”, in Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, New York, NY, USA, 2013, pp. 97–110. Publisher's Version p97-beimel_1.pdf
A. Beimel, et al., “

Private Learning and Sanitization: Pure vs. Approximate Differential Privacy

”, in Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, vol. 8096, Springer Berlin Heidelberg, 2013, pp. 363-378. Publisher's Version chp3a10.10072f978-3-642-40328-6_26.pdf
M. Bun and Thaler, J., “Dual Lower Bounds for Approximate Degree and Markov-Bernstein Inequalities”, Automata, Languages, and Programming, vol. 7965, pp. 303-314, 2013. DOIAbstract
The ε-approximate degree of a Boolean function f: { − 1, 1} n  → { − 1, 1} is the minimum degree of a real polynomial that approximates f to within ε in the ℓ ∞  norm. We prove several lower bounds on this important complexity measure by explicitly constructing solutions to the dual of an appropriate linear program. Our first result resolves the ε-approximate degree of the two-level AND-OR tree for any constant ε > 0. We show that this quantity is Θ(n‾‾√) , closing a line of incrementally larger lower bounds [3,11,21,30,32]. The same lower bound was recently obtained independently by Sherstov using related techniques [25]. Our second result gives an explicit dual polynomial that witnesses a tight lower bound for the approximate degree of any symmetric Boolean function, addressing a question of Špalek [34]. Our final contribution is to reprove several Markov-type inequalities from approximation theory by constructing explicit dual solutions to natural linear programs. These inequalities underly the proofs of many of the best-known approximate degree lower bounds, and have important uses throughout theoretical computer science.
PDF
K. Chandrasekaran, Thaler, J., Ullman, J., and Wan, A., “Faster Private Release of Marginals on Small Databases”, CoRR, vol. abs/1304.3754, 2013. arXiv.orgAbstract
We study the problem of answering \emph{$k$-way marginal} queries on a database $D \in (\{0,1\}^d)^n$, while preserving differential privacy. The answer to a $k$-way marginal query is the fraction of the database's records $x \in \{0,1\}^d$ with a given value in each of a given set of up to $k$ columns. Marginal queries enable a rich class of statistical analyses on a dataset, and designing efficient algorithms for privately answering marginal queries has been identified as an important open problem in private data analysis. For any $k$, we give a differentially private online algorithm that runs in time $$ \min{\exp(d^{1-\Omega(1/\sqrt{k})}), \exp(d / \log^{.99} d)\} $$ per query and answers any (possibly superpolynomially long and adaptively chosen) sequence of $k$-way marginal queries up to error at most $\pm .01$ on every query, provided $n \gtrsim d^{.51} $. To the best of our knowledge, this is the first algorithm capable of privately answering marginal queries with a non-trivial worst-case accuracy guarantee on a database of size $\poly(d, k)$ in time $\exp(o(d))$. Our algorithms are a variant of the private multiplicative weights algorithm (Hardt and Rothblum, FOCS '10), but using a different low-weight representation of the database. We derive our low-weight representation using approximations to the OR function by low-degree polynomials with coefficients of bounded $L_1$-norm. We also prove a strong limitation on our approach that is of independent approximation-theoretic interest. Specifically, we show that for any $k = o(\log d)$, any polynomial with coefficients of $L_1$-norm $poly(d)$ that pointwise approximates the $d$-variate OR function on all inputs of Hamming weight at most $k$ must have degree $d^{1-O(1/\sqrt{k})}$.
PDF
L. Sweeney, Yasnoff, W. A., and Shortliffe, E. H., “Putting Health IT on the Path to Success”, JAMA, vol. 309, no. 10, pp. 989-990, 2013. DOIAbstract
The promise of health information technology (HIT) is comprehensive electronic patient records when and where needed, leading to improved quality of care at reduced cost. However, physician experience and other available evidence suggest that this promise is largely unfulfilled. Serious flaws in current approaches to health information exchanges: (1) complex and expensive; (2) prone to error and insecurity; (3) increase liability; (4) not financially sustainable; (5) unable to protect privacy; (6) unable to ensure stakeholder cooperation; and, (7) unable to facilitate robust data sharing. The good news is that personal health record banks pose a viable alternative that is: (a) simpler; (b) scalable; (c) less expensive; (d) more secure; (e) community oriented to ensure stakeholder participation; and, (e) capable of providing the most comprehensive records. The idea of patient controlled records is not new, but what is new is how personally controlled records can help achieve the HIT vision.
PDF
S. Hooley and Sweeney, L., “Survey of Publicly Available State Health Databases”, Data Privacy Lab, IQSS, Harvard University. 2013. Project website PDF