Applied Computational Mathematics Division


Up

Clustering Algorithms for Streaming and Online Settings

Claire Monteleoni
Department of Computer Science, George Washington University

Tuesday, June 17, 2014 15:00-16:00,
Building 225, Room B111
Gaithersburg
Tuesday, June 17, 2014 13:00-14:00,
Room 1-4058
Boulder

Abstract:

Clustering techniques are widely used to summarize large quantities of data (e.g. aggregating similar news stories), however their outputs can be hard to evaluate. While a domain expert could judge the quality of a clustering, having a human in the loop is often impractical. Probabilistic assumptions have been used to analyze clustering algorithms, for example i.i.d. data, or even data generated by a well-separated mixture of Gaussians. Without any distributional assumptions, one can analyze clustering algorithms by formulating some objective function, and proving that a clustering algorithm either optimizes or approximates it. The k-means clustering objective, for Euclidean data, is simple, intuitive, and widely-cited, however it is NP-hard to optimize, and few algorithms approximate it, even in the batch setting (the algorithm known as "k-means" does not have an approximation guarantee). Dasgupta (2008) posed open problems for approximating it on data streams.

In this talk, I will discuss my ongoing work on designing clustering algorithms for streaming and online settings. First I will present a one-pass, streaming clustering algorithm which approximates the k-means objective on finite data streams. This involves analyzing a variant of the k-means++ algorithm, and extending a divide-and-conquer streaming clustering algorithm from the k-medoid objective. Then I will turn to endless data streams, and introduce a family of algorithms for online clustering with experts. We extend algorithms for online learning with experts, to the unsupervised setting, using intermediate k-means costs, instead of prediction errors, to re-weight experts. When the experts are instantiated as k-means approximate (batch) clustering algorithms run on a sliding window of the data stream, we provide novel online approximation bounds that combine regret bounds extended from supervised online learning, with k-means approximation guarantees. Notably, the resulting bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. I will also present encouraging experimental results.

This talk is based on joint work with Nir Ailon, Ragesh Jaiswal, and Anna Choromanska.

Speaker Bio: Claire Monteleoni is an assistant professor of Computer Science at The George Washington University, which she joined in 2011. Previously, she was research faculty at the Center for Computational Learning Systems, and adjunct faculty in the Department of Computer Science, at Columbia University. She was a postdoc in Computer Science and Engineering at the University of California, San Diego, and completed her PhD and Masters in Computer Science, at MIT. Her research focus is on machine learning algorithms and theory for problems including learning from data streams, learning from raw (unlabeled) data, learning from private data, and Climate Informatics: accelerating discovery in Climate Science with machine learning. Her papers have received several awards. In 2011, she co-founded the International Workshop on Climate Informatics, which is now entering its fourth year. She also co-organized the ICML 2011 workshop on Machine Learning for Global Challenges, and is currently co-organizing the KDD 2014 workshop on Data Science for Social Good. She is on the Editorial Board of the Machine Learning Journal, and she served as an Area Chair for NIPS 2013 and ICML 2012, 2014.

Presentation Slides: PDF

Contact: Y. Liu

Note: Visitors from outside NIST must contact Cathy Graham; (301) 975-3800; at least 24 hours in advance.