Browse Source

2016-01-22: university paper on RIC framework

Simon Lackerbauer 5 months ago
parent
commit
5e51fc5a66
Signed by: Simon Lackerbauer <simon@lackerbauer.com> GPG Key ID: 2B27C889039C0125

BIN
2016-01-22_robust_information_theoretic_clustering/imgs/bad_clustering.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cat1.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cat2.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cat3.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cat4.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cat5.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cluster_merging1.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/cluster_merging2.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/good_clustering.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/robust_estimation.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/imgs/vac.png View File


BIN
2016-01-22_robust_information_theoretic_clustering/paper.pdf View File


+ 191
- 0
2016-01-22_robust_information_theoretic_clustering/paper.tex View File

@@ -0,0 +1,191 @@
1
+\documentclass[conference]{IEEEtran}
2
+\usepackage{polyglossia}
3
+\usepackage{moreverb}
4
+\usepackage{mathtools}
5
+\usepackage{amsfonts}
6
+\usepackage{pifont}
7
+\usepackage{booktabs,caption}
8
+\captionsetup{labelsep=newline,singlelinecheck=false} % optional
9
+\usepackage{tabularx}
10
+
11
+\begin{document}
12
+\title{A Discussion of the Robust Information-theoretic Clustering Framework (RIC)}
13
+\author{\IEEEauthorblockN{Simon Lackerbauer}
14
+\IEEEauthorblockA{Institut für Informatik\\
15
+Ludwig-Maximilians-Universität München\\
16
+Oettingenstraße 67, 80538 München\\
17
+lackerbauer@lrz.mwn.de}}
18
+
19
+\maketitle
20
+
21
+\begin{abstract}
22
+In the 2006 paper \textit{Robust Information-theoretic Clustering} Böhm et al. propose a three-part algorithm to find diversely distributed clusters in data sets using an information-theoretic approach, most noticeably the VAC criterion, which depends on the principle of minimum-description length (MDL) - while being input-free and as such usable without specialist knowledge. After comparison with other algorithms with the same or similar prerequisites, one can conclude that it reaches its goal admirable, although with certain limitations, most noticeably quality of the input clustering and runtime efficiency.
23
+\end{abstract}
24
+
25
+\section{Introduction}
26
+The authors of the herein discussed paper\cite{Bohm2006-ts} went out to answer the question “How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise?”
27
+
28
+Heavy emphasis was put upon the notion that the algorithm accomplishing this feat shall need no user-input, and, furthermore, not be restricted to purely Gaussian cluster distributions. The proposed algorithm also shouldn't be easily subjugated by noisy data sets, and moreover be at least reasonable efficient in its runtime-complexity. If and how well the proposed algorithm adheres to these self-given requirements, and also how it compares to other methods by researches out to find a solution to much the same problems (compare eg. NIC\cite{Faivishevsky2010-uk} and PaCCo\cite{Mueller2011-hd}), shall be the focus of this discussion.
29
+
30
+\section{Overview}
31
+First of all, in section~\ref{sec:ric}, the RIC framework will be presented and the principles behind it explained. Then we'll take a short look at the experiments against which RIC was tested in section \ref{sec:exp}. Afterwards, section~\ref{sec:crit} will expatiate upon general criticisms of the framework that were addressed by other seminar participants in the discussion following the presentation. In section~\ref{sec:comp} the algorithm will be compared with other similar algorithms on a variety of axes. Finally, this discussion will conlude in section~\ref{sec:conc}.
32
+\section{The RIC Framework}
33
+\label{sec:ric}
34
+The RIC framework takes as input a pre-clustered data set, though it is indifferent towards the method by which this pre-clustering shall be accomplished. The authors themselves use K-means in their example data sets. RIC is thusly not strictly a clustering algorithm as it is a cluster refining algorithm\cite{Bohm2008-eh}. (By definition, any cluster refining algorithm should in principle be able to function as a clustering algorithm, refining from the case that every point in the data set is a distinct cluster.) Compared with other input-free clustering algorithms like X-means\cite{Hamerly2003-hm} and G-means\cite{Pelleg2000-jr} RIC is supposed to find any clusterings which can be defined by its predetermined PDFs, which means it doesn't rely solely on Gaussian clusters, but is nevertheless restricted to detecting clusters of a (reasonably small) finite set with mathematically non-too-complex probability densities.
35
+
36
+The RIC framework is composed of two sub-algorithms doing the heavy lifting with the MDL-like VAC criterion standing in as a measure of goodness.
37
+\subsection{VAC - Volume After Compression}
38
+The VAC (Volume After Compression) criterion lies at the heart of the RIC framework and guides virtually any decision the framework has to make about the goodness of a given clustering. VAC uses Elias gamma encoding\cite{Elias1975-wj} to encode any integer $i$ using $O(\log i)$, be it point-coordinates or point-offsets from the cluster center if point $\vec{x}$ is said to belong to a cluster $C$. It further takes into account the possible correlation in a cluster, and if it makes sense to instead save a decorrelation matrix. Thus, the volume after compression of a point $ \vec{x} $ in a grid with $ \gamma $ distance between grid cells can be summarized as 
39
+\[ VAC(x) = \left(\log_2 \frac{n}{|C|} \right) + \left(\sum_{0 \leq i < d} \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma} \right) \]
40
+where $ \log_2 \frac{n}{|C|} $ are the encoding costs of the cluster-Id and $ \sum_{0 \leq i < d} \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma} $ are the encoding costs of all the integers defining a point $ \vec{x} $ in $ d $. Obviously the encoding costs depend quite significantly on the chosen probability density function ($ pdf_i $) that is assigned to estimate one dimension of the cluster.
41
+
42
+To find the appropriate $ pdf $ the VAC criterion is employed again so that
43
+\[ pdf_i = {\arg\min}_{pdf_{stat} \in \text{PDF}} \sum_{\vec{x} \in C} \log_2 \frac{1}{pdf_{stat}(x_i) \cdot \gamma} \]
44
+where the available $ pdf $ is chosen from the set of defined PDFs with defining characteristics (eg mean, variance) as calculated from the considered coordinate dimension.
45
+
46
+Further, a decorrelation matrix is calculated, but only included if the VAC criterion warrants such a move - as in: the decorrelation saves more space by virtue of its better fit than it takes to save the matrix itself.
47
+
48
+Putting all these formulae together, we finally reach the VAC of a full cluster in 
49
+\[ VAC(C) = VAC(dec(C)) + \sum_{\vec{x} \in C} VAC(dec(C) \cdot \vec{x}) \]
50
+where $ dec(C) $ is the decorrelated cluster.
51
+
52
+The authors illustrate the principle behind encoding a cluster with different PDFs as in figure \ref{fig:vac}. As they write themselves that the grid constant $ \gamma $ shall be chosen so as to enable each datapoint to fall into its own grid cell, the choice of $ \gamma $ in their example seems a little off, but not in a truly negative way.
53
+
54
+\begin{figure}
55
+  \centering
56
+  \noindent\includegraphics[width=\linewidth]{imgs/vac}
57
+  \caption[]{Example of VAC\cite{Bohm2006-ts}}
58
+  \label{fig:vac}
59
+\end{figure}
60
+
61
+\subsection{RF - Robust Fitting}
62
+The Robust Fitting sub-algorithm deals with outliers in the input clusters. The fitting algorithm employs the VAC criterion to determine if a point is adequately described by the characteristic distribution of the cluster or if it makes more sense to not count that specific point as belonging to (any or this) specific cluster. An example of how a conventional clustering might work in finding a characteristic distribution as opposed to how the RF algorithm estimates its fitting using the VAC criterion can be found in figure \ref{fig:robust_estimation}. But the sole introduction of the VAC criterion to measure goodness does not yet incorporate outlier detection.
63
+
64
+\begin{figure}
65
+  \centering
66
+  \noindent\includegraphics[width=\linewidth]{imgs/robust_estimation}
67
+  \caption[]{Conventional and robust estimation\cite{Bohm2006-ts}}
68
+  \label{fig:robust_estimation}
69
+\end{figure}
70
+
71
+Instead, RF derives its robustness regarding outliers from the fact that the conventional way to estimate the covariance matrix $ \Sigma $ is by computing a matrix $ \Sigma_C $ from the points $ \vec{x} \in C $ by averaging
72
+\[ \Sigma_C = \frac{1}{|C|} \sum_{\vec{x} \in C} (\vec{x} - \vec{\mu}) \cdot (\vec{x} - \vec{\mu})^\text{T} \]
73
+whereas RF doesn't solely use arithmetic means but also tries the coordinate-wise median (the median of $ (\vec{x} - \vec{\mu}_{R_i}) \cdot (\vec{x} - \vec{\mu}_{R_j}) $ over all $ \vec{x} $) yielding the robust covariation matrix $ \Sigma_R $.
74
+
75
+Given the conventional covariance matrix $ \Sigma_C $, by Principle Components Analysis (PCA) one could obtain the orthonormal Eigenvector matrix $ V $, the decorrelation matrix, and the diagonal matrix $ \Lambda $, containing all Eigenvalues, with $ \Sigma_C = V \Lambda V^\text{T} $. Using the property of diagonal dominance (ie all diagonal elements are greater than the sum of the other values in their respective row), one could then measure the distance between any two points $ \vec{x} $ and $ \vec{y} $ by using the Mahalanobis distance\cite{Mahalanobis1936-wy} defined by $ V $ and $ \Lambda $:
76
+\[ d_{\Sigma_C}(\vec{x}, \vec{y}) = (\vec{x} - \vec{y})^\text{T} \cdot V \cdot \Lambda^{-1} \cdot V^\text{T} \cdot (\vec{x} - \vec{y}) \]
77
+
78
+The robust covariance matrix $ \Sigma_R $ does not necessarily have the diagonal dominance property. However, this can easily be fixed by adding a matrix $ \varPhi \cdot I $ to the covariance matrix. $ \varPhi $ should naturally be chosen as the maximum difference of all the column sums and their corresponding diagonal element. The authors also add a small amount on top of that (in their example 10\%), but it's not really clear why that added value would be needed, as even
79
+\[ \varPhi_{naive} = \max_{0 \leq i < d} \left\{ \left( \sum_{0 \leq j < d, i\neq j} (\Sigma_C)_{i,j} \right) - (\Sigma_{C})_{i,i} \right\} \]
80
+should be enough to satisfy the diagonal dominance property. They do however add that 10\%, which means the suggested value in the original paper is
81
+\[  \varPhi = 1.1 \cdot \varPhi_{naive}. \]
82
+
83
+The estimation of a robust covariance matrix does not mean that this matrix is actually used as the descriptive matrix of the cluster, however - the decision about which matrix to use is, as expected, made by calculating the VAC of both $ \Sigma_C $, $\Sigma_R $ and of several candidate matrices and finally choosing the one yielding the lowest VAC.
84
+
85
+Once the covariance matrix with the lowest VAC is found, all points are first deemed outliers and then iteratively inserted into the set of cluster points (by order of their Mahalanobis distance from the distribution center), with accompanying VAC scoring at each step. Once a (local) minimum is reached, the process stops and the cluster and outliers are then returned to the next RIC sub-algorithm, CM.
86
+
87
+\subsection{CM - Cluster Merging}
88
+The Cluster Merging algorithm is likely the computationally costliest part of the RIC process as it compares each cluster with every other cluster for possible merging using the VAC criterion again, or simply checking if
89
+\[ VAC(C_i \cup C_j) < VAC(C_i) + VAC(C_j) \]
90
+for all $ C_i,C_j \in \mathcal{C},i \neq j $. Getting the VAC score of a merged cluster seems only possible by building a new set of points from the previous two clusters and running the whole RF algorithm on that set again.
91
+
92
+The CM part of the algorithm also comes with the option of going further down the search tree to get out of a local minimum. It will then merge clusters even if
93
+\[ VAC(C_i \cup C_j) \geq VAC(C_i) + VAC(C_j) \]
94
+for a given $ t $ iterations. If it actually does find to have been in a local minimum, the counter will naturally reset to 0.
95
+
96
+\section{Experiments}
97
+\label{sec:exp}
98
+RIC is tested on two different sets of synthetic data and on two different sets of real data. RIC found all the correlational, Laplacian and Gaussian clusters in the synthetic data sets, the input being provided by either K-means or DBSCAN\cite{Ester1996-rh}. RIC manages to correctly identify more than 98\% of the noise objects in the data sets and reduces the initial VAC score by about a quarter each. The provided runtimes for both data sets are provided and discussed below in section \ref{sub:run}.
99
+
100
+Further evaluation is provided by running RIC on a 14-dimensional real world metabolic data set. Clustering in this biomedical instance can provide help in diagnosing disorders, if the algorithm can cluster both the healthy control group and the test group with disease correctly. For that reason, in this situation as further points of comparison \textit{impurities} in the clusters are counted - impurities being wrongly assigned groups for the known distribution, ie the sum of statistical type I and type II errors. As expected, RIC is able to beat both spectral as well as K-means clustering by achieving a lower VAC, even with both other clustering algorithms being provided with the correct number of clusters to identify. More importantly, however, is RIC's far superior number of impurities (mostly ill datapoints ascribed to the healthy cluster), usually making only about a third of the errors that K-means and spectral achieve alone.
101
+
102
+The fourth example provided concerns retinal images of cats in various states of health. The images are encoded as a 7-dimensional data set with more than 21 thousand instances, which RIC (again using K-means as a starting point) clusters into 13 clusters, all of which are correlational, and two of which describe actual, studied, biological meaningful relationships, namely the "Müller cells" and "rod photoreceptors". As RIC's pretense was to provide meaningful clusters without user-input, and, as such, without relying on the user's specific domain knowledge, this can be taken as a sign that it at least partially reached its goal.
103
+
104
+\section{General Criticism}
105
+\label{sec:crit}
106
+As the lead author himself confirms in later works\cite{Bohm2008-eh}\cite{Bohm2010-uu}, the clustering output of RIC strongly depends on the quality of the initial clustering and the cluster model is limited to linear attribute correlations. They're also limited to the list of predefined PDFs, though that might not be too much of a limitation, as, in theory, new PDFs are supposed to be added quite easily. Further, there's a reason that there's only a limited to supply of deeply researched and named density functions to begin with, namely that they're readily applicable to a big enough subset of problems. Natural processes that need much more elaborate models also are clearly not in scope of the RIC framework.
107
+
108
+\subsection{Runtime Analysis}
109
+\label{sub:run}
110
+The paper does not contain an explicit runtime analysis of RIC in general or its sub-algorithms. It is, however, relatively easy to see that the VAC calculation itself should take at most linear time as it's only adding up the total size of an encoded input. The Robust Fitting part can probably be assumed to also need only linear time, whereas the Cluster Merging algorithm almost certainly grows quadratically at least in the worst case.
111
+
112
+The experimental runtimes that can be taken from the paper (147s for 4751 objects in 2 dimensions and 567s for 7500 objects in 3 dimensions) also seem suggestive of a quadratic growth rate, as an input of $ 7500 \cdot 3 = 22500 $ integers, about 2.4 times as many as $ 4751 \cdot 2 = 9502 $ integers takes roughly 3.9 times the time to process.
113
+
114
+\begin{table*}
115
+  \caption{Comparison of RIC with other information-theoretic clustering algorithms}
116
+  \label{tab:comp}
117
+  \begin{tabularx}{\textwidth}{rlllXXl}
118
+  	\toprule
119
+  	                    & parameter-free & runtime analysis              & input data          & clustering strategy & clusters detected                                       & year \\ \midrule
120
+  	                RIC & yes            & $ O(n^2)^{\beta} $            & matrix              & input-refining      & Gaussian, Laplacian, Correlational, others$ ^{\alpha} $ & 2006 \\
121
+  	              PaCCo & yes            & $ O(n) < x < O(n^3)^{\beta} $ & graph (weighted)    & bisecting             & Gaussian, others$ ^{\alpha} $                           & 2011 \\
122
+  	minCEntropy($ ^+ $) & no             & $ O(n^2) $                    & matrix              & proprietary         & Gaussian                                                & 2010 \\
123
+  	              ROCAT & yes            & $ O(n) $                      & matrix              & bisecting         & not applicable                                          & 2014 \\
124
+  	               PICS & yes            & $ O(n) $                      & graph (attributed)  & bisecting         & not applicable                                          & 2012 \\
125
+  	            INCONCO & yes            &           $ O(n)^\beta $                    & matrix (attributed) & bisecting                    & Gaussian                                                & 2011 \\
126
+  	                NIC & yes            &     $ O(n^2) $                          &    matrix                 &      greedy sequential k-means               &      non-convex                                                   & 2010 \\
127
+  	                VoG &  yes              &     $ O(n)^\gamma $                          &     graph                &      multiple               &          not applicable                                               & 2014 \\ \bottomrule
128
+  	\multicolumn{7}{l}{$ ^{\alpha} $ user-defined PDFs are explicitly mentioned as desired extensions to the algorithm}                                                               \\
129
+  	\multicolumn{7}{l}{$ ^{\beta} $ estimate, as the respective authors did not provide an explicit runtime analysis} \\
130
+    \multicolumn{7}{l}{$ ^{\gamma} $ the average case with $ n $ being the numbers of edges on the graph}
131
+  \end{tabularx}
132
+\end{table*}
133
+
134
+\subsection{Edgecases}
135
+It would have been interesting to see how the RIC framework reacts to random or exactly equally distributed input. Would it see structure in the noise regardless? Would the CM method make initial bad K-means clustering on such data more deterministic? Sadly, there are no examples included in the paper that can act as a kind of control group.
136
+
137
+\section{Comparison with other Algorithms}
138
+\label{sec:comp}
139
+In this section, we'll compare how RIC fares against some of the other clustering algorithms introduced in the seminar, eg PaCCo\cite{Mueller2011-hd}, minCEntropy\cite{Vinh2010-tc}, ROCAT\cite{He2014-uf}, PICS\cite{Akoglu2012-vz}, INCONCO\cite{Plant2011-wy}, NIC\cite{Faivishevsky2010-uk}, VoG\cite{Koutra2014-up} and Subdue\cite{Ketkar2005-zs}. The categorical results of these comparisons are provided succinctly in table \ref{tab:comp}.
140
+
141
+\subsection{PaCCo}
142
+PaCCo is a weighted graph-based algorithm employing the information-theoretic MDL principle to check the goodness of a clustering just like RIC does, including the Huffman coding structure along a cluster PDF. Different from RIC, however, PaCCo in general assumes Gaussian clusters, but other PDFs can be added by the user. A further similarity is the fact that both algorithms work without user-input. PaCCo however comes with its own bisecting K-means strategy, so its input really are completely unordered and unclustered data.
143
+
144
+PaCCo's runtime efficiency is not explicitly stated, other than being "super-linear". In tests against Zelnik-Manor and Perona's parameter-free SpectralZM algorithm\cite{Zelnik-Manor2004-rx}, which has a runtime of $ O(n^3) $, PaCCo outpaced SpectralZM with ease. Its real runtime efficiency will thus lie somewhere in-between.
145
+
146
+In conclusion, with PaCCo introduced significantly later than RIC, and having a different input assumption, the RIC framework still holds up surprisingly well in comparison.
147
+
148
+\subsection{minCEntropy$ (^+) $}
149
+The focus of minCEntropy lies not on producing a single clustering solution to a given data set, but instead realizes that data can often be interpreted in more than one way (eg data that can be used for facial recognition/authentication could also be used to identify looking direction). The algorithm's goodness criterion seems different from all the other criteria employed throughout this paper - instead of referencing the MDL principle the authors choose to minimize the conditional entropy left in the data, or in other words, maximize the average intra-cluster similarity (hence the name). Of course, concentrating on maximizing intra-cluster similarity will invariably make it easier to succinctly describe a given cluster - in other words, minimizing the description length needed.
150
+
151
+The included complexity analysis puts emphasis on the fact that although initializing the required kernel matrix takes up to $ O(n^2) $ time, further updating to reach an optimal clustering takes only linear time and converges quickly.
152
+
153
+Then, minCEntropy$ ^+ $ is the extension of minCEntropy to find the above mentioned alternative clusterings when given a clustering to explicitly exclude. There's also a further extension minCEntropy$ ^{++} $ that can exclude multiple clusters.
154
+
155
+No iteration of the algorithm works completely parameter-free.
156
+
157
+\subsection{ROCAT}
158
+ROCAT is a pretty young, parameter-free approach to clustering, using the MDL principle (though expressed as a question of entropy like with minCEntropy above), and focusing on subspace clusters for categorical (ie not numerical) data. As ROCAT deals with categorical data, a defining cluster probability distribution like a Gaussian curve is not really applicable. Clusters built by ROCAT over a matrix view of the attributive data are always shaped rectangular.
159
+
160
+ROCAT's runtime efficiency grows linearly with $ n $, quadratically with dimensionality, so it's one of the most efficient algorithms discussed until now.
161
+
162
+\subsection{PICS}
163
+PICS is a parameter-free graph-clustering algorithm that does not solely cluster on connectivity information but also takes node attributes into account. Naturally, PICS also relies on an MDL criterion to decide if a split of the biggest identified cluster brings a better description of the data (similar to the approach used by PaCCo above).
164
+
165
+\subsection{INCONCO}
166
+INCONCO is a parameter-free clustering algorithm with much the same premise as PICS, only that it is not based on graph data with attributes, but instead on vector data with mixed-type attributes.
167
+
168
+Where RIC uses the Robust Fitting method instead of PCA to make its clustering less sensitive to outliers, INCONCO extends Cholesky decomposition\cite{Golub1996-ak} as it's more efficient than PCA and yields better results. The usage of an MDL criterion for decisions regarding goodness of fit at each step of the bisecting algorithm going through the input data set does not come as a surprise anymore.
169
+
170
+There is no explicit theoretical runtime analysis given, but from the runtimes reported, INCONCO seems extremely efficient when looking at the number of points processed, with much of it probably due to a large but constant overhead.
171
+
172
+\subsection{NIC}
173
+NIC is a parameter-free framework that can find clusters independently of their underlying structure utilizing the MeanNN differential entropy estimator. NIC is certainly different in that it can even find clusters consisting of points non-convex sets (eg concentric circles), a feat that RIC would not be able to accomplish without being fed a complex user defined distribution first.
174
+
175
+\subsection{VoG}
176
+The premise of VoG is the description of large graphs using a certain vocabulary (eg stars, chains, bi-partite cores). So not only shall clusters in the graph be recognized, but also be described by their most descriptive characteristic. By then examining the frequencies with which the predefined structures in the vocabulary appear in the graph and the most notable structures altogether, one can make some meaningful abstractions of the underlying real world structure (eg finding the two groups comprising a Wikipedia edit war). 
177
+
178
+
179
+\section{Conclusion}
180
+\label{sec:conc}
181
+In the end, especially against competitors that were introduced significantly later and which could build upon a larger body of work with much the same direction, the RIC framework still fares surprisingly well. It is one of the few algorithms that comes pre-equipped with more than just the regular Gaussian PDF (of course, Gaussian distributions still play one of the most significant roles in data description across all fields). Its runtime complexity is not the most amazing thing about the framework, but still holds up against much of the competition. As we've seen under section \ref{sec:crit}, there are some valid criticisms to be aware of, but as the real world examples have shown, any clustering algorithm that can identify known relations consistently and, as such, can be used to find new relationships in data is a step in the right direction.
182
+
183
+Adding into that the fact that the RIC framework provides exactly that, a framework that can profit from the fact that clustering elsewhere might get better and as such provide better input to be enhanced by the information-theoretic principles that guides it, the findings by Böhm et al. certainly advanced the area of research they set out to advance.
184
+
185
+\section*{Acknowledgments}
186
+The author would like to thank Dr. Bianca Wackersreuther for providing input and help whenever needed and conducting the seminar in a well-received and professional manner. Next, Prof. Christian Böhm, both for co-authoring the discussed paper in the first place, and then giving me the opportunity to engage with the topic for this course. Lastly, the other participants in this seminar, for providing constructive criticism after my talk without which this discussion paper wouldn't have been possible.
187
+
188
+\bibliography{sources.bib}{}
189
+\bibliographystyle{plain}
190
+
191
+\end{document}

BIN
2016-01-22_robust_information_theoretic_clustering/slides.pdf View File


+ 352
- 0
2016-01-22_robust_information_theoretic_clustering/slides.tex View File

@@ -0,0 +1,352 @@
1
+\documentclass{beamer}
2
+%\setbeameroption{show only notes}
3
+\usepackage{polyglossia}
4
+\usetheme{default} %minimal
5
+\setbeamercovered{transparent}
6
+\setbeamertemplate{bibliography item}{}
7
+\setbeamertemplate{caption}[numbered]
8
+\setbeamercolor*{bibliography entry title}{fg=black}
9
+\setbeamercolor*{bibliography entry author}{fg=black}
10
+\setbeamercolor*{bibliography entry location}{fg=black}
11
+\setbeamercolor*{bibliography entry note}{fg=black}
12
+\usepackage{natbib}
13
+\bibliographystyle{plain}
14
+\renewcommand\bibfont{\scriptsize}
15
+\beamertemplatenavigationsymbolsempty
16
+
17
+\AtBeginSection[]
18
+{
19
+  \begin{frame}<beamer>
20
+    \frametitle{Outline}
21
+    \tableofcontents[currentsection]
22
+  \end{frame}
23
+}
24
+
25
+
26
+\title{Robust Information-theoretic Clustering}
27
+\subtitle{}
28
+
29
+\author{Simon Lackerbauer}
30
+
31
+\institute[Ludwig-Maximilians-Universität München]
32
+{
33
+  Institut für Informatik\\
34
+  Ludwig-Maximilians-Universität München\\
35
+  lackerbauer@lrz.mwn.de
36
+}
37
+
38
+\date{Seminar \textit{Information Theoretic Data Mining} im WS 2015/16}
39
+
40
+\subject{}
41
+
42
+\AtBeginSubsection[]
43
+{
44
+  \begin{frame}<beamer>{Outline}
45
+    \tableofcontents[currentsection,currentsubsection]
46
+  \end{frame}
47
+}
48
+
49
+
50
+\begin{document}
51
+  
52
+  \begin{frame}
53
+    \titlepage
54
+  \end{frame}
55
+  
56
+  \note{Guten Morgen, mein Name ist Simon Lackerbauer und ich trage heute vor über einen auf der SIGKDD 2006 vorgestellten Algorithmus zum automatisierten Clustering vor.}
57
+  
58
+  \begin{frame}{Outline}
59
+    \tableofcontents
60
+  \end{frame}
61
+  
62
+  \note{Hier ist mal der grobe Aufbau: Zuerst schaun wir uns das Problem anhand einiger einfacher Beispiele an, danach erkläre ich das RIC framework und seine einzelnen Bestandteile - Volume after Compression, Robust Fitting und Cluster Merging, die erst im Zusammenspiel miteinander tatsächlich in der Lage sind, das Problem anzugehen.\\Danach sehen wir uns noch einige reale Daten an, die der Algorithmus geclustered hat und gehen kurz auf Vergleiche mit traditionellen Algorithmen ein.}
63
+  
64
+  \section{Clustering Problems}
65
+  
66
+  \note{Gut, also erstmal ab zum Problem - oder besser gesagt den Problemen.}
67
+  
68
+  \begin{frame}{Clustering Problems}{}
69
+    \begin{itemize}
70
+      \item There exist a wide range of possible clustering algorithms.
71
+      \item Many need user input or assume only Gaussian clusters
72
+      \item We want an algorithm without user input that automatically selects appropriate cluster functions
73
+    \end{itemize}
74
+  \end{frame}
75
+  
76
+  \note{Das Problem (Daten zu sinnvoll in Cluster zu packen) an sich ist natürlich nicht neu. Ein Cluster ist nichts anderes als eine Anhäufung von Datenpunkten nach einem bestimmten Muster, bzw. nach einer bestimmten Verteilungsfunktion. Datenpunkte innerhalb so eines Clusters sind nicht unabhängig von der zugrundeliegenden Verteilung: Wenn man weiß, wie die Daten geclustered sind, kann man also vielleicht interessante Zusammenhänge in den Daten einfach erkennen.\\Das bedeutet natürlich auch, dass sich da schon mehrere Leute Gedanken drüber gemacht haben. K-means als immernoch sehr weit verbreiteter, recht einfacher Algorithmus, beispielsweise geht bis auf die 50er Jahre zurück.}
77
+  
78
+  \note{K-means ist nicht der einzige solche Algorithmus, aber dient gut als Beispiel für alle Probleme, die es im Allgemeinen gibt. Man braucht einen Eingabewert (die Zahl der vorhandenen Cluster), was natürlich ein bisschen das Problem verschleiert, denn woher soll man das idR wissen? Danach haben dann alle Cluster (wenn man sich Daten in 2D anschaut) so ellipsenhafte Formen, weil der Algorithmus eine Gauss-Verteilung annimmt - und gerade das machen viele herkömmliche Algorithmen.\\Man will also einen Algorithmus, der ohne Input auskommt und zusätzlich unabhängig ist in der Auswahl der Dichtefunktion, auf der der Cluster basiert.}
79
+  
80
+  \begin{frame}{Clustering Problems}{Example: How not to do it}
81
+    \begin{figure}
82
+      \centering
83
+      \includegraphics[width=0.7\linewidth]{imgs/bad_clustering}
84
+      \caption[]{Example of “Bad” Clustering\cite{Bohm2006-ts}}
85
+      \label{fig:bad_clustering}
86
+    \end{figure}    
87
+  \end{frame}
88
+  
89
+  \note{Hier sieht man ein Beispiel, was möglicherweise ein K-Means gefunden hätte, wenn man als Input 5 eingibt. Bei Input zwei wären wahrscheinlich diese zwei Gruppen rausgekommen usw. Man sieht gleich, dass ein weiteres großen Problem die Outlier-Behandlung ist. Der schlechte Algorithmus ist nicht robust und versucht auf Teufel komm raus, Noise in die Cluster reinzupacken.}
90
+  
91
+  \begin{frame}{Clustering Problems}{Example: Reasonable reduction}
92
+    \begin{figure}
93
+      \centering
94
+      \includegraphics[width=0.7\linewidth]{imgs/good_clustering}
95
+      \caption[]{Example of “Good” Clustering\cite{Bohm2006-ts}}
96
+      \label{fig:good_clustering}
97
+    \end{figure}    
98
+  \end{frame}
99
+  
100
+  \note{So dagegen macht das ganze gleich viel mehr Sinn. Ein auf beiden Axen normalverteilter Cluster und ein korrelierter, linienförmiger Cluster. Und dazwischen ein paar Noise-Punkte, die ignoriert werden.}
101
+  
102
+  \begin{frame}{Comparison of examples}{}
103
+    \textbf{What makes the second pattern better than the first?}
104
+    \begin{itemize}
105
+      \item It's more descriptive of the interesting patterns in the data, because outliers have been “omitted”.
106
+      \item The clusters in the “good” example are those a human would immediately recognize as points being associated somehow.
107
+    \end{itemize}
108
+  \end{frame}
109
+  
110
+  \note{Man kann das auch etwas formaler ausdrücken und sagen, dass die zwei guten Cluster die Daten schlicht besser beschreiben. Nur aus der Form und Lage der Cluster allein lässt sich Information ableiten, was bei dem schlechten Beispiel nicht der Fall ist. Hier, im einfachen 2D-Beispiel deckt sich das natürlich mit den Clustern, die auch ein Mensch intuitiv eingezeichnen hätte. Aber es gibt natürlich Datensets, wo auch der Mensch nicht auf Anhieb sofort was sieht (beispielsweise das Set ganz am Ende) und demnach nicht dem Algorithmus mit Intuition "helfen" kann.}
111
+  
112
+  \begin{frame}{Measuring Success}{}
113
+        This human intuition must be translated into a dependable clustering algorithm, for which two measures for success can be defined:
114
+        \begin{itemize}
115
+          \item \uncover<2->{\textbf{Goodness of fit}}
116
+          \item \uncover<3->{\textbf{Efficiency}}     
117
+        \end{itemize}
118
+  \end{frame}
119
+  
120
+  \note{Wir wollen also einen Algorithmus, der zwei Voraussetzungen erfüllt: Goodness of fit: Wie gut passt der cluster tatsächlich zu den Daten? Bzw. Formal: wie kriegt man eine funktion, die ein scoring von Clustern zulässt und einen "guten" Score für das zweite Beispiel oben ausspricht und einen "schlechten" für die noisy cluster aus dem schlechten Beispiel?}
121
+  
122
+  \note{Wie kriegt man das mit ordentlicher Laufzeit hin und ohne, dass man von Outliers/Noise zu sehr gestört wird?}
123
+  
124
+  \section{Solution: The iterative approach}
125
+  
126
+  \note{gut, also jetzt ab ins eigentliche Thema}
127
+  
128
+  \subsection{VAC -- Volume After Compression}
129
+  
130
+  \note{Fangen wir an mit dem wohl wichtigsten Subalgorithmus: Volume After Compression}
131
+  
132
+  \begin{frame}{Proposition for a solution: VAC}{}
133
+    \textbf{VAC - Volume after Compression}\\
134
+    \begin{itemize}
135
+      \item does not specify good grouping
136
+      \item specifies for two groupings $x,y$ which one is better (e.g., because $VAC(x) < VAC(y) \rightarrow$ x is a better grouping)
137
+      \item size of total, \textbf{lossless} compression
138
+    \end{itemize}
139
+  \end{frame}
140
+  
141
+  \note{Das VAC-Kriterion ist ein informationstheretischer Ansatz des Scorings. Je besser man die Daten beschreibt, desto besser kann man diese auch ohne Verlust von Information komprimieren. Genauso wie ich stark komprimiert mit dem Wort "Integer" die ganzen Zahlen beschreiben kann, ohne die alle Aufzählen zu müssen, beschreibt ein guter Cluster die darunter liegenden Daten, ohne jeden Datenpunkt einzeln aufzählen zu müssen.\\Wichtig dabei: das VAC beschreibt keinen absoluten Score. Man kann nicht zwischen zwei Datensets die VAC scores vergleichen, und dann sagen, dass mit dem niedrigeren Score ist besser geclustert. Man kann aber innerhalb des selben Datensets zwei mögliche Clusterings vergleichen und dann eindeutig sagen: Das mit dem niedrigeren Score beschreibt die Daten einfach besser, denn es braucht weniger Informationstransfer, um das gleiche Wissen auf beiden Seiten herzustellen.}
142
+  
143
+  \begin{frame}{Proposition for a solution: VAC}{Integer Encoding}
144
+    \begin{itemize}
145
+      \item point coordinates are always integers
146
+      \item self-delimiting encoding of integers: Elias (gamma) codes
147
+      \item smaller integers require fewer bytes
148
+    \end{itemize}
149
+  \end{frame}
150
+  
151
+  \note{Wenn man jetzt anfängt, die Daten vorzubereiten, gehen wir davon aus, dass alle Koordinaten integer sind (weil man sowieso endliche Präzision nur darstellen kann) und können demnach die Achsten in ein "Grid" einteilen - man kann das Grid aber natürlich beliebig präzise machen. Integer kann man via Elias encoding ohne jegliches Padding eindeutig encoden, so dass kleine Zahlen weniger Platz verbrauchen - das ist später wichtig.}
152
+
153
+  \begin{frame}{Proposition for a solution: VAC}{Cluster Encoding}
154
+    \begin{itemize}
155
+      \item uses Huffman encoding for positioning points with probability distribution according to assumed cluster pdf
156
+      \item such, if we assume the correct distribution for the cluster, core points will be more efficiently encoded
157
+    \end{itemize}
158
+  \end{frame}
159
+  
160
+  \note{Huffman encoding geben uns für alle Koordinaten eines Punktes einen bit-string $l = log_2(1/P(x))$, so dass Punkte die an einem Punkt liegen, wo die Wahrscheinlichkeit laut pdf höher ist, dass sie liegen, weniger Bits verbrauchen als solche, die weiter am Rand der Verteilung liegen. Je besser demnach unsere Annahme der Distribution für ein Cluster ist, desto effectiver lassen sich die Punkte kodieren.}
161
+  
162
+  \begin{frame}{Proposition for a solution: VAC}{Cluster Encoding}
163
+    \begin{figure}
164
+      \centering
165
+      \includegraphics[width=0.65\linewidth]{imgs/vac}
166
+      \caption[]{Example of VAC\cite{Bohm2006-ts}}
167
+      \label{fig:vac}
168
+    \end{figure}    
169
+  \end{frame}
170
+  
171
+  \note{Hier sieht man ein Beispiel der Aufteilung eines Datensets in ein Grid, und dann die Enkodierung der Koordinaten. Man sieht gut, wie bei beiden Verteilungen das Gros der Punkte direkt in der Mitte der jeweiligen Dichtefunktion liegt.}
172
+  
173
+  \begin{frame}{Proposition for a solution: VAC}{Cluster encoding}
174
+    \begin{definition}[VAC of point $\vec{x}$]
175
+      Let $\vec{x} \in \mathbb{R}^d$ be a point of a cluster $C$ and $\overrightarrow{pdf}(\vec{x})$ be a $d$-dimensional vector of probability density functions  which are associated to $C$. Each $pdf_i(x_i)$ is selected from  a set of predefined probability density functions with corresponding parameters, i.e. $PDF = \{pdf_{Gauss(\mu_i, \sigma_i)}, pdf_{uniform(lb_i, ub_i)}, pdf_{Lapl(a_i, b_i)}, ...\}$, $\mu_i, lb_i, ub_i, a_i \in \mathbb{R}, \sigma_i, b_i \in \mathbb{R}^+$. Let $\gamma$ be the grid constant (distance between grid cells). The $VAC_i$ of coordinate $i$ of point $\vec{x}$ corresponds to
176
+      \[VAC_i(x) = \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma}\]
177
+      The $VAC$ of point $\vec{x}$ corresponds to
178
+      \[VAC(x) = \left( \log_2 \frac{n}{|C|}\right) + \sum_{0 \leq i < d} VAC_i(x)\]
179
+    \end{definition}
180
+  \end{frame}
181
+  
182
+  \note{Hier nochmal eine formale Definition des VAC-Kriteriums. wichtig ist vA, dass die möglichen PDFs vorher schon festgelegt sind und dann via ID angesprochen werden können.}
183
+  
184
+  \begin{frame}{Proposition for a solution: VAC}{Cluster Encoding and Decorrelation}
185
+    \begin{itemize}
186
+      \item $\gamma$ is a measure of granularity of grid cells
187
+      \item absolute VAC changes with grid resolution, but relative VAC stays the same
188
+      \item to choose optimal parameter settings for clusters, we use the statistical parameter of the dataset
189
+      \item if data is correlated amongst itself, define a decorrelation matrix iff the VAC savings at least compensate for saving the decorrelation matrix
190
+    \end{itemize}
191
+  \end{frame}
192
+  
193
+  \subsection{RF -- Robust Fitting}
194
+  
195
+  \note{Um mithilfe des VAC Kriterions jetzt tatsächlich Cluster beschreiben zu können, brauchen wir noch 2 Algorithmen, die bei der tatsächlichen erstellung von Clustern helfen. Einer davon ist Robust Fitting.}
196
+  
197
+  \begin{frame}{Two helper algorithms}{Robust Fitting}
198
+    \begin{itemize}
199
+      \item Start: get as input a set of clusters $\mathcal{C} = \{C_1,...,C_k\}$ by an arbitrary method
200
+      \item for every $C_i$ in $\mathcal{C}$ define a similarity measure (decorrelation matrix == ellipsoid)
201
+      \item use the VAC score to try out decorrelation matrices until the one with the lowest VAC is found 
202
+    \end{itemize}
203
+  \end{frame}
204
+ 
205
+  \note{RF ist an sich recht einfach zu beschreiben - man kriegt erstmal irgendeine Anzahl von Clustern nach irgendeiner Methode (da kann man auch erstmal K-means nehmen). Dann nimmt man sich eine decorrelation matrix, die einen Ellipsoid beschreibt, der die Grenze zwischen Core points und Outliers darstellt. Dann probiert man verschiedene Grenzen durch, bis man diejenige gefunden hat, die den niedrigsten VAC-Score produziert. (Da jeder Cluster nur endlich viele Punkte hat, kann man da tatsáchlich alle Punkte von innen nach außen nacheinander durchprobieren.)}
206
+  
207
+  \begin{frame}{Two helper algorithms}{Robust Fitting}
208
+    Decorrelation Matrix:
209
+    \begin{itemize}
210
+      \item contains the vectors that define the space in which points in the cluster reside
211
+      \item to improve robustness of cluster center estimation use coordination-wise median instead of arithmetic means
212
+      \item of the several matrices generated during this step, again partition into core points and noise by choosing the one with best VAC
213
+    \end{itemize}    
214
+  \end{frame}
215
+  
216
+  \note{}
217
+  
218
+  \begin{frame}{Two helper algorithms}{Robust Fitting}
219
+    \begin{figure}
220
+      \centering
221
+      \includegraphics[width=\linewidth]{imgs/robust_estimation}
222
+      \caption[]{Conventional and robust estimation\cite{Bohm2006-ts}}
223
+      \label{fig:robust_estimation}
224
+    \end{figure}    
225
+  \end{frame}
226
+  
227
+  \note{}
228
+  
229
+  \subsection{CM -- Cluster Merging}
230
+  
231
+  \begin{frame}{Two helper algorithms}{Cluster Merging}
232
+    \begin{itemize}
233
+      \item Start: get as input a set of clusters $\mathcal{C} = \{C_1,...,C_k\}$ by an arbitrary method
234
+      \item Purify (of noise) each cluster individually
235
+    \end{itemize}    
236
+  \end{frame}
237
+  
238
+  \note{}
239
+
240
+  \begin{frame}{Two helper algorithms}{Cluster Merging}
241
+    \begin{figure}
242
+      \centering
243
+      \includegraphics[width=\linewidth]{imgs/cluster_merging1}
244
+      \caption[]{Clustering by using K-means and then purifying\cite{Bohm2006-ts}}
245
+      \label{fig:cluster_merging1}
246
+    \end{figure}    
247
+  \end{frame}
248
+  
249
+  \note{}
250
+  
251
+  \begin{frame}{Two helper algorithms}{Cluster Merging}
252
+    \begin{figure}
253
+      \centering
254
+      \includegraphics[width=\linewidth]{imgs/cluster_merging2}
255
+      \caption[]{After merging\cite{Bohm2006-ts}}
256
+      \label{fig:cluster_merging2}
257
+    \end{figure}    
258
+  \end{frame}
259
+  
260
+  \note{}
261
+  
262
+  \section{Example: Cat Retina Images}
263
+  
264
+  \begin{frame}{Example: Cat Retina Images}{}
265
+    \begin{itemize}
266
+      \item 219 blocks of retinal images, 96 tiles per image $\rightarrow$ 22,024 tiles in total (example tiles in figure \ref{fig:cat5})
267
+      \item each tile is represented as vector of 7 features (figure \ref{fig:cat1}(a))
268
+      \item RIC finds 13 clusters, color coded in figure \ref{fig:cat1}(b)
269
+      \item Example clusters in figures \ref{fig:cat2}(a)-(f)
270
+    \end{itemize}
271
+  \end{frame}
272
+  
273
+  \note{}
274
+  
275
+  \begin{frame}{Example: Cat Retina Images}{}
276
+    \begin{figure}
277
+      \centering
278
+      \includegraphics[width=\linewidth]{imgs/cat5}
279
+      \caption[]{Examples of tiles\cite{Bohm2006-ts}}
280
+      \label{fig:cat5}
281
+    \end{figure}    
282
+  \end{frame}
283
+  
284
+  \note{}
285
+  
286
+  \begin{frame}{Example: Cat Retina Images}{}
287
+    \begin{figure}
288
+      \centering
289
+      \includegraphics[width=\linewidth]{imgs/cat1}
290
+      \caption[]{Visualization of cat retina data\cite{Bohm2006-ts}}
291
+      \label{fig:cat1}
292
+    \end{figure}    
293
+  \end{frame}
294
+  
295
+  \note{}
296
+  
297
+  \begin{frame}{Example: Cat Retina Images}{}
298
+    \begin{figure}
299
+      \centering
300
+      \includegraphics[width=\linewidth]{imgs/cat2}
301
+      \caption[]{Example clusters\cite{Bohm2006-ts}}
302
+      \label{fig:cat2}
303
+    \end{figure}    
304
+  \end{frame}
305
+  
306
+  \note{}
307
+  
308
+  \setcounter{figure}{7}
309
+  
310
+  \begin{frame}{Example: Cat Retina Images}{}
311
+    \begin{figure}
312
+      \centering
313
+      \includegraphics[width=\linewidth]{imgs/cat3}
314
+      \caption[]{Example clusters\cite{Bohm2006-ts}}
315
+      \label{fig:cat3}
316
+    \end{figure}    
317
+  \end{frame}
318
+  
319
+  \note{}
320
+  
321
+  \setcounter{figure}{7}
322
+  
323
+  \begin{frame}{Example: Cat Retina Images}{}
324
+    \begin{figure}
325
+      \centering
326
+      \includegraphics[width=\linewidth]{imgs/cat4}
327
+      \caption[]{Example clusters\cite{Bohm2006-ts}}
328
+      \label{fig:cat4}
329
+    \end{figure}    
330
+  \end{frame}
331
+  
332
+  \note{}
333
+  
334
+  \section{Summary}
335
+  
336
+  \begin{frame}{Summary}
337
+    \begin{itemize}
338
+      \item The VAC criterion provides a \alert{stable measure of goodness of fit}.
339
+      \item The RIC framework is very flexible, does not rely on user input and can handle any distribution that can be described by a pdf
340
+      \item Anytime a new, better clustering algorithm is introduced, RIC can improve on it by running its parts (CM and RF) with the better algorithm as a starting point\\
341
+    \end{itemize}
342
+  \end{frame}
343
+  
344
+  \note{Vielen Dank für Eure Aufmerksamkeit!}
345
+  
346
+  \section{References}
347
+  
348
+  \begin{frame}[t]{References}
349
+    \bibliography{sources}
350
+  \end{frame}
351
+  
352
+\end{document}

+ 504
- 0
2016-01-22_robust_information_theoretic_clustering/sources.bib View File

@@ -0,0 +1,504 @@
1
+
2
+@INPROCEEDINGS{Ester1996-rh,
3
+  title     = "A density-based algorithm for discovering clusters in large
4
+  spatial databases with noise",
5
+  booktitle = "Proceedings of the 2nd {ACM} {SIGKDD} International Conference
6
+  on Knowledge Discovery and Data Mining",
7
+  author    = "Ester, Martin and Kriegel, Hans-Peter and Sander, J{\"{o}}rg and
8
+  Xu, Xiaowei",
9
+  volume    =  96,
10
+  pages     = "226--231",
11
+  year      =  1996
12
+}
13
+
14
+@ARTICLE{Mahalanobis1936-wy,
15
+  title   = "On the generalized distance in statistics",
16
+  author  = "Mahalanobis, Prasanta Chandra",
17
+  journal = "Proceedings of the National Institute of Sciences (Calcutta)",
18
+  volume  =  2,
19
+  pages   = "49--55",
20
+  year    =  1936
21
+}
22
+
23
+@ARTICLE{Elias1975-wj,
24
+  title    = "Universal codeword sets and representations of the integers",
25
+  author   = "Elias, Peter",
26
+  abstract = "Countable prefix codeword sets are constructed with the universal
27
+  property that assigning messages in order of decreasing
28
+  probability to codewords in order of increasing length gives an
29
+  average code-word length, for any message set with positive
30
+  entropy, less than a constant times the optimal average codeword
31
+  length for that source. Some of the sets also have the
32
+  asymptotically optimal property that the ratio of average
33
+  codeword length to entropy approaches one uniformly as entropy
34
+  increases. An application is the construction of a uniformly
35
+  universal sequence of codes for countable memoryless sources, in
36
+  which the th code has a ratio of average codeword length to
37
+  source rate bounded by a function of for all sources with
38
+  positive rate; the bound is less than two for and approaches one
39
+  as increases.",
40
+  journal  = "IEEE Trans. Inf. Theory",
41
+  volume   =  21,
42
+  number   =  2,
43
+  pages    = "194--203",
44
+  month    =  mar,
45
+  year     =  1975,
46
+  keywords = "Source coding;Variable-length coding (VLC);Application
47
+  software;Binary sequences;Computational
48
+  complexity;Concrete;Encoding;Entropy;Information
49
+  analysis;Information retrieval;Information theory;Probability
50
+  distribution",
51
+  issn     = "0018-9448",
52
+  doi      = "10.1109/TIT.1975.1055349"
53
+}
54
+
55
+@INPROCEEDINGS{Pelleg2000-jr,
56
+  title     = "X-means: Extending K-means with Efficient Estimation of the
57
+  Number of Clusters",
58
+  booktitle = "{ICML}",
59
+  author    = "Pelleg, Dan and Moore, Andrew W and {Others}",
60
+  volume    =  1,
61
+  year      =  2000
62
+}
63
+
64
+@INPROCEEDINGS{Hamerly2003-hm,
65
+  title     = "Learning the k in k-means",
66
+  booktitle = "Seventeenth Annual Conference on Neural Information Processing
67
+  Systems ({NIPS)}, Neural Inf. Process. Syst., Vancouver, {BC},
68
+  Canada",
69
+  author    = "Hamerly, Greg and Elkan, Charles",
70
+  year      =  2003
71
+}
72
+
73
+@INPROCEEDINGS{Bohm2010-uu,
74
+  title     = "Clustering by Synchronization",
75
+  booktitle = "Proceedings of the 16th {ACM} {SIGKDD} International Conference
76
+  on Knowledge Discovery and Data Mining",
77
+  author    = "B{\"{o}}hm, Christian and Plant, Claudia and Shao, Junming and
78
+  Yang, Qinli",
79
+  publisher = "ACM",
80
+  pages     = "583--592",
81
+  series    = "KDD '10",
82
+  year      =  2010,
83
+  address   = "New York, NY, USA",
84
+  keywords  = "clustering, kuramoto model, synchronization",
85
+  isbn      = "9781450300551",
86
+  doi       = "10.1145/1835804.1835879"
87
+}
88
+
89
+@INPROCEEDINGS{Bohm2008-eh,
90
+  title     = "Outlier-robust Clustering Using Independent Components",
91
+  booktitle = "Proceedings of the 2008 {ACM} {SIGMOD} International Conference
92
+  on Management of Data",
93
+  author    = "B{\"{o}}hm, Christian and Faloutsos, Christos and Plant, Claudia",
94
+  publisher = "ACM",
95
+  pages     = "185--198",
96
+  series    = "SIGMOD '08",
97
+  year      =  2008,
98
+  address   = "New York, NY, USA",
99
+  keywords  = "epd, ica, outlier-robust clustering",
100
+  isbn      = "9781605581026",
101
+  doi       = "10.1145/1376616.1376638"
102
+}
103
+
104
+@INPROCEEDINGS{Bohm2009-im,
105
+  title     = "{CoCo}: Coding Cost for Parameter-free Outlier Detection",
106
+  booktitle = "Proceedings of the 15th {ACM} {SIGKDD} International Conference
107
+  on Knowledge Discovery and Data Mining",
108
+  author    = "B{\"{o}}hm, Christian and Haegler, Katrin and M{\"{u}}ller,
109
+  Nikola S and Plant, Claudia",
110
+  publisher = "ACM",
111
+  pages     = "149--158",
112
+  series    = "KDD '09",
113
+  year      =  2009,
114
+  address   = "New York, NY, USA",
115
+  keywords  = "coding costs, data compression, minimum description length,
116
+  outlier detection",
117
+  isbn      = "9781605584959",
118
+  doi       = "10.1145/1557019.1557042"
119
+}
120
+
121
+@INPROCEEDINGS{Bhattacharya2005-rg,
122
+  title      = "{ViVo}: Visual Vocabulary Construction for Mining Biomedical
123
+  Images",
124
+  booktitle  = "Fifth {IEEE} International Conference on Data Mining
125
+  ({ICDM'05})",
126
+  author     = "Bhattacharya, Arnab and Ljosa, Vebjorn and Pan, Jia-Yu and
127
+  Verardo, Mark R and Yang, Hyungjeong and Faloutsos, Christos
128
+  and Singh, Ambuj K",
129
+  publisher  = "IEEE",
130
+  pages      = "50--57",
131
+  year       =  2005,
132
+  conference = "Fifth IEEE International Conference on Data Mining (ICDM'05)",
133
+  isbn       = "9780769522784",
134
+  doi        = "10.1109/ICDM.2005.151"
135
+}
136
+
137
+@INPROCEEDINGS{Akoglu2012-vz,
138
+  title       = "{PICS}: Parameter-free Identification of Cohesive Subgroups in
139
+  Large Attributed Graphs",
140
+  booktitle   = "{SDM}",
141
+  author      = "Akoglu, Leman and Tong, Hanghang and Meeder, Brendan and
142
+  Faloutsos, Christos",
143
+  abstract    = "Abstract Given a graph with node attributes, how can we find
144
+  meaningful patterns such as clusters, bridges, and outliers?
145
+  Attributed graphs appear in real world in the form of social
146
+  networks with user interests, gene interaction networks with
147
+  gene expression information, ...",
148
+  publisher   = "Citeseer",
149
+  pages       = "439--450",
150
+  institution = "Citeseer",
151
+  year        =  2012
152
+}
153
+
154
+@INPROCEEDINGS{Vinh2010-tc,
155
+  title     = "{minCEntropy}: A Novel Information Theoretic Approach for the
156
+  Generation of Alternative Clusterings",
157
+  booktitle = "Data Mining ({ICDM)}, 2010 {IEEE} 10th International Conference
158
+  on",
159
+  author    = "Vinh, Nguyen Xuan and Epps, J",
160
+  abstract  = "Traditional clustering has focused on creating a single good
161
+  clustering solution, while modern, high dimensional data can
162
+  often be interpreted, and hence clustered, in different ways.
163
+  Alternative clustering aims at creating multiple clustering
164
+  solutions that are both of high quality and distinctive from
165
+  each other. Methods for alternative clustering can be divided
166
+  into objective-function-oriented and
167
+  data-transformation-oriented approaches. This paper presents a
168
+  novel information theoretic-based, objective-function-oriented
169
+  approach to generate alternative clusterings, in either an
170
+  unsupervised or semi-supervised manner. We employ the
171
+  conditional entropy measure for quantifying both clustering
172
+  quality and distinctiveness, resulting in an analytically
173
+  consistent combined criterion. Our approach employs a
174
+  computationally efficient nonparametric entropy estimator, which
175
+  does not impose any assumption on the probability distributions.
176
+  We propose a partitional clustering algorithm, named
177
+  minCEntropy, to concurrently optimize both clustering quality
178
+  and distinctiveness. minCEntropy requires setting only some
179
+  rather intuitive parameters, and performs competitively with
180
+  existing methods for alternative clustering.",
181
+  pages     = "521--530",
182
+  month     =  dec,
183
+  year      =  2010,
184
+  keywords  = "entropy;nonparametric statistics;pattern clustering;statistical
185
+  distributions;alternative clustering
186
+  generation;data-transformation-oriented approach;information
187
+  theoretic approach;minCEntropy;nonparametric entropy
188
+  estimator;objective-function-oriented approach;partitional
189
+  clustering algorithm;probability distributions;alternative
190
+  clustering;clustering;information theoretic
191
+  clustering;multi-objective optimization;transformation",
192
+  issn      = "1550-4786",
193
+  doi       = "10.1109/ICDM.2010.24"
194
+}
195
+
196
+@INPROCEEDINGS{Mueller2011-hd,
197
+  title       = "Weighted Graph Compression for Parameter-free Clustering With
198
+  {PaCCo}",
199
+  booktitle   = "{SDM}",
200
+  author      = "Mueller, Nikola and Haegler, Katrin and Shao, Junming and
201
+  Plant, Claudia and B{\"{o}}hm, Christian",
202
+  abstract    = "Abstract Object similarities are now more and more
203
+  characterized by connectivity information available in form of
204
+  network or graph data. Complex graph data arises in various
205
+  fields like e-commerce, social networks, high throughput
206
+  biological analysis etc. The generated interaction information
207
+  for objects is often not simply binary but rather associated
208
+  with interaction strength which are in turn represented as
209
+  edge weights in graphs. The identification of groups of highly
210
+  connected nodes is an important task and results in valuable
211
+  knowledge of the data set as a whole. Many popular clustering
212
+  techniques are designed for vector or unweighted graph data,
213
+  and can thus not be directly applied for weighted graphs. In
214
+  this paper, we propose a novel clustering algorithm for
215
+  weighted graphs, called PaCCo (Parameter-free Clustering by
216
+  Coding costs), which is based on the Minimum Description
217
+  Length (MDL) principle in combination with a bisecting k-Means
218
+  strategy. MDL relates the clustering problem to the problem of
219
+  data compression: A good cluster structure on graphs enables
220
+  strong graph compression. The compression efficiency depends
221
+  on the underlying edges which constitute the graph
222
+  connectivity. The compression rate serves as similarity or
223
+  distance metric for nodes. The MDL principle ensures that our
224
+  algorithm is parameter free (automatically finds the number of
225
+  clusters) and avoids restrictive assumptions that no
226
+  information on the data is required. We systematically
227
+  evaluate our clustering approach PaCCo on synthetic as well as
228
+  on real data to demonstrate the superiority of our developed
229
+  algorithm over existing approaches.",
230
+  pages       = "932--943",
231
+  institution = "SIAM",
232
+  year        =  2011,
233
+  doi         = "10.1137/1.9781611972818.80"
234
+}
235
+
236
+@INPROCEEDINGS{Plant2011-wy,
237
+  title     = "{INCONCO}: Interpretable Clustering of Numerical and Categorical
238
+  Objects",
239
+  booktitle = "Proceedings of the 17th {ACM} {SIGKDD} International Conference
240
+  on Knowledge Discovery and Data Mining",
241
+  author    = "Plant, Claudia and B{\"{o}}hm, Christian",
242
+  publisher = "ACM",
243
+  pages     = "1127--1135",
244
+  series    = "KDD '11",
245
+  year      =  2011,
246
+  address   = "New York, NY, USA",
247
+  keywords  = "clustering, minimum description length principle, mixed-type
248
+  data",
249
+  isbn      = "9781450308137",
250
+  doi       = "10.1145/2020408.2020584"
251
+}
252
+
253
+@INPROCEEDINGS{Feng2013-jn,
254
+  title     = "{Compression-Based} Graph Mining Exploiting Structure Primitives",
255
+  booktitle = "Data Mining ({ICDM)}, 2013 {IEEE} 13th International Conference
256
+  on",
257
+  author    = "Feng, Jing and He, Xiao and Hubig, N and Bohm, C and Plant, C",
258
+  abstract  = "How can we retrieve information from sparse graphs? Traditional
259
+  graph mining approaches focus on discovering dense patterns
260
+  inside complex networks, for example modularity-based or
261
+  cut-based methods. However, most real world data sets are very
262
+  sparse. Nevertheless, traditional approaches tend to omit
263
+  interesting sparse patterns like stars. In this paper, we
264
+  propose a novel graph mining technique modeling the transitivity
265
+  and the hub ness of a graph using structure primitives. We
266
+  exploit these structure primitives for effective graph
267
+  compression using the Minimum Description Length Principle. The
268
+  compression rate is an unbiased measure for the transitivity or
269
+  hub ness and therefore provides interesting insights into the
270
+  structure of even very sparse graphs. Since real graphs can be
271
+  composed of sub graphs of different structures, we propose a
272
+  novel algorithm CXprime (Compression-based exploiting
273
+  Primitives) for clustering graphs using our coding scheme as an
274
+  objective function. In contrast to traditional graph clustering
275
+  methods, our algorithm automatically recognizes different types
276
+  of sub graphs without requiring the user to specify input
277
+  parameters. Additionally we propose a novel link prediction
278
+  algorithm based on the detected substructures, which increases
279
+  the quality of former methods. Extensive experiments evaluate
280
+  our algorithms on synthetic and real data.",
281
+  pages     = "181--190",
282
+  month     =  dec,
283
+  year      =  2013,
284
+  keywords  = "data compression;data mining;graph theory;pattern
285
+  clustering;CXprime algorithm;coding scheme;compression
286
+  rate;compression-based graph mining technique;cut-based
287
+  methods;dense pattern discovery;example modularity-based
288
+  method;graph clustering methods;information retrieval;link
289
+  prediction algorithm;minimum description length
290
+  principle;objective function;sparse graphs;star parse
291
+  patterns;structure primitives;Clustering
292
+  algorithms;Communities;Data mining;Encoding;Entropy;Prediction
293
+  algorithms;Receivers;Compression;Graph mining;Link
294
+  prediction;Minimum Description Length;Partition",
295
+  issn      = "1550-4786",
296
+  doi       = "10.1109/ICDM.2013.56"
297
+}
298
+
299
+@INPROCEEDINGS{He2014-uf,
300
+  title     = "Relevant overlapping subspace clusters on categorical data",
301
+  booktitle = "Proceedings of the 20th {ACM} {SIGKDD} international conference
302
+  on Knowledge discovery and data mining",
303
+  author    = "He, Xiao and Feng, Jing and Konte, Bettina and Mai, Son T and
304
+  Plant, Claudia",
305
+  publisher = "ACM",
306
+  pages     = "213--222",
307
+  month     =  "24~" # aug,
308
+  year      =  2014,
309
+  keywords  = "categorical data; minimum description length; relevant subspace
310
+  clustering",
311
+  isbn      = "9781450329569",
312
+  doi       = "10.1145/2623330.2623652"
313
+}
314
+
315
+@INPROCEEDINGS{Bohm2006-ts,
316
+  title     = "Robust information-theoretic clustering",
317
+  booktitle = "Proceedings of the 12th {ACM} {SIGKDD} international conference
318
+  on Knowledge discovery and data mining",
319
+  author    = "B{\"{o}}hm, Christian and Faloutsos, Christos and Pan, Jia-Yu
320
+  and Plant, Claudia",
321
+  publisher = "ACM",
322
+  pages     = "65--75",
323
+  month     =  "20~" # aug,
324
+  year      =  2006,
325
+  keywords  = "clustering; data summarization; noise-robustness; parameter-free
326
+  data mining",
327
+  isbn      = "9781595933393",
328
+  doi       = "10.1145/1150402.1150414"
329
+}
330
+
331
+@INPROCEEDINGS{Faivishevsky2010-uk,
332
+  title     = "Nonparametric information theoretic clustering algorithm",
333
+  booktitle = "Proceedings of the 27th International Conference on Machine
334
+  Learning ({ICML-10})",
335
+  author    = "Faivishevsky, Lev and Goldberger, Jacob",
336
+  pages     = "351--358",
337
+  year      =  2010
338
+}
339
+
340
+@INCOLLECTION{Bohm2010-fv,
341
+  title     = "Integrative {Parameter-Free} Clustering of Data with Mixed Type
342
+  Attributes",
343
+  booktitle = "Advances in Knowledge Discovery and Data Mining",
344
+  author    = "B{\"{o}}hm, Christian and Goebl, Sebastian and Oswald, Annahita
345
+  and Plant, Claudia and Plavinski, Michael and Wackersreuther,
346
+  Bianca",
347
+  publisher = "Springer Berlin Heidelberg",
348
+  pages     = "38--47",
349
+  series    = "Lecture Notes in Computer Science",
350
+  month     =  "21~" # jun,
351
+  year      =  2010,
352
+  isbn      = "9783642136566, 9783642136573",
353
+  doi       = "10.1007/978-3-642-13657-3\_7"
354
+}
355
+
356
+@INPROCEEDINGS{Plant2012-vu,
357
+  title     = "Dependency clustering across measurement scales",
358
+  booktitle = "Proceedings of the 18th {ACM} {SIGKDD} international conference
359
+  on Knowledge discovery and data mining",
360
+  author    = "Plant, Claudia",
361
+  publisher = "ACM",
362
+  pages     = "361--369",
363
+  month     =  "12~" # aug,
364
+  year      =  2012,
365
+  keywords  = "clustering; heterogeneous data; minimum description length",
366
+  isbn      = "9781450314626",
367
+  doi       = "10.1145/2339530.2339589"
368
+}
369
+
370
+@INPROCEEDINGS{Chakrabarti2004-en,
371
+  title     = "Fully automatic cross-associations",
372
+  booktitle = "Proceedings of the tenth {ACM} {SIGKDD} international conference
373
+  on Knowledge discovery and data mining",
374
+  author    = "Chakrabarti, Deepayan and Papadimitriou, Spiros and Modha,
375
+  Dharmendra S and Faloutsos, Christos",
376
+  publisher = "ACM",
377
+  pages     = "79--88",
378
+  month     =  "22~" # aug,
379
+  year      =  2004,
380
+  keywords  = "MDL; cross-association; information theory",
381
+  isbn      = "9781581138887",
382
+  doi       = "10.1145/1014052.1014064"
383
+}
384
+
385
+@INPROCEEDINGS{Feng2012-ox,
386
+  title     = "Summarization-based Mining Bipartite Graphs",
387
+  booktitle = "Proceedings of the 18th {ACM} {SIGKDD} International Conference
388
+  on Knowledge Discovery and Data Mining",
389
+  author    = "Feng, Jing and He, Xiao and Konte, Bettina and B{\"{o}}hm,
390
+  Christian and Plant, Claudia",
391
+  publisher = "ACM",
392
+  pages     = "1249--1257",
393
+  series    = "KDD '12",
394
+  year      =  2012,
395
+  address   = "New York, NY, USA",
396
+  keywords  = "bipartite graph, clustering, link prediction, summarization",
397
+  isbn      = "9781450314626",
398
+  doi       = "10.1145/2339530.2339725"
399
+}
400
+
401
+@INPROCEEDINGS{Ketkar2005-zs,
402
+  title     = "Subdue: Compression-based Frequent Pattern Discovery in Graph
403
+  Data",
404
+  booktitle = "Proceedings of the 1st International Workshop on Open Source
405
+  Data Mining: Frequent Pattern Mining Implementations",
406
+  author    = "Ketkar, Nikhil S and Holder, Lawrence B and Cook, Diane J",
407
+  publisher = "ACM",
408
+  pages     = "71--76",
409
+  series    = "OSDM '05",
410
+  year      =  2005,
411
+  address   = "New York, NY, USA",
412
+  isbn      = "9781595932105",
413
+  doi       = "10.1145/1133905.1133915"
414
+}
415
+
416
+% The entry below contains non-ASCII chars that could not be converted
417
+% to a LaTeX equivalent.
418
+@INPROCEEDINGS{Koutra2014-up,
419
+  title     = "{VOG}: Summarizing and Understanding Large Graphs",
420
+  booktitle = "Proceedings of the 2014 {SIAM} International Conference on Data
421
+  Mining",
422
+  author    = "Koutra, Danai and Kang, U and Vreeken, Jilles and Faloutsos,
423
+  Christos",
424
+  abstract  = "Abstract How can we succinctly describe a million-node graph
425
+  with a few simple sentences? How can we measure the ‘importance’
426
+  of a set of discovered subgraphs in a large graph? These are
427
+  exactly the problems we focus on. Our main ideas are to
428
+  construct a ‘vocabulary’ of subgraph-types that often occur in
429
+  real graphs (e.g., stars, cliques, chains), and from a set of
430
+  subgraphs, find the most succinct description of a graph in
431
+  terms of this vocabulary. We measure success in a well-founded
432
+  way by means of the Minimum Description Length (MDL) principle:
433
+  a subgraph is included in the summary if it decreases the total
434
+  description length of the graph. Our contributions are
435
+  three-fold: (a) formulation: we provide a principled encoding
436
+  scheme to choose vocabulary subgraphs; (b) algorithm: we develop
437
+  VOG, an efficient method to minimize the description cost, and
438
+  (c) applicability: we report experimental results on
439
+  multi-million-edge real graphs, including Flickr and the Notre
440
+  Dame web graph.",
441
+  pages     = "91--99",
442
+  year      =  2014,
443
+  doi       = "10.1137/1.9781611973440.11"
444
+}
445
+
446
+@INPROCEEDINGS{Akoglu2013-qd,
447
+  title       = "Mining Connection Pathways for Marked Nodes in Large Graphs",
448
+  booktitle   = "{SDM}",
449
+  author      = "Akoglu, Leman and Chau, Duen Horng and Faloutsos, Christos and
450
+  Tatti, Nikolaj and Tong, Hanghang and Vreeken, Jilles and
451
+  Tong, Lajvh",
452
+  abstract    = "Abstract Suppose we are given a large graph in which, by some
453
+  external process, a handful of nodes are marked. What can we
454
+  say about these nodes? Are they close together in the graph?
455
+  or, if segregated, how many groups do they form? We approach
456
+  this problem by trying to find sets of simple connection
457
+  pathways between sets of marked nodes. We formalize the
458
+  problem in terms of the Minimum Description Length principle:
459
+  a pathway is simple when we need only few bits to tell which
460
+  edges to follow, such that we visit all nodes in a group.
461
+  Then, the best partitioning is the one that requires the least
462
+  number of bits to describe the paths that visit all the marked
463
+  nodes. We prove that solving this problem is NP-hard, and
464
+  introduce DOT2DOT, an efficient algorithm for partitioning
465
+  marked nodes by finding simple pathways between nodes.
466
+  Experimentation shows that DOT2DOT correctly groups nodes for
467
+  which good connection paths can be constructed, while
468
+  separating distant nodes.",
469
+  pages       = "37--45",
470
+  institution = "SIAM",
471
+  year        =  2013,
472
+  doi         = "10.1137/1.9781611972832.5"
473
+}
474
+
475
+
476
+@INPROCEEDINGS{Zelnik-Manor2004-rx,
477
+  title     = "Self-tuning spectral clustering",
478
+  booktitle = "Advances in neural information processing systems",
479
+  author    = "Zelnik-Manor, Lihi and Perona, Pietro",
480
+  pages     = "1601--1608",
481
+  year      =  2004
482
+}
483
+
484
+
485
+@BOOK{Golub1996-ak,
486
+  title     = "Matrix Computations",
487
+  author    = "Golub, Gene Howard and Van Loan, Charles F",
488
+  abstract  = "Revised and updated, the third edition of Golub and Van Loan's
489
+  classic text in computer science provides essential information
490
+  about the mathematical background and algorithmic skills
491
+  required for the production of numerical software. This new
492
+  edition includes thoroughly revised chapters on matrix
493
+  multiplication problems and parallel matrix computations,
494
+  expanded treatment of CS decomposition, an updated overview of
495
+  floating point arithmetic, a more accurate rendition of the
496
+  modified Gram-Schmidt process, and new material devoted to
497
+  GMRES, QMR, and other methods designed to handle the sparse
498
+  unsymmetric linear system problem.",
499
+  publisher = "Johns Hopkins University Press",
500
+  year      =  1996,
501
+  isbn      = "9780801854132"
502
+}
503
+
504
+