Browse Source

2016-01-22: university paper on RIC framework

Simon Lackerbauer 1 year ago
Signed by: Simon Lackerbauer <> GPG Key ID: 2B27C889039C0125

2016-01-22_robust_information_theoretic_clustering/imgs/bad_clustering.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cat1.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cat2.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cat3.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cat4.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cat5.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cluster_merging1.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/cluster_merging2.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/good_clustering.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/robust_estimation.png View File

2016-01-22_robust_information_theoretic_clustering/imgs/vac.png View File

2016-01-22_robust_information_theoretic_clustering/paper.pdf View File

+ 191
- 0
2016-01-22_robust_information_theoretic_clustering/paper.tex View File

@@ -0,0 +1,191 @@
\captionsetup{labelsep=newline,singlelinecheck=false} % optional
\title{A Discussion of the Robust Information-theoretic Clustering Framework (RIC)}
\author{\IEEEauthorblockN{Simon Lackerbauer}
\IEEEauthorblockA{Institut für Informatik\\
Ludwig-Maximilians-Universität München\\
Oettingenstraße 67, 80538 München\\}}
In the 2006 paper \textit{Robust Information-theoretic Clustering} Böhm et al. propose a three-part algorithm to find diversely distributed clusters in data sets using an information-theoretic approach, most noticeably the VAC criterion, which depends on the principle of minimum-description length (MDL) - while being input-free and as such usable without specialist knowledge. After comparison with other algorithms with the same or similar prerequisites, one can conclude that it reaches its goal admirable, although with certain limitations, most noticeably quality of the input clustering and runtime efficiency.
The authors of the herein discussed paper\cite{Bohm2006-ts} went out to answer the question “How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise?”
Heavy emphasis was put upon the notion that the algorithm accomplishing this feat shall need no user-input, and, furthermore, not be restricted to purely Gaussian cluster distributions. The proposed algorithm also shouldn't be easily subjugated by noisy data sets, and moreover be at least reasonable efficient in its runtime-complexity. If and how well the proposed algorithm adheres to these self-given requirements, and also how it compares to other methods by researches out to find a solution to much the same problems (compare eg. NIC\cite{Faivishevsky2010-uk} and PaCCo\cite{Mueller2011-hd}), shall be the focus of this discussion.
First of all, in section~\ref{sec:ric}, the RIC framework will be presented and the principles behind it explained. Then we'll take a short look at the experiments against which RIC was tested in section \ref{sec:exp}. Afterwards, section~\ref{sec:crit} will expatiate upon general criticisms of the framework that were addressed by other seminar participants in the discussion following the presentation. In section~\ref{sec:comp} the algorithm will be compared with other similar algorithms on a variety of axes. Finally, this discussion will conlude in section~\ref{sec:conc}.
\section{The RIC Framework}
The RIC framework takes as input a pre-clustered data set, though it is indifferent towards the method by which this pre-clustering shall be accomplished. The authors themselves use K-means in their example data sets. RIC is thusly not strictly a clustering algorithm as it is a cluster refining algorithm\cite{Bohm2008-eh}. (By definition, any cluster refining algorithm should in principle be able to function as a clustering algorithm, refining from the case that every point in the data set is a distinct cluster.) Compared with other input-free clustering algorithms like X-means\cite{Hamerly2003-hm} and G-means\cite{Pelleg2000-jr} RIC is supposed to find any clusterings which can be defined by its predetermined PDFs, which means it doesn't rely solely on Gaussian clusters, but is nevertheless restricted to detecting clusters of a (reasonably small) finite set with mathematically non-too-complex probability densities.
The RIC framework is composed of two sub-algorithms doing the heavy lifting with the MDL-like VAC criterion standing in as a measure of goodness.
\subsection{VAC - Volume After Compression}
The VAC (Volume After Compression) criterion lies at the heart of the RIC framework and guides virtually any decision the framework has to make about the goodness of a given clustering. VAC uses Elias gamma encoding\cite{Elias1975-wj} to encode any integer $i$ using $O(\log i)$, be it point-coordinates or point-offsets from the cluster center if point $\vec{x}$ is said to belong to a cluster $C$. It further takes into account the possible correlation in a cluster, and if it makes sense to instead save a decorrelation matrix. Thus, the volume after compression of a point $ \vec{x} $ in a grid with $ \gamma $ distance between grid cells can be summarized as
\[ VAC(x) = \left(\log_2 \frac{n}{|C|} \right) + \left(\sum_{0 \leq i < d} \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma} \right) \]
where $ \log_2 \frac{n}{|C|} $ are the encoding costs of the cluster-Id and $ \sum_{0 \leq i < d} \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma} $ are the encoding costs of all the integers defining a point $ \vec{x} $ in $ d $. Obviously the encoding costs depend quite significantly on the chosen probability density function ($ pdf_i $) that is assigned to estimate one dimension of the cluster.
To find the appropriate $ pdf $ the VAC criterion is employed again so that
\[ pdf_i = {\arg\min}_{pdf_{stat} \in \text{PDF}} \sum_{\vec{x} \in C} \log_2 \frac{1}{pdf_{stat}(x_i) \cdot \gamma} \]
where the available $ pdf $ is chosen from the set of defined PDFs with defining characteristics (eg mean, variance) as calculated from the considered coordinate dimension.
Further, a decorrelation matrix is calculated, but only included if the VAC criterion warrants such a move - as in: the decorrelation saves more space by virtue of its better fit than it takes to save the matrix itself.
Putting all these formulae together, we finally reach the VAC of a full cluster in
\[ VAC(C) = VAC(dec(C)) + \sum_{\vec{x} \in C} VAC(dec(C) \cdot \vec{x}) \]
where $ dec(C) $ is the decorrelated cluster.
The authors illustrate the principle behind encoding a cluster with different PDFs as in figure \ref{fig:vac}. As they write themselves that the grid constant $ \gamma $ shall be chosen so as to enable each datapoint to fall into its own grid cell, the choice of $ \gamma $ in their example seems a little off, but not in a truly negative way.
\caption[]{Example of VAC\cite{Bohm2006-ts}}
\subsection{RF - Robust Fitting}
The Robust Fitting sub-algorithm deals with outliers in the input clusters. The fitting algorithm employs the VAC criterion to determine if a point is adequately described by the characteristic distribution of the cluster or if it makes more sense to not count that specific point as belonging to (any or this) specific cluster. An example of how a conventional clustering might work in finding a characteristic distribution as opposed to how the RF algorithm estimates its fitting using the VAC criterion can be found in figure \ref{fig:robust_estimation}. But the sole introduction of the VAC criterion to measure goodness does not yet incorporate outlier detection.
\caption[]{Conventional and robust estimation\cite{Bohm2006-ts}}
Instead, RF derives its robustness regarding outliers from the fact that the conventional way to estimate the covariance matrix $ \Sigma $ is by computing a matrix $ \Sigma_C $ from the points $ \vec{x} \in C $ by averaging
\[ \Sigma_C = \frac{1}{|C|} \sum_{\vec{x} \in C} (\vec{x} - \vec{\mu}) \cdot (\vec{x} - \vec{\mu})^\text{T} \]
whereas RF doesn't solely use arithmetic means but also tries the coordinate-wise median (the median of $ (\vec{x} - \vec{\mu}_{R_i}) \cdot (\vec{x} - \vec{\mu}_{R_j}) $ over all $ \vec{x} $) yielding the robust covariation matrix $ \Sigma_R $.
Given the conventional covariance matrix $ \Sigma_C $, by Principle Components Analysis (PCA) one could obtain the orthonormal Eigenvector matrix $ V $, the decorrelation matrix, and the diagonal matrix $ \Lambda $, containing all Eigenvalues, with $ \Sigma_C = V \Lambda V^\text{T} $. Using the property of diagonal dominance (ie all diagonal elements are greater than the sum of the other values in their respective row), one could then measure the distance between any two points $ \vec{x} $ and $ \vec{y} $ by using the Mahalanobis distance\cite{Mahalanobis1936-wy} defined by $ V $ and $ \Lambda $:
\[ d_{\Sigma_C}(\vec{x}, \vec{y}) = (\vec{x} - \vec{y})^\text{T} \cdot V \cdot \Lambda^{-1} \cdot V^\text{T} \cdot (\vec{x} - \vec{y}) \]
The robust covariance matrix $ \Sigma_R $ does not necessarily have the diagonal dominance property. However, this can easily be fixed by adding a matrix $ \varPhi \cdot I $ to the covariance matrix. $ \varPhi $ should naturally be chosen as the maximum difference of all the column sums and their corresponding diagonal element. The authors also add a small amount on top of that (in their example 10\%), but it's not really clear why that added value would be needed, as even
\[ \varPhi_{naive} = \max_{0 \leq i < d} \left\{ \left( \sum_{0 \leq j < d, i\neq j} (\Sigma_C)_{i,j} \right) - (\Sigma_{C})_{i,i} \right\} \]
should be enough to satisfy the diagonal dominance property. They do however add that 10\%, which means the suggested value in the original paper is
\[ \varPhi = 1.1 \cdot \varPhi_{naive}. \]
The estimation of a robust covariance matrix does not mean that this matrix is actually used as the descriptive matrix of the cluster, however - the decision about which matrix to use is, as expected, made by calculating the VAC of both $ \Sigma_C $, $\Sigma_R $ and of several candidate matrices and finally choosing the one yielding the lowest VAC.
Once the covariance matrix with the lowest VAC is found, all points are first deemed outliers and then iteratively inserted into the set of cluster points (by order of their Mahalanobis distance from the distribution center), with accompanying VAC scoring at each step. Once a (local) minimum is reached, the process stops and the cluster and outliers are then returned to the next RIC sub-algorithm, CM.
\subsection{CM - Cluster Merging}
The Cluster Merging algorithm is likely the computationally costliest part of the RIC process as it compares each cluster with every other cluster for possible merging using the VAC criterion again, or simply checking if
\[ VAC(C_i \cup C_j) < VAC(C_i) + VAC(C_j) \]
for all $ C_i,C_j \in \mathcal{C},i \neq j $. Getting the VAC score of a merged cluster seems only possible by building a new set of points from the previous two clusters and running the whole RF algorithm on that set again.
The CM part of the algorithm also comes with the option of going further down the search tree to get out of a local minimum. It will then merge clusters even if
\[ VAC(C_i \cup C_j) \geq VAC(C_i) + VAC(C_j) \]
for a given $ t $ iterations. If it actually does find to have been in a local minimum, the counter will naturally reset to 0.
RIC is tested on two different sets of synthetic data and on two different sets of real data. RIC found all the correlational, Laplacian and Gaussian clusters in the synthetic data sets, the input being provided by either K-means or DBSCAN\cite{Ester1996-rh}. RIC manages to correctly identify more than 98\% of the noise objects in the data sets and reduces the initial VAC score by about a quarter each. The provided runtimes for both data sets are provided and discussed below in section \ref{sub:run}.
Further evaluation is provided by running RIC on a 14-dimensional real world metabolic data set. Clustering in this biomedical instance can provide help in diagnosing disorders, if the algorithm can cluster both the healthy control group and the test group with disease correctly. For that reason, in this situation as further points of comparison \textit{impurities} in the clusters are counted - impurities being wrongly assigned groups for the known distribution, ie the sum of statistical type I and type II errors. As expected, RIC is able to beat both spectral as well as K-means clustering by achieving a lower VAC, even with both other clustering algorithms being provided with the correct number of clusters to identify. More importantly, however, is RIC's far superior number of impurities (mostly ill datapoints ascribed to the healthy cluster), usually making only about a third of the errors that K-means and spectral achieve alone.
The fourth example provided concerns retinal images of cats in various states of health. The images are encoded as a 7-dimensional data set with more than 21 thousand instances, which RIC (again using K-means as a starting point) clusters into 13 clusters, all of which are correlational, and two of which describe actual, studied, biological meaningful relationships, namely the "Müller cells" and "rod photoreceptors". As RIC's pretense was to provide meaningful clusters without user-input, and, as such, without relying on the user's specific domain knowledge, this can be taken as a sign that it at least partially reached its goal.
\section{General Criticism}
As the lead author himself confirms in later works\cite{Bohm2008-eh}\cite{Bohm2010-uu}, the clustering output of RIC strongly depends on the quality of the initial clustering and the cluster model is limited to linear attribute correlations. They're also limited to the list of predefined PDFs, though that might not be too much of a limitation, as, in theory, new PDFs are supposed to be added quite easily. Further, there's a reason that there's only a limited to supply of deeply researched and named density functions to begin with, namely that they're readily applicable to a big enough subset of problems. Natural processes that need much more elaborate models also are clearly not in scope of the RIC framework.
\subsection{Runtime Analysis}
The paper does not contain an explicit runtime analysis of RIC in general or its sub-algorithms. It is, however, relatively easy to see that the VAC calculation itself should take at most linear time as it's only adding up the total size of an encoded input. The Robust Fitting part can probably be assumed to also need only linear time, whereas the Cluster Merging algorithm almost certainly grows quadratically at least in the worst case.
The experimental runtimes that can be taken from the paper (147s for 4751 objects in 2 dimensions and 567s for 7500 objects in 3 dimensions) also seem suggestive of a quadratic growth rate, as an input of $ 7500 \cdot 3 = 22500 $ integers, about 2.4 times as many as $ 4751 \cdot 2 = 9502 $ integers takes roughly 3.9 times the time to process.
\caption{Comparison of RIC with other information-theoretic clustering algorithms}
& parameter-free & runtime analysis & input data & clustering strategy & clusters detected & year \\ \midrule
RIC & yes & $ O(n^2)^{\beta} $ & matrix & input-refining & Gaussian, Laplacian, Correlational, others$ ^{\alpha} $ & 2006 \\
PaCCo & yes & $ O(n) < x < O(n^3)^{\beta} $ & graph (weighted) & bisecting & Gaussian, others$ ^{\alpha} $ & 2011 \\
minCEntropy($ ^+ $) & no & $ O(n^2) $ & matrix & proprietary & Gaussian & 2010 \\
ROCAT & yes & $ O(n) $ & matrix & bisecting & not applicable & 2014 \\
PICS & yes & $ O(n) $ & graph (attributed) & bisecting & not applicable & 2012 \\
INCONCO & yes & $ O(n)^\beta $ & matrix (attributed) & bisecting & Gaussian & 2011 \\
NIC & yes & $ O(n^2) $ & matrix & greedy sequential k-means & non-convex & 2010 \\
VoG & yes & $ O(n)^\gamma $ & graph & multiple & not applicable & 2014 \\ \bottomrule
\multicolumn{7}{l}{$ ^{\alpha} $ user-defined PDFs are explicitly mentioned as desired extensions to the algorithm} \\
\multicolumn{7}{l}{$ ^{\beta} $ estimate, as the respective authors did not provide an explicit runtime analysis} \\
\multicolumn{7}{l}{$ ^{\gamma} $ the average case with $ n $ being the numbers of edges on the graph}
It would have been interesting to see how the RIC framework reacts to random or exactly equally distributed input. Would it see structure in the noise regardless? Would the CM method make initial bad K-means clustering on such data more deterministic? Sadly, there are no examples included in the paper that can act as a kind of control group.
\section{Comparison with other Algorithms}
In this section, we'll compare how RIC fares against some of the other clustering algorithms introduced in the seminar, eg PaCCo\cite{Mueller2011-hd}, minCEntropy\cite{Vinh2010-tc}, ROCAT\cite{He2014-uf}, PICS\cite{Akoglu2012-vz}, INCONCO\cite{Plant2011-wy}, NIC\cite{Faivishevsky2010-uk}, VoG\cite{Koutra2014-up} and Subdue\cite{Ketkar2005-zs}. The categorical results of these comparisons are provided succinctly in table \ref{tab:comp}.
PaCCo is a weighted graph-based algorithm employing the information-theoretic MDL principle to check the goodness of a clustering just like RIC does, including the Huffman coding structure along a cluster PDF. Different from RIC, however, PaCCo in general assumes Gaussian clusters, but other PDFs can be added by the user. A further similarity is the fact that both algorithms work without user-input. PaCCo however comes with its own bisecting K-means strategy, so its input really are completely unordered and unclustered data.
PaCCo's runtime efficiency is not explicitly stated, other than being "super-linear". In tests against Zelnik-Manor and Perona's parameter-free SpectralZM algorithm\cite{Zelnik-Manor2004-rx}, which has a runtime of $ O(n^3) $, PaCCo outpaced SpectralZM with ease. Its real runtime efficiency will thus lie somewhere in-between.
In conclusion, with PaCCo introduced significantly later than RIC, and having a different input assumption, the RIC framework still holds up surprisingly well in comparison.
\subsection{minCEntropy$ (^+) $}
The focus of minCEntropy lies not on producing a single clustering solution to a given data set, but instead realizes that data can often be interpreted in more than one way (eg data that can be used for facial recognition/authentication could also be used to identify looking direction). The algorithm's goodness criterion seems different from all the other criteria employed throughout this paper - instead of referencing the MDL principle the authors choose to minimize the conditional entropy left in the data, or in other words, maximize the average intra-cluster similarity (hence the name). Of course, concentrating on maximizing intra-cluster similarity will invariably make it easier to succinctly describe a given cluster - in other words, minimizing the description length needed.
The included complexity analysis puts emphasis on the fact that although initializing the required kernel matrix takes up to $ O(n^2) $ time, further updating to reach an optimal clustering takes only linear time and converges quickly.
Then, minCEntropy$ ^+ $ is the extension of minCEntropy to find the above mentioned alternative clusterings when given a clustering to explicitly exclude. There's also a further extension minCEntropy$ ^{++} $ that can exclude multiple clusters.
No iteration of the algorithm works completely parameter-free.
ROCAT is a pretty young, parameter-free approach to clustering, using the MDL principle (though expressed as a question of entropy like with minCEntropy above), and focusing on subspace clusters for categorical (ie not numerical) data. As ROCAT deals with categorical data, a defining cluster probability distribution like a Gaussian curve is not really applicable. Clusters built by ROCAT over a matrix view of the attributive data are always shaped rectangular.
ROCAT's runtime efficiency grows linearly with $ n $, quadratically with dimensionality, so it's one of the most efficient algorithms discussed until now.
PICS is a parameter-free graph-clustering algorithm that does not solely cluster on connectivity information but also takes node attributes into account. Naturally, PICS also relies on an MDL criterion to decide if a split of the biggest identified cluster brings a better description of the data (similar to the approach used by PaCCo above).
INCONCO is a parameter-free clustering algorithm with much the same premise as PICS, only that it is not based on graph data with attributes, but instead on vector data with mixed-type attributes.
Where RIC uses the Robust Fitting method instead of PCA to make its clustering less sensitive to outliers, INCONCO extends Cholesky decomposition\cite{Golub1996-ak} as it's more efficient than PCA and yields better results. The usage of an MDL criterion for decisions regarding goodness of fit at each step of the bisecting algorithm going through the input data set does not come as a surprise anymore.
There is no explicit theoretical runtime analysis given, but from the runtimes reported, INCONCO seems extremely efficient when looking at the number of points processed, with much of it probably due to a large but constant overhead.
NIC is a parameter-free framework that can find clusters independently of their underlying structure utilizing the MeanNN differential entropy estimator. NIC is certainly different in that it can even find clusters consisting of points non-convex sets (eg concentric circles), a feat that RIC would not be able to accomplish without being fed a complex user defined distribution first.
The premise of VoG is the description of large graphs using a certain vocabulary (eg stars, chains, bi-partite cores). So not only shall clusters in the graph be recognized, but also be described by their most descriptive characteristic. By then examining the frequencies with which the predefined structures in the vocabulary appear in the graph and the most notable structures altogether, one can make some meaningful abstractions of the underlying real world structure (eg finding the two groups comprising a Wikipedia edit war).
In the end, especially against competitors that were introduced significantly later and which could build upon a larger body of work with much the same direction, the RIC framework still fares surprisingly well. It is one of the few algorithms that comes pre-equipped with more than just the regular Gaussian PDF (of course, Gaussian distributions still play one of the most significant roles in data description across all fields). Its runtime complexity is not the most amazing thing about the framework, but still holds up against much of the competition. As we've seen under section \ref{sec:crit}, there are some valid criticisms to be aware of, but as the real world examples have shown, any clustering algorithm that can identify known relations consistently and, as such, can be used to find new relationships in data is a step in the right direction.
Adding into that the fact that the RIC framework provides exactly that, a framework that can profit from the fact that clustering elsewhere might get better and as such provide better input to be enhanced by the information-theoretic principles that guides it, the findings by Böhm et al. certainly advanced the area of research they set out to advance.
The author would like to thank Dr. Bianca Wackersreuther for providing input and help whenever needed and conducting the seminar in a well-received and professional manner. Next, Prof. Christian Böhm, both for co-authoring the discussed paper in the first place, and then giving me the opportunity to engage with the topic for this course. Lastly, the other participants in this seminar, for providing constructive criticism after my talk without which this discussion paper wouldn't have been possible.

2016-01-22_robust_information_theoretic_clustering/slides.pdf View File

+ 352
- 0
2016-01-22_robust_information_theoretic_clustering/slides.tex View File

@@ -0,0 +1,352 @@
%\setbeameroption{show only notes}
\usetheme{default} %minimal
\setbeamertemplate{bibliography item}{}
\setbeamercolor*{bibliography entry title}{fg=black}
\setbeamercolor*{bibliography entry author}{fg=black}
\setbeamercolor*{bibliography entry location}{fg=black}
\setbeamercolor*{bibliography entry note}{fg=black}


\title{Robust Information-theoretic Clustering}

\author{Simon Lackerbauer}

\institute[Ludwig-Maximilians-Universität München]
Institut für Informatik\\
Ludwig-Maximilians-Universität München\\

\date{Seminar \textit{Information Theoretic Data Mining} im WS 2015/16}



\note{Guten Morgen, mein Name ist Simon Lackerbauer und ich trage heute vor über einen auf der SIGKDD 2006 vorgestellten Algorithmus zum automatisierten Clustering vor.}
\note{Hier ist mal der grobe Aufbau: Zuerst schaun wir uns das Problem anhand einiger einfacher Beispiele an, danach erkläre ich das RIC framework und seine einzelnen Bestandteile - Volume after Compression, Robust Fitting und Cluster Merging, die erst im Zusammenspiel miteinander tatsächlich in der Lage sind, das Problem anzugehen.\\Danach sehen wir uns noch einige reale Daten an, die der Algorithmus geclustered hat und gehen kurz auf Vergleiche mit traditionellen Algorithmen ein.}
\section{Clustering Problems}
\note{Gut, also erstmal ab zum Problem - oder besser gesagt den Problemen.}
\begin{frame}{Clustering Problems}{}
\item There exist a wide range of possible clustering algorithms.
\item Many need user input or assume only Gaussian clusters
\item We want an algorithm without user input that automatically selects appropriate cluster functions
\note{Das Problem (Daten zu sinnvoll in Cluster zu packen) an sich ist natürlich nicht neu. Ein Cluster ist nichts anderes als eine Anhäufung von Datenpunkten nach einem bestimmten Muster, bzw. nach einer bestimmten Verteilungsfunktion. Datenpunkte innerhalb so eines Clusters sind nicht unabhängig von der zugrundeliegenden Verteilung: Wenn man weiß, wie die Daten geclustered sind, kann man also vielleicht interessante Zusammenhänge in den Daten einfach erkennen.\\Das bedeutet natürlich auch, dass sich da schon mehrere Leute Gedanken drüber gemacht haben. K-means als immernoch sehr weit verbreiteter, recht einfacher Algorithmus, beispielsweise geht bis auf die 50er Jahre zurück.}
\note{K-means ist nicht der einzige solche Algorithmus, aber dient gut als Beispiel für alle Probleme, die es im Allgemeinen gibt. Man braucht einen Eingabewert (die Zahl der vorhandenen Cluster), was natürlich ein bisschen das Problem verschleiert, denn woher soll man das idR wissen? Danach haben dann alle Cluster (wenn man sich Daten in 2D anschaut) so ellipsenhafte Formen, weil der Algorithmus eine Gauss-Verteilung annimmt - und gerade das machen viele herkömmliche Algorithmen.\\Man will also einen Algorithmus, der ohne Input auskommt und zusätzlich unabhängig ist in der Auswahl der Dichtefunktion, auf der der Cluster basiert.}
\begin{frame}{Clustering Problems}{Example: How not to do it}
\caption[]{Example of “Bad” Clustering\cite{Bohm2006-ts}}
\note{Hier sieht man ein Beispiel, was möglicherweise ein K-Means gefunden hätte, wenn man als Input 5 eingibt. Bei Input zwei wären wahrscheinlich diese zwei Gruppen rausgekommen usw. Man sieht gleich, dass ein weiteres großen Problem die Outlier-Behandlung ist. Der schlechte Algorithmus ist nicht robust und versucht auf Teufel komm raus, Noise in die Cluster reinzupacken.}
\begin{frame}{Clustering Problems}{Example: Reasonable reduction}
\caption[]{Example of “Good” Clustering\cite{Bohm2006-ts}}
\note{So dagegen macht das ganze gleich viel mehr Sinn. Ein auf beiden Axen normalverteilter Cluster und ein korrelierter, linienförmiger Cluster. Und dazwischen ein paar Noise-Punkte, die ignoriert werden.}
\begin{frame}{Comparison of examples}{}
\textbf{What makes the second pattern better than the first?}
\item It's more descriptive of the interesting patterns in the data, because outliers have been “omitted”.
\item The clusters in the “good” example are those a human would immediately recognize as points being associated somehow.
\note{Man kann das auch etwas formaler ausdrücken und sagen, dass die zwei guten Cluster die Daten schlicht besser beschreiben. Nur aus der Form und Lage der Cluster allein lässt sich Information ableiten, was bei dem schlechten Beispiel nicht der Fall ist. Hier, im einfachen 2D-Beispiel deckt sich das natürlich mit den Clustern, die auch ein Mensch intuitiv eingezeichnen hätte. Aber es gibt natürlich Datensets, wo auch der Mensch nicht auf Anhieb sofort was sieht (beispielsweise das Set ganz am Ende) und demnach nicht dem Algorithmus mit Intuition "helfen" kann.}
\begin{frame}{Measuring Success}{}
This human intuition must be translated into a dependable clustering algorithm, for which two measures for success can be defined:
\item \uncover<2->{\textbf{Goodness of fit}}
\item \uncover<3->{\textbf{Efficiency}}
\note{Wir wollen also einen Algorithmus, der zwei Voraussetzungen erfüllt: Goodness of fit: Wie gut passt der cluster tatsächlich zu den Daten? Bzw. Formal: wie kriegt man eine funktion, die ein scoring von Clustern zulässt und einen "guten" Score für das zweite Beispiel oben ausspricht und einen "schlechten" für die noisy cluster aus dem schlechten Beispiel?}
\note{Wie kriegt man das mit ordentlicher Laufzeit hin und ohne, dass man von Outliers/Noise zu sehr gestört wird?}
\section{Solution: The iterative approach}
\note{gut, also jetzt ab ins eigentliche Thema}
\subsection{VAC -- Volume After Compression}
\note{Fangen wir an mit dem wohl wichtigsten Subalgorithmus: Volume After Compression}
\begin{frame}{Proposition for a solution: VAC}{}
\textbf{VAC - Volume after Compression}\\
\item does not specify good grouping
\item specifies for two groupings $x,y$ which one is better (e.g., because $VAC(x) < VAC(y) \rightarrow$ x is a better grouping)
\item size of total, \textbf{lossless} compression
\note{Das VAC-Kriterion ist ein informationstheretischer Ansatz des Scorings. Je besser man die Daten beschreibt, desto besser kann man diese auch ohne Verlust von Information komprimieren. Genauso wie ich stark komprimiert mit dem Wort "Integer" die ganzen Zahlen beschreiben kann, ohne die alle Aufzählen zu müssen, beschreibt ein guter Cluster die darunter liegenden Daten, ohne jeden Datenpunkt einzeln aufzählen zu müssen.\\Wichtig dabei: das VAC beschreibt keinen absoluten Score. Man kann nicht zwischen zwei Datensets die VAC scores vergleichen, und dann sagen, dass mit dem niedrigeren Score ist besser geclustert. Man kann aber innerhalb des selben Datensets zwei mögliche Clusterings vergleichen und dann eindeutig sagen: Das mit dem niedrigeren Score beschreibt die Daten einfach besser, denn es braucht weniger Informationstransfer, um das gleiche Wissen auf beiden Seiten herzustellen.}
\begin{frame}{Proposition for a solution: VAC}{Integer Encoding}
\item point coordinates are always integers
\item self-delimiting encoding of integers: Elias (gamma) codes
\item smaller integers require fewer bytes
\note{Wenn man jetzt anfängt, die Daten vorzubereiten, gehen wir davon aus, dass alle Koordinaten integer sind (weil man sowieso endliche Präzision nur darstellen kann) und können demnach die Achsten in ein "Grid" einteilen - man kann das Grid aber natürlich beliebig präzise machen. Integer kann man via Elias encoding ohne jegliches Padding eindeutig encoden, so dass kleine Zahlen weniger Platz verbrauchen - das ist später wichtig.}

\begin{frame}{Proposition for a solution: VAC}{Cluster Encoding}
\item uses Huffman encoding for positioning points with probability distribution according to assumed cluster pdf
\item such, if we assume the correct distribution for the cluster, core points will be more efficiently encoded
\note{Huffman encoding geben uns für alle Koordinaten eines Punktes einen bit-string $l = log_2(1/P(x))$, so dass Punkte die an einem Punkt liegen, wo die Wahrscheinlichkeit laut pdf höher ist, dass sie liegen, weniger Bits verbrauchen als solche, die weiter am Rand der Verteilung liegen. Je besser demnach unsere Annahme der Distribution für ein Cluster ist, desto effectiver lassen sich die Punkte kodieren.}
\begin{frame}{Proposition for a solution: VAC}{Cluster Encoding}
\caption[]{Example of VAC\cite{Bohm2006-ts}}
\note{Hier sieht man ein Beispiel der Aufteilung eines Datensets in ein Grid, und dann die Enkodierung der Koordinaten. Man sieht gut, wie bei beiden Verteilungen das Gros der Punkte direkt in der Mitte der jeweiligen Dichtefunktion liegt.}
\begin{frame}{Proposition for a solution: VAC}{Cluster encoding}
\begin{definition}[VAC of point $\vec{x}$]
Let $\vec{x} \in \mathbb{R}^d$ be a point of a cluster $C$ and $\overrightarrow{pdf}(\vec{x})$ be a $d$-dimensional vector of probability density functions which are associated to $C$. Each $pdf_i(x_i)$ is selected from a set of predefined probability density functions with corresponding parameters, i.e. $PDF = \{pdf_{Gauss(\mu_i, \sigma_i)}, pdf_{uniform(lb_i, ub_i)}, pdf_{Lapl(a_i, b_i)}, ...\}$, $\mu_i, lb_i, ub_i, a_i \in \mathbb{R}, \sigma_i, b_i \in \mathbb{R}^+$. Let $\gamma$ be the grid constant (distance between grid cells). The $VAC_i$ of coordinate $i$ of point $\vec{x}$ corresponds to
\[VAC_i(x) = \log_2 \frac{1}{pdf_i(x_i) \cdot \gamma}\]
The $VAC$ of point $\vec{x}$ corresponds to
\[VAC(x) = \left( \log_2 \frac{n}{|C|}\right) + \sum_{0 \leq i < d} VAC_i(x)\]
\note{Hier nochmal eine formale Definition des VAC-Kriteriums. wichtig ist vA, dass die möglichen PDFs vorher schon festgelegt sind und dann via ID angesprochen werden können.}
\begin{frame}{Proposition for a solution: VAC}{Cluster Encoding and Decorrelation}
\item $\gamma$ is a measure of granularity of grid cells
\item absolute VAC changes with grid resolution, but relative VAC stays the same
\item to choose optimal parameter settings for clusters, we use the statistical parameter of the dataset
\item if data is correlated amongst itself, define a decorrelation matrix iff the VAC savings at least compensate for saving the decorrelation matrix
\subsection{RF -- Robust Fitting}
\note{Um mithilfe des VAC Kriterions jetzt tatsächlich Cluster beschreiben zu können, brauchen wir noch 2 Algorithmen, die bei der tatsächlichen erstellung von Clustern helfen. Einer davon ist Robust Fitting.}
\begin{frame}{Two helper algorithms}{Robust Fitting}
\item Start: get as input a set of clusters $\mathcal{C} = \{C_1,...,C_k\}$ by an arbitrary method
\item for every $C_i$ in $\mathcal{C}$ define a similarity measure (decorrelation matrix == ellipsoid)
\item use the VAC score to try out decorrelation matrices until the one with the lowest VAC is found
\note{RF ist an sich recht einfach zu beschreiben - man kriegt erstmal irgendeine Anzahl von Clustern nach irgendeiner Methode (da kann man auch erstmal K-means nehmen). Dann nimmt man sich eine decorrelation matrix, die einen Ellipsoid beschreibt, der die Grenze zwischen Core points und Outliers darstellt. Dann probiert man verschiedene Grenzen durch, bis man diejenige gefunden hat, die den niedrigsten VAC-Score produziert. (Da jeder Cluster nur endlich viele Punkte hat, kann man da tatsáchlich alle Punkte von innen nach außen nacheinander durchprobieren.)}
\begin{frame}{Two helper algorithms}{Robust Fitting}
Decorrelation Matrix:
\item contains the vectors that define the space in which points in the cluster reside
\item to improve robustness of cluster center estimation use coordination-wise median instead of arithmetic means
\item of the several matrices generated during this step, again partition into core points and noise by choosing the one with best VAC
\begin{frame}{Two helper algorithms}{Robust Fitting}
\caption[]{Conventional and robust estimation\cite{Bohm2006-ts}}
\subsection{CM -- Cluster Merging}
\begin{frame}{Two helper algorithms}{Cluster Merging}
\item Start: get as input a set of clusters $\mathcal{C} = \{C_1,...,C_k\}$ by an arbitrary method
\item Purify (of noise) each cluster individually

\begin{frame}{Two helper algorithms}{Cluster Merging}
\caption[]{Clustering by using K-means and then purifying\cite{Bohm2006-ts}}
\begin{frame}{Two helper algorithms}{Cluster Merging}
\caption[]{After merging\cite{Bohm2006-ts}}
\section{Example: Cat Retina Images}
\begin{frame}{Example: Cat Retina Images}{}
\item 219 blocks of retinal images, 96 tiles per image $\rightarrow$ 22,024 tiles in total (example tiles in figure \ref{fig:cat5})
\item each tile is represented as vector of 7 features (figure \ref{fig:cat1}(a))
\item RIC finds 13 clusters, color coded in figure \ref{fig:cat1}(b)
\item Example clusters in figures \ref{fig:cat2}(a)-(f)
\begin{frame}{Example: Cat Retina Images}{}
\caption[]{Examples of tiles\cite{Bohm2006-ts}}
\begin{frame}{Example: Cat Retina Images}{}
\caption[]{Visualization of cat retina data\cite{Bohm2006-ts}}
\begin{frame}{Example: Cat Retina Images}{}
\caption[]{Example clusters\cite{Bohm2006-ts}}
\begin{frame}{Example: Cat Retina Images}{}
\caption[]{Example clusters\cite{Bohm2006-ts}}
\begin{frame}{Example: Cat Retina Images}{}
\caption[]{Example clusters\cite{Bohm2006-ts}}
\item The VAC criterion provides a \alert{stable measure of goodness of fit}.
\item The RIC framework is very flexible, does not rely on user input and can handle any distribution that can be described by a pdf
\item Anytime a new, better clustering algorithm is introduced, RIC can improve on it by running its parts (CM and RF) with the better algorithm as a starting point\\
\note{Vielen Dank für Eure Aufmerksamkeit!}

+ 504
- 0
2016-01-22_robust_information_theoretic_clustering/sources.bib View File

@@ -0,0 +1,504 @@

title = "A density-based algorithm for discovering clusters in large
spatial databases with noise",
booktitle = "Proceedings of the 2nd {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "Ester, Martin and Kriegel, Hans-Peter and Sander, J{\"{o}}rg and
Xu, Xiaowei",
volume = 96,
pages = "226--231",
year = 1996

title = "On the generalized distance in statistics",
author = "Mahalanobis, Prasanta Chandra",
journal = "Proceedings of the National Institute of Sciences (Calcutta)",
volume = 2,
pages = "49--55",
year = 1936

title = "Universal codeword sets and representations of the integers",
author = "Elias, Peter",
abstract = "Countable prefix codeword sets are constructed with the universal
property that assigning messages in order of decreasing
probability to codewords in order of increasing length gives an
average code-word length, for any message set with positive
entropy, less than a constant times the optimal average codeword
length for that source. Some of the sets also have the
asymptotically optimal property that the ratio of average
codeword length to entropy approaches one uniformly as entropy
increases. An application is the construction of a uniformly
universal sequence of codes for countable memoryless sources, in
which the th code has a ratio of average codeword length to
source rate bounded by a function of for all sources with
positive rate; the bound is less than two for and approaches one
as increases.",
journal = "IEEE Trans. Inf. Theory",
volume = 21,
number = 2,
pages = "194--203",
month = mar,
year = 1975,
keywords = "Source coding;Variable-length coding (VLC);Application
software;Binary sequences;Computational
analysis;Information retrieval;Information theory;Probability
issn = "0018-9448",
doi = "10.1109/TIT.1975.1055349"

title = "X-means: Extending K-means with Efficient Estimation of the
Number of Clusters",
booktitle = "{ICML}",
author = "Pelleg, Dan and Moore, Andrew W and {Others}",
volume = 1,
year = 2000

title = "Learning the k in k-means",
booktitle = "Seventeenth Annual Conference on Neural Information Processing
Systems ({NIPS)}, Neural Inf. Process. Syst., Vancouver, {BC},
author = "Hamerly, Greg and Elkan, Charles",
year = 2003

title = "Clustering by Synchronization",
booktitle = "Proceedings of the 16th {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "B{\"{o}}hm, Christian and Plant, Claudia and Shao, Junming and
Yang, Qinli",
publisher = "ACM",
pages = "583--592",
series = "KDD '10",
year = 2010,
address = "New York, NY, USA",
keywords = "clustering, kuramoto model, synchronization",
isbn = "9781450300551",
doi = "10.1145/1835804.1835879"

title = "Outlier-robust Clustering Using Independent Components",
booktitle = "Proceedings of the 2008 {ACM} {SIGMOD} International Conference
on Management of Data",
author = "B{\"{o}}hm, Christian and Faloutsos, Christos and Plant, Claudia",
publisher = "ACM",
pages = "185--198",
series = "SIGMOD '08",
year = 2008,
address = "New York, NY, USA",
keywords = "epd, ica, outlier-robust clustering",
isbn = "9781605581026",
doi = "10.1145/1376616.1376638"

title = "{CoCo}: Coding Cost for Parameter-free Outlier Detection",
booktitle = "Proceedings of the 15th {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "B{\"{o}}hm, Christian and Haegler, Katrin and M{\"{u}}ller,
Nikola S and Plant, Claudia",
publisher = "ACM",
pages = "149--158",
series = "KDD '09",
year = 2009,
address = "New York, NY, USA",
keywords = "coding costs, data compression, minimum description length,
outlier detection",
isbn = "9781605584959",
doi = "10.1145/1557019.1557042"

title = "{ViVo}: Visual Vocabulary Construction for Mining Biomedical
booktitle = "Fifth {IEEE} International Conference on Data Mining
author = "Bhattacharya, Arnab and Ljosa, Vebjorn and Pan, Jia-Yu and
Verardo, Mark R and Yang, Hyungjeong and Faloutsos, Christos
and Singh, Ambuj K",
publisher = "IEEE",
pages = "50--57",
year = 2005,
conference = "Fifth IEEE International Conference on Data Mining (ICDM'05)",
isbn = "9780769522784",
doi = "10.1109/ICDM.2005.151"

title = "{PICS}: Parameter-free Identification of Cohesive Subgroups in
Large Attributed Graphs",
booktitle = "{SDM}",
author = "Akoglu, Leman and Tong, Hanghang and Meeder, Brendan and
Faloutsos, Christos",
abstract = "Abstract Given a graph with node attributes, how can we find
meaningful patterns such as clusters, bridges, and outliers?
Attributed graphs appear in real world in the form of social
networks with user interests, gene interaction networks with
gene expression information, ...",
publisher = "Citeseer",
pages = "439--450",
institution = "Citeseer",
year = 2012

title = "{minCEntropy}: A Novel Information Theoretic Approach for the
Generation of Alternative Clusterings",
booktitle = "Data Mining ({ICDM)}, 2010 {IEEE} 10th International Conference
author = "Vinh, Nguyen Xuan and Epps, J",
abstract = "Traditional clustering has focused on creating a single good
clustering solution, while modern, high dimensional data can
often be interpreted, and hence clustered, in different ways.
Alternative clustering aims at creating multiple clustering
solutions that are both of high quality and distinctive from
each other. Methods for alternative clustering can be divided
into objective-function-oriented and
data-transformation-oriented approaches. This paper presents a
novel information theoretic-based, objective-function-oriented
approach to generate alternative clusterings, in either an
unsupervised or semi-supervised manner. We employ the
conditional entropy measure for quantifying both clustering
quality and distinctiveness, resulting in an analytically
consistent combined criterion. Our approach employs a
computationally efficient nonparametric entropy estimator, which
does not impose any assumption on the probability distributions.
We propose a partitional clustering algorithm, named
minCEntropy, to concurrently optimize both clustering quality
and distinctiveness. minCEntropy requires setting only some
rather intuitive parameters, and performs competitively with
existing methods for alternative clustering.",
pages = "521--530",
month = dec,
year = 2010,
keywords = "entropy;nonparametric statistics;pattern clustering;statistical
distributions;alternative clustering
generation;data-transformation-oriented approach;information
theoretic approach;minCEntropy;nonparametric entropy
estimator;objective-function-oriented approach;partitional
clustering algorithm;probability distributions;alternative
clustering;clustering;information theoretic
clustering;multi-objective optimization;transformation",
issn = "1550-4786",
doi = "10.1109/ICDM.2010.24"

title = "Weighted Graph Compression for Parameter-free Clustering With
booktitle = "{SDM}",
author = "Mueller, Nikola and Haegler, Katrin and Shao, Junming and
Plant, Claudia and B{\"{o}}hm, Christian",
abstract = "Abstract Object similarities are now more and more
characterized by connectivity information available in form of
network or graph data. Complex graph data arises in various
fields like e-commerce, social networks, high throughput
biological analysis etc. The generated interaction information
for objects is often not simply binary but rather associated
with interaction strength which are in turn represented as
edge weights in graphs. The identification of groups of highly
connected nodes is an important task and results in valuable
knowledge of the data set as a whole. Many popular clustering
techniques are designed for vector or unweighted graph data,
and can thus not be directly applied for weighted graphs. In
this paper, we propose a novel clustering algorithm for
weighted graphs, called PaCCo (Parameter-free Clustering by
Coding costs), which is based on the Minimum Description
Length (MDL) principle in combination with a bisecting k-Means
strategy. MDL relates the clustering problem to the problem of
data compression: A good cluster structure on graphs enables
strong graph compression. The compression efficiency depends
on the underlying edges which constitute the graph
connectivity. The compression rate serves as similarity or
distance metric for nodes. The MDL principle ensures that our
algorithm is parameter free (automatically finds the number of
clusters) and avoids restrictive assumptions that no
information on the data is required. We systematically
evaluate our clustering approach PaCCo on synthetic as well as
on real data to demonstrate the superiority of our developed
algorithm over existing approaches.",
pages = "932--943",
institution = "SIAM",
year = 2011,
doi = "10.1137/1.9781611972818.80"

title = "{INCONCO}: Interpretable Clustering of Numerical and Categorical
booktitle = "Proceedings of the 17th {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "Plant, Claudia and B{\"{o}}hm, Christian",
publisher = "ACM",
pages = "1127--1135",
series = "KDD '11",
year = 2011,
address = "New York, NY, USA",
keywords = "clustering, minimum description length principle, mixed-type
isbn = "9781450308137",
doi = "10.1145/2020408.2020584"

title = "{Compression-Based} Graph Mining Exploiting Structure Primitives",
booktitle = "Data Mining ({ICDM)}, 2013 {IEEE} 13th International Conference
author = "Feng, Jing and He, Xiao and Hubig, N and Bohm, C and Plant, C",
abstract = "How can we retrieve information from sparse graphs? Traditional
graph mining approaches focus on discovering dense patterns
inside complex networks, for example modularity-based or
cut-based methods. However, most real world data sets are very
sparse. Nevertheless, traditional approaches tend to omit
interesting sparse patterns like stars. In this paper, we
propose a novel graph mining technique modeling the transitivity
and the hub ness of a graph using structure primitives. We
exploit these structure primitives for effective graph
compression using the Minimum Description Length Principle. The
compression rate is an unbiased measure for the transitivity or
hub ness and therefore provides interesting insights into the
structure of even very sparse graphs. Since real graphs can be
composed of sub graphs of different structures, we propose a
novel algorithm CXprime (Compression-based exploiting
Primitives) for clustering graphs using our coding scheme as an
objective function. In contrast to traditional graph clustering
methods, our algorithm automatically recognizes different types
of sub graphs without requiring the user to specify input
parameters. Additionally we propose a novel link prediction
algorithm based on the detected substructures, which increases
the quality of former methods. Extensive experiments evaluate
our algorithms on synthetic and real data.",
pages = "181--190",
month = dec,
year = 2013,
keywords = "data compression;data mining;graph theory;pattern
clustering;CXprime algorithm;coding scheme;compression
rate;compression-based graph mining technique;cut-based
methods;dense pattern discovery;example modularity-based
method;graph clustering methods;information retrieval;link
prediction algorithm;minimum description length
principle;objective function;sparse graphs;star parse
patterns;structure primitives;Clustering
algorithms;Communities;Data mining;Encoding;Entropy;Prediction
algorithms;Receivers;Compression;Graph mining;Link
prediction;Minimum Description Length;Partition",
issn = "1550-4786",
doi = "10.1109/ICDM.2013.56"

title = "Relevant overlapping subspace clusters on categorical data",
booktitle = "Proceedings of the 20th {ACM} {SIGKDD} international conference
on Knowledge discovery and data mining",
author = "He, Xiao and Feng, Jing and Konte, Bettina and Mai, Son T and
Plant, Claudia",
publisher = "ACM",
pages = "213--222",
month = "24~" # aug,
year = 2014,
keywords = "categorical data; minimum description length; relevant subspace
isbn = "9781450329569",
doi = "10.1145/2623330.2623652"

title = "Robust information-theoretic clustering",
booktitle = "Proceedings of the 12th {ACM} {SIGKDD} international conference
on Knowledge discovery and data mining",
author = "B{\"{o}}hm, Christian and Faloutsos, Christos and Pan, Jia-Yu
and Plant, Claudia",
publisher = "ACM",
pages = "65--75",
month = "20~" # aug,
year = 2006,
keywords = "clustering; data summarization; noise-robustness; parameter-free
data mining",
isbn = "9781595933393",
doi = "10.1145/1150402.1150414"

title = "Nonparametric information theoretic clustering algorithm",
booktitle = "Proceedings of the 27th International Conference on Machine
Learning ({ICML-10})",
author = "Faivishevsky, Lev and Goldberger, Jacob",
pages = "351--358",
year = 2010

title = "Integrative {Parameter-Free} Clustering of Data with Mixed Type
booktitle = "Advances in Knowledge Discovery and Data Mining",
author = "B{\"{o}}hm, Christian and Goebl, Sebastian and Oswald, Annahita
and Plant, Claudia and Plavinski, Michael and Wackersreuther,
publisher = "Springer Berlin Heidelberg",
pages = "38--47",
series = "Lecture Notes in Computer Science",
month = "21~" # jun,
year = 2010,
isbn = "9783642136566, 9783642136573",
doi = "10.1007/978-3-642-13657-3\_7"

title = "Dependency clustering across measurement scales",
booktitle = "Proceedings of the 18th {ACM} {SIGKDD} international conference
on Knowledge discovery and data mining",
author = "Plant, Claudia",
publisher = "ACM",
pages = "361--369",
month = "12~" # aug,
year = 2012,
keywords = "clustering; heterogeneous data; minimum description length",
isbn = "9781450314626",
doi = "10.1145/2339530.2339589"

title = "Fully automatic cross-associations",
booktitle = "Proceedings of the tenth {ACM} {SIGKDD} international conference
on Knowledge discovery and data mining",
author = "Chakrabarti, Deepayan and Papadimitriou, Spiros and Modha,
Dharmendra S and Faloutsos, Christos",
publisher = "ACM",
pages = "79--88",
month = "22~" # aug,
year = 2004,
keywords = "MDL; cross-association; information theory",
isbn = "9781581138887",
doi = "10.1145/1014052.1014064"

title = "Summarization-based Mining Bipartite Graphs",
booktitle = "Proceedings of the 18th {ACM} {SIGKDD} International Conference
on Knowledge Discovery and Data Mining",
author = "Feng, Jing and He, Xiao and Konte, Bettina and B{\"{o}}hm,
Christian and Plant, Claudia",
publisher = "ACM",
pages = "1249--1257",
series = "KDD '12",
year = 2012,
address = "New York, NY, USA",
keywords = "bipartite graph, clustering, link prediction, summarization",
isbn = "9781450314626",
doi = "10.1145/2339530.2339725"

title = "Subdue: Compression-based Frequent Pattern Discovery in Graph
booktitle = "Proceedings of the 1st International Workshop on Open Source
Data Mining: Frequent Pattern Mining Implementations",
author = "Ketkar, Nikhil S and Holder, Lawrence B and Cook, Diane J",
publisher = "ACM",
pages = "71--76",
series = "OSDM '05",
year = 2005,
address = "New York, NY, USA",
isbn = "9781595932105",
doi = "10.1145/1133905.1133915"

% The entry below contains non-ASCII chars that could not be converted
% to a LaTeX equivalent.
title = "{VOG}: Summarizing and Understanding Large Graphs",
booktitle = "Proceedings of the 2014 {SIAM} International Conference on Data
author = "Koutra, Danai and Kang, U and Vreeken, Jilles and Faloutsos,
abstract = "Abstract How can we succinctly describe a million-node graph
with a few simple sentences? How can we measure the ‘importance’
of a set of discovered subgraphs in a large graph? These are
exactly the problems we focus on. Our main ideas are to
construct a ‘vocabulary’ of subgraph-types that often occur in
real graphs (e.g., stars, cliques, chains), and from a set of
subgraphs, find the most succinct description of a graph in
terms of this vocabulary. We measure success in a well-founded
way by means of the Minimum Description Length (MDL) principle:
a subgraph is included in the summary if it decreases the total
description length of the graph. Our contributions are
three-fold: (a) formulation: we provide a principled encoding
scheme to choose vocabulary subgraphs; (b) algorithm: we develop
VOG, an efficient method to minimize the description cost, and
(c) applicability: we report experimental results on
multi-million-edge real graphs, including Flickr and the Notre
Dame web graph.",
pages = "91--99",
year = 2014,
doi = "10.1137/1.9781611973440.11"

title = "Mining Connection Pathways for Marked Nodes in Large Graphs",
booktitle = "{SDM}",
author = "Akoglu, Leman and Chau, Duen Horng and Faloutsos, Christos and
Tatti, Nikolaj and Tong, Hanghang and Vreeken, Jilles and
Tong, Lajvh",
abstract = "Abstract Suppose we are given a large graph in which, by some
external process, a handful of nodes are marked. What can we
say about these nodes? Are they close together in the graph?
or, if segregated, how many groups do they form? We approach
this problem by trying to find sets of simple connection
pathways between sets of marked nodes. We formalize the
problem in terms of the Minimum Description Length principle:
a pathway is simple when we need only few bits to tell which
edges to follow, such that we visit all nodes in a group.
Then, the best partitioning is the one that requires the least
number of bits to describe the paths that visit all the marked
nodes. We prove that solving this problem is NP-hard, and
introduce DOT2DOT, an efficient algorithm for partitioning
marked nodes by finding simple pathways between nodes.
Experimentation shows that DOT2DOT correctly groups nodes for
which good connection paths can be constructed, while
separating distant nodes.",
pages = "37--45",
institution = "SIAM",
year = 2013,
doi = "10.1137/1.9781611972832.5"

title = "Self-tuning spectral clustering",
booktitle = "Advances in neural information processing systems",
author = "Zelnik-Manor, Lihi and Perona, Pietro",
pages = "1601--1608",
year = 2004

title = "Matrix Computations",
author = "Golub, Gene Howard and Van Loan, Charles F",
abstract = "Revised and updated, the third edition of Golub and Van Loan's
classic text in computer science provides essential information
about the mathematical background and algorithmic skills
required for the production of numerical software. This new
edition includes thoroughly revised chapters on matrix
multiplication problems and parallel matrix computations,
expanded treatment of CS decomposition, an updated overview of
floating point arithmetic, a more accurate rendition of the
modified Gram-Schmidt process, and new material devoted to
GMRES, QMR, and other methods designed to handle the sparse
unsymmetric linear system problem.",
publisher = "Johns Hopkins University Press",
year = 1996,
isbn = "9780801854132"