What is: Large-scale spectral clustering?
Source | Divide-and-conquer based Large-Scale Spectral Clustering |
Year | 2000 |
Data Source | CC BY-SA - https://paperswithcode.com |
Spectral Clustering
Spectral clustering aims to partition the data points into clusters using the spectrum of the graph Laplacians Given a dataset with data points, spectral clustering algorithm first constructs similarity matrix , where indicates the similarity between data points and via a similarity measure metric.
Let , where is called graph Laplacian and is a diagonal matrix with . The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SC_obj} {\max_{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where denotes the trace norm of a matrix. The rows of matrix are the low dimensional embedding of the original data points. Generally, spectral clustering computes as the bottom eigenvectors of , and finally applies -means on to obtain the clustering results.
Large-scale Spectral Clustering
To capture the relationship between all data points in , an similarity matrix is needed to be constructed in conventional spectral clustering, which costs time and memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where () is a set of landmarks with the same dimension to , indicate a similarity measure metric, and is the similarity sub-matrix to represent the with respect to the .
For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix is . Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, -means is conducted on the embedding to obtain clustering results.
The clustering result is directly related to the quality of that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.