Viet-Anh on Software Logo

What is: Large-scale spectral clustering?

SourceDivide-and-conquer based Large-Scale Spectral Clustering
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Spectral Clustering

Spectral clustering aims to partition the data points into kk clusters using the spectrum of the graph Laplacians Given a dataset XX with NN data points, spectral clustering algorithm first constructs similarity matrix W{W}, where wij{w_{ij}} indicates the similarity between data points xix_i and xjx_j via a similarity measure metric.

Let L=DWL=D-W, where LL is called graph Laplacian and D{D} is a diagonal matrix with dii=j=1nwijd_{ii} = \sum_ {j=1}^n w_{ij}. The objective function of spectral clustering can be formulated based on the graph Laplacian as follow: \begin{equation} \label{eq:SC_obj} {\max_{{U}} \operatorname{tr}\left({U}^{T} {L} {U}\right)}, \ {\text { s.t. } \quad {U}^{T} {{U}={I}}}, \end{equation} where tr(⋅)\operatorname{tr(\cdot)} denotes the trace norm of a matrix. The rows of matrix U{U} are the low dimensional embedding of the original data points. Generally, spectral clustering computes U{U} as the bottom kk eigenvectors of L{L}, and finally applies kk-means on U{U} to obtain the clustering results.

Large-scale Spectral Clustering

To capture the relationship between all data points in XX, an N×NN\times N similarity matrix is needed to be constructed in conventional spectral clustering, which costs O(N2d)O(N^2d) time and O(N2)O(N^2) memory and is not feasible for large-scale clustering tasks. Instead of a full similarity matrix, many accelerated spectral clustering methods are using a similarity sub-matrix to represent each data points by the cross-similarity between data points and a set of representative data points (i.e., landmarks) via some similarity measures, as \begin{equation} \label{eq: cross-similarity} B = \Phi(X,R), \end{equation} where R={r1,r2,,rp}R = \{r_1,r_2,\dots, r_p \} (pNp \ll N) is a set of landmarks with the same dimension to XX, Φ()\Phi(\cdot) indicate a similarity measure metric, and BRN×pB\in \mathbb{R}^{N\times p} is the similarity sub-matrix to represent the XRN×dX \in \mathbb{R}^{N\times d} with respect to the RRp×dR\in \mathbb{R}^{p\times d}.

For large-scale spectral clustering using such similarity matrix, a symmetric similarity matrix WW can be designed as \begin{equation} \label{eq: WusedB } W=\left[\begin{array}{ll} \mathbf{0} & B ; \ B^{T} & \mathbf{0} \end{array}\right]. \end{equation} The size of matrix WW is (N+p)×(N+p)(N+p)\times (N+p). Taking the advantage of the bipartite structure, some fast eigen-decomposition methods can then be used to obtain the spectral embedding. Finally, kk-means is conducted on the embedding to obtain clustering results.

The clustering result is directly related to the quality of BB that consists of the similarities between data points and landmarks. Thus, the performance of landmark selection is crucial to the clustering result.