Skip to main content

Provable and efficient data clustering via matrix factorization

A Prime Video paper discusses one of the fundamental clustering techniques in machine learning, examining its key performance factors, and revealing the underlying mechanism used to find the clusters.

A core machine learning (ML) problem is data clustering, which is when the given data is divided into a set of clusters in which the data points share common or similar features. Data clustering is considered an unsupervised learning problem because we don’t have any ground-truth label.

Being closed-form is a desirable feature in an ML algorithm. This is because the algorithm can quickly solve a given task without needing to run an iterative procedure. Most closed-form algorithms underperform complex approaches but a matrix factorization-based clustering (MFC) method can significantly outperform the other approaches while also running much faster.

In our Clustering a union of linear subspaces via matrix factorization and innovation search paper accepted to the conference on Uncertainty in Artificial Intelligence (UAI 2022), we provide the first deep analysis of this algorithm, explaining why the algorithm performs better in many scenarios, and reveal that some of the state-of-the-art clustering methods can be simplified into simple matrix factorization-based methods.

Clustering a union of manifolds via matrix factorization

In data clustering, we often assume that data points are clustered around some cluster centers and that the clusters are spatially separated from each other. However, in some applications this assumption does not hold and the clusters correspond to different manifolds in which the data points lie on them. These manifolds can be close to each other and this leads to challenging clustering problems in which many of clustering methods fail to distinguish the clusters.

MFC is an algorithm that can handle these challenging scenarios, particularly when each manifold has a linear structure that is a proper assumption in many applications. Another method that has been shown to perform well in these scenarios is Innovation Pursuit (iPursuit), which relies on solving an optimization problem to find out the connection between the data points. Innovation Pursuit yields state-of-the-art results when the cluster manifolds are close to each other. However, solving its optimization problem is computationally expensive. In our Clustering a union of linear subspaces via matrix factorization and innovation search paper, we illustrate the interesting connection between MFC and iPursuit, proving that they use a similar mechanism to cluster the data. In addition, we present a novel analysis that is applicable to both methods and paves the way to speed up the iPursuit algorithm.

A deep analysis for MFC and iPursuit

Although MFC is a core clustering algorithm and one of very few closed-form approaches, there has not been a deep understanding of the algorithm that explains the real mechanism it employs to cluster the data and its key performance factors. To derive an analysis of the algorithm, we assume that the data lies on a union of linear manifolds (either hyperplane or subspaces). Additionally, we let the span or space of the clusters intersect with each other. This data model allows us to demonstrate the robustness of the algorithm against challenging clustering scenarios.

In our analysis, we established the conditions guaranteeing that the true connections (affinity scores) computed by the algorithm are sufficiently stronger than false connections, which are the connections between data points that are in different clusters. According to the presumed data model, the span or space of each cluster can be decomposed into two components. The first component is the intersection of the span of that cluster with the rest of clusters. The second component is the innovative or unique part of that cluster. Note that the innovative component of a cluster doesn’t lie in the span of any other cluster. Defining this innovative component allows us to reveal an interesting and surprising discovery.

In sharp contrast to previous theoretical results for other clustering methods that require the span of the clusters to be sufficiently incoherent with each other, we proved that MFC and iPursuit algorithms only require the innovative components of the clusters to be incoherent with each other.

This feature made our results stronger than former guarantees, because even if the spans of two clusters are extremely close to each other, their innovative components can be strongly incoherent with each other. For example, two linear manifolds can have a high dimension of intersection but their innovative components could be orthogonal to each other. Therefore, we proved that MFC and iPursuit can handle challenging clustering scenarios while most other clustering method such as K-means fail.

Another research discovery that we presented in our Clustering a union of linear subspaces via matrix factorization and innovation search paper was that MFC and iPursuit employ similar mechanisms to construct the affinity graph. We proved that if we simply change the L1-norm based innovation search optimization problem used in iPursuit to an L2-norm, then iPursuit will be equivalent to MFC. This discovery not only explains why they yield a similar performance, but also shows a way to speed up the iPursuit method. Because they both compute the optimal directions of innovation, if we initialize the iPursuit algorithm with the directions computed by MFC (which is computed in a closed-form way), the optimizer of iPursuit converges much faster.

Conclusion

Clustering is a fundamental problem in data science and AI. It helps us to identify the important structures in the data and divide the data into meaningful or insightful segments. In many scenarios, it’s incorrect to assume that clusters are several clouds of data points around some cluster centers. MFC is a unique clustering method that is closed-form and can handle sophisticated clustering tasks, such as clustering a union of manifolds.

In our Clustering a union of linear subspaces via matrix factorization and innovation search paper, we provide a comprehensive theoretical analysis of the MFC algorithm that sheds light on the actual mechanism it employs to find the clusters in the data. We prove that iPursuit, a state-of-the-art clustering technique, is equivalent to MFC and we derived novel theoretical guarantees that were applicable to both methods.

Surprisingly, we showed that what matters to MFC and iPursuit is not the incoherency between the span of clusters but the incoherency between the innovative component of the clusters. The theoretical results provided a clear explanation for the robustness of the MFC and iPursuit methods and it paved the way to speed up the iPursuit method via utilizing its theoretical connection with MFC.

For more information and our full paper, see the Clustering a union of linear subspaces via matrix factorization and innovation search paper on the Amazon Science website.

Stay tuned for more from us!

Applied Scientist – Prime Video