K-means algorithm has long been a staple in machine learning and data mining fields, primarily for its effectiveness in clustering large-scale datasets. However, traditional K-means clustering doesn't inherently distinguish the varying discriminative power of features in data. To address this, the paper proposes an innovative clustering framework incorporating L2-norm regularization on feature weights, thereby enhancing clustering outcomes. This new approach builds on the Weighted K-means (W-K-means) algorithm by applying L2-norm regularization to feature weights, effectively balancing feature importance.
For numerical datasets, this framework introduces the l2-Wkmeans algorithm, which uses conventional means as cluster centers. For categorical datasets, two variations—l2-NOF (Non-numeric features based on different smoothing modes) and l2-NDM (Non-numeric features based on distance metrics)—are proposed. The essence of these methods lies in their updated clustering objective function and derived update rules for cluster centers, membership matrices, and feature weights.
Extensive experiments demonstrate the superior performance of the proposed algorithms on both numerical and categorical datasets. These methods exhibit advantages such as improved clustering accuracy, robustness to noisy data, and adaptability to high-dimensional data environments. This signifies that incorporating L2-norm regularization for feature weighting substantially enhances the clustering quality of K-means, especially for complex, high-dimensional datasets. Additionally, the study discusses the impact of regularization parameters on clustering performance, offering practical insights for tuning these parameters to optimize clustering results. This guidance allows users to select the appropriate regularization intensity based on task-specific and data-related characteristics.
The research provides a fresh perspective on improving the K-means clustering algorithm by emphasizing feature importance through L2-norm regularization, enhancing both clustering power and generalizability. This method is valuable for large-scale datasets and scenarios that require nuanced feature differentiation, representing a significant step forward in clustering quality and advancing related research fields.