Towards Optimal Lower Bounds for k-median and k-means Coresets - Sorbonne Université
Communication Dans Un Congrès Année : 2022

Towards Optimal Lower Bounds for k-median and k-means Coresets

Résumé

Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 ± ε) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the stateof-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: • In finite n point general metrics, any coreset must consist of Ω(k log n/ε^2) points. This improves on the Ω(k log n/ε) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. • For doubling metrics with doubling constant D, any coreset must consist of Ω(kD/ε^2) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. • In d-dimensional Euclidean space, any coreset for (k, z) clustering requires Ω(k/ε^2) points. This improves on the Ω(k/ √ ε) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for k-median and complements the Ω(k min(d, 2 z/20)) lower bound of Huang and Vishnoi [STOC'20]. We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size Õ(k/ε^2 • min(ε^{−z} , k)). This improves over the Õ(k^2 ε^{−4}) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA'21] and over the Õ(k/ε^4) upper bound for k-median by Huang and Vishnoi [STOC'20]. In fact, ours is the first construction breaking through the ε^{−2} • min(d, ε^{−2}) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(ε^{−1} polylog k/ε).
Fichier principal
Vignette du fichier
2202.12793.pdf (765.04 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03944755 , version 1 (18-01-2023)

Identifiants

Citer

Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn. Towards Optimal Lower Bounds for k-median and k-means Coresets. Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, Jun 2022, Rome, Italy. pp.1038-1051, ⟨10.1145/3519935.3519946⟩. ⟨hal-03944755⟩
33 Consultations
22 Téléchargements

Altmetric

Partager

More