Towards Optimal Lower Bounds for k-median and k-means Coresets

Vincent Cohen-Addad; Kasper Green Larsen; David Saulpic; Chris Schwiegelshohn

doi:10.1145/3519935.3519946

Communication Dans Un Congrès Année : 2022

Towards Optimal Lower Bounds for k-median and k-means Coresets

(1) , (2) , (3) , (2)

1
2
3

Vincent Cohen-Addad

Fonction : Auteur

Google Research [Zurich]

Kasper Green Larsen

Fonction : Auteur

Aarhus University [Aarhus]

David Saulpic

Fonction : Auteur

Recherche Opérationnelle

Chris Schwiegelshohn

Fonction : Auteur

Aarhus University [Aarhus]

Résumé

Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 ± ε) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the stateof-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: • In finite n point general metrics, any coreset must consist of Ω(k log n/ε^2) points. This improves on the Ω(k log n/ε) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. • For doubling metrics with doubling constant D, any coreset must consist of Ω(kD/ε^2) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. • In d-dimensional Euclidean space, any coreset for (k, z) clustering requires Ω(k/ε^2) points. This improves on the Ω(k/ √ ε) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for k-median and complements the Ω(k min(d, 2 z/20)) lower bound of Huang and Vishnoi [STOC'20]. We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size Õ(k/ε^2 • min(ε^{−z} , k)). This improves over the Õ(k^2 ε^{−4}) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA'21] and over the Õ(k/ε^4) upper bound for k-median by Huang and Vishnoi [STOC'20]. In fact, ours is the first construction breaking through the ε^{−2} • min(d, ε^{−2}) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(ε^{−1} polylog k/ε).

Mots clés

lower-bound sketch Clustering coreset k-median k-means

Domaines

Informatique [cs]

Fichier principal

2202.12793.pdf (765.04 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

David Saulpic : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-03944755

Soumis le : mercredi 18 janvier 2023-11:08:52

Dernière modification le : samedi 7 octobre 2023-21:36:22

Dates et versions

hal-03944755 , version 1 (18-01-2023)

Identifiants

HAL Id : hal-03944755 , version 1
ARXIV : 2202.12793
DOI : 10.1145/3519935.3519946

Citer

Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn. Towards Optimal Lower Bounds for k-median and k-means Coresets. Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, Jun 2022, Rome, Italy. pp.1038-1051, ⟨10.1145/3519935.3519946⟩. ⟨hal-03944755⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIP6 SORBONNE-UNIVERSITE SU-SCIENCES ANR

14 Consultations

17 Téléchargements

Towards Optimal Lower Bounds for k-median and k-means Coresets

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager