Towards Optimal Lower Bounds for k-median and k-means Coresets
Abstract
Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 ± ε) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the stateof-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely:
• In finite n point general metrics, any coreset must consist of Ω(k log n/ε^2) points. This improves on the Ω(k log n/ε) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors.
• For doubling metrics with doubling constant D, any coreset must consist of Ω(kD/ε^2) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors.
• In d-dimensional Euclidean space, any coreset for (k, z) clustering requires Ω(k/ε^2) points. This improves on the Ω(k/ √ ε) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for k-median and complements the Ω(k min(d, 2 z/20)) lower bound of Huang and Vishnoi [STOC'20]. We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size Õ(k/ε^2 • min(ε^{−z} , k)). This improves over the Õ(k^2 ε^{−4}) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA'21] and over the Õ(k/ε^4) upper bound for k-median by Huang and Vishnoi [STOC'20]. In fact, ours is the first construction breaking through the ε^{−2} • min(d, ε^{−2}) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(ε^{−1} polylog k/ε).
Domains
Computer Science [cs]Origin | Files produced by the author(s) |
---|