Schema Inference for Massive JSON Datasets

Abstract : Recent years have seen the widespread use of JSON as a data format to represent massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON data sets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision and conciseness of inferred schemas, and scalability.
Type de document :
Communication dans un congrès
Extending Database Technology (EDBT), Mar 2017, Venise, Italy. 〈10.5441/002/edbt.2017.21〉
Liste complète des métadonnées

https://hal.sorbonne-universite.fr/hal-01491765
Contributeur : Mohamed-Amine Baazizi <>
Soumis le : vendredi 17 mars 2017 - 12:55:01
Dernière modification le : vendredi 31 août 2018 - 09:25:56
Document(s) archivé(s) le : dimanche 18 juin 2017 - 13:16:08

Fichier

paper-62.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

UPMC | PSL | LIP6

Citation

Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani. Schema Inference for Massive JSON Datasets. Extending Database Technology (EDBT), Mar 2017, Venise, Italy. 〈10.5441/002/edbt.2017.21〉. 〈hal-01491765〉

Partager

Métriques

Consultations de la notice

532

Téléchargements de fichiers

362