Schema Inference for Massive JSON Datasets

Mohamed-Amine Baazizi; Houssem Ben Lahmar; Dario Colazzo; Giorgio Ghelli; Carlo Sartiani

doi:10.5441/002/edbt.2017.21

Communication Dans Un Congrès Année : 2017

Schema Inference for Massive JSON Datasets

(1) , (2) , (3, 4) , (5) , (6)

1
2
3
4
5
6

Mohamed-Amine Baazizi

Fonction : Auteur correspondant
PersonId : 13062
IdHAL : mohamed-amine-baazizi
IdRef : 162548923

Connectez-vous pour contacter l'auteur

Bases de Données

Houssem Ben Lahmar

Fonction : Auteur
PersonId : 1004298

Université de stuttgart

Dario Colazzo

Fonction : Auteur
PersonId : 840508
ORCID : 0000-0002-6031-0049
IdRef : 156469073

Université Paris Sciences et Lettres

Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision

Giorgio Ghelli

Fonction : Auteur
PersonId : 1004291

Dipartimento di Informatica [Pisa]

Carlo Sartiani

Fonction : Auteur
PersonId : 1095200

Dipartimento di Matematica e Informatica

Résumé

Recent years have seen the widespread use of JSON as a data format to represent massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON data sets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision and conciseness of inferred schemas, and scalability.

Mots clés

JSON schema inference

Domaines

Base de données [cs.DB] Informatique et langage [cs.CL]

Fichier principal

paper-62.pdf (315.5 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Mohamed-Amine Baazizi : Connectez-vous pour contacter le contributeur

https://hal.sorbonne-universite.fr/hal-01491765

Soumis le : vendredi 17 mars 2017-12:55:01

Dernière modification le : dimanche 17 novembre 2024-14:42:04

Archivage à long terme le : dimanche 18 juin 2017-13:16:08

Dates et versions

hal-01491765 , version 1 (17-03-2017)

Identifiants

HAL Id : hal-01491765 , version 1
DOI : 10.5441/002/edbt.2017.21

Citer

Mohamed-Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani. Schema Inference for Massive JSON Datasets. Extending Database Technology (EDBT), Mar 2017, Venise, Italy. ⟨10.5441/002/edbt.2017.21⟩. ⟨hal-01491765⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UPMC CNRS UNIV-DAUPHINE LIP6 LAMSADE-DAUPHINE PSL SORBONNE-UNIVERSITE SU-SCIENCES

1888 Consultations

1429 Téléchargements

Schema Inference for Massive JSON Datasets

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager