Catalog vs. Table versioning: which is the best data versioning strategy?
Discover OGMA and HEMERA, trusted solutions developed by Cleyrop to unlock the value of your data and accelerate the adoption of AI at the heart of your processes.
Request a demoIn this article, our Technical Director, Jean Humann, shares his experience of the best method of versioning data to enable continuous improvement of transformations without affecting production.
The data lakehouse takes the best of both worlds: on the one hand the low-cost, open storage layer of the datalake, on the other the data management functions of the data warehouse, from ACID transactions to rollbacks. However, to achieve the latter in an environment such as S3, a layer of abstraction is essential if computational engines are to understand how to query groups of raw data files/SQL tables.
At Cleyrop, our data lakehouse is based on a SQL engine and pure object storage, with data in Apache Parquet format. The abstraction layer is provided by Apache Iceberg, an open source table format developed by Netflix. This writes a layer of metadata that helps the engines to understand the design and history of the tables, as illustrated in the diagram opposite.
In other words, Iceberg enables simple SQL queries, versioning of data, parallel reading and writing, and preservation of atomic changes. This table format is also supported by a data catalog, like JDBC, Hive or Nessie. When a table is queried via an SQL engine, it's this catalog that the engine will first call, to then retrieve the correct Iceberg metadata references. Very schematically, the role of a catalog is to group tables together and make them detectable by tools such as Spark or Dremio.
Catalog-level versioning with Nessie
Butwhere most catalogs confine themselves to this table discovery, Nessie goes a step further. In fact, it is one of the first open source catalogs to offer a transactional catalog, preserving a validation history of all tables. In other words, a "git-like" semantics that brings versioning properties directly into the data lakehouse, i.e. at the catalog level of branches that can be tagged, branched and merged, with access control rules attached not to the engine, but to the catalog. In this way, data engineers and other data scientists can iterate their data cleansing processes and carry out tests in complete safety, without the risk of impacting a branch in production. However, this catalog-level versioning is not without its shortcomings. First and foremost, there is the problem of data management, and more specifically of the data files behind the tables. This is often the case when you no longer wish to store data from merged tables in the main, or wish to delete unnecessary data. Iceberg's PURGE function is a powerful garbage collector by default. It allows you to delete all data linked to tables you no longer need. However, this feature is deactivated in Nessie, since branching is performed at catalog level, not table level. As a result, the engine will no longer find a child table if the parent table has been deleted, and will return an error if you want to delete the data from this child table, even though it is still stored in Minio.It is possible to opt for workarounds, but these too will quickly show their limitations. At Cleyrop, we have tested a first mechanism consisting in directly deleting the data in Minio, with a Kubernetes job to reflect these changes in Nessie. However, fine-tuned data management, particularly for orphans, remains complex. What's more, this method risks undermining Nessie's ability to work with Iceberg.
Fine table management with Iceberg 1.2.0
Thesolution came from Iceberg itself. In its latest versions, from Iceberg 1.2.0 onwards, the tool has added functionality for managing branches and tags at table level, eliminating the need to go through a catalog. In concrete terms, in Iceberg, each version of a table, referred to as a snapshot, can be tagged, branched or merged. It is thus possible to create a branch from a snapshot and tag this branch to carry out modifications, while the main remains unchanged until the merge.Cleyrop has chosen to implement this function and modify the data catalog, moving from Nessie to the more standard REST catalog pushed by the Iceberg team. The REST catalog provides a better connection to other applications and clients, and can be called up via the REST API. It uses the same Git semantics and commit system as Nessie, with only the commands changing. Above all, Iceberg 1.2.0 goes one step further than Nessie, since branching is carried out at table level, with each table having its own git. As a result, there's no need to look for workarounds to remove data files and orphans, since versioning is done on the table level, rather than the catalog level.Conclusion: there's no such thing as good or bad versioning. Branching by catalog and branching by table both have their advantages. The former, when supported by Nessie, enables multi-transaction management, while the latter, natively supported by Iceberg, enables finer-grained data management, particularly in terms of garbage collection. Why not get the best of both worlds by combining the two, you ask? It's technically feasible to have both catalog-level management and the table version, but it would be complex for the user, due to the double branching.
References: REST catalog technical specifications: https: //github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yamlCLINessie: https: //projectnessie.org/tools/cli/ReleaseIceberg 1.2.0: https: //iceberg.apache.org/releases/#120-release