More AI-Ready Metadata, and More Open Metadata
The history of metadata is rich and long, but a lot of the promise of metadata has never truly been unlocked. Conveying meaning about data to humans via business glossaries and data dictionaries has often been neglected, and when needs arise, ad hoc reverse engineering is used to try to understand data. Data lineage, whether at a technical level (tables, columns) or a business level (datasets, data products) is often undertaken only when there is a driving regulatory need and id documentary in nature. Metadata engineering, an area where I have a lot of experience, can achieve phenomenal results in terms of code generation, but seems to have almost no adoption.
And yet this seems poised to change. The era of AI has taken us to an inflection point where not only the need for metadata, but also attitudes towards metadata are changing radically. I recently had the opportunity to interact with Suresh Srinivas and Harsha Chintalapani, co-founders of Collate (getcollate.io) on an episode of The Briefing Room hosted by Eric Kavanaugh of DM Radio to discuss what this inflection point means, and how the industry needs to react to it.
The Mosaic of Metadata
How we define “metadata” is important. Often it is just said to be “data about data”. However, I define it as:
the data that describes all aspects of an enterprise’s information assets and enables the enterprise to effectively use and manage these assets.
It is the usage and management needs that drive what metadata is maintained by an enterprise. Metadata is not passive documentation that exists for its own sake. This brings us to the fact that metadata is an open-ended concept. There are many different kinds of metadata, and new kinds come into existence all the time. For instance, descriptions of tables and columns existed for decades, but only with the advent of the Internet did we see URL’s emerge as a new class of metadata.
Not only is there an ever-expanding mosaic of metadata, but there are important relationships between different kinds of metadata. For example, data quality business rules need access to business glossaries and data dictionaries to get exact names and definitions.
Ultimately, this means that the mosaic of the different kinds of metadata must exist in a single layer that is managed as a whole. It cannot be fragmented with different pieces existing only in specific technical products.
I was very happy to hear from Suresh and Harsha that this is the vision for their company too. Their perspective is that in the past metadata has existed to satisfy human needs, but now metadata has to satisfy the requirements of AI – and this is the core of the inflection point we are witnessing.
The OpenMetadata initiative which Collate is heavily involved in is attempting to address this need. It is an open source effort to build the single metadata layer as a platform. OpenMetadata connects to warehouses, databases, BI tools, pipelines, etc., and builds a centralized metadata graph. It includes connectors to collect metadata, covering data discovery, data lineage, data quality, data observability, and data governance.
My experience in building such a layer for specific projects has been that it also enables integration of metadata that can even be fed back to sources, like data catalogs, to enrich them. It also permits historical analysis of metadata, like a data warehouse, e.g. so trends can be reported on.
If we think about where this is going, then we can see that the inflection point may be more significant than at first glance.
Context and Meaning
Returning to the needs of AI, we need to understand that metadata that only provides context is not going to be enough. What is missing is meaning.
Humans can take metadata that provides context and investigate further to get a complete picture of what they need. This is an unfortunate reality given the traditional approach to metadata which has resulted in poor naming conventions, poor definitions, and patchy coverage of information assets.
However, this will not do for AI. If AI is told something is a Customer table, then it will treat it like that and use the data. But what if that table is a backup from a year ago, or something in a development environment with only test data, or simply junk someone forgot to delete? AI only understands what it has been told and will simply think this is a Customer table. Even if it is a good production Customer table, what is a “Customer” and what if in a Marketing context “Customer” legitimately includes “Prospect”?
What this means is that the traditional structural metadata is not sufficient for AI. It needs meaning, that is, semantics. This semantic layer has to sit over all of the other metadata in the metadata layer as it informs all of it. It is this kind of architecture that AI needs.
In our conversation, Suresh and Harsha emphasized the importance of meaning and how it ties into all metadata. They also pointed out that Semantic Intelligence includes something more. The concept they have is of AI agents and automation that use the metadata layer (implemented as a semantic graph) to help add to the meaning by auto-documenting datasets, classifying sensitive data, enforcing governance policies, answering natural-language questions, and providing LLMs with trusted context so they reason accurately about the enterprise’s data. For me this begins to address some of the problems that exist with Data Governance where roles and responsibilities are assigned, but the scale of metadata documentation is just too massive for human effort alone. Automation of some kind is needed, although significant human effort will always be required. It seems that we may have an answer with AI not just consuming metadata, but producing it too.
It Matters Now
As organizations deploy LLMs, AI agents, and other AI applications that use enterprise data, those systems need more than just access to tables. Yes, access control and having AI align to the privileges of the user invoking it is extremely important. But fundamentally, AI needs to understand what the data means. Without something like the metadata platform with a semantic layer that we have been discussing, AI will revert to its generalized training and hallucinate or gives inconsistent answers. Enterprises that understand this and create the required metadata infrastructure are going to have a huge advantage as the reach of AI expands.
