Dewey Decimal Proud: Why Catalogs Rock With Nathan Turajski, Ken Barth, And Tim Gasper
Making sense of data is the cornerstone of analytics. Data Catalogs are increasingly a valuable tool for achieving this critical goal. Find out why on this episode of DM Radio, as host Eric Kavanagh interviews Nathan Turajski, the Senior Director of Informatica; Ken Barth, the CEO of Catalogic; and Tim Gasper, the VP of Product at data.world. Join in the conversation to learn more about data cataloging in machine learning, cyber resiliency, and democratization.
[00:00:46] Eric: Our topic is a good one. It’s going to help us understand more about the whole data world. In fact, this technology and this whole best practice are all geared around furthering the understanding of your data. The topic is a data catalog. We are going to talk about a couple of different kinds of data catalogs, the more modern versions that are all about understanding your corporate data, understanding definitions of different things like a customer, revenue or whatever the case may be.
Catalogs are also used in data security and for watching out for troublemakers who are out there trying to hack into our systems which happens every day. The attacks are constant. I get reports all the time, and you’ve got to watch up for that. It’s all about data catalogs and how they can help us better understand the world around us. We have several great guests. We have Ken Barth from Catalogic, Tim Gasper from data.world, and Nathan Turajski from a company called Informatica that I’ve done some work with over the years. It’s a cool company based out of Redwood City in Silicon Valley.
I have been to the headquarters a long time ago when my journey in data began back in 2001. A lot has changed since then but a lot is still the same. Nathan, I throw it over to you to tell us what you folks are doing. Informatica has been one of the stalwart vendors in the data integration space and the data management space for decades. What’s the latest with you and data catalogs?
[00:02:13] Nathan: Thanks. It’s great being here. Informatica has been around for a bit over 25 years. That’s a lifetime. We are focused on data management. Data cataloging is the area that I’m focusing on, along with data privacy and data security. As part of a bigger picture, data cataloging fits into a category that we call data governance. It covers a bit more broadly than data cataloging, but data cataloging tends to be the tip of the spear.
It’s the starting point for most journeys into the world of data governance. You have to understand where your data is and where it’s going to start making informed decisions and, hopefully, intelligent decisions on how to use that data responsibly, or how to protect that data from the abuses that we see nowadays to derive trust from that data so it can be used responsibly.
That’s where we see data cataloging, whether you are doing this on-premises whether you are moving to the cloud. Certainly, after the pandemic, we are seeing a big acceleration toward cloud-hosted services. We play in these spaces and offer a variety of different data cataloging products that customers are using to essentially build trust in the data, understand that data, how it’s being used, and how they can use it more effectively in the future.
[00:03:38] Eric: Trust is a good concept to bring in here. If you don’t trust your data, you are not going to use your data. That happens all the time. People lose trust in the data, and they go to some alternate system or go with gut instinct or whatever. Increasingly, that’s not going to be an option. You are going to have to have trust in your data, especially if you are in a large enterprise. You’re going to need some technology.
We were joking before the show that business dictionaries had been around for decades. It’s nothing new, but it was an offline resource that you had to rely on someone to go to and check. It wasn’t dynamically connected to your operational systems. Where we are heading is for the data catalog to be this intermediary that offers meaning, lineage, and control points around the information assets to enable responsible use of that information.
[00:04:29] Nathan: Think about the proliferation of data, not just in the last few years. It has been building for a long time that there’s much more data. It’s proliferated. It’s spread across the organization. The pandemic that we are starting to hopefully get away from and move on from created a new a new environment for us to deal with. You have a lot more users and consumers in the organization that are working remotely, spread in diverse places. The enterprise, as a model, is changing, and with it, so is that data.
That data proliferates to locations and places that we may not have expected. It’s critical to have transparency into where that data is going, and that catalog helps tie together to better understand, “Where is my data? Who’s using it? Where is it going? Can I trust it? What are the indicators of those different data sets?” Maybe it’s customer information, maybe it’s your intellectual property. Maybe it’s information related to sensitive data that you need to protect for regulatory reasons and otherwise.
Maybe you just want to understand the value of it for using it in different value creation opportunities. All this understanding starts with the catalog. Everyone here is enthusiastic about where you can take data once you can expose it responsibly, have that trust assurance, and consumers across the organization trust that it’s available for their needs.
[00:05:58] Eric: We all agree that the pandemic spurred the data catalog space on and in part. Suddenly, we had folks working remotely and could not go into the office. You better come up with some way to expose the data they need responsibly. That helped fuel the catalog space. What do you think?
[00:06:20] Nathan: One thing that is also a big shift is that years ago, when data analytics started to take off, this data was purely in the hands of the data scientists. They were trying to understand that data and its relevancy, and then use it for different analytic uses. We have seen a huge shift towards what we call data democratization. It is simply that there are more consumers in the organization that want to mine that data.
They want to understand that data for their value-creation opportunities, whether you are in human resources, marketing or some other program where you are responsible for the bottom line to figure out how that data can be transformed, and again used responsibly but used for these value creation opportunities. Maybe it’s a better understanding of your customers. Maybe it’s to understand what kinds of products and services are going to be more appropriate and build more stickiness or loyalty for your customer base. It’s gone beyond the data science use and into much more use cases as data has become the lifeblood of our economy.Data cataloging is the starting point for most journeys into the world of data governance. Click To Tweet
[00:07:25] Eric: We talk a lot about data governance on this show. Until a few years ago, it was difficult to pull off. You could give access controls at the database level or the tool level in a BI environment. In between, you couldn’t do that, but now you can. It makes sense that the catalog should be woven together with the governance structures. That also touches on security too. You are touching on multiple things that all revolve around responsible use of data.
[00:07:59] Nathan: You could think of data governance as two sides of the same coin. Historically, privacy and security, understanding the data in terms of what was sensitive data, cataloging that data, and classifying that data as sensitive were useful because of all the compliance mandates. We are a few years removed from the GDPR. California, for example, has the CCPA. There are new states of mandated privacy regulations that are out there to protect data privacy, certainly all the abuses and security issues that are out there.
The flip side or the other side of that coin is we still want to use that data. We can’t simply lock it down perpetually. We have to be able to make it available to those stakeholders in and across the organization that can see the value in that data and wants to build those types of programs that improve the bottom line, customer stickiness, and perhaps give us new insights into how to build better operational efficiencies or create other types of value that are important for the particular domain that you may work in.
[00:08:57] Eric: That’s an interesting point you brought up about operational efficiencies. That’s a great sign that you are mentioning that. It means that we are at a place now where understanding the data and the information sources can play a strategic role in redesigning business processes. What you are talking about is changing a business process and infusing it with data in some responsible way. That’s pretty cool because we are squarely in the space of digital transformation.
[00:09:26] Nathan: Even at Informatica, we like to position our product as being a catalog of catalogs because there are catalogs that can be departmental and localized. Being able to understand where data is coming from, and even tap into those sub-catalogs as it were, and be able to understand all the different data sources across your environment, aggregate them together.
From there, understanding the quality of that data, where that data is coming from, how it’s being used across the organization can start to build those insights into how to make that data flow or those information value chains a bit more efficient and make them available to the right people in the right context for the right use cases.
It’s all about those efficiencies. How do I connect the producers of the data or the data sources with the consumers of that data? There’s quite a bit of operational efficiency in terms of understanding those data sets. There’s a lot going on on both sides. You can connect the right producers and consumers. You can encourage data literacy and bring together the right groups of people. They can find new ways to collaborate and democratize that data for additional value creation.
[00:10:35] Eric: That’s great stuff. Let’s bring in Tim Gasper from the data.world. Nathan is sticking around for the round table. It’s another cool company that has a cool background. You are strong in the data catalog space. Tell us a bit about what you folks are doing there and how you are helping to increase the value and the usage of the data responsibly.
[00:11:00] Tim: Thanks so much, Eric, for having me on the show. I’m the VP of Product over at data.world. I’m also a cohost at Catalog & Cocktails podcast around data management. Data.world has been around for a few years. Folks recognize us being in the catalog and governance space a little bit more because we started as an open data catalog. We were connecting to the world’s data, and that exists now. It’s the world’s largest open data catalog. Anybody can sign up for free. It’s an intelligent Facebook-like experience around data.
Years ago, a lot of enterprises were saying, “You created this engaging Facebook-like, Amazon-like experience around open data. We want to bring that into our own enterprise. We want to have that Amazon or Facebook-like experience internally where people can collaborate around data, find data, and leverage the intelligence that you are providing being built on a knowledge graph to provide a more intelligent, engaging experience for people trying to work with their data.” What we have been focused on is providing this engaging, highly adopted data environment for people to find data and work around data.
[00:12:16] Eric: To explain this to our audience, think about when you are in some information system, and you are appending it with notes. Think about CRM systems. What we are talking about here is enabling collaboration around the definition and meaning of terms, and helping people understand how to stitch all that stuff together. Especially at a large organization, let’s say a product organization, we have tons of different products. That could be a bit of a morass to get into and to understand.
What we are trying to do is provide a framework and a set of processes or workflows around defining the terms, then sharing those definitions and collaborating with people that folks can onboard more quickly. Folks can dive into particular domains of expertise in the organization and figure out number one, “What are we doing as an organization,” and number two, “How does that fit into the broader world?” Is that about right, Tim?
[00:13:12] Tim: Yeah, that was a great explanation. There are two analogies that I like to use most often when talking about catalogs and the value that it provides around working with data and then managing data responsibly. One of them is the analogy of the library. When you are going into the library, you want to look up a certain subject. If you want to be a responsible librarian, you need to organize the books such that people can find them in logical places so that you can’t lose track of those things. When people want to go find it, they need to be able to use the computer terminal to search for it or be able to browse around the aisles and for it to make a lot of sense where these things are organized.
One analogy is this library analogy. Another one, which is maybe even more apt, is more of the marketplace analogy. It’s where you want to go to buy something. This something happens to be important. It happens to be data that’s going to be what you make your sales decisions on. Maybe if you are a retail organization, you are trying to decide on new locations. These are important products that you are going to buy from this “marketplace.”
One of the things that you want is a great retail experience. You walk into the store, and you want to be able to see what data is available. What’s the most important data? What’s not the most important data? You want to be able to talk to the right experts that can guide you to the right data. If you have a problem with it, you can come back to the store, and you can return it, get a repair or what that might be, that retail experience. It’s also the trust of that data.
It’s, “What’s the supply chain that got this data product through this store in the first place? Can I trust the factory that created it the right way? What trucks did it take, and what warehouses did it stop at on its way over here?” What catalogs are all about is providing you that library and that marketplace around your data so that you can see it? You have that traceability that you can interact with it. You can collaborate around it, and you can extract a lot more value out of your data.
[00:15:11] Eric: I have been in this business for many years, so I have been watching the evolution of data warehousing. Years ago, you saw master data management come around, and a number of different companies came along to provide master data management solutions for both operations and analytics. You have analytical MDM and operational MDM. Did data catalogs come on to prominence in the last few years in part because we never fully got the value we expected from MDM? Where does MDM begin and data catalogs end? Tim, what do you think about that?
[00:15:51] Tim: Master data management and that movement are still going now, although maybe it’s not quite as hot as it was at that time. It’s still around how do you reconcile data integration? How do you bring different data sources together in a smart way to get that one view of the customer or the product? That’s still important. You see a lot of data integration projects still focused on master data management. Catalogs and governance solutions play an important part together in conjunction with master data management.
There are a couple of other trends that pushed catalog even more than the master data management trends. It’s more around, first of all, self-service. It’s the idea that we want anybody in the organization to say, “You’ve got Excel. You may not think of it like this, but you technically have a business intelligence solution sitting on your desktop.” We want to add people to Tableau and Power BI to all these new modern tools as well. You even have the advent of these modern AI tools, Dataiku, Domino Labs, and Hex is the new player. If you want to have these tools, you want to create self-service around data, then you have to have visibility, trustability, and access to that data for it to make sense. That’s one big trend here.
The other one is about the complexity around data. You had the Big Data movement, which was big in technology. Hadoop became popular for a period there. We all thought that maybe Hadoop was going to be the saving grace that, “We bring all our data together, and magic will happen,” and that we will solve world peace with Hadoop. The truth is, as probably we should have expected, it’s just a piece of a larger technology ecosystem but volume, velocity, all of this stuff is an important part of that.
[00:18:20] Eric: We are talking to several cool companies in the space. We have heard from Informatica and the data.world. Next up is Catalogic. We have Ken Barth, CEO of Catalogic. They do a different catalog to protect us from all these hackers out there and all this ransomware, which is everywhere. It’s crazy. Ken, tell us a bit about yourself, your company, and how you are enabling the use of data catalogs.
[00:18:45] Ken: First of all, I want to thank you because few people have ever called me one of the brightest minds in the world. I love that tagline. It’s perfect. It says it in our name. We are in the security and data protection space. We have been there for many years. We have over a twenty-year history of protecting some of the world’s largest data. It’s interesting, we decided to call the company Catalogic for a reason because we believe the catalog is at the core of everything.Trustworthiness is a function of the quality and privacy of data. Click To Tweet
We use catalogs in multiple different ways. We use catalogs for backup and cataloging actions in our cybersecurity products. We have two main lines of business, DPX, which is our flagship backup product. We have added ransomware protection to that. Again, we use a catalog to look at and map customer actions. If you think about it, you are managing all this data. You have to find ways to be able to track.
Nathan and Tim both mentioned data governance is huge here. You have to find a way to map and track the data that you bring into your organization. Once you have that, you have to perform and be actionable with it. It has been a long time coming that the catalog is the next step in this whole migration of bringing structured, unstructured, and semi-structured data together so you can make good business decisions. As part of that, we are certainly using that in Catalogic to make sure that your backups and your cyber resiliency, if you are going into the cloud, is there with you.
[00:20:34] Eric: You made a couple of good comments there. One, I love that you brought in the fact that you can use these catalogs to better understand the synthesis of structured and unstructured data. The unstructured data, you typically provide the context to help you better understand the structured data and what it means. Data without context is meaningless.
[00:20:58] Ken: There are three kinds. There’s structured data, unstructured data, and semi-structured data. It brings its own information to play here and could be valuable, depending on the perspective that you are bringing to the problem you are trying to solve. We are in a data swamp out there. We collect information, and if you can’t somehow organize and tag that, you are in a lot of trouble. I feel the organizations that will be able to organize it and tag it, are the ones that are going to win in a big way.
[00:21:32] Eric: It’s interesting you bring up tagging. We talk about machine learning on this show. I had a guy who is one of the foremost experts in the machine learning space in artificial intelligence. He was talking about the tagging part of the process. It’s funny because what he said is in their experience, a small percentage of people are good at that process. You want to be careful about who you put in charge of tagging things because if they don’t tag it responsibly or consistently, you are going to mess up.
[00:22:03] Ken: That’s why with machine learning, you are seeing that process get automated to an extent that you have to still build out the problem. What you are going to see is a subset of catalogs that will be running businesses depending on perspective. Nathan mentioned a catalog of catalogs like a master catalog. We touched on master data management. To me, master data management was a manual attempt from organizations to try to build the master of master catalogs. What you want is data syncing with different kinds of catalogs. Nathan mentioned using as an example of a data of record, which is your customer information.
If you look at what happens a lot in organizations, you will have one customer record that maybe is in the product like Salesforce.com but over here in Oracle have a whole different customer record with different information. Bringing those together into one master data manager requires those two groups. If you can bring a catalog together and then have a master catalog on top, you can bring automated but you are going to stand a better chance of getting that done a lot quicker.
[00:23:29] Eric: With the advent of machine learning, we are going to be able to revisit how we do almost everything in data. That’s already happening, where we’ve got concepts like self-driving data. There’s a company we’ve had on the show a few times. It’s doing some cool work in the self-driving data space. It’s a different form of automation. For example, I’m always loading spreadsheets of registrants into our email marketing applications. You can learn over time. He’s logging into the system every time. He’s done it before. He’s taken a marketing database and put it over here.
What we are going to start seeing is the automation of suggestions. It may be that this time I want to do something differently. Even though 99 times up until now, I did it, all the same, I may want to do something different next time. What we are going to see is more of these suggestions for people within the console where you are working as an information worker, you are going to have a suggestion come in, “Do you want to pull in this data source? Do you want to do that based upon your history?” That’s clever because it doesn’t automate something that you don’t want automated but it does automate something that’s valuable because the user will probably want to do that. You are saving a half step, if that makes sense. What do you think about that?
[00:24:44] Ken: It makes a ton of sense. There’s a lot of information that go into making that decision that you might not be privy to. It’s trying to suggest to you the right decision to make. That’s where the metadata becomes involved. There are three kinds of metadata. You can have metadata with your schemas and tables, and this is how your other databases are laid out, and a good machine learning product, all that. You got the business metadata.
If I’m on Salesforce.com, I’m making notes. They are in the margin. You have operational metadata like how recently was this field refreshed. If you take all that, you need to take that to make a suggestion as to which way you should go. Here’s what I suggest. That’s why the technologies here where you can start building some of these things and get some unbelievable automation. Tim was referring to that a little bit earlier.
It will become a self-service thing. People aren’t going to panic every time Eric comes and tries to import a marketing list. We know he is not going to blow our schemes away or put the European date format in where we want the other format. It will clean it up. We will solve a lot of issues. We have a long way to go, don’t get me wrong but a lot of this is starting to happen. If you look at it, particularly from a Catalogic perspective, if I’m using this for cyber resiliency, I need to be able to track user actions. That’s the key.
The key is a file that looks like it was corrupted. Who touched that file? It seems like Eric or Nathan touched that file but what else did they do after that? Where did they go after that? If I have a catalog of that, then I can go back. The two big issues that we fight in our business are recoverability, “How quick can you recover? What was affected?” If I can put a box around those two things and I can bring that back for you quickly, I’m going to win the game. A catalog is key for that.
[00:27:04] Eric: I’m sitting here thinking of another example I can throw at you. This is a bit of a curveball. In terms of the value that a catalog can provide, we talked about mastery data management and other ways of reconciling disparate information systems. Company size, we wound up standardizing on whatever it is that LinkedIn goes by. It’s like 1, 0 to 10 or 1 to 10, 11 to 50, whatever the case may be. You go to different places, and those numbers change dramatically. It might be 10,000 to 49,000. It might be 10,000 to 249,000 or whatever.
The key is how do you reconcile that? How do you normalize that data? This is a fairly simple example but that’s one place where a data catalog can be the place where you come together and decide upon that and say, “We are going to go with this version. I want to translate this 5,000 set of records that we have to fit over to this version.” It’s a menial detail but it’s an important detail at some point in time. What do you think about that as a metaphor for what we have to work on, Tim?
[00:28:20] Tim: That’s a good connection there. There are a lot of the rules and the semantics that are important when it comes to data that a catalog can play an important role in. It’s usually the simplest concepts that we struggle with the most as businesses, “What is a phone number, and how should we represent it? What is an active customer?” “That’s a customer that has been active in the last twelve months.” “No, it’s not. It’s the last 36 months.” “What? We have different definitions.” A catalog plays an important role, not just in the automated discovery of things, but also then coming together and saying, “Let’s align on meaning.”
[00:29:03] Eric: We talked about this on a show. It’s going to be a common theme of prioritization. What do I work on? A catalog is a good tool for being able to identifying where we have disparate definitions. Maybe I will throw this one over to Nathan to see, “We’ve got different definitions. I want active customers. Who are the people involved? Who are the stakeholders? Let’s get them together.”
If you do it through the tool, depending upon which tool that is, it can be dynamic. It’s connected to other systems. You change a policy, and the policy is changed in effect. Whereas before, you had to send an email, “From now on, this is what it should be.” Half the people read the email, half the people didn’t. You haven’t done much at all. In fact, you made matters worse. What do you think about all that, Nathan?
[00:29:51] Nathan: When you talked earlier about trustworthiness, that’s a big part of data cataloging. When you break that down a little bit further, trustworthiness is often a function of the quality and privacy of data. Let’s put privacy aside for one second. The quality of that data is critically important. We heard quite a bit of discussion around automation, AI, and ML being critically important to accelerate a lot of the insights and the data intelligence you can derive from a catalog.
One thing I also would like to touch on a little bit to is this idea overall of organizational data literacy. It’s important to have AI and ML, these sorts of modern technologies, to help accelerate how we derive data intelligence. It’s also important to bring together all of the stakeholders in the organization that are data stewards that are consumers, and producers, and have an environment for them to be able to collaborate.
If I were to use an analogy, when you go to Amazon.com, and you search for a product, that’s similar to the way you might search in a catalog. You are looking for something in particular, maybe not a data set, maybe it might be a product. You rely upon all of the various comments, the ratings, and the scoring. You are trying to find some value in that data, and you are relying upon your organization also to provide input. There is that tribal knowledge that’s locked up in an organization. You don’t want it to remain in silos.
The catalog can be critical in bringing together all of those individuals, so they have a place to collaborate. They have a place to be able to comment on those data sets and also be able to determine the trustworthiness and value of consuming that data as it’s exposed to more individuals in the organization. I mentioned data democratization and this concept of a marketplace. That ultimately is the end game.Don’t bite off more than you can chew. Start with a project you can understand. Click To Tweet
The way that Informatica sees it is that you can find the data, you can trust the data. Ultimately, you want to use that data to create value. You make it available in the marketplace where then the different organizations can come together, understand the data, understand how others have found value in that data, look at the quality of that data, and profile that data to understand it a bit better. The two go hand-in-hand, AI and ML, automating, accelerating, discovery, lineage, and then also the human aspect, which can’t be ignored. A catalog can still help to provide that environment.
[00:32:07] Eric: I got myself thinking here about retail environments. For example, this marketplace concept is fantastic, by the way, I will throw that out there. It’s a great way to look at the world. It also enables self-service much more. I’ve got a good friend in the industry, Donald Farmer, who came up with this great metaphor to describe the changing role of IT and IT professionals, especially in the data space.
He says, “In the on-prem world, they were more like the gatekeepers to protect the machinery. We don’t want to break our schemas. In this new world of self-service and cloud-based, you want them to be more like shopkeepers to show the different people where the stuff is.” It’s a dramatic and significant change but it speaks to the evolution of the technologies and what’s available quickly, right, Nathan?
[00:32:54] Nathan: Absolutely. Data marketplaces are the future. It’s, “How do I connect the producers with the consumers, and how do I make that information available?” As a consumer, I don’t have to dig into the space of data scientists. I don’t have to search down across the organization, send out a bunch of emails, and look through a bunch of spreadsheets. I have one place to collaborate. I have one place to accelerate my value creation opportunities, as opposed to my data curation and data discovery capabilities.
[00:33:22] Eric: You want an assembly line for data. You don’t want every person having to try to do every job on the assembly line. It makes no sense at all. That’s why you have an assembly line. “Here’s where the data is pulled. Here’s what the data is curated. Here’s where the data is enriched.” At the end of that process, you have the data in the marketplace that some users can come to grab and not have to worry about the extraction, the pipelines, and all that stuff. You don’t want those people worrying about that. You want them to use the data, leveraging the data to make better decisions.
I will throw this first question over to Ken. It’s one of the coolest things about data catalogs in terms of processes and onboarding people. If you hire someone to come into the organization, that’s a great place to have them start to go looking around, to understand, “What does your business hold? What are the information assets and all that fun stuff?” It’s a fun little project. Years ago, I had someone on a show who said, “That’s exactly what we do.” It’s a resource of information and definitions. It touches everything, depending upon how thorough it is. What do you think about all that?
[00:35:01] Ken: If you have a catalog, a lot of times, it’s a series of Excel spreadsheets. It is what it is. Go read product brochures. Get Excel spreadsheets. If you are good enough to work for a larger company like Informatica like Nathan is, you probably have a good training and induction program. For a smaller company like ours and the ones I have usually been involved in, a catalog would be an absolute home run. Let me say it that way. It would be a great place to send somebody because then it’s a guidebook on how to touch on things. That’s one of the valuable things about a catalog. A catalog can be a catalog of catalogs.
If I’m coming in as a salesperson, I don’t necessarily need to know what the operational aspect of it and what our backup situation is in the company. What I need to know are customer records and purchase orders, “What is the financing process? How do you process an order? What are the travel policies of the company?” Those are the things I need to know. I can easily get pointed to each one of those catalogs to look at it or have one reservoir.
[00:36:14] Eric: That’s a good point. Tim, maybe I will bring you into this one too. I’m sure you have seen lots of different examples of catalogs because it’s all over the map in terms of what you define in them and how you use them. A lot of these things, topics, concepts come around, then they go away, they come back around, and they go away. Think of a portal. Remember your internet used to have internet all the time. Now they are coming back as portals, customer analytics or things of this nature. That’s the interface through which you see the world. A catalog underneath something like that is powerful, right, Tim?
[00:36:46] Tim: Absolutely. It ties into your question and the comments by Ken there that a lot of times, data is becoming an important part of our jobs. It becomes a central part of what the knowledge worker has to work with. You want to have access to your data through a data portal in the same way that you have your employee portal. I think of an example. One of our customers at data.world is a large consultancy. They have over 50,000 employees, and over 22,000 of them are users of the catalog.
That’s what you can get to when you have broad adoption of a data portal of a data catalog. They integrate their data catalog with their internet. When you go on the internet, you search for things. You are not just getting results of Wiki pages, benefits portals, and stuff like that. You are also getting data results and analytics results directly in your employee portal. You are going to see more of that thing happen as more and more people embrace this data democratization approach.
[00:37:48] Ken: I’m going to add one thing to what Tim said. He mentioned 22,000. The way everyone looks at that is 22,000 people, that’s wonderful. They have access but you have 22,000 chances to make a mistake and let a bad actor into your system. That’s why it’s so important we are tracking those actions. The fact that we have democratized this is fantastic. We need to be aware that every eleven seconds, somebody gets hacked. For anybody considering any of this out there, you’ve got to make sure you’ve got it buttoned down, you have good data governance, as Nathan was saying, and you’ve got a way to protect yourself against that.
[00:38:27] Nathan: Governance and democratization are two sides of the same coin. The more you do one better, the more you can do the other better.
[00:38:36] Eric: Nathan, maybe I will throw this one over to you too. The other nice thing about these catalogs is that you can track their usage, and then you understand what’s hot, who’s hot, who’s using them, who’s not. You can go back and see, “Did this marketing campaign work or did it not work?” I can tell you, being years in that business, it’s a difficult thing to figure out. Attribution is difficult. There are some amazing tools these days. I lean and I’m like, “They don’t even do attribution.” “What do you mean they don’t do attribution? You sure sound you do attribution.” It’s a difficult thing to determine but it’s another resource to help us understand what’s happening, right, Nathan?
[00:39:14] Nathan: You talked about the concept a little bit earlier, this notion of metadata. It’s not just the data itself but that insight into the data, the metadata or the data about the data that’s highly important for those in the organizations that are trying to make intelligent data-driven decisions. At the end of the day, catalogs have historically been more of a technical tool. The data scientists, the analytic leaders, and the analysts have largely employed the data catalog.
The one topic that we keep coming back to is the idea that democratization and a marketplace are going to open up that data to more consumers. To play the devil’s advocate, there might be situations because of abuses, security concerns, and regulatory requirements that you may not want to necessarily expose all of your data. The more you can do that responsibly, the more you can build trust assurance into the equation, and the more that you can open up these marketplaces so that the consumers across the organization can not only find the data they need, but to understand the effectiveness and the quality of that data.
They can make better decisions. They can bring together data sets that they may not even be aware of before. They can use that Google search capability in a way to almost web surf through their data sets and understand what data sets might be relevant to a particular program. If I’m trying to build a customer loyalty program, I’m trying to offer better products and services to my consumers, I want to understand all those different data sets about that consumers.
What you are largely seeing is this major transformation happening over the last few years, where it’s gone from being cataloging, being a technical tool, to understanding the data, to more of a tool that enables those value-creation opportunities across by organization. What better place than a catalog to bring together both the business side and the technical side together to collaborate, bring together the producers of the data with the collaborators and the consumers of that data? It can act as a hub for all of that understanding, improving your data literacy, and making you as an organization more effective at using that data in ways that you might not have even dreamed of before.
[00:41:13] Eric: Tim, I’ve got to throw this over to you. It’s a great conduit for better understanding the business and having conversations about things, programs, and organizational hierarchies, whatever the case may be. If you embrace it thoroughly, it becomes that hub that Nathan mentioned. It’s not just a data hub. It’s an organizational hub that helps you understand who’s doing what and where we can align.
What I love so much about the collaborative capabilities of our technology is that it’s completely game-changing. It’s something as simple as Google Sheets or Google Docs. The fact that you and ten other people, and I can all log into the same doc and see who’s where and comment on it in real-time is a sea change from the way it used to be. I did it on my machine then I emailed all of you. You all did it separately, and you emailed it back. I tried to parse all that together. Good luck.
[00:42:08] Tim: Yet it’s scary at the same time because the conversation is happening in so many places. How do you keep track of it? It’s like, “That one Google Doc you sent me that one time, why is Google not finding it when I’m searching for it?” This is the funny situation we find ourselves in. We have so much collaboration at our fingertips, and yet it has become harder than ever to find that one piece of information.
Catalogs are becoming a hub, not just of the data but of the organizational knowledge. The organizational conversation is important, whether it’s directly in your catalog like comments or integrating with things like Slack and Teams in these places where the conversations are happening, where you want to have back and forth push and pull with your catalog. It turns your catalog into something operational.Data hacking is the biggest single problem we face in the democratization of data. Click To Tweet
[00:42:57] Eric: I will throw it back over to Nathan to comment on, you want it to be operational. You want it to be a center hub for discussions, for activities, for your business. Right, Nathan?
[00:43:07] Nathan: Yeah, I thought that it was great to bring up that analogy of something like Google Docs. It works in very much the same way. You’ve got maybe data in a spreadsheet, and you want to be able to bring together your collaborators to be able to understand what’s going on, how to use that information, and how to move forward. Increasingly, it is becoming that hub for many organizations. Moving from the technical to more of the business audiences, as the business audience are responsible for using data as the lifeblood of their organization to derive new value.
It used to be that you had certain industries, financial services, healthcare or a few that were data-driven. Now in every organization virtually across the board, you can find situations where data is giving them new insights that they have to be able to harness and turn into data intelligence that can then fuel that organization. In every organization, whether you are last to the table or whether it’s critical to your business, you are going to find that in the environment nowadays. You can’t delay the implementation of a central resource to manage your data on that.
[00:44:15] Tim: Eric, I need to compliment you on this. I had to jump in because when you named this the Dewey Decimal Classification, I thought that was cool. If you look at what this is, the Decimal Classification introduced the concepts of relative location and index. We have been talking about the flexible index or catalog this whole thing. You were spot on with what you did.
[00:45:00] Eric: Nathan, how do you get started? If you are in an organization that does not have a data catalog and realizes that you need one, what does that process look like? What’s your advice?
[00:45:10] Nathan: First, understand who are the different stakeholders that are responsible for success. It could be an analytics program. It could be a line of business. It could be your chief data officer. There’s going to be some stakeholder in your organization that’s going to be passionate about this. You have to bring in the right stakeholders. It could be both technical as well as business. It could be your IT department. It can also be others in the organization that is trying to derive value.
It could be a security concern. Maybe I’m interested in understanding my particular security posture, and I need those tools. Start small. Bring together the right stakeholders, and then scale out from there. You don’t want to bite off more than you can chew. Initially, you want to be able to start with a project that you can understand. You can demonstrate success. You can show an easy and quick win.
I don’t want to call data cataloging a lifelong journey but it’s certainly something that’s going to take many years to achieve the true full enterprise value. If you want to start small, you can start small. Identify a project. Identify specific data sets, whether they are customers, whether it’s IP-related, whether security concerns. You want to start around bringing the right people and then scale out from there.
[00:46:18] Eric: Tim Gasper from data.world, what’s your advice on how to get the ball rolling?
[00:46:25] Tim: We hear that a lot, “How do I get started with a catalog? How do I get started with solving these use cases?” Similar to what Nathan said, the first is to understand your goals and use cases. Establish what success means for you. Is it adoption? Is it quality? Is it engagement and enablement? Set that up. Set it to paper. Get people to agree on it. It’s people’s process and technology. Catalog is part of the technology piece but people in the process are important too.
At data.world, we think about three things. Discovery, you want to connect to things and understand them deeply as automatically as possible, take advantage of AI and machine learning. Governance, wrap your arms, process, and policy around it. Enablement, you want to empower as many people in the organization to engage in the data conversation and get more value with data.
[00:47:15] Eric: A good way to think about this is, “What is your priority? What are you trying to do?” Nathan had a great example about loyalty programs. How do you figure out what to reward them and what to reward them for? These are all things that the data can help you understand. You perhaps use machine learning or AI or some other statistical analysis to segment your customers.
You’ve got 9, 10 or 11 different groups, “Who do we want to reward?” Obviously, the loyal customers we want to reward. Understand who these people are. Navigate through the data. Find some similarities. It turns out they all like to sail, for example. That’s an interesting angle. It’s an iterative process. That’s what Nathan was alluding to. It’s not a once-and-done. It’s an ongoing process but you’ve got to start somewhere and start climbing the mountain, right, Tim?
[00:48:07] Tim: Yeah, absolutely. We use a phrase, agile data governance. Think about how you can implement agile data governance within your organization.
[00:48:17] Eric: Ken Barth, you are focused on the security space and using catalogs for that. It’s a mechanism of action and used to come to wrangle the data to get some understanding of the truth.
[00:48:37] Ken: For bad actors, the ultimate prize is your data. That’s what they are all after. The very first thing I would start with is being able to get an actionable database of what my users are doing and what they are after. To do that, you’ve got to build a little bit of a governance policy. It doesn’t have to be huge but, “Where is my data, the sensitivity of my data? Who needs the access? How quickly can I roll back?”
We get hacked every eleven seconds. Somebody in this country or this world is getting hacked every eleven seconds. It’s the world’s fastest-growing business. If I want to go into business, I can go on some sites, download some tools that other people have written, start hacking, and probably make a pretty good living.
That’s where I would start. This is the perfect place. I would make sure your data protection software has some cataloging ability to catalog users’ actions so you can start tracking some of this. If you get hacked, a ransom amount of you ever lived through it, I have been lucky enough not to have. I have been with a lot of people that have. This is a problem. It is the biggest single problem we face in terms of the democratization of the data, which is what we have all been talking about.
[00:49:50] Eric: You made a good point earlier too when you said that when you have 22,000 people using your catalog, those are 22,000 endpoints.
[00:49:58] Ken: They are great people. They don’t mean to make a mistake but we all make a mistake. You just click on the wrong thing. What you need to be able to do is know where they were last. If you can figure that out, then you can start doing damage control, which is the most important thing in the whole world for you.
[00:50:18] Eric: We enjoyed the conversation. Look, these guys up online, Tim Gasper of data.world, Ken Barth of Catalogic Software, and Nathan Turajski of Informatica. Send me an email if you want to be in the show, Info@DMRadio.biz. We will talk to you next time, folks.