Diversity And Inclusion: A Database For All Seasons!

Let's talk Data!

Diversity And Inclusion: A Database For All Seasons!

October 14, 2022 Transcriptions 0

 

Years ago, there were precious few choices for database technology. You had IBM, Oracle, SAP, and some stuff from a Microsoft company. Today, there are hundreds of database technologies to choose from, each of them designed for a particular purpose. What’s the latest? Check out this episode of DM Radio to find out as host @eric_kavanagh will interview guests, including Keshav Pingali, Katana Graph; David Wang, Imply; and Jim Walker, Cockroach Labs.

Transcript

[00:00:48] Eric: We’re going to be talking all things database, diversity and inclusion, a database for all seasons. I’m old enough to remember when there were only a few databases you could choose from. Back in the day, you had Oracle, IBM and some stuff from a company called Microsoft. There were a few fringe databases out there. You had Sybase IQ, which was a very interesting column-oriented database. Now, there are hundreds of database options that you can choose from. The open-source movement was a big part of this transformation. Big data also helped shape the database industry because companies like Facebook realized there was no database they could go out and buy to do what they wanted so they had to roll their own. It’s called Cassandra.

DataStax sits on top of that and lots of other databases. We use MongoDB for our media lens. There are lots of other open-source databases. Also, MySQL. Monty Widenius built that from the ground up. He got a bunch of people to help him out over the years. I interviewed him a few years ago and asked him, “How do you do it?” He sent 40,000 emails to people who were helping him try to build this thing. My favorite story in the database world has to be when Oracle bought Sun Microsystems. With Java and MySQL, what happened is Monty Widenius forked the open-source project that day to something called MariaDB.

Can you imagine if you’re Oracle, you go and buy something and then someone says, “I’m going to go ahead and fork it over here,” and you don’t own that? Sorry. MariaDB is a very prominent player in the space too. There are lots of these companies out there. There are some interesting technologies. They’re all purpose-built. We have this whole next generation of database platforms that are designed around analytics and analytic discovery.

This is important because if you’re going to build an analytic app, the discovery side of it is important. Understanding the shape of your data, nature of your data, frequency, speed, velocity, all these different characteristics, knowing what that data looks like and what you can glean from it is important information for being able to design new apps to leverage that data. If someone gives you a big pile of data and you don’t even know what’s in there, trying to build an application on top of that, it’s going to be a pretty sketchy process.

With some of these new tools, we’ll talk about Apache Druid and Google Spanner but we have lots of great companies on CockroachDB. My old buddy, Jim Walker is in the show. Also, Katana Graph is a new company. It’s got to be the University of Texas if I’m not mistaken, which is doing interesting things and David Wang from Imply. They’re the folks who sit on top of Apache Druid. We got Jim Walker from Cockroach, Keshav Pingali from Katana Graph and David Wang from Imply. Let’s go around the room. Jim Walker, tell us a bit about what you’ve been working on at Cockroach and what’s going on in your world.

[00:03:54] Jim: First of all, thank you for having me. I’m happy to be talking to you again on this show. It’s been several years since we last spoke. Thanks for having me back. I’m the Principal Product Evangelist at Cockroach Labs, which means I go out and talk a lot. This is a perfect thing for me. Maybe I’ll have a radio show one day. Cockroach is an interesting company. I had been working there for years. When I saw what was happening here, I couldn’t unsee it. When I see technologies that are like that, I’m wildly attracted to them.

What the team is doing here is rebuilding and rearchitecting the entirety of the database, minus the language because I still think SQL is paramount. Our grandkids are all going to be speaking SQL one day. Let’s not reinvent SQL guys. Let’s make sure that the execution layer and the storage layer of the database, which is where the magic happens are cloud-native, distributed and could run across an entire planet. How do you do that? If you look into the past and what some of the leaders have done, you go out, build and use that as inspiration to do what you want to do.

CockroachDB was born several years ago. Before you asked me what the name means, the founders are working in New York City. They started building the repo. The open source gets repo and the ubiquitous cockroach in New York City. It’s a database you can’t kill and we’ll sprawl across the whole planet. It’s an operational relational database something like PostgreSQL but with a different name.

[00:05:22] Eric: It was inspired by Google Spanner but it’s an open source. Google Spanner is from Google. It’s tied to one cloud environment. You had a vision at Cockroach of having a database that is cloud-native but does not favor one cloud environment over another.

[00:05:40] Jim: The three founders here were all in the employee 300 range at Google. They were responsible for building Colossus which is the backend file system for Google, eReader and Google reader. When they left, they had started a company and they were frustrated that they didn’t have the same tools that they had internally at Google.

Spanner is one of those things, having a relational operational transactional database that can’t be killed, that’s truly distributed and can span the whole planet. There was nothing like it. Spanner white paper is out there and available. They took inspiration from that and started coding. If somebody had opened the bracket on this code base, it was Spencer Kimball. Here we are years later and it’s truly a groundbreaking, phenomenal piece of technology.

[00:06:26] Eric: I interviewed Spencer Kimball in the second year of his show or enterprise. He is a very smart guy. What I love much about this industry is when you’re creating software, even a database platform, it’s just code. You’re writing code and building something with a vision for the future. Look around. If there’s one space that has more innovation that you can track, it’s the database and open source as well. The database is the foundation of the information economy.

[00:06:57] Jim: We used to think about the OLTP market, transactional database and OLAP market, “One was big and the other one was different.” To me, the future of that collapses into one thing because it’s all data. The pandemic killed the concept of an analog business. If you weren’t online at that moment, what were you doing? It was pretty hard to have an analog business at that time. That was the death knell of the analog business. Everything is digital. If you’re going to be digital, that’s predicated on data and that’s the equation.

The pandemic killed the concept of an analog business. Share on X

[00:07:29] Eric: Let’s get our next guest in. We also have David Wang from Imply. I’m very impressed with these folks and what they’re doing. I was alluding to you guys in the opening when I talked about the importance of the discovery side. Jim did a good job hinting at different kinds of databases. You had a transactional database, which is row-based that’s for doing transactions, like purchasing stuff from stores, you want to know what the address is and all that information. You have analytical databases that tend to be column-oriented. It’s easier to compress columns because all the data is the same.

There are these various constraints or factors that go into the design but then there’s Apache Druid that comes along. One of my favorite lines was from a play by Edward Albee called The Zoo Story. There’s a great line in it where the character says, “Sometimes it’s necessary to go a long distance out of the way to come back a short distance correctly.” If you look at the database world, it’s very applicable in terms of how we design things. These new database technologies can learn from all the mistakes of the past. With that, David Wang, tell us a bit about yourself and Imply.

[00:08:37] David: Thank you so much, Eric. I appreciate the opportunity to be on your show. I’m part of Imply. Imply is a company founded by the creators of Apache Druid. It’s a real-time analytics database founded back in 2010 by a handful of guys that got together. They were working at an Adtech startup and they had a problem. They were trying to build an analytic application. This is very different than when you think about BI.

BI is the world of reports and executive dashboards. I’m looking at historical data predominantly. When they’re at this Adtech startup, they recognize the opportunity to build an application. Application means interactivity, lots of concurrent users and operational uses on streaming events. They recognize the opportunity to build a whole new database. It’s a database that they opened source back in 2012. It was attributed back to the Apache foundation shortly after. They were incorporated as a company in 2015.

Between the time when they open-sourced in 2012 and 2015, they saw this traction where a bunch of companies like Netflix, Pinterest, Airbnb and others was recognizing the opportunity to also go beyond BI and build analytic applications that are developer-built apps that are pushing the direction forward in terms of how organizations are leveraging data and analytics. They’re building out these whole new use cases. The growth of Apache Druid has been phenomenal over the last several years because of this market trend on how folks are thinking, “BI is important but there’s a whole new opportunity to take analytics even further.” We’re excited about it.

[00:10:18] Eric: What’s interesting is that we were coming back together. If you think about analytics and business intelligence and I’ve been in this business a long time, tracking the space, it was this offline process where you pull all your data out, put it into a data warehouse, then you do your analytics on top of that. You learn something. Once you learn something, you pick up the phone and say, “We should do a program around X, Y or Z.”

There’s a lot of latency in that continuum and time that gets lost between the time the data is captured to the time I do something with it and change my business. What you’re talking about is collapsing that latency into a moment where the analytics and insights can fuel what the application is telling you. That’s a much more dynamic and engaging environment than the old way of doing business intelligence.

[00:11:06] David: I love the way you put that because if you think the buzzword, real analytics has been around for a while. If you unpack what real-time analytics means, it is two parts. One has to do with how fast your queries are. Can you do a very large aggregation group, buy and get back in a subsecond? That’s real-time. The point that you’re bringing up is the notion of when data is created and the ability to analyze that split second later. That’s real-time from the standpoint of data freshness. If you look at data freshness and fast query responses, you got Apache Druid.

DMR Keshav | Database Technology

Database Technology: When data is created and the ability to analyze it, that split second later, that’s real-time from the standpoint of data freshness.

 

It’s not even druid-like pushing an agenda into the market. It’s a fact that organizations are adopting streaming pipelines. They’re moving away from the days of ETL and creating events. The whole world is increasingly becoming event-driven. That’s because you got folks like Apache Kafka and Amazon who can use this that are saying, “If the world is event-driven, then what do you do after you create the event? After you create the pipeline, what are you going to do at that event?” That’s what becomes interesting for analysis.

[00:12:07] Eric: What business is all about is making decisions, taking new courses of action, changing things and doing something that the data is fueling. We’re past just data-driven. We’re analytics-driven now. Last but not least, Keshav Pingali. He is waiting out there in the wings. The University of Texas, “Hook ’em Horns.” I’m a big Texas fan. I wish I were in Texas. I love Texas. Tell us a bit about yourself and what Katana Graph is doing.

[00:12:36] Keshav: Thanks, Eric. It’s a great pleasure to be here. Thank you for inviting me. It’s also a pleasure to be on this panel with some very distinguished people, Jim and David. I know you guys by reputation. It’s a pleasure to get to meet you and talk with you.I’m the CEO of Katana Graph. I’m also a professor in the Computer Science Department at the University of Texas at Austin. Katana Graph is different from what Jim and David talked about. We are building a graph intelligence platform. It provides graph processing, AI and analytics across a wide range of industries and innovative applications.

I’ll mention two things that are different from what we’ve heard about it so far. The first is our focus is on data that can be represented as very large knowledge graphs or property graphs. Usually, when you talk about big data and databases, you’re thinking in terms of relations tables and then you process those using languages like SQL. What we find is the data that’s being created is heterogeneous. The entities are all of the different types. What you need to exploit is the relationships between those entities. That is what we represent as a graph because edges in the graph represent relationships between entities.

DMR Keshav | Database Technology

Database Technology: The data that’s being created is heterogeneous. So the entities are all of different types. And what you need to exploit is the relationships between those entities. And that is what we represent as a graph because edges in the graph represent relationships between entities.

 

The second thing that I want to emphasize is the analytics we do is predictive analytics rather than descriptive analytics. In predictive analytics, you’re taking all the data that you have, building a model using AI or machine learning and then doing inference using that model. Our emphasis is on making that inference, training and graph AI machine learning as efficient as possible.

[00:14:36] Eric: We should talk about graph databases versus some of these other types because a lot of people think database, rows and columns, Excel spreadsheet and you could have multiple spreadsheets and tables. That’s like a relational database pointing between different tables, for example. A graph is a much different entity. It relies on nodes and edges. Graphs become very popular because of how well it is served as a social media space.

If you look at LinkedIn, Twitter, Facebook and all these other engines, it’s important to know who knows whom and where the centers of attention are or who the influencers are, for example. That’s why the graph is very good at discovering relationships between things. I noticed that all these people like to buy a certain kind of car. That’s an edge now that I’m going to manage. I’m going to track that over time or all these people tend to go to this certain location. There are a lot of interesting insights that you can generate. You mentioned knowledge graphs, which can be an underpinning technology for all kinds of different applications. Talk quickly about nodes, edges and how you populate a graph database.

[00:15:44] Keshav: We can read data from CSV files, for example. You just need to tell us what the nodes are, what the properties on the nodes are, what the edges are and their properties on the edges. We read them into our internal representation for the graph. The internal representation shows you the topology of the graph. In other words, the edges, which edges are connected with other edges. Very importantly, the nodes of the edges have a lot of property data. That’s where most of the storage is taken up rather than by the topology in most of the applications that we have seen.

[00:16:20] Eric: Graphs are very lean in that sense. They’re very lightweight compared to some of the other traditional database types. They’re great for discovery. You have a discovery process, you want to design some predictive model, for example, and then you deploy your predictive model. All these factors come into choosing what database you want.

[00:19:18] Eric: It’s a wonderful world out there for database technologies. I often tell the story of when I interviewed Dr. Michael Stonebraker a long time ago when I was working for the Data Warehousing Institute and he was pushing something he called, “One size does not fit all.” He came out and said that the relational database for various reasons had won the day. This is back in 2005.

He said, “Relational database is not optimal for all use cases.” Frankly, the more that our economy evolves and the more that the information economy prevails and takes hold of international commerce, the more we are going to rely on data and database technologies or data fabric technologies. There are lots of old databases that have been around for a long time and they’re good at doing certain things.

What happens? The keyword to watch out for is scale. Once you scale out, a whole set of different challenges comes to the fore. It’s very difficult to deal with that stuff. That’s why Cassandra had to come into play for Facebook. It’s why a lot of these databases were designed in the first place because people realize it’s just not going to cut the moisture. We’re not going to be able to do what we want to do unless we come up with a new platform. With that, it’s a great lead to Jim Walker from CockroachDB. Tell us a bit about the vision from Cockroach and where you are in that continuum.

[00:20:37] Jim: I love that intro. Look at this interview. There are three of us all with different databases. Honestly, every application is comprised of different workloads and each workload has its requirements for what you need to do with data. It’s important for people to understand. Are you going to just use PostgreSQL for everything? No. You’re going to use it for transactions and data needs to be correct. You need the isolation of transactions. This word ACID is funny. We all know what it is but there’s this I in ACID that gets tricky for databases and people who implement these things. It’s understanding the tool that you need for the workload that contributes to your application.

Any application is comprised of lots of different databases, in my opinion. That’s what I see in most people who are doing the right thing. If you think about real-time analytics, we’re talking like David. It’s like, “We’re going to publish from a relational database. We’ll publish a change feed to Kafka Sync. It’s going to pick up by Kafka.” There’s this complete thread of data and different things you need to accomplish with them. To reaffirm that, it’s important. There’s a lot of room and it’s a big market in DBMS.

You talk about scale as well. People often think the problem of scale to be like, “My database got big. The storage is big. What do I get to do?” “Throw a bigger machine in it.” “I have to have 1.5 million transactions per second.” “Chart that database.” Charting a database is not fun. Anybody who has ever said it is woefully misunderstood. When we think about that, it’s like the size is important but there’s a third vector of scale as well.

What happens when you have transactions that are happening all over the planet? What happens when I have accounts for a bank running in New York and I have somebody trying to access that account in Singapore at the same time? Two transactions are happening. Who wins? Are they all going to travel the 800 milliseconds back and forth a couple of times to commit a transaction?

Your customers or consumers simply will not wait. If there’s any delay and you had to wait 2 or 5 seconds for something to happen, you’re worried that something went wrong, any of us on this show. You can’t beat the speed of light. It’s no joke. We have not like met this thing. I would love to be here when we do because it’d be probably a cool moment in time for humans. We have to change the way we think about the underlying infrastructure that powers our applications because every business is global from the day they start.

There’s a scale in terms of volume or size of database or volume of transactions but this geographic scale is one of these things important. The Cockroach is there to simplify the first two, “Let’s make it simple to scale up a database even in a single region. Let’s take the concept of charting eliminated. Let’s take the concept of an active-passive system so that when it fails over to something, let’s make everything active. More importantly, let’s do it so that can have guaranteed transactions that global scale if you need to go there.” That’s what we’re predicated on.

[00:23:37] Eric: You had a bunch of great content in that one little line and you had a couple of great quotes that have already pulled out to tweet about. ACID transactions, Atomicity, Consistency, Isolation and Durability. To put that in context for our readers. Consistency, if you look at the NoSQL databases, for example, Cassandra, they are like, “We can relax the consistency a bit. We don’t need to worry about that too much now. What we want is super fast reads.” It is what they want or maybe it was right. I can’t remember.

The point is databases do different things. They read, write and persist. They have different mechanisms for locking and doing all kinds of other stuff. What we’ve seen is an explosion of database technologies that are focusing on certain parts of the components or the picture to be able to get massive performance in certain ways.

[00:24:29] Jim: Why do we have a great graph database to shop? Relationships are different concepts within a database and you don’t want to use a relational database to do that. It’s not going to perform well. The concept of a graph database evolved because I need to think about the data in a different way. I need to think about it in terms of the relationships of the data or analytical database. We’re talking about, “Are they optimized for vectorized theories?”

DMR Keshav | Database Technology

Database Technology: Why do we have a great graph database to shop? It’s because relationships are different concepts within a database, and you don’t want to use a relational database. It’s not going to perform well.

 

They are because I want to think about columns and not rows. Is there a time and place where these things start to converge? I don’t think there’s ever kind of one solution for all the things but I do believe that we’re all moving in the middle a little bit and adding little more transactional things to say that the data warehouse-type technologies or more analytical-type reactions to the relational database. That’s the way we look at it.

To me, that’s the pragmatic way of thinking about these things because there is a different solution for the workload and whatever your requirements are. For us, you need something durable. You are focused on business continuity. You want to reduce the operational overhead of scale and managing the database. That’s us. We’re going to redream this for the cloud and allow you to take advantage of the ubiquity and the always-on nature and the scale that the cloud promises but do it without the operational overhead. That’s what we’re all about.

[00:25:43] Eric: You referenced the graph there. I’ll throw it over to Keshav to talk about where you folks are going. As Jim said, there are all these different use cases there. Some of these databases, I won’t mention any names here, but they’re trying to be like a Swiss Army knife, which is great. A Swiss Army knife could do lots of different things but it can’t do certain things extremely well where if you want a big hunting knife because you’re going to be cleaning fish and doing big hunter things out there in the woods, you need a separate tool. That’s what we’re talking about. You want to understand, “What is the job I’m trying to accomplish? What database can I use in terms of performance, functionality, features or whatever that is going to get that job done?”

[00:26:22] Keshav: That’s a very good way to put it. One way to explain some of these differences between different databases that Jim was talking about was to look at a few use cases that we believe graphs are very well suited forward. We have worked with BAE systems to build a solution for them for doing intrusion detection in computer networks. You’re running a big computer network. There are good guys using it. There are also bad guys trying to break in. We want to catch the bad guys as quickly as possible. The way they wanted to do it was by building what they called interaction graphs where the nodes represent users of the system as well as resources like files, I/O ports and so on. If I send you an email for example, then there’s an edge that’s created between my node and your node.

What they wanted to do was to find what I call forbidden patterns. For example, if every communication between the two of us has to go through Jim, then you can say, “Is there a part in this graph from my node to your node that doesn’t contain Jim’s node?” If so, that’s a forbidden pattern. You raise the alarm and the human operator steps in. When you have data that has entities, relationships between entities and then you’re looking for patterns within that, that’s the kind of thing that graph technologies are well suited for.

You can do it in principle using other kinds of databases, SQL and so on but it’s much more difficult to do that computation. It’s also not as efficient. I agree with Jim when he said, “You have to look at the use cases and the data. Some things are best done in the relational model. Other things, for example, this pattern finding in graphs and data is better down, in my opinion, with graphs.

[00:28:11] Eric: Let’s talk about hops too because as you’re traversing the graph, you do these hops. The best way to describe this is the classic The Six Degrees Of Separation with Kevin Bacon. A lot of people have seen that. The joke is that Kevin Bacon has been in many movies. The question is, “How many degrees of separation are you from Kevin Bacon based on movies you’ve been in?” Most people are 1 or 2 because he’s been in many movies. You think, “If I’ve been in a movie with Kevin Bacon and all these other people, they’ve been in movies too and other movies. How many hops do you have to take to get to Kevin Bacon?” Talk about hops and why it’s important to be able to do 5, 6 and 7 and not just 2 or 3 efficiently.

[00:28:57] Keshav: I’ll give you another example. Take the Facebook friends graph. The average person has about 300 friends. In the Facebook friends graph, the nodes represent users of Facebook and then there’s an edge between two users if they happen to be friends. Mark Zuckerberg is not one of my friends on Facebook but is he a friend of a friend? How far out do I have to go from my social network before I get to Mark Zuckerberg? Facebook has about 3 to 4 billion users. The average person has only 300 friends. Most people guess that you have to go out maybe 10, 15 or 20 hops before you encounter Mark Zuckerberg.

It turns out that the average diameter for the Facebook friends graph is only about 4.5 or so. In other words, within four hops, you get to about everybody in the network. That’s because there are these very high-degree nodes that have lots of edges. Those are the concentrators as they call them. Kevin Bacon in the movie example, because he’s active in a lot of movies with lots of other people and on Facebook, there are people like Zuckerberg and Elon Musk who have lots of friends. Once you hit one of those nodes during your traversal, you get connected to lots of people.

Multi-hop queries, why are they useful? I can go back to that example with intrusion detection in computer networks because if we say, “Communications between the two of us have to go through Jim,” we don’t know whether that could be a single email message or maybe I sent a message to David and then asked him to forward that to you. If so, that would be a forbidden pattern too because that doesn’t contain Jim. You have to look at an arbitrary long path in this interactive graph to rule out the possibility that Jim was not involved.

[00:30:56] Eric: We got to get David into this segment too. I’ll throw it over to you, David Wang, to talk about some of the other use cases that you are applying. An ad network is an area good for you because there is so much data. Understanding that data and driving analysis are important to the decisions you’re going to make.

[00:31:16] David: You mentioned scale earlier. If you think about the notion of scale, there are a couple of dimensions to consider. You want to consider the ability of a system to handle the scale but what is the byproduct? What about performance? What about latency? What about the cost even from a dimension? When you think about use cases where you have a lot of data, think about telemetry, Adtech, cloud services and SaaS products. When you’re looking at all the telemetry that’s coming in that has to be analyzed, you hit a point of scale that traditional BI-based data warehouses are going to start incurring the spinning wheel of death where you’re waiting seconds to minutes.

Let’s say, you’re the CEO asking for a report. You go to the data analyst to get that report. It takes two minutes to get the report generated. Does anybody care? Probably not because it’s going to send the inbox of the data analyst and then the CEO. When you’re in an operational state, for example, Salesforce, a nice customer of ours and a big user, Apache Druid, are in an operational business where they have to maintain their cloud service always on, fast and the best Saas product out there, Salesforce has to mine through trillions of log lines consecutively and interactively where its product management, supports teams and engineers are constantly going through issuing queries in a high concurrent fashion across one trillion rows of data.

They do that while their ingestion pipeline is bringing in three million events per second because their cloud service is multitenant and large. This notion of scale is a scale of events coming in, the number of users that are trying to interact with the database and the amount of data. How do you do it to get back a sub-second response so that the person who’s trying to get the insights isn’t waiting for a second but they can ask a question and get an answer? Scalability is huge. That is one of the most important driving factors why people choose Druid.

[00:33:23] Eric: It is a line of demarcation. You must scale to survive.

[00:35:46] Eric: We got some salty veterans in the show. It’s always fun talking to old pals. We were chatting about service orientation and the cloud. David Wang from Imply, you had a good point about that. Talk about service orientation and how it’s baked into how we do things in the cloud.

[00:36:03] David: If you think about a developer, they want choice, flexibility and ease. The cloud has provided the agility and the ease of being able to pick the services that they need to build apps that they’re trying to build. In this whole notion of, “A database for all seasons,” the notion of a sport, one database to build them all is a fallacy. Nobody wants a sport. They want the right tool for the job. The notion of the cloud becomes very intriguing because, in the cloud, the developer has a choice, ease and consumption. If you’re going to build an application that has transaction processing, maybe some BI reporting but also want power and analytic use case for that app, they can choose multiple databases. They don’t have the pain that they used to have back on the premises. That’s cool.

[00:36:56] Eric: Speed to production from the idea is important. You can’t be taking 6, 8 or 12 weeks to figure something out and either deploy a new feature or understand why network traffic has changed dramatically. You’ve got to be able to very quickly iterate on things and try things. The whole fail fast-forward concept maybe doesn’t work but at least let’s see if it works. If it doesn’t, we’ll try something else and that has to be a very short cycle time.

[00:37:26] David: If you think about even going to Apache Druid, it is open-source software. Developers can learn the technology, architecture and the different distributed process, build out a node or cluster and run with it or consume our cloud service at Imply called Polaris. They don’t have to learn any of those constructs. They can just get the performance that they need to power their app. You’re using Druid and SQL but not the underlying mechanics. That allows the organization to go fast. That’s what it’s all about.

DMR Keshav | Database Technology

Database Technology: Developers can consume our cloud service at Apply Club Polaris and don’t have to learn any of those constructs. They can get the performance that they need to power their app.

 

[00:37:56] Eric: Jim Walker from CockroachDB, you mentioned, “The new battleground is in the cloud.” That’s very true. I’m amused by the fact that we can thank Microsoft for saving us from the monopoly of Amazon Web Services. That’s ironic. What do you think about all that?

[00:38:27] Jim: To build on something that David was saying, the way that people consume software is a service. That’s flat-out going to be what it’s going to be for the next however long we’re all in this game. If you look at database software in 2021, Gartner says, “The market from 2020 to 2021 was up by $15 billion spent.” That represented a fairly 25% or 30% increase. Of that, database as service went up to 60% of that increase.

If you look at that, that’s a fundamental shift. They also track numbers. We’re seeing AWS and GCP landing in the top five database vendors on the planet. This is a territory owned by Oracle, Microsoft and IBM for traditionally many years. Ultimately, what people want to do is consume software as a service. That is the way that people work because people want choice, autonomy and move. That’s the funny thing. We talk a little bit about open source. I’ve been an open source for quite a long time as well. I’ve been building open-source companies and doing this commercial open-source software for a long time.

Open-source software is a fundamental component of modern IT infrastructure. Share on X

Open source is not just a license. The OSI has approved licenses and that’s great. We’ve lived in Apache world. The people who fork those licenses and make little changes to protect a business that’s behind these things are valid because open source is not just a license. It is code, “Is that code open?” It is a community. “Are there people working on that?”

The third and most important part is its consumption model. Back in the day, you download the bits, build the thing and run it yourself. Consumption in 2022 has changed. It is by default a service. The companies that are doing that well and delivering a great service and AWS does a good job at that, are the ones that are winning in these games or larger pieces of this data market.

[00:40:21] Eric: On-prem, it’s going to be around for a long time. I don’t think it’s going away. I’ve already seen some interesting developments of ways to leverage on-prem in different ways and hybrid environments but the cloud is the future for lots of different reasons. You got all these engineers focused on making that thing run fast. I want to rent it.

In my first conversation ever with Dr. Robin Bloor whom I’ve found in this company many years ago, we were talking about being able to rent capabilities as opposed to buying things. I said, “Imagine if you could rent F/A-18 Hornet for 30 minutes to go blow up your friend’s house or something, wouldn’t that be nice? You don’t have to spend all the money to build it, maintain it and operate it. I’ll rent it for 30 minutes to go do something.”

That’s what we’re looking at with these technologies. You can rent a small space in Amazon, Google, Microsoft or wherever and use TensorFlow, PyTorch or all these incredibly powerful technologies. It’s up to you to execute to get the job done. It’s all about knowing which tools to use and where, knowing your business model and what you’re trying to accomplish. If you could do that, the possibilities are endless. Keshav, what do you think?

[00:41:32] Keshav: I agree with you and what Jim was saying, which is the future is in the cloud but if we look at our customers at Katana Graph, we still have a lot of customers who are on-prem and they’re not planning to move to the cloud anytime soon. That’s because they’re worried about data security and those sorts of issues. I agree with what you and Jim were saying that instead of having your data center, maintaining it and all of that, it is much easier to rent some space in AWS or GCP and so on. There are a lot of banks that don’t want to let go of their data. They want that to be on-prem. We do run on the cloud. Katana is available on AWS and GCP but we also have a lot of on-prem customers.

DMR Keshav | Database Technology

Database Technology: Instead of having your data center and maintaining it, it’s so much easier just to rent some space in AWS or GCP.

 

One of the things we’ve noticed within some of these big banks is there are many different groups within these banks. Some groups are on-prem and not moving anywhere else. Whereas other groups are already exploring the possibility of moving to the cloud. Things are a bit mixed up as you were saying. All these companies are generating the data on the cloud itself. Online payment processes, for example, their data is already on the cloud and they’re natural for this kind of service-oriented architecture that you are talking about.

[00:42:59] Eric: A topic that we should probably use for the bonus segment because it’s not a small topic is to persist or not to persist. In the database world, you think, “Persist first. I’m going to grab some data and persist it in my database, then my applications are going to use it.” There’s also this whole thing called streaming data. David from Imply, what’s your take on this whole new dynamic of streaming data and leveraging it before you even persisted anywhere? What do you think about that?

[00:43:28] David: The notion of streaming or also known as real-time data or events is pretty foundational. We have this construct at Imply called the Event Data Architecture because we do think the traditional constructs of what represents an OLTP, like transactions. the notion of a transaction itself is about updating state. Think of a banking transaction when the account went down and up. It’s the state that changed.

The notion of streaming, also known as real-time data or events, is pretty foundational. Share on X

When you think about events and streaming data, the paradigm is about a pin. You’re adding more data and streams to the flow. Therefore, when it comes to analyzing, the whole architecture changes. The way that you analyze streams and process them to drive some type of insight changed. It requires fundamentally data architectures.

[00:44:19] Eric: It does require a whole different view and architecture. We can do a whole episode on data and information architecture. What David is talking about here is you got to understand the cycle time of your business and the decisions that you’re making. Is it daily or weekly? Maybe you don’t need some of these technologies but if you’re running ad networks, the numbers in somebody’s use cases are mind-blowing. You ask yourself, “Is that even possible?” Yes, it’s possible and it’s happening.

Think about how impatient we get. I’m trying to log into Google Docs. If it spins for a second, I’m like, “Get me into my data.” They’ve done a masterful job of being able to deliver this service. I’ll never not be impressed by what Google did with Google Docs, broadsiding Microsoft and their bread basket of office documents of a workflow and basic stuff. They kick that butt. The show’s bonus segment is up next. You can always email me at [email protected]. I always love learning from these intelligent people and these fantastic conversations.

[00:45:30] Eric: It’s time for the show’s bonus segment here talking all things database. We’ve got Jim Walker of Cockroach, David Wang from Imply and Keshav Pingali from Katana Graph. We were talking about ad networks and you threw out a number there at 300 milliseconds.

[00:45:45] David: Don’t quote me on it but I remember reading that the publishers have in real-time bidding, you have 300 milliseconds for when you have to be able to serve the right ad based on the person who came to the webpage, choose the right ad based on the bid auction and publish it is 300 milliseconds. There is an incredible level of analytics and decisions have to go into what’s the right ad at the right time. There isn’t much time. That speaks to the speed at which analytics is moving.

[00:46:13] Eric: You’re not going to be able to use an access database to fulfill that particular use case. Jim Walker, go ahead.

[00:46:23] Jim: It’s the speed of transactions too. You’re seeing them on the analytical side. I’m seeing them on the transactional side. I don’t know the guy who invented Gmail but he came up with this concept called The 100-Millisecond Rule. It goes back to the cockpit of a fighter jet. If something takes longer than 100 milliseconds, a human can discern there’s some delay. When you’re talking about the transactions, millions of dollars could be made in that. Those improvements and the stuff that they work on are awesome.

With the transaction, we were talking about state like, “This state needs to be updated super fast. This stuff is going to happen in Sydney and New York at the same time. We cannot beat the speed of light. How does that work?” There are interesting software engineering problems that we’re all solving. I find it to be one of the most amazing times to be involved in tech because of the challenges that are in front of us like this.

[00:47:13] Eric: I’ll throw this over to Keshav to comment on. What a fun time to be around. I’m glad that I could see the whole history of how we got here but past this prologue, going forward, it is exciting what can be done because number 1) You can work from anywhere. That’s a key component. We talked about that in the pre-show. Number 2) You can focus on your passion.

Part of what keeps people going when they love their jobs is that they love their jobs and they want to do that kind of thing. Historically, it was hard to just follow your passion. You had to be good at what you were doing many years ago. Now, because you can use the web to get to any place you want and learn about anything you want, you can follow your passion in creative, interesting ways and do it for any number of types of jobs. The possibilities are endless. It’s exciting. You guys are playing a role in that.

[00:48:10] Keshav: Let me elaborate a bit on that. What you talked about is the fact that you could be sitting anywhere in the world and access data or information that could be on the other side of the world. It is ubiquity in space but there’s another important component which is analytics is largely restricted to descriptive analytics. What has happened in the past? You can query that but we are at a point where we can do predictive analytics.

You can talk about what might happen in the future because you’re taking all the data that you have about what happened in the past, building a model and then doing predictions based on that. It’s not only spatial that you can move anywhere in space but you can also move back and forth in time. That is what is exciting, the new wave of analytics, like the ones we are doing at Katana.

We are at a point where we can do predictive analytics. So you can talk about what might happen in the future because you're taking all the data you have about what happened in the past, building a model, and then making predictions based on that. Share on X

[00:49:01] Eric: David Wang, I’ll give it over to you for final comments. What’s coming down the pike? What are your thoughts about the trajectory of the database base?

[00:49:09] David: You brought up open-source wins. Hands down. If you look at the top five databases that exist in the world, they’re all open-source technologies. As we look at how people want to consume the type of technologies they want to use in their data infrastructure, it’s open source. That’s where I would end. We’re a big fan of Imply about open source tech. Our value add is not at building proprietary widgets within the database itself but it’s about the service. That’s what people are liking.

[00:49:36] Eric: Look these folks up online. We’ve been talking to Jim Walker from CockroachDb, David Wang from Imply and Keshave Pingali from Katana Graph. We’ll talk to you next time.

 

Important Links