Learning Curve? Understanding ML’s Growing Role With Bin Fan, Joshua Rubin, And Ryan Ries

Let's talk Data!

Learning Curve? Understanding ML’s Growing Role With Bin Fan, Joshua Rubin, And Ryan Ries

November 15, 2022 Transcriptions 0


From 0-60 in just a few short years, Machine Learning is now pervasive in business. It’s being used by lots of different large and small organizations, whether for optimizing pricing, procurement, or processes. ML algorithms are everywhere, and the ML process is getting easier to understand, from low code to no code. Join Eric Kavanagh as he talks with three other guests who are using machine learning to their advantage. Join VP of Open Source at Alluxio, Inc., Bin Fan, Director of Data Science at Fiddler.ai, Joshua Rubin, and Practice Lead at Mission Cloud Services, Ryan Ries. Find out how machine learning is growing by checking out this episode of DM Radio!


[00:00:41] Eric: We have a wonderful all-star lineup now. We are to talk all about machine learning and understanding ML’s growing role. Machine learning refers to a set of technologies that learn over time that are able to optimize all kinds of things from pricing to processes. There are many different ways that you can optimize the business. It turns out that we humans are fairly predictable. The stuff we do once the machine scans it, follows it, and understands where we are going, can be good at knowing what we are going to do next. That is why machine learning is such a powerful technology these days.

It needs data. You have to train these algorithms to be able to do clever things. In order to do that, you need some platform. Even there, the technologies are getting better in terms of making it more business-friendly, low code to no code environments, or low code to pro code, as some people refer to it. The technologies used to deploy these types of algorithms are getting much better.

You look at the big guys like Facebook, LinkedIn, and Instagram. I promise you these folks are using machine learning in all kinds of ways, whether it is to optimize the flow of traffic on their sites or the algorithm of what content. They want you to stick around. That is what they use these for. Primarily, it is to understand what you want as a consumer of information and how they can better serve you and keep you rocking and rolling. We are going to talk to several guests. We have Ryan Reis, Bin Fan, and Josh Rubin. I will start over with Ryan. Tell us a bit about yourself, your company, and how you were helping organizations leverage machine learning.

[00:02:16] Ryan: I work for a company called Mission Cloud. We are an AWS Premier Partner. We are a full-service integrator. We work with companies wherever they are on their journey. In my particular team, I run the data analytics and machine learning team. We help customers wherever they are in that data stack, whether they are starting out trying to get into the cloud onto AWS and do reporting, working in more advanced analytics, starting to do machine learning, or are already pretty advanced doing machine learning.

One of the things we see, and it is often hard to believe, is that people have built tons of models. They have data scientists on staff and they never get those models into production. I have walked into companies that will have ten models sitting on the shelf and they are wondering, “How do we get these models into a reduction to help our business?” We help them move those models from on-premise to AWS. What people often that aren’t in the field don’t realize is that a model is like a living organism. It got a lifetime. Depending upon what data you use to train that model, it will tell you how frequently you need to retrain.

An ML model is like a living organism. It's got a lifetime, depending on the data you use to train that model. Share on X

If you think about real estate, it was going along pretty good. COVID hit and everything changed in the world of real estate. If you were tracking your models and you never updated them once COVID hit, your models would be greatly wrong if you are sitting there trying to make investments or other insights on what is going on in real estate. That shows you one example of how data changes and drives you to make sure to retrain your data.

A lot of companies are retraining every day because their data are changing frequently. With that, you have to start thinking about MLOps, which is equivalent to the CI/CD Pipeline and DevOps. Doing cloud-native development where people no longer are looking at the, “Let me release an update every six months.” It is now releasing updates daily or hourly. You are able to push out new products all the time. It is the same thing. ML is pushing out new models as frequently as you need them to keep the accuracy and performance that you are looking for.

[00:04:21] Eric: That’s a good example. COVID threw off quite a few machine learning algorithms from my understanding because human behavior changed dramatically. By key, it changes the behaviors where you go when you go places. All of that changed. That was a real expression come to Jesus moment for the industry because a lot of companies realized they had to throw out all these different models for predictive capabilities, whether it was for the next best offer, pricing, or whatever. Because the whole world changed, they had to go and retrain all the algorithms.

[00:04:53] Ryan: If you didn’t have that MLOps auto training, you had hope that the person that trained your model was still on staff because people move around. The tribal knowledge is still there so you can train. Every time you retrain, there is always something new that pops up. Some new edge case that you sit there and investigate and try to understand, “What’s this edge case telling me? Is it truly something I need to worry about or is it some noise that is not going to happen?” Retraining is always an adventure.

[00:05:20] Eric: Since you know a lot about this, explain to our readers overfitting the model. This is always a danger with machine learning to overfit a model where it is telling you what you want to hear instead of telling you what is happening and giving you ideas about what to do about it. How does that happen from your experience and how do you avoid that?

[00:05:37] Ryan: There are a lot of things that can happen when you are training data. Overfitting is one of the big ones that people talk about, where you often will see somebody will take a data set and they will put in huge numbers of samples. They train the data on that and it gives you an impression of, “Your data looks like this.” Some new data comes in and it doesn’t fit on that model. It is not necessarily an edge case but it is a case that you didn’t put into your data set. It seems like your model is not working.

One of the things that we talk about a lot with customers is the importance of creating a good representative data set. A lot of people don’t put a lot of effort into creating their data set. They grab a bunch of data like, “Let me grab all the data I have right here for a particular use case.” Often, they will have millions of points and train it and you will see overfitting. It happens when you put more parameters in, where you might have a model.

Sometimes you will see somebody build a model of 50 parameters. That is one of those other cases where you start getting overfitting where you are trying to think about it like a saw line or something like that. I put in a huge number of parameters to get all of those little triangles in there when an average might be better fitted at the end of the day. Part of that overfitting is you are trying to tweak it so much to get 100% accuracy that you often lose the details that you need for predicting.

DMR Bin Fan | Machine Learning

Machine Learning: When you overfit your model, you try to tweak it so much to get 100% accuracy that you’ll often just lose the details you need for predicting.


[00:07:08] Eric: Before we bring in our next guest, one last question I will throw at you, which I also find interesting is this process of introducing some noise intentionally or some people refer to it as kicking the models a little bit. What can you say about trying to throw the models some data that is going to change things a little bit, almost like you used to hit the side of your TV back in the old days? What do you think about that?

[00:07:31] Ryan: It depends on the type of model you are trying to do. Some people are doing that more on your deep learning models because they are oftentimes black boxes and no one quite understands how the black box works. If you made a predictive model, people are already trying to take into account those differences you would have by constructing your data set and doing things like seven folds. That is where you break up your data set into a bunch of different pieces, fit each of the pieces and then come back and look at, “How is my model?” Look across all of these different pieces as you fit them.

There are a lot of traditional techniques that aren’t necessarily used in deep learning to try to get past some of those endpoints. What they are trying to do is they are worried and hitting a local minimum, not the global minimum, because all these are optimization functions. If you are in a local minimum, they try to hit it to pop it out of that to find the true global minimum. This is what they are trying to do on the deep learning side.

[00:08:23] Eric: We will stop and get into deep learning later in the show. That is a good segue now to bring in our next guest. We got Josh Rubin on the line. Josh, tell us a bit about yourself and what you are doing in the world of machine learning.

[00:08:34] Josh: Thanks, Eric. I’m with Fiddler.ai. Fiddler is about four years old now in 2022. I have been with the company for three of those. Our motto is to build trust in AI. Fiddler provides a platform for enterprises to use machine learning responsibly for a variety of things. To start with, I will take a step back. Ryan and I will have a whole lot of overlap and commonality in the things that we are interested in and concerned by.

One thing that we see is how ubiquitous machine learning is across all business functions. In some ways, the risks are increasing. We deal with companies. We have customers who are in all different segments of the industry. We have financial services customers, eCommerce customers, ad-tech customers, and personal digital assistant customers who are using machine learning to infer the intent behind the thing you are asking your digital assistant to do.

All of those functions are based on models that are trained to data. Some of those are high-risk functions. For example, a credit underwriting model in a bank is a potentially risky scenario for a bank. They are making a decision about how responsibly someone can manage debt, or it could be something innocuous. It’s email classification or some internal function in a company that adds a little bit of efficiency to people’s workflows. Call center chatbot classification might be a low-risk model. Companies are finding all sorts of places to pepper in machine learning to increase their efficiency and also to do new things that weren’t possible before.

The second source of risk is that these models are getting more complex. We certainly see lots of models that are simple regressions. Simple like fits to points that you might do in Excel or a Math class. Increasingly, we are seeing things like deep learning models used for some essential functions and can outperform from a pure utility point of view. From a measured revenue point of view, some of these simpler models.

An example is we see a lot of financial services companies starting to toy with the idea of using deep learning models that can look at a whole credit history rather than make a simple assessment based on credits. With more complex models, there are greater risks. You are dealing with something that is harder to definitively say will make the right decision in every scenario.

With more complex models comes greater risks. Share on X

The third risk is that, as we talked about with COVID, the world is a changing place, and it doesn’t need to be a dramatic shift, suddenly a pandemic. There are things like seasonality in data sets or political cycles and changing preferences of Boomers in a forecasting model. All of those things introduce risks in the operation of machine learning and production. We try to provide a stack of tools, explainability, model monitoring, and fairness bias assessment that can help companies navigate those risks.

[00:11:10] Eric: A lot of times, I will see new stories where they are blaming the algorithm for being bad. I’m thinking, “It is an algorithm. It is not a person. It can’t be a bad algorithm. It can be a poorly functioning or designed algorithm but don’t blame the algorithm. Blame the data or the person who designed the algorithm.”

[00:11:27] Josh: To connect up with Ryan’s comment about MLOps, that is a big term for us. Our stakeholders, data scientists, model developers, and people in this MLOps role of running a model in production and making sure it is doing the right thing. We think of ourselves as a model performance management platform.

There is a responsible AI practice. Model development doesn’t end when the model is rolled into production. That part of responsibly using AI in a production setting is instrumenting it with things that can show how it is performing over time, how inputs and outputs may be changing with time, and providing diagnostic tools for asking questions about why it made a decision when it made a bad decision. Tools for segmenting analysis so that you can see that a model is making decisions on the right features and not parroting back some unfortunate demographic injustices that it is learned from experiences.

[00:12:19] Eric: You have to be careful about this. It is a red flag for organizations. When you decide to deploy machine learning or artificial intelligence, you need to be careful about which data sets you point it at because that is the data where it is going to grab the information it uses to make its calculations. If you point it at old data, stale data, bad data, or inaccurate data, you are going to get an unpleasant response on the other end.

[00:12:42] Josh: There are lots of ways that this can go wrong. If you could grab a data set that is too narrow in time, you might not capture a natural seasonality in your data. You can imagine the usage of a cell phone network going up and down depending on when people in that area are awake and asleep who are monitoring your data on a baseline that is too narrow. If you are comparing it against or turning a model on a time window, that is too narrow. A natural cycle in your data set may look like wildly unpredictable behavior when you chart that on a graph. There are some careful considerations that go into that.

I would also throw out there that we all know this old adage that correlation isn’t causation. A lot of the dominant techniques for doing machine learning right now don’t have any intrinsic causal model that they are trying to build. They are inferring something causal from a pile of data that we have given them. Without some care and diagnostic tools that give you the ability to ask the model why it made its decision, it could be learning inappropriate causal relationships between variables that are correlated. When you think about historical sexism and racism, a simple peril to run into is to reiterate those because a model has learned something causal from something that was unfortunate statistical correlation.

DMR Bin Fan | Machine Learning

Machine Learning: A lot of techniques for doing machine learning right now don’t have any sort of intrinsic causal model. You need to ask the model why it made its decision because it could be learning inappropriate causal relationships.


[00:13:52] Eric: That’s a very good point. We’ll get deeper into the deep learning upon intended in our coming segments. One of the challenges in some of these deep learning models is you have hidden layers. What is interesting is that there is a certain proprietary nature to the algorithms that are being run by Facebook, Google, and other companies where they don’t want to reveal the recipe but there is going to be a bit of a challenge on that front sooner or later. At least, I hope there is because I’m a big fan of transparency, and besides, execution is what matters in business now.

[00:15:36] Eric: We are talking about all things machine learning on this show. ML is a growing role in the world. It is under the covers all over the place. Machine learning is being used by lots of different large corporations, quite a few startups, and small organizations. It is good at optimizing processes. It is used quite a bit in segmentation, trying to figure out which group I put these objects, people, or products that people want to purchase. That is one thing it does.

It is also optimized for pricing. As we discussed in the last segment, sometimes, you got to be careful about what decisions are that come out of these machines. Explainability is a big part of machine learning and AI these days, especially with deep learning. A man who knows a thing or two about that is my friend Bin Fan from Alluxio. Bin, welcome back to the show. Tell us a bit about what you have been working on and how it is going.

[00:16:26] Bin: Thanks, Eric. I’m happy to be back. I’m one of the founding engineers and running the open-source initiative in Alluxio. We are an open-source technology vendor focusing on data and how to present the data closer to applications. Applications like machine learning training frameworks or training MLOps. They can access the data faster with higher bandwidth and cheaper. That is what we are doing.

We are different from Ryan and Josh. They are more from the model side. If we want to build this model, how do we build this model more efficiently using less resource, cheaper prices, or reducing the cost? That is the starting point of how we are talking to our users. There are different challenges in explainability and monitoring MLOps. The angle we are entering here is more from the efficiency perspective, especially for the IO efficiency. We are an open-source project starting from more data analytics sites.

This was originally a research project incubated by UC Berkeley AMPLab. As a sister project of Apache Spark, it’s how to present Spark data faster or help Spark. People naturally think, “Help me solve this data analytics problem. How can you help me solve this machine-learning problem?” We notice there is a difference here.

What we are doing for the data analytics for Spark, Trino, and Presto is not necessarily translated to how machine learning and applications are using data. I’m giving you a few examples. The data set is different for machine learning. From time to time, we see users or applications using a big pile of small files, typically video clips or pictures. They are on a few hundred kilobytes to represent the picture or something small. That is common for machine learning applications. That is not how the convention or previous generation of a big data platform is designed. We need to do something different.

The API is different in the data analytics world by default. People will think, “We can use file system API or client to access data.” It is not the case in the machine learning world. Because of the latest events of all the hardware like GPUs and all this type of specialized hardware, machine learning jobs can run in a highly parallel way. They are good at leveraging hundreds or thousands of different threads or doing the jobs in parallel. That makes applications efficient but also creates a lot of challenges on IO side.

How do you feed these applications to keep them less hungry? These are typically the more expensive instances in the cloud. We want to reduce your cost. We don’t want them to be idle because they are waiting for some data to be fed. These are the challenges we are seeing from the lower level, not rather from the model side, how we are helping data scientists or machine learning engineers to feed their pipelines in an efficient way. These are interesting and challenging problems. We are working on open-source projects. We get a lot of people working together with us globally, and we are making a lot of progress.

[00:19:49] Eric: You gave us a really good point here because the cost is always a concern when you get into training some of these models and leveraging these models. To your point, if the training is not an efficient process, it can be rather expensive and especially expensive if it doesn’t do what you wanted it to do in the first place anyway. You want to be able to train the model and engage the efficacy of the models.

One of the challenges has been going from your Jupyter Notebook where you are working on stuff to production depending upon the environment because a lot of these large organizations have lots of legacy systems that weren’t designed for that input. You have to come up with some framework and mechanism by which you insert the analysis or the intelligence into the operational system.

You have these challenger models too. Theoretically, you should have your production model and challenger models waiting in the wings and ready to move over. How far have we come in terms of being able to quickly replace models that are running and MLOps stuff that Ryan and Josh were talking about? How far along are we on that front?

[00:20:50] Bin: We are talking with quite a few users. They see these challenges and they are saying, “How can you help us to make sure all the model is quickly updated and all those things?” The situation depends on different applications and the data platform. We are seeing a lot of users have their own different challenges. Especially for our users, from time to time, they have data silos or some different management problem to prevent them from getting this done or deployed easily. That is one direction we are helping them.

It is still an early stage in general. The entire machine-learning world is hot. Everyone is talking about it. Still, the tech stack is still quickly evolving. There are a lot of different moving pieces and legacy pieces there. We are seeing different features from one company to another company. That is where we are. We are trying to help people from one angle, and we are saying it is early off the game.

[00:21:54] Eric: You brought up another good point here, which is that the tech stack is changing. It is a good segue to our round table. I will throw this first to you, Bin, and then to the rest of the panel. Let’s talk about data science platforms. There are a number of cool vendors out there that have data science platforms. We were talking about this on a call. They need a team of data scientists and a bunch of data engineers. Those people are in tremendously high demand right now because there are not that many of them, and this tremendous need is growing.

On the other side, you have no code platforms, which are slowly rising and getting better at being able to build nice gooey interfaces on top of the algorithms and all the number crunching underneath. That is clearly where we want to go. It seems to me we are probably still far away from that, but that is what the business wants.

If I’m a business person, I don’t want to have to get lost in the mire of understanding our Python, data pipelines, observability, and how that plays into it. I want the car to run the way I want it to run. Business people want easy interfaces. Where are we in that evolution? What is your advice to an organization that wants to leverage this technology but isn’t sure how much to budget for it? How many people do we need? These are difficult questions to answer. Bin, first, what is your advice on that?

[00:23:10] Bin: Based on my conversations with the users or customers that are running their machine learning pipelines and platform, one advice I typically give them is if you want to build a deck of tech stack and have a huge team to maintain, that is going to be complicated. You want to leverage. You don’t reinvent the wheel. You look for what are the functionalities you need and look for vendors on the markets, but also make sure your architecture should be future-proof. That is important. You don’t want to be stuck in some architectural legacy tech debt. It is preventing you to move for the next thing.

Technology and models are evolving quickly. These giants are still creating awesome models for newer applications. We see this trend. It is going quickly so you should be prepared. You should leave enough room to say, “We need to revisit the architecture in the next several years.” Be prepared and make sure that you have enough room to make your tech stack evolve to the next or change the direction and be on top it.

[00:24:23] Eric: Ryan, I will throw it over to you. What are your thoughts on how to recommend the right course of action for your clients or people you talk to?

[00:24:28] Ryan: It’s some of what you are describing of where is the future. Do we end up in a dystopian world? The matrix background I have or the terminator is coming after us. I often think that is not going to happen because if you look at things like AutoML and everything people are trying to do to make it easy, it doesn’t work because you still need to have somebody that is got insight, looks at the data, and understands, “This is good data, or this is not good data that is going in.”

You still need to have somebody look at the output of a model and say, “This makes sense.” Someone with Stack Overflow and everything else. Anyone can be an ML engineer, go out, and get a model to run. I’m old and I remember doing data science. XGBoost came out of Stanford. If we wanted to use that, we had to code it all ourselves. You have to sit there, code it, and understand how to group everything and all that, but nowadays, anybody can go out, grab a library and Python, stick it into their code and run it. That doesn’t make it a good model or have something that you are going to be able to talk about and report on.

We are seeing a lot of that with our customers where they tried to go out on the cheap and grab somebody to run them a model. They can get results and once they dig into it, it doesn’t work. You still don’t have a good data science base of people that understand it or build it. People are trying to make all these tech stacks. There is a lot of value in some of them and AutoML stuff. They still need a trained data scientist to look at and make sure these models and the data I put into it makes sense. Until you improve or figure out how to create something that makes sure you have good data, you are always going to have to have someone in the loop to make it work.

DMR Bin Fan | Machine Learning

Machine Learning: Anyone can be an ML engineer these days. Anyone can go out and get a model to run, but that doesn’t make a good model. You still need to look at and understand the data.


[00:26:10] Eric: That is a good point and segue to bring in Josh because you are right. Human is never going to go away. The storyline in this consumer media was, “AI and machine learning are going to take away jobs.” No, it is not going to take away jobs. If it does, it is going to take away the tedious, boring jobs the humans are not good at and don’t want to do anyway. The best expression I have heard is that AI won’t take away jobs, but people who learn how to use AI will get the jobs, and people who don’t, won’t. It is a new tool and platform to leverage in business and you should figure out one way or another how to do it.

[00:26:43] Josh: My best advice is to keep it simple as a starting place. There is this incredible tendency in using ML in the industry to grab the shiniest thing that you read about on your Apple newsfeed or saw on YouTube. In many scenarios, there is a much simpler solution than the world’s biggest deep learning model to solve the problem that you are trying to do. I wanted to make sure that I addressed that.

I also agree with Ryan’s comment on making sure you are equipped with the knowledge and tools to use this technology responsibly. There is a question of how risky a model is. What is its function? What are the risks involved? There is a whole school of thought that is focused on what should ship with your machine-learning model. People talking about it.

Google had this strong push towards this notion of model cards that, along with your machine learning model, you should provide a prescription label for information about what it was traded on and make a risk assessment for various kinds of uses of this model. Whoever receives it can make some intelligent assessment.

[00:27:49] Eric: I’m glad you mentioned that because that is good advice for anyone out there using these technologies and models. Have the prescription, explain what the model is supposed to do, and what the effect is supposed to be.

[00:29:06] Eric: I will throw this out to our guests for comment on each. One of the most powerful uses of machine learning and artificial intelligence is going to take the form of suggestions. Meaning the AI and the ML is under the cover. It is scanning, looking for stuff, and giving you suggestions of what to do next instead of taking decisions for you. It is going to give you a recommendation about what to do. I will throw that first to Ryan Ries. What do you think about the manifestation of AI and ML being suggestive?

[00:29:32] Ryan: We are already in that world, to be honest. You look at personalization algorithms. When you go to your Netflix, it tells you what to watch. When you go to Amazon, it tells you what you want to buy. Amazon got Alexa and they have been working on trying to make it more conversational if you ask it a question. It still works on Valentine’s Day in 2022, when they had set it to run. You could ask it for suggestions for a date and it would tell you. Even knowing your location, some date ideas that it could have. We are already living in that world as far as I’m concerned.

You're already living in a world where algorithms are personalized for you. Share on X

[00:30:08] Eric: I don’t like talking to machines, but I’m an old man. My daughter has no problem talking to machines. She was sitting there talking to Siri asking all weird questions. It will even give weird sassy answers sometimes. Whenever she is insulted, she is like, “That is not very nice.” I found it a little bit creepy. One time, I was asking Siri to do something. She gave me a completely wacko answer. I’m like, “It is ridiculous.” She was like, “That is not nice.” I’m like, “Is the phone now talking back to me?” How far are we going to push this model?

[00:30:37] Ryan: There was a movie out on Netflix called Jexi. He got a new phone and it has the latest Siri version that tries to take over and control his life and wants to marry him. It is not too far in the distant future since the writers are already there.

[00:30:56] Eric: Josh, I will throw it over to you. What are your thoughts on where this is all going from a business perspective? There is a concept I heard a couple of years ago that I like. They talk about narrow AI, where it’s not the big deep-learning red robot that is going to take over the world. It is a focused algorithm on the understanding either classification of signals, network security, optimization of decision points, or whatever. The key is to keep the focus narrow, understand what you are doing and make sure it does that well. You build that into a larger mosaic of analytic recommendations. What do you think about that approach?

[00:31:33] Josh: One of the fun things about my role at Fiddler is that we are introduced to new customers and new potential customers all the time. You get on the call with somebody who has some new use case. To go back to the personalization for a second, you suddenly realize the degree to which the order of items that you see in eCommerce to the articles beyond any search that is all tailored to you.

[00:31:54] Ryan: Is it tailored to you? We once wrote a personalization algorithm for a company that was somewhat tailored to you, but also, the first results were most beneficial to that company for you to click on.

[00:32:07] Josh: You are optimizing for something. The company is always optimizing for you to click and purchase the thing closest to where your mouse pointer is. It is a little unclear who the real stakeholders are here.

[00:32:17] Eric: Bin, I will throw this over to you to chew on. If you look at GDPR, General Data Protection Regulation, this whole concept of the right to be forgotten is a very ambitious goal that they have. I think it is not going to happen but you can get closer. You can strive towards it. What the regulators will want is to see that you are trying to solve that problem even if it is impossible to track down every last bit of data about you. It is still a bit of a challenge.

I have expanded upon that. I talk about the right to be respected. It is an obvious thing that I would expect you, XYZ corporation, to be respectful of me, my time, my purchasing habits, etc. If you have that respect baked into your corporate culture, you are going to do well leveraging these technologies because you will have conversations. The reason we’re having so many conversations around ethics with AI is because we are at this significant inflection point where things are changing in some fundamental ways. What do you think about that whole baying of issues I threw at you?

[00:33:18] Bin: Before I joined Alluxio, I was working at Google. One of my projects is implementing that you get the right to forget your search result, search history, ad-click history, and all this. You have the option to nuke all this history and make sure this entire footprint is forgotten from Google’s internal perspective. That is my own project in Google. I left Google after two years of working on that. I know it was getting well adopted as an upstream for many other downstream applications because there are a lot of machine learning models based on different people’s histories.

After this, if you have your machine learning models training your updated machine learning models based on the updated training sets, you can remove this different history or behavior in your history. That is one way to make sure you remove yourself from certain algorithms. At least from that perspective, I can see how Google is working on that.

In Alluxio, one of the reasons people are talking to us is because they want to use us as a data abstraction layer to handle different data in different places and ask us. You can ask for some filters or help us to update the data because we want to update training set or the logs. You are using the term I was using in Google.

We see this as demand for the market. More people are talking to us and they are looking for solutions. I see a big market opportunity for that in putting that way. There is a good direction. I’m not following that closely. I don’t know what is the state of the art but I don’t think there is a well-established standard or technology to help customers users to do that.

[00:35:10] Eric: I will throw it over to Ryan first. This right to be forgotten is ambitious, but there is something to be said for wanting certain pieces of your history to be removed and you think about what goes into it. The point I was going to make is I recall from certain interpretations, at least of GDPR, that if someone says, “I want you to remove me.” You are supposed to retrain any algorithms that you had trained with their data. I’m not sure how strict they will be about that but that is a substantial ask of the business, especially if they don’t have a modern platform. What do you think?

[00:35:41] Ryan: GDPR is interesting. It is a UK thing right now. People have taken and modeled their own versions of that. California has its version. Not all states have their version. I have done a lot of stuff with GDPR on the data side but not necessarily on the model side. It is true what you are saying. If you have to take people out of that model every time, it comes down to the timeline that it sets right because, even in GDPR, you have a certain period of time to take that person out of the database. If it were to change where you have to retrain every time, you could imagine how painful that could be.

A lot of people use fairly small models. Your training time might only be 1 or 2 hours. If you were involved in a big data set and took a week to train, you could imagine the significant cost that it is going to be for a company to remove that data. If you are running GPUs for a week, many thousands of dollars in training costs to remove one person. It is interesting to see what the rules are there.

[00:36:39] Eric: We are with all the work and observability. We have gone a long way in the last few years in terms of being able to understand what this costs. Companies like Snowflake, for example, rolled out what they call resource groups. They are doing that as a way to be able to get some understanding of how much it costs to run this algorithm, to run that algorithm, to store this data, and to store that data. That thing is important for business people who have to try to figure out, have we gotten a return from investment on all this? Are we moving in the right direction? Are we moving in the wrong direction? We are getting there, but it is a fast-moving space. Look all these guys up online. They are cool companies.

It is time for the bonus segment here on our show, Learning Curve. It’s understanding machine learning’s roaming role in our society. I will throw it over to Ryan Ries first. It is interesting for me to watch how the reflection we get back from machine learning and AI will affect us, our culture, and even our way of communicating.

The example I will give is this NLP stuff. I’m sure you have seen this now. If you are typing in Gmail, Google is finishing sentences for you or at least offering to finish sentences for you, which is interesting. What they are doing is they are sensing patterns. They now know when you start a sentence like this, oftentimes, you are going to finish it like that. I’m a typist. It is hard for me to take advantage of that. I’m like, “Leave me alone. Let me finish my sentence on my own.” It is this interesting dynamic where it is going to reflect back at us with all these recommendations that are going to start changing how we communicate. What do you think about all that?

[00:38:25] Ryan: LinkedIn does that too, where somebody will send you a note on LinkedIn, whatever random you wants to say. There are some mail services out there that are exactly that where it tries to auto-write your emails. With the rise of GPT-3, it is the biggest language model out there. It is beyond your emails. Everyone wants to write code for you.

I’m not sure if you have looked at Dall-E. In Dall-E, you can type in something, and it will create pictures to GPT. It is pretty fascinating. A lot of people are looking at that. How is it going to change video games and everything else? A big issue to make any content is how you make the content, how you get the artists, and all that. If you are using stuff like Dall-E and all that to make it, it is going create much faster content and everything else will be interesting.

[00:39:15] Eric: There is another side of the coin here that we weren’t going to talk about but let’s throw it out quickly for good measure, which is blockchain. One of the cool things about blockchain, and not necessarily the original but some of the other ones that are coming out of Ethereum and some others, is that it is immutable or at least, it is theoretically immutable. You can have this true pure audit trail and thus be able to know who took this picture and who wrote this code.

You start talking about fake news, fake imagery, deep fakes, and so forth. It does fundamentally change things. It is like, “What are we going to go back to?” You have to go back to trust and your own intuition on things, but that could be hard to do, especially when the fakes are good. That looks like Elon Musk. What is he promoting? Josh, I will throw it over to you.

[00:39:58] Josh: In the ML space, they talk about providence with a chain of custody of all the ingredients that go into a data set, model version, and who touched it. It is like an immutable blockchain idea, especially for customers or ML users who are in highly regulated industries like financial services. Being able to go back in time and say, “This is the chain of custody.” All the way back to the instructions for where and how the data that trained the model was collected. That is an important thing, and it is almost missing from any of the standard toolings right now.

[00:40:30] Ryan: One of our big pieces in MLOps is big financial institutions and healthcare have to have audit trails of what was the data set that created that model. What were the models? You have to show all that history. It is a big deal in our world.

[00:40:45] Eric: I will throw it over to Bin to comment on. It is funny because we are starting to see a lot of these development platforms bake in the commentary about what it was, what you did, and when you did it. Coders don’t typically like to do documentation. That is what they call it. Coders like to do stuff. They like to build things. They don’t want to have to document what they built. Who wants to do that? Not many people. What you are seeing now is more effort being put into the tools themselves.

This is one cool thing about the cloud. We peeked into the cloud, capturing the metadata, who logged in when, what did they touch, and where did they go? That thing used to be impossible to figure out. Now it is not impossible at all because it is all logged into the system. You can tell, “John logged in at 2:00 PM and he loaded this set of data.” It is not completely there, but we are getting there. That is good news. What do you think, Bin?

[00:41:35] Bin: For the modern enterprise, this is super important. Some people call this governance or a lot of different technology to enforce. You understand who did what to which data sets. This is not new. It has been in the industry for a while. It is now in the cloud environment, and people are getting maybe more cautious because the cloud is accessible by anyone who gets access and credentials to it. It has become even more important than before. For example, for our customers, we see a lot of requirements like this. We are supporting, to some degree, different features like this. This is a big category of maintaining healthy and stable production environments for their data. That is how people are talking to us and why we are working with them.

DMR Bin Fan | Machine Learning

Machine Learning: Understanding who did what to which data or data sets is very important. Especially today, with the cloud, people are more cautious because it is accessible by anyone with the right credentials.


[00:42:22] Eric: One last quote from each of you. Ryan Ries, your final comment. What is your advice on helping people leverage machine learning?

[00:42:26] Ryan: The advice is come to Mission.

[00:42:32] Eric: He already admitted some of his habits. He is a very honest man in AI. That is what we want. We want honesty, transparency, and explainability. I will throw it over to Josh. Final comments from you.

[00:42:42] Josh: The industry is getting more complicated. There are more unstructured data, computer vision applications, and language applications. Using those things responsibly is increasingly tricky. The best practice is to tool up with an MPM platform or some strong MLOps practice that takes those risks into account.

The industry is getting more complicated, so your best practice is to tool up with a strong ML ops practice that takes risks into account. Share on X

[00:43:02] Eric: The build-your-own thing can work in certain circumstances, but I think you are right. You want to start somewhere and have some platform that you can use and build on top of. If you try to build that whole thing yourself, that is what a lot of the big Silicon Valley giants did, Facebook with Cassandra, LinkedIn with Kafka, etc. They rolled their own and that is great. At a certain point, if you are going to commoditize that or commercialize that, you want to have a foundation to work from. Some final thoughts from Bin Fan. What is your recommendation?

[00:43:29] Bin: I should borrow what Josh mentioned previously. Keep it simple. It is important. Keep your work, tech stock, or project as simple as possible to start with. That is more stable and explainable, typically.

[00:43:42] Eric: Look these guys up online. We have an excellent show. Thank you so much. Send me an email at [email protected]. We’ll talk to you next time.


Important Links


About Bin Fan

DMR Bin Fan | Machine LearningI like to design and implement distributed systems that solve important and challenging problems.




About Joshua Rubin

DMR Bin Fan | Machine LearningVersatile solver of challenging computational problems, and particularly enthusiastic about data modeling and semantic matching (ranking and recommendation) using deep-learning. Background in experimental particle physics, industry experience using deep learning for calibration and signal processing problems, and subsequently worked on natural-language embedding and matching tasks.


About Ryan Ries

DMR Bin Fan | Machine LearningDr. Ries is an accomplished research scientist, system engineer, program manager, and business developer that has honed his expertise in scientific consulting and data analytics. Dr. Ries’ expertise is in developing advanced imaging products incorporating cutting-edge hardware with custom detection software incorporating data analytics to improve performance. Dr. Ries seamlessly interfaces with customers capturing funding opportunities for cutting-edge imaging systems (proposal writing, networking, expanding current contracts, …). Dr. Ries manages staff, defines tasks, and ensures delivery (software and hardware) in a timely manner to the customer.