Data Pipelines and the Promise of Data
The flow of data can be perilous. Any number of problems can develop during the transport of data from one system to another. Data flows can hit bottlenecks resulting in latency; it can become corrupted, or datasets may conflict or have duplication. The more complex the environment and intricate the requirements, the more the potential for these problems increases. Volume also increases the potential for problems. Transporting data between systems often requires several steps. These include copying data, moving it to another location, and reformatting and/or joining it to other datasets. With good reason, data teams are focusing on the end-to-end performance and reliability of their data pipelines.
If massive amounts of data are streaming in from a variety of sources and passing through different data ecosystem software during different stages, it is reasonable to expect periodic problems with the data flow. A well-designed and well-tuned data pipeline ensures that all the steps needed to provide reliable performance are taken care of. The necessary steps should be automated, and most organizations will require at least one or two engineers to maintain the systems, repair failures, and make updates as the needs of the business evolve.
DataOps and Application Performance Management
Not long ago, data from different sources would be sent to separate silos, which often provided limited access. During transit, data could not be viewed, interpreted, or analyzed. Data was typically processed in batches on a daily basis, and the concept of real-time processing was unrealistic.
“Luckily, for enterprises today, such a reality has changed,” said Shivnath Babu the co-founder and Chief Technology Officer at Unravel in a recent interview with DATAVERSITY®.
“Now data pipelines can process and analyze huge amounts of data in real time. Data pipelines should therefore be designed to minimize the number of manual steps to provide a smooth, automated flow of data from one stage to the next.”
The first stage in a pipeline defines how, where, and what data is collected. The pipeline should then automate the processes used to extract the data, transform it, validate it, aggregate it and load it for further analysis. A data pipeline provides operational velocity by eliminating errors and correcting bottlenecks. A good data pipeline can handle multiple data streams simultaneously and has become an absolute necessity for many data-driven enterprises.
DataOps teams leverage Application Performance Management (APM) tools to monitor the performance of apps written in specific languages (Java, Python, Ruby, .NET, PHP, node.js). As data moves through the application, three key metrics are collected:
- Error rate (errors per minute)
- Load (data flow per minute)
- Latency (average response time)
According to Babu, Unravel Provides an AI-powered DataOps/APM solution that is specifically designed for big data systems such as Spark, Kafka, NoSQL, Impala and Hadoop. A spectrum of industries—such as financial, telecom, healthcare, and technology—use Unravel to optimize their data pipelines. It is known for improving the reliability of applications, the productivity of data operations teams, and reducing costs. Babu commented that:
“The modern applications that truly are driving the promise of data, that are delivering or helping a company make sense of the data—they are running on a complex stack, which is critical and crucial to delivering on the promise of data. And that means, now you need software. You need tooling that can ensure these systems that are part of those applications end up being reliable, that you can troubleshoot them, and ensure their performance can be depended upon. You can detect and fix it. Ideally, the problem should never happen in the first place. They should be avoided, via the Machine learning algorithms.”
Data Pipelines
Data pipelines are the digital reflection of the goals of data teams and business leaders, with each pipeline having unique characteristics and serving different purposes depending on those goals. For example, in a marketing-feedback scenario, real-time tools would be more useful than tools designed for moving data to the cloud. Pipelines are not mutually exclusive and can be optimized for both the cloud and real-time, or other combinations. The follow list describes the most popular kinds of pipelines available:
- Cloud native pipelines: Deploy data pipelines in the cloud helps with cloud-based data and applications. Generally, cloud vendors provide many of the tools needed to create these pipelines, thereby saving the client time and money on building out infrastructure.
- Real-time pipelines: Designed to process data as it comes in, or in real-time. The use of real-time requires processing data coming from a streaming source.
- Batch pipelines: The use of batch processing is generally most useful when moving large amounts of data at regularly scheduled intervals. It is not a process that supports receiving data in real-time.
Ideally, a data pipeline handles any data as though it were streaming data and allows for flexible schemas. Whether the data comes from static sources (e.g. flat-file databases) or whether it comes from real-time sources (e.g. online retail transactions), a data pipeline can separate each data stream into narrower streams, which are then processed in parallel. It should be noted that processing in parallel uses significant computing power, said Babu:
“Companies get more value from their data by using modern applications. And these modern applications are running on this complex series of systems, be it the cloud or in data centers. In case any problem happens, you want to detect and fix it—and ideally fix the problem before it happens. That solution comes from Unravel. This is how Unravel helps companies deliver on the promise of data.
The big problem in the industry is having to hire human experts, who aren’t there 24/7, he commented. If the application is slow at 2 a.m. in the morning, or the real-time application is not capable of delivering in true real time, if somebody has to troubleshoot it, and fix it, that person is very hard to find. We have created technology that can collect what we call full-stack performance information at every level of this complex procedural stack. From application, from the platform, from infrastructure, from storage, we can collect performance information.
Cloud Migration Containers
There has been a significant rise in the use of data containers. As use of the cloud has gained in popularity, methods for transferring data and their processing instructions have become important, with data containers providing a viable solution. Data containers will organize and store “virtual objects” (these are self-contained entities made up of both data and the instructions needed for controlling the data).
“Containers do, however, come with limitations,” Babu remarked. While containers are very easy to transport, they can only be used with servers having compatible operating system “kernels,” and put a limit on the kinds of servers that can be used:
“We’re moving toward microservices. We have created a technology called Sensors, which can basically pop open any container. Whenever a container comes up, in effect, it opens what we call a sensor. It can sense everything that is happening within the container, and then stream that data back to Unravel. Sensor enables us to make these correlations between Spark applications, that are streaming from a Kafka topic. Or writing to a S3 storage, where we are able to draw these connections. And that is in conjunction with the data that we are tapping into, where the APIs go with each system.”
DataOps
DataOps is a process-oriented practice that focuses on communicating, integrating, and automating the ingest, preparation, analysis, and lifecycle management of data infrastructure. DataOps manages the technology used to automate data delivery, with attention to the appropriate levels of quality, metadata, and security needed to improve the value of data.
“We apply AI and machine learning algorithms to solve a very critical problem that affects modern day applications. A lot of the time, companies are creating new business intelligence applications, or streaming applications. Once they create the application, it takes them a long time to make the application reliable and production-ready. So, there are mission-critical technical applications, very important to the overall company’s bottom line.”
But, the application itself often ends up running on multiple definitive systems, systems to collect data in a streaming fashion, systems to collect very large and valuable types of data. The database architecture that most companies have ends up being a complex collection of details that’s inside, and on this, “we have mission-critical technical applications running.”
Unravel has applied AI and machine learning to logs, metrics, execution plans, and configurations and then applied it to automatically identify where crucial problems lie, where inefficiencies lie, and fix a lot of these problems automatically, so that companies that are running these applications are guaranteed that these mission-critical applications are very reliable.
“So, this is how we help companies deliver on the promise of data,” said Babu in closing. “Data pipelines are certainly making data flows and data volumes easier to deal with, and we remove those blind spots in an organization’s data pipelines. We give them visibility, AI-powered recommendations, and much more reliable performance.”
Image used under license from Shutterstock.com
API Application Performance Management Batch Pipelines Big Data Cloud Cloud Native Pipelines containers Data Pipelines dataops microservices Open Source Real-time Pipelines