We use cookies to ensure that we give you the best experience on our website. By continuing to use the website you agree for the use of cookies for better website performance and a personalized experience.

Expanding Druid to Data Science: Introducing the Open-Source Spark-Druid Connector

Beata Zawiślak
.
October 23, 2023
Expanding Druid to Data Science: Introducing the Open-Source Spark-Druid Connector
Beata Zawiślak
October 23, 2023
.
X MIN Read
October 23, 2023
.
X MIN Read
October 23, 2023
.
X MIN Read

Introduction

Big data technologies such as Apache Druid, Spark, Flink, and Kafka have been gaining tremendous popularity in recent years as organizations strive to make sense of the massive amounts of data generated. However, as the demand for data science and machine learning capabilities grows, a new need has emerged - bridging the gap between analytics and data science workflows.

We will take a closer look at the use of Apache Druid, one of the key players in the data analytics arena, and discover how to integrate Druid-based analytics with data science workflows using an open-source Spark Druid Connector, created by our team at Deep.BI. We will explore how the connector makes the process of extracting data from Druid to Spark for ad hoc data science workloads easier and more efficient.

The Challenge: Expanding Druid to Data Science

Druid is widely used in real-time analytics pipelines to provide fast and flexible access to data. However, when it comes to using Druid data for AI and machine learning, the process can become cumbersome. To train an AI model, a dataset must be built from the raw events collected. 

This typically involves inspecting, transforming, and modeling the data into a structure suitable for machine learning. If the data is stored in Druid, it must be extracted and processed before it can be used for AI model training.

There are several options for extracting data from Druid, but each comes with its own set of challenges:

1. Direct Data Extraction via Scan Queries

One approach is to extract the data directly from Druid using scan queries. While this method works well for small datasets, it proved inefficient and resource-intensive when dealing with terabytes of data. Furthermore, it places a significant burden on the Druid cluster's performance.

2. Using the Druid DumpSegment tool

The Druid DumpSegment tool offers an alternative, allowing us to skip both the query and data nodes and read data directly from Deep Storage. While this approach is useful for debugging purposes, it isn't optimized for extracting large volumes of data. Reading the results as a data frame in a scalable manner presented challenges.

3. Archiving Data as a Data Frame on HDFS

The final option involves archiving a copy of the data as a data frame on HDFS. However, this approach comes with the drawback of duplicating both storage and indexing processes. This duplication creates operational complexities and incurs significant costs, especially when the data is only needed for a limited period.

The Solution: Spark Druid Segment Reader

To address these challenges, Deep.BI has developed the Spark Druid Segment Reader. This tool allows data scientists to read data directly from Druid's Deep Storage as a Spark DataFrame in PySpark. 

The Druid Segment Reader offers several key advantages:

1. Reprocessing Historical Data

With the Druid Segment Reader, historical data can be reprocessed without involving the Druid cluster. This reduces the load on the cluster and eliminates data duplication between the data lake and Druid Deep Storage.

2. Ad Hoc Analysis of Historical Data

Different teams within an organization can run ad hoc analyses on historical data going back several years without impacting the performance of the Druid cluster. This flexibility empowers data scientists and analysts to extract valuable insights from long-term historical data.

3. Streamlined Workflow

The Druid Segment Reader allows data scientists to work natively in Spark without needing to learn additional frameworks or languages. This simplifies the transition from Druid to data science and allows for a more seamless and efficient workflow.

After the AI model is trained and evaluated, it can be serialized and pushed to Apache Flink for online predictions. The predictions can then be stored in Druid for real-time analytics. This ensures a smooth transition from batch processing to real-time predictions, all while minimizing operational and infrastructure costs. 

Technical Insights: How Druid Segment Reader Works

The Druid Segment Reader is an Apache Spark connector designed to extend the DataFrame reader. Initially developed for internal use, it has been open-sourced under the Apache 2.0 license. Here's a closer look at how it operates:

  • Segment Detection: The connector identifies segments based on intervals within the Deep Storage directory structure.
  • Schema Inference: Data schema is automatically inferred from each segment file, allowing for flexible handling of data with varying schemas.
  • Schema Compatibility: All generated schemas are merged into one compatible schema, streamlining the data handling process.
  • Data Conversion: Druid Segment Reader converts Druid data types into Spark data types, ensuring seamless integration with Spark workflows.
  • Reading Data: The connector reads data row by row, mapping each row into a new Spark row, ultimately rendering segments into Spark DataFrames.

One of the standout features of the Druid Segment Reader is its ability to automatically infer data schemas, eliminating the need to specify schemas in advance. This feature is particularly valuable in situations where data dimensions are automatically detected at indexing time, leading to variations in dimensions across segments.

To check how you can use Spark Druid Segment Reader in practice at Jupyter Notebook follow our step-by-step tutorial.

Limitations

While the Druid Segment Reader offers powerful capabilities, it currently supports reading only the dimensions and skips the metrics. This limitation might be a consideration for specific use cases.

However, as the project has an open-source nature, this means that organizations can contribute to and extend the connector to meet their specific needs. Deep.BI encourages contributions from the community to address these limitations and expand the connector’s capabilities.

Conclusion

We have discussed the challenges of using Druid data for AI and machine learning and how our Spark Druid Segment Reader provides a solution. This tool allows data scientists to read data directly from Druid's deep storage as a Spark DataFrame in PySpark, making extracting data for AI model training easier and more efficient.

By integrating Druid with AI, organizations unlock the full potential of their Big Data and make better-informed decisions. To access the Druid Segment Reader and explore its features further, you can find the project on the GitHub repository.

Additionally, Deep.BI offers other solutions, such as Predictive Cache for Druid, which optimizes cluster performance, and their Druid support plans. If you have any inquiries or are interested in collaborations, please feel free to contact us. We look forward to helping you harness the power of your data.

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

Thank you for joining our newsletter!
Oops! Something went wrong while submitting the form.
Deep.BI needs the contact information you provide to contact you. You may unsubscribe at any time. For information on how to unsubscribe and more, please review our Privacy Policy.

You Might Also Like