We use cookies to ensure that we give you the best experience on our website. By continuing to use the website you agree for the use of cookies for better website performance and a personalized experience.

Reading Druid Segments as Data Frames in PySpark: Spark Druid Segment Reader Guide for Data Scientists

Beata Zawiślak
.
October 30, 2023
Reading Druid Segments as Data Frames in PySpark: Spark Druid Segment Reader Guide for Data Scientists
Beata Zawiślak
October 30, 2023
.
X MIN Read
October 30, 2023
.
X MIN Read
October 30, 2023
.
X MIN Read

If you find that article, you are looking for an efficient way to use data from Druid Segments in PySpark. There are several options for extracting data from Druid, but each comes with its own set of challenges. To address them, our team has developed the Spark Druid Segment Reader which allows reading data directly from Druid's Deep Storage as a Spark DataFrame in PySpark. 

This tutorial will help you practice Spark Druid Segment Reader at Jupyter Notebook. If you want to find more information about the connector, follow the previous blog article

Before we begin:

  • Check your PySpark version: ensure you download the appropriate version from the repository. The Spark Druid Segment Reader supports Spark 2.3 and Spark 2.4. 
  • Dimensions and metrics: The connector reads only the dimensions, skipping all the metrics.

Step-by-step procedure:

  1. Download/Build the Solution

Download the JAR from the repository: deep-bi/spark-druid-segment-reader. You can also download the solution and build that project using ‘sbt assembly’ command on your own. 

  1. Add the Custom JAR to Spark Context

To work with the custom JAR, add a “spark.jars” parameter to Spark Context. You can use the following Python code: 


from pyspark.sql import SparkSession, SQLContext

sc = (
    SparkSession.builder
    .config("spark.jars", spark_druid_segment_reader_jar_path)
    .getOrCreate().sparkContext
)

Remember to replace ‘spark_druid_segment_reader_jar_path’ with the actual path to the JAR file you downloaded or built in the previous step.

  1. Load a Data Frame from Druid Segments

With the Spark Druid Segment Reader properly configured, you can now load a Data Frame from Druid Segments using the code below: 


sqlContext = SQLContext(sc)
dataFrame = sqlContext.read.format("druid-segment").options(**properties).load()

The ‘properties’ parameter is a dictionary containing read configuration. You can provide detailed specifications of it in README

Here is an example definition of the Properties dictionary:


propertiesDir = {
    "data_source": "druid_segments",
    "start_date": "2022-05-01",
    "end_date": "2022-05-10",
    "druid_timestamp": "druid-timestamp",
    "input_path": "s3a://s3-deep-storage/druid/segments/",
    "temp_segment_dir": "/tmp/segments/"
}

Results

If you do all the steps correctly, you read data from Druid segments and write them as Data Frames. Now, you can use the potential of a Data Frame for your purpose. If you have any concerns or need further technical support, feel free to contact us

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

Thank you for joining our newsletter!
Oops! Something went wrong while submitting the form.
Deep.BI needs the contact information you provide to contact you. You may unsubscribe at any time. For information on how to unsubscribe and more, please review our Privacy Policy.

You Might Also Like