Reading Druid Segments as Data Frames in PySpark: Spark Druid Segment Reader Guide for Data Scientists

Beata Zawiślak

October 30, 2023

If you find that article, you are looking for an efficient way to use data from Druid Segments in PySpark. There are several options for extracting data from Druid, but each comes with its own set of challenges. To address them, our team has developed the Spark Druid Segment Reader which allows reading data directly from Druid's Deep Storage as a Spark DataFrame in PySpark.

This tutorial will help you practice Spark Druid Segment Reader at Jupyter Notebook. If you want to find more information about the connector, follow the previous blog article.

Before we begin:

Check your PySpark version: ensure you download the appropriate version from the repository. The Spark Druid Segment Reader supports Spark 2.3 and Spark 2.4.
Dimensions and metrics: The connector reads only the dimensions, skipping all the metrics.

Step-by-step procedure:

Download/Build the Solution

Download the JAR from the repository: deep-bi/spark-druid-segment-reader. You can also download the solution and build that project using ‘sbt assembly’ command on your own.

Add the Custom JAR to Spark Context

To work with the custom JAR, add a “spark.jars” parameter to Spark Context. You can use the following Python code:


from pyspark.sql import SparkSession, SQLContext

sc = (
    SparkSession.builder
    .config("spark.jars", spark_druid_segment_reader_jar_path)
    .getOrCreate().sparkContext
)

Remember to replace ‘spark_druid_segment_reader_jar_path’ with the actual path to the JAR file you downloaded or built in the previous step.

Load a Data Frame from Druid Segments

With the Spark Druid Segment Reader properly configured, you can now load a Data Frame from Druid Segments using the code below:


sqlContext = SQLContext(sc)
dataFrame = sqlContext.read.format("druid-segment").options(**properties).load()

The ‘properties’ parameter is a dictionary containing read configuration. You can provide detailed specifications of it in README.

Here is an example definition of the Properties dictionary:


propertiesDir = {
    "data_source": "druid_segments",
    "start_date": "2022-05-01",
    "end_date": "2022-05-10",
    "druid_timestamp": "druid-timestamp",
    "input_path": "s3a://s3-deep-storage/druid/segments/",
    "temp_segment_dir": "/tmp/segments/"
}

Results

If you do all the steps correctly, you read data from Druid segments and write them as Data Frames. Now, you can use the potential of a Data Frame for your purpose. If you have any concerns or need further technical support, feel free to contact us.

‍

Reading Druid Segments as Data Frames in PySpark: Spark Druid Segment Reader Guide for Data Scientists

Before we begin:

Step-by-step procedure:

Results

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

You Might Also Like