We use cookies to ensure that we give you the best experience on our website. By continuing to use the website you agree for the use of cookies for better website performance and a personalized experience.

Druid SQL Queries on Apache Druid

Sebastian Zontek
.
September 5, 2022
Druid SQL Queries on Apache Druid
Sebastian Zontek
September 5, 2022
.
X MIN Read
September 5, 2022
.
X MIN Read
September 5, 2022
.
X MIN Read

Running Druid SQL Queries on Apache Druid

Apache Druid is an open-source data analytics platform designed to store and retrieve large volumes of streaming data quickly while handling high data volumes throughout.

Written in Java, it can also be used in conjunction with Apache Hadoop to process its data using SQL or other languages such as Java, C++, Python, or Scala.

It provides near real-time query capabilities across terabytes of data using the Hadoop Distributed File System (HDFS) as its storage backend and Apache Storm or Apache Spark as its processing engine.

Apache Druid SQL architecture

Apache Druid supports a native SQL layer that translates SQL queries into efficient Druid queries. The native SQL layer is powered by a cost-based optimizer and provides features such as type inference, subquery support, and parameterized queries.

The optimizer uses a cost model that takes into account the number of druids required to process a query, the number of bytes processed, and the network latency between the broker and historical nodes.

Apache Druid supports running SQL queries via a native query engine or via a third-party query engine such as Presto. The query engine converts SQL queries into native Druid queries, which are then executed by the Druid cluster.

Apache Druid SQL interfaces: API selection - http or JDBC

There are two ways to query data in Apache Druid: HTTP API or JDBC. Both interfaces have their own benefits and drawbacks, so it's essential to choose the right one for your needs.

The HTTP interface is a good choice if you're working with Druid or other external technologies that rely on the REST protocol. The HTTP interface also has built-in features for querying large numbers of series efficiently by indexing them as time-series arrays.

The JDBC interface provides some advantages over the HTTP interface because it allows queries to be made directly against an underlying database.

Another advantage of using the JDBC interface is that data can be easily extracted from Druid without any transformation. If you need complex SQL queries and/or schema modification, then this would be the way to go. However, this may not be as straightforward for developers who are new to big data technologies such as Druid.

Different kinds of SQL workloads: OLTP vs OLAP

When it comes to running SQL queries on Apache Druid, there are two different types of workloads that you can use: OLTP and OLAP. OLTP is best suited for transactional workloads, while OLAP is better for analytical workloads. Each type of workload has its own benefits and drawbacks, so it's important to choose the right one for your needs.

For example, with OLTP workloads, rows retrieved per query might be limited to one or two rows by default. On the other hand, with OLAP workloads, a single query might return hundreds or thousands of rows in a single request.

Another difference between these two types of workloads is how they scale. With OLTP workflows like transaction processing, you need high concurrency because transactions may overlap with each other. But with an analytical workflow like data exploration, where transactions rarely overlap, concurrency isn't as critical an issue.

OLTP - Online Transaction Processing

If you're running a business that relies on fast, reliable access to data, you need a database that can handle OLTP (online transaction processing). And if you're looking for an open-source option, Apache Druid is a great choice.

Some of the advantages of using Apache Druid include its scalability and availability; it also has low latency and high throughput. Apache Druid also has a large list of enterprise-grade features, such as clustering and replication, including masterless clustering.

OLAP - Online analytics processing

Druid is an open-source OLAP database that supports high-performance queries on large data sets. The built-in query language, called DataFrames, can execute a wide variety of analytical functions, including aggregations, joins, pivots, and statistical functions.

The query language also has the ability to join multiple tables using the JOIN keyword and aggregate columns using GROUP BY. These features allow users to work with complex analytics much more efficiently than traditional analytics engines require the user to pre-process the data into different buckets in order to perform any type of analysis.

How do OLAP queries give better results on Apache Druid?

Druid is designed for OLAP (online analytical processing) style queries, which tend to be much more complex and resource-intensive than the simpler OLTP (online transaction processing) queries that are commonly run on relational databases.

As a result, Druid can provide much better performance and results for OLAP queries.

Apache druid SQL data type

Apache Druid supports the following SQL data types: TIMESTAMP, DATE, TIME, BOOLEAN, STRING, LONG, DOUBLE, and FLOAT.

Standard types

Druid supports a variety of standard data types, including boolean, string, long, int, float, double, and date. In addition to these basic types, Druid also supports complex types such as intervals, histograms, and timestamps. All of these data types can be used in SQL queries. A good way to start is by trying out some simple SELECT statements with the dp explain command.

Druid's native column types are "long" (64-bit signed int), "float" (32-bit float), "double" (64-bit float), "string" (UTF-8 encoded strings and string arrays), and "complex" (catch-all for more exotic data types like hyperUnique and approxHistogram columns).

Druid treats timestamps (including the __time column) as longs, with the value being the number of milliseconds from 1970-01-01 00:00:00 UTC, without including leap seconds. As a result, timestamps in Druid do not include any time zone information, but merely the precise point in time they represent.

The table below illustrates how Druid translates SQL types to native types during query execution. Except for the exceptions listed in the table, casts between two SQL types with the same Druid runtime type have no impact.

Casts between two SQL types with distinct Druid runtime types will result in a Druid runtime cast. If a value cannot be properly cast to another value, as in CAST('foo' AS BIGINT), the runtime will substitute a default value. NULL values cast to non-nullable types will also be substituted with a default value (for example, nulls cast to numbers will be converted to zeroes).

Standard SQL data type in Apache Druid
Standard SQL data type in Apache Druid

Multi-value strings

Strings can have many values in Druid's native type system. These multi-value string dimensions will be represented as VARCHAR types in SQL and can be utilized syntactically as any other VARCHAR. Regular string functions referring to multi-value string dimensions will be applied to each row's values independently.

Multi-value string dimensions can alternatively be regarded as arrays by using special multi-value string functions that conduct array-aware operations. When you group by a multi-value expression, the native Druid multi-value aggregation behavior is followed, which is akin to the UNNEST capabilities seen in some other SQL dialects. For more information, see the documentation on multi-value string dimensions.

Because multi-value dimensions are treated by the SQL planner as VARCHAR, there are some inconsistencies between how they are handled in Druid SQL and in native queries.

For example, expressions involving multi-value dimensions may be incorrectly optimized by the Druid SQL planner: multi_val_dim = 'a' AND multi_val_dim = 'b' will be optimized to false, even though it is possible for a single row to have both "a" and "b" as values for multi_val_dim. The SQL behavior of multi-value dimensions will change in a future release to more closely align with their behavior in native queries.

NULL values

If you want to filter out NULL values, you can use the IS NOT NULL operator in your WHERE clause. This will return all rows for which the column is not NULL. It's worth noting that this will also remove any NULLs that are at the end of a list of values: SELECT * FROM MY_TABLE WHERE COLUMN_A != NULL

In contrast, if you want to keep the NULLs, use ISNULL instead: SELECT * FROM MY_TABLE WHERE COLUMN_A = ISNULL Otherwise, you may be interested in specifying the top 10 columns and sorting them alphabetically. To do so, just change... WHERE COLUMN_A = ISNULL ... to WHERE COLUMN_A IN (COL1, COL2,...) ORDER BY COL1 ASC LIMIT 10 OFFSET 0; The total row count would be 818. The next step would be to use the DESC keyword with limit and offset parameters.

We now have 818 records sorted by value in descending order, with only 100 records per page. You could repeat this until you have listed as many pages as desired or got tired of counting records.

Apache Druid SQL Operators

Apache Druid supports a wide variety of SQL operators. These operators allow the user to create more complex queries. For example, users can use the CONTAINS operator to find events that contain the word "deer." In this case, users would enter CONTAINS(‘deer’) in their query textbox and click run query.

Arithmetic operators

Druid supports standard arithmetic operators for all numeric data types. Addition, subtraction, multiplication, and division all work as you would expect. You can use these operators in conjunction with the GROUP BY clause to perform calculations on grouped data. For example, you could use the following query to calculate the average price of a product by category.

Datetime arithmetic operators

You can use the following arithmetic operators with DATETIME columns: +, -, *, /, and %. The + and - operators are used to add or subtract a certain number of days, months, years, hours, minutes, or seconds from a DATETIME value.

The * and / operators are used to multiply or divide a DATETIME value by a number. And the % operator is used to find the remainder after dividing a DATETIME value by a number.

Concatenation operator

Druid supports a wide variety of data sources, but one of the most popular is Apache Kafka. If you're running Druid queries that use an aggregation function such as count(), then you'll need to ensure that your configuration specifies a group by key in order to get meaningful results. For example:

SELECT count(*) FROM streams WHERE ds='my-stream' GROUP BY ds; - We can also use the concatenation operator to join two fields together. For example, we can see how many messages were sent with 'cats' or 'dogs':

SELECT msg_type ':' msg_text AS type_of_message

FROM stream AS s JOIN messages AS m ON (m.msg=s.ds) AND (msg LIKE '%cats%') OR (msg LIKE '%dogs%')

Comparison operators

Druid provides several different ways to query data, including native queries and SQL.  Native queries are often preferred because they’re easier to use, but they require a more thorough understanding of the dataset and have lower performance than SQL.

For this reason, most applications that use Apache Druid do not use the native query system; they instead rely on using SQL or an application language with a built-in ORM like Python or Java.

When to Use SQL Queries on Apache Druid

SQL queries can be used to aggregate data, run time series calculations, or perform filters and grouping operations.

In general, if the data can be expressed in SQL, then it can be run as a query on Apache Druid.

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

Thank you for joining our newsletter!
Oops! Something went wrong while submitting the form.
Deep.BI needs the contact information you provide to contact you. You may unsubscribe at any time. For information on how to unsubscribe and more, please review our Privacy Policy.

You Might Also Like