We use cookies to ensure that we give you the best experience on our website. By continuing to use the website you agree for the use of cookies for better website performance and a personalized experience.

Monitoring Apache Druid in Grafana

Beata Zawiślak
April 29, 2024
Monitoring Apache Druid in Grafana
Beata Zawiślak
April 29, 2024
X MIN Read
April 29, 2024
X MIN Read
April 29, 2024
X MIN Read

Modern, especially distributed, systems are becoming more and more tailored to business expectations and customer requirements. The result is that they are becoming increasingly complex, so it’s not obvious how to track every operation performed.

Fortunately, good system monitoring is gaining popularity and continues beyond collecting and processing logs. New systems offer the possibility to collect the most important information about them as metrics.

System metrics refer to various types of measurements present within a system. Each resource within the system that can be observed for factors such as performance, availability, reliability, and other characteristics, possesses one or multiple metrics from which data can be gathered.

Metrics in Druid

In Druid, there is an easy way to configure emitting metrics which are fundamental for monitoring query execution and performance, ingestion process, and exceptions. At Druid documentation, you can closely look at every metric type, its dimensions, and a brief description. 

Some of the key metric groups that Druid offers include:

  • Query Metrics: Query metrics track the performance of queries executed on Druid. This includes metrics such as the time needed to complete a query, the amount of returned bytes in the query response, and query success rates.
  • Indexing Metrics: This set of metrics provides all needed details about indexing processes, such as the time taken to run a task, and the number of task actions executed successfully during the emission period.
  • Coordinator Metrics: With these metrics, it is possible to track general segment status.
  • Ingestion Metrics: Druid provides general native ingestion metrics and the possibility of monitoring indexing services such as Kinesis or Kafka.
  • General Health Metrics: General health metrics contain information about the number of segments and their size, JVM memory, Zookeeper connection status, and disconnected time.
  • System Metrics: Allows information related to CPU usage, memory usage, disk I/O, and network traffic. These metrics are only available if the OshiSysMonitor module is included.

These metrics are essential for monitoring the health, performance, and efficiency of Druid clusters, identifying bottlenecks, optimizing configurations, and ensuring the smooth operation of real-time analytics workloads. Additionally, the Druid integration with Grafana allows users to set up alerts based on predefined thresholds for these metrics.

Grafana in Druid

Check our previous tutorial on Integrating Grafana with Druid for detailed guidelines on how to set the environment correctly. In the further part of this article, we will show you sample dashboards that you find useful for basic Druid monitoring.

Example dashboards in Grafana

1. Number of successful and failed queries

Used metrics: 

  • query/success/count
  • query/failed/count 
  • query/interrupted/count

The most basic metrics (but extremely useful) are those that return the amount of success, failed, interrupted, or timeout queries. That simple information might be easily used for Druid cluster monitoring. 

Number of successful and failed queries

In Grafana we could display individual values depending on the daytime. Thanks to easily filtering by time, we can show a line graph for a particular time range. 

A line graph for a particular time range

Moreover, Grafana provides a feature to display mathematical expressions on charts. In our case, we wanted to see the rate of success queries. The ratio of positive queries to all gives us more detailed information about query performance. 

The number of queries is specific and depends on the system, but the ratio of positive queries should be high no matter the number of queries.

Success queries rate

To configure the above chart, you must use the Grafana expression feature available while creating the panel. 

Firstly, you have to choose the metrics regarding the number of successful, failed, and interrupted queries. For this, write an appropriate PromQL query to Prometheus. 

In the screenshot below you can also notice global Grafana variables such as “Servers” and “Jobs” added for filtering. They are not required to gain the successful query rate panel. 

Metrics browser

The expression down below shows how to calculate success query rates. 

  • $A is the result of the Prometheus query which returns the number of successful queries.
  • $B is the result of the Prometheus query which returns the number of failed queries.
  • $C is the result of the Prometheus query which returns the number of interrupted queries.

Remember to disable all queries despite the expression with the final result.

2. Cache hit rate

Used metrics: 

  • druid/query/delta/hitRate

Druid enables two query caching types: caching per-segment or caching whole-query. Regardless of the used type, we want to check whether our caching is effective. Druid might collect caching metrics as delta or total type. The difference between them is significant and we have to be careful during the query to choose the correct one. The delta metrics type collects caching metrics since the last emission while the total metrics type collects the total cache metrics values. We can monitor our cache performance while using the druid/query/delta/hitRate metric in the Grafana panel. Take notice to choose delta metric type rather than total one to draw the below chart on your own.

Cache hitrate

3. Taken time to complete the query

Used metrics: 

  • druid/query/time/sum

The average query time of executed queries might show us how good the performance of our queries is, but it’s tough to deduce possible issues without extra details.  Grouping time metrics by query type allows seeing the relationship between query type and the average time of executed queries. With that, we can present what kind of queries are executed the longest and are irritating for the customers.

Average time taken to complete a query


Monitoring Druid in Grafana provides a comprehensive solution for tracking the health, performance, and efficiency of Druid clusters. Users can gain valuable insights into system operations with various metrics available, including query, indexing, coordinator, ingestion, and general health metrics. Grafana's flexibility enables the creation of informative dashboards, allowing users to visualize key metrics such as query success rates, cache hit rates, and query completion times. This integration helps users to optimize configurations, identify bottlenecks, and ensure the smooth operation of real-time analytics workloads in increasingly complex distributed systems.

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

Thank you for joining our newsletter!
Oops! Something went wrong while submitting the form.
Deep.BI needs the contact information you provide to contact you. You may unsubscribe at any time. For information on how to unsubscribe and more, please review our Privacy Policy.

You Might Also Like