Connecting Druid with AWS EMR via VPN to run Hadoop Indexing Jobs

It's a common case that you would need run hybrid infrastructure: your own datacenter with some services in a public cloud. At Deep.BI we have built our private cloud on rented servers and we also use some external clouds like AWS or Azure.

In this post we will describe how to connect a Druid cluster hosted in your private datacenter with Amazon cloud Hadoop called EMR ( Elastic Map Reduce) to run Hadoop Indexing Jobs, which solves the Kafka Indexing Service "not merging segments" problem.

First we modified security groups and accepted our Druid Middlemanagers. There was no problem with HDFS access as our HDFS client (snakebite) connects to a webHDFS service which listens on port:8020. Unfortunately while trying to access EMR with a public DNS, we encounter the same Connection refused error .

The message for us was clear: we need to have direct access to the EMR cluster using local EMR cluster hostnames. We configured the ec2-to-emr router and used a VPN to access EMR.

Finally, our middle managers were able to connect to the EMR cluster using its local IP. Unfortunately, we still encountered the same Connection refused error. The point was that the Hadoop client was selecting our interfaces randomly from [all_traffic:eth0, VPN:tun0] to communicate with the cluster. We tried to "convince" it to use tun0 interface and it was a partial success: no Connection refused error anymore, but a new error was presented: unknown tun0 interface. EMR passed our Hadoop option to its own cluster, which didn't have, nor was suppose to have, any idea about our tun0 interface.

Fortunately, adding proper entries for EMR local name resolution to /etc/hosts on our Druid middlemanagers solved all of the networking problems.