We use cookies to ensure that we give you the best experience on our website. By continuing to use the website you agree for the use of cookies for better website performance and a personalized experience.

How Deep.BI collects and stores big data for real-time and historical super fast analytics and AI

Sebastian Zontek
.
September 26, 2017
How Deep.BI collects and stores big data for real-time and historical super fast analytics and AI
Sebastian Zontek
September 26, 2017
.
X MIN Read
September 26, 2017
.
X MIN Read
September 26, 2017
.
X MIN Read

The major challenge in big data processing is its storage in a way that allows fast and flexible access and analytics, including real-time analysis.

At Deep.BI we fought this problem using top-notch technologies like Druid, Kafka and Flink.

How the data is collected and stored

Our Javascript tag collects single interactions on websites and apps, then convert them to JSONs.


For example:  

{   event.type: “page-open”   timestamp: timestamp,   user.id.cookie: cookie_id,   attributeGroup1: {     attr1: "value",     attr2: "value"   }   attributeGroup2: {     attr1: "value",     attr2: "value"   } }

Each attribute creates a column in our database (Druid). Thus, as a result, we get the following columns:

event.type, timestamp, user.id.cookie, attributeGroup1.attr1, attributeGroup1.attr2, attributeGroup2.attr1, attributeGroup2.attr2 

Usually, there are ~200 columns per row. Each row represents a single event, and each event "weighs", on average, 4kB.

These columns and rows are stored in Druid in its native, binary format, segmented by time. Each segment consists of events from a certain time range. It is worth noting that data are compressed per column, and the algorithm compresses data 40-100 times. So, 1TB of raw data is stored as 10-40GB. In this way, we can provide real-time data exploration, analytics and access on huge data sets.

How you can use this data

We provide visual data exploration and dashboard creation as well as sharing tools. Additionally, you have access to our API.
People often ask about access to their "raw data" stored at Deep.BI.

Considering raw data usage, you should have the ways you want to analyze it in mind.

First, you can reverse this compression mechanism and extract raw JSONs - this is often not optimal.

Usually people want to extract some specific information from the data around users. For example, for machine learning purposes you may want to extract such data markers:  

[userid, attr1, attr2, …, attrN, metric1, metric2, …, metricN, label]

Example:  

[UUID1, deviceModel, country, …, emailDomain, numberOfVisits[N], visitedSectionX[0,1], …, timeSpent[X], purchased[0,1]]

In this ML scenario you would use the Deep.BI platform for feature engineering: the extraction of attributes and creation of synthetic features from metrics.

To do this you don’t actually need raw Druid segments, nor raw JSON files; just create Deep.BI API queries and you’ll get CSVs with that kind of “raw data”.

Documentation Page

Get in touch to find out more!

Subscribe and stay in the loop with the latest on Druid, Flink, and more!

Thank you for joining our newsletter!
Oops! Something went wrong while submitting the form.
Deep.BI needs the contact information you provide to contact you. You may unsubscribe at any time. For information on how to unsubscribe and more, please review our Privacy Policy.

You Might Also Like

No items found.