Analytics Infrastructure

Source

I don’t have a source for this post. This post is a walkthrough of evaluating a technology for real-time analytics.

Open Table Of Contents

Background
Requirements
Solutions
- A Time Series DB
- An OLAP Data Store
A Simple Example
- Docker
Conclusion
- Evaluation and Extensions

Background

I have been working on a side project — namely RelayBird — to compete with URL shorteners like bitly and tinyurl (have you seen their prices???). I’ll probably do a longer write up when I launch the site. At a minimum I need link analytics and dashboards to display them, otherwise I won’t have a feature set worth paying or charging for. I decided to do this write up because, up until recently, analytics were a mystery to me.

Requirements

Like any good developer, I asked: “what should I be tracking and what would a user want to see?” I came up with this simplified list:

Average clicks over time (i.e., traffic to each link)
Total clicks per link
Where the clicks to links are coming from

As the site admin, I want to be able to view that data for everyone that uses the site. As a sidenote: I don’t expect anyone to really use this service. RelayBird is just a conduit for my learning.

Solutions

A Time Series DB

I am tracking events (clicks) that happen over time, so a time series DB seems like a natural place to start. The most popular, free, metrics store that I am aware of is Prometheus. However, prometheus wasn’t necessarily built to serve as a real-time analytics data store. It was built to serve as the monitoring and alerting backbone of a service. More importantly, Prometheus wasn’t built to be a long-term storage option. E.g., we might want to keep “clicks over time” data for months/years. Prometheus doesn’t support that amount of storage out of the box.

Furthermore, I might need to perform some queries that are a little too complicated than most TSDBs were designed to handle. E.g., if I want to run an aggregation across dozens of different fields (i.e. dimensions), I will have a better time if I use tech that can handle that use case. Queue some OLAP!

An OLAP Data Store

RelayBird should be able to serve hundreds or thousands of clicks and redirects. The information collected from those clicks will need to be shown to users in their dashboards in real time. I might also want to analyze all the “click” data directly and metadata I have collected for those clicks so I can make decisions on what to change.

ClickHouse is a great option for my use case and I have used it at my job for tons of different data. If you are comfortable working with relational databases, ClickHouse should feel right at home with the exception that it’s a column oriented data store.

A Simple Example

By no means am I a ClickHouse expert. If I was brand new to the OLAP space, and analytics pipelines in general, I would evaluate ClickHouse by writing a toy example and getting familiar with the platform. That is exactly what we are going to do. Thankfully, ClickHouse has a docker image we can leverage.

Docker

The ClickHouse docs will point you to the official ClickHouse docker image. I recommend spinning up a container and and getting a shell into it. This way you don’t need to worry about installing ClickHouse on your system.

You can start up the container and get a shell by running the following command.

docker run -it clickhouse /bin/bash

Now you can run the clickhouse command to start building/testing/evaluating. I do recommend reading up on the official stater guides: https:// clickhouse.com/docs/en/guides/creating-tables. TL;DR - ClickHouse (CH) works like any other SQL database. So you can create a database in the CH instance and the tables you want in your database just like you would in Postgres or MySQL. Similarly, you can define a primary key (optional) and sort key for your table.

I am going to call my database “relaybird”.

:) CREATE DATABASE IF NOT EXISTS relaybird;

CREATE DATABASE IF NOT EXISTS relaybird

Query id: a723ec3e-0ad4-481b-92d5-8dfac05f63e3

Ok.

0 rows in set. Elapsed: 0.030 sec.

Since I’ll be working in the relaybird DB context, I’ll just switch into it. You don’t have to do this but I think it’ll save some typing.

USE relaybird;

USE relaybird

Query id: f66568b3-766c-4854-8058-3237c0cb7883

Ok.

0 rows in set. Elapsed: 0.002 sec.

Rollup Tables

If you are asking: “wait how exactly are we going to store the information we want?”, I applaud you.

We could just make one table to store all the information we care about. In my case, I want to let users see minute-by-minute click rates (e.g., 123 clicks in a 1 minute period). However, some users might have campaigns they want to run over the course of a few weeks or months. Storing minute-by-minute data would not only cost me an arm and a leg, it would take a long time to compute the coarse grain values that I am talking about.

Enter the “rollup table”. The idea of these tables is to aggregate fine grained data at a more coarse grained level. E.g., my minute by minute table could be aggregated into an hour-by-hour table. These tables do wonders for performance and storage. If you want hour-by-hour data, you don’t need to query your raw data, instead you can query a table that contains pre-aggregated hour-by-hour data (faster). In terms of storage, you have the option to delete some of your raw data once you have the coarse grain values you care about. In my case, I am going to guess that most people don’t care to see redirects in a given second so once my minute-by-minute data has been found, I can delete the raw data that took place within a given minute.

Tables

Let’s zoom in on redirect/click data. Each redirect is an event that takes place in my system. I want to record every single one regardless of when it happens. The collection of all redirects will form my base table.

CREATE TABLE IF NOT EXISTS redirect_events
(
  user_id UInt32,
  link_string String,
  timestamp DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);

This table definition is pretty much identical to the one that is used in the clickhouse docs. It will serve as the source of truth for all the minute, hour, day, etc rollup tables that I will want. So let’s create those table definitions too.

CREATE TABLE IF NOT EXISTS redirects_by_minute
(
  user_id UInt32,
  link_string String,
  minute DateTime,
  redirects UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(minute)
ORDER BY (minute, link_string, user_id);

CREATE TABLE IF NOT EXISTS redirects_by_hour
(
  user_id UInt32,
  link_string String,
  hour DateTime,
  redirects UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, link_string, user_id);

CREATE TABLE IF NOT EXISTS redirects_by_day
(
  user_id UInt32,
  link_string String,
  day DateTime,
  redirects UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (day, link_string, user_id);

You might be asking, why are we using a SummingMergeTree engine instead of the MergeTree engine? The CH docs has a pretty good description of the difference:

The difference is that when merging data parts for SummingMergeTree tables ClickHouse replaces all the rows with the same primary key (or more accurately, with the same sorting key) with one row which contains summarized values for the columns with the numeric data type

That is exactly what we want. So these tables will contain an aggregate of the columns that correspond to the same primary key. In this case, our primary keys are implicitly defined by our sort key: (time granularity, link_string, user_id).

So far so good. We have the tables we want to define; we are left with figuring out how to populate them. A naive approach would be to write some cronjob that will populate the tables based on values in the base table. That isn’t a terrible idea depending on how much data and time you have but as the data grows, a brute force approach becomes more costly.

Fortunately CH has a solution for populating these tables. They are called materialized views. I do recommend reading the CH documetnation on materialized views because they are not exactly the same as the materialized views that you would create in PostgreSQL.

Unlike in transactional databases like Postgres, a ClickHouse materialized view is just a trigger that runs a query on blocks of data as they are inserted into a table. The result of this query is inserted into a second “target” table. Should more rows be inserted, results will again be sent to the target table where the intermediate results will be updated and merged. This merged result is the equivalent of running the query over all of the original data.

In our case we want a materialized view that aggregates data based on time granulairty. Here are the definitions:

CREATE MATERIALIZED VIEW redirects_by_minute_mv
TO redirects_by_minute
AS
SELECT
    user_id,
    link_string,
    toStartOfMinute(timestamp) AS minute,
    count() AS redirects
FROM redirect_events
GROUP BY
    user_id,
    link_string,
    minute;

CREATE MATERIALIZED VIEW redirects_by_hour_mv
TO redirects_by_hour
AS
SELECT
    user_id,
    link_string,
    toStartOfHour(timestamp) AS hour,
    count() AS redirects
FROM redirect_events
GROUP BY
    user_id,
    link_string,
    hour;

CREATE MATERIALIZED VIEW redirects_by_day_mv
TO redirects_by_day
AS
SELECT
    user_id,
    link_string,
    toStartOfDay(timestamp) AS day,
    count() AS redirects
FROM redirect_events
GROUP BY
    user_id,
    link_string,
    day;

The toStartOf* functions are supported by CH directly. More information about these functions can be found in the CH docs.

These views are basically the same. The main difference is the way we count the events. As you can see the redirects_by_day will contains fewer rows than the redirects_by_minute. The CH docs capture this idea well:

These results will often be a smaller representation of the original data (a partial sketch in the case of aggregations).

Testing

After you create the tables and views, you should be able to insert data into your base table and watch the data flow into your rollup tables.

Here is some toy data you can use for the base table. If you created your tables before the views, then there shouldn’t be any issue with viewing the aggregated data.

INSERT INTO redirect_events (user_id, link_string, timestamp) VALUES \
(1, 'http://example.com/my-link', '2024-12-01 08:13:00') \
(1, 'http://example.com/my-link', '2024-12-01 08:13:45') \
(2, 'http://example.com/my-link', '2024-12-05 10:26:00');

You can perform a few SELECT queries against the base table or against the rollup tables if you’d like.

Here is what I have in the base table.

SELECT *
FROM redirect_events

Query id: 6058f99d-cfb8-45e7-ba09-c24649b143a7

   ┌─user_id─┬─link_string────────────────┬───────────timestamp─┐
1. │       1 │ http://example.com/my-link │ 2024-12-01 08:13:00 │
2. │       1 │ http://example.com/my-link │ 2024-12-01 08:13:45 │
3. │       2 │ http://example.com/my-link │ 2024-12-05 10:26:00 │
   └─────────┴────────────────────────────┴─────────────────────┘

3 rows in set. Elapsed: 0.015 sec.

The minute by minute break down makes sense.

SELECT *
FROM redirects_by_minute

Query id: e8bb2aa3-04de-4fbd-a67c-a67a8f1c4755

   ┌─user_id─┬─link_string────────────────┬──────────────minute─┬─redirects─┐
1. │       1 │ http://example.com/my-link │ 2024-12-01 08:13:00 │         2 │
2. │       2 │ http://example.com/my-link │ 2024-12-05 10:26:00 │         1 │
   └─────────┴────────────────────────────┴─────────────────────┴───────────┘

2 rows in set. Elapsed: 0.017 sec.

The hour by hour break down makes sense.

SELECT *
FROM redirects_by_hour

Query id: a55ea4cc-365b-4479-8089-708ea1b868de

   ┌─user_id─┬─link_string────────────────┬────────────────hour─┬─redirects─┐
1. │       1 │ http://example.com/my-link │ 2024-12-01 08:00:00 │         2 │
2. │       2 │ http://example.com/my-link │ 2024-12-05 10:00:00 │         1 │
   └─────────┴────────────────────────────┴─────────────────────┴───────────┘

2 rows in set. Elapsed: 0.006 sec.

Finally, the day by day breakdown looks good too!

SELECT *
FROM redirects_by_day

Query id: 3408c2a3-601f-4316-a611-65c1815fd835

   ┌─user_id─┬─link_string────────────────┬─────────────────day─┬─redirects─┐
1. │       1 │ http://example.com/my-link │ 2024-12-01 00:00:00 │         2 │
2. │       2 │ http://example.com/my-link │ 2024-12-05 00:00:00 │         1 │
   └─────────┴────────────────────────────┴─────────────────────┴───────────┘

2 rows in set. Elapsed: 0.007 sec.

Conclusion

In this post, I covered some of the steps you could take if you were going to build an analytics pipeline. Personally, I feel content with ClickHouse and recommend using it for your analytics needs. The fact that its open source is a huge plus compared to other proprietary options.

Note: I am not paid by CH to say this. The company I work for (Cloudflare, Inc.) is all in on ClickHouse and I genuinely think it’s an amazing piece of technology.

Evaluation and Extensions

Depending on the context you’re working in, I do recommend a more thorough evaluation of CH since it might not be the best fit for your needs. E.g., depending on how much data you are expecting to log, you can just use PostgreSQL for everything I showed here. PostgreSQL is powerful. Like really powerful.

You can also extend what I have done here by automating the insertion of events into the base table and by automating the reads from the rollup tables. CH has a handful of useful 3rd-part client libraries in their docs. You can write a wrapper around those clients for your own needs.

I actually intend on making the insertion + reading a service in RelayBird (because why not).