Introducing Cirdan, an embeddable data warehouse
Cirdan is an embeddable “data warehouse”. By embeddable, I mean it is a library that can be linked against a program to use the functionality it provides, and by “data warehouse” I mean a data repository that can be updated and accessed, programmatically and/or via the use of SQL.
Think BigQuery, SnowFlake, RedShift, except not running on cloud or on multiple machines, and not running as a service(which
would accept connections and process requests on behalf of clients), but instead is used like any other library (think RocksDB, or SQLite over mySQL or AWS Aurora).
It is great for analytics, data exploration, powering reports and dashboards, and more. Much more.
It is also inspired heavily by Presto, ClickHouse, and Cloudera Impala. They are all fantastic projects and you should definitely consider using them.
It was developed in the last 4 months or so as a side-project, and it now production-ready, although there many planned features and improvements (mostly optimizations) that will be pursued later, depending on demand for them and on free time (I almost always need to divide my time between working on such side-projects and on working on products and services for our two companies).
It supports joins, subqueries, aggregations, and has some unique features made possible thanks to its design and use semantics(library).
Rationale
We have been working on rebuilding our ads server technology and as soon as it was mostly done, we had to figure out what to do about analytics.
Our “legacy” service uses a bunch of data stores including some ad-hoc ones for specific purposes, and while that worked fine, we wanted to do better. Specifically, we wanted to do away with aggregations/roll-ups and instead retain all events and rely on dimensions, filters, and metrics to access them; we would no longer be restricted to specific dimensions. We wanted to be able to query millions, or billions of events as opposed to relying on aggregations, such as one for (advertiser, campaign, site, date).
We all agreed this is what we want to be able to offer to our customers(specifically, we wanted to offer “Google Analytics for Digital Marketing professionals”).
I initially built a clone of Google Mesa. It’s still based on aggregations, but it makes it easy to define new aggregations. It’s powerful and very fast, but because it relies on aggregations, it couldn’t be used. We will still use it in other future projects that require sub-ms response times.
We could either use an existing solution for that (OSS, or proprietary) or roll our own thing.
Not long ago, when we faced such dilemmas, we would lean towards building everything in-house. I am still not sure if that was the right call, but as of a few years ago, we have been aggressively adopting OSS unless building it ourselves provided a real benefit to our customers and users, and would be a competitive advantage to our businesses (we are also constrained by resources, so we had to make good use of those resources).
I wanted to understand how such systems work, and I had a bunch of ideas that wouldn’t really work unless we could embed the functionality to programs, so I decided we need to build the technology ourselves.
We all stand on the shoulders of giants though; researching such designs is nowadays very easy. There are so many codebases and papers to study, people you can reach out to online who can help you understand what they are expert at, it’s an embarrassment of riches.
The only problem, and it is a real problem, is that one can spend too much time researching and not making any real progress.
Furthermore, there is a lot of OSS (libraries, frameworks) one can use, without reinventing the wheel entirely, so to speak.
I am grateful to everyone who answered my questions(especially Todd, Alexey, and my friends at dist-sys Slack community), to people who released code and papers I studied and otherwise directly or indirectly helped me build Cirdan.
Design
Cirdan is written in C++ (C++20). Roughly speaking, it is comprised of an SQL parser, storage “engines”, an execution engine and implementation of various ‘transformers’, implementation of various functions (including aggregation functions), codecs for data storage, and datatypes handling.
Conceptually, a warehouse is made up of databases. A database is made up of tables, and each table is associated with a schema and a storage engine. SQL statements can reference tables and columns of those tables.
There are a bunch of existing storage engines. One persists data in memory, another persists on disk (partitions, which themselves are segmented into “micro partitions”), another accesses TANK partitions directly for real-time events, another for accessing CSV files, and (soon) another for accessing MySQL databases.
Applications can trivially build specialized storage engines by implementing an interface — this is possible because Cirdan is a library(this is one of the reasons I wanted to build this as a library).
For example, we have some services that persist to RocksDB, so for any of them, we could build a RocksDB based storage engine that interfaces with those datasets. An SQL statement can involve hundreds of tables, each of them with its schema and storage engine. (Cirdan may access PBs of data and trillions of rows).
This is just the tip of the iceberg though. For other programs, one could build new codecs, new functions(UDFs, see later), even new transforms. This would have been either impossible or very involved and unwieldy if Cirdan wasn’t implemented as a library. Another side-benefit of that is that experimenting with Cirdan is trivial. You can link your program against the library, access files locally, iterate on custom storage engine implementations effortlessly, and more.
Because of the simplicity of the implementation(at least, as I perceive it), extending the SQL syntax is also trivial. If we are going to need another operator, we will just build support for it (e.g window functions, which is planned for sometime next month, time permitting). (update: implemented)
Currently, INNER, LEFT, and CROSS joins are supported, because that’s all we now need, but extending support to other types, even non-conventional ones, of joins is also very easy. We will just implement whatever we need that’s missing as we go based on the needs of our developers and our products.
Currently, there is no code generation(JIT) involved, but that may change (see the previous paragraph). It would benefit aggregations. It’s a matter of priorities though.
Cirdan is very fast, close to 0.7s for 1 billion rows aggregations fast. We do not make use of GPUs in our data centers, but, again, depending on time and demand, we may wind up supporting GPU execution for even greater performance.
Cirdan supports signed and unsigned integers(from 8 to 128 bits), single and double-precision floating-point numbers, and strings. We will probably extend data types support to arrays and tuples down the road if will need them.
The various codecs (which may be used by storage engines) are responsible for serializing and deserializing column rows data. There are currently two different encoders, for strings and numbers.
For numbers, specifically, we attempt to persist them using the type that can hold the range of [min, max) value for column rows set. For example, the data type of a column may be uint32, but the values can be stored as uint8. We consider dictionary encoding, RLE, PFOR (we will use StreamVByte as well in a future update, thank you, Daniel). Dictionary encoding may be combined with RLE or PFOR. We consider both the compression gains, and the “cost” to decompress later and choose the best codec.
For strings, we determine the common prefix among column rows set, we strip it if there is one, and then we identify common prefixes, if any, and also use dictionary, RLE and PFOR encoding for encoding those prefixes/strings when possible.
For any type, we also attempt to compress using LZ4 and if compressing the already earlier compressed (using our codec) column rows results in over 20% gains, we retain the LZ4 compressed data (this is specific to the “disk” storage engine; each engine implementation is different).
Storage, Access, and Compute
Hopefully, previous paragraphs illustrated that storage is decoupled from access. The only coupling between query and storage is specific to predicate pushdowns. For each SELECT query, a predicate pushdown state is generated and passed on to the storage engines, and the storage engines can use that state to quickly filter partitions/segments of data based on the predicate pushdown state.
Each storage engine is responsible for managing datasets. In fact, one can directly use the storage interfaces to access that data. You can get a pointer to a storage engine of a table and use various member functions to access data. This may make sense for some specialized applications.
Sometimes you may want to process data in ways SQL can’t support. In such situations, you can create a user-defined function(UDF). Because Cirdan is a library, you can just implement an API, register the function with the Cirdan SQL parser and execution subsystem, and you can then use that function in your SQL queries.
For example, you may want to write a new function for aggregations, which accepts various columns, and runs some ML model on the aggregated data, while simultaneously doing something else with that data, and eventually generating a JSON (string)with the results of those operations for each aggregation set. This is very powerful. Use of such UDFs may be preferable to using available SQL functions and facilities for performance reasons.
Unique Features
A schema column can be configured so that a special bloom filter will be built for improving prefix (e.g for a LIKE operation or regular expressions) case sensitive and case insensitive queries. The strings encoder used to encode rows of a column will determine the common prefix among those strings in the rows, and will also build a special bloom filter. Both of those can be persisted by the storage engine for a micro-partition(in the case of the “disk” storage engine). Storage engines can use that together with predicate pushdown state to quickly skip/ignore micro partitions. This is powerful.
When I was experimenting with Cirdan, I built a barebones Google Analytics-like service that involved url LIKE “http://<host_path_prefix>%”. This feature improved performance by an order of magnitude.
This is currently not used, but the pipeline execution supports “scheduling” retrieval of data asynchronously, and when that data becomes available, the pipeline transform(source) to be notified about it so that it will push that data back to the pipeline for processing. This can involve C++20 coroutines, use of other OS threads, anything should work. For example, data could have been stored on S3, retrieved asynchronously without blocking pipeline execution, and when that data was returned, to be pushed back into the pipeline (the pipeline itself operates on a graph of nodes, and the graph itself can be altered at runtime).
A FROM clause may specify multiple sources. A source can be a table, a sub-select, a named (using WITH) SELECT statement, and (eventually, when implemented) a materialized view. This is semantically equivalent to using UNION ALL where the same SELECT query is repeated for multiple sources. However, including more than one source in a FROM clause is faster. You can also use a “*” to include all tables from a specific warehouse database or tables with a common prefix. That way, you can have, for example, a table using a ‘disk’ storage engine for data updated periodically, and another table that accesses TANK or memory directly, for recent, not yet committed to disk, columns rows, and that way you effectively get real-time queries.
The Future
Cirdan is going to be getting regular updates, not only specific to new functionality and performance but also new tools and extensions.
We will use it for new products and for improving existing ones at both our companies (The BEST Company and Phaistos Networks).
I hope we can open source Cirdan soon. It’s really a function of time and priorities though. We have already open-sourced some of our most important projects(e.g TANK, Trinity, KMS). We are benefited greatly from OSS and we want in turn to contribute back. It is the least we can do.
Cirdan was one of the wisest and most foresighted of the Elves in Tolkien’s Legendarium, and the only known bearded Elf.
We have been naming projects after The Matrix characters, but we ran out of names, so we are branching out to Legendarium characters. I don’t think we will have problems with name availability anytime soon.