Time-series (or timeseries. Or time series? Never figured it out…) are a semi-regular encounter at work. Chances are you’ve met some too!
Be it IoT data, monitoring metrics, logs from services… everyone is talking about them, and some crazy people at Sqooba even want to convince you that everything is a time-series!
At some point, when you need to work with such kind of data, you’ll be like please someone give me an efficient timeseries store! To that end, you will fire up your favourite search engine or ask your colleagues, and suddenly be swamped in multiple blog posts, promotional websites and opinions selling you the best time-series store in the world!
Though, blindly going for one of these (possibly very good!) options may land you in some uncomfortable place down the road. See, there are actually many things hiding under the umbrella word of time-series, and because these many things may have vastly different properties, it is very unlikely that a single time-series store can fulfill your needs in an optimal manner for every type of timeseries that exist.
Disclaimer
I haven’t worked with every technology listed here. Some things were experienced first hand. Many others were learned by reading some docs while evaluating some engines. Also, I like sarcasm: take everything here with a grain of salt.
Four Categories Of Time Series
I’ve found it useful to reason in terms of the four following shapes or categories of time-series (other nuances follow further down):
Numerical | Arbitrary Type | |
---|---|---|
Single Value | (timestamp, number) tuples | (timestamp, text) tuples |
Multi Value | (timestamp, n1, …, ni) n-tuples | (timestamp, key-value collection) |
Why the numerical versus non-numerical distinction? Mainly because if you are storing series of numbers (eg, floats or integers), interesting optimisation (read, compression) options open up.1
This is the first important point to clarify when various parties are enthusiastically hurling but, ElasticSearch! –No, M3DB! –Cut it! TimeScale FFS! at each other2 around the coffee machine: What are you going to store? Where does it come from, and in what format? In what quantities?
For example, having a collection of documents in elastic search, it’s enough to slap a time-stamp on each of them to obtain a time-series of documents. In a log-indexing scenario, elastic search might be exactly what you need!
However, if you’re collecting temperatures from a humonguous network of wireless thermostats and are essentially dealing with numbers, you may opt for a more specialised tool.
Time-Domain considerations
To squeeze as much data points into every available byte as possible, storage engines can make additional assumptions on the data they will store.
For example, they may assume that you will be monitoring live stuff and that after some time has passed, metrics that have not been received will never arrive. This allows for multiple series to be compacted together, possibly saving on space. It will also allow for certain partition schemes.
How is this relevant? If you happen to generate, series of, say, the number of black holes in the observable univers since the extinction of the dinosaurs (because why not?), a storage engine that partitions data by month may not be the best idea.
However, for storing your data-center metrics for the next year or so, it makes a lot of sense.3
Scale Considerations & Cardinality
The question of the scale at which you are doing things matters too: will you have a few metrics comming in at insane rates, or a lot of different metrics comming in now and then?
Will the same series exist over very long time periods, or will they be short lived?4.
And how intensive will the access to the data be? Write once, never read? Or many, many reads because you happen to be displaying Corona dashboards?
Time Series Hell?
Considering the rough categories outlined above give us 4 types x 2 domains x 2 scales equals 16 – sixteen! – possible specialisation niches for engines, notwithstanding my probably abusive simplifications.5
So, next time you find yourself around the sacred battle coffee grounds it’s probably worth enquiring what exactly is hiding behind the time-series you are talking about on that day. And if you are going to deal with crazy scales on one or more dimensions, take the time to choose your storage layer.
What about solutions?
Have you read up to here in the hopes of finding a magic solution? I’m sorry, haven’t really found one yet. But here are some heuristics for getting you started.
- No clue about which engine to select but you just need something? Start by reading the available integrations for Prometheus. It won’t help you choose, but the 23 options – at the time of writing – kind of make the case of this blog post.
- Dealing with mixed types and bigger scales? Assuming you’re not simulating log output for services over the lifetime of the universe, you’ll probably be fine with ElasticSearch.
- Same as above, but you absolutely want to do SQL queries and happen to have a PostgreSQL instance around? Consider TimeScale!
- Stuck with a gargantuan hose spitting out live metrics? Give VictoriaMetrics or M3DB a try.
- You don’t really know, and suspect you won’t increase scale any time soon? InfluxDB might have you covered.
This is the point were I confess my (wet) dream of having elastic search support gorilla-like compression where possible: that would really kick some ass. But it ain’t christman yet.6
Sorry, no silver bullet here. And of course, YMMV!
Tools & Libs
April 2022 update: since writing this post, some colleagues and I continued experimenting, mostly with Victoria Metrics and generic PromQL endpoints.
This resulted in a scala PromQL client and a higher level Chronos-client which directly returns the rich TimeSeries
type from the scala-timeseries-lib.
Happy hacking!
-
See for example the Gorilla: A Fast, Scalable, In-Memory Time Series Databas paper by Facebook, which inspired countless variations that improved upon it. ↩︎
-
I’m the M3 fan, by the way. No, I can’t help you operate it. ;) ↩︎
-
This is actually a good example to explain the difference between very generic time-series storage engines – that is, which store any kind of data over arbitrary time-intervals – and more specialised engines like VictoriaMetrics or M3DB which are more suited for metrics. ↩︎
-
Metrics in a container orchestrated world were services are respawned hundred times a day are worlds away from the goold old days were manually configured RRDTools or Zabbix would do the job! ↩︎
-
Some niches are probably equivalent or irrelevant: at smaller scales generic engines will do fine. On the other hand, we haven’t looked at the problem of out-of-order inserts, nullable fields, open-source vs commercial or the ease of operations… ↩︎
-
The only reference I found on that subject seems to be an open ticket from 2017… ↩︎