Introduction to working with TimeSeries Data
Key Takeaways
- What is time series data?
- When to use a time series database?
- Why relational database or NoSQL DB fails for time series data?
What is Time Series Data?
Series of data points listed in time order. Time series is a sequence taken at successive equally spaced points in time. - Wikipedia
Time Series data has 3 things in common
- Every record is a new entry
- Data typically comes in time order
- Time is a primary axis
From: Gitlab dashboards
A single data point doesn’t speak much, but a data set (series of data points) might tell a story or give insights.
Examples of time series data:
- Forecasting weather
- Stock prices
- Internet of Things
- App Analytics
- DevOps metrics
- Collecting the metrics over time for resources like VMs or containers will help us understand the system behaviour during load and at what times the peak load was there.
Why updated_at doesn’t serve the purpose?
Throughout this article, we will consider collecting data for user logins as an example. For a web application with user login, if we update last login timestamp every time a user logs in, we just have one data point but we will not be able to get a sense of user’s login behaviour over a period of time.
Instead, if we treat every login of the user as a new event, over time it gives us a chance to analyze user login behaviour and analyze it. For example, you could find out the peak time of traffic, this will help you time your advertisements. You could also put people in different buckets by the time they are logging in.
What you can get out of your time series data?
The real value of time series data comes in recording every change. Usually, time series database is connected to a visualization engine, for eg: Grafana works well with Influxdb.
Past: It allows us to analyze the past, how things changed over time
Present: measure the present, how something is changing in the present
Future: predict the future, how it may change in the future.
When to use a time series Database?
You could use a relational database or NoSQL if your volume of events is low. The problem with time series data is, it piles up very quickly. Eg: Boeing 787 generate half a terabyte of data per flight. Time series databases are a category of databases specialized for time series data. They treat time as a first-class citizen. Time series databases give better writes, and better query performance at scale.
Let's understand the scale with an example. Let’s say we have 5000 IoT devices which send 100 measurements once every 5 seconds. This itself gives us a total of 8.64 billion data points, which we cannot handle in a relational database.
Per day we need to collect 17280 samples. 86400/5 = 17280
17280*5000*100 = 8.64 billion data points.
The following are few time series databases:
- Influxdb
- TimescaleDB
- Prometheus
Timescale DB gives 20% higher insert performance, up to 1400x faster queries.[3]
Timescale DB gives 20x higher inserts and faster queries ranging from 1.2x to 14000x improvements when benchmarked against PostgreSQL.[4]
Time series databases give better performance for time-based queries.
References
- https://en.wikipedia.org/wiki/Time_series
- https://blog.timescale.com/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563
- Benchmarking MongoDB and TimeScale DB
- Benchmarking TimeScale DB and Postgres
- InfluxDB vs MongoDB