Time-series architecture epic #291

jimaek · 2023-02-20T20:52:15Z

To support use-cases like continuously plotting performance data for an endpoint we need to design a time-series system.

This means that each probe must have the ability to run scheduled tests every minute or even every 30 seconds. The results will have to be processed and stored in a time-series DB. The API will then new endpoints to read this data and output results ready to be charted on a frontend.

It sounds like the easiest solution would be to run a cron on the API level and then send commands to all probes to run the scheduled test. Summary:

We need a way for the admin to register the continuous tests. There could be many but up to 200 while its an admin-only feature.
The tests could be HTTP, DNS, PING commands with different parameters targeting different endpoints.
This means that a single probe could receive 200 different tests it needs to run every 30 seconds. All that without impacting the data or the quality of service. Sounds problematic for smaller probes if we consider some endpoints could take 10+ seconds to respond. Unless we build some kind of queue and de-duplication system. Need to discuss this.
The results will be returned to the API as normal, but in this case instead of being outputted to a user it needs to process them and store into a persistent DB

Next we need to select the best possible DB that has good performance, its easy to use and easy to operate. I am considering https://questdb.io/ or clickhouse but further research and benchmarking is needed.

The DB needs to be able to store data from (number of registered tests)*(number of probes) per (cycle). So if we have 1000 probes and 200 registered tests that run every 30 seconds, the API would have to accept and then store 400k data points per minute.
The DB will then downsample the raw data into aggregated values based on algorithms like average and median. But the raw data will remain and be used. We also need to decide on the TTL of raw data and downsampled data. I would say no reason to store more than 2 years worth of data of any kind.

Before storing the data in the DB we need to consider:

Filter trash data e.g. errors or impossible values like 0.0ms to run a query
Consider deduplication or some kind of pre-processing of data from the same ASN+City combo
Add all related metadata. location, resolver, all perf data...

Next we need new API endpoints to read the results for a set date-range. e.g. a month, a quarter, a year.... This affects the DB's schema as well.
To consider:

A weights system. The idea is to view perf data per endpoint per location. But if you want to see the worldwide performance you will run into an issue where it will be heavily impacted by the number of tests per location. e.g. if 90% of tests come from California, then to be the best in the world you would only need to be the best in California. Thats why we need a system that would be fair.
We need to support different aggregations. e.g. average, media, 90th percentile, 95th percentile..
But even while providing aggregated data we still need a way to query and show the raw data that was used to make the aggregation

jimaek added the new feature label Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time-series architecture epic #291

Time-series architecture epic #291

jimaek commented Feb 20, 2023 •

edited

Time-series architecture epic #291

Time-series architecture epic #291

Comments

jimaek commented Feb 20, 2023 • edited

jimaek commented Feb 20, 2023 •

edited