Similarly to Reddit’s todayilearned, I will be sharing my recent learnings about all things DevOps, cloud and distributed systems.

I’ve been using Prometheus for about six months, and it has been an instant success. However, over a time, the number of metrics stored in Prometheus has grown, and the frequency of querying has also increased. With more dashboards being added to the Grafana, I’ve started to experience situations where Grafana would not render a graph on time and Prometheus query will timeout. This can be very annoying, and refresh of the dashboard is an optimistic fix which will work sporadically, if at all. I needed a better way to fix Prometheus query timeout, especially when aggregating a vast number of metrics over long durations.

Problem

One practical example was an attempt to understand the real utilisation of CPU and Memory across Kubernetes nodes. I wanted to compare CPU and Memory % Utilisation by using container_cpu_usage_seconds_total and container_memory_usage_bytes metrics. Both metrics are collected per each running container and summed. In a busy production, you can be running from thousand to tens of thousands of containers simultaneously. When you aggregate date over 5-minute intervals for thousands of containers over a week - it is not that surprising that Prometheus struggles with timely data retrieval.

CPU utilisation calculated by dividing total container_cpu_usage_seconds_total by kube_node_status_allocatable_cpu_cores total using rolling windows:

sum(rate(container_cpu_usage_seconds_total[5m])) / avg_over_time(sum(kube_node_status_allocatable_cpu_cores)[5m:5m])
Load time: 15723ms

Memory utilisation calculated by dividing container_memory_usage_bytes total by kube_node_status_allocatable_memory_bytes total using rolling windows:

avg_over_time(sum(container_memory_usage_bytes)[15m:15m]) / avg_over_time(sum(kube_node_status_allocatable_memory_bytes)[5m:5m])
Load time: 18656ms

Solution

I thought that there must be a better way to achieve the same instead of reading in all of the data points from thousands of containers over a week. This is how I’ve discovered Prometheus Recording Rules. While I’ve seen the page earlier, you rarely appreciate the documentation until after you’ve needed it.

The essential idea of the recording rules is that it allows you to create custom, meta-time series based on other time series. If you are a Prometheus Operator user - you might have a large number of such rules running in your Prometheus already.

groups:
  - name: k8s.rules
    rules:
    - expr: |
        sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace)
      record: namespace:container_cpu_usage_seconds_total:sum_rate
    - expr: |
        sum(container_memory_usage_bytes{job="kubelet", image!="", container_name!=""}) by (namespace)
      record: namespace:container_memory_usage_bytes:sum

Two rules above do exactly what I was doing in my queries, but they do it continuously and store results in a very small time series. sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) will be evaluated at predefined intervals and stored as the new metric namespace:container_cpu_usage_seconds_total:sum_rate, same with memory query.

Now I can change my queries to look as follows: CPU utilisation calculated by dividing total container_cpu_usage_seconds_total by kube_node_status_allocatable_cpu_cores total using rolling windows:

sum(namespace:container_cpu_usage_seconds_total:sum_rate) / avg_over_time(sum(kube_node_status_allocatable_cpu_cores)[5m:5m])
Load time: 1077ms
This now runs 14x times faster!

Memory utilisation calculated by dividing container_memory_usage_bytes total by kube_node_status_allocatable_memory_bytes total using rolling windows:

sum(namespace:container_memory_usage_bytes:sum) / avg_over_time(sum(kube_node_status_allocatable_memory_bytes)[5m:5m])
Load time: 677ms
This now runs 27x times faster!

Benefits of using Prometheus Recording Rules

  • Reduce query-time load on Prometheus by spreading the computational overhead over entire ingest time.
  • Keep Prometheus DRY by creating reusable queries, building blocks upon which more sophisticated insights are drawn.
  • Have more responsive dashboards and experience with Prometheus