• 11 min read

Autoscaling your Kubernetes application with Dash0

This blog post shows how to use Dash0 as the source of truth to automatically scale applications running on Kubernetes.

Elasticity is one of the key value propositions for containerizing applications. Kubernetes, like most of the other orchestration platforms on which we deploy our containers, makes it easy to scale applications up and down. This, in turn, allows developers to balance the infrastructure costs of running their applications, and the load those applications serve.

The decision of when to scale up or down an application is something that many developers perform manually. For example, you may go and increase the count of replicas (that is, the instances, or copies) of a deployment with the kubectl command line or your GitOps tool of choice.

But what if I told you that you could entirely automate the scaling of your applications based on telemetry. And that you could use for this the very same tool and very same data you use to troubleshoot your application’s behaviour. Sounds good? Read on!

Types of Kubernetes autoscaling

In Kubernetes, there are three main types of autoscaling:

By the way, there are also other projects that provide autoscaling capabilities. For example Keda is a bit like the horizontal pod autoscaler, but based on events like messages in a queue.

In this post, we focus on what is, based on my experience, the most adopted autoscaler: the horizontal pod autoscaler.

How the Horizontal Pod Autoscaler works

The horizontal pod autoscaler comes out-of-the-box with Kubernetes as a resource definition, i.e., a set of configurations you provide to the Kubernetes cluster, and a controller that acts on those configurations. You can think of this as a sort of built-in Kubernetes operator, although everything you do with Kubernetes out of the box, like creating namespaces or deployments, follows the same principles.

The horizontal pod autoscaler works on a simple principle: it monitors with metric data some specific aspects of how your application is operating, and when those metrics and your specified thresholds are not close enough, more instances of your applications are scheduled, or some are deleted. It’s effectively a control loop that minimises the amount of pods of your applications so that the metrics it’s told to care about are close to the desired target values. Let's have a look an an example:

The horizontal pod autoscaler in action, configured to respond to the count of HTTP requests an application serves. The Y-axis uses a logarithmic scale.

The horizontal pod autoscaler in action, configured to respond to the count of HTTP requests an application serves. The Y-axis uses a logarithmic scale.

What you see in the chart above is how the horizontal pod autoscaler changes the amount of instances of an application as the count of requests served by that application changes over time. In the chart above, the scale of the vertical axis is logarithmic in base 10, i.e., every step away from zero represents a tenfold increase in the number. This tweak of the visualisation is useful because the count of requests your pods serve are several orders of magnitude more than how many pods you need to serve them: with a logarithmic scale in our chart, we can see the correlations between the two metrics despite the HTTP Requests one having much, much larger values.

Hello, Horizontal Pod Autoscaler

Out of the box, the horizontal pod autoscaler can use metrics like the amount of CPU and memory used by your pods. In the autoscaler lingo, these are Utilization metrics, which you configure with a resource like the following:

yaml
hpa.yaml
01234567891011121314151617
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50

The values you set is then matched against the average, across all yours pods, of the percentage of CPU or memory used compared with the resource requests.

But are Utilization metrics good enough?

As always, it depends. Specifically, it depends on how your application behaves under load, and other non-functional aspects. Out there there are applications that are CPU or memory intensive, where a lack of sufficient computing resources will swiftly lead to malfunction.

However, what type of computing resource is the most useful for your application is somewhat beside the point. It’s a very technical detail. But what utilization metrics do not tell you much about, is what user experience your applications provide. In fact, a lot of containerized applications are more I/O bound than anything else: for example, the web application that serves your e-commerce website is likely to spend a considerable amount more time waiting for your database or another microservice to serve data, then it is crunching numbers on its own.

Besides, each application is a bit different. Deciding which metrics to use to scale them is something you should extensively test over time. Often, you may even want to engineer your application so that it scales in a certain way.

Which metrics could I use instead?

A very common aspect to optimize for in user-facing applications is latency, i.e., how fast your application responds to requests. I would even go as far as arguing that, as long as the user experience is great, the fact that the CPU usage of your containers is maxed out, makes it a very fine-tuned application. (It’s another discussion altogether whether it is an efficient application, i.e., the CPU usage is used well for its intended purpose of serving requests.)

This very line of thought, prioritizing the user experience over technical details like resource utilization, is actually at the core of best practices around Service Level Indicators (SLIs). Because your users likely never spare a thought about “how chill” is the CPU usage of your containers; but they will get annoyed by slow responses.

Which metric you use to drive your horizontal pod autoscaling is usually very dependent on your implementation details, but there is a golden rule to always remember:

  • The metric you use to autoscale should guarantee that adding a replica will make things better

Using a metric that does not have a strong guarantee, may cause compounding negative effects. For example, if your architecture is such that applications can overwhelm the database with requests (like: if your Postgres database does not have connection pooling), scaling up the applications when the database is already overloaded will make things worse.

How do I wire my own metrics?

The horizontal pod autoscaler needs to retrieve metrics from somewhere. And that “somewhere” is a metric server. The most common metric server is Kubernetes Metrics Server, but that only knows how to expose CPU and memory utilization using telemetry from Kubelet, the Kubernetes component on each of your nodes that is responsible for managing the pods running on that node.

And this is where it gets really interesting, because those same application metrics are likely something you are already monitoring in your observability solution!

Wouldn’t it be amazing if you could just ask your observability tool for the data it already has? If your observability tool speaks PromQL, the query language of Prometheus, the answer is YES. Because there is the Prometheus Adapter, that provides an implementation of the Custom Metrics API to generate metrics for the horizontal pod autoscaler by running PromQL queries against an endpoint. That endpoint could be a Prometheus server running in your cluster. Or, you know, Dash0!

A diagram of how the Horizontal Pod Autoscaler can be steered by data from Dash0, using the Prometheus Adapter to calculate HPA metrics through PromQL queries.

A diagram of how the Horizontal Pod Autoscaler can be steered by data from Dash0, using the Prometheus Adapter to calculate HPA metrics through PromQL queries.

In Dash0, every single piece of telemetry is queriable via PromQL. You can use PromQL to set up alerts or dashboards based on logs and spans, as well of course metrics. And the same queries can be used with the horizontal pod autoscaler. After you have done the setup described in this repository to deploy the Prometheus Adapter and configure it to query Dash0, to scale up or down your application based on the amount of requests is as easy as creating the following two Kubernetes resources:

yaml
hpa.yaml
0123456789101112131415161718192021222324
# Based on https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/walkthrough.md
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2
metadata:
name: sample-app
spec:
scaleTargetRef:
# point the HPA at a sample application
apiVersion: apps/v1
kind: Deployment
name: sample-app
# autoscale between 1 and 10 replicas
minReplicas: 1
maxReplicas: 10
metrics:
# use a "Pods" metric, which takes the average of the given metric across all pods controlled by the autoscaling target
- type: Pods
pods:
# use the metric that you used above: pods/http_requests
metric:
name: http_requests
# target 200 requests max per replica
target:
type: Value
averageValue: "200"

And this is how the http_requests metric is mapped to a PromQL query that the Prometheus Adapter will throw at Dash0:

yaml
prometheus-adapter-values.yaml
012345678910111213
rules:
default: false
custom:
- seriesQuery: '{otel_metric_name="http_requests_total",k8s_namespace_name!="",k8s_pod_name!=""}'
resources:
# Map HPA parameters to semantic conventions
overrides:
k8s_namespace_name: {resource: "namespace"}
k8s_pod_name: {resource: "pod"}
name:
as: http_requests
metricsQuery: |
sum (increase(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)
external: []

The full setup is provided in this repository.

Do I have to compute metrics upfront?

Normally, the answer would be yes. But not with Dash0. Since Dash0 exposes to PromQL also tracing and logging data, you can autoscale based on the P99 of latency of the application serving your requests (that is, the latency of the best-performing 99% of your requests), by calculating the following metric:

yaml
prometheus-adapter-values.yaml
01234567891011
rules:
default: false
custom:
- seriesQuery: '{otel_metric_name="http_requests_total",k8s_namespace_name!="",k8s_pod_name!=""}'
resources:
overrides:
k8s_namespace_name: {resource: "namespace"}
k8s_pod_name: {resource: "pod"}
name:
as: http_requests_p99
metricsQuery: '(max({otel_metric_name="dash0.resources",<<.LabelMatchers>>})by(<<.GroupBy>>)*0)unless(histogram_quantile(0.99,sum(rate({otel_metric_name="dash0.spans.duration",dash0_operation_name!="",<<.LabelMatchers>>}[2m]))by(<<.GroupBy>>,le))*1000)'
external: []

The horizontal pod autoscaler will then evaluate the most recent value of that metric.

Dash0 also offers you high-level query builders to figure out the PromQL for your queries, without having to be a PromQL expert:

The Services query builder in Dash0 allows you to specify complex metrics for Resource, Errors and Duration (RED) use cases based on your tracing data, without any PromQL expertise.

The horizontal pod autoscaler, wired to retrieve data from the same system that alerts you of issues with your applications. And with complex use-cases made easy. That’s an SRE dream!

Conclusions

Scaling up and down your applications is a key value proposition of container orchestrations like Kubernetes. You could scale your applications on Kubernetes based on how much CPU and memory they consume, but you are probably better off using application-level metrics like latency, which describe the user experience of your end users.

In this blog post, we have shown how to wire up Kubernetes’ Horizontal Pod Autoscaler (HPA) with Dash0, so that scaling your applications is entirely automated, using as source of truth the same system you trust to sending to you alert notifications in dead of the night when your applications have problems.