7 steps to save half of your AKS costs
At Pionative we talk to companies running Kubernetes every single day. Many of them started with managed Kubernetes clusters (in particular AKS) that were set up - with all the best intentions - by their application developers. In this article we would like to focus on one of the downsides this may have: high Azure bills.
Many companies are surprised by the high costs that Microsoft charges them for running AKS. Other companies are not even aware that they are paying too much. Application developers or less-experienced DevOps Engineers will deploy a working AKS cluster, but the question is at what cost?
As a reminder: the main driver of AKS costs are the number of clusters, the number of nodes and the types of nodes you deploy. Next to that, depending on your use case, you will need to carefully monitor storage costs and networking costs, such as traffic between peered Virtual Networks that are cross-region. By the way: did you know that Microsoft will start charging for cross-Availability Zone traffic from the 1st of July 2023?
Below we will describe a few ways to cut back on those costs. Some of these solutions will sound familiar, but others may be new to you. We will also give some tangible examples of how to get started with these ideas. For the sake of this article, we will focus on Microsoft Azure.
1. Run less clusters
The most obvious way to cut back on AKS costs is to simply run less clusters. Many companies are still used to having a DTAP-street. When you are using containers, you are already tackling a lot of issues for which the DTAP-street was originally introduced. Most of our customers do not have a Development environment anymore, for some of them we also got rid of the Test environment.
We suggest this article about the role of the DTAP street in Agile development.
If you still want to have multiple environments for your applications, to cut costs you should also consider deploying in separate namespaces instead of in separate clusters. This saves you a lot of overhead costs for building and maintaining multiple Kubernetes clusters. Make sure you use clear naming conventions to differentiate between the environments.
2. Azure Reservations
Azure Reservations allow you to buy resources in advance. By committing to one or three-year usage you can get discounts of up to 72%. The Reservations are even more flexible than the term suggests because you can pay monthly (instead of upfront), exchange resource reservations for a similar type, and even request refunds (up to 50.000 USD per year).
For AKS the Virtual Machines that function as worker nodes can be reserved. However, for VMs Microsoft is currently pushing users to switch to another form of long-term commitment: Azure savings plan for compute. The advantage of this pricing model is that you do not need to specify the instance family (e.g. Dv3), size (e.g. D8 v3), and region (e.g. West-Europe). Instead, you agree to an hourly rate (or ‘budget’) for the coming one or three years. All compute you use up to that hourly rate will be at a discount (up to 65 percent according to Microsoft). Any compute used above that hourly rate will be charged pay-as-you-go.
A three year commitment is quite a stretch for most smaller companies. It is not only a commitment in terms of cost, but also in terms of technology. If you buy three years worth of VMs, switching to Azure Functions is suddenly a lot less attractive. Another drawback of both of these strategies is that unused capacity is lost. This may not be a problem if you are still better off than if you would have paid the full amount. Also when you are constantly growing (never having unused capacity) these options may be attractive.
3. Azure Spot Virtual Machines
Anyone that ever created an Azure Virtual Machine through the portal, will have seen the option to use Spot Instances. Spot Instances are unused Azure capacity available at a discount. Even though Microsoft clearly mentions this, it also tries to scare you by adding: “Workloads should be tolerant to infrastructure loss as Azure may recall capacity for pay-as-you-go workloads.”
Basically, this signals: I am unreliable, don’t use me.
At Pionative, we reconsidered this when we realized the potential cost-saving: up to 90 percent. Kubernetes is ideal for stateless workloads and as such ‘tolerant to infrastructure loss’. Of course, not all your nodes should disappear when Microsoft feels like it, but we think you can spread the risks according to your appetite.
Basically what we recommend to our clients is to use Spot Instances for non-production environments. By using the autoscaler (more on that below) you can immediately switch to On Demand (also called pay-as-you-go) nodes when your Spot Instances are reclaimed by Azure. Theoretically, this should be possible without any downtime, but even a minute or two of downtime in a non-production environment we consider acceptable given the high cost-savings.
For Production environments, we do not yet recommend the above scenario. We have seen some issues with Spot Instances not being drained correctly (hence the two minutes downtime mentioned above). We would like to see this fixed first and after that, we will carefully test this scenario for Production usage.
There is another possibility to reduce risk while still saving costs: by using different types of Spot Instances (different Instance families, Availability Zones, Regions) you can spread the risk of your instance being reclaimed. By configuring slightly more nodes than you actually need, you can still save costs but at a lower risk.
One final remark: it is not possible to use Spot Instances for the system node pool. Every AKS cluster needs at least one system node pool with one node. We recommend only deploying system services on this node and sizing it correctly. Of course, you could decide to create an Azure Savings Plan for your system node pools.
4. Region selection
Almost everyone determines the Region to run in based on the distance to the applicable data center. What not everyone knows is that there are considerable differences in pricing between the regions. You may have seen pop-ups from Azure saying “Consider selecting UK South to help reduce your costs.”
Regional pricing is a difficult topic. Pricing varies over time and even Azure recommendations like the one above are not always true. Also, different Instance Families are cheaper or more expensive in different regions. We recommend checking the pricing differences for your desired capacity now and then. Microsoft’s pages are not always the most insightful, you can consider AzurePrice.net for this.
But still: be careful. At the moment of writing, you can see that the France Central region (1,40 USD) is on average significantly cheaper (13%) than the popular West Europe region (1,62 USD). However, if you select a specific instance type - e.g. the cheap and thus popular DS2 v2 - you will see it’s actually more expensive in France Central (€0,162 vs. €0,126 per hour).
What makes matters even worse: in the pricing calculator Microsoft (again) recommends UK South, but that will even increase your costs from €0,126 to €0,162. The nearest cheap region for this particular type is North Europe at €0,122 per hour. In short: pay attention and don’t follow the recommendations blindly.
For a complete overview, you can take a look at the VM-specific page on AzurePrice.net, e.g. https://azureprice.net/vm/Standard_D2_v2. You will need a subscription to see all data points.
5. Ephemeral disks
Another advantage of the stateless workloads on Kubernetes is that you will usually not need disks attached to your Virtual Machines. However, by default Azure VMs do come with an OS disk attached, and guess what: you also pay for it.
As an alternative, it is possible to run the OS disk in the cache of the VM using so-called Ephemeral OS disks. This means that if your VM is stopped, the data is lost. In the case of stateless workloads, such as containers on AKS, this is not a problem. But this approach comes with a few other advantages as well: lower read/write latency, faster node scaling, and faster cluster upgrades. On top of this, you also stop paying for the separate disks. In our opinion, if you have stateless workloads on your AKS only, this is a no-brainer.
6. Start-stop schedules
Another no-brainer for non-production environments is a start-stop schedule. Development environments are usually not used during the weekends and some other environments may only operate at night for testing purposes.
There is no native functionality in AKS to start and stop your clusters according to a schedule. However, fairly easy examples are available using e.g. Logic Apps, Automation Accounts, or Azure DevOps pipelines.
Make sure your developers know how to turn the cluster back on in case they want to pull an all-nighter.
Lastly, if cost saving is your main concern, consider not starting the clusters on a schedule at all, but only starting them when developers need them. Especially when you use Ephemeral Disks (see above) start-up times are excellent these days. This way you can easily save a few days of cluster usage each month, e.g. when your developers are in meetings all day (hint: they shouldn’t be).
Kubernetes and autoscaling are often considered to go hand in hand, but in reality, this is hardly the case. Autoscaling is a process that is more complex than most people think.
If your application receives a lot of traffic, it will consume more resources (e.g. CPU and/or memory) with performance degradation as a result. To restore the performance you will either need to increase the number of replicas of your application or assign more resources to your current replica.
Of course, you would like to do this automatically. For the increase of replicas, you can use a Horizontal Pod Autoscaler (HPA), and for the resources a Vertical Pod Autoscaler (VPA). You have to choose between these autoscalers, but the decision depends highly on the technologies you use. In most cases, we recommend using the HPA.
Once your autoscaler of choice makes it change, the Cluster Autoscaler could determine to increase the number of nodes.
As you can see, this scaling process requires configuration on multiple levels. Many companies will do this scaling manually at first, e.g. when they expect higher customer traffic. The configuration and maintenance of the different autoscaling components, will costs your developer quite some time. Usually this is only cost-effective if you have a lot of traffic that is also fluctuating highly. If you have such a variable usage pattern, diving into the world of autoscaling will be a nice but worthwhile adventure for you and your colleagues.
Finally: please combine these measures
The above measures can easily be applied in combination with each other. For example: use on-demand nodes with an Azure Savings Plan for your all master nodes, but carefully select the region in which you run them. Use Spot Instances for all your non-production environments. Stop them at night. Maybe use a part of your Savings Plan for your baseline Production workloads as well, but autoscale the rest of the capacity using Spot Instances from different regions. Use ephemeral disks in every instance.
At Pionative, we helped multiple clients significantly lower their Azure costs using (among others) the above methods. Feel free to reach out in case you want us to review your cloud setup as well.