Kubernetes clusters as cattle, not pets

Since DevOps and Automation became popular in dealing with IT infrastructure, a lot of DevOps Engineers and Platform Engineers referenced a Pets vs Cattle analogy on how to deal with applications and infrastructure. Infrastructure as Code tools like Terraform, Chef, Puppet, Ansible made it possible to just destroy and recreate your application from scratch if there is an issue, instead of fixing (petting) it manually. A good read on how the 'pets vs cattle' analogy is used and why you should treat your applications as cattle can be found here.

Kubernetes takes this concept of immutable infrastructure a step further by automatically stopping and recreating our applications as container images, both on planned and unplanned moments. This forces us to write our applications in a way that is cattle-proof. As a developer, you never know (and should not care much) when your application will be restarted. Your application should be written in such a way that the end-user will not notice anything in such a scenario. A good point to start is by following the 12-factor-app guidelines.

However, in many companies the Kubernetes clusters itself are treated like pets. As you are running more and more Kubernetes resources, your cluster configuration will get very complex very fast. This will get even worse if you apply your resources manually using 'kubectl apply' and completely rely on etcd for your cluster configuration. It is important to use the cattle-principles on all layers of the stack, also for Kubernetes clusters.

Why should I treat my Clusters as Cattle?

Of course, when it comes to the 'Pets vs Cattle' analogy, there is no black and white. There are different levels of automation and recreation of Infrastructure as Code. But in general, the more you focus on creating immutable infrastructure, the easier your maintenance will be in the long-run and the more control you will have over your infrastructure. Container systems are so strong, partly because it enforces developer to deal with the fact that containers will be killed regularly. Why should the Ops teams managing Kubernetes be treated differently?

No more configuration drift

As with all configuration that is not recreated every once-in-a-while, 'configuration drift' is likely to happen. Especially if it is not actively monitored. This means that your actual cluster-configuration deviates from your code, and it is dangerous to rollout changes to an environment in a 'partly unknown' state. If you manage your clusters as cattle, and you regularly redeploy your Kubernetes clusters, you will remove configuration drift altogether. This will give you control over your infrastructure, and confidence with every new rollout.

Easy disaster recovery strategy

Failures happen, also in Kubernetes. You should be prepared for an entire Kubernetes cluster failure, which is why you should have a disaster recovery scenario in-place. If you treat your Kubernetes cluster as a pet, this means you will need to actively look for the problem and fix it on-the-spot. Some cloud providers are better than others, but sometimes you will need real deep Kubernetes and Cloud knowledge to fix big outages in your cluster. If you can simply destroy your cluster, and recreate it from code, you will be up-and-running again in no-time. No complex Etcd backup and restore scenarios, but just a simple clean redeploy from GIT as your only source of truth.

Easier upgrades and migrations

One big advantage of treating your infrastructure as cattle is doing easier upgrades and migrations. Kubernetes updates and upgrades can feel like a black-box. Although it is perfectly possible to upgrade Kubernetes using the cloud upgrade-process, recreating your Kubernetes cluster from-scratch will give you time to test your changes in a better way (blue/green, canary). Also, with every Kubernetes upgrade, you will also enforce your 'infrastructure as code' architectural principles (more about this below).

How do I setup my clusters as cattle

While building a mature Kubernetes cluster infrastructure, there are some things you need to take into account. You will be recreating your entire Kubernetes cluster from code, from scratch. In order to do this, you will need to make sure that no data is lost, and your new cluster will behave in the exact same way as your old cluster.

Use Infrastructure as Code and GitOps everywhere

Having a good Infrastructure as Code strategy for cluster recreation is crucial for clusters-as-cattle. You should be able to run a set of scripts that automatically recreate your entire stack, both on infrastructure and application level. On infrastructure-level, you can use Terraform scripts to rollout your cluster resources. A recreation of your Kubernetes cluster should only be a 'terraform apply' command.

After you created the Kubernetes cluster as Code, you need to add all your applications on top of Kubernetes. The best way to manage applications on Kubernetes is using GitOps. If you leverage a GitOps Bootstrapping technique, you will only need to create one Bootstrap GitOps resource, which will automatically provision your entire Kubernetes cluster. Make sure that GitOps is your only deployment method for Kubernetes and you can be confident that Git is your single-source-of-truth and all resources are created as they were. You won't need to rebuild your applications, and you don't need to run any pipelines for your applications. GitOps will manage this for you, out-of-the-box.

Recreate your clusters regularly

If you want to rely on your Infrastructure code and GitOps code, you need to trust your recreation setup. Just like with regular DevOps deployments, the best way to gain trust in your setup is to regularly redeploy your Kubernetes clusters. If you actively test your cluster configuration by recreating the cluster, you will reduce configuration drift. Automatically recurring pipelines can be created to spin-up a Kubernetes cluster from-scratch, do some testing, and destroy the cluster when all tests are completed. You can also go a step further, and automatically recreate your production clusters weekly, or even daily, with automated testing.

Deploy your clusters blue/green or canary

Recreating your clusters comes with challenges, of which one of them is handling no-downtime cluster deployments. A rolling, blue/green deployment of your cluster can be used to create no-downtime deployments. Or you can even rollout cluster updates in a canary-fashion, only rolling out your new clusters to a certain percentage of your users with the option to switch back in case of any unexpected results. When you are running a duplicate (older) version of your stack, this will give you the possibility to easily revert to your previous cluster configuration.

In cloud environments, blue/green and canary deployments are fairly easy to do for stateless clusters, especially since most Kubernetes clusters are setup with a single Ingress entrypoint. AWS supports blue/green DNS updates with weighted routing for canary deployments. In Azure, you can leverage the traffic manager for canary cluster deployments.

Challenges with clusters-as-cattle

When implementing clusters-as-cattle, you will find that there are some challenges you will need to deal with. For example, if you are already running Kubernetes in production for some time, it will take effort to migrate from an existing Kubernetes architecture to a cattle-setup. Also, if you run stateful workloads (databases, caches, session state, etc.) on Kubernetes, you will have an extra complication to deal with. Data is not meant to be treated as Cattle and should be handled with care. If you run databases in Kubernetes using Persistent Volumes, you might not want to go all-the-way with clusters-as-cattle, or maybe rollout your clusters less regularly, since downtime might be unavoidable.

Conclusion

Although recreating your Production clusters from-scratch may sound scary, it is fairly easy to do as long as you stick to your architecture principles ('Infrastructure as Code' and 'GitOps only'). Not only in a document (developers don't read those), but enforce those principles and regularly recreate your stack in production! The bigger your Kubernetes cluster grows, the more configuration drift becomes an issue. A clusters-as-cattle strategy will give you peace-of-mind. It will save you headaches and maintenance in the long-run.