We are nearing 10 Years of Kubernetes (K8s), which is great. The platform has matured and a lot has happend since we have installed our first K8s clusters. But right from the beginning we had one question, and we are still missing the answer: How many clusters do I need?
As so often, there is no single true answer, but there is a way to get the right answer for you.
May this article help you on your Cloud Native Journey.
Do I want a dedicated Production Cluster?
At first you should decide if you are willing to run everything on just one cluster or if you should have at least two.
Have you separated Dev and Prod in the past? Think about why that might have been a good idea and you will find the same arguments also apply to K8s. Here are some pointers to get your train of thought going.
Imagine, a developer deploys a new version that writes millions of log files to disk, this at first impacts the IO of your disks and eventually you will run out of space. Your old server did not like that and other applications running on that same server did not like that either. Same goes for K8s.
You may now think that in a virtualized world of resource quotas and limits this might no longer be an issue. But this is only partially right. Yes, CPU, RAM, and disk space can be limited, but this applies only to the resources your service is using directly. In a distributed world there need to be common infrastructure services that abstract away the complexity for the end user. And these services, in general, do not care about the quotas of single services, namespaces, or teams. So if your logs run wild so does the log collector; this affects the node on which the rough application is running and with it all other services running on that node.
But this is not even the most common case in which one workload has a huge negative impact others running in production. The most common cause is of course simple misconfiguration of quotas and limits. Of course, this can be mitigated using automation. But in the fast moving fast changing world of microservices it can be challenge to automate every little detail perfectly and adapt to each change in a flawless manner.
Now as yourself again. Can you risk that a minor mistake in development may affect your production systems? If the answer is no you should opt for a dedicated production K8s cluster, or at least dedicated nodes for production to prevent immediate influences.
Shared resources like the log collector, monitoring systems and even automation tools also require updates and updates need to be tested.
Of course, those are much more stable when you opt for a K8s enterprise distribution in which the company backing the product tests updated components and provides you only with the supported versions. But updates are updates. There is not only the chance that something goes wrong, but with shared resources and many services developed by different teams depending on it you can almost guarantee that one team did not do their homework and you will run into a version conflict for that project. That might have been fine in the past when only a few applications ran on the same host and you had a zoo of different server versions anyway. But now we are opting for scale and with this for homogeneity. There should be no special treatment for a project within your K8s cluster. So what do you do now? Should you roll back the complete update? Do you have the project team on call and can they quickly fix that issue? Do you really want to prolong the update process and even risk partial downtimes? I guess not. You’d rather wish you had tested the cluster update in a development environment.
Authorization and Security
Are all of your developers allowed to work on the production systems? No? Then neither should they in K8s.
Using RBAC allows fine-grained access to all resources within a K8s cluster. So, it is possible to run production and non-production workloads in the same K8s cluster separated by namespaces and secured against unwanted user access. It is possible to secure production secrets against curious looks and it is possible to separate all workloads—from a security perspective—using network policies and pod security policies, but it is hard work. And sometimes, in all of these configurations a single misconfiguration can be enough to expose your valued secrets to an attacker. Since we are thinking zero trust and thus always assume an attack – also from the inside – it is best to limit the blast radius of an attack as effectively as possible. This means that a breach in the less secure development environment should not be allowed to affect production.
Traditionally there have been two roles Dev and Ops. Dev needs to adapt fast and therefor needs to experiment a lot while Ops keeps things running and stable.
We are now working with DevOps teams and platform teams and all the mixtures of roles that come with your individual organizational structure. But in the end we are facing the same conflict that we saw between Dev and Ops. We need to be able to adapt fast and we need to be able to have secured and stable production workloads.
How do we solve this?
This production needs to be stable. We simply cannot allow certain libraries and container images especially when they are untested. But if we are running Dev and Prod on the same cluster we might limit access to those external sources to the developers which will lead to unhappy teams, slow development cycles and shadow IT. Alternatively we can give the developers a cluster for fast experiments and a safe space to test those external sources before they move to production.
In some cases it needs to be clear which team is responsible for which systems especially when it comes to SLAs.
A colleague once told me ‘The only stable cluster is a cluster without users.’ This might sound a bit over the top, but it does have some truth to it. Users always find creative ways to break a cluster. This might be due to misbehavior of an application, extensive load testing, misplaced curiosity, or just true and honest misconfiguration of yet another YAML file.
Should the platform team who gives their best to keep the required SLAs work overtime because the production cluster is affected? Or should they only be held responsible for a cluster where only they, their pipeline and a handful of expert users have access?
Do I need even more K8s-Clusters?
There are plenty of reasons to use multiple K8s clusters especially multiple production clusters. Have you thought about the following categories?
Microservices are designed to be resilient to failure, but is your cluster?
Putting your K8s cluster in one datacentre with the nodes spread evenly across different racks is good enough for many. But depending on your SLA you might consider multiple clusters in multiple regions. Ideally you are able to spread the workload evenly through your microservices that reside in multiple clusters, because spanning one cluster over multiple regions is generally speaking not a good idea due to in cluster latencies.
Analize your latency requirements.
If you are running applications on edge locations or need impressively low latencies for your end users, think about the location of your clusters.
And yes, clusters. You probably need more then one.
This is a given.
If you identify workloads that per your compliance regulations are not allowed to run with other certain applications, need dedicated hardware, are only allowed in certain company networks or certain regions, you might need a separate cluster for this.
There are not just the extremes Dev and Prod. You will also need a balance in between for integration tests.
Depending on your setup there might be a lot of microservices, applications and maybe even partner applications that need to play together in near perfect harmony to fulfil your business needs. This requires testing! On which hardware you want to put that stage is up to you. If you are in the cloud, you might have dedicated ephemeral cluster, that you spin up when needed. When you are running with a fixed number of clusters the integration environment might be placed within your more dev-ish or prod-ish clusters. Where to put it depends on the number of changes, the network requirements, compliance regulations – not every partner application sends necessarily only anonymized test data – and factors of responsibility.
To operate at scale, homogeneity is key.
Otherwise, your automation becomes very quickly riddled with unreadable if statements and hard to debug hidden dependencies. But what if your services are quite diverse and have very different requirements?
The best-known use case are GPUs. If only a part of your applications requires GPUs you might think about a dedicated cluster for those workloads. On the other hand, it is entirely possible to run different workloads on different nodes in a single cluster using node labels. So, if you have special hardware requirements, it also comes down to scale.
Many tools come prepacked for K8s. Depending on your setup, this might call for yet another cluster.
There are certain services that need to be available to multiple teams and Applications. For example: Messaging, Streaming and Logging. Each of these services brings their own challenges with different infrastructure requirements and maybe different SLAs. They might be handled by different organisational units or teams. If you are concerced about stability or simply need to separate them due to the teams resposibilities, different clusters for theses shared services might be the right way to go.
Are there third-party applications that you do not want to run with your own workload?
This might be simply a reason of compliance, or trust but sometimes third-party applications are shipped with low security standards. It might be that the application is perfectly fine and secure, but the company decided to ship to K8s and now relies on some undesirable privileges – e.g. allowing root users in the container. Such applications should not run in same environment as your production workloads. You might want to set up a designated cluster for these applications, or maybe even one cluster per application, depending on the resource consumption.
Be honest, cost is always a factor.
More clusters mean more overhead. Each cluster has dedicated infrastructure and master nodes. These take up resources and depending on the licencing model they cause additional cost. But on the other hand, for huge clusters you may need additional master nodes anyway, or at least nodes with more power. So instead of condemning the idea of multiple clusters, take the time and sit together with your K8s distributions sales representative or your trusted consultant. You might be surprised.
That’s it. I hope it helps and best of luck on your Cloud Native Journey!