Chaos Engineering with K8s and AWS

Recently, one of my Platform Engineering Colleagues approached me with a great question:
What tools do you think work well for Chaos Engineering?
I had no answer.

Of course, I heard of Chaos Monkey – who hasn’t – but with the nearly overwhelming set of DevOps tools for K8s, who has the time to try them all, right?
But since I find this topic highly fascinating, today is the day I embrace the chaos!

The Obligatory Introduction

It is great to break stuff systematically, before it breaks randomly in production.

This will save you many sleepless nights because you are the one actively introducing errors and chaos on your schedule, which gives you plenty of time to fix errors before they ever find their way into production. So break things and start learning from the errors!
But do yourself a favor and never call it Chaos in front of your PO. They prefer the much nicer term Reliability Engineering 😉

The Contenders

As always with Cloud Native Tools, there are plenty of fish CNCF-Pond. So we need some criteria to limit it down.

  • Criteria 1: Works well with K8s; as Kubernetes is the strategic platform for us and most of our customers.
  • Criteria 2: Works well with AWS; as the vast majority of our Public Cloud Projects run on AWS.
K8s and AWS, the strategic selection

Currently, there are 8 Chaos Engineering Tools listed in the CNCF-Landscape, which probably all deserve their place and have unique benefits, but only two are listed in the AWS Documentation.
And this brings me directly to the next contender – The AWS Fault Injection Simulator (FIS). This one is not part of the CNCF but as it is specifically designed to introduce chaos to AWS, I must check it out.

Getting Started

Litmus

Highlights: Cloud-Native, nice UI, includes observability features
Fun Fact: Litmus had two years more time to mature than Chaos Mesh

First thing to note is that there are multiple Versions of Litmus and as always with Documentation: Versions Matter
So if you want to get started, check that this link still points to the newest version.
Currently, there are two ways to install Litmus, but for me only the kubectl variant worked (for EKS v1.23). So unless you really require helm, skip it for now.

Once it is running, of course you want to start with the hello world scenario.
For this, litmus deploys the podtato head application – which is actually a quite cool project, you should take a look at. Then Litmus tries to break it, and you get a nice visualization of the performed steps in the UI.

At this Point I wonder why one of the steps is red and if this might be a design choice or if this is in fact an error. It turns out that the Litmus version I have running has a bug and currently won’t work with its own hello world deployment.

But hey, I see this as a great learning opportunity! Who wants a working hello world anyway?!

So I dive deeper and quickly realize that the documentation does miss a few steps that need to be complimented by blog articles. So I learn about their CRDs and their different versions, about the architecture and I feel really lucky at this point, that I have quite some experience with kubernetes. Otherwise, I might have given up a long time ago. By now, I understand the authors of chaos-engineering-tools-comparison; They were right all along.

“Litmus is a comprehensive tool that, unfortunately, comes with a steep learning curve.

https://www.gremlin.com/community/tutorials/chaos-engineering-tools-comparison/

But after deploying the permissions to all namespaces and adapting the examples to my app, finally some success. I can kill pods as part of an chaos experiment and I can see the results as part of Kubernetes Resources, which is a really nice touch.
But somehow there is still trouble with the Network Loss experiment and suddenly the UI – which had small hiccups before – now stopped showing experiments alltogether.

This would be the right time to join the community slack channel, where I would surely find all the answers I am looking for, but as I just want to gain an overview, it is time to take a look at the next contender.

Chaos Mesh

Highlights: Cloud-Native, nice UI, good Documentation

Chaos Mesh? More like Well Organized Mesh! An up-to-date documentation with usable examples is what sets apart Chaos Mesh at the first glance.

Right from the start, there are multiple QuickStart scripts that help with the setup. Go right to the QuickStart and from there to the Permissions and you are up and running. And since the tool also supports K3s I can run it locally and save a few bucks.
Wrong. It works like a charm for Pod Delete, but somehow the Network Loss Experiment fails. But we like it when stuff fails, don’t we? Because this is where we learn.

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - demo
    labelSelectors:
      app: my-app

After having a slightly deeper look at Chaos Mesh I still don’t see much of a difference to Litmus. Yes, there is a slight difference in the architecture, but whether this is good or bad is too early to say. But both tools use CRDs which are good to understand and since it is all Cloud Native it fits well into GitOps pipelines.

Regarding network traffic, both tools use the capabilities of the Linux Network emulator (NETEM). And since I am having Problems with the Network Experiment on K3s on WSL2 on Windows, this is the point where I abandon K3s and switch to EKS.
Finally, everything works perfectly.

AWS Fault Injection Simulator (FIS)

Highlights: Breaks more than just K8s, avaiable via Terraform

The AWS Fault Injection Simulator is the odd one in the group. This tool is not Cloud Native and its focus is not on Kubernetes, and it is not written in yaml files – unless you count CloudFormation.

With FIS it is really easy to start since nothing needs to be installed. FIS automatically has the permissions of the user. This is a great benefit for the first tests, but it might get a bit risky later on, since this can highly influence the blast radius of your chaos.

But with FIS you are only a few clicks away from the perfect Network Chaos. Just select the VPC and the Subnet and you are all set. You can also simulate a wide range of API misbehavior and force your RDS Database to fail. And all of this of course on a schedule. Ideally, you also provide CloudWatch Alarms that stop the experiment if something breaks that definitely shouldn’t.

You can also use it with in connection with SSM and since you can use it in Terraform, FIS can also be used in connection with GitOps.

FIS + Chaos Mesh/Litmus

Higlights: Chaos Mesh’s detailed K8s control + the ability to break AWS

Currently, Chaos Mesh is mostly limited to K8s. It provides a few actions that work on EC2 instances, but that’s it. FIS on the other hand works well with AWS, and it is possible to break stuff in K8s using SSM, but currently it is not as rounded when it comes to K8s as Chaos Mesh or Litmus.

So of course AWS thought it would be great to integrate these tools into its Fault Injection Simulator.
But somehow they forgot to integrate it in their documentation, or maybe the person documenting it thought that it would be sufficient to document a field kubernetesNamespace as “The Kubernetes namespace“. Yes, namespace, right.., but the one of the Controller or the one of the targeted resource that is also provided in the next field, kubernetesSpec. When it comes to permissions it is not much better, simply because for using Chaos Mesh or Litmus we switch to RBAC and leave the IAM world behind. So this is out of scope for the AWS documentation.

The Main use case for this integration is in my opinion the extension deeper into K8s for teams that are already using FIS. For everybody else it might be more comfortable to run the tools separately but as part of the same pipeline, instead of integrating them. At least for now.

Summary

So what have we learned?

Event though Litmus had more time to mature, Chaos Mesh provides a better entry point for beginners. But if you want to get serious, it is probably smart to get into the community slack channels as fast as possible. Then both tools should bring a real benefit to your Kubernetes Reliability Engineering Journey.

If you are set on AWS, definitely have a look at FIS. It is fun and easy to get started.

Embrace the Chaos!

Kommentar verfassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Nach oben scrollen
Cookie Consent Banner von Real Cookie Banner