03. March 2021
by Josef Fuchshuber | 1163 words | ~6 min read
diagnosibility devops observability chaos engineering testing quality
This blog article demonstrates the usage and functionality of the open-source Chaos Engineering Chaos Toolkit (CTK) with two simple examples. If you want to learn more about Chaos Engineering first, read our article “The Status Quo of Chaos Engineering”. With CTK, Chaos Engineering experiments can be specified with a custom DSL as a JSON or YAML description. CTK is open source (Apache-2.0), implemented in Python and can be extended by drivers. Some drivers are provided out-of-the-box by the project:
The drivers include actions and probes. Probes are used to validate the Steady State, the state before and after the actions. The actions are the core of the experiment. They simulates failures of the application or platform or perform a slowdown in network traffic. CKT always runs actions and probes by using public APIs of the platforms and applications. The experiment runs outside the cluster and adds no inverse software blocks (e.g. service meshes) to the cluster.
The best feature of CTK is that you can not only run experiments, you also have to define the Steady State as a part of your test cases. This makes CTK tests ideal for integration into ongoing CI/CD processes.
We will demonstrate our two examples using a demo application, a contacts management app based on a microservice architecture and Kubernetes runtime. The demo app is taken from the Azure Developer College.
The first example is the Kubernetes “hello world” of Chaos Testing. What happens if the contact service fails?
To interact with Kubernetes, CTK has its own extension 1. The tool and extension can be easily installed in an existing Python environment (installation details).
|
|
After we have installed the tool, we can write our first “hello world” test. For the declarative description you can choose between JSON and YAML. JSON should be avoided just because of the missing capability of comments.
The first part of the test describes the Steady State. The CTK core supports three different probe providers: HTTP, Process, Python2. Our first example uses the HTTP provider and sends a HTTP request to the search service and validates its HTTP response code.
|
|
The second part of the test describes our experimental method. Our example uses the Kubernetes driver extension and terminates exactly one pod from the “contactsapp” namespace of all deployments with the label service=searchapi
. The driver uses your current kubectl context (~/.kube/config
).
|
|
As soon as Kubernetes detects that a pod is missing from the replica set, a new pod is started. This does not happen immediately, but takes a few seconds depending on the Kubernetes configuration and the application in the pod. At the beginning, our example has a replica set of size 1, so there will be a short downtime when terminating the pod. This fails the validation of the Steady State and thus the complete chaos test. After increasing the replica set to two instances of the search service, the test can be successfully executed. The following recording shows the failure of the first execution, the fix by rescaling and then the successful repetition of the test.
Execution plan used by Chaos Toolkit when running an experiment:
If one of the steps fails, the experiment is classified as a failure.
The results of the experiment are output directly in the terminal and additionally logged with more detailed information in the output file journal.json
. With this file HTML or PDF reports of the experiment can be generated.
|
|
The source code of the experiment can be found on Github.
The second example uses the Azure Driver extension3 from CTK. In this example, a node from the Kubernetes node pool managed by Azure AKS4 is shut down temporarily, simulating a virtual machine failure.
|
|
Before the extension can be used, the Azure secrets must be deployed in the Azure credential file. This requires the Azure CLI and the permission for service principal creation (full doc).
|
|
credentials.json:
|
|
Store the path to the file in an environment variable called AZURE_AUTH_LOCATION
and make sure that your experiment does NOT contain secret section. When this is done, our Azure experiment can be run. Our experiment stops an instance from the filtered virtual machine scale set (vmss). The filtering is done by the resource group, the scale set name and the instance ID.
|
|
This test can be used to verify that the instances of the search service have been deployed on more than one Kubernetes node, thus guaranteeing the availability of the service in the unavailability of a node.
Because in this test the node remains stopped even after the second Steady State validation, we need to execute a rollback. In our example we restart the node with the function restart_vmss
.
|
|
The source code of the experiment can be found on Github.
The Chaos Toolkit is a stable open-source tooling for Choas Engineering. The existing driver extensions, the possibility for own extensions or to be able to execute processes directly as action or probe results in a very large flexibility for any kind of chaos tests.
Because a test always includes the complete experiment (Steady State & Action), Chaos Toolkit is ideal for continuous automated quality assurance.