Chaos Engineering: Chaos Toolkit Demo

by Josef Fuchshuber | 1163 words | ~6 min read

chaos-engineering/chaostoolkit-logo.png

This blog article demonstrates the usage and functionality of the open-source Chaos Engineering Chaos Toolkit (CTK) with two simple examples. If you want to learn more about Chaos Engineering first, read our article “The Status Quo of Chaos Engineering”. With CTK, Chaos Engineering experiments can be specified with a custom DSL as a JSON or YAML description. CTK is open source (Apache-2.0), implemented in Python and can be extended by drivers. Some drivers are provided out-of-the-box by the project:

  • Infrastructure & platform level: AWS, Azure, Cloud Foundry, Gandi, Google Cloud Platform, Kubernetes, Service Fabric
  • Application level: Spring Boot
  • Network: Istio, ToxiProxy, WireMock
  • Observability: Dynatrace, Humio, Open Tracing, Prometheus

The drivers include actions and probes. Probes are used to validate the Steady State, the state before and after the actions. The actions are the core of the experiment. They simulates failures of the application or platform or perform a slowdown in network traffic. CKT always runs actions and probes by using public APIs of the platforms and applications. The experiment runs outside the cluster and adds no inverse software blocks (e.g. service meshes) to the cluster.

The Chaos Toolkit CLI orchestrates your experiment

The Chaos Toolkit CLI orchestrates your experiment

The best feature of CTK is that you can not only run experiments, you also have to define the Steady State as a part of your test cases. This makes CTK tests ideal for integration into ongoing CI/CD processes.

Demo App

We will demonstrate our two examples using a demo application, a contacts management app based on a microservice architecture and Kubernetes runtime. The demo app is taken from the Azure Developer College.

Contacts App Architecture

Contacts App Architecture

Example #1

The first example is the Kubernetes “hello world” of Chaos Testing. What happens if the contact service fails?

To interact with Kubernetes, CTK has its own extension 1. The tool and extension can be easily installed in an existing Python environment (installation details).

1
2
pip install -U chaostoolkit
pip install chaostoolkit-kubernetes

After we have installed the tool, we can write our first “hello world” test. For the declarative description you can choose between JSON and YAML. JSON should be avoided just because of the missing capability of comments.

The first part of the test describes the Steady State. The CTK core supports three different probe providers: HTTP, Process, Python2. Our first example uses the HTTP provider and sends a HTTP request to the search service and validates its HTTP response code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# define the steady state hypothesis
steady-state-hypothesis:
  title: Verifying search api remains healthy
  probes:
  - type: probe
    name: search-api-must-still-respond
    tolerance: 200 # http response code 200 is expected
    provider:
      type: http
      timeout: 2
      url: http://${azure_app_endpoint}/api/search/contacts?phrase=mustermann

The second part of the test describes our experimental method. Our example uses the Kubernetes driver extension and terminates exactly one pod from the “contactsapp” namespace of all deployments with the label service=searchapi. The driver uses your current kubectl context (~/.kube/config).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
method:
- type: action
  name: terminate-pod
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    # Terminates one "searchapi" pod randomly
    arguments:
      label_selector: service=searchapi
      ns: contactsapp
      qty: 1
      rand: true
      grace_period: 0

As soon as Kubernetes detects that a pod is missing from the replica set, a new pod is started. This does not happen immediately, but takes a few seconds depending on the Kubernetes configuration and the application in the pod. At the beginning, our example has a replica set of size 1, so there will be a short downtime when terminating the pod. This fails the validation of the Steady State and thus the complete chaos test. After increasing the replica set to two instances of the search service, the test can be successfully executed. The following recording shows the failure of the first execution, the fix by rescaling and then the successful repetition of the test.

Execution plan used by Chaos Toolkit when running an experiment:

  1. Validate experiment description
  2. Run Steady State Hypothesis
  3. Run Method
  4. Run Steady State Hypothesis
  5. Run Rollbacks

If one of the steps fails, the experiment is classified as a failure.

The results of the experiment are output directly in the terminal and additionally logged with more detailed information in the output file journal.json. With this file HTML or PDF reports of the experiment can be generated.

1
chaos report --export-format=html journal.json report.html

The source code of the experiment can be found on Github.

Example #2

The second example uses the Azure Driver extension3 from CTK. In this example, a node from the Kubernetes node pool managed by Azure AKS4 is shut down temporarily, simulating a virtual machine failure.

1
pip install -U chaostoolkit-azure

Before the extension can be used, the Azure secrets must be deployed in the Azure credential file. This requires the Azure CLI and the permission for service principal creation (full doc).

1
2
az login
az ad sp create-for-rbac --sdk-auth > credentials.json

credentials.json:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "subscriptionId": "<azure_aubscription_id>",
  "tenantId": "<tenant_id>",
  "clientId": "<application_id>",
  "clientSecret": "<application_secret>",
  "activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
  "resourceManagerEndpointUrl": "https://management.azure.com/",
  "activeDirectoryGraphResourceId": "https://graph.windows.net/",
  "sqlManagementEndpointUrl": "https://management.core.windows.net:8443/",
  "galleryEndpointUrl": "https://gallery.azure.com/",
  "managementEndpointUrl": "https://management.core.windows.net/"
}

Store the path to the file in an environment variable called AZURE_AUTH_LOCATION and make sure that your experiment does NOT contain secret section. When this is done, our Azure experiment can be run. Our experiment stops an instance from the filtered virtual machine scale set (vmss). The filtering is done by the resource group, the scale set name and the instance ID.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
method:
- type: action
  name: stop-instace
  provider:
    type: python
    module: chaosazure.vmss.actions
    func: stop_vmss
    arguments:
      filter: where resourceGroup=~'${azure_resource_group}' and name=~'${azure_vmss_name}'
      instance_criteria:
      - name: ${azure_vmss_instanceId}
  pauses:
    after: 15

This test can be used to verify that the instances of the search service have been deployed on more than one Kubernetes node, thus guaranteeing the availability of the service in the unavailability of a node.

Because in this test the node remains stopped even after the second Steady State validation, we need to execute a rollback. In our example we restart the node with the function restart_vmss.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
rollbacks:
- type: action
  name: restart-instance
  provider:
    type: python
    module: chaosazure.vmss.actions
    func: restart_vmss
    arguments:
      filter: where resourceGroup=~'${azure_resource_group}' and name=~'${azure_vmss_name}'
      instance_criteria:
      - name: ${azure_vmss_instanceId}

The source code of the experiment can be found on Github.

Summary

The Chaos Toolkit is a stable open-source tooling for Choas Engineering. The existing driver extensions, the possibility for own extensions or to be able to execute processes directly as action or probe results in a very large flexibility for any kind of chaos tests.

Because a test always includes the complete experiment (Steady State & Action), Chaos Toolkit is ideal for continuous automated quality assurance.

Image sources

The Latest Posts