The idea of adding chaos to a system is generally credited to Netflix. In 2011, the company published Chaos Monkey, a tool that it built to disable parts of its production infrastructure. By inducing random failures in monitored environments, Netflix found that it could discover hidden problems that went unnoticed during regular tests.
Chaos engineering provides an immune response effect. It’s similar to how we vaccinate healthy people. You purposefully introduce a threat, potentially causing brief but observable problems, in order to develop stronger long-term resistance.
Building Resilience
It’s safe to assume that any sufficiently large system contains bugs that you don’t know about. Despite all your automated tests and day-to-day real-world usage, you can’t catch everything. Some issues only surface in very specific scenarios, such as loss of connectivity to a third-party service.
Chaos engineering accepts that unforeseen operating issues will always be a fact of life, even in supposedly watertight production environments. Whereas many organizations end up taking a “wait and see” approach, playing whack-a-mole as real reports come in, chaos engineering works on the principle that a brief outage that you invoke is always better than one that the customer sees first.
Breaking things on purpose gives you a way of determining your system’s overall resilience. What happens if the database goes down? How about an outage at your third-party email-sending service? Chaos engineering’s greatest strength is its ability to reproduce events that unit tests and real-world use alone won’t usually cover.
Chaos testing tools are often run against real deployments to eliminate discrepancies between dev and production environments. You don’t need to apply this much risk, though: As long as you’re confident that you can accurately replicate your infrastructure, you could use the technique against a sandboxed staging environment.
Adding Chaos to Your Systems
You have multiple options if you’d like to add some chaos to your infrastructure. Automated tools built for this purpose provide a starting point but can be tricky to incorporate into your own infrastructure. You normally need to integrate with VM or container management platforms so that the tool can interact with your own instances.
In the case of Chaos Monkey, you need to be using Spinnaker, Netflix’s continuous delivery platform. While it has broad compatibility with popular public cloud providers, it’s also another dependency that you’re adding to your stack.
If you’re using Kubernetes, kube-monkey takes the original Netflix principles and packages them for use in your cluster. It works on an opt-in basis, so Kubernetes resources with the kube-monkey/enabled label will be eligible for random termination.
Pumba provides similar capabilities for regular Docker containers. It can provoke container crashes, stress resource allowances such as CPU and memory, and cause network failures.
A tool that specifically targets networking errors is Shopify’s Toxiproxy. This provides a TCP proxy that simulates a wide range of network conditions. You can filter your application’s traffic through Toxiproxy to see how the system performs with severe latency or reduced bandwidth.
For advanced control, VMWare’s Mangle is a “chaos engineering orchestrator” that targets several different deployment mechanisms. It works with Kubernetes, Docker, VMware vCenter, and generic SSH connections. Mangle lets you define custom faults for application and infrastructure components. Application faults should affect a single service. Infrastructure faults target shared components that could take down multiple services.
While chaos engineering is most commonly associated with backend development and DevOps, there’s growing interest among frontend engineers, too. React Chaos is a library that will throw random errors from React components, letting you identify flaky UI sections that could crash your whole app.
Designing Your Own Chaos Experiments
If you can’t successfully use an open-source chaos tool, design your own experiments instead. Make a list of the assumptions within your application’s environment. Identify the connections between services and think about what would happen if one dropped out.
You then need to test your hypothesis. Break the system and observe the consequences. Next, determine whether the effect was acceptable. Did the app crash and display a stack trace to the user? Or did it show an outage status page and email the stack trace to your on-call staff?
It’s important to keep each test small and focused. This limits the impact in the event of a production outage and helps you be sure that the issue arises from the tested assumption, not from another part of the system.
Always ensure that you have a clear recovery procedure before manually conducting a chaos experiment. Elevating a provoked outage into a live, unplanned one is the last thing that you want. If you’re terminating a service, be mindful of the time that you’ll need to get it started again. There could be knock-on impacts on your application during longer outages: If you drop out of an email distribution service, there could be a backlog to work through when it comes back online. These aspects need to be incorporated into your action plan before you start work.
After your experiment completes, you might need to update your system before re-running the test. Testing your fix actually improves the situation and lets you be confident that your system is now resilient to that specific scenario.
Here’s a summary of the chaos experiment process:
Develop a hypothesis: “The system is resilient to increased network latency. ” Design a focused experiment: “We will artificially increase latency to 500ms on 70% of requests. ” Make sure that you have a clear rollback and recovery strategy. Run the experiment: Observe the impact on your application. Revert detrimental changes to production environments as soon as possible. Analyze the results: If you decide that your system wasn’t resilient enough, implement improvements and repeat the process.
The Non-Technical Side of Chaos Engineering
Chaos engineering is normally viewed as a technical task for development and operations teams—after all, “engineering” is in the name. Besides the nuts and bolts of networks and services, it’s important to also look at the human side, too. It’s easy to think that your system only depends on a database, a few app servers, and a stable network. That’s not usually the case.
Think about how your system would respond if team members were unavailable. Is knowledge readily accessible if an admin needs to step back unexpectedly? Especially in smaller organizations, it’s common for a “team” to be a single person. What happens if your networking guy is ill during a live outage?
In the same way that you test the technical aspects by dropping out of services, you can anticipate human scenarios, too. Try purposefully excluding key individuals as you rehearse an outage. Was the remainder of the team able to restore service to an acceptable state? If they weren’t, you might benefit from documenting more of the system and its dependencies.
Summary
The term “chaos engineering” refers to the practice of purposefully breaking things in production to uncover previously hidden issues. Although the approach can seem daunting to start with, dedicated tools like Chaos Monkey can help you get started with minimal risk.
Adding chaos is a useful technique, as it uncovers both transient and systemic problems. You might find that peaking memory use causes knock-on impacts across your infrastructure, but that increased network latency has a sporadic effect on specific parts of your stack.
Effective use of chaos engineering can help you find bugs faster, before your customers notice them. It helps you build up resiliency in your system by encouraging anticipation of issues. Most teams still address problems reactively, leading to an increased cycle time that impedes efficiency.
Chaos engineering is best treated as a mindset rather than a specific procedure or software product. If you acknowledge that systems tend toward chaos, you’ll naturally start baking support for more “what-if” scenarios into your code.
It’s always worth thinking about the “impossible” events, like a data center outage or severe network congestion. In reality, they’re not impossible, just extremely rare. When they do strike, they’re likely to be the most destructive events that your system encounters, unless your infrastructure is prepared to handle them with fallback routines.