Semi-coherant ramblings about cool tech stuff

View project on GitHub
By Tristan Rhodes Posted 21 September 2015

Building Resilient Distributed Systems - Part 1

Building Resilient Distributed Systems - Part 1


This post is part of a series that covers the concept of resilience in a distributed environment, and how to improve a systems handling of transient errors.



Wikipedia - In computer networking: “Resiliency is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.”

What does this mean?

Large distributed systems are made up of many nodes, over many clustered layers that communicate in long chains of calls, either synchronously or asynchronously.

Successful Service Call

The more nodes involved in a system, the greater the probability of a failure. A data center with 100,000 servers and millions of users will experience more hardware failures than a data center with 10 servers and a few hundred users. So what happens when one of these nodes fails?

Failed Service Call

Our user gets this!

500 Error

What causes failure?

So what kind of things can cause a transient error in our system?

These are all errors that can happen in a large system, but need not be instantly fatal to a given request. Using resilience techniques, we can ensure the system better handles these events.

What kinds of systems will benefit from resilience?

Some kinds of systems benefit much more from resiliency than others.

These systems all provide a feed of data and interruptions result in connections to the client being severed. Where the cost of re-establishing connections is high, then the lack of resilience is more pronounced.


This covers what resilience is and what symptoms you can expect from a non-resilient system. In my next post, I will go over how you can implement resilience in your system.

Tristan Rhodes
Tristan Rhodes