In the previous post in this series we looked at how to reason about load and overload in a system.
The way to think about building systems hardened against overload is to introduce an upper bound to the amount of load you introduce in your system at a given point in time. In the graph below the yellow line indicates the load threshold above which our system availability degrades. The blue line represents a safe upper bound we want to introduce – we want to start rejecting work if the load in our system reaches this threshold.
The naive recommendation we made was to use token bucket throttling to restrict our inbound tps to 38 tps. This allowed us to draw the red line shown in the graph below.
This would protect our service from exceeding the safe upper bound of load in our system, but it is also very conservative, because it catered for the pathologically worst scenario we could encounter. In reality, the percentile distribution of cost per unit of work will vary so the average cost per unit of inbound work will fall somewhere between the steepest curve and flattest curve. If we configured our system to start rejecting work when we encounter the red line, we will lose out on utilising the headroom we have to fulfill cheaper requests. This unutilised capacity is indicated in the red and blue triangles in the graph above.
What we actually want to do is slide that vertical red line to the right to create a safe upper bound for tps, and introduce a separate horisontal red line to restrict overload.
Intuitively this makes sense, but what does that mean in practice?
Throttling vs Loadshedding
Throttling or Rate Limiting refers to controlling the rate of traffic you allow into your system. The intent is to control the transactions per second entering your system or concurrent within your system at a given point in time.
Loadshedding refers to controlling how much load you allow in your system at a given point in time. The intent is to prevent overload.
Remember in the initial post of the series I told you about a failure mode where we saw a massive spike in load in the system while maintaining constant tps? In that scenario only throttling inbound requests would not have sufficiently protected us against overload. We would have needed additional loadshedding strategies to recover from overload.
Throttling
There are a few ways we can introduce throttling, and a few reasons why we would want to do so.
One of the most popular throttling mechanisms is the token bucket algorithm. See here for a nice example on how Stripe.com uses Redis to implement a shared token bucket in a distributed environment. In summary, we control rate of access to our system by handing out tokens from a shared bucket which is replenished with new tokens at a constant rate. The rate at which we can hand out these tokens is dictated by the rate of replenishment. It is designed to accommodate small bursts of traffic, but slow down larger, more sustained spikes in load.
Overall inbound traffic throttling
We know that there is a definitive upper bound to the request tps our service can handle. That vertical red line we moved to the far right in the graph above can be implemented by configuring a maxconns limit per host on your load balancer, or (preferably AND) using token bucket throttling to rate limit inbound HTTP request tps. (Note, there is a distinction between maxConns and max HTTP requests. More to follow in another future post). This can help protect you against DDOS attack, for example.
Concurrent Throttling
It is good practice to reason about how many concurrent units of work can be in flight in your system at a given point in time. Subtleties that can inform this are thread count and db connection pool size. It doesn’t make sense to accept more requests into your system than you have worker threads to execute. All you will be achieving is creating bottlenecks in your system where in flight work starts backing up and waiting for resources, leading to higher request latency.
Imagine our supermarket deli counter. If we only have three people who can hand out salami, it is inevitable that a queue will start forming. The queue might not move as slowly as our checkout counter queue, but the latency that customers spend in the shop will increase.
You can implement this by putting an interceptor in your HTTP Request Handler stack which has a local ‘bucket counter’. Every time an HTTP request enters your system, it decreases the counter. Every time an HTTP request exits the system it increases the counter. If the counter is zero it returns a 503 error.
Per customer throttling
Many customers will write code which calls our banking APIs. It is very easy for a bug in one of these systems to become a bad actor and start calling your banking service too aggressively. If they create a malformed request which returns an HTTP error response, on which they retry without exponential backoff, they can create retry traffic storms. It is a good idea to restrict the number of concurrent TCP connections against your service, as well as the number of HTTP requests a customer can create in a given period. The intent here is isolation to protect other users from one bad actor.
Per customer resource throttling
We saw that the number of accounts a customer has can have an impact on the cost per request for certain operations. Also, we often build systems where customers are billed per resource. It is good practice to restrict the rate at which customers can create new resources to protect both load in your system as well as the customer against themselves.
Loadshedding
The point of a loadshedder is to measure the concurrent load in your system and to start rejecting work when the load increases above a healthy threshold.
[Interesting point – the word threshold comes from the wooden barrier across the bottom of a door entry which used to be build to stand up from the ground, creating a barrier. In medieval times people used to throw thresh (or straw) on the ground in the house for insulation and protecting against damp. The threshold had to prevent the straw from dribbling out the door.]
Fleet usage loadshedding
This type of loadshedding reserves a certain percentage of system availability for specific traffic. This loadshedder will start rejecting unreserved traffic once it hits a percentage of use threshold even if the system is not degraded in order to protect headroom for potential mission critical traffic. This is a nice pattern to protect health check traffic between a load balancer and host as an example.
Prioritised loadshedding
If your system hits a load threshold, you have to start rejecting work to protect availability. Prioritised loadshedding allows you to decide which requests to reject based on its priority. In the example above you can see how we started favouring high priority traffic bands over low priority bands to reduce load in the system until it has recovered from overload. We then slowly start introducing traffic back.
Detecting Load
This is an interesting topic which actually deserves its own post, and is very much informed by the shape of your system. I can list a few mechanisms here.
Response Latency
In most systems there exists a direct relationship between load in system and response latency. If the load in your system increases your response latency will start degrading and you can start backing off. Response latency becomes a very useful proxy for load if the cost of work per request in your system is uniform.
In our banking application using response latency as an indication for degradation in the system would work for DescribeAccount, because you have an O(1) cost of work percentile distribution. It could also work for an O(n) percentile distribution if you have insight into n – this would require you to calculate a normalised latency cost, i.e. load measure = response latency/number of accounts. This would not work easily for DescribeLargestAccount() because you can’t derive n by inspecting either the request or the response.
You can easily implement a load shedder like this by adding an interceptor in your HTTP request handler chain which maintains a history of the last n response latencies seen. Every time you see a new response, you update your historic sliding window. This allows you to detect a velocity of increase/decrease of load in the system based on which you can adjust your loadshedding behaviour.
Concurrent In-flight Requests
You can use in flight requests as a proxy for load as well. This is fairly coarse, and works better for uniform cost operations where number of in flight requests is directly related to load in the system. If our service degrades then the rate of in flight requests will start increasing because work is not exiting the system. This means that in flight requests will go up. If you keep a sliding window history of in flight requests this will allow you to be able to measure the velocity by which load is increasing/decreasing in your system. This is a bad measure of absolute load in the system though. This will also not work if you have high variance in cost per unit of work.
CPU/Memory Utilisation
Your service can use CPU or memory utilisation as a proxy for load in your system. It is easy to query your OS to get utilisation metrics to inform your loadshedding behaviour.
Back Pressure
I really like this one for systems which are composed of multiple components or layers – engines, databases etc. – each with its own load threshold and failure modes.
You want to place your loadshedder as high up in your system as possible, because the deeper into your system work penetrates, the more load on your system you add.
Sometimes, however, it is difficult to anticipate the cost of units of work by inspecting requests and responses at the edge. A good example is the failure mode I discussed in my initial post. The cost of work was determined deep within the system depending on whether we needed to do a write to the database or whether we treated a write as a no-op. There is no way for an HTTP interceptor to inspect requests and determine the cost of doing the work in the request. Two identical requests could trigger very different costs in the system, depending the existing state in the database.
A common pattern for HTTP services looks like the diagram below. The entry point to the service is an HTTP port receiving requests. In this example we have two types of operations, each with its own request handler.
We want to implement a Loadshedding Interceptor right at the outside border of the service, taking a look at each inbound request and each outbound response. This guy needs to reason about the load in the service to decide which requests to reject or allow.
The problem here is that different components of this service has different load thresholds and failure modes. The loadshedder does not necessarily have all the information to reason about load in the system, because it is far away from the blue and green layers.
There are a few ways we can try solve this.
Approach 1
Defer the loadshedding decision to the database integration layer. Let the database integration layer have its own token bucket to allow access to a db connection in the pool.
There are a few problems with this.
Firstly, one request handler can make multiple independent calls down into the database integration layer. If even one of these operations get throttled by the database integration layer it will retry – the request latency goes up and the request stays in the system for longer. Work takes longer to exit the system and we have introduced a bottleneck. We want to loadshed full requests, not sub-components of requests.
Secondly, the database integration layer is request agnostic and is fairly far down in the system. Work has to enter far down into the system before we make a loadshedding decision. This means a lot of wasted work and a lot of occupied threads that could have been dedicated to more relevant work.
Approach 2
Let each component bubble up metadata about its load upstream to the loadshedder.
There are two approaches here – add load metadata to the response (not ideal, now we are mixing system internals with business logic in the contract between components) or add extra operations on components so that the loadshedder can query component health continuously.
It can work but it is clunky and add bulk to the system.
Approach 3
Each component measures its own internal load and throws an error when it isn’t able to accommodate the cost of an operation. These errors bubble up to the loadshedder which can see that the system failed to complete work. It doesn’t need to know why the system failed to complete the work, just that it wasn’t successful. The loadshedder maintains a sliding window of the past n responses, grouping them by success and error. It can calculate the velocity of change in the ratio of successes vs failures and use this as a proxy for reasoning about the availability of the system.
This pattern extends nicely to integrate with other definition of load. You can easily build a loadshedder that creates a composite definition of load by inspecting CPU and memory utilisation, velocity of change in error rates and response latency. It is in the right position to reject or accept requests into the system. It is also in the right place to inspect the request and classify its priority, allowing us to give priority to more important requests.
Health Monitoring System
You can build an external monitoring system that observes service metrics like CPU and memory utilisation, response latency and error rate. This out of band system then calculates a composite load score that a loadshedder can use to make loadshedding decisions.
The benefit of this approach is that load monitoring and compute happens out of band. Even if your system becomes unavailable, this guy can still be online.
The problem with this approach is that an external system does not have the detailed insight into component load that each internal component has. it also comes at the cost of building and maintaining another service.