Over the life of your production system, it will experience two types of constant forces. The first type of force is known as impulses. Impulses are forces that occur for a short duration of time. For example, I could publish a popular article on my blog and see a sudden rise in the number of visitors. The sudden rise will eventually go back to normal. Impulses tend to go back to normal after a short period of time. Is your system prepared to withstand against such impulses? Do you perform impulse tests before deploying your software to production?
Another type of force is known as stress. Stress is a strain on your system that occurs gradually over time. A great example is the growth of the database. In the beginning, you start with an empty database with queries performing at lightening speed. A year later, the database is now full of hundreds, thousands and if you are lucky, millions of gigabytes of data. As the size of your database grows, so does the latency of your database queries. These database queries grow slower over time due to the increasing size of the tables. Bigger tables mean more users. This is a good problem to have. Stress on your production system will manifest itself as slower responses or even become unresponsive at worst. Is your system designed to handle this type of stress?
These two types of forces will cause your production system to experience failures due to instability and/or overcapacity. Either one of these types of failures can bring down your bug-free system. Unfortunately, what you will find is that both these types of failures work together to cripple your production system.
There are many things that can cause your system to become unstable. The good news is that there are design patterns you can implement to mitigate or isolate such instabilities from spreading throughout your system. I will go over two things that most developers have experienced in their past that can cause your system to become unstable.
Integration points are any interface between your code and another system. This can be as common as performing an RPC (remote procedure call) to another system to perform a transaction, or calling into another library that you have no control over. If the other party that you are interacting with can be treated as a black-box, then it is considered an integration point. Integrations points will fail sooner or later. It can become slow responding, unavailable, or worst, become unresponsive. An unresponsive code is the worst culprit because it disrupts your normal workflow. Your code expects a return from a method call. What do you do when the method hangs and doesn’t return at all?
You can implement design patterns that can detect and isolate such failures. The first type of protection is using timeouts. Whenever you call an integration point, using timeouts will guarantee your call will return and then you have an opportunity on deciding how to handle the failure. The second type of protection is to guard against repeated calls to a failed integration point. Our normal coding instinct tells us that we should retry the call again until it succeeds. This type of retry can make your integration points fail faster. Integration points fail because it cannot process the requests fast enough due to the fact that it does not have enough capacity to do so. It can also be the victim of a cascading failure of another system that it is depending on. The best chance for a failing system to recover is to avoid retries, therefore, allowing the integration point to recover back to a normal state. You can implement a circuit breaker design pattern to prevent retries, but at the same time allow for recovery of any failed integration point.
When you perform a database query, do you store all the results in memory and then process the results afterward? If this sounds like your system, then you will become a victim of an unbounded result set failure. With unbounded result sets, the faster the database table grows, the bigger the unbounded result sets become. Over time, the database result set can grow beyond a point that it can no longer fit within the memory allocated for your production process. Even before this breaking point happens, your system will already be experiencing slower responses due to memory constraints. Memory constraints can cause garbage collectors to work harder and use more CPU cycles. Eventually, your system will become unresponsive due to most of the CPU being used by your garbage collector performing overtime.
Unbounded result sets can be avoided by either putting a limit on the number of results that can be retrieved or by treating the results as a stream of data that can be processed as it fetched from the database, therefore avoiding storing it in memory.
The capacity of your production system determines how many transaction your system can perform in a given amount of time. Many junior developers do not a give thought to the limitations of the systems that they are working on. The only way to truly measure the capacity of your system is to perform a capacity test on your system to simulate the increased stress that can jeopardize your system’s integrity.
How many simultaneous users can your production system handle without experiencing degradation in response time? Take an example of your favorite website, how fast does that website have to respond before you become frustrated? That website must be able to respond as fast as possible, but also serves as many simultaneous users as possible. There will always be a point where adding additional users will impact the responsiveness of your systems for all users. This is due to the fact that all systems have finite resources and adding more users will incrementally add to the breaching of this limit.
Capacity issues arise because of the abuse of limited resources. A common case of a limited resource is the use of the database. The database is like any regular system that will degrade in performance when many users or connections are simultaneously performing queries against the same database tables. The slow responsiveness of a database query can result from one of two cases. The first case deals with finite physical resource limits. When running multiple queries, the response time is determined by how much CPU and memory are available for the database to use. In addition, the I/O performance greatly impacts the database because it has to constantly read and write the data to secondary storage such as a hard drive or SSD. The second case is when there are multiple queries against the same table, the database has to perform locking on this table in order to maintain the integrity of the data. Resource locking is a form of blocking that can slow down a system because the locking prevents multiple threads of execution from running independently and in parallel. There are a few things that you can do to improve the capacity of your system.
Implementing the right caching strategies can greatly improve the responsiveness and capacity of your production systems. The purpose of a cache is to avoid having to recompute the results on every request. This computation is slow and typically identified as a bottleneck in your transaction workflow. This bottleneck arises because there is access to a shared resource. Caching is a strategy that reduces the access to the shared resource or reduces the overall time needed to access this resource. In both cases, there is an improvement in performance and that contributes to a higher capacity of your production system.
The use of caches will improve your system’s performance, but it can also do the opposite. Be extremely careful when implementing caches that are unbounded. Always take practice to make sure the cache has a limited size or does not continually grow unchecked. Just like with unbounded result sets that I mentioned in regards to stability, having too big of a cache can jeopardize the stability of your system especially if some part of your cache needs to use memory to either store the lookup keys or the data. In-memory caches are the fastest and you are encouraged to use them whenever possible, but you should always be aware of how big the in-memory part of the cache will grow. Cache replacement strategies such as LRU (least recently used) and MRU (most recently used) exist to control the growth of caches. Having a cache replacement strategy is the first important step to keeping your cache from contributing to the failure of your production systems.
After you iron out the bugs in your system, the longevity of your production system is highly dependent on its ability to handle itself in times of instability and overcapacity. In this article, I just touched on some of the things that can go wrong during the lifetime of your production systems. The first step to creating a solid production system that can withstand the test of time is to become aware of the natural forces that work in concert to bring down your production systems.