Are your scale tests answering the right things?

One of the key requirements for running a cloud service is to have a solid understanding of your service’s scalability. Scalability is usually defined as a service’s ability to manage increased demands and is assessed by running scale tests. Scale test usually involves testing the service or a collection of services with synthetic traffic to mimic production loads. It involves a series of test runs where the synthetic traffic is slowly ramped up (concurrent users, and/or size of the data/request) while observing the service’s behavior under load. 

In my experience of reviewing scale test results, I have consistently noticed teams focus on the behavior of their service with varying traffic patterns where the topology of the service is kept constant. In other words, the number of instances of a service in cluster or the hardware features used by the service (CPU/memory/disk) are unchanged. The failure to observe the behavior of the service when you vary the topology is a missed opportunity for the team to further bolster their system against a surge in usage. 

By answering these 3 key questions as an outcome of a scale test, you are ensuring you have a clearer grasp of knowing when your service is going to break under load and you have a clear plan on addressing the spike before it breaks.

  1. Does your service scale linearly with capacity? (Horizontally scalable) – For example, if you add an additional node to your cluster, do you see linear growth in concurrent users or requests/second handled by your service.
  2. How long does it take to add new capacity in production? (Elasticity) – For example, depending on your service architecture & deployment, it might take minutes or hours or days to add new nodes. 
  3. Can you predict when you need to add new capacity? (Monitoring & Alerting) – Given the answer to the time required to add capacity, you will have to work backward to figure out when you would need to start that work. This will ensure you have the right monitors and alerting setup to address increased usage.