Availability vs Reliability

In the world of cloud services, Availability is a key metric measured by your customer to ensure they are getting what they are paid for. But Availability (sometimes also referred to as uptime) is an IT service metric which usually measures the time when the service was completely inaccessible to its users. It doesn’t translate well to a cloud service based architecture where the chances of your service being completely down is far fewer than the service not behaving properly. Typically when a team measures the Availability of their service, they use a combination of synthetic tests & monitors which run on a schedule and count the minutes when the service was unavailable. Once you have this number, you can calculate the availability using the formula for the specified time window –

Availability = (total mins - outage mins)/(total minutes)%

The problem with this metric is that it doesn’t accurately capture what your users are experiencing in terms of a usable product. For example, if you provide a web service where search is a key scenario and 20% of a particular user’s search are failing, you have a highly available service but it is still a bad experience for your user.

To combat this gap, we have usually required for all our team’s services to define & measure another metric called Service Reliability. Reliability as a metric is very common in the world of electronics and is measured as Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR) and usually not used in the world of cloud services. A simple definition of Service Reliability is the ratio of the number of times your service behaves correctly over all the times it was invoked to serve its user. If you provide RESTful API based service, you can use a simple formula such as –

Reliability = (total number of success (HTTP status 2xx))/(total number of invocations)%

You could add further attributes/dimension to this formula which can track things such as scenario (for example, ‘search’ or ‘checkout’), customer (a unique customer id). By using such a metric, you can easily identify the overall quality of service provided by your application and can also easily drill into hotspots with respect of scenarios and customers. Another advantage of measuring & ensuring high Service Reliability, you are also ensuring you are providing a highly available service as defined by the Availability metric.