Cloud Services

To Rewrite or Not To Rewrite

Back in 2017, when I was looking around for my next challenge, I wrote this article while preparing for an Amazon interview. I just came across this and thought it was a good story to share about the early days of Power BI SaaS offering.

Most decisions are made with analysis, but some are judgment calls not susceptible to analysis due to time or information constraints.  Please write about a judgment call you’ve made recently that couldn’t be analyzed.  It can be a big or small one, but should focus on a business issue.  What was the situation, the alternatives you considered and evaluated, and your decision-making process?  Be sure to explain why you chose the alternative you did relative to others considered.

An existential question most engineering leaders face at least once in their career is whether to reuse or rewrite a core piece of their technology due to changing trends impacting their business. Recently as a Group Engineering Manager for Microsoft’s PowerBI team I had to make a similar decision. The choice is usually a hard one since you don’t always have the data required to make an analyzed decision and it is more a judgement call based on past experiences and future trends on which you need to take a bet on. 

Before I describe the specific decision taken, it would be helpful to provide some context as to where we were as a business. PowerBI’s core value proposition is to provide self-service BI tools targeted at business users. Towards this goal, we had released PowerBI 1.0 as an addon for Office 365 users which leveraged the analysis and reporting capabilities of Excel on the web. As part of that effort, the visualization stack used for the charting capabilities were based on an existing implementation from Excel. For the web application, our team had transformed it into a server hosted rendering component fitted with a client side support for drawing of the charts. The team built a new concept called Visual Representation Markup for communicating the shape (or geometry) of charts on the server side to the client. Nine months into the project, the MVP of the visualization stack was released along with release of PowerBI 1.0. At that time the team was aware of the high technical debt they had accrued due to the complex nature of the stack. One of the primary areas of concern was the performance of the stack since it was a chatty service and each user interaction required several round trips to the service from the client. The team had drawn up a plan to address these performance issues and implement other optimization techniques.  

After the initial release of PowerBI 1.0, senior leadership team challenged with slow user adoption decided to pivot PowerBI from an Excel based workload to a more generic SaaS service. The service was focused on allowing business users to connect to a variety of data sources and help them gain insights using interactive charts. As a part of this pivot, I was promoted to the role of Group Engineering Manager for the SaaS service and I inherited the visualization stack. Given the intense pressure to release the new service, we had to decide the execution strategy for our visualization stack. Our choice at that point was either –  

  • double down on the existing server based rendering stack and improve the performance of the service, or 
  • invest in rewriting the visualization stack on a client based technology leveraging the best of breed charting solutions available in the market like D3.js, C3.js, Highcharts, etc. 

One of the main reason for the server based stack was the desire to use a single implementation across desktop and web. The goal behind it was to provide a consistent experience for our users. Given the team’s and my previous experience with web technologies and the requirements around experience, we trusted our instincts that a rewrite using the latest HTML5/JavaScript technologies would make the team more agile and help meet our goals both on user requirements and engineering cost. After an assessment of the available frameworks, we chose to use D3.js which had industry wide adoption and strong community support. The technical challenge was to wrap this framework within our visualization stack and implement the experience consistency. We didn’t think a continued investment in the server based stack would have resulted in a better outcome. The reasoning was based on two fronts – 

  • User Experience – a client based implementation would provide the best experience in terms of performance and fluidity. Making a chatty server based implementation perform better would have been prohibitively expensive. 
  • Engineering Cost – Leveraging community built solution would make our engineering cost lower since we could completely devote our focus on making the integration solid and invest more in the user experience.  

Armed with a few early prototypes demonstrating the proposed new architecture, we undertook the task of convincing senior leadership that this would be the right call. Microsoft has long struggled with the “not-invented-here” syndrome where distrust in community solutions caused it to invest in building everything internally. But based on our persistence and presentation of the pros & cons, we convinced the leadership to provide us a 3-month window to deliver a MVP to replace the existing stack. Based on an aggressive execution plan we delivered a performant and user delightful experience with the new stack on time. With the engineering agility afforded by the new stack, we were able to react quickly to user feedback and make our offering one of the most compelling BI solutions in the market with close to 6 million subscribers in the first year since launch. 

Cloud Services

Mind the gap

Microservices have become the defacto standard for building cloud native applications for the past decade and several enterprises have large internal projects to decompose their monolithic applications into collection of microservices to realize the benefits of this service oriented architecture.

As the move to a microservice world accelerates , I have been noticing an aspect of software engineering that needs to be revisited. Typically for any large project to be successful and sustainable, you have to define a solid testing strategy. This involves building several types of tests which are usually classified as unit, integration and end-to-end (e2e) tests. QE teams follow a system called “Testing Pyramid” to help build these tests which suggests that you invest a lot more in unit tests since they help test the business logic without the noise of external dependencies. Martin Fowler’s article is an excellent read regarding this topic.

But in the world of microservices, most of the business logic isn’t centralized inside a particular microservice and it is usually a complex orchestration across different services. In such a world, having a larger corpus of unit test compared to integration tests doesn’t provide the protection you need to run your cloud application. The focus has to shift to a larger battery of integration tests which provide the coverage of the intricate dependencies between the microservices and increase the quality of your releases. It looks more like a “Testing Diamond”.

Cloud Services

Availability vs Reliability

In the world of cloud services, Availability is a key metric measured by your customer to ensure they are getting what they are paid for. But Availability (sometimes also referred to as uptime) is an IT service metric which usually measures the time when the service was completely inaccessible to its users. It doesn’t translate well to a cloud service based architecture where the chances of your service being completely down is far fewer than the service not behaving properly. Typically when a team measures the Availability of their service, they use a combination of synthetic tests & monitors which run on a schedule and count the minutes when the service was unavailable. Once you have this number, you can calculate the availability using the formula for the specified time window –

Availability = (total mins - outage mins)/(total minutes)%

The problem with this metric is that it doesn’t accurately capture what your users are experiencing in terms of a usable product. For example, if you provide a web service where search is a key scenario and 20% of a particular user’s search are failing, you have a highly available service but it is still a bad experience for your user.

To combat this gap, we have usually required for all our team’s services to define & measure another metric called Service Reliability. Reliability as a metric is very common in the world of electronics and is measured as Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR) and usually not used in the world of cloud services. A simple definition of Service Reliability is the ratio of the number of times your service behaves correctly over all the times it was invoked to serve its user. If you provide RESTful API based service, you can use a simple formula such as –

Reliability = (total number of success (HTTP status 2xx))/(total number of invocations)%

You could add further attributes/dimension to this formula which can track things such as scenario (for example, ‘search’ or ‘checkout’), customer (a unique customer id). By using such a metric, you can easily identify the overall quality of service provided by your application and can also easily drill into hotspots with respect of scenarios and customers. Another advantage of measuring & ensuring high Service Reliability, you are also ensuring you are providing a highly available service as defined by the Availability metric.

Cloud Services

Are your scale tests answering the right things?

One of the key requirements for running a cloud service is to have a solid understanding of your service’s scalability. Scalability is usually defined as a service’s ability to manage increased demands and is assessed by running scale tests. Scale test usually involves testing the service or a collection of services with synthetic traffic to mimic production loads. It involves a series of test runs where the synthetic traffic is slowly ramped up (concurrent users, and/or size of the data/request) while observing the service’s behavior under load. 

In my experience of reviewing scale test results, I have consistently noticed teams focus on the behavior of their service with varying traffic patterns where the topology of the service is kept constant. In other words, the number of instances of a service in cluster or the hardware features used by the service (CPU/memory/disk) are unchanged. The failure to observe the behavior of the service when you vary the topology is a missed opportunity for the team to further bolster their system against a surge in usage. 

By answering these 3 key questions as an outcome of a scale test, you are ensuring you have a clearer grasp of knowing when your service is going to break under load and you have a clear plan on addressing the spike before it breaks.

  1. Does your service scale linearly with capacity? (Horizontally scalable) – For example, if you add an additional node to your cluster, do you see linear growth in concurrent users or requests/second handled by your service.
  2. How long does it take to add new capacity in production? (Elasticity) – For example, depending on your service architecture & deployment, it might take minutes or hours or days to add new nodes. 
  3. Can you predict when you need to add new capacity? (Monitoring & Alerting) – Given the answer to the time required to add capacity, you will have to work backward to figure out when you would need to start that work. This will ensure you have the right monitors and alerting setup to address increased usage.
Programming

Running a SaaS/Web app for zero cost

Recently I had to help a friend with a web application which they wanted to be implemented very quickly and invest as low as possible for keeping the application running (they are on a shoe string budget). I looked around at the various PaaS/IaaS/SaaS building blocks they could use and most of them provided a free option. There are two types of free options in the market – time based vs quota based. Given my friend wanted to see if her idea had any legs, a quota based option was the best for her (having to pay because your usage went up is a good problem to have). So finally the solution was built using the following technologies/services –

  • Technology – Modern web application implemented using the latest JavaScript framework (AngularJS).
  • CI/CD – We used Github for code repository, bug/work item tracking and milestone planning. As for continuous integration, we leveraged the Travis CI (travis-ci.org) integration with Github.
  • Data storage – We used Firebase (firebase.google.com) since it provided the best JSON document storage which also had the added benefit of –
    • integration with the latest web frameworks (AngularJS)
    • realtime change notifications (this feature is amazingly good)
    • the free plan provides upto 1GB storage.
  • Image/File storage – We used Cloudinary (cloudinary.com) for storing images which provides one of the best in class solution for image storage, image manipulation and caching. As for regular file storage, Firebase provides a 5GB storage for files in the free plan.
  • Hosting – We used Firebase for this again since it provided custom domains in it’s free plan. Being part of Google, their hosting is backed by Google’s CDN system.
  • Telemetry – We used Localytics (localytics.com) for logging telemetry, measuring and tracking usage for the application. They provide a free subscription for upto 10,000 MAU.
  • Backend Services – any web application will require some server side processing (mainly for security reasons). Azure (portal.azure.com) provides a free plan for running basic web services in a shared environment. Their Azure Functions is also a great alternative for rapid deployment of backend functionality. They integrate really well with Github repositories which enables a reliable CI/CD integration.
  • Error Tracking – any web application needs a way of tracking errors faced by customers in the browser. There are several options available in this space, but we ended up using Rollbar (rollbar.com) since it was one of the best with good support for tracking deployments and deduping errors across deployments. Also the free plan allows you to push upto 5000 events per month (a good incentive to keep errors low in your application).

So with the above laundry list of areas needed to implement a web application, we created a web application which costs zero dollars to keep running. Pretty sweet deal!

Software

IE7 never starts/works on my machine!

I have heard this comment many times when people find out that I work for the IE team in Redmond. This is especially true when I visit India. One of the quick checks which help to analyze where the problem lies is to run IE without any addons. To do this, go to Start -> All Programs -> Accessories -> System Tools -> Internet Explorer (No Add-ons). If IE runs properly, then this indicates the problem lies with one of the extensions previously installed which is failing to work properly with the new version of the browser. Now you can use Manage Add-ons to zoom in on the faulty add-on and disable it.
Software

Check out Multi-Login Helper for IE7

MLH is an IE7 addon that helps you to open multiple accounts on a website in the same IE7 broswer. Currently, you cannot access two different account on a website (for example your bank site) from the same IE7 browser, due to the sharing of sessions in the same process. You need to start two browsers to check multiple accounts which render the advantages of tabs useless in such scenarios.

Want to monitor those multiple accounts using tabs in a single IE7 browser? Want to check both yours and your spouse’s bank accounts from a single browser? Then check out Multi-Login Helper. Download the addon from http://digiratii.com