Skip to content

Info Sources for Building Highly Available, Scalable, Resilient Azure Services and Apps

May 16, 2016

Building highly available, scalable, and resilient software running in the cloud is quite different from building such software systems that run on-premises.  Why?  In the cloud you must plan for your software encountering a much higher rate of failures than usually encountered in on-premises systems.  This article provides links that describe techniques and best practices for building your cloud software to effectively deal with such frequent failures.

Here is a rough sketch of the sources of these failures:

  • Cloud hardware failures – The cloud uses vast numbers of cheap, commodity compute, storage, and network hardware units to host both the cloud provider’s PaaS services and customer services and apps. This cheap hardware fails more frequently than on-premises systems which generally utilize expensive, top-of-line compute, storage, and network hardware.  On-premises hardware systems are designed to achieve a high Mean-Time-Between-Failure (MTBF) so that software running on them does not have to deal with a high rate of hardware failures.  The cloud is the opposite, having a low hardware MTBF due to a much higher rate of failure of its cheap hardware.  These routine hardware failures are very common and can happen multiple times a day to a single cloud service.  The cloud control software (known as the “fabric”) is programmed to recover the software affected by hardware failures, both customer software and cloud provider service software.  The “fabric” recovery happens in the background, out of sight.  During the recovery process from these routine hardware failures the cloud provider’s services return a “not available” signal to customer software using the service.  The duration of such “not available” failures is typically measured in seconds, perhaps minutes, rarely longer.  This requires that customer software running in the cloud be designed to 1) gracefully handle the higher rate of routine, short term failures of both hardware and the cloud provider services it uses, plus 2) also to have a low Mean-Time-To-Recovery from non-routine failures as well.  The much higher rate of such routine failures is the big difference between cloud and on-premises software.  Note that the cost savings of using cheap, commodity hardware by cloud providers are passed on to customers.
  • Cloud hardware overloading – Many cloud provider services are multitenant (software-as-a-service), i.e. they share blocks of hardware (nodes) between multiple customers utilizing a cloud provider service. For example Azure SQL is a multitenant cloud provider service that is used by multiple customer services and apps. A multitenant cloud provider service shares hardware amongst customers to reduce costs, with the savings passed on to the customer.  When some customer’s software becomes very heavily loaded it may use too many resources provided by a particular cloud provider service sharing compute, storage, or network nodes.  In this heavily loaded situation the cloud provider service itself and/or the “fabric” control software will start throttling the cloud provider service to protect it and its hardware from becoming fatally overloaded and crashing.   Such throttling appears to the customer’s software as if the cloud provider service is temporarily unavailable.  In other words, it appears as if the cloud provider service has failed for some reason since it will be unresponsive for a few seconds or minutes until the throttling stops.  This intermittent protective throttling affects all customer software utilizing that cloud provider service in this way.  Throttling is a very common occurrence, happening as much as several times per hour, or more during heavy usage periods, with a typical duration of seconds per occurrence, but occasionally longer.  Customer software must be written so it is able to effectively deal with such throttling to remain resilient and available.  Note that some cloud providers have non-shared (single tenant) PaaS services available for a premium price.  Use of such premium services will side step throttling issues, other than the throttle you should build within your own customer developed services to avoid hard crashes due to overloading.
  • Cloud catastrophic failures – Compared to the above failures, catastrophic failures are very rare. They occur perhaps a few times per year and typically involve the loss of one or more cloud provider services for use by customers for a half hour, several hours, or for a day or so in extreme cases.  Such failures are caused by 1) Physical disasters, like earthquakes or terrorism, affecting data centers or their network infrastructure, 2) Massive hardware failures, 3) Massive software failures or bugs, or 4) Operational failures, i.e. the cloud provider operations staff making a big mistake or a series of smaller mistakes which cascade into a big outage.  Mission critical customer services and apps must be designed to withstand these longer duration failures as well as the above shorter duration failures.  One way to achieve such “high availability” is for customer software to “failover” to another data center located in a different geographical area. Note that this situation is quite similar to what can happen in an on-premises data center, and is also addressed by the links that follow.

The routine short term failures described above are known as Transient Faults in Azure.  Please see the below item called “General Retry Guidance” in the ” Azure Cloud Application Design and Implementation Guidance” link for a full description of how Transient Faults happen and best practices to deal with them.

The good news in the area of failure is that the cloud “fabric” control software is very intelligent and will usually be able to automatically heal cloud hardware failures and hardware overloading failures.  For these, the healing process may take a few seconds, or a minute, or generally some time that is within the Service Level Agreement (SLA) for a particular cloud service like Azure SQL or Azure Storage.  A Service Level Agreement is a legal agreement between customers and a cloud provider that gives a cloud provider a financial incentive to provide a stated level of service to customers.  Each cloud service usually has its own unique SLA. Typically, if the cloud provider is not able to fulfill the terms of the SLA for a particular cloud service, it will refund the customer’s payments for the services used to some stated extent.  Below shows typical levels of service one can expect from an Azure SLA, usually measured on a monthly basis in terms of minutes of availability per month.

So, how much failure time per month can one expect from different SLAs?

  • An SLA of “three 9s” (a cloud service is available 99.9% of the minutes in a month) results in a maximum unavailability time of 43.2 minutes per month, or 10.1 minutes per week.
  • An SLA of “four 9s” (a cloud service is available 99.99% of the minutes in a month) results in a maximum unavailability time of 4.32 minutes per month, or 1.01 minutes per week.
  • Many cloud services have a 99.9% availability. Some are a little higher, some a little lower.
  • For more on Azure SLA’s please see the “Characteristics of Resilient Cloud Applications – Availability” section of the below link to “Disaster Recovery and High Availability for Azure Applications”.

 Conclusion

  • With 10.1 minutes of unavailability per week as typical, and appearing to customer software running in the cloud as if a cloud provider service has failed, you absolutely must build your cloud software to effectively deal with frequent failures of all kinds. Failure is a normal part of cloud computing.  It is not exceptional at all.
  • Plus, for mission critical services and apps running in the cloud you must also build them for high availability so that they can gracefully withstand a catastrophic failure as well, and very rapidly come back on line. Perhaps be back on line in seconds to minutes.

The info sources presented below describe specific techniques to deal with such failures.

Azure Cloud Application Design and Implementation Guidance by Microsoft Patterns and Practices — Over the past year Microsoft has pulled together its key Azure best practices into one place.  This makes it so much easier to draw upon when building software to run in Azure. The Guidance contains links to 13 focused areas.  In my opinion the “must reads” in the above list are as follows.  They are required to gain a minimal effective understanding of what it takes to build “Highly Available, Scalable, Resilient Azure Services and Apps”.

  • Retry General Guidance (this has more detail of why there are lots more failures in the cloud)
  • Availability Check List
  • Scalability Check List
  • Monitoring and Diagnostics Guidance
  • Background Job Guidance

Disaster Recovery and High Availability for Azure Applications – This Microsoft document covers strategies and design patterns for implementing high availability across geographic regions to cope with catastrophic failures.  These patterns allow an Azure app or service to remain available even if an entire data center hosting the app or service ceases to function.  They also aid in reducing the Mean-Time-To-Recovery for your cloud hosted software.

Hardening Azure Applications – A book by Suren Machiraju and Suraj Gaurav published by APress in 2015.  It does a great job of identifying techniques to build “Highly Available, Scalable, Resilient Azure Services and Apps”, as well as including security, latency, throughput, disaster recovery, instrumentation and monitoring, and the “economics of 9s” in SLAs.  It is invaluable in defining requirements and dealing with the business in these areas.  The target audience is Architects and CIOs, but Senior Developers and Technical Leads will also benefit from it.  We all have a steep cloud learning curve to climb in the area of understanding and defining an organization’s non-functional requirements for cloud services and apps, plus the techniques required to meet those requirements.  This book speeds one on their way.

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications – An online and paper back book by Microsoft Patterns and Practices, published in 2014.  This provides excellent primers in key cloud topics like Data Consistency, Asynchronous Messaging, plus an excellent section with in depth explanations of a number of “Problem Areas in the Cloud”.  So if you are unsure of terminology or technology terms, this is a good place to learn the basics.

Finally, a new way to aid building “Highly Available, Scalable, Resilient Azure Services and Apps” has just become available in Azure.  It is called Service Fabric.  I will cover that in future blogs.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: