Skip to content

Why I Like Modeling IoT Devices with Azure Service Fabric Actors

Service Fabric Actors offer a simple, reliable programming model to efficiently act as an IoT Device Shadow.  Device Shadow is the concept used when a software entity (a class, a service, etc.) is used to contain the recent state of an individual remote IoT device.  By “state” I mean the current data that an IoT device is designed to furnish to an IoT system.  For example, the state of a remote oil temperature sensor device would be the current temperature of the oil it is monitoring.   Device Shadows are often implemented with a JSON document, but here I use Service Fabric Actors.

Essentially, a Device Shadow is a virtual representation of a remote IoT device.  It contains the persistent recent state of the device.  A Device Shadow is used by the rest of the software system in an IoT solution to access a device’s state to avoid a time consuming and resource expensive process of gathering the state information from various places in an IoT system, perhaps including a trip all the way out to the remote IoT device to get it’s state.  Continuing the previous example, with a Device Shadow for an oil temperature sensor device the software does not need to ask the sensor device directly for its current measurement.  Nor does the software have to do a series of database queries to get the current state of a device saved in disk storage. Rather, the software can simply query the associated Device Shadow instead, since it contains all of a device’s recent state in one place.  This assumes the rest of the system is designed to have the IoT devices periodically report their state to their Device Shadow — A standard practice.   There may be dozens to 100s of thousands or more of remote devices and their Device Shadows in an IoT System!  The ability to deal with such massive scale is extremely important here.

Service Fabric is Microsoft’s next-generation middleware platform that has strong support for high scale microservices.  And it is aimed at providing very high reliability achieved by having multiple replicas of each service it hosts running on multiple virtual machines that make up a Service Fabric cluster.   Only the “primary replica” is used, but the “secondary replicas” in standby on other virtual machines in the cluster have their state always kept up to date with that of the primary replica.  This allows for very quick and precisely accurate automatic recovery from a crashed virtual machine in the cluster.  No operator involvement is required for this recovery.

A Service Fabric Actor is a highly reliable, single threaded, persistently stateful microservice that runs in a Service Fabric cluster.  Actors support high scale and high reliability since each one is a special Service Fabric Reliable Stateful Service that runs under control of the Service Fabric Actor Service which provides them with the unique characteristics of Actors.  Thus, SF Stateful Actors are very well suited to act as a Device Shadow.  And, since they are individual microservices, they can scale out without a negative impact on the code, operations, or performance of the overall software system.

Here is a conceptual data flow diagram showing how Service Fabric Actors can implement the Device Shadow role in an Azure IoT solution.

deviceshadowactoriotsystem

I have omitted detail in Figure 1 so as to focus only on the Actors and the flow of commands, command responses, and telemetry in the solution.  Please see “Microsoft IoT Reference Architecture” for definitions of terms used above.

The Business Backend System (Backend for short) is responsible for the UI display of alerts, dashboards, visualizations, and reports, plus some analysis of data, provisioning the system’s resources, and monitoring the health of the system as it runs.  The Backend also is responsible for issuing commands to remote IoT devices through Device Shadows as directed by user interaction with the UI or a via a programmatic workflow. The Device Shadow Actors are intermediaries, standing between Backend System and the Cloud IoT System (with its IoT Gateway data ingestion buffer to which the IoT devices connect).

During December 2016 and January 2017 I implemented an exploratory proof of concept (POC) system that used SF Stateful Actors as Device Shadows.  A challenging feature of my POC is that it had a requirement to implement significant commanding from the Backend System to the IoT devices.  Specifically, I had to make the IoT devices start, stop, and when running put them online and take them off line.  Commanding devices entails a lot more than just passively collecting telemetry from sensors, as is common in many IoT solutions.

Commanding involves the Device Shadows maintaining the command state of a remote device, in addition to its telemetry state.  It also requires a Device Shadow to implement behavior to support commanding.  Specifically, the Device Shadow must receive a command from the Backend System and then send the command to the correct remote device, allowing time for the device to execute the command, and then receive a command response from the IoT device when the device has completed command execution.  In addition, the Device Shadow needs the ability to time-out when no command response is received within a specified time.  And, the Device Shadows must know when a command is in the process of being executed by a device so as to prevent attempts to have the device execute multiple commands concurrently.  Finally, in all of the above scenarios a Device Shadow Actor must notify the Backend System of the command state at critical junctures in the commanding process, e. g. normal command completion, command time out, device error, etc.  These command response notifications allow the Backend to push them to UI client’s for alerts and dashboard updates.

Service Fabric Stateful Actors offer a great programming model to efficiently act as a Device Shadow that models the telemetry and command needs of an IoT device, to allow it to work well with the Backend System as well as other parts of an IoT solution.   As such, SF Actors speed up development time, plus reduce the level of skill required to effectively program and debug potentially complex logic associated with commanding and telemetry.

Keep in mind, with many cloud solutions there is usually no guarantee that difficult problems will not arise when scaling out to 1) service tens of thousands or even millions of IoT devices, or to 2) service high data rates moving to and from the devices.  Such problems often involve throughput and contention issues in accessing storage, the need to use multi-threaded programming techniques, excessive latency, and the need to orchestrate the timing of several different collaborating components in a cloud environment which is eventually consistent by its nature.  Any of these issues by themselves is a challenge to a seasoned developer, and taken together they will likely take a substantial amount of time to get right.  My POC showed me that Service Fabric Stateful Actors can relieve a substantial amount of this burden in the realm of Device Shadows.

In what follows I list what I liked about developing my POC with SF Actors, followed by a list of links that facilitate you learning more about Actors and Service Fabric, rather than explaining them in this blog.

Why I Like Service Fabric Actors as IoT Device Shadows

  • Most of the behavior required for Device Shadowing is located close together: It is in the Actor microservice itself or in the resource accessor used to provide system wide access to an Actor. Not much of this behavior is spread widely across the entire system.  This reduces development and maintenance time.
  • There is no need to deal with complex multi-threaded code. Actors are single threaded with “turn based concurrency”.  In other words, all the code in one of an Actor’s methods will be completely executed before any of the other methods can begin execution.  This makes development and debugging much, much easier and faster than when dealing with multi-threading.
  • One does not have to worry about data persistence concurrency issues. All of the state of an Actor (i.e. the telemetry and command state in a Device Shadow) is persisted in an Actor’s own private state store.  And, that state can only be accessed via an Actor method or property which is guaranteed to use “turn based concurrency”.  So, there is no worry about data access exceptions due to “record locking”, which can happen with a busy traditional database.  And there is no need to use transactions since they are “built in” to Actor state access.
  • One does not have to worry about slow or intermittent data access to an Actor’s private state. Service Fabric Stateful Actors are built on top of Service Fabric Reliable Services.  Behind the scenes, Stateful Reliable Services save their state on the hard disk of the virtual machine in the SF cluster that is running the Actor instance.  Thus, there is no network access involved in state access by an Actor.  This makes data access much faster than over a network, and it eliminates the ever present network transient faults of cloud computing than serve to complicate data access.
  • There is no need to write code to deal with special considerations to gain massive scalability. By the inherent nature of Service Fabric, its clusters, and Actors (each instance is running as a separate microservice) there is built-in scalability of vast proportions!  The focus shifts from providing scalability in code, to configuring the Service Fabric cluster to provide the required scale.  And configuration takes a whole lot less time and effort.
  • The learning curve is quite reasonable. The documentation is good.  And learning to use Actors in the role of Device Shadows is just not that hard!  However there are some advanced scenarios involving complex parallel computations with Actors that are not for the rank beginner.
  • The development environment is good. A “development cluster” can be set up on your development system, and it runs exactly the same binaries that a production Service Fabric cluster does.  This is not emulation!  Working with the development cluster through Visual Studio is a very time efficient way to develop and debug code before running it on a non-development cluster on Azure or in your own data center.
  • SF provides amazing time saving upgrade capabilities when Actors must have code changes. Upgrades can happen without taking the system down or out of service.  And if an upgrade has problems, the code can be automatically rolled back.  Again, without taking the system out of service.  In general, Service Fabric adds great value in the area of operations, in this and many other ways.

Links Concerning Service Fabric, Actors, and Code Samples Showing Their Use in IoT Solutions

Microsoft Azure – Azure Service Fabric and the Microservices Architecture”, MSDN Magazine, December 2015.  This is an excellent overview of Service Fabric, plus has a good section on Actors as well.

Overview of Service Fabric.  This Microsoft article provides a broad view of Service Fabric development and operational features, plus contains a few good videos for more in depth knowledge.

Introduction to Service Fabric Reliable Actors.  This Microsoft document is a good place to start learning about Actors.  It also includes links to other detailed documentation for Actors, in the panel on the left.

Getting to Know Actors in Service Fabric. This is a really good in-depth blog on the details of Service Fabric Actors, plus graphically showing how they fit in with other microservices running under Service Fabric.  This is a must read to get up to speed on Actors quickly, IMO.

IoT Actor Code Example:  Service Fabric IoT Sample, from Microsoft by Vaclav Turecek (Sr. Program Manager on the Service Fabric team).

IoT Actor Code Example:  Paolo Salvatori (Microsoft, Italy) provides IoT code examples using Actors, plus several other Service Fabric examples as well.

Azure Code Samples for Service Fabric provides some IoT related Actor samples, plus others as well.

I hope this introduction to Service Fabric Stateful Actors used as Device Shadows spurs you to further investigate their capabilities and how they can be useful to you.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

Internet-of-Things Security — Info Sources

The security of distributed systems, whether cloud based, on-premises, or hybrid cloud/on-premises, is a complex subject by itself.  Add securely connecting a bunch of hardware things to a distributed software system and you have more complexity, new requirements, new techniques, and new technologies to deal with.  Hopefully this article will shed some light on some of the current best mental models, best practices, and technologies to use in designing and building secure Internet-of-Things (IoT) systems.

Please keep in mind the key points made in my previous blog article “Reinventing the Wheel is Not Necessary for IoT Software Architecture”:

  1. It’s best to use an end-to-end system perspective when thinking about IoT Systems. They are much more complex that just the internet and some things.
  2. “When developing IoT Systems we can use all of the software structural (aka software architecture) knowledge we’ve gained over the past decade from developing secure, mission critical distributed systems, and Service Oriented Architectures (SOA), and Cloud Systems.”

The info sources listed below often apply the above perspective and techniques since they generally serve to facilitate the timely development of secure IoT systems, as well as high quality IoT systems.

To get you started, consider what happened with weak IoT security on October 16, 2016 — Hacked Cameras, DVRs Powered Today’s Massive Internet Outage, by Brian Krebs.  We can do better than that!  Below are the most useful sources of information on security of IoT systems that I’ve encountered in 2016.

General IoT Security Info Sources

First, if you only have time to consult one of the info sources listed in this blog, make sure it is viewing the recommended parts of the following video.  That is where you will initially get the greatest return for the time you spend.  This video provides an excellent overview of key technology agnostic concepts and techniques in IoT security: Secure your IoT with Azure IoT by Arjmand Samuel of Microsoft.  It shows a presentation at Microsoft’s Ignite conference in September 2016.  The first quarter of the video (about 10 minutes) is an overview of the key general security issues in IoT, including the roles and concerns of various stakeholders.  I found it most helpful, identifying the specific challenges of why IoT security is hard.

Then it presents an excellent mental model of a “Trustworthy Internet of Things”, with pressure put on any IoT system by the Environment, Security Threats, Human Error, and System Faults.  Counteracting these pressures are the design and implementation of the IoT system’s aspects of Security, Privacy, Safety, and Reliability throughout the entire system.  I believe this mental model, along with the roles of various stakeholders, are key concepts to drive the effective design and execution of the planning, development and operation of a solid, secure IoT system.

The middle part of the video outlines specifically how various Microsoft technologies fit into this model.  It spans the Windows 10 IoT operating system down at the “things” level, to all the way up to the preconfigured Azure cloud IoT Suites available.  These IoT Suites are full cloud software systems specifically targeted at remote monitoring, or predictive maintenance, etc.

The last part of the video is a “must see”.  Starting at around 28 minutes is a super valuable description of the concept of “Defense in Depth”.   Plus, it shows how to use the STRIDE threat analysis model to systematically identify security threats and then counteract each one with a “Defense in Depth” approach.  I found the 10 minutes spent walking through an example of how to apply the STRIDE threat analysis model to be vital to being able to build strong security into an IoT system.  STRIDE is part of Microsoft’s long standing “Security Development Lifecycle” (SDL).  They use it internally on the software products and services they sell, plus they support their customers using it as well with free tools, videos and tutorials at SDL.  The SDL concepts and practices around STRIDE (as well as other areas in SDL) are largely technology agnostic.

Second, the Microsoft article Internet of Things Security Architecture – This is mainly about technology agnostic security techniques.  It has a detailed example of using the STRIDE threat modeling analysis technique as a starting point to secure an IoT system.  It goes on to show how to design the architecture of various portions of an IoT system to counteract threats at each level, with a “defense in depth” perspective.  I consider this a must read article.

Third, the Microsoft article Internet of Things Security Best Practices – This deals with “Defense in Depth” and outlines the best practices of various roles in the IoT world.  For example, the roles of the IoT hardware manufacturer/integration, the IoT solution developer, etc.  This role based approach is useful in being able to focus on security concerns specific to key participants involved in developing and operating an IoT system.

Fourth, in June 2015 the Industrial Internet Consortium released its Industrial Internet Reference Architecture (IIRA) document (click to download a pdf).  It outlines the requirements and the conceptual system architecture needed to build industrial strength IoT systems.  This is about a lot more than hooking up your toaster to the internet!  The 5 founding members of IIC are AT&T, Cisco, GE, Intel, and IBM.  Note that most of them have deep experience in distributed systems and/or Cloud Systems.

Section 9 of this IIRA document, “Security, Trust, and Privacy”, gives extensive coverage to all aspects of IoT security.  Being familiar with the ideas, terms and techniques presented in Section 9 will give you a strong base in what is recommended by many of the leading, highly experienced companies in the IoT realm.  You can greatly advance your knowledge from their experience as expressed in this section.

Microsoft Specific IoT Technology Info Sources

Here are useful links to Microsoft IoT Security documentation generally focused on specific Microsoft technologies.  However, many of them also contain valuable general IoT security concepts that are technology agnostic.

First, the 11 page Microsoft white paper “Securing your Internet of Things from the Ground Up – Comprehensive built-in security features of the Microsoft Azure IoT Suite” (click to download a pdf) provides an introduction to Microsoft’s Azure IoT services using most of the concepts outlined above.  Here you will see them in action.  And, roughly the same information is presented in the online document Internet of Things security from the ground up.

Second, the online Microsoft document Securing your IoT deployment provides the details of securing Azure IoT systems in 3 security areas – Device Security, Connection Security, and Cloud Security.  It provides a more fine-grained-detail look at IoT security than most of the other info sources listed above.

Third, the minute details of device authentication and security credentials used by the Azure IoT Hub service are presented in Control access to IoT Hub.  This shows exactly how robust device security is achieved.

Finally, Azure IoT Hub Developer Guide provides a list of references to documents on over 15 topic areas concerning the use of the Azure IoT Hub.  You can use this as a guide to perusing the IoT Hub documentation.

I hope you benefit as much from the above info sources as I have.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

Azure Application Insights – Quick and Easy Service Performance and Health Monitoring

I am getting solid productivity increases in my microservice development process by using Azure Application Insights (AAI) to monitor the performance and health of services running in the cloud or on-premises.  Not only can I quickly put AAI to productive use in the code-test-debug cycle to see where performance bottlenecks are.  I can also use it when my services are in normal operation to monitor their health daily (and send me email alerts), and over weeks and months via graphs and charts.  These capabilities are easily available, and require very little code be written.  Indeed, many health indicators have their telemetry data automatically generated by the AAI dlls added to a service project.

Microsoft describes Applications Insights as follows — “Visual Studio Application Insights is an extensible analytics service that monitors your live“ service or “web application.  With it you can detect and diagnose performance issues, and understand what users actually do with your app.  It’s designed for developers, to help you continuously improve performance and usability.  It works for apps on a wide variety of platforms including .NET, Node.js and J2EE, hosted on-premises or in the cloud”, from Application Insights – introduction.

In summary, Azure Application Insights is a software developer/dev-ops Business Intelligence (BI) package.  Similar to BI in other realms, AAI allows one to easily and quickly compose and visualize charts and graphs of key indicators of the “business” of software development and operations, and also drill down into the minute details with other charts and lists of detailed data.  I am really impressed with how quickly one can come up to speed and productively use it.

AAI has charts, searches, and analyses available both in Visual Studio and the Azure Portal.  When you need health alerts sent by email, long term charts and graphs, and an easy to use query language to search through your health telemetry data, use the Azure Portal.  Visual Studio’s Application Insights capabilities provide good performance and usage oriented charts (with drill down capabilities) and searches available during debug test runs without leaving Visual Studio. 

The following are examples of some of the basic Visual Studio AAI displays involving performance analysis that can be had without very much work on your part writing the code to generate the telemetry data and/or to display it in a useful way.

Below, Figure 1 is an example of an AAI chart I’ve found highly useful in pinpointing the source of performance bottle necks.   This chart is available via the Visual Studio Application Insights toolbar by clicking on the “Explore Telemetry Trends” menu item which displays an empty chart.  You then must click the “Analyze Telemetry” button to generate the display.  Note how you can set up the chart to display various “Telemetry Types” and “Time Ranges”, etc.

aai-figure1Figure 1 – Visual Studio’s Explore Telemetry Trends:  The Analyze Telemetry display.

If you double click on one of the blue dots in Figure 1, you’ll start a “drill down” operation that will open up a “Search” display shown below in Figure 2.  This display lists all the individual measurements that have been aggregated into the dot you double clicked on.  And in a pane to the right (not shown) it lists the minute details of the item your cursor is on.   Also note that you can use check boxes to the left and above to further refine your search.   Figure 2 below shows the drill down display you get from double clicking on the 1sec – 3sec small blue dot at the Event Time of 4:48 in Figure 1.

aai-figure2Figure 2 – Visual Studio’s Explore Telemetry Trends:  The Drilldown Search display.

The displays in Figure 1 and 2 show the aggregation and breakdown of the elapsed time it takes for a single WCF service to complete about 100 dequeue operations from an Azure Service Bus Queue using the NetMessagingBinding in ReceiveAndDelete mode.  After the service dequeues a single item, it checks to see if the item is valid, and then saves it in Azure Table Storage.  You can get a link to the service code from this blog article SO Apps 2, WcfNQueueSMEx2 – A System of Collaborating Microservices.  This code does not have the telemetry generating code present.

Therefore, from the point of view of Application Insights there are a couple relevant things to measure in this service:

  • The total elapsed time of the “request”, from the start of the service operation until it executes its return statement. This data is generated by a few lines of telemetry code that I had to write.  The telemetry code uses the TelemetryClient.Context.Operation, TelemetryClient.TrackRequest(), and TelemetryClient.Flush() provided by the AAI dlls added to the service project.  These are described in Application Insights API for custom events and metrics in the “Track Request” section.  The telemetry code also uses the System.Diagnostics.StopWatch to record total elapsed time of a service operation.
  • The elapsed time it takes for each of the 2 “dependencies” (aka external service calls) to execute. The external dependencies are the Azure Service Bus and Azure Table Storage.  Specifically one dependency is the Service Bus Dequeue operation.  The other dependency is the Table Storage Save operation.  In both cases the dependency elapsed time is automatically measured by the Application Insights dlls, and this data is automatically sent as telemetry as well.  I did not have to write any code to support dependency analysis.  All the work is done by the 5 or 6 Application Insights dlls that are added to a service project via NuGet. This “automatic telemetry” may or may not require .NET 4.6.1.  Many of the “automatic” performance monitoring features require that the “Target Framework” of a service’s project be set to .NET 4.6.1.  You can use lower versions as well, but may not get so many automatic measurements.  Note that many .NET Performance Counters are automatically generated and sent out as telemetry as well.

Figure 1 measures the first item, the total elapsed time of the request, from start to finish including the elapsed time of any dependencies.  Figure 1 shows 2 performance test runs – One at Event Time of 4:23 and the other at Event Time of 4:48.  It is obvious that the run at 4:48 (at the right of the chart) had the vast majority of the service requests complete in <= 250 milliseconds.  That is fast!

In the 4:23 run (at the left of the chart) the majority of the service requests took between 500 milliseconds and 1 second to complete.  That is much longer.  Why?  The 4:23 run (at the left) had the WCF service running on my development system, while the 4:48 run (at the right) had the service running in an Azure WorkerRole.  It is not surprising to see much faster elapsed times in the cloud since the overall network latency is much, much less there  when the service does Service Bus and Table Storage operations.  Plus, there is more CPU power available to the Azure based service since the WorkerRole host did not also have to run the test client as well.  Both runs had the test client running on my desktop development system in my office, using a single thread enqueuing 100 items one after another.

Being able to quickly separate the execution time of the service code from the code it depends upon is key to rapidly pin pointing the source of performance problems.  From Figure 2 ‘s drill down display you get from double clicking on the 1sec – 3sec small blue dot at the Event Time of 4:48, you can clearly see where the slowness in these two independent dequeue and save operations came from – One was entirely due to the slowness of the service code, while another was largely due to the slowness of the Service Bus during that service operation.

Figure 3 below shows the drill down display you get from double clicking on the 3sec – 7sec small blue dot at the Event Time of 4:48.  Note that the source of slowness this time is NOT due to the Service Bus nor Table Storage dependencies, but rather solely due to the service code itself.  Perhaps there was some thread or resource contention going on between service instances here that deserves further investigation.  AAI has the capability to aid in pinpointing these sort of things as well, but is not covered here.

aai-figure3Figure 3 – Visual Studio’s Explore Telemetry Trends:  Another Drilldown Search display.

The above displays (and more) are available in Visual Studio.  And you get even more displays and capabilities in the Azure Portal via the Application Insights resource that collects and analyzes the telemetry sent from services and clients. 

Please see the following links for more info on AAI:

Visual Studio Application Insights Preview

Application Insights – introduction

Application Insights for Azure Cloud Services

WCF Monitoring with Application Insights — With the dll that comes with this you do not have to write the request tracking code yourself.  It takes care of that for you, providing code-less performance telemetry data.

More telemetry from Application Insights

Application Insights API for custom events and metrics

I hope this introduction to Application Insights spurs you to further investigate it’s capabilities and how it can be useful to you.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

SO Apps 7: Idempotency Info Sources

Back in the 20th century, before the broad use of Service Oriented Apps, many software systems depended upon distributed transactions using the 2-phase commit to ensure data was properly obtained from, and saved into, databases.  Many software systems did not use messaging back then. Nowadays Service Oriented Apps often use messaging, plus also steer away from using distributed transactions since the various databases used are spread far and wide — from the data center to various clouds.  Widely distributed data makes the resource locking required for distributed transactions problematic in various ways, plus distributed transactions tend to produce high latency times.

Learning how to design and develop Service Oriented systems using messaging and avoiding distributed transactions requires some new perspectives, knowledge, and skills.  One part of the new knowledge required is how to create idempotent designs that facilitate “at least once” delivery messages.  This kind of delivery is common with most messaging technologies today.  “At least once” means that a given message may be delivered once, or twice, or more often.  The software must be able to effectively deal with multiple deliveries, and the duplicate data within the multiply delivered messages.  And this requires idempotency.

Here are a few articles I found useful concerning idempotency and related issues:

Messaging:  At-least-once-delivery, by Jonathan Oliver, April 2010.

Idempotency Patterns by Jonathan Oliver, April 2010.  This is a really useful article and widely cited.

Ditching 2-phased commits, by Jimmy Bogard, May 2013.  A good overview of problems with 2-phased commits and alternatives.

(Un)reliability in Messaging:  idempotency and deduplication by Jimmy Bogard, June 2013.  This shows a couple useful techniques with code snippets.

Life Beyond Distributed Transactions, Pat Helland, 2007?  Pat worked at Amazon.com and it is interesting to read about his perspective at this time.

I hope you find this information as useful as I did!

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

Info Sources for Building Highly Available, Scalable, Resilient Azure Services and Apps

Building highly available, scalable, and resilient software running in the cloud is quite different from building such software systems that run on-premises.  Why?  In the cloud you must plan for your software encountering a much higher rate of failures than usually encountered in on-premises systems.  This article provides links that describe techniques and best practices for building your cloud software to effectively deal with such frequent failures.

Here is a rough sketch of the sources of these failures:

  • Cloud hardware failures – The cloud uses vast numbers of cheap, commodity compute, storage, and network hardware units to host both the cloud provider’s PaaS services and customer services and apps. This cheap hardware fails more frequently than on-premises systems which generally utilize expensive, top-of-line compute, storage, and network hardware.  On-premises hardware systems are designed to achieve a high Mean-Time-Between-Failure (MTBF) so that software running on them does not have to deal with a high rate of hardware failures.  The cloud is the opposite, having a low hardware MTBF due to a much higher rate of failure of its cheap hardware.  These routine hardware failures are very common and can happen multiple times a day to a single cloud service.  The cloud control software (known as the “fabric”) is programmed to recover the software affected by hardware failures, both customer software and cloud provider service software.  The “fabric” recovery happens in the background, out of sight.  During the recovery process from these routine hardware failures the cloud provider’s services return a “not available” signal to customer software using the service.  The duration of such “not available” failures is typically measured in seconds, perhaps minutes, rarely longer.  This requires that customer software running in the cloud be designed to 1) gracefully handle the higher rate of routine, short term failures of both hardware and the cloud provider services it uses, plus 2) also to have a low Mean-Time-To-Recovery from non-routine failures as well.  The much higher rate of such routine failures is the big difference between cloud and on-premises software.  Note that the cost savings of using cheap, commodity hardware by cloud providers are passed on to customers.
  • Cloud hardware overloading – Many cloud provider services are multitenant (software-as-a-service), i.e. they share blocks of hardware (nodes) between multiple customers utilizing a cloud provider service. For example Azure SQL is a multitenant cloud provider service that is used by multiple customer services and apps. A multitenant cloud provider service shares hardware amongst customers to reduce costs, with the savings passed on to the customer.  When some customer’s software becomes very heavily loaded it may use too many resources provided by a particular cloud provider service sharing compute, storage, or network nodes.  In this heavily loaded situation the cloud provider service itself and/or the “fabric” control software will start throttling the cloud provider service to protect it and its hardware from becoming fatally overloaded and crashing.   Such throttling appears to the customer’s software as if the cloud provider service is temporarily unavailable.  In other words, it appears as if the cloud provider service has failed for some reason since it will be unresponsive for a few seconds or minutes until the throttling stops.  This intermittent protective throttling affects all customer software utilizing that cloud provider service in this way.  Throttling is a very common occurrence, happening as much as several times per hour, or more during heavy usage periods, with a typical duration of seconds per occurrence, but occasionally longer.  Customer software must be written so it is able to effectively deal with such throttling to remain resilient and available.  Note that some cloud providers have non-shared (single tenant) PaaS services available for a premium price.  Use of such premium services will side step throttling issues, other than the throttle you should build within your own customer developed services to avoid hard crashes due to overloading.
  • Cloud catastrophic failures – Compared to the above failures, catastrophic failures are very rare. They occur perhaps a few times per year and typically involve the loss of one or more cloud provider services for use by customers for a half hour, several hours, or for a day or so in extreme cases.  Such failures are caused by 1) Physical disasters, like earthquakes or terrorism, affecting data centers or their network infrastructure, 2) Massive hardware failures, 3) Massive software failures or bugs, or 4) Operational failures, i.e. the cloud provider operations staff making a big mistake or a series of smaller mistakes which cascade into a big outage.  Mission critical customer services and apps must be designed to withstand these longer duration failures as well as the above shorter duration failures.  One way to achieve such “high availability” is for customer software to “failover” to another data center located in a different geographical area. Note that this situation is quite similar to what can happen in an on-premises data center, and is also addressed by the links that follow.

The routine short term failures described above are known as Transient Faults in Azure.  Please see the below item called “General Retry Guidance” in the ” Azure Cloud Application Design and Implementation Guidance” link for a full description of how Transient Faults happen and best practices to deal with them.

The good news in the area of failure is that the cloud “fabric” control software is very intelligent and will usually be able to automatically heal cloud hardware failures and hardware overloading failures.  For these, the healing process may take a few seconds, or a minute, or generally some time that is within the Service Level Agreement (SLA) for a particular cloud service like Azure SQL or Azure Storage.  A Service Level Agreement is a legal agreement between customers and a cloud provider that gives a cloud provider a financial incentive to provide a stated level of service to customers.  Each cloud service usually has its own unique SLA. Typically, if the cloud provider is not able to fulfill the terms of the SLA for a particular cloud service, it will refund the customer’s payments for the services used to some stated extent.  Below shows typical levels of service one can expect from an Azure SLA, usually measured on a monthly basis in terms of minutes of availability per month.

So, how much failure time per month can one expect from different SLAs?

  • An SLA of “three 9s” (a cloud service is available 99.9% of the minutes in a month) results in a maximum unavailability time of 43.2 minutes per month, or 10.1 minutes per week.
  • An SLA of “four 9s” (a cloud service is available 99.99% of the minutes in a month) results in a maximum unavailability time of 4.32 minutes per month, or 1.01 minutes per week.
  • Many cloud services have a 99.9% availability. Some are a little higher, some a little lower.
  • For more on Azure SLA’s please see the “Characteristics of Resilient Cloud Applications – Availability” section of the below link to “Disaster Recovery and High Availability for Azure Applications”.

 Conclusion

  • With 10.1 minutes of unavailability per week as typical, and appearing to customer software running in the cloud as if a cloud provider service has failed, you absolutely must build your cloud software to effectively deal with frequent failures of all kinds. Failure is a normal part of cloud computing.  It is not exceptional at all.
  • Plus, for mission critical services and apps running in the cloud you must also build them for high availability so that they can gracefully withstand a catastrophic failure as well, and very rapidly come back on line. Perhaps be back on line in seconds to minutes.

The info sources presented below describe specific techniques to deal with such failures.

Azure Cloud Application Design and Implementation Guidance by Microsoft Patterns and Practices — Over the past year Microsoft has pulled together its key Azure best practices into one place.  This makes it so much easier to draw upon when building software to run in Azure. The Guidance contains links to 13 focused areas.  In my opinion the “must reads” in the above list are as follows.  They are required to gain a minimal effective understanding of what it takes to build “Highly Available, Scalable, Resilient Azure Services and Apps”.

  • Retry General Guidance (this has more detail of why there are lots more failures in the cloud)
  • Availability Check List
  • Scalability Check List
  • Monitoring and Diagnostics Guidance
  • Background Job Guidance

Disaster Recovery and High Availability for Azure Applications – This Microsoft document covers strategies and design patterns for implementing high availability across geographic regions to cope with catastrophic failures.  These patterns allow an Azure app or service to remain available even if an entire data center hosting the app or service ceases to function.  They also aid in reducing the Mean-Time-To-Recovery for your cloud hosted software.

Hardening Azure Applications – A book by Suren Machiraju and Suraj Gaurav published by APress in 2015.  It does a great job of identifying techniques to build “Highly Available, Scalable, Resilient Azure Services and Apps”, as well as including security, latency, throughput, disaster recovery, instrumentation and monitoring, and the “economics of 9s” in SLAs.  It is invaluable in defining requirements and dealing with the business in these areas.  The target audience is Architects and CIOs, but Senior Developers and Technical Leads will also benefit from it.  We all have a steep cloud learning curve to climb in the area of understanding and defining an organization’s non-functional requirements for cloud services and apps, plus the techniques required to meet those requirements.  This book speeds one on their way.

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications – An online and paper back book by Microsoft Patterns and Practices, published in 2014.  This provides excellent primers in key cloud topics like Data Consistency, Asynchronous Messaging, plus an excellent section with in depth explanations of a number of “Problem Areas in the Cloud”.  So if you are unsure of terminology or technology terms, this is a good place to learn the basics.

Finally, a new way to aid building “Highly Available, Scalable, Resilient Azure Services and Apps” has just become available in Azure.  It is called Service Fabric.  I will cover that in future blogs.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

Fast Development Time for Analytics and Data Transforms — Azure Stream Analytics

I’ve been able to get quite fast development time using Azure Stream Analytics (ASA) to analyze streams of unstructured data, plus to transform the format of such data, i.e. breaking up the data into different streams and/or reconstituting the data into different structures and streams.  These are things we often need to do, and now we do not have to always write programs to do it.  In some cases we can use ASA instead.

The learning curve is quite manageable for ASA.  I found the longest part of the learning curve was working with ASA’s SQL like query language, particularly learning how to use its ability to do real time analysis of data streams via the Tumbling, Hopping, and Sliding time windows it offers.  But if you know the basics of SQL this only takes an hour or so to learn, with good examples at hand (in the links below).  I hope the links to ASA info sources will shorten your learning curve as much as they shortened mine, plus open your eyes to the possibilities ASA offers — It is a powerful, yet easy to use tool.

Here is a basic introductory example showing the process of building an ASA job and its query in the Azure Portal – “Get started with Azure Stream Analytics to process data from IoT devices“ by Jeff Stokes of Microsoft.  The screen shots of the Azure Portal for ASA in this link will give you an understanding of how to work with ASA and its query language.  Note that you need not write external code to get things working.  All your work, including writing and debugging the query, is done in the Azure Portal UI.  Note that you may need to write some C# code later for production monitoring of the ASA job and any Event Hubs it gets data from.

At the time of writing this blog article, ASA can input and output data from the following Azure services:

ASA Input Sources
Blob
Event Hub
IoT Hub
Reference Data in a Blob

ASA Output Destinations

SLQ database
Blob
Event Hub
Table Storage
Document DB
Service Bus Queue or Topic
PowerBI

These inputs and outputs provide an amazing array of options for processing data at rest (residing in a Blob) or data in motion (streaming into an Event Hub or IoT hub).

Here are 2 common usage scenarios of ASA:

  • Searching for patterns in log files or data streams
    • This can include using ASA to analyze log files that are programmatically created by ones software to look for errors and warnings of certain kinds, or for telltale evidence of security problems. “SQL Server intrusion detection using Azure Event Hub, Azure Stream Analytics and PowerBI” by Francesco Cogno of Microsoft is an example of such a usage scenario.
    • Since ASA works on live data streams contained in Azure Event Hubs it can be used to search for patterns in telemetry data from the outside world, e.g. IoT systems. For example one could find each item in the input stream that had “Alert” in the field named “EventType” and place that record into a Service Bus Queue read by a Worker Role whose job it was to push alert messages to a UI.
  • Calculating real time statistics on-the-fly
    • An example is calculating moving averages, standard deviations, and being able to create alert records sent to an Alerts queue when such a calculation exceeds some preset level. “Using Stream Analytics with Event Hubs” by Kirk Evans of Microsoft presents an example of this usage scenario as does the first link, above.

Other Useful Info Sources

How to debug your ASA job, step by step” by Venkat Chilakala of Microsoft. This can save lots of time when debugging.

Query examples for common Stream Analytics usage patterns” by Jeff Stokes of Microsoft. For both simple and complex query techniques by example.

Scale Azure Stream Analytics jobs to increase stream data processing throughput” by Jeff Stokes of Microsoft. This will give you in depth knowledge of ASA.

Stream Analytics & Power BI: A real-time analytics dashboard for streaming data” by Jeff Stokes of Microsoft. How to quickly display charts from data output by ASA.

Azure Stream Analytics Forum” on MSDN. I have found this forum to contain some really useful posts. Plus you can ask questions as well.

I hope you find these info sources as useful as I did in opening up a new world of cloud-based data analysis and transformation!

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.

No SQL? No Problem! – No SQL Info Sources

One of my current technology explorations is polyglot persistence. I am now mainly through the reading stage and it is quite clear that No SQL databases can be quite useful in certain situations, as can relational databases. Using both No SQL and relational databases together in the same solution, each according to its strengths, is the essence of the polyglot persistence idea.

Here are some sources of information I’ve found to be most useful on No SQL databases, their strengths, weaknesses, and when and how they can be best used:

  • Martin Fowler’s book NoSQL Distilled (2013) has been immensely helpful in gaining an understanding of both the various DBs, their strengths and weaknesses, and key underlying issues like eventual consistency, sharding, replication, data models, versioning, etc. It is short little book that is truly distilled.  If you read only one thing, this should be it.
  • Also very useful is Data Access for Highly-Scalable Solutions (2013) from Microsoft Press and the Patterns and Practices group.  It is written with a cloud mindset, contains code examples, and goes into much more detail that Fowler’s book. Importantly, it shows examples of how to design for No SQL DBs. I found the first few pages of its Chapter 8 “Building a Polyglot Solution” to be an excellent summary of the strengths, weaknesses, and issues one must deal with in using a No SQL database. That chapter also presents an excellent succinct summary of general guidelines of when to use a Key-Value DB, a Document DB, a Column-Family DB, and a Graph DB on page 194 of the book.
  • The blog article I posted several months ago, CQRS Info Sources, contains links to good articles on techniques that themselves use No SQL persistence (sometimes by implication).  Reading these links aided me in seeing areas where NoSQL DBs could be useful.
  • Microsoft Press’s book Cloud Design Patterns contains a lot of useful information on patterns that can use NoSQL DBs; guidance on things like Data Partitioning, Data Replication; plus a primer on Data Consistency that promotes a good understanding of eventual consistency versus strong consistency (usually available with a relational DBs via transactions). Some of the patterns it describes that can be implemented with a NoSQL DB are Event Sourcing, CQRS, Sharding, and the Materialized View.

Finally, keep in mind that both books listed above advise that relational databases will typically be the best choice for the majority of database needs in a system, and to use No SQL DBs only when there are strong reasons to do so.  The costs of not using a relational DB, with its capability to automatically roll back transactions spanning multiple tables, can be quite substantial due to the complexity of programming the error compensation (rollbacks) by hand.

George Stevens

Creative Commons License

dotnetsilverlightprism blog by George Stevens is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Based on a work at dotnetsilverlightprism.wordpress.com.