Amazon, on August 24, 2006 made a test version of its Elastic Computing Cloud (EC2) public. EC2 allowed hiring infrastructure and accessing it over the internet. The term “Cloud Computing” was coined a year later, to describe the phenomenon that was not limited to hiring the infrastructure over the internet but encompassed a wide array of technology services offerings, including Infrastructure as a Service (IaaS), web hosting, Platform as a Service (PaaS), Software as a Service (SaaS), network, storage, High Performance Computing (HPC) and many more.
Maturity of many of the technologies like Internet, high performing networks, Virtualization, and grid computing played vital role in the evolution and success of the “Cloud Computing”. Cloud platforms are highly scalable, can be made available on demand, scaled-up or scaled-down quickly as required and are very cost effective. These factors are leveraged by the enterprises for fostering innovation, which is the survival and growth mantra for the new age businesses.
An upward surge in the adoption of cloud by the all sizes of business enterprises has confirmed the notion that it is more than a fad and will stay. As the cloud platforms get maturity and some of the inhibitions, for genuine reasons, regarding security and proprietary are addressed more and more businesses will see themselves moving to the cloud.
Designing complex and highly distributed systems was always a daunting task. Cloud platforms provide many of the infrastructure elements and building blocks that facilitate building such applications. It opens the door of unlimited possibilities. But with the opportunities come the challenges. The power that the cloud platforms offer doesn’t guarantee a successful implementation, leveraging them correctly does.
This article intends to introduce the readers with some of the popular and useful architectural patterns that are often implemented to harness the potentials of the cloud platforms. The patterns themselves are not specific to the cloud platform but can be effectively implemented there. Apart from that these patterns are generic and in most of the cases can be applied to various cloud scenarios like IaaS and PaaS. Wherever possible the most likely helpful services (or tools) that could help implementing the pattern being discussed have been cited from Azure, AWS or both.
Traditionally getting more powerful computer (with a better processor, more RAM or bigger storage) was the only way to get more computing power when needed. This approach was called Vertical Scaling (Scaling Up). Apart from being inflexible and costly it had some inherent limitations- power of one piece of the hardware can’t be moved up beyond a certain threshold and, the monolithic structure of the infrastructure can’t be load balanced. Horizontal Scaling (Scaling Out) takes a better approach. Instead of making the one piece of the hardware bigger and bigger, it gets more computing resources by adding multiple computers each having limited computing power. This novel approach doesn’t limit the number of computers (called nodes) that can participate and so provides theoretically infinite computing resources. Individual nodes can be of limited size themselves, but as many as required of them can be added or even removed to meet the changing demand. This approach gives practically unlimited capacity together with the flexibility of adding or removing the nodes as requirement changes and the nodes can be load balanced.
In Horizontal Scaling usually there are different types of nodes performing specific functions, e.g., Web Server, Application Server or Database Server. It is likely that each of these node types will have a specific configuration. Each of the instances of a node type (e.g., Web Server) could have similar of different configurations. Cloud platforms allow creation of the node instances from images and many other management functions that can be automated. Keeping that in mind using the homogeneous nodes (nodes with identical configurations) for a specific node type is a better approach.
Horizontal Scaling is very suitable for the scenarios where:
- Enormous computing power is required or will be required in future that can’t be provided even by the largest available computer
- The computing needs are changing and may have drops and spikes that can or can’t get predicted
- The application is business critical and can’t afford a slowdown in the performance or a downtime
This pattern is typically used in combination with the Node Termination Pattern (which covers concerns when releasing compute nodes) and the Auto-Scaling Pattern (which covers automation).
It is very important to keep the nodes stateless and independent of each other (Autonomous Nodes). Applications should store their user sessions details on a separate node with some persistent storage- in a database, cloud storage, distributed cache etc. Stateless node will ensure better failover, as the new node that comes up in case of a failure can always pick up the details from there. Also it will remove the need of implementing the sticky sessions and simple and effective round robin load balancing can be implemented.
Public cloud platforms are optimized for horizontal scaling. Computer instances (nodes) can be created scaled up or down, load balanced and terminated on demand. Most of them also allow automated load balancing; failover and rule based horizontal scaling.
Since the horizontal scaling is to cater to the changing demands it is important to understand the usages patterns. Since there and multiple instances of various node types and their numbers can change dynamically collecting the operational data, combining and analyzing them for deriving any meaning is not an easy task. There are third party tools available to automate this task and Azure too provides some facilities. The Windows Azure Diagnostics (WAD) Monitor is a platform service that can be used to gather data from all of your role instances and store it centrally in a single Windows Azure Storage Account. Once the data is gathered, analysis and reporting becomes possible. Another source of operational data is the Windows Azure Storage Analytics feature that includes metrics and access logs from Windows Azure Storage Blobs, Tables, and Queues.
Microsoft Azure has Windows Azure portal and Amazon provides Amazon Web Services dashboard as management portals. Both of them provide APIs for programmatic access to these services.
QUEUE CENTRIC WORKFLOW
Queues have been used effectively implementing the asynchronous mode of processing since long. Queue-centric workflow patterns implement asynchronous delivery of the command requests from the user interface to the back end processing service. This pattern is suitable for the cases where user action may take long time to complete and user may not be made to wait that long. It is also an effective solution for the cases where the process depends on another service that might not be always available. Since the cloud native applications could be highly distributed and have back end processes that they may need to connected with, this pattern is very useful. It successfully decouples the application tiers and ensures the successful delivery of the messages that is critical for many applications dealing with financial transaction. Websites dealing with media and file uploads; batch processes, approval workflows etc. are some of the applicable scenarios.
Since the queue based approach offloads part of the processing to the queue infrastructure that can be provisioned and scaled separately, it assists in optimizing the computing resources and managing the infrastructure.
Although Queue Centric Workflow pattern has may benefits, it poses its challenges that should be considered beforehand for its effective implementation.
Queues are supposed to ensure that the messages received are processed successfully at least for once. For this reason the messages are not deleted permanently until the request is processes successfully and can be made available repeatedly after a failed attempt. Since the message can be picked up multiple times and from the multiple nodes, keeping the business process idempotent (where multiple processes don’t alter the final result) could be a tricky task. This only gets complicated in the cloud environments where processes might be long running, span across service nodes and could have multiple or multiple types of data stores.
Another issue that the queue poses is of the poison messages. These are the messages that can’t get processes due to some problem (e.g., an email address too long or having invalid characters) and keep on reappearing in the queue. Some queues provide a dead letter queue where such messages are routed for further analysis. The implementation should consider the poison message scenarios and how to deal with them.
Since the inherent asynchronous processing nature of the queues, applications implementing it need to find out ways to notify the user, about the status and completion of the initiated tasks. There are long polling mechanisms available for requesting the back end service about the status as well.
Microsoft Azure provides two mechanisms for implementing asynchronous processing- Queues and Service Bus. Queues allow communicating two applications using simple method- one application puts the message in the queue and another application picks it up. Service Bus provides a publish-and-subscribe mechanism. An application can send messages to a topic, while other applications can create subscriptions to this topic. This allows one-to-many communication among a set of applications, letting the same message be read by multiple recipients. Service Bus also allows direct communication through its relay service, providing a secure way to interact through firewalls. Note that Azure charges for each de-queuing request even if there are no messages waiting, so necessary care should be taken to reduce the number of such unnecessary requests.
Auto Scaling maximizes the benefits from the Horizontal Scaling. Cloud platforms provide on demand availability, scaling and termination of the resources. They also provide mechanism for gathering the signals of resource utilization and automated management of resources. Auto scaling leverages these capabilities and manages the cloud resources (adding more when more resources are required, releasing existing when it is no more required) without manual intervention. In the cloud, this pattern is often applied with the horizontal scaling pattern. Automating the scaling not only makes it effective and error free but the optimized use cuts down the cost as well.
Since the horizontal scaling can be applied to the application layers individually, the auto scaling has to be applied to them separately. Known events (e.g., overnight reconciliation, quarterly processing of the region wise data) and environmental signals (e.g., surging number of concurrent users, consistently picking up site hits) are the two primary sources that could be used to set the auto scaling rules. Apart from that rules could be constructed based on inputs like the CPU usages, available memory or length of the queue. More complex rules can be built based on analytical data gathered by the application like average process time for an online form.
Cloud service providers have certain rules for billing in the instances based on clock hours. Also the SLAs they provide may need a minimum number of resources active all the time. See that implementing the auto scaling too actively doesn’t ends up being costly or puts the business out of the SLA rules. The auto-scale feature includes alerts and notifications that should be set and used wisely. Also the auto-scaling can be enabled or disabled on demand if there is a need.
The cloud platforms provide APIs and allow building auto scaling into the application or creating a custom tailor auto scaling solution. Both the Azure and AWS provide auto-scaling solutions and are supposed to be more effective. They come with a price tag though. There are some third party products as well that enable auto-scaling.
Azure provides a software component named as Windows Azure Auto-scaling Application Block (WASABi for short) that the cloud native applications can leverage for implementing auto scaling.
BUSY SIGNAL PATTERN
The cloud services (e.g., the data service or management service) requests may experience a transient failure when very busy. Similarly the services that reside outside of the application, within or outside of the cloud, may fail to respond to the service request immediately at times. Often the timespan that the service would be busy would be very short and just another request might be successful. Given that the cloud applications are highly distributed and connected to such services, a premeditated strategy for handling such busy signals is very important for the reliability of the application. In the cloud environment such short lived failures are normal behavior and these issues are hard to be diagnosed, so it makes even more sense to think through it in advance.
There could be many possible reasons for such failures (an unusual spike in the load, a hardware failure etc.). Depending upon the circumstances the applications can take many approaches to handle the busy signals: retry immediately, retry after a delay, retry with increasing delay, retry with increasing delay with fixed increments (linear backoff) or with exponential increments (exponential backoff). The applications should also decide its approach when to stop further attempts and throw an exception. Besides that the approach could vary depending upon the type of the application, whether it is handling the user interactions directly, is a service or a backend batch process and so on.
Azure provides client libraries for most of its services that allow programming the retry behavior into the applications accessing those services. They provide easy implementation of the default behavior and also allow building customization. A library known as the Transient Fault Handling Application Block, also known as Topaz is available from Microsoft.
The nodes can fail due to various reasons like hardware failure, unresponsive application, auto scaling etc. Since these events are common for the cloud scenarios, applications need to ensure that they handle them proactively. Since the applications might be running on multiple nodes simultaneously they should be available even when an individual node experiences shutdown. Some of the failure scenarios may send signals in advance but others might not, and similarly different failure scenarios may or mayn’t be able to retain the data saved locally. Deploying an additional node than required (N+1 Deployment), catching and processing platform generated signals when available (both Azure and AWS send alerts for some of the node failures), building robust exception handling mechanism into the applications, storing the application and user storage with the reliable storage, avoiding sticky sessions, fine-tuning the long running processes are some of the best practices that will assist handling the node failures gracefully.
MULTI SITE DEPLOYMENT
Applications might need to be deployed across datacenters to implement failover across them. It also improves availability by reducing the network latency as the requests can be routed to the nearest possible datacenter. At times there might be specific reasons for the multi-site deployments like government regulations, unavoidable integration with the private datacenter, extremely high availability and data safety related requirements. Note that there could be equally valid reasons that will not allow the multisite deployments, e.g. government regulations that forbid storing business sensitive or private information outside the country. Due to the cost and complexity related factors such deployments should be considered properly before the implementation.
Multi-site deployments call for two important activities: directing the users to the nearest possible datacenter and replicating the data across the data stores if the data needs to be the same. And both of these activities mean additional cost.
Multisite deployments are complicated but the cloud services provide networking and data related services for geographic load balancing, cross-data center failover, database synchronization and geo-replication of cloud storage. Both Azure and Amazon Web Services have multiple datacenters across the globe. Windows Azure Traffic Manager and Elastic Load Balancing from Amazon Web Services allow configuring their services for geographical load balancing.
Note that the services for the geographical load-balancing and data synchronization may not be 100% resilient to all the types of failovers. The service description must be matched with the requirements to know the potential risks and mitigation strategies.
Cloud is a world of possibilities. There are a lot many other patterns that are very pertinent to the cloud specific architecture. Taking it even further, in real life business scenarios, more than one of these patterns will need to get implemented together for making it work. Some of the cloud crucial aspects that are important for the architects are: multi-tenancy, maintaining the consistency of the database transactions, separation of the commands and queries etc. In a way each business scenario is unique and so it needs specific treatment. Cloud being the platform for the innovations, the well-established architecture patterns too may be implemented in novel ways there, solving these specific business problems.
Cloud is a complex and evolving environment that fosters innovation. Architecture is important for an application, and more important for the cloud based applications. Cloud based solutions are expected to be flexible to change, scale on demand and minimize the cost. Cloud offerings provide the necessary infrastructure, services and other building blocks that must be put together in the right way to provide the maximum Return on Investment (ROI). Since majority of the cloud applications could be distributed and spread over the cloud services, finding and implementing the right architecture patterns is very important for the success.