Maintaining High Availability in a Multi-Tenanted SaaS Solution
Introduction
SaaS Solution providers can live or die on the basis of availability. There is a massive shift of responsibility to the vendor, who must ensure that users along with their data are protected and safe (a massive and separate topic) but that the solution is available when the clients want to use it.
This is no different to browsing Netflix only to be told "Sorry were having some technical difficulties at the moment" or "There are too many people watching that Movie right now". A SaaS product from a Fintech such as Fenergo might seem a million miles from a streaming service, but there is commonality in the target outcomes.
If your platform suffers from availability issues, you will loose customers, breach SLAs and invalidate contracts. Your business will not last. The acceptable tolerance threshold for downtime is so low in our industry that meeting availability targets is only one aspect, vendors need to prove empirically that the strategy to assure availability is consistent with the best practices in industry.
What Does High Availability Mean in a Multi-Tenanted SaaS Environment?
High Availability SaaS means that a system has been designed to provide fault-tolerant, resilient access to the software a vendor has sold. Usually this is achieved by engaging multiple independent sets of infrastructure, each working in concert together to accommodate any failures that might occur. Put simply, if Site A goes offline, Sites B & C continue to be operational and the client is not impacted.
For a Multi-Tenanted SaaS platform, that same scenario plays out with multiples of clients simultaneously using the platform, all expecting an experience on par with what they would achieve had they deployed to their own infrastructure. To sell availability as part of a service offering means a client is being asked to "trust" that a vendor has designed a solution which can meet and even exceed what client would do themselves. The Netflix comparison fades somewhat here, because on a 1:1 basis each Fenergo client is using our software as an operationally critical part of their business. The same cannot be said for someone who wants to sit down and watch an episode of The Crown but then Netflix does support more then 200 Million subscribers globally.
Strategies for High Availability
Fenergo selected AWS early in our SaaS Journey and we deliberately selected a “Single Cloud” strategy because we wanted to embrace all the benefits available from cloud native services. (https://resources.fenergo.com/engineering/single-cloud-vs-multi-cloud-strategy) AWS also offered the most complete set of available (constantly evolving) products and their global infrastructure footprint is second to none. In selecting AWS, our engineering team knew that we could bake in resilience from the ground up and design our solution with high availability as a native feature.
In traditional system design for enterprise platforms the process is often to stack the components on top of one another, focus on the development of the solution and then scale the infrastructure horizontally and vertically for higher and production environments.
Resilience came from the multiple combined stacks and availability was serviced by provisioning infrastructure that was powerful enough to cater for peak demand. Disaster Recovery was catered for by duplicating production in a separate data centre, preferably in a HOT state, and keeping data in sync across the data centres. This entire approach COULD work on a cloud environment. Known as a "Lift and Shift" approach to cloud adoption. It has all the disadvantages and most of the cost of a traditional system design without many of the cloud benefits (arguably more cost to have an ops team capable of managing in the cloud).
However, the objective of the above architecture is to achieve high availability and fault tolerance. A SaaS solution must be available for clients when THEY want to use it and eliminate impacts caused by issues with the underlying infrastructure. This is EXACTLY what the best of the cloud has to offer if you design with resilience in mind and take advantage of serverless components. There is a shift in how the traditional approach differs from early DEV stages through to deployment for production.
High Availability in the Cloud
Before looking at a comparative cloud solution design, remember the "Shared Responsibility Model". AWS are responsible for the availability of infrastructure that underpins the cloud. They allow customers (such as Fenergo) to select at a granular level, what to run and where across their vast plane of availability. Each AWS Availability Zone represents a sophisticated data centre capable of delivering all the services a customer would require to deploy their solutions at scale. Dependancy on a single Data Centre does not represent resilience. To get resilience, you must spread your deployment across multiple Availability Zones. In physical terms, looking at the picture below, you would need to travel 50 kilometres into the horizon off the top right and then again of the top left. At each of those locations you will find two equally impressive data centre facilities. These three Availability Zones are are interconnected with low latency high speed fibre and allows for a single logical data centre to be conceived.
One of three Availability Zones in Dublin, Ireland
AWS best practice guidelines suggest that clients looking to stand up resilient workloads on the cloud should architect in a way to take advantage of multiple Availability Zones. Doing so will result in near negligible downtime for those workloads in the event a single AZ outage (which does happen from time to time). Its a risk vs cost debate. A higher appetite for risk will result in a lower cost, and the lower the risk appetite the higher the cost. It is naturally more expensive to deploy across two AZs than one.
Resilient Cloud Solution Design
Comparing traditional strategies with a cloud strategy is 🍎 and 🍊. Fenergo are using "serverless" for the most part, horizontal and vertical scaling are quite different. Instead of provisioning multiples of EC2 instances, we deploy compute and storage services (such as Lambda or DynamoDB) and select which AZs to make those services available from. For maximum resilience, we actually deploy across three AZs (where available). The AWS "Well Architected Framework" defines resilience as "The capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components". This objective is where we aimed with our resilient design.
Lets once again look at the Development through to Production comparisons to understand where resilience is embedded. Developers use AWS cloud services to deploy code assets and configuration to selected AWS AZs. These are capable of scaling to production levels but of course is not required in non-production deployments. Firstly using Serverless technology eliminates the need for development teams to manage infrastructure and server software. They can focus on creating the functionality and deploying code. AWS tooling that integrates seamlessly with modern IDE platforms like Visual Studio and IAC Ops pipelines mean that building and deploying to the cloud (once the up front investment to create the pipelines as been made) becomes a relatively effortless task for Developers.
The key here is that Development is simply a profile and selection of variables. Lambdas can be tuned to have certain amount of memory, compute resources and scalability as well as availability across AZs. DynamoDB works the same, a non-production table can be scoped to a specific or multiple AZs, size etc. All serverless resources work this way. Then when time comes to deploy into production, its the CI/CD pipeline (not the developer) which push changes and apply a production profile to the deployment. This production profile is what conforms to High Availability by deploying across multiple AZs (and even multiple Regions). The infrastructure of servers and networking components are handled by AWS allowing Fenergo to focus on the solution.
If we look at the Fenergo product from a deployed perspective (as below), it can be seen that each AZs is part of the production deployment. This delivers a solution that can be described as "Always On - Active - Active - Active". A failure at one or two AZs will not affect the availability of the application and we have achieved a high degree of resilience. Lets not forget the aerial shot of a single Availability Zone from above to emphasise how powerful a configuration this truly is. Not only that, but with the build pipelines handling the complexity of deployment and serverless removing the complexity of infrastructure and networking we really can focus on the solution.
Summary
Across this post we looked at what resilience means in general traditional terms and how we at Fenergo have designed our solution to be resilient from the ground up.
From our clients perspective, the hard work of implementation is done as part of our SaaS offering, but what we must also do is explain how that High Availability has been achieved. As has been seen, the investment in best practice strategies and pipelines offers maximum flexibility for our own SDLC process by taking advantage of the granular controls on offer from AWS on their serverless components. This has to be done with efficiencies in mind but also the commitment we make to our clients in terms of SLAs and the balance is always changing.