The On-Call Engineers Guide to a Happy Christmas

George McGrane

If you're a software engineer, and love what you do, nothing beats that feeling of solving a challenge, then theres the side of the job where you are "On-Call" for support.

In Fenergo, we manage a large Globally Deployed SaaS Platform for our clients. Recently I spoke with Steve O'Rourke, one of our Directors of SaaS Engineering to talk about how we handle On Call Support at scale and importantly ask the question : Is it possible to be On Call and still have a Happy Christmas ?

- - - - - - - - - - - - - - - - -

Q: Steve can you give me an idea of what kind of scale of deployment the SaaS Engineering Team support?

A: Well, We have clients all around the globe and in order to meet their data residency and application proximity requirements, we also have deployments all around the globe. Specifically we have multiple deployments across EMEA, Canada, United States and APAC. In each of those regions, there are multiple instances of the application and each instance is spread across multiple AWS Availability Zones. Then we also offer Multi-Region Disaster Recovery. Not to mention each of our clients are allocated multiple tenants per deployment. So . . . to say we manage a LOT of moving parts is an understatement.

- - - - - - - - - - - - - - - - -

Q: Wow, that sounds like a large deployment. Would it be fair to say that not much depends on manual handling?

A: Absolutely, it's not viable to have manual steps in place at that scale, also it would not be best practice. So front to back of our SDLC, all the way from DEV check in to Production Deployment and Support, we have automation processes across all pipelines with multiple approval and gatekeeping steps to ensure nothing but working, fully tested secure code makes it into the hands of our clients.

- - - - - - - - - - - - - - - - -

Q: Can you describe the key difference between supporting a multi-tenant SaaS Platform on the cloud versus a more traditional software offering?

A: That question is more like "Whats the new world look like compared to the old?" With SaaS the "as a Service" component means that we as the vendor are responsible for the availability and resilience of the software to a specified SLA. We have a 99.9% uptime commitment which allows for only 8.77 Hours of downtime per year. Making sure the platform is available to clients is our top priority . To confidently achieve that availability, we need to be monitoring all the moving parts of all the deployments across all of the regions all of the time. The application itself is comprised of many sub components - so collectively all that monitoring is a firehose of data and way more than any person or group of people could handle looking at screens or tracing logs. But this is one of the key benefits of using a large scale cloud vendor like AWS, they have the facilities and infrastructure that SaaS vendors like ourselves can use to support our clients. Provided you are doing it properly that is- thats the trick.

- - - - - - - - - - - - - - - - -

Q: So how do you drink from a firehose, how can you get a meaningful view across so much data?

A: It all comes down to best practice and solid SDLC principals. Every executing piece of code logs activity and errors. When clients are making API calls, our services respond with good and bad, http 2XX, 3XX, 4XX 5XX codes and thats what you monitor. When operating at scale, you use thresholds of healthy vs unhealthy in terms of error volume. Client’s can cause errors by submitting bad data, but those tend to be an exception rather than norm, or at least localised to a specific client in a particular region. So a few errors as a client is learning or testing an API wont breach a threshold, but once a pattern of errors begins, it can be indicative of something more meaningful and if a threshold is breached, an automatic alarm is raised against the area of the application where the errors are coming from.

Plus you need to remember, we are supporting multiple identical deployments of code being used by multiples of clients on each deployment. That means we are supporting at a "GLOBALLY AVAILABLE SERVICE" level. Imagine errors start accumulating , we can tell very quickly if an issue is localised to a region, or a specific client on a deployment as opposed to a more holistic issue across multiple regions.

The automation software we use to monitor activity and generate alarms, works PROACTIVELY not REACTIVELY. So we are alerted to and investigating an issue in real time before clients or users even look to raise a support ticket. More often than not, the first communication would be from us to a client or clients notifying them that we are looking into alerts.

- - - - - - - - - - - - - - - - -

Q: You raise a very important distinction about supporting SERVICES, traditionally support would be focused on clients and their issues. Can you elaborate on that a little?

A: Sure, as I said guaranteeing availability of the platform for clients is priority one, but that doesn’t mean client specific focus comes second. By supporting a SERVICE, indirectly that is supporting clients at scale. Think if it like a rising tide raises all ships. If there is a genuine issue with a service - the impact will be broad and affect all clients, it may be 1am on a Monday Morning GMT and only one client in APAC has noticed, but that doesn’t mean the problem will stay localised to just one client for long. On the other hand, if a single client has observed an issue but the service is operating as expected for all other clients in other regions, then chances are the client themselves, through some error in API calls or configuration, has caused the problem and more traditional support activities / personnel will resolve any problems they are having.

- - - - - - - - - - - - - - - - -

Q: When you say "more traditional" support activities and personnel, what do you mean?

A: The On-Call support team is made up actual Software Engineers. The people who built the product are the same ones who steward it forward into production and then monitor to make sure their code is working as expected. It creates an extra vested interest in quality and speaks to our culture as well. Our development team can take full end-to-end ownership over their output.

Then when I say "more traditional" support activities, thats where we have dismissed a potential P1 as non pertinent and the answer to whatever root cause will be a "How do I correctly do X, Y or Z" style answer and we can direct clients to appropriate documentation or other channels best suited to address their issue.

- - - - - - - - - - - - - - - - -

Q: Coming into the holiday season, people are looking forward to a break, time to switch off and spend with their families, what does an incident look like for an On Call Engineer?

A: The truth is a support incident looks the same at Christmas as it does for any day during the middle of the year. The easier we can make it for an engineer to deliver that support the better. Lets break down an incident.

The majority of our engineering department work in or close to GMT hours - a few hours overlap across other development centres in EMEA but the majority of the working day is GMT. We have clients across North America up to 11 hours behind and then clients in APAC working 10 or more hours ahead. An alarm at 1am GMT is the end of the working day in California or the start of the next day in Sydney.

Lets stick with the 1 am GMT - Priority 1 Alarm - as an example.

I mentioned earlier about Thresholds. We have these set across all moving parts of the application and there is a specified qty of errors allowed to occur before an alarm goes off. Our automation software is not waiting for the thresholds to breach, it's watching as they approach a limit and notifies the support team of trending metrics. So even before an alarm is raised, our engineers have their attention drawn to areas of the service to watch out for.

Once a P1 Alarm is raised, that same Automation Software creates a report which presents metrics and details of what has been happening, along with useful links to drill into the logs where the issues have been captured.

At that point it comes down to the RCA and the RC at the bottom of any given P1 can be many things. One thing we do not do is "code forward". There is a well oiled SDLC pipeline for supporting patch updates if needed, the support activity is about analysis. Based on the outcome of that analysis a P1 might be dismissed or if an issue is identified a feature can be rolled back, bringing the application to a known working state and the Development Team can focus on remediation as part of their SDLC.

All that being said, knowing that a service is healthy but a specific client is still having issues is not ideal. If we have identified what is causing specific client issues, we can notify them of our findings and join a call with them or schedule a call with people best suited to help them with anything implemented incorrectly.

Lastly - we refine. This is how we make sure on-call engineers are not overburdened. If an alarm threshold is set too low, and we are managing multiple P1 alarms which originated from user error, then that threshold can be adjusted. Likewise if we are not proactively getting to issues quick enough, the threshold could be too high. It is about constant learning, refactoring and adjustment to ensure we can be exactly where we need to be when we are needed to be there. New functionality is also being released via CI/CD so the targets keep changing, but our strategy is solid.

- - - - - - - - - - - - - - - - -

Q: You really seem to be able to turn the pressure off that firehose down to a manageable stream of helpful data, any parting tips for other Cloud Support Engineering Teams out there. What are the big game changers do you think?

A: Sure. I'll give you my top 3 tips :

1 - Automate EVERYTHING You Can: We didn’t get here overnight. Anything built automatically was first created manually but we sure didn’t leave it that way. We test constantly, not just just the platforms code but the Infrastructure as Code. It means you have dependable repeatable processes for provisioning and managing large scale global services. The cloud providers and lots of other solutions exist to automate, top tip - Use them.

2 - Remember you are supporting Services: This strategy still has the client at the centre of the outcomes you are driving towards but when you focus on services and their availability, the benefit will apply to all clients and users. In traditional software, a feature is complete when it gets released to production, but services are NEVER complete until they are switched off. It is a more fluid way to view how the platform is responsible for the outcomes that clients are looking to achieve continuously.

3 - Refine and Improve: Software will change, new functionality will be introduced, but clients will also come up with creative often unintended ways to use the service you provide. Every day of usage you learn more and more about what good looks like and what can go wrong, so reflect that learning back into the support processes to create an optimal experience for clients. Staying too rigid will just slow you down and create blind spots.