Elevated rates of 503 errors for some ad serving requests

Incident Report for Adnuntius System Status

Postmortem

Context

Adnuntius processes ad requests (traffic) using load balancers and ad servers in several Data Centres (DCs) worldwide. Some of our ad requests come from users through our adn.js adapter, and some are sourced from prebid, but regardless all requests normally share the same load balancers and adservers in the DC that is closest to the user that made the request.
We can move traffic between DCs as needed using BGP, DNS, or by updating our load balancers.
When we wish to shut down an ad server, we first drain traffic from it by writing a key to a distributed KV store, which is monitored by a process that instructs our load balancers to remove it from rotation (if the server is not cleanly drained like this then there are checks that run every few milliseconds from the load balancer that will detect it as down and re-dispatch the requests to another server, but this increases latency and in the worst case could cause requests to fail if the server is up but not able to process requests).
We monitor and alert on the presence of 5xx errors (responses such as 500 internal server error, and 503 service unavailable) on all our servers.

Timeline of events

Note: All times are provided in the Melbourne, Australia time zone (GMT+11). The time period covers from the evening of 16th December to the morning of 17th December.

Part 1 - Deployment

20:03: Routine deployment of ad servers in one of our DCs (DC1). Deployment to ad servers always proceeds one server at a time, with each server checked before continuing to the next.

20:09: The first ad server (identifier ps16) starts up and passes the standard checks on the local machine – these check that it can serve ads and is reporting a healthy status.

20:12: Monitoring systems report that the Ad Server on ps16 has not started handling any external ad requests (successful or unsuccessful) from the internet. This is noted by the deployment engineer, and the deployment process is terminated to investigate.

20:18: Automated alerts begin to fire, reporting that 10% of ad requests on ps16 are returning a 5xx HTTP response (Server Error). The conclusion is reached by the deployment engineer that the software changes in the deployment are causing this issue, and the decision is made to roll back to the previously deployed version.

20:26: Software roll back completes.

20:34: Automated alerts fire, indicating that a “canary” ad has not responded when pinged on its 5 minute schedule. We have a number of different canary ads that are tested from various locations around the world. The deployment engineer assumes that the receipt of the alert was slightly delayed and was due to an ad server running the “faulty” version of the software before the roll back had completed.

20:53: An automated alert fires, again reporting that 10% of ad requests on ps16 are being returned a 5xx HTTP response (Server Error). The deployment engineer checks the monitoring system and notices that the ad server on ps16 is still not logging any external ad requests, even though it has been rolled back to the previous software version. The decision is made to shutdown the ad server process on this machine, and resolve any issue in the morning. The automated checks that alert when an ad server is shutdown are disabled overnight, as otherwise they would continue to fire until the ad server was fixed.

Part 2 - Out of hours traffic shift

22:20: An automated alert fires, reporting that a routine check of the ad server status returned a 503 error. The status check returns to reporting normal on the next check. This error is noted by an SRE, but no action is taken as it quickly resolved without intervention and it was assumed to have been an isolated issue, unrelated to the earlier problems with the ps16 ad server.

00:01: The same automated alert fires again, reporting that the ad server status returned a 503 error. Again this alert automatically resolves on the next check. Because this is the second time that this alert has been fired, an SRE investigates the metrics from the alerting system and notices that adserver response times are trending upwards starting at the time the adserver in DC1 was shut down and will soon cause more alerts to fire and wake the On-Call support engineer.

The SRE concludes that the earlier shutdown of one of the 5 adservers in DC1 has reduced capacity sufficiently that we are unable to absorb the current rate of traffic. As a remedy they move just a portion (prebid) of the requests that would normally be handled in one DC to another (DC2). This decision was made to minimise the impact on DC1 customers that would otherwise suffer from increased latency (due to routing to a further DC) until the deployment engineer can investigate the original issue in the morning. However, the normal action for moving traffic is to shift all the traffic via DNS, so the procedure for moving the prebid traffic, although extremely simple, was never previously tested on this scale with high levels of traffic as were occurring at the time the switch happened.

Monitoring systems are checked for a period of time. The volume of incoming ad requests appears normal, the expected amount of prebid traffic that should have moved from the first to the second DC has happened, and none of the alerts fire that would notify when a specific customer (rather than the entire platform) is showing a traffic volume significantly below the expected average amount. Furthermore, the increased response time now returns to close to normal and so SRE assumes the incident is resolved.

Part 3 - Detection and makeshift fix

01:15: We start to receive reports from a customer that ads are not always appearing on their site. The first level support team investigates and initially concludes that it might be an intermittent CDN issue. They try various things to reproduce and isolate the issue, but ultimately are unsuccessful.

01:42 An on-call engineer in Australia is paged to notify them they need to investigate

02:40 On-call engineer has eliminated all proximate causes other than the earlier traffic move, so calls for a decision to wake the SRE that made the traffic shift to investigate that as a cause.

02:49 SRE starts investigating. That fact that only a single customer is reporting this issue, and they are not located near DC1, means they initially disregard the earlier DC1 issue as being irrelevant.

02:59: The intermittent nature of the problem leads the investigators to conclude that different ad servers are responding differently to the incoming ad requests. This could indicate that the indexes for some ad servers, which contain all of the customer campaign data, may be out of date or corrupted. It is decided to perform a re-index of the customer’s advertising campaigns, which will refresh all of the campaign data in every ad-server.

03:08: The significance of the 503 errors is again discussed and it is realised that they are not actually a problem with the ad servers, but in fact with the load balancers finding an ad server to respond to requests. Armed with this conclusion, it is thought that the re-routing of traffic earlier in the night might have overloaded DC2, so it is decided that ps16 should be fixed immediately so that traffic can be safely routed back into DC1.

03:20: ps16 has been started and the prebid traffic has now been fully restored to DC1. Support engineers observe that 503 errors have stopped occurring in DC2 region, although they still intermittently occur for DC1. The previously stopped adserver in DC1 is now started, although it still doesn’t receive any traffic. The 5xx error rate displayed on the DC1 dashboards remains high.

04:14: The customer confirms the issues are no longer occurring for them. To resolve the remaining 503 errors in the DC1 region, all traffic in the first DC is moved across to the second until the morning using DNS.

Part 4 - Clean up

07:28: A full deployment of the version of the software that was previously deployed (not the one that appeared to cause issues) is commenced by the deployment engineer.

08:22: The deployment completes. The monitoring system now shows that the ad server on ps16 is now once more handling ad server requests. DC1 traffic is moved back to DC1. The monitoring shows that all 5xx errors are now at normal levels.

Findings

There were 2 immediate causes that are currently being fixed:

A race condition in the setting and reading of the key to control traffic resulted in the ps16 load balancer accepting requests that it could not process. This caused some requests in DC1 region that arrived at this adserver to be returned 503 errors.
The process for moving prebid traffic resulted in a server in the second DC receiving the vast majority of the traffic and becoming overloaded. This was due to a lack of load balancing for how we distribute this traffic.

Additionally there are a number of areas for improvement:

The deployment process did not provide the appropriate information to allow the deployment engineer (and later the SRE and support engineer) to maintain situational awareness of the cause of the 5xx errors. A new monitoring dashboard will be developed distinguishing the 5xx errors from a service from the 5xx errors from a load balancer, and showing the state of the key controlling traffic.
Clearly distinguish between 500 and 503 errors in alerts, since the action to take is very different
Metrics are now shown for distribution of adserver load, and an alert may be developed for this if possible
External checks should be done on each load balancer after the deployment reaches each ones server, rather than just internal checks on the server.

Posted Dec 21, 2022 - 07:10 UTC

Resolved

From 09:30 for just over 7 hours (see timeline in postmortem for specifics) some users experienced elevated rates of 503 errors. The issue was challenging to detect both because of an earlier related issue that caused it to be initially overlooked, and because it was initially isolated to one DC and occurred during the evening and early morning hours in that timezone where traffic is quite low and the developers are on-call rather than actively working.
Findings from our initial investigation have revealed gaps in alerting systems and suggests improvements to some processes. We apologise for allowing this incident to happen and for not detecting it ourselves. The findings are now being addressed to prevent similar incidents from going undetected in the future.

Posted Dec 16, 2022 - 09:30 UTC