Yesterday we had the worse outage of the FourteenFish website for several years. In total there were 35 minutes where our website was inaccessible, usually for 30 seconds to 2 minutes at a time. This happened between 12:50pm and 9:10pm yesterday.
We use the Pingdom website monitoring service, which sends an SMS alert to our two most senior staff if it detects that the FourteenFish website is inaccessible or is very slow to load. They test our website every minute from about 10 different locations around the world.
At 12:50pm we received our first notice of an outage, which lasted for 1 minute.
When we manually checked the website, everything seemed fine. This made us think that the outage was caused by a “timeout” where the website was too slow respond to the Pingdom check.
We immediately checked our database log for any problems, as database queries are usually the bottleneck when a website is slow to respond. We found that a particular database query was taking much longer than is acceptable. We consider anything over 2 seconds to be “slow”, and this query was taking about 10 seconds. Our developers optimised the problematic database query to speed it up, and we thought that it was likely that this was the cause of the outage.
At 1:37pm we received our second notice of an outage, which again lasted for 1 minute.
Again we checked our database log, and found some other queries that were running slowly. We also noticed that there were a large number of connections to our database. Our baseline is about 150 connections, with 200 at busy times.
We were hitting 300-400 concurrent connections, which is very unusual. When we manually checked the website this time, we were unable to access it.
Looking for the root cause
The most likely cause of any outage is that a recent change we’ve made to our code is causing problems, so we started to look back through recent changes.
We had recently added a new feature to allow GPs taking the RCA exam to see if they had met the mandatory criteria, and we knew that this page was getting accessed a lot at the moment. We stepped through the changes we’d made to see if there was anything that might be causing the problem. We then started looking back at other changes we’d made in the last week.
Throughout the afternoon, we had more short periods of down time, each lasting around 1 minute and up to 3 minutes. These were happening roughly every 20 minutes.
This was quite a stressful time, to put it mildly.
Around 10 people raised a support ticket mentioning the problem during the afternoon, but we knew it was affecting a lot more people than this. On a typical Monday around 11,000 people use FourteenFish. We checked our website usage on Google Analytics, and it seemed like a pretty normal day.
We carried on troubleshooting throughout the afternoon. We looked at all the most frequently accessed parts of our system (the dashboard, the portfolio overview and the recorded consultation system) to try and isolate the problem.
At around 6pm we noticed the issues were getting less frequent, but the problem was still persisting.
At around 7pm we had looked at all the most likely causes of the problem, so we started looking at other components of our system.
When we checked our mailserver something jumped out – our system had sent 1,500,000 emails in the past 12 hours.
That’s a lot of emails
On an average day our system sends about 13,000 emails so this was over 100 times our baseline. Looking at the logs for our mailserver, we noticed that the majority of the emails were to a single @nhs.net email address.
As well as providing online tools for GPs and other health professionals, we also supply system to related organisations such as LMCs (local medical committees). One of the features of this system gives the LMC the ability to email all the doctors in their area. If any of the emails they send bounce, the LMC is alerted to this by sending a message to the LMC’s “office admin” email address.
The problem was that the “office admin” email address itself was having problems, and our emails to it were being bounced back. When we got the bounce, we’d again alert the “office admin” email address of the problem, triggering another bounce back.
This created an infinite loop between the mailserver and the API which handles the bounces. Each time a bounce is received, this triggers a number of database queries (find the person in our database, find the LMC that they belong to, find the office admin email for that LMC and so on). This was having the effect of consuming all our database server capacity.
Fixing the problem
We immediately began working on a change to our system to address this, and filter out any bounced emails that would be “self reported” and create this sort of loop.
At 9:10pm we deployed the change to our systems, and we immediately saw a massive reduction in connections to our database.
We were confident that we’d found the problem. We continued to monitor our website for the next hour just to be sure.
The full extent of the outage
Here’s the overview of all the downtime we had, shown by the red blocks. In total our website was inaccessible for around 35 minutes.
What we’re doing now
As part of our ISO 27001 certification we have an Information Security Management System in place that includes a few procedures for documenting and analysing any system outages.
We’re currently documenting all the steps we took yesterday while we were looking for the problem, and analysing what we could have done better.
For example, in hindsight we went down a bit of a rabbit hole looking at the most likely cause of the problem (the recent changes to the recorded consultation tool) and we should have started looking at other parts of our system sooner to rule those out.
We’d like to apologise to all our users who were affected by this. We’re very sorry for any inconvenience and stress that was caused.
CPO (Founder), FourteenFish