Summary
Emails routed to another mail provider (hostedemail.com) begin failing, causing a backup in our delivery queue as we continued retrying.
Timeline (all US Eastern Time)
10/3/2025
21:30 [Incident Begins][First Machine Detection] All emails sent to hostedemail.com fail. Our sentry exception tracker soft emails us of an unhandled malformed ARC headers from them. Our systems automatically retry all emails, so the queue begins backing up with retries. This has little effect at first, but begins building.
10/4/2025
01:18 [First Customer Report] We receive the first customer report that their emails are being delayed. These continue over the next couple of hours.
05:25 [First Human Detection] Yosif signs on, sees many customers complaining about delayed delivery, and investigates.
05:29 [Escalation] Yosif pages Matthew for assistance as this looks like a more widespread production incident
05:30 Matthew acknowledges the page and signs on
05:34 [Customers Notified] Matthew posts a status page update alerting customers that we are looking into the issue.
06:05 We notice a large number of incoming malformed ARC headers, and deploy code to log additional information on these requests
06:46 [Mitigation #1 Deployed] From debugging, we see that hostedemail.com is sending us many emails with large chains of malformed ARC headers, indicating they are having trouble with their infrastructure. We deploy code to drop these requests, this reduces the incoming flow of malformed requests.
07:14 [Mitigation #2 Deployed][Recovery Begins] After alleviating the incoming flow, we determine there's a large queue of emails in our outgoing queue being retried to hostedemail.com. They fail and retry, which has clogged up our queue, causing emails to be delayed for up to one hour. We deploy code to drain the queue of hostedemail.com messages. Email delivery speeds begin to improve.
07:18 [Customers Notified of Recovery] Status update posted that we've identified the root cause and services are recovering.
07:44 All hostedemail.com retrying tasks have been drained, and now the queue is working through the backlog of delayed emails.
08:48[Incident Ends] The last backlog of delayed emails have been delivered, and delivery speed is back to normal. Status update of full recovery posted soon after.
What Ultimately Went Wrong
Our infrastructure is vulnerable to other major ISPs going down, and we lacked alerting & visibility on delivery times and queue sizes which delayed our response. This will be diligently corrected going forward.
Follow up Action Items
- IMX-1362: Queue of outgoing messages should be prominently displayed in our main metrics dashboard
- IMX-1363: Our current alerting checks that emails are successfully going through to all major providers, but does not alert on if they're delayed. We will add strict alerting here for the future.
- IMX-1368: Add prominent dashboards for p10, p50, and p90 delivery speeds for all major email providers