Post Mortem: Database Server Out of Memory

Summary

Our primary database ran out of memory, causing it to become completely unresponsive. Because all of our core services depend on this database, this led to a full outage for our API, web app, inbound email, and outbound email delivery. The incident downtime lasted approximately 30 minutes and was resolved by upgrading the database server.

Timeline (Wednesday April 8, 2026, all times US Eastern / EDT)

12:48 PM — [Service Degradation Begins][First Alert Fired] Monitoring detects timeouts on our API from multiple regions
12:55 PM — [First Human Detection] Matthew sees an unusual sentry error saying the database cannot create new threads. Begins investigating. Finds that load is high, memory is high, and the database is unreachable via SSH.
12:59 PM — [Second Responder Signs On] Yosif acknowledges the alert
1:02 PM — [Customers Alerted] Matthew creates status page incident: "We are having issues with our primary database, affecting all services."
1:04 PM — [Root Cause Identified] Matthew determines the server needs more memory. A restart would likely hit the same issue again, begins upgrading the server
1:10 PM — [Mitigation #1 Applied] Server is stopped, upgraded, and restarted. Database comes back online
1:18 PM — Status page updated: "We've upgraded our primary database, and are seeing services recover."
1:27 PM — Status page updated: "Mail delivery has recovered. We are still investigating app/api."
1:32 PM — [Recovery Complete] All services confirmed recovered. Status page resolved

Second Incident (same day)

Later that evening, a problematic database query during our recovery process caused a brief recurrence of service degradation.

5:37 PM — [Downtime Detected] Services begin degrading again. Matthew begins investigating and posts a status page update.
5:50 PM — [Mitigation #2 Applied] Matthew identifies and kills a bad query on the primary database. Services begin recovering
5:55 PM — Status page updated: "Mail delivery has recovered. API servers are still degrading and we're investigating"
5:58 PM — [Full Recovery] API and app have recovered. Status page resolved

What Went Wrong

The database was overutilized and ran out of memory during a spike in incoming mail traffic.

The increased utilization was a result of us earlier this week increasing our max throughput for delivering mail, and us not scaling up the database appropriately in response.

We did not have good alerting on overall memory usage on the server, which could have alerted us to the issue before it resulted in downtime.

Followup Actions

In the 24h following the incident, we made significant infrastructure improvements:

Upgrade all database servers with double CPU/memory, higher storage capacity, faster I/O, and increased throughput
Improve our database backup process to produce consistent, replication-ready backups every hour
Create comprehensive runbooks for database recovery so we can restore service faster if a similar issue occurs
Reduce unnecessary database write load by removing redundant data collection
Drop unecessary database tables improving our recovery speed in event of database restoration from backup.
Tune database memory settings to maintain a safe margin, preventing future out-of-memory conditions
Add alerting on high memory usage on databases.

Statement

We sincerely apologize for the outage and the impact it had on your email forwarding. We have taken this incident as an opportunity to harden all the tooling and performance tuning around our core database, and this should result in significantly improved reliability going forward.