Post Mortem: SQL Servers Overloaded

Post Mortems

October 5, 2025

Summary

A connection spike triggered a recurrence of the SQL connection overload issue from 09/23/2025.

Timeline (10/05/2025 EST)

08:00 Inbound SMTP connections spike
08:08 [First Machine Detection][Service Degradation Begins] Sentry alerting detects "too many connections" SQL issue, sends email to us. API endpoints and email forwarding begin being degraded.
08:15 [Machine Escalation] Our alerting detects errors for api.improvmx.com, and pages Matthew, the primary oncall
08:16 [Oncall Signs On] Oncall acknowledges the page and signs on
08:18 [Customers Notified] Oncall posts that we're investigating increased error rates.
08:58 [Mitigation #1 Attempted][Recovery Begins] Oncall recognizes that the SQL servers are under heavy load, and upgrades one of the SQL secondaries to an EC2 machine 4x the size.
09:08 [Mitigation #2 Attempted] The upgrade seems to have a positive effect, so oncall proceeds to 4x upgrade the other secondary and primary as well.
09:10 [Recovery Complete] The full upgrade of the SQL cluster causes the SQL connection overload error to disappear from all clients.
09:14 [Customers Notified] Oncall posts incident resolution to status.improvmx.com

What Went Wrong

The last time this SQL issue occurred, we drastically tuned down the number of connections our servers make to SQL. But the SQL servers were still overloaded, and needed to be upgraded.

Action Items

IMX-1378: Ensure SQL servers are right-sized
IMX-1337: Audit all libraries connecting to SQL to ensure a safe number of connections
IMX-1377: Ensure SQL primary/secondary fallthrough work, and that load is spread evenly across primaries/secondaries
IMX-1340: Add SQL specific load metrics to our primary metrics dashboards

‍

Matthew Tse

Owner and CEO of ImprovMX