Summary
A connection spike triggered a recurrence of the SQL connection overload issue from 09/23/2025.
Timeline (10/05/2025 EST)
08:00 Inbound SMTP connections spike
08:08 [First Machine Detection][Service Degradation Begins] Sentry alerting detects "too many connections" SQL issue, sends email to us. API endpoints and email forwarding begin being degraded.
08:15 [Machine Escalation] Our alerting detects errors for api.improvmx.com, and pages Matthew, the primary oncall
08:16 [Oncall Signs On] Oncall acknowledges the page and signs on
08:18 [Customers Notified] Oncall posts that we're investigating increased error rates.
08:58 [Mitigation #1 Attempted][Recovery Begins] Oncall recognizes that the SQL servers are under heavy load, and upgrades one of the SQL secondaries to an EC2 machine 4x the size.
09:08 [Mitigation #2 Attempted] The upgrade seems to have a positive effect, so oncall proceeds to 4x upgrade the other secondary and primary as well.
09:10 [Recovery Complete] The full upgrade of the SQL cluster causes the SQL connection overload error to disappear from all clients.
09:14 [Customers Notified] Oncall posts incident resolution to status.improvmx.com
What Went Wrong
The last time this SQL issue occurred, we drastically tuned down the number of connections our servers make to SQL. But the SQL servers were still overloaded, and needed to be upgraded.
Action Items
- IMX-1378: Ensure SQL servers are right-sized
- IMX-1337: Audit all libraries connecting to SQL to ensure a safe number of connections
- IMX-1377: Ensure SQL primary/secondary fallthrough work, and that load is spread evenly across primaries/secondaries
- IMX-1340: Add SQL specific load metrics to our primary metrics dashboards