On Monday, May 7th, our platform experienced a complete outage affecting most customer-facing services from approximately 16:40 UTC to 19:31 UTC. During this time, users encountered issues accessing our UI and sending capabilities for both marketing and transactional emails were impacted.
The outage affected:
Web application access
Email sending capabilities (both marketing and transactional)
Most API endpoints
While Email API endpoints remained accessible for queuing messages, the actual delivery of these messages was delayed until service was restored.
The outage was caused by a networking configuration issue in our internal infrastructure. A previously deployed configuration change created an unexpected interaction with our service mesh, causing request routing failures. When this configuration was propagated across our entire infrastructure during routine maintenance, it resulted in widespread 404 errors for internal service communications.
Our engineering team identified the issue and implemented a solution by:
Isolating the problematic configuration
Reconfiguring our internal networking to use dedicated routing paths
Expanding our network address pool to support the new configuration
Progressively restoring services in order of priority
Critical services were restored by 18:24 UTC, with full platform recovery completed by 19:31 UTC.
This incident highlighted several areas for improvement in our systems:
The need for better testing of configuration changes before deployment
Improved monitoring to detect partial failures before they cascade
Enhanced visibility into configuration changes across our infrastructure
We are implementing several changes to prevent similar issues:
Adopting dedicated routing paths for all services to prevent widespread failures
Improving our monitoring systems to detect configuration anomalies
Enhancing our deployment processes to identify potential conflicts
Creating better tooling to track configuration changes across our systems
We sincerely apologize for any disruption this incident may have caused to your operations. We're committed to continuously improving our platform's reliability and appreciate your patience during this event.