Write-up
Multiple services are down
Summary

On Monday, May 7th, our platform experienced a complete outage affecting most customer-facing services from approximately 16:40 UTC to 19:31 UTC. During this time, users encountered issues accessing our UI and sending capabilities for both marketing and transactional emails were impacted.

Impact

The outage affected:

  • Web application access

  • Email sending capabilities (both marketing and transactional)

  • Most API endpoints

While Email API endpoints remained accessible for queuing messages, the actual delivery of these messages was delayed until service was restored.

Root Cause

The outage was caused by a networking configuration issue in our internal infrastructure. A previously deployed configuration change created an unexpected interaction with our service mesh, causing request routing failures. When this configuration was propagated across our entire infrastructure during routine maintenance, it resulted in widespread 404 errors for internal service communications.

Resolution

Our engineering team identified the issue and implemented a solution by:

  • Isolating the problematic configuration

  • Reconfiguring our internal networking to use dedicated routing paths

  • Expanding our network address pool to support the new configuration

  • Progressively restoring services in order of priority

Critical services were restored by 18:24 UTC, with full platform recovery completed by 19:31 UTC.

What We've Learned

This incident highlighted several areas for improvement in our systems:

  • The need for better testing of configuration changes before deployment

  • Improved monitoring to detect partial failures before they cascade

  • Enhanced visibility into configuration changes across our infrastructure

Preventative Measures

We are implementing several changes to prevent similar issues:

  • Adopting dedicated routing paths for all services to prevent widespread failures

  • Improving our monitoring systems to detect configuration anomalies

  • Enhancing our deployment processes to identify potential conflicts

  • Creating better tooling to track configuration changes across our systems

We sincerely apologize for any disruption this incident may have caused to your operations. We're committed to continuously improving our platform's reliability and appreciate your patience during this event.

Powered by