Write-up
Issue with Multiple Applications
Issue Summary

On October 1, 2025, between 20:10 and 21:27 UTC, our platform experienced widespread service disruptions affecting multiple features. During this time, customers may have encountered errors or delays when using segmentation operations, onboarding features, and sending marketing and transactional SMS and email campaigns.

We sincerely apologize for the inconvenience this caused and want to provide full transparency about what happened and how we're preventing similar issues in the future.

What Happened

A scheduled maintenance event by one of our cloud infrastructure providers caused an unexpected network routing issue on our end. This maintenance affected the network connection between our data centers, which disrupted communication between our application servers and databases.

The issue arose because our network routing configuration inadvertently continued directing traffic through a network path that was temporarily unavailable due to the maintenance. This caused timeouts and connection failures across multiple services.

Customer Impact

Duration: Approximately 1 hour and 17 minutes (20:10 - 21:27 UTC)

Affected Services:

  • Segmentation operations: Failed to complete or experienced significant delays

  • Onboarding processes: Interruptions for new users and account setup

  • Email: Delays in sending both marketing and transactional emails

  • SMS campaigns: Delays and failures in sending marketing and transactional SMS messages

During this window, customers experienced intermittent errors and slower than normal performance across the platform.

Timeline
  • 20:10 UTC - Issue began affecting services

  • 20:18 UTC - Our monitoring systems detected the problem and our team was alerted

  • 21:27 UTC - Issue fully resolved

  • 21:54 UTC - All services confirmed operating normally

Root Cause and Resolution

The root cause was an unannounced maintenance window by our cloud provider that temporarily disrupted a network link. Our network configuration, which was designed to provide redundancy, did not properly redirect traffic to alternative paths due to how routing information was being shared between our systems.

Our engineering team quickly identified the problem and implemented an emergency configuration change to reroute traffic through healthy network paths. This immediately stabilized all affected services.

What We're Doing to Prevent This

We're taking several steps to ensure this doesn't happen again:

  • Improved Monitoring: Enhanced our notification system to receive advance notice of infrastructure provider maintenance windows

  • Network Configuration Updates: Modified our routing configuration to better handle similar scenarios and ensure proper failover

  • Infrastructure Improvements: Planning to redistribute critical infrastructure components to reduce dependency on any single network path

  • Runbook Enhancements: Updated our incident response procedures to more quickly identify and resolve network-related issues.

Powered by