Brevo

Write-up

Issue with multiple services

Issue Summary:

On July 30, between 08:32 and 09:02 UTC, we experienced a temporary disruption that affected multiple customer-facing services. During this time, some users may have seen error messages or experienced delays when accessing our frontend applications or receiving emails.

Impact:

The following areas were affected during the incident:

Access to all frontend applications was briefly unavailable.
Transactional emails (such as password resets, confirmations, etc.) were delayed or not delivered.
Marketing emails also experienced delivery issues.

No customer data was lost, and all systems recovered once the issue was resolved.

Root Cause:

The incident was triggered by the deletion of a service-specific routing rule within our system. This rule was handled by a shared traffic routing component that also served several unrelated services.

The deletion caused a misconfiguration in the shared routing layer, which failed to update correctly. As a result, multiple unrelated services lost their ability to route traffic properly leading to 403 errors and sudden connection drops across several applications.

This was due to a known issue in how the routing system manages dynamic configuration updates. In this case, the configuration sync failed after one service’s route was removed, impacting all services using the same routing entry point.

Action Taken:

At 09:00 UTC, our engineering team manually restarted the affected routing layer to trigger a full configuration reload. This restored all expected traffic routes and resolved the issue immediately:

To prevent recurrence, we’ve made the following improvements:

Operational Safeguards: Adding additional checks to detect and respond to similar issues faster.
Routing Isolation: All critical services have now been moved to dedicated routing paths, ensuring that changes to one service can no longer impact others.

We sincerely apologize for the inconvenience this caused. Ensuring reliable service is our top priority, and we’re actively working to prevent such incidents in the future.