Brevo

Write-up

Issue with multiple services

Incident Summary

On Friday, July 25th, between 06:32 UTC and 08:20 UTC, some of our services experienced intermittent disruptions that impacted the performance of our Public API V3, transactional email delivery, and frontend applications. These issues were the result of an internal configuration error and an unexpected load surge on critical components. We’ve identified and addressed the root causes and are taking clear steps to prevent recurrence.

What Was Impacted

The incident occurred in two overlapping parts:

1. Public API & Email Delivery Impact (06:32 UTC – 07:04 UTC)

Between 06:32 and 07:04 UTC, the following issues were observed:

Public API V3 traffic returned 500 Internal Server Errors Intermittently
The majority of these errors were concentrated in:
- /v3/smtp/email – 75% of the failures
- /v3/contact/lists – 24.4% of the failures
- Other routes accounted for the remaining 0.6%
A portion of transactional emails was routed through a legacy fallback system. After the issue was resolved, these emails were replayed, resulting in some end users receiving duplicate emails.
Marketing email campaign processing was delayed by 25–30 minutes, affecting only a very small number of customers

2. Frontend Application Impact (07:03 UTC – 08:20 UTC)

Shortly after resolving the Public API V3 issue, we observed frontend errors that prevented many applications from displaying data.
At 07:42 UTC, frontend access was restored, although some errors still persisted.
By 08:20 UTC, all remaining issues were fully resolved.

Root Cause

A code change intended to update internal configuration was released at 06:32 UTC. However, due to an error, the change inadvertently modified a setting unrelated to its intended scope. This caused a critical dependency for the Public API V3 to become unreachable.

As a result, certain system components became unhealthy and entered crash-restart cycles, making the API temporarily unavailable. Some transactional emails were routed through a fallback system during this time, which led to duplicate email deliveries once the system recovered.

At 07:04 UTC, the misconfiguration was corrected, and all affected components of the Public API recovered automatically. Public API functionality was fully restored.

Shortly after, at 07:03 UTC, our frontend applications began to experience errors due to an unexpected surge in traffic to a backend validation service. This overwhelmed the frontend infrastructure, resulting in incorrect CORS errors and preventing frontend apps from rendering data.

At 07:42 UTC, a component restart alleviated the issue for most frontend traffic. Finally, at 08:20 UTC, a related backend change was rolled back, resolving the remaining issues and fully stabilizing all services.

Resolution

Our engineering team:

Quickly identified and corrected the misconfiguration that triggered the initial service disruption
Restarted impacted components and reverted a related backend change to stabilize frontend services
Monitored the system for residual effects, including retries and delays in email delivery
Verified the health and performance of all affected components to confirm full recovery

Next Steps

To prevent similar issues in the future, we are:

Improving safeguards and validation for internal configuration changes
Enhancing monitoring to detect partial dependency failures earlier
Strengthening our existing controls to further minimize the risk of duplicate processing during retries or message replays
Continuing internal audits of our change management and deployment pipelines

We sincerely apologize for the disruption and any inconvenience caused. Ensuring a reliable experience for our customers remains our highest priority. We are taking proactive steps to prevent recurrence and improve resilience across our systems.

Thank you for your understanding and continued trust.