Brevo

Write-up

Issue with multiple Services

Summary

On January 29th, 2026, at approximately 11:20 AM UTC, multiple production services experienced service degradation and intermittent unavailability. This resulted in elevated error rates, request timeouts, and temporary failures across several customer-facing features.

The incident was triggered by an unexpected spike in system resource usage during background processing, which led to internal connectivity disruptions. Our engineering teams quickly identified the issue, took immediate mitigation steps, and restored services within a short period. We continue to monitor the platform to ensure full stability and prevent recurrence.

What Was Impacted

Several customer-facing features, including segmentation, automation workflows, transactional and marketing email delivery, outbound Webhook, and email engagement tracking (such as opens and clicks), experienced increased latency.
All affected services have since fully recovered.
A limited number of components were monitored post-recovery to ensure continued stability.

Root Cause

During routine background operations, a processing task unintentionally opened a very high number of simultaneous connections to a shared data system. This exhausted available system resources, preventing new connections from being established.

As resource limits were reached, internal services were unable to communicate reliably with one another. This resulted in widespread connection failures and increased application error rates. Additionally, internal name-resolution caching issues delayed recovery for some components, even after the initial system load was reduced.

Resolution

Our engineering team took the following actions:

Immediately stopped the background process responsible for excessive resource consumption.
Paused related workloads to prevent further strain on the system.
Restored normal operation of the affected data systems.
Resolved internal connectivity and caching issues to reestablish reliable service communication.
Verified recovery across all impacted services and closely monitored system performance.
Communicated updates through our status page to keep clients informed.

Next Steps

To prevent similar incidents in the future, we are implementing the following improvements:

Proactive connection monitoring:
Add alerts to track active system connections and quickly flag abnormal spikes before customer impact.
Improved connection handling:
Enhance background processing to efficiently reuse connections, close them properly, and enforce safe limits to prevent overload.

We sincerely apologize for the disruption and any inconvenience this incident may have caused. Providing a reliable and stable platform for our customers remains our highest priority. We are taking proactive steps to strengthen our systems and minimize the risk of similar issues in the future.

Thank you for your understanding and continued trust.