Issue summary
On Monday, July 28th, 2025, between 21:12 and 21:24 IST, some clients experienced delays in the delivery of outbound messages, including transactional and marketing emails, SMS, and WhatsApp communications. The delays ranged up to 2 minutes
Impact
Delay of up to 2 minutes for outbound transactional emails, marketing emails, SMS, and WhatsApp messages. No message loss occurred; all messages were successfully delivered after the delay.
Root Cause
The issue was triggered by increased read latency in our storage layer (Bigtable) following a recent backend update. This latency impacted the performance of our Plan Manager service, which plays a central role in handling and routing outbound messages. As a result, services dependent on Plan Manager including transactional and marketing emails, SMS, and WhatsApp experienced delivery delays.
The elevated latency was due to a spike in CPU and storage consumption, which caused timeouts during message processing. Once identified, we mitigated the issue by increasing the capacity of the affected Bigtable instance, restoring system stability.
Resolution
Once the root cause was identified high read latency in Bigtable affecting the Plan Manager service our engineering team took immediate action to stabilize the system. We scaled up the capacity of the affected Bigtable instance, which helped reduce latency and restore normal performance across all impacted services.
Next Steps
To prevent recurrence of this issue and improve overall system resilience, we are taking the following actions:
Infrastructure Scaling Automation
We are enhancing our auto-scaling policies for Bigtable and other key infrastructure components. This will allow us to automatically respond to increased load or resource consumption before it begins to impact performance.
Improved Monitoring and Alerting
We're updating our monitoring systems to detect early warning signs such as latency spikes or rising CPU/storage usage across critical services. This includes more granular metrics and real-time alerting to ensure quicker detection and response.
Service Isolation Enhancements
We are working to better isolate time-sensitive services like outbound messaging from shared dependencies. This will help limit the blast radius of similar issues in the future and keep critical workflows running smoothly.
Post-Deployment Safeguards
We’re implementing additional validation and performance checks post-deployment, especially for components that interact with core infrastructure such as Bigtable. This will help catch performance regressions earlier.
Documentation & Knowledge Sharing
The learnings from this incident are being documented internally and shared across engineering teams. We’re also updating our incident response playbooks to include specific actions for handling similar storage-layer issues.