Write-up
Delay in Email ,sms ,whatsapp delivery

Issue summary
On Monday, July 28th, 2025, between 21:12 and 21:24 IST, some clients experienced delays in the delivery of outbound messages, including transactional and marketing emails, SMS, and WhatsApp communications. The delays ranged up to 2 minutes

Impact

Delay of up to 2 minutes for outbound transactional emails, marketing emails, SMS, and WhatsApp messages. No message loss occurred; all messages were successfully delivered after the delay.

Root Cause

The issue was triggered by increased read latency in our storage layer (Bigtable) following a recent backend update. This latency impacted the performance of our Plan Manager service, which plays a central role in handling and routing outbound messages. As a result, services dependent on Plan Manager including transactional and marketing emails, SMS, and WhatsApp experienced delivery delays.

The elevated latency was due to a spike in CPU and storage consumption, which caused timeouts during message processing. Once identified, we mitigated the issue by increasing the capacity of the affected Bigtable instance, restoring system stability.

Resolution
Once the root cause was identified high read latency in Bigtable affecting the Plan Manager service our engineering team took immediate action to stabilize the system. We scaled up the capacity of the affected Bigtable instance, which helped reduce latency and restore normal performance across all impacted services.

Next Steps

To prevent recurrence of this issue and improve overall system resilience, we are taking the following actions:

  • Infrastructure Scaling Automation
    We are enhancing our auto-scaling policies for Bigtable and other key infrastructure components. This will allow us to automatically respond to increased load or resource consumption before it begins to impact performance.

  • Improved Monitoring and Alerting
    We're updating our monitoring systems to detect early warning signs such as latency spikes or rising CPU/storage usage across critical services. This includes more granular metrics and real-time alerting to ensure quicker detection and response.

  • Service Isolation Enhancements
    We are working to better isolate time-sensitive services like outbound messaging from shared dependencies. This will help limit the blast radius of similar issues in the future and keep critical workflows running smoothly.

  • Post-Deployment Safeguards
    We’re implementing additional validation and performance checks post-deployment, especially for components that interact with core infrastructure such as Bigtable. This will help catch performance regressions earlier.

  • Documentation & Knowledge Sharing
    The learnings from this incident are being documented internally and shared across engineering teams. We’re also updating our incident response playbooks to include specific actions for handling similar storage-layer issues.

Powered by