CometChat - Chat & Messaging - API (Europe) Degraded Performance – Incident details

All systems operational

Chat & Messaging - API (Europe) Degraded Performance

Resolved
Degraded performance
Started 10 months agoLasted about 1 hour

Affected

Europe (EU)

Degraded performance from 7:25 AM to 8:20 AM

Chat & Messaging - API

Degraded performance from 7:25 AM to 8:20 AM

Updates
  • Postmortem
    Postmortem

    Root Cause Analysis (RCA)
    Incident Date: June 09, 2025
    Region Impacted: Shared EU Region
    Service Impacted: Chat & Messaging API

    Introduction
    At CometChat, we’re committed to providing a reliable experience. On June 09, 2025, some applications in our shared EU Region experienced degradation of the Chat & Messaging API. We’re sharing this report for transparency and to outline actions taken to prevent recurrence.

    Incident Description
    At 12:15 AM MST on June 09, 2025, automated monitoring detected a significant increase in resource consumption on a database shard in our shared EU Region. A sudden traffic surge from a single tenant, combined with the shard’s existing rate-limit configuration, caused resource contention, resulting in intermittent latency and brief service disruptions for other applications on the same shard. The engineering team intervened immediately and restored normal operations by 1:20 AM MST.

    Scope of Impact
    Only customers whose applications were hosted on the affected shard experienced this disruption. All other EU shards and regions operated normally throughout the event.

    Root Cause Analysis
    The affected tenant generated an unprecedented spike in request volume that exceeded the adaptive thresholds configured for the shared database. Although rate-limiting controls operated as designed, the magnitude and velocity of the burst saturated the connection pool before throttling could fully engage, creating a query backlog that affected co-resident applications. Monitoring alerted the team promptly, but the anomaly developed faster than the current controls could contain.

    Resolution and Preventative Measures
    Immediate Actions Taken

    • Refined rate-limit parameters for the originating tenant.

    • Applied targeted throttling to stabilize the load on the affected shard.

    • Cleared the backlog and confirmed system performance at baseline.

    Long-Term Actions

    • Increase rate-limit granularity and enable dynamic scaling to accommodate extreme load without affecting neighboring tenants.

    • Deploy advanced traffic guardrails with real-time analytics and automated containment to intercept abnormal patterns earlier.

    • Enhance shared-database architecture by introducing stronger logical isolation for high-traffic tenants, reducing the blast radius and improving resilience.

    We remain committed to providing uninterrupted chat services and will continue strengthening platform safeguards to prevent recurrence. For any questions, please contact your Customer Success Manager.

  • Resolved
    Resolved
    This incident has been resolved.
  • Investigating
    Investigating

    Our engineering team is actively addressing an issue with the Chat & Messaging API. As we optimize traffic flow, there may be brief service disruptions and slight delays in event processing. We appreciate your patience and will provide updates as soon as more information is available.