Service Incident - February 26, 2025 - Talk | Pod 17 - Delays and issues changing agent status – Second Brand

SUMMARY

On February 26th, 2025 from 10:43 UTC to 21:56 UTC, Talk customers on Pod 17 experienced delays and issues changing agent status.

TIMELINE

February 26, 2025 10:17 PM UTC | February 26, 2025 02:17 PM PT
We are happy to report that the fix has resolved the issue causing delays and affecting agent status changes in Talk on pod 17. Thank you for your patience during our investigation.

February 26, 2025 09:32 PM UTC | February 26, 2025 01:32 PM PT
Following the fix, we are beginning to see improvement in the issue causing delays and issues changing agent status for Talk customers on pod 17. Please let us know if you continue to experience any issues.

February 26, 2025 07:25 PM UTC | February 26, 2025 11:25 AM PT
We have found a recent update which we believe is responsible for the issues affecting pod 17 Talk customers, causing delays and issues changing agent status. Our team is preparing a fix and we will provide further updates when the fix is deployed.

February 26, 2025 06:36 PM UTC | February 26, 2025 10:36 AM PT
Our team continues to investigate the issue affecting Talk customers in pod 17 causing delays and agents to get stuck in wrap-up mode. We will post additional updates in one hour or when we have new information to share.

February 26, 2025 06:08 PM UTC | February 26, 2025 10:08 AM PT
We are aware of an issue impacting Talk customers on pod 17, causing issues moving between agent statuses, sometimes getting stuck in wrap-up mode. Our team is investigating and we will provide further updates as soon as they are available.

POST-MORTEM

Root Cause Analysis

This incident was caused by a change in the system that allowed a data scrubbing job to remain in the queue and lead to the same job being enqueued repeatedly every minute. This resulted in a backlog that starved other jobs of processing time.

Resolution

To resolve this issue, the team first addressed the data scrubbing job's inability to complete due to the readonly model. They then modified the logic to ensure that once a job was enqueued, it would not be enqueued again until it was processed successfully.

Remediation Items

Implement a mechanism to prevent the same data scrubbing requests from being enqueued multiple times.
Create a tool within the Voice system to manage and remove jobs from the relevant queue.
Improve database queries associated with data scrubbing jobs to enhance performance.
Reduce the number of worker instances to optimize resource allocation and review current limits.
Explore options to quickly boost resources (workers) as needed.
Consider relocating jobs from the low queue if latency impacts core features.

By addressing these remediation items, the team aims to prevent similar incidents in the future and improve the overall reliability of the Voice service.

FOR MORE INFORMATION

For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.