Title:
Secret Server Cloud (SSC) Issues in East US starting on April 3rd, 2024.
Impact:
Beginning on April 3rd, 2024, at approximately 7:52:40 PM ET, SSC customers would have started experiencing issues accessing their systems.
Issue:
On April 3rd, 2024, at approximately 7:52:40 PM ET, Azure SQL Databases in East US started experiencing issues. New connections to databases in this region may have resulted in an error or timeout. Existing connections remained available to accept new requests, however if those connections were terminated and re-established, they may have failed. Microsoft suspects that a potential deployment on their frontend gateways caused SQL database availability issues impacting customer connectivity. Microsoft worked to mitigate the issue by stopping the active rollout of their deployment.
Resolution:
At approximately 7:54 PM our monitoring systems started alerting regarding an issue. After initial triage, we estimated that roughly 25% of US based SSC customers were experiencing issues. At approximately 8:26 PM ET, SC failover was initiated. However, due to an issue with the automated tool to perform failover, we began to manually perform failover, which increased our response and resolution time. At approximately 10:36 PM ET the failover had completed, and all SSC systems were operational. Our engineers monitored the Azure incident and waited for it to be fully resolved before closing the incident on our side. At approximately 11:29 PM ET on April 3rd, 2024, the Azure team declared the incident mitigated. At approximately 3:56 AM ET on April 4th, 2024, a rollback to restore functionality in the primary East US region for SSC was completed.
Action Items:
1. Investigate issues with tooling that facilitates automated failover.
To address this and prevent future occurrences:
IN PROGRESS – Improve failover automation tooling and capabilities.
Incident Start Time: April 3rd, 2024, 07:52:40 PM ET
Incident End Time: April 3rd, 2024, 11:29 PM ET