* Platform availability may not reflect customer availability.
Pod29 Service Interruptions
Beginning on 7/30/22 at approx. 06:15 AM UTC, pod29 began experiencing intermittent errors in operations requiring new connections to the backend storage system. This resulted in those operations failing and an overall service degradation.
Due to an increase in application activity, periodic spikes were occurring in the number of connections to the storage system being requested by the front-end servers. This resulted in the maximum limit for connections being hit causing additional operations to fail. Monitoring on the number of storage connections paged the Operations staff for the issue. The initial investigation led Operations to believe there may be an underlying problem in the current primary storage node. A failover to the secondary storage node was initiated on 7/30 at 1:45 PM UTC. This appeared to resolve the problem since all the front end servers had to reconnect to the storage system clearing the connection alert. However, the alert triggered again the following day on 7/31 at 7:49 PM UTC. Operations staff again investigated and determined the issue was due to idle connections to the storage system not being cleaned up as expected.
At 8:51 PM UTC on 7/31, Operations staff began a process of manually executing a command to kill idle connections to the storage system when the alert fired. This prevented impact to customers or reduced the time customers were impacted. Over the next several days, the alert went from firing 10-13 times a day to 3-4. Additionally, Operations engaged the Development team to investigate the cause for the connection leak and to provide an automated script to detect the issue and kill idle connections. Since the automation script was deployed to Production the alert has decreased even further to 0-2 times a day.
After investigation by the Development team, it was determined there is a bug in the 3rd party library used by the application for operations with the backend storage system resulting in a connection leak. This bug has existed for some time but had not manifested in our application until the increase in activity on 7/30. The bug has been fixed by the 3rd party vendor, however the fix requires a major version upgrade of the library. Due to the potential impact of making a major version change of the library, extensive application wide testing is being conducted before deploying the update to Production. The results of the testing will determine when the update can be deployed to Production.