Elevated error rates for rendering service.
Resolved
May 14 at 02:20pm UTC
The incident has been resolved.
Affected services
Service
Recache
Created
May 14 at 11:07am UTC
On May 14, 2024, between 11:07 UTC and 14:20 UTC, Chrome servers took longer than usual to process rendering requests. We also saw increased error rates across the service.
The root cause of the incident was that our manager daemon was inadvertently upgraded during a regular deployment process to a new version that wasn't compatible with the GLIBC library installed on most of our US servers. This caused the manager daemon to crash and restart repeatedly, resulting in increased error rates and slow render times.
At 13:10 we started to push a rollback to restore service, due to the load on the system the rollback took longer than expected to complete. The rollback was completed at 14:20 and the service was restored.
To prevent recurrence, we've pinned the version of the manager daemon and added additional monitoring to alert us if the daemon crashes again. We're also reviewing our deployment process to ensure that incompatible versions of software are not deployed to production servers.
Affected services
Service
Recache