Service outage

Downtime

Jun 11, 2025 at 2:36pm UTC

Affected services

Service

API

Dashboard

Dashboard API

REST API (API v3)

Authentication

Recache

Resolved
Jun 11, 2025 at 2:37pm UTC

Service recovered

Created
Jun 11, 2025 at 2:36pm UTC

At 06 : 45 CEST (UTC +2) on 11 June 2025 every compute node in our production cluster was powered off simultaneously by the upstream infrastructure provider. Public-facing APIs, workers and scheduled jobs were completely unavailable for 4 h 45 m. A reduced-capacity service then ran with elevated latency and intermittent 503s until 16 : 25 CEST, when the final batch of nodes was re-attached and all metrics returned to baseline.

Root Cause (still under investigation)
No definitive response has been received from the infrastructure provider at the time of writing. The working hypothesis is an unscheduled maintenance event on their side that powered off hardware which had been running continuously for ~2.5 years. Because the shutdown was ungraceful, filesystems required replay and the cluster could not auto-recover.