Malfunctioning scheduler post-mortem
Date of incident: 2025-02-24
Duration: 10:30 UTC - 16:15 UTC
Impact:
- From 10:30 UTC to 15:00 UTC, some of the schedules were triggered more than once. It started as a small-scale issue, and the problem escalated after 14:00 UTC.
- From 15:00 UTC to 15:55 UTC, API performance was degraded, and some Actor builds and runs were stuck in the READY state. Some Actor runs were aborted but can be resurrected.
- From 15:25 UTC to 16:15 UTC, schedules were not working, so some Actor runs did not start at all.
What happened?
- We implemented a change that would improve Actor startup performance by targeting Docker layers more effectively.
- The change caused certain database items to increase in size, overloading the network of the primary cluster and leading to degraded performance.
- Some core components, including those behind the scheduler, repeatedly went out of memory.
What we did
- We temporarily paused the Actor run scheduler and manually aborted runs triggered after 15:00 UTC.
- We downscaled the platform to make the database responsive again and reverted the change causing the bug.
Next steps
- We will improve our monitoring to catch similar problems, deviations, and network spikes sooner before they escalate.
- We will investigate the component's current architecture to ensure we prevent such incidents in the future.
We apologize for the disruption and appreciate your patience. If you have any questions, please reach out to support@apify.com.
Sincerely,
Apify engineering team