Malfunctioning scheduler

Incident Report for Apify

Postmortem

Malfunctioning scheduler post-mortem

Date of incident: 2025-02-24
Duration: 10:30 UTC - 16:15 UTC
Impact:

From 10:30 UTC to 15:00 UTC, some of the schedules were triggered more than once. It started as a small-scale issue, and the problem escalated after 14:00 UTC.
From 15:00 UTC to 15:55 UTC, API performance was degraded, and some Actor builds and runs were stuck in the READY state. Some Actor runs were aborted but can be resurrected.
From 15:25 UTC to 16:15 UTC, schedules were not working, so some Actor runs did not start at all.

What happened?

We implemented a change that would improve Actor startup performance by targeting Docker layers more effectively.
The change caused certain database items to increase in size, overloading the network of the primary cluster and leading to degraded performance.
Some core components, including those behind the scheduler, repeatedly went out of memory.

What we did

We temporarily paused the Actor run scheduler and manually aborted runs triggered after 15:00 UTC.
We downscaled the platform to make the database responsive again and reverted the change causing the bug.

Next steps

We will improve our monitoring to catch similar problems, deviations, and network spikes sooner before they escalate.
We will investigate the component's current architecture to ensure we prevent such incidents in the future.

We apologize for the disruption and appreciate your patience. If you have any questions, please reach out to support@apify.com.

Sincerely,
Apify engineering team

Posted Feb 26, 2025 - 15:26 CET

Resolved

Summary: We experienced a partial outage of the scheduler and degraded performance of Apify API.

Timeframe: The issue started at a small scale from 11.00 UTC and then escalated around 15.00 UTC. By 16.15 UTC, the issue was resolved.

Impact:
- During the incident, some scheduled Actors did not start or were run twice.
- In response, we've aborted some of the runs that started after 15:00 UTC and turned off the Schedule component for approximately 45 minutes.
- If your runs were aborted, they can now be safely resurrected; see: https://docs.apify.com/platform/actors/running/runs-and-builds#resurrection-of-finished-run
- There was no data loss for runs in progress, but some runs were not triggered.
- API performance was degraded at various levels along the incident, and some runs started via API may have been in a READY state for longer than usual.

We're sorry for the inconvenience. We take this issue seriously and will implement measures to mitigate similar issues in the future.
We will provide a post-mortem summary here with more details later this week.

Posted Feb 24, 2025 - 17:55 CET

Monitoring

We have identified the root cause and deployed a fix. Both API and Actor components are now stable, and we’re slowly starting the scheduled runs.

Posted Feb 24, 2025 - 17:22 CET

Update

We are continuing to investigate this issue.

Posted Feb 24, 2025 - 16:59 CET

Investigating

We're experiencing issues with scheduling. Some Actor runs may not start in time. We are identifying potentially started duplicate Actor runs. The first problems started around 11:00 CET at a very small scale and escalated around 3:00 UTC.

The performance of API is partially degraded, and some endpoints might be answering slower than usual.

We're investigating the issue and will keep you informed about the progress. We are sorry for the inconvenience caused; this is a critical issue.

In response to the incident, we aborted some of the runs in the ready state. This affects runs that were started by the scheduler or another way from 2025-02-24 15:00 UTC. These runs can be resurrected if needed.

Posted Feb 24, 2025 - 16:24 CET

This incident affected: API (api.apify.com), Actors, and Scheduler.