Major systems outage

Incident Report for Apify

Postmortem

Saturday outage post-mortem
Date of incident: 2025-02-22

Impact:

  • Complete downtime of the Apify platform from 22.00 UTC to 23.00 UTC.
  • No Actors could be started.
  • Those already running were paused and continued running after the incident was resolved.
  • There was no data loss, but the workloads were disrupted.

What happened:

  • Our user base is growing fast, and the number of new signups is increasing exponentially. The exponentially increased load revealed suboptimalities that had not manifested before.
  • One misconfigured query then caused a peak in memory due to a load of a suboptimal index, which caused the cluster to crash.

What we did:

  • The cluster size was increased to ensure enough memory until the issue was fixed.
  • Then, the cluster and all the dependent systems were restarted.

Next steps:

  • We are already re-architecting our primary database clusters to optimize systems for future growth.
  • As part of this, we aim to decrease the size of all our clusters to ensure performant restarts in case of a problem.
  • We are currently optimizing indexing and querying among heavy-load components.
  • In addition, we identified missing metrics that would indicate similar problems ahead of time.

We sincerely apologize for the disruption and appreciate your patience. If you have any questions, please reach out to support@apify.com.

Sincerely,
Apify engineering team

Posted Feb 26, 2025 - 15:21 CET

Resolved

This incident has been resolved.
Posted Feb 23, 2025 - 00:38 CET

Update

We are continuing to monitor for any further issues.
Posted Feb 23, 2025 - 00:33 CET

Monitoring

A fix has been implemented and we're monitoring the results. It might take some time to allocate all jobs, until all services have fully recovered.

We're sorry for the inconvenience. We will implement new measures based on this critical incident to prevent similar incidents in the future.
Posted Feb 23, 2025 - 00:25 CET

Identified

We've identified the cause of the issue and are taking steps to resolve it.
Posted Feb 23, 2025 - 00:05 CET

Update

We are continuing to investigate this issue.
Posted Feb 22, 2025 - 23:13 CET

Investigating

We are investigating the issue.
Posted Feb 22, 2025 - 23:12 CET
This incident affected: Web (apify.com), Console (console.apify.com), API (api.apify.com), Actors, Scheduler, Webhooks, Proxy (Datacenter, Residential, SERP), and Storage (Dataset, Request queue, Key-value store).