Saturday outage post-mortem
Date of incident: 2025-02-22
Impact:
- Complete downtime of the Apify platform from 22.00 UTC to 23.00 UTC.
- No Actors could be started.
- Those already running were paused and continued running after the incident was resolved.
- There was no data loss, but the workloads were disrupted.
What happened:
- Our user base is growing fast, and the number of new signups is increasing exponentially. The exponentially increased load revealed suboptimalities that had not manifested before.
- One misconfigured query then caused a peak in memory due to a load of a suboptimal index, which caused the cluster to crash.
What we did:
- The cluster size was increased to ensure enough memory until the issue was fixed.
- Then, the cluster and all the dependent systems were restarted.
Next steps:
- We are already re-architecting our primary database clusters to optimize systems for future growth.
- As part of this, we aim to decrease the size of all our clusters to ensure performant restarts in case of a problem.
- We are currently optimizing indexing and querying among heavy-load components.
- In addition, we identified missing metrics that would indicate similar problems ahead of time.
We sincerely apologize for the disruption and appreciate your patience. If you have any questions, please reach out to support@apify.com.
Sincerely,
Apify engineering team