The original problem was triggered by an internal Apify user running an extreme amount of actor runs in parallel. A suboptimal DB query used in our job allocator overloaded the CPU of the database kernel, and our central database slowed down under the critical value, which caused a cascade effect.
After we recovered the system, the buffer used for our dataset API did not fully recover, which caused the performance issues of the dataset API endpoint that persisted until the full recovery.
Next steps and new measures
We discovered the suboptimal query caused the database to be overloaded. This is fixed at the time of writing. We plan to publish an engineering blog post on this topic.
We also found the bottleneck that protected dataset API from an automated recovery, and the fix is in progress.
We are currently analyzing and testing other system components to ensure that they are elastic enough.
We will apply new platform measures and limits to beware of the original trigger of the incident. These limits will affect no users, but they are strict enough to protect workloads from accidental overload.
We are sorry for any inconvenience and problems caused by this incident.
Posted Feb 17, 2022 - 11:43 CET
Resolved
We are closing the incident. We will publish an update on this issue and future measures to prevent these problems in the future.
Posted Feb 10, 2022 - 15:04 CET
Monitoring
All systems are back up and fully operational.
In summary: - The initial outage happened during the period 17:30 UTC and 20:00 UTC. - Since then, we recovered the actor runtime, but the dataset push items API endpoint had performance problems, the rest of the platform operated normally. - To prevent a data loss, between 0:30 and 1:00, we had to restart most of our servers during the recovery process. This caused runs to be on hold, and only the runs with a short timeout may have timed out. - Thanks to the recovery process, we prevented a potential data loss.
This incident is unrelated to the previous one, and the root cause is different. We consider this an incident of critical severity, and we will update you next week with detailed information on the root cause and our measures to prevent these problems in the future.
Posted Feb 10, 2022 - 02:57 CET
Update
We are currently working on full recovery of the system. Some parts have to be restarted which can make some actors run longer than usual.
Posted Feb 10, 2022 - 02:03 CET
Update
We are continuing to work on a fix for this issue.
Posted Feb 09, 2022 - 21:50 CET
Identified
We have identified the issue and the system has mostly recovered. The performance of datasets might still be affected.
Posted Feb 09, 2022 - 21:10 CET
Investigating
We are experiencing issues with our database provider causing the high error rate of Apify API and Console. We are investigating the root cause and will keep you up to date. We are sorry for any inconvenience caused.
Posted Feb 09, 2022 - 18:38 CET
This incident affected: Web (apify.com), Console (console.apify.com), API (api.apify.com), Actors and Proxy (Datacenter).