Today we had several outages following a general bumpy week.
- 08:06 UTC to around 08:20 UTC
- 08:31 UTC to around 08:45 UTC
I am finally able to at least partly understand the problem. Squidex is hosted in kubernetes in Google Cloud. When a node is low on resources pods are evicted without a grace period. This is called a hard eviction (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#hard-eviction-thresholds)
In this situation the pod cannot declare itself as dead anymore in the cluster and other nodes still try to communicate with this node for a while until they give up. In this situation the cluster is not able to recover anymore and need several attempts and automatic restarts to become healthy again.
I will do my best to understand this situation and to make the cluster setup more stable.
Futhermore I have increased the kubernetes resources and will ensure with proper resource limits that only soft evictions will happen.