[SOLVED] 2020-05-24 Downtime due to low resources

Hello together.

Today we had several outages following a general bumpy week.

  • 08:06 UTC to around 08:20 UTC
  • 08:31 UTC to around 08:45 UTC

I am finally able to at least partly understand the problem. Squidex is hosted in kubernetes in Google Cloud. When a node is low on resources pods are evicted without a grace period. This is called a hard eviction (https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#hard-eviction-thresholds)

In this situation the pod cannot declare itself as dead anymore in the cluster and other nodes still try to communicate with this node for a while until they give up. In this situation the cluster is not able to recover anymore and need several attempts and automatic restarts to become healthy again.

I will do my best to understand this situation and to make the cluster setup more stable.

Futhermore I have increased the kubernetes resources and will ensure with proper resource limits that only soft evictions will happen.