[SOLVED] 2018-11-20 Cluster failure and downtime


#1

Please take a look - cloud.squidex.io is down.
API calls got timeout exceptions.


#2

I restarted the cluster and it is working fine now. I will investigate the issue later and will write an analysis.


#3

Thanks, Sebastian!
It looks good from our side as well now.


#4

I have promised an post-mortem and here it is:

Our setup

We are hosting the Squidex cloud version on Google Cloud and Kubernetes. Kubernetes (k8) is a cluster management system and has some behaviors you should understand:

  1. When you deploy an application you can define the number of replicas and k8 starts or shuts down replicas to ensure the number of replicas are equal to the specified number.

  2. When the system is low on resources, k8 shuts down replicas to ensure that the cluster stays healthy.

  3. Squidex uses Orleans (https://dotnet.github.io/orleans/) and is also deployed as a cluster. When a Squidex replica is shut down it notifies the other cluster member that it just left the cluster. The state of all cluster members is stored in a membership table.
    Read more: https://dotnet.github.io/orleans/Documentation/implementation/cluster_management.html

What happened?

  • Last week: We introduced an health endpoint to the Squidex server .This endpoint returns the status. It is defined as healthy, when…

    • It does not consume more memory than the specified threshold.
    • It can establish a connection to the database and query data successfully.
    • It can connect to the cluster and query data successfully.

    Kubernetes checks this endpoint every 10 seconds and after 3 consecutive failures the replica is restarted. Read more: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

  • 2018-11-20 09:45.00 UTC: The system was low on memory after backup tasks and k8 decided to shutdown a Squidex replica. Unfortunately k8 started the graceful shutdown process (which has a timeout of 30 seconds) but decided to terminate the node a second later. So this replica never had the chance to notify the other cluster members and the cluster was now in an unhealthy state.

  • 2018-11-20 09:46:00 UTC: Kubernetes decided to start a new Squidex node, but the cluster was in an unhealthy state, because the cluster members still assumed that the node, that just went down, is still part of the cluster. The membership protocol of the clustering software defines that a new node can only join a healthy cluster. Because of the recently introduced health endpoint, kubernetes restarted the replica, before the Cluster went back to the healthy state and a new unhealthy entry was added to the membership table. We ended up in an infinite loop and the cluster never recovered.

  • 2018-11-20 09:52:00 UTC: Our monitoring system also recorded the problem and sent out alarms, but unfortunately it was just too late in Germany and nobody could resolve the incident.

  • 2018-11-21 08:12.00 UTC: I got back to work and restarted the cluster. The total downtime was more than 9 hours.

What can we do?

  • The problem with the health endpoint has been fixed by defining a delay to tell k8 that it should wait 5 minutes before performing the first probe.
  • The health endpoint has been tested and yesterday nodes have been terminated randomly to ensure that in such cases the cluster can go back to an healthy state.