[SOLVED] 2018-11-06 Unplanned Downtime


#1

Today we had an unplanned downtime for a few minutes.

The bug happened in the Orleans Framework from Microsoft and has to be understand and fixed first. Fortunately a simple restart solved the issue.

We will introduce health checks to automatically restart the system when something like this happens. Of course we will work very hard to make the system as stable as possible but you can only reduce the risk, never eliminate it.


#2

The health endpoint has been implemented: https://cloud-staging.squidex.io/healthz

The memory limit is configurable: https://github.com/Squidex/squidex/blob/master/src/Squidex/appsettings.json#L57

There are several use cases.

  1. You can use monitoring tools and track the health using this endpoint.

  2. In kubernetes I added the health endpoint as readiness and liveness probe:

       readinessProbe:
         httpGet:
           path: /healthz
           port: 80
       livenessProbe:
         httpGet:
           path: /healthz
           port: 80