[SOLVED] 2018-07-04 Unplanned Downtime


#1

What happened?

On the 2018-07-04 we had an unplanned downtime, because of the following reasons:

  1. We added more resources to our cluster and the migration did not went well, even though we practiced it before. Out cluster is too small to ensure 100% up time for this scenario and in the test it was just pure luck that it went well.

  2. Google cloud had issues with the google compute and the site was temporarily not available for a few minutes.

  3. We still used an old model for authentication and when we got the new nodes for our cluster, these nodes were created with the new authentication system which requires to setup the correct credentials. I was not aware of this, therefore squidex was not able to access the google storage to upload or download assets.

What was the total downtime?

Approximately 2x10 minutes.

What can we improve?

We will make the next maintenance during our usual maintenance time, which is around midnight (Berlin Timezone).