I am very ashamed, that yesterday the stability of Squidex was horrible.
I apologize and can offer you a free month for a monthly plan and 10% for a yearly plan. If you want to accept this offer, please write me a PM with the name of your app and billing address and I will add a voucher to your subscription.
Yesterday evening I was trying to make a planned updated of the MongoDB cluster. I had to make an upgrade from 3.4. to 4.4. because some queries were not working as expected and even though indexes were configured, MongoDB has actually never used the indexes for some queries. Locally I could confirm that it is related to the MongoDB version, so there was no other way than to upgrade it.
Unfortunately, it went really wrong and a few other things went very bad as well and here is why:
Issues 1: The morning
The day has not started very well. The billing provider has made a change and some users were not able to update their subscription. I got it sorted out with the help of the support, but I had to deploy a new version. This version had a small bug, which was causing the frontend not the renew the access token properly or to logout properly. Therefore some users were not able to login properly. I just mention this to explain the other problem.
Issues 2: Evening
In the evening and night I wanted to make an update of docker.
The plan was to migrate the docker cluster node by node and to not have any downtime. Several mistakes I made, made the process a pain:
The cluster (aka replica set) is hosted in kubernetes. When I started with the replica set it was configured with a sidecar. This sidecar is a special container which scans kubernetes for MongoDB instances and connects them to a replica set. It was the way to go a few years ago, but is not recommended anymore, because during updates and down times the replica set gets reconfigured all the time, making updates much harder in some cases. The idea of this sidecar is to make scaling easier, as you can increase the number of nodes and everything will be configured automatically. But in practice this does almost never happens.
Not reading the documentation properly: I was not aware that you have to make the update step by step and that you have to configure some values in MongoDB before each update. It was not the case when I made the last update, but there is no good excuse for that. It is unprofessional and careless and the assumption that the upgrade is easy and smooth was just stupid. If I would have read the documentation properly I would have made a better test first.
The replica set was also configured with rolling deployment strategy. Which means that when you make an update, the first node gets updated, then the next and so on. Only one at a time. If you use the kubernetes default settings, a node is only updated when you delete it, but I was not aware that the setting has been changed to that and I don’t know the reason anymore.
What went wrong?
I updated the replica set and the first node was terminated and restart as expected. For a few seconds it has reported that it is healthy before it crashed. But this few seconds were enough for kubernetes to also restart the next node until all the nodes went down and the MongoDB cluster was actually dead. I was able to restart the cluster with the previous version and had a look to the logs.
Due to my mistakes I made before the logs were full with strange error messages and I thought I am on the right track, and just had to fix a single configuration value and tried it 1 more time, until I gave up. Unfortunately the data became corrupt and I had to repair the database which caused another downtime. The total downtime my systems have recorded were 10 minutes.
When MongoDB was not available for a few minutes the Squidex cluster was also not able to repair itself fast enough and I had to restart this cluster as well, causing the majority of the downtime.
How I have solved it?
After that I have created a new replica set with 3.4 and a clone of data and another Squidex installation and simulated a full migration from 3.4 to 4.4. It went well and due to my Squidex installation I could prove that there was no downtime anymore. Then I have down the same in the production environment and it went well again.
Overall this is a story of human mistakes caused by a stressful day, ending in an stressful evening and night. But it could have been avoided.
- Practice rolling updates of MongoDB as I do not do it very often.
- Test how Squidex cluster recovers from MongoDB failures.
2020-08-11 a maintenance is planned. The plan is to upgrade MongoDB to a newer version. As we are running it in a replica set, no downtime should happen, but I want everybody to be aware.
The maintenance windows is scheduled for 20:00 to 20:30 UTC: