Outage today - totally down


#1

We’ve been experience total failure today. API calls were failing, and we couldn’t log in to the cloud.squidex.io site.

Then a little later, we were able to log in to the Admins, but it says all our Schemas are gone!

Our site is totally down, my client is very mad, do we have an ETA on when things will be restored?


#2

It is restored, I am sorry.


#3

What happened? This is the middle of the business day for us, so this is the worst time possible to experience an outage.

I’ve been asked to give an update to my client on root cause, what shall I tell him?

THanks
David


#4

I have introduced a small change to a the serialization system that worked great and has been fully tested locally but has effected some part of the clustering. When I deployed a new version I tested everything and it was fine and then the problems started a little bit later, when I did not realized it immediately.

It was just pure bullshit that should not have happened. I knew that this change might be a little bit problematic and have therefore moved it to my late afternoon after business hours in Germany.

I will create a report and make a plan how to ensure that this does not happen again. One part of the store is the team size of course, but the other part are to create full clustering tests that should run before a new release.


#5

Would it be possible to have an “Europe cluster” and a “US cluster” and deploy to each in the down-time that would be appropriate for that time zone?


#6

That would be great but would be too expensive at the moment. If I would deploy new changes at 6am Berlin time zone it should be appropiate for you as well, I think.