I wanted to write about this topic today anyway, so I can also do it here.
In fact the Squidex cloud was not as stable as it should be. There are a wide range of reasons, not all of them are in our hands or harder to solve. Just recently we had the following issues:
- 06-21 Cloudflare hat a longer downtime.
- 06-28 The database has some issues to recover from complex queries.
- 07-04 The Squidex instances lost connection to the database.
But as of writing this, I am not experiencing any problems with the Squidex Cloud.
The first problem is out of control, but the other issues could have been avoided with better decisions to the architecture.
Squidex uses Orleans, an actor framework from Microsoft. With this framework, your instances build a cluster, similar to a replica set in the database world. When the cluster becomes unstable all kind of issues happen, for example new nodes cannot join the cluster anymore. This happened then the Squidex instances lost the connection to the database. Orleans is a great technology but hard to master. For many developers it is an unknown way to write, maintain and operate applications. It is proven to work great in scenarios such as gaming with hundreds of nodes, but requires special skillsets and has requirements about the hosting platform.
You could argue, that in an ideal setup the cluster would be stable and you would not have any downtimes or restarts of the node. Unfortunately this is not the case. It is very hard to forecast what kind of queries are made. In a normal application the queries are more or less known upfront and you can work on your hot paths to be as performant as possible.
But with Squidex, developers can define the own schemas and queries and therefore they are no hot paths. With GraphQL developers can even write very complex queries that need a log of memory for processing and can even bring a node down. It is better that a node gets restarted instead of consuming too much memory. So we have to live with the problem of unstable nodes and a database that needs to scale automatically.
Therefore the decision has been made to move away from Orleans and use a traditional architecture with frontend (API) nodes and worker nodes. If a worker node goes down we can still make all the queries and the fact that API nodes are independent makes it easier to scale automatically and to use serverless hosting or managed container hosting. It is not only about the cloud but also to make self hosting easier.
I have made a lot of progress already and I am working on the tests now. I hope to have a first version this week and a ready version end of next week. It was a good decision not to couple Orleans with other components, so the code changes are acceptable.
What you can do:
- Have a little bit of patient please, I am working hard to solve this problem.
- Use the CDN whenever possible.
- Use some other kind of caching, e.g. a simple in memory cache can help a lot.
Just for your info:
The Squidex cloud is monitored with an setup inside the Google Cloud that is used for hosting and will send Slack notifications whenever something happens. So 99% of the time I am aware of issues. Sometimes there are network problems with cloudflare and only some users are affected, then the monitoring will not detect the problem in all cases.