Cloud downtime/degraded performance

I have…

I’m submitting a…

  • [x] Performance issue

Current behavior

I’ve noticed some degraded performance and/or downtime with cloud hosted Squidex over the last week or so.

I noticed an issue yesterday, which I believe corresponds with one of the red dots here;
Squidex Queries (Cloud) Status

Just now (in the last few moments) I’ve started seeing errors being logged which are a result of slow/failed connections to Squidex.

I wonder how best to mitigate these issues. At this point it’s happened enough that I’ve started receiving emails from customers asking if the site is “down”, so am naturally keen to know what the options are.

I have switched some queries to the CDN (after upgrading my Squidex subscription this week) but at this point I am wondering whether to migrate to self-hosted as an alternative.

Thanks for your help.
Jon

Environment

App Name: PracticalDotNet

  • [ ] Self hosted with docker
  • [ ] Self hosted with IIS
  • [ ] Self hosted with other version
  • [x] Cloud version

I wanted to write about this topic today anyway, so I can also do it here.

In fact the Squidex cloud was not as stable as it should be. There are a wide range of reasons, not all of them are in our hands or harder to solve. Just recently we had the following issues:

  • 06-21 Cloudflare hat a longer downtime.
  • 06-28 The database has some issues to recover from complex queries.
  • 07-04 The Squidex instances lost connection to the database.

But as of writing this, I am not experiencing any problems with the Squidex Cloud.

The first problem is out of control, but the other issues could have been avoided with better decisions to the architecture.

Squidex uses Orleans, an actor framework from Microsoft. With this framework, your instances build a cluster, similar to a replica set in the database world. When the cluster becomes unstable all kind of issues happen, for example new nodes cannot join the cluster anymore. This happened then the Squidex instances lost the connection to the database. Orleans is a great technology but hard to master. For many developers it is an unknown way to write, maintain and operate applications. It is proven to work great in scenarios such as gaming with hundreds of nodes, but requires special skillsets and has requirements about the hosting platform.

You could argue, that in an ideal setup the cluster would be stable and you would not have any downtimes or restarts of the node. Unfortunately this is not the case. It is very hard to forecast what kind of queries are made. In a normal application the queries are more or less known upfront and you can work on your hot paths to be as performant as possible.

But with Squidex, developers can define the own schemas and queries and therefore they are no hot paths. With GraphQL developers can even write very complex queries that need a log of memory for processing and can even bring a node down. It is better that a node gets restarted instead of consuming too much memory. So we have to live with the problem of unstable nodes and a database that needs to scale automatically.

Therefore the decision has been made to move away from Orleans and use a traditional architecture with frontend (API) nodes and worker nodes. If a worker node goes down we can still make all the queries and the fact that API nodes are independent makes it easier to scale automatically and to use serverless hosting or managed container hosting. It is not only about the cloud but also to make self hosting easier.

I have made a lot of progress already and I am working on the tests now. I hope to have a first version this week and a ready version end of next week. It was a good decision not to couple Orleans with other components, so the code changes are acceptable.

What you can do:

  1. Have a little bit of patient please, I am working hard to solve this problem.
  2. Use the CDN whenever possible.
  3. Use some other kind of caching, e.g. a simple in memory cache can help a lot.

Just for your info:

The Squidex cloud is monitored with an setup inside the Google Cloud that is used for hosting and will send Slack notifications whenever something happens. So 99% of the time I am aware of issues. Sometimes there are network problems with cloudflare and only some users are affected, then the monitoring will not detect the problem in all cases.

Thanks for this, appreciate the open response (and detailed information about the challenges cloud is currently facing).

From my side I can look at moving some more queries over to the CDN.

As far as I could tell, when I dug into the code, my GraphQLGetAsync queries won’t use the CDN so I need to modify them to use ContentQuery, is that correct?

I’m basing that conclusion off of this code in ContentsClient.GraphQLGetAsync<TResponse>

GraphQlResponse<TResponse> graphQlResponse = await contentsClient.RequestJsonAsync<GraphQlResponse<TResponse>>(HttpMethod.Get, contentsClient.BuildAppUrl("graphql", false, context) + str, (HttpContent) null, context, ct)

Then this in BuildAppUrl:

private string BuildAppUrl(string path, bool query, QueryContext? context = null)
{
    if (!this.ShouldUseCDN(query, context))
        return "content/" + this.ApplicationName + "/" + path;

    ...
}

And this in ShouldUseCDN

    private bool ShouldUseCDN(bool query, QueryContext? context)
    {
      if (!query || string.IsNullOrWhiteSpace(this.Options.ContentCDN))
        return false;
      return context == null || !context.IsNotUsingCDN;
    }

Which also raises one final question, I wonder if there’s a way to identify which queries are using the CDN and which aren’t?

At the moment I’m just using Stackify Prefix to keep an eye on outgoing http requests when I run the app locally, to check if they’re going to the CDN.

Thanks!

There is no reason why GraphQLGetAsync should not use the CDN. It does not make sense for POST queries, because the BODY is not used to calculate cache keys.

I will have a look to the client library and provide a fix for that ASAP. You can also open a PR, if you want.

OK, here’s a PR for that tweak to the GraphQLGetAsync method.

Thanks :slight_smile:

1 Like