Squidex Status Down June 28 2022

I have…

I’m submitting a…

  • [ ] Regression (a behavior that stopped working in a new release)
  • [ ] Bug report
  • [X ] Performance issue
  • [ ] Documentation issue or request

Current behavior

Timeout on squidex api. Checking https://status.squidex.io/ shows apis are down. Its been like this a few hours now.

Expected behavior

Squidex working properly

Environment

  • [ ] Self hosted with docker
  • [ ] Self hosted with IIS
  • [ ] Self hosted with other version
  • [ X] Cloud version

Thx, I am investigating right now. It is weird, because Squidex cloud works good for me but not the website

Same for us. Squidex cloud works but the API doesn’t

it seems that there some kind of queries that are bringing the secondary DB members down (the cloud UI works with the primary member).

I am investigating which queries these are and then I will try to block these.

I will also consider to enforce a rule that large schemas need to be queried with the CDN.

Squidex.io is down too. Does it related?

Yes, it uses the normal endpoint.

I have restarted the secondary database servers. So far the CPU is much lower than before, but I don’t know if it will stay like that.

It was not the first time that the server was down. If we change our plan to the enterprise plan would that be different?

The enterprise plan is a support plan for self hosting. If you do that you will be more protected from the noisy neighbor problem, which means that other apps or users bring your service down.

In general there are 2 types of hosting solutions for these kind of services: Shared or dedicated hardware. If you compare the squidex pricing to other providers like Contentful or Graph CMS you will see a very big difference. For $489 a month it is possible to provide dedicated Hardware per customer, but not for 19€, 49€ or 99€.

If you think about downtimes there are several reasons, and just us a quick classification I would distinguish between these categories:

  1. External failures:
  • 2022-06-21 Cloudflare was having a major downtime.

=> Usually you can do nothing in this case, because if you are not Amazon or Google you do not have a fallback.

  1. Human Error: When you make a misconfiguration and something goes wrong:
  • 2021-12-12: A new version was deployed with a misconfiguration.
  • 2021-10-29: An update with DNS was causing issues.

=> Something like this happens all the time (see cloudflare) and you can only try your best to avoid this. If you are responsible for your own service, you can at least do something and your boss can blame someone :wink:

  1. Architectural implications and external services: Every architecture has positive and negative implications and the issue today was such a case. If you have performance issues you need the tools to investigate that. I used several tools, but none of them helped me to investigate the root cause of the issue:
  • You can use telemetry data to understand where the performance issue is. It has helped me to identity that the reason is MongoDB.

  • You can use database telemetry data to narrow down the issue. It was very obvious that only the secondary members had the issue.

  • You can monitor your system and identify queries that have grown in number. I tried to use Google monitoring to identity HTTP calls with an exceptionally high amount of calls. I found a few candidates but none of them was really an obvious problem. Then I used cloudflare to block these calls, which was working, but it did not change anything for the MongoDB performance.

  • I have tried to use MongoDB profiler to analyze slow queries. There are two important values that often help to find a the reason:

    • Is your query executed with an index? This was the case for all slow queries.
    • How many documents have been analyzed for the query? e.g. if you have return 200 documents, but you scan 200.000 documents, than your index is not very good. With a few exceptions this number was very low. Like in the range of 500 documents.

Therefore I could not find the problem yet. It still confused my that the CPU performance is back to normal after a restart. It seems that the database was not able to recover from previous high load.

Thank you for the response! So far I can confirm it is working now.