Squidex response 521

peterstarling · December 17, 2021, 11:28am

The last couple of weeks and especially today we are having issues with reaching squidex service, the CDN responds with 521, see below:

This is affecting our production applications.
The issues do not seem to be reported here https://status.squidex.io/
Error seems to be intermittent however it usually lasts for about 10-15 minutes and can be replicated across different regions (and can be replicated locally - see screenshot, FYI the url is https://cloud.squidex.io/)

I have…

[ ] Checked the logs and have uploaded a log file and provided a link because I found something suspicious there. Please do not post the log file in the topic because very often something important is missing.

I’m submitting a…

[ ] Regression (a behavior that stopped working in a new release)
[ ] Bug report
[x] Performance issue
[ ] Documentation issue or request

Current behavior

Expected behavior

Minimal reproduction of the problem

Environment

[ ] Self hosted with docker
[ ] Self hosted with IIS
[ ] Self hosted with other version
[x] Cloud version

Version: [VERSION]

Browser:

[ ] Chrome (desktop)
[ ] Chrome (Android)
[ ] Chrome (iOS)
[ ] Firefox
[ ] Safari (desktop)
[ ] Safari (iOS)
[ ] IE
[ ] Edge

Others:

peterstarling · December 17, 2021, 11:28am

sorry for posting here but i am no longer able to post new topic in the Maintenance cateogry

Sebastian · December 17, 2021, 11:50am

I have a reported incident for about 1 minute between 11:25 and 11:26 UTC but I cannot reproduce it at the moment.

kmaid · December 17, 2021, 12:09pm

Hi,

We had reported intermittent problems across our application from 8:53 until 11:23.

Screenshot taken at 11:01 from our CI

Sebastian · December 17, 2021, 12:17pm

I believe you but I cannot reproduce it yet and do not see it in the logs. What I see is the problem I mentioned.

And of course the dropdown in performance:

I am adding more metrics to investigate that.

kmaid · December 17, 2021, 12:27pm

We are also improving our Sentry logging. Is there anything we should include when Squidex throws an error that will help you trace your side? We intend to log graphql queries and the response returned should an error be thrown

Sebastian · December 17, 2021, 12:28pm

No, I don’t think so.

niezgoda · December 17, 2021, 1:14pm

What exactly does it mean that you do not see it in the logs?
You do not have the metrics for DB and external services response times?
Memory/CPU usage for internal infrastructure?

This does not look like memory deallocation or tree-shaking problem if this is not on a single application server.

Look at network changes that occurred on your infrastructure.
Maybe the cloud/servers provider made some changes that caused DB/API services to be no longer available.

Networking issues are hard to debug

Sebastian · December 17, 2021, 1:21pm

I have an up time check in Google cloud that monitors Squidex cloud from four locations and except the short issue it does not report anything. This correlates with the internal logs from the application (first screenshot).

I can also reach Squidex just fine.

However I see some downgrade in performance but not exactly where it is coming from.

Sebastian · December 17, 2021, 2:11pm

Btw:

Does this work

kmaid · December 17, 2021, 2:20pm

Hi @Sebastian,

The issue is no longer happening for our application and these pages work. I do wonder if while your backend responds perhaps there is a proxy layer above it times out and thus sends no response which gives us the error on cloudflare. I briefly looked at cloudflare timeouts and found them to be quite long.

I would like to suggest an enhancement for errors. I don’t think Squidex should ever be returning an HTML Cloudflare page when there is a problem with the API. Rather a message within your own infrastructure in all cases (so that they can be logged)

Sebastian · December 17, 2021, 2:30pm

Good point, cloudflare should not return HTML. The timeout issues are recorded though in the logs, because the request is aborted by cloudflare and then tracked.

kmaid · December 17, 2021, 2:53pm

It also be good to track all errors (even valid ones). I have used to use things like https://www.stathat.com/ to alert us if there was a sudden spike in requests resulting in an error. Sometimes its a rouge import script or some hacker trying a bunch of things but its also helped detect allot more insidious smaller scale issues.

kmaid · December 17, 2021, 2:54pm

Will you look at the error enhancement? How can track the status of this issue?

Sebastian · December 17, 2021, 3:04pm

Yes, but not today. We can track it here.

Sebastian · December 20, 2021, 5:43pm

I have turned off always online feature.

peterstarling · December 23, 2021, 1:09pm

ok, what does that mean?

kmaid · December 23, 2021, 1:21pm

Can you see the 520 & 521 errors we reported in the cloudflare logs?

Sebastian · December 23, 2021, 4:20pm

The always online feature serves a cached version of the page, which does not make sense anyway and should bettet bubble up the 521 to the client. Perhaps I can test that with a demo project very early next year.

But I am not aware that there are any logs in cloudflare, have to test that. Internally squidex uses Google Stackdriver, error reporting and monitoring, but I can not expose that. To be honest I am not 100% sure what your expectations are.