Squidex pod can not restart after it has been crashedLoopBackOff in GKE

I have…

  • [X ] Checked the logs and have provided uploaded a log file and provided a link because I found something suspicious there.

I’m submitting a…

  • [ ] Regression (a behavior that stopped working in a new release)
  • [X ] Bug report
  • [X ] Performance issue
  • [ ] Documentation issue or request

Current behavior

We running Squidex in the GKE and at some stage the pod has been failed with a CrashLoopBackOff Status.

Then I deleted the pod to recreate a new one to start, and it won’t be starting anymore, it keeps get failed until CrashLoopBackOff…

-squidex-cms-59dd6bd66f-lbf9d 0/1 CrashLoopBackOff 6 9m9s

Please see the pod log for more details.

Expected behavior

After deleted the failed pod, when a new pod is created, it should be just started a new instance and running

Minimal reproduction of the problem

Using JMeter to run the load test for 30 mins then the pod won’t be response quickly and crashedLoopBackOff in the end. And after that, even new pod won’t work.

Is it to do dangling data in mongodb and where I can check that?

I noticed the Migration object has a lock:true, will it be the issue?

Environment

  • [X ] Self hosted with docker
  • [ ] Self hosted with IIS
  • [ ] Self hosted with other version
  • [ ] Cloud version

Version: [VERSION]

Browser:

  • [ ] Chrome (desktop)
  • [ ] Chrome (Android)
  • [ ] Chrome (iOS)
  • [ ] Firefox
  • [ ] Safari (desktop)
  • [ ] Safari (iOS)
  • [ ] IE
  • [ ] Edge

Others:

Logs
Now listening on: http://[::]:80
Application started. Press Ctrl+C to shut down.
{
“logLevel”: “Information”,
“filters”: {
“costs”: 0.0
},
“elapsedRequestMs”: 1027,
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“web”: {
“requestId”: “112bfac8-2024-44cd-ab3f-0bcd226f6e5f”,
“requestPath”: “/build/index.html”,
“requestMethod”: “GET”
},
“timestamp”: “2019-10-02T01:56:51Z”
}

{
“logLevel”: “Warning”,
“message”: “UnregisterManyAsync 43 failed.”,
“eventId”: {
“id”: 100502
},
“exception”: {
“type”: “System.InvalidOperationException”,
“message”: “Grain directory is stopping”,
“stackTrace”: " at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription)\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary2& forward, List1 tasks, String context)\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List1 addresses, UnregistrationCause cause, Int32 hopCount)\n at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()\n at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs)"
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:51Z”,
“category”: “Orleans.Runtime.Catalog”
}

{
“logLevel”: “Warning”,
“message”: “UnregisterManyAsync 1 failed.”,
“eventId”: {
“id”: 100502
},
“exception”: {
“type”: “System.InvalidOperationException”,
“message”: “Grain directory is stopping”,
“stackTrace”: " at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription)\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary2& forward, List1 tasks, String context)\n at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List1 addresses, UnregistrationCause cause, Int32 hopCount)\n at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()\n at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs)"
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:51Z”,
“category”: “Orleans.Runtime.Catalog”
}

{
“logLevel”: “Warning”,
“message”: “Socket error on new socket during ReceiveAsync with RemoteEndPoint: OperationAborted”,
“eventId”: {
“id”: 101013
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:53Z”,
“category”: “Orleans.Runtime.Messaging.IncomingMessageAcceptor”
}

{
“logLevel”: “Error”,
“message”: “RunClientMessagePump has thrown exception”,
“eventId”: {
“id”: 100326
},
“exception”: {
“type”: “System.OperationCanceledException”,
“message”: “The operation was canceled.”,
“stackTrace”: " at System.Collections.Concurrent.BlockingCollection1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)\n at System.Collections.Concurrent.BlockingCollection1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)\n at System.Collections.Concurrent.BlockingCollection`1.Take(CancellationToken cancellationToken)\n at Orleans.Runtime.HostedClient.RunClientMessagePump()"
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:53Z”,
“category”: “Orleans.Runtime.HostedClient”
}

{
“logLevel”: “Warning”,
“message”: “Failed to update status to dead in the alotted time during shutdown”,
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:58Z”,
“category”: “Orleans.Runtime.MembershipService.MembershipTableManager”
}

{
“logLevel”: “Warning”,
“message”: “Lifecycle stop operations canceled by request.”,
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “0d7ed7dc-70ce-487b-9707-366d67fb44fc”
},
“timestamp”: “2019-10-02T01:56:58Z”,
“category”: “Orleans.Runtime.SiloLifecycleSubject”
}

Application is shutting down…

And in mongodb, there is a table named Orleans_orleansreminderV2 and data is as below, would it causing the issue?

db.Orleans_OrleansReminderV2.find({})
{ “_id” : “squidex_GrainReference=00000000000000000000000000000000060000001f05941d+Default_Default”, “ServiceId” : “squidex”, “GrainId” : “GrainReference=00000000000000000000000000000000060000001f05941d+Default”, “ReminderName” : “Default”, “Etag” : “a6835b2f-7d6b-4ce8-b699-ea92579114f8”, “Period” : “00:10:00”, “GrainHash” : NumberLong(1418268458), “IsDeleted” : false, “StartAt” : ISODate(“2019-10-02T01:56:50.921Z”) }
{ “_id” : “squidex_GrainReference=00000000000000000000000000000000060000002352c26b+Default_Default”, “ServiceId” : “squidex”, “GrainId” : “GrainReference=00000000000000000000000000000000060000002352c26b+Default”, “ReminderName” : “Default”, “Etag” : “03d7dff2-1283-4517-9afe-5fa31bcac5ad”, “Period” : “00:10:00”, “GrainHash” : NumberLong(2133228456), “IsDeleted” : false, “StartAt” : ISODate(“2019-10-02T02:07:53.148Z”) }
{ “_id” : “squidex_GrainReference=0000000000000000000000000000000006ffffff89da8f81+Default_Default”, “ServiceId” : “squidex”, “GrainId” : “GrainReference=0000000000000000000000000000000006ffffff89da8f81+Default”, “ReminderName” : “Default”, “Etag” : “1865bd37-a4cc-49ee-a69f-f5f583f47371”, “Period” : “00:10:00”, “GrainHash” : NumberLong(201268861), “IsDeleted” : false, “StartAt” : ISODate(“2019-10-02T02:07:51.795Z”) }
{ “_id” : “squidex_GrainReference=0000000000000000000000000000000006fffffffdc46fa9+Default_Default”, “ServiceId” : “squidex”, “GrainId” : “GrainReference=0000000000000000000000000000000006fffffffdc46fa9+Default”, “ReminderName” : “Default”, “Etag” : “5493930a-2d89-49e3-b86c-b6408c970f80”, “Period” : “00:10:00”, “GrainHash” : NumberLong(1408215336), “IsDeleted” : false, “StartAt” : ISODate(“2019-10-02T02:07:53.162Z”) }

You can try to delete the reminders and the membership table.

thanks, but which tables you mean?

The identiy_xxx tables and Orleans_OrleansReminderV2?

I am able to re-produce the issue now, as long as I run a load test nto Squidex with 85-100 rate per second for 30 mins around, the pod will be restarted until it is in CrashedLoopBackOff state. After that, even I deleted the pod, it is still crashing.

Also I tried to make the pod as read only by setting MODE__ISREADONLY = true, it is still the same.
The docker version I am running is 3.3.0, did try it with 3.4.0, but after using the new image, all the Apps we have won’t show in the web dashboard.

Below are all the tables in mongodb.

Events
Identity_PersistedGrants
Identity_Roles
Identity_Users
Migration
Orleans_OrleansReminderV2
Projections_History
RuleEvents
States_Apps
States_Assets
States_EventConsumerState
States_Index_AppsByName
States_Index_AppsByUser
States_Index_RulesByApp
States_Index_SchemasByApp
States_Index_Tags
States_RestoreState
States_Schemas
States_UISettings
States_UsageTracker
States_XmlRepository
UsagesV2

Oh so you are running without clustering. Do you have a health check available?

It is running as one pod. I think there is a default Kubernetes health check.

I noticed the Squidex deployment has terminationGracePeriodSeconds set as 60. by default it is 30 in Kubernetes, do I have to update the value to that much as well?

Can you see in your kubernetes cluster why it failed? e.g. health check? or low memory or another crash?

I can see usually it is failed (restarted) because of the failed liveness/readiness check.
That usually because the load causing the CPU usage higher and then it won’t be response quick enough. And we previously have the CPU limit which can cause the pod restarted sometimes.

It is OK if the pod needs to restart due to the resource or high traffic, the issue is SOMETIMES when it is restarted, it will be failed and keep restarts until it gets to CrashLoopBackOff, the only way I can fix it now is to restore the mongodb database from a previous backup.

our backup is the full database which includes everything.

Do you have a full log of a start that crashed?