Inconsistent data and Orleans exception in my logs when running in kubernettes cluster


#1

Hi

I’ve managed to set up a kubernettes cluster with 3-nodes, so 3 Squidex instances, and 3-node mongodb replica set.

I’ve used your yaml from your github repo to set it up.

Everything seems to be working fine, but then i tried to do a bit of chaos monkey in order to test out how resilience it was. So i was killing a mongodb randomly, just to make sure it came back up.

I’m not sure if this is the reason for the issue, but now i sometimes i get data that is out of sync ( different result when i refresh schema page in editor UI), and i can see the following in the log:

{
“logLevel”: “Error”,
“message”: “Caught and ignored exception: Squidex.Infrastructure.DomainObjectVersionException with message: Requested version 12 for object ‘Default’ (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 99. thrown from timer callback GrainTimer. TimerCallbackHandler:Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain->System.Threading.Tasks.Task <OnActivateAsync>b__5_0(System.Object)", "eventId": { "id": 101413 }, "exception": { "type": "Squidex.Infrastructure.DomainObjectVersionException", "message": "Requested version 12 for object 'Default' (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 99.”,
“stackTrace”: " at Squidex.Infrastructure.States.Persistence`2.WriteSnapshotAsync(TSnapshot state) in /src/src/Squidex.Infrastructure/States/Persistence{TSnapshot,TKey}.cs:line 143\n at Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain.CheckUsagesAsync() in /src/src/Squidex.Domain.Apps.Entities/Rules/UsageTracking/UsageTrackerGrain.cs:line 110\n at Orleans.Runtime.GrainTimer.ForwardToAsyncCallback(Object state)"
},

I have…

  • [ x ] Checked the logs and have provided uploaded a log file and provided a link because I found something suspicious there.

I’m submitting a…

  • [ ] Regression (a behavior that stopped working in a new release)
  • [ x ] Bug report
  • [ ] Performance issue
  • [ ] Documentation issue or request

Current behavior

Expected behavior

Minimal reproduction of the problem

Environment

  • [ x ] Self hosted with docker and kubernettes
  • [ ] Self hosted with IIS
  • [ ] Self hosted with other version
  • [ ] Cloud version

Version: [3.3.0]

Browser:

  • [ x ] Chrome (desktop)
  • [ ] Chrome (Android)
  • [ ] Chrome (iOS)
  • [ ] Firefox
  • [ ] Safari (desktop)
  • [ ] Safari (iOS)
  • [ ] IE
  • [ ] Edge

Others:


#2

I have changed this to Hosting for now until the real issue is known. Don’t worry, I would live to improve this as well.

You said you are seeing different schema versions in the editor UI, right?

Do you also have logs for that?
Do you know the overall cluster state?
Have you checked the clustering dashboard? /orleans


#3

Can’t reproduce it know :neutral_face: For some reason it stopped throwing exception and and data seems to be in sync.

I’ll try to keep an eye on it and see if it happens again. Do you have any idea what the exception means? maybe i can find a way to reproduce it.


#4

I’m wondering if it’s a sync issue between the mongodb cluster. Maybe i can reproduce it by putting some load on the servers.

I saw in a previous post you have build some kind of script for stress testing. Is this something that can be reused?


#5

I just managed to reproduce the error, added a new schema and got folowing exception

{
“logLevel”: “Error”,
“action”: “GrainInvoked”,
“status”: “Failed”,
“grain”: “Squidex.Domain.Apps.Entities.Schemas.Indexes.SchemasByAppIndexGrain”,
“grainMethod”: “System.Threading.Tasks.Task AddSchemaAsync(System.Guid, System.String)”,
“exception”: {
“type”: “Squidex.Infrastructure.DomainObjectVersionException”,
“message”: “Requested version 5 for object ‘7c524cd9-a111-4370-acf6-19582ad8bed6’ (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Schemas.Indexes.SchemasByAppIndexGrain+GrainState]), but found 7.", "stackTrace": " at Squidex.Infrastructure.States.Persistence2.WriteSnapshotAsync(TSnapshot state) in /src/src/Squidex.Infrastructure/States/Persistence{TSnapshot,TKey}.cs:line 143\n at Squidex.Domain.Apps.Entities.Schemas.Indexes.OrleansCodeGenSchemasByAppIndexMethodInvoker.Invoke(IAddressable grain, InvokeMethodRequest request) in /src/src/Squidex.Domain.Apps.Entities/obj/Release/netstandard2.0/Squidex.Domain.Apps.Entities.orleans.g.cs:line 2021\n at Orleans.Runtime.GrainMethodInvoker.Invoke()\n at Squidex.Infrastructure.Orleans.LoggingFilter.Invoke(IIncomingGrainCallContext context) in /src/src/Squidex.Infrastructure/Orleans/LoggingFilter.cs:line 30”
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “354ebe36-027b-4c74-aa48-38601d747282”
},
“web”: {
“requestId”: “9218f8ef-c754-4113-b512-385d418de706”,
“requestPath”: “/apps/test-account/schemas”,
“requestMethod”: “POST”,
“routeValues”: {
“area”: “Api”,
“action”: “PostSchema”,
“controller”: “Schemas”
}
},
“timestamp”: “2019-09-10T09:41:41Z”
}

And the UI looked like this:

And Orleans

Methods with Most Exceptions
100.00%AddSchemaAsync
Squidex.Domain.Apps.Entities.Schemas.Indexes.SchemasByAppIndexGrain


#6

More logs:

{
“logLevel”: “Error”,
“message”: “Caught and ignored exception: Squidex.Infrastructure.DomainObjectVersionException with message: Requested version 11 for object ‘Default’ (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 249. thrown from timer callback GrainTimer. TimerCallbackHandler:Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain->System.Threading.Tasks.Task <OnActivateAsync>b__5_0(System.Object)", "eventId": { "id": 101413 }, "exception": { "type": "Squidex.Infrastructure.DomainObjectVersionException", "message": "Requested version 11 for object 'Default' (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 249.”,
“stackTrace”: " at Squidex.Infrastructure.States.Persistence`2.WriteSnapshotAsync(TSnapshot state) in /src/src/Squidex.Infrastructure/States/Persistence{TSnapshot,TKey}.cs:line 143\n at Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain.CheckUsagesAsync() in /src/src/Squidex.Domain.Apps.Entities/Rules/UsageTracking/UsageTrackerGrain.cs:line 110\n at Orleans.Runtime.GrainTimer.ForwardToAsyncCallback(Object state)"
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “354ebe36-027b-4c74-aa48-38601d747282”
},
“timestamp”: “2019-09-10T10:15:12Z”,
“category”: “Orleans.Runtime.GrainTimer”
}

{
“logLevel”: “Error”,
“message”: “Caught and ignored exception: Squidex.Infrastructure.DomainObjectVersionException with message: Requested version 12 for object ‘Default’ (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 249. thrown from timer callback GrainTimer. TimerCallbackHandler:Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain->System.Threading.Tasks.Task <OnActivateAsync>b__5_0(System.Object)", "eventId": { "id": 101413 }, "exception": { "type": "Squidex.Infrastructure.DomainObjectVersionException", "message": "Requested version 12 for object 'Default' (type Squidex.Infrastructure.Orleans.GrainState1[Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain+GrainState]), but found 249.”,
“stackTrace”: " at Squidex.Infrastructure.States.Persistence`2.WriteSnapshotAsync(TSnapshot state) in /src/src/Squidex.Infrastructure/States/Persistence{TSnapshot,TKey}.cs:line 143\n at Squidex.Domain.Apps.Entities.Rules.UsageTracking.UsageTrackerGrain.CheckUsagesAsync() in /src/src/Squidex.Domain.Apps.Entities/Rules/UsageTracking/UsageTrackerGrain.cs:line 110\n at Orleans.Runtime.GrainTimer.ForwardToAsyncCallback(Object state)"
},
“app”: {
“name”: “Squidex”,
“version”: “3.3.0.0”,
“sessionId”: “66d93fdd-3289-4ef1-bf84-7cdcd724a99f”
},
“timestamp”: “2019-09-10T10:15:22Z”,
“category”: “Orleans.Runtime.GrainTimer”
}


#7

Thx, I had a look to the Orleans code. This scenario is actually handled by Orleans in a kind of ugly way:

So the solution would be to catch the DomainObjectVersionException and then to rethrow a InconsistentStateException. Then the grains should be deactivated properly and recreated.


#8

Okay, so is this something that needs to be handled in Squidex or orleans ?


#9

In Squidex. I have pushed a potential improvement.


#10

Btw: My scripts for loading testing are here: https://github.com/Squidex/squidex/tree/master/tools/LoadTest


#11

I’m trying to test out the squidex:dev image out to see if it fix the issue

But i’m getting:
No usable version of the libssl was found in the logs

Any clue what’s wrong?


#12

I have no clue yet, I have only changed a few lines of C# code.


#13

Okay, but you don’t have the same issue? if not i will start looking into my own cluster to see if i can find an answer.

So far i’ve only been running latest and 3.3.0, so wondering it could be related to switching to dev image


#14

I have not used the dev docker image in k8 yet, so I don’t know yet. But it could be related to SSL, which I do not handle in Squidex itself.


#15

Okay, but is there a different build process for dev?


#16

No, it is the same build process. It is just running the dockerfile in the repository.


#17

okay, weird :confused: it’s a problem for both latest dev, and latest dev build, but not for 3.3.0 version


#18

I am working on that. It is weird, it looks like the docker registry has been corrupted. The docker image for dotnet shows a different libssl version than the Dockerfile. I have reported it.


#19

dev-3019 image seems to be working now :slight_smile:


#20

Yes, it was an issue with different versions of libssl. It seems that not all dotnet2.2 artifacts are compatible to each other.