Configurable Orleans cluster options

denisp · August 31, 2021, 5:33pm

I’m submitting a…

[x] Documentation issue or request

Current behavior

When running in cluster mode with 1 silo and that silo gets shut down ungracefully, next time a new silo tries to join, it will not be able to do so because of ClusterMembershipOptions.ValidateInitialConnectivity. This results in a cluster that cannot recover.

MembershipAgent failed to start due to errors at stage BecomeActive (19999): Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.19.65.71:11111:368129322]
   at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
   at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
   at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location ---
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)

One way to fix this is to manually remove all “Active” silos from the Orleans_OrleansMembershipSingle collection since they are actually not active anymore.

Expected behavior

I should be able to configure Orleans options such ClusterMembershipOptions.ValidateInitialConnectivity so that my cluster can get healthy in the situation explained.

I realize that this can also be solved by running multiple silos in the cluster so that one ungraceful shutdown does not bring the whole cluster down, and other running silos can mark lost silos as “Dead”. However what happens if all silos gut shutdown ungracefully? Then the only intervention is manual cleanup of Orleans_OrleansMembershipSingle collection (unless I’m missing some other way).

Minimal reproduction of the problem

Run in cluster mode with 1 silo. Shut it down (gracefully or not). Modify Orleans_OrleansMembershipSingle collection to change a “Dead” silo to “Active” to simulate ungraceful shutdown (Also change Status to 6). Restart your cluster with any number of silos. The cluster won’t get healthy.

Environment

[x] Self hosted with docker on ECS

Version: 5.8.2

denisp · August 31, 2021, 6:37pm

After playing around with this for some time, I noticed that new silos start in status Joining and then turn into Active IF last silo that was Active cannot be contacted and has IAmAliveTime more than 10 minutes ago. However, something happens and the application crashes leaving the new Active silo in the member table as Active and thus extending this endless cycle by another 10 minutes.

I now suspect that this is a bug within Squidex or Orleans error handling somewhere. I think silo should be able to continue being Active if last Active silo has not spoken for 10 minutes.

This is the last error message before the app crashes

Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.19.66.102:11111:368129894. See InnerException

Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S10.19.65.71:11111:368129322. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)

Sebastian · August 31, 2021, 7:06pm

First of all: Thank you for your detailed analyze.

I am definitely open to add this setting, a PR would be welcome.

If you use kubernetes you can enable kubernetes clustering and then the cleanup should be done automatically.
If you have one member only, why do you not disable the clustering?

I am not sure if it really crashes. Do you see it in the logs? It could also be an issue with the liveness or readiness probes.

denisp · August 31, 2021, 7:25pm

If you use kubernetes you can enable kubernetes clustering and then the cleanup should be done automatically.

Great, I didn’t know this. We will keep it in mind.

If you have one member only, why do you not disable the clustering?

We scale down to 1 member in some environments, but technically this can happen even if you have multiple members, and they all crash at the same time for whatever reason.

I am not sure if it really crashes. Do you see it in the logs? It could also be an issue with the liveness or readiness probes.

Pretty sure it crashed because it’s the last log that’s received from that container and then ECS immediately schedules another container to keep the desired count.

I am definitely open to add this setting, a PR would be welcome.

Thanks man, I will see if it’s critical enough for us to spend more time on this or we will opt to maintain at least 2 members in the cluster to minimize the chances of this happening.

Sebastian · August 31, 2021, 7:36pm

It would be great if you can open an issue in the orleans forum. I am going to support you as much as I can with that, but I don’t just want to repeat your findings.