I’m submitting a…
- [x] Documentation issue or request
When running in cluster mode with 1 silo and that silo gets shut down ungracefully, next time a new silo tries to join, it will not be able to do so because of
ClusterMembershipOptions.ValidateInitialConnectivity. This results in a cluster that cannot recover.
MembershipAgent failed to start due to errors at stage BecomeActive (19999): Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: . Failed to get response from: [S10.19.65.71:11111:368129322] at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity() at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive() at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext() --- End of stack trace from previous location --- at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
One way to fix this is to manually remove all “Active” silos from the
Orleans_OrleansMembershipSingle collection since they are actually not active anymore.
I should be able to configure Orleans options such
ClusterMembershipOptions.ValidateInitialConnectivity so that my cluster can get healthy in the situation explained.
I realize that this can also be solved by running multiple silos in the cluster so that one ungraceful shutdown does not bring the whole cluster down, and other running silos can mark lost silos as “Dead”. However what happens if all silos gut shutdown ungracefully? Then the only intervention is manual cleanup of
Orleans_OrleansMembershipSingle collection (unless I’m missing some other way).
Minimal reproduction of the problem
Run in cluster mode with 1 silo. Shut it down (gracefully or not). Modify
Orleans_OrleansMembershipSingle collection to change a “Dead” silo to “Active” to simulate ungraceful shutdown (Also change Status to 6). Restart your cluster with any number of silos. The cluster won’t get healthy.
- [x] Self hosted with docker on ECS