For some reasons, one of the pod will start restarting itself. Once it get to that situation, it will not go back to stable state. And only one pod does this.
I still can’t identity what trigger the issue, but it seems like it has something to do with GraphQL access.
My environment is
k8s: v1.19.8
Squidex: dev-6177 with Kubernetes enabled in AppSettings and other required settings.
MongoDB 4.4.7 (it’s from bitnami chart with 2 replicasets and 1 arbiter)
It has the same issue before the kubernetes AppSettings change.
It usually throws errors like
The target silo became unavailable for message: Request S10.42.56.5:11111:364573933stg/DirectoryService/0000000a@S0000000a->S10.42.232.7:11111:364612165stg/DirectoryService/0000000a@S0000000a InvokeMethodRequest Orleans.Runtime.IRemoteGrainDirectory:RegisterAsync #330828. Target History is: S10.42.232.7:11111:364612165:*stg/DirectoryService/0000000a:@S0000000a
Response did not arrive on time in 00:00:30 for message: Request S10.42.56.5:11111:364573933stg/MembershipOracle/0000000f@S0000000f->S10.42.232.7:11111:364612165stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #330689. Target History is: S10.42.232.7:11111:364612165:*stg/MembershipOracle/0000000f:@S0000000f
In general there are only a few cases why a pod restarts:
Stackoverflow exception
Out of memory exception
Health endpoint not successful.
The pod decides to restart itself.
You have to identity the root cause. But I have not seen it before. Have you configured the labels and so on correctly? I added the kubernetes support on Monday and haven’t had a chance to test it properly.
I think it starts with a MongoDB connection issue and leads to a failure healthcheck. That causes the first restart. Once it restarts, it can’t find the previous pod. Although I feel like kubernetes settings should avoid this. (Of course, this is just my theory, I don’t have much experience with Orleans) . However, there are a lot of members in Orleans_OrleansMembershipSingle after multiple restarts. (Most of their status are dead, only the running pods have active status)
at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveMessageAsync(Int32 responseTo, IMessageEncoderSelector encoderSelector, MessageEncoderSettings messageEncoderSettings, CancellationToken cancellationToken)
at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocolAsync[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.ExecuteAsync[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.ExecuteAsync(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.ListCollectionsUsingCommandOperation.ExecuteAsync(RetryableReadContext context, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Operations.ListCollectionsOperation.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
at MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken)
at MongoDB.Driver.MongoDatabaseImpl.ExecuteReadOperationAsync[T](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
at MongoDB.Driver.MongoDatabaseImpl.ListCollectionNamesAsync(IClientSessionHandle session, ListCollectionNamesOptions options, CancellationToken cancellationToken)
at MongoDB.Driver.MongoDatabaseImpl.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
at Squidex.Infrastructure.Diagnostics.MongoDBHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken) in /src/src/Squidex.Infrastructure.MongoDb/Diagnostics/MongoDBHealthCheck.cs:line 35
at Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService.RunCheckAsync(IServiceScope scope, HealthCheckRegistration registration, CancellationToken cancellationToken)
at Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService.CheckHealthAsync(Func`2 predicate, CancellationToken cancellationToken)
at Microsoft.AspNetCore.Diagnostics.HealthChecks.HealthCheckMiddleware.InvokeAsync(HttpContext httpContext)
at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware.Invoke(HttpContext context)
at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware.Invoke(HttpContext context)
at Squidex.Web.Pipeline.CachingKeysMiddleware.InvokeAsync(HttpContext context) in /src/src/Squidex.Web/Pipeline/CachingKeysMiddleware.cs:line 55
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
I checked more log but I haven’t seen anything. I think It is still related to the connection between Mongodb. But mongodb is running fine. I don’t see anything error at least.
However, I do notice couple things.
Do you think it is correct to let healthcheck behind CachingKeysMiddleware?
I found lots of duplicated log message.
I think the error causing restart is
—> System.TimeoutException: A timeout occurred after 30000ms selecting a server using CompositeServerSelector{ Selectors = WritableServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : “1”, Type : “ReplicaSet”, State : “Connected”, Servers : [{ ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev:27017” }”, EndPoint: “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev:27017”, ReasonChanged: “Heartbeat”, State: “Disconnected”, ServerVersion: , TopologyVersion: , Type: “Unknown”, HeartbeatException: “MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
—> System.Net.Sockets.SocketException (113): No route to host
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
— End of stack trace from previous location —
at MongoDB.Driver.Core.Connections.TcpStreamFactory.ConnectAsync(Socket socket, EndPoint endPoint, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Connections.TcpStreamFactory.CreateStreamAsync(EndPoint endPoint, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken)
— End of inner exception stack trace —
at MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.ServerMonitor.InitializeConnectionAsync(CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.ServerMonitor.HeartbeatAsync(CancellationToken cancellationToken)”, LastHeartbeatTimestamp: “2021-07-22T20:05:23.3040656Z”, LastUpdateTimestamp: “2021-07-22T20:05:23.3040658Z” }, { ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev.svc.cluster.local:27017” }”, EndPoint: “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev.svc.cluster.local:27017”, ReasonChanged: “Heartbeat”, State: “Disconnected”, ServerVersion: , TopologyVersion: , Type: “Unknown”, HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
—> System.Net.Sockets.SocketException (113): No route to host
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
I think when you see this error the process does not even start. So it starts the pod, tries to connect for 30sec and then crashs. Perhaps it is just wrong connection string or you have to whitelist your IP addresses or something like that.