Kubernetes Pod keeps restarting

Sam-Lin-MillersLab · July 22, 2021, 4:50am

It seems like an orleans + kubernetes issue.

For some reasons, one of the pod will start restarting itself. Once it get to that situation, it will not go back to stable state. And only one pod does this.

I still can’t identity what trigger the issue, but it seems like it has something to do with GraphQL access.

My environment is
k8s: v1.19.8
Squidex: dev-6177 with Kubernetes enabled in AppSettings and other required settings.
MongoDB 4.4.7 (it’s from bitnami chart with 2 replicasets and 1 arbiter)

It has the same issue before the kubernetes AppSettings change.

It usually throws errors like

The target silo became unavailable for message: Request S10.42.56.5:11111:364573933stg/DirectoryService/0000000a@S0000000a->S10.42.232.7:11111:364612165stg/DirectoryService/0000000a@S0000000a InvokeMethodRequest Orleans.Runtime.IRemoteGrainDirectory:RegisterAsync #330828. Target History is: S10.42.232.7:11111:364612165:*stg/DirectoryService/0000000a:@S0000000a

Response did not arrive on time in 00:00:30 for message: Request S10.42.56.5:11111:364573933stg/MembershipOracle/0000000f@S0000000f->S10.42.232.7:11111:364612165stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #330689. Target History is: S10.42.232.7:11111:364612165:*stg/MembershipOracle/0000000f:@S0000000f

I wonder if you have seen this before.
ps, I will keep digging this is so weird.
ps2. It is like this one https://github.com/dotnet/orleans/issues/6661

Sebastian · July 22, 2021, 5:05am

In general there are only a few cases why a pod restarts:

Stackoverflow exception
Out of memory exception
Health endpoint not successful.
The pod decides to restart itself.

You have to identity the root cause. But I have not seen it before. Have you configured the labels and so on correctly? I added the kubernetes support on Monday and haven’t had a chance to test it properly.

Sam-Lin-MillersLab · July 22, 2021, 6:17am

I did add labels and env

  labels:
    orleans/clusterId: cms-dev
    orleans/serviceId: cms-dev

      env:
        - name: ORLEANS_SERVICE_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: 'metadata.labels[''orleans/serviceId'']'
        - name: ORLEANS_CLUSTER_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: 'metadata.labels[''orleans/clusterId'']'
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP

Sam-Lin-MillersLab · July 22, 2021, 6:35am

I think it starts with a MongoDB connection issue and leads to a failure healthcheck. That causes the first restart. Once it restarts, it can’t find the previous pod. Although I feel like kubernetes settings should avoid this. (Of course, this is just my theory, I don’t have much experience with Orleans) . However, there are a lot of members in Orleans_OrleansMembershipSingle after multiple restarts. (Most of their status are dead, only the running pods have active status)

   at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveMessageAsync(Int32 responseTo, IMessageEncoderSelector encoderSelector, MessageEncoderSettings messageEncoderSettings, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocolAsync[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadOperationExecutor.ExecuteAsync[TResult](IRetryableReadOperation`1 operation, RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ReadCommandOperation`1.ExecuteAsync(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ListCollectionsUsingCommandOperation.ExecuteAsync(RetryableReadContext context, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ListCollectionsOperation.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.ExecuteReadOperationAsync[T](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.ListCollectionNamesAsync(IClientSessionHandle session, ListCollectionNamesOptions options, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
   at Squidex.Infrastructure.Diagnostics.MongoDBHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken) in /src/src/Squidex.Infrastructure.MongoDb/Diagnostics/MongoDBHealthCheck.cs:line 35
   at Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService.RunCheckAsync(IServiceScope scope, HealthCheckRegistration registration, CancellationToken cancellationToken)
   at Microsoft.Extensions.Diagnostics.HealthChecks.DefaultHealthCheckService.CheckHealthAsync(Func`2 predicate, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Diagnostics.HealthChecks.HealthCheckMiddleware.InvokeAsync(HttpContext httpContext)
   at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Builder.Extensions.MapWhenMiddleware.Invoke(HttpContext context)
   at Squidex.Web.Pipeline.CachingKeysMiddleware.InvokeAsync(HttpContext context) in /src/src/Squidex.Web/Pipeline/CachingKeysMiddleware.cs:line 55
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

Sebastian · July 22, 2021, 6:58am

Thats normal, it is there to make monitoring and debugging easier.

You should see that in kubernetes. Check the event log.

Sam-Lin-MillersLab · July 22, 2021, 8:01pm

I checked more log but I haven’t seen anything. I think It is still related to the connection between Mongodb. But mongodb is running fine. I don’t see anything error at least.

However, I do notice couple things.

Do you think it is correct to let healthcheck behind CachingKeysMiddleware?
I found lots of duplicated log message.

I think the error causing restart is

—> System.TimeoutException: A timeout occurred after 30000ms selecting a server using CompositeServerSelector{ Selectors = WritableServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : “1”, Type : “ReplicaSet”, State : “Connected”, Servers : [{ ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev:27017” }”, EndPoint: “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev:27017”, ReasonChanged: “Heartbeat”, State: “Disconnected”, ServerVersion: , TopologyVersion: , Type: “Unknown”, HeartbeatException: “MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
—> System.Net.Sockets.SocketException (113): No route to host
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
— End of stack trace from previous location —
at MongoDB.Driver.Core.Connections.TcpStreamFactory.ConnectAsync(Socket socket, EndPoint endPoint, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Connections.TcpStreamFactory.CreateStreamAsync(EndPoint endPoint, CancellationToken cancellationToken)
at MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken)
— End of inner exception stack trace —
at MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.ServerMonitor.InitializeConnectionAsync(CancellationToken cancellationToken)
at MongoDB.Driver.Core.Servers.ServerMonitor.HeartbeatAsync(CancellationToken cancellationToken)”, LastHeartbeatTimestamp: “2021-07-22T20:05:23.3040656Z”, LastUpdateTimestamp: “2021-07-22T20:05:23.3040658Z” }, { ServerId: “{ ClusterId : 1, EndPoint : “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev.svc.cluster.local:27017” }”, EndPoint: “Unspecified/mongodb-dev-0.mongodb-dev-headless.dev.svc.cluster.local:27017”, ReasonChanged: “Heartbeat”, State: “Disconnected”, ServerVersion: , TopologyVersion: , Type: “Unknown”, HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
—> System.Net.Sockets.SocketException (113): No route to host
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)

Sebastian · July 22, 2021, 8:52pm

I think it does not matter.

I think when you see this error the process does not even start. So it starts the pod, tries to connect for 30sec and then crashs. Perhaps it is just wrong connection string or you have to whitelist your IP addresses or something like that.

How do you host MongoDB?

Sam-Lin-MillersLab · July 22, 2021, 9:16pm

no I don’t think CachingKeysMiddleware part matter. It is just a general question.

MongoDB is on the same K8S cluster by using bitnami chart.

Here is the connection string

mongodb://username:password@mongodb-dev-0.mongodb-dev-headless.dev,mongodb-dev-1.mongodb-dev-headless.dev:27017

Maybe I should add port to both and add /?replicaSet=mongodb-dev in the end

update

Changing connection string doesn’t seem to help.
It is odd that it is always one pod keeps restarting.

And it seems like it is always causing by

“MongoDb connection failed to connect to database CMS.”

Sam-Lin-MillersLab · July 23, 2021, 4:11am

I think it is somehow my MongoDB causing this issue. I used a different setup and everything seems fine.

Sam-Lin-MillersLab · July 23, 2021, 6:27am

However, I perform some load testing. 200 VU 300s and it throws lots of errors.

I wonder if you have any recommendation on resource requirement?

Like how many CPU and Memory?

I have 2cpu and 1G memory. 5 instances

Sebastian · July 23, 2021, 7:14am

1 CPU is usually a virtualized CPU, so 2 CPUs = 1 normal CPU. I would have at least 4 vCPUs + 4GB per instance.