Warnings about silos, orleans, pings, connection timeouts, sockets etc

spacecat · October 17, 2018, 8:20am

Hi,

Today when I woke up and was going to continue working on a blog post I created yesterday I was welcomed with a server error in my front-end app (SquidexClient exception because Squidex was down) and I noticed squidex was behaving strangely.

I spent an hour reading logs and trying to understand but finally after one hour decided to restart my two squidex pods and Squidex started working again.

The Deployment and Pods themselves did not show any errors. But the Squidex logs show the following warnings - it’s been like this from the first time I deployed Squidex 19 days ago approx and I’m starting to believe it has somehow exausted some limit or something similar:

(Also, I’m seeing some error messages whenever I Save a blog post - I have to click save 2-3 times before it saves successfully - it says something that another client is already trying to do something or something - I think this could be related to the issue I experience today)

{
  "logLevel": "Warning",
  "message": "-Did not get ping response for ping #1096 from S10.244.0.169:11111:276688976. Reason = Original Exc Type: Orleans.Runtime.OrleansMessageRejectionException Message:Silo S10.244.0.173:11111:277454327 is rejecting message: Request S10.244.0.173:11111:277454327*stg/15/0000000f@S0000000f->S10.244.0.169:11111:276688976*stg/15/0000000f@S0000000f #10754: . Reason = Exception getting a sending socket to endpoint S10.244.0.169:11111:276688976",
  "eventId": {
    "id": 100613
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:53.8207321Z",
  "category": "Orleans.Runtime.MembershipService.MembershipOracleData"
}

{
  "logLevel": "Warning",
  "message": "Noticed that silo S10.244.0.169:11111:276688976 has not updated it's IAmAliveTime table column recently. Last update was at 10/08/2018 10:23:08, now is 10/17/2018 07:39:56, no update for 8.21:16:47.9160000, which is more than 00:10:00.",
  "eventId": {
    "id": 100625
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:56.2473998Z",
  "category": "Orleans.Runtime.MembershipService.MembershipOracleData"
}

{
  "logLevel": "Warning",
  "message": "Noticed that silo S10.244.0.171:11111:276688976 has not updated it's IAmAliveTime table column recently. Last update was at 10/08/2018 10:23:12, now is 10/17/2018 07:39:56, no update for 8.21:16:44.1400000, which is more than 00:10:00.",
  "eventId": {
    "id": 100625
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:56.2474549Z",
  "category": "Orleans.Runtime.MembershipService.MembershipOracleData"
}

{
  "logLevel": "Warning",
  "message": "Noticed that silo S10.244.0.170:11111:276688976 has not updated it's IAmAliveTime table column recently. Last update was at 10/08/2018 10:23:07, now is 10/17/2018 07:39:56, no update for 8.21:16:48.9160000, which is more than 00:10:00.",
  "eventId": {
    "id": 100625
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:56.2474747Z",
  "category": "Orleans.Runtime.MembershipService.MembershipOracleData"
}

{
  "logLevel": "Warning",
  "message": "Exception getting a sending socket to endpoint S10.244.0.171:11111:276688976",
  "eventId": {
    "id": 101021
  },
  "exception": {
    "type": "Orleans.Runtime.OrleansException",
    "message": "Could not connect to 10.244.0.171:11111: HostUnreachable",
    "stackTrace": "   at Orleans.Runtime.SocketManager.Connect(Socket s, IPEndPoint endPoint, TimeSpan connectionTimeout)\n   at Orleans.Runtime.SocketManager.SendingSocketCreator(IPEndPoint target)\n   at Orleans.Runtime.LRU`2.Get(TKey key)\n   at Orleans.Runtime.Messaging.SiloMessageSender.GetSendingSocket(Message msg, Socket& socket, SiloAddress& targetSilo, String& error)"
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:56.9221413Z",
  "category": "Runtime.Messaging.SiloMessageSender/PingSender"
}

{
  "logLevel": "Warning",
  "message": "Exception getting a sending socket to endpoint S10.244.0.171:11111:276688976",
  "eventId": {
    "id": 101021
  },
  "exception": {
    "type": "Orleans.Runtime.OrleansException",
    "message": "Could not connect to 10.244.0.171:11111: HostUnreachable",
    "stackTrace": "   at Orleans.Runtime.SocketManager.Connect(Socket s, IPEndPoint endPoint, TimeSpan connectionTimeout)\n   at Orleans.Runtime.SocketManager.SendingSocketCreator(IPEndPoint target)\n   at Orleans.Runtime.LRU`2.Get(TKey key)\n   at Orleans.Runtime.Messaging.SiloMessageSender.GetSendingSocket(Message msg, Socket& socket, SiloAddress& targetSilo, String& error)"
  },
  "app": {
    "name": "Squidex",
    "version": "1.0.0.0",
    "sessionId": "710f4057-6d7e-4091-a025-bc2fd0e5f32e"
  },
  "timestamp": "2018-10-17T07:39:56.9335387Z",
  "category": "Runtime.Messaging.SiloMessageSender/SystemSender"
}

Any ideas why this might have happened and how I can fix it? I can provide with more logs/testing if necessary.

Sebastian · October 17, 2018, 8:42am

I would migrate to the newest version. It could be a bug that I already fixed: [SOLVED] New schema -> 404

spacecat · October 17, 2018, 9:12am

Thanks, killed pods so latest Squidex image has been pulled. Will keep an eye on the logs.

Sebastian · October 17, 2018, 9:18am

You can also have a look the dashboard under http://my-squidex/orleans to check if your cluster looks okay.

spacecat · October 24, 2018, 1:33pm

I tried the orleans dashboard but I cannot access it; it give me a:

502 Bad Gateway
nginx/x.x.x

the URL that I tried btw: https://subdomain.mycustomdomain.com/orleans

any ideas?

Sebastian · October 24, 2018, 7:26pm

No idea, can you bypass Nginx?

spacecat · October 24, 2018, 9:47pm

Not sure how I would be able to test that… not the greatest at nginx…

Sebastian · October 24, 2018, 9:51pm

Depending on your setup you can try to use the internal port. Nginx forwards all requests to port 5000 or so.

spacecat · November 7, 2018, 6:21pm

Okay, don’t really have time for that right now, but will try it in the future.

Sebastian · November 8, 2018, 7:12am

I am also working on a health endpoint that can be used by monitoring tools or kubernetes.

Sebastian · November 9, 2018, 3:55pm

The health endpoint has been implemented: https://cloud-staging.squidex.io/healthz

The memory limit is configurable.

In kubernetes I added the health endpoint as readiness and liveness probe:

      readinessProbe:
        httpGet:
          path: /healthz
          port: 80
      livenessProbe:
        httpGet:
          path: /healthz
          port: 80