Squidex runs out of memory and crashes our server


#1

Hei,

we have noticed that our server has crashed once, after inspecting logs we found out that was due to memory constraints. We sync data into Squidex via partial syncs every 15 minutes and nightly full sync. So we can send 30K+ requests, most of the requests happen during the full sync. We have added alerts and see that our CPU usage spikes to 90%+. Is there any way we can reduce the load on Squidex? The only thing I am considering right now is to make our client cache more data and hit the Squidex API less but wanted to hear in the forum first if anyone else has hit similar resource issues when self-hosting. Thanks.

Since the code is from a proprietary project, I can’t share reproducer but I have added the call trace we get below

Aug 19 04:05:59 appName kernel: containerd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Aug 19 04:05:59 appName kernel: containerd cpuset=/ mems_allowed=0
Aug 19 04:05:59 appName kernel: CPU: 2 PID: 3544 Comm: containerd Not tainted 4.15.0-55-generic #60-Ubuntu
Aug 19 04:05:59 appName kernel: Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
Aug 19 04:06:00 appName kernel: Call Trace:
Aug 19 04:06:00 appName kernel:  dump_stack+0x63/0x8b
Aug 19 04:06:00 appName kernel:  dump_header+0x71/0x285
Aug 19 04:06:00 appName kernel:  oom_kill_process+0x220/0x440
Aug 19 04:06:00 appName kernel:  out_of_memory+0x2d1/0x4f0
Aug 19 04:06:00 appName kernel:  __alloc_pages_slowpath+0xa61/0xe20
Aug 19 04:06:00 appName kernel:  ? alloc_pages_current+0x6a/0xe0
Aug 19 04:06:00 appName kernel:  __alloc_pages_nodemask+0x263/0x280
Aug 19 04:06:00 appName kernel:  alloc_pages_current+0x6a/0xe0
Aug 19 04:06:00 appName kernel:  __page_cache_alloc+0x81/0xa0
Aug 19 04:06:00 appName kernel:  filemap_fault+0x378/0x6f0
Aug 19 04:06:00 appName kernel:  ? page_add_file_rmap+0x134/0x180
Aug 19 04:06:00 appName kernel:  ? filemap_map_pages+0x181/0x390
Aug 19 04:06:00 appName kernel:  ext4_filemap_fault+0x31/0x44
Aug 19 04:06:00 appName kernel:  __do_fault+0x24/0xf0
Aug 19 04:06:00 appName kernel:  handle_pte_fault+0x89a/0xce0
Aug 19 04:06:00 appName kernel:  __handle_mm_fault+0x478/0x5c0
Aug 19 04:06:00 appName kernel:  handle_mm_fault+0xb1/0x1f0
Aug 19 04:06:00 appName kernel:  __do_page_fault+0x250/0x4d0
Aug 19 04:06:00 appName kernel:  do_page_fault+0x2e/0xe0
Aug 19 04:06:00 appName kernel:  ? async_page_fault+0x2f/0x50
Aug 19 04:06:00 appName kernel:  do_async_page_fault+0x51/0x80
Aug 19 04:06:00 appName kernel:  async_page_fault+0x45/0x50
Aug 19 04:06:00 appName kernel: RIP: 0033:0x560f7860ab47
Aug 19 04:06:00 appName kernel: RSP: 002b:000000c4206dfc10 EFLAGS: 00010202
Aug 19 04:06:00 appName kernel: RAX: 000000c4202a21b0 RBX: 0000000000000000 RCX: 0000000000000010
Aug 19 04:06:00 appName kernel: RDX: 0000000000012e6d RSI: 0000000000000010 RDI: 0000560f799130c0
Aug 19 04:06:00 appName kernel: RBP: 000000c4206dfca0 R08: 0000000000000001 R09: 0000000000000000
Aug 19 04:06:00 appName kernel: R10: 000000c4202a21b0 R11: 0000000000000001 R12: ffffffffffffffff
Aug 19 04:06:00 appName kernel: R13: 000000000000001c R14: 000000000000001b R15: 0000000000000200
Aug 19 04:06:00 appName kernel: Mem-Info:
Aug 19 04:06:00 appName kernel: active_anon:4006665 inactive_anon:153 isolated_anon:0
                                          active_file:246 inactive_file:164 isolated_file:16
                                          unevictable:1 dirty:0 writeback:0 unstable:0
                                          slab_reclaimable:15195 slab_unreclaimable:21353
                                          mapped:315 shmem:299 pagetables:11516 bounce:0
                                          free:33340 free_pcp:1419 free_cma:0
Aug 19 04:06:00 appName kernel: Node 0 active_anon:16026660kB inactive_anon:612kB active_file:984kB inactive_file:656kB unevictable:4kB isolated(anon):0kB isolated(file):64kB mapped:1260kB dirty:0kB writeback:0kB shmem:1196kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Aug 19 04:06:00 appName kernel: Node 0 DMA free:15908kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Aug 19 04:06:00 appName kernel: lowmem_reserve[]: 0 3457 15999 15999 15999
Aug 19 04:06:00 appName kernel: Node 0 DMA32 free:64744kB min:14588kB low:18232kB high:21876kB active_anon:3479372kB inactive_anon:0kB active_file:0kB inactive_file:160kB unevictable:0kB writepending:0kB present:3653612kB managed:3566416kB mlocked:0kB kernel_stack:156kB pagetables:8944kB bounce:0kB free_pcp:2184kB local_pcp:260kB free_cma:0kB
Aug 19 04:06:00 appName kernel: lowmem_reserve[]: 0 0 12541 12541 12541
Aug 19 04:06:00 appName kernel: Node 0 Normal free:52708kB min:52924kB low:66152kB high:79380kB active_anon:12547288kB inactive_anon:612kB active_file:500kB inactive_file:1660kB unevictable:4kB writepending:0kB present:13107200kB managed:12842808kB mlocked:4kB kernel_stack:7492kB pagetables:37120kB bounce:0kB free_pcp:3492kB local_pcp:528kB free_cma:0kB
Aug 19 04:06:00 appName kernel: lowmem_reserve[]: 0 0 0 0 0
Aug 19 04:06:00 appName kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Aug 19 04:06:00 appName kernel: Node 0 DMA32: 538*4kB (E) 363*8kB (UME) 205*16kB (UME) 217*32kB (UME) 163*64kB (UE) 75*128kB (UME) 20*256kB (UE) 17*512kB (UME) 7*1024kB (UME) 2*2048kB (UE) 1*4096kB (M) = 64496kB
Aug 19 04:06:00 appName kernel: Node 0 Normal: 588*4kB (MEH) 846*8kB (UMEH) 638*16kB (UMEH) 271*32kB (UMEH) 159*64kB (UMEH) 70*128kB (UMEH) 23*256kB (UMEH) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 53536kB
Aug 19 04:06:00 appName kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug 19 04:06:00 appName kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 19 04:06:00 appName kernel: 665 total pagecache pages
Aug 19 04:06:00 appName kernel: 0 pages in swap cache
Aug 19 04:06:00 appName kernel: Swap cache stats: add 0, delete 0, find 0/0
Aug 19 04:06:00 appName kernel: Free swap  = 0kB
Aug 19 04:06:00 appName kernel: Total swap = 0kB
Aug 19 04:06:00 appName kernel: 4194201 pages RAM
Aug 19 04:06:00 appName kernel: 0 pages HighMem/MovableOnly
Aug 19 04:06:00 appName kernel: 87918 pages reserved
Aug 19 04:06:00 appName kernel: 0 pages cma reserved
Aug 19 04:06:00 appName kernel: 0 pages hwpoisoned
Aug 19 04:06:00 appName kernel: [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Aug 19 04:06:00 appName kernel: [  442]     0   442    32702      960   270336        0             0 systemd-journal
Aug 19 04:06:00 appName kernel: [  453]     0   453    26476       58    94208        0             0 lvmetad
Aug 19 04:06:00 appName kernel: [  690] 62583   690    35481      151   176128        0             0 systemd-timesyn
Aug 19 04:06:00 appName kernel: [  808]   100   808    17992      193   184320        0             0 systemd-network
Aug 19 04:06:00 appName kernel: [  821]   101   821    17689      205   180224        0             0 systemd-resolve
Aug 19 04:06:00 appName kernel: [  911]     0   911    10873      343   114688        0         -1000 systemd-udevd
Aug 19 04:06:00 appName kernel: [ 1044]   103  1044    12540      189   139264        0          -900 dbus-daemon
Aug 19 04:06:00 appName kernel: [ 1063]     0  1063    42738     1962   221184        0             0 networkd-dispat
Aug 19 04:06:00 appName kernel: [ 1065]     0  1065    72048      276   192512        0             0 accounts-daemon
Aug 19 04:06:00 appName kernel: [ 1070]     0  1070    27621       86   114688        0             0 irqbalance
Aug 19 04:06:00 appName kernel: [ 1073]     0  1073    17775      323   172032        0             0 systemd-logind
Aug 19 04:06:00 appName kernel: [ 1077]   102  1077    65759      474   159744        0             0 rsyslogd
Aug 19 04:06:00 appName kernel: [ 1091]   111  1091  1705821  1364428 12087296        0             0 mongod
Aug 19 04:06:00 appName kernel: [ 1092]     0  1092    95670      158   118784        0             0 lxcfs
Aug 19 04:06:00 appName kernel: [ 1093]     0  1093     7083       52   106496        0             0 atd
Aug 19 04:06:00 appName kernel: [ 1094]     0  1094     7937       76   102400        0             0 cron
Aug 19 04:06:00 appName kernel: [ 1123]     0  1123    18074      190   196608        0         -1000 sshd
Aug 19 04:06:00 appName kernel: [ 1124]     0  1124    75545     3030   229376        0             0 fail2ban-server
Aug 19 04:06:00 appName kernel: [ 1125]     0  1125   461503     4250   421888        0             0 containerd
Aug 19 04:06:00 appName kernel: [ 1133]     0  1133    46913     1977   262144        0             0 unattended-upgr
Aug 19 04:06:00 appName kernel: [ 1136]     0  1136    72220      230   200704        0             0 polkitd
Aug 19 04:06:00 appName kernel: [ 1138]     0  1138   463978    11255   573440        0          -500 dockerd
Aug 19 04:06:00 appName kernel: [ 1144]     0  1144     4103       37    73728        0             0 agetty
Aug 19 04:06:00 appName kernel: [ 1166]     0  1166     3722       32    77824        0             0 agetty
Aug 19 04:06:00 appName kernel: [ 2763]  1000  2763    19191      321   204800        0             0 systemd
Aug 19 04:06:00 appName kernel: [ 2767]  1000  2767    27956      612   253952        0             0 (sd-pam)
Aug 19 04:06:00 appName kernel: [ 3118]  1000  3118     9631     2664   114688        0             0 tmux: server
Aug 19 04:06:00 appName kernel: [ 3119]  1000  3119     6170     1234    90112        0             0 bash
Aug 19 04:06:00 appName kernel: [ 3310]     0  3310    15556      124   167936        0             0 sudo
Aug 19 04:06:00 appName kernel: [ 3311]     0  3311   145013      162  1036288        0             0 journalctl
Aug 19 04:06:00 appName kernel: [ 4022]  1000  4022     6153     1233    86016        0             0 bash
Aug 19 04:06:00 appName kernel: [ 4279]     0  4279    15556      125   172032        0             0 sudo
Aug 19 04:06:00 appName kernel: [ 4280]     0  4280   135717      162   847872        0             0 journalctl
Aug 19 04:06:00 appName kernel: [ 1436]   999  1436   290246     2140   249856        0          -900 do-agent
Aug 19 04:06:00 appName kernel: [ 9425]  1000  9425     6170     1234    94208        0             0 bash
Aug 19 04:06:00 appName kernel: [13161]  1000 13161     6153     1232    94208        0             0 bash
Aug 19 04:06:00 appName kernel: [17726]  1000 17726   230894     2404   389120        0             0 nodejs
Aug 19 04:06:00 appName kernel: [ 7892]  1000  7892     6153     1233    94208        0             0 bash
Aug 19 04:06:00 appName kernel: [10067]     0 10067     2456      289    81920        0          -500 docker-proxy
Aug 19 04:06:00 appName kernel: [10082]     0 10082     2128      288    86016        0          -500 docker-proxy
Aug 19 04:06:00 appName kernel: [10090]     0 10090     2947      285    77824        0          -999 containerd-shim
Aug 19 04:06:00 appName kernel: [10119]     0 10119     4153      123    86016        0             0 forego
Aug 19 04:06:00 appName kernel: [10220]     0 10220     4489       64    77824        0             0 bash
Aug 19 04:06:00 appName kernel: [10224]     0 10224     4488       65    81920        0             0 bash
Aug 19 04:06:00 appName kernel: [10227]     0 10227     3830      443    86016        0             0 docker-gen
Aug 19 04:06:00 appName kernel: [10228]     0 10228    22223      377   126976        0             0 nginx
Aug 19 04:06:00 appName kernel: [10277]     0 10277     2683      244    73728        0          -999 containerd-shim
Aug 19 04:06:00 appName kernel: [10305]     0 10305      543       49    40960        0             0 bash
Aug 19 04:06:00 appName kernel: [10539]  1000 10539   123116    42253   659456        0             0 docker-compose
Aug 19 04:06:00 appName kernel: [10593]     0 10593      669      184    45056        0             0 letsencrypt_ser
Aug 19 04:06:00 appName kernel: [10594]     0 10594    28577      692    94208        0             0 docker-gen
Aug 19 04:06:00 appName kernel: [16409]  1000 16409   557845   349788 12574720        0             0 node
Aug 19 04:06:00 appName kernel: [24279]   101 24279    22390      561   118784        0             0 nginx
Aug 19 04:06:00 appName kernel: [24280]   101 24280    22224      356   114688        0             0 nginx
Aug 19 04:06:00 appName kernel: [24281]   101 24281    22224      356   114688        0             0 nginx
Aug 19 04:06:00 appName kernel: [24282]   101 24282    22224      416   118784        0             0 nginx
Aug 19 04:06:00 appName kernel: [24283]   101 24283    22224      356   114688        0             0 nginx
Aug 19 04:06:00 appName kernel: [24284]   101 24284    22224      356   114688        0             0 nginx
Aug 19 04:06:00 appName kernel: [10530]     0 10530     2128      287    81920        0          -500 docker-proxy
Aug 19 04:06:00 appName kernel: [10538]     0 10538     2683      290    77824        0          -999 containerd-shim
Aug 19 04:06:00 appName kernel: [10564]     0 10564  4101262  2199348 18518016        0             0 Squidex
Aug 19 04:06:00 appName kernel: [15202]  1000 15202     6170     1233    90112        0             0 bash
Aug 19 04:06:00 appName kernel: [15609]  1000 15609     2139       26    57344        0             0 less
Aug 19 04:06:00 appName kernel: [24194]     0 24194      480       10    36864        0             0 sleep
Aug 19 04:06:00 appName kernel: [24491]     0 24491    18074      181   184320        0             0 sshd
Aug 19 04:06:00 appName kernel: [24492]   109 24492    18074      189   180224        0             0 sshd
Aug 19 04:06:00 appName kernel: Out of memory: Kill process 10564 (Squidex) score 536 or sacrifice child
Aug 19 04:06:00 appName kernel: Killed process 10564 (Squidex) total-vm:16405048kB, anon-rss:8797392kB, file-rss:0kB, shmem-rss:0kB

#2

Updated the snippet above to use correct log date and also make output more complete

Aug 19 04:06:00 appName kernel: Out of memory: Kill process 10564 (Squidex) score 536 or sacrifice child
Aug 19 04:06:00 appName kernel: Killed process 10564 (Squidex) total-vm:16405048kB, anon-rss:8797392kB, file-rss:0kB, shmem-rss:0kB

#3

Currently running 3.1.0.0 will see if it also happens for more recent release.


#4

What are your server specs and what kind of requests are you doing?

Would you be able to share a reproduction sample?


#5

Have upgraded to 3.2.2.0 and will try make a isolated reproducer later. We are running a DigitalOcean droplet (16 GB Memory / 160 GB Disk).

Docker stats

$ docker stats --no-stream
CONTAINER ID        NAME                              CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
a8021ff2446a        squidexserver_squidex_encrypt_1   0.31%               9.68MiB / 15.66GiB    0.06%               1.58kB / 0B         242kB / 4.1kB       16
14e7a227d1d0        squidexserver_squidex_proxy_1     0.21%               15.96MiB / 15.66GiB   0.10%               147MB / 151MB       23.8MB / 4.1kB      30
44e3c0bc60fe        squidexserver_squidex_squidex_1   2.22%               641.8MiB / 15.66GiB   4.00%               92.9MB / 147MB      3.13MB / 340MB      55

#6

This is pretty heavy :slight_smile:

Needs to be fixed, but I think it is something I have to do.


#7

I can send you more raw server logs via DM until I can reproduce reliably, thanks.


#8

I am not so sure if it helps me, but why not …


#9

I started with a few load tests here: https://github.com/Squidex/squidex/blob/master/tools/LoadTest/QueryBenchmarks.cs

So far I cannot reproduce a problem with queries, must be something else.


#10

I am not so used to C# so don’t understand soo much of the code, but how many requests do you end up generating with the load tests?

I think ideally you should test with 20K+ requests to get a similar amount load.


#11

It is 100 users in parallel with 100 requests each


#12

That only gives you 10K requests if I understand you correctly. Can you try increasing it to at least 20K?


#13

Sure, but i have run it several time and have very likely a higher degree of parallelism than you.


#14

The current version contains a few improvements, but not sure yet, if this solves your issue.


#15

I am not able to reproduce the crasher but we do still see memory spikes up to 98%. I will leave this open and update if it happens again. Also in parallel will try to export the payloads so it’s possible to run with our dataset or similar.

Thanks for all the help @Sebastian!


#16

If you send me your code as discussed or a reproduction I am probably able to improve it even more :slight_smile:


#17

I am working on an improvement that limits the number of grains/actors to 5000. This should improve it a lot.


#18

The latest build contains this improvement now.