On 12/03/2013 11:11, Jan Včelák wrote:
Hello Anand,
I ran the non-debug version of the server, and when it
shot up to 100%
and stopped logging, I attached to it with strace:
please, can you also try with GDB? The output might be more useful.
Okay, here's the output from gdb, as you requested. Does it help?
this looks either as a problem in RCU library, or we are doing something
wrong with it in Knot.
Please, can you give us some information about the machine? Operating
system, libc version, versions of linked libraries. And some hint on
reproducing...
Hello Jan,
This is a CentOS 6.3 system, x86_64 architecture.
$ uname -a
Linux
ams3.authdns.ripe.net 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19
07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ $ rpm -q glibc
glibc-2.12-1.80.el6_3.7.x86_64
glibc-2.12-1.80.el6_3.7.i686
$ rpm -q userspace-rcu
userspace-rcu-0.7.3-1.el6.x86_64
Other libraries linked to this copy of Knot:
$ ldd /usr/sbin/knotd
linux-vdso.so.1 => (0x00007fff37fb0000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f851b03c000)
librt.so.1 => /lib64/librt.so.1 (0x00007f851ae34000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f851ac2f000)
liburcu.so.1 => /usr/lib64/liburcu.so.1 (0x00007f851aa2a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f851a80d000)
libm.so.6 => /lib64/libm.so.6 (0x00007f851a588000)
libc.so.6 => /lib64/libc.so.6 (0x00007f851a1f5000)
libz.so.1 => /lib64/libz.so.1 (0x00007f8519fdf000)
/lib64/ld-linux-x86-64.so.2 (0x00007f851b3e2000)
liburcu-common.so.1 => /usr/lib64/liburcu-common.so.1 (0x00007f8519ddc000)
The server has 16 GB of RAM, and NO SWAP. This is a deliberate decision.
If the server runs out of memory, we'd like to know immediately, rather
than let it begin a slow spiral of death.
I do't know what triggers the behaviour. However, what I did was to
configure it with over 5000 slave zones, which would cause Knot to try
and use more memory. The Linux OOM killer comes along and kills Knot.
Since Knot is run from upstart, it is automatically restarted, so it
tries to load all its zones. And when this happens, it sometimes locks
up. The memory usage is close to 16GB at this point, so perhaps there is
something causing this when the server memory is close to 100% usage.
Please let me know if I can supply any more information.
Regards,
Anand