Hello Tomas,

I did on Friday an upgrade to the latest version of Knot-Resolver 4.3 as was suggested.
In the log was recorded a few unfortunate restarts, even if the DNSSEC validation was disabled and the bogud_log was unloaded. (disabled since 14.12.2019 20:30).

My server installed packages:
Knot Resolver, version 4.3.0
rpm -qa | grep knot
knot-libs-2.9.1-1.el7.x86_64
knot-resolver-4.3.0-1.el7.x86_64
knot-resolver-module-http-4.2.2-2.el7.x86_64
CentOS Linux release 7.7.1908 (Core)

In the time between 19:00-19:08, the WM backup is provided.
Each service restart causes new record in the /var/cache/knot-resolver/tty and the old one still persists (This is an unfortunate state of things in CentOS 7 right now. We have a 
solution for it in an upcoming 5.0 release. Each instance will have 
exactly one deterministic control socket. ).

A log cut:
Dec 13 19:03:00 dnsserver systemd[1]: kresd@1.service watchdog timeout (limit 10s)!
Dec 13 19:03:01 dnsserver systemd[1]: kresd@1.service: main process exited, code=killed, status=6/ABRT
Dec 13 19:03:01 dnsserver systemd[1]: Unit kresd@1.service entered failed state.
Dec 13 19:03:01 dnsserver systemd[1]: kresd@1.service failed.
Dec 13 19:19:25 dnsserver systemd[1]: kresd@1.service watchdog timeout (limit 10s)!
Dec 13 19:19:35 dnsserver systemd[1]: kresd@1.service stop-sigabrt timed out. Terminating.
Dec 13 19:19:45 dnsserver systemd[1]: kresd@1.service stop-sigterm timed out. Killing.
Dec 13 19:19:47 dnsserver systemd[1]: kresd@1.service: main process exited, code=killed, status=9/KILL
Dec 13 19:19:47 dnsserver systemd[1]: Unit kresd@1.service entered failed state.
Dec 13 19:19:47 dnsserver systemd[1]: kresd@1.service failed.
Dec 14 19:01:23 dnsserver systemd[1]: kresd@1.service watchdog timeout (limit 10s)!
Dec 14 19:01:24 dnsserver systemd[1]: kresd@1.service: main process exited, code=killed, status=6/ABRT
Dec 14 19:01:24 dnsserver systemd[1]: Unit kresd@1.service entered failed state.
Dec 14 19:01:24 dnsserver systemd[1]: kresd@1.service failed.
Dec 14 19:02:19 dnsserver systemd[1]: kresd@1.service watchdog timeout (limit 10s)!
Dec 14 19:02:23 dnsserver systemd[1]: kresd@1.service: main process exited, code=killed, status=6/ABRT
Dec 14 19:02:23 dnsserver systemd[1]: Unit kresd@1.service entered failed state.
Dec 14 19:02:23 dnsserver systemd[1]: kresd@1.service failed.
Dec 15 19:03:58 dnsserver systemd[1]: kresd@1.service watchdog timeout (limit 10s)!
Dec 15 19:04:08 dnsserver systemd[1]: kresd@1.service stop-sigabrt timed out. Terminating.
Dec 15 19:04:19 dnsserver systemd[1]: kresd@1.service stop-sigterm timed out. Killing.
Dec 15 19:04:25 dnsserver systemd[1]: kresd@1.service: main process exited, code=killed, status=9/KILL
Dec 15 19:04:25 dnsserver systemd[1]: Unit kresd@1.service entered failed state.
Dec 15 19:04:25 dnsserver systemd[1]: kresd@1.service failed.
--

Smil Milan Jeskyňka Kazatel


Hi,

first - please try to use more descriptive e-mail subjects. It helps
others to find solutions to same/similar issues in the future.

On 12/12/2019 14.29, Milan Jeskynka Kazatel wrote:> I´m still facing the
service kresd@1 crashes without any obvious reasons. 
> Today I did a second try to upgrade to Knot Resover to version 4.2.2 and the
> upgrade seems to be ok, service can start without any difficulties.

The latest released version is 4.3.0. Before any further debugging,
please ensure you're using the latest version. EPEL repositories lag
behind the upstream releases, but there's usually an update waiting
shortly after our upstream release. You can install it using:

yum update knot-resolver --enablerepo epel-testing

Alternately, you can use our upstream package repositories to get the
updates right as they're released:
https://www.knot-resolver.cz/download/

> It runs
> as expected more than 3,5 hour, but unfortunately, it starts to write in the
> log the same messages as was reported in my previous post and the service
> get restart by itself.

The auto-restart is a systemd feature we're using to recover from
crashes/failures. It's preferable to a dead service.

However, it'd be interesting to find out the cause of these crashes.
Could you explore the errors in journal and post the output?

journalctl -u kresd@1 -p notice --since -2w

> Every restarts couse a new sevice PID in /var/cache/
> knot-resolver/tty, the old one was not correctly finished

This is an unfortunate state of things in CentOS 7 right now. We have a
solution for it in an upcoming 5.0 release. Each instance will have
exactly one deterministic control socket.

> and the whole
> operating system goes to a visible slowdown.

I don't see how knot-resolver crash under systemd would cause any
slowdown. Do you have any evidence of that? Are there any hanging kresd
process in ps, which weren't correctly terminated? What system resources
are they using?

> I don´t know how to do an
> exact sevice crashdump file, but I can provide any log messages if needed.  

If the crashes keep happening after upgrade to 4.3.0 and the journal
messages don't help with debugging, this is how I managed to turn on
coredump collection on CentOS 7:


1. install debugsymbols
$ debuginfo-install knot knot-resolver luajit

2. create /etc/sysctl.d/50-core.conf with the following content:
kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

3. modify/uncomment the following parameters in /etc/systemd/system.conf
DumpCore=yes
DefaultLimitCORE=infinity

4. reboot

Please refer to man systemd-coredump for more details.


The next time kresd crashes, there should be a PID in
$ coredump list

which can be used to display some information about the coredump:
$ coredump info $PID

Even the stack trace could helps us track the root of the issue. If you
believe you've found a security issue, please report it via a
*confidential* issue at
https://gitlab.labs.nic.cz/knot/knot-resolver/issues or to
knot-resolver@labs.nic.cz (non-public list).

Thanks!

--
Tomas Krizek
PGP: 4A8B A48C 2AED 933B D495 C509 A1FB A5F7 EF8C 4869