Hi,
Roland and I ran into a crashing condition for knotd 2.6.[689],
presumably caused by a race condition in the threaded use of PKCS #11
sessions. We use a commercial, replicated, networked HSM and not SoftHSM2.
WORKAROUND:
We do have a work-around with "conf-set server.background-workers 1" so
this is not a blocking condition for us, but handling our ~1700 zones
concurrency would be add back later.
PROBLEM DESCRIPTION:
Without this work-around, we see crashes quite reliably, on a load that
does a number of zone-set/-unset commands, fired by sequentialised knotc
processes to a knotd that continues to fire zone signing concurrently.
The commands are generated with the knot-aware option -k from ldns-zonediff,
https://github.com/SURFnet/ldns-zonediff
ANALYSIS:
Our HSM reports errors that look like a session handle is reused and
then repeatedly logged into, but not always, so it looks like a race
condition on a session variable,
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:02 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:03 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:55:50 | [0000744C:0000744E] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
These errors stopped being reported with the work-around configured.
Until that time, we have crashes, of which the following dumps one:
Thread 4 "knotd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff696902a in __GI_abort () at abort.c:89
#2 0x00007ffff69a97ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7ffff6ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff69b237a in malloc_printerr (ar_ptr=,
ptr=,
str=0x7ffff6ac2fe8 "double free or corruption (out)", action=3) at
malloc.c:5006
#4 _int_free (av=, p=, have_lock=0) at
malloc.c:3867
#5 0x00007ffff69b653c in __GI___libc_free (mem=) at
malloc.c:2968
#6 0x0000555555597ed3 in ?? ()
#7 0x00005555555987c2 in ?? ()
#8 0x000055555559ba01 in ?? ()
#9 0x00007ffff7120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4
#10 0x00007ffff6d036ba in start_thread (arg=0x7fffcd1bd700) at
pthread_create.c:333
#11 0x00007ffff6a3941d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
DEBUGGING HINTS:
Our suspicion is that you may not have set the mutex callbacks when
invoking C_Initialize() on PKCS #11, possibly due to the intermediate
layers of abstraction hiding this from view. This happens more often.
Then again, the double free might pose another hint.
This is on our soon-to-go-live platform, so I'm afraid it'll be very
difficult to do much more testing, I hope this suffices for your debugging!
I hope this helps Knot DNS to move forward!
-Rick
Hi folks,
maybe anybody can help me..
Is there any possibility to sign with more than one core ? The
"background-workers" parameter didn't help...
KnotDNS is using only one core for signing..
thanks a lot
best regards
--
Christian Petrasch
Senior System Engineer
DNS/Infrastructure
IT-Services
DENIC eG
Kaiserstraße 75-77
60329 Frankfurt am Main
GERMANY
E-Mail: petrasch(a)denic.de
http://www.denic.de
PGP-KeyID: 549BE0AE, Fingerprint: 0E0B 6CBE 5D8C B82B 0B49 DE61 870E 8841
549B E0AE
Angaben nach § 25a Absatz 1 GenG: DENIC eG (Sitz: Frankfurt am Main)
Vorstand: Helga Krüger, Martin Küchenthal, Andreas Musielak, Dr. Jörg
Schweiger
Vorsitzender des Aufsichtsrats: Thomas Keller
Eingetragen unter Nr. 770 im Genossenschaftsregister, Amtsgericht
Frankfurt am Main
Hi admin,
I found your knot-dns is really amazing software but I have a issue during
master & slaves configuration . While I configuring any domain zone in
master.
The zone details not propagate to slave until and unless I manually
specifies the domain name in slave zone.
Is there any way to configure this .
I hope your reply soon after you receive the email. I'm doing it for my
personal use and demo.
Regard
Innus Ali
>From : India
Hi admin,
I found your knot-dns is really amazing software but I have a issue during
master & slaves configuration . While I configuring any domain zone in
master.
The zone details not propagate to slave until and unless I manually
specifies the domain name in slave zone.
Is there any way to configure this .
I hope your reply soon after you receive the email. I'm doing it for my
personal use and demo.
Regard
Innus Ali
>From : India
Hello,
I have an issue with a zone where KNOT is slave server. I am not able to
transfer a zone: refresh, failed (no usable master). BIND is able to
transfer this zone and with host command AXFR works as well. There are
more domains on this master and the others are working. The thing is
that I can see in Wireshark that the AXFR is started, zone transfer
starts and for some reason KNOT after the 1st ACK to AXFR response
terminates the TCP connection with RST resulting in AXFR fail. AXFR
response is spread over several TCP segments.
I can provide traces privately.
KNOT 2.6.7-1+0~20180710153240.24+stretch~1.gbpfa6f52
Thanks for help.
BR
Ales Rygl
Dear all,
I use knot 2.7.1 with automatic DNSSEC signing and key management.
For some zones I have used "cds-cdnskey-publish: none".
As .CH/.LI is about to support CDS/CDNSKEY (rfc8078, rfc7344) I thought
I should enable to publish the CDS/CDNSKEY RR for all my zones. However,
the zones which are already secure (trust anchor in parent zone) do not
publish the CDS/CDNSKEY record when the setting is changes to
"cds-cdnskey-publish: always".
I have not been able to reproduce this error on new zones or new zones
signed and secured with a trust anchor in the parent zone for which I
then change the cds-cdnskey-publish setting from "none" to "always".
This indicates that there seems to be some state error for my existing
zones only.
I tried but w/o success:
knotc zone-sign <zone>
knotc -f zone-purge +journal <zone>
; publish a inactive KSK
keymgr <zone> generate ... ; knotc zone-sign <zone>
Completely removing the zone (and all keys) and restarting fixes the
problem obviously. However, I cannot do this for all my zones as I would
have to remove the DS record in the parent zone prior to this...
Any idea?
Daniel
Hi all,
I would like to kindly ask you to check the Debian repository state? It
looks like it is a bit outdated... The latest version available is
2.6.7-1+0~20180710153240.24+stretch~1.gbpfa6f52 while 2.7.0 has been
already released.
Thanks
BR
Ales Rygl