Thomas, don't you call strace in a way which only tracks one thread of
knotd process? Knot is a multithreaded application and the backup is
done by different threads than the one handling knotc requests. It's
still visible that the straces you provided so far don't cover any key
file copying etc...
I'm not very surprised that on another machine the backup works. This is
a nasty issue and therefore we need your help with debugging it on the
exact environment it fails on.
Thanks!
Dne 09. 12. 20 v 17:17 Thomas E. napsal(a):
Hi Libor,
I performed the same actions on a test machine with 2 test zones. Same
Knot version, same OS.
root@signer-test:/var/lib/knot/keys/keys# strace -p 3424
strace: Process 3424 attached
restart_syscall(<... resuming interrupted poll ...>) = 1
accept4(13, NULL, NULL, SOCK_NONBLOCK) = 14
recvmsg(14, {msg_namelen=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN
(Resource temporarily unavailable)
poll([{fd=14, events=POLLIN}], 1, 5000) = 1 ([{fd=14, revents=POLLIN}])
recvmsg(14, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\1\20\0\vzone-backup\21\0\0\32\0\v/tmp/backup"...,
iov_len=262144}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
MSG_DONTWAIT|MSG_NOSIGNAL) = 37
write(18, "2020-12-09T16:13:06+0000 info: c"..., 71) = 71
stat("/var/lib/knot/catalog/data.mdb", 0x7ffd37b72bb0) = -1 ENOENT (No
such file or directory)
stat("/var/lib/knot/journal/data.mdb", {st_mode=S_IFREG|0660,
st_size=94208, ...}) = 0
stat("/var/lib/knot/keys/data.mdb", {st_mode=S_IFREG|0660,
st_size=45056, ...}) = 0
mkdir("/tmp/backup", 0770) = 0
openat(AT_FDCWD, "/tmp/backup/knot.backup.lockfile",
O_RDONLY|O_CREAT|O_EXCL, 0600) = 21
mkdir("/tmp/backup", 0750) = -1 EEXIST (File exists)
stat("/var/lib/knot/catalog/data.mdb", 0x7ffd37b72ac0) = -1 ENOENT (No
such file or directory)
futex(0x7ffd37b73400, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ffd37b733b0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ffd37b73404, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7ffd37b733b0, FUTEX_WAKE_PRIVATE, 1) = 1
sendmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3",
iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL)
= 1
recvmsg(14, {msg_namelen=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN
(Resource temporarily unavailable)
poll([{fd=14, events=POLLIN}], 1, 5000) = 1 ([{fd=14, revents=POLLIN}])
recvmsg(14, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0",
iov_len=262144}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
MSG_DONTWAIT|MSG_NOSIGNAL) = 1
close(14) = 0
poll([{fd=13, events=POLLIN}], 1, -1
2020-12-09T16:13:06+0000 info: control, received command 'zone-backup'
2020-12-09T16:13:06+0000 info: [xx.de.] zone backed up to '/tmp/backup'
2020-12-09T16:13:06+0000 info: [yy.de.] zone backed up to '/tmp/backup'
All 4 keys have been successfully backed up.
Thanks,
Thomas
On 09.12.20 16:05, libor.peltan wrote:
> Some explanation: zone backup or restore is performed as a zone "event".
> When zone events are invoked via the knotc interface, it typically works
> in the way that the event is planned to be performed later and "OK" is
> returned just as an info that the command has been registered. The event
> (zone backup in this case) happens per-zone and with some delay, when
> some background-worker becomes available, and the outcome appears in the
> log. On the contrary, if knotc is invoked with `-b` parameter, it waits
> for the event to be performed synchronously, and collects the outcome of
> the whole process (some error code handligs of this are currently
> subject to ongoing improvements by my colleagues).
>
> So, from the strace you sent me, I only see that some initialization has
> been done and that the early "OK" has been returned, but there are no
> traits of the backup process as such :(
>
> Could you please collect somewhat longer strace, by watching when the
> backup finishes in the logfile?
>
> Thanks!
>
> Libor
>
> Dne 09. 12. 20 v 15:56 Thomas E. napsal(a):
>> Hi Libor,
>>
>> yes, the backup command was executed and confirmed with "OK":
>>
>> root@signer-0:/# knotc zone-backup +backupdir /tmp/backup
>> OK
>>
>> I did it again and the output is the same.
>>
>> Thanks,
>> Thomas
>>
>> On 09.12.20 15:42, libor.peltan wrote:
>>> Hi Thomas,
>>>
>>> are you sure this strace has been collected while the issue was
>>> reproduced and half of the keys copied again?
>>>
>>> Maybe you just have deattached strace too early(?).
>>>
>>> Libor
>>>
>>> Dne 09. 12. 20 v 15:00 Thomas E. napsal(a):
>>>> Hi Libor,
>>>>
>>>> sure, here is the output:
>>>>
>>>>
>>>> root@signer-0:/tmp/backup/keys/keys# strace -p 9781
>>>> strace: Process 9781 attached
>>>> restart_syscall(<... resuming interrupted restart_syscall ...>) =
1
>>>> accept4(16, NULL, NULL, SOCK_NONBLOCK) = 23
>>>> recvmsg(23, {msg_namelen=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN
>>>> (Resource temporarily unavailable)
>>>> poll([{fd=23, events=POLLIN}], 1, 5000) = 1 ([{fd=23, revents=POLLIN}])
>>>> recvmsg(23, {msg_name=NULL, msg_namelen=0,
>>>>
msg_iov=[{iov_base="\1\20\0\vzone-backup\21\0\0\32\0\v/tmp/backup"...,
>>>> iov_len=262144}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>>>> MSG_DONTWAIT|MSG_NOSIGNAL) = 37
>>>> write(6, "2020-12-09T13:57:12+0000 info: c"..., 71) = 71
>>>> stat("/var/lib/knot/catalog/data.mdb", 0x7ffd0fb9f090) = -1
ENOENT (No
>>>> such file or directory)
>>>> stat("/var/lib/knot/journal/data.mdb", {st_mode=S_IFREG|0660,
>>>> st_size=9936896, ...}) = 0
>>>> stat("/var/lib/knot/keys/data.mdb", {st_mode=S_IFREG|0660,
>>>> st_size=45056, ...}) = 0
>>>> mkdir("/tmp/backup", 0770) = 0
>>>> openat(AT_FDCWD, "/tmp/backup/knot.backup.lockfile",
>>>> O_RDONLY|O_CREAT|O_EXCL, 0600) = 24
>>>> mkdir("/tmp/backup", 0750) = -1 EEXIST (File
exists)
>>>> stat("/var/lib/knot/catalog/data.mdb", 0x7ffd0fb9efa0) = -1
ENOENT (No
>>>> such file or directory)
>>>> futex(0x7ffd0fb9f8e0, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> futex(0x7ffd0fb9f890, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> futex(0x7ffd0fb9f8e4, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> futex(0x7ffd0fb9f890, FUTEX_WAKE_PRIVATE, 1) = 1
>>>> sendmsg(23, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\3",
>>>> iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>>>> MSG_NOSIGNAL) = 1
>>>> recvmsg(23, {msg_namelen=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EAGAIN
>>>> (Resource temporarily unavailable)
>>>> poll([{fd=23, events=POLLIN}], 1, 5000) = 1 ([{fd=23, revents=POLLIN}])
>>>> recvmsg(23, {msg_name=NULL, msg_namelen=0,
msg_iov=[{iov_base="\0",
>>>> iov_len=262144}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>>>> MSG_DONTWAIT|MSG_NOSIGNAL) = 1
>>>> close(23) = 0
>>>> poll([{fd=16, events=POLLIN}], 1, -1
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On 09.12.20 13:34, libor.peltan wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> would you try attaching `strace` to the running Knot and try the
>>>>> backup
>>>>> again, in order to see if some syscall(s) return error codes?
>>>>>
>>>>> Thanks much!
>>>>>
>>>>> Libor
>>>>>
>>>>> Dne 09. 12. 20 v 12:25 Thomas E. napsal(a):
>>>>>> Hi Libor,
>>>>>>
>>>>>>
>>>>>> # knotc -V
>>>>>> knotc (Knot DNS), version 3.0.2
>>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>> On 09.12.20 12:16, libor.peltan wrote:
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> aren't you using Knot version 3.0.0 (thus lower than
3.0.1) ?
>>>>>>>
>>>>>>> Libor
>>>>>>>
>>>>>>> Dne 09. 12. 20 v 11:39 Thomas E. napsal(a):
>>>>>>>> Hi Libor,
>>>>>>>>
>>>>>>>> I deleted the backup directory a few times, but I receive
the same
>>>>>>>> error
>>>>>>>> every time:
>>>>>>>>
>>>>>>>> 2020-12-09T10:31:52+0000 info: control, received command
>>>>>>>> 'zone-backup'
>>>>>>>> 2020-12-09T10:31:52+0000 warning: [xxx.] zone backup
failed (not
>>>>>>>> exists)
>>>>>>>> 2020-12-09T10:31:52+0000 error: [xxx.] zone event
'backup/restore'
>>>>>>>> failed (not exists)
>>>>>>>> 2020-12-09T10:31:52+0000 warning: [yyy.] zone backup
failed (not
>>>>>>>> exists)
>>>>>>>> 2020-12-09T10:31:52+0000 error: [yyy.] zone event
'backup/restore'
>>>>>>>> failed (not exists)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> One of the keys is the ZSK of zone xxx and the oterh the
KSKS of
>>>>>>>> zone
>>>>>>>> yyy.
>>>>>>>>
>>>>>>>>
>>>>>>>> root@signer-0:/# keymgr xxx list human | grep
>>>>>>>> 1fb3900b2e5ac72d30f927016ea4546ca561a5da
>>>>>>>>
>>>>>>>> 1fb3900b2e5ac72d30f927016ea4546ca561a5da ksk=no zsk=yes
tag=39969
>>>>>>>> algorithm=7 size=1024 public-only=no pre-active=0
>>>>>>>> publish=-1M3D17h49m4s ready=0 active=-1M3D17h49m4s
retire-active=0
>>>>>>>> retire=0 post-active=0 revoke=0 remove=0
>>>>>>>>
>>>>>>>>
>>>>>>>> root@signer-0:/# keymgr yyy list human | grep
>>>>>>>> 087cc573318e070befff1d9cbcf07e3b5cf5444d
>>>>>>>>
>>>>>>>> 087cc573318e070befff1d9cbcf07e3b5cf5444d ksk=yes zsk=no
tag=37419
>>>>>>>> algorithm=7 size=2048 public-only=no pre-active=0
>>>>>>>> publish=-1M3D18h11m7s ready=-1M3D18h11m7s
active=-26D12h49m4s
>>>>>>>> retire-active=0 retire=0 post-active=0 revoke=0 remove=0
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>> On 09.12.20 11:15, libor.peltan wrote:
>>>>>>>>> Hi Thomas,
>>>>>>>>>
>>>>>>>>> could you please try if this issue is reproducible:
if whenever
>>>>>>>>> you
>>>>>>>>> attempt the backup (to a fresh empty target
directory), it fails
>>>>>>>>> with
>>>>>>>>> "not exists"?
>>>>>>>>>
>>>>>>>>> Could you please check if the keys that happen to
make it to the
>>>>>>>>> backup
>>>>>>>>> belong to the same zone, or that it's one from
each zone? (You
>>>>>>>>> light use
>>>>>>>>> `keymgr list` to check which key ID belongs to which
zone.)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Libor
>>>>>>>>>
>>>>>>>>> Dne 08. 12. 20 v 21:36 Thomas napsal(a):
>>>>>>>>>> Hi Libor,
>>>>>>>>>>
>>>>>>>>>> sorry, I was really too unspecific.
>>>>>>>>>>
>>>>>>>>>> I'm hosting 2 zones. These 4 keys are on the
production machine:
>>>>>>>>>>
>>>>>>>>>> root@signer-0:/var/lib/knot/keys/keys# ls -alh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -rw-r----- 1 knot knot 1,7K Nov 5 16:22
>>>>>>>>>> 087cc573318e070befff1d9cbcf07e3b5cf5444d.pem
>>>>>>>>>> -rw-r----- 1 knot knot 916 Nov 5 16:44
>>>>>>>>>> 1fb3900b2e5ac72d30f927016ea4546ca561a5da.pem
>>>>>>>>>> -rw-r----- 1 knot knot 916 Nov 5 16:22
>>>>>>>>>> 6ebb8eb3ec2ddaf150119b4bc11b47dcec91621a.pem
>>>>>>>>>> -rw-r----- 1 knot knot 1,7K Nov 5 16:44
>>>>>>>>>> d7e47e2909f4d5947d8fb8684cb79ed06feb4b0a.pem
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Performing a backup with the following command:
>>>>>>>>>>
>>>>>>>>>> # knotc zone-backup +backupdir /tmp/backup
>>>>>>>>>>
>>>>>>>>>> Backup directory after performing the backup
shows:
>>>>>>>>>>
>>>>>>>>>> root@signer-0:/tmp/backup/keys/keys# ls -ahl
>>>>>>>>>>
>>>>>>>>>> -rw-r----- 1 knot knot 1,7K Dez 8 20:21
>>>>>>>>>> 087cc573318e070befff1d9cbcf07e3b5cf5444d.pem
>>>>>>>>>> -rw-r----- 1 knot knot 916 Dez 8 20:21
>>>>>>>>>> 1fb3900b2e5ac72d30f927016ea4546ca561a5da.pem
>>>>>>>>>>
>>>>>>>>>> 2 keys are missing.
>>>>>>>>>>
>>>>>>>>>> Hhmm ok, there is an error in the log:
>>>>>>>>>>
>>>>>>>>>> 2020-12-08T20:26:43+0000 info: control, received
command
>>>>>>>>>> 'zone-backup'
>>>>>>>>>> 2020-12-08T20:26:43+0000 warning: [xxx.] zone
backup failed (not
>>>>>>>>>> exists)
>>>>>>>>>> 2020-12-08T20:26:43+0000 error: [xxx.] zone
event
>>>>>>>>>> 'backup/restore'
>>>>>>>>>> failed (not exists)
>>>>>>>>>> 2020-12-08T20:26:43+0000 warning: [yyy.] zone
backup failed (not
>>>>>>>>>> exists)
>>>>>>>>>> 2020-12-08T20:26:43+0000 error: [yyy.] zone
event
>>>>>>>>>> 'backup/restore'
>>>>>>>>>> failed (not exists)
>>>>>>>>>>
>>>>>>>>>> I'm using the latest knot version.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> Thomas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 08.12.20 um 16:56 schrieb libor.peltan:
>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>
>>>>>>>>>>> could you be more specific about "half
of private keys were
>>>>>>>>>>> in the
>>>>>>>>>>> backup" ? How many were, how many
weren't, and was there some
>>>>>>>>>>> obvious
>>>>>>>>>>> difference between them?
>>>>>>>>>>>
>>>>>>>>>>> Could you share the log snippets covering the
backup and the
>>>>>>>>>>> restore
>>>>>>>>>>> procedures?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Libor
>>>>>>>>>>>
>>>>>>>>>>> Dne 08. 12. 20 v 16:48 Thomas E. napsal(a):
>>>>>>>>>>>> Hi (again),
>>>>>>>>>>>>
>>>>>>>>>>>> I was trying to backup and restore a
server with the new knotc
>>>>>>>>>>>> zone-backup/restore command.
>>>>>>>>>>>>
>>>>>>>>>>>> I recognized that only half of the
private keys were in the
>>>>>>>>>>>> backup,
>>>>>>>>>>>> which leads to an error:
>>>>>>>>>>>>
>>>>>>>>>>>> 2020-12-08T14:44:00+0100 error: [xxx.]
DNSSEC, failed to load
>>>>>>>>>>>> private
>>>>>>>>>>>> keys (not exists)
>>>>>>>>>>>> 2020-12-08T14:44:00+0100 error: [xxx.]
DNSSEC, failed to load
>>>>>>>>>>>> keys
>>>>>>>>>>>> (not
>>>>>>>>>>>> exists)
>>>>>>>>>>>>
>>>>>>>>>>>> Shouldn't the backup contain all
private keys?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Thomas