在 2026/1/5 16:30, Daniel Salzman via knot-dns-users 写道:
Hi!
On 1/5/26 08:06, SUN Guonian wrote:
It take effect to increase quic-outbuf-max-size,
the transfer works on architecture x86_64.
but on aarch64, there is a lot difference,
Are there the same number of background workers on both architectures?
background worker is the default value,
on x86_64, it is 2 for both master & slave.
on aarch64, master is 48, slave1(10.1.136.156) is 48, slave2(10.1.136.159, virtual host)
is 24.
I have decreased it to 2 on the master, both slave could get the zone.
This is interesting, because the number of background workers shouldn't matter. The
output buffer limit is per worker.
1. I increase quic-outbuf-max-size from default 100M to 3200M, doubled each time, it
still produce,
2026-01-05T14:47:49+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.156@37936 QUIC,
started, serial 2025123113
2026-01-05T14:47:49+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.156@37936 QUIC,
buffering finished, 0.27 seconds, 2956 messages, 49797072 bytes
2026-01-05T14:49:13+0800 debug: QUIC, terminated inactive connections 1
2026-01-05T14:49:13+0800 debug: QUIC, terminated inactive connections 1
2026-01-05T14:49:19+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.1.136.159@14324 QUIC cert-key TFg9ybqubTukNtMiFdn5jW61Y4VUPS9XmYxHsCeQ/4c=
2026-01-05T14:49:19+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.159@14324 QUIC,
started, serial 2025123113
2026-01-05T14:49:19+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.159@14324 QUIC,
buffering finished, 0.27 seconds, 2956 messages, 49797072 bytes
2026-01-05T14:49:23+0800 debug: QUIC, terminated inactive connections 1
Note that the reason for connection termination is different. The client was inactive.
2. one slave(10.1.136.156, rocky9) could get the zone fully, another (10.1.136.159,
rock10) couldn't, both are alive.
What does the server log?
on slave1,
2026-01-05T17:31:58+0800 info: [foo.] notify, incoming, remote 10.1.136.154@24037 TCP,
serial 2025123116
2026-01-05T17:31:58+0800 info: [foo.] refresh, remote 10.1.136.154@853, remote serial
2025123116, zone is outdated
2026-01-05T17:31:58+0800 info: [foo.] IXFR, incoming, remote 10.1.136.154@853 QUIC,
receiving AXFR-style IXFR
2026-01-05T17:31:58+0800 info: [foo.] AXFR, incoming, remote 10.1.136.154@853 QUIC,
started
2026-01-05T17:32:02+0800 info: [foo.] AXFR, incoming, remote 10.1.136.154@853 QUIC,
finished, remote serial 2025123116, 3.42 seconds, 3695 messages, 62246340 bytes
2026-01-05T17:32:03+0800 info: [foo.] refresh, remote 10.1.136.154@853, zone updated,
5.56 seconds, serial 2025123114 -> 2025123116, expires in 2491200 seconds
2026-01-05T17:32:05+0800 info: [foo.] zone file updated, serial 2025123114 ->
2025123116
2026-01-05T17:33:13+0800 debug: stats, dumped into file
'/home/gtld/knot/chroot1/var/run/stats.yaml'
2026-01-05T17:42:08+0800 info: [foo.] refresh, remote master_quic, address
10.1.136.154@853, failed (connection reset)
2026-01-05T17:42:08+0800 warning: [foo.] refresh, remote master_quic not usable
2026-01-05T17:42:08+0800 error: [foo.] refresh, failed (no usable master), next retry at
2026-01-05T17:45:28+0800, expires in 2490595 seconds
2026-01-05T17:42:08+0800 error: [foo.] zone event 'refresh' failed (no usable
master)
2026-01-05T17:45:30+0800 info: [foo.] refresh, remote master_quic, address
10.1.136.154@853, failed (connection reset)
2026-01-05T17:45:30+0800 warning: [foo.] refresh, remote master_quic not usable
2026-01-05T17:45:30+0800 error: [foo.] refresh, failed (no usable master), next retry at
2026-01-05T17:48:50+0800, expires in 2490393 seconds
2026-01-05T17:45:30+0800 error: [foo.] zone event 'refresh' failed (no usable
master)
on slave2,
2026-01-05T17:32:28+0800 info: server started
2026-01-05T17:32:28+0800 info: [foo.] AXFR, incoming, remote 10.1.136.154@853 QUIC,
started
2026-01-05T17:32:33+0800 info: [foo.] AXFR, incoming, remote 10.1.136.154@853 QUIC,
finished, remote serial 2025123116, 4.94 seconds, 3695 messages, 62246340 bytes
2026-01-05T17:32:35+0800 info: [foo.] refresh, remote 10.1.136.154@853, zone updated,
7.34 seconds, serial none -> 2025123116, expires in 2491200 seconds
2026-01-05T17:32:37+0800 info: [foo.] zone file updated, serial 2025123116
2026-01-05T17:34:59+0800 debug: [foo.] ACL, allowed, action notify, remote
10.1.136.154@21413 TCP
2026-01-05T17:34:59+0800 info: [foo.] notify, incoming, remote 10.1.136.154@21413 TCP,
serial 2025123116
2026-01-05T17:42:36+0800 info: [foo.] refresh, remote master_quic, address
10.1.136.154@853, failed (connection reset)
2026-01-05T17:42:36+0800 warning: [foo.] refresh, remote master_quic not usable
2026-01-05T17:42:36+0800 error: [foo.] refresh, failed (no usable master), next retry at
2026-01-05T17:45:56+0800, expires in 2490599 seconds
2026-01-05T17:42:36+0800 error: [foo.] zone event 'refresh' failed (no usable
master)
2026-01-05T17:45:58+0800 info: [foo.] refresh, remote master_quic, address
10.1.136.154@853, failed (connection reset)
2026-01-05T17:45:58+0800 warning: [foo.] refresh, remote master_quic not usable
2026-01-05T17:45:58+0800 error: [foo.] refresh, failed (no usable master), next retry at
2026-01-05T17:49:18+0800, expires in 2490397 seconds
2026-01-05T17:45:58+0800 error: [foo.] zone event 'refresh' failed (no usable
master)
on master,
2026-01-05T17:31:58+0800 info: server started
2026-01-05T17:31:58+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.1.136.156@45612 QUIC cert-key kulz9ehQf5Ycn/+2mCicUdfTMuDXHbQEWBwg5qDi0Eo=
2026-01-05T17:31:58+0800 info: [foo.] IXFR, outgoing, remote 10.1.136.156@45612 QUIC,
incomplete history, serial 2025123114, fallback to AXFR
2026-01-05T17:31:58+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.1.136.156@45612 QUIC cert-key kulz9ehQf5Ycn/+2mCicUdfTMuDXHbQEWBwg5qDi0Eo=
2026-01-05T17:31:58+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.156@45612 QUIC,
started, serial 2025123116
2026-01-05T17:31:58+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.156@45612 QUIC,
buffering finished, 0.33 seconds, 3695 messages, 62246340 bytes
2026-01-05T17:32:47+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.1.136.159@41166 QUIC cert-key TFg9ybqubTukNtMiFdn5jW61Y4VUPS9XmYxHsCeQ/4c=
2026-01-05T17:32:47+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.159@41166 QUIC,
started, serial 2025123116
2026-01-05T17:32:47+0800 info: [foo.] AXFR, outgoing, remote 10.1.136.159@41166 QUIC,
buffering finished, 0.38 seconds, 3695 messages, 62246340 bytes
2026-01-05T17:35:18+0800 info: [foo.] notify, outgoing, remote 10.1.136.159@53 TCP,
retry, serial 2025123116
These logs describe different situation. I see successful transfers over QUIC. Note that
the failed transfers were over TCP.
master still crashed,
-rw-r-----. 1 root root 65663813 Jan 5 17:42
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.323451.1767606123000000.zst
3. the master(10.1.136.154, rocky9) collapsed.
# ls -lrt /var/lib/systemd/coredump/
-rw-r-----. 1 root root 52527031 Jan 5 09:09
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.320376.1767575365000000.zst
-rw-r-----. 1 root root 52575905 Jan 5 10:06
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.320623.1767578803000000.zst
-rw-r-----. 1 root root 52546840 Jan 5 10:53
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.320970.1767581613000000.zst
-rw-r-----. 1 root root 52519031 Jan 5 13:33
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.321756.1767591216000000.zst
-rw-r-----. 1 root root 52519386 Jan 5 13:46
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.321896.1767591963000000.zst
-rw-r-----. 1 root root 52512650 Jan 5 14:23
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.322020.1767594182000000.zst
-rw-r-----. 1 root root 52553042 Jan 5 14:49
core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.322309.1767595778000000.zst
Could you send me a backtrace? Something like
coredumpctl gdb knot \
--batch \
-ex "set pagination off" \
-ex "thread apply all bt full"
this command output,
coredumpctl: unrecognized option '--batch'
I try to get the output for gdb/bt, but failed,
# coredumpctl dump > c01
PID: 323451 (knotd)
UID: 1000 (gtld)
GID: 1000 (gtld)
Signal: 11 (SEGV)
Timestamp: Mon 2026-01-05 17:42:03 CST (6min ago)
Command Line: /usr/sbin/knotd -c /home/gtld/knot/chroot4/etc/knot.conf -d
Executable: /usr/sbin/knotd
Control Group: /user.slice/user-0.slice/session-129.scope
Unit: session-129.scope
Slice: user-0.slice
Session: 129
Owner UID: 0 (root)
Boot ID: eceeeff3d58f457db4014dc5f33e0fad
Machine ID: 82aa697a9ae54202bb5e0ec31c510520
Hostname: tiangong-01
Storage:
/var/lib/systemd/coredump/core.knotd.1000.eceeeff3d58f457db4014dc5f33e0fad.323451.1767606123000000.zst
(truncated)
Size on Disk: 62.6M
Message: Process 323451 (knotd) of user 1000 dumped core.
Stack trace of thread 323458:
#0 0x0000ffff9bce36f0 n/a (n/a + 0x0)
#1 0x0000ffff9be77c7c n/a (n/a + 0x0)
#2 0x0000ffff9be77c7c n/a (n/a + 0x0)
#3 0x0000ffff9beaf4b8 n/a (n/a + 0x0)
#4 0x0000ffff9be42d5c n/a (n/a + 0x0)
#5 0x0000ffff9be4ff8c n/a (n/a + 0x0)
#6 0x0000ffff9c3516a8 n/a (n/a + 0x0)
#7 0x0000ffff9c35170c n/a (n/a + 0x0)
#8 0x0000ffff9c35c104 n/a (n/a + 0x0)
#9 0x0000ffff9c374e2c n/a (n/a + 0x0)
#10 0x0000ffff9c36b630 n/a (n/a + 0x0)
#11 0x0000ffff9c36b984 n/a (n/a + 0x0)
#12 0x0000ffff9c36c12c n/a (n/a + 0x0)
#13 0x0000ffff9c359498 n/a (n/a + 0x0)
#14 0x0000aaaae3cf8244 quic_handler (/usr/sbin/knotd + 0x28244)
#15 0xeee58b05aa78eb00 n/a (n/a + 0x0)
ELF object binary architecture: AARCH64
More than one entry matches, ignoring rest.
# gdb --core c01
GNU gdb (Rocky Linux) 16.3-2.el9
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
warning: BFD: warning: /var/lib/systemd/coredump/c01 has a segment extending past end of
file
warning: Can't open file /tmp/knot-confdb.B4NsUg/lock.mdb (deleted) during
file-backed mapping note processing
warning: Can't open file /tmp/knot-confdb.B4NsUg/data.mdb (deleted) during
file-backed mapping note processing
[New LWP 323458]
[New LWP 323453]
[New LWP 323451]
[New LWP 323454]
[New LWP 323456]
[New LWP 323460]
[New LWP 323455]
[New LWP 323461]
[New LWP 323457]
[New LWP 323459]
[New LWP 323462]
[New LWP 323463]
[New LWP 323452]
[New LWP 323464]
[New LWP 323465]
warning: failed to parse execution context from corefile: Cannot access memory at address
0xffffc31bcfe8
Reading symbols from /usr/sbin/knotd...
Reading symbols from /usr/lib/debug/usr/sbin/knotd-3.5.2-cznic.1.el9.aarch64.debug...
warning: Error reading shared library list entry at 0x2e20613635303264
Cannot access memory at address 0x6578652d6c642f80
Cannot access memory at address 0x6578652d6c642f78
Failed to read a valid object file image from memory.
Core was generated by `/usr/sbin/knotd -c /home/gtld/knot/chroot4/etc/knot.conf -d'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000ffff9bce36f0 in ?? ()
[Current thread is 1 (LWP 323458)]
(gdb) bt
#0 0x0000ffff9bce36f0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
Unfortunately, this output is incomplete. You should increase the curedump limit and
install packages with debug symbols:
knot-debuginfo knot-libs-debuginfo
Daniel
Thanks!
Best Regards,
SUN Guonian
Thanks!
>
> Thanks !
>
> Best Regards,
>
> SUN Guonian
>
> 在 2026/1/4 19:05, Daniel Salzman 写道:
>> Hello!
>>
>> You probably have to increase
https://www.knot-dns.cz/docs/latest/singlehtml/index.html#quic-outbuf-max-s…
>>
>> Daniel
>>
>> On 1/4/26 10:46, SUN Guonian via knot-dns-users wrote:
>>> Greetings,
>>>
>>> I have tried to use QUIC in zone transfering, I met one error in on bigger
zone,
>>>
>>> from master's log, it displayed,
>>> 2026-01-04T17:32:19+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.0.0.147@60880 QUIC cert-key xJKsDkUqpl6orXeTwsrDgDvgZ/PiYxOSVlOkVdn5EOU=
>>> 2026-01-04T17:32:19+0800 info: [foo.] IXFR, outgoing, remote 10.0.0.147@60880
QUIC, incomplete history, serial 2026010403, fallback to AXFR
>>> 2026-01-04T17:32:19+0800 debug: [foo.] ACL, allowed, action transfer, remote
10.0.0.147@60880 QUIC cert-key xJKsDkUqpl6orXeTwsrDgDvgZ/PiYxOSVlOkVdn5EOU=
>>> 2026-01-04T17:32:19+0800 info: [foo.] AXFR, outgoing, remote 10.0.0.147@60880
QUIC, started, serial 2026010404
>>> 2026-01-04T17:32:20+0800 info: [foo.] AXFR, outgoing, remote 10.0.0.147@60880
QUIC, buffering finished, 0.87 seconds, 7390 messages, 124493148 bytes
>>> 2026-01-04T17:32:20+0800 notice: QUIC, terminated connections, outbuf limit 1
>>>
>>> on the slave side, I got log as,
>>> 2026-01-04T17:32:18+0800 info: [foo.] zone file loaded, serial 2026010403
>>> 2026-01-04T17:32:19+0800 info: [foo.] loaded, serial none -> 2026010403,
92000117 bytes
>>> 2026-01-04T17:32:19+0800 info: [foo.] refresh, remote 10.0.0.151@853, remote
serial 2026010404, zone is outdated
>>> 2026-01-04T17:32:19+0800 info: server started
>>>
>>> (and, the knotd on slave will down without log.)
>>>
>>> Thanks in advance.
>>>
>>>
>>> My testing environment is,
>>>
>>> the zone size is 1,000,000 x ( 2 NS + 2 A ), such as,
>>> domain00000000 3600 NS ns1.domain00000000
>>> 3600 NS ns2.domain00000000
>>> ns1.domain00000000 3600 A 10.0.0.1
>>> ns2.domain00000000 3600 A 10.0.0.2
>>> ...
>>> domain00999999 3600 NS ns1.domain00999999
>>> 3600 NS ns2.domain00999999
>>> ns1.domain00999999 3600 A 10.0.0.1
>>> ns2.domain00999999 3600 A 10.0.0.2
>>>
>>> If I decrease the record number to 500,000 x ( 2 NS + 2 A ), the zone could
be transfer with QUIC successfully.
>>>
>>> For traditional TCP and TLS, the zone transfer is processed without error,
even for more large size.
>>>
>>> Version in master and slave are both 3.5.2, installed from copr.
>>> OS in both side is Rocky9 x86_64.
>>>
>>> Best Regards,
>>> SUN Guonian
>>>
>>> --