Hello Anand.
Thank you for a complex write-up. :)
On 4.2.2016 17:03, Anand Buddhdev wrote:
Next, I edited the config file, and added 4682 slave
zones to it. They
all share the "default" template, which defines one master server. Then
I called "knotc reload". Knot logged all the zones, and said it was
going to bootstrap them. But then it just sat there, doing *something*,
and it was a full 118 seconds later, when it started to check the master
for updates. Here's the log snippet showing this:
2016-02-04T15:20:06 info: [ZONE4681] zone will be bootstrapped, serial 0
2016-02-04T15:20:06 info: [ZONE4682] zone will be bootstrapped, serial 0
2016-02-04T15:20:06 info: configuration reloaded
2016-02-04T15:22:04 info: [ZONE0001] AXFR, incoming, X.X.X.X@53: starting
2016-02-04T15:22:04 info: [ZONE0002] AXFR, incoming, X.X.X.X@53: starting
Hm. We will investigate this a little more. It's quite possible that
it's related to the next problem.
Note that 118-second delay before the zone refreshes
start. Note that
during this delay, Knot made hundreds of DNS queries (A and AAAA)
towards the locally-configured caching resolver (Google DNS in this
case) for its own hostname, for example:
Yes, this is a bug. Knot tries to get host's canonical name for the
purpose of hostname.bind TXT/CH queries. This happens when any event is
started, which is wrong. We will fix it.
Next up, when the refreshes started, Knot went and
pummelled the master
server. Several zones on the master have expired, so Knot logged this:
2016-02-04T15:22:10 warning: [ZONENNNN] AXFR, incoming, X.X.X.X@53:
server responded with SERVFAIL
2016-02-04T15:22:10 warning: [ZONENNNN] AXFR, incoming, X.X.X.X@53:
failed (processing layer error)
2016-02-04T15:22:10 warning: [ZONENNNN] AXFR, incoming, remote 'hidden'
not available
2016-02-04T15:22:10 error: [ZONENNNN] AXFR, incoming, failed (no active
master)
So, the remote *is* available. It's just telling Knot that it can't
provide a zone transfer (SERVFAIL), because the zone has probably
expired on the master. So the log message looks a bit confusing. And
what does "processing layer error" mean?
Any outgoing queries are handled by some kind of a state machine. And we
use layers to stack the processing steps. So this error just means that
there was some error during the transfer.
We could improve this. But probably no earlier than in 2.2.0.
Finally, I have some comments about the various
parameters that "knotc"
takes. The explanation for them in knotc man page (and in the online
documentation) is rather terse. For example what does "zone-check"
actually check? It might be nice if the man page gave a bit more
information about this.
We will improve the documentation before the release. And we will try to
address your issues regarding any functional changes in the next feature
release.
Next up, "zone-status" prints some status.
However, it would be useful
if it also explains the output of "zone-status" to the operator (such as
what is "refresh in" or "journal flush" in for).
We are aware of this. The zone-status output is the top candidate for
upcoming improvements.
About "zone-reload": is that only for master
zones, or slave zones, or
both? If it's for master zones, will they be reloaded based on the zone
file's mtime? Or will knot look at the serial number in the SOA record?
For both. The reload checks if the zone file mtime and reloads the zone
from disk if necessary. This applies both to master and slave zones.
For slave zone, refresh/bootstrap is scheduled in addition to that.
About "zone-refresh": I assume that this
makes knot immediately query
the master, and if the serial numbers are the same, then no transfer is
done (like BIND's "rndc refresh"). This could be made explicit.
Right.
About "zone-retransfer": I assume that this
makes knot ignore any serial
number on the master, and transfer the zone anyway, like BIND's "rndc
retransfer". Again, this could be made explicit.
Exactly.
About "zone-sign": the word
"resign" usually means "leave your job", so
it's probably best spelled as "re-sign" for clarity :)
I already resigned on naming commands. ;-) This one was originally named
'sign'. But we changed it since we do automatic signing and this command
just forces Knot to drop all existing signatures.
I have one more observation. The test server has been
running for about
5 hours now. Of the 4682 zones configured, 2891 have still not been
transferred in. I ran "knotc zone-refresh" and Knot appears to be trying
to refresh zones. It appears to be doing some kind of batching. It does
a bunch of zones, and then waits 5 seconds before doing another batch.
The numer of untransferred zones is going down, albeit slowly.
I think Knot 2's slave zone refresh strategy and timing need still more
work, if it's to work effectively for a configuration with lots of slave
zones.
The thing is that we didn't changed anything on the transfer scheduling
between 1.6 and 2.0. I'll investigate this. If you found some additional
hints, please, let us know.
Thank you again. :)
Cheers,
Jan