On 11 March 2013 17:43, Anand Buddhdev <anandb(a)ripe.net> wrote:
On 11/03/2013 15:27, Marek Vavruša wrote:
Hi Marek,
Well, you probably hit the weak spot of current
implementation.
We regularly test bootstrap speed of about 5k small zones and it
finishes in about 1 minute or so,
Oh, interesting. How many parallel XFRs does Knot try? Our master is
configured to allow a maximum of 500 XFRs in parallel, but of course
there are other clients as well, so Knot would get a share of that. And
then the master will refuse additional connections.
There is no finite upper bound, at any time there can only be 3
transfers processed but
others may be pending and waiting for data for example. When the
transfer is pending for a long
time without data, it get's discarded (I think it's about 5 minutes
between packets).
The congestion is "solved" really primitively using jittered timers,
but that may or may not work
and gives no guarantee, that's why I wan't to rework it.
but the
problem is that this is done over a 1GbE. The thing is we do
not handle congestion very efficiently when the there are a large
number of larger zones or the line is slower.
In our case, we have a mixture of zones. Some are small, while others
are quite large. Additionally, not all the zones can be loaded. For many
zones, the master replies with SERVFAIL, because the upstream master of
that zone has not provided a zone transfer, and the zone has gone stale
on our intermediate distribution master.
The connection between our Knot instance and the master is a 1 GbE
connection, but as I explained, the master cannot cope with a thundering
herd of incoming AXFR requests from Knot.
I see, that would be the case.
As of current implementation, bootstrap requests
are scheduled with
jittered timer and some stepping,
but over non-ideal lines it may happen that the transfer rate is
slower, packets may be lost, connections may be interrupted and so on.
We are working on a new implementation with a fixed queue, that would
handle this situation efficiently (it will be self-throttling) but it
probably won't get into 1.2.0.
For what it's worth, the problem is most evident on a bootstrap, when
you have most of the zones and
reasonable refresh timers, it will get up to speed again. Sorry for that.
Okay fair enough. We don't expect to have to bootstrap a server too
often, but when we do have to, it's not ideal to have to wait so long
for it to be ready, so better queuing of the AXFR requests would be good.
Any idea which release you expect to put this code in?
Regards,
Anand
I'll try my best to put it into 1.3.
Kind regards,
Marek