Hello,
I'd like to kindly ask for help with following issue.
I am considering to deploy knot-resolver to the DNS solution
where it should coexist with other DNS daemons namely Unbound. I
am running dnsdist
in front of the pool of resolvers where I have just added also the
latest release (5.5.4) of knot-resolver. It receives a portion of
requests at the rate about 500qps. There are 6 kresd processes on
a VM running with cache in tmpfs. Cache size is 2GB (8GB mounted
to tmpfs). Resolver is running Debian Bullseye. The solution is
serving real customers - the traffic is not artificial.
The configuration enables some modules:
modules = {
'hints > iterate', -- Load /etc/hosts and allow
custom root hints
'stats', -- Track internal statistics
'predict', -- Prefetch expiring/frequent
records
'bogus_log', -- DNSSEC validation failure logging
'nsid',
'prefill',
'rebinding < iterate'
}
cache.size = cache.fssize() - 6*GB
modules = {
predict = {
window = 15, -- 15 minutes sampling window
period = 6*(60/15) -- track last 6 hours
}
}
policy.add(
policy.rpz(policy.DENY,
'/var/cache/unbound/db.rpz.xxx.cz',
true)
)
And the Carbon protocol reporting is enabled with the sample rate
of 5s. The performance in terms of response time is equal or even
better when comparing with Unbound. I am measuring the performance
of the dnsdist's backend servers (daemons) - the latency, dropped
requests and the requests rate using Carbon protocol with sample
rate of 5s. The backend servers kresd included are receiving
requests from just several clients at the moment (source IPs of
the balancers).
What is wrong here is some kind of repeated packet drops reported for kresd instance by dnsdist. It appears in about 15 min. interval. When it occurs the drop rate increases from about 0.5 req/s (standard behavior) to 5-10 req/s which causes a positive feedback and increased packet rate. There are such peaks every 15 min. When I restart the kresd instances drops go away and everything is stellar for couple of hours.
I have no explanation for that. It is apparently not related to the requests - I have tried to simulate this by replaying the traffic captured when the problem occurs several times without success. kresd can even process 20k qps on this instance with no increase of drop rate from the balancer's point of view. But after a while... The fact that restarting kresd helps immediately shows that there might be something wrong inside.
I have tried to increase the number of kresd processes, cache
size and disabling the cache persistence. Nothing helps here. I am
pretty sure there are no network issues on the VM or surrounding
network.
I can provide some graphical representations of this behavior of
it is needed or a packet capture of the real traffic.
Many thanks
With best regards
Ales Rygl