Hello,

I'd like to kindly ask for help with following issue.

I am considering to deploy knot-resolver to the DNS solution where it should coexist with other DNS daemons namely Unbound. I am running dnsdist in front of the pool of resolvers where I have just added also the latest release (5.5.4) of knot-resolver. It receives a portion of requests at the rate about 500qps. There are 6 kresd processes on a VM running with cache in tmpfs. Cache size is 2GB (8GB mounted to tmpfs). Resolver is running Debian Bullseye. The solution is serving real customers - the traffic is not artificial.

The configuration enables some modules:

modules = {
        'hints > iterate', -- Load /etc/hosts and allow custom root hints
        'stats',            -- Track internal statistics
        'predict',          -- Prefetch expiring/frequent records
        'bogus_log',        -- DNSSEC validation failure logging
        'nsid',
        'prefill',
        'rebinding < iterate'
}

cache.size = cache.fssize() - 6*GB

modules = {
        predict = {
                window = 15, -- 15 minutes sampling window
                period = 6*(60/15) -- track last 6 hours
        }
}

policy.add(
    policy.rpz(policy.DENY,
               '/var/cache/unbound/db.rpz.xxx.cz',
               true)
)

And the Carbon protocol reporting is enabled with the sample rate of 5s. The performance in terms of response time is equal or even better when comparing with Unbound. I am measuring the performance of the dnsdist's backend servers (daemons) - the latency, dropped requests and the requests rate using Carbon protocol with sample rate of 5s. The backend servers kresd included are receiving requests from just several clients at the moment (source IPs of the balancers).

What is wrong here is some kind of repeated packet drops reported for kresd instance by dnsdist. It appears in about 15 min. interval. When it occurs the drop rate increases from about 0.5 req/s (standard behavior) to 5-10 req/s which causes a positive feedback and increased packet rate. There are such peaks every 15 min. When I restart the kresd instances drops go away and everything is stellar for couple of hours.

I have no explanation for that. It is apparently not related to the requests - I have tried to simulate this by replaying the traffic captured when the problem occurs several times without success. kresd can even process 20k qps on this instance with no increase of drop rate from the balancer's point of view. But after a while... The fact that restarting kresd helps immediately shows that there might be something wrong inside.

I have tried to increase the number of kresd processes, cache size and disabling the cache persistence. Nothing helps here. I am pretty sure there are no network issues on the VM or surrounding network.

I can provide some graphical representations of this behavior of it is needed or a packet capture of the real traffic.

Many thanks

With best regards

Ales Rygl