Hello,
I recently deployed knot-resolver as main resolver for my network with following config:
-- modules
modules = {
  'policy',
  'view',

  'workarounds < iterate',
  'stats', -- required by predict module
  'predict',
  'http',
}

-- network
net.ipv6 = false
net.listen('0.0.0.0', 53, { kind = 'dns' })
net.listen('0.0.0.0', 8453, { kind = 'webmgmt' })

-- permissions
user('knot-resolver','knot-resolver')

-- cache, stats, performance optimizations
cache.open(950*MB, 'lmdb:///cache/knot-resolver')
cache.min_ttl(600)

predict.config({
    window = 5,
    period = 1*(60/5)
  })


-- limit access with ACLs
view:addr('127.0.0.0/8', policy.all(policy.PASS))
view:addr('172.16.0.0/12', policy.all(policy.PASS))
view:addr('10.0.0.0/8', policy.all(policy.PASS))

view:addr('redacted', policy.all(policy.PASS))
view:addr('redacted', policy.all(policy.PASS))
view:addr('redacted', policy.all(policy.PASS))

view:addr('0.0.0.0/0', policy.all(policy.DROP))

-- query resolving policies
policy.add(policy.rpz(policy.ANSWER, '/run/knot-resolver/custom.rpz', true)) 
policy.add(policy.all(policy.FORWARD({
  "9.9.9.9",
  "8.8.8.8",
  }))) 

Process is started inside docker container with properly attached tmpfs at /cache mountpoint.
Setup is monitored through prometheus and I received notification around 21:55 that exporter is down. To be exact context deadline exceeded (scrape_interval: 15s).

I grabbed control socket and dumped following stats:

> worker.stats()
{
    ['concurrent'] = 1,
    ['csw'] = 31436648,
    ['dropped'] = 16081,
    ['err_http'] = 0,
    ['err_tcp'] = 0,
    ['err_tls'] = 0,
    ['err_udp'] = 0,
    ['ipv4'] = 9950850,
    ['ipv6'] = 0,
    ['pagefaults'] = 1,
    ['queries'] = 30978879,
    ['rss'] = 166846464,
    ['swaps'] = 0,
    ['systime'] = 2681.109337,
    ['tcp'] = 232596,
    ['timeout'] = 139212,
    ['tls'] = 0,
    ['udp'] = 9718254,
    ['usertime'] = 6298.521581,
}
> cache.stats()
{
    ['clear'] = 1,
    ['close'] = 1,
    ['commit'] = 23279836,
    ['count'] = 20349,
    ['count_entries'] = 340222,
    ['match'] = 0,
    ['match_miss'] = 0,
    ['open'] = 2,
    ['read'] = 144512076,
    ['read_leq'] = 10170653,
    ['read_leq_miss'] = 4988410,
    ['read_miss'] = 25166216,
    ['remove'] = 0,
    ['remove_miss'] = 0,
    ['usage_percent'] = 11.396792763158,
    ['write'] = 25742556,
}

When I opened web interface I noticed steady decline in requests.  

image.png

I think this means that clients mostly moved to secondary DNS resolver.

Also when reviewing gathered data in grafana dashboard I noticed that slow (250ms+) queries are prevalent from the fast queries.
image.png

Any advice what happened here? 

Best regards,
Łukasz Jarosz