On 5/9/19 11:04 PM, Christoph wrote:
This time, kresd produced the following logs when it reached
cache.size (the tmpfs still had lots of free space).

22:00:15 kresd[11750]: [cache] MDB_BAD_TXN, probably overfull
22:00:15 kresd[11750]: [cache] clearing error, falling back
22:00:15 kresd[11750]: [cache] MDB_BAD_TXN, probably overfull
22:00:15 kresd[11750]: [cache] clearing because overfull, ret = 0

When this happened kresd lost all its cache (this is an assumption
but at the time it happened the usage level of the tmpfs partition
dropped to 0 before starting to increase again at the usual rate).

Yes, that's been "normal" behavior of kresd, at least so far.  I don't think it's documented, but when the cache is full (i.e. fails a write due to being full), the only coping is cache.clear() - in that state typically lmdb isn't able to commit *any* kind of changes (even clearing), so there's the "falling back" that removes the files and starts a new cache.

We have a WIP on a "garbage-collecting" daemon that tries to remove data that are estimated as less useful when the cache is getting large, but so far typical deployments can afford setting the limit so large that it only fills up very rarely.


I'm wondering if there is something special here in our setup
with tmpfs? Is anyone else putting their cache on tmpfs with kresd
4.0.0? (and can reproduce this or also: can not reproduce this?)

I think it's actually typical to use tmpfs, at least for more "serious" deployments.  We certainly do that on our Turris routers (Omnia and MOX), as the writes could kill the flash storage soon.  I suppose persisting cache across system restarts isn't too useful :-)  and tmpfs is enough for persisting across daemon restarts.

--Vladimir