Hello again.
I upgraded the one concerned VM to knot-resolver 6.0.11 / libknot15, and the problem is
still there: the same old steady memleak therefore unrelated to "blocky" :(
So we have two different issues:
– the steady memleak happening only with libknot15 and not libknot14, only on one of the 3
VMs, for which the reason is unidentified.
– a memleak that happened on 2 VMs, with versions 6.0.8/libknot14 and 6.0.11/libknot15,
when some queries were coming from one customer using "blocky".
I will restore this VM to 6.0.8 for now and hold any upgrade.
If I find the time for it, I might create a fourth knot-resolver VM as a lab and divert
some customers/queries on it until I find which is causing this… but again, the very
strange thing is that when the VM experiencing the problem is offline and the queries are
anycasted on one of the other VM (with latest knot-resolver/libknot), the memleak does not
occur (at least it never happened at this point), so a lab VM might not catch the issue.
Have a great week-end,
Gabriel
Le 4 avr. 2025 à 15:50, oui.mages_0w--- via
knot-resolver-users
<knot-resolver-users_at_lists_nic_cz_48qbhjm2vj0347_06aede6e(a)icloud.com> a écrit :
Vladimir,
Huge progress!
During the night, our VM still on v6.0.8 crashed. Thankfully, exabgp did its job and the
anycast went directly to another VM.
This time, the second VM reached a memleak and rebooted (and the third took over the time
the second restarted).
You can see the the RAM usage increasing quite fast and VM1, then VM2, and generally, it
is cleared in time to prevent a crash, except around 6 am.
<utf-8''Capture%20d%E2%80%99e%CC%81cran%202025%2D04%2D04%20a%CC%80%2015.25.18.png><utf-8''Capture%20d%E2%80%99e%CC%81cran%202025%2D04%2D04%20a%CC%80%2015.25.50.png>
Now, the good news is that we have identified the source :)
We don’t know what the queries were, because it was DoH, but one of our customer was
using this:
https://github.com/0xERR0R/blocky
As soon as he stopped using it, no more memleaks or sawtooth graphs.
Sorry for the quick and dirty graph superposition below, but it shows the correlation:
– the blue line is the the answer rate to this particular customer in pps,
– in purple, it is the memory usage of the VM with knot resolver on it
– before 10 am this VM was offline, so anything before is irrelevant.
<utf-8''Capture%20d%E2%80%99e%CC%81cran%202025%2D04%2D04%20a%CC%80%2015.36.30.png>
The customer stopped his blocky around 10:40 (I believe he might have restarted it
briefly between 10:50 and 11:35).
So in our case, blocky was the culprit behind the knot resolver memleaks.
The other knot resolver users experiencing memleaks should look if any requests are
coming from a blocky instance.
My next step will be to upgrade this VM and confirm that there are no memleaks anymore
even with version > 6.0.8 and libknot15.
All the Best,
Gabriel
Le 3 avr. 2025 à 11:43, Vladimír Čunát via
knot-resolver-users
<knot-resolver-users_at_lists_nic_cz_48qbhjm2vj0347_06aede6e(a)icloud.com> a écrit :
On 02/04/2025 23.19, oui.mages_0w(a)icloud.com <mailto:oui.mages_0w@icloud.com>
wrote:
So knot-resolver 6.0.8 with libknot15 seems to
also trigger the memory leak I was experiencing with knot-resolver 6.0.9+ by the
unidentified traffic pattern (or whatever is causing this).
Thanks, this is very
interesting. I confirm that (for our Ubuntu 24.04 packages), libknot15 (i.e. knot 3.4) is
used exactly since 6.0.9, so the timing checks out, too. That's just a matter of
binary builds. Even the latest versions can still be built with libknot14 (3.3.x)
Have you looked into which libdnssec and libzscanner you have there? The thing is that
these two didn't change soname between knot 3.3 and 3.4, so here I see larger risks
than with libknot itself.
--
--