Hi Libor,
On 31.08.22 13:14, libor.peltan wrote:
What rrsig-refresh actually serves for, is to refresh
RRSIGs soon
enough, so they they don't expire due to delays in:
1) propagation among authoritative servers, that means synchronization
of secondaries with primaries, including e.g. the lengthy process of
signing itself (in case of huge zone)
2) propagation to resolvers' caches
When I thought about this, I actually saw that (1) is exactly
propagation-delay and (2) is exactly the RRSIG's TTL. Setting
rrsig-refresh default to the sum of both values was a logical conclusion.
I see, I guess from a knot standpoint this makes sense, but I feel this
does not take into account any potential delays that are caused by
operational issues.
To paint a very simplified picture of our own architecture:
* We use Puppet Configuration management create/maintain the knot
configuration on all involved servers
* We have a hidden primary, that holds the zonedata and does the signing
* We have public secondaries that sync these zones via the normal
TSIG/AXFR/IXFR protocol
* Zonedata update on the hidden primary is done via a CI pipeline
towards the hidden primary
So for the actual "public" facing service, only the secondaries are
relevant as long as we do not need to change zonedata. That means the
hidden primary also has no redundancy built in. If it breaks, we will
simply redeploy it with puppet, rerun the pipeline and we are up and
running again. However this would take time depending on when the outage
is. So having signatures refresh early before they expire give us some
headroom there were the secondaries can serve the current zonedata
without being dependent on the primary.
Another issue I can think of, could be temporary network issues between
the primary and the secondaries.
I'd say that the setting of propagation-delay is
still in your hands,
as well as setting non-default rrsig-refresh. The only disadvantage of
too high rrsig-refresh is that zone signing takes place more often and
creates larger change-sets to be propagated to secondaries. In other
words, utilizing more of all resources (CPU, memory, disk, network).
For our deployment this is not really a concern. We do not have huge
zones, we just have many of them. Also, they are mostly static. So
signing performance was never an issue until now.
I would probably prefer to set a higher rrsig-refresh as compared to
increase propagation-delay, it seems clearer to me what it does.
Propagation delay for me is the time it takes during normal operations
for all primaries and secondaries to be in sync, plus some margin for
taking into account caching on resolvers. On top of that I'd like to
have some sort of safety margin against operational issues, so setting
rrsig-refresh is probably the way we go about in the future.
This all makes me think if the one-hour default of
propagation-delay
is maybe not optimal...?
Please let me know your ideas/opinions in more detail. Any real
operational experience is very very valuable for us!
As already said, at least to me propagation-delay is not what I would
associate with operational issues, I would expect all my primaries and
secondary to be in sync during normal operation well within the default
of one hour.
I guess choosing default values is always hard and I do not have an
issue with making our configuration more explicit to cover our specific
use case. I just wish this, at least for us, quite significant change in
behavior would have been made a bit more obvious in the changelog. It
caught us by suprise :)
Regards
André