TL;DR
Set
hot_standby_feedback on
to get rid of replication delay.
All the documentation on replication I found only talked about network throughput, disk I/O and change rate. I had a strong intuition that these had nothing to do with our replication delay.
---
The above is a graph of our replicas replication lag in a stacked area log scale chart. Why I chose stacked area and log scale is a bigger discussion but suffice it to say, visually it is by far more powerful than line chart and/or base 10.
We have 2-4 replicas depending on the day.
There are 4 delay time indicator lines
1 second black
1 minute green
1 hour orange
1 day red
PS the labels are a bit messed up, more work to do, but I want to get this blog out.
We can see that on Tuesday 15 there is a spike in delay of up to 1 DAY ! like Holy hell , what's going on?
The second graph in the image, the one below Replication delay is longest running query/transaction.
We can see a clear correlation.
Before I had these two graphs I was flying blind.
Replication delay is not in the default Datadog DBM graphs AFAIK.
And when I created the Replication delay graph , the max query/txn time was on a different page.
I had gotten an intuitive hit that replication delay was correlated with transaction/query length on the replica, so I put the max transaction/query length below the Replication delay and it became clearly apparent that there was a correlation.
( Intuition comes from hours of pouring over performance metrics and picking up a six sense of what is happening, that I can then test with empirical analysis)
I reached out to Nikolay Samokhvalov and ask him if he had any ideas. His response was
1) "it seems to depend on how long queries are running" -- this sounds like hot_standby_feedback is off. Is it (on replica). If it's off, then any long-running transaction on replica is going to block replication, this is expected. Solution is set it to "on" (but then long-running transactions will be reported to the primary and affect autovacuum -- so we need to choose the lesser of two evils).
2) if hot_standby_feedback=on already, I check if DDL happening on the primary + long transactions on replica has correlation with high lags
3) are replicas the same size as primary, and shared_buffers value also the same (if it's smaller, the lag can be related to it -- quite simple reason though, smaller cache)
I have yet to monitor the impact on vacuuming. This is a a shadow in the back of my mind that I have to dig into.
One important point: Community is crucial to running software productively. When your team wants the shiny new bobble but knows no one who uses it personally, stay way from running it on production systems.
zoom in on our replication delay. As one can we can see replication lag commonly hitting 1 minute and occasionally hitting 1 hour or even 1 day.
After setting
hot_standby_feedback on
on Thursday the 15, the replication delay is under 1 second.
Phewww!!
Here is the JSON for the Replication Delay Chart
{
"viz": "timeseries",
"requests": [
{
"style": {
"palette": "dog_classic",
"type": "solid",
"width": "normal"
},
"type": "area",
"formulas": [
{
"formula": "default_zero(query1)"
}
],
"queries": [
{
"name": "query1",
"data_source": "metrics",
"query": "avg:postgresql.replication_delay{*} by {host}"
}
],
"response_format": "timeseries"
}
],
"yaxis": {
"scale": "log",
"include_zero": false
},
"markers": [
{
"label": " 1 sec ",
"value": "y = 1",
"type": "info bold"
},
{
"label": " 1 minute ",
"value": "y = 60",
"type": "ok bold"
},
{
"label": " 1 hour ",
"value": "y = 3600",
"type": "warning bold"
},
{
"label": " 1 day ",
"value": "y = 86400",
"type": "error bold"
}
]
}
From Nikolay to follow up on as well:
Btw, speaking of replication lags,
mentioned an interesting case when we chatted at PGCon a few weeks ago: if on the primary, we have a long-running DDL, then an exclusive lock is going to be acquired on replica, lasting long, blocking SELECTs. I was going to reproduce it and write about it, to have demo, but couldn't find time so far unfortunately.
Discussion Twitter
The discussion centers around the balancing act of optimizing database performance by managing queries on a replica (standby) database, while also ensuring that the primary database is not negatively affected. It touches upon PostgreSQL-specific settings such as hot_standby_feedback and max_standby_*_delay, which control this behavior.
The conversation starts with the assertion that if hot_standby_feedback is turned off, any long-running query on the replica database would be canceled if it interferes with the recovery process.
Kyle Hailey responds that this is possible if the system is configured to behave in that manner. (not the config I had)
Vignesh Ravichandran expresses surprise, thinking that this was the default setting for streaming replication in PostgreSQL. He asks for further guidance on how to configure these settings.
Vik Fearing steps in to clarify that the default setting is as Vignesh thought. However, in this particular scenario, the max_standby_*_delay parameters were disabled. Vik explains that a standby database is generally configured either for failover (acting as a backup in case the primary database fails) or for running queries. In this case, the standby was configured for running queries. Vik points out that turning on hot_standby_feedback can lead to a false sense of security as it only helps resolve one type of conflict.
Vignesh acknowledges Vik's explanation, agreeing that it explains how a long-running query could cause lag on the standby. He also expresses his preference for keeping hot_standby_feedback off and setting a reasonable standby_*_delay parameter to minimize impact on the primary database.
Finally, Vik confirms that the explanation about disabling max_standby_*_delay was not in Kyle's blog, and he also prefers to keep hot_standby_feedback off to avoid any performance issues on the primary database caused by unnecessary queries.
@BertrandDrouvot FWIW, as of #PostgreSQL 14 there is also log_recovery_conflict_waits that can be helpful to diagnose delay (see https://bdrouvot.github.io/2021/09/30-report-long-recovery-conflict-wait-times-with-PostgreSQL-14/…). It adds info in the logfile like the WAL record the recovery is blocked on.