Postgres, Kafka, and a mysterious 100 GB

105 points by ritmatter 4 years ago · 23 comments

Reader

nja 4 years ago

Debezium is a useful tool, but requires a lot of babysitting. If the DB connection blips or DNS changes (say, if you just rebuilt your prod db), or in some other cases, it'll die and present this exact problem. Fortunately, it's easy to enable a "heartbeat" topic to alert on to make sure it can be restarted before the db disk fills (of course, db size growth alerts are critical too).

We've found that it's worth it for most use cases to switch to a vanilla JDBC Kafka Connector with frequent polling. This also allows for cases such as emitting joined data.

Other than Debezium, Postgres + Kafka + Kafka Connect builds a pretty stable system for sending data around all our different dbs, apps, and data lakes.

gunnarmorling 4 years ago

Which version of Debezium was it you encountered these issues with? Connection losses should not be a problem with any current version, as the connectors will restart automatically in that case.
Agreed though that monitoring should be in place, so to be notified upon failed connectors early on (could be based on the heartbeat topic, but there's also JMX metrics which can be exposed to Prometheus/Grafana, and/or health checks could be set-up based on the connector's status as exposed via the Kafka Connect REST API).
On the matter of disk growth, there's no silver bullet here. Some people will want to make 100% sure that never ever events are missed, which implies the replication slot must hold onto those WAL segments while it's not read (this is not specific to Debezium btw.). Others may be willing to accept missing events if the slot isn't read long enough, so those WAL segments can be discarded. In recent Postgres versions, a max size (or age, not sure) can be configured for a replication slot, so it's a matter of configuration which behavior you want.
In any case, a connector downtime for longer than say a few hours is something that should show up as an alert, allowing to take action.
Disclaimer: I work on Debezium
milanogc 4 years ago

How do you deal with late commits that have smaller identifiers. Using the incrementing mode those records will be skipped.
- higeorge13 4 years ago
  
  This and not being able to track deleted records requires a solution like debezium.

aeyes 4 years ago

The write activity every 5 minutes is standard Postgres checkpointing, the default value for checkpoint_timeout is 5 minutes. This is not limited to RDS.

Background processes like vacuum and analyze also write to WAL.

ritmatterOP 4 years ago

Thanks, I didn't know that. Why do you think the Debezium docs call out this behavior specifically for RDS (https://debezium.io/documentation/reference/stable/connector...)? I'm wondering if there is any difference for RDS vs other hosting methods.

asjfkdlf 4 years ago

Had this exact thing happen in production when we turned off an audit DB replication slot. We got lucky and caught it before our entire app went down. It’s one of the many foot-guns we have found with Postgres.

nieve 4 years ago

I'm not sure telling a replication system to keep copies of all changes without limits until they're replicated and then not letting them replicate is much of a footgun. You're getting exactly what you asked for and any system with the setup is going to eventually start filling up space if the replication or connector goes down. It's like complaining that your kitchen floor got wet when when you plugged the sink and left it running - it's unpleasant surprise, maybe you wanted a sink with an emergency overflow outlet, but it's the obvious outcome.
j16sdiz 4 years ago

Depends on your audit requirement, halting the operation when audit is unavailable maybe desirable.
n49o7 4 years ago

What other footguns have you found ?

gregplaysguitar 4 years ago

I heard an interesting comment recently from Derek Collison (creator of NATS[1]) that durability and delivery requirements can have the unwanted side-effect that one consumer can adversely impact all the others. It didn’t immediately make sense then, but this seems like a succinct illustration of the point!

[1] https://NATS.io

bufferoverflow 4 years ago

I still don't get what caused 100GB of logs. Idle events every 5 minutes should not generate that much data, so what am I missing?

merb 4 years ago

when turning on logical replication postgres keeps all wal files until they are also applied to the downstream consumer. if you don't use logical replication they get cleaned up by setting wal_keep_segments and max_wal_size (which have a sane value so there is nothing to bother)
btw. a good idea is to keep these wal files or at least archive them with a basebackup, because it enables PITR recovery (https://www.postgresql.org/docs/9.3/continuous-archiving.htm...) via wal-g or wal-e.
btw. the bigger mistake here is that a lot of people do not monitor their database size, which is a huge problem.
edit: ah and wal files are always 16mb (unless compiled differently)
> There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they are recycled (renamed to become the next segments in the numbered sequence). If, due to a short-term peak of log output rate, there are more than 3 * checkpoint_segments + 1 segment files, the unneeded segment files will be deleted instead of recycled until the system gets back under this limit.
- mattashii 4 years ago
  
  Yes, but this system seems to have generated WAL at a very high rate for something that is a test system with no load.
  It went from effectively 0 to 97 GB in less than 6 days; which is 673 MB/hr; 11MB /min, or 58 MB /checkpoint timer.
  If we consider that most, if not all, of this traffic would be based on checkpoint-related full-page writes, that's still some 7000 distinct pages being modified every 5 minutes. Even when I consider pg_stat_statements recording and writing out all activities, modifying ~7000 blocks is a lot for what I can expect to be repeating and templated maintenance tasks.
- whartung 4 years ago
  
  So this was 21 days of idle chatter? 1 WAL per 5 minute event, 16MB per WAL, 100GB/16MB = 6250 * 5m = 31,250m/1440m/day = 21ish days.
  Is that what happened?
  - ritmatterOP 4 years ago
    
    In the post there is actually a graph showing that Transaction Logs filled the database in just 6 days (11/24 through 11/30). This doesn't add up to ~100 GB if we assume that each event takes up 16MB of space, but maybe each event took up even more storage than that.
  - merb 4 years ago
    
    probably. well in his timeline it was only two weeks less, but without more information (which is probably lost) it would be hard if the problem was made worse and other stuff accumulated. but yeah wal file are probably the culprint here. btw. most of the time it's not just aws rds, as somebody else stated a few things in postgres do happen to write data into wal files.
    there is a lot more to it and depending on the config and replication it can behave a little bit differntly i.e. under normal use these files can get reused depending on min_wal_size/max_wal_size,etc... it's quite a complex thing and normally do not need to be fine tuned.
    I'm also pretty sure the behavior changed in more recent versions (which probably made it worse in this scenario, but more default safety for everbody.)

tbrock 4 years ago

We ran into this too, i actually think its a terrible postgres default. Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.

You'd have to resync the followers/secondaries, big deal, its way better than the primary going down because its disk filled up. This failure mode is awful. On RDS its relatively painless because you can snap fingers and have more disk but if you are running it yourself? Good luck.

In practice, mongodb’s oplog mechanism, for example, which acts as a circular buffer with a set size is a much more tolerant implementation. If the oplog rolls over befrore you've read it just resync but at most its taken up 10% of your disk.

Izkata 4 years ago

> Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.
In other words, replication without using replication slots. Postgres has had it for a long time, although the limit is set in disk usage rather than time.
(unless specifically logical replication and/or debezium don't support this?)
- voganmother42 4 years ago
  
  Postgres replication slots support a max size (since 13) specifically to deal with this scenario (prior to that needed to setup monitoring/intervention outside the db)

TYMorningCoffee 4 years ago

I thought it was going to be a feedback loop, where the write at the end would trigger another Kafka message. I was pleasantly surprised it wasn't.

avg_dev 4 years ago

Just wanted to say thanks for this article. I have been exploring Debezium for capturing Pg changes at work. I don’t know a whole lot about Pg replication and it’s nice to hear the potential gotchas before moving to anything production-like.

Settings

Postgres, Kafka, and a mysterious 100 GB

Keyboard Shortcuts