Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are a bunch of pitfalls.

For one, listen/notify in Postgres wakes up all listeners on each notify. You will want a single stream of indexing updates, yet you will also want it to be fault-tolerant; if one indexer isn't running, another must take over its work.

Listen/notify also doesn't work well if you want to distinguish between batch updates (large imports, for example) and realtime ones.

Thirdly, if you miss out on events, you will want the syncing to catch up from where it was last.

The better design is to maintain state about indexing. If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas. Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change. This can be super useful for other purposes (auditing, debugging, support), and makes it easy to index from an earlier point in time.

Such a system will also let you more easily recover from hard failures in ES where entire shards (and their replicas) are lost (I have had it happen with 1.x, should be much less rare in 2.x).

Keeping one data store (ES) in sync with another (Postgres, or anything else) is an interesting problem in general.



> If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas.

It's hard because concurrent transactions can commit sequential IDs or timestamps out of order. There can be a lag between the moment when the sequential ID or the timestamp was generated, and the moment when it was committed.

> Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change.

Yes.


Good point about sequence IDs.

Another, mroe recent method that's bound to be more foolproof is to track the actual transaction log. Postgres exposes an API for "logical decoding" of said log. The downside is that it's fairly new, Postgres-specific, and I also suspect it puts more burden on the client, which has to interpret the entries and determine if/how they should be indexed. But in theory it gives you a completely data-loss-proof, application-independent way of streaming changes from Postgres.


I agree, using PostgreSQL logical decoding is a good solution.

MySQL can do something similar using a replication stream (https://github.com/siddontang/go-mysql-elasticsearch) and MongoDB by tailing the oplog.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: