The headline makes it sound like they did work to improve the performance of count(*) in the postgres code, when that's not at all what the article is about.
Given the title, I expected this to be about how PG improved their counting. This is not what it was about.
I remember working over half a billion records and having problems when I needed a count. I used count(id) but that was mainly from internet mantra. I did not see an improvement. Using Citus gave me a significant improvement from 7 minutes to 1. And that was just a single coordinator, two workers on the same host. It could become much much better.
If the data is very stagnant and writes are very low the triggers are great. Usually the "close enough" with pages is good if you have over 100k since paging - please correct me if I'm wrong - is sometimes 1k off.
My preference is Citus as a catch all, but a trigger, a Redis cache managed at the app level, or using page counts are all . really useful for stickier situations.
As an alternative to page counts, we used HLLs to estimate (unique) cardinality, and were quite happy with it. There is a postgres extension (postgresql-hll) and also a version for the JVM using the same algorithm/data format.
Sorry for being a bit off topic, but anyone out there who is using PostgreSQL in production, how do you manage tables with lots of updates? Is auto-vacuum doing good enough job for you or do you have to run “vacuum full” regularly?
For our use case, we found the best approach was to clone all of the data to a temporary table with indices and constraints disabled, perform the updates, re-enable indices and constraints, and then replace the production table with the temporary table. This only works if you are able to update your data in bulk, and if some lag time in your updates is OK. This also has the benefit of never locking your production table.
In situations where real-time updates are important, the key is to minimize your indices as much as possible. Read up on heap only tuples (HOT). If that all isn't enough, maybe consider sharding your database.
Never run VACUUM FULL; it locks too aggressively. Let autovacuum do the job.
If you are updating lots of rows, specify a low fill factor, both indexes and tables. It will avoid table bloat from dead pages by reusing existing pages for updates rather than appending the latest row version to a new page (and avoids the associated index update)
We have busy tables with lots of record churn some which receive 2000 xacts/s. I’ve found that they will never shrink down until you do a full vacuum but regular vacuum is good at reusing pages so tables won’t outgrow their max size during busiest periods. We never run full vacuums because we’ve never found it to be necessary.
If you end up needing full vacuuming to reduce bloat, there's also the pg_repack extension (it's even supported on RDS).
The extension essentially creates a new table without the block and replaces the original one, keeping track of changes to the table using trigger; so the exclusive lock of VACUUM FULL is not needed for quite the long time.
| A note about count(1) vs count(* ). One might think that count(1) would be faster because count(* ) appears to consult the data for a whole row. However the opposite is true.
Which is the opposite decision of other databases. So sometimes Postgres does make bad decisions...
I think the "the opposite is true" is in reference to the assertion that "count(⧆) appears to consult the data for a whole row", not to the proposition that "one might think that count(1) would be faster [than count(⧆)]".
count(⧆) just counts the tuples themselves, which is fast; it's like counting heap-allocated data structures by counting their pointers (which you're already walking), without dereferencing those pointers.
count(1) counts the result of evaluating the SQL expression "1" upon landing on each row, but still walks the same pointers to do so.
So, in terms of time complexity, they're roughly equivalent. Both data items (the tuple and the SQL constant expression) are already on the stack, ready to be directly computed upon.
Postgres's count(1) isn't slower than the one in any other DBMS. It's just their count(⧆)—at least the expression-evaluation part of it—which is more optimized than the one in other DBMSes. Nothing wrong with that, IMHO.
I'm surprised count(<constant>) is not special-cased in pretty much every database, and count(*) as well. There's no reason either would do anything further counting the number of records / tuples.
> Postgres's count(1) isn't slower than the one in any other DBMS.
count( * ) is faster on Postgres than count(1). But both are fundementally slow because of MVCC. And count( * ) on postgres (the optimized one on Postgres) is much slower than count(1) on other databases (the optimzed one on other databases). So practically speaking, counting rows is slower on Postgres than on other databases.
That said, I love Postgres. I use it every day. There is some room for improvement and it does improve all the time. It is an amazing open source project. And I wouldn't care at all if they never optimize count(1)
Yes, I was trying to make a finer point—the difference between count(⧆) and count(1) comes down to the cost of filtering the row, and in Postgres count(1) is a regular filtering operation—taking the same filtering cost that count(1) has in other DBMSes—while count(⧆) has a filtering cost that is lower than that of other DBMSes, because it has been specifically optimized†.
Separately, there's an MVCC cost of walking the rows to filter them, and other DBMSes optimize walking rows for counting [usually causing both count(1) and count(⧆) to be faster], while Postgres does not do this optimization. (And, as stated in the article, in those DBMSes, this isn't a pure optimization per se, but is rather a trade-off, trading write speed for all INSERTs/DELETEs for read speed for this particular case.)
(† Technically, the filtering cost of count(⧆) hasn't been specifically optimized; the relative speed of filtering tuples for count(⧆) is an emergent property of the general fact that Postgres treats any mention of `⧆` as a reference to the row-tuple object itself. i.e. If `foo` is a table (x int, y text), then in actuality, `foo` is first created as a type [a pg_class] defined as the tuple (int, text); and then the table `foo` is defined as a relation persisting a rowset of `foo`-tuple-typed rows [in est making a table['s triggerable operations] each into a stored procedure with a `foo`-tuple-typed-rowset return type.] Then, the expressions `(SELECT ⧆ FROM foo)`, and `(SELECT f.⧆ FROM foo f)` both evaluate to rowsets type `foo`, which means that Postgres doesn't need to dereference the pointer to each `foo` heap tuple to build those rowsets. It only needs to dereference the pointers when it comes time to actually serialize and emit the row over the wire—which in case of a `count(⧆)` operation, never happens.)
I'm not sure I understand your statement. On databases without an intrinsic row count property the only way to find the row count is to go to every page and count the rows there. The query planner will use the same plan for both `count(*)` and `count(1)`.
An interesting postgres oddity I ran into a few years ago that isn't about counting but is in a similar arena is how under very specific circumstances, "SELECT *" is significantly faster than "SELECT one_column" despite the conventional and normally well advised advice that you should only select the columns you need.
Not according to that blurb. There have been other databases where count(1) was much faster. Here they are saying that in Postgres, count(1) is slower because star is treated as no arguments, and count(1) is more complex.
It is not count() that is slow, it is iterating through the rows that is / can be slow :)
For me, the trick to basic understanding of perf in PG was exactly this: it is all about limiting the amount of rows you have to iterate over.
It is true for count() but also for every other operation you do.
PG is surprisingly non-magical (at least in my experience) in that you won't get much perf for free, but on the other hand you can reason about perf & optimize pretty reliably once you come to terms with this.
Note the trigger approach mentioned in the article would be terrible for performance in a concurrent environment, since only one transaction could modify the whole table at a time.
I wonder if there's a way to "shard" these counters to avoid this problem.
If you had 10 different counters (maybe in ten different tables) and a mechanism for round-robin or randomly selecting which counter gets incremented/decremented would that allow ten concurrent transactions at once?
The query to return the total count would then need to sum the 10 individual counters, which should be extremely fast.
Or is the concurrency limitation here caused by the trigger on the counted table itself, not the writes performed by the trigger?
Sharding the counters would help, but with MVCC the typical solution is a delta table containing +1 and -1 records that is periodically compacted into the main count. With some cleverness about how to perform the compaction it's possible to make that very efficient.
Instead of maintaining a single row in a table for the count, you maintain a table containing several rows. Instead of updating the single row, you insert a new row containing a 1 or -1. To get the count, you sum() the table. And you have a process to rollup the rows occasionally. This way you avoid the locks, except for the rollup.
Even though only 1 / 10 counters would be locked, you'd still have to read all 10 to get the count, which would be blocked until the concurrent transaction ended.
I appreciate approaches that acknowledge when “close” is good enough. It would be interesting to graph how far off those numbers are to put a real metric to it.
Only one data point, but I've got a custom search engine for a community that has pretty close to the ideal usage pattern for this: append-only with no update or deletes. A pure count(*) over 6.8M rows on a slow VPS takes 135,015ms and the n_live_tup technique¹ takes 214ms. The latter was 00.009557% high before an analyze and 00.001414% low afterward. Definitely good enough for me.
¹ SELECT n_live_tup FROM pg_stat_all_tables WHERE relname = 'comments';
If you are using a recent postgreql release and have just run ANALYZE, the pg_class.reltuples percentage error relative to the true row count will almost always be less than 1.0/DEFAULT_STATISTICS_TARGET, and usually substantially better than that.
Be careful with this if your postgresql release is not up to date. There was a bug in the way ANALYZE updated pg_class.reltuples that could cause the value in reltuples to grow incorrectly for large tables. If the table had a lot of updates the pg_class.reltuples value would tend to increase a bit each time it got updated.
This was fixed in March 2018 and was backpatched so any binary release since then should be ok, eg 9.x.latest 10.x.latest, 11.x.
This post from Citus is far more informative:
https://www.citusdata.com/blog/2016/10/12/count-performance/