Caching Parquet headers/footers sounds super interesting. Can you say more about...

staticassertion · 2026-02-25T18:49:24 1772045364

Currently there's nothing in my headers, but the footer is straightforward. There's the schema, row group metadata, some statistics, byte offsets for each column in a group, page index, etc. It's everything you'd want if you wanted to reject a query outright or, if necessary, query extremely efficiently.

min/max stats for a column are huge because I pre-encode any low-cardinality strings into integers. This means I can skip entire row groups without every touching S3, just with that footer information, and if I don't have it cached I can read it and skip decoding anything that doesn't have my data.

Footers can get quite large in one sense - 10s-100s of KB for a very large file. But that's obviously tiny compared to a multi-GB Parquet file, and the data can compress extremely well for a second/ third tier cache. You can store 1000s of these pre-parsed in memory no problem, and store 10s of thousands more on disk.

I've spent 0 time optimizing my footers currently. They can get smaller than they are, I assume, but I've not put much thought. In fact, I don't have to assume, I know that my own custom metadata overlaps with the existing parquet stats and I just haven't bothered to deal with it. TBH there are a bunch of layout optimizations I've yet to explore, like using headers would obviously have some benefits (streaming) whereas right now I do a sort of "attempt to grab the footer from the end in chunks until we find it lol". But it doesn't come up because... caching. And there are worse things than a few spurious RANGE requests.

UltraSane · 2026-02-25T19:18:01 1772047081

Have you tried AWS s3 tables which is a manged iceberg service?

staticassertion · 2026-02-25T22:20:29 1772058029

I haven't. I'm sort of aware of it but I guess I prefer to just have tight control over the protocol/ data layout. It's not that hard and it gives me a ton of room to make niche optimizations. I doubt I'd get the same performance if I used it, but I could be wrong. Usually the more you can push your use case into the protocol the better.

UltraSane · 2026-02-26T01:20:17 1772068817

Like most managed services it is a trade off of control vs ease of operation. And like everything with S3 it scales to absurd levels with 10,000 tables per table bucket

staticassertion · 2026-02-26T01:32:38 1772069558

Makes sense and tbh there's a very good chance that I'd consider it if I were trying to stay more "standard" but I'd have to learn more.