This may be a foolish question, but if the dedup code isn’t open source, how did...

sillysaurusx · on Sept 16, 2021

Not foolish! The Tarsnap client code is open source, but the license file prohibits anyone from using the code: https://github.com/Tarsnap/tarsnap/blob/master/COPYING

> Redistribution and use in source and binary forms, without modification, is permitted for the sole purpose of using the "tarsnap" backup service provided by Tarsnap Backup Inc.

The codebase is a jewel. I love the design, the way it's organized, the coding style, the algorithms, everything.

My process was to skim Colin's thesis: http://www.daemonology.net/papers/thesis.pdf

Along with the rsync thesis: https://www.samba.org/~tridge/phd_thesis.pdf

Then I started making a mental map of tarsnap: How does it build its deduplication index? How does it decide where block boundaries start within a file? Etc.

Eventually I started coding the algorithms in Python, mostly as a way of understanding the code. It's not actually as hard as it sounds, but you have to be rigorous. (It's a C -> Python conversion, after all, so there's not much room for error.)

My process was basically: Copy the C code into a Python file; comment out the code; for each line, write the corresponding Python; try to get something running as quickly as possible.

It worked pretty well, but I eventually lost interest.

Over the years, I've wanted a deduplication library, and 2021 is no exception. Someday I'll just roll up my sleeves and finish porting it.