Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This may be a foolish question, but if the dedup code isn’t open source, how did you port it to Python?


Not foolish! The Tarsnap client code is open source, but the license file prohibits anyone from using the code: https://github.com/Tarsnap/tarsnap/blob/master/COPYING

> Redistribution and use in source and binary forms, without modification, is permitted for the sole purpose of using the "tarsnap" backup service provided by Tarsnap Backup Inc.

The codebase is a jewel. I love the design, the way it's organized, the coding style, the algorithms, everything.

My process was to skim Colin's thesis: http://www.daemonology.net/papers/thesis.pdf

Along with the rsync thesis: https://www.samba.org/~tridge/phd_thesis.pdf

Then I started making a mental map of tarsnap: How does it build its deduplication index? How does it decide where block boundaries start within a file? Etc.

Eventually I started coding the algorithms in Python, mostly as a way of understanding the code. It's not actually as hard as it sounds, but you have to be rigorous. (It's a C -> Python conversion, after all, so there's not much room for error.)

My process was basically: Copy the C code into a Python file; comment out the code; for each line, write the corresponding Python; try to get something running as quickly as possible.

It worked pretty well, but I eventually lost interest.

Over the years, I've wanted a deduplication library, and 2021 is no exception. Someday I'll just roll up my sleeves and finish porting it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: