filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

Ori Berger linux-il at orib.net
Tue May 28 16:22:30 IDT 2013


On 05/08/2013 09:22 PM, Elazar Leibovich wrote:
> Hi,
>
> I have a software product being built a few times a day (continuous
> integration style). The end product is an installable tar.gz with many
> java jars.
>
> Since the content of the tar.gz's is mostly the same, I want to use a
> filesystem that would dedupe the duplicated content.
>
> As I see it, it's s FUSE filesystem that:
>
.
.snip
.
> Is there anything like that available?
> Is there a smarter solution?
.

Apologies for being late to the party.

The tar.gz makes everything a problem - a zip would work better for what 
you want (because, unlike a .tar.gz, it will not compress across files - 
each one will compress individually).

However, there is an (essentially) ready made solution which will work 
with .zips, but much much much better with the original folders: bup

https://github.com/bup/bup

As long as you don't care about ownership/permissions/modification-time 
(there's a branch that has those as well, but IIRC it's not in the main 
branch yet), bup:

a) dedups at the sub-file level (that is, if you add/delete/change 1 
byte in a 100GB file, the additional version will take ~10KB on 
average). bup breaks file into "easy to find again" sections, and 
actually stores those sections. A change of one byte will likely change 
just one such section, which has expected size of ~8KB

b) gzips each such section individually (so it won't be much larger than 
a .tar.gz except for pathological cases)

c) is randomly accessible - any version, any time

d) comes with a command line front end, an FTP front end, a FUSE front 
end, and possibly more I forgot.

e) uses git as a storage format. If all else fails, you can poke at the 
internals using git.

f) has a "manual mode" (bup split / bup join), in which you supply your 
own file through stdin, and bup still does its own dedup magic. You'd 
still want to use .tar (best) or .zip (2nd best) rather than .tar.gz, of 
course.

bup is the best thing for backup since sliced bread. It's also 
reasonably fast, works locally or client/server through ssh, and more. 
The only thing I'm really missing is built-in encryption, and some 
people who care more about perms and ctime/mtime/atime in backups miss 
those - but otherwise, it is teh awesome.



More information about the Linux-il mailing list