filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

Elazar Leibovich elazarl at gmail.com
Tue May 28 16:54:24 IDT 2013


You came late to the party, but you're the only one who brought cheque!

Thanks, it's exactly what I was looking for.
On May 28, 2013 4:22 PM, "Ori Berger" <linux-il at orib.net> wrote:

> On 05/08/2013 09:22 PM, Elazar Leibovich wrote:
>
>> Hi,
>>
>> I have a software product being built a few times a day (continuous
>> integration style). The end product is an installable tar.gz with many
>> java jars.
>>
>> Since the content of the tar.gz's is mostly the same, I want to use a
>> filesystem that would dedupe the duplicated content.
>>
>> As I see it, it's s FUSE filesystem that:
>>
>>  .
> .snip
> .
>
>> Is there anything like that available?
>> Is there a smarter solution?
>>
> .
>
> Apologies for being late to the party.
>
> The tar.gz makes everything a problem - a zip would work better for what
> you want (because, unlike a .tar.gz, it will not compress across files -
> each one will compress individually).
>
> However, there is an (essentially) ready made solution which will work
> with .zips, but much much much better with the original folders: bup
>
> https://github.com/bup/bup
>
> As long as you don't care about ownership/permissions/**modification-time
> (there's a branch that has those as well, but IIRC it's not in the main
> branch yet), bup:
>
> a) dedups at the sub-file level (that is, if you add/delete/change 1 byte
> in a 100GB file, the additional version will take ~10KB on average). bup
> breaks file into "easy to find again" sections, and actually stores those
> sections. A change of one byte will likely change just one such section,
> which has expected size of ~8KB
>
> b) gzips each such section individually (so it won't be much larger than a
> .tar.gz except for pathological cases)
>
> c) is randomly accessible - any version, any time
>
> d) comes with a command line front end, an FTP front end, a FUSE front
> end, and possibly more I forgot.
>
> e) uses git as a storage format. If all else fails, you can poke at the
> internals using git.
>
> f) has a "manual mode" (bup split / bup join), in which you supply your
> own file through stdin, and bup still does its own dedup magic. You'd still
> want to use .tar (best) or .zip (2nd best) rather than .tar.gz, of course.
>
> bup is the best thing for backup since sliced bread. It's also reasonably
> fast, works locally or client/server through ssh, and more. The only thing
> I'm really missing is built-in encryption, and some people who care more
> about perms and ctime/mtime/atime in backups miss those - but otherwise, it
> is teh awesome.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20130528/08bcc1e5/attachment.html>


More information about the Linux-il mailing list