filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

Elazar Leibovich elazarl at gmail.com
Wed May 8 23:21:37 IDT 2013


On Wed, May 8, 2013 at 10:47 PM, Oleg Goldshmidt <pub at goldshmidt.org> wrote:

>
> Disclaimer: I am definitely not an expert on the subject matter and I
> hardly know what I am talking about (in this case?). Creativity is no
> substitute for knowing what you are doing.
>
> Now let me try and get creative.
>
> What is your purpose? Just doing something fancy to impress your boss


My real purpose, or the official stated purpose ;-)
The thing is, we build 400MB artifact multiple times a day. Say 5×200×400=400Gb
a year, not sure I have the space. This means I have to maintain the
repository (delete old build results, etc).
OTOH, if I use dedupe technique, I can keep all build artifacts and forget
about it altogether. I'll never ever fill a modern 250Gb disk.


> or
> truly save space, e.g., if this stuff - everything that gets built - is
> backed up? I'll assume the latter.
>
> [Aside: if it is not backed up, how many versions do you really need to
> keep and why is it an issue?]
>
> 1. I would probably look into using a version control system rather than
>    a filesystem.
>
>    a) Modern version control systems are often/usually capable of
>       storing binary diffs between revisions. Frankly, I've never looked
>       at how git or mercurial do that (probably quite well), but even,
>       say, SVN should be able to store a binary diff on commit. IIRC SVN
>       diffs using xdelta or similar.
>

I suspect they don't work well on gzipped content:

Binary file with diff:

(fabenv_mac)❯ du -h .git/objects
4.0K .git/objects/08
232K .git/objects/3d
4.0K .git/objects/44
4.0K .git/objects/84
232K .git/objects/d7
4.0K .git/objects/ee
  0B .git/objects/info
  0B .git/objects/pack
480K .git/objects
(fabenv_mac)❯ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)
(fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
4.0K .git/objects//info/packs
4.0K .git/objects//pack/pack-bd546ad638a3a27e16e57298469558cdd5018879.idx
216K .git/objects//pack/pack-bd546ad638a3a27e16e57298469558cdd5018879.pack

However when it's gzipped:

(fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
4.0K .git/objects//2a/8fc1caff222272cb043bbf18d240c54315f9d0
4.0K .git/objects//4e/71017582e4f46b3641d27084e5cae0c3303974
216K .git/objects//70/81d2b08bc00dff607aea60e9c6fecbc6950b16
216K .git/objects//8e/71116f4a7f89af36051b8b431427c0e88ab741
4.0K .git/objects//92/00e8eaf6093e6cfd07735bc9fe30da4e86db33
4.0K .git/objects//9d/e5e4af60673998992579be40960d65a5b498a3
(fabenv_mac)❯ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)
(fabenv_mac)❯ du -sh .git/objects
440K .git/objects
(fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
4.0K .git/objects//info/packs
4.0K .git/objects//pack/pack-5253e59d6e6950fbbf8455310bb32e3004ded6b2.idx
432K .git/objects//pack/pack-5253e59d6e6950fbbf8455310bb32e3004ded6b2.pack

Note the total size didn't change when the same two versions of the file
(gcc binary with the first byte changed) were gzipp'd.


>    b) I suppose one can write commit/get (I use this terminology only
>       because I mentioned SVN, consider it generic) hooks for most
>       version control systems to tar/untar (and possibly zip/unzip jars)
>       if you really need something close to what you described.
>

All your suggestions are basically good, but they mean I have to change the
work style of all the team.
The main benefit in my suggestion is, that it's completely transparent. I
add a single mount command to the directory I already keep my binary files,
and that's it. Everything still works as usual, except I never need to
worry about deleting anything.
BTW Java artifacts have a very easy to set-up and known deployment
mechanism (binary repository with a known protocol to keep binary build
products, known API for how to get a build product, etc). It's good to keep
your work environment as standard as you reasonably can.


> 3. I *heard* of lessfs but I have absolutely no idea if it is relevant
>    (search and check?).
>

I need to check how it supports gzip.


>
> 4. MVFS (Multi-Version FileSystem - the underlying technology of
>    Rational's ClearCase) comes to mind. It's not open source (or cheap).
>    It is not userspace. It is probably only available as a part of
>    ClearCase. Just mentioning for completeness.
>
> If none of the above is even remotely relevant, sorry for the noise.
>
> --
> Oleg Goldshmidt | pub at goldshmidt.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20130508/3d8f8edb/attachment-0001.html>


More information about the Linux-il mailing list