filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

E.S. Rosenberg esr+linux-il at g.jct.ac.il
Thu May 9 02:27:26 IDT 2013


2013/5/8 Elazar Leibovich <elazarl at gmail.com>:
> On Wed, May 8, 2013 at 10:47 PM, Oleg Goldshmidt <pub at goldshmidt.org> wrote:
>>
>>
>> Disclaimer: I am definitely not an expert on the subject matter and I
>> hardly know what I am talking about (in this case?). Creativity is no
>> substitute for knowing what you are doing.
>>
>> Now let me try and get creative.
>>
>> What is your purpose? Just doing something fancy to impress your boss
>
>
> My real purpose, or the official stated purpose ;-)
> The thing is, we build 400MB artifact multiple times a day. Say
> 5×200×400=400Gb a year, not sure I have the space. This means I have to
> maintain the repository (delete old build results, etc).
> OTOH, if I use dedupe technique, I can keep all build artifacts and forget
> about it altogether. I'll never ever fill a modern 250Gb disk.
Considering the _low_ price of storage these days why in the world
would you invest expensive time and effort in a complicated way of
saving a few bytes here and there?
3xx NIS buys you 1TB, considering you want some data safety so you
mirror or do some form of redundant RAID or RAID 5 and you spend some
more but have storage that should last you for several years by your
calculations.
In addition if you build lots of times and every build result is one
tar.gz you can just clean up all the tar.gz files that never became
production periodically (I assume builds that get shipped get
moved/copied to a different location) for far less the cost then
developing a layer on top a filesystem that does a task that is beyond
what a filesystem should do (ie. open a [compressed] file and examine
its' contents).
[find /path/of/builds -mtime +30 -exec rm {} \;]
>
>>
>> or
>> truly save space, e.g., if this stuff - everything that gets built - is
>> backed up? I'll assume the latter.
>>
>> [Aside: if it is not backed up, how many versions do you really need to
>> keep and why is it an issue?]
>>
>> 1. I would probably look into using a version control system rather than
>>    a filesystem.
>>
>>    a) Modern version control systems are often/usually capable of
>>       storing binary diffs between revisions. Frankly, I've never looked
>>       at how git or mercurial do that (probably quite well), but even,
>>       say, SVN should be able to store a binary diff on commit. IIRC SVN
>>       diffs using xdelta or similar.
>
>
> I suspect they don't work well on gzipped content:
>
> Binary file with diff:
>
> (fabenv_mac)❯ du -h .git/objects
> 4.0K .git/objects/08
> 232K .git/objects/3d
> 4.0K .git/objects/44
> 4.0K .git/objects/84
> 232K .git/objects/d7
> 4.0K .git/objects/ee
>   0B .git/objects/info
>   0B .git/objects/pack
> 480K .git/objects
> (fabenv_mac)❯ git gc
> Counting objects: 6, done.
> Delta compression using up to 8 threads.
> Compressing objects: 100% (6/6), done.
> Writing objects: 100% (6/6), done.
> Total 6 (delta 1), reused 0 (delta 0)
> (fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
> 4.0K .git/objects//info/packs
> 4.0K .git/objects//pack/pack-bd546ad638a3a27e16e57298469558cdd5018879.idx
> 216K .git/objects//pack/pack-bd546ad638a3a27e16e57298469558cdd5018879.pack
>
> However when it's gzipped:
>
> (fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
> 4.0K .git/objects//2a/8fc1caff222272cb043bbf18d240c54315f9d0
> 4.0K .git/objects//4e/71017582e4f46b3641d27084e5cae0c3303974
> 216K .git/objects//70/81d2b08bc00dff607aea60e9c6fecbc6950b16
> 216K .git/objects//8e/71116f4a7f89af36051b8b431427c0e88ab741
> 4.0K .git/objects//92/00e8eaf6093e6cfd07735bc9fe30da4e86db33
> 4.0K .git/objects//9d/e5e4af60673998992579be40960d65a5b498a3
> (fabenv_mac)❯ git gc
> Counting objects: 6, done.
> Delta compression using up to 8 threads.
> Compressing objects: 100% (6/6), done.
> Writing objects: 100% (6/6), done.
> Total 6 (delta 0), reused 0 (delta 0)
> (fabenv_mac)❯ du -sh .git/objects
> 440K .git/objects
> (fabenv_mac)❯ find .git/objects/ -type f|xargs du -h
> 4.0K .git/objects//info/packs
> 4.0K .git/objects//pack/pack-5253e59d6e6950fbbf8455310bb32e3004ded6b2.idx
> 432K .git/objects//pack/pack-5253e59d6e6950fbbf8455310bb32e3004ded6b2.pack
>
> Note the total size didn't change when the same two versions of the file
> (gcc binary with the first byte changed) were gzipp'd.
>
>>
>>    b) I suppose one can write commit/get (I use this terminology only
>>       because I mentioned SVN, consider it generic) hooks for most
>>       version control systems to tar/untar (and possibly zip/unzip jars)
>>       if you really need something close to what you described.
>
>
> All your suggestions are basically good, but they mean I have to change the
> work style of all the team.
> The main benefit in my suggestion is, that it's completely transparent. I
> add a single mount command to the directory I already keep my binary files,
> and that's it. Everything still works as usual, except I never need to worry
> about deleting anything.
Yes but the disadvantage is that you still have to develop the whole
thing (which breaks out of fs boundaries), is the value of developing
that and not getting people used to decent verioning systems (which is
anyhow a very good idea) really lower then the loss of a few
hours/days of slightly lower productivity while people get used to a
new workmode?

Sorry to sound so cynical, but I hope that it was helpful in some way.
Regards,
Eliyahu - אליהו

> BTW Java artifacts have a very easy to set-up and known deployment mechanism
> (binary repository with a known protocol to keep binary build products,
> known API for how to get a build product, etc). It's good to keep your work
> environment as standard as you reasonably can.
>
>>
>> 3. I *heard* of lessfs but I have absolutely no idea if it is relevant
>>    (search and check?).
>
>
> I need to check how it supports gzip.
>
>>
>>
>> 4. MVFS (Multi-Version FileSystem - the underlying technology of
>>    Rational's ClearCase) comes to mind. It's not open source (or cheap).
>>    It is not userspace. It is probably only available as a part of
>>    ClearCase. Just mentioning for completeness.
>>
>> If none of the above is even remotely relevant, sorry for the noise.
>>
>> --
>> Oleg Goldshmidt | pub at goldshmidt.org
>
>
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>



More information about the Linux-il mailing list