filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

Oleg Goldshmidt pub at goldshmidt.org
Wed May 8 22:47:14 IDT 2013


Elazar Leibovich <elazarl at gmail.com> writes:

> Hi,
>
> I have a software product being built a few times a day (continuous
> integration style). The end product is an installable tar.gz with many
> java jars.
>
> Since the content of the tar.gz's is mostly the same, I want to use a
> filesystem that would dedupe the duplicated content.
>
> As I see it, it's s FUSE filesystem that:
>
> 1. When a file with .tar.gz extension stored, it untar it and store it
> in a folder (keeping the file order in a list).
> 2. When it is read again, it will tar gz the underlying folder, and
> will give the gzip'd result.
> 3. It will keep a list of file hashes, and would replace the file with
> a symlink to another file if possible.
> 4. Bonus: do the same for jars. Java is linked at runtime, so if a
> .java file didn't change - neither does its class.
>
> Is there anything like that available?
> Is there a smarter solution?

Disclaimer: I am definitely not an expert on the subject matter and I
hardly know what I am talking about (in this case?). Creativity is no
substitute for knowing what you are doing.

Now let me try and get creative.

What is your purpose? Just doing something fancy to impress your boss or
truly save space, e.g., if this stuff - everything that gets built - is
backed up? I'll assume the latter.

[Aside: if it is not backed up, how many versions do you really need to
keep and why is it an issue?]

1. I would probably look into using a version control system rather than
   a filesystem.

   a) Modern version control systems are often/usually capable of
      storing binary diffs between revisions. Frankly, I've never looked
      at how git or mercurial do that (probably quite well), but even,
      say, SVN should be able to store a binary diff on commit. IIRC SVN
      diffs using xdelta or similar.

   b) I suppose one can write commit/get (I use this terminology only
      because I mentioned SVN, consider it generic) hooks for most
      version control systems to tar/untar (and possibly zip/unzip jars)
      if you really need something close to what you described.

   c) Integrate commits into your build, gets into your install
      procedure. Don't keep the actual tar.gz's around once
      cmmitted. Only back up the version control repository.

   d) Version control commit/get may be slower than just copying the
      right file, even over network. Unless your archives are very large
      tarred/gzipped files should be reasonably efficient still (I am
      thinking of your "several times a day" here). YMMV.

   e) Additional advantages include the possibility of tagging, etc.
  
2. I would suspect that filesystems that dedupe at block level (ZFS?)
   won't help - the contents of tar/gzip/zip files can hardly be
   expected to be block-aligned between versions. But then you know
   that, eh?

3. I *heard* of lessfs but I have absolutely no idea if it is relevant
   (search and check?).

4. MVFS (Multi-Version FileSystem - the underlying technology of
   Rational's ClearCase) comes to mind. It's not open source (or cheap).
   It is not userspace. It is probably only available as a part of
   ClearCase. Just mentioning for completeness. 

If none of the above is even remotely relevant, sorry for the noise.

-- 
Oleg Goldshmidt | pub at goldshmidt.org



More information about the Linux-il mailing list