filesystem capable of deduping tar.gz's content

Wed May 8 23:11:51 IDT 2013

On Wed, May 08, 2013 at 10:47:14PM +0300, Oleg Goldshmidt wrote:
> Elazar Leibovich <elazarl at gmail.com> writes:
> 
> > Hi,
> >
> > I have a software product being built a few times a day (continuous
> > integration style). The end product is an installable tar.gz with many
> > java jars.
> >
> > Since the content of the tar.gz's is mostly the same, I want to use a
> > filesystem that would dedupe the duplicated content.
> >
> > As I see it, it's s FUSE filesystem that:
> >
> > 1. When a file with .tar.gz extension stored, it untar it and store it
> > in a folder (keeping the file order in a list).
> > 2. When it is read again, it will tar gz the underlying folder, and
> > will give the gzip'd result.
> > 3. It will keep a list of file hashes, and would replace the file with
> > a symlink to another file if possible.
> > 4. Bonus: do the same for jars. Java is linked at runtime, so if a
> > .java file didn't change - neither does its class.
> >
> > Is there anything like that available?
> > Is there a smarter solution?

Can you afford a periodic scan by some service? I figure you could
always trigger it with inotify otherwise, but there is an overhead.

http://dedup.debian.net gives the following advice, that I have not yet
tested:

# Replace duplicate files with symlinks
rdfind -outputname /dev/null -makesymlinks true debian/mypackage/
# Fix those symlinks to make them relative
symlinks -r -s -c debian/mypackage/

> 1. I would probably look into using a version control system rather than
>    a filesystem.
> 
>    a) Modern version control systems are often/usually capable of
>       storing binary diffs between revisions. Frankly, I've never looked
>       at how git or mercurial do that (probably quite well), but even,
>       say, SVN should be able to store a binary diff on commit. IIRC SVN
>       diffs using xdelta or similar.
> 

Git stores files. It should do handle such deduping by design. But this
is in Git's storage, and not in the actual filesystem:

tzafrir at pungenday:/tmp/git-test$ git init
Initialized empty Git repository in /tmp/git-test/.git/
tzafrir at debian.org
tzafrir at pungenday:/tmp/git-test$ dd if=/dev/urandom bs=1024 count=1024
of=rand
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0832973 s, 12.6 MB/s
tzafrir at pungenday:/tmp/git-test$ du -s .git .
92      .git
1028    .
tzafrir at pungenday:/tmp/git-test$ git add rand
tzafrir at pungenday:/tmp/git-test$ git commit -m "rand"
[master (root-commit) 401d035] rand
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 rand
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1172    .git
1028    .
tzafrir at pungenday:/tmp/git-test(master)$ cp rand rand1
tzafrir at pungenday:/tmp/git-test(master)$ git add rand1
tzafrir at pungenday:/tmp/git-test(master)$ git commit -m "rand1"
[master a4d084f] rand1
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 rand1
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1188    .git
2052    .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1

There are a number of backup systems / schemes that aim to provide file
de-duplication. At least some of them use Git.

-- 
Tzafrir Cohen         | tzafrir at jabber.org | VIM is
http://tzafrir.org.il |                    | a Mutt's
tzafrir at cohens.org.il |                    |  best
tzafrir at debian.org    |                    | friend