filesystem capable of deduping tar.gz's content
Tzafrir Cohen
tzafrir at cohens.org.il
Wed May 8 23:11:51 IDT 2013
On Wed, May 08, 2013 at 10:47:14PM +0300, Oleg Goldshmidt wrote:
> Elazar Leibovich <elazarl at gmail.com> writes:
>
> > Hi,
> >
> > I have a software product being built a few times a day (continuous
> > integration style). The end product is an installable tar.gz with many
> > java jars.
> >
> > Since the content of the tar.gz's is mostly the same, I want to use a
> > filesystem that would dedupe the duplicated content.
> >
> > As I see it, it's s FUSE filesystem that:
> >
> > 1. When a file with .tar.gz extension stored, it untar it and store it
> > in a folder (keeping the file order in a list).
> > 2. When it is read again, it will tar gz the underlying folder, and
> > will give the gzip'd result.
> > 3. It will keep a list of file hashes, and would replace the file with
> > a symlink to another file if possible.
> > 4. Bonus: do the same for jars. Java is linked at runtime, so if a
> > .java file didn't change - neither does its class.
> >
> > Is there anything like that available?
> > Is there a smarter solution?
Can you afford a periodic scan by some service? I figure you could
always trigger it with inotify otherwise, but there is an overhead.
http://dedup.debian.net gives the following advice, that I have not yet
tested:
# Replace duplicate files with symlinks
rdfind -outputname /dev/null -makesymlinks true debian/mypackage/
# Fix those symlinks to make them relative
symlinks -r -s -c debian/mypackage/
> 1. I would probably look into using a version control system rather than
> a filesystem.
>
> a) Modern version control systems are often/usually capable of
> storing binary diffs between revisions. Frankly, I've never looked
> at how git or mercurial do that (probably quite well), but even,
> say, SVN should be able to store a binary diff on commit. IIRC SVN
> diffs using xdelta or similar.
>
Git stores files. It should do handle such deduping by design. But this
is in Git's storage, and not in the actual filesystem:
tzafrir at pungenday:/tmp/git-test$ git init
Initialized empty Git repository in /tmp/git-test/.git/
tzafrir at debian.org
tzafrir at pungenday:/tmp/git-test$ dd if=/dev/urandom bs=1024 count=1024
of=rand
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0832973 s, 12.6 MB/s
tzafrir at pungenday:/tmp/git-test$ du -s .git .
92 .git
1028 .
tzafrir at pungenday:/tmp/git-test$ git add rand
tzafrir at pungenday:/tmp/git-test$ git commit -m "rand"
[master (root-commit) 401d035] rand
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 rand
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1172 .git
1028 .
tzafrir at pungenday:/tmp/git-test(master)$ cp rand rand1
tzafrir at pungenday:/tmp/git-test(master)$ git add rand1
tzafrir at pungenday:/tmp/git-test(master)$ git commit -m "rand1"
[master a4d084f] rand1
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 rand1
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1188 .git
2052 .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa rand
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa rand1
There are a number of backup systems / schemes that aim to provide file
de-duplication. At least some of them use Git.
--
Tzafrir Cohen | tzafrir at jabber.org | VIM is
http://tzafrir.org.il | | a Mutt's
tzafrir at cohens.org.il | | best
tzafrir at debian.org | | friend
More information about the Linux-il
mailing list