filesystem capable of deduping tar.gz's content

filesystem capable of deduping tar.gz's content

Tzafrir Cohen tzafrir at cohens.org.il
Thu May 9 10:39:37 IDT 2013


On Thu, May 09, 2013 at 09:27:28AM +0300, Elazar Leibovich wrote:
> On Wed, May 8, 2013 at 11:11 PM, Tzafrir Cohen <tzafrir at cohens.org.il>wrote:
> 
> >
> > Git stores files. It should do handle such deduping by design. But this
> > is in Git's storage, and not in the actual filesystem:
> >
> 
> git packs them in a pack file.
> 
> Use git gc to make it aware of changes, or just look at my reply to Oleg.

(It's really a side-issue, as it won't help you, but, xkcd.com/385.
Warning: another long post)

No. That's pure file-level de-duplication.

Following my previous run:

tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1188    .git
2052    .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1

now let's see how it handles gzipped files. Note that you need to add
~2M to the size of '.' in the following, as 'rand' and 'rand1' are
missing from it.

When we use gzip with -n: we get exactly the same file:

tzafrir at pungenday:/tmp/git-test(master)$ gzip -n rand
tzafrir at pungenday:/tmp/git-test(master)$ gzip -n rand1
tzafrir at pungenday:/tmp/git-test(master)$ git add rand*.gz
tzafrir at pungenday:/tmp/git-test(master)$ git commit -m gzipped
[master 443ded4] gzipped
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 rand.gz
 create mode 100644 rand1.gz
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
2236    .git
2060    .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz

If not, we get different files: 

tzafrir at pungenday:/tmp/git-test(master)$ git show HEAD:rand > rand2
tzafrir at pungenday:/tmp/git-test(master)$ git show HEAD:rand > rand3
tzafrir at pungenday:/tmp/git-test(master)$ gzip rand2
tzafrir at pungenday:/tmp/git-test(master)$ gzip rand3
tzafrir at pungenday:/tmp/git-test(master)$ md5sum rand2.gz rand3.gz 
603d95587520d3ca203329eaeea8ac6c  rand2.gz
572f7178846083f82bb56da1e996d9a1  rand3.gz
tzafrir at pungenday:/tmp/git-test(master)$ git add rand[23].gz
tzafrir at pungenday:/tmp/git-test(master)$ git commit -m "gzipped without
-n"
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
4316    .git
4116    .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz
100644 blob 4e2e7106fad7c2d38dfed8b0221686ac561709ac    rand2.gz
100644 blob f2f9615fb255d7e75d74e65a675ca850a3c67a34    rand3.gz

rand.gz and rand1.gz have the same content (or rather: sha1 checksum of
the content) and thus are considered to be the same. When you don't use
'-n', you can get a slightly different result of the compression
(rather: a header. The rest of the file is the same).

That's before any packing. When git packs it, it can compress the three
gzip-compressions of rand to almost the size of a single one:

tzafrir at pungenday:/tmp/git-test(master)$ git gc
Counting objects: 12, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (12/12), done.
Total 12 (delta 4), reused 0 (delta 0)
tzafrir at pungenday:/tmp/git-test(master)$ du -s .git .
1164    .git
4116    .
tzafrir at pungenday:/tmp/git-test(master)$ git ls-tree HEAD
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand.gz
100644 blob e5ad63a5eb4a806ae572e977742cdce9e9f74cfa    rand1
100644 blob a25f7e04d2c142d0f25d88850d75999c7cfa8391    rand1.gz
100644 blob 4e2e7106fad7c2d38dfed8b0221686ac561709ac    rand2.gz
100644 blob f2f9615fb255d7e75d74e65a675ca850a3c67a34    rand3.gz

And indeed:

tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
wc -c
3146274
tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
gzip | wc -c
3146772

Hmm... not quite as expected. We need a larger dictionary:

tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
gzip -9 | wc -c
3146772
tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
bzip2 | wc -c
3160302
tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
bzip2 -9 | wc -c
3160302
tzafrir at pungenday:/tmp/git-test(master)$ cat rand.gz rand2.gz rand3.gz |
xz | wc -c
1049516

Finally. So as I was saying, those three compress very well together and
thus git can efifciently pack them. The size also suggests that the
compression did not change the content of the file very much in this
pathological case of a file with close-to-random content:

zafrir at pungenday:/tmp/git-test(master)$ cat rand rand1 rand.gz rand1.gz
rand2.gz rand3.gz | xz | wc -c
1050152

-- 
Tzafrir Cohen         | tzafrir at jabber.org | VIM is
http://tzafrir.org.il |                    | a Mutt's
tzafrir at cohens.org.il |                    |  best
tzafrir at debian.org    |                    | friend



More information about the Linux-il mailing list