filesystem capable of deduping tar.gz's content

Thu May 9 11:57:36 IDT 2013

Elazar Leibovich <elazarl at gmail.com> writes:

> On Wed, May 8, 2013 at 11:11 PM, Tzafrir Cohen <tzafrir at cohens.org.il>
> wrote:
>
>     
>     
>     Git stores files. It should do handle such deduping by design. But
>     this
>     is in Git's storage, and not in the actual filesystem:
>     
>
> git packs them in a pack file.
>
> Use git gc to make it aware of changes, or just look at my reply to
> Oleg.

Well, I have a horrible suspicion that I did not make myself quite
clear. At least that is my impression from your and Tzafrir's replies.

Let me try and rephrase. I am going to make some assumptions about what
is going on as you build. I obvously think these assumptions are
reasonably close to realiy, you'll tell us which of them break
down. Then I will review the procedure I suggested yesterday, and
discuss when it (something like it) may be needed.

I would suggest - assuming my explanation below is clear enough - that
you take a few of the actual builds of yours that differ a bit and
commit them as sequential revisions of te same file (as I describe
below) into svn/git/hg/whatever and see how much space the repository
takes in each case. It is not clear to me how what you did with git is
related to my suggestion, and it does look to me that Tzafrir misread my
itent. Of course, it may be me who misses something.

To assumptions:

A1. You have a build procedure that updates from a source version control
    repository, does a build - full or incremental, I'll touch upon it a
    bit later - and creates a tar.gz file with jars inside. The build
    itself is unattended, even if it is triggered manually (or on
    schedule, or on event - whatever). At present, all the builds are
    kept as independent files, with names like build-<NNN>.tar.gz (where
    <NNN> is build number that is incremented) or
    build-<yyyymmdd-HHMM>.tar.gz (where <yyyymmdd-HHMM> is the
    timestamp) or according to some other naming scheme the particulars
    of which are not very important.

A2. You have an install procedure that takes one of the build-*.tar.gz
    files as input. Whether there is a command, a script, or whatever is
    not important. The actual argument - functionally, not by
    implementation - is the build number or timestamp or some other
    identiier tha allows the install procedure (maybe the user,
    manually) to find the right archive.

A3. Your problem or concern is that all the numerous builds together
    take too much space even though they don't differ that much. This is
    what you want to alleviate.

A4. Your development team is used to the curent procedure and you want to
    keep things as transparent as possible. You are willing to change
    some things (e.g., keep the built archives in a different place, on
    a different partition with a different file system, and make the
    install tool - or users - aware of the change in pathnames, etc.) as
    long as it doesn't change or complicate the procedures too much from
    the users' point of view.

A5. Of all the builds only a few are useful often. Possibly a few recent
    ones to facilitate rollbacks when a problem occurs, plus some known
    good old ones correponding to production releases, versions for
    specific customers, etc. I expect them to be identifiable, even if
    only by some excel spreadsheet that says that build 437 is
    production version 3.1 in the field (hopefuly something more
    automatic). I do not expect the whole team to know what a random
    build-328.tar.gz from 4 months ago corresponds to, or use it
    regularly.

So my suggestion yesterday was as follows:

1. Create a new repository (or module, if your version control supports
   it) that will only hold a single file - build.tar.gz. The version
   control should be chosen to handle binary diffs efficiently (in
   space, and in time as a secondary consideration). Assuming
   svn/git/hg/whatever are all goot at it choose the tool your dev team
   is most comfortable with.

2. Add an extra step to the build procedure - either modify the build
   process itself or add an external step after the build. The step will
   be moving or copying build-<NNN>.tar.gz to results/build.tar.gz -
   always the same filename for all the different builds - that is under
   version control. NB: *clobber* the file, do not *add* another one as
   Tzafrir seems to have understood me. The version control will
   recognize the file as locally modified. Commit the change. The
   repository will hold binary deltas, i.e., if you do this 10 times
   your repository will not hold 10*X bytes (where X is the typical
   build size), but X+9*dX bytes, where the binary delta dX is much
   smaller than X. The file version may be made to correspond to build
   number (e.g., SVN increments version numbers on commit, though git/hg
   don't), or may be tagged symbolically as a part of the process. By
   assumption A1 above the build is unattended, so you should be able to
   do that without anyone noticing anything.

3. Once step 2 above is done you can decide which of the original
   build-<NNN>.tar.gz's you can remove. By assumption A5 you can remove
   many/most of them, e.g., all older than a week/month/whatever except
   a few marked as useful in the long run. Assuming that "useful" marks
   are detectable this step can be automated and incorporated into the
   build.

4. If you can modify the install procedure to get the right verion from
   the version control system and then use the checked out file (see
   assumption A2) then you can remove *all* the build results after they
   have been committed (and tagged) as revisions. If your current system
   is sane then such interference (steps 2-4 here) will not lead to a
   horrible disruption of your team's workflow.

5. If the installation procedure cannot be tampered with at all you
   still end up with a lot of space savings. According to step 3 you
   keep the likely useful builds exactly as they were before. In those
   rare cases when someone needs a random build 328 from 4 months ago
   that no one has touched since it was created it can be checked out
   manually, I am pretty sure your team can handle this task once in a
   blue moon.

6. As a bonus, your version control will allow you to get not only build
   576 but also the last build before May 9, etc.

Now, when is this exercise worthwhile? Only when the build procedure
itself is prohibitively expensive/lengthy. If it is not then I'd say
don't store your built archives forever and just rebuild from version
control when needed (you do tag your source snapshots as build numbers,
etc., right?). One should not store the build results for long unless it
is necessary. So, what are the use cases that may justify storing build
results?

C1. The full build is waaaay too long, say, many hours. Your "continuous
    integration" process would not allow you to build many times a day
    if you did full builds, but you track dependencies intelligently and
    build incrementally (cf. assumption A1 above).

C2. You find out that checking an archive out of version control is much
    more efficient than checking out the source and building, e.g., it
    is 10 seconds vs. 10 minutes, and it is needed often enough.

C3. Your SOP is to link/test your changes against a large number of
    builds, corresponding to supported production releases, custom
    versions for specific clients, etc. And you find that checking out
    that many revisions from the repository and building them takes too
    long. Maybe even checking out the numerous archives is too
    sluggish. So keep those target builds as described in step 3 of the
    suggested procedure but remove everything else. There can't be too
    many of those - if there are then your support matrix is so huge
    that you have bigger problems than buying a terabyte disk.

There may be other use cases that I am missing now, but the point is
that you need to really think about your procedures and needs and
understand whether or not a neat deduping trick (or a functional
equivalent) is really needed. My guess is that in most cases it is not.

-- 
Oleg Goldshmidt | pub at goldshmidt.org