filesystem capable of deduping tar.gz's content
Oleg Goldshmidt
pub at goldshmidt.org
Thu May 9 11:57:36 IDT 2013
Elazar Leibovich <elazarl at gmail.com> writes:
> On Wed, May 8, 2013 at 11:11 PM, Tzafrir Cohen <tzafrir at cohens.org.il>
> wrote:
>
>
>
> Git stores files. It should do handle such deduping by design. But
> this
> is in Git's storage, and not in the actual filesystem:
>
>
> git packs them in a pack file.
>
> Use git gc to make it aware of changes, or just look at my reply to
> Oleg.
Well, I have a horrible suspicion that I did not make myself quite
clear. At least that is my impression from your and Tzafrir's replies.
Let me try and rephrase. I am going to make some assumptions about what
is going on as you build. I obvously think these assumptions are
reasonably close to realiy, you'll tell us which of them break
down. Then I will review the procedure I suggested yesterday, and
discuss when it (something like it) may be needed.
I would suggest - assuming my explanation below is clear enough - that
you take a few of the actual builds of yours that differ a bit and
commit them as sequential revisions of te same file (as I describe
below) into svn/git/hg/whatever and see how much space the repository
takes in each case. It is not clear to me how what you did with git is
related to my suggestion, and it does look to me that Tzafrir misread my
itent. Of course, it may be me who misses something.
To assumptions:
A1. You have a build procedure that updates from a source version control
repository, does a build - full or incremental, I'll touch upon it a
bit later - and creates a tar.gz file with jars inside. The build
itself is unattended, even if it is triggered manually (or on
schedule, or on event - whatever). At present, all the builds are
kept as independent files, with names like build-<NNN>.tar.gz (where
<NNN> is build number that is incremented) or
build-<yyyymmdd-HHMM>.tar.gz (where <yyyymmdd-HHMM> is the
timestamp) or according to some other naming scheme the particulars
of which are not very important.
A2. You have an install procedure that takes one of the build-*.tar.gz
files as input. Whether there is a command, a script, or whatever is
not important. The actual argument - functionally, not by
implementation - is the build number or timestamp or some other
identiier tha allows the install procedure (maybe the user,
manually) to find the right archive.
A3. Your problem or concern is that all the numerous builds together
take too much space even though they don't differ that much. This is
what you want to alleviate.
A4. Your development team is used to the curent procedure and you want to
keep things as transparent as possible. You are willing to change
some things (e.g., keep the built archives in a different place, on
a different partition with a different file system, and make the
install tool - or users - aware of the change in pathnames, etc.) as
long as it doesn't change or complicate the procedures too much from
the users' point of view.
A5. Of all the builds only a few are useful often. Possibly a few recent
ones to facilitate rollbacks when a problem occurs, plus some known
good old ones correponding to production releases, versions for
specific customers, etc. I expect them to be identifiable, even if
only by some excel spreadsheet that says that build 437 is
production version 3.1 in the field (hopefuly something more
automatic). I do not expect the whole team to know what a random
build-328.tar.gz from 4 months ago corresponds to, or use it
regularly.
So my suggestion yesterday was as follows:
1. Create a new repository (or module, if your version control supports
it) that will only hold a single file - build.tar.gz. The version
control should be chosen to handle binary diffs efficiently (in
space, and in time as a secondary consideration). Assuming
svn/git/hg/whatever are all goot at it choose the tool your dev team
is most comfortable with.
2. Add an extra step to the build procedure - either modify the build
process itself or add an external step after the build. The step will
be moving or copying build-<NNN>.tar.gz to results/build.tar.gz -
always the same filename for all the different builds - that is under
version control. NB: *clobber* the file, do not *add* another one as
Tzafrir seems to have understood me. The version control will
recognize the file as locally modified. Commit the change. The
repository will hold binary deltas, i.e., if you do this 10 times
your repository will not hold 10*X bytes (where X is the typical
build size), but X+9*dX bytes, where the binary delta dX is much
smaller than X. The file version may be made to correspond to build
number (e.g., SVN increments version numbers on commit, though git/hg
don't), or may be tagged symbolically as a part of the process. By
assumption A1 above the build is unattended, so you should be able to
do that without anyone noticing anything.
3. Once step 2 above is done you can decide which of the original
build-<NNN>.tar.gz's you can remove. By assumption A5 you can remove
many/most of them, e.g., all older than a week/month/whatever except
a few marked as useful in the long run. Assuming that "useful" marks
are detectable this step can be automated and incorporated into the
build.
4. If you can modify the install procedure to get the right verion from
the version control system and then use the checked out file (see
assumption A2) then you can remove *all* the build results after they
have been committed (and tagged) as revisions. If your current system
is sane then such interference (steps 2-4 here) will not lead to a
horrible disruption of your team's workflow.
5. If the installation procedure cannot be tampered with at all you
still end up with a lot of space savings. According to step 3 you
keep the likely useful builds exactly as they were before. In those
rare cases when someone needs a random build 328 from 4 months ago
that no one has touched since it was created it can be checked out
manually, I am pretty sure your team can handle this task once in a
blue moon.
6. As a bonus, your version control will allow you to get not only build
576 but also the last build before May 9, etc.
Now, when is this exercise worthwhile? Only when the build procedure
itself is prohibitively expensive/lengthy. If it is not then I'd say
don't store your built archives forever and just rebuild from version
control when needed (you do tag your source snapshots as build numbers,
etc., right?). One should not store the build results for long unless it
is necessary. So, what are the use cases that may justify storing build
results?
C1. The full build is waaaay too long, say, many hours. Your "continuous
integration" process would not allow you to build many times a day
if you did full builds, but you track dependencies intelligently and
build incrementally (cf. assumption A1 above).
C2. You find out that checking an archive out of version control is much
more efficient than checking out the source and building, e.g., it
is 10 seconds vs. 10 minutes, and it is needed often enough.
C3. Your SOP is to link/test your changes against a large number of
builds, corresponding to supported production releases, custom
versions for specific clients, etc. And you find that checking out
that many revisions from the repository and building them takes too
long. Maybe even checking out the numerous archives is too
sluggish. So keep those target builds as described in step 3 of the
suggested procedure but remove everything else. There can't be too
many of those - if there are then your support matrix is so huge
that you have bigger problems than buying a terabyte disk.
There may be other use cases that I am missing now, but the point is
that you need to really think about your procedures and needs and
understand whether or not a neat deduping trick (or a functional
equivalent) is really needed. My guess is that in most cases it is not.
--
Oleg Goldshmidt | pub at goldshmidt.org
More information about the Linux-il
mailing list