|
|
- <h1>Thursday, January 22</h1>
-
- deleting files from git history
- -------------------------------
-
- Working on a project where we included some built files that took up a bunch of
- space, and decided we should get rid of those. The git repository isn't public
- yet and is only shared by a handful of users, so it seemed worth thinking about
- rewriting the history a bit.
-
- There's reasonably good documentation for this in the usual places if you look,
- but I ran into some trouble.
-
- First, what seemed to work: David Underhill has a [good short script][du] from
- back in 2009 for using `git filter-branch` to eliminate particular files from
- history:
-
- > I recently had a need to rewrite a git repository’s history. This isn’t
- > generally a very good idea, though it is useful if your repository contains
- > files it should not (such as unneeded large binary files or copyrighted
- > material). I also am using it because I had a branch where I only wanted to
- > merge a subset of files back into master (though there are probably better
- > ways of doing this). Anyway, it is not very hard to rewrite history thanks to
- > the excellent git-filter-branch tool which comes with git.
-
- I'll reproduce the script here, in the not-unlikely event that his writeup goes
- away:
-
- #!/bin/bash
- set -o errexit
-
- # Author: David Underhill
- # Script to permanently delete files/folders from your git repository. To use
- # it, cd to your repository's root and then run the script with a list of paths
- # you want to delete, e.g., git-delete-history path1 path2
-
- if [ $# -eq 0 ]; then
- exit 0
- fi
-
- # make sure we're at the root of git repo
- if [ ! -d .git ]; then
- echo "Error: must run this script from the root of a git repository"
- exit 1
- fi
-
- # remove all paths passed as arguments from the history of the repo
- files=$@
- git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD
-
- # remove the temporary history git-filter-branch otherwise leaves behind for a long time
- rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
-
- A big thank you to Mr. Underhill for documenting this one. `filter-branch`
- seems really powerful, and not as brain-hurting as some things in git land.
- The [docs][fb] are currently pretty good, and worth a read if you're trying to
- solve this problem.
-
- > Lets you rewrite Git revision history by rewriting the branches mentioned in
- > the <rev-list options>, applying custom filters on each revision. Those
- > filters can modify each tree (e.g. removing a file or running a perl rewrite
- > on all files) or information about each commit. Otherwise, all information
- > (including original commit times or merge information) will be preserved.
-
- After this, things got muddier. The script seemed to work fine, and after
- running it I was able to see all the history I expected, minus some troublesome
- files. (A version with `--prune-empty` added to the `git filter-branch`
- invocation got rid of some empty commits.) But then:
-
- brennen@exuberance 20:05:00 /home/brennen/code $ du -hs pi_bootstrap
- 218M pi_bootstrap
- brennen@exuberance 20:05:33 /home/brennen/code $ du -hs experiment
- 199M experiment
-
- That second repo is a clone of the original with the script run against it.
- Why is it only tens of megabytes smaller, when minus the big binaries I zapped,
- it should come in somewhere under 10 megs?
-
- I will spare you, dear reader, the contortions I went through arriving at a
- solution for this, partially because I don't have the energy left to
- reconstruct them from the tattered history of my googling over the last few
- hours. What I figured out was that for some reason, a bunch of blobs were
- persisting in a pack file, despite not being referenced by any commits, and no
- matter what I couldn't get `git gc` or `git repack` to zap them.
-
- I more or less got this far with commands like:
-
- brennen@exuberance 20:49:10 /home/brennen/code/experiment2/.git (master) $ git count-objects -v
- count: 0
- size: 0
- in-pack: 2886
- packs: 1
- size-pack: 202102
- prune-packable: 0
- garbage: 0
- size-garbage: 0
-
- And:
-
- git verify-pack -v ./objects/pack/pack-b79fc6e30a547433df5c6a0c6212672c5e5aec5f > ~/what_the_fuck
-
- ...which gives a list of all the stuff in a pack file, including
- super-not-human-readable sizes that you can sort on, and many permutations of
- things like:
-
- brennen@exuberance 20:49:12 /home/brennen/code/experiment2/.git (master) $ git log --pretty=oneline | cut -f1 -d' ' | xargs -L1 git cat-file -s | sort -nr | head
- 589
- 364
- 363
- 348
- 341
- 331
- 325
- 325
- 322
- 320
-
- ...where `cat-file` is a bit of a Swiss army knife for looking at objects, with
- `-s` meaning "tell me a size".
-
- (An aside: If you are writing software that outputs a size in bytes, blocks,
- etc., and you do not provide a "human readable" option to display this in
- comprehensible units, the innumerate among us quietly hate your guts. This is
- perhaps unjust of us, but I'm just trying to communicate my experience here.)
-
- And finally, [Aristotle Pagaltzis's script][ap] for figuring out which commit
- has a given blob (the answer is _fucking none of them_, in my case):
-
- #!/bin/sh
- obj_name="$1"
- shift
- git log "$@" --pretty=format:'%T %h %s' \
- | while read tree commit subject ; do
- if git ls-tree -r $tree | grep -q "$obj_name" ; then
- echo $commit "$subject"
- fi
- done
-
- Also somewhere in there I learned how to use [`git bisect`][gb] (which is
- really cool and likely something I will use again) and went through and made
- entirely certain there was nothing in the history with a bunch of big files
- in it.
-
- So eventually I got to thinking ok, there's something here that is keeping
- these objects from getting expired or pruned or garbage collected or whatever,
- so how about doing a clone that just copies the stuff in the commits that still
- exist at this point. Which brings us to:
-
- brennen@exuberance 19:03:08 /home/brennen/code/experiment2 (master) $ git help clone
- brennen@exuberance 19:06:52 /home/brennen/code/experiment2 (master) $ cd ..
- brennen@exuberance 19:06:55 /home/brennen/code $ git clone --no-local ./experiment2 ./experiment2_no_local
- Cloning into './experiment2_no_local'...
- remote: Counting objects: 2874, done.
- remote: Compressing objects: 100% (1611/1611), done.
- remote: Total 2874 (delta 938), reused 2869 (delta 936)
- Receiving objects: 100% (2874/2874), 131.21 MiB | 37.48 MiB/s, done.
- Resolving deltas: 100% (938/938), done.
- Checking connectivity... done.
- brennen@exuberance 19:07:15 /home/brennen/code $ du -hs ./experiment2_no_local
- 133M ./experiment2_no_local
- brennen@exuberance 19:07:20 /home/brennen/code $ git help clone
- brennen@exuberance 19:08:34 /home/brennen/code $ git clone --no-local --single-branch ./experiment2 ./experiment2_no_local_single_branch
- Cloning into './experiment2_no_local_single_branch'...
- remote: Counting objects: 1555, done.
- remote: Compressing objects: 100% (936/936), done.
- remote: Total 1555 (delta 511), reused 1377 (delta 400)
- Receiving objects: 100% (1555/1555), 1.63 MiB | 0 bytes/s, done.
- Resolving deltas: 100% (511/511), done.
- Checking connectivity... done.
- brennen@exuberance 19:08:47 /home/brennen/code $ du -hs ./experiment2_no_local_single_branch
- 3.0M ./experiment2_no_local_single_branch
-
- What's going on here? [Well][clone], `git clone --no-local`:
-
- --local
- -l
-
- When the repository to clone from is on a local machine, this flag
- bypasses the normal "Git aware" transport mechanism and clones the
- repository by making a copy of HEAD and everything under objects and
- refs directories. The files under .git/objects/ directory are
- hardlinked to save space when possible.
-
- If the repository is specified as a local path (e.g., /path/to/repo),
- this is the default, and --local is essentially a no-op. If the
- repository is specified as a URL, then this flag is ignored (and we
- never use the local optimizations). Specifying --no-local will override
- the default when /path/to/repo is given, using the regular Git
- transport instead.
-
- And `--single-branch`:
-
- --[no-]single-branch
-
- Clone only the history leading to the tip of a single branch, either
- specified by the --branch option or the primary branch remote’s HEAD
- points at. When creating a shallow clone with the --depth option, this
- is the default, unless --no-single-branch is given to fetch the
- histories near the tips of all branches. Further fetches into the
- resulting repository will only update the remote-tracking branch for
- the branch this option was used for the initial cloning. If the HEAD at
- the remote did not point at any branch when --single-branch clone was
- made, no remote-tracking branch is created.
-
- I have no idea why `--no-local` by itself reduced the size but didn't really do
- the job.
-
- It's possible the lingering blobs would have been garbage collected
- _eventually_, and at any rate it seems likely that in pushing them to a remote
- repository I would have bypassed whatever lazy local file copy operation was
- causing everything to persist on cloning, thus rendering all this
- head-scratching entirely pointless, but then who knows. At least I understand
- git file structure a little better than I did before.
-
- For good measure, I just remembered how old much of the software on this
- machine is, and I feel like kind of an ass:
-
- brennen@exuberance 21:20:50 /home/brennen/code $ git --version
- git version 1.9.1
-
- This is totally an old release. If there's a bug here, maybe it's fixed by
- now. I will not venture a strong opinion as to whether there is a bug. Maybe
- this is entirely expected behavior. It is time to drink a beer.
-
- [ap]: https://stackoverflow.com/questions/223678/which-commit-has-this-blob/223890#223890
- [clone]: http://git-scm.com/docs/git-clone
- [du]: http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/
- [fb]: http://git-scm.com/docs/git-filter-branch
- [gb]: http://git-scm.com/docs/git-bisect
-
- postscript: on finding bugs
- ----------------------------
-
- The first thing you learn, by way of considerable personal frustration and
- embarrassment, goes something like this:
-
- > Q: My stuff isn't working. I think there is probably a bug in this mature
- > and widely-used (programming language | library | utility software).
- >
- > A: Shut up shut up shut up _shut up_ there is not a bug. Now go and figure
- > out what is wrong with your code.
-
- The second thing goes something like this:
-
- > Oh. I guess that's actually a bug.
-
- Which is to say: I have learned that I'm probably wrong, but sometimes I'm
- also wrong about being wrong.
|