a technical notebook
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

248 lines
11 KiB

9 years ago
9 years ago
  1. <h1>Thursday, January 22</h1>
  2. deleting files from git history
  3. -------------------------------
  4. Working on a project where we included some built files that took up a bunch of
  5. space, and decided we should get rid of those. The git repository isn't public
  6. yet and is only shared by a handful of users, so it seemed worth thinking about
  7. rewriting the history a bit.
  8. There's reasonably good documentation for this in the usual places if you look,
  9. but I ran into some trouble.
  10. First, what seemed to work: David Underhill has a [good short script][du] from
  11. back in 2009 for using `git filter-branch` to eliminate particular files from
  12. history:
  13. > I recently had a need to rewrite a git repository’s history. This isn’t
  14. > generally a very good idea, though it is useful if your repository contains
  15. > files it should not (such as unneeded large binary files or copyrighted
  16. > material). I also am using it because I had a branch where I only wanted to
  17. > merge a subset of files back into master (though there are probably better
  18. > ways of doing this). Anyway, it is not very hard to rewrite history thanks to
  19. > the excellent git-filter-branch tool which comes with git.
  20. I'll reproduce the script here, in the not-unlikely event that his writeup goes
  21. away:
  22. #!/bin/bash
  23. set -o errexit
  24. # Author: David Underhill
  25. # Script to permanently delete files/folders from your git repository. To use
  26. # it, cd to your repository's root and then run the script with a list of paths
  27. # you want to delete, e.g., git-delete-history path1 path2
  28. if [ $# -eq 0 ]; then
  29. exit 0
  30. fi
  31. # make sure we're at the root of git repo
  32. if [ ! -d .git ]; then
  33. echo "Error: must run this script from the root of a git repository"
  34. exit 1
  35. fi
  36. # remove all paths passed as arguments from the history of the repo
  37. files=$@
  38. git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD
  39. # remove the temporary history git-filter-branch otherwise leaves behind for a long time
  40. rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
  41. A big thank you to Mr. Underhill for documenting this one. `filter-branch`
  42. seems really powerful, and not as brain-hurting as some things in git land.
  43. The [docs][fb] are currently pretty good, and worth a read if you're trying to
  44. solve this problem.
  45. > Lets you rewrite Git revision history by rewriting the branches mentioned in
  46. > the <rev-list options>, applying custom filters on each revision. Those
  47. > filters can modify each tree (e.g. removing a file or running a perl rewrite
  48. > on all files) or information about each commit. Otherwise, all information
  49. > (including original commit times or merge information) will be preserved.
  50. After this, things got muddier. The script seemed to work fine, and after
  51. running it I was able to see all the history I expected, minus some troublesome
  52. files. (A version with `--prune-empty` added to the `git filter-branch`
  53. invocation got rid of some empty commits.) But then:
  54. brennen@exuberance 20:05:00 /home/brennen/code $ du -hs pi_bootstrap
  55. 218M pi_bootstrap
  56. brennen@exuberance 20:05:33 /home/brennen/code $ du -hs experiment
  57. 199M experiment
  58. That second repo is a clone of the original with the script run against it.
  59. Why is it only tens of megabytes smaller, when minus the big binaries I zapped,
  60. it should come in somewhere under 10 megs?
  61. I will spare you, dear reader, the contortions I went through arriving at a
  62. solution for this, partially because I don't have the energy left to
  63. reconstruct them from the tattered history of my googling over the last few
  64. hours. What I figured out was that for some reason, a bunch of blobs were
  65. persisting in a pack file, despite not being referenced by any commits, and no
  66. matter what I couldn't get `git gc` or `git repack` to zap them.
  67. I more or less got this far with commands like:
  68. brennen@exuberance 20:49:10 /home/brennen/code/experiment2/.git (master) $ git count-objects -v
  69. count: 0
  70. size: 0
  71. in-pack: 2886
  72. packs: 1
  73. size-pack: 202102
  74. prune-packable: 0
  75. garbage: 0
  76. size-garbage: 0
  77. And:
  78. git verify-pack -v ./objects/pack/pack-b79fc6e30a547433df5c6a0c6212672c5e5aec5f > ~/what_the_fuck
  79. ...which gives a list of all the stuff in a pack file, including
  80. super-not-human-readable sizes that you can sort on, and many permutations of
  81. things like:
  82. brennen@exuberance 20:49:12 /home/brennen/code/experiment2/.git (master) $ git log --pretty=oneline | cut -f1 -d' ' | xargs -L1 git cat-file -s | sort -nr | head
  83. 589
  84. 364
  85. 363
  86. 348
  87. 341
  88. 331
  89. 325
  90. 325
  91. 322
  92. 320
  93. ...where `cat-file` is a bit of a Swiss army knife for looking at objects, with
  94. `-s` meaning "tell me a size".
  95. (An aside: If you are writing software that outputs a size in bytes, blocks,
  96. etc., and you do not provide a "human readable" option to display this in
  97. comprehensible units, the innumerate among us quietly hate your guts. This is
  98. perhaps unjust of us, but I'm just trying to communicate my experience here.)
  99. And finally, [Aristotle Pagaltzis's script][ap] for figuring out which commit
  100. has a given blob (the answer is _fucking none of them_, in my case):
  101. #!/bin/sh
  102. obj_name="$1"
  103. shift
  104. git log "$@" --pretty=format:'%T %h %s' \
  105. | while read tree commit subject ; do
  106. if git ls-tree -r $tree | grep -q "$obj_name" ; then
  107. echo $commit "$subject"
  108. fi
  109. done
  110. Also somewhere in there I learned how to use [`git bisect`][gb] (which is
  111. really cool and likely something I will use again) and went through and made
  112. entirely certain there was nothing in the history with a bunch of big files
  113. in it.
  114. So eventually I got to thinking ok, there's something here that is keeping
  115. these objects from getting expired or pruned or garbage collected or whatever,
  116. so how about doing a clone that just copies the stuff in the commits that still
  117. exist at this point. Which brings us to:
  118. brennen@exuberance 19:03:08 /home/brennen/code/experiment2 (master) $ git help clone
  119. brennen@exuberance 19:06:52 /home/brennen/code/experiment2 (master) $ cd ..
  120. brennen@exuberance 19:06:55 /home/brennen/code $ git clone --no-local ./experiment2 ./experiment2_no_local
  121. Cloning into './experiment2_no_local'...
  122. remote: Counting objects: 2874, done.
  123. remote: Compressing objects: 100% (1611/1611), done.
  124. remote: Total 2874 (delta 938), reused 2869 (delta 936)
  125. Receiving objects: 100% (2874/2874), 131.21 MiB | 37.48 MiB/s, done.
  126. Resolving deltas: 100% (938/938), done.
  127. Checking connectivity... done.
  128. brennen@exuberance 19:07:15 /home/brennen/code $ du -hs ./experiment2_no_local
  129. 133M ./experiment2_no_local
  130. brennen@exuberance 19:07:20 /home/brennen/code $ git help clone
  131. brennen@exuberance 19:08:34 /home/brennen/code $ git clone --no-local --single-branch ./experiment2 ./experiment2_no_local_single_branch
  132. Cloning into './experiment2_no_local_single_branch'...
  133. remote: Counting objects: 1555, done.
  134. remote: Compressing objects: 100% (936/936), done.
  135. remote: Total 1555 (delta 511), reused 1377 (delta 400)
  136. Receiving objects: 100% (1555/1555), 1.63 MiB | 0 bytes/s, done.
  137. Resolving deltas: 100% (511/511), done.
  138. Checking connectivity... done.
  139. brennen@exuberance 19:08:47 /home/brennen/code $ du -hs ./experiment2_no_local_single_branch
  140. 3.0M ./experiment2_no_local_single_branch
  141. What's going on here? [Well][clone], `git clone --no-local`:
  142. --local
  143. -l
  144. When the repository to clone from is on a local machine, this flag
  145. bypasses the normal "Git aware" transport mechanism and clones the
  146. repository by making a copy of HEAD and everything under objects and
  147. refs directories. The files under .git/objects/ directory are
  148. hardlinked to save space when possible.
  149. If the repository is specified as a local path (e.g., /path/to/repo),
  150. this is the default, and --local is essentially a no-op. If the
  151. repository is specified as a URL, then this flag is ignored (and we
  152. never use the local optimizations). Specifying --no-local will override
  153. the default when /path/to/repo is given, using the regular Git
  154. transport instead.
  155. And `--single-branch`:
  156. --[no-]single-branch
  157. Clone only the history leading to the tip of a single branch, either
  158. specified by the --branch option or the primary branch remote’s HEAD
  159. points at. When creating a shallow clone with the --depth option, this
  160. is the default, unless --no-single-branch is given to fetch the
  161. histories near the tips of all branches. Further fetches into the
  162. resulting repository will only update the remote-tracking branch for
  163. the branch this option was used for the initial cloning. If the HEAD at
  164. the remote did not point at any branch when --single-branch clone was
  165. made, no remote-tracking branch is created.
  166. I have no idea why `--no-local` by itself reduced the size but didn't really do
  167. the job.
  168. It's possible the lingering blobs would have been garbage collected
  169. _eventually_, and at any rate it seems likely that in pushing them to a remote
  170. repository I would have bypassed whatever lazy local file copy operation was
  171. causing everything to persist on cloning, thus rendering all this
  172. head-scratching entirely pointless, but then who knows. At least I understand
  173. git file structure a little better than I did before.
  174. For good measure, I just remembered how old much of the software on this
  175. machine is, and I feel like kind of an ass:
  176. brennen@exuberance 21:20:50 /home/brennen/code $ git --version
  177. git version 1.9.1
  178. This is totally an old release. If there's a bug here, maybe it's fixed by
  179. now. I will not venture a strong opinion as to whether there is a bug. Maybe
  180. this is entirely expected behavior. It is time to drink a beer.
  181. [ap]: https://stackoverflow.com/questions/223678/which-commit-has-this-blob/223890#223890
  182. [clone]: http://git-scm.com/docs/git-clone
  183. [du]: http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/
  184. [fb]: http://git-scm.com/docs/git-filter-branch
  185. [gb]: http://git-scm.com/docs/git-bisect
  186. postscript: on finding bugs
  187. ----------------------------
  188. The first thing you learn, by way of considerable personal frustration and
  189. embarrassment, goes something like this:
  190. > Q: My stuff isn't working. I think there is probably a bug in this mature
  191. > and widely-used (programming language | library | utility software).
  192. >
  193. > A: Shut up shut up shut up _shut up_ there is not a bug. Now go and figure
  194. > out what is wrong with your code.
  195. The second thing goes something like this:
  196. > Oh. I guess that's actually a bug.
  197. Which is to say: I have learned that I'm probably wrong, but sometimes I'm
  198. also wrong about being wrong.