Week 8

Day 3 - "Filtered repos"

Looking at a repo with rose tinted glasses

It does happen. Sometimes when people are under pressure, mistakes are made, just like earlier when we accidently deleted our branch from the repository. This time the mistake is a little more crucial but again it does happen and it sometimes goes a long time before it is noticed.
In the trenches...
"So it's been in there for how long?" asked John.

Simon looked pretty sheepish as he mouthed the words, "Weeks."

John bit on the end of the pen in his hand. His teeth chewed into the plastic, deforming the blue lid. "Did you find a way of sorting it out yet?"

"I think so. It's not ideal, but I think so."

It would be useful if we could rewrite the history to remove the information that we wanted to. As it turns out there is a tool that we can use to do this. The git filter-branch allows us to run operations on a branch to rewrite its history. Hopefully you are already remembering about the care we need to take when rewriting history, but sometimes there is a real need to perform some of these operations. Let us take a look at a few examples to see how this can work. We are going to assume that our file newfile1 contains some very sensitive information and we wish to remove it completely from the repository.
john@satsuki:~/coderepo$ git checkout master
Already on 'master'
john@satsuki:~/coderepo$ ls -la
total 40
drwxr-xr-x 3 john john 4096 2011-07-27 19:54 .
drwxr-xr-x 32 john john 4096 2011-07-27 19:00 ..
-rw-r--r-- 1 john john 35 2011-07-22 07:15 another_file
-rw-r--r-- 1 john john 25 2011-07-22 07:15 cont_dev
drwxrwxr-x 9 john john 4096 2011-07-27 19:54 .git
-rw-r--r-- 1 john john 69 2011-07-27 19:54 newfile1
-rw-r--r-- 1 john john 58 2011-07-22 07:15 newfile2
-rw-r--r-- 1 john john 45 2011-07-22 07:15 newfile3
-rw-r--r-- 1 john john 8 2011-03-31 22:15 temp_file
-rwxrwxr-x 1 john john 114 2011-07-21 21:17 test.sh
john@satsuki:~/coderepo$

As you can see, currently we have newfile1 in our tree. We can also use the git log tool to see each commit which has touched that path.
john@satsuki:~/coderepo$ git log --pretty=oneline master -- newfile1
9cb2af2a00fd2253060e6bf8cc6c377b3d55ecea Important Update
d50ffb2fa536d869f2c4e89e8d6a48e0a29c5cc1 Merged in zaney
a27d49ef11d9f0e66edbad8f6c7806510ad5b2be Made an awesome change
cfbecabb031696a217b77b0e1285f2d5fc2ea2a3 Fantastic new feature
55fb69f4ad26fdb6b90ac6f43431be40779962dd Added two new files
john@satsuki:~/coderepo$

So there were five commits in the past which have touched that path. In our example we require the removal of this path from the entire history of the repository. As this is a destructive operation that works on the current branch, meaning it will rewrite our branch HEAD, we are first going to switch into a new branch.
john@satsuki:~/coderepo$ git checkout -b remove_file
Switched to a new branch 'remove_file'
john@satsuki:~/coderepo$

Now we need to run the git filter-branch tool.
john@satsuki:~/coderepo$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch newfile1' HEAD
Rewrite 55fb69f4ad26fdb6b90ac6f43431be40779962dd (6/21)rm 'newfile1'
Rewrite 9710177657ae00665ca8f8027b17314346a5b1c4 (7/21)rm 'newfile1'
Rewrite 4ac92012609cf8ed2480aa5d7f807caf2545fe2f (8/21)rm 'newfile1'
Rewrite cfbecabb031696a217b77b0e1285f2d5fc2ea2a3 (9/21)rm 'newfile1'
Rewrite b119573f4508514c55e1c4e3bebec0ab3667d071 (10/21)rm 'newfile1'
Rewrite ed2301ba223a63a5a930b536a043444e019460a7 (11/21)rm 'newfile1'
Rewrite a27d49ef11d9f0e66edbad8f6c7806510ad5b2be (12/21)rm 'newfile1'
Rewrite 7cc32dbf121f2afa8c40337db54bafb26de5b9c4 (13/21)rm 'newfile1'
Rewrite d50ffb2fa536d869f2c4e89e8d6a48e0a29c5cc1 (14/21)rm 'newfile1'
Rewrite 9cb2af2a00fd2253060e6bf8cc6c377b3d55ecea (15/21)rm 'newfile1'
Rewrite 37950f861a3cc0868c65ee9571fc6c491aa689ea (16/21)rm 'newfile1'
Rewrite 1c3206aac0fb012bfdaf5ff00e320b565bb89e7d (17/21)rm 'newfile1'
Rewrite 1968324ce2899883fca76bc25496bcf2b15e7011 (18/21)rm 'newfile1'
Rewrite f8d5100142b43ffaba9bbd539ba4fd92af79bf0e (19/21)rm 'newfile1'
Rewrite a8281fb589e36389cc8cb0da7ebee225b4d1adfc (20/21)rm 'newfile1'
Rewrite 30900fe1b7e72411dabab8b02070f36e2431f704 (21/21)rm 'newfile1'

Ref 'refs/heads/remove_file' was rewritten
john@satsuki:~/coderepo$

We have passed a few parameters to git filter-branch and we should take a few seconds to discuss this as the syntax may seem a little strange. Firstly we are invoking the git filter-branch tool, that should not be anything new at all. Next, we are passing three parameters to it. The first of these is the type of filter we wish to use. In our case we have used the --index-filter option. More information is available in the Git manual, but in a nutshell we have asked Git to work on the index at each commit stage. There is another similar option called --tree-filter, however care must be taken to distinguish between the two as using --tree-filter checks out the commit at each point in history. This may not sound like a problem, until you discover that as well as checking each revision out, it also automatically adds any untracked files in the working tree and commits them.

The next parameter is the actual command that we wish Git to perform on each revision. In this case we want to git rm --cached --ignore-unmatch newfile1 each time. We have enclosed the command we wish to run inside quotes so that Git does not get confused with which parameters are part of the filter-branch and which are part of the rm. Using these options we have asked Git to work on just the index and not to complain if it can not find the file to delete.

Lastly we list the commit range we wish to filter. In this case we have specified the target revision as HEAD. Git will interpret this as meaning everything up to the HEAD revision. As such Git will be rewriting the entire history of the branch.

Now if we list the files in the directory, we can see something important has happened. The file that we wanted removed, has gone and newfile1 is no more.
john@satsuki:~/coderepo$ ls -la
total 36
drwxr-xr-x 3 john john 4096 2011-07-27 19:53 .
drwxr-xr-x 32 john john 4096 2011-07-27 19:00 ..
-rw-r--r-- 1 john john 35 2011-07-22 07:15 another_file
-rw-r--r-- 1 john john 25 2011-07-22 07:15 cont_dev
drwxrwxr-x 9 john john 4096 2011-07-27 19:53 .git
-rw-r--r-- 1 john john 58 2011-07-22 07:15 newfile2
-rw-r--r-- 1 john john 45 2011-07-22 07:15 newfile3
-rw-r--r-- 1 john john 8 2011-03-31 22:15 temp_file
-rwxrwxr-x 1 john john 114 2011-07-21 21:17 test.sh
john@satsuki:~/coderepo$

Re-running the log command we ran earlier against our new branch confirms our operation. However checking out the master also confirms that the file is still present elsewhere.
john@satsuki:~/coderepo$ git log --pretty=oneline remove_file -- newfile1
john@satsuki:~/coderepo$ git checkout master
Switched to branch 'master'
john@satsuki:~/coderepo$ ls -la
total 40
drwxr-xr-x 3 john john 4096 2011-07-27 19:54 .
drwxr-xr-x 32 john john 4096 2011-07-27 19:00 ..
-rw-r--r-- 1 john john 35 2011-07-22 07:15 another_file
-rw-r--r-- 1 john john 25 2011-07-22 07:15 cont_dev
drwxrwxr-x 9 john john 4096 2011-07-27 19:54 .git
-rw-r--r-- 1 john john 69 2011-07-27 19:54 newfile1
-rw-r--r-- 1 john john 58 2011-07-22 07:15 newfile2
-rw-r--r-- 1 john john 45 2011-07-22 07:15 newfile3
-rw-r--r-- 1 john john 8 2011-03-31 22:15 temp_file
-rwxrwxr-x 1 john john 114 2011-07-21 21:17 test.sh
john@satsuki:~/coderepo$

It should be stressed at this point how destructive the git filter-branch command can be to your repository. The master and remove_file branches have diverged from the point where newfile1 was first introduced. Consequently all of our other branches, such as zaney and wonderful still refer to the master branch. We would also have to rewrite those branches too, but because of the rewriting of commit objects, we could lose the relationships between the branches and their ancestors. In short, though it is exceedingly powerful, this type of filtering can cause huge distress to other people working on the project.
In the trenches...
"So what do we do?" asked John. "We can't push out the repo as it is because it contains the API key." He massaged his forehead moving down to his eyebrows. "But we seem to be introducing a real headache if we filter the branch. Any suggestions?"

"Well the project is going to be finished in a few weeks right?" Simon was sitting at the end of the table. He was ashamed and was talking through a pair of hands deperately trying to conceal his identity.

"Yeh, but what the hell has that got to do with it?" snorted Klaus.

"I'm just thinking that we leave the repo like it is until all development has finished," he paused to run his hands through his hair, "then we filter the branch just before we release it." He looked over at John, "At that point there shouldn't be any test or dev branches, and we can just get everyone to clone the repo if we need to do anything else."

John nodded. "You know Simon I think you may have just redeemed yourself."

Note - Since you've been gone

Even though we have rewritten our tree, the fact that another branch still has the file present means that our potentially senitive data still exists somewhere inside the repository. In order to truly get rid of the file we would need to not only remove the file from all branches, or delete the branches that contained the file, but also run a few more steps if we wanted to ensure the file was gone now. Be aware that these steps are potentially very destructive to a repository. The best way to remove the file completely would be to remove ALL references to the file and then clone the repository. Git will not clone objects into a new repository if nothing references them. Alternatively if you absolutely must work on the current repository, you would need to do the following.

Delete the filter-branch backup using git update-ref <refname> -d. (See the callout on More backups)

Expire all reflogs with git reflog expire --expire=now --all

Repack all of the pack files with git repack -ad

Prune all unreachable objects with git prune

As you can see some of these are quite scary procedures and so it is important that you understand all that you are doing before you do it.

The idea being proposed here is only really viable because of Tamagoyaki's situation. The code is due to be finished soon and once that happens, the team have decided to push a rewritten branch into the public domain and to resync all of their development repositories to this new branch. It should be noted that the filter-branch tool can be used in other circumstances too. We are going to take a look at just one of these. However, let us first clean up our repository a little and move some things around.
john@satsuki:~/coderepo$ mkdir tester
john@satsuki:~/coderepo$ ls
another_file cont_dev newfile1 newfile2 newfile3 temp_file tester test.sh
john@satsuki:~/coderepo$ mv test.sh tester/
john@satsuki:~/coderepo$ git mv newfile* tester
john@satsuki:~/coderepo$ git add tester/test.sh
john@satsuki:~/coderepo$ rm temp_file
john@satsuki:~/coderepo$ git status
# On branch master
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
# renamed: newfile1 -> tester/newfile1
# renamed: newfile2 -> tester/newfile2
# renamed: newfile3 -> tester/newfile3
# new file: tester/test.sh
#
john@satsuki:~/coderepo$ git commit -a -m 'Moved testing suite'
[master f08ac57] Moved testing suite
4 files changed, 9 insertions(+), 0 deletions(-)
rename newfile1 => tester/newfile1 (100 rename newfile2 => tester/newfile2 (100 rename newfile3 => tester/newfile3 (100 create mode 100755 tester/test.sh
john@satsuki:~/coderepo$

We have reverted back to our master branch and in doing so have regained newfile1. After that, we deleted our rewritten branch and moved test.sh along with all of the newfiles into a new folder called tester.

Previous Day

Next Day

 
   
home | download | read now | source | feedback | legal stuff