Week 3

"A Closer Look At Diffs and Tags"

The Diff Utility

We learnt in Week 3 how to work with a diff, and what a diff actually represents. It is interesting to note how old the diff utility actually is and how it works. The diff algorithm was developed in the early 1970s and the research published in 1976, by Douglas McIlroy, who wrote the original diff utility and James Hunt. The algorithm we use today to perform diffs has become known as the Hunt-McIlroy after the research papers authors.

In essence the task of calculating a diff is that of finding the differences between two files on a line by line basis. Mathematically, this can be described as the LCS or Longest Common Subsequence problem, which is a classic computer science problem.

Though we are not going to go into the problem in great detail, it is useful to know what actually happens at this level. Essentially you have two sequences, for now we are going to simplify the problem and work on a string of letters.

Old string:

a b c d e j k l m p r s

New string:

a c d f g h i j n o p t u

The challenge is to find the longest sequence of items that is present in both of the strings above. This new sequence is found by deleting items from the first and second set until all that remains is a sequence of common items.

In our example case, this is as follows LCS string:

a c d j p

Comparing this to each of our strings above, it is easy to generate a diff. If the items are present in the Old string, but not in the LCS string then they must have been deleted. Conversely if they are present in the New string, but not the LCS, then they must have been additions. Putting this into practice in our example and marking deletions with - and additions with +, we get the following:

Diff string:

b e fghi klm o rs tu

- - ++++ --- + -- ++

The actual algorithm for generating the LCS and the subsequent diff is too complex to describe here and is out of the scope of this book. This section was included to give you some idea of how Git performs some of its actions internally.

More about tags

Tags can actually do a little more than just hold a single identifier to a specific commit. A tag can also have a log message with it, similar to the commit objects we discussed earlier. In order to invoke this option, we need to use the git tag -m 'message' option. This will allow us to supply a message to be stored along with the tag. Let us see how this works in practice.

Firstly we can use the git tag command to show all tags that are currently in the repository.

john@satsuki:~/coderepo$ git tag

v0.9

v1.0a

john@satsuki:~/coderepo$

Now let us create a new tag and give it some extra information.

john@satsuki:~/coderepo$ git tag v1.0b -m 'This is an annotated tag'

john@satsuki:~/coderepo$

Unfortunately Git was not particularly forthcoming with information on the creation of the tag. On saying that, it is not difficult for us to use the git show command to see what we have done.

john@satsuki:~/coderepo$ git show v1.0b

tag v1.0b

Tagger: John Haskins <john.haskins@tamagoyakiinc.koala>

Date:   Thu Mar 31 23:55:50 2011 +0100

This is an annotated tag

commit a022d4d1edc69970b4e8b3fe1da3dccd943a55e4

Author: John Haskins <john.haskins@tamagoyakiinc.koala>

Date:   Thu Mar 31 22:05:55 2011 +0100

    Messed with a few files

diff --git a/my_second_committed_file b/my_second_committed_file

index 095b9cd..c9887f8 100644

--- a/my_second_committed_file

+++ b/my_second_committed_file

@@ -1,2 +1 @@

-Change1

-Change2

+Changed this file completely

diff --git a/my_third_committed_file b/my_third_committed_file

new file mode 100644

index 0000000..5d27866

--- /dev/null

+++ b/my_third_committed_file

@@ -0,0 +1 @@

+Addition to the line

john@satsuki:~/coderepo$

Notice that as well as showing us the tag itself, the git show command also gave us a diff of what exactly changed in this commit. Small details like this are what makes Git in particular a joy to use for developers.

If we found that we had actually created the tag incorrectly, we have two options, we could use the -d option to delete a tag, or we could use the -F option to forcibly overwrite a tag with the same name with different information. However please remember the warning about tags in Week 2. It is very dangerous to go changing tags, especially if you have already pushed them out somewhere where other people can grab them from.

Now that we have learnt a little about how to play with tags, we should probably take a look under the hood. This is an After Hours section after all.

The implementation of tags in Git is very simple indeed. We are going to jump into the .git directory and take a look at a simple output from some commands.

john@satsuki:~/coderepo$ cd .git/

john@satsuki:~/coderepo/.git$ ls

branches        config       HEAD   index  logs     refs

COMMIT_EDITMSG  description  hooks  info   objects

john@satsuki:~/coderepo/.git$ cd refs/tags/

john@satsuki:~/coderepo/.git/refs/tags$ ls

v0.9  v1.0a  v1.0b

john@satsuki:~/coderepo/.git/refs/tags$ cat v1.0a

a022d4d1edc69970b4e8b3fe1da3dccd943a55e4

john@satsuki:~/coderepo/.git/refs/tags$

Inside the .git/refs/tags folder, there is a file for every single tag in the system. This file contains a single string of characters. That string looks oddly like an SHA-1 commit to me. Using the tricks that we learnt in the last After Hours section, we can interrogate the Git repository, by throwing a few wrenches at it.

john@satsuki:~/coderepo$ git cat-file -p a022d4

tree 96551f45496232c0ec6b389731d55fa3d7e1c8fd

parent 9938a0c30940dccaeddce4bb2eb151fba3a21ae5

author John Haskins <john.haskins@tamagoyakiinc.koala> 1301605555 +0100

committer John Haskins <john.haskins@tamagoyakiinc.koala> 1301605555 +0100

Messed with a few files

john@satsuki:~/coderepo$ git cat-file -t a022d4

commit

john@satsuki:~/coderepo$

Excellent! Just as we expected. So Git stores the SHA-1 hash for the commit we are referring to, in a file which has the same name as the tag. Just for clarity, let us run the same set of commands against our newly created annotated tag.

john@satsuki:~/coderepo$ cd .git/refs/tags/

john@satsuki:~/coderepo/.git/refs/tags$ cat v1.0b

6cbcf47957589bf4b84cc934a26731636d021574

john@satsuki:~/coderepo/.git/refs/tags$

Hang on a minute! Shouldn't the output of the v1.0a and v1.0b files be the same. There were both created at the same point in the repositories history. They were both supposed to be pointing to the same commit. Let us use a little plumbing and see what is going on.

john@satsuki:~/coderepo$ git cat-file -t 6cbcf4

tag

john@satsuki:~/coderepo$

Interesting. So it seems as if there is another type of object that can be added to the list. The tag object. So what exactly is the difference here? To answer that we need to take a look at the difference between the two tags. v1.0a was a simple tag, whereas v1.0b is an annotated tag.

When the tag was merely a pointer to a commit object, the repository required nothing more that the object that it was referring to, to be stored in the file. Now we have more information, Git treats the information just as it would any other, by creating an object. We can investigate this further and use the -p parameter to the git cat-file command to see exactly what is stored within this file.

john@satsuki:~/coderepo$ git cat-file -p 6cbcf4

object a022d4d1edc69970b4e8b3fe1da3dccd943a55e4

type commit

tag v1.0b

tagger John Haskins <john.haskins@tamagoyakiinc.koala>

 Thu Mar 31 23:55:50 2011 +0100

This is an annotated tag

john@satsuki:~/coderepo$

So an annotated tag contains more information than just the commit ID we are referring to. This is pretty handy and later on in the book we will start talking about topics like signed tags, which are required if you wish to verify the identity of the person claiming to have created the tag.

A shorter After Hours section this time. Next time we will move on to taking a deeper look at what happens behind the scenes with branches and merging. Stay tuned.