Friday, February 28, 2020

Understanding Git, Part 3

The final part of my talk about Git. Here are Part 1 and Part 2.

After familiarizing ourselves with the areas of Git repository and basic commands, let’s dive into practical application of this knowledge. In this last part of the talk we will consider how to manipulate changes to achieve your goals, how to track what other people have done, and how to get out of trouble.

Let’s start with understanding how to make amendments to changes. This is useful to correct any mistakes you’ve discovered after commits have already been submitted into local repository. Also, amending your previous changes is required if you are working with a distributed code review system like Gerrit.

As you can remember, the actual commit objects residing in the repository are immutable and can’t be changed once you have created them. However, what we can do is come up with another set of commits, trees, and blobs, connect the new commits to the commit graph, and update the branch reference to point to the new set of commits.

This is illustrated on this diagram. Imagine we have a master branch which points to commit D. We have realized that we need to amend the last two commits: C and D. We use them as starting points for creating new commits, beginning from the oldest one—C. When we create this new commit we can optionally use a different parent commit for it.

After we have created the last new commit—D' in this case, we reset the master branch reference to point to it. Now although the old commits C and D are still in the repository, Git will consider them garbage at some point as they are not referenced by any branch.

Note that even if we didn’t change the trees of C' and D' compared to the original commits, but just reparented C, it will still be a different commit because the commit hash is calculated from entire commit contents which includes the parent reference. And since C' becomes a different commit with a different hash, D' is also different from D even if it still references the same tree.

That was a generic case of history rewriting, let’s consider the simplest one—amending just the last commit. Git has a special option for the git commit command, which is called --amend. If you haven’t staged any changes, executing this command will only result in updating the commit message of the last commit. If you have staged any files, they will replace corresponding files in the last commit.

As always, you can skip the intermediate step of staging by passing the -a option together with --amend.

And this is an illustration of what happens when you amend the last change. The tree of the commit may change or it may stay the same. The commit itself changes if you change the commit description. The parent of the amended commits stays the same. Your current branch gets updated to point to the new commit object.

Let’s consider a more interesting example of amending commits. Imagine we were developing two features: X and Y in dedicated branches, and now want to integrate them into a single branch.

As we mentioned earlier, one way of doing that is to create a merge commit. If that is not an option, then we can take the changes specific to one of the branches and replant them onto another branch.

This is how we do that using git rebase command. Imagine we decided to take the changes from the featureX branch and rebase them onto the featureY branch. For that, we make the featureX branch current, then we provide git rebase with the identifier of the change that ends the commit chain, in this case this is commit B, and the destination branch: featureY.

What happens next is Git takes the commits from that chain and, one by one, plants them onto the destination branch. Then it updates the branch reference. Note that the current branch remain the same after rebasing, thus the HEAD meta-reference doesn’t change.

Here is another practical example of rebasing, this time using a tracking branch. This scenario happens when you are working on a branch which is being changed in a remote repository by someone else. Recall that by default when you do a git pull Git merges your local changes with remote changes. Instead we can do a rebase by specifying --rebase argument. In this case git will automagically rebase your local changes on top of remote changes. Here is how this happens.

To demonstrate what happens, let’s do the same that git pull --rebase does, but this time manually. First we sync with a remote repository using git fetch. By comparing positions of the tracking and upstream branches, Git can find out that there are 100 new changes happened upstream, and we have a couple of changes, too. This is in fact the same situation as we have considered earlier with two features. The only difference is that this time one of the branches is a remote branch reference.

Now we do a rebase. We don’t have to specify the fork point and the destination branch to git rebase this time, because Git assumes we are rebasing onto the upstream branch, and it can figure out the fork point itself. After rebasing we have our changes on top of the changes from the upstream, as expected.

Since while we are rebasing we are creating new commits, it’s possible to apply more radical changes rather than just changing a parent or amending files. With Git interactive rebase we can also re-apply changes in different order, skip some changes, or add new ones, and combine multiple changes together. Basically, if you recall our “Patch algebra”, we can do all those operations on actual commits.

For that we use -i (for “interactive”) option to git rebase.

This power of being able to rearrange our changes in the hindsight allows us to apply “working in small steps” philosophy. What this means in practice is that we commit our changes as often as possible. Usually it’s after we have completed some logical step and the code at least compiles, although this is not a requirement. This allows us to rearrange our changes later, or to find a change that has broke everything.

Even if we use a code review system for the project, working in small steps is beneficial because by the time when we are ready to get our changes reviewed, we can group them into bigger chunks.

Imagine we were working on adding a resampler to our audio code. Let’s say, we have implemented the resampler partially, then hooked it up to existing code, then finished implementation, and then have fixed a bug in the integration code.

Now what we can do is fire up an interactive rebase to organize those changes into a more logical form.

After we have run git rebase -i we end up in our favorite text editor. Here we are presented with a list of changes. By default, Git will just pick them one by one, and this will result in no new commits, because no data will change. However, the interactive rebase offers a wide choice of operations listed below.

It is important to note that the commits are listed in direct chronological order (oldest go first), which is opposite to the order normally used by git log.

So what we have done here is we rearranged the changes in order to group together the two changes that implement the resampler. We use the squash (“s”) command to combine these changes and to edit the message of the resulting commit. On top of them we put the plumbing change and the fix for it. For the fix, we use fixup (“f”) command which simply uses the commit message from the first commit.

After we have done with editing the interactive rebase instructions, Git follow them, and this results in a new sequence of commits.

There is also a manual alternative to rebasing, known as “cherry-picking”. We can basically replant changes manually one by one in the order we want. This can be used, for example, if you want to try a new approach in a new branch, but want to reuse some of the work you have done on another branch.

What you do is you use git cherry-pick command providing a list of commit identifiers as parameters. There is no option to squash commits.

In this example, we decided to try to hook up our resampler in a different way, so we start another branch using master branch as a starting point, then we find out the hashes of the commits we are interested in, and apply them in the right order to our new branch. Obviously, this will result in new commits being created.

In the examples above we used a couple of times special syntax to reference commits that are ancestors of HEAD. The time has come to understand it. I called this section “Mini-Languages” to refer to the concept of Domain-Specific Languages. I also introduce Git-specific jargon here that is encountered in the documentation.

Let’s start with what is called “refname”, which I guess means “the name of a reference”. If you recall, HEAD is the name of the metarefence to the current branch.

Obviously, the name of the branch is a refname too, because as we know, any branch is a reference. If a local branch tracks a remote branch, we can retrieve it using @{upstream} syntax.

Another Git jargon word is “rev” which is a short name for “revision”, and that means specifying a commit. The canonical way is to use the full SHA-1 hash of the commit. However, it’s usually enough to specify just a few first characters of the hash. For small repositories, something like 5 first characters is enough, and for big ones we might need to provide up to 12.

Since the contents of a branch reference is a commit hash, a branch name can be trivially resolved into a commit. That’s why any Git command requiring commit hash will also accept a branch name. If we have a rev, which as we remember is equivalent to having a commit, we can reference the Nth parent of this commit by using the ^N syntax. If N isn’t specified, it’s assumed to be 1.

Since the parent of a rev is also a rev, we can apply the ^ “operator” multiple times moving one parent away with each application. There is a short syntax for this—~. If N isn’t specified it’s assumed to be 1, thus a ~ alone is equivalent to ^.

And finally, this is the syntax we’ve seen when we were talking about reflogs. Recall that a reflog is the list of changes happened to a reference. In order to reference the Nth previous state, we use @{N} syntax.

As you remember, every commit has a tree object associated with it. You can reach that object by adding a colon (:) after the rev. And once you’ve got a tree, you can go down by it and reference files it contains. This way you can designate the state of any file at any commit—kind of a time machine.

Since you typically execute Git commands from some subdirectory of your working tree, Git can use the current path for resolving a relative path. The relative path is specified by prefixing it with ./ character sequence.

Here are a couple of examples. Both assume we are working with the frameworks/av repository of Android. The first example references the state of the file services/audioflinger/Threads.cpp 3 commits ago. The second example assumes that we are currently in the services subdirectory of that repository, and references the state of the same file at the specific commit.

Consult the git revisions help page for the full list of options.

So far, we were talking about how to reference just one commit. It is also useful to be able to specify a section of a path in the commit graph. This is called a “commit range”. A commit range may also denote a set of commits which are disjoint in the commit graph. Commit ranges are frequently used with git log command.

If you provide git log with just one rev, it will interpret it as “all the paths that go down from the commit <rev> to the initial commit”. Actually, if you don’t provide any revision specification to git log it will simply take HEAD. The problem with that is the fact that the path from HEAD to the initial commit is quite long. In order to restrict the range we can use .. syntax, as shown here: <rev1>..<rev2>. This means “all commits not reachable from rev1 but reachable from rev2 (including rev2 itself)”.

In case of a linear history (when no merge commits present), this is simply the path from the commit rev2 down to but not including rev1. However, with non-linear history the result can be more interesting, as we can see on the diagram on the right.

In you don’t specify rev1 or rev2 in the range, Git will use HEAD.

Similar to printf function in C, Git has a template-based mini-language for specifying how to format the output of git log and other commands that can display commits, like git show. There is a set of predefined formats, called “pretty” in Git jargon. You can also construct a format that suits your needs by using placeholders prefixed with the percent symbol %. I’m not listing them here because usage of this mini-language is pretty straightforward and you can look up all the formatting instructions on the git pretty-formats help page.

Now we are armed with the knowledge of how to specify objects we want to view and how to represent the results. Let’s consider the commands that can be used for viewing different kinds of objects in Git.

A versatile command git show can display any repository object: a blob, a tree, or a commit. You can provide either the hash of the object, or a rev, or a path specification we have considered earlier.

The next command is well-known git log. It accepts a commit range, and its output can be limited to only consider certain paths so you can view changes specific only to a certain file or a directory. Don’t forget that you can specify output format and number of commit entries to display to make the output more manageable.

And finally, git status command which we have seen before. Its output can also be scoped to a certain path.

As we have discussed before, Git doesn’t store any diffs in the repository but generates them on request, for example when you invoke git diff command. Remember that diff is always generated from two objects, even if you specify only one or even no arguments to git diff. Here on the slide I listed possible useful options.

Sometimes we are not interested in seeing an entire diff. What we can do then—first, limit diff output to a certain path or paths. We can also ask diff to only display what is called in Git the “diff stats”—the amount of lines changed in every file. We can even only display the names of the changed files, with no extra information.

Of course you know about git blame command in Git. It’s perfect when you can immediately find the commit that made the change in question to the file. However, it’s not always that easy because in big and long-living repositories a lot of refactorings happen that just move lines between files or change formatting. But there is always a way to find the first functional change.

In this example, let’s consider this fragment of code from Android’s frameworks/av repository. Let’s say we want to find out the reason why “FLUSHED” state of a track is also considered as “stopped” state. From git blame output we can see that this line was last touched by someone named Eric in the commit with the hash starting from 81784.

We use git log to check the description of this commit (we could also use git show to view the entire commit), and unfortunately this last change was just code reorganization, so we need to look deeper.

It’s a bit of a problem that the line has been moved from one file to another. We need to find which file had this line before. One way to do that is to go manually through the change, and that’s doable if the change is small. In this case, the change was quite big so I used grep in order to find this line in the output of git show. I instruct grep to show a lot of lines (300) before the match to see the name of the original file and then grep the output once more just to show the line with the file name which diff prefixes with three dashes (---), as we can recall from our discussion of the patch format.

So I can see that the line was originally in AudioFlinger.h file. Now I instruct git blame to start looking from the ancestor of the refactoring commit. I use the hat (^) here to specify that revision, I could also use a tilde (~) if you recall the slide about revision specification. And I specify the name of the file I want to blame.

Now I see the last commit that actually changed this code, it was also made by Eric, and from looking at the description and probably diff, which didn’t fit on the slide so it’s not shown, I can see that I finally have found a functional change.

So git blame is really useful for figuring out commits that have changed a line of code. But changes can also remove lines of code. Obviously, we will not see the removed lines in the output of git blame. In order to see the history of removed lines, we can use the “pickaxe” tool of Git.

Recall that in diffs, each modification of a line of text is described as the removal of the old line of text, and the addition of the new version of that line. Any line can also be removed in one place of one file and added to some other place, maybe in some other file. Obviously, if the change adds a new line, there will be no old version of it. If the change removes a line, there will be no new version.

So what “pickaxe” does—it counts the amount of diff lines that remove a line containing a specified string, and the amount of diff lines that add it back. If these numbers do not match, that means the line has been added or removed in this diff. And this is what pickaxe reports. In this example we are looking for changes that add or remove string 4.0 from the set of files. We provide -S argument to git log which invokes the pickaxe tool.

Some commits add or remove not just lines but entire files. Once our colleague Jean-Michel couldn’t anymore find a file that he knew existed before. He wanted to find the change that removed that file. That’s actually easy with git log which supports --compact-summary argument which only displays “diff stats”. We limit the scope of git log to that particular file and here is what we see—the last commit that had removed the file.

Managing commits is a complex business and sometimes things do not go as you expect. Remember that Git always allows you to pull out from a multi-step operation like rebasing or merging and get back to a clean slate. To pull out, just re-run the same command with --abort argument. Because all of the listed commands check before starting that there are no changes to the working tree, they guarantee that after aborting you end up in the state you were initially.

Here are some more Git commands that can help getting your working tree and index back to some good known state. First, there is git reset command which, if called with only a revision parameter resets the branch reference to the specified revision. You can always look up any previous state of any branch by examining its reflog, as we discussed earlier.

Second, recall about the git stash command which you can use to clear out of your way any uncommited changes without losing them. If you are sure you want to lose them, use git reset --hard command.

And finally, you can always restore any particular file to a previous state using git checkout command. Recall that Git repository is a time machine that stores the state of any file at any previously committed revision.

There is one thing in version control management that people hate as much as going to the dentist or paying taxes—it’s resolving merge conflicts. Let’s try to understand this problem better using the “patch algebra” we have discussed in the first part of this talk. In terms of patches, on one branch we have a file in state A to which a change M had been applied. Then on another branch the same file has received a change P. Now we are trying to merge these changes together. Recall that any change in the diff has a context. A merge conflict occurs if the change M modifies parts of the file that are included in the context of the change P.

A lot of times Git can work around these discrepancies and still apply the patch. However, sometimes the help of a human is required. What I highly recommend to do is to configure Git to use so called “diff3 conflict resolution style” using the command shown on the slide because it provides the most comprehensive information about the conflict.

Let’s consider an example. The change A which is common to both branches adds a file with 3 lines of text. Now let’s imagine that two persons have made two different changes to this file. The change M was done by a humble developer, while the change P was done by a more optimistic marketing person.

This is how Git presents this conflict to you using diff3 style. First goes the part of the file that hasn’t been changed. Then the change on our branch (change M). Then how was this part looking prior to the change on the remote branch, that is, the parent of the change P. In our case this is actually the same as the state A. And finally, how this part looks with the change P applied.

Now what we need to do is to either select one of the versions, or produce a new version which for example combines both changes. We also need to remove the lines with merge conflict markers. And we repeat this process for every conflicting hunk.

Another common scenario we encounter when working on Android is the need to transfer a change from one repository to another. Here Git offers several options.

The first one is to add the second repository as a remote, fetch its contents into our repository, and then cherry-pick the commit containing the change. The problem we encounter here is that with large repositories fetching objects from another repository will take a lot of time and end up consuming a lot of disk space.

So if we are not interested in the history of the change we are transferring, another option is to use so called “mail exchange” scenario. We use git format-patch command on the source repository to export the change in the form of a diff file that contains all the commit metadata, and then apply this diff to the target repository using git am command. git format-patch produces a series of files, one per commit, so we can end up with lots of files.

Finally, instead of using git format-patch, we can produce a diff using git diff that contains all the changes we are interested in, possibly from multiple commits, and then apply it using git apply command. But since in this case the diff will not contain commit metadata, we will need to provide it manually.

Note that while transferring changes from one repository to another it’s a common situation to encounter a merge conflict, because repositories are likely diverged significantly one from another, like Android internal master and AOSP. If there is a conflict you will be presented with the following instruction from Git.

What you need to do is to go over the files that have conflicts, resolve them, and then stage updated files using git add. Then you can let Git to continue.

As I’ve mentioned before, if anything goes wrong, cherry-picking can be aborted at any moment.

These are the final tips for your Git journeys. First, don’t panic! Remember that Git is a time machine and there is always a way out. Second, remember that with the way Git repository is organized it’s actually hard to lose anything. Most probably, your changes are somewhere in the repository, you just need to find them using one of the tools we have discussed. Third, the most important thing you must care about is files in your working directory—if you haven’t saved changes to them and you overwrite your working tree, that’s it—they are lost forever. So take care of the changes you are doing in your working tree and let Git takes care about the rest.

And finally, if you are losing track of what’s going on, recall the building blocks diagram. Commit objects are the keys to everything, so if you need to find something, always start with finding the corresponding commit.

And there is always more to learn about Git! Here are several recommended sources. First, it’s the free book about Git called “Pro Git”.

If you are a visual thinker like myself, then “A visual Git reference” provides a lot of illustrations on what Git commands are doing to Git objects.

Finally, I’m pretty confident that if you’ve got through this talk you are now able to understand the text of official Git documentation pages :) Invoking git help command with the name of another command as a parameter provides you with pages of text you might now even find helpful.

No comments:

Post a Comment