Lecture 5: Version Control and Git
Transcript
Cool, let's get started. Today we're going to talk about version control systems, and in particular, Git, because that's a really popular version control system these days. So what is a version control system? Well, it's a piece of software that's used to track changes to your source code, or any other files and folders that you want to track changes to. And as the name version control implies, these tools help maintain a history of changes you make to the software you're writing. And furthermore, they facilitate collaboration.
So if you're using a development platform like GitHub, for example, you will use the Git version control system to interface with that service and with other developers working on the same codebase. So logically what version control systems do is they track changes to a folder containing files like code and so on and all the contents in that folder in a series of snapshots and every snapshot encapsulates the entire state of files or folders within that directory that you're tracking. And version control systems also maintain metadata, like who created each snapshot, and maybe any messages you want to associate with changes you make to your code, perhaps explaining why you're making certain changes, and so on. So why is version control useful? Well, even if you're not collaborating with other developers, even if you're just working on a pset or lab assignment, version control can let you look at old snapshots of a project, basically keep track of your code as you're changing it over time, and it can enable other features, like letting you work on parallel branches of development. So if you want to have your underlying code base and work on feature one, and then switch from that to working on feature two before you're finished with feature one, and maybe you notice a bug in your code and you want to work on fixing that bug, you can do all that with the help of a version control system without having all of your changes kind of interfere with each other before they're finished.
So these tools provide a lot of value, even if you're just working by yourself. And if you're working with other developers, version control systems, are basically necessary to use, like unless you want to be emailing back and forth zip files or something, which really quickly gets super chaotic. It's just the standard tool for collaborating with other people when doing software development. And modern version control systems let you do all sorts of handy things and answer questions that might otherwise be hard to answer. Things like you can look at a particular piece of code or a module and ask, who wrote this? If you want to go talk to them.
You can look at a particular line of code and ask, who was the last person to modify this code? When was it changed? Why was it changed? You can even do fancy things with version control systems. Suppose you've been working on a particular piece of software for months and months, and one day you notice, oh, hey, the software has a particular bug, and it looks like this bug wasn't originally there. Like one year ago, your software did this particular thing fine, and then at some point in the last year, you introduced a bug into your code. What would you do to try to figure out when that bug was introduced? What was the change that introduced it? Well, version control systems can automate this for you. You can do something really cool where you write a test that fails if your bug is present, and then your version control system can automatically binary search your entire version history in order to pinpoint exactly where that bug was introduced.
So really powerful tools that are worth learning. And in this lecture, we're going to teach you about concepts underlying version control systems, and we're going to teach you about the Git version control system in particular, because it's just the de facto version control system of today. You know? Unfortunately, Git has a kind of complicated reputation. You can take a look at the XKCD comic I've put up there, and we'll read through it. This is Git.
It tracks collaborative work on projects through a beautiful distributed graph-theory tree model. Cool. How do we use it? No idea. Just memorize these shell commands and type them to sync up. And if you get errors, save your work elsewhere, delete the project, and download a fresh copy.
So, no shame. Who has done this before? All right, a number of hands go up, including mine. I've done this before when I was learning how to use these tools. And so in this lecture, we're going to try to get you from being that guy over there to not being that guy over there. And so why does this happen? Why does Git have this kind of a reputation? I think it's because Git's interface, like the commands you use to work with it, it's a leaky abstraction.
And if you learn Git top down, if you learn the commands and what they do, and that's your starting point, it can lead to a lot of confusion, especially if you don't understand the underlying ideas. You can memorize a bunch of commands and know when to use them and how things generally work and roughly treat them as magic incantations. And then do this when anything goes wrong. Delete your repo and start over. But obviously, if you do that, you're not using the tool effectively.
So I think Git does have not the cleanest interface. But I think the ideas underlying the version control system, its design and those underlying ideas are really beautiful. The ugly interface has to be memorized, but the underlying ideas, I think, can be deeply understood. So for this reason, in today's lecture, we're going to give you a bottom-up explanation of Git, starting with the underlying ideas and things like the data model. Later, we'll cover the command line interface at a high level and then give you pointers to let you learn more and memorize all the commands you need to memorize.
We think this is the right way to learn Git. The reason we're taking the time to teach it to you this way today is because I think most of the resources available online for learning this tool don't approach the learning process in this way. So any questions before we start diving into the material on version control systems or the motivation for approaching learning this way? Great. So let's talk a little bit about Git's data model. And don't worry about this complicated distributed graph-theory tree model stuff.
It's not that complicated. So I think Git's ingenuity is in its really well-thought-out... data model that enables all these nice features of version control, including some of the basics we talked about earlier, like seeing who made a particular changes, keeping track of your changes, as well as some of the fancy stuff I mentioned. You can maintain history, support different branches of development, collaborate with other developers. And so as the starting point in the data model, let's talk about snapshots.
So I'm using this term to refer to the state of a directory and everything inside of it. So it's this hierarchy of files and folders on your computer. So Git models a particular point in time in the history as a collection of file and folders, which I'm referring to as a snapshot. And in Git terminology, a file is called a blob, just a helpful term to know, and a folder is called a tree. And then you might have a particular example state of a snapshot that might look something like this.
So you might have the root directory of your snapshot, which is a tree. And then, so this is like a folder. Inside this folder we might have another tree. So say we have a tree called foo. So this is another tree.
And this tree inside might have a file. So say there's a file called bar.txt in here. So this is a blob, this is a tree, this is a tree, and then maybe this top level tree has inside of it another file. So this is at the top level instead of inside. All right, so this kind of encapsulates the state of a folder and everything inside of it.
That's how we're going to model it in Git. So now in our version control system. How do we want to relate these snapshots to each other? This is kind of the state of things at a point in time. And now we want to maintain history. So one way you might imagine doing this is by just storing a sequence of snapshots.
Like you might have version one of your repository, then version two, then version three, and so on. And every one is a snapshot, and you can see how things change over time by looking at different snapshots. And for a number of reasons, Git doesn't model history like this. And instead, it's a little bit fancier in Git. And In Git, a history is a directed acyclic graph of snapshots.
And if you haven't taken the data structures and algorithms class, don't worry, we'll give you good enough intuition for that, and it's not that complicated. So all this means is that in Git, each snapshot refers to a set of parents, the things that came before it. And we'll go through an example and talk about why it's a set of parents rather than just a single parent. So suppose we have the state... of our repository in a single snapshot at a particular point in time.
I'm going to draw a diagram and use circles to indicate snapshots. So this is like a folder with a bunch of files in it and like its state including the contents of the files and the names of the folders and all that at a particular point in time. So suppose I start off with that state and then maybe I make some changes to my code. I might add a feature. So I have a new snapshot, a new state of my code base, and then this will actually refer back to the earlier snapshot.
And maybe I implement a bug fix, so I create a new snapshot and that'll refer back to the earlier snapshot. And then suppose at this point I identify that I want to implement a feature but there's also a bug I want to fix, and you could imagine doing those two things independently based off the same snapshot. So maybe I branch off of here and fix a bug and create a new snapshot, but I start from the same starting point and maybe I make some progress towards implementing a feature, it's not quite done. Maybe I make some more progress, so I take another snapshot that creates another history element that points back to the parent where it came from. So far we've shown branching, but we can also talk about a scenario where we might have a snapshot with multiple parents.
We might have this bug fix branch where I made a bug fix, this feature branch where I finally finished implementing a feature. And then I want to combine together these sets of changes into a single unified code base. So I might eventually create a single commit that includes changes from both of these, refers to both of these as parents because it was derived from both. And then I create what's called a merge commit here. And so logically, this is how Git represents history.
It's just a bunch of snapshots that are related through this parent relationship. Any questions about this? We're still talking about things in the abstract right now. We'll get more concrete later in the lecture. Great. And one other detail that might be good to know is that in Git's model of history, snapshots and also these commits and the relationships between them All of this is an immutable data structure.
You can add on to it. You can create a new commit and have it point to the earlier thing. But you can't actually take something in here and modify it. Say I switch to this commit prime thing here. You can't actually make changes like that.
And maybe a good way to think about this might be, just like in real life, you can't go back in time and change the history of things. And in the same way, Git wants to record history as it happened. And you can't actually go back and mutate things in this graph. You can only add onto it. And there are ways to handle various types of mistakes.
Like if you accidentally introduced a bug in your code here and you want to fix it, or you made a commit that you actually want to undo later, there are ways to do things that will achieve that same effect without actually messing with this immutable or append-only model of history. All right, so let's go one level deeper. It might be instructive to see what it looks like when we write down some of these concepts and something that looks a little bit more like code. So there we go. Let's write down a little bit of pseudocode.
So let's write down some type definitions. In Git, what is a file or in Git terminology blob? It's just a bunch of data. It's a sequence of bytes. And I'll write down like array of bytes. Like on your computer, that's what a file is, right? Just some binary data.
And then we have trees or directories. And what are directories? They're data structures that map the name of their contents to the actual contents. So here I'm saying that directories map... strings like in the case above the top level tree maps foo to the tree that's contained inside of it and also maps baz.txt to the blob that's contained inside of it. So trees can have inside of them as elements either trees or blobs identified by the file names here.
So this is the file name. Or maybe just name is better because it can be a tree too. All right, so now we have files and folders and now we want to model our history. And just to be a little bit more precise, when I say blob or tree, I'm referring to elements or types of things in this recursive data structure. When I'm saying snapshot, I think this is not quite standard Git terminology.
I'm referring to the top level tree corresponding to the entire directory whose history we want to track. And then when I say commit, I'm talking about nodes in that history graph. So graph is like the history is made up of a bunch of commits related by this parent relationship. So what is a commit? It's a data structure with a couple of different elements. So continuing along in my...
pseudocode. Commits have inside of them, they have snapshots. And what is the type of this? This is a tree. And then commits also have parents. And this is a list of commits.
So this is also a recursive data structure. Commits refer back to other commits. And then there's some other stuff that goes inside of here, handy things, like maybe you want to have an author name. You might want a commit message, where you might put in things like what changed in a particular commit that you're introducing in the history graph. Any questions so far? So continuing along in terms of thinking about Git and history and all that in terms of — question? in terms of pseudocode level abstraction, Git introduces a notion of something called an object, which unifies blobs, trees, and commits.
So object is a blob or a tree or a commit. So now we can refer to all these different types of things in the kind of unified language. And then in Git, Git stores your data in an object store where the data is content addressed by its SHA-1 hash. So now we're going to use this type we just defined here and say that Git tracks your objects in a map from SHA-1 hashes to the actual contents of the object. And so like how is stuff actually stored and loaded from this object store inside of Git? In pseudocode, if we want to store an object, we will first compute its ID as the SHA-1 hash of the object and then store it into this logical map.
And then I'm sure you can fill this one in yourself. If we want to load something by ID from this object store, we just look at the object store and grab it by its ID. All good so far? And then one detail here, blobs, trees, and commits are unified in this way. They're all stored in this object store that's content addressed. And in the on-disk representation, these things don't actually contain the other thing.
It's not like the commit contains inside of it all the previous commits or the tree contains inside of it all the trees that are inside. There's a level of indirection here through the object store. So you can think of these as pointers. Maybe I'll just draw this in this diagram as like an asterisk. So this is a pointer to a commit.
This is a pointer to a tree. This is a pointer to a tree. This is a pointer to a blob. But logically, you don't need to worry too much about that. Can you repeat the question? Oh, I see.
You mean an objects array. I see. Yes. Like, here I'm defining the type of objects, but then you can think of this as, like, a global variable in your Git repository. It stores the set of objects identified by their SHA-1 hash.
And then this is the object data itself. So, for example, a tree in Git, like that tree I've drawn up there, will be an object in the object store, and the contents of that object will have... Since this is a tree, it'll have two entries. It'll have the entry for foo, which will be the hash, like basically a pointer to that inner tree. And then it'll have baz.txt, that name with the hash of the blob, so a pointer to the blob object.
And so let me swap these around. So now that we've introduced these concepts, including the object store, and this idea that everything in Git, from commits to trees to blobs, can be identified by their SHA-1 hash, we can flesh out this picture here a little bit. I won't draw hashes everywhere, but for example, this commit here might be identified by... so this is a commit object. And its hash might be something like the hexadecimal.
Yeah, oftentimes in Git you write down the hashes as 40 character long hexadecimal strings. So this might be like 0 f c 2 something something something 3 7. That's the hash of this object. And then it has contents inside of it including a parent pointer. And so this commit here might have as the hash of its contents.
And again the contents are... sorry, the hash is the hash over these contents, the contents of the object. So maybe this has some other SHA-1 hash, maybe this starts with 2 and ends with 7. So this parent pointer here will actually be the value of this hash here. So this will be that value and this will have its own data.
So now we have our model of history including a model of contents of the file system. and we have a way to name things by these really inconvenient 40 character long strings. So the next concept that Git introduces helps us with naming things. So 40 character long hexadecimal strings are not very human readable and you don't want to have to remember them. So Git introduces a concept called references.
that maps human-readable names to SHA-1 hashes. So the question is what's SHA-1? SHA-1 is a hash function. I don't think I can cover in detail today what hash functions are, but you can think of them as... [Student] They make some strings into random data, but it's deterministic so that... but it's deterministic, right? Yes, so SHA-1 hash takes in some data that's an array of bytes and returns like 160 bits [speaker said "bytes"] of data.
And you can think of this as kind of like randomly but deterministically mapping arbitrary length data to a fixed length representation. maybe getting a tiny bit more into like a crypto theory here, and if this is helpful at all, you can think of this as there's something called the random oracle model and you can think of it as like suppose there's this global registry of objects and their hashes and whenever you compute the SHA-1 hash you look up in this global registry has ever anybody tried to register this thing before or not. If they haven't registered it then it's added to the registry and it's associated with some random 160-bit [speaker said "byte"] value. And then in the future, whenever anybody looks up that same object in this globally shared registry, they'll get back that same value. But really at a high level, it's a way to take some arbitrary size thing and compress it to a small representation in a deterministic way, such that it's kind of like randomish and there's limited collisions.
So if you hash two different things, it's very unlikely that you will end up with the same output from the hash function. Question? Oh, so the question is why are hashes stored as strings instead of some numerical data type? Yeah, they're actually stored as arrays of bytes. I just decided to use this notation, talking about them as strings. If we're actually talking about the actual on-disk representation that Git uses, it's slightly fancier than this. But this is a good logical model.
All right, so back to references. So recall that git history is immutable, right? We can't change any of the contents of that thing. We can only add new things to it. But reference, so related to that, you can think of the object store like you should like put new stuff in it, but you don't go and like delete things. And also if you think about my explanation of SHA-1, how this like deterministically maps this data to a short string or short array of bytes.
If you think about like what would happen if you were to like modify something in the middle of that commit graph, well all these later nodes point to earlier nodes by referring to them by the hash of their contents. So if you change something earlier in the graph, its hash would also change and like none of this stuff would work. So anyways... That stuff is all immutable, append-only, or that's a good way to think about it. And references are where you introduce mutation.
So you might want to have human-readable names, like maybe you want to refer to like the latest version of the codebase you're working on with the name main or master. And that might be kind of a tag or branch in Git terminology that refers to a particular commit in your commit graph. And because this references mapping in Git is mutable, you can keep this up to date. as you continue to develop your software. So in this diagram, I'm introducing a new type of thing.
These are commits, these are parent relationships, and what I'm writing in this rectangle here is a reference that's pointing to this commit. And so the idea is that as you develop your software, maybe you will add a new feature. And now you introduce a new commit. That commit points back to this previous commit, and you can up... update master to point to this latest commit that you just created.
And when you're working with Git, there are higher level commands that'll take care of a lot of the stuff where you're not manually doing individual steps as I'm drawing on the board here. Yes, that's right. think of the history in the sense of the commits and the parent relationships as being immutable, and the names that you're using to refer to particular things that are stored in the references data structure are mutable. Yeah, question? to get rid of a change from the history? How do you do that? Yeah, that's a great question. So the question is, suppose you realize that maybe over here you did something very bad, like you accidentally committed an API key.
What do you do if the history is immutable? So there are a couple. It's a kind of complicated topic. When you're collaborating with other people using Git, what's effectively happening is you're sharing the same view of like .. different slices of a growing data structure. Like you might make some of these commits, a different developer might make these commits, then you'll share them with each other and things like that.
And so if you've shared this with other people and then you realize that there was an API key in there, the way I would recommend approaching it is you need to work to invalidate that API key. So generally when you get an API key from some API, there'll be a button there to delete that API key or to regenerate the API key and invalidate the old one. If you haven't yet shared this commit with other people, Git actually has some commands to basically rewrite history. Like, again, this stuff is all immutable, but Git can basically recreate, say you want to make this commit prime, you can have Git do that for you and also recreate everything that comes after it to take account for this. So you'll end up with a modified version of your entire history.
There's a command called git rebase and some other commands related to that that will do this. I think we have an exercise in today's lecture notes that will give you additional pointers on how to do things related to this. So that's one way to deal with it. Another way is if you've invalidated the API key, which you should generally do anyways, you can just leave it in your version history. It's not a big deal.
For other classes of changes, suppose you added a feature here and then later realized that your implementation of that feature was buggy and you don't have time to fix it. You just want to undo that change and maybe you'll deal with it later. And you don't want to go back and rewrite history. Git has a command that introduces a new commit on top of everything you've done that basically undoes the effect of a previous commit. So roughly what it does is it computes the delta between this and this and replays like the inverse of that delta here to undo that change.
That's a command called git revert. Yes? Yeah, that's a great question. So the question is, if you've forked your history and then you want to merge it together, when you're creating this commit, how does Git know how you want to combine the stuff between these two things? And the answer is, Git has a number of heuristics for doing so intelligently. For example, if you're working on a large code base and you modify one file in one commit, modify a different file in the other commit, and merge together those changes, will just take the most updated versions of those two files. When you're editing in the same file, even then it's kind of smart.
Like if you modify different chunks of the file that are far enough away from each other, it'll just work out. And if you don't, if you're modifying, say, like the same function, then like at some point Git can't really tell what you want. So it'll give up and ask you to help it. So you'll get what's called a merge conflict. And then Git has certain tools that will help you to deal with those merge conflicts.
And this is something that does come up if you're doing like any real world software development, working with other people on the same Git repository, you will at some point end up having to deal with that concept. Cool. So I wrote out pseudo code for storing and loading from the object store. Unless people really want me to, I'm going to skip doing that for the references. You can kind of figure it out yourself or refer to the lecture notes.
But just recall that or remember that the references are mutable. So you can read a reference. So given a name, get the SHA-1 hash. You can store a reference. So given a name and SHA-1 hash, you can put it into this data structure.
And you can also update existing things. Which is a little bit different than what we supported in our object store. One of the little details that might be helpful to mention. Sorry, what? Yes, the question is, is it called a merge conflict? And that's right. When you have different branches that you want to merge together and something goes wrong, that concept is called a merge conflict.
And then Git has something called git mergetool that can help you. Deal with that. You'll end up with conflict markers in your files where Git can't figure out what to do. And then there are a variety of tools that you can use to use your own brain to help Git out. So we have references.
We can have names for commits in our commit graph. And one other little detail that might be helpful to mention is that Git has a special reference, or a special named reference called head, which You can roughly think of it as referring to the thing you're currently looking at. And maybe this will become a little bit more clear through the examples I'm going to show in a moment. And you might see this in documentation that you read later. All right.
So we've actually covered all the core concepts that are part of Git's data model. So we can now finally define what is a Git repository. And all it is is it's the set of objects and references. And that is it. So I think if you look at this, like in terms of, or like as far as data structures go, like these are all really simple concepts, right? So some pretty beautiful ideas at the core of Git.
And so on disk, all Git stores for your repository, roughly, are just objects and references. And it's really helpful to think about as you're learning Git, the different commands that you will learn in terms of how they manipulate objects in kind of at the level of the underlying data model. So when you learn about commands like git commit or git reset or things like that, like think about how they modify things in this picture. Do you have a question? So the question is what is GitHub? And yeah, so git is a free and open source distributed version control tool. GitHub is a software as a service provider that lets you host git repositories.
It's a popular place for people to host their open-source software and to collaborate with other people. So, like, one scenario that might come up if you're a student is your professor might put up lab assignments in a Git repo, and you will clone that Git repo and maybe work on the lab assignment, and they might update the lab assignment, and you can use Git to, for example, pull the latest changes from GitHub, which is a website that hosts a copy of that professor's Git repository, into your own local copy. And this version control tool will let you do things like have your own local version that you've modified, and cleanly sync in the changes from the upstream GitHub without having to do something kludgy. Like if you weren't using a version control system and your professor published an updated version of the lab, you might download a zip file, unzip it, and then copy over all of your changes, or manually figure out how to merge the changes you made in your local copy with the updated lab handout from the professor. But GitHub is not the only Git host.
They also have nothing to do with each other in some sense that they're like one is a company that's now owned by Microsoft, the other is open source software. There are other Git hosts out there that you can google if you want to find alternatives. Don't need that comic anymore. Hopefully we've demystified Git so you are no longer that person in the comics. Okay, actually a moment before I get to practical demos, I want to explain one other concept that's orthogonal to this core data model, but it's part of Git's interface to create commits.
So far we haven't talked about how do you actually create these things here, how do you actually describe what goes in the snapshots. So one way you might imagine designing a version control system is like you're tracking changes to a particular folder on your computer and then, the version control system might have a command to commit the latest changes and it just takes everything that's in its current state and says like, okay, I'm going to take that, I'm going to take a picture of it, and that is my new snapshot. And there are some version control systems that work like that, but not Git. And the reason is that you might have like your folder where your code exists, and maybe you went in and added one feature, and then before committing your changes, you went in and added another feature, and maybe you fixed a bug, and then maybe you like started on a new feature, and then you're like, oh wait, I didn't commit any of my changes. And if you were to just take the current state of things as a snapshot, that would be a little bit messy.
And so Git gives you a little bit more control over crafting these snapshots. And so it has this concept called a staging area, which is used to help describe to Git what you want to be included in the next commit that you create. And we will see the staging area in action through demos in conjunction with other Git commands. Any questions so far before we start typing some code into the terminal? Fantastic. Quick show of hands, who here has used Git before? Oh, decent number of people.
Who here wants me to talk about the fundamentals, like going through the different commands like git init, git add, git branch, git merge, and so on, and relate it back to the data model, versus talk a little bit more about... slightly more advanced functionality like Git remotes, like how to collaborate with people, versus even more advanced functionality like Git bisect or Git rebase or things like that, or git worktree. So option one fundamentals, number of hands. Option two, remotes, intermediate stuff. And then option three, fancy things.
Okay, that's roughly an even split. Let's see if we can go through all of it and see what happens. So I think what I really want to make sure I cover is the fundamentals, right? Like if you're at the point where you understand all the fundamentals and you're comfortable using Git remotes and stuff like that, then for the fancier features, all you really need is someone to tell you, like, hey, this thing exists, and then you can go and figure out how to use it. And for that, we've described it in the lecture notes. We mentioned a bunch of tools that we like using and we've linked to.
I think Pro Git is probably my favorite resource. And there's links in the lecture notes so you can explore on your own, go through the exercises, which also talk about some of these advanced concepts, and we can also help you out on the Discord forum. All right, so unsurprisingly if you type in git, that is the interface to the git command line program. One helpful command to know might be git help. All of git's commands, all of its functionality, are implemented as sub commands below git, so there's like git status and git init.
and git commit and so on. And we'll talk about these in detail in a moment. And there's also a git help. And you can get help on git's subcommands through this interface. It just opens up the man page for the appropriate command.
Recall, man pages we covered in lecture one. So if you do git help help, for example, you'll get the help page for the help command. If you do git help commit, you'll get the help page on the commit command. And so this is really useful just inline help if you don't want to go to Google or ChatGPT or something. All right, so here.
I have an empty directory I called git demo. There's empty directory, there's nothing in it, nothing up my sleeves. If I type in git init, I'll see that it creates an empty git repository in this directory. If I do another directory listing, I'll see that there's this new hidden file, a file that starts with a dot called .git. All the data corresponding to this git repository is stored in there.
We poke around in there a little bit. If we do an ls.git, we'll see this objects and refs folder. So that actually stores Git's object storage, just an on-disk data structure and the references. And if you want to see exactly how this looks under the hood, you can go create a Git repository and poke around yourself, but I will skip doing that for now. So one helpful command to know is git status.
It tells you what's going on. So what we see here is that we're on the branch master. There are no commits yet. So in terms of like this graphical view of history, there's just nothing. It's an empty world.
And there's also nothing to commit. We've not added anything into this directory. There's nothing in the — nothing's been staged. And then we see this "on branch master". This is — my Git is set up to call the default branch master.
And so there's roughly think of it like there's a name that doesn't point to anything yet because there isn't any history. So let's create some contents here and then show how we actually start creating some Git history. Yeah, so the question is, can you have only one repository in a given directory given that the file name is fixed? And like, yes, asterisk, I can't like, I've been writing software for a while. I haven't had too many cases where I wanted to get repositories in the exact same directory. There are ways to do it if you really want to.
But yeah, so if you want multiple repositories, do you want to create separate files or folders for each one of them? And yeah, the answer is yes. Like a lot of us will just have like a source directory. And then inside there, we have a bunch of directories that are all themselves Git repositories. Git also has some fancier concepts like sub repositories and stuff like that, but we won't get into those today Cool Where are we? Okay, so we have no contents here. We want to create some content so we can then track it in our history.
And so I'm not actually gonna like live-code something random in front of you. I'm just gonna write a text file with some kind of filler text. So we're gonna type in the NATO phonetic alphabet. Does anybody remember the NATO phonetic alphabet? All right. This is enough.
So I have this file called nato.txt that has some text in it. All right. So if I do a git status, I'll see that something has changed. Now I see I have untracked files. So git's saying, hey, this file was not included in the last snapshot.
In this case, there's no snapshot. But this file was not included in the last snapshot. But hey, I'm just letting you know that it's untracked so you can do something about it. So I can do git add nato.txt. And what this does is it stages the file to be committed.
So it says changes to be committed. So my Git repository history, my graph view is still empty. There are zero commits. But now I've said, okay, I want to include this whenever I make the next commit. And then Git has a command called git commit that creates a new commit.
And it pops up with this editor here. And Git includes a couple pieces of metadata along with commits, including a commit message. So you want to type in good commit messages. In this case, I'm just going to call this initial commit. I save this file and then git looks at the contents of that file and creates a new commit called initial commit.
Now I think I'm not going to talk about how to write good commit messages in today's lecture, but I think it's a really important topic and so we have included some links to resources in today's lecture notes and maybe Jon's going to talk about it next week? Potentially we're going to talk about it next week. All right, so I've created my first commit. Now I want to see what's going on here. I can do git status and it says something a little bit different. It says nothing to commit working directory clean but we notice that this no commits yet message is gone.
And there are a couple commands you can use to inspect the state of your history that are pretty handy. So there's a command called git log which gives you a flattened version, a flattened view of your history. So it's roughly like linearizing it and printing it out in a linear order and this command has a couple options. Like there's this dash dash graph option which shows a graph. It doesn't look that interesting when I have a single commit, but we'll look at this again later and see a slightly fancier graph.
We can even look a little bit under the hood at the data. If we want to connect this back to the data model, it might be helpful to look at the git cat-file command. Now, I think in regular usage, you're very unlikely to use this command, but this lets you kind of go under the hood a little bit in Git. So this lets you look at objects in the object store either by their name, so it'll do the reference lookup for you, or you can type in the SHA-1 hash directly. And this has a couple flags, -t and -p, to look at the type or to pretty print the file.
So I can say, what is the type of the object that the ref master refers to? And I say, okay, this is a commit. It's referring to this same commit up here. And if I do git cat-file -p, and this time I'll just type in the commit hash, Again, recall that all the objects in Git are identified by their object hash. I can see the data corresponding to the commit. So what I've highlighted there corresponds to this data structure up here.
So we can see here it has a pointer to a tree. That's the contents of this snapshot along with some metadata and my commit message. There are no parents here because this is the first commit. But if I do git cat-file - -t this thing I'll see tree, git cat-file -p this thing. And I'll see that there's a tree with a single entry in it in that map from strings to stuff.
And that entry is this name with this hash. And also conveniently tells me that the thing is referring to is a blob. And if I do git cat-file -p this hash, I will see the contents of the file in that snapshot, which in this case. is actually the same as the contents of nato.txt in my working directory. So do people see how I'm connecting these git commands back to the underlying data model? Yeah, question? Can you just like...
Yeah, so a branch is a special type of reference that you can kind of attach to, and it moves along with you as you make commits. So here, if we look at the git log, we have a single commit and And we have the master branch which points to this commit. And if I make some changes to my file, don't write bad commit messages like this, but I added the contents of my change. -m lets you supply a commit message on the command line. Now if I look at the git log, I have my original commit, that initial commit.
and then this new commit I just made where I added a line of text, and I see that master is updated. So I was attached to this master branch, which I saw in git status. It said on branch master, and then when I do git commit, it updates that ref for me. So recall earlier I was talking about how we can think of things in terms of manipulations to this underlying data model, but usually we don't make like atomic changes to this thing. Git will combine together these things for us.
So you can create a new node in this graph. and move this ref over using the git commit command. Cool. Any questions at this point? Yeah. Yes, so you pointed out that intuitively you might think that a branch refers to an entire lineage or something, but in Git, a branch is really just a pointer to a commit.
And that is right. There are some other version control systems that have a concept of branches that work a little bit differently and closer to, I guess, your intuition of actually tracking the lineage, but Git does not work like that. So branches are just pointers to commits. You can move them around however you like. If you have a branch here and you just want to move it over here, you can totally do that.
And so they aren't really special. They don't track lineage aside from pointing to commit where commits track lineage. Any other questions at the moment? Yeah, so the question is, is master always auto updated to point to the most recent commit? No, it is not. Master is the, or used to be the default name for the default branch when you create a new Git repository. Basically, all branches work this way, where if you're attached to a branch and you make a commit, whichever branch you're on, that branch will be updated.
So if I make a new branch here, if I do git branch anish, that creates a new branch right where I am. And I do git switch anish. and I do git status, it'll say on branch anish. And then suppose I modify this file to add a new line and I look at my git history in this dash dash graph view, it also shows me where some of the branches are. master is still where it was before because I switched to the anish branch and then made the latest commit.
So this one branch got updated to the new commit. But master stayed where it was because I wasn't on master at the time I made the commit. Yeah, so the question is if you're on the master branch, so let me git switch master, this will even, so by the way, this will change the contents of my working directory. So if I cat nato.txt, hotel's gone now because I've moved back here. So the snapshot is its own thing, but also this git switch command changes the contents of your working directory to match.
This branch I just switched to. So it's absolutely right that if I go modify this thing, let me modify this to add alpha here and do git commit. So the dash a flag commits all the things. I'm going to start giving worse and worse commit messages to move faster. Now if I look at the git history, I can also add the dash dash all flag.
If you don't add dash dash all, it just shows you the recursive history from the branch you're on. So it doesn't show unreachable nodes from the current. But if you do dash dash all, it shows you all the things reachable from all the branches. Yeah, so now you see that I was here. You were asking what happens if I switch to master and make a commit.
So I have made this new commit. Oops, master is updated. But this commit points back to this as its parent. So this was the one without hotel or alpha. This is the one where I added hotel.
This is the one where I added alpha. And so now if I look at my file, I have alpha through golf in this one. If I do git switch anish and I... look at this, I have bravo through hotel. And now since we've gotten into a situation where we've forked our history and we have, yeah, two different branches — or two different commits where neither is a parent of the other — we can talk about merging.
So If we, let's say we switch to the master branch, we can do git merge anish and git will get into a state where it's creating a new commit, pop up with this thing where it asks me to write the commit message. I'll do save and quit. In this case, we see it says auto merging nato.txt. So in this case, because I made one change where I added alpha to the top and a different change where I added hotel to the bottom and the middle contents were the same, the diff and merge algorithm was smart enough to figure out that like actually we want to combine the things and we didn't get a merge conflict. And so now if I look at the history, I should see my first three commits — one, two, three — then this commit, um, or sorry, these first two commits — one, two — then this commit, this other commit that has this earlier one as a parent, and then finally this new merge commit that I just created which merges this branch anish into master.
And if I look at the contents of nato.txt now, I have the alpha that I added in one branch and the hotel that I added in the other branch. And then since... oops. Since master includes all the history, including this thing, this thing has been merged in, I don't need this name anymore. So I can go ahead and do git branch -d anish.
Here, let me do git branch. It'll show me the list of, whoops, git branch. Git branch. There we go, third time's the charm. I'll see that I have two branches here.
I'm on the master branch. That's why it has this asterisk in green. And I can do git branch -d anish. And it says deleted this branch. So now if I look at the version history, the graph looks the same.
because all this is reachable from this master thing, but just this name associated with this commit is gone, because I don't need the name anymore. And so again, that conveys that I can just delete a branch, and the history is still there. Branches and tracking lineage are two different concepts in Git. Yeah, question? Yeah, so a commit is that data structure we spent a bunch of time talking about. And a branch is a reference.
It's a name for a commit. or a name for a SHA-1 hash. Commits are immutable, branches are mutable. Question? Yeah, so the question is, instead of merging anish into master, could we have merged master into anish? And yeah, absolutely. Master isn't special, it's just one name versus another.
Question? Let's try. So the master branch is in use. That's the one I have checked out currently. That's what this head pointing to master means. And yeah, that's what I've checked out in this current directory.
But if I do something like git branch main and then git switch main, I can do git branch -d master. And now I have no more master. Cool. Any other questions at the moment? Yeah, the question is, can I recover it? All right, so this is like pretty advanced Git. And if you do everything correctly, you will never need this tool.
But there's this tool called Git reflog. Basically, Git maintains a bunch of like extra state, including history of your references. So references are mutable. And so it's oftentimes helpful to maintain a history of what's happened with them. Like if you move a branch from pointing to one thing to pointing to another, and you're like, oops, I didn't want to do that.
like how would you go back you just change the thing so that's why it maintains a history and so like ref log is like reference log And then the other thing in Git, like, again, Git just has two things. It has references and objects. Objects are immutable. And so how does Git decide what to keep as you, like, create more and more history and move branches around and stuff? Well, think of, like, the active history as the graph that's reachable from the union of all the references. So if I create, say, actually already in this picture, like, this node is not reachable from any reference, right? And so as you end up with kind of like orphan stuff in your object store, Git will automatically garbage collect it.
So if you say like I created a commit, I like deleted the branch or something, I can look at my ref log and find the commit. That commit will still be there in my object store. But if I like use Git for a long time and then like one year later, I was like, oops, I actually wanted that old thing I deleted a very long time ago. It might be gone. Yeah, question? That's a good question.
When I did branch main, how did I specify what main should be pointing to? So without any additional arguments, main points to where I am currently, which is head, and which is also the same as master in this case. You can also specify a ref or a SHA-1 hash. So if I look at my graph again, here was my at first add more stuff. So if I do git branch old this, what will this do? This will create a new branch called old that'll point here initially. And so if I look at this graph, I see that this old has been created.
Yeah, so the question is, is it convention to have a main or master branch? And yeah, the answer is yes. Like oftentimes you want like, say people have the most up-to-date version of their code being main or master. It used to be master in older versions of Git. Now I think the default is either it's unset or it's main. I'm not a hundred percent sure.
I think GitHub uses main as the default now. But yeah, like you want main, main is special in some way. So you might have that be like the main version of your code. And then when you are implementing features, you might create what's called feature branches, make changes in those feature branches and then merge them back into main. There are other more sophisticated workflows out there.
Like maybe you're maintaining a website and you have the production version of your website, like the main live one that real people use. And you might have a staging version which is used for testing or something. So you might have a main branch for production, you might have a staging branch for staging, and then you might have a bunch of feature branches and bug fix branches for other stuff that's in development. And maybe you'll merge those things into staging first, test them out, and if staging looks like it's in good shape, you might merge that into main. So there's some pretty sophisticated branching and merging workflows, and I think we linked to some of them in the lecture notes.
Yeah, that's right. The name main is just a convention. We could have called it anish if we wanted to. Cool. If you want to stick around for like five more minutes, I could show you how to use Git remotes, and then we can wrap up there.
If you have to go, feel free to leave. But I think this might be a nice topic to end on. So we've so far talked about how you can use Git just yourself for software development, introduced you to some of the basic concepts like creating a version history and going back and forth. I think one thing we didn't show is Git checkout. So if you want to change the contents of your working directory to an older commit, you can do something like Git checkout old or Git switch old in this case, Git switch old.
And it'll switch to this branch and check out the contents here. So if I cat nato.txt, it's before I added all the new stuff I added later. I still have the later history, thanks to having this main here. I can even check out particular commits like I can do git checkout and then type in the commit hash or even a prefix of the commit hash as long as it's unambiguous. And now I have my very short contents of nato.txt I had way back here.
And if I look at the graph, I see that this all caps head, the special reference that git renders in blue pointing back here. And then I have all the newer history up top. And if I do git status, I'll see that git is in this detached head state. And what that means is that there's no active branch currently. And so if I'm making new commits, I no longer have this nice property where I have this named branch that advances along with me as I change my code.
So it's relatively uncommon to stay in this detached head state for long. Normally you're on some branch making, whoops, making changes to your code. Okay, so Git remotes. So far we've talked about how you just use Git by yourself. One of the really powerful things you can do with Git is work with other people.
Yeah, question. We will make... Yes, that's a good way to think about it. Detached head means that head points directly to a commit rather than to a branch. And if you're in detached head state, so I'll do get, I don't know if git switch works with this.
Yeah, git checkout this. So I'm in detached head state here. I can modify this nato.txt, create a commit, and it'll actually create the commit. But this commit is kind of this orphaned thing. Only head points to this commit right now, no branch.
And so if I switch back to main, for example, see it says warning, you're leaving this one commit behind, it's not connected to any branch. And so this is actually still in my object store at the moment, but Git will eventually garbage collect it because there's no branch that points to it or any of its successors. All right, so Git remotes, we talked about using Git locally, but you can also collaborate with other people using Git repositories. And one common way to do so is using GitHub. So now I'm showing.
me logged into my GitHub account. And you can think of GitHub repositories as just, GitHub repositories aren't really special. Git has this, it's a distributed version control system where you have a copy of your Git repository and it can be connected to any number of other Git repositories called remotes. And then you can exchange information between them in both directions. So you can add stuff to your history locally and push it to somebody, some remote.
And somebody else can make changes to a remote and you can pull it into your own local repository. So I can create a new repository on GitHub. Can be blank. Maybe I should make this private. So I've created an empty repository on GitHub.
So this is kind of like somebody else. Like GitHub has just done a git init. They have no history. But I have my repository I'm working on where I have a bunch of history. And now I can connect these together.
So there's this way to tell Git that you want to add a new remote. You can copy paste this from GitHub, or you can learn how to use this git remote command. So there's this git remote sub command. There's an add sub command for that. You can name every remote since you can have multiple.
And then I think we don't have time to get into the details of exactly how the authentication with GitHub works, but you can read their documentation or our lecture notes for that if you're interested. But once we do that, we can start sending and receiving data. to and from this remote. So there's this git push command that can take my local changes and send them to the remote. So I can do git push dash u, I'll explain dash u in a moment, main.
And what this does is it pushes to the remote origin the contents of my branch main. It'll create, if it doesn't exist, a branch main on the remote and set it up so it matches my local main. And the dash u makes this a tracking branch. So as I make commits locally, I can just do another git push without any additional arguments. And it'll know that my current branch that I'm on, main, corresponds to origin slash master on the other end.
What? Origin slash main here. Yes. Now that I've done this git push, if I go over to GitHub's web interface, I can see that I have my nato.txt here with all the latest contents. And then just to demonstrate a couple other things, suppose I'm somebody else. I'm just going to demonstrate on the same computer, of course.
I'll go to my desktop folder. Somebody else can take this git repository and clone it. So... The git clone command takes an existing remote and gives you a copy of it locally, and I can go into this and make some changes in here. So I'm just like very quickly adding some contents, making some dummy commit.
Now if I do a git status, I'll see on branch main your branch is ahead of origin slash main by one commit. Remember I'm like player two right now on my desktop. If I do a git push here, without any additional arguments. It'll send my local changes to the remote. And if I go back to this earlier window I had open, and I do a git status here, and I do cat nato.txt, nothing has changed because I haven't pulled any contents from the remote on this computer.
But if I do a git pull here, or actually maybe I can do a git fetch first, there are these closely related commands. git fetch receives data from the remote, but doesn't actually change any of your local references. And if you do git pull, it will actually update your local references. Since this branch was tracking origin slash main and origin slash main has been updated, the git pull does what's called here in this case a fast forward. And now if I look at nato.txt, I see this ASDF I just added.
So I've shown like end to end. Okay. One person, I make some changes to code. I'm a different person, I can pull in those changes. And these are just the fundamentals.
But you can, in your head, figure out how these things map to changes to the underlying data model. Basically, everybody has different views, different slices of this history. And you're sharing changes or updates to this history. And that enables collaboration in a very clean way. And then maybe useful to show the commit history is also on — oh, yes.
Jon points out it might be helpful to show that the commit history is also on GitHub. Yeah, and this is a pretty fancy, pretty sophisticated product. One of the nice features it has is it lets you visualize commit history. So we can also go to a different repository like the missing semester repository. So yeah, side note, all the lecture notes and everything in this class website is in an open source Git repository.
So you can go and poke around in that if you're interested. And if we look at the commit history there, like we can see that Jon was just making some changes during lecture today. I was making some changes earlier today and so on. So really powerful web interface here that will certainly come in handy as you collaborate with other people. Yeah, question? Yeah, that's a great question.
So it's like when you push main to the remote origin, does it include just the commit that main is pointing to? Like would it include just this or the entire history? And the answer is the latter. So when you share stuff, it always includes all the history that leads up to that point. Git does have some special commands if you kind of want to like truncate this stuff and only like receive part of the history, but that's pretty advanced stuff. I think it's not that commonly needed. But yeah, in general you share the entire version history and a lot of git commands like make use of that history.
One other command that I'll show you before we do final questions and wrap up is the git diff command. So I think one really important concept to understand is that in Git's model of history, every single commit has corresponding to it a snapshot, which is kind of an entire picture of a file, a folder, and all the files in it and all the contents in there. That's the logical model. Git does not model deltas, like that is not the model. It's not like this is what changed in this commit.
It's just here's the new state. That is what is in a commit. But it's very handy to be able to look at what's changed. But what's nice is that that can just be computed. So you can ask git, like, what is the difference between the snapshot here and the snapshot here? And it can render it nicely for you.
So, for example, I can do a git log here. I can do a git log of the last two commits. And I can do a git diff against this commit. And it can show me this is the diff output. The line asdf was added here.
And the git diff command is very powerful. You can also do things like git diff head tilde. There are different ways to refer to the thing you want to do a diff against. You can do a diff against two named things. You can do a diff just for a particular file.
I'm not going to get into all the details of this command, but a very handy command to look at what has changed between different commits in your git history. Yeah, that too. There's also git log -p that Jon points out. That's a handy command that shows your git log with inline diffs for every single commit. So I can scroll through this.
I see my latest commit that adds the ASDF. I'll scroll down. I see this commit that adds the alpha. And here I've added hotel and so on. And Git's interface, which is also called the porcelain, the kind of high-level commands that sit on top of the core data model, is pretty sophisticated.
It's like we're not going to be able to cover it even in two hours of lecture. Look at the lecture notes and references in there if you want to learn the advanced stuff. But I think it all builds on top of these fundamentals, and it's really easy to learn all that advanced stuff once you understand this well. Any final questions? [Student question] Yes. Yeah, so the question is if you're in detached head state, you make a new commit, it'll point out like, yeah, this is this orphan thing, there's no branch there.
can you just use the commit hash that it prints out and create a new branch? Yes, you can do git branch and then name the branch and then give it the SHA-1 hash and it will create a new branch at that location. Cool, no other questions? All right, let's end here and we will see you not on Monday but on Tuesday next week.