How GIT Works Deep Inside

If you are a programmer, you probably use GIT. But have you ever wondered how GIT is working deep inside? I do. Fortunately, you can find…

How GIT Works Deep Inside

If you are a programmer, you probably use GIT. But have you ever wondered how GIT is working deep inside? I do. Fortunately, you can find many documents on the web about GIT internals. When I read them, I have to realize that GIT is a relatively simple but super genius system, and it uses the same hash-based storage that is used in distributed filesystems like IPFS or Ethereum Swarm. In this article, I will show you, how GIT works deep inside.

When you clone your favorite repo from GitHub or any other git repository, you will get the files and a .git folder. This single .git folder contains everything. It’s not a problem if you delete the other files, you can simply restore them by a ‘checkout’ command. It’s possible because the whole file tree is described in the .git folder.

Let’s see into the .git folder. It contains some files and folders. One of the most important is the objects folder. Git is something like a special filesystem that stores file with the same content only once. If you have different folders that contain the same file, the content will be stored only once. When you store a file in the Git repo, it will calculate the SHA1 hash of the file, and store it in the objects folder. If the file exists in different places in the tree, it will be stored only once, because SHA1 will maps the same content to the same file.

The SHA1 hash of the content is 20 bytes. The first byte (2 hex characters) defines the folder in the objects folder, and the other 19 bytes (38 hex characters) will be the name of the file. For example, if the content hash is 10116ede2f0bcf2ec0720843616e4a5250ae5268 then it will be mapped to objects/10/116ede2f0bcf2ec0720843616e4a5250ae5268.

If you cloned the repository and haven’t changed anything, you will probably not find any object file, only a pack folder and a .pack file in it. It is an optimization. Git pulls the object files in one pack file from the server. You can simply unpack this single file if you move it outside of the .git directory and run

git unpack-objects < ./{pack_file_name}.pack

This command will unpack the objects into the objects folder in the above format.

The object files are zip-ed, so, if you open one of them, you won’t be able to read it, but you can easily unzip it by the following command:

pigz -d < ./.git/objects/10/116ede2f0bcf2ec0720843616e4a5250ae5268

The objects are organized into trees. A tree is something like a folder in a filesystem stored in another file in the objects folder. A tree looks like this:

100644 blob 5f71dbb20efc1dc9bd95e116ebc403659556b58a	.gitignore
100644 blob f288702d2fa16d3cdf0035b15a9fcbc552cd88e7	LICENSE
100644 blob 49e96aecc3c354402c153d759e900354cfcb7c80	README.md
040000 tree 7054d5d9fd2431c4ff4f27537d6a5388b3c73ca9	database
100644 blob 9b50d8c47e0ad56aab6aa570f344c6db5409a955	env.development
100644 blob a473235e1bf1461feef090b2a62b2066d75c7d97	env.template
100644 blob a0f18dc0b81d5122a8eeca6903868f1ea4721ebc	package.json
040000 tree ae9e90c2dcc818fab099dd22093ac5e5adb87bbb	public
040000 tree 0bef5a72fa773367998e501275c262bb0ec75544	scripts
040000 tree 878c06bb25e1752fa6271c6eef51edad0942c3ff	src
100644 blob 604c913eebc2578696d37b7346be681db2591816	tsconfig.json
040000 tree cdd80d4ee72ce05a172e9d6bc05b2d946767d079	views

This output is generated by the git cat-file command, which can read and parse any file from the objects folder by hash. The above output is generated by:

git cat-file -p 54ca9b88af96f27e181b9a059ca4be1f60e720ba

The first column shows a Linux-like file mode, the second column shows the object type, the third column is an object hash and the last column is the filename. A Git tree is very similar to a Linux folder that can contain files (blobs) and other folders (tree). If you would, you can check the content of some of the files or trees by using the commands that we used before.

Git can be imagined as a virtual filesystem, where every branch and every commit in the branches are folders. When you do a checkout you copy the contents of the chosen folder outside of the .git directory. In a standard filesystem, this needs a huge amount of disk space, but because of the clever hash-based and compressed solution of Git, it is stored in an optimal way.

Creating a branch would need a full directory copy in a standard filesystem, but Git only generates one single file that points to the tree of the source of the branch. If you change a file and do a commit, only a commit object is generated that points to the changed tree that contains the file (3 files instead of a full directory copy).

Every commit contains the hash of the previous commit (like a blockchain), so the history is fully trackable. This makes this special filesystem a version control system.

When you pull or push, Git sends these object files to the other part in a packed format. Because of hash-based naming, the objects will never collide. You could simply copy every object from every Git repository in the world to a single folder without any problem. This is why forking a repository on GitHub needs only a few seconds. GitHub doesn’t copy anything, only creates an entry in the database similar to branching.

In nutshell, this is how Git works. IPFS or Ethereum Swarm also uses this hash-based representation. The difference is that these systems add a discovery protocol to it to find the given hashes in the distributed network of storage nodes.

Mixing the discovery system of these decentralized filesystems and Git versioning abilities could be the base of a fully decentralized GitHub alternative, but it is another story…

If you want to know Git more deeply, you can find everything on the Git website or in the Git book.