Gerrit Code Review - Repository Maintenance

Description

Each project in Gerrit is stored in a bare Git repository. Gerrit uses the JGit library to access (read and write to) these Git repositories. As modifications are made to a project, Git repository maintenance will be needed or performance will eventually suffer. When using the Git command line tool to operate on a Git repository, it will run git gc every now and then on the repository to ensure that Git garbage collection is performed. However regular maintenance does not happen as a result of normal Gerrit operations, so this is something that Gerrit administrators need to plan for.

Gerrit has a built-in feature which allows it to run Git garbage collection on repositories. This can be configured to run on a regular basis, and/or this can be run manually with the gerrit gc ssh command, or with the run-gc REST API. Some administrators will opt to run git gc or jgit gc outside of Gerrit instead. There are many reasons this might be done, the main one likely being that when it is run in Gerrit it can be very resource intensive and scheduling an external job to run Git garbage collection allows administrators to finely tune the approach and resource usage of this maintenance.

Git Garbage Collection Impacts

Unlike a typical server database, access to Git repositories is not marshalled through a single process or a set of inter communicating processes. Unfortunately the design of the on-disk layout of a Git repository does not allow for 100% race free operations when accessed by multiple actors concurrently. These design shortcomings are more likely to impact the operations of busy repositories since racy conditions are more likely to occur when there are more concurrent operations. Since most Gerrit servers are expected to run without interruptions, Git garbage collection likely needs to be run during normal operational hours. When it runs, it adds to the concurrency of the overall accesses. Given that many of the operations in garbage collection involve deleting files and directories, it has a higher chance of impacting other ongoing operations than most other operations.

Interrupted Operations

When Git garbage collection deletes a file or directory that is currently in use by an ongoing operation, it can cause that operation to fail. These sorts of failures are often single shot failures, i.e. the operation will succeed if tried again. An example of such a failure is when a pack file is deleted while Gerrit is sending an object in the file over the network to a user performing a clone or fetch. Usually pack files are only deleted when the referenced objects in them have been repacked and thus copied to a new pack file. So performing the same operation again after the fetch will likely send the same object from the new pack instead of the deleted one, and the operation will succeed.

Data Loss

It is possible for data loss to occur when Git garbage collection runs. This is very rare, but it can happen. This can happen when an object is believed to be unreferenced when object repacking is running, and then garbage collection deletes it. This can happen because even though an object may indeed be unreferenced when object repacking begins and reachability of all objects is determined, it can become referenced by another concurrent operation after this unreferenced determination but before it gets deleted. When this happens, a new reference can be created which points to a now missing object, and this will result in a loss.

Reducing Git Garbage Collection Impacts

JGit has a preserved directory feature which is intended to reduce some of the impacts of Git garbage collection, and Gerrit can take advantage of the feature too. The preserved directory is a subdirectory of a repository’s objects/pack directory where JGit will move pack files that it would normally delete when jgit gc is invoked with the --preserve-oldpacks option. It will later delete these files the next time that jgit gc is run if it is invoked with the --prune-preserved option. Using these flags together on every jgit gc invocation means that packfiles will get an extended lifetime by one full garbage collection cycle. Since an atomic move is used to move these files, any open references to them will continue to work, even on NFS. On a busy repository, preserving pack files can make operations much more reliable, and interrupted operations should almost entirely disappear.

Moving files to the preserved directory also has the ability to reduce data loss. If JGit cannot find an object it needs in its current object DB, it will look into the preserved directory as a last resort. If it finds the object in a pack file there, it will restore the slated-to-be-deleted pack file back to the original objects/pack directory effectively "undeleting" it and making all the objects in it available again. When this happens, data loss is prevented.

One advantage of restoring preserved packfiles in this way when an object is referenced in them, is that it makes loosening unreferenced objects during Git garbage collection, which is a potentially expensive, wasteful, and performance impacting operation, no longer desirable. It is recommended that if you use Git for garbage collection, that you use the -a option to git repack instead of the -A option to no longer perform this loosening.

When Git is used for garbage collection instead of JGit, it is fairly easy to wrap git gc or git repack with a small script which has a --prune-preserved option which behaves as mentioned above by deleting any pack files currently in the preserved directory, and also has a --preserve-oldpacks option which then hardlinks all the currently existing pack files from the objects/pack directory into the preserved directory right before calling the real Git command. This approach will then behave similarly to jgit gc with respect to preserving pack files. An implementation is available here.

Part of Gerrit Code Review