Gerrit Code Review - System Design

Objective

Gerrit is a web based code review system, facilitating online code reviews for projects using the Git version control system.

Gerrit makes reviews easier by showing changes in a side-by-side display, and allowing inline/file comments to be added by any reviewer.

Gerrit simplifies Git based project maintainership by permitting any authorized user to submit changes to the master Git repository, rather than requiring all approved changes to be merged in by hand by the project maintainer. This functionality enables a more centralized usage of Git.

Background

Git is a distributed version control system, wherein each repository is assumed to be owned/maintained by a single user. There are no inherent security controls built into Git, so the ability to read from or write to a repository is controlled entirely by the host’s filesystem or network access controls.

The objective of Gerrit is to facilitate Git development by larger teams: it provides a means to enforce organizational policies around code submissions, eg. "all code must be reviewed by another developer", "all code shall pass tests". It achieves this by

providing fine-grained (per-branch, per-repository, inheriting) access controls, which allow a Gerrit admin to delegate permissions to different team(-lead)s.
facilitate code review: Gerrit offers a web view of pending code changes, that allows for easy reading and commenting by humans. The web view can offer data coming out of automated QA processes (eg. CI). The permission system also includes fine grained control of who can approve pending changes for submission to further facilitate delegation of code ownership.

Overview

Developers create one or more changes on their local desktop system, then upload them for review to Gerrit using the standard git push command line program, or any GUI which can invoke git push on behalf of the user. Authentication and data transfer are handled through SSH and HTTPS. Uploads are protected by the authentication, confidentiality and integrity offered by the transport (SSH, HTTPS).

Each Git commit created on the client desktop system is converted into a unique change record which can be reviewed independently.

A summary of each newly uploaded change is automatically emailed to reviewers, so they receive a direct hyperlink to review the change on the web. Reviewer email addresses can be specified on the git push command line, but typically reviewers are added in the web interface.

Reviewers use the web interface to read the side-by-side or unified diff of a change, and insert draft inline/file comments where appropriate. A draft comment is visible only to the reviewer, until they publish those comments. Published comments are automatically emailed to the change author by Gerrit, and are CC’d to all other reviewers who have already commented on the change.

Reviewers can score the change ("vote"), indicating whether they feel the change is ready for inclusion in the project, needs more work, or should be rejected outright. These scores provide direct feedback to Gerrit’s change submit function.

After a change has been scored positively by reviewers, Gerrit enables a submit button on the web interface. Authorized users can push the submit button to have the change enter the project repository. The user pressing the submit button does not need to be the author of the change.

Infrastructure

End-user web browsers make HTTP requests directly to Gerrit’s HTTP server. As nearly all of the Gerrit user interface is implemented in a JavaScript based web app, the majority of these requests are transmitting compressed JSON payloads, with all HTML being generated within the browser.

Gerrit’s HTTP server side component is implemented as a standard Java servlet, and thus runs within any J2EE servlet container. The standard install will run inside Jetty, which is included in the binary.

End-user uploads are performed over SSH or HTTP, so Gerrit’s servlets also start up a background thread to receive SSH connections through an independent SSH port. SSH clients communicate directly with this port, bypassing the HTTP server used by browsers.

User authentication is handled by identity realms. Gerrit supports the following types of authentication:

OpenId (see OpenID Specifications)
OAuth2
LDAP
Google accounts (on googlesource.com)
SAML
Kerberos
3rd party SSO

NoteDb

Server side data storage for Gerrit is broken down into two different categories:

Git repository data
Gerrit metadata

The Git repository data is the Git object database used to store already submitted revisions, as well as all uploaded (proposed) changes. Gerrit uses the standard Git repository format, and therefore requires direct filesystem access to the repositories. All repository data is stored in the filesystem and accessed through the JGit library. Repository data can be stored on remote servers accessible through NFS or SMB, but the remote directory must be mounted on the Gerrit server as part of the local filesystem namespace. Remote filesystems are likely to perform worse than local ones, due to Git disk IO behavior not being optimized for remote access.

The Gerrit metadata contains a summary of the available changes, all comments (published and drafts), and individual user account information.

Gerrit metadata is also stored in Git, with the commits marking the historical state of metadata. Data is stored in the trees associated with the commits, typically using Git config file or JSON as the base format. For metadata, there are 3 types of data: changes, accounts and groups.

Accounts are stored in a special Git repository All-Users.

Accounts can be grouped in groups. Gerrit has a built-in group system, but can also interface to external group system (eg. Google groups, LDAP). The built-in groups are stored in All-Users.

Draft comments are stored in All-Users too.

Permissions are stored in Git, in a branch refs/meta/config for the repository. Repository configuration (including permissions) supports single inheritance, with the All-Projects repository containing site-wide defaults.

Code review metadata is stored in Git, alongside the code under review. Metadata includes change status, votes, comments. This review metadata is stored in NoteDb along with the submitted code and code under review. Hence, the review history can be exported with git clone --mirror by anyone with sufficient permissions.

Permissions

Permissions are specified on branch names, and given to groups. For example,

[access "refs/heads/stable/*"]
        push = group Release-Engineers

this provides a rule, granting Release-Engineers push permission for stable branches.

There are fundamentally two types of permissions:

Write permissions (who can vote, push, submit etc.)
Read permissions (who can see data)

Read permissions need special treatment across Gerrit, because Gerrit should only surface data (including repository existence) if a user has read permission. This means that

The git wire protocol support must omit references from advertisement if the user lacks read permissions
Uploads through the git wire protocol must refuse commits that are based on SHA-1s for data that the user can’t see.
Tags are only visible if their commits are visible to user through a non-tag reference.

Metadata (eg. OAuth credentials) is also stored in Git. Existing endpoints must refuse creating branches or changes that expose these metadata or allow changes to them.

Indexing

Almost all data is stored as Git, but Git only supports fast lookup by SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing system (powered by Lucene by default) for other types of queries. There are 4 indices:

Project index - find repositories by name, parent project, etc.
Account index - find accounts by name, email, etc.
Group index - find groups by name, owner, description etc.
Change index - find changes by file, status, modification date etc.

The base entities are characterized by SHA-1s. Storing the characterizing SHA-1s allows detection of stale index entries.

Plug-in architecture

Gerrit has a plug-in architecture. Plugins can be installed by dropping them into $site_directory/plugins, or at runtime through plugin SSH commands, or the plugin REST API.

Backend plugins

At runtime, code can be loaded from a .jar file. This code can hook into predefined extension points. A common use of plugins is to have Gerrit interoperate with site-specific tools, such as CI-systems or issue trackers.

Some backend plugins expose the JVM for scripting use (eg. Groovy, Scala), so plugins can be written without having to setup a Java development environment.

Frontend plugins

The UI can be extended using Frontend plugins. This is useful for changing the look & feel of Gerrit, but it can also be used to surface data from systems that aren’t integrated with the Gerrit backend, eg. CI systems or code coverage providers.

Internationalization and Localization

As a source code review system for open source projects, where the commonly preferred language for communication is typically English, Gerrit does not make internationalization or localization a priority.

The majority of Gerrit’s users will be writing change descriptions and comments in English, and therefore an English user interface is usable by the target user base.

Accessibility Considerations

Whenever possible Gerrit displays raw text rather than image icons, so screen readers should still be able to provide useful information to blind persons accessing Gerrit sites.

Standard HTML hyperlinks are used rather than HTML div or span tags with click listeners. This provides two benefits to the end-user. The first benefit is that screen readers are optimized to locating standard hyperlink anchors and presenting them to the end-user as a navigation action. The second benefit is that users can use the 'open in new tab/window' feature of their browser whenever they choose.

When possible, Gerrit uses the ARIA properties on DOM widgets to provide hints to screen readers.

Browser Compatibility

Gerrit requires a JavaScript enabled browser.

As Gerrit is a pure JavaScript application on the client side, with no server side rendering fallbacks, the browser must support modern JavaScript semantics in order to access the Gerrit web application. Dumb clients such as lynx, wget, curl, or even many search engine spiders are not able to access Gerrit content.

All of the content stored within Gerrit is also available through other means, such as gitweb or the git:// protocol. Any existing search engine crawlers can index the server-side HTML served by a code browser, and thus can index the majority of the changes which might appear in Gerrit. Therefore the lack of support for most search engine crawlers is a non-issue for most Gerrit deployments.

Product Integration

Gerrit optionally surfaces links to HTML pages in a code browser. The links are configurable, and Gerrit comes with a built-in code browser, called Gitiles.

Gerrit integrates with some types of corporate single-sign-on (SSO) solutions, typically by having the SSO authentication be performed in a reverse proxy web server and then blindly trusting that all incoming connections have been authenticated by that reverse proxy. When configured to use this form of authentication, Gerrit does not integrate with OpenID providers.

When installing Gerrit, administrators may optionally include an HTML header or footer snippet which may include user tracking code, such as that used by Google Analytics. This is a per-instance configuration that must be done by hand, and is not supported out of the box. Other site trackers instead of Google Analytics can be used, as the administrator can supply any HTML/JavaScript they choose.

Gerrit does not integrate with any Google service, or any other services other than those listed above.

Plugins (see above) can be used to drive product integrations from the Gerrit side. Products that support Gerrit explicitly can use the REST API or the SSH API to contact Gerrit.

Privacy Considerations

Gerrit stores the following information per user account:

Full Name
Preferred Email Address

The full name and preferred email address fields are shown to any site visitor viewing a page containing a change uploaded by the account owner, or containing a published comment written by the account owner.

Showing the full name and preferred email is approximately the same risk as the From header of an email posted to a public mailing list that maintains archives, and Gerrit treats these fields in much the same way that a mailing list archive might handle them. Users who don’t want to expose this information should either not participate in a Gerrit based online community, or open a new email address dedicated for this use.

As the Gerrit UI data is only available through XSRF protected JSON-RPC calls, "screen-scraping" for email addresses is difficult, but not impossible. It is unlikely a spammer will go through the effort required to code a custom scraping application necessary to cull email addresses from published Gerrit comments. In most cases these same addresses would be more easily obtained from the project’s mailing list archives.

The user’s name and email address is stored unencrypted in the All-Users repository.

Spam and Abuse Considerations

There is no spam protection for the Git protocol upload path. Uploading a change successfully requires a pre-existing account, and a lot of up-front effort.

Gerrit makes no attempt to detect spam changes or comments in the web UI. To post and publish a comment a client must sign in and then use the XSRF protected JSON-RPC interface to publish the draft on an existing change record.

Absence of SPAM handling is based upon the idea that Gerrit caters to a niche audience, and will therefore be unattractive to spammers. In addition, it is not a factor for corporate, on-premise deployments.

Scalability

Gerrit supports the Git wire protocol, and an API (one API for HTTP, and one for SSH).

The git wire protocol does a client/server negotiation to avoid sending too much data. This negotation occupies a CPU, so the number of concurrent push/fetch operations should be capped by the number of CPUs.

Clients on slow network connections may be network bound rather than server side CPU bound, in which case a core may be effectively shared with another user. Possible core sharing due to network bottlenecks generally holds true for network connections running below 10 MiB/sec.

Deployments for large, distributed companies can replicate Git data to read-only replicas to offload fetch traffic. The read-only replicas should also serve this data using Gerrit to ensure that permissions are obeyed.

The API serves requests of varying costs. Requests that originate in the UI can block productivity, so care has been taken to optimize these for latency, using the following techniques:

Async calls: the UI becomes responsive before some UI elements finished loading
Caching: metadata is stored in Git, which is relatively expensive to access. This is sped up by multiple caches. Metadata entities are stored in Git, and can therefore be seen as immutable values keyed by SHA-1, which is very amenable to caching. All SHA-1 keyed caches can be persisted on local disk.
```
The size (memory, disk) of these caches should be adapted to the
instance size (number of users, size and quantity of repositories)
for optimal performance.
```

Git does not impose fundamental limits (eg. number of files per change) on data. To ensure stability, Gerrit configures a number of default limits for these.

Scaling team size

A team of size N has N^2 possible interactions. As a result, features that expose interactions with activities of other team members has a quadratic cost in aggregate. The following features scale poorly with large team sizes:

the change screen shows conflicting changes by default. This data is cached, but updates to pending changes cause cache misses. For a single change, the amount of work is proportional to the number of pending changes, so in aggregate, the cost of this feature is quadratic in the team size.
the change screen shows if a change is mergeable to the target branch. If the target branch moves quickly (large developer team), this causes cache misses. In aggregate, the cost of this feature is also quadratic.

Both features should be turned off for repositories that involve 1000s of developers.

Browser performance

Real life numbers

Gerrit is designed for very large projects, both open source and proprietary commercial projects. For a single Gerrit process, the following limits are known to work:

Table 1. Observed maximums
Parameter	Maximum	Deployment
Projects	50,000	gerrithub.io
Contributors	150,000	eclipse.org
Bytes/repo	100G	Qualcomm internal
Changes/repo	300k	Qualcomm internal
Revisions/Change	300	Qualcomm internal
Reviewers/Change	87	Qualcomm internal

Google runs a horizontally scaled deployment. We have seen the following per-JVM maximums:

Table 2. Observed maximums (googlesource.com)
Parameter	Maximum	Deployment
Files/repo	500,000	chromium-review
Bytes/repo	12G	chromium-review
Changes/repo	500k	chromium-review
Revisions/Change	1900	chromium-review
Files/Change	10,000	android-review
Comments/Change	1,200	chromium-review

Redundancy & Reliability

Gerrit is structured as a single JVM process, reading and writing to a single file system. If there are hardware failures in the machine running the JVM, or the storage holding the repositories, there is no recourse; on failure, errors will be returned to the client.

Deployments needing more stringent uptime guarantees can use replication/multi-master setup, which ensures availability and geographical distribution, at the cost of slower write actions.

Backups

Using the standard replication plugin, Gerrit can be configured to replicate changes made to the local Git repositories over any standard Git transports. After the plugin is installed, remote destinations can be configured in '$site_path'/etc/replication.conf to send copies of all changes over SSH to other servers, or to the Amazon S3 blob storage service.

Logging Plan

Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages from the Java logger, under $site_dir/logs/.

Published comments contain a publication date, so users can judge when the comment was posted and decide if it was "recent" or not. Only the timestamp is stored in the database, the IP address of the comment author is not stored.

Changes uploaded over the SSH daemon from git push have the standard Git reflog updated with the date and time that the upload occurred, and the Gerrit account identity of who did the upload. Changes submitted and merged into a branch also update the Git reflog. These logs are available only to the Gerrit site administrator, and they are not replicated through the automatic replication noted earlier. These logs are primarily recorded for an "oh s**t" moment where the administrator has to rewind data. In most installations they are a waste of disk space. Future versions of JGit may allow disabling these logs, and Gerrit may take advantage of that feature to stop writing these logs.

A web server positioned in front of Gerrit (such as a reverse proxy) or the hosting servlet container may record access logs, and these logs may be mined for usage information. This is outside of the scope of Gerrit.

Part of Gerrit Code Review