By Jonathan Block, Software Engineer
When an engineer at DoorDash opens a GitHub pull request, our goal is to quickly and automatically provide information about code health. GitHub’s status API compliments GitHub webhooks, which allow you to trigger custom routines as events fire in your GitHub account.
When developers push to our largest repo, they see something like this at the bottom of their pull request page:
We initially used a third party CI hosting company to implement our checks. This worked well when the amount of tasks we wanted to trigger was relatively low. However, as the number of checks grew, developers were waiting longer and longer for their CI results. In early 2017, we were waiting more than 20 minutes for an average pull request to complete all checks, despite our use of parallelization features.
Instead of using third parties, we used Jenkins on AWS to build a CI/CD system integrated with GitHub. Our custom solution produces test results within 5 minutes and we’ve also gained an ability to deploy our code continuously — features we will integrate into all new DoorDash microservices.
Jenkins is an open source CI server that’s been around almost as long as WordPress. It’s used by big companies like Netflix and small two person startups alike.
Jenkins has a handful of core concepts:
To make our GitHub integration work, we created Python scripts that receive 100% of the webhooks from our GitHub account. (There’s an option in the GitHub account settings for where to send webhooks, no matter which specific repository generated the event.) Our Python scripts examine each webhook and conditionally start jobs in Jenkins via the Jenkins API. We refer to this component as our DoorDash GitHub “callback gateway.”
Only certain GitHub events (such as “push”) on a specific list of GitHub repositories (such as our main/monolith repo) actually trigger jobs in Jenkins. For example, when a commit is pushed to our main monolith repository, we immediately begin running tests in Jenkins.
I should note that by default, Jenkins has an ability to poll GitHub repositories and start work when commits are detected to certain branches. Our callback gateway approach allows us to more precisely trigger custom logic against each event rather than polling every 60 seconds. More details on our custom logic in the “Callback Gateway Custom Logic” section below.
Rather than starting a handful of Jenkins jobs individually from the callback gateway, the callback gateway instead starts a single Jenkins “Pipeline.”
The Jenkins Pipeline for our feature branches has two steps:
At the conclusion of a job, we send another curl request to GitHub to update the status check with the results of the job, with a message like, “Linting completed after 30 seconds,” and a GitHub status flag that makes the label either green or red.
When someone pushes code to a feature branch, we trigger a pipeline oriented to testing the code on the branch. However, when someone pushes to the master line, we are interested in starting a pipeline oriented to ultimately deploying the code to production.
DoorDash runs two Jenkins pools: a “general” pool and a “deployment” pool. The general pool runs our tests, docker builds, linters, etc. The deployment pool is reserved for deploying code. The theory is if we need to push an emergency hotfix, it should not be delayed by queueing in the general pool.
When we commit to the “master” line of our GitHub repo, the Deployment Jenkins server notices that the master line received the commit. It will automatically execute instructions found in the Jenkinsfile, located in the master line of the project root. This file uses Jenkins’ pipeline syntax to perform a sequence of events roughly covering these steps:
Jenkinsfile gives you the flexibility to implement a sequence of events that you think is a good idea. For example, you can see that we are gating the deploy sequence at certain points and requiring manual approval before we continue to subsequent steps. You could also easily implement a pipeline which only continues if certain things are true. For example, instead of requiring programmer approval, you might automatically deploy to canary and then automatically check that there is no increase in error levels, and then automatically proceed to deploy to production, etc.
Jenkins has options to depict your pipeline sequences, allowing you to more easily understand what’s going on. The following is an example pipeline I’m currently working on. It is rendered with the Jenkins “Blue Ocean” plugin:
Below is yet another example of how a different, simpler deploy pipeline looks like at DoorDash once it completed:
Since our callback gateway is listening to all GitHub events, we have an ability to implement custom features into our GitHub account. For example, sometimes we see a unit test flap and we want to have the tests run again. We have an ability to “fire a blank commit” at the pull request. To do it, you comment the :gun: emoji like this:
it will appear as a normal comment as you would expect in the pull request…
however, after a few seconds, you’ll see the blank commit appear into the branch linage…
as a result of the new commit, all of the test jobs implicitly restart:
The easiest way to get started with Jenkins is to run a Jenkins master using Docker. Just type:
docker run -p 8080:8080 jenkins
In just one command, you have a locally running Jenkins master on your computer.
Jenkins doesn’t use a database like MySQL in order to function. Instead, it stores everything in files under
/var/jenkins_home. Therefore, you should set a Docker bind mount on the
jenkins_home directory. For example:
docker run -v /your/home:/var/jenkins_home -p 8080:8080 jenkins
Additionally, if you host Jenkins in AWS, I recommend that you mount an EBS volume at that host location and set up recurring snapshots of the volume.
The Jenkins master server only exists to run the Jenkins core and its website interface. You run as many slaves as you want, though in my experience, you usually do not want to exceed more than 200 slaves per master server.
Jenkins has a concept of “executor” which describes the number of jobs a node will ever run at once. Though you can technically set the number of executors on your master to any number, you should probably set your master to have zero executors and only give executors to your slaves.
Since DoorDash is on AWS, our strategy is to use EC2 reserved instances to run a low baseline number of Jenkins servers that are always running. In the morning, we use EC2 Spot Instances to scale up. If we are outbid, we scale up on demand instances.
The Jenkins master must have each slave registered in order to be able to dispatch work. When a slave server launches, the slave’s bootstrap script (Amazon’s EC2 “user_data” property) registers a minutely cron job, which upserts the instance’s internal hostname and the current unix timestamp into a t2.micro MySQL RDS database. The master server polls this table each minute for the list of servers that have upserted within the last 2 minutes. Instances failing to upsert are unregistered from the Jenkins master and new ones are idempotently added.
Each weekday morning, we scale up the number of slave Jenkins servers. Each evening, we initiate a scheduled scale down. If you terminate a Jenkins slave while it’s doing work, the Jenkins jobs it was running will, by default, fail. In order to avoid failing developers’ builds during a scheduled scale-down, we have split all of our slaves into two groups A and B.
At 7:45pm, we mark all slaves in group A offline and then we wait 15 minutes. This allows for a graceful drain down of in-flight jobs because Jenkins will not assign new work to slaves marked as offline. At 8pm, we trigger a scheduled AWS scale-down of group A. At 8:15, we mark all remaining slaves in group A as online. We then repeat this sequenced process for group B, and then finally for our spot instances.
We trigger an AWS Lambda function each minute that queries the Jenkins APIs and instruments certain metrics into Wavefront via statsd. The main metric that I watch is what we call “Human wait time” representing the amount of time a real person waited from the moment a pull request was pushed to the moment that all of the CI checks were completed. Wavefront allows us to fire PagerDuty alerts to the infrastructure team if any of the metrics fall to unacceptable levels.
There are numerous options for setting up CI & CD. Depending on your situation, you may find a 3rd party hosted tools to be perfect for your use case like CircleCI and TravisCI. If you like customizing an open source project and running it yourself, Jenkins might be for you. Still, if you have highly specialized needs or need to customize everything imaginable, you might decide to write something entirely from scratch.
So far, Jenkins has offered us a way to quickly setup CI & CD and scale it using the tools we’re already using like AWS and Terraform.
Amazon has a great white-paper outlining their recommendations and considerations for setting up Jenkins on AWS, found here.
Come back to our blog for more updates on DoorDash’s engineering efforts. If you’d like to help build our our systems which are growing at 250% per year, navigate to our open infrastructure engineering jobs page.