Distributed CI: Scaling Jenkins on Mesos and Marathon

I’m Roger now see I’m from puppet labs today we’re going to talking about scaling Jenkins horizontally stateless Lee on top of mesas and marathon so really just a quick bit about myself I’m part of the quality engineering team at puppet labs and one of our missions is to be able to provide tooling infrastructure and services to the rest of the engineering org one of those services happens to be you know 15 or so different Jenkins clusters if you want to call that a service collectively we like to refer to that as our CI system even though it’s a lot more than that anyway we did an experiment last year to figure out what it would take to run something as stateful as a Jenkins master on top of a stateless system such as marathon and mesas and this is just what we learned and some of the outcomes so like I said a little bit about me QA automation engineer / puppet labs on Twitter at Roger nausea I’m also the author of mesas in action with Manning publications I think we’ve had about four hundred or so copies sold in the last couple months it’s currently in pre-sales so if you bought it thank you if you haven’t you can get 40 something percent off with the code on the screen so I just want to talk about puppet labs testing environment a little bit conventional methods for scaling Jenkins you know typical scenarios master per team master per project one single master talk about our CI re engineering project a little bit and some of the motivations and problems that we’re trying to solve we’ll have a short demo everything’s up in a vagrant environment and you’ll be able to grab that code too if you want to play with it and then we’ll have Q and I at the end we should have a little bit of time left so I just want to start off you know it’s been a long day thank you for coming I know a lot of us want to go and grab beer myself included but now just before we get started how many of you are Jenkins users right now ok cool so everybody in this room so you probably all know what i mean when when I say Jenkins is hard to scale and how many of you are also mesas users in production okay so considerably less alright cool so it’ll be a I guess I’ll glance over some of the Jenkins intro stuff since uh guys probably don’t need to know that but just a quick intro to maces for anyone who’s not already familiar with it it’s a general-purpose cluster manager it really allows you to treat your data center or treat multiple machines as if they were one entity where individual resources such as CPU memory and disk are advertised directly to your applications or in may suppose terms frameworks so with Jenkins this really means that Jenkins either accepts an offer and it launches a new slave or it rejects the offer and it does nothing marathon it’s a Mesa supreme work develop I’m ases fear that provides essentially a private platform-as-a-service it runs applications as if they were long-running maces tasks and provides automatic failover if one of those tasks happens to go down or a slave happens to go down yeah and it allows you to easily scale to n instances so if you have you know two instances of an app running you suddenly want five it just launches those as as new tasks but along with that comes problem of being able to maintain state since once that task is gone so is your build history and your job could fix and I don’t really need to intro drinkin’s but if there is anyone in here who’s not familiar with Jenkins it just allows you to continually build software projects and test software projects and it has a huge user community over a thousand different plugins and some of those plugins were going to talk about in this talk and and reuse them so I just want to UM you know before we get into it I want to talk about the scale that we’re testing software at puppet I’m just to give you an idea of some of the challenges we face so we’re launching about 4,000 5,000 build today across 75 different combinations of platforms releases and architectures so that’s everything from rel an ubuntu to windows and OS 10 and solaris and if there are any X users in here which probably not we even support IX with puppet enterprise all of this testing is driven by 15 different statically partition Jenkins clusters with some mix of Jenkins being based on access control team project function somebody wanted a Jenkins for testing so we gave them one it’s really just a sprawling static infrastructure but what that comes down to is about 1,300 executives across

around 240 build machines we’ve also got some custom tooling so I said that all of this is driven by Jenkins we have an automated testing framework called weaker multi-cloud multi-platform it essentially allows devs and qas to write their tests once and run them on any platform on any cloud provider without needing to know about the underlying infrastructure we also have a service called vm pooler because our customers are running puppet on you know to build containers and to run it in VMs or run it on bare metal really we need to have unmodified kernels we can’t be doing any of our integration testing in containers or in a a pair of virtual setting so vm pooler is sort of a pooling service to our vSphere back end and it allows us to have multiple different platforms of VMs just ready for checkout so instead of waiting for our testing to provision you know 300 VMs to start running upgrade test scenarios we can check those out immediately and then just have vm pooler clone those back sometime in the future so I think um see last month we cloned about 120,000 disposable VMs which saved us about a hundred and seventy-five compute days of time during our tests pre suites our configuration management’s a little haphazard most of our job configs and scripts are still stored in Jenkins sometimes the Jenkins jobs call scripts that are in the individual project repos and then we even have a monolithic job configs repo that we use to configure our Jenkins instances via cron all of our infrastructures managed by puppet I’m sure that’s not really a surprise or at least most of it we don’t manage the Jenkins configs per se but we at least bring up the service and then as you could imagine our reporting’s a little bit sketchy too we’ve got 15 different Jenkins you eyes which means we’ve got 15 different data stores which means we’ve got 15 different api’s to hit if we want real-time status so we’ve written a couple tools called clockin and waylon to try to abstract some of that and clocking really gives us a historical run data and infrastructure metrics about all of our different Jenkins clusters and waylon allows real-time querying of all of our Jenkins masters on a per team per project function just hitting those api’s and bubbling the failures up to the top just because we have certain projects certain teams that would otherwise have to hit three four five different Jenkins instances to figure out if their software is actually shippable or not so I just wanted to touch on you know before we get into you you know sort of the the riorca tech chure stuff that that I’m pretty excited about conventional methods for scaling Jenkins I mean like I had alluded to we’ve got two common deployments we’ve got a single huge Jenkins with a single resource pool with all of your jobs or once you outgrow that or you start having you know problems with you know polling taking up too many threads or you know starting to be cpu-bound you start breaking that out into master per team per project or function the problem with breaking that out is it starts leading the static partitioning if you’re not careful and Jenkins doesn’t really provide a great way to avoid that except with various Jenkins plugins obviously the masters are not highly available unless you do some crazy sysadmin foo and you start setting up you know shared storage or drbd with pacemaker and you know I’ve done it before I don’t want to do it again you know it’s especially complex when you’re talking about having 15 16 different Jenkins masters and now you have 15 or 16 different pairs of Jenkins masters and you’re having to worry about failover in all these different places and you still have your static partitioning problem you haven’t actually solved resource utilization you can’t easily load balance across these masters because all of the state is contained within them and I mean let’s face it you know we’re here at mesa con static partitioning kills overall data center utilization it’s just a fact so just for the more visually inclined i just wanted to sort of give you a scenario so we’ve got an open source jenkins we have multiple jenkins that make up puppet enterprise testing and in the build process and then you know 13 or 14 later we’ve got various different jenkins for projects so in this particular case our open source jenkins is about 90% utilized so it’s not too bad although we’re probably pretty close to ship date on puppet enterprise we’ve got one hundred and forty percent utilization you know we’ve got 40 build sitting in queue developers are waiting to see if we can actually release the product or not and meanwhile you know 10 11 Project X Jenkins here is sitting at zero percent utilized like totally idle it all of its you know CPU resources are just sitting there wasting so I mean what can we do about it well there are

various plugins for Jenkins to be able to support global resource pools you know we’ve got things like gierman we’ve got ec2 plugins vSphere may soz no I think there’s a Cooper Nettie’s plug in 4 for Jenkins to but I’m not entirely sure how much sure that is I was just looking into it last week but um I mean the problem two is uh you know if you’ve got Jenkins slave labels on any of this stuff you’re you’ve got static partitioning inception where your partitioning your slaves after you’ve already partitioned them on the infrastructure so let’s take a look at the mesas plug in a real quick so we’ve got the same Jenkins masters they’re still mana lyst they’re still maintaining state we have the same reporting problems we have the same config problems but they can talk to a Jenkins or they can talk to am asos master they can pull all of the resources from a single resource pool as opposed to only being given the resources that they were assigned and really improve utilization so you know Project X Jenkins that was sitting 0% Idol before or now that’s free CPU that’s free memory that can be used for the puppet enterprise testing so single pool of resources but like I said we still have the same problem multiple URLs multiple plate multiple sources of truth so one of the things we did was we started this sort of see I re engineering project to kind of figure out you know talking to dev talking to QA q e managers you know what would their ideal system look like you know what is the data that we can provide to help them make better informed decisions so they can predict when we’re actually going to be able to ship or if we’re putting too many features in the product to be able to ship it in you know six or eight weeks maybe more importantly it’s how we can make development workflows a little bit better you know so what we did was we started off with some user stories during these interviews and if you haven’t used user stories before it follows a simple but pretty powerful concept where as a role you want or need something so that outcome without come being the measurable and want or need being the urgency or the priority so just to run through a few of you know the real user stories that we got internally at puppet as a developer I want tests to be run against pull request so that I have confidence in the code about to be merged like yeah as a developer I don’t want to worry about the underlying infrastructure of the CI system I mean their devs they’re not sysadmin so we have entire teams for that as a CI consumer I want a central location of you all see I activity so I don’t have to visit multiple URLs you know if you can put the status of a build in one location and be able to bubble that up to the right people at the right level you know why would you want them to log into three or four different Jenkins’s but maybe most importantly was one that we came out of in our own team and as a QE I want sleeps to be on demand so that resources are used more efficiently so we’ve got a few motivations here I mean obviously we’ve got some friction in the dev workflows and when I say friction really it’s you know I was talking about some of this mono thick job configs repositories or individual pipeline configurations on it Jenkins and those don’t necessarily map well with topic branches or if testing needs to be changed between topic branches so if you need to set up a separate pipeline for you know a new branch based on a release really you have to do a bunch of copying pasting you have to have somebody make that change on a Jenkins or make it in a job configs repo meanwhile devs are you know branching and merging all day why can’t we follow their workflow why can’t we work you know better with them and for their needs perhaps most importantly we really wanted an event-driven system you know how many people in here have seen the the warning in the jenkins management you I that says you know you have too many get polling jobs you’re pulling too fast and you run out of threads all right no one oh okay well we have that problem probably on half of our infrastructure just because we want things like you know as close to on commit as possible and yet we’re just over running it with a number of jobs that we have we want to improve the reporting and user experience you know we want to be able to give the right people the right information when they need it as opposed to you know somebody asked a question and it take us you know two hours to actually give them an answer because we have to dig into so many different systems but maybe most importantly you know we need to scale to meet the growing demand of the org statically partition clusters you know all managed with puppet on individual VMs doesn’t really work but if we can have dynamic infrastructure using something like mesas and on-demand resources then we’re really starting to

get somewhere so one of the things we identified when we started this project was that you know marathon allows us to run applications stateless lee in order to remove the state from Jenkins we had to start looking at a few things you know since everybody in the room is comfortable and familiar with Jenkins I don’t think this is any huge surprise but let’s just take a look at the different components that make up the monolith you know we’ve got our github repo that you know just sits there and you know that’s code lands and we’ve got you know various web UI we’ve got a REST API we’ve got various plugins that store configurations on disk we’ve got the job configuration management we have a trigger now pulling cron you know if you’re using github pull request builder you know that’s also the trigger we’ve got a scheduler built into Jenkins that has its own queuing which hands off to remoting and actually giving a bill to an executor and then passing that off to a Jenkins slave to actually be run and then build info and results are also persisted to disk so we’ve got this 15 different times the problem is a typical scenario that doesn’t really leave any clear interaction points for developers you know when you consider our organization and cue he’s trying to provide this as a service you know Deb shouldn’t be interacting with every little bit they should be interacting with you know the things that we tell them they need to interact with and we should be handling all the infrastructure paint for them so just to recap in order to break up this monolith a little bit and remove state we’ve got a handle job configurations we’ve got to handle a build trigger and we have to take care of build history so this is a this is an example architecture that we came up with where the job configs are actually stored alongside the project repositories when a push or PR occurs web hooks fired to a hook processor which is handed off to Jenkins and a job dsl seed job eventually that’ll be handed off to the asos master will give Jenkins the resources it needs to go ahead and spin up those slaves on demand now the hook processor is going to give a bunch of information about our repository in the event that just occurred in store that in elastic search for later query and Jenkins is also going to ship all of its build data over into a database that we can then write an API and a web interface for but really what it does is it allows us to have clear interaction points where developers are working you know in their project repositories with their job configs and they’re using an API and reporting application without having to worry about the mechanics of the underlying infrastructure so really our sis ops team can provide mesas as a service we can build on top of it and then our developers can actually do what they need to do instead of having to worry about the system itself so like I had mentioned earlier marathon allows us to scale applications out horizontally and just say we want any instances of this app you know it allows us to be able to deploy updates the application make configuration changes and plugins in a very standardized in systemic way but really it might even give us continuous deployment of our own CI system you know with its own automated testing which you just don’t really get with static infrastructure you know in I’m talking about scaling Jenkins masters horizontally and you know having them be stateless and you might be wondering yourself you know why would we do that and if you get back to the single monolithic Jenkins master that has all of the jobs and has one resource pool if that goes down one hundred percent of your jobs can’t run you know one hundred percent of your in-flight jobs are dead in the water but if we scale this out 25 stateless masters on marathon that number goes down to twenty percent for one failure so this is just a quick graphic of what it looks like to run Jenkins on marathon so marathons registered as a mesas framework Jenkins is running on marathon as any other web service it’s registered with mesas as a mesas framework and then it brings up these slaves on demand using the Jenkins slave chart and then all of that data shipped into reticent lk which in this particular case is not running on mesos but there’s no real reason that it could and just to give you an idea of what’s actually going on with the processor here the job config is ending up on the Jenkins master as a seed job pretty much the web hook hands off to the web hook listener which queries marathon for a healthy Jenkins instance so really it takes a look at the number of instances that are running it picks a task it tries to ensure that the Jenkins master is healthy that it’s actually returning a version it creates that seed job which creates multiple dynamic jobs too configure our pipeline and then all that

shipped into an external data store so really what we have is ephemeral pipelines based on the point in time and the branch for our individual project so it allows QE to be able to branch off a project repo make changes to see I and test it out in a topic branch before merging it back into main line so with that hook processor you know I just kind of said that it goes into a datastore and you know it’s a little bit of magic but really what we have is for every event every web hook that comes in we generate a uuid and that shoved into red is based on the project repo and the namespace so you know let’s say puppet labs puppet and we just maintain that list to give us the arbitrary build numbers or a list of recent builds all of the web hooks and build data are then shipped via logstash into elasticsearch so that we can query them a little bit later we can then query and visualize our system activity in cabana things like being able to search for number of successes and plotting that out over time or number of failures you know do we experience more failures around a release cycle why is that but really with all of this stuff being in its own database and queryable you know arguably outside of you know Jenkins file system based approach why can’t we write our own reporting on so this is just something that I hacked together just sort of for the purposes it’s demo but we’re able to just we’ve got all this information and we can just present it in a really user-friendly way so in this particular case it looks a little bit like Travis CI but uh yeah just a little bit of Twitter bootstrap there and we’ve got a pretty nice looking you I so I just want to jump into a demo quick just sort of showing all of this stuff in action and in really the power of it I went ahead in and pre-recorded this quick wasn’t really sure what that Wi-Fi was going to do here and I figured a bunch of you guys would probably want to get out here and grab a beer and and not sit around and watch our spec run so let’s go ahead and walk through this we’ve got a bunch of applications running on marathon and on mesos jenkins the hook processor and rock for actually handling the hook and then this experimental reporting on so we’ve already got this this hook created on the github repo and it just replied upon Jenkins is up and running let’s just go ahead and take a look at the job DSL script that I’m talking about just in case you aren’t familiar with a job DSL plugin but I mean we’re setting things like the job name create jobs allows us to actually clean up after the sea job creating a matrix with various rubies we’ve got the shell script in line but that could be a separate script itself we’re going to process the J unit output we’re going to shove everything into a log stash we’re actually going to fire the job once this is created and this was already committed but I’m just going to go ahead and uh and bump it anyway so yeah if we look here the the seed jobs already created and and we saw that queue at the very end so it’s already triggered so we’ve already got a so the jobs already in queue to go ahead and process that job dsl script so jenkins is spun up a dynamic mesa slave a dynamic jenkins slave on my toes and it’s just going to go ahead and process that and usually takes about a minute or so but we’re going to skip past it we refresh the page here will actually see that we’ve got this one new job not a very complex example but it’s an example nonetheless there’s the three rubies that we said in our multi configuration project if we come back here will actually see that we’re going to get some builds in queue in just a second so the Ruby 200 build has already started in 193 & 2 & 5 are sitting in queue and really what’s happening is the mesas masters going to offer up the the resources to drink ins and it’s going to spin up some more these these dynamic slaves so we’re running another job in a container and in that 193 Java will kick off shortly and it’ll also be running on in a container on one of these dynamic slaves you can have multiple executives / slave I just decided to limit it to one just to sort of illustrate the whole dynamic slave concept

so we’ve already got one of the builds finished up so that slate was just sitting idle after whatever the termination timeout is I think in this case it was about three minutes we went ahead and killed that off so those resources are now freed up for another framework or another Jenkins master or whatever other workloads are running on your maces cluster to go ahead and use and you know just into mezzos you I we can see that that’s already been killed off and we’ve still got two containers running alright so just fast forward it to the end of the run here and I will see those those sleeves drop off too but we don’t just stick around for that so if we switch over to cabana cabanas a nice front end for elastic search allows us to write some queries using the scene so if we just filter everything by timestamp so we’ve got all this data we’ve got our web hook payloads we’ve got our build data we can go ahead and select a specific source so in this case let’s select the source of the name of the hook processor was just Jenkins hook shot I’m very inventive but here we’ve got information you know such as the the seed job name the the Jenkins URL that actually ran the job the repository who actually owns the repo some of the URLs to various things on github if we scroll down a little bit we’ve got things like the sending the sending repo the before shot after Shaw what the new head commit is we’ve got our commit messages commit subjects who actually authored and who committed this particular Shaw if we switch this back to Jenkins will have our build data here and the the log stash plugin for Jenkins gives us a whole bunch of information project success what the the access is you know the full display name with the uuid that we gave it you know the build variables just like you know Ruby 193 yeah if we look through some of this stuff I mean we’ve even got test results so we process the the j unit output from our spec and you know we can see that we had 43 skipped tests and about 19,000 successful any failed test would show up there and then we also got the entire build log and I guess the cool thing is you know this particular project was tested at this shot at this point in time I’m just based on an event-driven system and it’s all sitting in elasticsearch ready for us to query it’s not sitting on disk somewhere and then this is just that quick little reporting app it doesn’t do a whole lot but it’s taking the uuid s and the projects in Redis and the recent builds and taking some of the data in elasticsearch really using reticence the lookup service to go ahead and display a little bit about our are built here so cool probably just through a whole lot of information at you it’s the end of the day some of you tuning out let’s just talk about let’s wrap up and figure out what we really saw here we’ve got a single base and get work flow between development QA & QE we’ve got standardized stateless Jenkins masters where we can have marathon deploy changes to our Jenkins configuration or be able to scale those apps horizontally and every Jenkins master is going to look exactly the same you know yeah you could go in and you could change it but restarted the app it’s going to take care of that the Jenkins slaves are provision on-demand using available resources from the mesas cluster if there weren’t any available resources we’ll just sit and cue on one of those masters just waiting to run but as a bonus maybe uh you know we were talking about CI but we also got a private platform out of it marathon and and that’s interesting for a couple things you know we’ve got a number of internal apps and a lot of them are stateless or they have their own data stores and they could really ease our deployment process and we wouldn’t have to necessarily worry about maintaining all these static VMs and the ops team wouldn’t really have to worry about maintaining user access and logging on all of them and it really just gives us a you know a much nicer way to to interact with our services you know we could have

rewritten Jenkins from scratch as a you know highly available may so scheduler but the fact of the matter is the Jenkins plugin ecosystem is really rich you know we don’t want to reinvent too many wheels you know if we can already use certain plugins for HipChat notifications or slack notifications or you know shipping all this stuff to log stash or using the job dsl plugin to be able to put our configs in version control why would you reinvent all of that it sets us up to be able to but we don’t really need to so just a bit of future work adoption I mean adoption with any sort of you know you don’t really want to rip and replace infrastructure so obviously everything here might not work for everyone else but I mean you could at least start by solving the static partitioning problem by using mesas and then you know maybe start experimenting with running Jenkins on marathon in a way that might seem more saying to you I mean that could be just committing your configurations to version control and running them as separate marathon apps but anyway it just kind of gives you a starting point and a new way to think about operating Jenkins as a CI service we saw a single reporting dashboard that particular app was also backed by the API it was just in a single project so yeah we’ve got a intelligent job queuing and throttling that’s something that we’ve been talking about a couple of our infrastructure guys on the QE team we’ve been thinking about being able to look at the load on our back-end infrastructure and our VM pooling service and be able to throttle builds or whole builds in a queue before they hit the hook processor and actually get created but as Jenkins jobs just so that we don’t overload an already dire situation and you know as you saw the job dsl it was a little bit verbose you know there’s you know if you consider that across you know to 3 100 different projects there’s going to be a lot of repetition so we’ve also been looking at a job dsl plug-in abstraction and templates so we can provide engineering with various templates that we’ve known are good for testing ruby projects or c++ projects or closure projects and they can just provide some parameters on how they actually want their job to run but maybe most importantly you know Jenkins is still here you know it’s still a component of the overall CI system and I don’t really see that going away at any point soon and with that thank you for your time and I’d like to open it up for a bit of QA if anybody’s got some questions again the the blue striped shirt yeah so the question was what plugin in Jenkins were we using to bring up the dynamic slaves and if the slaves already had Ruby and our VM and everything installed right yeah so we’re using the mesas plugin which I believe came out of Twitter and vinod and his team has been maintaining and as far as Ruby and our vm if you’re using see groups then yeah you’ll want those dependencies installed on your may so slaves we happen to have a dedicated may suppose cluster for build infrastructure but if you were to run it in a docker container you could bundle all of it up that way too and be able to have you know it might be a little bit nicer to be able to manage your Ruby run times in the docker container as opposed to you know running it on general purpose slave especially if the SIS ops team is providing as a service to you and perhaps other customers yeah yeah yeah so the question was the the idle termination timeout for the the Jenkins slaves and if that was true may soz it’s a feature that’s built into the scheduler for Jenkins in that may suppose plugin so it’ll just look and see you know if a job hasn’t run on this particular slave with this ID it’ll go ahead and terminate that task so that’s why we saw in the the May so see you I earlier that the task was killed it wasn’t just successful because the schedule actually killed it off it’s just like a long-running I I guess a short running Jenkins slave yeah yes the question was you know how do you tie build in a particular phase of a pipeline to a particular Jenkins agent

and one of the things that you can do with the missiles plug in and the global config for Jenkins is you can define multiple different kinds of may so slaves with their own slave labels I’m sorry Jenkins slaves with their own slave labels and each one of those could have different requirements you know different amount of CPU different amount of memory and it’s just going to wait for those resources to be offered up so it’s pretty flexible and then along with the various different labels they could be different docker containers as well it so we’ve got two questions the first question was what do we do with build artifacts that’s a fantastic question um so this is a this has only been like the prototype phase so we don’t have a clear clear way of handling build artifacts I mean they could be stored on the master that could be stored on a you know external web server we haven’t really gotten that far yet but the second question was was pretty interesting and I’m glad you asked and he asked couldn’t we just run each job on an ephemeral Jenkins master as opposed to playing with all the Jenkins slaves and the answer is yes absolutely that’s actually what we had started with and we thought that it made a lot more sense as far as you know the demo and as far as trying to get people to adopt it to be able to show the dynamic slaves and really it applies to more of the scaling use cases but yeah there’s really no reason that you couldn’t do that and really it gets a rid of a lot of the overhead but that’s an entirely new concept from the new concept that I just talked about so it was a little bit of a harder sell in the back Oh a great question so the question was how do we deal with OS 10 and Solaris and windows and all that stuff in May soz if you recall in the beginning I also talked that we had we had a framework called beaker and what that does is it’ll talk to vm pooler and over on vsphere we’ve got pools of virtual machines running solaris or OS 10 or windows and what we would do is we would just use Jenkins as the execution engine and then run it on those disposable VMs and then destroy those afterwards so a little bit more heavy weight but it allows us to get rid of the static partitioning as far as our static Jenkins infrastructure is concerned and just bring a little bit more dynamic resources to the mix oh yes a great question so the question was what do we do with the master with the jobs that are created on each of these ephemeral masters yeah we could leave them there they’re never going to do anything unless we configure you know something like get polling or or cron which we specifically don’t do but if you notice in early on in the job dsl script I had this boolean called create jobs and I didn’t really explain it too much but what we could do is this particular project it’s up on github but we could rerun the seed job with a create jobs false parameter what’s going to happen is job dsl is going to see that oh you know I’ve created all these jobs but they’re no longer in my configuration so it’ll go ahead and clean up after them and only leave the seed job left so that’s one way to do it or you could also just periodically restart them as you push configuration changes are there any other questions I think we’ve got time for maybe one more alright cool well thank you for your time slides will be posted if you’re interested i’m at roger now see on twitter from puppet labs and uh thanks