Cloud Infrastructure to Help Researchers Build

a deputy director of the Mon Ashley Research Center and Blair is here with me we’re going to share this talk go half and half hopefully there won’t be too much disruption changing over just need to start over I require I suggest no introduction hey we all start again okay hi all welcome my name is Steve quinet deputy director of the monarchy Research Center and Blair and I who’s also part of my team we’re going to give this talk today we’re talking about our view on how we help researchers using the cloud hpc and all these things put together we’re going to use slightly different words and I’m going to spend the first 10 minutes of this tour actually explaining what those words mean to us and that framework of thinking and how that dries what we do and how we do it there’s a third member to this team wojtek who is not here today with us and we’ll see his bits influence in this work as well so our equipment now where everything we do is full of hpc like things it’s got RDMA here we’ve got GPUs you’ve got lots of cos parallel file systems and all these sort of things but I don’t think what we do is HPC and I say that painting with a lot of pain because I come from an HTC background right and and you know a part of this thing is how do we start thinking about in a slightly different way that lets us really move forward influence the world I’m but to begin with I’ll start off with how we used to talk about this five years ago because we’ve even changed since then and we used to five years ago and there’s two points here we used to talk about the peak and long town we used to sort of say the peak with a HPC guys and the long tail with the guys doing stuff for window shares right you usually supported one or two camps and as a result there is this pit in the middle of people who weren’t serviced at all right the sort of the guys missing in the middle if we’re kind of looking at today what we kind of see we understand and you’ll see this will come to her in this talk is that the picker more like the leading researchers and they build the tools that other researchers use right they get they get proliferated run but then there’s another conundrum there because the peak researchers in that toolset that they create are using tools but other researchers used right so even the peak a kind of long tail so we don’t really talk about picking long tail so much anymore and we definitely don’t associate peak with the idea of traditional HPC the other thing we did is we used to talk about when we were engaging researchers and how we do our infrastructure in the way that gardener used to talk about the hype cycle so in the early phase it’s all it’s all iterative and all experimental and at the very end of the phase it’s this discipline engine room which you give to the IT guys and we tried that we actually try really hard to kind of make the spectrum work and think that we could hand it over to IT guys or even HPC centers for that matter and expect that they could do the right things what was really interesting in the keynotes we had on Monday morning from gartner itself despite five is going on they are still using this sort of framework of thinking and they call you know the early part now this mode one I guess business people don’t like all extra words and so they went with something simple mode 1 and the discipline engine-room us are the other way around mode 1 is the discipline engine room the stuff will give to IT people and the exhale experimental part which is a result of the use of the clouds and things today as mode as the mode to but even still in that presentation they recognize is this third layer in between and it’s more like a spectrum how you move in between 0 and and even more so she made the point of saying that actually our IT infrastructure across the board is is moving towards the more cloud style of things alright so today what we say instead and this is how we think and communicate and what we do we actually say that researchers building used 21st century microscopes and I’m going to just explain to you what that means so if we think of the humble microscope it kind of came into being around 200 years ago at that point we as in mankind learnt how to machine brass really well we learnt how to machine lenses really well and reliably

and someone had the insight I guess he was sick of getting a microscope and trying to get the distance right as i am at little mi could hand have one and and so we built a little machine that join it together using grass right this was this created the biology biology discipline and research and there was a boom of scientific outputs that result of that as a result as a result of that you know a microscope has a light source which then i put some sample in and we have knobs and filters where light passes through this thing and through the knobs and filters were able to tune the device to help us see things that we couldn’t see before right and we have a lens obviously by which we can see drive that process now will relate this to tony hayes am-4 paradigm sort of speak and then I learn map it to a slightly mathematical viewpoint because once again computational science HPC whatever else I really struggle on subjective conversations room so the microscope but then really produced just outputs there were wise there are observations right and the boom of research we had or the first paradigm came from those microscopes around the same time we had a boom of theoretical models in this case we had f’s you know the models were the bits that we were having these innovations we actually worked out that statistics and normals are actually a model when by that we can make various predictions and everything else we have newton we have everything else right so i’m going to come back to sanity but one of the questions we ask ourselves then is is discovery leading technology or is it the other way around the technology’s leaving discovery or is it the perpetual cycle between the two and where are we and how do we use this to drive us and if we ask that question then it becomes very useful to ask ourselves how and what technologies have been driving man mind alright so we know that one of the greatest technology evolutions we had was the Industrial Revolution where we basically over 100 year period we started to produce more food than that then we had humans to consume right was the first time in mancom actually a more food than the so we stop controlling a population growth by starvation essentially run and that was a five percent growth compounded every year compared over tip about a hundred years I haven’t left and put that in this graph but I’ve taken electronic ones out here the real interesting one is the red line which is the speed at which we traveled across the Atlantic run and we went from steam liners to basically Boeing’s we’ve jet things in the 1950s right and during that you know sort of 4050 period of that innovation boom and we flight was air flight was was was special where it was all this video there was this boom of culture around air flight and everything else and a lot of industry was bought but we never we didn’t we hit 1950 and we actually stopped going across the Atlantic any faster at about 500 commons now or whatever it is you look at the what’s happened 40 years after that the u.s government had to help its airlines because the innovation was lost in the business models around everything had changed right so it’s interesting to see these things there forever in this room we all know what this blue line is going to be about me having to say it right it’s essentially Moore’s law right it is nothing is nothing else is like it in mankind’s history and it’s fifty percent compounder grow for over a 50-year period sure it’s sort of maybe stagnated a little bit now but it’s not clear that something like it might continue for a period of time yet the real interesting question actually is how is that influence what we do in science and research if we take our fourth paradigm model what it really means is at some point we actually can start using computers to deal with the fact that our f’s our models are big and complicated much bigger than what the human mind could do as a result of this we have engineering we have CFDs predictions and all these things that we do through and all these discoveries we’ve made through if you like the third parallel computing or simulation right I can add to the graph and say well the Green Line is essentially the size of the most expensive hard disk you can buy so it’s like the peak of hard disks and the little greater yellow line is the growth

of our centering sensor capabilities now these two together you can argue have driven the fourth paradigm which is around data in this case data the Y is really really big and that so that’s the idea but the fourth paradigm is really you could probably break it down into three different ways it’s data mining and in which case the big y what we’re trying to do is actually find what the f + XR we’re trying to find what the knowledge is there is no we don’t give it any prior knowledge in a few models right and there’s data assimilation where we already have big and complicated models and we have the big data we’ve observed we’re trying to marry or work out some equivalence between the two visualization is still immensely your relevant because if it was true that you could data mine everything then we wouldn’t need to do research whatsoever right and the best lens we can actually have is going to be environment so give us let us see the most and we have a facility in particular that are the amount of pixels in the brightness of those pixels we can see in these facilities allow us there researchers to help discover things that they then put in two models or help condition data mining from that point on and i’ve added this because it’s really relevant to this meeting and we’ve heard a lot about the in other things and I’m sort of sort of suggesting here that maybe there is a fifth paradigm perhaps because I don’t think it’s been about the fourth and we’re seeing it in how big businesses are thinking about where the future is and that is that this orange line is the number of devices on the internet or the Internet of Things and it’s predicted to be the only thing that will continue at something that looks like Moore’s law right now the real question is how is that influencing how business works well think about 10 years ago five years ago even you went on a Windows machine you had your email pop up and it was made by your institution right now your emails probably done by Google your verizon or something might be who provides your telecommunications and there’s this several companies involved in part and doing the thing that you used to do by yourself before so because of these curves we can kind of see that the world is changing changing and to win for our researchers we need to think about what is that workshop and one of the materials that allows them to play in the space and win so we say that the 21st century microscope looks more like something that ties together the big instruments and all the things that produce this raw data the supercomputers and cloud infrastructures and the software is on them that are the filters that allow us to tune to see the things we couldn’t see before and rather than like going up this thing its data and a transform that goes up through this thing and then the lens then is really the environment we interact with they are now desktops and other things like that so our facility aims to create that environment where we can the becomes the brass or the ability to make and tune that brass we used OpenStack and Stefan and Blair will continue talking about these bits in a little bit and it has a lot of it was always from day one part of a federation to share and collaborate across Australia the nectar research Claire we have Lyle in the room here and it was always from day one about specialist equipment we had no intent of trying to compete with Amazon if you like if it was just about dollars per core or cloud bursting right the intent here was ringing the right equipment for our researchers to do what they need to do so we had rocky SSDs high memory all these things from day one the graph on the right is the number of core hours allocated per month and it’s been a literally a sort of exponential connor curve and how that how it’s grown over the since we started which is not that long ago which brings us on to what is hpc as a service for us or what is HP see in the cloud if rapmon is the bit that lets us orchestrate our 21st century microscopes then the HPC part is really just another flavor it’s just another part a component that other people connect into the bigger things they’re doing right so we’re not really hpc first we don’t think in hpc first type of way and so and the point here is then they’re actually what we’re focusing on is the environments those researchers are trying to connect everything with and i’m going to give you two end members as two examples and one is we had australians banks are actually quite powerful they’re really well well without around renowned

businesses I guess it means they repre- some well first research were basically the bank and the researchers said that they want to do some data mining on some real eff post data electronic transfer data so it’s it’s highly confidential highly sensitive they were trying to discover I’ll think about whether they were categorizing their marketplace well write the data mining required machinery which was not normal we were able to very quickly create a virtual environment and it’s it’s own microscope we were able to destroy that environment after its do all the secure stuff yes we didn’t have Software Defined Networking doing for doing it for us like at that time but we weren’t that far right so that the concept is there is that HP pc whatever it is it was really really important they were able to publicly say that out this was a well first and they knew that only in the entire world only two US banks were going to be even close to beyond do something similar I don’t know if they have by yet but I imagine the property here which is quite significant right but maybe something that’s a bit more normal the study of / foreign is a protein that is used to that’s in your arm cells that actually allows things to come in your cells they open and close and form whatever else and it’s one of those tricky things I was talking to Paul about this yesterday it’s one of those tricky ones where like most of these things we try to understand where we can’t crystallize it easily so we can’t just throw them in a synchrotron really easy and we need to see things at the scale which are beyond what we can do with the synchrotron so there’s new equipments coming about new microscopes being built or instruments for those microscopes and they require computation to get us to the point of be able to see things right it very much is that data assimilation problem and so the environment looks a little bit like this we’ve got some instruments has to go to hpc and it has to be shared and stuff for later use for common things and everything in this circle is about how we produce an accelerated reproducible environment form for making this happen and what they’re trying to do and this is added directly out of their nature paper is this is the statement around the set of tools they use and how the pipeline of work so this is the bit that we need to reproduce for other people for the proliferation part it’s also the bit we need to make easy for them to make for the in the in the first part so we created this thing called the characterization virtual laboratory it’s essentially a managed desktop environment it’s VDI right it’s already connected to all the HPC equipment and all the data sources in the secure appropriate ways if you like and it’s also this sort of mass customization thing where we have a call way that we do it and there are certain flavors of these for the various disciplines that are pushing the boundaries we have four major ones in Australia these are national projects and and now listed there listed here now why is this important be because if I go back to that very first graph it tackles that middle gap and our usage pattern for this is an exponential curve these are the number of people who actively use it so not accounts the actual active users on the right hand side is the number of times those a subset of those people are using it you’ll you’ll see there they’re 60 people who you have used it more than a hundred times have actively used the sessions ordered more than x which is kind of scary when you think about the stats of and how we try and measure stats of hpc facilities nowadays right well one last part about this is we give the researchers a little little app which gives them a one click to get them on to the virtual laboratory and the virtual laboratory really takes a form of a bunch of GPU enabled VD is which is amazingly managed for slow you know so we use HPC start you manage to manage that resource and everything then happens through the web peeve in the video I connections so that’s sort of the pattern mile by the last thing I’ll say before hand over the Blair is we want to join those two paradigms together what we’re doing with a characterization virtual lab in accelerating the peak researchers at the proliferation of those that are that sort of work with security and this is important to us because we have a lot of medical applications coming to be so if we apply this same problem the imaging to matching that to phenotype data or if it’s genomics magic the phenotype data we have a problem where we can’t take the data off that environment until the governance of that project or the data

says it’s okay and so we’re we were in this phase where we stand to have to marry these sorts of things and with that I’ll hand over to Blair thanks to okay so I thought I’d talk a little bit about I guess what makes that that engine room of the virtual mark’s coptic and some of the prehistory there with the nectar research crab program because that’s really worth nectar was a you know pretty pie own program at the time so that was established by the federal government super science initiative and 2011 there was a small technical committee set up to advise on what cloud middle where we should use the net to research Club I was fortunate to be on that committee only by mistake really I had been doing a bit of stuff on Amazon and not many people had to hit the time really I was working in a research group and sort of got pulled out of that and everything snowballed from there Tom fifield who few people may recognize his name now Tom Matt OpenStack dog he acted as a consultant into that group and did an evaluation of feature set and stuff across the different options at the time and keep in mind that that was approximately the becks our timeframe for OpenStack so you know one of the highlights and the release notes for swift in that release was experimental s3 support API support but the decision that we ended up making or recommending in that committee was actually largely not to do with tech at all it was more about the community process and the governance structure that was starting to spring up around OpenStack it looked very promising and you know ultimately I think we made a good decision so then so you need University of Melbourne also in Melbourne as ma ashes was the lead agent or is lead agent for the nectar program they established the first node the pilot node for nectar which opened up to users in January 2012 so I guess that would have been deployed on Diablo I think monash our node we eventually joined following in early 2013 and we’ve some had this just in time features coming into nova to actually allow our architecture therefore the nectar research cloud so we were one of the first nervous that like first major nova cells deployments outside of rackspace and now there’s eight nodes australia with over 10 data centers and 30,000 course and those 30,000 cores actually uh that’s just the because that the nectar program itself funded to be built for public access it’s worth noting that many of the nodes now including Monash are adding a bunch of capacity and they’re leveraging that infrastructure but for their own institutional investment all their members they are the other thing to point out is about the cells in structure because that sort of that’s kind of a new thing and OpenStack for many people now where as we’ve been doing it for a long time and I was actually I was kind of skeptical about that to begin with I have to admit because having been a user of Amazon I was used to the region’s idea and I had done you know programmed against that and I thought well that seems fine but actually it really significantly makes things easier for the end user you know they have at the time there was no support for regions in horizon the way we have things set up users just come to the one dashboard they have the same identity everywhere and you know we don’t have issues of trying to sync Keystone and this sort of thing they just have a drop-down list of a Z’s that they can use and they don’t even have to pick one if I don’t want to the other big advantage is that we have a core services group that look after all the user facing stuff the api’s and all of that and the court that core infrastructure and down at the nodes where you worry about the computer infrastructure and that sort of thing so we kind of have a small management footprint so Rahman which is sort of this funny abbreviation mix of it’s just a research klemish is now about 210 compute nodes across two data centers about six and a half thousand cpu cores 45 terabytes of ram about 150 GPUs and volume safe and a bit of luster as well as maybe that one point 5 petabytes so that is luster the rest is safe all integrated into the cloud

infrastructure so hpc at monash we’ve had at hpc resource for quite a while started out as the the Monash Sun grid I think and then sort of became the Maya campus cluster it’s a typical institutional HPC services everybody and everything PhD students high-end star researchers that sort of thing for a long while we’ve had a sort of a partnership model where those people that have gotten grants to buy infrastructure will take that bring it in manage it through the cluster and then also Monash will provide some of the operational expense there too and that’s that’s really good because I mean I talk to people in hpc forums and the problem of you know little departmental clusters everywhere and I talked to one guy last week it was managing 17 clusters it’s a bit amazing and then and so that’s now called we recently changed the name of that when we sort of move things onto the cloud so that’s now called monarch and mon accident if you like the one may be one step ahead of massive which is the next resource where monarch we may be in it we’re innovating a little bit more at the middleware layer and then massive is coming along and taking that and doing that as a larger scale as well so massive is is actually another federally funded project it’s Australia has a national computational infrastructure it’s a shoulder shoulder facility of that infrastructure specializing in characterization so imaging and visualization and with a number of external partners and affiliates as well for example the Australian synchrotron which is co-located with marsh in melon so monarch now is run almost entirely on OpenStack so all of our compute infrastructure there is running in a hypervisor of 12 kvm what we did initially was actually just take and exist over cell that we had for the nectar of research cloud and add to it build it out and just use host aggregates to control things so that the the cluster project could get to those nodes we also went into lustre for the first time previously we just had sort of NFS filers and the sort of thing attached to hsm and we had all sorts of nasty things happening like users trying to run hpc jobs on a hierarchical storage file system and wondering why things were getting pushed out to tape that that resource is just all dual socket as well gear a mix of high core and high-speed stuff various workloads by and large though we see in terms of job numbers probably still over eighty percent single threaded workload I don’t I don’t know if top of my head what that looks like in terms of the actual CPU time spent on the resource but we are starting now to see compared to say five years ago and increase in the number of parallel jobs and people doing sort of a bit of starting to dabble and OpenMP and the sort of thing just within a node or hybrid stuff where they’re going across two to eight nodes the sort of thing and initially because we built this as part of the nectar research cloud one of our architectural constraints there was to fit within the frame with the OpenStack framework that we were using at the time and we were still another network then this was a year ago I guess and using multi hosts flat DHCP but we wanted to integrate with luster so that that proved to be a small challenge however not entirely impossible to overcome so I guess people sort of want to know why do HBC on OpenStack well for us it was about consolidation on one hand at HBC team then becomes I guess it’ll be a customer of my team and those guys also can really focus just on their operations they’re not too worried about Hardware anymore in fact we’re not really worried about how we were either in my team flexibility of course is another big one lots of people that are running standard HPC facilities especially with centos and that sort of thing they typically say they’ve got bioinformaticians who want a bun to seems to be a common pattern and we get Windows users coming along as well various software requirements and we had some confidence there too because from the very beginning when we started running in the nectar research cloud the HPC team you know they already had a resource of a whole lot of mixed and old hardware running on bare metal and at

that point they started spreading out onto a local cloud resource as well and so you know we already had a fair bit of confidence that that was actually going to work and be suitable for our workloads and so why not be a medal or ironic well sort of like I’m saying the performance for us was good enough so we didn’t really feel like it was worth trying to learn how to do ironic as well when we already had this big pile of kvm infrastructure and I mean I’m sort of basically the only OpenStack architect in the team and I haven’t really gotten confident enough yet that with ironic you can achieve the the provisioning Network isolation and that sort of thing to get secure of multi tendency and and that that secure multi-tenancy is something that we’re that we really are after because whilst we tend to build these HPC facilities as something that’s going to be a managed service at the end of the day for the user we still want their flexibility there to be able to to be able to hand to chunk off to a user if they really need it so and we’re also worth mentioning that bare metal was one of the one of the big topics that the science working group folks identified as wanting more information about some work on so how we talk to time sir well sure now we’ve got like five minutes of me is that yeah I can okay so this is like a little basic diagram of how monarch is run on OpenStack above the line is OpenStack below the lines bare metal so luster is the only bare metal piece in there with Nova network we solved the problem of integrating with luster there by simply using pc i pass through so we reuse mellanox gear there there Nix fortunately allow you to do some funky things with PCI virtual functions so you can do things like to find virtual functions that are already tied to a VLAN the sort of thing so then when a when an instance starts up on one of these nodes it’s got whatever the Nova network interface was so it has say a public IP or a private IP depending on how you set up that’s provided by Nova network and DHCP and then they actually go and configure their layer three services using the layer two device that we’ve given them so they set up their own private subnet obviously so they can talk to lustre as well so that’s been monarch has never been in production for six months they started from scratch there so that was entirely new cluster didn’t bring the other users across so in six months is 150 total users 50 active and they’ve done about 800,000 jobs in that time a number of different different types of workload and domains you can see probably probably resonates with people who have institutional facilities so some of the some of the issues with virtualized HPC i mean it’s it’s not all beer and skittles there are some points of confusion i guess with regard to performance tuning and so forth soon have been a big community player that have done a lot of work and shared really well in this space and if you want to know like get into details about this stuff i’m not going to go into low level here because we’re one thing we don’t have time and this is a beginner talk but really go and have a look at their blog hypervisor features are one of the issues i mean there’s a bunch of features that are great for general virtualization workloads basically stuff that Linux does so Colonel st. page merging can save your memory footprint but it’s not so good when you’ve got a HPC work load linux natively has a NUMA auto balance facility in it as well since about 3.8 that’s interesting because libvirt and kvm actually allow you to do some new mer tuning as well so there’s some potential for some interesting interaction there which we’re doing some testing on at the moment huge pages is another one and EPT is a feature that cern mentioned in their blog and then recently published the new information saying that when they have actually rolled out turning off EPT based on micro benchmark results they realize that they had a problem to say the least because they’re rolling out across a hypervisor fleet of about 160,000 cause i think so that’s our benchmarks at the moment we’re just using linpack and that is a micro benchmarks oh there’s a big caveat on that but benchmarks are quite hard to do in the real world front and that’s something that maybe I think the scientific working group might be able to help with in terms of common codes

and that’s sort of thing the other thing to to I guess know about is CPU capabilities that provides a really big boost if you’re not passing through the host model of the CPU to the guest then you’re probably missing out on at least ten percent performance and sometimes that means you need in your queue kvm as well for example we were running trustee but trust e QE kvm didn’t know about as well yet that’s the thing cpu pinning is another one that gets you another five plus percent and numa memory allocation policy I’ll talk a little bit more about this here so here’s some numbers that we got running on trustee so this is a trustee hypervisor running a center of seven guests because they are hpc facility actually runs a cent or so s the moment on an hour Adele are 632 sock machine to we 526 atv-3 so the bare metal performance at the top there there’s a couple of lines we did the bare metal on both CentOS and trustee and you can see they’re quite closely grouped then the lines very close together up there are various kvm performances very Kate various kvm configurations rather and so you get 97 to ninety-eight percent which is pretty good those eighty-six percent numbers there are none configurations where we have no pinning or anything like that the only thing that’s been done is passing the CPU host model through to the guests one interesting thing to note on that graph is that one of the best numbers there was actually obtained just using new MIDI not specifying any strict CPU topology or mapping into the guest and that’s quite neat because it means we didn’t have to you don’t have to muck around too much with a big set of flavors to get all these different configurations you just let new MIDI decide what to do and so we’re now moving on to testing that and Z because we upgrading their quite soon I didn’t include any results there yet because we’ve encountered some interesting issues that look like bugs for one thing new MIDI is now packaged in xenia but libvirt snot double support for it so the other major thing I guess for hpc is is network I o and s ROV as I mentioned earlier with the way we integrated lustre solves that to a major extent and coprocessors processes gpg be used that sort of thing also SR iov don’t need to explain what single route / virtualization is hopefully but it’s there you need to look it up there’s plenty of information out there just a final word on on how we manage cluster deployment so you know we were running a managed to hpc facility on the cloud so this is not about giving users cluster or something like this because at least in my experience we have maybe two people in our university that could go off and use that well for themselves and even then I’m not sure it’s really the best use of their time we have a managed to hpc facility with you know software specialists that look after all of that stuff so it’s probably a more efficient thing to do there so the guys that actually run that HP seizures led to gave me some notes here so they use heat initially for clusters deployment had had some rough edges with auto scaling and that sort of thing and also just frequent updates to the cluster at scale that might just be a maturity issue that will eventually see improves slurm is really happy running in this environment this that’s one of the problems with things like sge and why people I guess maybe why Slim is becoming so popular now as well in the space and of course images and a substitute for configuration management and global file systems are quite hard obviously and the best ones are the most performant ones don’t do encryption and this sort of thing as well so they want a strong relationship with their infrastructure as a service provider which is us and that was a quick tour of how we’re doing HPC and those rely partners I guess questions got any time left for questions wipe anybody’s got a burning question I think take that side on the pipeline development to like make it match your system like what sort of feedback loop

do you have there so we have I mean the the HBC team sits just across the hallway from us so they work directly with users they have I mean they’re very focused on engagement actually and so generally if there are any issues that may be infrastructure-related we hear about it pretty quickly but we typically don’t need to get too involved in that stuff now usually hear about patterns and things that we might need to support going forward yeah so you essentially run all of your HPC workloads inside of your vm environment and only come out to InfiniBand to lustre via your back-end networking yeah so well actually we’re actually not using infinity glad we’re using Ethernet so we use our DNA over Ethernet one of the kind of maybe interesting architectural decisions that we made there was not to build two separate fabrics for this environment and instead build single resilient fabric so all hosts are bonded and so forth in in the cluster that we’re just finalizing the build out at the moment which is a new part of the massive environment that’s 100 gig spectrum gear and you know so we have multiple different speeds coming out of the guests and they can do NPI over over their network as well as a DMA to lustre and we I mean we designed this thing so that our largest say MPI job is the size of a racks which pier because we don’t I mean that’s already about well over 1100 cause or so which is well big enough for our users yep okay thank you guys