Ceph The De Facto Storage Backend for OpenStack

hello everyone thanks for being here to real pledging to to be here today for the last day of the OpenStack summit I know it has been a really tough week and all the parties and everything so thank you very much for being here I’m Sebastian and I work as a cloud engineer at in events in events is a multi cloud provider so we basically run design and build cloud platforms we have several domain of expertise among others OpenStack and SEF my daily job is mainly focused on OpenStack and SEF and I also rotate between the operational development and the Freestyle Team apart aside fungicide afford per ton of my foot part of my time to blogging so that’s here are the details my personal blog and company vlog so don’t hesitate to have a look at them during the next 30 minutes we will be discussing the safe integration into OpenStack so I’m going to briefly introduce staff or disagree you’re not familiar with it except is a unified distribute storage system that started in 2006 doing H PhD PhD it’s open source under a GPL license so well vendor lock-in fully oppressors it’s mainly written in C++ and well it’s basically building the future of storage on commodity hardware which is quite good actually because we don’t have any restrictions so we can just choose them all all were really diverse out where to build your first cluster and well it evils well according to you to your own needs and it’s also fairly easy to to run a park and do tests so that’s that’s quite good safe as numerous key features such as Seth managing and self-healing the main point is it’s a really dynamic cluster so if something goes wrong if you lose a note or if you lose a disk the self will just trigger covery process because there they are like tons of else check between each component so as soon as close to detect that something is wrong well it will just eating it up it’s self balancing which means that as soon as you had a new disc or add a new node cluster is dynamically load balance so all the data are just moving it’s really punished for scaling because it’s fairly easy to add a new disk and to add on just at a new node thanks to the tremendous puppet modules chef modules and save deploy so now it’s really easy to deploy safe and also to do to scale sep sep as a well chef is really unique because it has a really really cool feature called crush and it’s just a data placement crush is well stands for control replication and scalable hashing it’s a letter to the random placement algorithm which means that we don’t know any well it’s just fast calculation so every time we want to store an object into the the cluster we have to compute the location so we don’t store anything into a hash table or something we just always calculate the location so it makes it deterministic and it’s this is what this is pretty good it’s statistically uniform distribution well as mentioned earlier as soon as you add a new node the were clustered gets with balanced so that’s fairly easy to to take advantage of the full order and it’s a rule-based configuration the really cool thing with crushes that you can reflect your physical while we can logically logically reflect your physical infrastructure so basically you have a while closest is a map and within the map you have liked your auto topology and just apology of your physical infrastructure so I have nodes you have your disks and it’s rule-based configuration and what I just said its topology where and then you can build rules and within these rules you can just specify replication count and things like that so that’s that’s something really unique and the really cool thing is that you can just specify that you have several other words if the divers already have SSD based systems and you have SATA disk system so you can just say okay with in this pool stole all the objects on SSDs or store on the objects on the SATA so that’s that’s quite useful to have a really good placement algorithm just to give you well the final big picture upset this is how Seth looks like so everything is bet is built

upon the Redis object stores so everything is stored as an object and just on top of this we built several several components so you have several ways to access your data into store data but first you have the Leigh brothers it’s just a library to access the Redis cluster so you can basically build your own application and then from this you can just write and access all the objects so it’s very easy to plug with the liberators because it has several language bindings just by thong C++ Ruby Java and well a lot of languages finding what the first component of step is called well as gateway is it small it’s just a restful api them just a really pivotal test of what Amazon s3 does and what OpenStack Swift does they it’s just a restful api so you oh sorry it’s it has me to tenants capabilities and it does good as well you can do it supports supports zero application and also disaster recovery features then the second component is called rbd H stands for rattles block device and it’s divided into two pieces the first one is a kernel mujo part of the colonel so you can basically create a device and then map it on your machine so you have an uog call our drive disk it’s well quite useful it’s just same way as ice QC does if you want in the second piece is just the queue kvm driver so you can create images there are Finn provisioned and they support snapshotting and Ian Wright clones full or incremental so that’s really useful and it’s well integrated into presenter and kvm then the last part of self is the distributed file system it’s called CFS and it’s a the POSIX compliant file system that supports snap shooting as well or worth mentioning that all the pieces are really rubbish well accepts ffs which is well not really do you say this it’s not production ready that’s sorry yeah almost I think that says yes mentioning always yeah almost awesome because everything is really robust and already really good so that’s we are almost there to have a fully unified storage now it’s some of the first consideration while building your first set cluster it’s really performance oriented but it’s just the general just like a methodology when you want to build your first step twister the thing is how to start first of all you need a use case and while we need within this use case you have to establish several rules like a well Italy you might be able to tell ok I’m more doing I up so I’m more doing bandwidth or perhaps it’s mixed but this will ridiculous the way that you build your cluster if you want more iOS you won’t just fascist disk with more I ups and if you want been with you might want a really large network bandwidth for example you might want to also establish sort of granted i hopes for this so really you would like to be able to tell ok i want to deliver this amount of i hopes and this amount of bandwidth for all of my customers individually obviously it’s really her difficult to tell us but well if you can let’s just do it you definitely want to know if you use self as a standalone solution or if you self combined with the software solution if you use with OpenStack or or the cloud solution for example because it’s at some point if you have like performance issues you must really want to know that this is implemented that way but first of all you know how safe forex basically and you know how its implemented software solution so if something goes wrong you you know that weekly how do you look so that’s something that you want to consider as well then you need to establish well what’s the amount of data that you want to start with and usable data no troop then you need to tell ok Sep does the replication so just you can specify replica count it should just start to decide if you want to start with two three four or more really you would like to establish also a failure ratio which means that when you build a cluster you don’t really want to get high density nodes if you have 100 terabytes for example you don’t really want to build 3d well each node with 33 terabytes each because if you loo to know their well you have a lot of data to rebalance so you need to just establish like a percentage ratio of the data that you’re willing to load balance if something goes wrong according to performance and because as soon as you as soon as you recover well you have to

write a little bit more and client keep sweating too so that’s kind of something that you want definitely want to consider too he really would like to have also a data growth planning if you know that I don’t know maybe every six months you are getting 10 more or 100 more terabytes this will definitely change the way that you build your initial cluster so maybe you will spend a little bit more of money but every six months that’s going to be well way easier to to scale and well obviously you need a budget so how I won’t go through any consideration about the budget but yeah definitely it’s something that packed all your requirements things that you must not do but I just want to highlight that don’t get me wrong this is really performance oriented so obviously everything is doable there but if you if you want to avoid like unnecessarily troubleshooting you might want to follow this considerations usually you don’t want to put a red and then if you always DS but you always d is the object storage demon and the general recommendation is just to use one this for a 10 SD step already does the replication so it’s quite well justjust less to to do more application with that so you lose space and yet not quite efficient just also think that because you don’t necessarily only want to do read one or raise you can also do read zero if you really want to burst your performances but the graded read breaks the performance and then if you don’t have the right tool to monitor everything you might get into trouble because usually that well what we tend to do is that the speed up your cluster is the speed of your well slowest disk within the entire cluster so if you don’t want to drop down all the performances or have like spikes well just don’t do undo this as mentioned earlier you don’t really want to build high density nodes with the tiny cluster because you might have a lot of things to load balance and then potentially get it close to you if you have too many data to do load balance we could argue on the last one that don’t run step on your I providers as mentioned this is doable obviously but and at some point you might think that you could get like way more performance because if you if you have your storage layer and your I provides layer on the hypervisor then you can just directly access your cluster so the first it is just really fast because is look it locally and the second one is a little bit less but what that’s my main concern there is about memory in a but also and also about consistency on the platform usually storage servers do only storage and I professor they only do memory step needs memory as well because wallah the more memory you have the better filesystem caching you can get so in this case both of them require memory so step once memory and obviously the hypervisor wants memories so at the end you just end up with a really huge battle with memory but what this it’s mainly an assumption now let’s dive into the state of the integration into a vana so basically what step is so good with OpenStack because it unifies all the components originally it was present in glance and then in cinder and recently in a vana so it if eyes all the components or you just have this really single layer of storage and then all the components are plugged into this storage layer which is quite good because you don’t need to have like a diverse storage solution for solutions for one component or another you just have the same abstraction for storage and that’s that’s quite good a valise best edition well first of all there there were a complete refactor of the senior driver so now which it uses the lib rattles and lip rbd this is really cool because we can get a better error handling for that and but that’s something that we had to do thanks just for doing that by the way we have new features like flattened volumes while creating a snapshot because what happened on the back run is that if the sips in the detects that the that sap is also the back in storage for glance when you create a new volume then it creates a clone from this so if you want if you don’t really want to have like too much dependency on the chaining of the snapshots and the clone you might want to just flatten the snapshot every time you create the volume and we have also a new policy about chrome death so that’s also what I just said as soon as you create more vol more clones you at some point you just want to say stop and then

flatten the original image and then continue cloning everything then cinder backup was already present in grizzly but the only back end was Swift so now we can we can do backups from cell to cell this is well the way we can do it is just you can you can back up within the same poop which is not recommended because it’s just the same machine so you don’t easily anything between domains failures but if you if a different post is definitely points to different machines so while you can highlight that well ideally you do dr with that with this this feature so you have one location and you have another safe cluster when on another data center it supports our pedis tribes and the really important thing is that differential so we do we already do actually incremental backups when backup except from Seth with another cluster I know that yesterday we had a discussion we’ve seen the guys for implementing an incremental API for for backups but it’s just already there if you use Seth and one of wild kids for me one of the biggest tradition for a van around save is the Nova elaborate image type originally this this flag is set to file which means that every time you create a new virtual machine you get a filing system under well live Nova instances in instance UID and this file is just the root the root disk of your virtual machine you also have a second implementation with lvm so you specify your volume group and then every time you put a machine it creates a new lb and then it attaches to the kvm process now what it does is you specify a new pool observe a new staff pool and then you create a new rbd image and you just connect the pointer to the kvm process so you just directly boot all the VMS within Seth this is definitely in a better decision so the client or the user doesn’t know anything about this this was a really huge requirement for from the community and from our customers to to just can I just boot everything within self instead of always doing bootiful volumes for example so kind of hard to automate boot from volumes into it–and everything so now we can just directly put everything himself which makes operation like a live migration way easier yes it’s only forgave him yeah yeah the question was it is it only with is it also compatible of exam and yeah sorry but okay it’s not another well it’s a part of the Nova and a cinder addition to so now tech support qos which is quite good because staff doesn’t do any quest at the moment so every i/o request are just too restricted from the hypervisor itself so this is quite useful to allocate certain amount of vibes or bandwidth for from your hypervisor itself it’s bound with cinder volume type so that’s that’s good that’s the big picture of the two days of an integration so what we do is we can just boot a vm so it goes into SEF and then we can attach a volume so it’s also calling cinder for doing that and we can also do in a viable code yeah that was the point before the question live migration is made easier when you have everything into sep so because you just have to move the kvm process and then you just reconnect the link on the on the rbd image so it’s like just like really really fast know that it’s also fairly easy to trigger on a vibe accurate if you lose a compute node you can do either novi evacuate or else to evacuate and it’s already on the safe cluster but if you have this on the topic Leon the hypervisor it’s quite hard to reboot the virtual machine and it’s just like the workflow as I just explained it earlier we have the multi back and capabilities so as soon as we create a volume we just do a copy on white clone and we do just robt incremental backups on the second location but the question is is a valid the perfect stack well unfortunately it’s oh we are almost there I would say we are missing some some mini tiny features the problem is that we we were about to submit a new pet and then the patch cuadras district rejected because we were just after feature freeze so now what it does it’s you just create a new vm the compute well Glenn’s downloads

the image and stream into the stream the image into the compute node so you have to download the image on the computer through a glance then you have to import it into itself with it which is quite an efficient but yours are josh has a video patch and it’s ready on the pipe so the idea the idea here is just to do the same thing as we do already with cinder so when we create a new vm and if the glens if the image is already present in glens then we just do a copy on write clone so it’s just really fast to to put a new vm so that’s probably before the bug we release for for Ravana one of the thing that is not implemented it’s the the staffs nation snapshotting so now every time you want to even if you put a vm into safe and if you take the snapshot of the instance what happened is that well it’s just really common will you come and snapshot with Kim you so you have to download the image well you just absurd the instant it goes locally on the compute node and then it goes into Glenn’s but in the future we could just call a staff snapshot so the the operation is quite in what you can do this instantly if you’re in a hurry to go into to go into production and that and if you really want to do batch everything I think they are just only three bugs and George already built a new brand for that so did you well if you really want to fix everything already a little bit about the roadmap I sauce burn up and and beyond this is at least we made this is really pursue know once again but this could be the safe integration for ourselves and maybe for Jay what’s missing is something that you might want to do is to have the ability to store snapshots and images into different pools because at some point you just want a replica count of two for example for the images but potentially snapshot contains contain customers data so you might want to be higher replica level like three for example this is something that we just want to trick whether kyouko implementation is already on the road to reach on the pub so it’s not worth mentioning it for the ISOs roadmap things that well we would like to see is the migration support because currently septuagint support the volume migration understand ur basically the volume migration is when you want to migrate from one black n to another and FS back n into whatever other back end but it’s not supported when you use sap something that we could easily implement is the Nova ber middle function the barometer is basically when you want to put a new vm it’s actually not really a vm it’s more a compute node so it’s a dedicated type of well bare metal machine for your customer thanks to the kernel module we could just easily enable the kernel module and then create a new orbital device and just map it to to the physical oast so that could be really easy to do there is also this lfs implementation going on which is just just a restful willya well agnostic restful api which can call talk with rattles and with a swift swift cluster so it really you will have this lfs api that talks to yourself twister so it’s it’s not really a replacement source for swift or things like that but it’s just that well basically for the OpenStack you could also use the object store from from staff from the dashboard or whatever so you will get a complete unified storage solution because that’s going to be just like everywhere and while potentially the Manila support analyst report is an initiative from netapp I guess in its the distributed file system as a service solution so we could also just just add a new driver for force ffs and just create a new distributed file system for one of our customer because it’s also really huge requirement for legacy applications to do use a distributed file system this is while this is the high selves roadmap but it’s just basically what i just said earlier with the picture but it’s just for you guys to have the this kind of reminder there what’s coming up with incest or for the next release we don’t have that much like really new fancy features for Emperor’s or with just directly jump for a firefly which should be landing february2014 we have the cheering functionality which is the basically you have this notion of cold and hard storage so you can have like a

pool that has a bunch of SSDs and then everything goes into this pool and then well politically we just flush everything on the on the back end with SATA disk when the data is less requested we have the erasure coding with it which is just well more or less like a red 5 on the software-defined storage fashion so you can just like have a really large compression of your data ZFS report for the for the file system of do SD it’s quite good because we but it’s really really really deep it’ll for this so it’s just good thing because we can use red parallel mode with the jomo so we just do the same thing as we are supposed to do with bare FS but it’s instead production-ready we can’t use it so that’s that’s definitely good that we can have the ZFS report and obviously we will do all the efforts that we can to fully support the OpenStack Icehouse release it’s it’s both in tank and community roadmap so that’s that’s it that’s it for me so like to thank you for your kind intention and if you have any questions it’s time for our questions may be pretty fast here so we we have like 15 minutes for questions so yes yes it’s more Nova but yeah yeah yeah but well actually it doesn’t support cuoco to every time we do this you must ensure that the image is no roof format so if the image welcome the image is already in glance and it’s already a roof format so yeah but the question is yeah it does cook out because we do a clone so it’s copy-on-write cloning yeah because what we do is as soon as we store the image into glance these mitch is naturally and protected and then from the specific snapshot we just run a bunch of clones as soon as we create a new virtual machine yes yeah yeah yes I’m not really familiar with edifice but as far as I know and a if someone knows more about this feel free to jump in and explain us what lfs is basically it’s just a restful api agnostic wistfully appear that came talks to several whatever object storage back in so you just just do a request on lfs and then you can have a either Swift packin object store or aratus objects to work it’s just a way to unify all the aps and just to have a single abstraction layer and no matter which objects to which you have underneath yeah yeah I really have no idea I don’t think so and I don’t really know the status or nor the progress of the implementation so uh you should have the grocery fest guys because they they initiated the initiative okay yes yeah the second is you and then yeah yes encryption I know that there is something but it’s a safe specific and there is also something on encryption but I barely looked at it so I I don’t have the answer sorry about the encryption I know that Ritter ism there is a module that may have landed for vana but i’m not sure if it does inning ah i just don’t know there is always really complex because you don’t even know where to store the key and everything although there is a project in this like barbican or something that does the key management system okay that was you and then okay that’s a good question yeah I I mainly

work on debian based systems i I don’t know about there any Colonel recommendation maybe you know sage yes well thank I think sage is the definitely the person to ask any more questions yes what is the singing for know that they are there there is no single point of failure because it doesn’t work like Swift for example where you have like this proxy that request the object stories you directly talk to the object servers so you don’t have any single entry point to recruit review data you just TT talk with the object service ok the question was if there were any support from on sep yeah largest elation well as far as i know it’s 5 petabytes dream mousse or more or less yeah 5 petabytes any more questions yeah sorry what is coming wait what is more coming yes no no it’s like it’s really flexible so no there is no specific use cases for that yeah yeah so what’s the amount of how many discs should I put into a single machine why do you have into you have to take into consideration many things what’s your network bandwidth what do you want to achieve also if you’re like really high-ups well from iOS perspective you can pack a lot of thigh ops on a gigabit link so you can do like Halfin SSD and then you’re a gigabit network bandwidth is full if you are 10 gig obviously you are like unlimited honest but in terms of bandwidth and if you if you are if you want to be if you want to achieve performance in terms of bandwidth just keep in mind that a single set up enterprise or drive disk can just fulfill your Gigabit bandwidth then if you go with the 10 gigabit network you have to think about so while you can mostly the river well one dot to ok and the staff has a really specific design which means that first you eat a journal and then you float the data so basically this more or less split into two the iOS and then the band wave so general recommendation is 12 disks permission but you can go to 24 if well that’s only the theory on the bandwidth if you go with 24 you just can just fulfill the entire bandwidth and the wall server will just make sure that the red controller supports it but yeah that’s if you want to use an SSD for example just to burst the performance and don’t have any impact because as soon as you hit the disk you do that it at 50 megabytes for example and a new flash and 50 but if you use an SSD the f is d just absorbs everything but one more time if you don’t don’t put too many oh is this on a single SSD so based on your calculation you have like what an enterprise SSD can deliver 500 makes sequential writes per second because the runner is only sequential so if you just

put like 40 s DS you go around a little bit more than one Android so you need to establish you on a ratio for that but traditionally are becoming like 60 s DS for a single drawer which leads to weld 12 disks in the end and then you already fulfilled your Gigabit bandwidth I don’t know that much about InfiniBand but I’ve been doing that so they just this will change everything if you go to infiniband but we’ve gigabit the gigabit link here that’s the the answer that’s going to be useful on the Internet yeah i’m going to post it on SlideShare and that’s going to be treated by you know vents definitely so you you will find it anywhere I’ll grab you an email ok that’s yeah going to be soon on the Internet yes the region’s report ratio yeah okay oh the universe recording yeah that’s what you’re asking what okay and I’m sorry what’s the way what’s the question how do you guys mean I’m not doing anything for it I’m not the main developer of the erasure coding and I’m not sure Isla it does is look there I don’t know but I mean yea once again sage knows about it you you we have two more minutes otherwise we are if you have any more questions okay thank you very much everyone