Rethinking Distributed State Management in Networks

>> Thank you, visitor. So, today I’m going to talk about a project I started a bit back at Berkeley and this is a joint work with our Japan dish in naval and is quite Shanker So, a bit of background before I get started We have been struggling over the past 10 years since SDN was first proposed in its current incarnation to address the problems and issues around the state management For every single project we had, pretty much designing controllers, scaling controllers, fault-tolerance for controllers, and the same repeat of the same issues for network function virtualization, we always had struggled to pinpoint what it is that the system needs to do for state management and we always came up with ad hoc de novo mechanisms to solve that specific problem Then around last year, we had enough and we said, “okay, we have looked at this problem enough times that we now should have a good understanding of what the requirements are and what it is that we want our toughest state management for network functions virtualization” So, with that, let me first tell you what a CNN NFP is about, I assume you all know so sorry about the repeat So, classically, we build networks out of composition of hardware boxes, each implementing individual functions like routing switching, fire-walling, van optimization, and so on and these boxes bundled control and data planes together in a single box With a SDN NFV it changed, in the sense that functions are known, software run on a pool of shared infrastructure and its separation of control and data planes It brought about many nice things, so flexibility and that the expect of software and getting away from the rigidity of hard managing an API less setup hardware boxes and brought about a lot of challenges with them The one that is probably most emphasized is performance and scalability, fault tolerance and issues around that So, what SDN and NFV’s impact, obviously it had a lot of impact in the networking industry, but it impact reach far beyond So, besides the network virtualization part that pretty much everyone the way people build network virtualization is out of the same concepts, beyond that, the SDN NFV, this software ecosystem offers an evolutionary path for hardware software co-design So, your features first emerging software, you have time to think through what it is you really want, you put the features there, they mature and once you have proven, you can take the proven pieces of feature and move it to hardware You probably can first experiment with FPGAs, if it’s really worthwhile, move it to asec, and so on Another thing that has happened in the past nine years interestingly is that, we learned a lot of lessons in a SDN NFV and these lessons are the same principles that we use and design and the arguments that were made in favor of SDN and NFV were applied to different systems to build better systems So, we have a whole bunch of software defined projects, so software defined X, we have storages stacks, for example, the project that the Intel, SPDK, which is inspired by the DPDK, but for the storage’s stack Key-Value Stores people are building nowadays, but I’m thinking of them as network functions with the same kind of low-level principles, that’s very inspired by SDN NFV So, what are network functions? Network functions are pieces of network functionality built in software, some people want to really emphasize the V in NFV, we are not one of those, so network functions they’re just pieces of network functionality wrapped in software and there may be or may not be a separation of control and data I assume here that there is an external controller

that is orchestrating the full function So, what are the things we assume, I mean what are the traits of these network functions? So, one of the things is that distribution is the norm So we have to meet traffic demand for that one box and one core is not enough often, you need to have many enough instances to meet the traffic demand For fault tolerance, the same 2.5 authorized we need more than one box, another trait is that there is a NFs STN controllers and so on are stateful What are these pieces of a state? Flow tables configuration is statistics and things like that and this is a very beautiful view of SDN NFV that you can think of the state as logically centralized, unfortunately physically it is all over the network, all over your instances because they’re different producers of the state and because when the consumer and the producer of the state is different, we need to replicate this state, and final property that distinguishes network functions of today and from what we did probably in the 2000s, 90s for packet processing in software is a different expectation in terms of throughput and performance, and latency requirements So, we went from tens of thousands and hundreds of thousands packets per second, processing to millions of packets per second, processing event from millisecond timescale to microsecond timescale and this was enabled not only by the software stack becoming better, but the hardware, I mean let’s say the commodity 86 processors becoming much better in doing IO and that trend is going forward, the hardware is getting better, the software is getting better What makes scaling and fault-tolerance challenging in this context is the state and you might argue the same thing is true for lots of distribute applications, there’s some unique challenges here If we didn’t have any piece of this state in these network functions, the scaling out and fault tolerance could be as simple as just putting up more boxes and having a load balancer that can spread traffic between these instances, but the problem is back to the first place is the stringent performance requirements that are put on an NFV and SDN controllers and NF boxes and SDN controllers What it calls for to meet that performance requirement, there has been many research papers that argued that a state could be local, a state could be remote, let’s build systems that make this transparent easy for people to migrate as state to access state remotely and so on But the bottom line is that to meet the same level of performance that were promised as part of the performance we expect at NFV the state must be local, must be Indira and no farther out than that Two things, one is that, why does this make sense? So, the volume of the state and networks compared to the volume of packets, volume of a state changes, is much much smaller than the volume of packets, you need to detour to reach this state So, it makes sense to make the state available unpaired for the packets rather than make detouring and forwarding packets all over the place to reach this state Another property is that we have many many reads which as expected we have a much higher volume of freeze, compared to rights and the state itself at the end of these unlike Big Data Systems, the state itself is not that big I think the biggest number I remember from a company, a start-up back in the day was order of gigabytes of the state, when they were recompiling down high-level policy rules to the individual flow table entries For a large network, the state boiled down to gigabytes, to be in the order of gigabytes aggregate the state in the network It’s not that we have terabytes of the state or petabytes of this state yes >> [inaudible] >> I probably can’t understand that But it was for a sizable data center and the product was there for virtualization

So, nothing more flexible than that >> So most of those state is [inaudible]? >> Most suffered not just echoes but instantiating and bootstrapping virtual networks, for different tenants It was basically a multi-tenant data center and it is basically forwarding rules for different tenants and building layering to meet and basically dealing with encapsulation and things like that, and this number is a few years old, people might have bigger state sizes and so on But the point holds that the state that I’m talking about is not gigantic, it is something quite manageable For individual network functions, research for a recent work we had, most NFs basically the working set size of most NFs in the order of kilobytes to megabytes So, the largest working set size I remember was out of the snort which was order of 30 something megabytes, with a sizable rule set >> Can you saying about the requirements on conflicting rights, so there may be no many rights, but if two guys liked the same thing, what are the requirements on that? And does it differ in these context than in other distributor state contexts? >> So, that is a big part of the stock going forward, so I will get to it So, it is mainly an argument So what I am trying to argue for is that, unlike a general purpose data management layer, we can get away with having single writers, we can partition the state as such to have single writers So, this is a big part of the argument and that is I think up to debate But I’ll get to it, it’s a very good question Pardon me, the second point I would like to say is that bandwidth is improving day-to-day, latency is not We are adding more and more parallelism, we are getting higher and higher bandwidth links, but latency is not improving much The software stack is improving There are things like RDMA that cuts a lot of overhead that we now know where they come from, but at the end of the day, it is the cutting that latency further down is a very, very challenging thing, and that’s not a trend that we’re seeing with lots of technologies today So, scalability and fault tolerance challenges basically, the first point entails that we have at massive scale, replication We need to have stage replicator all across where it is needed, and we want to keep up with this synchronization and massive replication while I keep the replicas in sync without impeding performance So, what core in the SDN controllers and NFV solutions often do, so I’m not saying everything and everyone, but most products do, is they take the same solutions developed for general purpose distributed applications and apply it to networks and hope that it works What we are going with here is trying to think from first principles what is it that we require and build the cleanest fleet solution So, the view that we want to provide is what probably Scott promised back in 2011 of STN that let’s have a logically centralized view of the network and just operate on it Let’s decouple So, this clean modularity for state management, what I mean by that is making a state available when needed in the form of snapshots, and developing NFs with decoupling a state and compute By decoupling a state and compute, I don’t mean place them at different places, but logically think of a state as something available on the box that is doing packet processing, and working on a snapshot of the state, and develop it as a single program, and not worry about the picture on top that NFs are in different course, different boxes, they have this much latency to each other and so on So, this the view on the bottom is where we want to go to each and every box have this snapshot all of this data needs to operate on >> Do we even need the picture, the snapshot that you are saying we should work with as if it was a single global state that we instantaneously modified,

is that the same thing as what you’re saying? >> Not quite because the state those views of S that I have in different boxes are different So, yes the reason they want the snapshot there is that if you don’t put any constraint on it that say, ”Oh, everything is changing all the time.” It makes reasoning about applications tablet harder, but making the notion of everything is instantaneous Everyone has [inaudible] you see the same view, there is no way to get around the latency that it entails Basically, we have to have update the replicas, in real-time in fast path or invalidate the cache entries in fast path Both of them incurs a high-latency which the conjecture is not acceptable So, where do we go with this? What we propose is basically revisiting and rethinking distributed shared memory So, distributed shared memory has a lot of connotations attached to it, the research from back, 20 years back, and lots of people told me, ”Don’t use that term”, but to just give you a sense of what is this we are trying to do, I’m using that term, and probably this is our work as a marriage of the concept from DSN I mean just the notion of having shared memory, and what people do to basically have isolation between reads and writes, something like multi-version concurrency control in the Linux land, RC use and things like that So, we want to put these two concepts together So, the form it takes is that we have a network fabric, that connects nodes, there is a layer of shared state on top, which is our system it has weak semantics I’ll get to, to make working with these shared state easy If you want to have some building blocks that abstracts away these semantics and make it easy for programmers to write the applications on top, what we can do is build data structures, build barriers, locks, build durable a snapshot or even consensus on top, and on top of it, build the applications So, this is the set of applications we have been experimenting with in this context Before I present you the design, I will tell you about the few of the traits and characteristics that this semantic has and how that is enabled So, the semantics I’m going to talk about we believe is a good match for a broad class of distributed programs We started from network function virtualization, and looking at a broader class of applications, and we are trying to work around what you might think of as deficiencies and drawbacks of traditional and canonical VSMs So, first, all threads whatever state, they might require they can read it locally, but update visibility is delayed So, fleets are fast, updates might be delayed, how we do it as by versioning memory regions So, piece of addresses, piece you have It might be different You might have a different view on different threads It might be different versions, and obviously, the reads might be as stale, so we are putting availability above all in the system, but our target is to update these snapshots at the millisecond timescale with a very low overhead Synchronization and the system is eager, it is fast and it incurs a very minimal overhead in the fastpath How we do it is, we do bulk data-oriented synchronization in fastpath instead of processor into synchronization What we mean by that, instead of copying commands, replicating commands, or function calls and messages, we are replicating the results of computation Synchronization basically there is some [inaudible] expensive parts to a technicalization network variance in those parts We split up synchronization in two parts, what we call into internal and external synchronization to offload this loop part of synchronization to the slowpath What this entails is that the apps now have to organize their state

into single writer memory regions This is a big constraint, but we believe this is something we can work around in this context This enables conflict-free data-oriented synchronization What is wrong with data and oriented synchronization is that if a conflict occurs, you have no way of resolving it because the semantics of the application is lost How to work around having single writers, there are many techniques You can have delegation, you can use the state CRDTs and combine You can use flat combining or if you really need to, you can implement Multi-writer on the top, so what it entails is that you have to have the knowledge of how the logging and the snapshoting works in the system, and work around it’s constraint to implement Multi-writers on top >> [inaudible] Is that something that the program is required to the RT, if they get it wrong, they just go home? >> It is very easy to do it by just at the page level marketing pages as read-only We don’t do it in our current implementation So, the reader threads, when they’re mapping an address’ space, you can always tag it as this is a read-only space There are new CPU features to make it easy to do this from user space at run time, not as a security property, but as a debugging mechanism that if there is such a violation until we caught But in our current implementation, we are not doing that we are assuming the application developer is aware of the single writer that they are the owner, and they will be the only one writing to them >> They do write too They’re just unspecified bad things will happen >> No, I mean not necessarily because the read-only region is only visible to the threads on that machine, that update will not propagate throughout the network I mean it’s just because we don’t really collect updates from read-only regions That is a side effect of implementation No, bad things may or may not happen So, the basic building block is a notion of areas It’s a unit of memory allocation So, when you allocate a new area, you get the continuous [inaudible] at a fixed address So, if the address is X on machine on thread one, it will be the same thing on every thread that will matter, so you can use your Navel data structures and lay it on top and it would work So, these areas are version timestamped, and versioning entails that we must be updating atomically, so we guarantee that the are atomically updated, and the thread is able to find a staleness bound that, okay, this area is stale by at most, 10 milliseconds It does not give the application directly, it does not give the application the ability to know if this is the last state that is a limitation So, areas are organized in a hierarchy So each area has a parent Even though they have a flat address of space we just define each area that we allocate has a parent The reason for that is for fault tolerance It’s just putting someone, some thread in charge of failure detection for specific areas. Has it- >> The parent is a thread, not another area So it’s not a [inaudible] of areas, is that right? >> It is for hierarchies within areas, not threads But the thread that is the owner of the parent is in charge of at the moment Whoever is your parent’s owner is in charge of detecting if you died >> Okay >> Does it make sense? So if it happens that the parent dies and somebody else takes over and becomes the new owner, that thread is in charge of failure detection >> Each area has a parent that is another area? >> Yes >> And each area has an owner that is a thread? >> Yes, that is true >> Okay, that’s very helpful Just as you said, atomically updated it says somehow- how are they- >> [inaudible] >> How are the boundaries of the transactions? >> So it’s explicit So synchronization which I’ll get to is explicit So it’s very much like RCU or any other similar mechanism that you say that I’m out of the critical section, I’m out of the read region or write region, and all the threads frequently call into- whenever they are in a safe state, they call into a service function that signals that it’s now time- I can enter a synchronization phase if you need me to

I’ll get to it in a bit So explicit membership is another notion So in this picture, we have area zero owned by process A Process B and C are attached to it, they are in different versions You can see that in machine A process for area 2, only process B is attached Process A is not subscribed to that area What this mechanism provides us with is scoping So we don’t get all the updates for all the areas, they exist in the address’s space So you can partition and limit the update traffic So the building blocks at a very high level, because I’ll get to it in a second, I will just briefly mention it So software, what we are doing to track-write, to software-write tracking, something very simple, with user-defined granularity So at compile-time, you define what the granularity of write-tracking you want it to be So what’s the experiment with the 64-byte granularity because it’s what the architecture we’re working on is very [inaudible] for So every time there’s a write to an address, we expect the application to say log write, I wrote to this address these many bytes The application can choose to do one log write for many following writes or it can choose to displace that log write to basically make the write visible at a later time and all the flexibilities are with the application The reason we are not using hardware mechanisms is the granularity So hardware right now provides PML, Page Modification Logging but it’s at a page granularity We want to go at the granularity that the user wants and needs and as we show- the overhead is really low for this So there is no reason to go with hardware >> Again, if you fail to advertise [inaudible] unspecified, bad things can happen? >> No. So, if it is your intention not to make the write visible Imagine, we have some machine learning use cases like this that there is no reason for me to make this write immediately visible It might be overwritten as we go on So I will choose to only make it visible at the end of- one second later So I only call log write then but the burden is on me There’s a way for the application developer to optimize and things like that but that’s a good question So the log, this is invoked by applications and in a specific case by the Daemon Synchronization, now the main part is scheduled by a Daemon So if we have a Daemon running on each machine, this is per user defined interval So there is an internal synchronization and an external synchronization and the user defines two internal synchronization in five milliseconds, external one every 20 milliseconds The internal one what happens is really simple For each area, we have two copies The internal synchronization is working the log So having all threads cooperatively work all the logs for all the areas that are subscribed to in the system, and cooperate to the copy changes from the writer copy to the reader copy So during this phase we are stopping the word, not doing any processing So it’s very important to make that fast The external one is the part that we’re pushing updates from the from our machines’ read-only view, external to the network, to the push- packetize the updates and send it over the network to everyone else subscribed Area management means we have allocation, deletion, ownership, management, and things like that There are some set of management features, meaning implemented by a Daemon, and that is to help with discovery help with bootstrapping, help with failure detection and recovery The last one is by Daemon mainly This is a DSM system but at the end of the day, for timely notification, it should be obvious by now that there is a lot of delay between things So if you’re bootstrapping your system and you are depending on lots of initial teeny-tiny operations at initialization time, it may take a long time So we built a lightweight RPC mechanism to enable us to implement a doorbell mechanism or lightweight notification At a high level, this is just to go over

what happens in run-time and what the system looks like So what we assume is that the process, the Daemon first comes up, initializes test feed, a Daemon on the other machine comes up initializes and joins whoever the root of all these areas are, and the threads come up, initializes each thread calls initialize, and there is a new area allocation This area gets allocated on machine A Similarly on Machine B, we are going to attach to this area to start receiving the updates So machine A during the normal course of operation, whenever there’s a write to the area, it will call log write, phase 1, and it will continuously call a service whenever we’re in a safer state, out of critical section The Daemon is going to schedule and basically, flip a bit for a thread that now enter a barrier for box synchronization When it is time, we will enter phase 2, during which all the threads will cooperatively work that log, change the copies from here to here And after this, we move on The applications will move on but the Daemon in the background, when it’s time for the external synchronization, will go over the log internally maintaining the area, copy all the changes Basically packetize all the changes and send it out over the network >> You can have multiple threads writing to the same area? >> So, by default, no That’s what we assume There’s only a single thread So, threads can attach to the writer view or the reader view The default assumption and the default is that only the owner attaches to the writer view and everyone is attached to the reader views >> [inaudible] logs and then transfer these to the read copy, did you mean-? >> They all can see the log, they can all see the reader and writer copy. There is no- >> When you say they do it cooperatively, what does that mean? >> So each thread in the system has both of these areas mapped in its address’s space Depending on if you’re a writer or a reader, you basically swap the mapping so that the area- >> Who applies the log to the read copy? >> Every single thread Every single thread that enters bulk synchronization, we chunk up the log The Daemon chunks up the log and each thread works a part of the log and find if there is anything change from here and copies it over from here regardless of whether it’s a reader or a writer So during that bulk synchronization phase, every thread is going to read from the writer copy and copy it to the reader copy >> [inaudible] is on the log, you have some kind of version you said not to apply a copy and has overwritten by [inaudible] >> So, we have various specific alignments that makes sure that- so, we give let’s say two kilobyte chunks of log to each thread Just giving that region of the log ensures that these threads will not overstep each other at any point in time So, we are basically giving different >> [inaudible] >> Yes. They are partitioned by the [inaudible] but this [inaudible] and we can talk about it afterwards if there are lots of details here So, upon external synchronization when the data is received on the remote machine, it acts as if it’s the writer on the scratch area will log the writes for the received packets and a similar internal synchronization will make the data available on the remote machine So, is there any questions about that? It’s a bit of a rush >> Would it be right to say that the internal synchronization establishes a consistent snapshot that the Daemon sees? And the external synchronization, like the Daemons all talking to each other to make sure they got to swaps snapshots >> Yes >> Is that mostly right? >> Yes. Is it consistent across the regions? >> Well, yes >> Across areas, sorry >> So, for the threads owned by this, owned on the same machine, yes Not for the threads on different machines >> So, you actually wait for everyone to be out of critical section? >> Yes. So, here the run-time is that of the slowest thread So, if somebody’s taking longer to copy all the changes, everybody has to wait for it to be safe to exit So, it’s always consistent

in the sense that there’s no torn writes ever in the system How we assure it is that we are stopping the world There is no way in time that we are doing a read and write to the same area at the same time in an unsafe manner But we are stopping the world for that The way to make it faster is with cooperation, with data oriented synchronization and so on >> [inaudible] question So, if I’m a reader and I read multiple regions, I only release a consistent snapshot of all the writers of those regions were on the same machine >> So, you see a consistent- the snapshot of all the state in the system So, you see the latest version of the thread, almost the latest version for the areas hosted on the same machine But for the areas that are farther away from you, the owner is farther away, you’ll see a more stale version So, each area with the same version number across the network, the content would be the same, but areas with different version numbers they are self-consistent there is no consistency violation within an area >> What about datas wanting to read different areas? >> Yes >> Do they see a consistent snapshot across the network? >> What we define this as is that this consistent as long as there is no torn write for individual areas, but it might be that they are not necessarily seeing the same snapshot They’re seeing it a consistent snapshot The version numbers that the VS in the picture I showed earlier might be different depending on where you are If you are, let’s say in China seeing an update, this is just an extreme example Seeing an updating the US, the view of the snapshot of the state you see in the US is very different than that of China, just because of the latency you have If there are some areas that are updated in China, some areas update in the US For half of them, you have a more recent view than another half I hope >> I think what you’re saying is that you don’t have a mechanism to have transaction across areas, right? >> No. This is much weaker semantics than providing transactions for things that are not hosted on the same machines >> [inaudible] transaction if you want within an area >> So, one thing you can think of is that, if you can reclaim ownership of an area So, you have two areas you are the owner of both them, you can safely write to them and make sure that the writes happen everywhere at the same time, but in the general case that the writers are different place, the system doesn’t do anything about it You can build slow- much slower mechanisms on top to provide transactions or stronger forms of Multi-version Concurrency Control, but that’s not what the system does You’re absolutely right about that So, question that should be popping on everybody’s head is, what is the overhead? It looks like it should have quite a bit of overhead So, there are three main overheads ignoring the biggest, the elephant in the room which is we are doubling the memory requirements for the application Because we are maintaining two copies of each area in the system So, if the original estate was three megabytes, now it’s six megabytes because the readers and writers are looking at different areas The first part of the overhead is logging So, in software write tracking which is a step 1 Internal synchronization which is basically working the log and copying all the changes Finally, the indirect overhead which is, this process will induce some LLC contention, they’ll induce- will have its own memory footprint So, summing up logging and internal synchronization is not enough, it reduces It’s not enough to say that they have a very low synchronization overhead >> [inaudible] synchronization too, which costs- >> So, that shows up at indirect, but it doesn’t directly show up for applications because the background Daemon is done asynchronously in the background >> Is that overhead? >> Yes, it is. It shows up here, you’re right So, for every change we are making three copies essentially We have it in the reader, we have it in the writer, and we have a buffer structure to packetize and send it over the network So, essentially at the end of the day we might have at most three copies So, what we see in practice is this, is that logging is quite likely This is true for realistic workloads

So, logging takes 20 cycles, roughly 20, 30 cycles, plus reading a cache line very likely if it’s not cache for the log So, roughly around one percent of run-time overhead This is compared to a single threaded application with no synchronization whatsoever The internal synchronization part, we’re working the log The log that I talked about is not hierarchical, it’s just flat If the applicant calls for it we can make it hierarchical So, log the logs as well but as is with 64-byte granularity with no changes just working the log it takes one microsecond to cover one megabyte of address’s space So, if you have one gigabyte of address space it takes one millisecond, but if there are modifications depending on whether it’s in pass pad or not, it takes between 100 microseconds to one millisecond per megabyte per core So, both numbers were per core The nice thing about it is that because it’s cooperative, it scales nicely for most cases and it scales near linearly with the number of course So, if you are one writer aggressively writing and updating, but you have 24 threads on the machine, 24 cores They will all cooperate to copy all the changes, your throughput from 1 to 10 gigabytes per second can jump to let’s say, 20 gigabytes per second to 60 gigabytes per second depending on how many threads are cooperating to make that internal sync work >> From my extremely limited experience it’s not so much the time for the internal sync, but the time for everybody across the whole data sector to get ready to do [inaudible] >> Yes >> So the barrier bit- >> Yes >> That can dominate this part >> So for the application, that is a big constraint on the kind of application we support in that the critical section should be really small and they should be frequently calling into the barrier a bit >> So, what really happens in our design, we have a grace period That’s to say wait for ten microsecond, it’s a compile-time constant Wait for ten microseconds If you didn’t succeed in entering the barrier just let go, try again in 10 microseconds So, in our experience with most of the applications we tried with for packet processing, even with large number of threads they can enter the barrier successfully >> [inaudible] the entire data center >> No. This is on a single machine >> Oh, I thought this is internal >> Oh! it’s on a single machine >> Barrier synchronization They all had to do it at once >> No. It’s on a single machine >> Oh, that’s good >> On a single machine. Sorry, that would have been horrible >> I was a little bit worried >> Yes, that’s on a single machine, just synchronizing all the cores to enter a barrier together Because essentially, if somebody else is reading a copy on a remote machine, there is no way it will conflict with this copy So, this and this have nothing to do with one another That’s a good point that I should have been cleared So, the lining between and machine boundaries Internal synchronization is really internal External synchronization is what goes across machines So, indirect overhead for realistic workloads, we observed to be around 10 percent So, all these numbers could be much higher if we go with a very aggressive adversarial workload, which is the next slide So, for this, we allocate a single area with a single core This is just numbers for single core and see, for different areas sizes, for different synchronization intervals what turns the overhead for different parts of the system, so, the logging overhead, the synchronization overhead, and indirect overhead So, you can see, I mean, as you expect, this is the micro benchmark is updating a random location in memory with some straight of eight bytes, I believe for this graph, updating the region It’s back-to-back right, it’s 100 percent right So, this is stressing the system quite a bit So, if you do what it entails is that, let’s say in one millisecond, that workload for 16 megabyte dataset can pretty much rewrite the whole region So, if we are synchronizing every one millisecond, over one second, we are doing a thousand times, roughly a thousand times, and we are copying 16 gigabytes of data per second This is the number in the middle and this is the synchronization bit, widening direct overhead is so large for that is because that’s exactly where the borderline where things fit nicely in the cache alone But, as soon as we start adding a second copy and all our overheads, we are starting to make things back

So, the trend for this micro benchmark, you observe that a synchronization overhead remains pretty much the same between 16 megabytes, 64 megabytes and 256 megabytes and going on The reason for that is rather straightforward The amount of data that one core can change in one millisecond, as no matter how big the data is not getting bigger So, you pretty much can update 10 megabytes per milliseconds, some number along those lines As you go with larger areas, the chances of you updating more is less So, these numbers are specific to Haswell On Broadwell, they look different On Haswell and Broadwell, they look pretty much the same on the Skylake because of the changes in the cache hierarchy and things like that, they will look a bit different, but we can talk about it offline if anybody’s interested So, just a quick overview before I get to the applications bit, so the semantics of this layer is that they have areas that their version atomically updated The reads are local and possibly a state The logging is in software with tunable granularity and the synchronization has user-defined intervals Internal one as bulk by all threads, external one as a synchronous by daemon and the application interface is simply area management logging rights and explicit bulk synchronization So, the common design patterns that will follow and we commonly use, there are a couple of very simple ones One is combined aggregate multiple areas into one So, we have multiple publication list you might think of it and the one who is the combiner, we’ll combine them into one republished area This is just to mimic multi-writer scenario, but in a different way >>This is a way that a client of this API can build an instruction on top that does something more powerful >> Yes >> So, we have to change your underlying, so this is [inaudible] is just a lively An example of a lively you can build on top, is that right? >> Yes. Yes, exactly So, the last thing, we have two examples building on this One is a monitor telemetry application that you have You collect statistics from all across the datacenter, but for you to do it in a flat way, it’s very expensive There’s lots of traffic if you do it at a fine granularity What you can do is to summarize the areas that are published by each machine and aggregate on one of the machines, republish the summary to others So, not everyone needs to subscribe to every node in the datacenter People can subscribe I mean, at the toiler at the rack level, you can subscribe to the summary of that track, and things like that So, you can overlay the way you’d think your data should be structured and represent it nicely, but this application Another example is machine learning one, but I hope I get to it, there’s not much time left, but let’s see The second one is scoped membership which I alluded to So, if everybody subscribes to every area, you can easily see that with lots of fried chicken There’s not enough bandwidth to move that many bits around and you may not want to do it So, scoped membership enables to do it Pre-partitioning state and ownership transfer, lots of works in an FEC and talking about dynamically repartitioning the state transfer, I mean, delegation and things like that It’s rather complicated what we can do here and what we have done with let’s say an elastic load balancer is we have essentially have a hash table We want to have DHD, but in memory basically, front-end hash table that says which of these sub-hash tables will contain the flow entry, but host all of them in memory in one machine as soon as we need more to scale reads Imagine writes is not the problem, reads are the problem, we can boot up a new instance and subscribe to those areas If writes is a bottleneck, writing is a bottleneck, we can transfer ownership to these individual hash tables to a new one rather than repartitioning the state If it makes us, it was rather quick I’ll get back to this in a bit as well Failure detection and ownership transfer, so one thing I didn’t mention is how this works when a thread fails So, if a thread fails, the pattern is supposed to detect it with timeout and things like that, and just find another trend with the most recent view and

delegate ownership of that thread So, now, we know exactly who is in charge of failure detection The only exception to this is all the way up to the root If the root fails, we don’t have any pattern to decide who takes over So, there is a consensus around that needs to go in there to decide a new route >> [inaudible] but you’re going to [inaudible] the thread [inaudible] >> Yes >> You can be on your own right? So, you might lose >> Oh, yes >> Sometimes, right? >> Yes, that is a trade-off We make that we want to keep the state in DRAM domain to your ability part of our system One thing I like about FARM is it says that, let’s keep the state in DRAM I use battery back system to just persist it, when needed, when things fail >> It comes from battery packs here We’re [inaudible], right? >> Why not? >> The ability, you might not >> Oh, yes Yes, that is certainly true That is a true statement So, what goes on is that during the failure phase, reads are fine It’s just writes that are affected We cannot update the area So, it depends on you pick your poison Either you’re fine waiting to see all the updates from wherever you are persisting them or you’re fine losing a bit of a state from the last version that is visible across the network and go from there I would personally pick to just lose some state, a few milliseconds worth of state >> Did you say some hybrid facility between machines, is that the idea? >> Yes >> And they follow the same hierarchy as this pattern? >> Yes. So, we bootstrap to assist the Tasvir areas for node management area, thread management with Tasvir itself So, threads update the heartbeat in the area and it will become visible So, everybody is able to see when was the last time you tried to check that, but it’s bootstrap with the system itself So, in the interest of time, I will go over this very quickly So, one thing, OpenFlow came out of Stanford Berkeley as a joint work We didn’t like it from the beginning I mean, something is wrong with this and it abstracts datapath as match-action flow tables and there’s a very detailed protocol defining how to essentially manipulated remote flow table to match what we intended Usually, controller’s updater and in the STN case So, there are three approaches to make this work is; process OpenFlow messages inline in the datapath some switches do this, that you receive an OpenFlow message, on the same thread you update the flow table per the message One option which some other v switches take is that, there’s a controller agent that has a shared lock over the table, by some RC mechanism will update the table for you So, it’s not the datapath is processing those messages, but an agent on the datapath So, it essentially involves some synchronization and locking The last part is, let’s forget about all these protocols byte formats everything and just make it a pure byte copy, that we are constructing the datapath in the same byte format in the controller, we are just copying the change bytes with Tasvir So, what we see is this Let’s just look at fine grain locking in a Cocoa hash table, and compare it with no locking by using Tasvir So, the re-throughput, this is for an update heavy workload, the re-throughput is minimally affected with Tasvir, even at a fine grained synchronization interval So basically, the loss of throughput is due to the synchronization overhead, whereas with locking, we lose a lot Imagine you are really into a new mechanism, that is the updates take a long time, but the reads are really fast, the data structure you’re really fond of, cannot deploy it because it will install the fast path Tasvir is a very good thing for you to resolve that issue because it will offload the rights to another thread and make byte copying the main issue So, what we are trading is up the visibility to get better scalable system and isolate reads and writes Another thing is that the visibility latency is bounded So, high-level point for this graph, I will just tell you that takeaway point is that this system gives you knobs You decide whether you want to trade, how much synchronization overhead you’re willing to incur,

and trade it with visibility latency Another example we have is Elastic, Fault-Tolerant Load Balancer, it builds on the same concepts I talked about So, we have an in-memory DHT in the beginning of time All the areas is owned by one instance, and demand, we add more instances having your partition right, the end of the day, we want to see how scalable it is So, within one machine as it scales quite linearly, across machines it scales near linearly as well So, we essentially, are able to have a stateful load balancer which is what everybody is trying to avoid because it’s hard to make work, but make it linearly scalable What they talked about, it applies more broadly to networks So, these two examples I mentioned So, CPCP, Control Plane-Control Plane interactions, Control Plane-Data Plane interactions, building distributed data planes, I’m finally building network by telemetry The same abstraction could replace a lot of different protocols if you are fine with going with data-oriented approach One last thing, we tried it on different examples, Mini-batch Stochastic Gradient Descent was something among people at Berkeley said, “Why don’t you try this and see if it helps?” So, we were trying to mimic a parameter server with the way Tasvir is taking over So, this is the aggregation pattern, that we have workers working on local updates, aggregated on a local area, aggregated at the global step So, for a head-to-head comparison, we compare with Shared-Memory Implementation, native shared-memory implementation, which has no contending rights, which is cyclades@nips17, if I’m not mistaken, and we match the native shared-memory overhead and we surpass it after more than some number of course So, finally the premise after work is that copying data is faster than replaying compute for a broad class of applications and networking We’re looking at applications beyond this >> [inaudible] tells us dropping the [inaudible] your little data, a lot of data [inaudible] then dropping the data that, doesn’t it all depend? >> Yes So, our conjecture is that the class of applications is very targeting, has the property that the deltas is small compared to the volume of packets we need to displace in comparison So, it makes sense to replicate the data and with that, it is much cheaper to replicate state and data rather than replay the messages that generated that data, replay the computed resulted in that data But, it all depends, if you have a workload that that has a function called map sets two gigabytes of data to zero, and I don’t have any compression in my system, it would be horrible system for that So, it all depends This is not a general solution It’s for niche class of applications So, one other thing that is nice is that this is a hardware friendly mechanism The mechanisms I talked about could be a specialist for different architectures, so if you have wider vector instructions, you can do faster copies, you can relax memory ordering constraints in a x86, you can use it If you have RDMA, you can off the networking and packetization, but can be made much cheaper, hardware low tracking if it ever becomes available at a finer granularity, you can use it This state-oriented nature at the end of the day lends itself very well to CPUs and offload Rapid snapshotting which is essentially the crux of the work is that we are doing snapshotting at millisecond timescale enables us to move functions off of fastpath Imagine somebody asked for durability treaty correspondence Intel starts selling 3D cross point dense, you want to make the data persist and you haven’t fault-tolerance solutions Customers don’t want to go and build a new solution out of it What they could do with this system boot up an incense with 3D cross points attached to it and start persisting is stay to that offer fastpath on a thread that is not serving traffic Then, we are interested in and investigating other applications and that we are open and evolving the design as we go This is an ongoing work and I’m very excited to learn what you think

and in our one on one’s talk about it Thank you so much for listening >> [inaudible] further questions [inaudible] Okay, thanks again