RailsConf 2014 – Distributed Request Tracing by Kenny Hoxworth

okay I want to first welcome everyone to the distributed track at railsconf this year this is the first year that I think we’ve been doing a distributed track so I’m excited to be able to give you guys a talk on distributed request racing a little bit about myself real quick my name is Kenny Hawksworth I am a software engineer at Twilio I’m actually not doing any rails development anymore unfortunately but I have taken a number of the things I’ve learned on the scaling architectures that Twilio has has as part of their services and thought there’s a lot of stuff that can actually come back to the rails world and that’s kind of what I want to be able to bring to you guys today so real quick I just wanted to talk about before we get into the investigation that led us down the path of Trulia’s architecture I wanted to get into really what request distributed request tracing is and so request distributed request tracing can be kind of thought of as a profiler for for your distributed system as a request goes into your distributed system so we have profilers for a whole bunch of other frameworks and programming languages where you can break down and actually see what function calls taking so much time and what is the exact the exact stacktrace for each function call and so the distributed tracer essentially provides the exact same profiling from a request cycle as it end as it enters into a distributed system so given this given this diagram you can look at it this is exactly a profiling diagram of a of a full-scale trace running from end to end as entered and exited a system and then a web service call that actually was the first call that was made in the distributed system and off service that ended up getting called as as a part of that and then another DB call and then after everything was done on the web service side I got handed off to a worker thread to finish up some processing or whatever else okay so I work on part I work on the messaging team at Twilio and I’m not familiar or I don’t know if all of you are familiar with what Twilio does but Tullio provides an API for voice communication so you can actually send and receive phone calls via Twilio API and we also provide a messaging API so you can send and receive text messages and MMS is and for receive MMS is you can actually have web hooks that go out to your servers just to figure out what exactly should we be doing with this message it allows your servers to respond to the incoming message and one of the things that the method the messaging stack at Twilio is set up in a distributed fashion where every portion of the or every service that can actually occur for a for an inbound message or an outbound message is provided by an individual service on AWS or AWS instance and so at any given time there’s nine different distributed systems that are our computer components that are all communicating when and when a message comes into the service via the API until it’s handed off via the aired out to the carriers and so we ran into some problems so this is just a very basic overview of it comes in it’s handed off to another server it’s handed off to another server it’s handed off to another server there’s some DB persistence communication that’s going on but we’re where we started to have problems is a distributed tracing system for this is kind of a little bland you don’t really need it when you’re just handing handing it off or handing a request off to another system to another system profiling is pretty useful in the sense that it can give you more information about Layton sees and other issues that are occurring between your services but where it becomes really useful is when you start scaling so Twilio each service that Tullio has is scaled to be able to meet whatever the current needs are for the arc for our service and at any given time we can have 1020 nodes or however many nodes running for each individual service that we have and a lot of times we have customers or support personnel who come to us the engineers and say hey we actually have a report from a customer who says who says you know every every 20 or so sms’s that go out some sometimes it takes like two seconds as opposed to you know near-instantaneous so why is that and then there’s other other customers that come to us and say hey and mms sometimes when it it’s taking five seconds why is that longer than SMS and then also some customers will come and say we tried sending a message and it completely failed and tracing that information down can be difficult in a distributed system you can you can do log parsing and go through logs and you can use log stash to actually be able to get some

information about about what the actual exception was that occurred but being able to recreate the entire event from when it entered into the system all the way to the end is incredibly useful from an analytic standpoint and from a debugging standpoint because you can actually figure out I know exactly what code was running on this server I know exactly what was running here I know exactly what was running here I know at this point that this server was having some problems because of a specific memory issue that was occurring was whereas all of the other servers and that load balancer were not affected so I mean as an example when you add distributor tracing to it you can actually follow the exact path that the that a message travels throughout your system so distribute of tracing can give you a little bit of information on performance obviously which we talked about previously in this example you can see that a DB had 500 millisecond latency during a transaction which is pretty bad but then there’s one of the other things that just that a tracing framework can give you is information on bad nodes this is a problem that occurs a lot in a distributed system where I mean where a node is bad it might actually have run out of disk space it might have there might be some sort of unplanned conser problem that’s actually occurring on one of your boxes and if you if you’re running a robust system it should be if there’s a failure it should be retrying and sending it back into the load balancer and then going to hopefully a good server a lot of times you don’t know when these problems are occurring there might be a slight latency that’s that’s introduced in your in your overall framework but usually there’s not going to be a little bit there’s not going to be a error message that gets back gets back up to the customer and there’s not going to be an error message that you’re going to see unless aside from like some exceptions that might get thrown and you might see it on a nacho solar but being able to use a distributed tracing system you can actually follow the exact path say see that it hit one node it failed it hit another note it failed and then finally hit a third node and it succeeded and you actually know which nodes that you need to go investigate and be able to clean up and then possibly plan to scale better in the future okay so this was kind of like giving the Twilio example I a lot of you may be wondering really you know what I run is it a distributed system essentially if you are running a service that that has two or more components that talk to each other whether that be a web service and a database it’s a distributed system and the more services that you add to that it becomes more and more distributed and laying the groundwork for a good tracing framework is really easy if you start in the beginning but it can be really difficult to go through and actually instrument all of your services after view after you’ve started everything so I wanted to run through a quick little example it’s kind of silly but it’s what I’m gonna use for the rest of the talk is I wanted to come up with a service that we can use as an example here tried to think of something that the internet uses and needs more of and that’s that’s pretty obvious we need more cats everyone needs cats cat seems to be everywhere and the thing is we didn’t want to create another I Can Has Cheezburger or anything like that we wanted to be able to provide an API so people can can go and retrieve cats whatever cats they want they send them out to customers or send them out to customers or clients and integrate them with their service so we have cats as a service which is fantastic everyone loves cats and I figured I would just have a well all right all right so that’s what our service is going to provide wonderful cats that attack children okay so starting off a very basic with a very very basic architecture we had we’re going to provide both an API to our service and a web interface to our service and then we’re gonna send things out through Twilio on Twilio and I’m gonna talk about Twilio but we you know we allow our cut our customers to be able to retrieve retrieve cats and be able to send cats out to their friends using using Twilio and this is actually a cat spammer casa famicom actually exists if you guys want to play with that but so this is this is pretty simple you might think I don’t actually need to create a I don’t need to add distributed tracing into this this is a very simple architecture well let’s let’s say this is a service that you put together and you thought I want to do this as a side project but well here I am giving this conversation in this room and let’s say let’s say a VC is in the room right now and he’s thinking oh my gosh this is great we really really really need to start sending cats all over the internet which is already done but you know just bear with me and thinking all we need is a social own so you know if you happens to come up to me later and say you know I’ve got 50 million dollars that I need you to go build this system because of all this massive traffic that we’re gonna be bringing in and we want to add a new social element too and all this sort of stuff it suddenly starts to become a big problem so we start out with our API and our web interface we add a social interface which I don’t even know what a social interface forecast would be but it seems to make seems to make good sense we could add an authorization layer because it might

make sense to protect the images of cats that people have a worker process that might that might actually go out and build cats another media fetching service that actually does deduping and caching of any any services that exist or I’m sorry the deduping or caching of any images that actually retrieved and then we’ll also provide a message queuing infrastructure that we’re going to call / mq that communicates throughout the entire service I mean its web scale we have to have messaging or message queue and then so we’ll still send out via via Twilio but we probably send out via other other social sites like Twitter fakeblock you know any kind of those services so at this point you can start to see how a distributed system or a distributed tracing system could be fairly useful Wow anyway this you can see how right now we have the API and the web and social interface as all of our ingress points we never know what customers are going to be using so it would be good from any request into our system to be able to see where it’s coming from once it gets into the message queue it’s even more difficult because they we don’t know if the authorization system is actually needed on each request we don’t know the media service is actually going to be needed maybe the media service is going out making a call retrieving a giant cat image and it ends up blowing the entire system up and then we have a multiple egress points and so with that said once you start scaling it it ends up becoming even crazier than you think you know you don’t even know where you’re gonna end up starting so I want to go through and real quick talk about what can you do to what makes a good tracing system so the first goal of building a good tracing system is to have low overhead and so this would be it’s incredibly important that any tracing system to have as little impact on the server resources as possible on the system that it’s actually tracing this was seems it seems like this seems like an obvious conversation point however if you’re talking about a production system that is doing traces on every single request and is trying to log all of those somewhere and is trying to send them into a another system this has to be thought about all the way from the beginning the second point is it needs to be scalable this is not as important when you first just starting a starting to build out a distributed tracing system however it needs to be scalable from the start so that developers don’t have to worry about about kicking over the traces system as they add services it seems silly but developers don’t want to actually have to sit there and think about tracing and all those and all the metrics that occur they want it to be transparent which brings us to the third goal which is transparent instrumentation this is by far the most important point of building a tracing system and by this I’m saying if you have a distributed system that has 20 different components and let’s say you don’t have tracing already and as part of your system you are not going to want to have your developers go out and add tracing to every system not only every system but then every call-out that attends that that occurs throughout your process I mean it would be a nightmare you’re going to miss some calls it’s going to happen so with that said I want to go through a couple of tracing systems that already exists X trace is one of them X trace has been around for a few years now I think in the early 2000s actually all of these were we’re put together as academic papers in the early 2000s X tres is a system that has some C++ bindings right now and a few other systems I think they’re they have Java bindings as well X tres is not like there’s no Ruby instrumentation aside from a commercial distributor tracing system that’s called that’s a Panetta’s trace view which I I actually I’m not doing any kind of demonstration of there is one other commercial system that’s called or that’s part of new relics across application tracing but just one of the throw tools out there so X trace magpie and pin point are all I’ll follow a very standard basic system which is bringing this bring this back up they all create a single unique identifier for the trace that’s that you are running through your distributed system and then they another unique identifier gets created for each of the spans that occur so the span is each of the each of the services that occur throughout your distributed system and then you can actually link them all together by by linking the parent spans with the child spans and you can end up creating an entire tree infrastructure so Google put together a they have an internal system called dapper they have it they have a paper that they put out a

few years ago also regarding dapper Dapper is very very very similar to X trace in magpie but there’s two key differences and they built these key differences in to be able to make their system a lot more scalable one is they implemented a low latency logging system on the back end so as as traces get created there’s a low latency system that will grab traces if there are system resources available to be able to grab the traces but there are but if if the systems are getting pounded the the log collector won’t actually do any of the collection the other point that was introduced as part of dapper is they ended up introducing a sample rate so not every single request gets traced this huge for google because of the amount of requests that they end up receiving and so gap is actually really nice system Google has the luxury of building out an entire network our networking RPC layer underneath all of their services however so Google maintains control of all of their all of their communications pass whether it be the HTTP communication or database communication all of its done using part of Google’s RP the underlying networking RPC layer because of that they can they can integrate all of the instrumentation instrumentation directly into the networking layer which is critical to being able to put together a transparent instrumentation platform for developers to use so dapper is put together all of the services are running on individual nodes and and are transparently doing all of the tracing and logging out to a a log daemon that’s running on each of the boxes there’s also a dapper collector service that’s running that actually goes through and collects all of the log files at specific intervals and then all of the log files end up getting sent to the dapper UI and dapper service on the backend Twitter however took the dapper paper and decided to try to build an open-source project off of it so this has been fantastic because it’s something that has enabled open it up for the rest of us and this project is called Zipkin zip guns been around for a little while now and it seems to kind of like bounce back and forth in the open-source community on on what what kind of developments actually occurring on it so Zipkin is also based off of a custom networking RPC layer that twitter introduced called phonate finagle and finagle is great but the it’s it’s a really nice networking layer that has all the instrumentation built into it the problem is is that it’s most of the clients and servers services are built in Scala which don’t really match up very well with a Ruby and rails tech world I mean you could use JRuby and somehow use it to actually communicate but I’m but for a larger enterprise that’s using that’s using binary Ruby and MRI and stuff it would make sense to be able to do instrumentation outside of it the good thing about Zipkin is source so all of the headers all of the communication paths are all and and they’re there the all of the finagle services are all written in thrift so it’s pretty easy to actually do instrument 8 or creation of servers and clients that will actually speak the language that Zipkin expects so zip can also has a couple of really nice aspects to it one of them is it has a pluggable back-end data store so it the the entire the entirety of Sipkins written in scala but it’s actually incredibly easy to start up and get running so you guys can actually go out and easily start doing some instrumentation rails projects today you basically the pluggable datastore by default starts using SQLite which is great for doing some initial development it’s not very good for production traffic but it does allow my sequel Postgres Cassandra backends and I’m pretty sure Twitter’s men using Cassandra pretty heavily on the backend for using their pluggable datastore so the the the scribe collector is use described as it’s log correct log collector scribe is a open source implementation of a log or a low latency log collector service the other really nice thing about Sipkins infrastructure is it doesn’t require all of the individual instrumented services to communicate via a scribe collector you can actually talk directly to Zipkin so if you’re doing development just trying to introduce distributed tracing to your process to your to your application you can write directly to zip khun’s scribe port and it works flawlessly and that’s what we’ll do in the example in a second the last point is it’s highly configurable there’s lots of tuning options obviously I already said that the datastore and the log collectors our pluggable it also can be it can use zookeeper on the backend to automate service discovery to be able to when you actually do have different log log collectors running on individual boxes so Sipkins interface is pretty simple I you can see here the profiling view of a request at the top it’s the entire request and then each service that ends

up getting called ends up ends up having its own entry there you can actually go and look at each in each individual point throughout the tracing architecture and then we’ll look at a live example of that in just a second so anyway I want to go into a an example of instrumenting a rails application using Zipkin it’s actually incredibly easy I’m a little embarrassed that I’m up here talking to you about the actual rails portion of it but at the same time there are some things that need to be that there’s there’s some additional work that needs to be done to be able to instrument stuff that doesn’t already automatically integrate with Zipkin to end with finagle so the initial setup is installing Zipkin it’s pretty pretty straightforward you do need to have a JVM running and you do need to have Scala installed both of them are I mean everyone’s installed a JVM probably and Scala is incredibly easy to install once it’s installed it’s fine Scala does take forever to actually compile all of its services but that’s that’s a bit of a pain point but you can end up just cloning the Zipkin Zipkin repo and then each of those commands will start up each of the individual three services that are provided throughout Zipkin Zipkin provides all of the the collector engine the query engine and the web engine separately so that you can this you can actually put different services on different hosts depending on the load on your system the Ruby setup is actually pretty easy we use we use the scribe gem which allows us to talk with a scribe block system the finanical thrift gem which is part of Twitter’s finagle thrift github repo and this this introduces all of the trace IDs the span IDs all of the communications and all of the recording stuff for you so it’s all fairly simple and straightforward to actually do do request tracing and then finally Zipkin tracer is a rack middleware component that is part of the Zipkin project that allows you to do that to automate the the inbound portion of request tracing on a rail service so as long as as long as you’re handling all just rail services it’s incredibly easy there’s no extra configuration that needs to be done beyond just getting these things installed and then adding the middle the middleware and so we’ll add we’ll show that in just a second the one diff the one difficulty is the Zipkin tracer middleware has it hasn’t been abandoned but there’s a there’s a couple of Forks that actually work a lot better than the the basis of caen library so I’ll show you right now I’ve got the the file that I use for this I in the end up grabbing from that specific repo I’ll have information for this where you guys I can give this information to people who are interested after the talk that’s for sure but it’s pretty simple you start up your gem file you can start up a config initializer to actually set up the Arteta to introduce the middleware into your rail stack there’s a little bit of configuration that needs to occur all it is is the service name that you’re that you’re creating so right now this would be our cast Bammer api the service port that Zipkin runs on the sample rate which is a number between 0 and 1 specifying the percentage of requests that should get that should get sampled and then the scribe server this is this is the point where i’m pointing directly to my scribe server if you had a scribe daemon running on each individual box each component would have to be set up in rails to point to whatever the scribe daemon is that’s running on the boxes okay so here’s where a little bit difficulty comes in so active record and Redis and using RabbitMQ and any other communication path that you want to that you want to use inside of your service unfortunately the current Zipkin tracer gem is does not provide default tracing and instrumentation for all of those so it’s a bit of a womp-womp a moment but it’s something that’s actually being actively worked I was actually hoping to have it available for you guys today but I was trying to add it to active record and Redis and RabbitMQ and a whole bunch of other services and wasn’t able to get all of them working and playing nicely together but it is something that hopefully in the next couple weeks I’ll have up and writing so you you definitely still can I don’t know why I did that you definitely still can do tracing of all of these services unfortunately you have to wrap them all in tracing calls such as this this is this this this will allow you to on each of your DV requests as long as there’s a synchronous request this will this will actually add tracing client tracing for your DBA for your actual DB call for synchronous communication it’s a little bit different you have to do have to be careful that you’re actually passing along your trace IDs and your span IDs but that’s that’s something we can talk about offline also so anyway I do what I want to do just a real quick demo it’s it’s nothing incredibly difficult but where am i oh no no let’s kill that okay alright so

I’m starting up this is the cache memory interface that we ended up creating oh gosh this is really difficult this is the one thing I couldn’t test before oh I can see it down here perfect so give me one second actually well uh shall I just put my own number in there so let’s let’s say I’m going to everyone please spam me after this yeah exactly so I’m going I’m going to send a cat to my to my phone right now using cash spammer okay so it’s that so I actually have on the back end I’ll pull this up real quick everything’s cut off pretty bad here but I’m gonna show you real quick here I have all the three three individual scribe service or scribe and Sipkins services that are running and then I also have in these terminals everything’s cut off but in this terminal I have a bunch of all of the rails applications that are running for our all of the rail service that’s running did something fail that’s no fun as running for our applications so there’s the API portion there’s an auth component there’s a social component that talks to Twilio and through Annie and emails and all that set up to to to there is a little bit of all of them are set up to communicate with one another so I could actually create an example example trace for you guys so I’ll bring up the Zipkin interface here real quick okay so this is the basic continious it’s actually a pretty simple interface there’s nothing really difficult about it at all the one thing that’s interesting about it is it does allow you to select different service names so I can actually look into at my camp spammer interface and then I can actually grab the trace from that so here’s one trace that ended up coming through and I’ll pull that up oh and unfortunately it didn’t actually work very well oh that’s too bad I got more womp womp I mean one sec I’m gonna run another request through I might not have internet that’s the problem well unfortunately I don’t have internet right now so my actual sending of the message out is going to fail but I can drop back real quick here and we can and we can act as if this is the interface that I actually was just showing you so this is so assuming that this was the this was the the request this shows each each individual service as it was called you can actually click on each of these each of the spans here and it’ll give you annotation information so you can actually see when the request started when the request ended you can actually add tracing information as part of your as as part of your request while you’re while you’re doing processing so you can actually add extra information to each individual tracing request and you can actually see the the latency issues or anything that’s actually happening as part of your application so I’ll run through real quick okay so I have the demo and that actually concludes what I have I would love to take any questions for you guys the actual rails instrumentation like I showed you it’s actually really easy with with Zipkin the the app nano trace view instrumentation is just as simple and I’m pretty sure that the new relic cat is pretty pretty simple also I haven’t used that one unfortunately but I have used the trace view okay excellent well I would be happy to talk to any of you about this offline or you guys can actually go and visit cast famicom and send your friends images so thank you and yeah i hope you guys have a good rest of the time at railsconf you