Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305) | AWS re:Invent 2013

okay my name is Robert Barnes as Vernor mentioned the other day we’re passionate about measuring and so I suppose my title should be director of measurement or measurer in chief or something like that for Amazon Web Services the topic for today is best practices for benchmarking and performance analysis and the cloud I realized as I put this talk together I had a couple objectives but I also realized why what my problem is I should probably have a lot of therapy for this but I seem to have an attraction for measuring things and maybe from a professional professional perspective it harkens back to my first paid computing job which was working in a metrology lab if you’re not familiar what that is it was at an aerospace company that had to measure things with very strict tolerances and all of the equipment they used for measurement needed to be calibrated against NIST standards so I was coding and with pdp-8 and PDP 11s that did automated calibration of oscilloscopes and that sort of thing I won’t tell you how long ago it was there were some slide rules at the time I’ll leave it at that however I’m going to actually start with a pseudo benchmarking demo now any of you have done benchmarking know that demos and benchmarks don’t really mix I’ve had to do them live and it really can be painful but I want to start with a question of the audience and I’ll do a quick survey on the answers here there are a number of devices that I happen to have on my desk I have most of those here right now and they all can be used to measure how many different ways do you think these can be used to measure how many people think it’s five or less just raise your hand okay ten or less okay 15 or less 20 or less more than 20 okay we’re on the right track so there are at least 20 ways to use these devices and I’ll illustrate quickly oh don’t worry I’ll make this relevant and you don’t have faith in me but so this tape measure you could use the width of the tape measure you could use the height of the tape measure you could use the length of the tape measure you could actually use it as intended you could use the width of the tape measure or my favorite you could use the thickness of the tape measure to measure now prior to starting I asked some brave volunteers to assist me in this demonstration I had four methods of measuring I had this wonderful laser based accurate to 1/16 of an inch measuring tool got to have one of these ah nothing I’m passionate about this at all a regular tape and a ruler and the fourth fourth approach was measurement by estimate and I asked these brave volunteers to measure this stage the measurements that I got were 16 feet for the estimate the digital tape measure came in at 24 point 1 7 feet the tape measure came in at 24 point one two five or twenty four and a quarter the ruler came in at 23 feet 23 point two five feet so what does that tell you all of those measurements were measurements of the stage and if you didn’t know there were other ways of measuring how would you know which one of those was accurate in fact the first time I had bought this wonderful tool and used it the first thing I did was calibrated against the tape measure because I wasn’t sure how to hold it think about it this thing has an emitter here and it measures here but I didn’t know the way they designed it whether it started measuring from the bottom of it or from the top and so it took a little bit of a while before I could trust this tool this is where I bring it in there are many many ways to measure performance you have to understand what how the tool is measuring and how relevant that measurement is and if you’re using a tool for the first time I recommend using more than one tool to see if you can correlate the results and make sure that they are meaningful meaningful to you before I get into actually talking about measurement I want to give a little bit I promise a little bit of background on cloud benchmarking first I have to start by saying the best benchmark is your own application I know we hear that all the time but the reality is when you use some other way of measuring things you still have to tie back what that thing tells you to your actual application or put another way how many people admit to running a benchmark in production right now you’re running a benchmark in production that’s your business is running the benchmark probably not I I realized that was a somewhat sarcastic question but the

point is there’s usually some application that you’re really running and sometimes it makes sense to use something to help you tune it or or baseline it or something you still have to tie back what you measure to your actual application in all benchmarks you’re typically dealing with two types of measurement absolute measurement in this particular case the measurement of the stage we’re looking for an absolute measure or relative measure I could tell you that for those four measurements we had the mean the average of those measurements was twenty one point eight eight and the standard deviation was 16 percent given the wide variance those those those relative information can sometimes tell helped tell you how accurate your measurement devices and can also help you pinpoint whether artifacts you’re measuring are things you’re trying to measure as opposed to the way you’re measuring now that wasn’t what I want to do okay help me with button dyslexia here is it going through stages sorry oh is it back to one is that what it did okay I can get it I can get it yeah yeah yeah so if you don’t know how to use the tools then you’re really in trouble we’re good okay absolute versus relative BET benchmarks are typically going to give you either a fixed amount of work and you measure how long it takes or they’re going to give you a fixed amount of time and you count how much work happens within that’s at that time it’s important to understand which kind of test it is because for instance when it’s a fixed amount of time and you’re counting how much work gets done higher is better when it’s a fixed amount of work and the time is what you’re measuring lower is better I’ve actually seen published results where people got that wrong and they touted results that should have been considered in terms of lower is better they got it backwards what’s different about benchmarking in the cloud there are two broad categories of what’s different that I really need to point out the first one is there are typically a few more layers of abstraction than what you would be doing if you were doing it on premises what are those layers of abstraction mean well on the positive side it means you’re not wrestling cables you’re not calling someone up to say hey can you configure this for me you know you’re provisioning very quickly the flipside of that is because there are more layers of abstraction there’s more likelihood that you could have some sort of variability so as a rule of thumb you typically need to run more iterations of your tests in order to quantify variability and make sure you understand where it’s coming from last point before we get into actual measurement I promise use a good ami that may sound like a strange recommendation what do you mean good ami versus bad AMI well to illustrate this I ran some experiments on the exact same instance type using the exact same binaries using all it for all all these Ami’s except for one where scent OS five for the same distribution and I got dramatically different results it turns out that the differences were primarily because the kernel in the first three armies was an old kernel it was a Fedora kernel from 2007 and that particular kernel does not handle multiprocessors well how did it manifest itself well theoretically all these results should have been pretty similar and I mentioned using two tests to calibrate what I did here was also run Ubuntu 12.4 which was the the latest release the the Amazon Ubuntu release at the time I did this test and it came in pretty close to the AWS cent OS 5.4 release so I knew that those results were probably reasonable and these other ones were not when I looked at the coefficient of variance if you’re not familiar with that term that’s dividing the standard deviation by the mean it gives you a relative percentage of variation which is a great way to look at variation and you could see there was almost 50% variation in the three bad Ami’s what does all that mean if you in general if you’re using an Amazon ami it’s been tested we’ve done a lot of work to make sure it’s okay if you’re using another army that came from somewhere else even if it’s your own ami that has been around a while and you keep using it it’s worthwhile to go back and make sure it’s still a good on me because they can get out of date over time and you need to keep them up to date going into actual measurement the I’m going to cover CPU memory and disk I could probably spend at least an hour on each of those topics

but I wanted to try and cram in as much breadth here and maybe leave time for a few questions if not I promise I’ll figure out where we can do the questioning afterwards if you still want to drill me on questions but this first one let’s say hypothetically that you have an application that you’re running on premises and what you want to do is find the best instance to run that same application the first question you have to ask yourself is what is best how do I know it’s best best sounds like a great term but there could be absolute performance there could be price performance there could be other parameters you need to be able to answer the question best but a very methodical approach here in trying to do this I get asked this a lot is let’s say that you don’t have a benchmark for your application it’s complex to set up you don’t have test data you don’t have security setup so you want to use some proxy so you find a synthetic benchmark and for CPU I’ll do for you configure and run that on premises so you get what we’ll call a baseline here you then use the exact same configuration and scripts hopefully to run it on instances you run those tests multiple times you look at things like variability and based on your definition of best you’re now ready to go the next stage which is deciding which one so for testing CPU going through those steps one at a time we’ll choose a benchmark for the purposes of today I picked for that I tend to see used a lot Geekbench UNIX bench sis bench the CPU part of that and there’s a UNIX bench yes I didn’t spec CPU sorry integer by choosing these are not saying these are the only ones you should consider I’m not I’m not endorsing these as the best ones they’re just trying to illustrate how you would go about doing this and some of the strengths and weaknesses of using them the first question you have to ask yourself particularly with a benchmark you’ve never used before is how do I know I have a good result for instance in my example here where I asked for people to to measure the stage what I did prior to them measuring the stage is I carefully measured the stage three or four times from different ways and I came up with an answer of 24 and 9/16 inches which is pretty close to where most of them came in and if you actually look at it each this there are six panels here and each of them are 6 feet wide and so I had multiple ways of saying yeah that seems about right and then I used the measuring tape and so on but when you’re using one of these benchmarks any of you if you’ve never used it before sometimes they’re completely black box all you do is say run and it gives you this beautiful answer most of the time the beautiful answer is in Prairie units that make no sense to you it’s just a number how do you know that number is representative how do you know there aren’t things you could tweak on your instance and your OS settings if the app supports tweaking in the app to say this is a good result I should say for for people who are practitioners of benchmarking the first question to teach young engineers to ask is how do I know when I’m done right because faster means infinite make it faster means you’ve got job security for life for the CPU tests I ran on nine different instance types for each of the instances I launched ten instances you get what I’m trying to do there I talked about quantifying variability I wanted to look at variability in several ways and so numbers help a lot I used the same base army in this case Ubuntu 13.04 for all of these tests so the first test I mentioned was geek bench and geek bench has workloads in three categories there are 13 integer tests there are 10 floating-point tests and there are 4 memory tests I should say this is a black box it comes as an executable you can buy the commercial product which is full-featured or you can get a free download which is 32-bit only and as limited functionality it sets up quickly it doesn’t really take much to set it up it gives you results in roughly 2 minutes so it sounds like a dream it gives you both single CPU and multi CPU results and I’m talking about version 3 of Geekbench a version 2 had lots of problems and I met with them and I pointed out from a methodology perspective what was wrong they made a bunch of improvements and at least based on this testing it’s much better than it was so speaking of testing I’m going to show you scripts and don’t worry there’s going to be a lot of stuff here but I’ll break down what’s there the idea is that you can take these scripts the the presentations that will be available you can take these scripts and harvest them all you want to help but when I set out

to write a script for benchmarking the first thing I do was figure out what are the results going to look like and how am I going to incorporate in them into analysis tools because you can run tests all day and a lot of times tools like eke bench will give you human human readable reports but if you’re trying to run these tests at scale and automate them then you need to think about how do i parse the results how do i stuff them into a database or spreadsheet so the first part of this script is designed to name the file because when you’re mining milk tests multiple times you have multiple results how do you know which one is which so naming the file helps me keep track of the metadata about the tests and in this particular case I use the api’s to get both the instance ID that’s the first blue line and the instance type and when I’m running these tests across nineteen sorry nine different instance types and I’m pulling them all together into one spreadsheet or database it helps to know what they were so I can do analysis the next part is actually the running of Geekbench it’s highly complicated and sophisticated you need hours to Train no it’s very simple the on Linux the x86 version basically has two parameters the first one is upload are not upload by default they will take your results and publish them publicly if you don’t want that then you may want to say you know upload the the second option is whether you’re running benchmarks or whether you are doing a stress test it does not tell you how long it takes it doesn’t report that information so I wrapped the the execution by a start and end capture of the time computed the time by doing add if I know there are lots of ways to do this I’m just showing a simple and straightforward way to do it then I essentially make the output file the combination of the instance type the instance ID I passed in a sequence number because if I run the test four times I want to be able to distinguish between each of the tests and finally in this particular case I’m actually putting the time into the into the file name as well why when I import it into Excel Excel has these wonderful text import wizard that help you put things into columns and it just makes it easy for me to do the analysis when I actually go to do the analysis finally pulling the data out so I use a combination of grep said and awk to pull the results out I won’t bore you by going through line by line if you’re not familiar with those tools they are must no tools at least on Linux in order to grab data and put it into a useful format I I tend to like to use comma separated values just because there are so many different tools that accept that format and so I always figure out how to parse things into comma separated format value so I mentioned we would look at results so these are actual results and let me let me walk through what this table is and and how to interpret it so for these tests for all of these tests I decided to use a single CPU on a cc to 8x large as the baseline since I’m interested in this particular case in integer CPU performance and I’m trying to find the best instance the absolute values for tool like Geekbench mean nothing to me what matters is the relative performance and so I’m setting the CC to ATX large as one then I have the one CPU ratio which is for each of these instance types the results that Geekbench provided for a single CPU and all of these are ratios to the baseline the the CC to 8x large you’ll notice that they’re fairly consistent with the m2s being slightly lower than the m3s and the m3s are slightly lower than the C threes the new C 3s that’s completely understandable to C 3s have 2.8 gigahertz processors the the m3s or 2.6 I believe the C C 2 is definitely 2.6 I also measured the coefficient of variance one thing I want you to take away from this just because the tool reported a coefficient of variance doesn’t mean you can say AHA that variance is in that instance well you can say as you measured it and what you have to figure out is where it’s coming from remember our exercise where we had four different people measure this stage and we came up with a 15% variance the stage wasn’t changing in shape or size the difference happened to be artifacts of the way the measurements were made in fact I should point out this this as I mentioned this this digital tool is great but if you don’t understand that it’s measuring from the base you could be off by I don’t know five inches by assuming that it didn’t count that or something like that so you need to know what it’s telling you next the the the third and fourth columns are the multi CPU I call it in

because multi was too long and the table got light so the multi CPU ratio that’s running the Geekbench number that came out relative to the baseline again the single CPU baseline and then the coefficient of variance you can notice here that the ratios are higher and you would expect that when you have more CPUs involved that you would get more results you’ll notice the coefficient of variance is higher now you could have one assumption you could make is that the coefficient of variance is higher because there’s more variability when you’re running more threads on the instance you could also assume that the tool itself is less accurate with multiple CPUs and the variance is coming from there you could decide maybe it’s a combination of both and I don’t know so I’ll have to try and figure that out what’s really important is as I mentioned the run time for this is fairly consistently right around two minutes now most of all of the benchmark workloads that are in Geekbench are publicly available the Geekbench doesn’t provide the source they don’t tell you how they tuned it they don’t tell you what workload parameters I can tell you from running each of those tests individually that they take a lot longer than two minutes to run in a single CPU in a multi CPU so they’ve drastically scaled down the workload probably trying to tune this to look like something I’m not sure what they tuned it to look like but a run time of two minutes is a great way to get a quick and dirty check now I mentioned that I wanted to look at variance from multiple ways I my previous chart had nine different instance types on this chart I’m taking just one of those instance types to m3x large and remember I mentioned ten instances were running each of those ten instances ran at least four times so I took the the mean and the coefficient of variance from those runs and and I’m showing you here what they look like and you can see they’re fairly consistent in both the ratio to my previous baseline and in the coefficient of variance it’s important to look at variance across test and within tests now I mentioned earlier that I’m really interested for this particular hypothetical scenario and integer performance so I pulled out the integer only portion of geek pants to complete bench to compare it to the full benchmark remember it had 13 integer 10 floating-point and four memory tests I’m only looking at the integer portion now in this particular slide I can’t tell you how long the integer portion took because Geekbench doesn’t call that out separately but it’s certainly a fraction of the typical two minutes for these tests I just pulled a subset of the instances to look at this in detail well you’ll notice in just looking at the integer portion is that the coefficient of variance is typically lower for for the same sets of results the for both the single and multi CPU variants the ratios are also different and slightly higher but we’ll come back to that in a in another slide coming up now eunuchs bench is another popular test it’s actually been around for a while it was originally the the byte index is the default when you run UNIX bench and it was originally developed for byte magazine as it’s UNIX system test to give people a measure it has 12 workloads that run two times by default this is a fixed time workload so what you’re counting is how much work gets done in that fixed time and it takes about 30 minutes it does have integer floating point system calls and file system calls it does a statistical combination using a Geo mean to come up to come up with a relative measure to an ancient spark 2061 I believe it’s a machine you posses you could possibly you know buying today but as long as as LA when you have a benchmark like this as long as it’s consistently using the same baseline and and the results are always measured the same way those relative numbers should hold what they mean to your application is another thing completely one one comment about this it has the source code it’s available as source code it hasn’t been updated since I think 2007 maybe it does not work by default on greater than 16 CPUs so you have to patch it if you go to the the get distribution you can find the patch for doing that I’ve actually seen published results where people said they were test testing you know 32 CPU systems and I looked at the results and found there were no results for the 32 so I really wondered where they made the numbers up because I knew they hadn’t patched the results so the script for UNIX bench again I’m using the the ID of the instance and the type to create the filename in in this

particular case on in order to run the single CPU and multi CPU I’m using proc a CPU info to get how many processors are exposed then a very sophisticated runtime command its capital are run – C says how many threads to use or how many copies it’s a single threaded test that you run multiple copies to get a full measure of the system so this command line says run one CPU and then run and where n is the number of CPUs exposed then again it’s always important to figure out how you’re going to take those results and put them into an analyzable fashion so I use a combination of grep and off to pull those results out and put them into a comma separated value here are the results again looking at relative compute this is for the full UNIX bench test you’ll notice that these ratios look slightly different again I’m using the CC to a Dex large as the baseline you’ll notice that for instance the m2 s seemed to be almost the performance of the bass line whereas in the previous set of tests with Geekbench they were about three quarters or 0.7 well they’re running different tests they’re giving you different answers and you have to really decide for what those tests do which is more meaningful to your application or if there are ways to tweak the test to make it more meaningful you’ll notice in this particular case that the the multi CPU ratio it looks pretty screwy it goes from basically 3/4 up to maybe six times whereas in the previous test it went up I believe it was about 16 times the coefficient of variance has some some wild results here even in the single CPU test but the runtime is always consistently around 30 minutes comparing just the integer portion which is dry stone to of UNIX bench to the full benchmark you can see now that for these the subset of tests the single CPU tests look fairly similar the coefficient of variance tamed quite a bit the multi CPU looks very different in that instead of getting a ratio of 6 we’re getting a ratio of 15.5 and I can tell you from looking at the source code here it’s the file system and system calls that basically become almost single threaded so if you’re trying to get a multi CPU measure and it’s a blended result they really damp down what the CPU result if you cared about CPU only basically you should ignore the full UNIX bench result and look only at the integer portion the runtime is fascinating here the integer only portion which you can pull out from UNIX bench takes 10 seconds or 0.17 of a minute so talk about a quick and dirty answer 10 seconds and you’re there again you have to say is this test meaningful to me and how do I want to use these results now spec CPU 2006 is probably the most comprehensive of these four different test types that I that I’ve that I’m showing it is a competitive test meaning it has been designed by public committee the open software group to be a competitive benchmark where if you want to run it and publish results there are rules for both how you run it and how you report it and a lot of those rules were designed to make it a fair and even thing and there they actually talk about how do we prevent cheating and those kinds of things so it’s it’s focused on being used for competitive purposes but it’s a very thorough test it requires a commercial site license it’s provided a source code it must be built and on the positive side it’s highly customizable on the lots of work side there are so many things you can do that that that tuning can take a long time however the good news is because this is a publishable result the the published results are available and those published results have to include the compiler settings and the configuration settings so the question of how do I know I have a good result you can go and look up publish result and say oh what system type was this what configurations did they use let me see if I can reproduce their results now I know I’m starting from a good base it does take quite a long time to run it’s a fixed work variable time kind of benchmark and so if you run it in fully reportable mode depending on how fast the processor is it can take five plus hours five plus hours for a fully reportable run is a good time I’ve seen results longer than 24 hours because it

runs a number of tests multiple times and it takes a while I’m only going to focus on the integer portion there are 12 workloads all of these workloads are derived from actual workloads of some sort but they’ve been highly tailored to be repeatable and be used as benchmarks so they’re scripted and don’t require a lot of they don’t do any disk i/o and they don’t require a lot of system services available for the purposes of a following slide I want to point out that each of these workloads has a code that designates the workload so you don’t have to spell out the whole workload when you’re telling spec CPU how to run I also want to point out that by default the full integer workloads require two gigabytes of memory per CPU it’s a single threaded application that to get a full measure of the system you run multiple copies so if on the configuration you’re testing you don’t have two gigabytes of memory for every CPU you need to do some tweaking to be able to get meaningful results the tweaking I would suggest is removing 429 mcf because it’s the one workload that requires almost two gigabytes so in the script that I’m about to show you which for the most part looks similar to the other ones basically I’m coming up with a file name in this particular case spec expects things to run in a well known path and again knowing the number of processors helps you defy it decide how many copies to run here’s the invocation command for spec this looks fairly simplistic but I have to tell you most of the complexity is in the config file I’m using a default config file in a well-known place called configs and it’s named default config so I simply don’t have to type it in but all of the complexity is in the config file in this particular case I’m saying it’s a not reportable run I’ve tweaked this to take the least amount of time one other thing I should have mentioned if if if you worry about the cost of benchmarking then the longer it takes to run a test the more it cost you to run so anything you can do to shorten the length of time that the test takes to run without sacrificing the accurate of the test makes it cheaper to run the test in this particular case I’m using iterations equals one the default is iterations equal three so right off the top I cut the time that it took this test to run to a third of what it would normally take secondly I eliminated the MCF workload you’ll see in the numeric list if you memorize all 12 of those workloads but I did not include the MCF workload that eliminates the two gigabyte requirement so I’m able to run this across the full gamut of instances that I was testing in this particular case they have a peculiar file naming conveyor I happen to know what that was and so the the combination of gret cut and arc helped me pull out of the comma-separated value that they produce the final result that I use to combine compiled this analysis you can pull out each of the end of the individual workloads into a lot more in-depth but we only have an hour here so I’m trying to accelerate this so in this particular case again using the CC to 8x large as the baseline you see that the the ratios here actually looked fairly similar to what we saw for instance in Geekbench where the m2s looked to be about 0.7 or three-quarters of the CC to 8x large the the c3s looked to be about 8 to 10% better the coefficient of variance for both the single and multi CPU runs are actually pretty good this is I can tell you from experience this is one of the artifacts of how much engineering went into building this benchmark to make sure that results were repeatable and that artifacts that you measure are really a result of what you’re measuring and not the tests themselves if run properly you know I mentioned this does run longer and even in this tweaked state the run times were from an hour to almost two hours it takes longer to run the multi CPU tests because the way at times it is it waits for all of the copies to completely finish and sometimes on larger CPU systems it takes longer the results are still computed using the same statistical method but the total time it takes to get all the work done can be longer the last one and then I’ll pull together a summary of all four of these and compare and contrast sis bench is design was designed for DBAs to do a quick and dirty test of a system for it for its usefulness for my sequel it has six different test categories ranging from file i/o to actually doing all TP database calls it’s provided in source code format it must be built my my best

advice on this test is it has there very simplistic defaults one might say outdated defaults based on when it was developed versus what kinds of systems are available today and so I highly caution you at it using any results with sis bench that use default values you’ll see why in a minute the script here is very similar I won’t spend time figuring out the name of the file in this particular case the defaults for doing a CPU tests allow you to specify the max number of requests by default it’s ten thousand I modified it to thirty thousand when I when I tuned it and the max Prime’s it’s basically doing a prime number calculation by default the max prime is ten thousand and here I set it to one hundred thousand and all that this is a fixed work variable time so all it tells you when it runs is how long it took the challenge here is with the defaults if you look at the run times here so the first of all the results here the relative using the same baseline the results seem a much broader spread from three up to 25 times actually one point seven eight to 25 times the coefficients of variants tend to be much higher than we’ve seen but most importantly if you look at the run time here these are fractions of a minute so for instance for the CC 280x large the run time for the default CPU test is less than a second it’s a half a second and if you think about the accuracy this is kind of like taking my measuring tape and measuring the width of this stage by the thickness of the tape it’s just pretty hard to get an accurate result when the test runs so quickly and so that’s what I meant about the defaults now using my tune results you can see that the ratio starts to look a lot more like the ratios we saw before ranging from you know a little less than 1 to about 13.7 and the coefficient of variances are better but still slightly wonky and from from what I can tell that wonkiness is an artifact of the artifact of the test not the instances because remember I used the same instances for all these tests and finally the run time in this tweaked version is still fairly short the longest one is about seven minutes so you could theoretically use something like sis bench and play with the parameters tool this tool you saw behavior that looked more like your application my best advice those be very careful with using defaults on this so in the half an hour or more that we’ve been diving into CPU measurement I’ve given you four different ways actually six if you count the variants here but a bunch of different ways of measuring CPU the first question I’d ask you going back to the beginning is how do you know when you’ve got the best instance in this particular case you’d have to decide which one of the tests was most relevant to you which tests you trust the most or which combination of tests but a few things I’d like to point out particularly with UNIX bench and with sis bench using the defaults may not give you the most meaningful results and so you need to look carefully at what the tests are telling you to see if that makes sense to you secondly if you look at the toon results here they all come in fairly similar so I mentioned earlier if you’re using tests for the first time having more than one way of calibrating tests against each other tell you whether you’re getting something that makes sense so you may decide the best test for you to run in this case is the one that costs you the least in terms of getting the test and how long it takes to run it and it’s good enough if for instance using the analogy of measuring the stage I was trying to make a solid gold cover for this stage why I don’t know I’m just using that as an example where precision really meant the difference between spending lots of money or not you would probably want to use that digital measure and make sure you knew how to use it really well to get a really accurate measurement if you’re if you’re doing one of these well that’s good enough kinds of tests and maybe one of these tests that runs in two minutes is good enough I can’t tell you which one is right but you really have to understand what the test does and what it’s telling you transitioning from looking at CPU to memory using a similar approach except in this particular case the scenario I’m going to use is I need to find an instance that gives me at least 20,000 megabytes per second I know that’s a critical throughput for my application and so I’m going to run these tests using of several different types of measurement and I’m looking for an absolute value of at least 20,000 megabytes per second same approach choose a benchmark run it

on premises then run it carefully on the set of instances you’re considering and then do your analysis in this particular case I’m going to introduce one new test called stream and that’s partially true in that the if you if you were really fast and you read my slide on what the tests were in Geekbench the for memory tests were actually variants on stream in fact there are four workload types and stream I’m going to talk about on the next slide then I’m going to use the memory test portions of Geekbench and sis bench to compare and contrast these tools as a way of measuring memory throughput again using nine instance types ten instances of each one to get a sense of variability using the exact same instances I used before with the exact same ami so stream was developed by dr. John McHale Fionn University of Virginia if you go to his website his nickname is dr. bandwidth so that tells you something about him there have been many many academic reports based on using his tool for particularly for hardware engineers looking at quantifying and improving memory bandwidth there are four relatively simplistic tests that are done in this the first one is simply a copy from one location to another the second one is a scale which a copy with a multiplier the third is a sum where you’re adding essentially two blocks of memory together and putting that result into a third and the final one is a combination of those called triad for the purposes of time and the sake of simplicity I’m not going to go through the gory details of each of these for many people who are looking for one measure tend to use triad so for the sake of simplicity I’m just going to use try out going back to how do I know I have a good result with this test results are published on the University of Virginia website I have to say they’re sporadically published they’re not necessarily always up to date but at least it’s one of those ways of saying hey I just measured the stage how reasonable is this measurement they tend to be sort of the best ever numbers reported as opposed to everyday numbers by default it must be built and by default it runs one thread for every CPU that that it can see there is an available tool out there called public publicly available called stream scaling it’s a script that actually examines the CPU type the cache size and comes up with optimal settings for running stream which is a really good thing and I’m going to show you first how you run it in this particular case I’m calculating the number of threads you tell the test how many threads to use by setting an environment variable again I’m setting up the filename using the same mechanism before then in this particular case stream doesn’t really take any parameters so you just invoke it stream and I use egress to basically pull out that single number and then just to compare this with what you’ve already seen to run the suspense memory test you tell it the number of threads you set you say the test equals memory and then run so here let’s compare and contrast these three measurements using the method methodology I’ve described so the first thing I want to point out is if our if our primary requirement was to pick any system that provided 20,000 megabytes per second you can see of this list the m2 one of the m2 s is is there the others were not and most of the other ones at least met that with the the CC to 8x large providing 55,000 megabytes per second how would you decide which instance to choose there are lots of ways you could do that you could say which is going to cost me the lease I have other things going on we’re having more cores is better I need more memory that sort of thing but in this scenario we are is looking for something that provided at least 20,000 megabytes per second stream is telling me this now this is the this is the tuned stream triode where I use stream scaling to pull the result and what that does is it tries threads 1 through n where n is the number of CPUs and essentially tells you what memory bandwidth you get with each which means from an application perspective you can use that to say oh well if I use this many threads I’m going to get the max throughput now as as mentioned before Geekbench is a black box so you really have no control over how it runs the test it is running the exact same triad test what you can see comparing these numbers is that the numbers aren’t the same what does that mean it means it’s not tuned it means you’re getting the default values so if you’re looking for an absolute value in this particular case the only instance types you would

choose are the c32 x-large and the CC – 8 X large whereas in the the first column where I used the stream scaling script to find through plaits I found there were a majority of these instance types would actually have met the goal the last one I have to tell you is that this the defaults for a syst bench are let’s say different and according to sis bench none of these instance types would have provided the bandwidth however caveat sis bench is actually doing an even simpler set of tests where by default it’s doing sequential writes to memory so with what you cared about with sequential writes then try odd may not be what you need and maybe you need to use this bench and play with the defaults and by play with the defaults I mean heavily play with the defaults because what I found is the defaults for suspense in general tend to be slightly meaningless because they were originally developed when CPUs weren’t quite as powerful as they are today or fast the last topic I have to apologize in advance I could easily spend two hours on how to benchmark disk i/o but I wanted to cover some of the main highlights particularly in the the choices you face on EBS sorry on ec2 with all of the variants in fact I’m thinking ahead of myself here the first choice if you wanted to test disk i/o on ec2 is what type of i/o do I want to test now obviously there are differences between using ephemeral storage and EBS where you talk about availability and so on you can use provisioned I ops if if you’re worried about consistency of the performance and on the new instance types like the C 3s where we have solid state disks or on the hi1 where there are solid state disks if you want blazing fast local storage you may choose to use that understanding what it’s what it what the characteristics are of that storage beyond just its performance secondly when looking at this performance the question you have to ask is what kind of measurement do I want to do do I want to do a single volume do I want to do a shave odd do I want to do software raid there are many different ways some tools are going to help you with this some tools won’t help you at all finally when it comes down to what you’re actually going to measure there are fortunately or unfortunately depending on how much time you have to do this lots of things to think about first of all it’s what type of access are you doing is it read or write or some combination if it’s some combination what percentage rate what percentage right secondly what access pattern is it sequential is it random finally very important and many people miss this step you want to think about queue depth what is queue depth translate into huge depth Ran’s lates into how busy you’re submitting the rate many people say all I care about is I ops and I just need X I ops but if you ask them do you want X I ups at 400 micro seconds or do you want X I ups at one point five milliseconds they would say well of course 400 micro seconds then I would say well then you don’t just care about I ought to care of I ops and latency it’s the same with bandwidth so the broad categories are I ops latency and throughput so you you can find there’s some there’s some good documentation actually on the ec2 website if you do a search for EBS performance under documentation you’ll come up with a long list of recommendations including how to run tests and so on but I wanted to give you a simple guideline to measuring provisioned ions why am i choosing measuring provision dye ops because I’ve I’ve seen many customers who’ve struggled with either getting the I ups they want and it was simply a matter of how they were doing the testing remember we measured this stage multiple different ways some gave us different results the the simple approach is to launch an ABS optimized instance if you’re going to measure P I ops you really want to use EBS optimized instances then you provision the volumes now you can with the wizard or with the API do those both simultaneously but I’m just making this a simple approach then you attach them the guidelines in that documentation I mentioned talk about pre-warming the volumes why would you pre-warm unless what you’re trying to measure is the performance in production of using a volume when it’s just been provisioned you’re going to get a more reliable read when you’ve pre-warmed it what is pre-warming it means touching the blocks and I’ll show you a way to do that with with FIO which is one of the tools I’m going to talk about but my previous point about latency this is a

graph that I produced with don’t worry about the complexity of the graph just the two arrows the leftmost bar is the blue bar is showing the queue depth for sequential read at 2000 I ups and queue depth of 1 is the blue bar queue depth of 2 is the orange bar so if I said all I care about is getting 2000 ions you can see that if you set the queue depth to 2 in this particular test then you’re going to be getting more than two times the latency then you have to get and still get the I ops level so you depth is really important so just to illustrate a number of different ways that you could measure and that I’ve seen people measure performance one of my favorites which definitely fits into the quick and dirty test if someone takes an existing file or a little more sophisticated they use DD to create a file and then they do a file copy and they say I’m gonna measure the throughput of this disk by doing a file copy you can calculate how long it takes to do it and say I copied so many bytes in so many seconds I have to caution you that it’s it’s not very precise and it can be prone to all sorts of complexity second I’ll slightly more sophisticated way people use the tool dd I think one of the reasons they use it is it has to do with disks and when it’s done it tells you what the throughput was so it’s all there the challenge is it’s not accurate from a variety of different perspectives for instance I’ll show you some examples of where I ran DD I ran DD of a one gig file so I just used zero bytes created a one gig file and there was I think it took point seven seconds to do that and it told me that it was one point five gigabytes per second bandwidth hey I happen to know the particular device was a P I ops volume not capable of 1.5 gigabytes per second B I did the math and calculated how long it took versus how much data copied and the answer wasn’t 1.5 it was 1.4 or something so it generously rounded up to 1.5 but it is easy to use and can be quick so so of the three variants I’m telling you the most sophisticated which is both good and bad is Fi a flexible IO tester it’s publicly available the good part about this is you can test almost any kind of i/o pattern you want including multiple volumes it can do things and the particular config file which is where the complexity is driven here does pre-warming and it does it in a careful way to make sure you’re really touching all the blocks and I’m doing a very simple stick test which is a sequential what am I doing here I’m doing a a sequential right and I’m using an i/o depth of 1 means the Q depth of 1 I’m doing 4080 iOS per second now these this was a these tests were all run on a 4000 I ups provisioned volume and you may say it’s 4000 why are you doing 40 80 well from my experience when you drive slightly more than the provision level you sort of get the maximum amount out of it it’s kind of like you go to the gas tank you fill up your gas and the Machine cuts off but there’s still a little more space in there and you’re not supposed to do it but I like to fill it anyway that’s sharing way too much about my quirks sorry therapy again therapy so bringing this all in so I wanted to point out some of the problems with using copy for instance the first test I did was copy a 1 gig file from file 1 to file 2 and on this one gig provisioned sorry this 4000 I ops provisioned volume was actually foreigner gigabytes in size I had pre-warmed it so there were no artifacts of the measurement and I can tell you that a that the the bandwidth of this volume is 64 megabytes per second so anything that measures above that is probably not accurate and so the copy just simply copying file 1 to file – it took 17 seconds and that told me that the bandwidth was around 59 megabytes per second if I deleted file 2 and then did a copy of file 1 to file 2 again it took point 8 5 3 seconds and it said that the bandwidth was twelve hundred megabytes per second why do you think that was can anyone say cache yes so there were artifacts of how you run these tests that can greatly perturb the results you have to be very careful in both what the test does and how you run it then I copied file 1 to file 3 and file 1 is still cache so again it’s really blazingly fast then I use DD and the

first test was using as I mentioned the zero byte fill and with zero byte fill the one gig file ended up as I mentioned taking 0.72 seconds it told me that I did 1.5 gigabytes per second but if you do the math it’s actually one point four one nine 0.01 okay stop then I did a you random obviously writing random bits verse in writing zero fill are very different workloads in this particular case it took almost eighty seconds and the measured throughput was twelve point eight four megabytes per second why do you think the bandwidth was only twelve point eight four megabytes per second it wasn’t the disk you were measuring it was the ability to measure to create random so if you’re trying to measure this performance you random ain’t the way to do it finally looking at the fiyo test now I didn’t go into detail about the config file but this test literally ran for thirty seconds so it was not a very thorough test it takes a while to get things up to speed but it measured roughly sixty one and a half megabytes per second and I can show you in the next slide I was asked to help out for a blog that Jeff did recently sort of demonstrating testing at scale so this is a twelve volume test with twelve four thousand I opted provision volumes and I ran a sequence of read write and read write tests and the flat bar line is 64 megabytes per second so basically I was able to drive 12 volume volume simultaneously at max bandwidth and that’s I bring this up mainly to show that with with FIO in particular you can do some very sophisticated tests if that’s what you need it’s going to be much more accurate than the others so I have a brief summary here and we have a brief amount of time for probably a couple questions but all of the stuff I threw at you in the hour you were gracious enough to spend with me this morning you want to choose the best benchmark that represents your application if you can’t use your application you really need to understand what best means when it comes to looking at the results I’m telling you is this a good test is this a good result it’s really important to run enough samples to quantify variability particularly important in the cloud when there are lots of layers of abstraction particularly important when you’re running tools you’re not used to to make sure that the artifacts you’re measuring are not part of the tool as opposed to the thing you’re trying to measure you really need to understand what a good result looks like having a baseline having a results someone else has run that you know is on a similar type configuration running multiple tests to say does this really make sense finally keep all of your results I can’t tell you how many times I’ve gone back and needed my results from previous tests that had nothing to do with what I was trying to test it saved me so much time to keep information about the not only how I ran the test but what the test was and most importantly the results I have a database that keeps years of results in it and I can do lots of wonderful sequel queries to pull it out so in the time that we have left first I want to thank you secondly I would appreciate it if you would fill out a survey one of the reasons I’m here this year is I ask people to do that last year and enough people were either caffeinated enough or that they appreciated what I talked about or or maybe it was the bribes whatever it was they liked it and so they asked me to do another talk and so in the time we have left which is about three minutes so probably one or two questions I’ll answer questions and then I have to vacate I’ll be outside and we can go somewhere else if you have more questions so any questions