Advanced Systems Monitoring with Nagios, PNP and Nconf

so hi it’s good to see all of you who aren’t interested in what’s new with monkey my name is Josh Malone I’m a system administrator with the National Radio Astronomy Observatory and I tend to laughed myself when I’m nervous talking in front of large groups like I was saying just a minute ago recorded for posterity I’ve been there for about 10 years I’ve been running Nagios for probably about 13 years I’ve used it several places before I came to work here and well as far as I’m concerned Nagios is great right this talk sort of builds upon my talk from last year where we got through installation basic configuration theory of operation what you should do in a monitor network environment so now we’re just sort of picking up from there where we’ve all universally agreed that this is true and you know it it checks our stuff it lets us know when there’s problems life is good but there’s always a button right I don’t know about you guys but my data center these days kind of looks like this services just keep proliferating more systems more pieces of software I was the only person maintaining our Nagios system and I’d have to go chase down stuff and also we had a lights on data center when I started there we actually had people whose offices were in the machine room so we would see amber lights on the Dell servers we would hear beeping things from their aid cards so you know we knew that something was going wrong when it was going wrong because we could physically see it now that we’ve all moved out of the data center we need a little bit more robust stuff there’s more of us we’re all responsible for radios different things it’s no longer just the Linux team does the main network services and windows team does desktops well no we’ve got windows services we need to pay serious attention to now and of course when there’s more of us we all want to work on stuff at the same time it’s no longer cool to have an environment where there’s only one person responsible for X and there’s only one person responsible for y we need to have cross-training we all need to be able to work on stuff the pointers not working the demand for metrics is starting to rear its ugly head we have monthly stuff we have to report all the way up to the National Science Foundation we’re a federally funded research and Development Corporation and we need to tell them you know how’s our uptime what percentage of our storage is in use what percentage of our cluster is in use and when you’re reporting this sort of metrics to management a picture really actually is worth a thousand words so how can we collect all the stuff that our monitoring system knows about the monetary system has a pretty good idea what our availability is we’re capacity is but how can we get that information out in a form that’s digestible and then finally we just need to make sure that a monitoring system is completely bulletproof we need to have it no longer is it acceptable for some homegrown plug-in not to detect the failure of a service everything has to just work right it’s got to let you know when it’s down if you don’t know before the users tell your boss you are in deep so we’ll take today to build on last year Alec we take on monitoring solution and just whatever trite term you want to use kick it up a notch and figure out how we can take something from yes it’s monitoring and paging us to well this really works nice and I really enjoy using it we have by this point we had already settled on Nagios as our monitoring solution we had the in-house expertise we’re running it at all three sites we know that it’s extensible enough to monitor basically everything that we would want to monitor additionally in our environment network traffic monitoring was out of scope for us we use a piece of software called stat seeker for that does a real good job of monitoring switches band it’s Network sort of that sort of stuff so we didn’t have to worry about that with Nagios there are some good ways to do that but that wasn’t what I was concerned with and second our HPC stuff is it’s I think it’s fine now I think I was I think it was actually user error amazingly enough but I appreciate it our HPC monitoring uses ganglia ganglia is really good at getting to the low level how much time is your processor spending waiting on your hard drives how much your memory is buffer how much of

its cache how much of its real memory and use by an application so again that we weren’t concerned with in retooling the monitoring infrastructure it was really just about make sure the services are up make sure you’re accurately detecting whether they’re up or down and make sure you’re getting good metrics out of them um there’s no slide in here where I just stop and ask for QA so if you have any questions that are pertinent to the current slide please interrupt me although I reserve the right to defer you to a later slide so we start with this add-on PNP for nagios simply it graphs the data from your service checks graphs initially started I pitched this to the other people they’re like that’s really flashy and totally unnecessary we’ll never need that and then you realize you have the ability to show your facilities manager yep new air conditioners working great right there’s where we pulled some tiles out of the floor change things around a little bit I can say environments good or I can show this to my CIO and say we’re gonna need that new disk shelf on the net app in about six months when they when you start to show them stuff like that changes in mind a little bit and second add-on I want to cover is in Kahn this is a web-based Nagios configurator remember how I said I was the only one responsible for setting up Nagios well this allows us to take the responsibility of configuring monitoring and push it out to the people who are actually running the services and actually installing the services and it turns all your hairy little Nagios config files into a web dashboard like this that you can actually look at on your lists of hosts list of services and presets right now my Nagios server has over nine thousand lines of config files and I would never want to manage that by hand I’m seeing a couple of shocked looks out there instead you can just this is this is a demo screenshot I’ll show you some from my actual environment later but I don’t have to worry about how complex and hairy my config files are because I never see him anymore and it’s not only a big help to me but it’s a big help to my other admins who don’t have to learn the syntax of a config file they just learned you had a host yeah at a service do you pick from the list of predefined service checks deploy it make sure it works yes no great and then after we get some add-ons in the mix we could go through and revisit the plugins that we’re using when now you started out there was this site called Nagios exchange it was a great place for people to post plugins that they developed the great Nagios domain seizure of whatever year it was happened and Nagios got forked eight ways from Sunday and so did the monitoring repositories so now we have the asynch exchange monitoring exchange the monitoring plugins site there’s a million of these things and then the wonderfulness that is github happened and so people start actually posting plugins on github so I can go and download the plug-in look how it’s developed I can fork the repository fix bugs and the pull requests yay github so things have improved in the space but there are still a lot of crud plugins out there a lot of check plugins that just barely work and we were running some of them and the service would go down and the plug-in wouldn’t notice or wouldn’t be configured for this particular special case or it would be up but not quite looking the same and we think it was down so finding the right plugin the plugin that actually monitors things actually works the quality control of community plugins needs some addressing and then ultimately you should really be comfortable writing plug-ins yourself because eventually it’s gonna be something that no one else out there figured out how to monitor or no one’s monitor it the way you want to yeah I will show you this is this is where I’m reserving my option to defer you to a later slide and so we’ll cover some stuff about Nagios plugins particularly in Perl although as you’ll see you don’t have to know Perl so we’ll start with PNP anyone using PNP for and I guess any graphing add-ons a new familiar with already tool mrtg cacti okay some of this will seem familiar to you PNP for nagios graphs performance data what is performance data performance data is something that can be returned by a plug-in and it’s really just the metric that was used that the plug-in used to determine whether that’s that service was okay or not it could be you know if check HTTP will check and make

sure web pages was returned within a proper amount of time well that amount of time that it took is the performance data and it could be you know the response time of a web page it could be the webpage size it could be if you are monitoring network stuff with it it could be the network throughput the current bandwidth of a link room temperature if we have environment monitoring sensors so it’s any actual real data associated with how you determined that this what state this service is in if we go ahead and add as an example here’s the check ping plug-in this is the plug-in that’s used just to determine ping the server make sure it’s up this is a typical invocation the minus H host address right here this is the upstream router from my DSL line at home and we’ll set our warning threshold to be 100 milliseconds or 2% packet loss on our critical threshold to be 200 milliseconds or 5% packet loss it’s a perfectly typical example of running the check ping plug in the output you get from that is this now it’s all comes out on the standard out just like any Nagios plugins you’ll have but this part right here is known as the screen output this is what you see in the CGI is this is what gets sent to your pager this is the the human readable output from that check plug-in but after that we have all this other little good stuff that you can can do some stuff with separated by this vertical bar character right here and then after so after that is the performance data and what it’s telling you is Artie a round-trip average was fifty six point five six three zero zero zero milliseconds I don’t know why it needs six significant digits but they’re there the warning threshold I specified was a hundred same as their or two hundred there and the packet loss was zero percent to would have been warning five would have been critical the zero is a minimum scale that suggests to PNP how to draw the graphs and then so this comes out of the check plugin you hand that data to PNP for nagios and you get these wonderful graphs right here this is I installed Nagios on a Raspberry Pi at my house and just had it monitor my DSL line for a couple of weeks to get some pretty graphs and to show how crap my DSL line is so you can see right here it’s exceeded my warning threshold for round trip average several times and I got some pretty wicked packet loss right there about forty percent so this is the visualization of all the data that’s been coming out of your plugins this whole time that you never knew about now of course not all plugins support performance data it is optional in the API some plugins require a command line flag to activate the performance data the Dell OpenManage plug-in is one of those you have to specify minus P so if your plugin isn’t giving you perfect data check its help text check its man page see if there’s the way to turn it on some plugins like I said the quality of plugins is not standardized some plugins output things in the screen output that would be performance data but they’re not formatted correctly so innocuous isn’t harvesting it for you I’ve had to do in those places is wrap the output of that wrap that check plug-in in some other script to parse the screen output reformatted as performance data and then output that the the check MySQL plug-in used to be that way but now it up puts performance data and really nice performance data to it tells you how many queries per second your run and how many connections you have your server how many open tables really good stuff that you can actually use to plan the capacity and performance of your MySQL server so now that you get all this data out the unfortunate fact is that by itself Nagios does nothing with performance data it puts it in some macros but unless you use those macros somewhere else it just goes into the bit bucket so we have to install an add-on for it it not just comes with these two sample commands that show you some of the things you could do with performance data but by default they do nothing so when we install PNP installation the basic installation of PNP for nagios is really simple you just replace those two commands with commands that throw that performance data into the process perf data dot PL script that’s part of PNP and so what it does is it gets that performance data and then throws it into and round-robin database did you miss

something on the last slide oh sorry yeah yeah yeah don’t please don’t try and write any of this down and most of this is all gleaned from the installation manuals anyway when you’re going back to install this you’ll you’ll have seen this before you’ll recognize something out of the installation manual I’m trying to point out some of the places that are more important and yes I’m going pretty fast sorry a round-robin databases that’s right so a round-robin database is industry standard way of storing time series data in a fixed amount of space so it’s a database that’s built up of a whole series of bins with different time scales to them so when PNP is running and it’s thrown in let’s make the example really simple let’s say you have a check this running every one minute you probably wouldn’t have this for real unless you have a super critical service but if you got new performance data coming in every one minute it’s just going to add this into the one minute bins on the already eventually you’ll run out of one minute bins and then it’ll take to say the last five and roll them up into a five minute average and then eventually you run out of five minutes and we’ll roll them up into the next slot and so the result is that your file never grows over time what resolution is lost so you can sit there and monitor your data for by default it allocates enough space for four years and it only takes about seven megabytes and you know that it’s not going to grow over time it’s not going to eventually fill up your disk because you’ve been monitoring too much already tools used in mrtg cacti basically anything that draws those type of graphs it’s open source projects written by tobe oetiker and then once you get the data into the our RDS the already tools component already graph can suck the data out of those are IDs given up whatever time window you want to visualize and give you a visualization this particular example shows you multiple time series in one RRD so this is the CPU load off of my Cisco 60 509 switch and load average is typically come in a five minute a one minute and a five second average if you look at load or top command on any Linux system and so what you can see here the the red 5-minute average sort of peeking through the yellow five second average and you can do sort of glanced at this and go okay no significant problems with the CPU there’s a spike here to 20 or 24 percent but we’re not in danger of causing any problems on our course which another way it can visualize the data is by stacking time series this is off of the current probes on one of our dual power supply servers and so this is one power supply this is another they total up to about point eight amps and they’re distributing the power evenly we like that again we can just glance at the graph say everything’s good we can see we can easily see the sum total we know it’s not drawing a ridiculous three amps or something and then finally the last cool thing that it can do is multiple line graphs so this one is off of the OpenManage plug-in again on a Dell server and displaying the values of all my four cooling fans all running at about the same level so we know we don’t have any particularly bad hotspots in this server we don’t have one fan that’s going way too slow or one that’s going all the way up to its max and we can infer some stuff about the health of our machine that way oh I should also mention this graph is monitoring a Windows server this doesn’t just work against your Linux stuff the Nagios check plugins can run just fine on Windows servers this might be the exchange server or something I don’t monitor any Mac stuff with Nagios but it’s possible one quick thing to note this is another one of those places where I want to point out something interesting and in the installation manual there’s two ways you can set up PNP synchronous mode where it every single time a check plug-in is run it stops the data into the RRD and runs its stuff if you have a big site monitoring lots and lots of services this can cause the load on your machine to shoot up because you’re running already tool too many times to many perl exact it has a bulk mode where the performance data is accumulated just in a flat file after every check and then every 30 or

60 seconds or whatever you configure for this number it goes through reads all that data out and updates all they are IDs then it reduces the load on your system at the expense of not having your graphs updated immediately usually not a big problem to trade that off if you have a big site I’m monitoring I don’t know six or seven hundred services most of them with performance data this mode isn’t causing me any headache yet it’s it’s a the synchronous mode yes it’s a bog-standard l server nothing particularly fancy but I’ll just mention that if you if you’re running a either an underpowered server or a really large environment if you’re monitoring 7,000 desktops in your classroom labs you might have problems with synchronous mode the reason that I’m pointing this out right now is because this mode is really easy this mode is not quite as easy there’s some more stuff you have to setup file permissions things have to be able to write to this temporary file you have to make sure that the processor is able to run so just something to consider when you’re setting up your in PNP installation and then remember where I told you that the r ID is allocated with enough space to hold four years worth of data for some reason the web interface time slots only go out to one year it seems like an odd decision to me but luckily we can fix it we just find our config dot local comes with it and add another line like this so this is 3600 seconds times 24 hours times 740 days for two years and we can add whatever predefined time series time periods you want you might want three years four years if you want a two-week view preset you can create it here in this file and it’s in the local config file so you don’t have to worry about it being blown away by any upgrades so then we get more time ranges available to us in the web interface you can of course always use the calendar view in P n n P and P to look at exactly the time you want but sometimes it’s just nice to have those presets which brings me to using the NP for nagios once you’ve got it installed and your day to day operation this is what it looks like I think you can just about read that text over there so this will be where all your graphs get drawn your search box Actions menu your basket watch all I’ll talk about these in just a second it’s telling you right now I’m looking at a host named IMS 4000 and I’m looking at the service temp to 36 which was last checked on July 6th I have my time ranges that can presets that I can view I think I grabbed the yes so this is the one year view that I’m looking at so I just zoomed out at one of the temperature probes in my network closet and you can see all the good stuff that’s happening to the temperature now bear in mind that’s only a 6 degree swing Fahrenheit but when you zoom in on enough of these it can look like something really terrible is happening to your metric when in fact you’re just really zoomed in this menu up over here the search menu is really cool rather than having to back out to Nagios find another host you want to look at and click its graph link you can just type the name of a host you want to look at in here and you can zoom right to it the calendar widget you can select an exact date range if you need to go back in time and prepare a report for something that happened way back you can use this widget and pull up exactly the days that you’re interested in it has a built-in PDF export which actually is useful I’ve used that to generate reports to send to facilities about the environment monitors or send to people who want to look at the storage and the PDF is especially useful with the basket which you can use remember how I was showing you the one host you were looking at well the basket allows you to combine graphs from multiple hosts into a single report so you can use this little plus icon here on each graph and so I can go to the IMS 4000 add the temperature 115 go to my ups and my network closet at it’s temperature probe and then I click show basket and I get both of my rooms temperature graphs on one report I can find exactly the time range I want for that export it to PDF send it off and the management knows the information that they asked for this gap right here is when the temperature probe failed and had to be sent back for

repair for about three weeks any questions so far yeah all right so I am not actually very familiar is it a related point thank you the question was about this versus cacti is there any point in having Nagios and cacti in your environment I’ll admit I’m not completely up to speed on exactly what cacti provides but my guess is there’s a pretty significant overlap between the niches filled by Nagios and cacti I use Nagios because I’m primarily just want to know we started off just wanting to be notified about stuff like that and I think Nagios handles that better whereas cacti is built by default to get the performance data not used by itself wouldn’t get you that but when you add the pnp I think it would pretty much eliminate all the need for cacti that you would have unless there was something particularly the cacti monitored the nuttiest wasn’t good at I don’t know what that might be because we don’t use cacti does anyone have anything to add to that Thanks templates so once you get all the data out of the Rd P and P needs to decide how to present that to you and that’s where the templates come in this is one of those cool features where based on the name of the service check that was used to check this service it can find a template file on your disk to figure out how to draw that graph and remember when I showed you all those different ways of presenting the data that was based on the templates that came with that check plug in some of the some of the really cooler check plugins that I’ll mention later come with their own PNP templates so if you want to create your own you just take the name of whatever service it is the the the check command name that you’ve configured create a plugin for that it’s written in PHP so you can do whatever processing and it you want and it’s really good for fixing wonky graphs like this in a default installation with the default plug-in you’ll see the infamous killer gigabyte better known to most of us as a terabyte but our DeGraff is once to scale things so it’s automatically scaling this 17000 into 1.7 k so we just go into our template here create a custom template to this check add the minus x 0 which is the rd graph option to say don’t power scale the y-axis you have to pull this out of the rd graph man pages to figure that out but i’ll mention it here the other thing you can do is you can create c deaths calculated definitions so I can define a value called GB a gigabyte that’s equal to var 1 this is the first data series in our rrd divided by 1024 in postfix math notation and so now I’ve taken a value in my rrd that’s not scaled the way I want it to and I’ve created a scaled value that has what I want and I can fix it up and show you this number of gigabytes I have this one particularly hard-coded to gigabytes because the same check monitors areas my file server that are not terabytes big and so then I would run into point three terabytes instead of 300 gigabytes so this is the way I’ve made the trade-off but I have the template I could define it to work exactly the way I want any questions thus far before we dive into income great so now that we’ve got our performance data and graphing sorted out let’s figure out how we can better handle the configuration of Nagios and get Nagios no longer just controlled by one guy me so last year in my intro talk I think I spent about 15 slides just on configuration files many of you very used how many people in here actually use now gifts right now and are familiar with these configuration files yeah so again it turns all those configuration files into a GUI it’s web-based seems to work with Internet Explorer and Firefox I haven’t had any Eddy issues with the web GUI since the latest version this is D all right so this is a list of all of our hosts again

this is just an example slide but you have the host names that P addresses are configured with what server they’re monitored by n conf is designed to manage more than one Nagios server if you have a distributed network of Nagios collectors one for your San Francisco site one for your New York site you can manage them all from a single end conf dashboard and you just pick which host is assigned to which negiah server presets for operating systems these can be everything from simple fluff up setting the icon to pre defining Windows servers should all be ping monitored SMB monitored some others so you can assign default services that should be monitored on each OS and then the server so this is a list of all the machines machine name : service name so for instance my DNS and mail MX has its load checked the size of its mail queue number of mail scanner processes name D it’s my DNS check that the ntp synchronization so just give you a list of all the services that you have on each host the way this works with the Nagios configuration files you’re configuring objects and objects all have relations to each other services are assigned to hosts context or assign to whatever so it’s a really great map for storing relational objects in a database which is exactly what incomes does it’s MySQL on the backend and it can stuffs all of its objects into the database all with their proper relations and then when you tell it to deploy this configuration to Nagios it reads all your objects all the relevant objects out of the database generates the configuration flat files hands them to Nagios and says go that’s the deployment phase it’s really flexible about how you can do the deployment phase it can SCP them to a server it can just untie them into a local directory if you’re running in con form the same server as you’re not yes which we are for one of them the base case is really easy you can R sync them to another machine at another site you could put them in git and sync them to the cloud however you want to do it because it’s completely scriptable just like everything in Nagios it will do anything you can implement in a script yeah three slides from now you’re good at this no it’s fine I I tried to anticipate the questions I just didn’t necessarily anticipate the order of the questions first let’s get it installed it wants MySQL with nodb shouldn’t be an issue in most environments your basic library dependencies a quick note about PHP it needs short tags and register of global and magic quotes GPC should be off really they should be off anyway if anyone’s still running their PHP set this way you deserve what you get so finished installation it’s on tar the files from an come in to an area in your web server and then there’s three config files we need to look at config the MySQL dot PHP to give it the authentication credentials to your database the authentication dot PHP it an N cons can do its own authentication because you want to make sure not just anyone can log in and manage your MySQL your Nagios server it can use an Active Directory back-end a sequel query back-end HT password if you want to go that route or the basic auth which is how I use it because I have this deployed on a server that’s already doing Apache mod LDAP authentication against my Active Directory infrastructure so basically this just says don’t worry about the authentication this it’s handled by the web server and so in a trivial case that’s what it is it works great and then the deployment dot ini is what tells it how do once you’ve read everything out of the database and generated the nine thousand lines of Nagios config files what do you do with them so this is almost literally my deployment ini I’m using type local because I’m running on the same machine the source file is where it gets built basically this this is a pointer to the path of the output directory where I configure it running move those files in to EXCI Nagios extract them from the tarball and then reload Nagios you do have to be careful in this case that I give sudhi the permissions to let my web server user which is what’s running this PHP

the permissions to run sudo reload without a password that’s some sudoers hackery that I’ll talk to you about later but I’m not going to cover in this slide so importing existing configurations totally doable but there’s a trick to it in comp comes with an importer but that importer can only import one object type at a time so you can’t just hand it one big monolithic file you need to import your contacts separately from your services separately from your hosts and you have to do of course in the right order because it’s pointless to import your contact groups before you import your contacts they have a pretty good import guide but I did stumble over this for a couple of days luckily the Nagios object cache which should be sitting somewhere in nagisa’s var directory already lists one big unified file with all the objects that you’ve configured split up by object type so you can literally cut that file in about six places to generate six separate files run the importer six times against each one of those objects and you’ve got your configuration in I wish someone had told me this before I started and I even more than that I wish they had just written an importer they would go find your object cache and import it one piece at a time for you maybe someone will write that maybe that someone’s in this room cuz it ain’t me extending the schema remember we’re just dealing with a relational database on the back end and n comes with some knowledge about how objects relate to each other in Nagios configurations but it has the author’s idea of how those objects should relate to each other so for instance it doesn’t support adding a contact directly to a host it expects your contacts to be part of contact groups and contact groups to be assigned to hosts if if you don’t play that way like my environment didn’t you can extend the schema to add that extra attribute that author didn’t think about and the administration box right here we just add that attribute remember of course to back up your database before changing your schema no seriously backup your database before changing your schema and we get this screen so we’ll create a new attribute it should be a Nagios contacts that’s the syntax in the config file that Nagios would expect we’ll give it a friendly name just capitalize the C for the hell of it and then the description well it’s the people we should notify about this hosts the attribute belongs to class host so I’m assigning a contact to a host I’m setting up the relation there in the schema it’s an assigned many meaning that I can assign many attributes of type contact to this attribute of type host and the items to be assigned are of the contact type which is something else that unconfident was about this probably doesn’t make a lot of sense to you right now until you get in look at it and start playing with it it didn’t to me I actually had to email with the author a couple of times to figure figure out this I was like hey why can’t I add a host directly to a to a how can why can’t I had a contact quickly to a host so after a couple of exchanges of philosophical arguments about why I shouldn’t do this he finally pointed me to the place in the manual where it tells you how to do it so the second part of sending the schema the attributes not mandatory it is visible in the web interface yes right into the config file no it’s not the naming attribute that means what should this remember I’m assigning something to the host attribute and it’s the host attribute shouldn’t be named by the contact this is pretty much always going to be no unless you’re creating a whole brand new type of type of object to use that that encanta has never heard of and continue about contacts it just didn’t know that I should be able to assign a contact to a host so any questions here before we move on to play with check plugins cool so check plugins this this is why we this is why I like Nagios you can monitor anything that you can write a script or a check plug-in to monitor and some plugins are even contributed by major companies I’m looking at a multi-thousand dollar storage appliance right now it comes with its own Nagios plugins that’s a

line item on their cut sheet so real companies are starting to take this seriously so these are just some of the ones that I really like the check OpenManage I’ve shown you plenty of graphs from this one because this is really such a great plugin this is looking at the temperature sensors on one of my servers this is my system the system board inlet so this is the air coming into the server 18 degrees C air coming out of the server 35 36 C so yep we’re exhausting hot air that’s good we like that and the two temperature probes on my two CPUs one of these is significantly hotter than the other I’m guessing that that’s just due to the ventilation design on the server although it could potentially be evidence of an unbalanced task load between the two CPUs I don’t know but it’s below 70 it’s not terrible another one another graph output from that one remember I showed you the the stacked graph from power supplies earlier this is one that just looks weird because it’s again got two redundant power supplies but one of them is taking almost the entire look for the machine this one’s only doing you know about 0.2 amps why is this I don’t know to me this looks like a problem with the power supply paralleling board I should probably check it out I only discovered that when I was looking through my graphs trying to find pretty ones to show here so now I have something to look at when I get back home the NetApp this was the one all right this was the one check NetApp filer this is one of the few ones I’ve seen written in Python it’s old I haven’t seen any updates to it in a while but it works really well it as I only run my filers in seven mode I don’t have any clusters so I don’t know how well it works with a cluster but it works great for seven modes stuff and it comes with its own P and P templates so it gives me really nice graphs like this so this is a breakdown of one of my filer volumes grey spaces used data green space is free data space then light blue free snapshot space dark blue used snapshot space and then this white line here is a zero referenced graph of my free data space so I can glance at this and say all right my snap space is probably allocated fine I’ve got enough free data space to last me a little while and it’s not growing at an obscene rate and the the plug-in itself also will monitor the basic health attributes number of disks number of power supplies power supply function overall health as reported and this all uses the SNMP connector that’s built into data ontap by the way if you’re interested I’m going to cover some plug-in development but if you’re interested in doing even more with SNMP check out the video from yesterday’s talk francois and manuel did a great session on getting SNMP data particularly out of max check log files this is this is a newly added tool to my arsenal myself and my Oracle database administrator got this working in our environment a couple of months ago Oracle is really verbose in the amount of stuff it tells you in the logs but if you can read through an oracle log you’ll learn pretty much everything you need to know about the status of stuff that’s going on so wouldn’t it be great if we could do that in a programmatic way while this check log files plug-in right here we’ll do that it will read through your log file on every check it stores bookmarks so it knows where to start reading from the last time it ran it notices rotated log files so it’ll go back and finish reading the previous log file then catch up on the current one you can tell it about patterns that should be warnings patterns that should indicate critical status and patterns that should indicate an OK status as in the problem has fixed itself so if it’s reading through the log and it sees oh so this process is a bending crap but then later on it sees oh the process has started again it’s running normally it won’t throw an error because it sees that the problem went away really cool plug-in I probably have yet to discover all the power of this plug-in the patterns are red X Y s regular expressions what it will reg X or simple string matches are supported and this the external config file if you have something really complicated like a whole bunch of different strings or trick strings that should trigger this type versus this type of error you can set up to a pretty complicated config file to monitor exactly the stuff you

want like I said I’m still probably discovering the true power that is this fully armed and operational plug-in check Cisco I showed you a graph out of this from our 6500 9 switch but it also monitors you know temperature power supply functioning I don’t think it goes as far as monitoring individual ports you know port up port down but I’m sure there’s a plugin for that sign ology status is another one that I use for a couple of our small NASA units will check you know overall health radar a check for degraded raid arrays actually gives you the individual temperatures of all the disks which I guess is nice and see if you have some disk that’s on the verge of a head crash because it’s overheating and of course you know available storage I don’t have a link for that one but it’s available on the the proper Nagios exchange that’s its name check SNMP Synology does anyone else have any plugins so we should all know about no all right so once we’ve gone through and we’ve gleaned out all the really great plugins from out there in the community we should really start looking at or writing our own like I said I’ve been using not use for a long time been writing Nagios plugins almost as long as I’ve been using Nagios been writing bad plugins for most of those years so lately I’m trying to figure out alright how to really write a bullet-proof check make sure it handles all errors make sure it can actually detect when the service goes down Nagios plugins are great to write because the API is stupid simple this is a link to the API documentation on Nadia’s website and you can write it in anything you want I started off writing and bash then Perl lets people write in Python I’ve seen them written in expect you can write it in a compiled language if you want the API is so simple like there’s basically nothing you can’t write a plug-in in but Perl is particularly fun because Nagios has an embedded perl interpreter so if you have a lot of good plugins in perl it will cache those in memory so you’re not executing and compiling your perl every time I see I see some head shaking okay because I thought you were about to make the point that I’m about to make there are some caveats you need to know about using this embedded Perl it’s it’s going to treat your code in ways that you might not expect I will cover those in a minute but the API it is like I mentioned it’s simple this is how simple it is it’s really just the exit code if you exit zero that means okay one for warning two for critical and three for unknown the standardout is there the human for human readable notices it goes on the web page but it’s ignored by Nagios your plug-in can spit out standard out oh my god disks are on fire sky is falling ah but a few eggs at zero and I guess thinks everything’s cool it’s really only the exit code that matters performance data like I showed you after the verdict you’re actually allowed to write up to four kilobytes of output from a Nagios plugin and that will go into a different macro if you have something else you want to do with it if you want to provide some really verbose information to a certain type of notification you can and yet another link to the plug-in API so when you’re writing plugins in Perl it provides you some nice features the utils dot p m particularly includes this errors variable it’s a it’s a perl hash so you don’t have to remember those eggs of codes that i showed you in the previous slide you just have to know critical so if you exit errors critical it will translate that into the proper exit code you don’t have to remember and potentially accidentally mix up the API codes I have my github up here the template that I’m using to start rewriting most of my crud checks as real checks it’ll handle the command-line parsing parsing some the warning and critical thresholds and a couple of the things for you and there’s this great big spot right in the middle of it where it says your logic here so feel free to grab that plug-in if there’s other things I’m still continuing to work on it so I’ll probably be posting some more updates maybe next week if you find things that could be added to this particularly if you find bugs for kids send me a pull request if you don’t feel like doing that send me an email tell me what I need to fix we’re all in this together if you’re writing plug-ins in bash there’s a utility seiche that also provides state ok state critical state warning translations for the API for that but I’ve stopped writing most of my plugins and bash I’m pretty much writing

most of them in Perl so what makes a good plug-in or what does what do I think makes a good plug-in what does other people think makes a good plugin for starters keep your messages short remember that the output is going to be sent to you on an SMS message possibly even an alphanumeric pager and one still have those so just remember to keep that output short sweet to the point and it has helped it make it help you fix the problem if you’re calling any external binaries make sure you do it by the full path you don’t want to wind up on some weird system that has path set differently or doesn’t follow LSB correctly if you do have to refer to an external binary you can set that path to that at a variable type your script or if you’re writing something in a compiled language make it a command line flag there are a number of plugins I’ve seen where there’s a command line flag to specify the full path to some other binary that they use long processes and hung hung external processes and long runtimes nagios by default has a built-in timeout for its check plugins if your check plug-in runs more than I think it’s 60 seconds Nagios will kill it and it’ll just assume that the state is in unknown state because the plug-in didn’t work right if you use some functions like the alarm that’s a standard function in Perl you don’t have to use anything for that or the timeout binary for bash scripts that’s part of core utils it should be on every system you can use those to tell your plugin internally to handle a timeout so if you’re about to call DF and you set this this timeout if DF never returns because your storage system is hosed and your disk is on fire the timeout will get control back to your program before Nagios kills you and at that point you can issue a much more informative error message saying DF timed out have you checked to see if the discs are on fire avoid temporary files because you don’t know if you’re gonna be running on a system that might have its disk full or maybe you’re a lot of file handles or maybe your storage system is hung so really trying to avoid temporary files if at all possible and validate your command-line arguments even though this is going to be configured by admins and should not under most circumstances have any intentionally hostile code injected into the arguments Edmunds don’t always read the man page and you want to make sure protect some other admin who might be using your plugin from his or herself is it legal for your warning threshold to be higher than your critical threshold sometimes it is sometimes you have a threshold where lower numbers are bad make sure your numeric arguments really are numeric make sure you have all the arguments you need to actually proceed and if not kindly tell the user which arguments he or she needs to supply the embedded pearl as I mentioned it has some caveats to it the great thing about embedded pearl is it’s not compiling your pearl script every time it runs when Nagios starts up it compiles it and then caches it so you plugin should work with you strict Perl should be run with – W – turn on all warnings it’s just make sure your code is squeaky clean avoid any warnings you have to specifically close all your opened files because embedded pearl never exits and cleans up all this stuff that you’re used to pearl doing similarly with variables don’t assume that they’re going to be initialized to 0 automatically initialize all your variables because embedded pearl will cache the value from the last time your check plug-in was run because again it’s not recompiling it every time and then this is another one I just ripped this right out of someone else’s advice don’t use global variables in subroutines that can create something called the closure which will reference memory in a function that no longer exists and suddenly your Perl script is leaking memory like nobody’s business and the last thing you need on a monitoring system is a big memory leak so a final note about the unknown status remember I showed you the API is okay warning critical unknown unknown is intended to signify that something went wrong in the plug-in not with the service that you’re monitoring so if you have some internal error in your plug-in unable to find an external binary that you need called incorrectly something timed out missing a perl module that’s when to use unknown that tells Nagios I don’t know what the state of the service in because the plug-in couldn’t run

don’t use unknown to indicate that the service is in an unknown state if you’re trying to connect to a machine and you couldn’t find its DNS entry you can’t that’s not unknown that’s either warning or critical depending on what your environment should be yes you couldn’t actually connect to the service to make sure the service was up but that’s not the intent of unknown so let’s walk through a basic Nagios check in Pearl I did this last year’s talk I took you through how to write a plug in to check the last time Time Machine ran and I really hope no one is writing plug-ins that way because that example was I’ll admit it terrible there was no error checking there was no input validation and it just barely worked as an example of how to use the API so we’re gonna start here we’re going to use the net SNMP library because we’re going to query an APC Symetra you ps4 it’s available runtime in particular I’m going to use this function to turn to handle time ticks which is the SNMP value that I’m going to get back I’m going to tell Perl look in this directory for your utils p.m and then use utils to get me that errors hash and then I’m gonna I’m gonna use get up long so that I can parse my command line arguments and in particular I’m gonna use no ignore case the default behavior of get up long is to ignore the case on single character command-line switches but because I need an SNMP community and I need a critical threshold both of which can be signified by C and the Nagios convention is to use lowercase W for warnings and lowercase C for critical I need to be able to differentiate between those two so I’m going to say no ignore case the other thing I’m going to do is I’m gonna using get up long so if in order to make things even clearer you can use the – – community instead of capital C and then when someone’s looking at a config file they can see very clearly this isn’t intended to be the critical threshold this is intended to be the SNMP community easier – easier to write self-documenting code using get up long so I’ll parse my command arguments here and then this construct or print help if I pass an invalid option – get up – get up shion’s it will run this function this function should do something like printing out a helpful message and then it should exit with errors unknown because we don’t know there was an internal error it was called incorrectly so that’s the place to use unknown similar here I’m gonna validate my argument because I’m checking run time lower run time on a UPS is worse I think we can all agree on that so I want to make sure that the warning threshold is not greater than the critical threshold because that wouldn’t make sense and not in that type of a test if it’s wrong it says error your warning has to be greater than your critical and again we exit unknown any questions so far so here’s my run time oh ID that big long string that you can get out of the SNMP powernet nib or figure it out from snmpwalk then I will construct my SNMP session so I’m using the hostname you gave me I’m using the community you gave me in the command line and then I’m specifying a timeout I’m gonna say wait up to 10 seconds for this SNMP connection otherwise timeout and let me handle that error I’m gonna use SNMP version 1 because I’m lazy and I’m gonna tell it don’t do any special translation on my time ticks give them to me and I’ll scale them myself Thank You Annie so grab the error value out of that using this construct if this fails I can tell you there was an SNMP error using the error that I got back from the constructor object and then again I can exit unknown so I’ve told the user this is what went wrong and I’ve told Nagios something went wrong then finally after all that I can actually make the SNMP for my run time OID so now I get the run time back from my UPS and then the very next line I should do a whole bunch more checking to make sure this is reasonable I’m skipping that for brevity in this slide and then I’m going to calculate my minutes because time ticks in SNMP is in units of hundredths of a second some are divided by hundred and then divided by sixty and now I have my run time in minutes so two screens later I’ve retrieved the value that I want out of my UPS now I can actually check my value

I’ll start by assuming that my status is okay I’ll check to see if it’s below the warning threshold if it is I’ll set my status to warning and then I’ll check to see if it’s below the critical threshold and I’ll set my status to critical the order of these two is very important if I did them in the wrong order I could never get into a critical State then I’ll construct my screen output state name however minutes of run time I’ve got and then construct my performance data run time equals some number of minutes pass in my run time minutes and then also provide the warning and critical thresholds that tells PNP to work to draw those yellow and red lines on it where your thresholds would be so you can view them on the graph and then finally print my screen output print my performance data and exit with that status that I determined up here any questions yeah yeah um so so they’re two schools of thought on this the question is should i hard-code the OID or should I rely on having someone have the powernet and MIT installed and request the OID symbolically in this case I just want the plug-in to work and not have to have people go get the powernet mabh and install it and figure out that sort of stuff so I’m just gonna hard code that that one value right there I didn’t always do it that way it would be potentially clearer in my code when if I were to specify it symbolically and required to have the OID the the may have installed but for ease of other admins use I’m hard coding it here I’m gonna go ahead and say that’s probably the right way to do it in this case rather than requiring some external dependency on the mid file but that’s probably up to interpretation by the person writing the plug-in good question though so let’s run our fun little plugin right here will specify minus capital H this is an IP address an SNMP community and we’ll say warning is 40 minutes and critical is 20 minutes so I’m telling it go check go check this network UPS right here your run time I get the output okay 64 minutes of runtime estimated and then my performance data again this would all be on the same line if the screen were whiter and then I will check my exit value exit zero we’re good we’ve just told Nagios everything is good with your ups so I have 12 minutes available that I will throw in one final word here and wax philosophical on system administration here for a moment and let you know that only you can stop putting cruddy services into production remember that no service is truly production ready until it has been tested backed up monitored and of course documented I understand there bit some pretty good talks on documentation this meeting I haven’t attended any of them but we have to remember that it’s essential because the unofficial motto of system administration is it’s only temporary unless it works thank you so I will take 12 minutes of questions anyone else want to take a picture of this for Twitter no okay there’s my feedback link and I will open the floor apparently I’ll close the floor – all right thank you