How would you manage 500+ servers?

tdean · November 2010

what tools would you use? i've used sitescope, whatsup and opmgr before, but to set that up for 500 machines would be impossible. what would you monitor and how? just fwd events to 5-10 machines and set up alerts from there? what services though?

contentpros · November 2010

You have a few open-ended questions so my apologies if the answer is a little long winded.

There are a number of tools that are capable of handling large numbers of servers. You might want to take a look at HP OpenView or Solar Winds Orion. Neither are cheap but at 500+ servers hopefully you should be able to get some sort of budget. If you have the patience and are linux savvy then you can look at other tools like Nagios, OpenNMS, Cacti, or Zenoss. There are tons of tools out there and these are just the tip of the iceberg. Not all tools are equal and it depends what you are looking to accomplish.

Most of these tools have some sort of snmp or ping scan discovery utility to help with getting you started. Some tools may require some type of agent to be installed so that is something else you may want to consider. Some tools will try to autoscan the well known ports and do some type of service identification. From my experience the initial configuration and tuning to reduce the unwanted noise are the most tedious. If you are logging to some central syslog servers look at Splunk as it will make your life easier.

As what to monitor and how is really something you need to create some type of a needs analysis to figure out the answer. I have worked in many places from the small mom & pop shops to the large enterprise and for a number of ISP's and the needs (and wishlist) varies for each environment. If you are looking for detailed monitoring for specific applications like Oracle, MS SQL, or Exchange you may want to consider the Quest Software "Spotlight" tools which can give you ridiculous amounts of information about the applications that many of the other tools will not. You may just be looking for something as simple as "does it respond to a ping" or you may need a tool that is capable of executing synthetic transactions against a database server or website.

With most tools monitoring starts with the basics a ping and (in most cases) a snmp query. If you are considering undertaking a task like this learning how to snmp walk the host can be a handy skill. This also means that you will most likely want to make sure your community strings are set on all of the hosts you want to monitor and the devices are sending traps to your monitoring servers. While leaving the default public and private strings will make life easier *cringe* (plz don't do this!) it also opens up other risks.

We run multiple monitoring solutions for alerting and trending/history. In my environment for each server we monitor processor use, processes, lots of memory stats, logged in users, disk i/o, temperature, network utilizations and protocol stats, and ping latency. I know I have omitted a few of the other baseline stats. We have numerous other stats that we monitor based on roll like apache, bind, sendmail, spam, and a/v appliances, the list goes on and on... then you have swiches, routers, phones, IDS/IPS etc.

It is real easy to have monitoring snowball into a nightmare situation.

Set up a small lab 5 or 10 machines and see which solutions give you the options you need and degree of difficulty to configure the solution. Once you find one you are comfortable with see what type of options you have for creating some sort of host template. create a baseline for each type of host and OS to make your life easier. Start with the minimum required services/checks and add as necessary. Have a ton of information is great but if it generates too many alerts for non critical items then more often then not the important alerts will get missed.

Start small, add as needed, document your changes, monitor the noise generated, and review!

HTH
~CP

tdean · November 2010

wow, excellent. thanks for the detail. also, if a job description is for "monitoring" 500 servers, i'd assume they already have something in place? it also doesnt seem like it would be that different that 10 or 50 servers.

another semi-related question.... how do you update /reboot that many servers every week or so when the MS security patches roll out?

contentpros · November 2010

Easy.... I don't run that many MS servers=)

All kidding aside, There are many patch management solutions out there like WSUS,SCCM, Bigfix and others that can help manage the process. Don't forget that tools like the MBSA and many vulnerability scanners can also help to identify boxes that require patching. most of our MS boxes are clusters so patching the passive head, roll the cluster, patch the other head, rinse and repeat.

I would hope that if they have a farm of 500+ servers they have some sort of monitoring solution in-place.

HTH
~CP

tdean · November 2010

contentpros wrote: »

Easy.... I don't run that many MS servers=)

All kidding aside, There are many patch management solutions out there like WSUS,SCCM, Bigfix and others that can help manage the process. Don't forget that tools like the MBSA and many vulnerability scanners can also help to identify boxes that require patching. most of our MS boxes are clusters so patching the passive head, roll the cluster, patch the other head, rinse and repeat.

I would hope that if they have a farm of 500+ servers they have some sort of monitoring solution in-place.

HTH
~CP

i've used WSUS in a few different organizations. no more than 50 servers though. it took all night to reboot them all and restart the appropriate services and test. how would you do that even if you were clustering? dont the back end servers have to reboot? im guessing you stagger them, but still, it would take 2-3 full days, wouldnt it? or do you just let them reboot on their own and only deal with them if you get alerted one doesnt come up?

rhauser44 · November 2010

tdean wrote: »

wow, excellent. thanks for the detail. also, if a job description is for "monitoring" 500 servers, i'd assume they already have something in place? it also doesnt seem like it would be that different that 10 or 50 servers.

another semi-related question.... how do you update /reboot that many servers every week or so when the MS security patches roll out?

To patch and reboot servers I've used both Opsware and Update Expert. But again both are pricey, as both solutions require dedicated servers, so hopefully you have a decent budget.

How would you manage 500+ servers?

Comments