Just like video killed the radio star I wonder if google hasn’t killed the IT troubleshooting skills. Once upon a time we followed a common set of principals to determine root cause of a problem and resolve it. Today though so many admins jump straight to google with a list of symptoms rather than establishing root cause. Not that google is bad, in fact it’s a great tool to help but I am not sure it’s the first place to look.
For you younger pups ( I still refuse to be considered old) there are some simple steps to help you get to the root of a problem.
- Determine Symptoms
- Dig into the logs
- Identify suspects
- Eliminating suspects
- Determining root cause
- Retest after resolution
Let’s walk through these steps to help you better understand and see where tools like google make sense.
One of the first things you see when an issue arises are the outlying symptoms. You have all seen enough medical dramas to get where I am going. But it’s important to keep track of issues because when you hit the next step it’s easy to go down a rabbit hole fast. So jot down the issues you are seeing and when it seems to have started happening. Web service unreachable? Ok is it an individual site, or page, is it localized or wide-spread, affecting multiple browsers? All good things to take in.
Dig Into the Logs
When you start to dig into the logs you are looking for the timestamps that correspond with the service outage, along with any instances of the service. While you are digging in, notice events and errors that occur on the system that could impact your issue. Start making a suspects list, where you are going to start looking.
Now that we know what the issue we are looking for is, and have some breadcrumbs to guide us to a few possible causes we need to go all detective on it. Based off of logs and symptoms we can develop a list of hardware, software, network components, firewall rules, patches, or services, that are causing our issue. If logs aren’t giving you anything, diagnostic tools can be used like sysmon, filemon, netmon, grep, or even baseline analytic tools like tripwire, vRealize Operations, or SolarWinds can bare the answers.
Like any good detective now your objective is to take your list and begin to interrogate them to narrow the search down to a few prime suspects. Here is where google becomes really handy, some error codes will stick out as will events that occur around the same time as the issue arising. You also know your symptoms and suspects, so your google query can help you streamline your search. Same web service down, and only localized, happening after patch night, only affecting Windows 8.1 desktops, with IE? Boom there is your web search.
Determine Root Cause
Now that we have some more info to go on hopefully we have our suspects down to one or two. Here is where we need to pause, because if the problem is software based we want to make an identical copy in our test environment to test with. If it’s hardware it makes things a little easier. Let’s split this topic.
You need to have a way to copy off your system as I mentioned before, because we are going to want to roll back like a Walmart sales associate. When you have your list of suspects and they are all software, whether it’s a driver, patch or just bad code you are going to need to do some uninstalls, or reloads and you will want to be able to roll back to the known bad state if one of the fixes doesn’t work. DO NOT CHANGE TO MUCH AT ONCE, I have worked with too many folks who go through their google search and just start implementing every change they see and never actually knowing which change fixed their problem. Go slow, be methodical if a change doesn’t fix the problem note the new norm and roll back and move to the next potential fix.
Hardware comes down to 1 simple rule, use a known good. If there is a hardware failure replace the suspect busted hw with a know to be good replacement part and retests.
For networking and firewall rules the software steps still apply, nothing crazy there.
Retest After Resolution
Like anything once you believe you have a fix in place, test it. There is nothing worse than waving the all clear just for the system to blow up. I have been guilty of that myself in the past. So you need to validate the resolution works. Once you are satisfied and the system is up you can be the unsung hero, and move onto the next problem in the queue.