What does pardoning a turkey have to do with Post Mortems anyway?

One turkey every year, typically named Tom, get’s a pardon from the President of the United States every Thanksgiving. That’s a pretty stupid tradition, but politics has a litany of action for the sake of public approval over actual substance scenarios that play out even dumber than this one. The thing that strikes me about this though is that the turkey is being forgiven of something that he never did, essentially a pardon is a waiving of the burden of guilt and the subsequent sentence associated with it.

Think about every post mortem or crisis room you have been in, how often is the fault of the issue acknowledged and forgiven so completely? Or even better how often is the problem not a direct result of anyone in the crisis room? It doesn’t happen is most likely the answer unless you are running blameless post mortems.

Now here is the thing, blameless post mortems are AWESOME in theory. No finger pointing, no screaming, just a calm discussion of what happened, how it was diagnosed and how it was solved who wouldn’t want to be a part of that after a crisis? In practice though, this requires a level of maturity with
everyone in the room, a room full of mature IT people is like a porcupine with a good hair day, not nonexistent just a rarity.

What if though we all take a deep breath in the room with all the stress around us and realize that we are all passionate professionals? What if we are able to separate our individualism for the sake of the team?

Bare with me as I have a back in my day moment ….

Back when I was in the operations business sitting in the data center every silo’d team had friends on other teams. We laughed a luncheons together, we had inside jokes about late night patching or outages or pranks we pulled. When it came to troubleshooting a serious issue we banded together and fixed it as as a team no matter what the root cause ended up being. But at the after action we were all so proud we would try to deflect root cause away from our team. It was not the right way to deal with it.

I have been fortunate enough to have seen blameless post mortem meetings since leaving full time operations. Wow what a difference. The meetings are swift and on target with determining root cause and sharing information and results. The efficiency is striking because no one feels like they have to explain their actions away, instead they are just explaining how things occurred and were fixed. Crisis meetings in these groups are equally impressive, with teams coming together and everyone helping with the troubleshooting process.

If you are interested, I would highly recommend checking out the folks over at Etsy who seem to do this better than anyone and are open enough to write about it. Here are some links:

https://codeascraft.com/2012/05/22/blameless-postmortems/

https://www.pagerduty.com/blog/blameless-post-mortems-strategies-for-success/

https://www.etsy.com/teams/7716/announcements/discuss/10641726/

What are your experiences with this?

Oh and Happy Thanksgiving everyone!!

Did google kill the troubleshooting star?

Just like video killed the radio star I wonder if google hasn’t killed the IT troubleshooting skills. Once upon a time we followed a common set of principals to determine root cause of a problem and resolve it. Today though so many admins jump straight to google with a list of symptoms rather than establishing root cause. Not that google is bad, in fact it’s a great tool to help but I am not sure it’s the first place to look.

For you younger pups ( I still refuse to be considered old) there are some simple steps to help you get to the root of a problem.

  1. Determine Symptoms
  2. Dig into the logs
  3. Identify suspects
  4. Eliminating suspects
  5. Determining root cause
  6. Retest after resolution

Let’s walk through these steps to help you better understand and see where tools like google make sense.

Determine Symptoms

One of the first things you see when an issue arises are the outlying symptoms. You have all seen enough medical dramas to get where I am going. But it’s important to keep track of issues because when you hit the next step it’s easy to go down a rabbit hole fast. So jot down the issues you are seeing and when it seems to have started happening. Web service unreachable? Ok is it an individual site, or page, is it localized or wide-spread, affecting multiple browsers? All good things to take in.

Dig Into the Logs

When you start to dig into the logs you are looking for the timestamps that correspond with the service outage, along with any instances of the service. While you are digging in, notice events and errors that occur on the system that could impact your issue. Start making a suspects list, where you are going to start looking.

Identify Suspects

Now that we know what the issue we are looking for is, and have some breadcrumbs to guide us to a few possible causes we need to go all detective on it. Based off of logs and symptoms we can develop a list of hardware, software, network components, firewall rules, patches, or services, that are causing our issue. If logs aren’t giving you anything, diagnostic tools can be used like sysmon, filemon, netmon, grep, or even baseline analytic tools like tripwire, vRealize Operations, or SolarWinds can bare the answers.

Eliminate Suspects

Like any good detective now your objective is to take your list and begin to interrogate them to narrow the search down to a few prime suspects. Here is where google becomes really handy, some error codes will stick out as will events that occur around the same time as the issue arising. You also know your symptoms and suspects, so your google query can help you streamline your search. Same web service down, and only localized, happening after patch night, only affecting Windows 8.1 desktops, with IE? Boom there is your web search.

Determine Root Cause

Now that we have some more info to go on hopefully we have our suspects down to one or two. Here is where we need to pause, because if the problem is software based we want to make an identical copy in our test environment to test with. If it’s hardware it makes things a little easier. Let’s split this topic.

Software  

 You need to have a way to copy off your system as I mentioned before, because we are going to want to roll back like a Walmart sales associate. When you have your list of suspects and they are all software, whether it’s a driver, patch or just bad code you are going to need to do some uninstalls, or reloads and you will want to be able to roll back to the known bad state if one of the fixes doesn’t work. DO NOT CHANGE TO MUCH AT ONCE, I have worked with too many folks who go through their google search and just start implementing every change they see and never actually knowing which change fixed their problem. Go slow, be methodical if a change doesn’t fix the problem note the new norm and roll back and move to the next potential fix.

Hardware

  Hardware comes down to 1 simple rule, use a known good. If there is a hardware failure replace the suspect busted hw with a know to be good replacement part and retests.

For networking and firewall rules the software steps still apply, nothing crazy there.

Retest After Resolution

Like anything once you believe you have a fix in place, test it. There is nothing worse than waving the all clear just for the system to blow up. I have been guilty of that myself in the past. So you need to validate the resolution works. Once you are satisfied and the system is up you can be the unsung hero, and move onto the next problem in the queue.