What does pardoning a turkey have to do with Post Mortems anyway?

One turkey every year, typically named Tom, get’s a pardon from the President of the United States every Thanksgiving. That’s a pretty stupid tradition, but politics has a litany of action for the sake of public approval over actual substance scenarios that play out even dumber than this one. The thing that strikes me about this though is that the turkey is being forgiven of something that he never did, essentially a pardon is a waiving of the burden of guilt and the subsequent sentence associated with it.

Think about every post mortem or crisis room you have been in, how often is the fault of the issue acknowledged and forgiven so completely? Or even better how often is the problem not a direct result of anyone in the crisis room? It doesn’t happen is most likely the answer unless you are running blameless post mortems.

Now here is the thing, blameless post mortems are AWESOME in theory. No finger pointing, no screaming, just a calm discussion of what happened, how it was diagnosed and how it was solved who wouldn’t want to be a part of that after a crisis? In practice though, this requires a level of maturity with
everyone in the room, a room full of mature IT people is like a porcupine with a good hair day, not nonexistent just a rarity.

What if though we all take a deep breath in the room with all the stress around us and realize that we are all passionate professionals? What if we are able to separate our individualism for the sake of the team?

Bare with me as I have a back in my day moment ….

Back when I was in the operations business sitting in the data center every silo’d team had friends on other teams. We laughed a luncheons together, we had inside jokes about late night patching or outages or pranks we pulled. When it came to troubleshooting a serious issue we banded together and fixed it as as a team no matter what the root cause ended up being. But at the after action we were all so proud we would try to deflect root cause away from our team. It was not the right way to deal with it.

I have been fortunate enough to have seen blameless post mortem meetings since leaving full time operations. Wow what a difference. The meetings are swift and on target with determining root cause and sharing information and results. The efficiency is striking because no one feels like they have to explain their actions away, instead they are just explaining how things occurred and were fixed. Crisis meetings in these groups are equally impressive, with teams coming together and everyone helping with the troubleshooting process.

If you are interested, I would highly recommend checking out the folks over at Etsy who seem to do this better than anyone and are open enough to write about it. Here are some links:

https://codeascraft.com/2012/05/22/blameless-postmortems/

https://www.pagerduty.com/blog/blameless-post-mortems-strategies-for-success/

https://www.etsy.com/teams/7716/announcements/discuss/10641726/

What are your experiences with this?

Oh and Happy Thanksgiving everyone!!

Did google kill the troubleshooting star?

Just like video killed the radio star I wonder if google hasn’t killed the IT troubleshooting skills. Once upon a time we followed a common set of principals to determine root cause of a problem and resolve it. Today though so many admins jump straight to google with a list of symptoms rather than establishing root cause. Not that google is bad, in fact it’s a great tool to help but I am not sure it’s the first place to look.

For you younger pups ( I still refuse to be considered old) there are some simple steps to help you get to the root of a problem.

  1. Determine Symptoms
  2. Dig into the logs
  3. Identify suspects
  4. Eliminating suspects
  5. Determining root cause
  6. Retest after resolution

Let’s walk through these steps to help you better understand and see where tools like google make sense.

Determine Symptoms

One of the first things you see when an issue arises are the outlying symptoms. You have all seen enough medical dramas to get where I am going. But it’s important to keep track of issues because when you hit the next step it’s easy to go down a rabbit hole fast. So jot down the issues you are seeing and when it seems to have started happening. Web service unreachable? Ok is it an individual site, or page, is it localized or wide-spread, affecting multiple browsers? All good things to take in.

Dig Into the Logs

When you start to dig into the logs you are looking for the timestamps that correspond with the service outage, along with any instances of the service. While you are digging in, notice events and errors that occur on the system that could impact your issue. Start making a suspects list, where you are going to start looking.

Identify Suspects

Now that we know what the issue we are looking for is, and have some breadcrumbs to guide us to a few possible causes we need to go all detective on it. Based off of logs and symptoms we can develop a list of hardware, software, network components, firewall rules, patches, or services, that are causing our issue. If logs aren’t giving you anything, diagnostic tools can be used like sysmon, filemon, netmon, grep, or even baseline analytic tools like tripwire, vRealize Operations, or SolarWinds can bare the answers.

Eliminate Suspects

Like any good detective now your objective is to take your list and begin to interrogate them to narrow the search down to a few prime suspects. Here is where google becomes really handy, some error codes will stick out as will events that occur around the same time as the issue arising. You also know your symptoms and suspects, so your google query can help you streamline your search. Same web service down, and only localized, happening after patch night, only affecting Windows 8.1 desktops, with IE? Boom there is your web search.

Determine Root Cause

Now that we have some more info to go on hopefully we have our suspects down to one or two. Here is where we need to pause, because if the problem is software based we want to make an identical copy in our test environment to test with. If it’s hardware it makes things a little easier. Let’s split this topic.

Software  

 You need to have a way to copy off your system as I mentioned before, because we are going to want to roll back like a Walmart sales associate. When you have your list of suspects and they are all software, whether it’s a driver, patch or just bad code you are going to need to do some uninstalls, or reloads and you will want to be able to roll back to the known bad state if one of the fixes doesn’t work. DO NOT CHANGE TO MUCH AT ONCE, I have worked with too many folks who go through their google search and just start implementing every change they see and never actually knowing which change fixed their problem. Go slow, be methodical if a change doesn’t fix the problem note the new norm and roll back and move to the next potential fix.

Hardware

  Hardware comes down to 1 simple rule, use a known good. If there is a hardware failure replace the suspect busted hw with a know to be good replacement part and retests.

For networking and firewall rules the software steps still apply, nothing crazy there.

Retest After Resolution

Like anything once you believe you have a fix in place, test it. There is nothing worse than waving the all clear just for the system to blow up. I have been guilty of that myself in the past. So you need to validate the resolution works. Once you are satisfied and the system is up you can be the unsung hero, and move onto the next problem in the queue.

No more VDI isolationism

End User Computing (EUC) has finally matured to the point where adoption no longer needs a long cycle of convincing everyone that VDI is a thing. But even with that said wide scale VDI isn’t the norm, organizations still struggle to get projects moving forward because of the cost of running separate architectures for VDI, that sits apart from primary datacenter workloads. This is part of the traditional design requirements that were driven by IOP limitations of most storage solutions. All flash arrays (AFA) have helped to solve the IOP issues, and are great for running super fast virtual desktops. Despite solving one problem AFAs haven’t changed the separation of architecture discussion.

SolidFire is built for mixed workloads, the quality of service (QoS) capability allows for a minimum guaranteed threshold or performance. Traditionally this has been the darling of service providers, as they built multi-tenant cloud environments with shared resources, SolidFire clusters ensured the performance SLA’s were being met for each tenant. But this is easily applied to enterprise solutions as well. Hopefully by now you are thinking, well if SolidFire can handle mixed workloads could I run primary datacenter apps like Oracle or SharePoint on the same cluster that includes VDI or EUC solution?

Yeah that’s exactly my point.

Because we now have mixed workloads solved, we can reduce the CAPEX of the EUC entry point. Think of it this way, if you are running more than one workload in your datacenter and need an AFA level or performance, then SolidFire should be in the mix. If you recognize that the SolidFire cluster is handling the business for that workload and has room to spare, then spin up a EUC solution on the same cluster, with it’s own guaranteed performance metrics. From a storage perspective this is lowering the cost per desktop as the mixed workload distributes cost across the various workloads. In addition as more workloads come on board, or more desktops or applications are required you can easily scale the cluster by adding more nodes.

I am working on a write up for large scale VDI\EUC solutions on SolidFire that will be out once I have it tightened up. For now though I invite you to take a look at what we have up today for reference architectures. These will be updated in the coming months, to build out more of an EUC not just VDI approach.

Last thing I want to touch on, what this approach does is it cuts down POC time. Traditionally when I was a consultant I would recommend, we build out a small POC to test VDI and let it run for a couple of months while we determine if the solution fits the customer’s needs. Now because we are leveraging the same SolidFire gear as the rest of the datacenter can use, we can spin up more desktops and jump to a larger scale pilot phase faster. If properly implemented on the desktop optimization size, and proper use cases are targeted this leads to big wins in EUC adoption faster than ever before.

Pimp’n may not be easy, but making a decision on building out EUC on SolidFire sure seems to be a no brainer.