steveheadshot.jpg

Hi.

Welcome to my blog. I talk about all things tech & leadership.

Your Infrastructure and Team Are Only as Strong as the Weakest Link

Your Infrastructure and Team Are Only as Strong as the Weakest Link

Last week, my friend and fellow community addict Dennis Faucher challenged me to write a blog post in a Tweet and I really couldn’t pass it up – I mean how often does someone challenge you to a blog post? Plus, knowing that someone as smart as Dennis reads this blog is a pretty big feather in my cap. As a matter of historical perspective, Dennis was a presenter at the very first VMUG I ever hosted as a VMUG leader in Boston – and in a very real sense carried the meeting – so he’ll always have some weight with me.

Anyways, earlier this week Dennis put out the tweet below:

Now the first time I saw this, I wasn’t entirely sure what I was looking at. It was obvious that this was a bridge of some sort with a missing board that someone had tried to cover over with a sloppy patch job. (I bet you already know where we’re going – don’t you?) What wasn’t clear to me was where this was. It looked from the left side like it was fifty or more feet in the air, like some sort of canopy walk in the woods. Instead, it turns out that this is on the ground, and is just a little walkway to keep people’s feet out of the swampy or muddy area below. What looked to me like a straight shot down is actually a reflection in the water of the trees above the image.

This is what I thought the picture was of - a Pirates of the Caribbean style bridge.  To be clear, I’m not going on this bridge regardless of its level of repair.

This is what I thought the picture was of - a Pirates of the Caribbean style bridge.
To be clear, I’m not going on this bridge regardless of its level of repair.

Because my first look made me think it was suspended high in the air, my first reaction was something like “who would allow a bridge to be ‘repaired’ this way?!? This must be the dumbest thing I’ve seen in a while – and no one should trust that bridge!” In looking at the picture, it’s absolutely clear that the “repair” is simply some boards thrown across the gap.

When I first started thinking about how this applies to IT I thought about the perils of patching, but I felt like that might make a boring post – after all, you know how important it is to keep your systems and apps patched to combat the increasingly frequent threat of zero-day attacks. You don’t need a post on that, and I don’t want to write one. Instead, I started thinking about my initial reaction to the picture – that no one should ever venture across a bridge a hundred feet in the air with that shoddy excuse for a link. And then I started thinking about how often we all encounter something similar in our own IT environments… and I had a topic.

I can remember a time not too long ago – though fortunately long enough ago that the wound isn’t fresh - that I was working in an environment that we had built to be rock-solid with redundancy. Storage, networking, compute – you name it, and it was redundant. I would demonstrate the resiliency by going to the datacenter and removing a drive from a server or unplugging something at random – app users never noticed a thing. I was proud of the design because it was really solid; this was in the days before virtualization was widely in use in datacenters, and the “cloud” was a thing outside when it rained, so building a resilient platform for mission-critical apps was something companies (including the one I worked for) took seriously.

What I didn’t know was that upstream, the firewall that I had no responsibility for was plugged into a power strip strung across an aisle. You know how this ends – you could write the rest of this story for me – but I’ll save you the trouble. Someone tripped on the cable and down the firewall went, knocking out access to my super-resilient environment that sat behind that firewall. Users didn’t care that my servers and everything were up and quietly waiting for someone to connect to them. From the user perspective, the database, mail, and pretty much everything else requiring authentication was down. The whole company came to a halt. From their perspective it was broken. And they were right; I firmly believe that you need to adopt the vantage point of the user as the only criteria for performance and availability. If they can’t get to it – it’s downtime.

The lesson I learned was that my infrastructure is only as resilient as the weakest link, even if that weak link is firmly outside of my purview and responsibility. This is the case on every application and every environment you’ve ever worked on or will work on. It’s unlikely that the entire environment will be built to the same bulletproof redundancy, so there will be a weak link somewhere. I encourage you to find it and then be honest about it. It’s possible that the risk is acceptable, but it’s also possible that no one was thinking that the bandwidth across VPN had anything to do with your HR app… until 100% of your company had to work from home and then use VPN to get into the HR app.

Once you’ve acknowledged where the weak link is, you can decide whether to accept it, mitigate it, manage it, or remediate it. Of course, the choice is going to be influenced by budget, timelines, scope, and company risk tolerance. I can’t give you much advice on how because situations are going to be wildly different – but the important part is to do something. Even if the answer is to just accept it, you are at least bringing it to the surface and getting it documented so it’s known that it’s the weakest link. Let’s make sure we’re level set on what each of those mean:

  • Accept it – acknowledge there is a weaker link in the chain, document the impacts if it fails, and have someone in a decision-making role acknowledge and accept the risk.

  • Mitigate it – apply a patch or band-aid to prevent it from doing as much damage. In my example above with the firewall, I could have screwed the plug into the wall. This isn’t a permanent solution, but it will bridge me to a permanent one.

  • Manage it – this one applies mostly to when the issue is human-related. You can set goals and provide training to improve the situation instead of living with it as it is.

  • Remediate it – make the underlying situation better. In our firewall example, this means we acquire a professional PDU for the rack, get it cabled appropriately and take care of it once and for all.

Honestly, though - what happened to this particular link? Like… why that one?

Honestly, though - what happened to this particular link? Like… why that one?

As uncomfortable as this is for a lot of people, you should apply the same methodology to your team if you’re a manager. Do you have a team member that you can’t trust as much as everyone else? Is there someone who seems to really struggle with some technical aspect of their role? You may have found the weakest link in your team. I’ve found out time and time again that your team is only as good as the weakest link. Now, don’t think I’m telling you to find the person on your team that is bad at their job, because I’m not. Sometimes the weakest link on your team is really strong – that means your team is really strong. That would be a good time to just acknowledge that that person is the weakest link and try to find ways to help build them up. Sometimes it’s just a matter of getting them training or offering them an opportunity to work on a new project.

Other times, you may find that you have a problem on the team, and you have to do something about it. You could just acknowledge it and say there’s acceptable risk, but if you’re managing an operations team and you can’t count on that person to handle any issues that come up during an on-call rotation, you may not want to do so. More often you have an opportunity to find a training and development path so that you can strengthen that weaker link – which will ultimately level up your whole team. In our on-call example, you could also decide to mitigate the problem by acknowledging that that person will need to have a point of escalation available when they are on call. That way the organization’s needs are met and you haven’t relied on that person skilling up further.

Generally, mitigation is a temporary fix (not unlike sticking some boards across a gap in a walkway) and won’t suffice for the long-term. I would encourage you not to build an environment – or a team – based on mitigating the damage the weakest link in the chain can do. If it’s not within an acceptable risk tolerance (which clearly varies with the criticality of the function, app, or solution you’re building) you should strive to remediate it so you have a long-term solution. It’s in your best interest – and your company’s best interest – to make sure that you are finding where the weakest links are and not just throwing some boards over them. You will sleep better at night if you fix them, and your users will have a better experience whether they’re 50 feet in the air or just right above the swampy surface.

Well, there you go. That’s my take on why you want your infrastructure, apps, and team to be as strong as possible all around. Ultimately, your success will be defined by how you manage, support, or design the weakest link. Special thanks to Dennis for giving me something to write about this week – he saved you all from some thoughts about productivity and energy that I’ll queue up for sometime in the future.

Every time you give me feedback, a good boy gets a head rub. Think about it.

Every time you give me feedback, a good boy gets a head rub.
Think about it.

As a parting thought, I feel like I should mention that I appreciate reader feedback, so hey – maybe if you challenge me like Dennis did, I’ll drop what I’m doing and write something about it. I’d love to hear your thoughts, feedback, and questions - but first, I have some questions for you:

  • Can you think of a time you knew there was a weak link in a project or team you were on? How did you handle it?

  • How do you define acceptable risk in your team or organization? How does it vary based on the project you’re working on?

  • If you have a weak link in an area you support right now, why is it there? What are you doing to manage or mitigate it?

The Need to Disconnect

Humble Pie Actually Tastes Pretty Good

Humble Pie Actually Tastes Pretty Good