Skip to content
All writing
·6 min read

Automating the boring parts of Tier 3 support

What years of the hardest infrastructure escalations taught me about where automation actually pays off — and how I turned repeatable cluster work into one-command runs.

  • Infrastructure
  • Automation
  • Support

I spend my days as the last line of defense for the hardest hyperconverged-infrastructure cases — hypervisor issues, cluster and storage problems, the escalations nobody else could crack. After enough of them, you notice something: the slow part usually isn’t the diagnosis. It’s everything around it.

The real bottleneck

Picture a tough escalation. By the time it reaches me, the actual “what’s wrong here” thinking is often the fast part. The slow part is the setup: collecting logs across every node, applying maintenance steps by hand, prepping updates, and getting all of it into a consistent state so I can reason about it — and so I can hand it to engineering reproducibly.

Done manually, that work is slow, easy to get subtly wrong, and different every time depending on who’s doing it. That last bit is the killer: if the steps aren’t reproducible, the root cause is much harder to find.

Make the right thing the easy thing

So I built Support-Ansible — a framework that turns those repeatable, high-stakes steps into reviewable, one-command runs:

  • Ansible playbooks that codify cluster maintenance and updates, so the sequence is the same every time and anyone on the team can run it.
  • Python tooling to collect and organize logs across nodes, so an escalation arrives at engineering with complete, consistent data.

The payoff wasn’t just saved time. It was reproducibility — which is the thing that actually lets root causes get found and fixed instead of worked around.

Write it down or relive it

The same instinct shows up in the runbooks and KB articles I write. The hardest cases tend to repeat — not identically, but in shape. Capturing how to diagnose them turns hard-won, in-my-head experience into something the whole team can use, and stops the same issue from being escalated five different ways by five different people.

Automation and documentation are the same move, really: do the thinking once, then make it cheap to reuse.