Disaster Recovery and The Lab’s Gaming Rig

Not too long ago, coding with my coffee on my patio, I went to check some changes into the lab git repository. The repo was not responding. I went to login to the virtual server handing the repo ~ also no response. On my way to the Nagios monitoring dashboard, I spotted the alert emails ~ it seems that one service was not all that was missing. An entire host server fell off the radar!

Now this particular host was something of an experiment: It was an HP pre-built / about to be discontinued model from Best Buy with an 8th generation, 6-core Intel Core i7 processor with hyperthreading, 16 GB of memory, a 128 GB NVME drive and a 1 TB SATA spinner — not too shabby. On-board graphics meant no need for a GPU to show a text console on the screen — bonus. I’d picked it up, stripped Windows from it, and installed XCP-ng, the xen server virtualization platform. With the six cores plus hyperthreading, the machine could run the equivalent of 10+ Amazon AWS micro-instances which cost around $10/month each ~ very cool. It’s primary role as an experiment itself was to host experiments ~ non-critical stuff that didn’t require failover or similar.

Good thing!

Firstly, no worries: The hosted virtual machines themselves were routinely backed-up to local disconnected storage and could be reloaded on on other servers, and particular services such as the local git repo were redundant with cloud services. Bottom line: there was nothing critical that couldn’t be recovered in some reasonable state with a little bit of work and time. That said, it would certainly be nice if I could get the server running again from where it left off…

Recovery

The server itself was roughly a $1000 investment and had been running for 1-2 years. It had some quirks, but was not too much trouble overall. Finding it powered up with fans gently spinning but unresponsive was not in any way typical. Bottom line? The box wouldn’t POST: no video signal, no beeps ~ only the fans firing up and then ramping down and settling in. Pulling the memory and trying again did result in error beeping, so at least something was alive in there…

Approach? Well, since the boot drive was an NVME solid state device, there was clearly power to the motherboard with very little demand, and single-stick memory checks didn’t produce different results, I speculated I was facing either a failed motherboard or a failed CPU. Surely at least the memory and drives could be salvaged ~ maybe the CPU too, especially given the difference in prices between the a motherboard an a CPU.

Well, the prebuilt design produced a tight and reportedly non-standard form with little space for upgrades or expansions — a consideration in light of the “momento mori” event and thinking the salvage should be anything but a disposable commodity — so I’d probably need a new case, and looking at the power supply? A non-descript 180W? One or two future accessories would kill that for sure. Ok then! Round One: motherboard, case, and power supply it is!

… and apparently a CPU cooler, too — for the second visit, that is. The original cooler, a simple air cooler, was probably suitable, but remember that prebuilt aspect? The cooler’s mounting bracket was glued to the bottom of the motherboard ~ not reuseable.

Alright: New motherboard and cooler, old CPU, memory, drives, and power supply. The results? No POST.

Next trip? Wait — about these “next trips:” The nearest place carrying all of these raw computer parts are our friendly neighborhood Micro Centers, were the “neighborhoods” are a choice of a 45-minute drive plus tolls either to the northeast of Baltimore or to the northwest of Washington, D.C. ~ choose your Beltway. So for the next trip, I’d pick up both replacement memory (as the old set was a pair of nondescript 8 GB, 2666 Hz sticks) as well as a new CPU — after carefully confirming the return policy, of course. The CPU would be hedging the bet and saving a trip, just in case… The 8th generation i7 was no longer sold by Micro Center as all the 9th generation stuff was out. The latest model, the Core i9 was certainly overkill — and over priced — for my needs, so the step in between would do: the 9th generation Core i7 9700k. I’m told the gamers love it, what with 8-cores and overclocking capable, what’s not to love except maybe the price around $400?

Well, the memory alone was not a fix — but as long as I had the better memory, might as well keep it. And yes, big air cooler off, old CPU pull, new CPU push, air cooler remount did the trick — we have POST!

What I didn’t have was any recognized boot drive. Finagling a wired USB keyboard, a wired USB mouse, and a USB thumb drive, I got a live instance of Ubuntu running. From there, yes, I could see the boot partition, but the box somehow couldn’t… And the XCP-ng partitioning wasn’t great (at least not at the time): Only about 40G of the 128G NVME was used by XCP-ng during the auto-install. In an old effort to recover some of that space, I linked the spinning drive to the NVME drive with linux logical volume management (LVM) — a bad move that produced no useful results. “It’d probably just be easier to kill those disks and start fresh, right?” I thought to myself… and then pushed the button.

Okay then: a fresh linux install. “Maybe this will serve as a development station in the meantime…” Install the desktop environment and… achievement unlocked!

… but what’s with that annoying flickering?

Google research… upgrade motherboard BIOS… tinker with settings in the OS and in BIOS… Swap HDMI ports, cables, monitors… No, still the occasional, full-screen blackout flicker. That is certainly not suitable for a workstation… The internet had no solution indexed — just some discussions regarding the Intel integrated graphics and the latest Linux 5.3 kernel…

Last trip, this time to the local Best Buy for a light raping by the price of a GPU card. Later that evening, after some installation details worthy of their own story, the flicker is gone. Woot.

Looking into the virtualization options, a hiccup. Remember the hyperthreading on the 6-core i7 8-gen? And remeber the hyperthreading for the 8-core i9 9-gen? The hyperthreading that effectively doubles the number of apparent CPUs for a hypervisor, that would place me now between 12 and 16? Guess what Intel omitted from the 8th-gen i7?

Right. No hyperthreading. Only 8 apparent CPUs from a hypervisor’s perspective. However fast they might be, it was a net loss for the intended use.

So what did I end up with? Well, if I stick Windows on there I’ll have built a fairly substantial gaming rig! That’s a far cry from the targeted virtualization server… Maybe my son will enjoy the new box… right after he has me buy several new games for it :-p

Disaster Recovery Scorecard?

Well, given that we had classified this server as experimental, deploying no primary or critical services on it, and given that we did keep routine backups off-box and off-site, there was no impact operations. There were, however, man-hours and expenses associated with restoring to the previous state, which has not yet been accomplished.

Was it worth it?

Well, kind of.

  • The expenses associated with the original purchase and even the replacement parts and time compared favorably to commercial cloud hosting. Admittedly though, that’s because this host was not significant to operations — that is, no clients were impacted with downtime.
  • As cloud provisioning becomes increasingly trendy and easy, I believe we’re losing a lot of our basic knowledge and DIY skills. Increasingly, we lose the capability to architect systems with other than cloud solutions and, as a result, we’re drawn into an increasing dependency cycle with associated pricing — the latest variant on “vendor lock-in.”
  • We hold to the tenant that using cloud computing means we’re using other people’s computers and networks — something that is fundamentally not conducive to operations requiring data security, privacy, anonymity, controls against vendor outages, and so forth.

To be resilient and effective, it’s important to keep skills sharp at the “roughing it”- and “guerrilla warfare”-levels of computing and network operations. Lessons learned at the lowest levels are applicable at every level.

So, yes: it’s worth it ~ personally and professionally, for ourselves and for our clients 🙂