Disaster Recovery and The Lab’s Gaming Rig

Not too long ago, coding with my coffee on my patio, I went to check some changes into the lab git repository. The repo was not responding. I went to login to the virtual server handing the repo ~ also no response. On my way to the Nagios monitoring dashboard, I spotted the alert emails ~ it seems that one service was not all that was missing. An entire host server fell off the radar!

Now this particular host was something of an experiment: It was an HP pre-built / about to be discontinued model from Best Buy with an 8th generation, 6-core Intel Core i7 processor with hyperthreading, 16 GB of memory, a 128 GB NVME drive and a 1 TB SATA spinner — not too shabby. On-board graphics meant no need for a GPU to show a text console on the screen — bonus. I’d picked it up, stripped Windows from it, and installed XCP-ng, the xen server virtualization platform. With the six cores plus hyperthreading, the machine could run the equivalent of 10+ Amazon AWS micro-instances which cost around $10/month each ~ very cool. It’s primary role as an experiment itself was to host experiments ~ non-critical stuff that didn’t require failover or similar.

Good thing!

Firstly, no worries: The hosted virtual machines themselves were routinely backed-up to local disconnected storage and could be reloaded on on other servers, and particular services such as the local git repo were redundant with cloud services. Bottom line: there was nothing critical that couldn’t be recovered in some reasonable state with a little bit of work and time. That said, it would certainly be nice if I could get the server running again from where it left off…

Recovery

The server itself was roughly a $1000 investment and had been running for 1-2 years. It had some quirks, but was not too much trouble overall. Finding it powered up with fans gently spinning but unresponsive was not in any way typical. Bottom line? The box wouldn’t POST: no video signal, no beeps ~ only the fans firing up and then ramping down and settling in. Pulling the memory and trying again did result in error beeping, so at least something was alive in there…

Approach? Well, since the boot drive was an NVME solid state device, there was clearly power to the motherboard with very little demand, and single-stick memory checks didn’t produce different results, I speculated I was facing either a failed motherboard or a failed CPU. Surely at least the memory and drives could be salvaged ~ maybe the CPU too, especially given the difference in prices between the a motherboard an a CPU.

Well, the prebuilt design produced a tight and reportedly non-standard form with little space for upgrades or expansions — a consideration in light of the “momento mori” event and thinking the salvage should be anything but a disposable commodity — so I’d probably need a new case, and looking at the power supply? A non-descript 180W? One or two future accessories would kill that for sure. Ok then! Round One: motherboard, case, and power supply it is!

… and apparently a CPU cooler, too — for the second visit, that is. The original cooler, a simple air cooler, was probably suitable, but remember that prebuilt aspect? The cooler’s mounting bracket was glued to the bottom of the motherboard ~ not reuseable.

Alright: New motherboard and cooler, old CPU, memory, drives, and power supply. The results? No POST.

Next trip? Wait — about these “next trips:” The nearest place carrying all of these raw computer parts are our friendly neighborhood Micro Centers, were the “neighborhoods” are a choice of a 45-minute drive plus tolls either to the northeast of Baltimore or to the northwest of Washington, D.C. ~ choose your Beltway. So for the next trip, I’d pick up both replacement memory (as the old set was a pair of nondescript 8 GB, 2666 Hz sticks) as well as a new CPU — after carefully confirming the return policy, of course. The CPU would be hedging the bet and saving a trip, just in case… The 8th generation i7 was no longer sold by Micro Center as all the 9th generation stuff was out. The latest model, the Core i9 was certainly overkill — and over priced — for my needs, so the step in between would do: the 9th generation Core i7 9700k. I’m told the gamers love it, what with 8-cores and overclocking capable, what’s not to love except maybe the price around $400?

Well, the memory alone was not a fix — but as long as I had the better memory, might as well keep it. And yes, big air cooler off, old CPU pull, new CPU push, air cooler remount did the trick — we have POST!

What I didn’t have was any recognized boot drive. Finagling a wired USB keyboard, a wired USB mouse, and a USB thumb drive, I got a live instance of Ubuntu running. From there, yes, I could see the boot partition, but the box somehow couldn’t… And the XCP-ng partitioning wasn’t great (at least not at the time): Only about 40G of the 128G NVME was used by XCP-ng during the auto-install. In an old effort to recover some of that space, I linked the spinning drive to the NVME drive with linux logical volume management (LVM) — a bad move that produced no useful results. “It’d probably just be easier to kill those disks and start fresh, right?” I thought to myself… and then pushed the button.

Okay then: a fresh linux install. “Maybe this will serve as a development station in the meantime…” Install the desktop environment and… achievement unlocked!

… but what’s with that annoying flickering?

Google research… upgrade motherboard BIOS… tinker with settings in the OS and in BIOS… Swap HDMI ports, cables, monitors… No, still the occasional, full-screen blackout flicker. That is certainly not suitable for a workstation… The internet had no solution indexed — just some discussions regarding the Intel integrated graphics and the latest Linux 5.3 kernel…

Last trip, this time to the local Best Buy for a light raping by the price of a GPU card. Later that evening, after some installation details worthy of their own story, the flicker is gone. Woot.

Looking into the virtualization options, a hiccup. Remember the hyperthreading on the 6-core i7 8-gen? And remeber the hyperthreading for the 8-core i9 9-gen? The hyperthreading that effectively doubles the number of apparent CPUs for a hypervisor, that would place me now between 12 and 16? Guess what Intel omitted from the 8th-gen i7?

Right. No hyperthreading. Only 8 apparent CPUs from a hypervisor’s perspective. However fast they might be, it was a net loss for the intended use.

So what did I end up with? Well, if I stick Windows on there I’ll have built a fairly substantial gaming rig! That’s a far cry from the targeted virtualization server… Maybe my son will enjoy the new box… right after he has me buy several new games for it :-p

Disaster Recovery Scorecard?

Well, given that we had classified this server as experimental, deploying no primary or critical services on it, and given that we did keep routine backups off-box and off-site, there was no impact operations. There were, however, man-hours and expenses associated with restoring to the previous state, which has not yet been accomplished.

Was it worth it?

Well, kind of.

  • The expenses associated with the original purchase and even the replacement parts and time compared favorably to commercial cloud hosting. Admittedly though, that’s because this host was not significant to operations — that is, no clients were impacted with downtime.
  • As cloud provisioning becomes increasingly trendy and easy, I believe we’re losing a lot of our basic knowledge and DIY skills. Increasingly, we lose the capability to architect systems with other than cloud solutions and, as a result, we’re drawn into an increasing dependency cycle with associated pricing — the latest variant on “vendor lock-in.”
  • We hold to the tenant that using cloud computing means we’re using other people’s computers and networks — something that is fundamentally not conducive to operations requiring data security, privacy, anonymity, controls against vendor outages, and so forth.

To be resilient and effective, it’s important to keep skills sharp at the “roughing it”- and “guerrilla warfare”-levels of computing and network operations. Lessons learned at the lowest levels are applicable at every level.

So, yes: it’s worth it ~ personally and professionally, for ourselves and for our clients 🙂

Continuity Planning (BCP)

This past Monday morning, I found I had a Google Hangouts message waiting for me from a friend of mine — except it wasn’t my friend; it was his wife. She was working through her husband’s phone, letting his different contacts know that he had passed suddenly on Saturday.

My instincts from working in this business for maybe too long had me questioning the authenticity of the message as well as my friend’s cellphone security practices. As reality set in and bumped up against self-reflection, focus shifted to continuity planning. What would my family do if it was me? What would I do if it was my wife?

Let’s use the unexpected reminder not only to take care of ourselves and our families, but also our business operations. Having a sound Business Continuity Plan (BCP) is right up there with — and tied to — Incidence Response, Disaster Recovery, and other planning and documentation. It’s about evaluating risks to the business as an entity — including those well outside information security — mitigating those risks, and planning what to do if, in spite of the mitigations, disruption occurs. Like the other documents, it should be periodically reevaluated to ensure it remains relevant and and changes should be approved through ordinary change process. Key personnel should be familiar with the document, know where to find it, and know when to activate any processes within it.

Just like for ourselves and our families, the priority of our day-to-day business life is hopefully not planing for its end. Still, do take that little bit of time every so often to make sure you’ve got it covered.

Time for Backups! #reminder

Don’t be that CIO…

Happy (Fiscal) New Year to the Feds & associated contractors! (Do we have a budget yet?) Happy Fall for those tracking the seasons! Happy October for folks tracking the monthly cycles and looking forward to Halloween! And Happy Tuesday for the folks celebrating every day or at least for those folks who are happy it’s not Monday anymore! Certainly at least one of those events is suitable to trigger your periodic reminder:

Backup your Data!

Are you still tracking the Baltimore ransomware incident from not-so-long-ago? The recovery effort is said to be over $18-million so far, but that’s not what has everyone in a tizzy. Rather, it’s the news that some amount of critical data could not be restored because it was solely maintained on individual users’ systems. Not servers. Not NAS with backups. Not cloud. Just their local drives. (See Dark Reading or Google for related links.)

I sympathize… I do. And who knows how much of that $18-million was related to that particular finding or if it was avoidable as the ransomware swept through like a wildfire, like the news story is doing around here now. Regardless, it doesn’t look like it’s going to matter to the current CIO: He’s said to be on leave and there’s a lot of speculation that he’s not expected to return.

Don’t be that CIO.

Know what data is important to you. Know where it is and back it up. Have protected copies on-site and off-site. And if you’ve done that, move on to the bonus round:

When was the last time you practiced restoring from backups?

Identify & Protect. Detect, Respond, & Recover. It doesn’t matter if it’s your business data or your kids’ baby pictures — the principles are the same. If you’re not doing it, put the plan in place to remediate and get on it. Need assistance? Contact us.