Disaster Recovery and The Lab’s Gaming Rig

Not too long ago, coding with my coffee on my patio, I went to check some changes into the lab git repository. The repo was not responding. I went to login to the virtual server handing the repo ~ also no response. On my way to the Nagios monitoring dashboard, I spotted the alert emails ~ it seems that one service was not all that was missing. An entire host server fell off the radar!

Now this particular host was something of an experiment: It was an HP pre-built / about to be discontinued model from Best Buy with an 8th generation, 6-core Intel Core i7 processor with hyperthreading, 16 GB of memory, a 128 GB NVME drive and a 1 TB SATA spinner — not too shabby. On-board graphics meant no need for a GPU to show a text console on the screen — bonus. I’d picked it up, stripped Windows from it, and installed XCP-ng, the xen server virtualization platform. With the six cores plus hyperthreading, the machine could run the equivalent of 10+ Amazon AWS micro-instances which cost around $10/month each ~ very cool. It’s primary role as an experiment itself was to host experiments ~ non-critical stuff that didn’t require failover or similar.

Good thing!

Firstly, no worries: The hosted virtual machines themselves were routinely backed-up to local disconnected storage and could be reloaded on on other servers, and particular services such as the local git repo were redundant with cloud services. Bottom line: there was nothing critical that couldn’t be recovered in some reasonable state with a little bit of work and time. That said, it would certainly be nice if I could get the server running again from where it left off…

Recovery

The server itself was roughly a $1000 investment and had been running for 1-2 years. It had some quirks, but was not too much trouble overall. Finding it powered up with fans gently spinning but unresponsive was not in any way typical. Bottom line? The box wouldn’t POST: no video signal, no beeps ~ only the fans firing up and then ramping down and settling in. Pulling the memory and trying again did result in error beeping, so at least something was alive in there…

Approach? Well, since the boot drive was an NVME solid state device, there was clearly power to the motherboard with very little demand, and single-stick memory checks didn’t produce different results, I speculated I was facing either a failed motherboard or a failed CPU. Surely at least the memory and drives could be salvaged ~ maybe the CPU too, especially given the difference in prices between the a motherboard an a CPU.

Well, the prebuilt design produced a tight and reportedly non-standard form with little space for upgrades or expansions — a consideration in light of the “momento mori” event and thinking the salvage should be anything but a disposable commodity — so I’d probably need a new case, and looking at the power supply? A non-descript 180W? One or two future accessories would kill that for sure. Ok then! Round One: motherboard, case, and power supply it is!

… and apparently a CPU cooler, too — for the second visit, that is. The original cooler, a simple air cooler, was probably suitable, but remember that prebuilt aspect? The cooler’s mounting bracket was glued to the bottom of the motherboard ~ not reuseable.

Alright: New motherboard and cooler, old CPU, memory, drives, and power supply. The results? No POST.

Next trip? Wait — about these “next trips:” The nearest place carrying all of these raw computer parts are our friendly neighborhood Micro Centers, were the “neighborhoods” are a choice of a 45-minute drive plus tolls either to the northeast of Baltimore or to the northwest of Washington, D.C. ~ choose your Beltway. So for the next trip, I’d pick up both replacement memory (as the old set was a pair of nondescript 8 GB, 2666 Hz sticks) as well as a new CPU — after carefully confirming the return policy, of course. The CPU would be hedging the bet and saving a trip, just in case… The 8th generation i7 was no longer sold by Micro Center as all the 9th generation stuff was out. The latest model, the Core i9 was certainly overkill — and over priced — for my needs, so the step in between would do: the 9th generation Core i7 9700k. I’m told the gamers love it, what with 8-cores and overclocking capable, what’s not to love except maybe the price around $400?

Well, the memory alone was not a fix — but as long as I had the better memory, might as well keep it. And yes, big air cooler off, old CPU pull, new CPU push, air cooler remount did the trick — we have POST!

What I didn’t have was any recognized boot drive. Finagling a wired USB keyboard, a wired USB mouse, and a USB thumb drive, I got a live instance of Ubuntu running. From there, yes, I could see the boot partition, but the box somehow couldn’t… And the XCP-ng partitioning wasn’t great (at least not at the time): Only about 40G of the 128G NVME was used by XCP-ng during the auto-install. In an old effort to recover some of that space, I linked the spinning drive to the NVME drive with linux logical volume management (LVM) — a bad move that produced no useful results. “It’d probably just be easier to kill those disks and start fresh, right?” I thought to myself… and then pushed the button.

Okay then: a fresh linux install. “Maybe this will serve as a development station in the meantime…” Install the desktop environment and… achievement unlocked!

… but what’s with that annoying flickering?

Google research… upgrade motherboard BIOS… tinker with settings in the OS and in BIOS… Swap HDMI ports, cables, monitors… No, still the occasional, full-screen blackout flicker. That is certainly not suitable for a workstation… The internet had no solution indexed — just some discussions regarding the Intel integrated graphics and the latest Linux 5.3 kernel…

Last trip, this time to the local Best Buy for a light raping by the price of a GPU card. Later that evening, after some installation details worthy of their own story, the flicker is gone. Woot.

Looking into the virtualization options, a hiccup. Remember the hyperthreading on the 6-core i7 8-gen? And remeber the hyperthreading for the 8-core i9 9-gen? The hyperthreading that effectively doubles the number of apparent CPUs for a hypervisor, that would place me now between 12 and 16? Guess what Intel omitted from the 8th-gen i7?

Right. No hyperthreading. Only 8 apparent CPUs from a hypervisor’s perspective. However fast they might be, it was a net loss for the intended use.

So what did I end up with? Well, if I stick Windows on there I’ll have built a fairly substantial gaming rig! That’s a far cry from the targeted virtualization server… Maybe my son will enjoy the new box… right after he has me buy several new games for it :-p

Disaster Recovery Scorecard?

Well, given that we had classified this server as experimental, deploying no primary or critical services on it, and given that we did keep routine backups off-box and off-site, there was no impact operations. There were, however, man-hours and expenses associated with restoring to the previous state, which has not yet been accomplished.

Was it worth it?

Well, kind of.

  • The expenses associated with the original purchase and even the replacement parts and time compared favorably to commercial cloud hosting. Admittedly though, that’s because this host was not significant to operations — that is, no clients were impacted with downtime.
  • As cloud provisioning becomes increasingly trendy and easy, I believe we’re losing a lot of our basic knowledge and DIY skills. Increasingly, we lose the capability to architect systems with other than cloud solutions and, as a result, we’re drawn into an increasing dependency cycle with associated pricing — the latest variant on “vendor lock-in.”
  • We hold to the tenant that using cloud computing means we’re using other people’s computers and networks — something that is fundamentally not conducive to operations requiring data security, privacy, anonymity, controls against vendor outages, and so forth.

To be resilient and effective, it’s important to keep skills sharp at the “roughing it”- and “guerrilla warfare”-levels of computing and network operations. Lessons learned at the lowest levels are applicable at every level.

So, yes: it’s worth it ~ personally and professionally, for ourselves and for our clients 🙂

Lifetime Warranty: Data & Identity

Living in a neighborhood replete with tall, deciduous trees, when the opportunity arose I had “gutter covers” installed around the house. When once one of those tall trees decided to attack the house, I remembered the “lifetime warranty” that came with that installation. Sure enough, it was for the lifetime of that company, which — naturally — no longer existed.

There’s a bit of chatter around Twitter today regarding a decision to purge idle accounts. The details apparently aren’t firm yet, but the vagaries include “sometime next month” and “accounts that have no login in the last six months.” One of the voiced benefits: freeing up account names so they can be reissued.

Two angles appeared immediately:

  1. Identity Management. People and organizations have identity, reputations, and relationships linked to their names. When the names change, we start over, for better or worse carrying little if any notoriety between the names. Similarly, should someone move in and claim our names, they have the opportunity to claim our identity, our reputation, and our relationships. Sometimes it’s “Under New Management!” Other times, the havoc of identity theft ensues…
  2. Data Management. TOS’s and EULA’s be damned! When you place your data and communications into another organization’s hands, you are implicitly accepting the risk that your data will be lost, stolen, compromised, abused, used for purposes other than you intended, etc.; and you are implicitly accepting the risk that that access switch will be turned off without a moment’s notice. What now?

Yes, “what now?” indeed…

In the Twitterverse, I’ve seen the first “What about my deceased dad’s tweets?” questions. “I like to visit them from time to time to remember our conversations, but I don’t have a login to his account!” Extend this to every other data service that relies on a third party to accept, hold, and present data that means something to you: Facebook, Instagram, YouTube, email, blogs, websites, data storage, …; then remember that your rights to all of that are as thin as the clause that allows the provider to change the agreement at their leisure.

How about identity? I shared with you in a recent post seeing a text message from my friend that was actually from his wife, yes? When large numbers of our interactions are not face-to-face anymore, let alone in our own “voice,” we become quite comfortably conditioned to accept that email accounts, text messages, twitter handles, and everything else are natural extensions of the person or entity that we trust. (Conversely, it is easy to assume that an email account, text message, twitter handle, or anything else is not from an individual we trust if we have not previously associated them with the individual ~ but that’s a post for another time…) The bottom line? These chains of trust are often easily broken, and once broken they are easily exploited.

What to do? Well, in our perspective it boils down as usual to risk analysis. For each piece of data, for each service you use, ask what it would mean to you if it was gone or compromised. Do you have your cloud data backed-up locally in some intelligible format? Do you have the sensitive stuff protected even in the cloud? Do you have alternatives available to provide those basic services like group / family communications, email, instant messaging, and telephony? Are your peers aware of your plan and know how to fail over to the alternatives? Do you have methods in place to authenticate one another, verifying identities periodically and especially before discussing important matters when not face-to-face, so you know you’re communicating with the right person? Do you have a strategy to signal that the communication channel is not secure, to switch to alternative channels, or even to indicate on the sly that you’re in distress?

Some of that may seem far fetched. If so, good! Maybe you’re one of the ordinary folks who may never encounter these problems. The items are not in your threat modeling, or they are in your threat modeling but you estimate it’s extremely unlikely that you’ll be impacted catastrophically if it does. That is a completely reasonable outcome of thoughtful risk analysis. On the other hand, if any of the threats resonate with you and you haven’t given any thought to handling them, well, good! That’s also a completely reasonable outcome of thoughtful risk analysis, and now you know where to focus your efforts.

Our role? Helping people and organizations open their eyes to the possible threats — particularly those in their blind spots — and helping with remediation strategies where warranted. We make posts like these freely and communicate the same everywhere, and we offer to confidentially review your situation as a service. Take a stab at the exercise yourself, then contact us for an outside assessment.

Time for Backups! #reminder

Don’t be that CIO…

Happy (Fiscal) New Year to the Feds & associated contractors! (Do we have a budget yet?) Happy Fall for those tracking the seasons! Happy October for folks tracking the monthly cycles and looking forward to Halloween! And Happy Tuesday for the folks celebrating every day or at least for those folks who are happy it’s not Monday anymore! Certainly at least one of those events is suitable to trigger your periodic reminder:

Backup your Data!

Are you still tracking the Baltimore ransomware incident from not-so-long-ago? The recovery effort is said to be over $18-million so far, but that’s not what has everyone in a tizzy. Rather, it’s the news that some amount of critical data could not be restored because it was solely maintained on individual users’ systems. Not servers. Not NAS with backups. Not cloud. Just their local drives. (See Dark Reading or Google for related links.)

I sympathize… I do. And who knows how much of that $18-million was related to that particular finding or if it was avoidable as the ransomware swept through like a wildfire, like the news story is doing around here now. Regardless, it doesn’t look like it’s going to matter to the current CIO: He’s said to be on leave and there’s a lot of speculation that he’s not expected to return.

Don’t be that CIO.

Know what data is important to you. Know where it is and back it up. Have protected copies on-site and off-site. And if you’ve done that, move on to the bonus round:

When was the last time you practiced restoring from backups?

Identify & Protect. Detect, Respond, & Recover. It doesn’t matter if it’s your business data or your kids’ baby pictures — the principles are the same. If you’re not doing it, put the plan in place to remediate and get on it. Need assistance? Contact us.