Got the urgent call at 9pm on a Monday night, this client had experienced a system crash that had resulted in corrupt and missing data as well as their primary file server (the life-blood of their business) being unable to boot. Users were complaining about slow and/or no response from several other critical business systems and in the process of attempting to recover they found that both local and off-site backups were corrupted and mostly unusable. As the client was in close proximity we were able to have 2 ninjas on site by 8am the next day. It was unclear at the initial triage meeting what had happened, when it happened and what the extent of the damage was. We did not know what we did not know.
We spent the rest of the week, the long weekend and a couple of sleepless nights with the client before the root-cause(s) became clear. First, a combination of low-level hardware and firmware configuration errors had been sporadically introducing unnoticed data corruption into their environment for months. Second, a poorly implemented backup scheme had allowed the corruption to infiltrate and effectively destroy the entire chain of local and off-site backup files. And finally, the system crash that brought down their infrastructure was not the cause of the file corruption and data loss, it was actually a symptom! This CIO found himself living his own worst nightmare: extended downtime, facing an imminent RPE, the C-Suite and upper management on his back, users fuming and years of work that could be lost forever. Fortunately, in concert with his staff, our ninja’s were able to find and fix the cause of the problem and restore operations in a timely manner. Due to the cause and the nature of the data corruption, the data recovery effort is still ongoing.
We are now designing and implementing a disaster recovery/business continuity system, which includes off-site replication of all IT systems, enabling fail over and fail back of all business operations to the cloud as well as a backup scheme that complies with the 3-2-1 principle and aligns with RPO’s & RTO’s established by the company.
This client is in the media and entertainment industry and was one of the many that were hit in the global “wannacry” ransomware attack. Their systems were compromised, their files were encrypted and hackers at the other end of the bitcoin account had a stranglehold on their business. Since this customer did not want publicity, they acquiesced and paid the ransom. Of course, the ransom was to be paid in bit coin, which took them a few days to get, because not everyone has 3,600 in bit coin ready to go. Luckily, once the ransom was paid, they were able to decrypt and regain access to their files.
Our ninjas are currently assisting in the company-wide remediation and cleanup effort which includes a “if you’re not 100% sure it’s clean – wipe it out & start over” clause. The project is going to take some time and it will not be cheap, but let’s face it: the last thing the client wants is to find out that 6 months down the road some seemingly benign code tucked away on an infrequently used drive is “calling home” to initiate a new crypto-locking scheme and start the nightmare in motion again.
This was not a fire drill emergency, or a potential RPE as in the previous two examples, but it was definitely a wake-up call. Along with the cleanup effort, we will be fast-tracking a disaster recovery project that was slated for Q4 into this quarter and implementing a zero-day threat management system to watchdog his freshly scrubbed infrastructure. As an added measure of security for the media files that were the focus of the ransomware attack, they are being copied to IBM cloud object storage (ICOS) where the data will be erasure coded, dispersed across multiple geographic locations and encrypted at rest. Think of it as a RAID array that has self-encrypting disks in 3 different data centers that are in 3 different cities. Take that, you bitcoin demanding cyber-criminals!
I got a call at 7am on Thursday morning, when the tired CIO informed me he needed immediate IT support, since his financial application was “down”. He indicated he and his IT person had been up all night, since, without a place to enter orders and create invoices, their business was effectively shut-down. He had been on manufacturer tech support, escalated tickets and all the basic checklists had been exhausted. We arranged a quick web conference call, and soon had our ninja’s trouble-shooting along-side this customer for most of the day and half of the next. This could have easily been an RPE (resume producing event) for the CIO, so we worked as part of his team until we were able to restore the financial system. While he didn’t lose his job, this CIO lost 2 full days of production, 2 nights of sleep and his credibility with management and the user base. We are currently in the process of designing and implementing a remote fail over configuration for his critical applications, off-site replication of his entire VMware infrastructure, a well-designed backup system and a DR Plan.