Key learnings from the days the digital world stood still

An abstract image of a lock against a digital background, denoting cybersecurity.

(Image Credit: TheDigitalArtist / Pixabay) (Image credit: Pixabay)

In the past few weeks, we have seen two devastating “blue screen of death” IT outages sweep across the globe – from the initial CrowdStrike outage that impacted 8.5 million Windows devices to the latest DDoS-related Microsoft crash. While the immediate impacts remain unclear, we can expect both outages to have significant, long-term repercussions.

Already, it is estimated that the immediate crash has cost US Fortune 500 companies up to $5.4 billion in damages, with companies in banking and healthcare expected to be hit the hardest. Beyond this, the disruption left countless organizations scrambling to restore their systems and secure their data, creating a chaotic environment that is ripe for exploitation. This turmoil not only exposed vulnerabilities but weakened cybersecurity defenses, making enterprises far more susceptible to cybercriminals who are quick to exploit during times of crisis.

Gregory Richardson

VP Global Advisory CISO at BlackBerry Cybersecurity.

The cracks in our global digital infrastructure

Ultimately, the outages drew attention to the often-overlooked physical and logistical challenges in managing a distributed IT infrastructure. As the crisis unfolded, it became clear that resolving the problem required rebooting systems in safe mode with admin privileges. However, this process is both nightmarish and time-consuming, particularly for large and dispersed enterprises. Many organizations also faced difficulties accessing and fixing remote systems, particularly those in hard-to-reach locations.

This is evident from the sheer volume and diverse range of sectors that were affected by the crash, from banks and airlines to hotels and hospitals. It showed us how a single point of failure can cascade across the intricate web of our digital infrastructure to impact various industries. At the same time, the scale of the outage highlighted the importance of skilled IT support and robust Managed Security Service Providers (MSSPs). Above all, we immediately saw professionals from Microsoft, SonicWall and SentinelOne work together to diagnose and resolve the issue. Their collective efforts underscore the immense value of industry collaboration, which remains one of the cybersecurity industry’s greatest assets.

Key learnings the global IT outages

When a major incident occurs, there is always a trail of lessons to uncover. These outages signal a pivotal moment for all organizations to assess their software supply chain and the operational risks to their business. This is especially true for cybersecurity software operating deep within our software stacks, where the adversaries attack but also where a bad line of code can take down the entire system.

As the immediate impacts of the global outage subside, CIO and CISOs must now ask themselves – do we have the right balance to deliver the disaster recovery and business continuity needed when this inevitably happens again? If the question is hard to answer, then IT and Security leaders should consider:

1. Enhancing process discipline – Strong management processes are crucial, particularly security tool updates. Security leaders should look to implement rigorous testing protocols before deploying updates across the infrastructure. If a vendor manages this process, it is essential to inquire about their remediation plans for problematic updates.

2. Implementing multi-vendor strategies – While consolidation has been popular, this incident emphasizes the importance of strategically diversifying vendors to mitigate risks and avoid single points of failure. A critical examination of the current setup to identify potential single points of failure should be a priority. Then consider robust Managed Detection and Response (MDR) solutions with open XDR capabilities which are best suited to supporting a diverse IT or security stack. The alternative locks users into a single vendor and leaves them exposed to potential vulnerabilities.

3. Bolstering endpoint protection – Oftentimes, outages are caused by legacy cybersecurity practices in play, with complex EDR and heavy endpoint agents a major infrastructure risk and unnecessarily complex. Using a lightweight AI on the endpoint can avoid these types of outages, as it protects your environment without heavy agents and regular updates that put your operations at risk.

4. Integrating AI responsibly – Though it might seem unrelated, developing clear policies for AI integration into cybersecurity operations is essential. This foresight will help prevent future large-scale issues as AI becomes more integrated into tech stacks. While AI offers a promising path forward, it has by no means reached its end state. IT and security leaders must therefore remain vigilant and adaptive and be prepared to address evolving vulnerabilities that AI may introduce with an innovative yet responsible approach.

5. Harnessing real-time comms capabilities – Given the outage impacted some of the most critical systems, networks and applications in the world, the response required speed, accuracy, and accountability. Here, a critical event management (CEM) solution can provide real-time visibility to ensure a quick and informed response to recover from business disruption. At the same time, this will provide a paper trail of incident communications to prove that the situation was handled with accountability and compliance at the fore.

6. Ensuring regular testing to remove blind spots – Understanding your vulnerabilities and risks through regular testing is paramount, not only when deploying new software but consistently over time. To protect against potential threat actors who seek to take advantage of IT outages, a combination of AI-enabled internal and external penetration testing assessments remains vital. These will reveal how an outside threat actor could compromise assets through ever-evolving tactics, techniques and procedures. The performance and security of your systems are only as good as its least secure hardware and software components. Therefore, blind spots need to be addressed as a priority to keep companies operating as usual.

These global tech outages were a stark reminder of the critical need for digital independence and robust management processes. Now, industry leaders must turn these lessons into actionable strategies, using this experience to build more resilient and adaptable cybersecurity frameworks. In this field, it is not a matter of if the next crisis will occur, but when. The strength of the cybersecurity industry lies not only in our individual expertise but also in our collective response to challenges. By fostering collaboration, embracing strategic complexity, and continuously improving processes, future crises can be faced with greater confidence and effectiveness.

We list the best data recovery service.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Gregory Richardson, VP Global Advisory CISO at BlackBerry Cybersecurity.