SolarWinds, MOVEit, Knight Capital, and now CrowdStrike. The vendor ecosystem will remain a major playing field for operational disruptions. But are you ready for the next inevitable event? As a CISO, your response to such a question from the board shouldn't be anything less than a resounding "Yes!"
Here are five plans of action to help your organization survive the next major IT quake, whether it's due to another rusty security update or a third-party breach.
1. Establish a 'War Room'
The organizations that reinstated their operations most efficiently in the aftermath of the CrowdStrike incident were those that could quickly bring together key decision-makers in a 'war room.' A war room is a centralized command center where specialists gather to manage a crisis in real time. Many organizations make the mistake of assuming a carefully crafted incident response plan is sufficient enough to reduce operational disruption risks. But -- as the CrowdStrike incident so delicately pointed out -- you can't prepare for every possible IT disruption.
A war room is a critical safety measure that can bridge the gap between your response plans and an unplanned IT crisis.
To have the capacity to address the broadest scope of potential disruptions, you need to fill your war room with representatives of your primary risk categories. For medium to large enterprises, the list of specialized personnel should at least include your:
- CISO - representing cyber risk exposure
- Information Security Officer - representing IT risk exposure
- Chief Financial Officer - representing financial risk exposure
- Chief Risk Officer - representing operational risk exposure
Other personnel that you could include in a war room besides C-Suite members include:
- Head of compliance - representing compliance risk exposure
- IT manager - representing IT and data security risk exposure
- Cybersecurity manager - representing security and third-party risk exposure
- Legal council - represents legal risk exposure
All your war room members should congregate regardless of the specific risk exposure a given event has inflamed. If a disruption is significant enough to trigger a war room gathering, it will likely have rippled effects across multiple risk categories, requiring collaborative response efforts across multiple business functions.
Whether the gathering occurs in-person or remotely, a war room setup should enable the following:
- Rapid information sharing: The efficient breakdown of all critical information regarding the active incident, either through impact analysis reports or vendor risk summary reports
- Decision-making agility: The ability to make swift, informed decisions to mitigate the impact of the outage and expedite recovery efforts
- Real-time impact and remediation monitoring: All members should have access to a real-time monitoring feed of all affected systems. If remediation action has been deployed, members should have visibility into each task's status.
- Development and maintenance of a timeline of events: In the heat of a crisis, it can be difficult to track events occurring in near real time and look for causal relationships between them. A detailed timeline is also essential to manage future audit and compliance processes.
In the case of the CrowdStrike incident, UpGuard provided customers with complete awareness of all their impacted third—and even fourth-party vendors.
Determine a third-party incident impact threshold for activating a war room gathering, as it's a significant resource maneuver. Your definition of this threshold will be a relationship between a static component (your third-party risk appetite) and a dynamic component (emerging risks in your external attack surface).
Your threshold for activating a war room is based on a combination of your third-party risk appetite and your current exposure to emerging third-party risks.
A tool like UpGuard Vendor Risk could support the dynamic component of a war room trigger definition with a real-time news feed of emerging risks in the third and fourth-party networks.
2. Don't become too dependent on automation
In a world where we're spoiled for choice in terms of process automation options, it's tempting to become complacent, allowing all knowledge of manual approaches to atrophy. The CrowdStrike incident, however, inverted years of IT progress, suddenly popularizing an old-school approach to incident response.
Because the faulty CrowdStrike update affected the core functioning of impacted systems, most automated remediation tasks were ineffective, necessitating a time-consuming, hands-on approach to purging millions of devices of the problematic update.
To ensure your IT personnel maintain sharp manual problem-solving instincts, consider reintroducing a regular rotation of hackathons. To enhance resilience to vendor ecosystem disruptions similar to the CrowdStrike incident, choose projects that will enhance the impact of your Third-Party Risk Management program. Here are some examples.
Incident Response Simulation
- Develop and implement comprehensive incident response playbooks that integrate automated response scripts and real-time system telemetry dashboards for large-scale IT outages.
Automated Remediation Tools
- Create sophisticated automation scripts or software agents that can detect, isolate, and remediate issues caused by faulty updates using machine learning models to predict and prevent similar incidents.
Enhanced Monitoring and Alerting Systems
- Design and deploy advanced monitoring solutions using AI-driven anomaly detection algorithms and real-time alerting mechanisms and integrate them into SIEM (Security Information and Event Management) systems.
Risk Assessment and Management Framework
- Build robust risk assessment tools leveraging big data analytics and continuous monitoring capabilities to evaluate and visualize third-party vendor risks dynamically.
Disaster Recovery Plan Development
- Develop detailed disaster recovery frameworks incorporating automated failover systems, continuous data replication techniques, and orchestration tools for seamless recovery processes.
Security Testing Automation
- Create and integrate CI/CD pipeline security testing tools that automatically perform static and dynamic code analysis, vulnerability scanning, and penetration testing before deploying updates.
Multi-Cloud Resilience Strategy
- Develop and implement workload distribution and failover strategies across multiple cloud providers using container orchestration platforms like Kubernetes and multi-cloud management tools.
Real-Time Incident Communication Platform
- Build and deploy a real-time communication platform with incident tracking, automated notification systems, and integrated collaboration tools for efficient incident management and coordination.
For inspiration for an optimal design of an integrated collaboration project, watch this video to learn how UpGuard streamlines vendor collaborations within its plaform.
3. Map your end-to-end dependency chains for critical systems
One of the most vital lessons from the CrowdStrike incident is the importance of understanding your end-to-end dependency chains for critical systems. Such awareness will help risk management teams predict the likely impact of external disruptions and the effort required to reinstate regular operation
Your dependency map should identify all interconnected components and services your critical systems rely upon to function correctly. This effort involves several steps:
- Step 1 - Inventorize your IT assets: Catalog all hardware, software, and network components of which your critical systems are compromised.
- Step 2 - Identify Interdependencies: Understand how all critical system components interact with each other. This effort should continue along the dependency chain to your vendor ecosystem, noting external dependencies on third-party services and Managed Service Providers.
- Step 3 - Document Processes and Workflows: Produce detailed documentation of all the processes and workflows dependent on these systems. This effort will make it easier to visualize the impact of a failure at any point in the dependency chain
- Step 4 - Assess Criticality: Evaluate the criticality of each component and dependency. Identify which elements are essential for operations and which have redundancies or failover options.
The UpGuard platform can help you quickly identify all assets comprising your external attack surface and their respective security risks, offering insights into potential external points of disruption in your dependency chain. Watch this video for an overview of UpGuard's Attack Surface Management features.
4. Implementing Robust Software Change Management Processes
The CrowdStrike event highlights the importance of an effective software change management policy. Here are some key elements to ensure future faulty security updates don't compromise system stability:
- Comprehensive testing: Always conduct thorough testing of new updates in a controlled environment. This effort should include unit and integration testing. Once confirming that no disruptions will happen, gradually roll out the update to expanding IT environments, always monitoring for disruptions with each roll-out expansion. Consider commencing roll-out on IT systems that have been predetermined not to require immediate update installations.
- Approval workflow: Establish a precise approval process for changes. Involve multiple stakeholders, such as IT, cybersecurity, and business units.
- Documentation: Document all new updates and IT ecosystem changes in relation to new releases. This will leave a helpful audit trail for recovery efforts.
- Backup and rollback plans: Before applying updates, always have a rollback plan in place, ready to be instantly activated should an incident occur.
- Change windows: Schedule updates during predefined change windows to minimize the impact on operations. Updates should occur outside of peak business hours to minimize the impact of a potential disruption. Stakeholders should be advised in advance of any upcoming updates that could potentially have major impacts.
5. Consider Multi-Cloud Strategies
A multi-cloud strategy could significantly reduce the risk concentration of relying on a single Cloud Service Provider (CSP). This approach involves strategically distributing workloads across multiple CSPs, thereby reducing the chances of major operational disruptions due to a single CSP failing.
Some examples of Multi-Cloud Strategies include
- Strategic workload distribution: The distribution of critical system workloads across multiple CSPs such that a greater weight of critical applications is assigned to CSPs with the least likelihood of failure
- Redundancy and diversification: This is a more general approach to workload distribution with an emphasis on diversification so that the potential of total system outage due to a single failure CSP is greatly reduced.
- Failover mechanisms: Failover mechanisms automatically reroute traffic to an alternate CSP when a CPS fails. The effectiveness of this approach is contingent on seamless operation diverting without any discernable effects on service availability. Tools such as Kubernetes or multi-cloud management platforms can monitor the health of services across different CSPs and initiate failovers without manual intervention.
- Performance optimization: Continuously monitor the performance of applications across different CSPs, utilizing load balancing to ensure optimal resource management.
- Cost management: Implement FinOps practices to manage and optimize costs associated with multi-cloud deployments. Use cost management tools to monitor spending across different CSPs and make informed decisions about resource allocation efficiency.
Watch this video to learn how UpGuard helps financial institutions mitigate third party cyber risk and maintain regulatory compliance.