Understanding Failure Points
A critical first step towards building true operational resilience in banking systems is to build an understanding of the systemic weaknesses in their existing IT landscape that could lead to operational paralysis.
Banks must meticulously identify and assess key failure points – including vulnerabilities in aging legacy systems and unsupported software, the complexity of their IT toolsets, flaws in change management, risks from third-party dependencies, inadequate governance over end-to-end services, core infrastructure limitations, and the challenges of cloud migration or upgrades. These weaknesses are compounded by external threats such as cyberattacks, including DDoS incidents that have previously disrupted banks like HSBC.
While not an exhaustive list, addressing these prevalent issues is a fundamental hygiene factor, enabling financial institutions to develop targeted solutions for delivering uninterrupted essential customer services despite internal or external failures.
Furthermore, banks should proactively “stress-test” their Important Business Services (IBS) by simulating real-world outage scenarios in production. This validation confirms the accuracy of their impact-tolerance assumptions. This proactive identification and remediation of weaknesses, rather than reactive firefighting after an incident, requires a deep and granular understanding of the inherent vulnerabilities within their existing infrastructure that could lead to significant failures of their critical services like payments, cash withdrawal.
Enhancing Failure Prevention
While no system can be entirely failure-proof, financial institutions can significantly reduce incident frequency through several key interventions:
Architectural modernisation: Simplify complex, interconnected systems by adopting microservices that reduce failure points and improve maintainability. Well-implemented cloud-native architectures provide superior scalability and resilience options.
AI-powered monitoring: Implement artificial intelligence to transform how potential issues are detected and addressed before impacting customers. Advanced anomaly detection systems can identify unusual performance patterns, while predictive maintenance flags at-risk components.
Enhanced change management: Strengthen processes with more rigorous testing protocols, automated tracking systems, and comprehensive rollback plans for when deployments don't proceed as expected.
Third-party risk management: Treat third-party dependencies as extensions of your own infrastructure by implementing strong contractual SLAs, regular audits, and vendor diversification for critical services.
Comprehensive testing: Conduct regular scenario-based testing that simulates severe but plausible disruptions to identify vulnerabilities before they're exposed in real-world incidents.
DDoS Defense: Implement robust, multi-layered defenses by deploying high-capacity protection, intelligent filtering, and WAFs. Proactively monitor networks and leverage threat intelligence for early detection and sustained availability.
Building Alternate Service Paths
Some of the leading financial institutions aren't just fixing outages; they're building proactive service paths for uninterrupted customer access.
Consider the inconvenience of a broken-down car. A dealership that offers a courtesy vehicle understands the customer's immediate need for mobility. Financial institutions must adopt a similar "customer-first" mindset, providing reliable alternatives that maintain essential functionalities such as accessing accounts, making payments, or checking balances – when primary systems fail. This isn't just about damage control; it's about anticipating customer needs during stressful situations and providing tangible solutions.
Much like how a car dealership provides a courtesy car when your vehicle is in for repairs, financial institutions must offer backup service channels that maintain essential functionality during system outages.
To ensure customers retain access to crucial financial services during outages, institutions can focus on developing the following alternate servicing paths:
Stand-in architectures: Develop independent backup systems focused exclusively on core banking functions. These architectures can operate on different cloud providers than primary systems, creating true infrastructure diversity. Monzo Bank has pioneered this approach with its Stand-in system that runs on Google Cloud Platform while their primary systems operate on AWS.
Artificial intelligence could further enhance these architectures by providing intelligent decision-making capabilities for triggering failover, dynamically scaling resources within the stand-in based on real-time needs, and optimizing the execution of core banking functions. What makes this approach particularly compelling is its cost-effectiveness – Monzo's Stand-in operates at approximately 1% of their primary platform's cost by focusing only on essential services.2
A Multi-Cloud Stand-in architecture supporting customer operations during an IT outage demonstrates that effective resilience solutions can be achieved without the cost and complexity of primary systems – they can be focused on essentials and still deliver critical customer value.
Intelligent Caching and Offline Functionality:
By intelligently caching user interface (UI) elements and critical data like account balances, beneficiary information, and recent transaction history, banks can build robust offline transaction capabilities to minimize customer disruption during IT outages. A proxy lightweight UI can be presented, allowing customers to raise requests for essential functions such as initiating payments to existing beneficiaries and setting up direct debit instructions, even when the primary infrastructure is unavailable. This offers customers greater autonomy over their finances during disruptions, as they can perform essential tasks without the repeated frustration of logging back into their mobile banking app or web browser.
Artificial intelligence offers the prospect of proactively optimizing this caching based on user behavior and intelligently managing the queueing and processing of offline requests.
These offline requests are securely queued and then processed seamlessly once the core systems are back online, ensuring that while real-time data and processing are temporarily limited, customers can still perform essential banking tasks and maintain uninterrupted access to key financial functions.
Intelligent caching of key data and UI elements allows banks to offer limited but crucial banking functions even during IT outages, minimizing customer disruption and avoiding login frustrations.
Exploring a Collaborative Industry Framework:
Drawing a parallel to card network Stand-In Processing (STIP) for card authorization continuity during issuer outages, a consortium of banks could establish a 'Basic Payments Stand-In Network (BPSN),' potentially with regulatory oversight and multi-cloud infrastructure for enhanced resilience.
This pre-agreed, minimalist infrastructure would enable customers of an impacted bank to conduct essential payment initiation and receipt for critical needs – such as basic account-to-account transfers, receiving salary credits, and facilitating essential bill payments –via a shared, standardized platform during severe IT disruptions. Similar to the card network approach, BPSN would prioritize core payment processing, employ pre-defined rules and limits, and necessitate post-outage reconciliation, offering a targeted solution for maintaining critical payment capabilities when a bank's primary systems fail.
Inspired by card network Stand-In Processing (STIP), banks could create a shared, Basic Payment Stand-in Network (BPSN). This would allow customers of a bank facing a major IT outage to still make essential card payments and limited transfers via a standardized platform.
Establishing a BPSN, a novel concept, presents significant implementation hurdles: technical standardization, governance gridlock, and regulatory challenges. The potential of this consortium lies in delivering vital resilience, enabling banks to mutually support each other and drastically reduce the threat of total outages. Despite these challenges, the industry's increasing focus on robust operational resilience and customer protection highlights the value of exploring such innovative solutions for payment continuity.
Conclusion:
UK and European banks are battling a rising tide of IT outages, demanding a bolder approach to resilience than current frameworks provide. The framework – understand failure points, enhance prevention, build alternate paths – offers a promising path forward. It's no longer enough to just try to prevent failures; banks must architect systems that ensure uninterrupted essential services when they happen.
Operational resilience is no longer just a matter for internal risk and compliance – it spans across all bank functions, payment schemes/network operators, technology providers, and beyond. Failure scenarios are no longer hypothetical ‘never happens’, and as we have seen from the Treasury Select report, they are very real, with far-ranging impacts.Failover to bank private cloud, alternative SaaS platforms, or traditional data centres, much like the analogy of the hire car, are becoming the new normal. – Richard Albery, European Head of Commercial for ACI Worldwide
By focusing on deep vulnerability analysis, robust prevention, and innovative backup solutions, the financial sector can move beyond fragile contingency plans to truly resilient operations. This proactive, customer-first strategy is the key to weathering inevitable digital storms and building lasting trust in an always-on world.
How can Cognizant help?
Cognizant partners with UK and European financial institutions to build true operational resilience against escalating IT outages. We leverage our expertise and partnership with Microsoft to address every layer of the Resilience framework. Strating with business and infrastructure assessments to understand critical failure points, we help clients enhance failure prevention by modernizing architectures with microservices and cloud-native solutions, implementing AI-powered monitoring, strengthening change management processes, bolstering third-party risk management, and establishing comprehensive testing strategies. Further, we architect and implement alternative service paths such as resilient multi-cloud stand-in environments, intelligent offline functionality, and robust API layers for potential collaborative frameworks. Our holistic approach ensures business continuity and strengthens customer trust by making readily available service alternatives a reality.
Cognizant and Microsoft bring together their industry and technology expertise to help payments and financial institutions build and run resilient systems. Access our latest insights on Payments here.