The need for true resilience of payment systems

The world is becoming smaller and as a result, everything we do or want to do is faster, in many ways these two points are a direct result of the world we are living in becoming increasingly digital. Our expectations, be that as a consumer or as a large multi-national company, are things should be instant and always available when I need them to be. In the world of financial services, these expectations are more than just an expectation, we are starting to demand immediate, and always on, always available services. Payment systems, and by extension settlement services, therefore must deliver immediate capabilities, capabilities that are always on, always available, not sometime in the future, but now.

The challenge though is one of operational and cyber resilience for both payment systems and settlement services. Having something that is always available, always on, always working and delivering immediacy is tough and expensive. But that is the demand and expectation we all have of our banks, and by extension the payment and settlement infrastructure our banks use. Regulators around the globe are starting to put additional pressure on banks and payment systems to meet these demands. And to be frank, rightfully so.

On the 23rd October 2020, the high value payment system that services Europe, Target 2 went offline. This caused no end of transactional pain, frustration and came at a cost for no doubt many businesses and some individuals. The pain was felt not just on domestic European transactions, but also international transactions coming into Europe. The outage was caused by a “third-party network device” in the internal network of the Eurosystem of national central banks. Interestingly, the failure didn’t lead to a failover situation, with Target 2 “backup” systems taking over. Essentially, resiliency plans didn’t work. The impact was still being felt by some SEPA transactions 3 days later.

Target 2 isn’t the only high-profile failure in recent years. If you spend some time online you can find lots of instances of bank IT outages, central bank outages and payment system outages. In 2014 the Bank of England’s core RTGS system suffered a nine-hour service outage. This time the root cause was defects introduced as changes made to the RTGS system in April and May the year before. The failover into MIRS (the contingency solution) wasn’t undertaken. On New Year’s eve, just the year before, the UK Clearing house suffered a major IT outage, with the nature of the problem creating obstacles to reverting to contingency arrangements.  What’s consistent with these outages, and what is the important part to note is, that contingency solutions were not or were unable to be triggered. This effectively renders them as very expensive systems, systems that sit idle, carry out zero workloads and at the time of need, seem to not be able to be used…Resiliency therefore needs to be looked at in a different light than that of a traditional Disaster Recovery (DR) failover.

CPMI: Cyber resilience in financial market infrastructures

It seems an age ago, but in November 2014 (yes 6 years prior to the Target 2 outage) the CPMI (Committee on Payments and Market Infrastructures) published a paper on Cyber resilience in financial market infrastructures. The report goes into to detail regarding “why cyber risks are special”, how to adopt an “integrated approach to cyber resilience” and more interestingly, “sector-wide considerations”. One of the key takeaways in this section is titled “Non-similar facility”. In this section, the report identifies resiliency failings if you use contingent solutions that have shared components with the primary solution. The thinking is, if you have a shared component, then if it fails in the primary, it will be impacted or worse fail totally in the contingency solution too. This is to be looked at every layer within the payment system stack, from your underlying networking to the platforms used, data storage and even how banks potentially connect into the payment system.

The non-similar facility (NSF) seeks to replicate the core functionality that is provided by the primary service, it need not provide all the same capabilities, but the core needs to be covered. This means that in the case of a failure/service outage, or a cyber security event, banks could switch over to the NSF. The report goes on to state:

“An NSF could create a backup of an FMI’s data to facilitate resumption of operations after data corruption, with services running independently of the FMI’s primary system (and hence remaining uncorrupted). This may require an independent communication channel. One possibility could be for an FMI’s participants (or other holders of its data) to send their data directly and separately to two difference facilities (e.g. the primary facility and the NSG)…”

A key element of an NSF is its ability to hold and store data that is independent of any data store the primary solution interacts with – or replicates to. The CPMI report rightly identifies the challenges of data corruption, specifically if caused by a cyber event. As such an NSF should have an independent source of data from the primary data store, utilise different data technologies and be housed on different infrastructure, therefore ensuring any corruption doesn’t impact the NSF.

An NSF in all the cases I have highlighted in this post, and many that you can find online, would have meant services had a very limited period where they were unavailable. Interestingly, if the participant banks had a separate submission channel – that was being constantly used, then you could argue that the downtime experienced would be measured in minutes, maybe seconds in all cases.

How can payment systems deliver a non-similar facility?

This has proven to be historically a tricky point, largely down to how the payment system was built (proprietary build) or by the limited communication channels that are used across the sector. For example, many high value payment systems are built around SWIFT messages and the SWIFT network. Many failover options again come from SWIFT, share the same underlying components, networking, data store technologies and of course, the submission channel itself. In other geographies there maybe a proprietary payment system build, but the failover is provided by the same provider. If you look at most Faster Payment Solutions (FPS in the UK included), the failover is a secondary site built by the same company with the same underlying shared storage components/technology, networking, business and application logic. So, in all of these cases you aren’t delivering a NSF.

In some discussions, some payments experts tout CBDCs as that resiliency option, however if we look at that argument, we see far more failings than potential for a real solution. The first point is that a CBDCs aren’t FIAT currency, and therefore a payment system failing over to a CBDC infrastructure would be fraught with challenges and many questions that need to be thought out. Secondly, the investment needed in a CBDC infrastructure, used just for payment system failover makes it prohibitive immediately, and that’s before we look at the costs that would be undertaken by participant banks. Yes, before we go into this debate and unpack it further, these two points alone means the argument for CBDCs as a resiliency solution simply makes no sense. However, when we add in a third point, that there are already NSF solutions out there, then the CBDC discussion is a non-starter…

RTGS.global ARK

One of the NSF solutions that central banks and payment systems can utilise, is the RTGS.global ARK product. RTGS.global provides a core product to participants around the globe that allows them to source liquidity in central bank funds, on demand, and to make immediate payments 24*7*365 even if the central bank systems are closed. Banks connect to the RTGS.global network through a highly available connector which is hosted in the Microsoft Azure Cloud. RTGS.global implements a high availability model, which sees services and data stores available from three independent physical data centres (availability zones) separated between 10-40km in a specific geography for that jurisdiction. What I mean by this is, the RTGS.global system is distributed across the globe, but data residency and network connectivity is local i.e Banks in the USA have their connectivity and data residency in Azure facilities within the USA. UK banks connect to Azure UK facilities across the UK.

Some see an ARK as a very large life-boat – I see it as being prepared for anything the storm can throw at you

When it comes to ARK, payment systems and central banks get a true NSF, one which has an independent communication channel for participants that is already being utilised, is already on and already available. ARK provides non-similar capabilities across the entire stack, including:

  • Networking
  • Data centres and sites
  • Platform OS
  • Data storage (immutable)
  • Application layer logic
  • Business logic
  • Participant communication channel

ARK may not provide all the capabilities that the primary system delivers, however, the core capabilities are provided. There are no shared components with a central banks core systems or a domestic payment system whatsoever. I cannot stress enough that it’s worth noting this includes the fact that there is zero usage or dependency on SWIFT messaging or the SWIFT network.

ARK can be used in a cold standby capability, or as a hot /active standby, where it already contains transactions that have taken place today. There are also two different methods of populating transactions into ARK, one coming directly from communications received by the central bank/payment system, the other a replica of the transaction coming from the participant bank source itself. This really provides implementation flexibility.  It is also worth pointing out that, data stored in ARK isn’t a result of data replication from the primary, rather it is a store of the messages the primary solution took in. The ARK data store is also immutable and therefore protected against any corruption of data that the primary may suffer.

However, the real beauty of ARK though is the ability for participants to have an independent communication channel, one that is already being utilised for other communications. This connectivity piece alone means banks are already using their connectors daily, the technology and the investment they have made is not just sitting there gathering dust, waiting for a failover situation, no it is delivering value back to the banks every single moment. The connector also gives banks confidence that in the scenario when ARK is needed to be used, banks systems are ready and able to utilise ARK, there is no failover rather a re-route of some payment traffic to a different connector.

The arrival of RTGS.global ARK means payment systems and central banks now have a true NSF solution that they can utilise, one that meets the recommendations of the CPMI and one that delivers values back to participants, as it provides connectivity into the wider RTGS.global network. In the same way that the core RTGS.global solution is built today but ready for the future, so too is ARK. ARK supports not just FIAT transactions, but in the same ledger can also be used to support other assets, yes including CBDCs, oh and it can be deployed in minutes…

For more information on ARK contact RTGS.global directly.

Leave a comment

Design a site like this with WordPress.com
Get started