FIX, Electronic Trading and Disaster Recovery

An often overlooked part of Electronic Trading implementations is the world of high availability/disaster recovery/business continuity.

Why? Simply put, because it's hard and expensive. Let's look at this more closely...

From first principles, a sell-side will anticipate making profit from the order flow received through a FIX connection. As such, keeping that FIX session reliably available (during normal trading hours ) to their client is a sensible move.

So, how can that be done?

Let's make some working assumptions:

the sell-side is running their market infrastructure in two locations - a production site and a disaster recovery/ business continuity site
the sell-side establishes FIX connections from their production site to their buy-side client production site
the buy-side has a production site and a disaster recover site
the logical connection between buy-side and sell-side is via a FIX network (such as Ullink, Fidessa, Autex, STN, FTN - many others are available)
the physical connection runs over a financial markets extranet (such as TNS, BT Radianz, Fixnetix - many others are available).

A simple model of this would be a four-way connectivity model -

sell-side production to buy-side production
sell-side DR/BCP to buy-side production
sell-side production to buy-side DR/BCP
sell-side DR/BCP to buy-side DR/BCP

This gives redundancy of trading infrastructure at both buy-side and sell-side.

Right, so we are done?

No...

Let's start with what has been missed. Lets's look for single points of failure.

The FIX network.

In various consulting engagements over the years I have seen every major application level FIX network suffer failure. Typically alongside poor communication of expected time to recovery, root-cause-analysis and remediation strategy. So, in order to avoid that issue - do we need to move from a four-way model of connections as listed above to an eight way model?

Physical infrastructure provider

If all equipment is racked in industrial strength data centres, then one may expect a high level of quality. But should a firm use two providers, again to avoid single point of failure. In this case, perhaps we can retain the four-way model and just use one provider for production and one for DR/BCP.

Physical connectivity providers

[This is a huge topic that merits a whole series of posts on it's own, but for now we'll fit it in here]

If the physical connectivity in the four-way model is provided by one firm, then that's a single point of failure as well. Adding a second provider may help, but behind the scenes many of the physical connectivity providers actually buy bandwidth from the same companies that actually dig up roads and lay fibre. So, diversity of extranet provider may be harder than it looks on the surface. Once you start looking for diverse routes of long haul fibre you can swiftly realise that this is a much bigger topic than a first glance. Again, does this mean moving from four-way to eight-way?

Application level issues

Simple example: a sell-side production trading infrastructure fails-over to the DR/BCP site. The DR/BCP site needs to re-establish connectivity. In order to do this elegantly (and not via the problematic brute-force approach of re-sending everything) the sell-side DR/BCP site needs to know the last FIX sequence numbers that have been processed inbound and outbound. Hence there is a fifth leg to add to the four-way model above:

sell-side production to buy-side production
sell-side DR/BCP to buy-side production
sell-side production to buy-side DR/BCP
sell-side DR/BCP to buy-side DR/BCP
sell-side production to sell-side DR/BCP

So, what does this mean?

Simply put - this issue boils down to a few key points:

Money
Time (and money) spent engineering
Time (and money) spent in testing the failover model for reliability (if your failover fails, you enter a world of pain, rather like BA did)
"War-Gaming" - brainstorming how failures can occur
Money spent on preventative maintenance
Hiring and retaining resourcing production support personnel who can improvise in the line of fire.

For ultimate reliability the $ cost is extremely high.

For no reliability the $ cost is low but there is reputational risk which has a different associated cost.

In summary, there is no simple way to offer real-world DR/BCP for FIX. You have to work out the financial impacts and constraints and reverse from there to the valid solution domain. Then pick what you think your firm can afford and what your clients will accept in failure.