Featured post

Fixed Income Trading: New venues

A simple question came up recently in a conversation – how many new Fixed Income trading venues are there?  I could not think of anywhere th...

Tuesday, 7 April 2015

Sell-side FIX reference implementation (not low latency)

This post will cover a reference implementation of a FIX ecosystem for a reasonable size sell-side organisation. The target here is not a low latency environment.  The trading ecosystem that is suited to this reference implementation is a hub and spoke model, where a sell-side in a region runs connectivity out from a central hub to the regional exchanges (or SEFs, ECNs, FX Hubs etc.).  
If you want co-location, FPGA based prime broker risk calculations and the like, this is not for you.
In part this is following on from earlier posts on cost cutting Cost cutting proposal part Ipart II and part III.
A key feature of this reference implementation is that the technology stack is thoroughly monitored, so that each component status is notified in real-time to a management console. The level of interdependence of a FIX technology stack means that any component failure can have a material adverse affect on the business activity of the sell-side, so proper monitoring is essential.  This is not merely implementing a FIX engine monitor that checks that the FIX engine is running - this is a complete stack monitoring solution that checks for: 
  • FIX engine application status
  • Hardware and OS that run the FIX engine status
  • Network connectivity between FIX engine HA pairs
  • Network connectivity between FIX engine and outside world
  • Network connectivity between FIX engine and sell-side internal systems
  • Firewalls
  • Packet capture devices
  • Routers
  • Switches
  • Middleware hardware
As the saying goes - "If you cannot measure it, you cannot manage it".  Monitoring in-depth of the entire stack is the measurement needed.  In the past this function has been performed by horrible mash-ups of grep, awk, tail, open source web servers and spaghetti code.  That is no longer viable for a modern complex stack where there are many moving parts and many technology vendors involved.  

Note that this full-stack monitoring is in addition to the specialised FIX engine monitoring provided by systems as listed in a previous post "Production support software".  There is also a requirement for "Testing and Certification Software" which we have previously covered here

At the edge of the sell-side network will be a series of routers connected to extranets such as TNS, IPC, BT Radianz, Fixnetix and similar for buy-side clients that connect directly.  Over those connections will also be connections to FIX hubs such as ThomsonReuters Autex, Fidessa Express, Ullink MCS, LSE Hub and others.

The sell-side may also have a series of VPN connections over the internet. Ideally the trading infrastructure will not share the sell-side internet connection for browsing and other interactive services.  I have seen cases where internet browsing traffic has created a "denial-of-service" for trading traffic.  It's possible to implement traffic shaping to mitigate that problem but based on experience I would advise a distinct infrastructure.

Since this post is not covering router, switch and firewall topology in depth I will gloss over that, perhaps to cover that in a further post.

Next up are FIX engines.  A sensible design pattern is for a primary datacentre to run FIX engines in High Availability pairs. Depending on traffic volumes multiple pairs may be required.

The servers that are paired up in an HA pair need to be able to communicate with each other very quickly to allow for graceful failover.  All of the commercial FIX engines I have seen that implement HA use variations on a theme of heartbeats between the pair and the survivor in a failover issuing a "Gratuitous Address Resolution Protocol" request - GARP.  Some implementations also look for well known network addresses to ensure that it's not a network failing rather than a server failing.
A dedicated network connection between the HA pair is an elegant way to ensure that failover is fast.
Depending on your network security model the FIX engine may have to sit behind one firewall or sit between two in a DMZ. This has an impact on network design and topology and also latency.
Of course, there are many ways to implement HA and to a high degree the reality is that the implementation has to fit in with the existing corporate network standards.  This can create a great deal of conflict when the front office is driving a particular design in order to meet latency goals and a networks engineering team is trying to ensure that the choices made don't impact on existing infrastructure or create a service level expectation that the existing team is not able to meet.
As an example, one way to spot a FIX engine failure is upstream from the FIX engine using a device such as F5 load balancer or an Inceptrum FPR1202.  The upstream device can then instruct the hot-standby FIX engine to become primary and start processing message traffic entering and leaving the sell-side.
Rather than have the FIX engines create log files either on storage or locally a better way to do this is to use packet capture.  There are a lot of vendors and interest in this market sector - this google search brings 217,000 hits. 
On the sell-side internal network side the FIX engines will of course connect to the sell-side trading systems. This will generally be using middleware which is a topic we have covered before here and here. The beauty of hardware based middleware is something I have covered before...
At this point the sell-side has a resilient, robust, highly available FIX ecosystem that can be managed, supported and monitored by a small team.  The high degree of sophistication in the implementation and design leads to: 

  • reduced operational total cost of ownership
  • reduced headcount needed for operational support
  • improvement in uptime
  • reduction in operational risk
  • scalability
  • reduction in cost of regression testing required when sell-side systems change
  • anti-fragile - removing single points of failure and "working by accident" technology 
Clearly this blog post is not a plan of action, but it sets out in broad brush a reference implementation.  If you would like more details please feel free to contact me via linkedin.
[I did not include clocks in this original post.  In answer to some feedback, I would use PTP and some decent kit with access to a GPS antenna.  I might write that up in more detail in future.]
Further reading:
FIX Testing and Certification: Production support software
FIX Testing and Certification: Implementation
What is a FIX engine?