Featured post

Fixed Income Trading: New venues

A simple question came up recently in a conversation – how many new Fixed Income trading venues are there?  I could not think of anywhere th...

Tuesday, 31 March 2015

Technical Debt

A term that has come into popular usage recently is "Technical Debt".  It's a new phrase for an old phenomenon.
Wikipedia has the following:
 
"metaphor referring to the eventual consequences of poor system design, software architecture or software development within a codebase. The debt can be thought of as work that needs to be done before a particular job can be considered complete or proper. If the debt is not repaid, then it will keep on accumulating interest, making it hard to implement changes later on. Unaddressed technical debt increases software entropy.
 
As a change is started on a codebase, there is often the need to make other coordinated changes at the same time in other parts of the codebase or documentation. The other required, but uncompleted changes, are considered debt that must be paid at some point in the future. Just like financial debt, these uncompleted changes incur interest on top of interest, making it cumbersome to build a project. Although the term is used in software development primarily, it can also be applied to other professions."
 
There are two cases of technical debt that I recall and which are worthy of further discussion.  As always on this blog there is a strict policy of no "naming and shaming" so these stories are anonymised.  Comments will be moderated and any comments that breach this policy will be deleted.
 
The (non-relational) relational database
A software vendor had created a buy-side software system.  A server side program, a client side application and a proprietary database.  While that worked and managed to win a considerable number of clients in the vendor's home market, sales overseas were slow.  One issue was that prospective clients that conducted evaluations would see that the database was proprietary.  One way to view that is "fast, the developers who built it are in-house, no license fees payable to a big third-party vendor, the application vendor will never blame the database vendor and so on..." 
 
However, in the evaluation process were often "consultants" who would decry this proprietary technology. 
 
A small number of senior staff at the vendor then sat down and tried to figure out how to respond.  The issue was that the proprietary system was basically an optimised b-tree structure that took data from large storage arrays in memory and on disk...
 
But the prospective clients wanted relational! Codd! Stored procedures! Indexes! Primary Key! Foreign Key! And all of that jazz...
 
A wise elder statesman of the group said "No, the prospective client RFIs and RFPs simply ask "Which type of relational database do you use?".  So why don't we pick one of MSSQL, Sybase, Oracle, DB2, Informix, Ingres (etc) and then simply take our storage arrays from the proprietary format into the table format used by the RDBMS. Just give the clients what they ask for..."
 
And so that's what happened, there was a port of the array structures in the proprietary database to an RDBMS.  No primary key/foreign key integrity.  No stored procedures.  No indexes. No constraints. 
 
Just give the clients what they asked for.
 
That approach worked for a number of years, until people who had implemented the system realised that the database was indeed relational, but the data in the database was not relational.
 
This is where I believe many buyers of technology get this wrong - they pick a vendor based on functionality but do not enquire deeply enough about the technology used.  I have completed many RFIs/RFPs from a vendor perspective and I have never been asked sufficient questions such as:
 
What build tools do you use in development?
Which compilers do you use?
Which languages, which versions?
Can you build the entire system from one job?
Do you have daily builds?
[List truncated, I could go on (and on, and on) in painful detail]
 


The single threaded server (with many clients)


 


Another system I worked with was a fat-client/middle tier/RDBMS-backend "three-tier application".  The system was big and complicated and glued together with message oriented middleware.


One problem was that the middleware implementation was based around having a reliable LAN on which all clients would run, with all servers always able to communicate with each other at all times.


So, take that to a globally distributed WAN and "I'm sure it'll work"...


On a GWAN the MoM messages would time out after a short period and then be lost forever.  So trade order executions could be lost between the server and a client.  As such, a dealer may see an order has 9,000 shares executed and therefore phone a broker to ask "where is the other 1,000?".  The broker sees that 10,000 has been done and IT gets blamed for a screw-up. Simple resolution for the dealer? Log out and log back in again, whereby the server will detect the new client and send all of the executions back to the client, so the dealer will then see 10,000 done and the trade is good.


The firm to which I was consulting at the time had a very impressive CTO who grasped the issues and asked his team to work with the vendor.  The vendor was offered an open cheque book to resolve the issue by moving to a true transactional usage of the middleware (so that no messages could be lost between client and server or server and server).  After several meetings the middleware vendor was brought into the loop and revealed that the MoM in question was "End-Of-Life" and that they were activating an agreement to give the source code to the vendor that sold the fat-client/middle tier/RDBMS backend three tier application.


The vendor of said application soon found that they inherited around about 1,000,000 lines of C, C++, Perl and a few other languages...


The underlying problem was that the main component of the system that maintained all state was single threaded.  The singleton pattern was workable when the system workload was within the number of compute cycles available to a single thread on the server running the main component.  The system was therefore using MoM as a queue to smooth over periodic episodes of workload beyond the capability of the main component single process to deal with in real-time. But when the over-capacity workload persisted beyond the time-out period for the messages "zap" - they are deleted.  And there was no implementation of a dead letter queue.


A further problem was that recent changes in chip architecture had brought down clock speeds and increased the number of cores.  Which takes a single-thread bound process out of it's comfort zone and into a world of pain...


So, at this point the vendor decided to offload as much code as possible from the main component and refactor to add more MoM between components on the server side of the application stack.  So more use of an unsupported (unsupportable?) MoM product to alleviate the symptoms of a core architectural pattern that was perfectly sensible when the system was designed but which was destroyed by two trends - move to multiple cores/lower clock speeds and the move in financial markets to vastly increased numbers of executions per order due to algorithmic/low latency/high frequency trading.


In conclusion: Technical Debt is a fashionable term, but often it's incurred for good reasons.  the really hard thing to do is repay the debt by taking the long hard road and walking it.  Too many folks think that rewriting an old application in a new language will remove the problems. It doesn't.  It just moves it somewhere else, out of direct sight. The way ahead is to re-design appropriately, take sensible guidance from business users and business-aligned-technologists and then build-in optionality in a cost effective manner.

No comments:

Post a Comment