Systemic Failures in High-Performance Trading


by Barry Thompson

Systemic Bridge FailureMission-critical, high-performance trading infrastructures remain extremely vulnerable to constituent failures, while burgeoning complexity and increasing system co-dependencies exacerbate risk and time-to-resolution. Today’s competitive environment leaves little room for operational mishaps, yet the foundation of the trading business – technology – is more susceptible than ever to failure.  This blog entry examines key technology hazards and a subsequent entry explains how to proactively mitigate danger without negatively impacting the performance equation.

 
The Sky is Falling

Recently, a major financial services institution specializing in mutual funds suffered a 48-hour outage in a critical business unit resulting in over $20 million in losses and numerous litigation issues.  The cause was the inability of algorithmic trading servers to effectively process market data, forcing the data producers to continually rebroadcast information and bring the underlying network down.

Another leading international financial firm recently saw its large New York City trading floor experience numerous outages that resulted in a suspended order flow of $90 billion and losses of $3 million per hour.   Data volumes wreaking havoc in the middle office and back office triggered network and system outages that brought down the front office.

The trading world continues to evolve as firms adjust their trading strategies to profitably exploit market opportunities before their competitors.  But neither of these variables – trading strategies nor market opportunities – is discrete, autonomous or effectively measurable in real-time.  Instead, they are part of an ecosystem of complex systems with sophisticated inter- and intra-dependencies.    In addition to diverse algorithmic trading strategies, firms must also manage trade structure, order routing and execution costs across diverse asset classes, markets and liquidity venues.  All of this will run on diverse IT infrastructures with diverse Service Level Agreements (if any) and have disparate controls and authority.  It’s critical in high-performance trading environments that all the components in the value chain are understood and their inherent interdependencies and risks are quantified.

The path to interdependent systems

Trading infrastructures will continue to evolve as long as algorithmic and technological advances keep yielding positive financial benefits.  Challenging economic and market conditions inherently present numerous and diverse execution opportunities.  Alternative execution venues, direct market access (DMA), algorithmic containers, electronic communication networks (ECNs), smart order routers, exchange co-location, multi-core processing, distributed caches, low-latency networks  and messaging systems are just some of the ingredients that must be combined into the optimal trading recipe.

Building a high-performance trading infrastructure from scratch is just not a practical option for several reasons.  Specific expertise is rare. There are too many niche components. Integration schedules won’t match the market opportunity window. And, finally, the operational risk of a monolithic system can knock a firm out of the market, which is far too common with many legacy systems today. Hedge funds, market markers, new liquidity venues, algorithmic trading houses and high-frequency trading firms are but a sampling of organizations either leading the charge into high-performance trading or being dragged into it.  They understand that the status quo won’t do.

The most overriding factor and the foundational driver for competitive advantage with interdependent systems is effective and efficient automation of all critical components of the trade lifecycle which includes market data processing and distribution, risk management, order routing, etc.  Historically, a good deal of emphasis has been on straight-through processing (STP) to automate and integrate information flow within a firm, but contemporary demands emphasize other critical path trade routes.  For example, market data distribution has been a critical challenge for firms, especially as market data volumes continue to grow.  In its latest year-end report on market data capacity, the Financial Information Forum (FIF) reported significant volume growth across all feeds and market centers.  The implications for firms are that any processing challenges in this area will only be exacerbated down the line.  Similar parts of the trading ecosystem have challenges as well, requiring a look at the overall risk.  Our goal is to keep our overall risk profile within an acceptable range.System Risk Analsysis

The risk of complexity

With so many disparate systems in the trade lifecycle, the risk equation changes greatly.  If there is one system, it can be readily modeled for risk.  If there are two systems, symbiotic models can be produced to sufficiently define the risk profile.  If there are greater than two systems, it involves a level of complexity that is increasingly difficult to model.  If there are three systems, for example, – A, B and C – then there are 10 risk factors: A, B, C, AB, BC, AC, ABC, AB on C, BC on A, AC on B.  Beyond that the model becomes significantly more complex, as does the risk profile.  What elements are part of the modern trading equation?  Networks (switches, routers, WAN links), messaging systems (producers, consumers, middleware, queues, servers), OMS, and EMS are some of the numerous elements of this equation.  The assertion is that merely adding a single system to the trading ecosystem increases the risk factor not by the autonomous risk profile of that system, but potentially by magnitudes as that system interacts with other systems.

A simple example is warranted - we’ll call it “The Slow Consumer Problem.” Assume market data is being distributed to 30 systems using the conventional message-oriented middleware of an Ethernet network.  Efficiencies have been built by using multicast technology so that a single message can be delivered to all systems simultaneously.  When one system falls behind in processing the market data (the slow consumer) requests that the producer retransmit information.  This, in turn, slows the producer down and forces all the other consumers to process the retransmitted information (which they will likely discard).  As a result, all systems – producers and consumers – lose valuable processing cycles and the network bandwidth will begin to erode. We have research that shows the slow consumer issue is not an isolated, one-time event.  Excessive requests may ultimately and dangerously impact the producer of information.  This cascades down to all the other components in the trading ecosystem, ultimately resulting in such conditions as price slippage and the inability to trade profitably and effectively.  Many trading environments are that fragile.

Given the less-than-desirable operational profile, there is an effective way to model the risk but there are also distinct challenges.  Chaos theory is the most appropriate model.  IT media company TechTarget provides an excellent definition:

In a scientific context, the word chaos has a slightly different meaning than it does in its general usage as a state of confusion, lacking any order. Chaos, with reference to chaos theory, refers to an apparent lack of order in a system that nevertheless obeys particular laws or rules; this understanding of chaos is synonymous with dynamical instability, a condition discovered by the physicist Henri Poincare in the early 20th century that refers to an inherent lack of predictability in some physical systems…The two main components of chaos theory are the ideas that systems - no matter how complex they may be - rely upon an underlying order, and that very simple or small systems and events can cause very complex behaviors or events.

In his book, Chaos Theory in the Financial Markets , Dimitris N. Chorafas analyzes in great depth the role of nonlinear systems, volatility, risk and cumulative exposure, as well as cognitive models for financial operations.  Dr. Chorafas states:


An overriding need in any business is the ability to represent problem information in such a way that the full complexity and dynamic nature of the underlying structures is captured.  Financial systems are no exception … As the information and the tools needed to solve prediction problems becomes more complex, it is increasingly more challenging to foresee and represent the evolving real-world situation.  From risk management to generation of profits, flexibility is the cornerstone of a successful prediction process.

 
Dr. Chorafas’ assertions address the external dynamics of financial markets.  The premise of this blog entry is that organizations must use the same scrutiny on their internal trading systems, understanding fully that a symbiotic relationship exists between these two environments.  Several challenges exist.  First, seldom do firms assign their quants to develop models for internal systems.  Next, the operational methods of most of these systems are not quantifiable.  Even the application of chaos theory to weather systems has greater measurable components. Last, even if these two challenges were overcome, it would be extremely difficult to bridge the operational – and political – compartments involved in the trading infrastructure.

Mitigation strategies

Due to the complexity of tackling risk at the systemic level, addressing component risk first is both a pragmatic, tactical move and an effective, long-term strategy.  Purists may argue that systems should be engineered from the ground up for effective operational risk mitigation, but this is nearly impossible in practice, especially given the continual changes in technology and infrastructure.  Furthermore, the risk probability curve rises precipitously as components are added, so factoring out hazards in individual components will have the inverse, beneficial effect.

Is there an optimum number of systems for a trading infrastructure?  If they were all built the same way, then the answer would be an emphatic “yes.”  However, reality quickly sets in.  One international broker-dealer we work with was able to reduce the number of FIX engines on legacy middleware from 24 down to two (with redundancy) on a modern messaging platform.  Outages dropped by 87 percent while performance increased by over 300 percent.  In another example, one of our hedge fund clients detected network (and latency) deterioration after just five direct mesh connections on options processing servers.  They moved to a centralized communication system yielding predictable and consistent sub-100 microsecond latency.  In both of these examples, many factors were at play including data volumes, intersystem communication, excess messaging traffic, etc.

Before going through the steps, what types of risk are critical to remove from the internal trading infrastructure?  There are several.  The most dangerous is operational risk, the failure of a critical component in the information flow.  Next is performance risk, the executional degradation of a critical component in the order flow which may include processing slowdowns and system malfunctions that produce errant behavior.  Finally, there is flexibility risk, the inability of systems to dynamically adapt to diverse market conditions.  All of these ultimately aggregate risk implications in the overall systemlike being behind the market or entirely out of the market.

Also exacerbating these risk conditions are fault detection and isolation.  The more complex the ecosystem, the more difficult the root cause analysis and return to execution.  Understand that a system that comes back up will likely be hit, for example, by the market data deluge that sunk it in the first place, much like the mutual fund company discussed at the beginning of this blog entry.

Comments RSS

No comments.