Systemic Failures in High-Performance Trading
by Barry Thompson
Mission-critical, high-performance trading infrastructures remain
extremely vulnerable to constituent failures, while burgeoning
complexity and increasing system co-dependencies exacerbate risk and
time-to-resolution. Today’s competitive environment leaves little room
for operational mishaps, yet the foundation of the trading business –
technology – is more susceptible than ever to failure. This blog entry
examines key technology hazards and a subsequent entry explains how to
proactively mitigate danger without negatively impacting the
performance equation.
The Sky is Falling
Recently, a major financial services institution specializing in mutual
funds suffered a 48-hour outage in a critical business unit resulting
in over $20 million in losses and numerous litigation issues. The
cause was the inability of algorithmic trading servers to effectively
process market data, forcing the data producers to continually
rebroadcast information and bring the underlying network down.
Another leading international financial firm recently saw its large New
York City trading floor experience numerous outages that resulted in a
suspended order flow of $90 billion and losses of $3 million per
hour. Data volumes wreaking havoc in the middle office and back
office triggered network and system outages that brought down the front
office.
The trading world continues to evolve as firms adjust their trading
strategies to profitably exploit market opportunities before their
competitors. But neither of these variables – trading strategies nor
market opportunities – is discrete, autonomous or effectively
measurable in real-time. Instead, they are part of an ecosystem of
complex systems with sophisticated inter- and intra-dependencies. In
addition to diverse algorithmic trading strategies, firms must also
manage trade structure, order routing and execution costs across
diverse asset classes, markets and liquidity venues. All of this will
run on diverse IT infrastructures with diverse Service Level Agreements
(if any) and have disparate controls and authority. It’s critical in
high-performance trading environments that all the components in the
value chain are understood and their inherent interdependencies and
risks are quantified.
The path to interdependent systems
Trading infrastructures will continue to evolve as long as algorithmic
and technological advances keep yielding positive financial benefits.
Challenging economic and market conditions inherently present numerous
and diverse execution opportunities. Alternative execution venues,
direct market access (DMA), algorithmic containers, electronic
communication networks (ECNs), smart order routers, exchange
co-location, multi-core processing, distributed caches, low-latency
networks and messaging systems are just some of the ingredients that
must be combined into the optimal trading recipe.
Building a high-performance trading infrastructure from scratch is just
not a practical option for several reasons. Specific expertise is
rare. There are too many niche components. Integration schedules won’t
match the market opportunity window. And, finally, the operational risk
of a monolithic system can knock a firm out of the market, which is far
too common with many legacy systems today. Hedge funds, market markers,
new liquidity venues, algorithmic trading houses and high-frequency
trading firms are but a sampling of organizations either leading the
charge into high-performance trading or being dragged into it. They
understand that the status quo won’t do.
The most overriding factor and the foundational driver for competitive
advantage with interdependent systems is effective and efficient
automation of all critical components of the trade lifecycle which
includes market data processing and distribution, risk management,
order routing, etc. Historically, a good deal of emphasis has been on
straight-through processing (STP) to automate and integrate information
flow within a firm, but contemporary demands emphasize other critical
path trade routes. For example, market data distribution has been a
critical challenge for firms, especially as market data volumes
continue to grow. In its latest year-end report on market data
capacity, the Financial Information Forum (FIF) reported significant
volume growth across all feeds and market centers. The implications
for firms are that any processing challenges in this area will only be
exacerbated down the line. Similar parts of the trading ecosystem have
challenges as well, requiring a look at the overall risk. Our goal is
to keep our overall risk profile within an acceptable range.
The risk of complexity
With so many disparate systems in the trade lifecycle, the risk
equation changes greatly. If there is one system, it can be readily
modeled for risk. If there are two systems, symbiotic models can be
produced to sufficiently define the risk profile. If there are greater
than two systems, it involves a level of complexity that is
increasingly difficult to model. If there are three systems, for
example, – A, B and C – then there are 10 risk factors: A, B, C, AB,
BC, AC, ABC, AB on C, BC on A, AC on B. Beyond that the model becomes
significantly more complex, as does the risk profile. What elements
are part of the modern trading equation? Networks (switches, routers,
WAN links), messaging systems (producers, consumers, middleware,
queues, servers), OMS, and EMS are some of the numerous elements of
this equation. The assertion is that merely adding a single system to
the trading ecosystem increases the risk factor not by the autonomous
risk profile of that system, but potentially by magnitudes as that
system interacts with other systems.
A simple example is warranted - we’ll call it “The Slow Consumer
Problem.” Assume market data is being distributed to 30 systems using
the conventional message-oriented middleware of an Ethernet network.
Efficiencies have been built by using multicast technology so that a
single message can be delivered to all systems simultaneously. When
one system falls behind in processing the market data (the slow
consumer) requests that the producer retransmit information. This, in
turn, slows the producer down and forces all the other consumers to
process the retransmitted information (which they will likely
discard). As a result, all systems – producers and consumers – lose
valuable processing cycles and the network bandwidth will begin to
erode. We have research that shows the slow consumer issue is not an
isolated, one-time event. Excessive requests may ultimately and
dangerously impact the producer of information. This cascades down to
all the other components in the trading ecosystem, ultimately resulting
in such conditions as price slippage and the inability to trade
profitably and effectively. Many trading environments are that fragile.
Given the less-than-desirable operational profile, there is an
effective way to model the risk but there are also distinct
challenges. Chaos theory is the most appropriate model. IT media
company TechTarget provides an excellent definition:
In a scientific context, the word chaos has a slightly different meaning than it does in its general usage as a state of confusion, lacking any order. Chaos, with reference to chaos theory, refers to an apparent lack of order in a system that nevertheless obeys particular laws or rules; this understanding of chaos is synonymous with dynamical instability, a condition discovered by the physicist Henri Poincare in the early 20th century that refers to an inherent lack of predictability in some physical systems…The two main components of chaos theory are the ideas that systems - no matter how complex they may be - rely upon an underlying order, and that very simple or small systems and events can cause very complex behaviors or events.
In his book, Chaos Theory in the Financial Markets , Dimitris N. Chorafas analyzes in great depth the role of nonlinear systems, volatility, risk and cumulative exposure, as well as cognitive models for financial operations. Dr. Chorafas states:
An overriding need in any business is the ability to represent problem information in such a way that the full complexity and dynamic nature of the underlying structures is captured. Financial systems are no exception … As the information and the tools needed to solve prediction problems becomes more complex, it is increasingly more challenging to foresee and represent the evolving real-world situation. From risk management to generation of profits, flexibility is the cornerstone of a successful prediction process.
Dr. Chorafas’ assertions address the external
dynamics of financial markets. The premise of this blog entry is that
organizations must use the same scrutiny on their internal trading
systems, understanding fully that a symbiotic relationship exists
between these two environments. Several challenges exist. First,
seldom do firms assign their quants to develop models for internal
systems. Next, the operational methods of most of these systems are
not quantifiable. Even the application of chaos theory to weather
systems has greater measurable components. Last, even if these two
challenges were overcome, it would be extremely difficult to bridge the
operational – and political – compartments involved in the trading
infrastructure.
Mitigation strategies
Due to the complexity of tackling risk at the systemic level,
addressing component risk first is both a pragmatic, tactical move and
an effective, long-term strategy. Purists may argue that systems
should be engineered from the ground up for effective operational risk
mitigation, but this is nearly impossible in practice, especially given
the continual changes in technology and infrastructure. Furthermore,
the risk probability curve rises precipitously as components are added,
so factoring out hazards in individual components will have the
inverse, beneficial effect.
Is there an optimum number of systems for a trading infrastructure? If
they were all built the same way, then the answer would be an emphatic
“yes.” However, reality quickly sets in. One international
broker-dealer we work with was able to reduce the number of FIX engines
on legacy middleware from 24 down to two (with redundancy) on a modern
messaging platform. Outages dropped by 87 percent while performance
increased by over 300 percent. In another example, one of our hedge
fund clients detected network (and latency) deterioration after just
five direct mesh connections on options processing servers. They moved
to a centralized communication system yielding predictable and
consistent sub-100 microsecond latency. In both of these examples,
many factors were at play including data volumes, intersystem
communication, excess messaging traffic, etc.
Before going through the steps, what types of risk are critical to
remove from the internal trading infrastructure? There are several.
The most dangerous is operational risk, the failure of a critical
component in the information flow. Next is performance risk, the
executional degradation of a critical component in the order flow which
may include processing slowdowns and system malfunctions that produce
errant behavior. Finally, there is flexibility risk, the inability of
systems to dynamically adapt to diverse market conditions. All of
these ultimately aggregate risk implications in the overall systemlike
being behind the market or entirely out of the market.
Also exacerbating these risk conditions are fault detection and
isolation. The more complex the ecosystem, the more difficult the root
cause analysis and return to execution. Understand that a system that
comes back up will likely be hit, for example, by the market data
deluge that sunk it in the first place, much like the mutual fund
company discussed at the beginning of this blog entry.
Comments
No comments.