Architecting for High-Performance Continuity


by Barry Thompson

Architecting for Continuity

Removing Systemic Risk

The goal of architectures is to remove systemic risk while ensuring predictable, market-leading performance.  Fortunately – or realistically – this work does not require a rip and replace; it can be applied to existing trading infrastructures.  The following are key items to consider:

 

1. Build continuity into systems with the greatest number of adjacencies

 

Redundancy and high-availability are critical requirements for all components in a mission critical system, but more so for those that interconnect other components.  This would include liquidity connectivity, network infrastructure and messaging middleware.  It’s important to note that building a mesh infrastructure as a means to resiliency actually has the opposite effect because risk increases when five nodes are interconnected in this manner.  This frequently occurs at both the network and messaging levels.  Redundancy does not mean plugging two systems into a risk-prone infrastructure – it doesn’t make the risk go away. The underlying paths must mitigate the “adjacency effect.”

 

2. Eliminate unnecessary integration

 

Much like a great team, each player in the trading infrastructure must play his part and play it well.  Too often, technology is adopted that allows systems to do many things.  The challenge is that individual component behavior is obscured and surprisingly at risk along with the other items with which they are comingled. Only do what you need to do.  This often generates conflicts because business units prefer to build out a separate infrastructure rather than embrace the economies of scale proposed by a centralized technology group.  If the latter could guarantee a specific risk profile, that would be an ideal scenario.  Otherwise, the savings may not outweigh potential operational issues.

 

3. Optimize performance characteristics

 

Major technology evolution occurs every five to seven years.  When it does, the results are dramatic.  Given that the benefits are often a magnitude higher and the cost, a fraction of the previous incarnation, adoption at the right time is important. This is most often when the technology has been in the market (not a lab) for about nine months.  Competitive pressures also influence this number.  We saw this with networks evolving to hardware; processors adding more cores; and now messaging moving to silicon.  It’s a continual evolution requiring ongoing education and evaluation for even the savviest of firms.  

 

4. Effectively provision management of multiple information streams

 

A great deal of market data and order flow is moving through the organization.  It’s critical that these streams do not all come together at under-provisioned rendezvous points, because volatility and micro-bursting can turn these into information dams, slowing data flow and causing risk-inducing latency.  This is a key consideration in messaging systems for market data distribution and order routing: software-based systems will need distributed streams while hardware-based systems can handle aggregated flows.  In this case, the risk profile is matched to the infrastructure capabilities.

Achieving operational excellence

Architecture is one side of the coin; operations is the other.  Operational excellence is not a one-time activity; it requires regular tuning and modification.  Quantified data and the ability to measure it are critical success factors.  The following figure highlights the reconciliation of risk with some of the requisite operational elements.

 Risk Components

1. Establish operational checkpoints

 

Gone are the days when high-performance systems were compromised by monitoring capabilities.  Establish checkpoints at the system level to mitigate the risk of component failure.  Establish checkpoints at major ingress and egress points to mitigate the risk of systemic failure. Make sure that the checkpoints can report independently of being polled, especially when baseline conditions are breached.  Checkpoints need to be managed as well.  Too many checkpoints - if improperly set up - can adversely affect the trading cycle by slowing systems down or creating excess network traffic.

 

2. Measure, measure, measure

 

Having the checkpoint in place is one thing, but the performance expectations both individually and in the aggregate must also be considered.  Not knowing is not an excuse.  Set the criteria, establish benchmarks, validate regularly, and warn when operational thresholds are compromised. Measure data volume - not just averages but peaks as well.  Measure latency across the entire trading cycle in addition to specific execution points.  Measure end-to-end system performance and compare with benchmarks and trending curves.  Measure server utilization rates.  Measure your service providers and your trading partners.  Aggregate what you measure and perform regular statistical analysis.

 

3. Isolate problems dynamically while maintaining performance
 

Even with the proper planning, problems are inevitable.   If components can be dynamically isolated while maintaining overall system performance, that’s a major step forward.  The slow consumer problem described earlier is an excellent example from the messaging arena.  This problem is being solved by today’s contemporary messaging platforms because they have the intelligence to be self-isolating.  High-availability and resiliency are important, but have historically impacted performance, at least in the short-term.  Far better to add the requisite intelligence to prune systems (with notification of course) while ensuring consistent high-performance.

 

Continuity is key

The cost of a lapse in operational continuity in today’s high-performance trading infrastructures is too high to leave things to chance.   Numerous and diverse systems create complex interdependencies with complicated risk profiles.  The most effective means to model this risk is with chaos theory, but it becomes impractical in real-world environments. Capriciously adding new technology may not mitigate the perils either.

Fortunately, risk can be driven down substantially by addressing continuity factors in key, individual components, especially those that have a large number of adjacencies.  Architectural improvement can be done on both new and, more likely, existing trading platforms.  Major technology innovations play a key role here.  Architecture progression in isolation is not enough; operational controls and metrics must be established as well.  Though we’ll never entirely remove all risk, we can certainly reduce the chance of systemic failure by orders of magnitude.  That will keep savvy firms both in and ahead of the market.

 

Comments RSS

No comments.