Posts

Simple Distributed Monitoring for your Trading Adapters

All leading Wall-Street funds run a number of trading adapters that connect various brokers and dealers’ trading systems to exchanges around the world in order to support their algorithmic or manual trading. Different adapters will be designed and customized in different ways, for example, many will utilize the FIX protocol, others will employ native protocols to an individual exchange.

So what is the best way to monitor the status of these highly-distributed adapters? How can you know if an adapter is down or is producing unusual data? And how do you control the performance impact of monitoring your performance-critical systems?

Trading adapters are deployed on different hosts which connect to each other through LAN connections. The monitoring system watches both the adapter application status as well as the status of connections to multiple internal and external systems. Your adapters probably connect to these systems through various types of middleware such as Tibco, ZeroMQ, RabbitMQ, etc. This can cause you monitoring headaches.

Bleum’s Solution

At Bleum, when we encounter this situation we develop an independent library as a common interface for the different types of middleware. All adapters contain several transports which are callback threads from the independent library. Our upstream systems cannot know what type of middleware they are dealing with so in order to monitor the connections with internal systems, the monitor needs to communicate with the independent library to gather and consolidate information from each middleware system. A major consideration of any monitoring system related to trading systems is that the performance impact on the adapters themselves should be minimal.

A major consideration of any monitoring system related to trading systems is that the performance impact on the adapters themselves should be minimal.

In order to reduce the project lifecycle and control any potential problems, we reuse the current system architecture as much as possible. Then, we build a monitoring agent that has a specialized transport between the adapters and the monitoring system. A customized distributed key-value store is used to manage all monitored data. The agent reads and keeps all LAN information for each adapter instance during its lifecycle. The agent can immediately identify if a transport fails to initialize or is disconnected for an unknown reason. The agent collects and summarizes all other transport information and health data, and sends it to the monitoring cluster key-value store. The message is JSON formatted which is simple and readable.

Distributed Monitoring Architecture

Each adapter sits on its own server between the order routing system and the exchange/broker system. The monitoring agent for each server reports back to the cluster, which aggregates data to be viewed by users.

It is important that the monitoring system has high flexibility in case it needs to support multiple platforms. For this reason, a common transfer protocol like HTTP is a good choice.

  • All messages come to the store through an HTTP service and, later, they can be easily referenced by the browser, shell script or a GUI.
  • If a port in any adapter is down or missing from the local network, the monitor will know this no later than a configurable timeout duration.
  • It is possible for an adapter to be administratively closed, even though the physical adapter is up. This is necessary for adapters from a business perspective as sometimes the remote side is not available (for example the exchange market is not open) or users want to set the adapter mode to allow only cancel requests (so that no new position will be opened).

All of the abovementioned events/statuses can now be monitored on the server side.

Security is also very important because the adapter information may be confidential. The server has authorization and authority components which help to divide rights and liabilities, increasing the usability of the monitor.

The Benefits

There are several benefits to the Bleum solution:

  • Great Performance

    The monitoring system works on a separate host and threads independently of trading flow. This results in extremely minimal performance impact.

  • High Security

    The store has a permission system, so only authenticated transports can be accepted. This prevents a denial-of-service (DoS) attack. Moreover, the store is capable of authorizing different access levels for each user and restricts which transport can be viewed and whether the user has the right to amend the information.

  • Fully Customizable

    The solution is highly customizable and has abundant configurable features such as support for different environments; regardless of whether users work on an interfaced system or only have a command line, users can still check the monitoring system through different tools.

  • Scalable and Flexible

    The JSON message is customizable for both the adapter and server side monitor. If any new feature or new transport type needs to be added to the system, it is easy to extend the system to support this.

Takeaway

Monitoring real time trading systems is vital.
You need visibility of existing performance to provide assurance, as well as being able to identify bottlenecks to target your development efforts.

With a scalable, flexible system, that has a minimal performance impact on adapter operations, you can collect the valuable data that you need to drive improvements in performance and reliability.

 

Offshore Partners should have Distributed Agile in their DNA – We Do

 

Agile emphasizes that all team members should be in one location for daily meetings, requirement discussions, designing, coding, testing, etc. Along with the trend toward agile becoming the predominant development methodology for software development, it’s more and more common to find project members located in different cities or even different countries.

When team members are in different locations, the primary challenge is the lack of face-to-face communication. Time-zone issues can also arise when team members are in different countries, and cultural differences can also add to the challenge.

One example we can discuss is a team that has used agile methodology to work for a Fortune 500 e-commerce website for over four years. Located in Shanghai China, our team works with a US team (California) and an India team (Bangalore) on software new feature development, production issue fixing and production support. Team sizes are roughly similar in each location and this is the case for more than 10 module teams. We have frequent engagement and collaboration with the other module teams, and under this model, we face all three challenges of location, time zone and culture.

Overcoming the challenge

To manage these challenges, we adopted the following practices:

  • Over-communication

    You can never over-emphasize the importance of communication when your team is distributed across locations. We have daily standup meetings with the US and India teams. Everyone talks about what he or she has completed yesterday, what he/she is going to do next and the impediments blocking our progress in the daily meeting.

    We’d always rather over-communicate than have a lack of communication. [tweetthis]When it comes to distributed agile, it’s always better to OVER communicate than UNDER communicate[/tweetthis]

    At the end of each day, everyone sends a daily status summary to the Scrum Master. This enables the Scrum Master to understand the work progress status of employees in the same time zone or opposite time zones (and not available at that time). This allows him/her to lead working associates to adapt immediately to changing circumstances and requirements.

    Bleum teams use both audio conference calls and screen sharing meetings such as GoToMeeting™ or video conferencing to facilitate communication. We have found it’s very useful for us to smoothly perform sprint planning meetings and review meetings.

  • Flexible Timing

    As teams are located in different time zones, it’s critical to identify a suitable meeting time for every time zone. It’s impossible to find a time that everyone feels happy about, however, a time should be identified that everyone can accept.

    Team members must be flexible.

    California is 15 hours behind Shanghai and Bangalore 2.5 hours behind, we chose to hold meeting at 11:30am Shanghai (8:30pm California and 9am Bangalore). The US team has to sacrifice every evening and the Shanghai team sometimes has to postpone lunch, but this is the time everyone has accepted.

  • Hybrid Model

    From time-to-time, a few engineers from the Shanghai team travel to California where the Product Manager and Scrum Master are located. These engineers spend half-day on California working time and the second half on Shanghai time communicating with the team there.

    Occasional on-site visits, targeted at specific critical project phases, offer a ‘best-of-both-worlds’ solutions

    This is a hybrid model and it is very effective and useful to get correct and detailed requirements through and allows for product managers to either confirm or give comments on requirement completion.

  • Dedicated Person for Production Support

    Teams not only work on new feature development but also take responsibility for production support, especially during Black Friday and Cyber Monday.

    We have found it is most effective to assign 1-2 dedicated engineers to provide support service.

    They join new feature development when no support tickets are assigned, and are the only resources to work on support if support tickets come in. This approach enables their peers to focus on new feature development and support resources to accumulate support knowledge and practice.

  • Continuous Integration

    Continuous integration plays an important role in guarding code quality especially under a distributed team model.

    Bleum has a dedicated team whose main responsibility is to checkout code from SVN, compile code and run automated test cases every day.

    If an issue occurs, they will identify who caused the issue. Everyone, no matter their location, has the responsibility to fix their own code and get it to pass compile and auto test as soon as possible once they receive a phone call from the support team. As teams are located in three locations, this becomes a very important part of the process to guard code quality.

  • Knowledge Base

    Team knowledge placed into the knowledge base system includes: requirements, release plans, design documents, domain knowledge, test cases, testing accounts, API documentation and more.

    The best practice is for all teams to maintain all knowledge in one system. [tweetthis]The secret to KM with global teams? For ALL teams to maintain ALL knowledge in ONE system[/tweetthis]

    It’s accessible not only to the US teams, China teams and India teams but also to the peer function teams. It ensures that everyone has the same understanding about the system and it also shortens the learning curve for new team members.

Consistent over time

The team has adopted these approaches in agile projects for more than four years, and successfully managed the challenges of distributed teams. The US team, India team and China team have collaborated with each other well and delivered many important functions, during which time the site revenue doubled. The biggest impact has been on delivery speed.

Before the team adopted agile, it performed a release every two months. After adopting agile and the above practices, the team now performs a release at the end of each sprint. This means that delivery speed has improved by four times while keeping the same quality level.

Over to you

How do your teams navigate the sometimes complex issues of multiple teams in different time zones and cultures?  Do you have any additional thoughts on this topic?  If you are just scratching your head as to how to improve your team’s performance, reach out to us through the form to the right. We will help.