WO2022031878A1 - Highly deterministic latency in a distributed system - Google Patents

Highly deterministic latency in a distributed system Download PDF

Info

Publication number
WO2022031878A1
WO2022031878A1 PCT/US2021/044588 US2021044588W WO2022031878A1 WO 2022031878 A1 WO2022031878 A1 WO 2022031878A1 US 2021044588 W US2021044588 W US 2021044588W WO 2022031878 A1 WO2022031878 A1 WO 2022031878A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
messages
compute
time
tbv
Prior art date
Application number
PCT/US2021/044588
Other languages
French (fr)
Inventor
Anthony D. Amicangioli
Allen Bast
B. Joshua Rosen
Christophe Juhasz
Original Assignee
Hyannis Port Research, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/988,249 external-priority patent/US11088959B1/en
Priority claimed from US16/988,491 external-priority patent/US11328357B2/en
Application filed by Hyannis Port Research, Inc. filed Critical Hyannis Port Research, Inc.
Priority to EP21786612.8A priority Critical patent/EP4193256A1/en
Priority to JP2023508074A priority patent/JP2023540448A/en
Publication of WO2022031878A1 publication Critical patent/WO2022031878A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2097Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/835Timestamp

Definitions

  • This patent application relates to connected devices, and more particularly to providing deterministic latency.
  • the financial instrument trading systems currently in widespread use in the major stock exchanges allow traders to submit orders and receive confirmations, market data, and other information, electronically, via communications networks.
  • the typical electronic trading system includes a matching engine, typically residing within a central server, and a plurality of gateways that provide access to the matching engine, as well as other distributed processors.
  • the typical order process can be as follows: request messages representing orders are received (e.g., bid orders and/or ask orders), as sent from client devices, e.g., trader terminals operated by human users or servers executing automated trading algorithms). An order acknowledgement is then typically returned to the client devices via the gateway that forwarded the request. The exchange may perform additional processing before the order processing acknowledgement is returned to the client device.
  • the exchange system may also disseminate information related to the order message, either in the same form as received or otherwise, to other systems to generate market data output.
  • Latency generally speaking, is the time between the input to a system and an observable response. In the context of communications systems, latency is measured as the difference in the time when a message enters or is received by the system, and the time when a corresponding response message is sent out. Latency is a particularly important consideration in high-speed electronic trading systems, where it is desirable to minimize the time it takes to execute a trade.
  • U.S. Pre-grant Publication 2019/0097745 describes a communication network that uses time stamps to reduce the impact of non-deterministic delays. The state of a transmit path is estimated by observing a “non-deterministic” delay of previously transmitted packets. Transmission circuits then hold outgoing packets until packet processing circuitry provides a deterministic latency for the packet.
  • ICON Packet Transport by Schweitzer Engineering Laboratories, Inc. ⁇ 2016 is an example of a networking device that provides deterministic, low latency packetization using a jitter buffer.
  • U.S. Patent 7,496,086 is a voice network having a set of gateways that use jitter buffers to equalize delay.
  • U.S. Patent 7,885,296 assigns timestamps to frames, and maintains synchronization among multiple timestamp counters distributed among different Physical Layer (PHY) transceivers.
  • PHY Physical Layer
  • U.S. Pre-grant Publication 2018/0359195 describes a network switch that uses a special type of tree data structure to identify a timestamp range for a received packet, such as may be used for streaming media in a Real-time Transport Protocol (RTP) network.
  • RTP Real-time Transport Protocol
  • an inbound message enters the system, such as request from a market participant or other client node, via one of a number of gateway nodes.
  • the gateway node receiving the inbound message then applies an ingress time based value (which may be a “timestamp” that depends on the time of receipt) to the message.
  • the message (including the timestamp now embedded within it) is forwarded to be processed by other nodes in the distributed system.
  • any corresponding response message generated by other nodes in the distributed system also retains the same timestamp value (and/or constant) embedded within it.
  • the response message As a corresponding response message to the request is readied to be sent back out to the participant / client by the gateway node, the response message first goes through an egress “quality of service” (QOS) shaper.
  • QOS quality of service
  • the QOS shaper ensures the response message is sent out of the system only at a very precise deterministic time that depends on the ingress timestamp plus some deterministic delay.
  • the QOS shaper may be implemented as a “packet scheduler,” which organizes outgoing messages into a set of indexed, temporary storage locations (or “buckets” associated for each discrete high precision timing interval, so that an entry is placed in a particular location in the scheduler is guaranteed to be released at the precise associated time interval.
  • the gateway may instead directly associate the inbound message with an indexed location of the packet scheduler associated with a desired egress time. As with other implementations, this egress time is carried with the message through the system so that a corresponding response messages processed and generated by the distributed system core will be sent at the allocated time.
  • One advantage of the system described herein is that, unlike prior trading systems, all users of the system obtain a response with the same latency. Whether the user is a market participant, or simply a subscriber of a market data feed, every user of the system experiences the same, deterministic response time.
  • the deterministic response time can be a fixed time value that does not vary. However, with other notions of fairness, the deterministic time can instead follow a predetermined pattern, or may be a randomly selected time that is chosen across a range of possible deterministic response times.
  • Fig. 1 A is a high level block diagram of a distributed electronic trading system.
  • Fig. IB illustrates messages travelling from a gateway to a compute node on a direct path and through a sequencer node.
  • Fig. 1C is an example format of a message.
  • FIG. 2 is a more detailed view of a system component such as a gateway or compute node.
  • Fig. 3 A illustrates how a time based value is applied to inbound and outbound messages.
  • Fig. 3B illustrates a packet scheduler
  • Fig. 4 is an example where the system provides a deterministic latency of 1000 time units.
  • Fig. 5 shows an asynchronous outbound message.
  • Fig. 6A is an example where the system provides a deterministic latency of 2000 time units selected from a deterministic range.
  • Fig. 6B is an example where the system provides a deterministic latency of 3500 time units selected from a deterministic range.
  • Fig. 7 illustrates how a set of deterministic latency values can be determined.
  • Fig. 8 shows how a fixed latency time can be selected.
  • Fig. 9 shows how a set of fixed latency times can be selected across a range.
  • Fig. 10 is an example in which different participants experience varying deterministic latency based on an additional parameter, such as the financial trading protocol used.
  • Example embodiments disclosed herein relate to a high-speed electronic trading system that provides a market where orders to buy and sell financial instruments (such as stocks, bonds, commodities, futures, options, and the like) are traded among market participants (such as traders and brokers).
  • the electronic trading system exhibits low latency, fairness, fault tolerance, deterministic latency, and other features more fully described below.
  • the electronic trading system is primarily responsible for “matching” trade orders to one another.
  • an offer to “buy” an instrument is matched to a corresponding counteroffer to “sell”.
  • the matched offer and counteroffer should at least partially satisfy the desired price, with any residual unsatisfied quantity passed to another suitable counterorder. Matched orders are then paired and the trade is executed.
  • order book Any wholly unsatisfied or partially satisfied orders are maintained in a data structure referred to as an “order book”.
  • the retained information regarding unmatched trade orders can be used by the matching engine to satisfy subsequent trade orders.
  • An order book is typically maintained for each instrument and generally defines or otherwise represents the state of the market for that particular product. It may include, for example, the recent prices and quantities at which market participants have expressed a willingness to buy or sell.
  • the results of matching may also be made visible to market participants via streaming data services referred to as market data feeds.
  • a market data feed typically includes individual messages that carry the pricing for each traded instrument, and related information such as volume and other statistics.
  • Fig. 1 A illustrates an example electronic trading system 100 that includes a number of gateways 120-1, 120-2, ..., 120-g (collectively referred to as gateways 120), a set of core compute nodes 140-1, 140-2, . . ., 140-c (collectively, the core compute nodes 140 or compute nodes 140), and one or more sequencers 150-1, 150-2, ..., 150-s (collectively, the sequencers 150).
  • the gateways 120, core compute nodes 140, and sequencers 150 are thus considered to be nodes in electronic trading system 100.
  • the gateways 120, compute nodes 140 and sequencers 150 are directly connected to one another, preferably via low latency, dedicated connections 180.
  • gateways 120-2, . . ., 120-g are the peers for gateway 120-1
  • core compute nodes 140-2, . . ., 140-c are the peers for core compute node 140-1
  • sequencers 150-2, . . ., 150-s are the peers for sequencer 150-1.
  • active and standby in relation to the discussion of the system 100, may refer to a high availability (HA) role/state/mode of a system/component.
  • a standby system/component is a redundant (backup) system/component that is powered on and ready to take over function(s) performed by an active system/component.
  • switchover/failover that is, a transition from the standby role/state/mode to the active role/state/mode, may be performed automatically in response to failure of the currently active system/component for non-limiting example.
  • the electronic trading system 100 processes trade orders from and provides related information to one or more participant computing devices 130-1, 130-2, . . ., 130-p (collectively, the participant devices 130).
  • Participant devices 130 interact with the system 100, and may be one or more personal computers, tablets, smartphones, servers, or other data processing devices configured to display and receive trade order information.
  • the participant devices 130 may be operated by a human via a graphical user interface (GUI), or they may be operated via high-speed automated trading methods running on some physical or virtual data processing platform.
  • GUI graphical user interface
  • Each participant device 130 may exchange messages with (that is, send messages to and receive messages from) the electronic trading system 100 via connections established with a gateway 120. While Fig. 1 A illustrates each participant device 130 as being connected to electronic trading system 100 via a single connection to a gateway 120, it should be understood that a participant device 130 may be connected to electronic trading system 100 over multiple connections to one or more gateway devices 120.
  • each gateway 120-1 may serve a single participant device 130, it typically serves multiple participant devices 130.
  • the compute nodes 140-1, 140-2, . . ., 140-c (also referred to herein as matching engines 140 or compute engines 140) provide the matching functions described above and may also generate outgoing messages to be delivered to one or more participant devices 130.
  • Each compute node 140 is a high-performance data processor and typically maintains one or more data structures to search and maintain one or more order books 145-1, 145-2, ..., 145-b.
  • An order book 145-1 may be maintained, for example, for each instrument for which the core compute node 140-1 is responsible.
  • One or more of the compute nodes 140 and/or one or more of the gateways 120 may also provide market data feeds 147.
  • Market data feeds 147 may be broadcast (for example, multicast), to subscribers, which may be participant devices 130 or any other suitable computing devices.
  • Some outgoing messages generated by core compute nodes 140 may be synchronous, that is, generated directly by a core compute node 140 in response to one or more incoming messages received from one or more participant devices 130, such as an outgoing “acknowledgement message” or “execution message” in response to a corresponding incoming “new order” message. In some embodiments, however, at least some outgoing messages may be asynchronous, initiated by the trading system 100, for example, certain “unsolicited” cancel messages and “trade break” or “trade bust” messages.
  • Distributed computing environments such as the electronic trading system 100, can be configured with multiple matching engines operating in parallel on multiple compute nodes 140.
  • sequencers 150 ensure that the proper sequence of any order-dependent operations is maintained. To ensure that operations on incoming messages are not performed out of order, incoming messages received at one or more gateways 120, for example, a new trade order message from one of participant devices 130, typically may then pass through at least one sequencer 150 (e.g., a single currently active sequencer, and possibly one or more standby sequencers) in which they are marked with a sequence identifier (by the single currently active sequencer, if multiple sequencers are present). That identifier may be a unique, monotonically increasing value which is used in the course of subsequent processing throughout the distributed system 100 (e.g., electronic trading system 100), to determine the relative ordering among messages and to uniquely identify messages throughout electronic trading system 100.
  • sequencer 150 e.g., a single currently active sequencer, and possibly one or more standby sequencers
  • That identifier may be a unique, monotonically increasing value which is used in the course of subsequent processing throughout the distributed system 100 (e.g., electronic trading system 100),
  • the sequence identifier may be indicative of the order (z.e., sequence) in which a message arrived at the sequencer.
  • the sequence identifier may be a value that is monotonically incremented or decremented according to a fixed interval by the sequencer for each arriving message; for example, the sequence identifier may be incremented by one for each arriving message. It should be understood, however, that, while unique, the sequence identifier is not limited to a monotonically increasing or decreasing value.
  • the original, unmarked, messages and the sequence-marked messages may be essentially identical, except for the sequence identifier value included in the marked versions of the messages.
  • the marked incoming messages that is, the sequence-marked messages
  • sequencer(s) 150 are typically then forwarded by sequencer(s) 150 to other downstream compute nodes 140 to perform potentially orderdependent processing on the messages.
  • sequencer(s) 150 may also determine a relative ordering of each marked message among other marked messages in the electronic trading system 100.
  • the unique sequence identifier disclosed herein may be used for ensuring deterministic order (z.e., sequence) for electronic-trade message processing.
  • the unique sequence identifier represents a unique, deterministic ordering (z.e., sequence) directive for processing of a given electronic trade message relative to other trade messages within an electronic trading system.
  • the sequence identifier may be populated in a sequence ID field 110-14 of a message, as disclosed further below with regard to FIG. 1C for non-limiting example.
  • messages may also flow in the other direction, that is, from a core compute node 140 to one or more of the participant devices 130, passing through one or more of the gateways 120.
  • Such outgoing messages generated by a core compute node 140 may also be order-dependent (i.e., sequence-order dependent), and accordingly may also typically first pass through a sequencer 150 to be marked with a sequence identifier. The sequencer 150 may then forward the marked response message to the gateways 120 in order to pass on to participant devices 130 in a properly deterministic order.
  • sequencer 150 to generate unique sequence numbers and mark messages or representations thereof with same, that is, to generate sequence-marked messages, ensures the correct ordering of operations is maintained throughout the distributed system, that is, the electronic trading system 100, regardless of which compute node or set of compute nodes 140 processes the messages.
  • This approach provides “state determinism,” for example, an overall state of the system is deterministic and reproduceable (possibly somewhere else, such as at a disaster recovery site), to provide fault-tolerance, high availability and disaster recoverability.
  • a generating node i.e., a node introducing a new message into the electronic trading system 100, for example by generating a new message and/or by forwarding a message received from a participant device 130
  • its peer nodes receive the sequence number assigned to that message. Receiving the sequence number for a message it generated may be useful to the generating node and its peer nodes not only for processing messages in order, according to their sequence numbers, but also to correlate the message generated by the node with the message’s sequence identifier that is used throughout the rest of the electronic trading system 100.
  • Such a correlation between an unmarked version of a message as introduced by a generating node into the electronic trading system and the sequence marked version of the same message outputted by the sequencer may be made via identifying information in both versions of the message, as discussed further below in connection with Fig. 1C.
  • a subsequent message generated within the electronic trading system 100 while also being assigned its own sequence number, may yet reference one or more sequence numbers of related preceding messages. Accordingly, a node may need to quickly reference (by sequence number) a message the node had itself previously generated, because, for example, the sequence number of the message the node had generated was referenced in a subsequent message.
  • the generating node may first send a message to the sequencer 150 and wait to receive the sequence number for the message from the sequencer before the generating node forwards the message to other nodes in electronic trading system 100.
  • sequencer 150 may not only send a sequenced version of the message (e.g., a sequence-marked message) to destination nodes, but may also send substantially simultaneously a sequenced version of the message back to the sending node and its peers. For example, after assigning a sequence number to an incoming message sent from the gateway 120-1 to core compute nodes 140, the sequencer 150 may not only forward the sequenced version of the message to the core compute nodes 140, but may also send a sequenced version of that message back to the gateway 120-1 and the other gateways 120. Accordingly, if any subsequent message generated in a core compute node 140 references that sequence number, any gateway 120 may easily identify the associated message originally generated by gateway 120-1 by its sequence number.
  • a sequenced version of the message e.g., a sequence-marked message
  • a sequenced version of an outgoing message generated by and sent from a core compute node 140 to gateways 120, and sequenced by sequencer 150 may be forwarded by sequencer 150 both to gateways 120 and back to core compute nodes 140.
  • Some embodiments may include multiple sequencers 150 for high availability, for example, to ensure that another sequencer is available if the first sequencer fails.
  • the currently active sequencer 150-1 may maintain a system state log (not shown) of all the messages that passed through sequencer 150-1, as well as the messages’ associated sequence numbers.
  • This system state log may be continuously or periodically transmitted to the standby sequencers to provide them with requisite system state to allow them to take over as an active sequencer, if necessary.
  • the system state log may be stored in a data store that is accessible to the multiple sequencers 150.
  • the system state log may also be continually or periodically replicated to one or more sequencers in a standby replica electronic trading system (not shown in detail) at a disaster recovery site 155, thereby allowing electronic trading to continue with the exact same state at the disaster recovery site 155, should the primary site of system 100 suffer catastrophic failure.
  • a currently active sequencer of a plurality of sequencers may store the system state log in a data store (not shown).
  • the data store may be accessible to the plurality of sequencers via a shared sequencer network, such as the sequencer- wide shared network 182-s disclosed further below with regard to Fig. 1 A.
  • a given sequencer of the plurality of sequencers transitions its role (state) from standby to active, such sequencer may retrieve the system state log from the data store to synchronize state with that of the former active sequencer.
  • the system state log may also be provided to a drop copy service 152, which may be implemented by one or more of the sequencers, and/or by one or more other nodes in the electronic trading system 100.
  • the drop copy service 152 may provide a record of daily trading activity through electronic trading system 100 that may be delivered to regulatory authorities and/or clients, who may, for example be connected via participant devices 130.
  • the drop copy service 152 may be implemented on one or more of the gateways 120.
  • the drop copy service 152 may provide the record of trading activity based on the contents of incoming and outgoing messages sent throughout electronic trading system 100.
  • a gateway 120 implementing the drop copy service 152 may receive from the sequencer 150 (and/or from core compute nodes 140 and other gateways 120) all messages exchanged throughout the electronic trading system 100.
  • a participant device 130 configured to receive the record of daily trading activity from the drop copy service 152 may not necessarily also be sending trade orders to and utilizing a matching function of electronic trading system 100.
  • Messages exchanged between participant devices 130 and gateways 120 may be according to any suitable protocol that may be used for financial trading (referred to for convenience as, “financial trading protocol”).
  • the messages may be exchanged according to custom protocols or established standard protocols, including both binary protocols (such as Nasdaq OUCH and NYSE UTP), and text-based protocols (such as NYSE FIX CCG).
  • the electronic trading system 100 may support exchanging messages simultaneously according to multiple financial trading protocols, including multiple protocols simultaneously on the same gateway 120.
  • participant devices 130-1, 130-2, and 130-3 may simultaneously have established trading connections and may be exchanging messages with gateway 120-1 according to Nasdaq Ouch, NYSE UTP, and NYSE FIX CCG, respectively.
  • the gateways 120 may translate messages according to a financial trading protocol received from a participant device 130 into a normalized e.g., standardized) message format used for exchanging messages among nodes within the electronic trading system 100.
  • the normalized trading format may be an existing protocol or may generally be of a different size and data format than that of any financial trading protocol used to exchange messages with participant devices 130.
  • the normalized trading format when compared to a financial trading protocol of the original incoming message received at the gateway 120 from a participant device 130, may include in some cases one or more additional fields or parameters, may omit one or more fields or parameters, and/or each field or parameter of a message in the normalized format may be of a different data type or size than the corresponding message received at gateway 120 from the participant device 130.
  • gateways 120 may translate outgoing messages generated in the normalized format by electronic trading system 100 into messages in the format of one or more financial trading protocols used by participant devices 130 to communicate with gateways 120.
  • incoming/outgoing messages e.g., the incoming message 103 and outgoing message 105 are communicated between the gateway 120-1 and a participant device 130.
  • Fig. IB is a block diagram of an example embodiment of the electronic trading system 100 of Fig. 1A, disclosed above.
  • the electronic trading system 100 comprises the gateway 120-1 coupled to the core compute node 140-1 via an activation link 180-1-1 and an ordering (i.e., sequencing) path 117.
  • the electronic trading system 100 further comprises the sequencer 150-1 electronically disposed within the ordering path 117.
  • the gateway 120-1 is configured to transmit a message (not shown) to the core compute node 140-1 via the activation link 180-1-1 and the ordering path 117, in response to reception of the incoming message 103.
  • the core compute node 140-1 is configured to receive the message (also referred to as an unsequenced message) from the gateway 120-1 and a sequence-marked version (not shown) of the message from the sequencer 150-1.
  • the sequence-marked version includes a sequence identifier (ID), such as may be included in a sequence ID field 110-14 of the sequence-marked message, as disclosed further below with regard to Fig. 1C for non-limiting example.
  • ID indicates a deterministic position of the sequence-marked version of the message among a plurality of sequence-marked versions of other messages, the other messages having been communicated via the activation link 180-1-1 and received by the sequencer 150-1 via the ordering path 117.
  • the plurality of messages among which the sequence ID indicates a deterministic position also includes the other sequenced-marked versions of messages received by the core compute node 140-1 via the ordering path 117.
  • the message (e.g., unsequenced message) and sequence-marked version include common metadata (not shown).
  • sequence ID of the message By correlating the message with its sequence-marked version via the common metadata, the sequence ID of the message is identified.
  • the sequence ID further indicates a deterministic position of the message among all messages communicated throughout the electronic trading system 100 that pass through the sequencer 150-1 and are, thus, sequence-marked by the sequencer 150-1.
  • sequence ID determined by the sequencer 150-1 determines the position (order/priority) of the messages communicated in the electronic trading system 100. It is possible that multiple systems may timestamp messages with a same timestamp and, thus, order/priority for such messages would need to be resolved at a receiver of same. Such is not the case in the electronic trading system 100 as the sequencer 150-1 may be the sole determiner of order/priority of messages communicated throughout the electronic trading system 100.
  • the core compute node 140-1 may be configured to (i) commence a matching function activity for an electronic trade responsive to receipt of the message via the activation link 180-1-1, and (ii) responsive to receipt of the sequence-marked version via the ordering path 117, use the sequence identifier to prioritize completion of the matching function activity toward servicing the electronic trade.
  • the core compute node 140-1 may commence the electronic trading function, that is, the matching function activity, upon receipt of the message (i.e., unsequenced message), thereby starting the processing of the unsequenced message, the core compute node 140-1 may not complete the processing and/or commit the results of the processing of the message until the core compute node 140-1 receives the sequence- marked message. Without a deterministic ordering for processing a message, as specified via the sequence identifier in the sequence-marked message, for example, the processing of messages by the compute node 140-1 could be unpredictable. As a non-limiting example of possible unpredictable results, there could be multiple outstanding unsequenced messages, each of which represents a potential match for the contra side in the exchange of a financial security. It is useful for there to be a deterministic way of arbitrating among the multiple potential matches because, perhaps, only a subset among the potential matches may be able to be filled against a given trade order on the contra side.
  • the compute node 140-1 may correlate the unsequenced message with the sequence-marked message via identifying information in both versions of the message, as discussed below in connection with Fig. 1C. Once the compute node 140-1 has received the sequence-marked message via the ordering path 117, the compute node 140-1 may then determine the proper sequence in which the message (or sequence-marked version of the message) should be processed relative to the other messages throughout electronic trading system 100. The compute node 140-1 may then complete the message processing, including sending out an appropriate response message, possibly referencing the sequence identifier assigned by the sequencer 150-1 and included in the sequence-marked message.
  • the compute node 140-1 may determine precisely the sequence in which the possible match(es) are to occur and complete the electronic trading matching function.
  • the sequencer 150-1 may further transmit the sequence-marked message via the second direct connection 180-gwl-sl (of the ordering path 117) to the gateway 120-1.
  • the sender that is, the gateway 120-1, to correlate the sequence number (assigned to the message) with other identifying information in the message (as discussed below in connection with Fig. 1C) so that the sender can easily deal with subsequent messages that reference that sequence number.
  • the gateway 120-1 may, upon receipt of the unsequenced response message received from the compute node 140-1 via the activation link 180-1-1, activate processing of such response message, even before the gateway 120-1 receives the sequence-marked version of the response message.
  • activating the processing could include updating the state of an open trade order database on the gateway 120-1 and/or building up the outgoing message 105 ready to be sent to the participant device 130.
  • the gateway 120-1 may not complete the processing of the response message, such processing including transmitting the outgoing message 105 to the participant device 130, until the gateway 120-1 has received the sequence-marked response message, which contains a sequence identifier specifying a deterministic position of the response message in a sequence of messages including the other messages in electronic trading system 100.
  • the gateway 120-1 may correlate the unsequenced response message with the sequence-marked response message via identifying information in both versions of the response message, as discussed below in connection with Fig. 1C. The deterministic position of the response message thereby being determined upon receipt of sequence-marked response message.
  • the processing of the response message may then be completed, such processing including committing the outgoing message 105 to be transmitted to the participant device, such as the participant device 130 of Fig. 1A.
  • the message transmitted via the activation path 180-1-1 and sequence-marked version of the message transmitted via the ordering path 117 may include common metadata.
  • the core compute node 140-1 may be further configured to correlate the message with the sequence-marked version based on the common metadata, responsive to receipt of the sequence-marked version via the ordering path 117.
  • the message is transmitted to the core compute node 140-1 via the activation link 180-1-1 in an activation link forward direction, that is, the act-link-fwd-dir 113a, and to the core compute node 140-1 via the ordering path 117 in an ordering path forward direction, that is the order-path-fwd-dir 115a.
  • the core compute node 140-1 may transmit a response (not shown) to the gateway 120-1 via the activation link 180-1-1 and the ordering path 117 in an activation link reverse direction (z.e., the act- link-rev-dir 113b) and an ordering path reverse direction (z.e., order-path-rev-dir 115b).
  • the activation link 180-1-1 is a single direct connection while the ordering path 117 includes multiple direct connections.
  • the ordering path 117 in the example embodiment includes both the direct connection 180-gwl-sl and direct connection 180-cl-sl.
  • the gateway 120-1, sequencer 150-1, and core compute node 140-1 are arranged in a point-to-point mesh topology, referred to as a point-to-point mesh system 102.
  • the core compute node 140-1 may be configured to perform a matching function (z.e., an electronic trading matching function) toward servicing trade requests received from participant devices 130 and introduced into the point-to-point mesh topology via the gateway 120-1.
  • a matching function z.e., an electronic trading matching function
  • the point-to-point mesh system 102 includes a first direct connection (z.e., 180-1-1), second direct connection (z.e., 180- gwl-sl), and third direction connection (z.e., 180-cl-sl).
  • the sequencer 150-1 may be configured to (i) determine a deterministic order (z.e., sequence) for messages communicated between the gateway 120-1 and core compute node 140-1 via the first direct connection and received by the sequencer 150-1 from the gateway 120-1 or core compute node 140-1 via the second or third direct connection, respectively.
  • the sequencer 150-1 may be further configured to (ii) convey position of the messages within the deterministic order by transmitting sequence-marked versions of the messages to the gateway 120-1 and core compute node 140-1 via the second and third direct connections, respectively.
  • the messages represent the trade requests or responses thereto, such as disclosed herein.
  • a message format for such messages is disclosed further below with regard to Fig. 1C.
  • the amount of preprocessing that may be done for an unsequenced message, and whether or not the results of that preprocessing may need to be discarded or rolled back, may depend on fields in the message, such as the message type field 110-1, symbol field 110-2, side field 110-3, or price field 110-4, according to the embodiment of Fig. 1C, disclosed further below.
  • the amount may also depend on whether other unsequenced messages are currently outstanding (that is, for which the corresponding sequence-marked message has not yet been received) that reference the same value for a common parameter in the message, such as the same stock symbol.
  • the core compute node 140-1 may load the symbol information relating to the relevant section of the order book into a fast memory. If the new order would be a match for an open order in the order book, the compute node 140-1 may start to generate a “fill” message, accordingly, but hold off on committing an order book update and on sending the “fill” message out until it receives the sequence- marked version of that message.
  • the core compute node 140-1 may perform its preprocessing differently.
  • the core compute node 140-1 may generate competing potential “fill” messages, for each of the two outstanding unsequenced “new order” messages that could serve as a match for the open order. Based on the sequenced version of the messages, one of the potential “fill” messages may be discarded, while the other would be committed to the order book and sent out to the gateways 120.
  • the compute node 140-1 may not perform any preprocessing that may need to be discarded or rolled back (e.g., may not create any potential “fill” messages), or it may abort or pause any such preprocessing for those outstanding unsequenced messages.
  • an outstanding unsequenced “new order” message that is a potential match for an open order in the order book could be competing with an outstanding unsequenced “replace order” message or “cancel order” message attempting to replace or cancel, respectively, the same open order in the order book that would serve as a potential match to the “new order” message.
  • the end result could either culminate in a match between the open order in the order book and the “new order” message, or it could instead culminate in that open order being canceled or replaced by a new order with a different price or quantity.
  • the compute node 140-1 cannot determine which of these two outcomes should result.
  • the compute node 140-1 may perform preprocessing in different ways. In some embodiments, when there are multiple competing outstanding unsequenced messages, the compute node 140-1 may simply perform preprocessing that would not need to be rolled back or discarded, such as loading into faster memory a relevant section of the order book relating to a symbol referenced in both competing messages. In other embodiments, the compute node 140-1 may perform additional preprocessing, such as forming up one or more provisional potential responses, each corresponding to one of the multiple competing scenarios.
  • the compute node 140-1 may create a potential “fill” message and/or a potential “replace acknowledgement” message or “cancel acknowledgement” message, and possibly also make provisional updates to the order book corresponding to one or more of the multiple possible outcomes. While in some embodiments, the compute node 140-1 may perform this additional preprocessing for all such competing scenarios, in other embodiments, the compute node 140-1 may only perform additional preprocessing on one of, or a subset of, the competing scenarios. For example, the compute node 140-1 may perform the additional preprocessing on an outstanding unsequenced message only if there are no other outstanding competing unsequenced messages.
  • the compute node 140-1 may prioritize the performing of additional preprocessing for outstanding competing unsequenced messages according to the amount of time and/or complexity involved in rolling back or discarding the results of the preprocessing. Upon receiving the sequence-marked versions of the outstanding unsequenced messages, the compute node 140-1 may then determine the sequence (as assigned by the sequencer 150- 1) in which the outstanding unsequenced messages should be processed, and complete the processing of the messages in that sequence, which may in some embodiments include rolling back or discarding one or more results of the preprocessing.
  • the compute node 140-1 may additionally or alternatively perform preprocessing related to validation of the message to determine whether to accept or reject the message.
  • the preprocessing could include performing real-time risk checks on the message, such as checking that the price or quantity specified in the message does not exceed a maximum value (/. ⁇ ., “max price check” or “max quantity check”), that the symbol in the message is a known symbol (i.e., “unknown symbol check”), that trading is currently permitted on that symbol (i.e., “symbol halt check”), or that the price is specified properly according to a correct number of decimal places (i.e., “sub penny check”).
  • the type of preprocessing could also include a “self trade prevention” validation check, to prevent a particular potential match from resulting in a self-trade in which a trading client matches against itself, if “self trade prevention” is enabled for the particular client or trade order. If a trade order fails one or more of these validation checks, the electronic trading system 100 may respond with an appropriate reject message. It should be understood that, even though these validation checks are described in the embodiments above as being performed by the compute node 140-1, at least some of these types of validation checks could in some embodiments be performed alternatively or additionally by a gateway 120 or other nodes in the electronic trading system 100.
  • the gateway 120- 1 may be beneficial or required for the gateway 120- 1 to be informed of the unique system-wide sequence identifier associated with a message that originated from a client. This information may enable the gateway 120-1 to match up the original incoming message to the unique sequence number, which is used to ensure proper ordering of messages throughout the electronic trading system 100. Such a configuration at the gateway(s) may be required for the electronic trading system 100 to achieve state determinism and to provide fault-tolerance, high availability, and disaster recoverability with respect to the activity in the gateways.
  • One solution for configuring the gateway 120-1 to maintain information on the sequence identifier associated with an incoming message is for the gateway 120-1 to wait for a response back from the sequencer 150-1 with the sequence identifier before forwarding the message to the compute node 140-1.
  • Such an approach may add latency to the processing of messages.
  • the sequencer 150-1 may also send, in parallel, the sequence-marked message to the gateway 120-1.
  • the gateway 120-1 may maintain information on the sequence identifier while minimizing latency at the electronic trading system 100.
  • Fig. 1C is a table of an example embodiment of fields of a message format 110 for trading messages, such as trading messages exchanged among nodes in the electronic trading system 100 disclosed above.
  • the message format 110 is a normalized message format, intended to be used for an internal (that is, within the electronic trading system 100) representation of trading messages when they are exchanged among nodes within electronic trading system 100.
  • gateways 120 exchange messages between the participants 130 and electronic trading system 100, and translate such messages between format(s) specified by one or more financial trading protocols used by the participants 130 and the normalized trading format used among nodes in the electronic trading system 100.
  • the fields 110-1 through 110-17 are for non-limiting example and that the message format 110 may include more, fewer, or different fields, and that the order of such fields is not limited to as shown in FIG. 1C.
  • the fields in the message format 110 are shown in this example in a single message format, they may be distributed across multiple message formats, or encapsulated in layered protocols. For example, in other embodiments, a subset of fields in the message format 110 may be included as part of a header, trailer, or extension field(s) in a layered protocol that encapsulates other fields of the message format 100 in a message payload.
  • the message format 110 may define one or more fields of data encapsulated in a payload (data) section of another message format, including without limitation a respective payload section of an IP datagram, a UDP datagram, a TCP packet, or of a message data frame format, such as an Ethernet data frame format or other data frame format, including InfiniBand, Universal Serial Bus (USB), PCI Express (PCI-e), and High-Definition Multimedia Interface (HDMI), for non-limiting example.
  • a payload (data) section of another message format including without limitation a respective payload section of an IP datagram, a UDP datagram, a TCP packet, or of a message data frame format, such as an Ethernet data frame format or other data frame format, including InfiniBand, Universal Serial Bus (USB), PCI Express (PCI-e), and High-Definition Multimedia Interface (HDMI), for non-limiting example.
  • USB Universal Serial Bus
  • PCI-e PCI Express
  • HDMI High-Definition Multimedia Interface
  • the message format 110 includes fields 110-1... 110-6 which correspond to information that may be included in messages sent or received according to a financial trading protocol for communication with one or more participant devices 130.
  • the message type field 110-1 indicates a trading message type.
  • Some trading message types (such as, message types “new order,” “replace order,” or “cancel order”) correspond to messages received from participant devices 130, while other message types (such as, “new order acknowledgement,” “replace order acknowledgement,” “cancel order acknowledgement,” “fill,” “execution report,” “unsolicited cancel,” “trade bust,” or various reject messages) correspond to messages that are generated by the electronic system 100 and are included in trading messages sent to the participant devices 130.
  • the message format 110 also includes a symbol field 110-2, which includes an identifier for a traded financial security, such as a stock symbol or stock ticker. For example, “IBM” is the stock symbol for “International Business Machines Corporation.”
  • the side field 110-3 in the message format 110 may be used to indicate the “side” of the trading message, such as whether the trading message is a “buy,” “sell,” or a “sell short.”
  • the price field 110-4 may be used to indicate a desired price to buy or sell the security
  • the quantity field 110-5 may be used to indicate a desired quantity of the security (e.g., number of shares).
  • the message format 110 may also include the order token field 110-6, which may be populated with an “order token” or “client order ID” initially provided by a participant device 130 to uniquely identify a new order in the context of a particular trading session (i.e., “connection” or “flow”) established between the participant device 130 and the electronic trading system via a gateway 120.
  • order token field 110-6 may be populated with an “order token” or “client order ID” initially provided by a participant device 130 to uniquely identify a new order in the context of a particular trading session (i.e., “connection” or “flow”) established between the participant device 130 and the electronic trading system via a gateway 120.
  • fields 110-1... 110-6 are representative fields that are usually included for most message types according to most financial trading protocols, but that the message format 110 may well include additional or alternate fields, especially for supporting particular message types or particular financial trading protocols.
  • “replace order” and “cancel order” message types require the participant 130 to supply an additional order token to represent the replaced or canceled order, to distinguish it from the original order.
  • a “replace order” and a “cancel order” typically may also include a replaced/canceled quantity field
  • a “replace order” may include a replace price field.
  • These additional replace/cancel order token fields, replace price fields, and replaced/canceled quantity fields may also be included in corresponding acknowledgement messages sent by electronic trading system 100.
  • the message format 110 includes fields 110-11... 110-17 that may be used internally within electronic trading system 100, and do not necessarily correspond to fields in messages exchanged with participant devices 130.
  • a node identifier field 110-11 may uniquely identify each node in electronic trading system 100.
  • a generating node may include its node identifier in messages it introduces into the electronic trading system 100.
  • each gateway 120 may include its node identifier in messages it forwards from participant devices 130 to compute nodes 140 and/or sequencers 150.
  • each compute node 140 may include its node identifier in messages it generates (for example, acknowledgements, executions, or types of asynchronous messages intended ultimately to be forwarded to one or more participant devices 130) to be sent to other nodes in the electronic trading system 100.
  • each message introduced into the electronic trading system 100 may be associated with the message’s generating node.
  • the message format 110 may also include a flow identifier field 110-12.
  • each trading session i.e., “connection” or “flow”
  • flow may be identified with a flow identifier that is intended to be unique throughout the electronic trading system 100.
  • a participant device 130 may be connected to the electronic trading system 100 over one or more flows, and via one or more of the gateways 120.
  • the version of the messages according to the normalized message format 110 (used among nodes in the electronic trading system 100) of all messages exchanged between a participant device 130 and the electronic trading system 100 over a particular flow would include a unique identifier for that flow in the flow identifier field 110-12.
  • the flow identifier field 110-12 is populated by a message’s generating node.
  • a gateway 120 may populate the flow identifier field 110-12 with the identifier of the flow associated with a message it receives from a participant 130 that the gateway 120 introduces into electronic trading system 100.
  • a core compute node 140 may populate the flow identifier field 110-12 with the flow identifier associated with messages it generates (i.e., response messages, such as acknowledgement messages or fills, or other outgoing messages including asynchronous messages).
  • the flow identifier field 110-12 contains a value that uniquely identifies a logical flow, which actually could be implemented for purposes of high availability as multiple redundant trading session connections, possibly over multiple gateways. That is, in some embodiments, the same flow ID may be assigned to two or more redundant flows between participant device(s) 130 and gateway(s) 120. In such embodiments, the redundant flows may be either in an active/standby configuration or an active/active configuration. In an active/active configuration, functionally equivalent messages may be exchanged between participant device(s) 130 and gateway(s) 120 simultaneously over multiple redundant flows in parallel.
  • a trading client may send in parallel over the multiple redundant flows functionally equivalent messages simultaneously to the electronic trading system 100, and receive in parallel over the multiple redundant flows multiple functionally equivalent responses from the electronic trading system 100, although the electronic trading system 100 may only take action on a single such functionally equivalent message.
  • a single flow at a time among the multiple redundant flows may be designated as an active flow, whereas the other flow(s) among the multiple redundant flows may be designated standby flow(s), and the trading messages would only actually be exchanged over the currently active flow.
  • messages exchanged over any of the redundant flows may be identified with the same flow identifier stored by the messages’ generating nodes in the flow identifier field 110-12 of the normalized message format 110.
  • messages exchanged among nodes in the electronic system 100 are sent to the sequencer 150 to be marked with a sequence identifier.
  • the message format 110 includes sequence identifier field 110- 14.
  • an “unmarked message” may be sent with a sequence identifier field 110-14 having an empty, blank (e.g., zero) value.
  • the sequence identifier field 110-14 of an unmarked message may be set to a particular predetermined value that the sequencer would never assign to a message, or to an otherwise invalid, value.
  • Still other embodiments may specify that a message is unmarked via an indicator in another field (not shown) of the message, such as a Boolean value or a flag value indicating whether a message has been sequenced.
  • the sequencer 150 may then populate the sequence identifier field 110-14 of the unmarked message with a valid sequence identifier value, thereby producing a “sequence marked message.”
  • the valid sequence identifier value in sequence identifier field 110-4 of the sequence marked message uniquely identifies the message and also specifies a deterministic position of the marked message in a relative ordering of the marked message among other marked messages throughout electronic trading system 100.
  • a “sequence marked message” sent by the sequencer 150 may then be identical to a corresponding unmarked message received by the sequencer except that the sequence marked message’s sequence identifier field 110-14 contains a valid sequence identifier value.
  • the message format 110 may, in some embodiments, also include the reference sequence identifier field 110-15.
  • a generating node may populate the reference sequence identifier field 110-15 of a new message it generates with the value of a sequence number of a prior message related to the message being generated.
  • the value in the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 to correlate a message with a prior associated message.
  • the prior associated message referenced in the reference sequence identifier field 110-15 may be a prior message in the same “order chain” (/. ⁇ ., “trade order chain”).
  • orders chain /. ⁇ ., “trade order chain”
  • messages may be logically grouped into an “order chain,” a set of messages over a single flow that reference or “descend from” a common message.
  • An order chain typically starts with a “new order message” sent by a participant device 130.
  • the next message in the order chain is typically a response by the electronic trading system (e.g., either a “new order acknowledgement” message when the message is accepted by the trading system, or a “new order reject” message, when the message is instead rejected by the trading system, perhaps for having an invalid format or invalid parameters, such as an invalid price for non-limiting example).
  • An order chain may also include “cancel order” message sent by participant device 130, canceling at least a portion of the quantity of a prior acknowledged (but still open, that is contains at least some quantity that is not canceled and/or not filled) new order.
  • the “cancel order” message may again either be acknowledged or rejected by the electronic trading system with a “cancel order acknowledgement” or a “cancel order reject” message, which would also be part of the order chain.
  • An order chain may also include a “replace order” message sent by participant device 130, replacing the quantity and/or the price of a prior acknowledged (but still open) new order.
  • the “replace order” message may again either be acknowledged or rejected by the electronic trading system with a “replace order acknowledgement” or a “replace order reject” message, which would also be part of the order chain.
  • a prior acknowledged order that is still open may be matched with one or more counter orders of the opposite side (that is, “buy” on one side and “sell” or “sell short” on the other side), and the electronic trading system 100 may then generate a complete “fill” message (when all of the open order’s quantity is filled in a single match) or one or more partial “fill” messages (when only a portion of the open order’s quantity is filled in a single match), and these “fill” messages would also be part of the order chain.
  • the reference sequence identifier in general, may identify another prior message in the same order chain.
  • the value for the reference sequence number may be the sequence number assigned by the sequencer for an “incoming” message originating from a participant device 130 and introduced into electronic trading system 100 by a gateway 120, such that a corresponding “outgoing” message, such as a response message generated by a compute node 140, may reference the sequencer number value of the incoming message to which it is responding.
  • a “new order acknowledgement” message or a “fill” message generated by a compute node 140 would include in the reference sequence identifier field 110-15 the value for the sequence identifier assigned to the corresponding “new order” message to which the compute node 140 is responding with a “new order acknowledgement” message or fulfilling the order with the “fill” message.
  • the value for the reference sequence identifier field 110-15 need not necessarily be that of a message that is being directly responded to by the electronic trading system 100, but may be that of a prior message that is part of the same order chain, for example, the sequence number of a “new order” or a “new order acknowledgement.”
  • the gateways 120 may also populate the reference sequence identifier field 110-15 in messages they introduce into electronic trading system 100 with a value of a sequence identifier for a related prior message. For example, a gateway 120 may populate the reference sequence identifier field 110-15 in a “cancel order” or a “replace order” message with the value of the sequence identifier assigned to the prior corresponding “new order” or “new order acknowledgment” message.
  • core compute nodes 140 may also populate the sequence identifier field 110-15 for a corresponding “cancel order acknowledgement” message or “replace order acknowledgement” message with the value of the sequence identifier for the “new order” or “new order acknowledgment,” rather than that of the message to which the compute node 140 was directly responding (e.g., rather than the sequence identifier of the “cancel order” or “replace order” message).
  • the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 generally to correlate a message with one or more prior messages in the same order chain.
  • a generating node may also include a node-specific timestamp field 110-13 in messages it introduces into electronic trading system 100.
  • the value in the node-specific timestamp field 110-13 may be unique among a subset of messages, those messages introduced into electronic trading system 100 by a particular generating node. While referred to herein as a “timestamp,” a value placed in the nodespecific timestamp field 110-13 may be any suitable value that is unique among messages generated by that node. For example, the node-specific timestamp may be in fact a timestamp or any suitable monotonically increasing or decreasing value.
  • Some embodiments may include other timestamp fields in the message format.
  • some message formats may include a reference timestamp field, which may be a timestamp value assigned by the generating node of a prior, related message.
  • a compute node 140 may include a new timestamp value in the node-specific timestamp field 110-13 for messages that it generates, and may also include a timestamp value from a related message in a reference timestamp field of the message the compute node generates.
  • a “new order acknowledgement” message generated by the compute node may include a timestamp value of the “new order” to which it is responding in the reference timestamp field of the “new order acknowledgement message.”
  • compute nodes 140 may not include a new timestamp value in the node-specific timestamp field 110-13 in messages they generate, but may simply populate that node-specific timestamp field 110- 13 with a timestamp value from a prior related message.
  • the message format 110 may include a time based value (TBV) field 110- 118.
  • TBV time based value
  • the TBV field may correspond to a time that an incoming message was received by the gateway 120-1 (referred to here as the ingress time or arrival time), or in other implementations, may correspond to a desired exit time (referred to herein as the egress time or exit time, Tex) for a corresponding response message to be returned by the system.
  • TBV is determined from the time of receipt plus some deterministic time delay value (as described in more detail below).
  • the TBV is typically not the same field as the “nodespecific timestamp” field 110-13 or the sequence numbers (e.g., a field different from either the sequence ID 110-14 or reference sequence ID 110-15) referenced above.
  • the message format 110 may also include an entity type field 110-16 and entity count field 110-17.
  • entity type of a message may depend on whether it is introduced into the electronic trading system 100 by a gateway 120 or a compute node 140, or in other words, whether the message is an incoming message being received at a gateway 120 from a participant device 130 or whether it is an outgoing message being generated by a compute node 140 to be sent to a participant device 130.
  • incoming messages are considered to be of entity type “flow,” (and the entity type field 110-16 is populated by the gateways 120 for incoming messages with a value representing the type “flow”), while outgoing messages are considered to be of entity type, “symbol,” (and the entity type field 110-16 is populated by the computed nodes 140 for outgoing messages with a value representing the type “symbol”).
  • the entity count of type “flow” is maintained by gateways 120, and the entity count of type “symbol” is maintained by the compute nodes 140.
  • a gateway 120 may maintain a per flow incoming message count, counting incoming messages received by the gateway 120 over each flow active on the gateway 120. For example, if four non-redundant flows are active on a gateway 120, each flow would be assigned a unique flow identifier, as discussed above, and the gateway 120 would maintain a per flow incoming message count, counting the number of incoming messages received over each of those four flows. In such embodiments, the gateway 120 populates the entity count field 110-17 of an incoming message with the per flow incoming message count associated with the incoming message’s flow (as identified throughout the electronic trading system 100 by a flow identifier value, populated in the flow identifier field 110-12 of the message).
  • each underlying redundant flow may be assigned the same flow identifier, yet a per-flow incoming message count may still be maintained separately for each redundant flow, especially when the redundant flows are implemented on separate gateways 120. Because it is the expectation that a participant device 130 will send the same set of messages in the same order (z.e., sequence) to the electronic trading system 100 over each of the redundant flows, it is also the expectation that the entity count assigned to functionally equivalent messages received over separate redundant flows should be identical.
  • These functionally equivalent incoming messages may be forwarded by the gateway (s) 120 to sequencer 150 and the compute nodes 140. Accordingly, in such embodiments, the sequencer 150 and compute nodes 140 could receive multiple functionally equivalent incoming messages associated with the same flow identifier, but the sequencer 150 and compute nodes 140 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same flow identifier.
  • the sequencer 150 and compute nodes 140 may keep track, on a per flow basis, of the highest entity count that has been included in entity count field 110-17 of incoming messages associated with each flow, which allows the sequencer 150 and compute nodes 140 to take action only on the first to arrive of multiple incoming functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent incoming messages.
  • the sequencer 150 may in some embodiments only sequence the first such functionally equivalent incoming message to arrive, and the compute nodes 140 may only start processing on the first such functionally equivalent message to arrive.
  • a node z.e., a sequencer 150 or a compute node 140
  • the node may assume that the incoming message is functionally equivalent to another previously received incoming message, and may simply ignore the subsequently received functionally equivalent incoming message.
  • a compute node 140 may maintain a per symbol outgoing message count, counting outgoing messages generated by and sent from the compute node 140 for each symbol serviced by the compute node 140. For example, if four symbols (e.g., MSFT, GOOG, IBM, ORCL) are serviced by a compute node 140, each symbol is assigned a symbol identifier populated in symbol field 110-2 of the message, as discussed above, and the compute node 140 would maintain a per symbol outgoing message count, counting the number of outgoing messages it generated and sent that serviced each of those four symbols.
  • symbols e.g., MSFT, GOOG, IBM, ORCL
  • the compute node 140 populates the entity count field 110-17 of an incoming message with the per symbol outgoing message count associated with the outgoing message’s symbol (as identified throughout the electronic trading system 100 by the value populated in the symbol identifier field 110-2 of the message).
  • compute nodes may be configured such that multiple compute nodes service a particular symbol in parallel, for reasons of high availability. Because of the deterministic ordering of messages throughout electronic trading system 100 provided by the sequencer 150, it can be guaranteed that even when multiple compute nodes service a given symbol, they will be processing incoming messages referencing the same symbol in the same order (/. ⁇ ., sequence) and in the same manner, thereby generating functionally equivalent response messages in parallel. When considering the outgoing messages being sent out for a particular symbol across multiple compute nodes 140, each outgoing message referencing that symbol should have a functionally equivalent message being sent out by each other compute node 140 actively servicing that symbol.
  • sequencer 150 and gateways 120 may all be sent by the compute nodes 140 to sequencer 150 and the gateways 120. Accordingly, in such embodiments, the sequencer 150 and gateways 120 could receive multiple functionally equivalent incoming messages associated with the same symbol, but the sequencer 150 and gateways 120 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same symbol identifier. In some embodiments, the sequencer 150 and gateways 120 may keep track, on a per symbol basis, of the highest entity count that has been included in entity count field 110-17 of outgoing messages associated with the symbol, which allows the sequencer 150 and gateways 120 to take action only on the first to arrive of multiple outgoing functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent outgoing messages.
  • the sequencer 150 may in some embodiments only sequence the first such functionally equivalent outgoing message to arrive.
  • the gateways 120 may only start processing the first such functionally equivalent message to arrive. If an outgoing message received by a node (i.e., a sequencer 150 or a gateway 120) has an entity count that is the same or lower than the highest entity count the node has previously seen for that symbol, then the node may assume that the outgoing message is functionally equivalent to another previously received outgoing message, and may simply ignore the subsequently received functionally equivalent outgoing message.
  • sequencer 150 only sequences the first message of a plurality of functionally equivalent messages to arrive at the sequencer
  • the sequencer could do so in a variety of ways.
  • other subsequently arriving messages that are functionally equivalent to that first such functionally equivalent message to arrive may simply be ignored by the sequencer (in which case only a single sequence marked message may be outputted by the sequencer for the set of functionally equivalent messages).
  • sequencer may track a sequence number that it assigns to the first functionally equivalent message, for example, by making an association between the entity count of the message, its flow identifier or symbol identifier (for messages having entity types of “flow” and “symbol”, respectively), and its sequence number, such that the sequencer may output a sequenced version of each functionally equivalent message in which the value of the sequence identifier field 110-14 for all the sequenced versions of the functionally equivalent messages is the same as had been assigned by the sequencer to the first message to arrive among the functionally equivalent messages received by the sequencer 150.
  • the sequencer 150 may not keep track of whether messages are functionally equivalent, and may assign each unsequenced message that arrives at the sequencer 150 a unique sequence number, regardless of whether that message is among a plurality of functionally equivalent messages.
  • the sequenced versions of messages among a plurality of functionally equivalent messages are each assigned different sequence identifiers by the sequencer as the value in the sequencer identifier field 110-14.
  • the recipient node of sequenced functionally equivalent messages in such embodiments may use the sequence identifier in the sequenced version of the message among the sequenced functionally equivalent messages that is first to arrive at the node.
  • sequenced versions of the messages are sent out in sequenced order by the sequencer 150, and accordingly, should be received in the same sequenced order among all nodes directly connected to the sequencer. Therefore, for all nodes receiving the sequenced messages via respective direct point-to-point connections with the sequencer, the first sequenced message to arrive among a plurality of functionally equivalent sequenced messages should have the same value in the sequence identifier field 110-14.
  • a combination of a message’s flow identifier and node specific timestamp may be sufficient to uniquely identify the message throughout electronic trading system 100.
  • a combination of a flow identifier and entity count could be sufficient to uniquely identify a message of entity type “flow,” and a combination of a symbol identifier and entity count could be sufficient to uniquely identify a message of entity type “symbol.”
  • sequence identifier is still necessary in order to specify in a fair and deterministic manner the relative ordering of the message among other messages generated by other nodes throughout electronic trading system 100.
  • the node-specific timestamp is in fact implemented as a timestamp value, even if system clocks among nodes are perfectly synchronized, two different messages, each generated by a different node, may each be assigned the same timestamp value by their respective generating node, and the relative ordering between these two messages is then ambiguous. Even if the messages can be identified uniquely, a recipient node of both messages would still need a way to determine the relative ordering of the two messages before taking possible action on the messages.
  • One possible approach for a recipient node to resolve that ambiguity could be through the use of randomness, for example, by randomly selecting one message as preceding the other in the relative ordering of messages throughout the electronic trading system 100. Using randomness to resolve the ambiguity, however, does not support “state determinism” throughout the electronic trading system 100. Different recipient nodes may randomly determine a different relative ordering among the same set of messages, resulting in unpredictable, nondeterministic behavior within electronic trading system 100, and impeding the correct implementation of important features, such as faulttolerance, high availability, and disaster recovery.
  • Another approach for a recipient node to resolve the ambiguity in ordering could be through a predetermined precedence method, for example, based on the node identifier associated with the message.
  • a predetermined precedence method for example, based on the node identifier associated with the message.
  • Such an approach works against the important goal of fairness, by giving some messages higher precedence simply based on the node identifier of the node that introduced the message into electronic trading system 100. For example, some participant devices 130 could be favored simply because they happen to be connected to the electronic trading system 100 over a gateway 120 that is deemed higher in the predetermined precedence method.
  • sequence identifier assigned to a message by the sequencer 150 may still be required in order to fairly and deterministically specify the ordering of a message relative to other messages in the electronic trading system 100.
  • the sequencer 150 (or the single currently active sequencer, if multiple sequencers 150 are present) serves as the authoritative source of a truly deterministic ordering among sequence-marked messages throughout the electronic trading system 100.
  • nodes in electronic trading system 100 may receive two versions of a message: an unsequenced (unmarked) version of the message as introduced into the electronic trading system 100 by the generating node, and a (marked) version of the message that includes a sequence identifier assigned by the sequencer 150. This may be the case in embodiments in which a generating node sends the unmarked message to one or more recipient nodes as well as the sequencer 150. The sequencer 150 may then send a sequence-marked version of the same message to a set of nodes including the same recipient nodes.
  • sequence-marked version of the message is useful for determining the relative processing order (/. ⁇ ., position in a sequence) of the message among other marked messages in electronic trading system 100
  • a recipient node may also be useful for a recipient node to receive the unmarked version of the message.
  • the unmarked version of the message it is certainly possible, if not expected, (for example, in embodiments in which there are direct connections between nodes), for the unmarked version of the message to be received prior to the marked version of the message, because the marked version of the message is sent via an intervening hop through sequencer 150. Accordingly, there is the opportunity, in some embodiments, for a recipient node to activate processing of the unmarked message upon receiving the unmarked message even before that recipient node has received the marked version of the message which authoritatively indicates the relative ordering of the marked message among other marked messages.
  • a node receiving both the marked and unmarked versions of a same message may correlate the two versions of the message to each other via the same identifying information or “common metadata,” in both versions of the message.
  • a generating node may include in messages it generates (i.e., unmarked messages) a node identifier and a node specific timestamp, which together, may uniquely identify each message throughout electronic trading system 100.
  • the marked message may also include the same node identifier and node specific timestamp that are also included in the corresponding unmarked message, thereby allowing a recipient node of both versions of the message to correlate the marked and unmarked versions. Accordingly, while the marked messages directly indicate relative ordering of a marked message relative to the other marked messages throughout electronic trading system 100, because of the correlation that may be made between the unmarked and marked version of the same message, marked messages, (at least indirectly via the correlation discussed above), indicate the relative ordering of the message relative to other messages (marked or unmarked) throughout electronic trading system 100.
  • nodes in electronic trading system 100 may also correlate sequence marked with unmarked versions of the messages by means of the other manners of uniquely identifying messages discussed above.
  • a correlation between sequence marked and unmarked messages may be made by means of a combination of a flow identifier and a node specific timestamp.
  • Such a correlation may additionally or alternatively be made by means of a message’s entity count along with the symbol identifier or flow identifier in the message, for messages having entity type “symbol” and “flow,” respectively.
  • participant devices 130 exchanging messages with the electronic trading system 100 are often very sensitive to latency, preferring low, predictable latency.
  • the arrangement shown in Fig. 1 A accommodates this requirement by providing a point- to-point mesh 172 architecture between at least each of the gateways 120 and each of the compute nodes 140.
  • each gateway 120 in the mesh 172 may have a dedicated high-speed direct connection 180 to the compute nodes 140 and the sequencers 150.
  • dedicated connection 180-1-1 is provided between gateway 1 120-1 and core compute node 1 140-1, dedicated connection 180-1-2 between gateway 1 120-1 and core compute node 2 140-2, and so on, with example connection 180-g-c provided between gateway 120-g and core compute node c 140-c, and example connection 180-s-c provided between sequencer 150 and core compute node c 140-c.
  • each dedicated connection 180 in the point-to- point mesh 172 is, in some embodiments, a point-to-point direct connection that does not utilize a shared switch.
  • a dedicated or direct connection may be referred to interchangeably herein as a direct or dedicated “link” and is a direct connection between two end points that is dedicated (e.g., non-shared) for communication therebetween.
  • a dedicated/direct link may be any suitable interconnect(s) or interface(s), such as disclosed further below, and is not limited to a network link, such as wired Ethernet network connection or other type of wired or wireless network link.
  • the dedicated/direct connection/link may be referred to herein as an end-to-end path between the two end points.
  • Such an end-to-end path may be a single connection/link or may include a series of connections/links; however, bandwidth of the dedicated/direct connection/link in its entirety, that is, from one end point to another end point, is non-shared and neither bandwidth nor latency of the dedicated/direct connection/link can be impacted by resource utilization of element(s) if so traversed.
  • the dedicated/direct connection/link may traverse one or more buffer(s) or other elements that are not bandwidth or latency impacting based on utilization thereof.
  • the dedicated/direct connection/link would not, however, traverse a shared network switch as such a switch can impact bandwidth and/or latency due to its shared usage.
  • the dedicated connections 180 in the point-to-point mesh 172 may be provided in a number of ways, such as a 10 Gigabit Ethernet (GigE), 25 GigE, 40 GigE, 100 GigE, InfiniBand, Peripheral Component Interconnect - Express (PCIe), RapidlO, Small Computer System Interface (SCSI), FireWire, Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or custom serial or parallel busses.
  • GigE 10 Gigabit Ethernet
  • 25 GigE 25 GigE
  • 40 GigE 100 GigE
  • InfiniBand Peripheral Component Interconnect - Express
  • PCIe Peripheral Component Interconnect - Express
  • RapidlO RapidlO
  • SCSI Small Computer System Interface
  • FireWire FireWire
  • USB Universal Serial Bus
  • HDMI High Definition Multimedia Interface
  • custom serial or parallel busses custom serial or parallel busses.
  • nodes may sometimes be referred to herein as “nodes”, the use of terms such as “compute node” or “gateway node” or “sequencer node” or “mesh node” should not be interpreted to mean that particular components are necessarily connected using a network link, since other types of interconnects or interfaces are possible.
  • a “node,” as disclosed herein, may be any suitable hardware, software, firmware component(s), or combination thereof, configured to perform the respective function(s) set forth for the node.
  • a node may be a programmed general purpose processor, but may also be a dedicated hardware device, such as a field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other hardware device or group of devices, logic within a hardware device, printed circuit board (PCB), or other hardware component.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • PCB printed circuit board
  • nodes disclosed herein may be separate elements or may be integrated together within a single element, such as within a single FPGA, ASIC, or other element configured to implement logic to perform the functions of such nodes as set forth herein. Further, a node may be an instantiation of software implementing logic executed by general purpose computer and/or any of the foregoing devices.
  • dedicated connections 180 are also provided directly between each gateway 120 and each sequencer 150, and between each sequencer 150 and each core compute node 140. Furthermore, in some embodiments, dedicated connections 180 are provided among all the sequencers, so that an example sequencer 150-1 has a dedicated connection 180 to each other sequencer 150-2, . . ., 150-s. While not pictured in Fig. 1A, in some embodiments, dedicated connections 180 may also be provided among all the gateways 120, so that each gateway 120-1 has a dedicated connection 180 to each other gateway 120-2, ..., 120-g. Similarly, in some embodiments, dedicated connections 180 are also provided among all the compute nodes 140, so that an example core compute node 140-1 has a dedicated connection 180 to each other core compute node 140-2, ..., 140-c.
  • a dedicated connection 180 between two nodes may in some embodiments be implemented as multiple redundant dedicated connections between those same two nodes, for increased redundancy and reliability.
  • the dedicated connection 180-1-1 between gateway 120-1 and core compute node 140-1 e.g., Core 1
  • Core 1 core compute node 140-1
  • any message sent out by a node is sent out in parallel to all nodes directly connected to it in the point-to-point mesh 172.
  • Each node in the point-to-point mesh 172 may determine for itself, for example, based on the node’s configuration, whether to take some action upon receipt of a message, or whether instead simply to ignore the message.
  • a node may never completely ignore a message; even if the node, due to its configuration, does not take substantial action upon receipt of a message, it may at least take minimal action, such as consuming any sequence number assigned to the message by the sequencer 150. That is, in such embodiments, the node may keep track of a last received sequence number to ensure that when the node takes more substantial action on a message, it does so in proper sequenced order.
  • a message containing a trade order to “Sell 10 shares of Microsoft at $190.00” might originate from participant device 130-1, such as a trader’s personal computer, and arrive at gateway 120-1 (i.e., GW 1). That message will be sent to all core compute nodes 140-1, 140-2, . . ., 140-c even though only core compute node 140-2 is currently performing matching for Microsoft orders. All other core compute nodes 140-1, 140-3, . . ., 140-c may upon receipt ignore the message or only take minimal action on the message. For example, the only action taken by 140-1, 140-3, . . ., 140-c may be to consume the sequence number assigned to the message by the sequencer 150- 1.
  • That message will also be sent to all of the sequencers 150-1, 150-2, . . ., 150-s even though a single sequencer (in this example, sequencer 150-1) is the currently active sequencer servicing the mesh.
  • the other sequencers 150-2, . . ., 150-s also received the message to allow them the opportunity to take over as the currently active sequencer should sequencer 150-1 (the currently active sequencer) fail, or if the overall reliability of the electronic trading system 100 would increase by moving to a different active sequencer.
  • One or more of the other sequencers may also be responsible for relaying system state to the disaster recovery site 155.
  • the disaster recovery site 155 may include a replica of electronic trading system 100 at another physical location, the replica comprising physical or virtual instantiations of some or all of the individual components of electronic trading system 100.
  • the system 100 By sending each message out in parallel to all directly connected nodes, the system 100 reduces complexity and also facilitates redundancy and high availability. If all directly connected nodes receive all messages by default, multiple nodes can be configured to take action on the same message in a redundant fashion. Returning to the example above of the order to “Sell 10 shares of Microsoft at $190.00”, in some embodiments, multiple core compute nodes 140 may simultaneously perform matching for Microsoft orders.
  • both core compute node 140-1 and core compute node 140-2 may simultaneously perform matching for Microsoft messages, and may each independently generate, after having received the incoming message of the “Sell” order, a response message such as an acknowledgement or execution message that each of core compute node 140-1 and core compute node 140-2 sends to the gateways 120 through the sequencer(s) 150 to be passed on to one or more participant devices 130.
  • a response message such as an acknowledgement or execution message that each of core compute node 140-1 and core compute node 140-2 sends to the gateways 120 through the sequencer(s) 150 to be passed on to one or more participant devices 130.
  • gateways 120 may receive multiple associated outgoing messages from core compute nodes 140 for the same corresponding incoming message. Due to the fact that it can be guaranteed that these multiple associated response messages are equivalent, the gateways 120 may simply process only the first received outgoing message, ignoring subsequent associated outgoing messages corresponding to the same incoming message.
  • the “first” and “subsequent” messages may be identified by their associated sequence numbers, as such messages may be sequence-marked messages.
  • messages may be identified as being functionally equivalent based on other identifying information in the messages, such as the values in the entity type field 110-16 and entity count field 110-17, as discussed further in connection with Fig. 1C above.
  • Allowing the gateways 120 to take action on the first of several functionally equivalent associated response messages to reach them may, therefore, also improve the overall latency of the electronic trading system 100.
  • the electronic trading system 100 can be easily configured such that any incoming message is processed by multiple compute nodes 140, in which each of those multiple compute nodes 140 generates an equivalent response message that can be processed by the gateways 120 on a first-to-arrive basis.
  • Such an architecture provides for high availability with no perceptible impact to latency in the event that a compute node 140 is not servicing incoming messages for a period of time (whether due to a system failure, a node reconfiguration, or a maintenance operation).
  • Such a point-to-point mesh 172 architecture of system 100 besides supporting low, predictable latency and redundant processing of messages, also provides for built-in redundant, multiple paths. As can be seen, there exist multiple paths between any gateway 120 and any compute node 140. Even if a direct connection 180-1-1 between gateway 120-1 and compute node 140-1 becomes unavailable, communication is still possible between those two elements via an alternate path, such as by traversing one of the sequencers 150 instead. Thus, more generally speaking, there exist multiple paths between any node and any other node in the point-to-point mesh 172.
  • this point-to-point mesh architecture inherently supports another important goal of a financial trading system, namely, fairness.
  • the point-to-point architecture with direct connections between nodes ensures that the path between any gateway 120 and any core compute node 140, or between the sequencer 150 and any other node has identical or, at least very similar latency. Therefore, two incoming messages sent out to the sequencer 150 at the same time from two different gateways 120 should reach the sequencer 150 substantially simultaneously. Similarly, an outgoing message being sent from a core compute node 140 is sent to all gateways 120 simultaneously, and should be received by each gateway at substantially the same time. Because the topology of the point-to-point mesh does not favor any single gateway 120, chances are minimized that being connected to a particular gateway 120 may give a participant device 130 an unfair advantage or disadvantage.
  • the point-to-point mesh architecture of system 100 allows for easily reconfiguring the function of a node, that is, whether a node is currently serving as a gateway 120, core compute node 140 or sequencer 150. It is particularly easy to perform such reconfiguration in embodiments in which each node has a direct connection between itself and each other node in the point-to-point mesh.
  • no re-wiring or recabling of connections 180 (whether physical or virtual) within the point-to-point mesh 172 is required in order to change the function of a node in the mesh (for example, changing the function of a node from a core compute node 140 to a gateway 120, or from a gateway 120 to a sequencer 150).
  • the reconfiguration required that is internal to the point-to-point mesh 172 may be easily accomplished through configuration changes that are carried out remotely.
  • the reconfiguration of the function of a node may be accomplished live, even dynamically, during trading hours. For example, due to changes on characteristics of the load of the electronic trading system 100 or new demand, it may be useful to reconfigure a core compute node 140-1 to instead serve as an additional gateway 120. After some possible redistribution of state or configuration to other compute nodes 140, the new gateway 120 may be available to start accepting new connections from participant devices 130.
  • lower-speed, potentially higher latency shared connections 182 may be provided among the system components, including among the gateways 120 and/or the core compute nodes 140. These shared connections 182 may be used for maintenance, control operations, management operations, and/or similar operations that do not require very low latency communications and, in contrast to messages related to trading activity carried over the dedicated connections 180 in the point-to-point mesh 172.
  • the shared connections 182-g and 182-c carry non-trading activity type traffic.
  • Shared connections 182, carrying non-trading traffic may be over one or more shared networks and via one or more network switches, and nodes in the mesh may be distributed among these shared networks in different ways.
  • gateways 120 may all be in a gateway -wide shared network 182-g
  • compute nodes 140 may be in their own respective compute node-wide shared network 182-c
  • sequencers 150 may be in their own distinct sequencer- wide shared network 182-s, while in other embodiments all the nodes in the mesh may communicate over the same shared network for these non-latency sensitive operations.
  • Distributed computing environments such as electronic trading system 100 sometimes rely on high resolution clocks to maintain tight synchronization among various components.
  • one or more of the nodes 120, 140, 150 might be provided with access to a clock, such as a high-resolution global positioning (GPS) clock 195 in some embodiments.
  • GPS global positioning
  • gateways 120, compute nodes 140, and sequencers 150 connected in the mesh 172 may be referred to as “Mesh Nodes”.
  • Fig. 2 illustrates an example embodiment of a Mesh Node 200 in the point-to-point mesh 172 architecture of electronic trading system 100.
  • Mesh node 200 could represent a gateway 120, a sequencer 150, or a core compute node 140, for example.
  • Mesh Node 200 may be implemented in any suitable combination of hardware and software, including pure hardware and pure software implementations, and in some embodiments, any or all of gateways 120, compute nodes 140, and/or sequencers 150 may be implemented with commercial off- the-shelf components.
  • Fig. 2 in order to achieve low latency, some functionality is implemented in hardware in Fixed Logic Device 230, while other functionality is implemented in software in Device Driver 220 and Mesh Software Application 210.
  • Fixed Logic Device 230 may be implemented in any suitable way, including an Application-Specific Integrated Circuit (ASIC), an embedded processor, or a Field Programmable Gate Array (FPGA).
  • ASIC Application-Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • Mesh Software Application 210 and Device Driver 220 may be implemented as instructions executing on one or more programmable data processors, such as central processing units (CPUs).
  • CPUs central processing units
  • Different versions or configurations of Mesh Software Application 210 may be installed on Mesh Node 200 depending on its role. For example, based on whether Mesh Node 200 is acting as a gateway 120, sequencer 150, or core compute node 140, a different version or configuration of Mesh Software Application 210 may be installed.
  • Mesh Node 200 has multiple low latency 10 Gigabit Ethernet SFP+ connectors (interfaces) 270-1, 270-2, 270- 3, . . ., 270-n, (known collectively as connectors 270).
  • Connectors 270 may be directly connected to other nodes in the point-to-point mesh via dedicated connections 180, connected via shared connections 182, and/or connected to participant devices 130 via a gateway 120, for example. These connectors 270 are electronically coupled in this example to 10 GigE MAC Cores 260-1, 260-2, 260-3, . . ., 260-n, (known collectively as GigE Cores 260), respectively, which in this embodiment are implemented by Fixed Logic Device 230 to ensure minimal latency. In other embodiments, 10 GigE MAC Cores 260 may be implemented by functionality outside Fixed Logic Device 230, for example, in PCI-E network interface card adapters.
  • Fixed Logic Device 230 may also include other components.
  • Fixed Logic Device 230 also includes a Fixed Logic 240 component.
  • fixed Logic component 240 may implement different functionality depending on the role of Mesh Node 200, for example, whether it is a gateway 120, sequencer 150, or core compute node 140.
  • Fixed Logic Memory 250 Also included in Fixed Logic Device 230 is Fixed Logic Memory 250, which may be a memory that is accessed with minimal latency by Fixed Logic 240.
  • Fixed Logic Device 230 also includes a PCI-E Core 235, which may implement PCI Express functionality.
  • PCI Express is used as a conduit mechanism to transfer data between hardware and software, or more specifically, between Fixed Logic Device 240 and the Mesh Software Application 210, via Device Driver 220 over PCI Express Bus 233.
  • any suitable data transfer mechanism between hardware and software may be employed, including Direct Memory Access (DMA), shared memory buffers, or memory mapping.
  • DMA Direct Memory Access
  • Mesh Node 200 may also include other hardware components.
  • Mesh Node 200 in some embodiments may also include High-Resolution Clock 195 (also illustrated in and discussed in conjunction with Fig. 1 A) used in the implementation of high-resolution clock synchronization among nodes in electronic trading system 100.
  • DRAM 280 may also be included in Mesh Node 200 as an additional memory in conjunction with Fixed Logic Memory 250.
  • DRAM 280 may be any suitable volatile or non-volatile memory, including one or more randomaccess memory banks, hard disk(s), and solid-state disk(s), and accessed over any suitable memory or storage interface.
  • the architecture of system 100 inherently supports another important goal of a financial trading system, namely, fairness.
  • the basic idea is that delays through the system do not favor the user of any single gateway 120 or core 140, thus minimizing the chance that being connected to a particular gateway 120 may give any participant device 130 an unfair advantage or disadvantage over another.
  • the end is accomplished, by controlling latency, that is the time between which messages arrive within the system and the time at which corresponding messages are permitted to leave the system.
  • an inbound request message is received at a gateway 120-1.
  • Gateway 120-1 processes the request message and generates an internal message, IBmsg, destined for one or more of the cores 140.
  • IBmsg adds at least one field, such as a time based value (TBV) to the request message.
  • TBV time based value
  • the TBV may be inserted into some unused field of a standard message (such as that explained in connection with Fig. 1C), or TBV may be encoded in IBmsg according to some proprietary internal protocol.
  • the time based value TBV can be determined by the fixed logic 240 within gateway 120-1.
  • the TBV may correspond to a time that the request message was received by the gateway 120-1 (referred to here as the ingress timestamp or arrival time, Tar), or in other implementations, may correspond to a desired exit time (referred to herein as the egress time or exit time, Tex) for a corresponding response message to be returned by the system 100.
  • TBV is determined from the time of receipt Tar plus some deterministic time delay value (referred to as Td, as described in more detail below).
  • Td some deterministic time delay value
  • the TBV is typically not the same field as the “node-specific timestamp” field 110-13 or the sequence numbers (e.g., a field different from either the sequence ID 110-14 or reference sequence ID 110-15) referenced above.
  • the inbound message IBmsg and its corresponding time based value TBV is then forwarded along a path 310-1 through the mesh to the cores 140 for processing.
  • a selected one of the cores, such as core 140-2 then further processes IBmsg and generates an outbound message OBmsg that contains some response data as well as the TBV (or some other value that depends on the TBV).
  • Outbound messages return along a path such as path 310-2 back to gateway 120-1.
  • the gateway 120-1 schedules OBmsg to exit the system 100, sending it to the participant 130-1 at a precise exit time Tex.
  • An element we refer to as an egress Quality of Service (QoS) shaper 320 controls the exit time for outbound messages OBmsg.
  • QoS Quality of Service
  • the precise exit time Tex is determined from the time based value TBV that was carried with IBmsg and OBmsg and a desired deterministic delay or latency, Td.
  • the deterministic delay Td is not necessarily directly dependent on the actual amount of time it takes for any particular response to be returned to the gateway 120 from the cores 140. As explained in more detail below, Td may depend upon a maximum expected time that any inbound message IBmsg requires to be fully processed by the system 100.
  • the exit time Tex can be determined by the gateway 120 from the TBV (using the arrival time, Tar), the exit time Tex can also be determined in other ways, such as by the sequencer 150.
  • the time based value TBV may take different forms. It may simply be the time at which the corresponding inbound request message was originally received at gateway 120- 1, Tar. In that instance, a time delay value (Td) is added to TBV to arrive at the exit time Tex at which the outbound message OBmsg will exit the gateway 120-1. In configurations where the deterministic delay Td is a fixed time, this fixed value can be stored or otherwise implemented in the gateway 120 (or the sequencer 150) at configuration time.
  • the time based value TBV carried with IBmsg and the corresponding OBmsg may be the actual desired time exit value Tex (that is, instead of the arrival time Tar).
  • inbound messages IBmsg will not necessarily be forwarded to one of the cores 140.
  • fixed logic within the gateway 120 may reject IBmsg if it is invalid, for example, for having a bad checksum, or some other case for which the gateway 120-1 cannot determine to which core 140 IBmsg is to be sent).
  • the outbound message OBmsg may not actually originate from one of the cores 140, but rather from the gateway 120-1.
  • OBmsg it may be desirable for OBmsg to still be sent at an exit time Tex that corresponds to a deterministic delay Td after the original request message was received at time Tar.
  • the outbound message OBmsg may be destined for more than one recipient device 130. In such instances, OBmsg is still sent to all such multiple consumers of the response at precisely the same exit time Tex. There is high confidence that the message can be sent to multiple recipients at the same precise exit time, due to the fact that the components of the system are typically implemented as highly synchronized, hardware components as explained above.
  • the inbound request message may originate from a participant device 130-1 who is a buyer in a transaction.
  • Participant device 130-2 connected to gateway 120-2 may be the seller in the transaction, and participant device 130-3 connected to gateway 120-3 may be a service such as a market data feed that reports such transactions.
  • the total time taken to return a response message (or otherwise process an inbound message) by the system may range for example, between 400-800 nanoseconds.
  • the egress QOS shaper 320 may be configured to send any corresponding response at a precise interval of 1000 nanoseconds (1 microsecond).
  • the outbound message OBmsg enters the egress QOS shaper 320-1.
  • OBmsg may remain in the egress shaper 320 for several hundred nanoseconds until exactly one microsecond has elapsed since the original request message was received at the gateway 120-1. This ensures that as long as the configured latency interval is set to a sufficiently high value (such that the distributed system as a whole can comfortably guarantee that it can process any inbound message within that interval), a response message such as OBmsg will always be returned at a precise, deterministic time interval after inbound message IBmsg was received.
  • the QOS level (i.e., the deterministic latency interval Td) may be tuned on an individual connection basis, on a per-gateway basis, or system-wide.
  • the QOS level may be configured on a per-participant or per- connection basis, optionally associated with varying cost subscription levels. While allowing different participants 130 to pay for different levels of QOS mitigates against fairness as a goal, such a feature may still be desirable to the provider of the distributed service as an additional revenue stream.
  • the system wide QOS level may also be set either manually or even dynamically.
  • the system 100 may dynamically temporarily increase the deterministic latency interval Td across the entire system 100 in the event an exceptional event occurs that threatens to cause the system not to satisfy the typical latency interval Td.
  • This could be an internal event, such as a failure of one or more components or nodes in the distributed system, that will cause the system to temporarily degrade performance until the problematic components can be hot-swapped or new compute nodes brought online.
  • the exceptional event could also be caused external to the distributed system, such as a news event that results in a huge increase in activity on or demand for the distributed system.
  • the exceptional event is resolved, the latency interval Td could be tuned back down, even dynamically, to its typical level.
  • a message from a single client 130 may trigger the need to generate and send acknowledgements or other types of response messages to numerous participants 130 and/or over multiple participant connections.
  • the content of the multiple messages is substantially identical among the multiple participants, whereas in other cases, a given inbound message IBmsg generates several related but different response messages OBmsgs, each with content that may be specific to a single participant or to a subset of participants.
  • a separate execution message may be generated for each match party.
  • Each execution message will contain information that is specific to that party, such as an “Order Token,” assigned by the client to identify the order.
  • the two execution messages are related to the extent that they will share some information in common, such as a unique market-assigned ‘Execution ID’ or ‘Match number’.
  • These two execution messages may be generated by the same compute core 140, and their egress time Tex should be the same, so that the two messages are sent simultaneously.
  • two or more related response messages are sent to multiple participants 130 simultaneously at the same precise deterministic time interval, regardless of whether the multiple participant connections are on the same or on different gateways.
  • Each related response message should contain a TBV that is the original timestamp value (or the desired egress time, or some other timebased value TBV related to the corresponding ingress time) corresponding to the incoming message IBmsg that triggered the responses.
  • the response from the core 140 may be a single outbound message OBmsg that arrives at the gateway(s) 120. However, that in turn causes the gateway(s) 120 to generate two or more related but not identical messages, the related messages differing in their destination address, and/or by having some other client-specific identifier, such as an order token, in their encoding as a result of the protocol associated with the participant 130, etc.
  • the cores 140 may provide a financial trading system’s matching engine that may generate a match between an order to buy a security and a corresponding order to sell that security.
  • a successful match triggers an acknowledgement or execution message to be sent to both counter parties (e.g., buyer and seller) in the match.
  • the execution message OBmsg going to each counter party contains information specific to that specific client’s trade, such an execution message will be sent to the two parties at the exact same time, ensuring fairness.
  • the latency interval for both of the resulting related responses may be set based off the incoming timestamp of one of the orders participating in the match (for example, the last of the two orders to arrive as an inbound request message).
  • client devices 130 may typically subscribe to streams of market data messages sent from the matching engine notifying subscribers of real time activity across the market as a whole, including activity in which the subscribing client is not a participating party, such as recent executions between two third parties.
  • a market data message in some embodiments is sent out to all subscribers simultaneously based on the timestamp value in the message as processed by the egress QOS shaper 320.
  • the sending of the market data messages to subscribers is timed to match the sending time of any acknowledgment or execution messages sent to market participants that are direct parties to the activity reflected in the market data messages also sent to subscribers.
  • this match activity may also be reported to market data subscribers.
  • the system 100 ensures that not only are both acknowledgement messages sent to the counter parties at the same time, but these acknowledgement messages are timed to also be sent at the same time as the market data messages reflecting this execution sent to market data subscribers, thereby ensuring that no market participant has prior access to useful financial information.
  • the system 100 provides precise control over when outbound messages are sent. In some instances, fairness to everyone is important, and so the response messages to multiple recipients can be released at the same time. Or in another embodiment, the system may be configured to release the response to the participants 130-1, 130-2 who are parties to a transaction a bit sooner than to a market feed 130-3. Or perhaps the market feed 130-3 receives the response sooner than the participants 130- 1, 130-2 to the transaction. In yet other configurations, the system 100 rewards participants 130 who are “liquidity adders” by letting them know sooner, and retards “liquidity removers” by letting them know later.
  • Fig. 3B is an example of an egress QOS shaper 320 that may be located inside one or more gateways 120.
  • Each OBmsg is received with its corresponding time based value TBV.
  • the TBV is then used with the desired latency Td to determine an exit time Tex at which the message will be sent out of the gateway 120.
  • the TBV is the receipt time for corresponding inbound message IBmsg
  • the corresponding deterministic delay, Td may be added to the TBV to arrive at Tex.
  • the egress QOS shaper 320 may be implemented as a “packet scheduler,” organized into a set of indexed storage locations 380-1, 380-2, . . . , 380-n (or “buckets”), with a location 380 associated with each discrete high precision egress timing interval Texl, Tex2, . . ., Texn. Each message placed in a particular bucket 380, is released at the precise time interval associated with the bucket. Each bucket may contain a list of zero or more outbound messages OBmsg.
  • TBV may instead be allocated as the desired outbound indexed location or bucket of the packet scheduler 320.
  • the desired egress time is used as the TBV, it can be considered to directly correlate to the exit time Tex.
  • message scheduler 320 may be a ring data structure that wraps, other implementations are possible.
  • message scheduler 320 may also be implemented as a set of linked lists with pointers, for example, with one linked list for each exit time Tex.
  • time the “time” value, be it an arrival or exit time, can be with reference to an absolute or relative time.
  • Fig. 4 is an example of an inbound message IBmsg arriving from a market participant 130 at a gateway 120, and a reply being sent as an outbound message OBmsg with a deterministic latency (e.g., Td) of 1000 time units (where a time unit may be a nanosecond).
  • Td deterministic latency
  • an initial state 401 the request message arrives at the gateway 120 and a timestamp TBV is added to it, to generate an internal message DI (e.g., IBmsg).
  • This state 401 may occur at a time T 17356.
  • the message DI including the timestamp TBV is sent by the gateway 120 to one of the cores 140.
  • a response message (Rl) with the same timestamp TBV is received back at the gateway 120 from the core 140.
  • state 404 (which may occur at time T 17922) an internal response message Rl with the timestamp is fed into the packet scheduler 320, which determines the appropriate outgoing timeslot (which in this example, is arrival time T17356 plus 1000, or a time of T18356.
  • the packet scheduler determines the appropriate outgoing timeslot (which in this example, is arrival time T17356 plus 1000, or a time of T18356.
  • the precise time T 183568 is reached, at which point the packet scheduler permits the response message (Rl) to exit the gateway as the system-level outbound message OBmsg.
  • Fig. 4 shows the inbound message IBmsg and its TBV being forwarded directly (e.g., over activation link 180-1-1 in FIG. IB) from a gateway node 120 to one of the cores 140
  • the IBmsg with the TBV may alternatively travel via a sequencer 150 to one of the cores 140 (e.g., over ordering path 117 in FIG. IB), or it may travel along both paths - e.g., both directly from the gateway node 120 to the core 140 and from gateway node 120 through the sequencer 150 to the core 140.
  • both unmarked messages and sequence-marked messages may have a TBV included with them.
  • a TBV included in an unmarked message may have the same value as a TBV included in the corresponding sequence-marked message.
  • the TBV is a field in a message that is different from the sequence identifier (e.g., different from either the sequence ID 110-14 or reference sequence ID 110-15) of the message.
  • Fig. 5 shows an example embodiment where an outbound message OBmsg results from an event internal to the system 100, as opposed to being a “response” to some inbound “request” message.
  • the OBmsg is “asynchronous” to any corresponding inbound message.
  • Such asynchronous messages should still be marked and handled with a scheduled egress time Tex.
  • an asynchronous message in a trading system might be an Order Cancel triggered by a timer.
  • a matching engine in the core might generate such a Cancel message when a “time in force” associated with a resting order expires. Since the Cancel message is be broadcast to more than one client/participant (for example, both the initiator of the order as well as a market data feed), it should exit all gateways to all destination participants at the same time.
  • the exit time Tex may be determined by the core 140 that originates the message. In other implementations, the exit time Tex may be determined by one of the sequencers 150 through which OBmsg will travel.
  • an outbound message needing to be sent to multiple participants will be sent at the same exit time Tex, regardless of whether the outbound message was generated as part of a request-response or whether it instead was an asynchronous message originated internally within the mesh.
  • the concept of having a deterministic latency encompasses more than just a fixed time.
  • the deterministic latency may include using a set of several fixed time values that are evenly distributed across a range according to fairness criteria. The particular sequence of time values may be statistically evenly and randomly distributed across the range.
  • the system 100 also supports deterministic latency variation, in which latency values Td are order-randomized within an evenly spanned latency value range.
  • time stamps are associated with inbound messages IBmsg in much the same way as described above, with the one difference being that the egress QOS shaper assigns the message to a time slot that depends on an order-randomized latency value within a bounded range.
  • the egress QOS shaper 320 could be considered to assign outgoing time slots in a manner similar to that of dealing a card from a deck of playing cards, so that once a latency value in the range has been assigned, that particular latency value will not be assigned again until all other values in the range have also been assigned and the “deck” is reshuffled. While in one embodiment the QOS shaper 320 is responsible for assigning the order-randomized latency value, in other embodiments, other components such a gateway 120 or sequencer 150 may assign these values at the time an inbound message is received.
  • Td the varying but still deterministic delay, Td
  • Td will not necessarily be the same for each inbound message, IBmsg.
  • the gateway 120 determines the correct exit time Tex at egress time for the corresponding outbound message(s) OBmsg.
  • the varying but deterministic delay Td can be added to the arrival time Tar right at ingress time, such that the TBV is the single value, Tex.
  • using the exit time Tex as the TBV in the variable delay case avoids the need to carry both the delay value Td and the ingress time Tar with each message)
  • Fig. 6A illustrates an example where the system assigns a deterministic latency (Td) of 2000 time units to a particular message, as selected from a deterministic range of between 1000 and 5000 time units.
  • Td deterministic latency
  • the response message with the timestamp is queued at the packet scheduler within the gateway 120-1.
  • One or more response messages are then sent out from the gateway in state 606 (at T 19356), because this particular message was assigned a latency of 2000 time units from the set of values that range from 1000 to 5000 time units.
  • the specific latency value assigned to a message (that is, its place in the packet scheduler) can be determined by the gateway at the time the message is received, or in other ways, as already explained elsewhere.
  • Fig. 6B is a similar example, but here the assigned deterministic latency for this message was determined by the system to be 3500 time units (as selected from the same range of 1000-5000 units). The processing is otherwise similar to Fig. 6A, albeit with a different deterministic delay.
  • the message arrives at the gateway at time T 17356 and is time stamped.
  • the message exits the gateway and is forwarded to the core.
  • a response message with the timestamp is returned to the gateway from the core.
  • Fig. 7 is an example of how a pattern such as an order-randomized set of values may be assigned to the deterministic latency.
  • the bounded range is from 1000 to 5000 time units, with an increment between slots of 500 time units.
  • the QOS shaper has generated a sequence of nine (9) evenly distributed Tex values as as 2000, 3500, 2500, 1500, 4000, 5000, 4500, 3000, and 1000. Once the initial sequence of values is used up, the QOS shaper repeats the process and generates preferably some other evenly distributed, random pattern of the 9 values.
  • the assignment of the range of latency values to messages may be on a system-wide basis, or per matching engine, or per account/participant, or per flow/connection. In some cases, it may be desirable not to have the latency distribution be entirely random or completely evenly spaced throughout the entire latency range. For example, it may be desirable to limit the number of consecutive lower valued latencies assigned, and/or the number of consecutive higher latencies assigned.
  • the method within the QOS shaper 320 that assigns latencies may therefore attempt to distribute the latencies evenly within a relatively small set, such as ensuring that for every set of five consecutive latencies assigned to a participant response, at least two latencies in the set will be in a higher latency range and two other latencies will be in a lower latency range.
  • limiting the consecutive number of similar latencies i.e., all high or all low
  • the periods surrounding market open and market close tend to exhibit higher levels of activity, as do periods following a news item that has a potential financial impact, such as an interest rate adjustment announcement. Such events might need a more tightly controlled distribution of latencies than during normal times.
  • the range of values selected should take into consideration the expected delays inherent in the components of the system, including at least the gateways 120 and the cores 140. For example, it may be that, depending on the message type and the current load on the distributed system, the time taken to process a message ranges between 400-800 nanoseconds, but the egress QOS shaper is configured to send any corresponding response at a randomized interval within a range of 1000 to 5000 nanoseconds. Accordingly, after the response message has reached the gateway node as it is being readied to be sent out to the trading participant, the message enters the egress QOS shaper, where it remains for a period of time until the precise randomized interval has elapsed since the original message was timestamped upon entry at the gateway node.
  • Adding a randomized but evenly distributed jitter to the latency ensures fairness, since each trading message has an equal chance of being assigned any randomized jitter within the configured latency bounds, yet by adding jitter, the system mitigates against participant behaviors that could exploit the predictability of a perfectly consistent latency.
  • Fig. 8 illustrates some of the considerations that may be factored into determining an appropriate fixed value for Td, which should be selected to be greater than a combination of the worst case delay through the cores 140 (Tcore) and the worst case delay (Tpath) of the paths between the gateways 120 and the cores 140 (for both inbound and outbound messages).
  • some of the considerations for determining Td may include: time in gateway ingress,
  • considerations for determining Td for asynchronous messages may include: time in core,
  • the cores 140 are expected to be largely implemented in hardware (e.g., they are fixed-logic, FPGA based) and thus have a relatively fixed predictable “time in core” to execute a task.
  • some tasks executed by the cores 140 may take longer than others.
  • a core 140 might respond to a Cancel request much faster than responding to an Add Order.
  • a different Add Order request may require different times to process, depending upon whether the order is for a security that is frequently traded (and hence the required data to execute the trade is held in a cache local to a core 140 dedicated to that security) or infrequently traded (where the data necessary to execute the order is to be retrieved from a location outside of a local cache on the core 140).
  • the types of messages expected to be sent may also factor into selecting Td.
  • Tpath The delays associated with the paths, Tpath is also expected to cluster around a relatively small and predictable value in an architecture that utilizes a fully connected mesh between all gateways 120 and cores 140, as described above.
  • Td should be chosen to be greater than the combination of these internal system delays, with the goal being that time Td is selected to “hide” the variability of things such as Tcore and Tpath from the market from the participant devices 130. As long as the worst-case scenario sets Td at or above some maximum possible system delay, it is possible for the system 100 to guarantee a deterministic latency for all participant devices 130.
  • Td can be determined at the time the system is designed. However, in another approach, Td can be dynamically tuned by monitoring the time difference between an inbound message and the associated outbound message, in order to guarantee Td will always be greater than the largest delta in time. In this mode, the system 100 can be placed in a special mode and subjected to a series of test request messages (different Add Orders, Cancels, etc.) and the response time noted for each of a set of participant connections, gateways, and cores, to determine a maximum latency time for the system.
  • Fig. 9 is similar to Fig. 8 but illustrates the situation where the deterministic latency is evenly distributed across a range of values. In this example, there are multiple possible delay times Tdl, Td2, Td3, . . . Tdn. All of these times selected are preferably greater than the expected worst-case internal delay, e.g, the sum of Tcore and Tpath.
  • some implementations of the system 100 may support participants that use different financial trading protocols.
  • the processing of such messages is thus inherently faster or slower depending upon the protocol selected.
  • an inbound request message encoded using a binary protocol would inherently be handled by the cores 140 more rapidly than a request that uses a text-based protocol.
  • gateway 120-1 receives an IBmsg from a first participant 130-1 (for example, a buyer) that uses a text-based financial trading protocol.
  • a different participant 130-2 for example, a seller
  • the time (Pl) needed for the system to process the IBmsg for the first participant and return the corresponding response (OBmsg) may be longer than the time (P2) needed for the system to process the IBmsg and return the response for the second participant.
  • the system 100 could be configured with a time Td that is sufficiently high to take into account the processing time required for the slowest financial trading protocol, thereby ensuring that messages exchanged according to any protocol exhibit the same response time, as discussed above, alternatively, the system may instead be configured to treat different protocols differently, providing a message a deterministic time boost or a deterministic time penalty, depending on the message’s protocol, thereby encouraging or discouraging the use of particular protocols.
  • the returned outbound message OBmsg will therefore be scheduled to finish exiting the system 100 at a time that depends upon both the protocol-dependent value as well as the time of receipt (e.g., the timestamp or TBV).
  • IBmsgs will be forwarded to the cores with a protocol-dependent time value (P), which is carried along with the message and the TBV, so that the gateway can determine a time for the outbound message OBmsg to exit in a manner that takes into account the protocol of the message.
  • P protocol-dependent time value
  • the gateways 120-1, 120-2 are able to calculate different egress times for market participants 130-1 and 130-2 to account for the message protocol they are using.
  • parameters such as Pl and P2 may be based on considerations other than the protocol in use. For example, they may be used to offer differentiated classes of service, where the participants who pay for “first class” service still are all delayed the same as other first class users, but they are not delayed as much as “second class” users who pay less, etc.
  • the notion of the system 100 being “fair to participants with a deterministic latency” can thus mean different things. It can mean, “treat all participants exactly the same”, or it can be used to incentivize certain types of trading behavior or protocols, or for security reasons.
  • the deterministic latency may also be temporarily adjusted based on current system conditions. For example, it may be increased due to failure of one or more components, or as a result of an exceptional burst of activity in the system.
  • the architecture described above may be of use in applications other than electronic trading systems. For example, it is possible that it may be used to monitor data streams flowing across a network, to capture packets, decode the packets’ raw data, analyze packet content in real time, and provide responses, for applications other than handling securities trade orders.
  • one approach may include a plurality of gateways connected to receive inbound messages from two or more participant devices.
  • One or more of the gateways are each further configured to determine a time based value (TBV) for a selected one of the inbound messages.
  • TBV time based value
  • the gateways forward the selected inbound message with its respective TBV to one or more compute nodes, and then receive a response message from the one or more compute nodes, the response message having information derivable from the TBV.
  • the one or more gateways then send a response message to at least one of the participant devices as an outbound message, the outbound message sent at a deterministic egress time that depends on both the information derivable from the TBV and a deterministic latency.
  • a method transmits a response message with a deterministic latency.
  • a plurality of incoming messages are received by one or more gateways from a plurality of participant devices.
  • a selected one of the gateways determines a corresponding time based value for a selected one of the plurality of incoming messages.
  • the selected incoming message relates to an electronic trading function.
  • the selected one of the gateways then sends the selected incoming message and information derivable from the corresponding time based value to a sequencer node.
  • the sequencer node then sends, to at least a selected one of a plurality of compute engines, a sequence-marked message and the information derivable from the corresponding time-based value.
  • the selected one of the plurality of compute engines receives the sequence-marked message and the information derivable from the corresponding time based value, and determines a selected compute-response message.
  • the selected compute-response message is based on the sequence-marked message, and is further configured to complete the electronic trade matching function.
  • the selected one of the compute engines then returns the selected compute-response message and the information derivable from the corresponding time based value to at least the selected one of the gateways.
  • the selected one of the gateways then receives the selected compute-response message from the selected one of the plurality of compute engines and the information derivable from the corresponding time based value, and transmits the selected compute-response message to at least one of the plurality of participant devices at a deterministic egress time that depends on at least the information derivable from the corresponding time based value for the selected incoming message.
  • the selected one of the plurality of gateways may receive an other incoming message from an other participant device.
  • the selected one of the gateways also forwards an other forwarded message to at least one of the plurality of compute engines, with the other forwarded message including the other incoming message and information derivable from an other time based value.
  • the selected one of the plurality of gateways receives an other compute-response message from the at least one of the plurality of compute engines, with the other computeresponse message including the information derivable from the other time based value.
  • the selected one of the plurality of gateways then delays sending the other computeresponse message until an other deterministic egress time is reached, where the other deterministic egress time depends on at least the information derivable from the other time based value.
  • the selected one of the gateways then sends the other computeresponse message to the other participant device at the other deterministic egress time, such that the selected one of the compute response messages and the other compute response message are each delayed by the deterministic latency.
  • the time based value may depend on a times that relates to at least one of a receive time for the selected incoming message or a desired egress time for the selected compute-response message.
  • the deterministic latency may depend on a maximum time for the selected one of the plurality of compute engines to return the compute-response message.
  • the deterministic latency may follow a varying but deterministic pattern.
  • the deterministic latency may be selected from a set of latencies evenly distributed across a predetermined range.
  • the deterministic latency may also be configured on a per gateway, a per- connection, or a system-wide basis.
  • forwarding from the selected one of the gateways to the compute engine may be over one or more direct, dedicated connections from the selected one of the gateways to the plurality of compute engines.
  • An asynchronous message may be received from one of the compute engines by at least the selected one of the gateways, and then sent as an outbound message simultaneously to two or more participant devices.
  • a transit time may be determined that may be associated with the step of forwarding the selected incoming message and with the step of receiving the selected one of compute-response messages, and the deterministic egress time further depends on the transit time.
  • Transmitting the selected one of the compute-response messages from the selected gateway may additionally comprise transmitting the selected one of the compute-response messages to at least one other participant device at the deterministic egress time.
  • the deterministic egress time may be one of a plurality of deterministic egress times.
  • the selected one of the compute-response messages may be stored in a one of a set of indexed locations, with each indexed location associated with one of the plurality of deterministic egress times.
  • the information derivable from the corresponding time based value may be inserted into an unused field in the selected incoming message before forwarding the selected incoming message to the plurality of compute engines.
  • the corresponding time based value may be part of an internal system protocol field within the selected incoming message.
  • the deterministic latency may be dynamically changed.
  • Two or more of the plurality of gateways may each receive the selected compute-response message from the selected one of the compute engines.
  • the selected one of the gateways may forward the selected incoming message with the information derivable from the corresponding time based value to the plurality of compute engines.
  • the selected one of the plurality of compute engines then receives the selected incoming message with the information derivable from the corresponding time based value.
  • the determining, by the selected one of the plurality of compute engines, the selected compute-response message may be further based on at least one of the selected incoming message or the sequence- marked message.
  • forwarding the selected incoming message and receiving the selected compute-response message may be via a plurality of direct connections provided between each of the plurality of gateways and each of the plurality of compute engines.
  • the plurality of gateways may optionally be further configured to receive an asynchronous message from at least a selected one of the plurality of compute engines, and then send the asynchronous message as an outbound message simultaneously to two or more of the participant devices.
  • the selected compute-response message may be associated with a trade match event between two match parties each associated with a respective one of two participant devices.
  • the selected gateway may transmit the selected computeresponse message simultaneously to the two participant devices associated with the two match parties.
  • the selected compute-response message may be sent as a market data event message to a third device associated with a subscriber of a market data stream.
  • the deterministic egress time may depend on one or more of a message path delay or compute engine delay.
  • Other embodiments may include a system that has a plurality of compute engines, one or more gateways, and a sequencer node.
  • the one or more gateways are configured to (a) receive a plurality of incoming messages from a plurality of participant devices; (b) determine a corresponding time based value for a selected one of the plurality of incoming messages; (c) forward the selected incoming message with information derivable from the corresponding time based value, to the plurality of compute engines; and (d) send the selected incoming message to the sequencer node.
  • the sequencer node may be configured to (e) receive the selected incoming message; and (f) send a sequence-marked message to a selected one of the compute engines.
  • the selected compute engine may be configured to (g) receive the selected incoming message and the information derivable from the time based value from at least the selected one of the gateways; (h) receive the sequence-marked message from the sequencer node; (i) determine a selected compute-response message based on at least one of the selected incoming message and the sequence-marked message, the compute-response message configured to complete the electronic trade matching function; and (j) return the selected compute-response message and the information derivable from the corresponding time based value to the one or more gateways.
  • At least the selected one of the gateways is further configured to (k) receive the selected compute-response message and the information derivable from the time based value from the selected one of the compute engines; and (1) transmit the selected compute-response message to at least one of the plurality of the participant devices at a deterministic egress time that depends on at least the information derivable from the corresponding time based value for the selected incoming message.
  • the deterministic egress time may follow a varying but deterministic pattern.
  • the selected one of the gateways in such a system may be further configured to forward the selected incoming message and receive the selected compute-response message via a plurality of direct connections provided between each of the plurality of gateways and each of the plurality of compute engines.
  • Such a system may also be configured such that the selected computeresponse message is associated with a trade match event between two match parties each associated with a respective one of two participant devices, and the selected one of the gateways may be further configured to transmit the selected compute-response message simultaneously to the two participant devices associated with the two match parties.
  • a method for transmitting a response message with a deterministic latency may include:
  • a system may comprise:
  • a plurality of gateways configured to receive a plurality of incoming messages from a plurality of participant devices
  • a selected one of the gateways is further configured to: [00243] determine a corresponding time based value for a selected one of the plurality of incoming messages;
  • sequencer node is further configured to:
  • [00248] send a sequence-marked message to at least one of the compute engines, and [00249] wherein a selected one of the plurality of compute engines is configured to: [00250] receive the selected incoming message and the information derivable from the corresponding time based value from at least the selected one of the gateways;
  • [00252] determine a selected compute-response message based on at least one of the selected incoming message and the sequence-marked message, the selected computeresponse message configured to complete the electronic trade matching function;
  • the selected one of the gateways is further configured to:
  • the various “data processors” may each be implemented by a physical or virtual general purpose computer having a central processor, memory, disk or other mass storage, communication interface(s), input/output (I/O) device(s), and other peripherals.
  • the general-purpose computer is transformed into the processors and executes the processes described above, for example, by loading software instructions into the processor, and then causing execution of the instructions to carry out the functions described.
  • such a computer may contain a system bus, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system.
  • the bus or busses are essentially shared conduit(s) that connect different elements of the computer system (e.g., one or more central processing units, disks, various memories, input/output ports, network ports, etc.) that enables the transfer of information between the elements.
  • One or more central processor units are attached to the system bus and provide for the execution of computer instructions.
  • I/O device interfaces for connecting the disks, memories, and various input and output devices.
  • Network interface(s) allow connections to various other devices attached to a network.
  • One or more memories provide volatile and/or nonvolatile storage for computer software instructions and data used to implement an embodiment. Disks or other mass storage provides non-volatile storage for computer software instructions and data used to implement, for example, the various procedures described herein.
  • Embodiments may therefore typically be implemented in hardware, custom designed semiconductor logic, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), firmware, software, or any combination thereof.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • the procedures, devices, and processes described herein are a computer program product, including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the system.
  • a computer program product can be installed by any suitable software installation procedure, as is well known in the art.
  • at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection.
  • Embodiments may also be implemented as instructions stored on a nontransient machine-readable medium, which may be read and executed by one or more procedures.
  • a non-transient machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device).
  • a non-transient machine-readable medium may include read only memory (ROM); random access memory (RAM); storage including magnetic disk storage media; optical storage media; flash memory devices; and others.
  • firmware, software, routines, or instructions may be described herein as performing certain actions and/or functions. However, it should be appreciated that such descriptions contained herein are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
  • block and system diagrams may include more or fewer elements, be arranged differently, or be represented differently. But it further should be understood that certain implementations may dictate the block and network diagrams and the number of block and network diagrams illustrating the execution of the embodiments be implemented in a particular way.

Abstract

A distributed computing system, such as may be used to implement an electronic trading system, supports a notion of fairness in latency. The system does not favor any particular client. Thus, being connected to a particular access point into the system (such as via a gateway) does not give any particular device an unfair advantage or disadvantage over another. That end is accomplished by precisely controlling latency, that is, the time between when request messages arrive at the system and a time at which corresponding response messages are permitted to leave. The precisely controlled, deterministic latency can be fixed over time, or it can vary according to some predetermined pattern, or vary randomly within a pre-determined range of values.

Description

HIGHLY DETERMINISTIC LATENCY IN A
DISTRIBUTED SYSTEM
CROSS REFERENCE TO RELATED APPLICATION(S)
BACKGROUND
Cross Reference to Related Application
[0001] This patent application claims priority to a co-pending U.S. Patent Application Serial No. 16/988,249 filed August 7, 2020 entitled “Highly Deterministic Latency in a Distributed System”, and co-pending U.S. Patent Application Serial No. 16/988,491 filed August 7, 2020 entitled, “Sequencer Bypass with Transactional Preprocessing in Distributed System” the entire contents of each of which are hereby incorporated by reference.
Technical Field
[0002] This patent application relates to connected devices, and more particularly to providing deterministic latency.
BACKGROUND
[0003] The financial instrument trading systems currently in widespread use in the major stock exchanges allow traders to submit orders and receive confirmations, market data, and other information, electronically, via communications networks. The typical electronic trading system includes a matching engine, typically residing within a central server, and a plurality of gateways that provide access to the matching engine, as well as other distributed processors. The typical order process can be as follows: request messages representing orders are received (e.g., bid orders and/or ask orders), as sent from client devices, e.g., trader terminals operated by human users or servers executing automated trading algorithms). An order acknowledgement is then typically returned to the client devices via the gateway that forwarded the request. The exchange may perform additional processing before the order processing acknowledgement is returned to the client device.
[0004] The exchange system may also disseminate information related to the order message, either in the same form as received or otherwise, to other systems to generate market data output.
[0005] Latency, generally speaking, is the time between the input to a system and an observable response. In the context of communications systems, latency is measured as the difference in the time when a message enters or is received by the system, and the time when a corresponding response message is sent out. Latency is a particularly important consideration in high-speed electronic trading systems, where it is desirable to minimize the time it takes to execute a trade.
[0006] "Determinism is the New Latency", Solution Brief © 2019 by Arista Networks, Inc. explains that one approach to controlling latency is a “speed bump” approach, which involves introducing an approximately 350us delay using a long length of optic fibre in the message path. Every order thus takes exactly the same amount of time to traverse the fibre. In another approach described in this document, frequently used trade data can be kept in a matching engine cache memory to minimize latency. There is also some discussion of the problems associated with trading systems that use multiple gateways to forward orders to multiple matching engines. Gateways may be allocated to participants, but that leads to another source of non-determini sm: contention. It is noted that if the time required to process the order by a gateway is not deterministic, it is possible that two orders sent to the exchange in one sequence may actually be executed in a different sequence. But no solution is suggested for these problems. [0007] U.S. Pre-grant Publication 2019/0097745 describes a communication network that uses time stamps to reduce the impact of non-deterministic delays. The state of a transmit path is estimated by observing a “non-deterministic” delay of previously transmitted packets. Transmission circuits then hold outgoing packets until packet processing circuitry provides a deterministic latency for the packet.
[0008] ICON Packet Transport, by Schweitzer Engineering Laboratories, Inc. © 2016 is an example of a networking device that provides deterministic, low latency packetization using a jitter buffer.
[0009] U.S. Patent 7,496,086 is a voice network having a set of gateways that use jitter buffers to equalize delay.
[0010] U.S. Patent 7,885,296 assigns timestamps to frames, and maintains synchronization among multiple timestamp counters distributed among different Physical Layer (PHY) transceivers.
[0011] U.S. Pre-grant Publication 2018/0359195 describes a network switch that uses a special type of tree data structure to identify a timestamp range for a received packet, such as may be used for streaming media in a Real-time Transport Protocol (RTP) network.
SUMMARY OF PREFERRED EMBODIMENTS
[0012] As described herein, preferred embodiments of a distributed computing system, such as an electronic trading system, provide perfectly deterministic latency. In one example implementation, an inbound message enters the system, such as request from a market participant or other client node, via one of a number of gateway nodes. The gateway node receiving the inbound message then applies an ingress time based value (which may be a “timestamp” that depends on the time of receipt) to the message. The message (including the timestamp now embedded within it) is forwarded to be processed by other nodes in the distributed system. As part of processing the message, any corresponding response message generated by other nodes in the distributed system also retains the same timestamp value (and/or constant) embedded within it.
[0013] As a corresponding response message to the request is readied to be sent back out to the participant / client by the gateway node, the response message first goes through an egress “quality of service” (QOS) shaper. The QOS shaper ensures the response message is sent out of the system only at a very precise deterministic time that depends on the ingress timestamp plus some deterministic delay.
[0014] The QOS shaper may be implemented as a “packet scheduler,” which organizes outgoing messages into a set of indexed, temporary storage locations (or “buckets” associated for each discrete high precision timing interval, so that an entry is placed in a particular location in the scheduler is guaranteed to be released at the precise associated time interval.
[0015] As an alternative implementation, rather than assigning a timestamp to the arriving message, the gateway may instead directly associate the inbound message with an indexed location of the packet scheduler associated with a desired egress time. As with other implementations, this egress time is carried with the message through the system so that a corresponding response messages processed and generated by the distributed system core will be sent at the allocated time.
[0016] One advantage of the system described herein is that, unlike prior trading systems, all users of the system obtain a response with the same latency. Whether the user is a market participant, or simply a subscriber of a market data feed, every user of the system experiences the same, deterministic response time. The deterministic response time can be a fixed time value that does not vary. However, with other notions of fairness, the deterministic time can instead follow a predetermined pattern, or may be a randomly selected time that is chosen across a range of possible deterministic response times.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] Additional novel features and advantages of the approaches discussed herein are evident from the text that follows and the accompanying drawings, where:
[0018] Fig. 1 A is a high level block diagram of a distributed electronic trading system.
[0019] Fig. IB illustrates messages travelling from a gateway to a compute node on a direct path and through a sequencer node.
[0020] Fig. 1C is an example format of a message.
[0021] Fig. 2 is a more detailed view of a system component such as a gateway or compute node.
[0022] Fig. 3 A illustrates how a time based value is applied to inbound and outbound messages.
[0023] Fig. 3B illustrates a packet scheduler.
[0024] Fig. 4 is an example where the system provides a deterministic latency of 1000 time units.
[0025] Fig. 5 shows an asynchronous outbound message.
[0026] Fig. 6A is an example where the system provides a deterministic latency of 2000 time units selected from a deterministic range.
[0027] Fig. 6B is an example where the system provides a deterministic latency of 3500 time units selected from a deterministic range.
[0028] Fig. 7 illustrates how a set of deterministic latency values can be determined.
[0029] Fig. 8 shows how a fixed latency time can be selected.
[0030] Fig. 9 shows how a set of fixed latency times can be selected across a range. [0031] Fig. 10 is an example in which different participants experience varying deterministic latency based on an additional parameter, such as the financial trading protocol used.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT(S)
System Overview
[0032] Example embodiments disclosed herein relate to a high-speed electronic trading system that provides a market where orders to buy and sell financial instruments (such as stocks, bonds, commodities, futures, options, and the like) are traded among market participants (such as traders and brokers). The electronic trading system exhibits low latency, fairness, fault tolerance, deterministic latency, and other features more fully described below.
[0033] The electronic trading system is primarily responsible for “matching” trade orders to one another. In one example, an offer to “buy” an instrument is matched to a corresponding counteroffer to “sell”. The matched offer and counteroffer should at least partially satisfy the desired price, with any residual unsatisfied quantity passed to another suitable counterorder. Matched orders are then paired and the trade is executed.
[0034] Any wholly unsatisfied or partially satisfied orders are maintained in a data structure referred to as an “order book”. The retained information regarding unmatched trade orders can be used by the matching engine to satisfy subsequent trade orders. An order book is typically maintained for each instrument and generally defines or otherwise represents the state of the market for that particular product. It may include, for example, the recent prices and quantities at which market participants have expressed a willingness to buy or sell.
[0035] The results of matching may also be made visible to market participants via streaming data services referred to as market data feeds. A market data feed typically includes individual messages that carry the pricing for each traded instrument, and related information such as volume and other statistics.
[0036] Fig. 1 A illustrates an example electronic trading system 100 that includes a number of gateways 120-1, 120-2, ..., 120-g (collectively referred to as gateways 120), a set of core compute nodes 140-1, 140-2, . . ., 140-c (collectively, the core compute nodes 140 or compute nodes 140), and one or more sequencers 150-1, 150-2, ..., 150-s (collectively, the sequencers 150). In some embodiments, the gateways 120, core compute nodes 140, and sequencers 150 are thus considered to be nodes in electronic trading system 100. As will be described in more detail below, in one embodiment, the gateways 120, compute nodes 140 and sequencers 150 are directly connected to one another, preferably via low latency, dedicated connections 180.
[0037] The term “peer” in relation to the discussion of the system 100 refers to another device that generally serves the same function (e.g., “gateway” vs. “core compute node” vs. “sequencer”) in electronic trading system 100. For example, gateways 120-2, . . ., 120-g are the peers for gateway 120-1, core compute nodes 140-2, . . ., 140-c are the peers for core compute node 140-1, and sequencers 150-2, . . ., 150-s are the peers for sequencer 150-1.
[0038] The terms “active” and “standby,” in relation to the discussion of the system 100, may refer to a high availability (HA) role/state/mode of a system/component. In general, a standby system/component is a redundant (backup) system/component that is powered on and ready to take over function(s) performed by an active system/component. Such switchover/failover, that is, a transition from the standby role/state/mode to the active role/state/mode, may be performed automatically in response to failure of the currently active system/component for non-limiting example.
[0039] The electronic trading system 100 processes trade orders from and provides related information to one or more participant computing devices 130-1, 130-2, . . ., 130-p (collectively, the participant devices 130). Participant devices 130 interact with the system 100, and may be one or more personal computers, tablets, smartphones, servers, or other data processing devices configured to display and receive trade order information. The participant devices 130 may be operated by a human via a graphical user interface (GUI), or they may be operated via high-speed automated trading methods running on some physical or virtual data processing platform.
[0040] Each participant device 130 may exchange messages with (that is, send messages to and receive messages from) the electronic trading system 100 via connections established with a gateway 120. While Fig. 1 A illustrates each participant device 130 as being connected to electronic trading system 100 via a single connection to a gateway 120, it should be understood that a participant device 130 may be connected to electronic trading system 100 over multiple connections to one or more gateway devices 120.
[0041] Note that, while each gateway 120-1 may serve a single participant device 130, it typically serves multiple participant devices 130.
[0042] The compute nodes 140-1, 140-2, . . ., 140-c (also referred to herein as matching engines 140 or compute engines 140) provide the matching functions described above and may also generate outgoing messages to be delivered to one or more participant devices 130. Each compute node 140 is a high-performance data processor and typically maintains one or more data structures to search and maintain one or more order books 145-1, 145-2, ..., 145-b. An order book 145-1 may be maintained, for example, for each instrument for which the core compute node 140-1 is responsible. One or more of the compute nodes 140 and/or one or more of the gateways 120 may also provide market data feeds 147. Market data feeds 147 may be broadcast (for example, multicast), to subscribers, which may be participant devices 130 or any other suitable computing devices.
[0043] Some outgoing messages generated by core compute nodes 140 may be synchronous, that is, generated directly by a core compute node 140 in response to one or more incoming messages received from one or more participant devices 130, such as an outgoing “acknowledgement message” or “execution message” in response to a corresponding incoming “new order” message. In some embodiments, however, at least some outgoing messages may be asynchronous, initiated by the trading system 100, for example, certain “unsolicited” cancel messages and “trade break” or “trade bust” messages.
[0044] Distributed computing environments, such as the electronic trading system 100, can be configured with multiple matching engines operating in parallel on multiple compute nodes 140.
[0045] The sequencers 150 ensure that the proper sequence of any order-dependent operations is maintained. To ensure that operations on incoming messages are not performed out of order, incoming messages received at one or more gateways 120, for example, a new trade order message from one of participant devices 130, typically may then pass through at least one sequencer 150 (e.g., a single currently active sequencer, and possibly one or more standby sequencers) in which they are marked with a sequence identifier (by the single currently active sequencer, if multiple sequencers are present). That identifier may be a unique, monotonically increasing value which is used in the course of subsequent processing throughout the distributed system 100 (e.g., electronic trading system 100), to determine the relative ordering among messages and to uniquely identify messages throughout electronic trading system 100. In some embodiments, the sequence identifier may be indicative of the order (z.e., sequence) in which a message arrived at the sequencer. For example, the sequence identifier may be a value that is monotonically incremented or decremented according to a fixed interval by the sequencer for each arriving message; for example, the sequence identifier may be incremented by one for each arriving message. It should be understood, however, that, while unique, the sequence identifier is not limited to a monotonically increasing or decreasing value. In some embodiments, the original, unmarked, messages and the sequence-marked messages may be essentially identical, except for the sequence identifier value included in the marked versions of the messages. Once sequenced, the marked incoming messages, that is, the sequence-marked messages, are typically then forwarded by sequencer(s) 150 to other downstream compute nodes 140 to perform potentially orderdependent processing on the messages. Thus, besides uniquely identifying a message throughout electronic trading system 100, the sequence identifier assigned by sequencer 150 may also determine a relative ordering of each marked message among other marked messages in the electronic trading system 100.
[0046] As such, in contrast to other purposes for which a sequence identifier may be employed, the unique sequence identifier disclosed herein may be used for ensuring deterministic order (z.e., sequence) for electronic-trade message processing. The unique sequence identifier represents a unique, deterministic ordering (z.e., sequence) directive for processing of a given electronic trade message relative to other trade messages within an electronic trading system. According to an example embodiment, the sequence identifier may be populated in a sequence ID field 110-14 of a message, as disclosed further below with regard to FIG. 1C for non-limiting example.
[0047] In some embodiments, messages may also flow in the other direction, that is, from a core compute node 140 to one or more of the participant devices 130, passing through one or more of the gateways 120. Such outgoing messages generated by a core compute node 140 may also be order-dependent (i.e., sequence-order dependent), and accordingly may also typically first pass through a sequencer 150 to be marked with a sequence identifier. The sequencer 150 may then forward the marked response message to the gateways 120 in order to pass on to participant devices 130 in a properly deterministic order.
[0048] The use of a sequencer 150 to generate unique sequence numbers and mark messages or representations thereof with same, that is, to generate sequence-marked messages, ensures the correct ordering of operations is maintained throughout the distributed system, that is, the electronic trading system 100, regardless of which compute node or set of compute nodes 140 processes the messages. This approach provides “state determinism,” for example, an overall state of the system is deterministic and reproduceable (possibly somewhere else, such as at a disaster recovery site), to provide fault-tolerance, high availability and disaster recoverability.
[0049] It may also be important for a generating node (i.e., a node introducing a new message into the electronic trading system 100, for example by generating a new message and/or by forwarding a message received from a participant device 130) and its peer nodes to receive the sequence number assigned to that message. Receiving the sequence number for a message it generated may be useful to the generating node and its peer nodes not only for processing messages in order, according to their sequence numbers, but also to correlate the message generated by the node with the message’s sequence identifier that is used throughout the rest of the electronic trading system 100. Such a correlation between an unmarked version of a message as introduced by a generating node into the electronic trading system and the sequence marked version of the same message outputted by the sequencer may be made via identifying information in both versions of the message, as discussed further below in connection with Fig. 1C. A subsequent message generated within the electronic trading system 100, while also being assigned its own sequence number, may yet reference one or more sequence numbers of related preceding messages. Accordingly, a node may need to quickly reference (by sequence number) a message the node had itself previously generated, because, for example, the sequence number of the message the node had generated was referenced in a subsequent message.
[0050] In some embodiments, the generating node may first send a message to the sequencer 150 and wait to receive the sequence number for the message from the sequencer before the generating node forwards the message to other nodes in electronic trading system 100.
[0051] In alternate example embodiments, to avoid at least one hop, which could add undesirable increased latency within electronic trading system 100, after receiving the unsequenced message from the generating node, sequencer 150 may not only send a sequenced version of the message (e.g., a sequence-marked message) to destination nodes, but may also send substantially simultaneously a sequenced version of the message back to the sending node and its peers. For example, after assigning a sequence number to an incoming message sent from the gateway 120-1 to core compute nodes 140, the sequencer 150 may not only forward the sequenced version of the message to the core compute nodes 140, but may also send a sequenced version of that message back to the gateway 120-1 and the other gateways 120. Accordingly, if any subsequent message generated in a core compute node 140 references that sequence number, any gateway 120 may easily identify the associated message originally generated by gateway 120-1 by its sequence number.
[0052] Similarly, in some further embodiments, a sequenced version of an outgoing message generated by and sent from a core compute node 140 to gateways 120, and sequenced by sequencer 150, may be forwarded by sequencer 150 both to gateways 120 and back to core compute nodes 140.
[0053] Some embodiments may include multiple sequencers 150 for high availability, for example, to ensure that another sequencer is available if the first sequencer fails. For embodiments with multiple sequencers 150 (e.g., a currently active sequencer 150-1, and one or more standby sequencers 150-2, ..., 150-s), the currently active sequencer 150-1 may maintain a system state log (not shown) of all the messages that passed through sequencer 150-1, as well as the messages’ associated sequence numbers. This system state log may be continuously or periodically transmitted to the standby sequencers to provide them with requisite system state to allow them to take over as an active sequencer, if necessary. Alternatively, the system state log may be stored in a data store that is accessible to the multiple sequencers 150.
[0054] The system state log may also be continually or periodically replicated to one or more sequencers in a standby replica electronic trading system (not shown in detail) at a disaster recovery site 155, thereby allowing electronic trading to continue with the exact same state at the disaster recovery site 155, should the primary site of system 100 suffer catastrophic failure.
[0055] According to an example embodiment, a currently active sequencer of a plurality of sequencers may store the system state log in a data store (not shown). The data store may be accessible to the plurality of sequencers via a shared sequencer network, such as the sequencer- wide shared network 182-s disclosed further below with regard to Fig. 1 A. In an event a given sequencer of the plurality of sequencers transitions its role (state) from standby to active, such sequencer may retrieve the system state log from the data store to synchronize state with that of the former active sequencer.
[0056] In some embodiments, the system state log may also be provided to a drop copy service 152, which may be implemented by one or more of the sequencers, and/or by one or more other nodes in the electronic trading system 100. The drop copy service 152 may provide a record of daily trading activity through electronic trading system 100 that may be delivered to regulatory authorities and/or clients, who may, for example be connected via participant devices 130. In alternate embodiments, the drop copy service 152 may be implemented on one or more of the gateways 120. Furthermore, in addition to or instead of referencing the system state log, the drop copy service 152 may provide the record of trading activity based on the contents of incoming and outgoing messages sent throughout electronic trading system 100. For example, in some embodiments, a gateway 120 implementing the drop copy service 152 may receive from the sequencer 150 (and/or from core compute nodes 140 and other gateways 120) all messages exchanged throughout the electronic trading system 100. A participant device 130 configured to receive the record of daily trading activity from the drop copy service 152 may not necessarily also be sending trade orders to and utilizing a matching function of electronic trading system 100. [0057] Messages exchanged between participant devices 130 and gateways 120 may be according to any suitable protocol that may be used for financial trading (referred to for convenience as, “financial trading protocol”). For example, the messages may be exchanged according to custom protocols or established standard protocols, including both binary protocols (such as Nasdaq OUCH and NYSE UTP), and text-based protocols (such as NYSE FIX CCG). In some embodiments, the electronic trading system 100 may support exchanging messages simultaneously according to multiple financial trading protocols, including multiple protocols simultaneously on the same gateway 120. For example, participant devices 130-1, 130-2, and 130-3 may simultaneously have established trading connections and may be exchanging messages with gateway 120-1 according to Nasdaq Ouch, NYSE UTP, and NYSE FIX CCG, respectively.
[0058] Furthermore, in some embodiments, the gateways 120 may translate messages according to a financial trading protocol received from a participant device 130 into a normalized e.g., standardized) message format used for exchanging messages among nodes within the electronic trading system 100. The normalized trading format may be an existing protocol or may generally be of a different size and data format than that of any financial trading protocol used to exchange messages with participant devices 130. For example, the normalized trading format, when compared to a financial trading protocol of the original incoming message received at the gateway 120 from a participant device 130, may include in some cases one or more additional fields or parameters, may omit one or more fields or parameters, and/or each field or parameter of a message in the normalized format may be of a different data type or size than the corresponding message received at gateway 120 from the participant device 130. Similarly, in the other direction, gateways 120 may translate outgoing messages generated in the normalized format by electronic trading system 100 into messages in the format of one or more financial trading protocols used by participant devices 130 to communicate with gateways 120. In Fig. IB, disclosed below, such incoming/outgoing messages (e.g., the incoming message 103 and outgoing message 105) are communicated between the gateway 120-1 and a participant device 130.
[0059] Fig. IB is a block diagram of an example embodiment of the electronic trading system 100 of Fig. 1A, disclosed above. In the particular embodiment, the electronic trading system 100 comprises the gateway 120-1 coupled to the core compute node 140-1 via an activation link 180-1-1 and an ordering (i.e., sequencing) path 117. The electronic trading system 100 further comprises the sequencer 150-1 electronically disposed within the ordering path 117. The gateway 120-1 is configured to transmit a message (not shown) to the core compute node 140-1 via the activation link 180-1-1 and the ordering path 117, in response to reception of the incoming message 103. The core compute node 140-1 is configured to receive the message (also referred to as an unsequenced message) from the gateway 120-1 and a sequence-marked version (not shown) of the message from the sequencer 150-1.
[0060] The sequence-marked version includes a sequence identifier (ID), such as may be included in a sequence ID field 110-14 of the sequence-marked message, as disclosed further below with regard to Fig. 1C for non-limiting example. The sequence ID indicates a deterministic position of the sequence-marked version of the message among a plurality of sequence-marked versions of other messages, the other messages having been communicated via the activation link 180-1-1 and received by the sequencer 150-1 via the ordering path 117. The plurality of messages among which the sequence ID indicates a deterministic position also includes the other sequenced-marked versions of messages received by the core compute node 140-1 via the ordering path 117. The message (e.g., unsequenced message) and sequence-marked version include common metadata (not shown). By correlating the message with its sequence-marked version via the common metadata, the sequence ID of the message is identified. The sequence ID further indicates a deterministic position of the message among all messages communicated throughout the electronic trading system 100 that pass through the sequencer 150-1 and are, thus, sequence-marked by the sequencer 150-1.
[0061] While elements of the electronic trading system 100 may timestamp messages communicated therein, it should be understood that the sequence ID determined by the sequencer 150-1 determines the position (order/priority) of the messages communicated in the electronic trading system 100. It is possible that multiple systems may timestamp messages with a same timestamp and, thus, order/priority for such messages would need to be resolved at a receiver of same. Such is not the case in the electronic trading system 100 as the sequencer 150-1 may be the sole determiner of order/priority of messages communicated throughout the electronic trading system 100.
[0062] The core compute node 140-1 may be configured to (i) commence a matching function activity for an electronic trade responsive to receipt of the message via the activation link 180-1-1, and (ii) responsive to receipt of the sequence-marked version via the ordering path 117, use the sequence identifier to prioritize completion of the matching function activity toward servicing the electronic trade.
[0063] While the core compute node 140-1 may commence the electronic trading function, that is, the matching function activity, upon receipt of the message (i.e., unsequenced message), thereby starting the processing of the unsequenced message, the core compute node 140-1 may not complete the processing and/or commit the results of the processing of the message until the core compute node 140-1 receives the sequence- marked message. Without a deterministic ordering for processing a message, as specified via the sequence identifier in the sequence-marked message, for example, the processing of messages by the compute node 140-1 could be unpredictable. As a non-limiting example of possible unpredictable results, there could be multiple outstanding unsequenced messages, each of which represents a potential match for the contra side in the exchange of a financial security. It is useful for there to be a deterministic way of arbitrating among the multiple potential matches because, perhaps, only a subset among the potential matches may be able to be filled against a given trade order on the contra side.
[0064] According to some embodiments, after having received both the unsequenced message and the sequence-marked message, the compute node 140-1 may correlate the unsequenced message with the sequence-marked message via identifying information in both versions of the message, as discussed below in connection with Fig. 1C. Once the compute node 140-1 has received the sequence-marked message via the ordering path 117, the compute node 140-1 may then determine the proper sequence in which the message (or sequence-marked version of the message) should be processed relative to the other messages throughout electronic trading system 100. The compute node 140-1 may then complete the message processing, including sending out an appropriate response message, possibly referencing the sequence identifier assigned by the sequencer 150-1 and included in the sequence-marked message. Returning to the non-limiting example of multiple messages representing potential matches for the contra side in the exchange of a financial security, once the sequence-marked versions of the messages representing potential matches are received by compute node 140-1, the compute node 140-1 may determine precisely the sequence in which the possible match(es) are to occur and complete the electronic trading matching function.
[0065] According to an example embodiment, besides transmitting the sequence- marked message to the compute node 140-1 via the third direct connection 180-cl-sl, the sequencer 150-1 may further transmit the sequence-marked message via the second direct connection 180-gwl-sl (of the ordering path 117) to the gateway 120-1. Providing the sequence-marked message to the sender of the message enables the sender, that is, the gateway 120-1, to correlate the sequence number (assigned to the message) with other identifying information in the message (as discussed below in connection with Fig. 1C) so that the sender can easily deal with subsequent messages that reference that sequence number.
[0066] Similar to the compute node 140-1 activating (commencing) processing based on the message received via the activation link 180-1-1, disclosed above, the gateway 120-1 may, upon receipt of the unsequenced response message received from the compute node 140-1 via the activation link 180-1-1, activate processing of such response message, even before the gateway 120-1 receives the sequence-marked version of the response message. As non-limiting examples, activating the processing could include updating the state of an open trade order database on the gateway 120-1 and/or building up the outgoing message 105 ready to be sent to the participant device 130. In some embodiments, the gateway 120-1, however, may not complete the processing of the response message, such processing including transmitting the outgoing message 105 to the participant device 130, until the gateway 120-1 has received the sequence-marked response message, which contains a sequence identifier specifying a deterministic position of the response message in a sequence of messages including the other messages in electronic trading system 100.
[0067] In some embodiments, after having received both the unsequenced response message and the sequence-marked response message, the gateway 120-1 may correlate the unsequenced response message with the sequence-marked response message via identifying information in both versions of the response message, as discussed below in connection with Fig. 1C. The deterministic position of the response message thereby being determined upon receipt of sequence-marked response message. In some embodiments, the processing of the response message may then be completed, such processing including committing the outgoing message 105 to be transmitted to the participant device, such as the participant device 130 of Fig. 1A.
[0068] Continuing with reference to Fig. IB, the message transmitted via the activation path 180-1-1 and sequence-marked version of the message transmitted via the ordering path 117, may include common metadata. The core compute node 140-1 may be further configured to correlate the message with the sequence-marked version based on the common metadata, responsive to receipt of the sequence-marked version via the ordering path 117.
[0069] In the example embodiment of Fig. IB, the message is transmitted to the core compute node 140-1 via the activation link 180-1-1 in an activation link forward direction, that is, the act-link-fwd-dir 113a, and to the core compute node 140-1 via the ordering path 117 in an ordering path forward direction, that is the order-path-fwd-dir 115a. Further to completion of the matching function activity, the core compute node 140-1 may transmit a response (not shown) to the gateway 120-1 via the activation link 180-1-1 and the ordering path 117 in an activation link reverse direction (z.e., the act- link-rev-dir 113b) and an ordering path reverse direction (z.e., order-path-rev-dir 115b). [0070] The activation link 180-1-1 is a single direct connection while the ordering path 117 includes multiple direct connections. For example, the ordering path 117 in the example embodiment includes both the direct connection 180-gwl-sl and direct connection 180-cl-sl.
[0071] The gateway 120-1, sequencer 150-1, and core compute node 140-1 are arranged in a point-to-point mesh topology, referred to as a point-to-point mesh system 102. The core compute node 140-1 may be configured to perform a matching function (z.e., an electronic trading matching function) toward servicing trade requests received from participant devices 130 and introduced into the point-to-point mesh topology via the gateway 120-1. In the example embodiment of Fig. IB, the point-to-point mesh system 102 includes a first direct connection (z.e., 180-1-1), second direct connection (z.e., 180- gwl-sl), and third direction connection (z.e., 180-cl-sl). The sequencer 150-1 may be configured to (i) determine a deterministic order (z.e., sequence) for messages communicated between the gateway 120-1 and core compute node 140-1 via the first direct connection and received by the sequencer 150-1 from the gateway 120-1 or core compute node 140-1 via the second or third direct connection, respectively. The sequencer 150-1 may be further configured to (ii) convey position of the messages within the deterministic order by transmitting sequence-marked versions of the messages to the gateway 120-1 and core compute node 140-1 via the second and third direct connections, respectively. The messages represent the trade requests or responses thereto, such as disclosed herein. A message format for such messages is disclosed further below with regard to Fig. 1C.
[0072] The amount of preprocessing that may be done for an unsequenced message, and whether or not the results of that preprocessing may need to be discarded or rolled back, may depend on fields in the message, such as the message type field 110-1, symbol field 110-2, side field 110-3, or price field 110-4, according to the embodiment of Fig. 1C, disclosed further below. The amount may also depend on whether other unsequenced messages are currently outstanding (that is, for which the corresponding sequence-marked message has not yet been received) that reference the same value for a common parameter in the message, such as the same stock symbol.
[0073]
[0074] For example, if an unsequenced message with a message type of “new order” is received by core compute node 140-1, the core compute node 140-1 may load the symbol information relating to the relevant section of the order book into a fast memory. If the new order would be a match for an open order in the order book, the compute node 140-1 may start to generate a “fill” message, accordingly, but hold off on committing an order book update and on sending the “fill” message out until it receives the sequence- marked version of that message.
[0075]
[0076] If, however, the compute node 140-1 had also received another outstanding unsequenced “new order” message referencing the same stock symbol, side, price, etc, such that it is also a potential match for the same open order in the order book, the core compute node 140-1 may perform its preprocessing differently. In some embodiments, the core compute node 140-1 may generate competing potential “fill” messages, for each of the two outstanding unsequenced “new order” messages that could serve as a match for the open order. Based on the sequenced version of the messages, one of the potential “fill” messages may be discarded, while the other would be committed to the order book and sent out to the gateways 120. In other embodiments, when more than one possible outstanding unsequenced message could potentially match the same open order, the compute node 140-1 may not perform any preprocessing that may need to be discarded or rolled back (e.g., may not create any potential “fill” messages), or it may abort or pause any such preprocessing for those outstanding unsequenced messages.
[0077] As another example, an outstanding unsequenced “new order” message that is a potential match for an open order in the order book could be competing with an outstanding unsequenced “replace order” message or “cancel order” message attempting to replace or cancel, respectively, the same open order in the order book that would serve as a potential match to the “new order” message. In such a case, depending on the relative sequence assigned by the sequencer of the “new order” message versus the “replace/cancel order” message, the end result could either culminate in a match between the open order in the order book and the “new order” message, or it could instead culminate in that open order being canceled or replaced by a new order with a different price or quantity. Until the sequence-marked versions of the competing outstanding unsequenced messages are received from the sequencer 150-1, the compute node 140-1 cannot determine which of these two outcomes should result.
[0078] In such instances, the compute node 140-1 may perform preprocessing in different ways. In some embodiments, when there are multiple competing outstanding unsequenced messages, the compute node 140-1 may simply perform preprocessing that would not need to be rolled back or discarded, such as loading into faster memory a relevant section of the order book relating to a symbol referenced in both competing messages. In other embodiments, the compute node 140-1 may perform additional preprocessing, such as forming up one or more provisional potential responses, each corresponding to one of the multiple competing scenarios. For example, the compute node 140-1 may create a potential “fill” message and/or a potential “replace acknowledgement” message or “cancel acknowledgement” message, and possibly also make provisional updates to the order book corresponding to one or more of the multiple possible outcomes. While in some embodiments, the compute node 140-1 may perform this additional preprocessing for all such competing scenarios, in other embodiments, the compute node 140-1 may only perform additional preprocessing on one of, or a subset of, the competing scenarios. For example, the compute node 140-1 may perform the additional preprocessing on an outstanding unsequenced message only if there are no other outstanding competing unsequenced messages. Alternatively, or additionally, the compute node 140-1 may prioritize the performing of additional preprocessing for outstanding competing unsequenced messages according to the amount of time and/or complexity involved in rolling back or discarding the results of the preprocessing. Upon receiving the sequence-marked versions of the outstanding unsequenced messages, the compute node 140-1 may then determine the sequence (as assigned by the sequencer 150- 1) in which the outstanding unsequenced messages should be processed, and complete the processing of the messages in that sequence, which may in some embodiments include rolling back or discarding one or more results of the preprocessing.
[0079] Besides the types of preprocessing already discussed above, in some embodiments the compute node 140-1 may additionally or alternatively perform preprocessing related to validation of the message to determine whether to accept or reject the message. For example, the preprocessing could include performing real-time risk checks on the message, such as checking that the price or quantity specified in the message does not exceed a maximum value (/.< ., “max price check” or “max quantity check”), that the symbol in the message is a known symbol (i.e., “unknown symbol check”), that trading is currently permitted on that symbol (i.e., “symbol halt check”), or that the price is specified properly according to a correct number of decimal places (i.e., “sub penny check”). In some embodiments, the type of preprocessing could also include a “self trade prevention” validation check, to prevent a particular potential match from resulting in a self-trade in which a trading client matches against itself, if “self trade prevention” is enabled for the particular client or trade order. If a trade order fails one or more of these validation checks, the electronic trading system 100 may respond with an appropriate reject message. It should be understood that, even though these validation checks are described in the embodiments above as being performed by the compute node 140-1, at least some of these types of validation checks could in some embodiments be performed alternatively or additionally by a gateway 120 or other nodes in the electronic trading system 100.
[0080] In further embodiments, it may be beneficial or required for the gateway 120- 1 to be informed of the unique system-wide sequence identifier associated with a message that originated from a client. This information may enable the gateway 120-1 to match up the original incoming message to the unique sequence number, which is used to ensure proper ordering of messages throughout the electronic trading system 100. Such a configuration at the gateway(s) may be required for the electronic trading system 100 to achieve state determinism and to provide fault-tolerance, high availability, and disaster recoverability with respect to the activity in the gateways. One solution for configuring the gateway 120-1 to maintain information on the sequence identifier associated with an incoming message is for the gateway 120-1 to wait for a response back from the sequencer 150-1 with the sequence identifier before forwarding the message to the compute node 140-1. Such an approach may add latency to the processing of messages. In a further example, in addition to forwarding to the compute node 140-1 a sequence- marked message it had originally received from the gateway 120-1, the sequencer 150-1 may also send, in parallel, the sequence-marked message to the gateway 120-1. As a result, the gateway 120-1 may maintain information on the sequence identifier while minimizing latency at the electronic trading system 100.
[0081] Fig. 1C is a table of an example embodiment of fields of a message format 110 for trading messages, such as trading messages exchanged among nodes in the electronic trading system 100 disclosed above. In the example embodiment of Fig. 1C, the message format 110 is a normalized message format, intended to be used for an internal (that is, within the electronic trading system 100) representation of trading messages when they are exchanged among nodes within electronic trading system 100. In this example embodiment, gateways 120 exchange messages between the participants 130 and electronic trading system 100, and translate such messages between format(s) specified by one or more financial trading protocols used by the participants 130 and the normalized trading format used among nodes in the electronic trading system 100. It should be understood that the fields 110-1 through 110-17 are for non-limiting example and that the message format 110 may include more, fewer, or different fields, and that the order of such fields is not limited to as shown in FIG. 1C.
[0082] While the fields in the message format 110 are shown in this example in a single message format, they may be distributed across multiple message formats, or encapsulated in layered protocols. For example, in other embodiments, a subset of fields in the message format 110 may be included as part of a header, trailer, or extension field(s) in a layered protocol that encapsulates other fields of the message format 100 in a message payload. According to some example embodiments, the message format 110 may define one or more fields of data encapsulated in a payload (data) section of another message format, including without limitation a respective payload section of an IP datagram, a UDP datagram, a TCP packet, or of a message data frame format, such as an Ethernet data frame format or other data frame format, including InfiniBand, Universal Serial Bus (USB), PCI Express (PCI-e), and High-Definition Multimedia Interface (HDMI), for non-limiting example.
[0083] The message format 110 includes fields 110-1... 110-6 which correspond to information that may be included in messages sent or received according to a financial trading protocol for communication with one or more participant devices 130. For nonlimiting example, the message type field 110-1 indicates a trading message type. Some trading message types (such as, message types “new order,” “replace order,” or “cancel order”) correspond to messages received from participant devices 130, while other message types (such as, “new order acknowledgement,” “replace order acknowledgement,” “cancel order acknowledgement,” “fill,” “execution report,” “unsolicited cancel,” “trade bust,” or various reject messages) correspond to messages that are generated by the electronic system 100 and are included in trading messages sent to the participant devices 130.
[0084] The message format 110 also includes a symbol field 110-2, which includes an identifier for a traded financial security, such as a stock symbol or stock ticker. For example, “IBM” is the stock symbol for “International Business Machines Corporation.” The side field 110-3 in the message format 110 may be used to indicate the “side” of the trading message, such as whether the trading message is a “buy,” “sell,” or a “sell short.” Similarly, the price field 110-4 may be used to indicate a desired price to buy or sell the security, and the quantity field 110-5 may be used to indicate a desired quantity of the security (e.g., number of shares). The message format 110 may also include the order token field 110-6, which may be populated with an “order token” or “client order ID” initially provided by a participant device 130 to uniquely identify a new order in the context of a particular trading session (i.e., “connection” or “flow”) established between the participant device 130 and the electronic trading system via a gateway 120.
[0085] It should be understood that fields 110-1... 110-6 are representative fields that are usually included for most message types according to most financial trading protocols, but that the message format 110 may well include additional or alternate fields, especially for supporting particular message types or particular financial trading protocols. For example, according to many financial trading protocols, “replace order” and “cancel order” message types require the participant 130 to supply an additional order token to represent the replaced or canceled order, to distinguish it from the original order. Similarly, a “replace order” and a “cancel order” typically may also include a replaced/canceled quantity field, and a “replace order” may include a replace price field. These additional replace/cancel order token fields, replace price fields, and replaced/canceled quantity fields, may also be included in corresponding acknowledgement messages sent by electronic trading system 100.
[0086] Additionally, the message format 110 includes fields 110-11... 110-17 that may be used internally within electronic trading system 100, and do not necessarily correspond to fields in messages exchanged with participant devices 130. For example, a node identifier field 110-11 may uniquely identify each node in electronic trading system 100. In some embodiments, a generating node may include its node identifier in messages it introduces into the electronic trading system 100. For example, each gateway 120 may include its node identifier in messages it forwards from participant devices 130 to compute nodes 140 and/or sequencers 150. Similarly, each compute node 140 may include its node identifier in messages it generates (for example, acknowledgements, executions, or types of asynchronous messages intended ultimately to be forwarded to one or more participant devices 130) to be sent to other nodes in the electronic trading system 100. Thus, via the node identifier field 110-11 in the message, each message introduced into the electronic trading system 100 may be associated with the message’s generating node.
[0087] The message format 110 may also include a flow identifier field 110-12. In some embodiments, each trading session (i.e., “connection” or “flow”) established between a participant device 130 and a gateway 120 may be identified with a flow identifier that is intended to be unique throughout the electronic trading system 100. A participant device 130 may be connected to the electronic trading system 100 over one or more flows, and via one or more of the gateways 120. In such embodiments, the version of the messages according to the normalized message format 110 (used among nodes in the electronic trading system 100) of all messages exchanged between a participant device 130 and the electronic trading system 100 over a particular flow would include a unique identifier for that flow in the flow identifier field 110-12. In some embodiments, the flow identifier field 110-12 is populated by a message’s generating node. For example, a gateway 120 may populate the flow identifier field 110-12 with the identifier of the flow associated with a message it receives from a participant 130 that the gateway 120 introduces into electronic trading system 100. Similarly, a core compute node 140 may populate the flow identifier field 110-12 with the flow identifier associated with messages it generates (i.e., response messages, such as acknowledgement messages or fills, or other outgoing messages including asynchronous messages).
[0088] In some embodiments, the flow identifier field 110-12 contains a value that uniquely identifies a logical flow, which actually could be implemented for purposes of high availability as multiple redundant trading session connections, possibly over multiple gateways. That is, in some embodiments, the same flow ID may be assigned to two or more redundant flows between participant device(s) 130 and gateway(s) 120. In such embodiments, the redundant flows may be either in an active/standby configuration or an active/active configuration. In an active/active configuration, functionally equivalent messages may be exchanged between participant device(s) 130 and gateway(s) 120 simultaneously over multiple redundant flows in parallel. That is, a trading client may send in parallel over the multiple redundant flows functionally equivalent messages simultaneously to the electronic trading system 100, and receive in parallel over the multiple redundant flows multiple functionally equivalent responses from the electronic trading system 100, although the electronic trading system 100 may only take action on a single such functionally equivalent message. In an active/standby configuration, a single flow at a time among the multiple redundant flows may be designated as an active flow, whereas the other flow(s) among the multiple redundant flows may be designated standby flow(s), and the trading messages would only actually be exchanged over the currently active flow. Regardless of whether the redundant flows are configured in an active/active or active/standby configuration, messages exchanged over any of the redundant flows may be identified with the same flow identifier stored by the messages’ generating nodes in the flow identifier field 110-12 of the normalized message format 110.
[0089] As discussed above, in some embodiments, messages exchanged among nodes in the electronic system 100 are sent to the sequencer 150 to be marked with a sequence identifier. Accordingly, the message format 110 includes sequence identifier field 110- 14. In some embodiments, an “unmarked message” may be sent with a sequence identifier field 110-14 having an empty, blank (e.g., zero) value. In other embodiments, the sequence identifier field 110-14 of an unmarked message may be set to a particular predetermined value that the sequencer would never assign to a message, or to an otherwise invalid, value. Still other embodiments may specify that a message is unmarked via an indicator in another field (not shown) of the message, such as a Boolean value or a flag value indicating whether a message has been sequenced. Upon receiving an unmarked message, the sequencer 150 may then populate the sequence identifier field 110-14 of the unmarked message with a valid sequence identifier value, thereby producing a “sequence marked message.” The valid sequence identifier value in sequence identifier field 110-4 of the sequence marked message uniquely identifies the message and also specifies a deterministic position of the marked message in a relative ordering of the marked message among other marked messages throughout electronic trading system 100. In this example, a “sequence marked message” sent by the sequencer 150 may then be identical to a corresponding unmarked message received by the sequencer except that the sequence marked message’s sequence identifier field 110-14 contains a valid sequence identifier value. [0090] The message format 110 may, in some embodiments, also include the reference sequence identifier field 110-15. A generating node may populate the reference sequence identifier field 110-15 of a new message it generates with the value of a sequence number of a prior message related to the message being generated. The value in the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 to correlate a message with a prior associated message.
[0091] The prior associated message referenced in the reference sequence identifier field 110-15 may be a prior message in the same “order chain” (/.< ., “trade order chain”). According to most financial trading protocols, messages may be logically grouped into an “order chain,” a set of messages over a single flow that reference or “descend from” a common message. An order chain typically starts with a “new order message” sent by a participant device 130. The next message in the order chain is typically a response by the electronic trading system (e.g., either a “new order acknowledgement” message when the message is accepted by the trading system, or a “new order reject” message, when the message is instead rejected by the trading system, perhaps for having an invalid format or invalid parameters, such as an invalid price for non-limiting example). An order chain may also include “cancel order” message sent by participant device 130, canceling at least a portion of the quantity of a prior acknowledged (but still open, that is contains at least some quantity that is not canceled and/or not filled) new order. The “cancel order” message may again either be acknowledged or rejected by the electronic trading system with a “cancel order acknowledgement” or a “cancel order reject” message, which would also be part of the order chain. An order chain may also include a “replace order” message sent by participant device 130, replacing the quantity and/or the price of a prior acknowledged (but still open) new order. The “replace order” message may again either be acknowledged or rejected by the electronic trading system with a “replace order acknowledgement” or a “replace order reject” message, which would also be part of the order chain. A prior acknowledged order that is still open may be matched with one or more counter orders of the opposite side (that is, “buy” on one side and “sell” or “sell short” on the other side), and the electronic trading system 100 may then generate a complete “fill” message (when all of the open order’s quantity is filled in a single match) or one or more partial “fill” messages (when only a portion of the open order’s quantity is filled in a single match), and these “fill” messages would also be part of the order chain. As discussed above, the reference sequence identifier, in general, may identify another prior message in the same order chain.
[0092] For example, returning to the reference sequence identifier field 110-15, the value for the reference sequence number may be the sequence number assigned by the sequencer for an “incoming” message originating from a participant device 130 and introduced into electronic trading system 100 by a gateway 120, such that a corresponding “outgoing” message, such as a response message generated by a compute node 140, may reference the sequencer number value of the incoming message to which it is responding. In this example, a “new order acknowledgement” message or a “fill” message generated by a compute node 140 would include in the reference sequence identifier field 110-15 the value for the sequence identifier assigned to the corresponding “new order” message to which the compute node 140 is responding with a “new order acknowledgement” message or fulfilling the order with the “fill” message. In general, however, the value for the reference sequence identifier field 110-15 need not necessarily be that of a message that is being directly responded to by the electronic trading system 100, but may be that of a prior message that is part of the same order chain, for example, the sequence number of a “new order” or a “new order acknowledgement.”
[0093] In some embodiments, at least for some message types, the gateways 120 may also populate the reference sequence identifier field 110-15 in messages they introduce into electronic trading system 100 with a value of a sequence identifier for a related prior message. For example, a gateway 120 may populate the reference sequence identifier field 110-15 in a “cancel order” or a “replace order” message with the value of the sequence identifier assigned to the prior corresponding “new order” or “new order acknowledgment” message. Similarly, core compute nodes 140 may also populate the sequence identifier field 110-15 for a corresponding “cancel order acknowledgement” message or “replace order acknowledgement” message with the value of the sequence identifier for the “new order” or “new order acknowledgment,” rather than that of the message to which the compute node 140 was directly responding (e.g., rather than the sequence identifier of the “cancel order” or “replace order” message). Again, the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 generally to correlate a message with one or more prior messages in the same order chain. [0094] A generating node may also include a node-specific timestamp field 110-13 in messages it introduces into electronic trading system 100. While the sequence identifier included in the sequence identifier field 110-14 of sequence-marked messages outputted by the sequencer 150 is intended to be unique throughout the electronic trading system 100, the value in the node-specific timestamp field 110-13 may be unique among a subset of messages, those messages introduced into electronic trading system 100 by a particular generating node. While referred to herein as a “timestamp,” a value placed in the nodespecific timestamp field 110-13 may be any suitable value that is unique among messages generated by that node. For example, the node-specific timestamp may be in fact a timestamp or any suitable monotonically increasing or decreasing value.
[0095] Some embodiments may include other timestamp fields in the message format. For example, some message formats may include a reference timestamp field, which may be a timestamp value assigned by the generating node of a prior, related message. In such embodiments, a compute node 140 may include a new timestamp value in the node-specific timestamp field 110-13 for messages that it generates, and may also include a timestamp value from a related message in a reference timestamp field of the message the compute node generates. For example, a “new order acknowledgement” message generated by the compute node may include a timestamp value of the “new order” to which it is responding in the reference timestamp field of the “new order acknowledgement message.” Furthermore, in some embodiments, compute nodes 140 may not include a new timestamp value in the node-specific timestamp field 110-13 in messages they generate, but may simply populate that node-specific timestamp field 110- 13 with a timestamp value from a prior related message.
[0096] The message format 110 may include a time based value (TBV) field 110- 118. The TBV field, as more fully explained elsewhere, may correspond to a time that an incoming message was received by the gateway 120-1 (referred to here as the ingress time or arrival time), or in other implementations, may correspond to a desired exit time (referred to herein as the egress time or exit time, Tex) for a corresponding response message to be returned by the system. In that case, TBV is determined from the time of receipt plus some deterministic time delay value (as described in more detail below). In a typical embodiment therefore, the TBV is typically not the same field as the “nodespecific timestamp” field 110-13 or the sequence numbers (e.g., a field different from either the sequence ID 110-14 or reference sequence ID 110-15) referenced above.
[0097] The message format 110 may also include an entity type field 110-16 and entity count field 110-17. The entity type of a message may depend on whether it is introduced into the electronic trading system 100 by a gateway 120 or a compute node 140, or in other words, whether the message is an incoming message being received at a gateway 120 from a participant device 130 or whether it is an outgoing message being generated by a compute node 140 to be sent to a participant device 130. For example, in some embodiments, incoming messages are considered to be of entity type “flow,” (and the entity type field 110-16 is populated by the gateways 120 for incoming messages with a value representing the type “flow”), while outgoing messages are considered to be of entity type, “symbol,” (and the entity type field 110-16 is populated by the computed nodes 140 for outgoing messages with a value representing the type “symbol”). In such embodiments, the entity count of type “flow” is maintained by gateways 120, and the entity count of type “symbol” is maintained by the compute nodes 140.
[0098] Considering the entity type “flow,” a gateway 120 may maintain a per flow incoming message count, counting incoming messages received by the gateway 120 over each flow active on the gateway 120. For example, if four non-redundant flows are active on a gateway 120, each flow would be assigned a unique flow identifier, as discussed above, and the gateway 120 would maintain a per flow incoming message count, counting the number of incoming messages received over each of those four flows. In such embodiments, the gateway 120 populates the entity count field 110-17 of an incoming message with the per flow incoming message count associated with the incoming message’s flow (as identified throughout the electronic trading system 100 by a flow identifier value, populated in the flow identifier field 110-12 of the message).
[0099] In the case of redundant flows in an active/active configuration, (that is, as discussed above, in which multiple flows receive the same messages, or at least, functionally equivalent messages, in parallel from participant device(s) 130 connected via one or more gateway(s) 120), each underlying redundant flow may be assigned the same flow identifier, yet a per-flow incoming message count may still be maintained separately for each redundant flow, especially when the redundant flows are implemented on separate gateways 120. Because it is the expectation that a participant device 130 will send the same set of messages in the same order (z.e., sequence) to the electronic trading system 100 over each of the redundant flows, it is also the expectation that the entity count assigned to functionally equivalent messages received over separate redundant flows should be identical.
[00100] These functionally equivalent incoming messages may be forwarded by the gateway (s) 120 to sequencer 150 and the compute nodes 140. Accordingly, in such embodiments, the sequencer 150 and compute nodes 140 could receive multiple functionally equivalent incoming messages associated with the same flow identifier, but the sequencer 150 and compute nodes 140 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same flow identifier.
[00101] In some embodiments, the sequencer 150 and compute nodes 140 may keep track, on a per flow basis, of the highest entity count that has been included in entity count field 110-17 of incoming messages associated with each flow, which allows the sequencer 150 and compute nodes 140 to take action only on the first to arrive of multiple incoming functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent incoming messages. For example, the sequencer 150 may in some embodiments only sequence the first such functionally equivalent incoming message to arrive, and the compute nodes 140 may only start processing on the first such functionally equivalent message to arrive. If an incoming message received by a node (z.e., a sequencer 150 or a compute node 140) has an entity count that is the same or lower than the highest entity count the node has seen for that flow, then the node may assume that the incoming message is functionally equivalent to another previously received incoming message, and may simply ignore the subsequently received functionally equivalent incoming message.
[00102] Considering now the case of entity type “symbol,” a compute node 140 may maintain a per symbol outgoing message count, counting outgoing messages generated by and sent from the compute node 140 for each symbol serviced by the compute node 140. For example, if four symbols (e.g., MSFT, GOOG, IBM, ORCL) are serviced by a compute node 140, each symbol is assigned a symbol identifier populated in symbol field 110-2 of the message, as discussed above, and the compute node 140 would maintain a per symbol outgoing message count, counting the number of outgoing messages it generated and sent that serviced each of those four symbols. In such embodiments, the compute node 140 populates the entity count field 110-17 of an incoming message with the per symbol outgoing message count associated with the outgoing message’s symbol (as identified throughout the electronic trading system 100 by the value populated in the symbol identifier field 110-2 of the message).
[00103] In some embodiments, as discussed further below, compute nodes may be configured such that multiple compute nodes service a particular symbol in parallel, for reasons of high availability. Because of the deterministic ordering of messages throughout electronic trading system 100 provided by the sequencer 150, it can be guaranteed that even when multiple compute nodes service a given symbol, they will be processing incoming messages referencing the same symbol in the same order (/.< ., sequence) and in the same manner, thereby generating functionally equivalent response messages in parallel. When considering the outgoing messages being sent out for a particular symbol across multiple compute nodes 140, each outgoing message referencing that symbol should have a functionally equivalent message being sent out by each other compute node 140 actively servicing that symbol. These outgoing messages may all be sent by the compute nodes 140 to sequencer 150 and the gateways 120. Accordingly, in such embodiments, the sequencer 150 and gateways 120 could receive multiple functionally equivalent incoming messages associated with the same symbol, but the sequencer 150 and gateways 120 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same symbol identifier. In some embodiments, the sequencer 150 and gateways 120 may keep track, on a per symbol basis, of the highest entity count that has been included in entity count field 110-17 of outgoing messages associated with the symbol, which allows the sequencer 150 and gateways 120 to take action only on the first to arrive of multiple outgoing functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent outgoing messages. For example, the sequencer 150 may in some embodiments only sequence the first such functionally equivalent outgoing message to arrive. Similarly, the gateways 120 may only start processing the first such functionally equivalent message to arrive. If an outgoing message received by a node (i.e., a sequencer 150 or a gateway 120) has an entity count that is the same or lower than the highest entity count the node has previously seen for that symbol, then the node may assume that the outgoing message is functionally equivalent to another previously received outgoing message, and may simply ignore the subsequently received functionally equivalent outgoing message.
[00104] In embodiments in which the sequencer 150 only sequences the first message of a plurality of functionally equivalent messages to arrive at the sequencer, the sequencer could do so in a variety of ways. In one example, other subsequently arriving messages that are functionally equivalent to that first such functionally equivalent message to arrive may simply be ignored by the sequencer (in which case only a single sequence marked message may be outputted by the sequencer for the set of functionally equivalent messages). Another possibility is for the sequencer to track a sequence number that it assigns to the first functionally equivalent message, for example, by making an association between the entity count of the message, its flow identifier or symbol identifier (for messages having entity types of “flow” and “symbol”, respectively), and its sequence number, such that the sequencer may output a sequenced version of each functionally equivalent message in which the value of the sequence identifier field 110-14 for all the sequenced versions of the functionally equivalent messages is the same as had been assigned by the sequencer to the first message to arrive among the functionally equivalent messages received by the sequencer 150.
[00105] In other embodiments, the sequencer 150 may not keep track of whether messages are functionally equivalent, and may assign each unsequenced message that arrives at the sequencer 150 a unique sequence number, regardless of whether that message is among a plurality of functionally equivalent messages. In such embodiments, the sequenced versions of messages among a plurality of functionally equivalent messages are each assigned different sequence identifiers by the sequencer as the value in the sequencer identifier field 110-14. To determine an effective sequence identifier for the set of functionally equivalent messages, the recipient node of sequenced functionally equivalent messages in such embodiments may use the sequence identifier in the sequenced version of the message among the sequenced functionally equivalent messages that is first to arrive at the node. In embodiments in which there are direct point-to-point connections among the nodes in the electronic trading system 100, sequenced versions of the messages are sent out in sequenced order by the sequencer 150, and accordingly, should be received in the same sequenced order among all nodes directly connected to the sequencer. Therefore, for all nodes receiving the sequenced messages via respective direct point-to-point connections with the sequencer, the first sequenced message to arrive among a plurality of functionally equivalent sequenced messages should have the same value in the sequence identifier field 110-14.
[00106] As may be apparent from the discussion above, in embodiments having a message format, such as the message format 110, besides a message’s sequence identifier, there may exist multiple other ways of uniquely identifying a message throughout electronic trading system 100. For example, in embodiments in which a message includes both a node identifier and a node-specific timestamp, the presence of these two identifiers in a message may be sufficient to uniquely identify the message throughout electronic trading system 100. Such fields may be understood as including metadata and multiple messages including such identical metadata may be understood as including common metadata. Similarly, in embodiments in which a flow identifier is unique throughout electronic trading system 100, a combination of a message’s flow identifier and node specific timestamp may be sufficient to uniquely identify the message throughout electronic trading system 100. Furthermore, a combination of a flow identifier and entity count could be sufficient to uniquely identify a message of entity type “flow,” and a combination of a symbol identifier and entity count could be sufficient to uniquely identify a message of entity type “symbol.”
[00107] It should be noted, however, that while there may exist other ways besides the sequence identifier assigned to a message of uniquely identifying the message throughout electronic trading system 100, the sequence identifier is still necessary in order to specify in a fair and deterministic manner the relative ordering of the message among other messages generated by other nodes throughout electronic trading system 100. For example, if the node-specific timestamp is in fact implemented as a timestamp value, even if system clocks among nodes are perfectly synchronized, two different messages, each generated by a different node, may each be assigned the same timestamp value by their respective generating node, and the relative ordering between these two messages is then ambiguous. Even if the messages can be identified uniquely, a recipient node of both messages would still need a way to determine the relative ordering of the two messages before taking possible action on the messages.
[00108] One possible approach for a recipient node to resolve that ambiguity could be through the use of randomness, for example, by randomly selecting one message as preceding the other in the relative ordering of messages throughout the electronic trading system 100. Using randomness to resolve the ambiguity, however, does not support “state determinism” throughout the electronic trading system 100. Different recipient nodes may randomly determine a different relative ordering among the same set of messages, resulting in unpredictable, nondeterministic behavior within electronic trading system 100, and impeding the correct implementation of important features, such as faulttolerance, high availability, and disaster recovery.
[00109] Another approach for a recipient node to resolve the ambiguity in ordering could be through a predetermined precedence method, for example, based on the node identifier associated with the message. Such an approach, however, works against the important goal of fairness, by giving some messages higher precedence simply based on the node identifier of the node that introduced the message into electronic trading system 100. For example, some participant devices 130 could be favored simply because they happen to be connected to the electronic trading system 100 over a gateway 120 that is deemed higher in the predetermined precedence method.
[00110] If messages were to be uniquely identified via the entity count and either symbol identifier or flow identifier, depending on whether the message has an entity type of “symbol,” or “flow,” respectively, there may be a deterministic ordering among other messages associated with that symbol (in the case of a message having entity type of “symbol”) or that flow (in the case of a message having an entity type of “flow”). The ordering, however, among other messages associated with different symbols and flows, respectively, would still be non-deterministic. [00111] Accordingly, even if other fields in the message format 100 may be sufficient to uniquely identify a message throughout the electronic trading system 100, the sequence identifier assigned to a message by the sequencer 150 may still be required in order to fairly and deterministically specify the ordering of a message relative to other messages in the electronic trading system 100. In such embodiments, the sequencer 150 (or the single currently active sequencer, if multiple sequencers 150 are present) serves as the authoritative source of a truly deterministic ordering among sequence-marked messages throughout the electronic trading system 100.
[00112] In some embodiments, nodes in electronic trading system 100 may receive two versions of a message: an unsequenced (unmarked) version of the message as introduced into the electronic trading system 100 by the generating node, and a (marked) version of the message that includes a sequence identifier assigned by the sequencer 150. This may be the case in embodiments in which a generating node sends the unmarked message to one or more recipient nodes as well as the sequencer 150. The sequencer 150 may then send a sequence-marked version of the same message to a set of nodes including the same recipient nodes.
[00113] While, as discussed above, the sequence-marked version of the message is useful for determining the relative processing order (/.< ., position in a sequence) of the message among other marked messages in electronic trading system 100, it may also be useful for a recipient node to receive the unmarked version of the message. For example, it is certainly possible, if not expected, (for example, in embodiments in which there are direct connections between nodes), for the unmarked version of the message to be received prior to the marked version of the message, because the marked version of the message is sent via an intervening hop through sequencer 150. Accordingly, there is the opportunity, in some embodiments, for a recipient node to activate processing of the unmarked message upon receiving the unmarked message even before that recipient node has received the marked version of the message which authoritatively indicates the relative ordering of the marked message among other marked messages.
[00114] A node receiving both the marked and unmarked versions of a same message may correlate the two versions of the message to each other via the same identifying information or “common metadata,” in both versions of the message. For example, as discussed above, a generating node may include in messages it generates (i.e., unmarked messages) a node identifier and a node specific timestamp, which together, may uniquely identify each message throughout electronic trading system 100. In embodiments in which the marked and unmarked versions of a message are essentially identical except for the sequence identifier assigned by sequencer 150, the marked message may also include the same node identifier and node specific timestamp that are also included in the corresponding unmarked message, thereby allowing a recipient node of both versions of the message to correlate the marked and unmarked versions. Accordingly, while the marked messages directly indicate relative ordering of a marked message relative to the other marked messages throughout electronic trading system 100, because of the correlation that may be made between the unmarked and marked version of the same message, marked messages, (at least indirectly via the correlation discussed above), indicate the relative ordering of the message relative to other messages (marked or unmarked) throughout electronic trading system 100. It should be understood that nodes in electronic trading system 100 may also correlate sequence marked with unmarked versions of the messages by means of the other manners of uniquely identifying messages discussed above. For example, a correlation between sequence marked and unmarked messages may be made by means of a combination of a flow identifier and a node specific timestamp. Such a correlation may additionally or alternatively be made by means of a message’s entity count along with the symbol identifier or flow identifier in the message, for messages having entity type “symbol” and “flow,” respectively.
[00115] In the era of high-speed trading, in which microseconds or even nanoseconds are consequential, participant devices 130 exchanging messages with the electronic trading system 100 are often very sensitive to latency, preferring low, predictable latency. The arrangement shown in Fig. 1 A accommodates this requirement by providing a point- to-point mesh 172 architecture between at least each of the gateways 120 and each of the compute nodes 140. In some embodiments, each gateway 120 in the mesh 172 may have a dedicated high-speed direct connection 180 to the compute nodes 140 and the sequencers 150.
[00116] For example, dedicated connection 180-1-1 is provided between gateway 1 120-1 and core compute node 1 140-1, dedicated connection 180-1-2 between gateway 1 120-1 and core compute node 2 140-2, and so on, with example connection 180-g-c provided between gateway 120-g and core compute node c 140-c, and example connection 180-s-c provided between sequencer 150 and core compute node c 140-c. [00117] It should be understood that each dedicated connection 180 in the point-to- point mesh 172 is, in some embodiments, a point-to-point direct connection that does not utilize a shared switch. A dedicated or direct connection may be referred to interchangeably herein as a direct or dedicated “link” and is a direct connection between two end points that is dedicated (e.g., non-shared) for communication therebetween. Such a dedicated/direct link may be any suitable interconnect(s) or interface(s), such as disclosed further below, and is not limited to a network link, such as wired Ethernet network connection or other type of wired or wireless network link. The dedicated/direct connection/link may be referred to herein as an end-to-end path between the two end points. Such an end-to-end path may be a single connection/link or may include a series of connections/links; however, bandwidth of the dedicated/direct connection/link in its entirety, that is, from one end point to another end point, is non-shared and neither bandwidth nor latency of the dedicated/direct connection/link can be impacted by resource utilization of element(s) if so traversed. For example, the dedicated/direct connection/link may traverse one or more buffer(s) or other elements that are not bandwidth or latency impacting based on utilization thereof. The dedicated/direct connection/link would not, however, traverse a shared network switch as such a switch can impact bandwidth and/or latency due to its shared usage.
[00118] For example, in some embodiments, the dedicated connections 180 in the point-to-point mesh 172 may be provided in a number of ways, such as a 10 Gigabit Ethernet (GigE), 25 GigE, 40 GigE, 100 GigE, InfiniBand, Peripheral Component Interconnect - Express (PCIe), RapidlO, Small Computer System Interface (SCSI), FireWire, Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or custom serial or parallel busses.
[00119] Therefore, although the compute engines 140, gateways 120, sequencers 150 and other components may sometimes be referred to herein as “nodes”, the use of terms such as “compute node” or “gateway node” or “sequencer node” or “mesh node” should not be interpreted to mean that particular components are necessarily connected using a network link, since other types of interconnects or interfaces are possible. Further, a “node,” as disclosed herein, may be any suitable hardware, software, firmware component(s), or combination thereof, configured to perform the respective function(s) set forth for the node. As explained in more detail below, a node may be a programmed general purpose processor, but may also be a dedicated hardware device, such as a field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other hardware device or group of devices, logic within a hardware device, printed circuit board (PCB), or other hardware component.
[00120] It should be understood that nodes disclosed herein may be separate elements or may be integrated together within a single element, such as within a single FPGA, ASIC, or other element configured to implement logic to perform the functions of such nodes as set forth herein. Further, a node may be an instantiation of software implementing logic executed by general purpose computer and/or any of the foregoing devices.
[00121] Conventional approaches to connecting components, such as the compute engines 140, gateways 120, and sequencers 150 through one or more shared switches, do not provide the lowest possible latency. These conventional approaches also result in unpredictable spikes in latency during periods of heavier message traffic.
[00122] In an example embodiment, dedicated connections 180 are also provided directly between each gateway 120 and each sequencer 150, and between each sequencer 150 and each core compute node 140. Furthermore, in some embodiments, dedicated connections 180 are provided among all the sequencers, so that an example sequencer 150-1 has a dedicated connection 180 to each other sequencer 150-2, . . ., 150-s. While not pictured in Fig. 1A, in some embodiments, dedicated connections 180 may also be provided among all the gateways 120, so that each gateway 120-1 has a dedicated connection 180 to each other gateway 120-2, ..., 120-g. Similarly, in some embodiments, dedicated connections 180 are also provided among all the compute nodes 140, so that an example core compute node 140-1 has a dedicated connection 180 to each other core compute node 140-2, ..., 140-c.
[00123] It should also be understood that a dedicated connection 180 between two nodes (e.g., between any two nodes 120, 150, or 140) may in some embodiments be implemented as multiple redundant dedicated connections between those same two nodes, for increased redundancy and reliability. For example, the dedicated connection 180-1-1 between gateway 120-1 and core compute node 140-1 (e.g., Core 1) may actually be implemented as a pair of dedicated connections.
[00124] In addition, according to some embodiments, any message sent out by a node is sent out in parallel to all nodes directly connected to it in the point-to-point mesh 172. Each node in the point-to-point mesh 172 may determine for itself, for example, based on the node’s configuration, whether to take some action upon receipt of a message, or whether instead simply to ignore the message. In some embodiments, a node may never completely ignore a message; even if the node, due to its configuration, does not take substantial action upon receipt of a message, it may at least take minimal action, such as consuming any sequence number assigned to the message by the sequencer 150. That is, in such embodiments, the node may keep track of a last received sequence number to ensure that when the node takes more substantial action on a message, it does so in proper sequenced order.
[00125] For example, a message containing a trade order to “Sell 10 shares of Microsoft at $190.00” might originate from participant device 130-1, such as a trader’s personal computer, and arrive at gateway 120-1 (i.e., GW 1). That message will be sent to all core compute nodes 140-1, 140-2, . . ., 140-c even though only core compute node 140-2 is currently performing matching for Microsoft orders. All other core compute nodes 140-1, 140-3, . . ., 140-c may upon receipt ignore the message or only take minimal action on the message. For example, the only action taken by 140-1, 140-3, . . ., 140-c may be to consume the sequence number assigned to the message by the sequencer 150- 1. That message will also be sent to all of the sequencers 150-1, 150-2, . . ., 150-s even though a single sequencer (in this example, sequencer 150-1) is the currently active sequencer servicing the mesh. The other sequencers 150-2, . . ., 150-s also received the message to allow them the opportunity to take over as the currently active sequencer should sequencer 150-1 (the currently active sequencer) fail, or if the overall reliability of the electronic trading system 100 would increase by moving to a different active sequencer. One or more of the other sequencers (sequencer 150-2 for example) may also be responsible for relaying system state to the disaster recovery site 155. The disaster recovery site 155 may include a replica of electronic trading system 100 at another physical location, the replica comprising physical or virtual instantiations of some or all of the individual components of electronic trading system 100.
[00126] By sending each message out in parallel to all directly connected nodes, the system 100 reduces complexity and also facilitates redundancy and high availability. If all directly connected nodes receive all messages by default, multiple nodes can be configured to take action on the same message in a redundant fashion. Returning to the example above of the order to “Sell 10 shares of Microsoft at $190.00”, in some embodiments, multiple core compute nodes 140 may simultaneously perform matching for Microsoft orders. For example, both core compute node 140-1 and core compute node 140-2 may simultaneously perform matching for Microsoft messages, and may each independently generate, after having received the incoming message of the “Sell” order, a response message such as an acknowledgement or execution message that each of core compute node 140-1 and core compute node 140-2 sends to the gateways 120 through the sequencer(s) 150 to be passed on to one or more participant devices 130.
[00127] Because of the strict ordering and state determinism assured by the sequencer(s) 150, it is possible to guarantee that each of the associated response messages independently generated by and sent from the core compute nodes 140-1 and 140-2 are substantially and functionally equivalent; accordingly, the architecture of the electronic trading system 100 readily supports redundant processing of messages, which increases the availability and resiliency of the system. In such embodiments, gateways 120 may receive multiple associated outgoing messages from core compute nodes 140 for the same corresponding incoming message. Due to the fact that it can be guaranteed that these multiple associated response messages are equivalent, the gateways 120 may simply process only the first received outgoing message, ignoring subsequent associated outgoing messages corresponding to the same incoming message. In some embodiments, the “first” and “subsequent” messages may be identified by their associated sequence numbers, as such messages may be sequence-marked messages. Although, in other embodiments, such as those in which the sequencer 150 assigns a single sequence identifier among a plurality of functionally equivalent messages, messages may be identified as being functionally equivalent based on other identifying information in the messages, such as the values in the entity type field 110-16 and entity count field 110-17, as discussed further in connection with Fig. 1C above.
[00128] Allowing the gateways 120 to take action on the first of several functionally equivalent associated response messages to reach them may, therefore, also improve the overall latency of the electronic trading system 100. Furthermore, the electronic trading system 100 can be easily configured such that any incoming message is processed by multiple compute nodes 140, in which each of those multiple compute nodes 140 generates an equivalent response message that can be processed by the gateways 120 on a first-to-arrive basis. Such an architecture provides for high availability with no perceptible impact to latency in the event that a compute node 140 is not servicing incoming messages for a period of time (whether due to a system failure, a node reconfiguration, or a maintenance operation).
[00129] Such a point-to-point mesh 172 architecture of system 100, besides supporting low, predictable latency and redundant processing of messages, also provides for built-in redundant, multiple paths. As can be seen, there exist multiple paths between any gateway 120 and any compute node 140. Even if a direct connection 180-1-1 between gateway 120-1 and compute node 140-1 becomes unavailable, communication is still possible between those two elements via an alternate path, such as by traversing one of the sequencers 150 instead. Thus, more generally speaking, there exist multiple paths between any node and any other node in the point-to-point mesh 172.
[00130] Furthermore, this point-to-point mesh architecture inherently supports another important goal of a financial trading system, namely, fairness. The point-to-point architecture with direct connections between nodes ensures that the path between any gateway 120 and any core compute node 140, or between the sequencer 150 and any other node has identical or, at least very similar latency. Therefore, two incoming messages sent out to the sequencer 150 at the same time from two different gateways 120 should reach the sequencer 150 substantially simultaneously. Similarly, an outgoing message being sent from a core compute node 140 is sent to all gateways 120 simultaneously, and should be received by each gateway at substantially the same time. Because the topology of the point-to-point mesh does not favor any single gateway 120, chances are minimized that being connected to a particular gateway 120 may give a participant device 130 an unfair advantage or disadvantage.
[00131] Additionally, the point-to-point mesh architecture of system 100 allows for easily reconfiguring the function of a node, that is, whether a node is currently serving as a gateway 120, core compute node 140 or sequencer 150. It is particularly easy to perform such reconfiguration in embodiments in which each node has a direct connection between itself and each other node in the point-to-point mesh. When each node is connected via a direct connection to each other node in the mesh, no re-wiring or recabling of connections 180 (whether physical or virtual) within the point-to-point mesh 172 is required in order to change the function of a node in the mesh (for example, changing the function of a node from a core compute node 140 to a gateway 120, or from a gateway 120 to a sequencer 150). In such embodiments, the reconfiguration required that is internal to the point-to-point mesh 172 may be easily accomplished through configuration changes that are carried out remotely. In the case of a node being reconfigured to serve as a new gateway 120 or being reconfigured from serving as a gateway 120 to another function, there may be some ancillary networking changes required that are external to the point-to-point mesh 172, but the internal wiring of the mesh may remain intact.
[00132] Accordingly, in some embodiments, the reconfiguration of the function of a node may be accomplished live, even dynamically, during trading hours. For example, due to changes on characteristics of the load of the electronic trading system 100 or new demand, it may be useful to reconfigure a core compute node 140-1 to instead serve as an additional gateway 120. After some possible redistribution of state or configuration to other compute nodes 140, the new gateway 120 may be available to start accepting new connections from participant devices 130.
[00133] In some embodiments, lower-speed, potentially higher latency shared connections 182 may be provided among the system components, including among the gateways 120 and/or the core compute nodes 140. These shared connections 182 may be used for maintenance, control operations, management operations, and/or similar operations that do not require very low latency communications and, in contrast to messages related to trading activity carried over the dedicated connections 180 in the point-to-point mesh 172. In contrast to the first direct connection 180-1-1, second direct connection 180-gwl-sl, and third direct connection 180-cl-sl that carry traffic related to trading activity, the shared connections 182-g and 182-c carry non-trading activity type traffic. Shared connections 182, carrying non-trading traffic, may be over one or more shared networks and via one or more network switches, and nodes in the mesh may be distributed among these shared networks in different ways. For example, in some embodiments, gateways 120 may all be in a gateway -wide shared network 182-g, compute nodes 140 may be in their own respective compute node-wide shared network 182-c, and sequencers 150 may be in their own distinct sequencer- wide shared network 182-s, while in other embodiments all the nodes in the mesh may communicate over the same shared network for these non-latency sensitive operations.
[00134] Distributed computing environments such as electronic trading system 100 sometimes rely on high resolution clocks to maintain tight synchronization among various components. To that end, one or more of the nodes 120, 140, 150 might be provided with access to a clock, such as a high-resolution global positioning (GPS) clock 195 in some embodiments.
[00135] With reference to FIG. 1 A, and for purposes of the following discussion, gateways 120, compute nodes 140, and sequencers 150 connected in the mesh 172 may be referred to as “Mesh Nodes”. Fig. 2 illustrates an example embodiment of a Mesh Node 200 in the point-to-point mesh 172 architecture of electronic trading system 100. Mesh node 200 could represent a gateway 120, a sequencer 150, or a core compute node 140, for example. Although in this example, functionality in the Mesh Node 200 is distributed across both hardware and software, Mesh Node 200 may be implemented in any suitable combination of hardware and software, including pure hardware and pure software implementations, and in some embodiments, any or all of gateways 120, compute nodes 140, and/or sequencers 150 may be implemented with commercial off- the-shelf components.
[00136] In the embodiment illustrated by Fig. 2, in order to achieve low latency, some functionality is implemented in hardware in Fixed Logic Device 230, while other functionality is implemented in software in Device Driver 220 and Mesh Software Application 210. Fixed Logic Device 230 may be implemented in any suitable way, including an Application-Specific Integrated Circuit (ASIC), an embedded processor, or a Field Programmable Gate Array (FPGA). Mesh Software Application 210 and Device Driver 220 may be implemented as instructions executing on one or more programmable data processors, such as central processing units (CPUs). Different versions or configurations of Mesh Software Application 210 may be installed on Mesh Node 200 depending on its role. For example, based on whether Mesh Node 200 is acting as a gateway 120, sequencer 150, or core compute node 140, a different version or configuration of Mesh Software Application 210 may be installed.
[00137] While any suitable physical communications link layer may be employed, (including USB, Peripheral Component Interconnect (PCI)-Express, High Definition Multimedia Interface (HDMI), 10 Gigabit Ethernet (GigE), 25 GigE, 40 GigE, 100 GigE, or InfiniBand (IB), over fiber or copper cables), in this example, Mesh Node 200 has multiple low latency 10 Gigabit Ethernet SFP+ connectors (interfaces) 270-1, 270-2, 270- 3, . . ., 270-n, (known collectively as connectors 270). Connectors 270 may be directly connected to other nodes in the point-to-point mesh via dedicated connections 180, connected via shared connections 182, and/or connected to participant devices 130 via a gateway 120, for example. These connectors 270 are electronically coupled in this example to 10 GigE MAC Cores 260-1, 260-2, 260-3, . . ., 260-n, (known collectively as GigE Cores 260), respectively, which in this embodiment are implemented by Fixed Logic Device 230 to ensure minimal latency. In other embodiments, 10 GigE MAC Cores 260 may be implemented by functionality outside Fixed Logic Device 230, for example, in PCI-E network interface card adapters.
[00138] In some embodiments, Fixed Logic Device 230 may also include other components. In the example of Fig. 2, Fixed Logic Device 230 also includes a Fixed Logic 240 component. In some embodiments, fixed Logic component 240 may implement different functionality depending on the role of Mesh Node 200, for example, whether it is a gateway 120, sequencer 150, or core compute node 140. Also included in Fixed Logic Device 230 is Fixed Logic Memory 250, which may be a memory that is accessed with minimal latency by Fixed Logic 240. Fixed Logic Device 230 also includes a PCI-E Core 235, which may implement PCI Express functionality. In this example, PCI Express is used as a conduit mechanism to transfer data between hardware and software, or more specifically, between Fixed Logic Device 240 and the Mesh Software Application 210, via Device Driver 220 over PCI Express Bus 233. However, any suitable data transfer mechanism between hardware and software may be employed, including Direct Memory Access (DMA), shared memory buffers, or memory mapping. [00139] In some embodiments, Mesh Node 200 may also include other hardware components. For example, depending on its role in the electronic trading system 100, Mesh Node 200 in some embodiments may also include High-Resolution Clock 195 (also illustrated in and discussed in conjunction with Fig. 1 A) used in the implementation of high-resolution clock synchronization among nodes in electronic trading system 100. A Dynamic Random-Access Memory (DRAM) 280 may also be included in Mesh Node 200 as an additional memory in conjunction with Fixed Logic Memory 250. DRAM 280 may be any suitable volatile or non-volatile memory, including one or more randomaccess memory banks, hard disk(s), and solid-state disk(s), and accessed over any suitable memory or storage interface.
Deterministic Latency
[00140] As mentioned above, the architecture of system 100 inherently supports another important goal of a financial trading system, namely, fairness. The basic idea is that delays through the system do not favor the user of any single gateway 120 or core 140, thus minimizing the chance that being connected to a particular gateway 120 may give any participant device 130 an unfair advantage or disadvantage over another. The end is accomplished, by controlling latency, that is the time between which messages arrive within the system and the time at which corresponding messages are permitted to leave the system.
[00141] Turning attention to Fig. 3 A, an inbound request message is received at a gateway 120-1. Gateway 120-1 processes the request message and generates an internal message, IBmsg, destined for one or more of the cores 140. IBmsg adds at least one field, such as a time based value (TBV) to the request message. The TBV may be inserted into some unused field of a standard message (such as that explained in connection with Fig. 1C), or TBV may be encoded in IBmsg according to some proprietary internal protocol.
[00142] In one implementation, the time based value TBV can be determined by the fixed logic 240 within gateway 120-1. The TBV may correspond to a time that the request message was received by the gateway 120-1 (referred to here as the ingress timestamp or arrival time, Tar), or in other implementations, may correspond to a desired exit time (referred to herein as the egress time or exit time, Tex) for a corresponding response message to be returned by the system 100. In that case, TBV is determined from the time of receipt Tar plus some deterministic time delay value (referred to as Td, as described in more detail below). In a typical embodiment therefore, the TBV is typically not the same field as the “node-specific timestamp” field 110-13 or the sequence numbers (e.g., a field different from either the sequence ID 110-14 or reference sequence ID 110-15) referenced above.
[00143] The inbound message IBmsg and its corresponding time based value TBV is then forwarded along a path 310-1 through the mesh to the cores 140 for processing. Typically, a selected one of the cores, such as core 140-2 then further processes IBmsg and generates an outbound message OBmsg that contains some response data as well as the TBV (or some other value that depends on the TBV). Outbound messages return along a path such as path 310-2 back to gateway 120-1. The gateway 120-1 then schedules OBmsg to exit the system 100, sending it to the participant 130-1 at a precise exit time Tex. An element we refer to as an egress Quality of Service (QoS) shaper 320 controls the exit time for outbound messages OBmsg.
[00144] The precise exit time Tex is determined from the time based value TBV that was carried with IBmsg and OBmsg and a desired deterministic delay or latency, Td. The deterministic delay Td is not necessarily directly dependent on the actual amount of time it takes for any particular response to be returned to the gateway 120 from the cores 140. As explained in more detail below, Td may depend upon a maximum expected time that any inbound message IBmsg requires to be fully processed by the system 100. [00145] Although it has been mentioned that the exit time Tex can be determined by the gateway 120 from the TBV (using the arrival time, Tar), the exit time Tex can also be determined in other ways, such as by the sequencer 150.
[00146] The time based value TBV may take different forms. It may simply be the time at which the corresponding inbound request message was originally received at gateway 120- 1, Tar. In that instance, a time delay value (Td) is added to TBV to arrive at the exit time Tex at which the outbound message OBmsg will exit the gateway 120-1. In configurations where the deterministic delay Td is a fixed time, this fixed value can be stored or otherwise implemented in the gateway 120 (or the sequencer 150) at configuration time.
[00147] However, in other implementations, the time based value TBV carried with IBmsg and the corresponding OBmsg may be the actual desired time exit value Tex (that is, instead of the arrival time Tar).
[00148] It should also be noted that some inbound messages IBmsg will not necessarily be forwarded to one of the cores 140. For example, fixed logic within the gateway 120 may reject IBmsg if it is invalid, for example, for having a bad checksum, or some other case for which the gateway 120-1 cannot determine to which core 140 IBmsg is to be sent). Thus, it should be understood that the outbound message OBmsg may not actually originate from one of the cores 140, but rather from the gateway 120-1.
However, even in that instance, it may be desirable for OBmsg to still be sent at an exit time Tex that corresponds to a deterministic delay Td after the original request message was received at time Tar.
[00149] In some instances, the outbound message OBmsg may be destined for more than one recipient device 130. In such instances, OBmsg is still sent to all such multiple consumers of the response at precisely the same exit time Tex. There is high confidence that the message can be sent to multiple recipients at the same precise exit time, due to the fact that the components of the system are typically implemented as highly synchronized, hardware components as explained above. [00150] Continuing to refer to Fig. 3 A, in one example the inbound request message may originate from a participant device 130-1 who is a buyer in a transaction. Participant device 130-2 connected to gateway 120-2 may be the seller in the transaction, and participant device 130-3 connected to gateway 120-3 may be a service such as a market data feed that reports such transactions.
[00151] Although the buyer, seller, and market feed are shown as each being connected to a different gateway, it should be understood that arrangement is not important. These three participants might all be connected to the same gateway 120-2, or only the buyer and seller to the same gateway 120-1, with the market data feed on another gateway 120-3, etc.
[00152] Depending on the message type and the current load on the distributed system, the total time taken to return a response message (or otherwise process an inbound message) by the system (that is, the total time the message stays within the system 100, from the point at which it enters a gateway Tar to the point at which a corresponding response message is ready to be sent out over that gateway at ) may range for example, between 400-800 nanoseconds. However, the egress QOS shaper 320 may be configured to send any corresponding response at a precise interval of 1000 nanoseconds (1 microsecond). Accordingly, after the OBmsg message has reached the gateway node 120-1 as it is being readied to be sent out to the trading participant 130-1, the outbound message OBmsg enters the egress QOS shaper 320-1. In one implementation, OBmsg may remain in the egress shaper 320 for several hundred nanoseconds until exactly one microsecond has elapsed since the original request message was received at the gateway 120-1. This ensures that as long as the configured latency interval is set to a sufficiently high value (such that the distributed system as a whole can comfortably guarantee that it can process any inbound message within that interval), a response message such as OBmsg will always be returned at a precise, deterministic time interval after inbound message IBmsg was received.
[00153] Additionally, the QOS level (i.e., the deterministic latency interval Td) may be tuned on an individual connection basis, on a per-gateway basis, or system-wide. In some embodiments, the QOS level may be configured on a per-participant or per- connection basis, optionally associated with varying cost subscription levels. While allowing different participants 130 to pay for different levels of QOS mitigates against fairness as a goal, such a feature may still be desirable to the provider of the distributed service as an additional revenue stream.
[00154] The system wide QOS level may also be set either manually or even dynamically. For example, the system 100 may dynamically temporarily increase the deterministic latency interval Td across the entire system 100 in the event an exceptional event occurs that threatens to cause the system not to satisfy the typical latency interval Td. This could be an internal event, such as a failure of one or more components or nodes in the distributed system, that will cause the system to temporarily degrade performance until the problematic components can be hot-swapped or new compute nodes brought online. The exceptional event could also be caused external to the distributed system, such as a news event that results in a huge increase in activity on or demand for the distributed system. When the exceptional event is resolved, the latency interval Td could be tuned back down, even dynamically, to its typical level.
[00155] In some embodiments, a message from a single client 130 (market participant) may trigger the need to generate and send acknowledgements or other types of response messages to numerous participants 130 and/or over multiple participant connections. In some cases, the content of the multiple messages is substantially identical among the multiple participants, whereas in other cases, a given inbound message IBmsg generates several related but different response messages OBmsgs, each with content that may be specific to a single participant or to a subset of participants.
[00156] In one example, for many protocols, when a trade match occurs, a separate execution message may be generated for each match party. Each execution message will contain information that is specific to that party, such as an “Order Token,” assigned by the client to identify the order. The two execution messages are related to the extent that they will share some information in common, such as a unique market-assigned ‘Execution ID’ or ‘Match number’. These two execution messages may be generated by the same compute core 140, and their egress time Tex should be the same, so that the two messages are sent simultaneously. [00157] Thus, according to some embodiments, two or more related response messages are sent to multiple participants 130 simultaneously at the same precise deterministic time interval, regardless of whether the multiple participant connections are on the same or on different gateways. Each related response message should contain a TBV that is the original timestamp value (or the desired egress time, or some other timebased value TBV related to the corresponding ingress time) corresponding to the incoming message IBmsg that triggered the responses. Thus, assuming the internal clocks of the various gateways are synchronized, the QOS shaper 320 on each gateway can ensure that each of the related responses is sent at the same precise time interval, and in turn, ensuring fairness among all clients 130 privy to receiving the related responses.
[00158] In some embodiments, the response from the core 140 may be a single outbound message OBmsg that arrives at the gateway(s) 120. However, that in turn causes the gateway(s) 120 to generate two or more related but not identical messages, the related messages differing in their destination address, and/or by having some other client-specific identifier, such as an order token, in their encoding as a result of the protocol associated with the participant 130, etc.
[00159] The cores 140 may provide a financial trading system’s matching engine that may generate a match between an order to buy a security and a corresponding order to sell that security. In some instances, where these two orders are typically placed at different times by two different parties, a successful match triggers an acknowledgement or execution message to be sent to both counter parties (e.g., buyer and seller) in the match. While the execution message OBmsg going to each counter party contains information specific to that specific client’s trade, such an execution message will be sent to the two parties at the exact same time, ensuring fairness. In this example, the latency interval for both of the resulting related responses may be set based off the incoming timestamp of one of the orders participating in the match (for example, the last of the two orders to arrive as an inbound request message).
[00160] In another related example implementation for a financial system’s matching engine, client devices 130 may typically subscribe to streams of market data messages sent from the matching engine notifying subscribers of real time activity across the market as a whole, including activity in which the subscribing client is not a participating party, such as recent executions between two third parties. Such a market data message in some embodiments is sent out to all subscribers simultaneously based on the timestamp value in the message as processed by the egress QOS shaper 320. In some embodiments, the sending of the market data messages to subscribers is timed to match the sending time of any acknowledgment or execution messages sent to market participants that are direct parties to the activity reflected in the market data messages also sent to subscribers. To return to the example of a match between two market participants (a buyer and a seller) that results in a pair of execution messages OBmsg sent to both counter parties, this match activity may also be reported to market data subscribers.
[00161] In some embodiments, the system 100 ensures that not only are both acknowledgement messages sent to the counter parties at the same time, but these acknowledgement messages are timed to also be sent at the same time as the market data messages reflecting this execution sent to market data subscribers, thereby ensuring that no market participant has prior access to useful financial information.
[00162] More generally, the system 100 provides precise control over when outbound messages are sent. In some instances, fairness to everyone is important, and so the response messages to multiple recipients can be released at the same time. Or in another embodiment, the system may be configured to release the response to the participants 130-1, 130-2 who are parties to a transaction a bit sooner than to a market feed 130-3. Or perhaps the market feed 130-3 receives the response sooner than the participants 130- 1, 130-2 to the transaction. In yet other configurations, the system 100 rewards participants 130 who are “liquidity adders” by letting them know sooner, and retards “liquidity removers” by letting them know later.
[00163] Fig. 3B is an example of an egress QOS shaper 320 that may be located inside one or more gateways 120. Each OBmsg is received with its corresponding time based value TBV. The TBV is then used with the desired latency Td to determine an exit time Tex at which the message will be sent out of the gateway 120. Specifically, when the TBV is the receipt time for corresponding inbound message IBmsg, the corresponding deterministic delay, Td, may be added to the TBV to arrive at Tex.
[00164] The egress QOS shaper 320 may be implemented as a “packet scheduler,” organized into a set of indexed storage locations 380-1, 380-2, . . . , 380-n (or “buckets”), with a location 380 associated with each discrete high precision egress timing interval Texl, Tex2, . . ., Texn. Each message placed in a particular bucket 380, is released at the precise time interval associated with the bucket. Each bucket may contain a list of zero or more outbound messages OBmsg.
[00165] As explained previously, in an alternative implementation, rather than assigning an inbound timestamp as the TBV associated with each arriving message in the gateway, TBV may instead be allocated as the desired outbound indexed location or bucket of the packet scheduler 320. Thus, when the desired egress time is used as the TBV, it can be considered to directly correlate to the exit time Tex.
[00166] Although the message scheduler 320 may be a ring data structure that wraps, other implementations are possible. For example, message scheduler 320 may also be implemented as a set of linked lists with pointers, for example, with one linked list for each exit time Tex. Note that the “time” value, be it an arrival or exit time, can be with reference to an absolute or relative time.
[00167] Fig. 4 is an example of an inbound message IBmsg arriving from a market participant 130 at a gateway 120, and a reply being sent as an outbound message OBmsg with a deterministic latency (e.g., Td) of 1000 time units (where a time unit may be a nanosecond).
[00168] In an initial state 401, the request message arrives at the gateway 120 and a timestamp TBV is added to it, to generate an internal message DI (e.g., IBmsg). This state 401 may occur at a time T 17356. Next, at time T 17358 the message DI including the timestamp TBV is sent by the gateway 120 to one of the cores 140. At a later time, in state 403, such as at time T 17918, a response message (Rl) with the same timestamp TBV is received back at the gateway 120 from the core 140. Next, in state 404 (which may occur at time T 17922) an internal response message Rl with the timestamp is fed into the packet scheduler 320, which determines the appropriate outgoing timeslot (which in this example, is arrival time T17356 plus 1000, or a time of T18356. Finally, in state 405, the precise time T 183568 is reached, at which point the packet scheduler permits the response message (Rl) to exit the gateway as the system-level outbound message OBmsg.
[00169] It should be understood that although the example of Fig. 4 shows the inbound message IBmsg and its TBV being forwarded directly (e.g., over activation link 180-1-1 in FIG. IB) from a gateway node 120 to one of the cores 140, other arrangements are possible. For example, the IBmsg with the TBV may alternatively travel via a sequencer 150 to one of the cores 140 (e.g., over ordering path 117 in FIG. IB), or it may travel along both paths - e.g., both directly from the gateway node 120 to the core 140 and from gateway node 120 through the sequencer 150 to the core 140. In embodiments in which the IBmsg with the TBV travels via the sequencer 150, both unmarked messages and sequence-marked messages may have a TBV included with them. In such embodiments, a TBV included in an unmarked message may have the same value as a TBV included in the corresponding sequence-marked message. Accordingly, as mentioned above, the TBV is a field in a message that is different from the sequence identifier (e.g., different from either the sequence ID 110-14 or reference sequence ID 110-15) of the message.
[00170] Fig. 5 shows an example embodiment where an outbound message OBmsg results from an event internal to the system 100, as opposed to being a “response” to some inbound “request” message. In this example, the OBmsg is “asynchronous” to any corresponding inbound message. Such asynchronous messages should still be marked and handled with a scheduled egress time Tex.
[00171] One example of an asynchronous message in a trading system might be an Order Cancel triggered by a timer. A matching engine in the core might generate such a Cancel message when a “time in force” associated with a resting order expires. Since the Cancel message is be broadcast to more than one client/participant (for example, both the initiator of the order as well as a market data feed), it should exit all gateways to all destination participants at the same time. [00172] In this instance, the exit time Tex may be determined by the core 140 that originates the message. In other implementations, the exit time Tex may be determined by one of the sequencers 150 through which OBmsg will travel.
[00173] Thus, as part of ensuring deterministic latency, an outbound message needing to be sent to multiple participants will be sent at the same exit time Tex, regardless of whether the outbound message was generated as part of a request-response or whether it instead was an asynchronous message originated internally within the mesh.
[00174] It should also be understood that the concept of having a deterministic latency encompasses more than just a fixed time. In other implementations, therefore, the deterministic latency may include using a set of several fixed time values that are evenly distributed across a range according to fairness criteria. The particular sequence of time values may be statistically evenly and randomly distributed across the range.
[00175] Accordingly, the system 100 also supports deterministic latency variation, in which latency values Td are order-randomized within an evenly spanned latency value range. In such embodiments, time stamps are associated with inbound messages IBmsg in much the same way as described above, with the one difference being that the egress QOS shaper assigns the message to a time slot that depends on an order-randomized latency value within a bounded range.
[00176] In such embodiments, the egress QOS shaper 320 could be considered to assign outgoing time slots in a manner similar to that of dealing a card from a deck of playing cards, so that once a latency value in the range has been assigned, that particular latency value will not be assigned again until all other values in the range have also been assigned and the “deck” is reshuffled. While in one embodiment the QOS shaper 320 is responsible for assigning the order-randomized latency value, in other embodiments, other components such a gateway 120 or sequencer 150 may assign these values at the time an inbound message is received.
[00177] In an implementation where the varying but still deterministic delay, Td, follows some determined pattern, Td will not necessarily be the same for each inbound message, IBmsg. Thus, it may be desirable to carry the Td value assigned to each message along with its associated ingress time Tar, as the “time based values” TBV for each message. This then enables the gateway 120 to determine the correct exit time Tex at egress time for the corresponding outbound message(s) OBmsg. In other embodiments, however, the varying but deterministic delay Td can be added to the arrival time Tar right at ingress time, such that the TBV is the single value, Tex. In other words, using the exit time Tex as the TBV in the variable delay case avoids the need to carry both the delay value Td and the ingress time Tar with each message)
[00178] Fig. 6A illustrates an example where the system assigns a deterministic latency (Td) of 2000 time units to a particular message, as selected from a deterministic range of between 1000 and 5000 time units. Similar to the example of Fig. 4, at state 602 a time T 17356 is reached in which an inbound message arrives at gateway 120-1. The message is time stamped with a TBV of 17356. At state 603 the inbound message with that timestamp is sent by the gateway to the core at (time T17358). At state 604 (time T 17918) a response message with the same timestamp is received back at the gateway from the cores 140. At state 605 (time T 17922), the response message with the timestamp is queued at the packet scheduler within the gateway 120-1. One or more response messages are then sent out from the gateway in state 606 (at T 19356), because this particular message was assigned a latency of 2000 time units from the set of values that range from 1000 to 5000 time units. The specific latency value assigned to a message (that is, its place in the packet scheduler) can be determined by the gateway at the time the message is received, or in other ways, as already explained elsewhere.
[00179] Fig. 6B is a similar example, but here the assigned deterministic latency for this message was determined by the system to be 3500 time units (as selected from the same range of 1000-5000 units). The processing is otherwise similar to Fig. 6A, albeit with a different deterministic delay. Starting at state 612, the message arrives at the gateway at time T 17356 and is time stamped. At state 614 (time T 17358) the message exits the gateway and is forwarded to the core. At state 616 (time T 17918) a response message with the timestamp is returned to the gateway from the core. Next in state 618 (time T 17922) the response message with the timestamp is queued in the packet scheduler, and the response message does not exit the gateway until state 620 (time T 20856). [00180] Fig. 7 is an example of how a pattern such as an order-randomized set of values may be assigned to the deterministic latency. Here the bounded range is from 1000 to 5000 time units, with an increment between slots of 500 time units. The QOS shaper has generated a sequence of nine (9) evenly distributed Tex values as as 2000, 3500, 2500, 1500, 4000, 5000, 4500, 3000, and 1000. Once the initial sequence of values is used up, the QOS shaper repeats the process and generates preferably some other evenly distributed, random pattern of the 9 values.
[00181] Other randomization schemes are possible, though, including using something conceptually similar to multiple decks of playing cards, or even a standard pseudorandom number generator to arrive at the deterministic sequence of delays.
[00182] The assignment of the range of latency values to messages may be on a system-wide basis, or per matching engine, or per account/participant, or per flow/connection. In some cases, it may be desirable not to have the latency distribution be entirely random or completely evenly spaced throughout the entire latency range. For example, it may be desirable to limit the number of consecutive lower valued latencies assigned, and/or the number of consecutive higher latencies assigned. The method within the QOS shaper 320 that assigns latencies may therefore attempt to distribute the latencies evenly within a relatively small set, such as ensuring that for every set of five consecutive latencies assigned to a participant response, at least two latencies in the set will be in a higher latency range and two other latencies will be in a lower latency range. For some applications, limiting the consecutive number of similar latencies (i.e., all high or all low) may be particularly desirable during particularly busy periods. For example, the periods surrounding market open and market close tend to exhibit higher levels of activity, as do periods following a news item that has a potential financial impact, such as an interest rate adjustment announcement. Such events might need a more tightly controlled distribution of latencies than during normal times.
[00183] The range of values selected should take into consideration the expected delays inherent in the components of the system, including at least the gateways 120 and the cores 140. For example, it may be that, depending on the message type and the current load on the distributed system, the time taken to process a message ranges between 400-800 nanoseconds, but the egress QOS shaper is configured to send any corresponding response at a randomized interval within a range of 1000 to 5000 nanoseconds. Accordingly, after the response message has reached the gateway node as it is being readied to be sent out to the trading participant, the message enters the egress QOS shaper, where it remains for a period of time until the precise randomized interval has elapsed since the original message was timestamped upon entry at the gateway node. Adding a randomized but evenly distributed jitter to the latency ensures fairness, since each trading message has an equal chance of being assigned any randomized jitter within the configured latency bounds, yet by adding jitter, the system mitigates against participant behaviors that could exploit the predictability of a perfectly consistent latency.
[00184] It should be understood that many of the concepts explained for fixed latency are also applicable to this embodiment. For example, it may be desirable to simultaneously send all related response messages that are triggered by a single incoming participant message. It may also be desirable to allow the set of QOS levels to be tuned, either on a per-participant/connection, or system-wide basis.
[00185] Fig. 8 illustrates some of the considerations that may be factored into determining an appropriate fixed value for Td, which should be selected to be greater than a combination of the worst case delay through the cores 140 (Tcore) and the worst case delay (Tpath) of the paths between the gateways 120 and the cores 140 (for both inbound and outbound messages).
[00186] More particularly, for synchronous messages (that is, where the outbound message is a response to an inbound message) some of the considerations for determining Td may include: time in gateway ingress,
- transmission time between gateway to core, time in core,
- transmission time between core and gateway, or
- time in gateway egress.
[00187] Similarly, considerations for determining Td for asynchronous messages (those that are not a specific response to an inbound message) may include: time in core,
- transmission time between core and gateway, or
- time in gateway egress.
[00188] In an example embodiment, the cores 140 are expected to be largely implemented in hardware (e.g., they are fixed-logic, FPGA based) and thus have a relatively fixed predictable “time in core” to execute a task. However, it should be understood that some tasks executed by the cores 140 may take longer than others. For example a core 140 might respond to a Cancel request much faster than responding to an Add Order. A different Add Order request may require different times to process, depending upon whether the order is for a security that is frequently traded (and hence the required data to execute the trade is held in a cache local to a core 140 dedicated to that security) or infrequently traded (where the data necessary to execute the order is to be retrieved from a location outside of a local cache on the core 140). Thus, the types of messages expected to be sent may also factor into selecting Td.
[00189] The delays associated with the paths, Tpath is also expected to cluster around a relatively small and predictable value in an architecture that utilizes a fully connected mesh between all gateways 120 and cores 140, as described above.
[00190] Thus, Td should be chosen to be greater than the combination of these internal system delays, with the goal being that time Td is selected to “hide” the variability of things such as Tcore and Tpath from the market from the participant devices 130. As long as the worst-case scenario sets Td at or above some maximum possible system delay, it is possible for the system 100 to guarantee a deterministic latency for all participant devices 130.
[00191] Td can be determined at the time the system is designed. However, in another approach, Td can be dynamically tuned by monitoring the time difference between an inbound message and the associated outbound message, in order to guarantee Td will always be greater than the largest delta in time. In this mode, the system 100 can be placed in a special mode and subjected to a series of test request messages (different Add Orders, Cancels, etc.) and the response time noted for each of a set of participant connections, gateways, and cores, to determine a maximum latency time for the system. [00192] Fig. 9 is similar to Fig. 8 but illustrates the situation where the deterministic latency is evenly distributed across a range of values. In this example, there are multiple possible delay times Tdl, Td2, Td3, . . . Tdn. All of these times selected are preferably greater than the expected worst-case internal delay, e.g, the sum of Tcore and Tpath.
[00193] As mentioned above, some implementations of the system 100 may support participants that use different financial trading protocols. The processing of such messages is thus inherently faster or slower depending upon the protocol selected. For example, an inbound request message encoded using a binary protocol would inherently be handled by the cores 140 more rapidly than a request that uses a text-based protocol.
[00194] In order for such a system to provide deterministic, fair behavior for all participants, the latency should be the same regardless of the protocol used. Therefore, Td should be chosen to also accommodate such variance.
[00195] In the example of Fig. 10, gateway 120-1 receives an IBmsg from a first participant 130-1 (for example, a buyer) that uses a text-based financial trading protocol. A different participant 130-2 (for example, a seller) may be using a binary financial trading protocol as it sends its IBmsgs to gateway 120-2. The time (Pl) needed for the system to process the IBmsg for the first participant and return the corresponding response (OBmsg) may be longer than the time (P2) needed for the system to process the IBmsg and return the response for the second participant. Accordingly, while the system 100 could be configured with a time Td that is sufficiently high to take into account the processing time required for the slowest financial trading protocol, thereby ensuring that messages exchanged according to any protocol exhibit the same response time, as discussed above, alternatively, the system may instead be configured to treat different protocols differently, providing a message a deterministic time boost or a deterministic time penalty, depending on the message’s protocol, thereby encouraging or discouraging the use of particular protocols.
[00196] By adjusting Td by a value that depends on a further parameter (e.g., Pl or P2) such as the financial trading protocol used (perhaps on a per-connection basis), the returned outbound message OBmsg will therefore be scheduled to finish exiting the system 100 at a time that depends upon both the protocol-dependent value as well as the time of receipt (e.g., the timestamp or TBV). Such IBmsgs will be forwarded to the cores with a protocol-dependent time value (P), which is carried along with the message and the TBV, so that the gateway can determine a time for the outbound message OBmsg to exit in a manner that takes into account the protocol of the message. In this manner, the gateways 120-1, 120-2 are able to calculate different egress times for market participants 130-1 and 130-2 to account for the message protocol they are using.
[00197] However, parameters such as Pl and P2 may be based on considerations other than the protocol in use. For example, they may be used to offer differentiated classes of service, where the participants who pay for “first class” service still are all delayed the same as other first class users, but they are not delayed as much as “second class” users who pay less, etc. The notion of the system 100 being “fair to participants with a deterministic latency” can thus mean different things. It can mean, “treat all participants exactly the same”, or it can be used to incentivize certain types of trading behavior or protocols, or for security reasons.
[00198] The deterministic latency may also be temporarily adjusted based on current system conditions. For example, it may be increased due to failure of one or more components, or as a result of an exceptional burst of activity in the system.
Other use cases
[00199] The architecture described above may be of use in applications other than electronic trading systems. For example, it is possible that it may be used to monitor data streams flowing across a network, to capture packets, decode the packets’ raw data, analyze packet content in real time, and provide responses, for applications other than handling securities trade orders.
[00200] Review of Embodiments [00201] Those of skill in the art will now recognize that various implementations of the methods and systems described above are possible.
[00202] For example, one approach may include a plurality of gateways connected to receive inbound messages from two or more participant devices. One or more of the gateways are each further configured to determine a time based value (TBV) for a selected one of the inbound messages. The gateways forward the selected inbound message with its respective TBV to one or more compute nodes, and then receive a response message from the one or more compute nodes, the response message having information derivable from the TBV. The one or more gateways then send a response message to at least one of the participant devices as an outbound message, the outbound message sent at a deterministic egress time that depends on both the information derivable from the TBV and a deterministic latency.
[00203] In one embodiment, a method transmits a response message with a deterministic latency. A plurality of incoming messages are received by one or more gateways from a plurality of participant devices. A selected one of the gateways then determines a corresponding time based value for a selected one of the plurality of incoming messages. The selected incoming message relates to an electronic trading function. The selected one of the gateways then sends the selected incoming message and information derivable from the corresponding time based value to a sequencer node. The sequencer node then sends, to at least a selected one of a plurality of compute engines, a sequence-marked message and the information derivable from the corresponding time-based value. The selected one of the plurality of compute engines receives the sequence-marked message and the information derivable from the corresponding time based value, and determines a selected compute-response message. The selected compute-response message is based on the sequence-marked message, and is further configured to complete the electronic trade matching function. The selected one of the compute engines then returns the selected compute-response message and the information derivable from the corresponding time based value to at least the selected one of the gateways. The selected one of the gateways then receives the selected compute-response message from the selected one of the plurality of compute engines and the information derivable from the corresponding time based value, and transmits the selected compute-response message to at least one of the plurality of participant devices at a deterministic egress time that depends on at least the information derivable from the corresponding time based value for the selected incoming message.
[00204] In some embodiments, the selected one of the plurality of gateways may receive an other incoming message from an other participant device. In that embodiment, the selected one of the gateways also forwards an other forwarded message to at least one of the plurality of compute engines, with the other forwarded message including the other incoming message and information derivable from an other time based value. The selected one of the plurality of gateways receives an other compute-response message from the at least one of the plurality of compute engines, with the other computeresponse message including the information derivable from the other time based value. The selected one of the plurality of gateways then delays sending the other computeresponse message until an other deterministic egress time is reached, where the other deterministic egress time depends on at least the information derivable from the other time based value. The selected one of the gateways then sends the other computeresponse message to the other participant device at the other deterministic egress time, such that the selected one of the compute response messages and the other compute response message are each delayed by the deterministic latency.
[00205] The time based value may depend on a times that relates to at least one of a receive time for the selected incoming message or a desired egress time for the selected compute-response message.
[00206] The deterministic latency may depend on a maximum time for the selected one of the plurality of compute engines to return the compute-response message. The deterministic latency may follow a varying but deterministic pattern. The deterministic latency may be selected from a set of latencies evenly distributed across a predetermined range. The deterministic latency may also be configured on a per gateway, a per- connection, or a system-wide basis.
[00207] In some embodiments, forwarding from the selected one of the gateways to the compute engine may be over one or more direct, dedicated connections from the selected one of the gateways to the plurality of compute engines. [00208] An asynchronous message may be received from one of the compute engines by at least the selected one of the gateways, and then sent as an outbound message simultaneously to two or more participant devices.
[00209] A transit time may be determined that may be associated with the step of forwarding the selected incoming message and with the step of receiving the selected one of compute-response messages, and the deterministic egress time further depends on the transit time.
[00210] Transmitting the selected one of the compute-response messages from the selected gateway may additionally comprise transmitting the selected one of the compute-response messages to at least one other participant device at the deterministic egress time.
[00211] The deterministic egress time may be one of a plurality of deterministic egress times. In that instance, the selected one of the compute-response messages may be stored in a one of a set of indexed locations, with each indexed location associated with one of the plurality of deterministic egress times.
[00212] The information derivable from the corresponding time based value may be inserted into an unused field in the selected incoming message before forwarding the selected incoming message to the plurality of compute engines.
[00213] The corresponding time based value may be part of an internal system protocol field within the selected incoming message.
[00214] In some embodiments, the deterministic latency may be dynamically changed. [00215] Two or more of the plurality of gateways may each receive the selected compute-response message from the selected one of the compute engines.
[00216] In some embodiments, the selected one of the gateways may forward the selected incoming message with the information derivable from the corresponding time based value to the plurality of compute engines. The selected one of the plurality of compute engines then receives the selected incoming message with the information derivable from the corresponding time based value. Here the determining, by the selected one of the plurality of compute engines, the selected compute-response message may be further based on at least one of the selected incoming message or the sequence- marked message. Also in these embodiments, forwarding the selected incoming message and receiving the selected compute-response message may be via a plurality of direct connections provided between each of the plurality of gateways and each of the plurality of compute engines.
[00217] The plurality of gateways may optionally be further configured to receive an asynchronous message from at least a selected one of the plurality of compute engines, and then send the asynchronous message as an outbound message simultaneously to two or more of the participant devices.
[00218] The selected compute-response message may be associated with a trade match event between two match parties each associated with a respective one of two participant devices. In those embodiments, the selected gateway may transmit the selected computeresponse message simultaneously to the two participant devices associated with the two match parties.
[00219] The selected compute-response message may be sent as a market data event message to a third device associated with a subscriber of a market data stream.
[00220] In some embodiments, the deterministic egress time may depend on one or more of a message path delay or compute engine delay.
[00221] Other embodiments may include a system that has a plurality of compute engines, one or more gateways, and a sequencer node. The one or more gateways are configured to (a) receive a plurality of incoming messages from a plurality of participant devices; (b) determine a corresponding time based value for a selected one of the plurality of incoming messages; (c) forward the selected incoming message with information derivable from the corresponding time based value, to the plurality of compute engines; and (d) send the selected incoming message to the sequencer node.
The sequencer node may be configured to (e) receive the selected incoming message; and (f) send a sequence-marked message to a selected one of the compute engines. The selected compute engine may be configured to (g) receive the selected incoming message and the information derivable from the time based value from at least the selected one of the gateways; (h) receive the sequence-marked message from the sequencer node; (i) determine a selected compute-response message based on at least one of the selected incoming message and the sequence-marked message, the compute-response message configured to complete the electronic trade matching function; and (j) return the selected compute-response message and the information derivable from the corresponding time based value to the one or more gateways. At least the selected one of the gateways is further configured to (k) receive the selected compute-response message and the information derivable from the time based value from the selected one of the compute engines; and (1) transmit the selected compute-response message to at least one of the plurality of the participant devices at a deterministic egress time that depends on at least the information derivable from the corresponding time based value for the selected incoming message.
[00222] In such a system, the deterministic egress time may follow a varying but deterministic pattern.
[00223] The selected one of the gateways in such a system may be further configured to forward the selected incoming message and receive the selected compute-response message via a plurality of direct connections provided between each of the plurality of gateways and each of the plurality of compute engines.
[00224] Such a system may also be configured such that the selected computeresponse message is associated with a trade match event between two match parties each associated with a respective one of two participant devices, and the selected one of the gateways may be further configured to transmit the selected compute-response message simultaneously to the two participant devices associated with the two match parties. [00225] Also, a method for transmitting a response message with a deterministic latency, may include:
[00226] receiving, by a plurality of gateways, a plurality of incoming messages from a plurality of participant devices;
[00227] determining, by a selected one of the gateways, a corresponding time based value for a selected one of the plurality of incoming messages;
[00228] wherein the selected incoming message is related to an electronic trade matching function;
[00229] forwarding, by the selected one of the gateways, the selected incoming message with information derivable from the corresponding time based value, to a plurality of compute engines; [00230] sending, by the selected one of the gateways, the selecting incoming message to a sequencer node;
[00231] sending, by the sequencer node to at least one of the compute engines, a sequence-marked message;
[00232] receiving, by the plurality of compute engines, the selected incoming message and the information derivable from the corresponding time based value from at least the selected one of the gateways;
[00233] receiving, by a selected one of the plurality of compute engines, the sequence- marked message;
[00234] determining, by the selected one of the plurality of compute engines, a selected compute-response message based on at least one of the selected incoming message and the sequence-marked message, the selected compute-response message configured to complete the electronic trade matching function;
[00235] returning, by at least the selected one of the plurality of compute engines, the selected compute-response message and the information derivable from the corresponding time based value, to at least the selected one of the gateways;
[00236] receiving, by at least the selected one of the gateways, the selected computeresponse message from at least the selected one of the plurality of compute engines and the information derivable from the corresponding time based value;
[00237] delaying, by at least the selected one of the gateways, the selected computeresponse message until a deterministic egress time is reached, where the deterministic egress time depends on at least the deterministic latency and the information derivable from the corresponding time based value; and
[00238] transmitting, by the at least the selected one of the gateways, the selected compute-response message to at least one of the plurality of participant devices at the deterministic egress time.
[00239] Furthermore, a system may comprise:
[00240] a plurality of compute engines;
[00241] a plurality of gateways, configured to receive a plurality of incoming messages from a plurality of participant devices; and
[00242] wherein a selected one of the gateways is further configured to: [00243] determine a corresponding time based value for a selected one of the plurality of incoming messages;
[00244] wherein the selected incoming message is related to an electronic trade matching function;
[00245] forward the selected incoming message with information derivable from the corresponding time based value to the plurality of compute engines; and
[00246] forward the selected incoming message to a sequencer node;
[00247] wherein the sequencer node is further configured to:
[00248] send a sequence-marked message to at least one of the compute engines, and [00249] wherein a selected one of the plurality of compute engines is configured to: [00250] receive the selected incoming message and the information derivable from the corresponding time based value from at least the selected one of the gateways;
[00251] receive the sequence marked message from the sequencer node;
[00252] determine a selected compute-response message based on at least one of the selected incoming message and the sequence-marked message, the selected computeresponse message configured to complete the electronic trade matching function;
[00253] return the selected one of the compute-response messages and the information derivable from the corresponding time based value, to at least the selected one of the gateways, and
[00254] wherein the selected one of the gateways is further configured to:
[00255] receive the selected one of the compute-response messages and the information derivable from the corresponding time based value from at least the selected one of the compute engines;
[00256] delay the selected one of the compute-response messages until a deterministic egress time is reached, wherein the deterministic egress time depends on at least a deterministic latency and the information derivable from the corresponding time based value for the selected incoming message; and
[00257] transmit the selected one of the compute-response messages to at least one of the plurality of participant devices at the deterministic egress time. [00258] Further Implementation Options
[00259] It should be understood that the example embodiments described above may be implemented in many different ways. In some instances, the various “data processors” may each be implemented by a physical or virtual general purpose computer having a central processor, memory, disk or other mass storage, communication interface(s), input/output (I/O) device(s), and other peripherals. The general-purpose computer is transformed into the processors and executes the processes described above, for example, by loading software instructions into the processor, and then causing execution of the instructions to carry out the functions described.
[00260] As is known in the art, such a computer may contain a system bus, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system. The bus or busses are essentially shared conduit(s) that connect different elements of the computer system (e.g., one or more central processing units, disks, various memories, input/output ports, network ports, etc.) that enables the transfer of information between the elements. One or more central processor units are attached to the system bus and provide for the execution of computer instructions. Also attached to system bus are typically I/O device interfaces for connecting the disks, memories, and various input and output devices. Network interface(s) allow connections to various other devices attached to a network. One or more memories provide volatile and/or nonvolatile storage for computer software instructions and data used to implement an embodiment. Disks or other mass storage provides non-volatile storage for computer software instructions and data used to implement, for example, the various procedures described herein.
[00261] Embodiments may therefore typically be implemented in hardware, custom designed semiconductor logic, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), firmware, software, or any combination thereof. [00262] In certain embodiments, the procedures, devices, and processes described herein are a computer program product, including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the system. Such a computer program product can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection.
[00263] Embodiments may also be implemented as instructions stored on a nontransient machine-readable medium, which may be read and executed by one or more procedures. A non-transient machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a non-transient machine-readable medium may include read only memory (ROM); random access memory (RAM); storage including magnetic disk storage media; optical storage media; flash memory devices; and others.
[00264] Furthermore, firmware, software, routines, or instructions may be described herein as performing certain actions and/or functions. However, it should be appreciated that such descriptions contained herein are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
[00265] It also should be understood that the block and system diagrams may include more or fewer elements, be arranged differently, or be represented differently. But it further should be understood that certain implementations may dictate the block and network diagrams and the number of block and network diagrams illustrating the execution of the embodiments be implemented in a particular way.
[00266] Accordingly, further embodiments may also be implemented in a variety of computer architectures, physical, virtual, cloud computers, and/or some combination thereof, and thus the computer systems described herein are intended for purposes of illustration only and not as a limitation of the embodiments. [00267] The above description has particularly shown and described example embodiments. However, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the legal scope of this patent as encompassed by the appended claims.

Claims

1. A system comprising: a plurality of gateways, connected to receive inbound messages from two or more participant devices; one or more of the gateways each further configured to determine a time based value (TBV) for a selected one of the inbound messages; forward the selected inbound message with its respective TBV to one or more compute nodes; receive a response message from the one or more compute nodes, the response message having information derivable from the TBV; and send a response message to at least one of the participant devices as an outbound message, the outbound message sent at a deterministic egress time that depends on both the information derivable from the TBV and a deterministic latency.
2. The system of claim 1 wherein the TBV depends on a timestamp that relates to a receive time for the selected inbound message.
3. The system of claim 1 wherein the TBV depends on a desired egress time for the response message.
4. The system of claim 3 additionally comprising: a packet scheduler, configured to receive the response message, the packet scheduler comprising a set of indexed locations, each associated with a desired egress time; and
73 wherein the TBV is a value that depends on a value of an indexed location associated with the desired egress time.
5. The system of claim 1 wherein the TBV is inserted into an unused field in the inbound message before being forwarded to one of the compute nodes.
6. The system of claim 1 wherein the TBV is added as part of an internal system protocol field within the inbound message.
7. The system of claim 1 wherein the information derivable from the TBV comprises the TBV.
8. The system of claim 1 wherein the deterministic latency depends on a maximum time for the gateway to receive the response message from the one or more compute nodes.
9. The system of claim 1 wherein the deterministic latency follows a varying but deterministic pattern.
10. The system of claim 1 wherein the deterministic latency is selected from a set of latencies evenly distributed across a predetermined range.
11. The system of claim 1 wherein the deterministic latency is configured on per gateway, per-connection, or system-wide basis.
12. The system of claim 1 wherein the deterministic latency is dynamically changed due to system conditions.
74
13. The system of claim 1 wherein two or more of the gateways each receive the response message from the compute node.
14. The system of claim 1 additionally wherein the selected inbound message is forwarded to the one or more compute nodes and the response message is received from the one or more compute nodes via a plurality of direct connections provided between each of the one or more gateways and each of the one or more compute nodes.
15. The system of claim 1 wherein the response message is associated with a trade match event between two match parties each associated with a respective one of two participant devices, and where the gateways are further configured to send the response message as an outbound message at the egress time simultaneously to the two participant devices.
16. The system of claim 15 wherein the outbound message is also sent simultaneously at the egress time as a market data event message to a device associated with a subscriber of a market data stream.
17. The system of claim 1 wherein the one or more gateways are further configured to receive an asynchronous message from one of the compute nodes, and send the asynchronous message as an outbound message simultaneously at the egress time to two or more participant devices.
18. The system of claim 1 wherein the TBV further depends on a time value related to one or more of a message path delay or compute node delay.
75
19. The system of claim 1 wherein the one or more gateways are further configured to forward the selected inbound message with its respective TB V to one or more sequencer nodes.
20. The system of claim 1 additionally wherein: the one or more compute nodes are configured to receive the selected inbound message with its respective TBV from the one or more gateways; the one or more compute nodes are also configured to return the response message with the information derivable from the TBV to the one or more gateways;
21. The system of claim 1 wherein the one or more gateways are further configured to forward the selected inbound message with its respective TBV to one or more sequencer nodes; and the one or more sequencer nodes are configured to forward the selected inbound message with its respective TBV as a sequence-marked message to the one or more compute nodes.
22. The system of claim 2 wherein the one or more compute nodes are further configured to determine the response message based on either one of or both of the selected inbound message or the sequence-marked message.
23. The system of claim 1 wherein the selected inbound message is related to an electronic trade matching function; and
76 the response message is configured to complete the electronic trade matching function.
24. A method to transmit a response message from an electronic device with a deterministic latency, the method comprising: receiving an incoming message from a participant device; determining a time based value (TBV) for the incoming message; sending a forwarded message to one or more compute engines, the forwarded message including the incoming message and the TBV; receiving a compute-response message from one or more compute engines, the computeresponse message including information derivable from the TBV; delaying the compute-response message until a deterministic egress time is reached, where the deterministic egress time depends on both the TBV and the deterministic latency; and transmitting the compute-response message to the participant device at the deterministic egress time.
77
PCT/US2021/044588 2020-08-07 2021-08-05 Highly deterministic latency in a distributed system WO2022031878A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21786612.8A EP4193256A1 (en) 2020-08-07 2021-08-05 Highly deterministic latency in a distributed system
JP2023508074A JP2023540448A (en) 2020-08-07 2021-08-05 Highly deterministic latency in distributed systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16/988,491 2020-08-07
US16/988,249 US11088959B1 (en) 2020-08-07 2020-08-07 Highly deterministic latency in a distributed system
US16/988,249 2020-08-07
US16/988,491 US11328357B2 (en) 2020-08-07 2020-08-07 Sequencer bypass with transactional preprocessing in distributed system

Publications (1)

Publication Number Publication Date
WO2022031878A1 true WO2022031878A1 (en) 2022-02-10

Family

ID=78078327

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/044588 WO2022031878A1 (en) 2020-08-07 2021-08-05 Highly deterministic latency in a distributed system

Country Status (3)

Country Link
EP (1) EP4193256A1 (en)
JP (1) JP2023540448A (en)
WO (1) WO2022031878A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315183B2 (en) 2020-08-07 2022-04-26 Hyannis Port Research, Inc. Electronic trading system and method based on point-to-point mesh architecture
US11683199B2 (en) 2020-08-07 2023-06-20 Hyannis Port Research, Inc. Distributed system with fault tolerance and self-maintenance

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7496086B2 (en) 2002-04-30 2009-02-24 Alcatel-Lucent Usa Inc. Techniques for jitter buffer delay management
US7885296B2 (en) 2006-07-27 2011-02-08 Cisco Technology, Inc. Maintaining consistency among multiple timestamp counters distributed among multiple devices
US20180047099A1 (en) * 2016-08-09 2018-02-15 Chicago Mercantile Exchange Inc. Systems and methods for coordinating processing of scheduled instructions across multiple components
US20180359195A1 (en) 2017-06-07 2018-12-13 Cavium, Inc. Timestamp-based packet switching using a trie data structure
US20190097745A1 (en) 2017-09-27 2019-03-28 Intel Corporation One-Step Time Stamping Of Synchronization Packets For Networked Devices
US20200034929A1 (en) * 2018-07-26 2020-01-30 Nasdaq, Inc. In-Order Processing of Transactions

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7496086B2 (en) 2002-04-30 2009-02-24 Alcatel-Lucent Usa Inc. Techniques for jitter buffer delay management
US7885296B2 (en) 2006-07-27 2011-02-08 Cisco Technology, Inc. Maintaining consistency among multiple timestamp counters distributed among multiple devices
US20180047099A1 (en) * 2016-08-09 2018-02-15 Chicago Mercantile Exchange Inc. Systems and methods for coordinating processing of scheduled instructions across multiple components
US20180359195A1 (en) 2017-06-07 2018-12-13 Cavium, Inc. Timestamp-based packet switching using a trie data structure
US20190097745A1 (en) 2017-09-27 2019-03-28 Intel Corporation One-Step Time Stamping Of Synchronization Packets For Networked Devices
US20200034929A1 (en) * 2018-07-26 2020-01-30 Nasdaq, Inc. In-Order Processing of Transactions

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315183B2 (en) 2020-08-07 2022-04-26 Hyannis Port Research, Inc. Electronic trading system and method based on point-to-point mesh architecture
US11683199B2 (en) 2020-08-07 2023-06-20 Hyannis Port Research, Inc. Distributed system with fault tolerance and self-maintenance

Also Published As

Publication number Publication date
JP2023540448A (en) 2023-09-25
EP4193256A1 (en) 2023-06-14

Similar Documents

Publication Publication Date Title
US9047243B2 (en) Method and apparatus for low latency data distribution
US20240007404A1 (en) Local and global quality of service shaper on ingress in a distributed system
US20230396559A1 (en) Highly deterministic latency in a distributed system
US10504183B2 (en) Methods, apparatus, and systems for processing data transactions
EP4193256A1 (en) Highly deterministic latency in a distributed system
US11315183B2 (en) Electronic trading system and method based on point-to-point mesh architecture
US20220343424A1 (en) Stock exchange trading platform
US20170109824A1 (en) Stock trading system
AU2021320315A1 (en) Electronic trading system and method based on point-to-point mesh architecture
WO2022031970A1 (en) Distributed system with fault tolerance and self-maintenance
WO2022031971A1 (en) Sequencer bypass with transactional preprocessing in distributed system
US20230299864A1 (en) Systems and methods for clock synchronization using special physical layer clock sync symbols
US20230269113A1 (en) Distributed System with Fault Tolerance and Self-Maintenance
KR20140058411A (en) A market access system and method
US11765095B2 (en) System and a method for controlling timing of processing network data
US20230316399A1 (en) Electronic Trading System and Method based on Point-to-Point Mesh Architecture
US11303389B2 (en) Systems and methods of low latency data communication for physical link layer reliability
US11438276B2 (en) Method and system for prioritizing network traffic data units
JP2010039840A (en) Data distribution system
WO2024036087A1 (en) Method, apparatus and system for time stamping and sequencing data items

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21786612

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023508074

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021786612

Country of ref document: EP

Effective date: 20230307