WO2022031970A1 - Système distribué avec tolérance aux pannes et auto-maintenance - Google Patents

Système distribué avec tolérance aux pannes et auto-maintenance Download PDF

Info

Publication number
WO2022031970A1
WO2022031970A1 PCT/US2021/044746 US2021044746W WO2022031970A1 WO 2022031970 A1 WO2022031970 A1 WO 2022031970A1 US 2021044746 W US2021044746 W US 2021044746W WO 2022031970 A1 WO2022031970 A1 WO 2022031970A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
messages
node
compute
compute nodes
Prior art date
Application number
PCT/US2021/044746
Other languages
English (en)
Inventor
Anthony D. Amicangioli
Allen Bast
Christophe Juhasz
Original Assignee
Hyannis Port Research, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/988,491 external-priority patent/US11328357B2/en
Priority claimed from US16/988,464 external-priority patent/US11683199B2/en
Application filed by Hyannis Port Research, Inc. filed Critical Hyannis Port Research, Inc.
Publication of WO2022031970A1 publication Critical patent/WO2022031970A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/52Indexing scheme relating to G06F9/52
    • G06F2209/522Manager

Definitions

  • Typical electronic trading systems performing order matching use a traditional networked architecture, in which multiple computing hosts communicate with one another over a shared network through one or more networking devices such as switches or routers.
  • a number of gateway hosts operate as interfaces between client trader devices operating on the front end and a network of matching engines operating on the back end.
  • the gateway nodes, matching engine, and sequencer nodes all communicate over the same shared matching engine network through the switches or routers.
  • Example embodiments described herein provide computing infrastructure for fault- tolerant maintenance of computing nodes.
  • a distributed system implements fault-tolerant groupings of nodes (e.g., pairs, triplets, quartets), in which each node of the grouping is redundant with respect to other nodes in the grouping.
  • Computing tasks within the computing environment may be partitioned off by grouping such that any node in the grouping can perform the task partitioned off to that grouping.
  • individual stock symbols may be assigned to be serviced by a particular pair of nodes, such that a message referencing a particular stock symbol can be sent to either or both nodes in that pair.
  • Other allocation configurations may be implemented, including partitioning by type of message e.g., a new order, a fill order, a replace order).
  • This grouping of redundant nodes provides a fault tolerance within the distributed system and reduces the chance that a maintenance operation will impact the performance or availability of the distributed system.
  • Further embodiments may also manage maintenance operations at the nodes, thereby ensuring that a node is always available to process a given message.
  • one or more “tokens” may be passed among the nodes, and possession of the token may grant permission to a node to perform a maintenance operation.
  • a node may perform maintenance if needed, and upon completion of the maintenance, the node may pass the token to the next node to provide the next node the opportunity to perform maintenance.
  • Example embodiments may implement a single token for all computing nodes in the system, or may provide each redundant group with a respective token. As a result, the distributed system can operate without any degradation of latency or risk of unavailability of the nodes.
  • Example embodiments may therefore include a computing-based distributed system with self-maintenance.
  • the distributed system may comprise a plurality of compute nodes configured to process messages, each of the messages corresponding to one of a plurality of values of a common parameter of the messages.
  • the plurality of compute nodes may be configured to 1) circulate at least one token between or among at least two of the plurality of respective compute nodes, and 2) perform a self-maintenance operation during a given state of possession of the token.
  • the plurality of compute nodes may include a first subset of at least two of the plurality of compute nodes, the first subset being configured to process messages corresponding to a first value of the common parameter of the messages.
  • a second subset of at least two of the plurality of compute nodes may be distinct from the first subset, and may be configured to process messages corresponding to a second value of the common parameter of the messages and to refrain from processing messages corresponding to the first value.
  • Each of the compute nodes of the first subset may be configured to process a message corresponding to the first value in parallel and independent from one another.
  • At least one other compute node of the first subset may be configured to process a message corresponding to the first value.
  • the plurality of compute nodes may be further configured to circulate the at least one token between or among the first and second subsets.
  • the first subset may also be configured to circulate a first one of the at least one token between or among the plurality of compute nodes of the first subset, and the second subset may be configured to circulate a second one of the at least one token among the second subset.
  • each of the plurality of values of the common parameter may be distributed among the compute nodes in a “striping” configuration, wherein the values are assigned to at least two of the plurality of compute nodes, and each of the plurality of compute nodes may be assigned to process a respective subset of at least two of the plurality of values of the common parameter. Further, each of the respective subsets may differ from one another by at least one value of the common parameter.
  • Each of plurality of compute nodes assigned to a given value of the common parameter may be configured to process a message corresponding to the given value in parallel and independent from one another.
  • a second compute node may be assigned to process messages corresponding to a first value in addition to the respective subset.
  • Each compute node of the plurality of compute nodes may be configured to refrain from performing the self-maintenance operation when not in the given state of possession of the at least one token.
  • Each compute node of the plurality of compute nodes may be configured to 1) receive the at least one token from a preceding one of the plurality of compute nodes; 2) perform the self-maintenance operation selectively based on a state of the compute node; and 3) forward the at least one token to a subsequent one of the plurality of compute nodes.
  • the messages may be associated with transactions of financial instruments, the respective values of the common parameter each corresponding to a respective financial instrument or a transaction type.
  • the distributed system may further comprise at least one gateway configured to forward the messages to the plurality of compute nodes, as well as at least one sequencer configured to sequence the messages.
  • the plurality of compute nodes may be further configured to forward a response to the at least one gateway after processing a message.
  • the at least one gateway may be further configured to transmit a message corresponding to a subset of the plurality of compute nodes as a function of a first value of a common parameter of the message.
  • the self-maintenance operation may include at least one of 1) clearing data associated with at least one previous message processing operation, 2) moving data in memory, 3) adjusting layout of a memory, and 4) modifying a message queue.
  • the at least one token may include a plurality of tokens, each of the plurality of tokens indicating a respective type of selfmaintenance operation.
  • the self-maintenance operation may correspond to the respective type.
  • Further embodiments may include a distributed system for processing computing tasks, the distributed system comprising a plurality of compute nodes including at least a first, second and third compute node.
  • the first compute node may be configured to process messages corresponding to a first value and a second value of a common parameter of the messages.
  • the second compute node may be configured to process messages corresponding to the second value and a third value of a common parameter of the messages and refrain from processing messages corresponding to the first value.
  • the third compute node may be configured to process messages corresponding to the third value and a fourth value of the common parameter of the messages and refrain from processing messages corresponding to the second value.
  • the compute nodes may also be configured to circulate at least one token between or among at least some of the plurality of compute nodes, and perform a self-maintenance operation during a given state of possession of the token.
  • Each of the first compute node and the second compute node may be configured to process a message corresponding to the second value in parallel and independent from one another.
  • the second compute node may be configured to process the message corresponding to the second value.
  • the compute nodes may also circulate the at least one token among the first, second and third compute nodes.
  • Each of the plurality of compute nodes may be configured to refrain from performing the selfmaintenance operation when not in the given state of possession of the at least one token.
  • Each of the compute nodes may be configured to receive the at least one token from a preceding one of the plurality of compute nodes, perform the self-maintenance operation selectively based on a state of the compute node, and then forward the at least one token to a subsequent one of the plurality of compute nodes.
  • the messages may be associated with transactions of financial instruments, and the first, second, third and fourth values may each correspond to a respective financial instrument or a transaction type.
  • a plurality of values may include the first, second, third and fourth values, and each of the plurality of values may be assigned to a subset of at least two of the plurality of compute nodes.
  • at least one of the plurality of compute nodes may be reconfigured to process messages corresponding to at least one of the first value and the second value.
  • Further embodiments include a method of processing messages.
  • messages may be selectively processed based on a value of a common parameter of the messages.
  • At least one token may be circulated between or among at least two of the plurality of respective compute nodes.
  • a self-maintenance operation may be performed during a given state of possession of the token.
  • Still further embodiments include a method of processing messages at a first compute node.
  • a first message may be parsed to determine a value of a common parameter of the message.
  • the message may be processed selectively based on whether the value corresponds to an assigned common parameter associated with the compute node.
  • a token may be received from a second compute node.
  • a self-maintenance operation may then be performed during a given state of possession of the token.
  • the token may then be sent to the second compute node or a third compute node.
  • Fig. 1 A is a block diagram of an electronic trading system in which example embodiments may be implemented.
  • Fig. IB is a block diagram of an example embodiment of an electronic trading system.
  • Fig. 1C is a table of an example embodiment of fields of a message format for trading messages.
  • Fig. 2 is a block diagram of an example embodiment of a mesh node in a point-to- point mesh architecture of an electronic trading system.
  • FIGs. 3 A-E are block diagrams illustrating a distributed system comprising a plurality of compute nodes in one embodiment.
  • FIGs. 4A-B are block diagrams illustrating plural subsets of compute nodes in further embodiments.
  • Fig. 5 is a flow diagram illustrating operation of a compute node in one embodiment.
  • Fig. 6 is a block diagram illustrating an arrangement of compute nodes in a further embodiment.
  • FIGs. 7A-D are block diagrams illustrating a distributed system comprising a plurality of compute nodes in a further embodiment.
  • Example embodiments disclosed herein relate to a high-speed electronic trading system that provides a market where orders to buy and sell financial instruments (such as stocks, bonds, commodities, futures, options, and the like) are traded among market participants (such as traders and brokers).
  • the electronic trading system exhibits low latency, fairness, fault tolerance, and other features more fully described below.
  • the electronic trading system is primarily responsible for “matching” trade orders to one another.
  • an offer to “buy” an instrument is matched to a corresponding counteroffer to “sell”.
  • the matched offer and counteroffer should at least partially satisfy the desired price, with any residual unsatisfied quantity passed to another suitable counterorder. Matched orders are then paired and the trade is executed.
  • order book Any wholly unsatisfied or partially satisfied orders are maintained in a data structure referred to as an “order book”.
  • the retained information regarding unmatched trade orders can be used by the matching engine to satisfy subsequent trade orders.
  • An order book is typically maintained for each instrument and generally defines or otherwise represents the state of the market for that particular product. It may include, for example, the recent prices and quantities at which market participants have expressed a willingness to buy or sell.
  • the results of matching may also be made visible to market participants via streaming data services referred to as market data feeds.
  • a market data feed typically includes individual messages that carry the pricing for each traded instrument, and related information such as volume and other statistics.
  • Fig. 1 A illustrates an example electronic trading system 100 that includes a number of gateways 120-1, 120-2, . . ., 120-g (collectively referred to as gateways 120), a set of core compute nodes 140-1, 140-2, ..., 140-c (collectively, the core compute nodes 140 or compute nodes 140), and one or more sequencers 150-1, 150-2, ..., 150-s (collectively, the sequencers 150).
  • the gateways 120, core compute nodes 140, and sequencers 150 are thus considered to be nodes in electronic trading system 100.
  • the gateways 120, compute nodes 140 and sequencers 150 are directly connected to one another, preferably via low latency, dedicated connections 180.
  • gateways 120-2, . . ., 120-g are the peers for gateway 120-1
  • core compute nodes 140-2, . . ., 140-c are the peers for core compute node 140-1
  • sequencers 150-2, ..., 150-s are the peers for sequencer 150-1.
  • active and standby in relation to the discussion of the system 100, may refer to a high availability (HA) role/state/mode of a system/component.
  • a standby system/component is a redundant (backup) system/component that is powered on and ready to take over function(s) performed by an active system/component.
  • switchover/failover that is, a transition from the standby role/state/mode to the active role/state/mode, may be performed automatically in response to failure of the currently active system/component for non-limiting example.
  • the electronic trading system 100 processes trade orders from and provides related information to one or more participant computing devices 130-1, 130-2, . . ., 130-p (collectively, the participant devices 130).
  • Participant devices 130 interact with the system 100, and may be one or more personal computers, tablets, smartphones, servers, or other data processing devices configured to display and receive trade order information.
  • the participant devices 130 may be operated by a human via a graphical user interface (GUI), or they may be operated via highspeed automated trading methods running on some physical or virtual data processing platform.
  • GUI graphical user interface
  • Each participant device 130 may exchange messages with (that is, send messages to and receive messages from) the electronic trading system 100 via connections established with a gateway 120. While Fig. 1 A illustrates each participant device 130 as being connected to electronic trading system 100 via a single connection to a gateway 120, it should be understood that a participant device 130 may be connected to electronic trading system 100 over multiple connections to one or more gateway devices 120.
  • each gateway 120-1 may serve a single participant device 130, it typically serves multiple participant devices 130.
  • the compute nodes 140-1, 140-2, . . ., 140-c (also referred to herein as matching engines 140 or compute engines 140) provide the matching functions described above and may also generate outgoing messages to be delivered to one or more participant devices 130.
  • Each compute node 140 is a high-performance data processor and typically maintains one or more data structures to search and maintain one or more order books 145-1, 145-2, . . ., 145-b.
  • An order book 145-1 may be maintained, for example, for each instrument for which the core compute node 140-1 is responsible.
  • One or more of the compute nodes 140 and/or one or more of the gateways 120 may also provide market data feeds 147. Market data feeds 147 may be broadcast (for example, multicast), to subscribers, which may be participant devices 130 or any other suitable computing devices.
  • Some outgoing messages generated by core compute nodes 140 may be synchronous, that is, generated directly by a core compute node 140 in response to one or more incoming messages received from one or more participant devices 130, such as an outgoing “acknowledgement message” or “execution message” in response to a corresponding incoming “new order” message. In some embodiments, however, at least some outgoing messages may be asynchronous, initiated by the trading system 100, for example, certain “unsolicited” cancel messages and “trade break” or “trade bust” messages.
  • Distributed computing environments such as the electronic trading system 100, can be configured with multiple matching engines operating in parallel on multiple compute nodes 140.
  • sequencers 150 ensure that the proper sequence of any order-dependent operations is maintained. To ensure that operations on incoming messages are not performed out of order, incoming messages received at one or more gateways 120, for example, a new trade order message from one of participant devices 130, typically may then pass through at least one sequencer 150 (e.g., a single currently active sequencer, and possibly one or more standby sequencers) in which they are marked with a sequence identifier (by the single currently active sequencer, if multiple sequencers are present). That identifier may be a unique, monotonically increasing value which is used in the course of subsequent processing throughout the distributed system 100 (e.g., electronic trading system 100), to determine the relative ordering among messages and to uniquely identify messages throughout electronic trading system 100.
  • sequencer 150 e.g., a single currently active sequencer, and possibly one or more standby sequencers
  • That identifier may be a unique, monotonically increasing value which is used in the course of subsequent processing throughout the distributed system 100 (e.g., electronic trading system 100),
  • the sequence identifier may be indicative of the order (i.e., sequence) in which a message arrived at the sequencer.
  • the sequence identifier may be a value that is monotonically incremented or decremented according to a fixed interval by the sequencer for each arriving message; for example, the sequence identifier may be incremented by one for each arriving message. It should be understood, however, that, while unique, the sequence identifier is not limited to a monotonically increasing or decreasing value.
  • the original, unmarked, messages and the sequence-marked messages may be essentially identical, except for the sequence identifier value included in the marked versions of the messages.
  • the marked incoming messages that is, the sequence-marked messages
  • sequencer(s) 150 are typically then forwarded by sequencer(s) 150 to other downstream compute nodes 140 to perform potentially order-dependent processing on the messages.
  • sequencer(s) 150 may also determine a relative ordering of each marked message among other marked messages in the electronic trading system 100.
  • the unique sequence identifier disclosed herein may be used for ensuring deterministic order (i.e., sequence) for electronic-trade message processing.
  • the unique sequence identifier represents a unique, deterministic ordering (i.e., sequence) directive for processing of a given respective electronic trade message relative to other trade messages within an electronic trading system.
  • the sequence identifier may be populated in a sequence ID field 110-14 of a message, as disclosed further below with regard to FIG. 1C for non-limiting example.
  • messages may also flow in the other direction, that is, from a core compute node 140 to one or more of the participant devices 130, passing through one or more of the gateways 120.
  • Such outgoing messages generated by a core compute node 140 may also be order-dependent (i.e., sequence-order dependent), and accordingly may also typically first pass through a sequencer 150 to be marked with a sequence identifier. The sequencer 150 may then forward the marked response message to the gateways 120 in order to pass on to participant devices 130 in a properly deterministic order.
  • sequencer 150 to generate unique sequence numbers and mark messages or representations thereof with same, that is, to generate sequence-marked messages, ensures the correct ordering of operations is maintained throughout the distributed system, that is, the electronic trading system 100, regardless of which compute node or set of compute nodes 140 processes the messages.
  • This approach provides “state determinism,” for example, an overall state of the system is deterministic and reproduceable (possibly somewhere else, such as at a disaster recovery site), to provide fault-tolerance, high availability and disaster recoverability.
  • a generating node i.e., a node introducing a new message into the electronic trading system 100, for example by generating a new message and/or by forwarding a message received from a participant device 130
  • its peer nodes receive the sequence number assigned to that message. Receiving the sequence number for a message it generated may be useful to the generating node and its peer nodes not only for processing messages in order, according to their sequence numbers, but also to correlate the message generated by the node with the message’s sequence identifier that is used throughout the rest of the electronic trading system 100.
  • Such a correlation between an unmarked version of a message as introduced by a generating node into the electronic trading system and the sequence marked version of the same message outputted by the sequencer may be made via identifying information in both versions of the message, as discussed further below in connection with Fig. 1C.
  • a subsequent message generated within the electronic trading system 100 while also being assigned its own sequence number, may yet reference one or more sequence numbers of related preceding messages. Accordingly, a node may need to quickly reference (by sequence number) a message the node had itself previously generated, because, for example, the sequence number of the message the node had generated was referenced in a subsequent message.
  • the generating node may first send a message to the sequencer 150 and wait to receive the sequence number for the message from the sequencer before the generating node forwards the message to other nodes in electronic trading system 100.
  • sequencer 150 may not only send a sequenced version of the message (e.g., a sequence-marked message) to destination nodes, but may also send substantially simultaneously a sequenced version of the message back to the sending node and its peers. For example, after assigning a sequence number to an incoming message sent from the gateway 120-1 to core compute nodes 140, the sequencer 150 may not only forward the sequenced version of the message to the core compute nodes 140, but may also send a sequenced version of that message back to the gateway 120-1 and the other gateways 120. Accordingly, if any subsequent message generated in a core compute node 140 references that sequence number, any gateway 120 may easily identify the associated message originally generated by gateway 120-1 by its sequence number.
  • a sequenced version of the message e.g., a sequence-marked message
  • a sequenced version of an outgoing message generated by and sent from a core compute node 140 to gateways 120, and sequenced by sequencer 150 may be forwarded by sequencer 150 both to gateways 120 and back to core compute nodes 140.
  • Some embodiments may include multiple sequencers 150 for high availability, for example, to ensure that another sequencer is available if the first sequencer fails.
  • the currently active sequencer 150-1 may maintain a system state log (not shown) of all the messages that passed through sequencer 150-1, as well as the messages’ associated sequence numbers.
  • This system state log may be continuously or periodically transmitted to the standby sequencers to provide them with requisite system state to allow them to take over as an active sequencer, if necessary.
  • the system state log may be stored in a data store that is accessible to the multiple sequencers 150.
  • the system state log may also be continually or periodically replicated to one or more sequencers in a standby replica electronic trading system (not shown in detail) at a disaster recovery site 155, thereby allowing electronic trading to continue with the exact same state at the disaster recovery site 155, should the primary site of system 100 suffer catastrophic failure.
  • a currently active sequencer of a plurality of sequencers may store the system state log in a data store (not shown).
  • the data store may be accessible to the plurality of sequencers via a shared sequencer network, such as the sequencerwide shared network 182-s disclosed further below with regard to Fig. 1 A.
  • sequencer may retrieve the system state log from the data store to synchronize state with that of the former active sequencer.
  • the system state log may also be provided to a drop copy service 152, which may be implemented by one or more of the sequencers, and/or by one or more other nodes in the electronic trading system 100.
  • the drop copy service 152 may provide a record of daily trading activity through electronic trading system 100 that may be delivered to regulatory authorities and/or clients, who may, for example be connected via participant devices 130.
  • the drop copy service 152 may be implemented on one or more of the gateways 120.
  • the drop copy service 152 may provide the record of trading activity based on the contents of incoming and outgoing messages sent throughout electronic trading system 100.
  • a gateway 120 implementing the drop copy service 152 may receive from the sequencer 150 (and/or from core compute nodes 140 and other gateways 120) all messages exchanged throughout the electronic trading system 100.
  • a participant device 130 configured to receive the record of daily trading activity from the drop copy service 152 may not necessarily also be sending trade orders to and utilizing a matching function of electronic trading system 100.
  • Messages exchanged between participant devices 130 and gateways 120 may be according to any suitable protocol that may be used for financial trading (referred to for convenience as, “financial trading protocol”).
  • the messages may be exchanged according to custom protocols or established standard protocols, including both binary protocols (such as Nasdaq OUCH and NYSE UTP), and text-based protocols (such as NYSE FIX CCG).
  • the electronic trading system 100 may support exchanging messages simultaneously according to multiple financial trading protocols, including multiple protocols simultaneously on the same gateway 120.
  • participant devices 130-1, 130-2, and 130-3 may simultaneously have established trading connections and may be exchanging messages with gateway 120-1 according to Nasdaq Ouch, NYSE UTP, and NYSE FIX CCG, respectively.
  • the gateways 120 may translate messages according to a financial trading protocol received from a participant device 130 into a normalized (e.g., standardized) message format used for exchanging messages among nodes within the electronic trading system 100.
  • the normalized trading format may be an existing protocol or may generally be of a different size and data format than that of any financial trading protocol used to exchange messages with participant devices 130.
  • the normalized trading format when compared to a financial trading protocol of the original incoming message received at the gateway 120 from a participant device 130, may include in some cases one or more additional fields or parameters, may omit one or more fields or parameters, and/or each field or parameter of a message in the normalized format may be of a different data type or size than the corresponding message received at gateway 120 from the participant device 130.
  • gateways 120 may translate outgoing messages generated in the normalized format by electronic trading system 100 into messages in the format of one or more financial trading protocols used by participant devices 130 to communicate with gateways 120.
  • incoming/outgoing messages e.g., the incoming message 103 and outgoing message 105 are communicated between the gateway 120-1 and a participant device 130.
  • Fig. IB is a block diagram of an example embodiment of the electronic trading system 100 of Fig. 1 A, disclosed above.
  • the electronic trading system 100 comprises the gateway 120-1 coupled to the core compute node 140-1 via an activation link 180-1-1 and an ordering (i.e., sequencing) path 117.
  • the electronic trading system 100 further comprises the sequencer 150-1 electronically disposed within the ordering path 117.
  • the gateway 120-1 is configured to transmit a message (not shown) to the core compute node 140-1 via the activation link 180-1-1 and the ordering path 117, in response to reception of the incoming message 103.
  • the core compute node 140-1 is configured to receive the message (also referred to as an unsequenced message) from the gateway 120-1 and a sequence-marked version (not shown) of the message from the sequencer 150-1.
  • the sequence-marked version includes a sequence identifier (ID), such as may be included in a sequence ID field 110-14 of the sequence-marked message, as disclosed further below with regard to Fig. 1C for non-limiting example.
  • ID indicates a deterministic position of the sequence-marked version of the message among a plurality of sequence-marked versions of other messages, the other messages having been communicated via the activation link 180-1-1 and received by the sequencer 150-1 via the ordering path 117.
  • the plurality of messages among which the sequence ID indicates a deterministic position also includes the other sequenced-marked versions of messages received by the core compute node 140-1 via the ordering path 117.
  • the message e.g., unsequenced message
  • sequence- marked version include common metadata (not shown). By correlating the message with its sequence-marked version via the common metadata, the sequence ID of the message is identified.
  • the sequence ID further indicates a deterministic position of the message among all messages communicated throughout the electronic trading system 100 that pass through the sequencer 150-1 and are, thus, sequence-marked by the sequencer 150-1.
  • sequence ID determined by the sequencer 150-1 determines the position (order/priority) of the messages communicated in the electronic trading system 100. It is possible that multiple systems may timestamp messages with a same timestamp and, thus, order/priority for such messages would need to be resolved at a receiver of same. Such is not the case in the electronic trading system 100 as the sequencer 150-1 may be the sole determiner of order/priority of messages communicated throughout the electronic trading system 100.
  • the core compute node 140-1 may be configured to (i) commence a matching function activity for an electronic trade responsive to receipt of the message via the activation link 180-1-1, and (ii) responsive to receipt of the sequence-marked version via the ordering path 117, use the sequence identifier to prioritize completion of the matching function activity toward servicing the electronic trade.
  • the core compute node 140-1 may commence the electronic trading function, that is, the matching function activity, upon receipt of the message (i.e., unsequenced message), thereby starting the processing of the unsequenced message, the core compute node 140-1 may not complete the processing and/or commit the results of the processing of the message until the core compute node 140-1 receives the sequence-marked message.
  • the processing of messages by the compute node 140-1 could be unpredictable.
  • there could be multiple outstanding unsequenced messages each of which represents a potential match for the contra side in the exchange of a financial security. It is useful for there to be a deterministic way of arbitrating among the multiple potential matches because, perhaps, only a subset among the potential matches may be able to be filled against a given trade order on the contra side.
  • the compute node 140-1 may correlate the unsequenced message with the sequence-marked message via identifying information in both versions of the message, as discussed below in connection with Fig. 1C. Once the compute node 140-1 has received the sequence-marked message via the ordering path 117, the compute node 140-1 may then determine the proper sequence in which the message (or sequence-marked version of the message) should be processed relative to the other messages throughout electronic trading system 100. The compute node 140-1 may then complete the message processing, including sending out an appropriate response message, possibly referencing the sequence identifier assigned by the sequencer 150-1 and included in the sequence-marked message.
  • the compute node 140-1 may determine precisely the sequence in which the possible match(es) are to occur and complete the electronic trading matching function.
  • the sequencer 150-1 may further transmit the sequence-marked message via the second direct connection 180-gwl-sl (of the ordering path 117) to the gateway 120-1.
  • the sender that is, the gateway 120-1, to correlate the sequence number (assigned to the message) with other identifying information in the message (as discussed below in connection with Fig. 1C) so that the sender can easily deal with subsequent messages that reference that sequence number.
  • the gateway 120-1 may, upon receipt of the unsequenced response message received from the compute node 140-1 via the activation link 180-1-1, activate processing of such response message, even before the gateway 120-1 receives the sequence-marked version of the response message.
  • activating the processing could include updating the state of an open trade order database on the gateway 120-1 and/or building up the outgoing message 105 ready to be sent to the participant device 130.
  • the gateway 120-1 may not complete the processing of the response message, such processing including transmitting the outgoing message 105 to the participant device 130, until the gateway 120-1 has received the sequence-marked response message, which contains a sequence identifier specifying a deterministic position of the response message in a sequence of messages including the other messages in electronic trading system 100.
  • the gateway 120-1 may correlate the unsequenced response message with the sequence-marked response message via identifying information in both versions of the response message, as discussed below in connection with Fig. 1C. The deterministic position of the response message thereby being determined upon receipt of sequence-marked response message.
  • the processing of the response message may then be completed, such processing including committing the outgoing message 105 to be transmitted to the participant device, such as the participant device 130 of Fig. 1 A.
  • the message transmitted via the activation path 180-1-1 and sequence-marked version of the message transmitted via the ordering path 117 may include common metadata.
  • the core compute node 140-1 may be further configured to correlate the message with the sequence-marked version based on the common metadata, responsive to receipt of the sequence-marked version via the ordering path 117.
  • the message is transmitted to the core compute node 140-1 via the activation link 180-1-1 in an activation link forward direction, that is, the act-link-fwd-dir 113a, and to the core compute node 140-1 via the ordering path 117 in an ordering path forward direction, that is the order-path-fwd-dir 115a.
  • the core compute node 140-1 may transmit a response (not shown) to the gateway 120-1 via the activation link 180-1-1 and the ordering path 117 in an activation link reverse direction (i.e., the act-link-rev-dir 113b) and an ordering path reverse direction (i.e., order-path-rev-dir 115b).
  • the activation link 180-1-1 is a single direct connection while the ordering path 117 includes multiple direct connections.
  • the ordering path 117 in the example embodiment includes both the direct connection 180-gwl-sl and direct connection 180-cl-sl.
  • the gateway 120-1, sequencer 150-1, and core compute node 140-1 are arranged in a point-to-point mesh topology, referred to as a point-to-point mesh system 102.
  • the core compute node 140-1 may be configured to perform a matching function (i.e., an electronic trading matching function) toward servicing trade requests received from participant devices 130 and introduced into the point-to-point mesh topology via the gateway 120-1.
  • a matching function i.e., an electronic trading matching function
  • the point-to-point mesh system 102 includes a first direct connection (i.e., 180-1-1), second direct connection (i.e., 180-gwl-sl), and third direction connection (i.e., 180-cl-sl).
  • the sequencer 150-1 may be configured to (i) determine a deterministic order (i.e., sequence) for messages communicated between the gateway 120-1 and core compute node 140-1 via the first direct connection and received by the sequencer 150-1 from the gateway 120-1 or core compute node 140-1 via the second or third direct connection, respectively.
  • the sequencer 150-1 may be further configured to (ii) convey position of the messages within the deterministic order by transmitting sequence-marked versions of the messages to the gateway 120-1 and core compute node 140-1 via the second and third direct connections, respectively.
  • the messages represent the trade requests or responses thereto, such as disclosed herein.
  • a message format for such messages is disclosed further below with regard to Fig. 1C.
  • the amount of preprocessing that may be done for an unsequenced message, and whether or not the results of that preprocessing may need to be discarded or rolled back, may depend on fields in the message, such as the message type field 110-1, symbol field 110-2, side field 110-3, or price field 110-4, according to the embodiment of Fig. 1C, disclosed further below.
  • the amount may also depend on whether other unsequenced messages are currently outstanding (that is, for which the corresponding sequence-marked message has not yet been received) that reference the same value for a common parameter in the message, such as the same stock symbol.
  • the core compute node 140-1 may load the symbol information relating to the relevant section of the order book into a fast memory. If the new order would be a match for an open order in the order book, the compute node 140-1 may start to generate a “fill” message, accordingly, but hold off on committing an order book update and on sending the “fill” message out until it receives the sequence-marked version of that message.
  • the core compute node 140- 1 may perform its preprocessing differently.
  • the core compute node 140- 1 may generate competing potential “fill” messages, for each of the two outstanding unsequenced “new order” messages that could serve as a match for the open order. Based on the sequenced version of the messages, one of the potential “fill” messages may be discarded, while the other would be committed to the order book and sent out to the gateways 120.
  • the compute node 140-1 may not perform any preprocessing that may need to be discarded or rolled back (e.g., may not create any potential “fill” messages), or it may abort or pause any such preprocessing for those outstanding unsequenced messages.
  • an outstanding unsequenced “new order” message that is a potential match for an open order in the order book could be competing with an outstanding unsequenced “replace order” message or “cancel order” message attempting to replace or cancel, respectively, the same open order in the order book that would serve as a potential match to the “new order” message.
  • the end result could either culminate in a match between the open order in the order book and the “new order” message, or it could instead culminate in that open order being canceled or replaced by a new order with a different price or quantity.
  • the compute node 140-1 cannot determine which of these two outcomes should result.
  • the compute node 140-1 may perform preprocessing in different ways. In some embodiments, when there are multiple competing outstanding unsequenced messages, the compute node 140-1 may simply perform preprocessing that would not need to be rolled back or discarded, such as loading into faster memory a relevant section of the order book relating to a symbol referenced in both competing messages. In other embodiments, the compute node 140-1 may perform additional preprocessing, such as forming up one or more provisional potential responses, each corresponding to one of the multiple competing scenarios.
  • the compute node 140-1 may create a potential “fill” message and/or a potential “replace acknowledgement” message or “cancel acknowledgement” message, and possibly also make provisional updates to the order book corresponding to one or more of the multiple possible outcomes. While in some embodiments, the compute node 140-1 may perform this additional preprocessing for all such competing scenarios, in other embodiments, the compute node 140-1 may only perform additional preprocessing on one of, or a subset of, the competing scenarios. For example, the compute node 140-1 may perform the additional preprocessing on an outstanding unsequenced message only if there are no other outstanding competing unsequenced messages.
  • the compute node 140-1 may prioritize the performing of additional preprocessing for outstanding competing unsequenced messages according to the amount of time and/or complexity involved in rolling back or discarding the results of the preprocessing. Upon receiving the sequence-marked versions of the outstanding unsequenced messages, the compute node 140-1 may then determine the sequence (as assigned by the sequencer 150-1) in which the outstanding unsequenced messages should be processed, and complete the processing of the messages in that sequence, which may in some embodiments include rolling back or discarding one or more results of the preprocessing.
  • the compute node 140-1 may additionally or alternatively perform preprocessing related to validation of the message to determine whether to accept or reject the message.
  • the preprocessing could include performing real-time risk checks on the message, such as checking that the price or quantity specified in the message does not exceed a maximum value (i.e., “max price check” or “max quantity check”), that the symbol in the message is a known symbol (i.e., “unknown symbol check”), that trading is currently permitted on that symbol (i.e., “symbol halt check”), or that the price is specified properly according to a correct number of decimal places (i.e., “sub penny check”).
  • the type of preprocessing could also include a “self trade prevention” validation check, to prevent a particular potential match from resulting in a self-trade in which a trading client matches against itself, if “self trade prevention” is enabled for the particular client or trade order. If a trade order fails one or more of these validation checks, the electronic trading system 100 may respond with an appropriate reject message. It should be understood that, even though these validation checks are described in the embodiments above as being performed by the compute node 140-1, at least some of these types of validation checks could in some embodiments be performed alternatively or additionally by a gateway 120 or other nodes in the electronic trading system 100.
  • the gateway 120-1 may be beneficial or required for the gateway 120-1 to be informed of the unique system-wide sequence identifier associated with a message that originated from a client. This information may enable the gateway 120-1 to match up the original incoming message to the unique sequence number, which is used to ensure proper ordering of messages throughout the electronic trading system 100. Such a configuration at the gateway(s) may be required for the electronic trading system 100 to achieve state determinism and to provide fault-tolerance, high availability, and disaster recoverability with respect to the activity in the gateways.
  • One solution for configuring the gateway 120-1 to maintain information on the sequence identifier associated with an incoming message is for the gateway 120-1 to wait for a response back from the sequencer 150-1 with the sequence identifier before forwarding the message to the compute node 140-1.
  • Such an approach may add latency to the processing of messages.
  • the sequencer 150-1 may also send, in parallel, the sequence-marked message to the gateway 120-1.
  • the gateway 120-1 may maintain information on the sequence identifier while minimizing latency at the electronic trading system 100.
  • Fig. 1C is a table of an example embodiment of fields of a message format 110 for trading messages, such as trading messages exchanged among nodes in the electronic trading system 100 disclosed above.
  • the message format 110 is a normalized message format, intended to be used for an internal (that is, within the electronic trading system 100) representation of trading messages when they are exchanged among nodes within electronic trading system 100.
  • gateways 120 exchange messages between the participants 130 and electronic trading system 100, and translate such messages between format(s) specified by one or more financial trading protocols used by the participants 130 and the normalized trading format used among nodes in the electronic trading system 100.
  • the fields 110-1 through 110-17 are for non-limiting example and that the message format 110 may include more, fewer, or different fields, and that the order of such fields is not limited to as shown in FIG. 1C.
  • the fields in the message format 110 are shown in this example in a single message format, they may be distributed across multiple message formats, or encapsulated in layered protocols. For example, in other embodiments, a subset of fields in the message format 110 may be included as part of a header, trailer, or extension field(s) in a layered protocol that encapsulates other fields of the message format 100 in a message payload.
  • the message format 110 may define one or more fields of data encapsulated in a payload (data) section of another message format, including without limitation a respective payload section of an IP datagram, a UDP datagram, a TCP packet, or of a message data frame format, such as an Ethernet data frame format or other data frame format, including InfiniBand, Universal Serial Bus (USB), PCI Express (PCI-e), and High-Definition Multimedia Interface (HDMI), for non-limiting example.
  • a payload (data) section of another message format including without limitation a respective payload section of an IP datagram, a UDP datagram, a TCP packet, or of a message data frame format, such as an Ethernet data frame format or other data frame format, including InfiniBand, Universal Serial Bus (USB), PCI Express (PCI-e), and High-Definition Multimedia Interface (HDMI), for non-limiting example.
  • USB Universal Serial Bus
  • PCI-e PCI Express
  • HDMI High-Definition Multimedia Interface
  • the message format 110 includes fields 110-1... 110-6 which correspond to information that may be included in messages sent or received according to a financial trading protocol for communication with one or more participant devices 130.
  • the message type field 110-1 indicates a trading message type.
  • Some trading message types (such as, message types “new order,” “replace order,” or “cancel order”) correspond to messages received from participant devices 130, while other message types (such as, “new order acknowledgement,” “replace order acknowledgement,” “cancel order acknowledgement,” “fill,” “execution report,” “unsolicited cancel,” “trade bust,” or various reject messages) correspond to messages that are generated by the electronic system 100 and are included in trading messages sent to the participant devices 130.
  • the message format 110 also includes a symbol field 110-2, which includes an identifier for a traded financial security, such as a stock symbol or stock ticker. For example, “IBM” is the stock symbol for “International Business Machines Corporation.”
  • the side field 110-3 in the message format 110 may be used to indicate the “side” of the trading message, such as whether the trading message is a “buy,” “sell,” or a “sell short.”
  • the price field 110- 4 may be used to indicate a desired price to buy or sell the security
  • the quantity field 110-5 may be used to indicate a desired quantity of the security (e.g., number of shares).
  • the message format 110 may also include the order token field 110-6, which may be populated with an “order token” or “client order ID” initially provided by a participant device 130 to uniquely identify a new order in the context of a particular trading session (i.e., “connection” or “flow”) established between the participant device 130 and the electronic trading system via a gateway 120.
  • order token field 110-6 may be populated with an “order token” or “client order ID” initially provided by a participant device 130 to uniquely identify a new order in the context of a particular trading session (i.e., “connection” or “flow”) established between the participant device 130 and the electronic trading system via a gateway 120.
  • fields 110-1... 110-6 are representative fields that are usually included for most message types according to most financial trading protocols, but that the message format 110 may well include additional or alternate fields, especially for supporting particular message types or particular financial trading protocols.
  • “replace order” and “cancel order” message types require the participant 130 to supply an additional order token to represent the replaced or canceled order, to distinguish it from the original order.
  • a “replace order” and a “cancel order” typically may also include a replaced/canceled quantity field
  • a “replace order” may include a replace price field.
  • the message format 110 includes fields 110-11 . . . 110-17 that may be used internally within electronic trading system 100, and do not necessarily correspond to fields in messages exchanged with participant devices 130.
  • a node identifier field 110-11 may uniquely identify each node in electronic trading system 100.
  • a generating node may include its node identifier in messages it introduces into the electronic trading system 100.
  • each gateway 120 may include its node identifier in messages it forwards from participant devices 130 to compute nodes 140 and/or sequencers 150.
  • each compute node 140 may include its node identifier in messages it generates (for example, acknowledgements, executions, or types of asynchronous messages intended ultimately to be forwarded to one or more participant devices 130) to be sent to other nodes in the electronic trading system 100.
  • each message introduced into the electronic trading system 100 may be associated with the message’s generating node.
  • the message format 110 may also include a flow identifier field 110-12.
  • each trading session i.e., “connection” or “flow”
  • flow may be identified with a flow identifier that is intended to be unique throughout the electronic trading system 100.
  • a participant device 130 may be connected to the electronic trading system 100 over one or more flows, and via one or more of the gateways 120.
  • the version of the messages according to the normalized message format 110 (used among nodes in the electronic trading system 100) of all messages exchanged between a participant device 130 and the electronic trading system 100 over a particular flow would include a unique identifier for that flow in the flow identifier field 110- 12.
  • the flow identifier field 110-12 is populated by a message’s generating node.
  • a gateway 120 may populate the flow identifier field 110-12 with the identifier of the flow associated with a message it receives from a participant 130 that the gateway 120 introduces into electronic trading system 100.
  • a core compute node 140 may populate the flow identifier field 110-12 with the flow identifier associated with messages it generates (i.e., response messages, such as acknowledgement messages or fills, or other outgoing messages including asynchronous messages).
  • the flow identifier field 110-12 contains a value that uniquely identifies a logical flow, which actually could be implemented for purposes of high availability as multiple redundant trading session connections, possibly over multiple gateways. That is, in some embodiments, the same flow ID may be assigned to two or more redundant flows between participant device(s) 130 and gateway(s) 120. In such embodiments, the redundant flows may be either in an active/standby configuration or an active/active configuration. In an active/active configuration, functionally equivalent messages may be exchanged between participant device(s) 130 and gateway(s) 120 simultaneously over multiple redundant flows in parallel.
  • a trading client may send in parallel over the multiple redundant flows functionally equivalent messages simultaneously to the electronic trading system 100, and receive in parallel over the multiple redundant flows multiple functionally equivalent responses from the electronic trading system 100, although the electronic trading system 100 may only take action on a single such functionally equivalent message.
  • a single flow at a time among the multiple redundant flows may be designated as an active flow, whereas the other flow(s) among the multiple redundant flows may be designated standby flow(s), and the trading messages would only actually be exchanged over the currently active flow.
  • messages exchanged over any of the redundant flows may be identified with the same flow identifier stored by the messages’ generating nodes in the flow identifier field 110-12 of the normalized message format 110.
  • messages exchanged among nodes in the electronic system 100 are sent to the sequencer 150 to be marked with a sequence identifier.
  • the message format 110 includes sequence identifier field 110-14.
  • an “unmarked message” may be sent with a sequence identifier field 110-14 having an empty, blank (e.g., zero) value.
  • the sequence identifier field 110-14 of an unmarked message may be set to a particular predetermined value that the sequencer would never assign to a message, or to an otherwise invalid, value.
  • Still other embodiments may specify that a message is unmarked via an indicator in another field (not shown) of the message, such as a Boolean value or a flag value indicating whether a message has been sequenced.
  • the sequencer 150 may then populate the sequence identifier field 110-14 of the unmarked message with a valid sequence identifier value, thereby producing a “sequence marked message.”
  • the valid sequence identifier value in sequence identifier field 110-4 of the sequence marked message uniquely identifies the message and also specifies a deterministic position of the marked message in a relative ordering of the marked message among other marked messages throughout electronic trading system 100.
  • a “sequence marked message” sent by the sequencer 150 may then be identical to a corresponding unmarked message received by the sequencer except that the sequence marked message’s sequence identifier field 110-14 contains a valid sequence identifier value.
  • the message format 110 may, in some embodiments, also include the reference sequence identifier field 110-15.
  • a generating node may populate the reference sequence identifier field 110-15 of a new message it generates with the value of a sequence number of a prior message related to the message being generated.
  • the value in the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 to correlate a message with a prior associated message.
  • the prior associated message referenced in the reference sequence identifier field 110-15 may be a prior message in the same “order chain.” (i.e., “trade order chain”).
  • orders chain i.e., “trade order chain”.
  • messages may be logically grouped into an “order chain,” a set of messages over a single flow that reference or “descend from” a common message.
  • An order chain typically starts with a “new order message” sent by a participant device 130.
  • the next message in the order chain is typically a response by the electronic trading system (e.g., either a “new order acknowledgement” message when the message is accepted by the trading system, or a “new order reject” message, when the message is instead rejected by the trading system, perhaps for having an invalid format or invalid parameters, such as an invalid price for non-limiting example).
  • An order chain may also include “cancel order” message sent by participant device 130, canceling at least a portion of the quantity of a prior acknowledged (but still open, that is contains at least some quantity that is not canceled and/or not filled) new order.
  • the “cancel order” message may again either be acknowledged or rejected by the electronic trading system with a “cancel order acknowledgement” or a “cancel order reject” message, which would also be part of the order chain.
  • An order chain may also include a “replace order” message sent by participant device 130, replacing the quantity and/or the price of a prior acknowledged (but still open) new order.
  • the “replace order” message may again either be acknowledged or rejected by the electronic trading system with a “replace order acknowledgement” or a “replace order reject” message, which would also be part of the order chain.
  • a prior acknowledged order that is still open may be matched with one or more counter orders of the opposite side (that is, “buy” on one side and “sell” or “sell short” on the other side), and the electronic trading system 100 may then generate a complete “fill” message (when all of the open order’s quantity is filled in a single match) or one or more partial “fill” messages (when only a portion of the open order’s quantity is filled in a single match), and these “fill” messages would also be part of the order chain.
  • the reference sequence identifier in general, may identify another prior message in the same order chain.
  • the value for the reference sequence number may be the sequence number assigned by the sequencer for an “incoming” message originating from a participant device 130 and introduced into electronic trading system 100 by a gateway 120, such that a corresponding “outgoing” message, such as a response message generated by a compute node 140, may reference the sequencer number value of the incoming message to which it is responding.
  • a “new order acknowledgement” message or a “fill” message generated by a compute node 140 would include in the reference sequence identifier field 110-15 the value for the sequence identifier assigned to the corresponding “new order” message to which the compute node 140 is responding with a “new order acknowledgement” message or fulfilling the order with the “fill” message.
  • the value for the reference sequence identifier field 110-15 need not necessarily be that of a message that is being directly responded to by the electronic trading system 100, but may be that of a prior message that is part of the same order chain, for example, the sequence number of a “new order” or a “new order acknowledgement.”
  • the gateways 120 may also populate the reference sequence identifier field 110-15 in messages they introduce into electronic trading system 100 with a value of a sequence identifier for a related prior message. For example, a gateway 120 may populate the reference sequence identifier field 110-15 in a “cancel order” or a “replace order” message with the value of the sequence identifier assigned to the prior corresponding “new order” or “new order acknowledgment” message.
  • core compute nodes 140 may also populate the sequence identifier field 110-15 for a corresponding “cancel order acknowledgement” message or “replace order acknowledgement” message with the value of the sequence identifier for the “new order” or “new order acknowledgment,” rather than that of the message to which the compute node 140 was directly responding (e.g., rather than the sequence identifier of the “cancel order” or “replace order” message).
  • the reference sequence identifier field 110-15 allows nodes in electronic trading system 100 generally to correlate a message with one or more prior messages in the same order chain.
  • a generating node may also include a node-specific timestamp field 110-13 in messages it introduces into electronic trading system 100. While the sequence identifier included in the sequence identifier field 110-14 of sequence-marked messages outputted by the sequencer 150 is intended to be unique throughout the electronic trading system 100, the value in the node-specific timestamp field 110-13 may be unique among a subset of messages, those messages introduced into electronic trading system 100 by a particular generating node. While referred to herein as a “timestamp,” a value placed in the node-specific timestamp field 110-13 may be any suitable value that is unique among messages generated by that node. For example, the node-specific timestamp may be in fact a timestamp or any suitable monotonically increasing or decreasing value.
  • Some embodiments may include other timestamp fields in the message format.
  • some message formats may include a reference timestamp field, which may be a timestamp value assigned by the generating node of a prior, related message.
  • a compute node 140 may include a new timestamp value in the node-specific timestamp field 110-13 for messages that it generates, and may also include a timestamp value from a related message in a reference timestamp field of the message the compute node generates.
  • a “new order acknowledgement” message generated by the compute node may include a timestamp value of the “new order” to which it is responding in the reference timestamp field of the “new order acknowledgement message.”
  • compute nodes 140 may not include a new timestamp value in the node-specific timestamp field 110-13 in messages they generate, but may simply populate that node-specific timestamp field 110-13 with a timestamp value from a prior related message.
  • the message format 110 may also include the entity type field 110-16 and entity count field 110-17.
  • entity type of a message may depend on whether it is introduced into the electronic trading system 100 by a gateway 120 or a compute node 140, or in other words, whether the message is an incoming message being received at a gateway 120 from a participant device 130 or whether it is an outgoing message being generated by a compute node 140 to be sent to a participant device 130.
  • incoming messages are considered to be of entity type “flow,” (and the entity type field 110-16 is populated by the gateways 120 for incoming messages with a value representing the type “flow”), while outgoing messages are considered to be of entity type, “symbol,” (and the entity type field 110-16 is populated by the computed nodes 140 for outgoing messages with a value representing the type “symbol”).
  • the entity count of type “flow” is maintained by gateways 120, and the entity count of type “symbol” is maintained by the compute nodes 140.
  • a gateway 120 may maintain a per flow incoming message count, counting incoming messages received by the gateway 120 over each flow active on the gateway 120. For example, if four non-redundant flows are active on a gateway 120, each flow would be assigned a unique flow identifier, as discussed above, and the gateway 120 would maintain a per flow incoming message count, counting the number of incoming messages received over each of those four flows. In such embodiments, the gateway 120 populates the entity count field 110-17 of an incoming message with the per flow incoming message count associated with the incoming message’s flow (as identified throughout the electronic trading system 100 by a flow identifier value, populated in the flow identifier field 110-12 of the message).
  • each underlying redundant flow may be assigned the same flow identifier, yet a per-flow incoming message count may still be maintained separately for each redundant flow, especially when the redundant flows are implemented on separate gateways 120. Because it is the expectation that a participant device 130 will send the same set of messages in the same order (i.e., sequence) to the electronic trading system 100 over each of the redundant flows, it is also the expectation that the entity count assigned to functionally equivalent messages received over separate redundant flows should be identical.
  • These functionally equivalent incoming messages may be forwarded by the gateway(s) 120 to sequencer 150 and the compute nodes 140. Accordingly, in such embodiments, the sequencer 150 and compute nodes 140 could receive multiple functionally equivalent incoming messages associated with the same flow identifier, but the sequencer 150 and compute nodes 140 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same flow identifier.
  • the sequencer 150 and compute nodes 140 may keep track, on a per flow basis, of the highest entity count that has been included in entity count field 110-17 of incoming messages associated with each flow, which allows the sequencer 150 and compute nodes 140 to take action only on the first to arrive of multiple incoming functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent incoming messages.
  • the sequencer 150 may in some embodiments only sequence the first such functionally equivalent incoming message to arrive, and the compute nodes 140 may only start processing on the first such functionally equivalent message to arrive.
  • a node i.e., a sequencer 150 or a compute node 140
  • the node may assume that the incoming message is functionally equivalent to another previously received incoming message, and may simply ignore the subsequently received functionally equivalent incoming message.
  • a compute node 140 may maintain a per symbol outgoing message count, counting outgoing messages generated by and sent from the compute node 140 for each symbol serviced by the compute node 140. For example, if four symbols (e.g., MSFT, GOOG, IBM, ORCL) are serviced by a compute node 140, each symbol is assigned a symbol identifier populated in symbol field 110-2 of the message, as discussed above, and the compute node 140 would maintain a per symbol outgoing message count, counting the number of outgoing messages it generated and sent that serviced each of those four symbols.
  • symbols e.g., MSFT, GOOG, IBM, ORCL
  • the compute node 140 populates the entity count field 110-17 of an incoming message with the per symbol outgoing message count associated with the outgoing message’s symbol (as identified throughout the electronic trading system 100 by the value populated in the symbol identifier field 110-2 of the message).
  • compute nodes may be configured such that multiple compute nodes service a particular symbol in parallel, for reasons of high availability. Because of the deterministic ordering of messages throughout electronic trading system 100 provided by the sequencer 150, it can be guaranteed that even when multiple compute nodes service a given symbol, they will be processing incoming messages referencing the same symbol in the same order (i.e., sequence) and in the same manner, thereby generating functionally equivalent response messages in parallel. When considering the outgoing messages being sent out for a particular symbol across multiple compute nodes 140, each outgoing message referencing that symbol should have a functionally equivalent message being sent out by each other compute node 140 actively servicing that symbol.
  • sequencer 150 and gateways 120 may all be sent by the compute nodes 140 to sequencer 150 and the gateways 120. Accordingly, in such embodiments, the sequencer 150 and gateways 120 could receive multiple functionally equivalent incoming messages associated with the same symbol, but the sequencer 150 and gateways 120 could identify such messages as being functionally equivalent when the entity count is identical for multiple messages having the same symbol identifier. In some embodiments, the sequencer 150 and gateways 120 may keep track, on a per symbol basis, of the highest entity count that has been included in entity count field 110-17 of outgoing messages associated with the symbol, which allows the sequencer 150 and gateways 120 to take action only on the first to arrive of multiple outgoing functionally equivalent messages each node has received, and to ignore other subsequently arriving functionally equivalent outgoing messages.
  • the sequencer 150 may in some embodiments only sequence the first such functionally equivalent outgoing message to arrive.
  • the gateways 120 may only start processing the first such functionally equivalent message to arrive. If an outgoing message received by a node (i.e., a sequencer 150 or a gateway 120) has an entity count that is the same or lower than the highest entity count the node has previously seen for that symbol, then the node may assume that the outgoing message is functionally equivalent to another previously received outgoing message, and may simply ignore the subsequently received functionally equivalent outgoing message.
  • sequencer 150 only sequences the first message of a plurality of functionally equivalent messages to arrive at the sequencer
  • the sequencer could do so in a variety of ways.
  • other subsequently arriving messages that are functionally equivalent to that first such functionally equivalent message to arrive may simply be ignored by the sequencer (in which case only a single sequence marked message may be outputted by the sequencer for the set of functionally equivalent messages).
  • sequencer may track a sequence number that it assigns to the first functionally equivalent message, for example, by making an association between the entity count of the message, its flow identifier or symbol identifier (for messages having entity types of “flow” and “symbol”, respectively), and its sequence number, such that the sequencer may output a sequenced version of each functionally equivalent message in which the value of the sequence identifier field 110- 14 for all the sequenced versions of the functionally equivalent messages is the same as had been assigned by the sequencer to the first message to arrive among the functionally equivalent messages received by the sequencer 150.
  • the sequencer 150 may not keep track of whether messages are functionally equivalent, and may assign each unsequenced message that arrives at the sequencer 150 a unique sequence number, regardless of whether that message is among a plurality of functionally equivalent messages.
  • the sequenced versions of messages among a plurality of functionally equivalent messages are each assigned different sequence identifiers by the sequencer as the value in the sequencer identifier field 110-14.
  • the recipient node of sequenced functionally equivalent messages in such embodiments may use the sequence identifier in the sequenced version of the message among the sequenced functionally equivalent messages that is first to arrive at the node.
  • sequenced versions of the messages are sent out in sequenced order by the sequencer 150, and accordingly, should be received in the same sequenced order among all nodes directly connected to the sequencer. Therefore, for all nodes receiving the sequenced messages via respective direct point-to-point connections with the sequencer, the first sequenced message to arrive among a plurality of functionally equivalent sequenced messages should have the same value in the sequence identifier field 110-14.
  • a combination of a message’s flow identifier and node specific timestamp may be sufficient to uniquely identify the message throughout electronic trading system 100.
  • a combination of a flow identifier and entity count could be sufficient to uniquely identify a message of entity type “flow,” and a combination of a symbol identifier and entity count could be sufficient to uniquely identify a message of entity type “symbol.”
  • sequence identifier is still necessary in order to specify in a fair and deterministic manner the relative ordering of the message among other messages generated by other nodes throughout electronic trading system 100.
  • the nodespecific timestamp is in fact implemented as a timestamp value, even if system clocks among nodes are perfectly synchronized, two different messages, each generated by a different node, may each be assigned the same timestamp value by their respective generating node, and the relative ordering between these two messages is then ambiguous. Even if the messages can be identified uniquely, a recipient node of both messages would still need a way to determine the relative ordering of the two messages before taking possible action on the messages.
  • One possible approach for a recipient node to resolve that ambiguity could be through the use of randomness, for example, by randomly selecting one message as preceding the other in the relative ordering of messages throughout the electronic trading system 100. Using randomness to resolve the ambiguity, however, does not support “state determinism” throughout the electronic trading system 100. Different recipient nodes may randomly determine a different relative ordering among the same set of messages, resulting in unpredictable, nondeterministic behavior within electronic trading system 100, and impeding the correct implementation of important features, such as fault-tolerance, high availability, and disaster recovery.
  • Another approach for a recipient node to resolve the ambiguity in ordering could be through a predetermined precedence method, for example, based on the node identifier associated with the message.
  • a predetermined precedence method for example, based on the node identifier associated with the message.
  • Such an approach works against the important goal of fairness, by giving some messages higher precedence simply based on the node identifier of the node that introduced the message into electronic trading system 100. For example, some participant devices 130 could be favored simply because they happen to be connected to the electronic trading system 100 over a gateway 120 that is deemed higher in the predetermined precedence method.
  • sequence identifier assigned to a message by the sequencer 150 may still be required in order to fairly and deterministically specify the ordering of a message relative to other messages in the electronic trading system 100.
  • the sequencer 150 (or the single currently active sequencer, if multiple sequencers 150 are present) serves as the authoritative source of a truly deterministic ordering among sequence-marked messages throughout the electronic trading system 100.
  • nodes in electronic trading system 100 may receive two versions of a message: an unsequenced (unmarked) version of the message as introduced into the electronic trading system 100 by the generating node, and a (marked) version of the message that includes a sequence identifier assigned by the sequencer 150. This may be the case in embodiments in which a generating node sends the unmarked message to one or more recipient nodes as well as the sequencer 150. The sequencer 150 may then send a sequence-marked version of the same message to a set of nodes including the same recipient nodes.
  • sequence-marked version of the message is useful for determining the relative processing order (i.e., position in a sequence) of the message among other marked messages in electronic trading system 100
  • a recipient node may also be useful for a recipient node to receive the unmarked version of the message.
  • the unmarked version of the message it is certainly possible, if not expected, (for example, in embodiments in which there are direct connections between nodes), for the unmarked version of the message to be received prior to the marked version of the message, because the marked version of the message is sent via an intervening hop through sequencer 150. Accordingly, there is the opportunity, in some embodiments, for a recipient node to activate processing of the unmarked message upon receiving the unmarked message even before that recipient node has received the marked version of the message which authoritatively indicates the relative ordering of the marked message among other marked messages.
  • a node receiving both the marked and unmarked versions of a same message may correlate the two versions of the message to each other via the same identifying information or “common metadata,” in both versions of the message.
  • a generating node may include in messages it generates (i.e., unmarked messages) a node identifier and a node specific timestamp, which together, may uniquely identify each message throughout electronic trading system 100.
  • the marked message may also include the same node identifier and node specific timestamp that are also included in the corresponding unmarked message, thereby allowing a recipient node of both versions of the message to correlate the marked and unmarked versions. Accordingly, while the marked messages directly indicate relative ordering of a marked message relative to the other marked messages throughout electronic trading system 100, because of the correlation that may be made between the unmarked and marked version of the same message, marked messages, (at least indirectly via the correlation discussed above), indicate the relative ordering of the message relative to other messages (marked or unmarked) throughout electronic trading system 100.
  • nodes in electronic trading system 100 may also correlate sequence marked with unmarked versions of the messages by means of the other manners of uniquely identifying messages discussed above.
  • a correlation between sequence marked and unmarked messages may be made by means of a combination of a flow identifier and a node specific timestamp.
  • Such a correlation may additionally or alternatively be made by means of a message’s entity count along with the symbol identifier or flow identifier in the message, for messages having entity type “symbol” and “flow,” respectively.
  • participant devices 130 exchanging messages with the electronic trading system 100 are often very sensitive to latency, preferring low, predictable latency.
  • the arrangement shown in Fig. 1 A accommodates this requirement by providing a point-to-point mesh 172 architecture between at least each of the gateways 120 and each of the compute nodes 140.
  • each gateway 120 in the mesh 172 may have a dedicated high-speed direct connection 180 to the compute nodes 140 and the sequencers 150.
  • dedicated connection 180-1-1 is provided between gateway 1 120-1 and core compute node 1 140-1, dedicated connection 180-1-2 between gateway 1 120-1 and core compute node 2 140-2, and so on, with example connection 180-g-c provided between gateway 120-g and core compute node c 140-c, and example connection 180-s-c provided between sequencer 150 and core compute node c 140-c.
  • each dedicated connection 180 in the point-to-point mesh 172 is, in some embodiments, a point-to-point direct connection that does not utilize a shared switch.
  • a dedicated or direct connection may be referred to interchangeably herein as a direct or dedicated “link” and is a direct connection between two end points that is dedicated (e.g., nonshared) for communication therebetween.
  • Such a dedicated/direct link may be any suitable interconnect(s) or interface(s), such as disclosed further below, and is not limited to a network link, such as wired Ethernet network connection or other type of wired or wireless network link.
  • the dedicated/direct connection/link may be referred to herein as an end-to-end path between the two end points.
  • Such an end-to-end path may be a single connection/link or may include a series of connections/links; however, bandwidth of the dedicated/direct connection/link in its entirety, that is, from one end point to another end point, is non-shared and neither bandwidth nor latency of the dedicated/direct connection/link can be impacted by resource utilization of element(s) if so traversed.
  • the dedicated/direct connection/link may traverse one or more buffer(s) or other elements that are not bandwidth or latency impacting based on utilization thereof.
  • the dedicated/direct connection/link would not, however, traverse a shared network switch as such a switch can impact bandwidth and/or latency due to its shared usage.
  • the dedicated connections 180 in the point-to- point mesh 172 may be provided in a number of ways, such as a 10 Gigabit Ethernet (GigE), 25 GigE, 40 GigE, 100 GigE, InfiniBand, Peripheral Component Interconnect - Express (PCIe), RapidlO, Small Computer System Interface (SCSI), FireWire, Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), or custom serial or parallel busses.
  • GigE 10 Gigabit Ethernet
  • 25 GigE 25 GigE
  • 40 GigE 100 GigE
  • InfiniBand Peripheral Component Interconnect - Express
  • PCIe Peripheral Component Interconnect - Express
  • RapidlO RapidlO
  • SCSI Small Computer System Interface
  • FireWire FireWire
  • USB Universal Serial Bus
  • HDMI High Definition Multimedia Interface
  • custom serial or parallel busses custom serial or parallel busses.
  • nodes may sometimes be referred to herein as “nodes”, the use of terms such as “compute node” or “gateway node” or “sequencer node” or “mesh node” should not be interpreted to mean that particular components are necessarily connected using a network link, since other types of interconnects or interfaces are possible.
  • a “node,” as disclosed herein, may be any suitable hardware, software, firmware component(s), or combination thereof, configured to perform the respective function(s) set forth for the node.
  • a node may be a programmed general purpose processor, but may also be a dedicated hardware device, such as a field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other hardware device or group of devices, logic within a hardware device, printed circuit board (PCB), or other hardware component.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • PCB printed circuit board
  • nodes disclosed herein may be separate elements or may be integrated together within a single element, such as within a single FPGA, ASIC, or other element configured to implement logic to perform the functions of such nodes as set forth herein. Further, a node may be an instantiation of software implementing logic executed by general purpose computer and/or any of the foregoing devices.
  • dedicated connections 180 are also provided directly between each gateway 120 and each sequencer 150, and between each sequencer 150 and each core compute node 140. Furthermore, in some embodiments, dedicated connections 180 are provided among all the sequencers, so that an example sequencer 150-1 has a dedicated connection 180 to each other sequencer 150-2, . . ., 150-s. While not pictured in Fig. 1A, in some embodiments, dedicated connections 180 may also be provided among all the gateways 120, so that each gateway 120-1 has a dedicated connection 180 to each other gateway 120-2, ..., 120-g. Similarly, in some embodiments, dedicated connections 180 are also provided among all the compute nodes 140, so that an example core compute node 140-1 has a dedicated connection 180 to each other core compute node 140-2, ..., 140-c.
  • a dedicated connection 180 between two nodes may in some embodiments be implemented as multiple redundant dedicated connections between those same two nodes, for increased redundancy and reliability.
  • the dedicated connection 180-1-1 between gateway 120-1 and core compute node 140-1 e.g., Core 1
  • Core 1 core compute node 140-1
  • any message sent out by a node is sent out in parallel to all nodes directly connected to it in the point-to-point mesh 172.
  • Each node in the point-to-point mesh 172 may determine for itself, for example, based on the node’s configuration, whether to take some action upon receipt of a message, or whether instead simply to ignore the message.
  • a node may never completely ignore a message; even if the node, due to its configuration, does not take substantial action upon receipt of a message, it may at least take minimal action, such as consuming any sequence number assigned to the message by the sequencer 150. That is, in such embodiments, the node may keep track of a last received sequence number to ensure that when the node takes more substantial action on a message, it does so in proper sequenced order.
  • a message containing a trade order to “Sell 10 shares of Microsoft at $190.00” might originate from participant device 130-1, such as a trader’s personal computer, and arrive at gateway 120-1 (i.e., GW 1). That message will be sent to all core compute nodes 140-1, 140-2, . . ., 140-c even though only core compute node 140-2 is currently performing matching for Microsoft orders. All other core compute nodes 140-1, 140-3, . . ., 140-c may upon receipt ignore the message or only take minimal action on the message. For example, the only action taken by 140-1, 140-3, . . ., 140-c may be to consume the sequence number assigned to the message by the sequencer 150-1.
  • That message will also be sent to all of the sequencers 150-1, 150-2, . . ., 150-s even though a single sequencer (in this example, sequencer 150-1) is the currently active sequencer servicing the mesh.
  • the other sequencers 150-2, . . ., 150-s also received the message to allow them the opportunity to take over as the currently active sequencer should sequencer 150-1 (the currently active sequencer) fail, or if the overall reliability of the electronic trading system 100 would increase by moving to a different active sequencer.
  • One or more of the other sequencers may also be responsible for relaying system state to the disaster recovery site 155.
  • the disaster recovery site 155 may include a replica of electronic trading system 100 at another physical location, the replica comprising physical or virtual instantiations of some or all of the individual components of electronic trading system 100.
  • the system 100 By sending each message out in parallel to all directly connected nodes, the system 100 reduces complexity and also facilitates redundancy and high availability. If all directly connected nodes receive all messages by default, multiple nodes can be configured to take action on the same message in a redundant fashion. Returning to the example above of the order to “Sell 10 shares of Microsoft at $190.00”, in some embodiments, multiple core compute nodes 140 may simultaneously perform matching for Microsoft orders.
  • both core compute node 140-1 and core compute node 140-2 may simultaneously perform matching for Microsoft messages, and may each independently generate, after having received the incoming message of the “Sell” order, a response message such as an acknowledgement or execution message that each of core compute node 140-1 and core compute node 140-2 sends to the gateways 120 through the sequencer(s) 150 to be passed on to one or more participant devices 130.
  • a response message such as an acknowledgement or execution message that each of core compute node 140-1 and core compute node 140-2 sends to the gateways 120 through the sequencer(s) 150 to be passed on to one or more participant devices 130.
  • gateways 120 may receive multiple associated outgoing messages from core compute nodes 140 for the same corresponding incoming message. Due to the fact that it can be guaranteed that these multiple associated response messages are equivalent, the gateways 120 may simply process only the first received outgoing message, ignoring subsequent associated outgoing messages corresponding to the same incoming message.
  • the “first” and “subsequent” messages may be identified by their associated sequence numbers, as such messages may be sequence-marked messages.
  • messages may be identified as being functionally equivalent based on other identifying information in the messages, such as the values in the entity type field 110-16 and entity count field 110-17, as discussed further in connection with Fig. 1C above.
  • Allowing the gateways 120 to take action on the first of several functionally equivalent associated response messages to reach them may, therefore, also improve the overall latency of the electronic trading system 100.
  • the electronic trading system 100 can be easily configured such that any incoming message is processed by multiple compute nodes 140, in which each of those multiple compute nodes 140 generates an equivalent response message that can be processed by the gateways 120 on a first-to-arrive basis.
  • Such an architecture provides for high availability with no perceptible impact to latency in the event that a compute node 140 is not servicing incoming messages for a period of time (whether due to a system failure, a node reconfiguration, or a maintenance operation).
  • Such a point-to-point mesh 172 architecture of system 100 besides supporting low, predictable latency and redundant processing of messages, also provides for built-in redundant, multiple paths. As can be seen, there exist multiple paths between any gateway 120 and any compute node 140. Even if a direct connection 180-1-1 between gateway 120-1 and compute node 140-1 becomes unavailable, communication is still possible between those two elements via an alternate path, such as by traversing one of the sequencers 150 instead. Thus, more generally speaking, there exist multiple paths between any node and any other node in the point- to-point mesh 172.
  • this point-to-point mesh architecture inherently supports another important goal of a financial trading system, namely, fairness.
  • the point-to-point architecture with direct connections between nodes ensures that the path between any gateway 120 and any core compute node 140, or between the sequencer 150 and any other node has identical or, at least very similar latency. Therefore, two incoming messages sent out to the sequencer 150 at the same time from two different gateways 120 should reach the sequencer 150 substantially simultaneously. Similarly, an outgoing message being sent from a core compute node 140 is sent to all gateways 120 simultaneously, and should be received by each gateway at substantially the same time. Because the topology of the point-to-point mesh does not favor any single gateway 120, chances are minimized that being connected to a particular gateway 120 may give a participant device 130 an unfair advantage or disadvantage.
  • the point-to-point mesh architecture of system 100 allows for easily reconfiguring the function of a node, that is, whether a node is currently serving as a gateway 120, core compute node 140 or sequencer 150. It is particularly easy to perform such reconfiguration in embodiments in which each node has a direct connection between itself and each other node in the point-to-point mesh.
  • no re-wiring or re-cabling of connections 180 (whether physical or virtual) within the point-to-point mesh 172 is required in order to change the function of a node in the mesh (for example, changing the function of a node from a core compute node 140 to a gateway 120, or from a gateway 120 to a sequencer 150).
  • the reconfiguration required that is internal to the point-to-point mesh 172 may be easily accomplished through configuration changes that are carried out remotely.
  • the reconfiguration of the function of a node may be accomplished live, even dynamically, during trading hours. For example, due to changes on characteristics of the load of the electronic trading system 100 or new demand, it may be useful to reconfigure a core compute node 140-1 to instead serve as an additional gateway 120. After some possible redistribution of state or configuration to other compute nodes 140, the new gateway 120 may be available to start accepting new connections from participant devices 130. [00127] In some embodiments, lower-speed, potentially higher latency shared connections 182 may be provided among the system components, including among the gateways 120 and/or the core compute nodes 140.
  • shared connections 182 may be used for maintenance, control operations, management operations, and/or similar operations that do not require very low latency communications and, in contrast to messages related to trading activity carried over the dedicated connections 180 in the point-to-point mesh 172.
  • the shared connections 182-g and 182-c carry nontrading activity type traffic.
  • Shared connections 182, carrying non-trading traffic may be over one or more shared networks and via one or more network switches, and nodes in the mesh may be distributed among these shared networks in different ways.
  • gateways 120 may all be in a gateway -wide shared network 182-g
  • compute nodes 140 may be in their own respective compute node-wide shared network 182-c
  • sequencers 150 may be in their own distinct sequencer-wide shared network 182-s, while in other embodiments all the nodes in the mesh may communicate over the same shared network for these non-latency sensitive operations.
  • FIG. 1 Distributed computing environments such as electronic trading system 100 sometimes rely on high resolution clocks to maintain tight synchronization among various components.
  • one or more of the nodes 120, 140, 150 might be provided with access to a clock, such as a high-resolution global positioning (GPS) clock 195 in some embodiments.
  • GPS global positioning
  • gateways 120, compute nodes 140, and sequencers 150 connected in the mesh 172 may be referred to as “Mesh Nodes”.
  • Fig. 2 illustrates an example embodiment of a Mesh Node 200 in the point-to- point mesh 172 architecture of electronic trading system 100.
  • Mesh node 200 could represent a gateway 120, a sequencer 150, or a core compute node 140, for example.
  • Mesh Node 200 may be implemented in any suitable combination of hardware and software, including pure hardware and pure software implementations, and in some embodiments, any or all of gateways 120, compute nodes 140, and/or sequencers 150 may be implemented with commercial off-the-shelf components.
  • FIG. 2 In the embodiment illustrated by Fig. 2, in order to achieve low latency, some functionality is implemented in hardware in Fixed Logic Device 230, while other functionality is implemented in software in Device Driver 220 and Mesh Software Application 210.
  • Fixed Logic Device 230 may be implemented in any suitable way, including an Application-Specific Integrated Circuit (ASIC), an embedded processor, or a Field Programmable Gate Array (FPGA).
  • ASIC Application-Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • Mesh Software Application 210 and Device Driver 220 may be implemented as instructions executing on one or more programmable data processors, such as central processing units (CPUs). Different versions or configurations of Mesh Software Application 210 may be installed on Mesh Node 200 depending on its role. For example, based on whether Mesh Node 200 is acting as a gateway 120, sequencer 150, or core compute node 140, a different version or configuration of Mesh Software Application 210 may be installed.
  • Mesh Node 200 has multiple low latency 10 Gigabit Ethernet SFP+ connectors (interfaces) 270-1, 270-2, 270-3, ..., 270-n, (known collectively as connectors 270).
  • Connectors 270 may be directly connected to other nodes in the point-to-point mesh via dedicated connections 180, connected via shared connections 182, and/or connected to participant devices 130 via a gateway 120, for example. These connectors 270 are electronically coupled in this example to 10 GigE MAC Cores 260-1, 260-2, 260-3, . . ., 260-n, (known collectively as GigE Cores 260), respectively, which in this embodiment are implemented by Fixed Logic Device 230 to ensure minimal latency. In other embodiments, 10 GigE MAC Cores 260 may be implemented by functionality outside Fixed Logic Device 230, for example, in PCL E network interface card adapters.
  • Fixed Logic Device 230 may also include other components.
  • Fixed Logic Device 230 also includes a Fixed Logic 240 component.
  • fixed Logic component 240 may implement different functionality depending on the role of Mesh Node 200, for example, whether it is a gateway 120, sequencer 150, or core compute node 140.
  • Fixed Logic Memory 250 is also included in Fixed Logic Device 230 .
  • Fixed Logic Device 230 also includes a PCI-E Core 235, which may implement PCI Express functionality.
  • PCI Express is used as a conduit mechanism to transfer data between hardware and software, or more specifically, between Fixed Logic Device 240 and the Mesh Software Application 210, via Device Driver 220 over PCI Express Bus 233.
  • DMA Direct Memory Access
  • shared memory buffers shared memory buffers
  • memory mapping any suitable data transfer mechanism between hardware and software may be employed, including Direct Memory Access (DMA), shared memory buffers, or memory mapping.
  • DMA Direct Memory Access
  • shared memory buffers shared memory buffers
  • memory mapping any suitable data transfer mechanism between hardware and software may be employed, including Direct Memory Access (DMA), shared memory buffers, or memory mapping.
  • DMA Direct Memory Access
  • Mesh Node 200 may also include other hardware components. For example, depending on its role in the electronic trading system 100, Mesh Node 200 in some embodiments may also include High-Resolution Clock 195 (also illustrated in and discussed in conjunction with Fig. 1A) used in the implementation of high-resolution clock synchronization among nodes in electronic trading system 100.
  • a Dynamic Random- Access Memory (DRAM) 280 may also be included in Mesh Node 200 as an additional memory in conjunction with Fixed Logic Memory 250.
  • DRAM 280 may be any suitable volatile or non-volatile memory, including one or more random-access memory banks, hard disk(s), and solid-state disk(s), and accessed over any suitable memory or storage interface.
  • the point-to-point mesh architecture e.g., 102 and 172 of the electronic trading system 100 provides for improved latency in a number of ways.
  • messages are exchanged among nodes in the electronic trading system 100 over direct dedicated connections 180 without traversing a switch, avoiding the switch in and of itself allows for multiple latency related advantages:
  • c) Communicating via direct point-to-point connections without a switch also eliminates a requirement to exchange messages within the mesh according to certain established protocols, such as TCP/IP, which may be required by the switch but add processing time overhead even in the sending node and receiving node.
  • This protocol specific overhead can optionally be eliminated in some embodiments of the point-to-point mesh architecture 172 that exchange messages in accordance with one or more custom protocols, rather than specific established protocols required by the switch, such as TCP/IP.
  • Messages may be broadcast perfectly simultaneously to directly connected nodes, and received perfectly simultaneously by the directly connected recipient nodes, as discussed further below. Different messages may also be received perfectly simultaneously from multiple directly connected sender nodes, also as discussed further below.
  • Latency may be improved in embodiments in which the direct dedicated connections 180 between nodes are implemented via dedicated communications logic and dedicated interfaces, such as GigE MAC Cores 260 and connectors 270, respectively, as discussed in connection with FIG. 2, above.
  • each of the GigE MAC Cores 260 handles the message communication between its node and a single other node, and each of the connectors 270 is connected via a dedicated connection 180 to a single other node in the point- to-point mesh architecture 172. For example, considering three nodes in the point-to-point mesh of the embodiment of FIG.
  • gateway 120-1 is connected via a dedicated connection 180-1-1 to core compute node 140-1, and gateway 120-1 is also connected via a separate dedicated connection 180-1-2 to core compute node 140-2. If each of these three nodes is implemented via the embodiment of Fig. 2, the GigE MAC Core 260-1 and the connector 270-1 on the gateway 120-1 may be used solely for communication with core compute node 140-1, and the GigE MAC Core 260-1 and the connector 270-1 on the core compute node 140-1 may be used solely for communication with the gateway 120-1, over the dedicated connection 180-1-1.
  • the GigE MAC Core 260-2 and the connector 270-2 on the gateway 120-1 may be used solely for communication with core compute node 140-2, and the GigE MAC Core 260-1 and the connector 270-1 on the core compute node 140-2 may be used solely for communication with the gateway 120-1, over the dedicated connection 180-1-2.
  • dedicated compute resources such as one of the GigE Cores 260 and one of the Connectors 270, is used on each node for communicating with each other node with which it has a dedicated connection 180.
  • These dedicated compute resources per dedicated connection 180 allow messages to be broadcast perfectly simultaneously by a sending node among other mesh nodes directly connected (e.g., via a dedicated connection 180) to that sending node, particularly when the GigE Cores 260 are implemented in hardware, such as the Fixed Logic Device 230.
  • a broadcasted message from the sending node may be received perfectly simultaneously by all the receiving nodes, in which each receiving node has a respective dedicated connection 180 to the sending node, the receiving being implemented by a dedicated connection communications logic and dedicated interface, for each node directly connected to the recipient node.
  • No receiving node is favored, since each receiving node receives the broadcasted message perfectly simultaneously; the internal message latency among different nodes in the mesh should be identical.
  • a receiving node may perfectly simultaneously receive different messages, each from a different node directly connected to it.
  • each server has a single connection (or at times, multiple redundant connections, but which may still be considered a single logical connection) between itself and the switch.
  • a server requires broadcasting a message to multiple other servers in the system, those messages will not actually be sent simultaneously to the switch nor received simultaneously by the switch, but instead will need to be sent/received one after another in serial fashion over the single logical connection between the server and the switch.
  • the point-to-point mesh architecture 102 and 172 allows for other latency improvements, as well.
  • a message being sent between two nodes in the point-to-point mesh such as from the gateway 120-1 to the core compute node 140-1, in order to ensure that the message can be processed throughout the electronic trading system 100 in a deterministic sequence relative to other messages, may be sent via the ordering path 117.
  • the sequencer 150-1 As the message traverses through the ordering path 117, it passes through the sequencer 150-1, which marks the message with a unique sequence identifier, producing a sequenced-marked message that the sequencer 150-1 then sends to the destination node (the core compute node 140-1, in this example). With the receipt of the sequence-marked message, the destination node (the core compute node 140-1, in this example), may ensure that the message is processed in the correct deterministic order, as specified by the sequence identifier in the sequence-marked message, relative to other messages (that have also been sequence-marked by the sequencer) in the electronic trading system 100.
  • this ordering path 117 may include multiple direct connections: a direct connection between the sending node and the sequencer (such as the direct connection 180-gwl-sl between the gateway 120-1 and the sequencer 150-1), and also a direct connection between the sequencer and the destination node (such as the direct connection 180-cl-sl between the sequencer 150-1 and the compute node 140-1).
  • These multiple direct connections comprising the ordering path 117 do not pass through a switch and may benefit from the latency advantages already discussed above.
  • the latency of the ordering path 117 is impacted by the processing time of a message in the sequencer 150-1 (i.e., the time necessary for the sequencer 150-1 to receive the non-sequence-marked message, mark it with a sequence identifier, and send the sequence-marked message to the destination) and the transmission time of the message through two direct connections. (That is, the non-sequence-marked version is first sent over a direct connection from the sender node to the sequencer, and then, the sequence-marked version is sent over a separate direct connection from the sequencer to the destination node).
  • the non-sequence-marked (i.e., unsequenced) message may also be sent in parallel over the activation link 180-1-1, which may be a single direct connection between the sender node and the destination node, such as the direct connection 180-1-1 between the gateway 120-1 and the sequencer 140-1.
  • the non-sequence-marked message may be sent perfectly simultaneously by the sending node (e.g., the gateway 120-1) both to the destination node (e.g., the core compute node 140-1) over the activation link 180-1-1 and, as part of the ordering path 117, to the sequencer 150-1 over the direct connection (e.g., direct connection 180-gwl-sl) between the sending node and the sequencer 150-1.
  • the sequencer 150-1 and the destination node e.g., the core compute node 140-1) may simultaneously receive the non- sequence-marked version of the message, which allows the destination node to activate processing of the non-sequence-marked message immediately upon its receipt.
  • the destination node may, in parallel, start processing the non-sequence-marked message, such processing including loading into faster memory data related to a symbol referenced in the message, building up a preliminary response message, or otherwise commencing a matching function activity for an electronic trade.
  • the destination node may complete the processing of the message in accordance with the proper deterministic order (i.e., sequence) as specified by the sequence identifier.
  • the destination node By the time the destination node receives the sequence-marked message over the ordering path 117, however, it will already have had the opportunity to perform useful processing of the non-sequence-marked version of the message (e.g., 0.5 to 1.0 microseconds worth of processing time), allowing it to complete the processing of the sequence-marked message more quickly, and produce a response message with corresponding lower latency, than if it had not already received the non-sequence-marked version of the message via the activation link 180-1-1.
  • useful processing of the non-sequence-marked version of the message e.g., 0.5 to 1.0 microseconds worth of processing time
  • the point-to-point mesh architecture (e.g., 102 and 172) provides for improved latency in a number of ways. Through testing, at least some of these latency improvements have been empirically quantified. For example, the latency improvements due to the use of direct connections between nodes, avoiding a switch, amount to a total of at least 1.0 microseconds and, in some cases, up to 3.0 microseconds in each direction (i.e., incoming vs outgoing).
  • the latency improvements due to the opportunity for a destination node to process a non-sequence-marked message received via the activation link 180- 1-1, in parallel with the processing and transmission of the message via the ordering path 117, amount to another 0.5 to 1.0 microseconds in each direction.
  • the possible latency improvements in aggregate from the time an incoming message enters the electronic trading system 100 from a participant device 130 via a gateway 120 to the time a corresponding response message is sent to the participant device 130 from the gateway 120 amount to at least 2.5 to 7.0 microseconds.
  • a difference in latency of even microseconds is highly significant.
  • These latency improvements allow some embodiments of the electronic trading system 100 to reliably respond in accordance with a desired response time latency of 5.0 to 7.0 microseconds, which is a marked improvement over prior art trading systems, the best of which currently have a typical response time latency in the range of 50 to 75 microseconds.
  • Figs. 3A-E illustrate a distributed system 300 in which example embodiments may be implemented.
  • the distributed system 300 may include some or all features of the electronic trading system 100 described above with reference to Figs. 1 and 2.
  • the distributed system 300 may include participant devices 330 (also referred to as client devices), gateways 320, and compute nodes 340-1, 340-2. . .340-6 (collectively, the compute nodes 340), which may incorporate some or all features of the participant devices 130, the gateways 120, and the compute nodes 140, respectively.
  • Each of the participant devices 330, gateways 320, and compute nodes 340 may also incorporate some or all features of the mesh node 200 described above with reference to Fig. 2.
  • the distributed system 300 may process messages (e.g., trade orders) from and provide related information to the participant computing devices 330.
  • Participant devices 330 may be one or more personal computers, tablets, smartphones, servers, or other data processing devices configured to display and receive information related to the messages (e.g., trade order information).
  • Each participant device 330 may exchange messages with the distributed system 300 via distributed system connections established with the gateways 320.
  • the compute nodes 340 may process the messages and generate a response. For example, if configured as a matching engine as described above, the compute nodes 340 may provide the aforementioned matching functions and may also generate outgoing messages to be delivered to one or more of the participant devices 330.
  • the gateways 320 transmit a message 380 from one of the participant devices 330 to the compute nodes 340.
  • the message 380 may specify a given value of a common parameter of the messages.
  • an example of a common parameter of the messages may be one of the fields of the message format 110, such as the symbol field 110-2, for non-limiting example.
  • the values of the common parameter may each correspond to a respective financial instrument (e.g., a stock symbol or other indicator representing a particular stock or bond), and according to the embodiment of FIG.
  • a respective value of the common parameter of the messages may be included in the symbol field 110-2 of the messages.
  • the common parameter of the messages may correspond to a type of the transaction (also known as “side”, e.g., buy, sell) that is ordered by the message 380.
  • the compute nodes 340 may each be assigned to one or more of the values such that the assigned compute nodes are configured to process messages corresponding to the assigned common parameter values. For example, as shown in Fig. 3A, compute nodes 340-1 and 340-4 are assigned to value 1, compute nodes 340-2 and 340-5 are assigned to value 2, and compute nodes 340-3 and 340-6 are assigned to value 3.
  • the compute nodes 320 may be divided into multiple subsets or groups, wherein each subset is configured to process messages corresponding to a respective value of the common parameter of the messages, and may refrain from processing messages that do not correspond to the respective value. Further, each compute node within a given subset may process a given message in parallel and independent from one another.
  • the compute nodes 340 may forward a response to the gateways 320, which in turn, may forward the response or information about the response to the participant computing device(s) 330 originating the message. For example, the compute nodes may generate a response informing a successful match between a buy order and a sell order, thereby indicating a completed trade of a financial instrument.
  • the gateways 320 may broadcast the message 380 to all of the compute nodes 340, and each of the compute nodes 340, in turn, may selectively process messages based on their common parameter value. For example, compute node 340-1 may evaluate received messages for their common parameter value, and then process only the subset of those messages that correspond to value 1. Alternatively, the gateways 320 may transmit the message 380 to a subset of the compute nodes 340 that are assigned to the value corresponding to the message 380. For example, the gateways 320 may determine that the message 380 has a common parameter value of value 1, and, accordingly, send the message to compute node 340-1 and 340-4. To do so, the gateways 320 or another node or element may maintain an index of values and compute nodes that are assigned to those values.
  • the compute nodes 340 may require a self-maintenance operation to maintain optimal operation when processing messages.
  • a compute node 340-1 may require adjustment or updates to the layout of an order book stored to its internal memory, or may need to update a hash collision bucket.
  • some or all of the compute nodes 340 may be reassigned to different values, thereby re-balancing the distribution of the values among the compute nodes to improve efficiency and latency in processing the messages.
  • the distributed system 300 is configured to ensure that all messages received from the participant computing devices 330 are processed in a timely manner, and that a self-maintenance operation by one or more of the compute nodes 340 does not disrupt the processing of messages or result in an increase in latency within the distributed system.
  • the distributed system 300 may circulate a token 390 among the compute nodes 340, and each of the compute nodes may receive the token 390 from a preceding compute node in a sequence of the compute nodes 340, possess the token 390 during an operation or a given period of time, and then pass the token 390 to a subsequent compute node.
  • the compute node 340-2 is currently in possession of the token 390. During this state of possession of the token 390, the compute node 340-2 may perform one or more self-maintenance operations such as those described above. During the self-maintenance operation, the compute node 340-2 may be unavailable to process messages. However, because the compute node 340-5 is also configured to process messages corresponding to value 2, those messages may continue to be processed without delay.
  • the compute node 340-2 may maintain a queue of messages (e.g., at Fixed Logic Memory 250 or DRAM 280) that it receives during the self-maintenance operation. After the maintenance operation is done but before the compute node 340-2 may relinquish the token 390, the compute node 340-2 may be configured to process all messages in the queue and update its state. Newly-received messages may continue to be queued while the queue is non-empty. Once the queue is empty, the compute node 340-2 may then pass the token 390 to a subsequent compute node.
  • a queue of messages e.g., at Fixed Logic Memory 250 or DRAM 280
  • one or more of the compute nodes 340 may be temporarily assigned to process messages corresponding to the value of the compute node that is currently undergoing a self-maintenance operation.
  • the compute node 340-1 may be temporarily assigned to process messages corresponding to value 2 in addition to messages corresponding to value 1.
  • one or more of the compute nodes 340 may be dedicated to processing messages corresponding to the value of a compute node currently undergoing self-maintenance, and optionally may refrain from processing messages when a self-maintenance operation is not occurring at any of the other compute nodes 340.
  • Figs. 3B-3E illustrate circulation of the token 390 among the compute nodes 340.
  • the compute node 340-2 upon completing a self-maintenance operation or determining that a self-maintenance operation is not needed, may pass the token 390 to the compute node 340-3.
  • the compute node 340-3 once in possession of the token 390, may perform a self-maintenance operation as described above.
  • the compute node 340-6 may continue to process messages corresponding to value 3, and those messages may also be queued at the compute node 340-3.
  • Fig. 3B illustrate circulation of the token 390 among the compute nodes 340.
  • the compute node 340-2 upon completing a self-maintenance operation or determining that a self-maintenance operation is not needed, may pass the token 390 to the compute node 340-3.
  • the compute node 340-3 once in possession of the token 390, may perform a self-maintenance operation as described above.
  • compute node 340-6 after the final node in a circulation sequence, compute node 340-6, completes a self-maintenance operation, it may pass the token 390 to the first compute node 340-1 in the sequence. Then, as shown in Fig. 3E, the compute node 340-1, once in possession of the token 390, may perform a selfmaintenance operation as described above. During this time, the compute node 340-4 may continue to process messages corresponding to value 1, and those messages may also be queued at the compute node 340-3.
  • the token 390, and/or one or more additional tokens may be circulated among the gateways 320 and/or sequencers 150 (Fig. 1) in a comparable manner, thereby enabling the gateways 320 and/or sequencers 150 to perform maintenance operations without disrupting or substantially delaying the operations performed by those nodes.
  • the distributed system 300 may implement alternative features in place of the token to grant permission to the compute nodes 340 to perform a maintenance operation.
  • each of the compute nodes 340 may be assigned a respective time interval during which it is granted permission to perform a maintenance operation. Each time interval may be distributed among the compute nodes 340 such that the compute nodes 340 perform a respective maintenance operation in a sequence comparable to that enabled by the token 390.
  • each of the compute nodes 340 may be configured to communicate with a central coordinating authority (e.g., a gateway or sequencer) for permission to perform a maintenance operation periodically or in response to a determination that maintenance is required.
  • the central coordinating authority may have access to the maintenance states of all of the compute nodes 440, and, based on these maintenance states, may grant or deny permission to a requesting node to ensure that at least one or more nodes assigned to a given value remains operational.
  • Figs. 4A-B illustrate plural subsets of compute nodes in further embodiments, which may be implemented in the distributed system 300 described above.
  • compute nodes 440-1, 440-2. . .440-6 (collectively, the compute nodes 440) are divided into three subsets of compute nodes, subsets 410A-C.
  • Each subset 410A-C may be assigned to one or more values of the common parameter of the messages such that the compute nodes within a given subset are configured to process messages corresponding to the assigned value(s).
  • the compute nodes 440 may circulate a plurality of tokens 490A- B in a looping path among the nodes 440, wherein each of the compute nodes 440 may pass the token to another node within the same subset of nodes or a different subset of nodes.
  • the tokens 490A-B may be positioned relative to one another along the path such that multiple compute nodes within the same subset (e.g., compute nodes 440-1 and 440-4 within subset 410A) do not perform maintenance simultaneously.
  • the compute nodes 440 may be configured to enforce a minimum distance along the path between the tokens 490 A-B. For example, the compute nodes 440 may receive a token from a preceding compute node in the path only after confirming that a subsequent compute node in the path is not in possession of another token. Further, compute nodes within the same subset may communicate with one another prior to performing a maintenance operation to confirm at least one other node within the subset is available (e.g., not undergoing maintenance or exhibiting a fault) to process messages assigned to the subset.
  • the tokens 490A-B may also be distinct from one another, for example by granting permission to perform different types of maintenance operations.
  • token 490 A may grant permission to a compute node to adjust or update the layout of an order book stored to the node’s internal memory
  • token 490B may grant permission to a compute node to update a hash collision bucket.
  • particular maintenance operations may require possession of multiple different tokens. For example, if a particular type of maintenance operation requires updating both the layout of an order book and updating a hash collision bucket, then performing that maintenance operation in a compute node may require simultaneous possession by the compute node of tokens 490 A and 490B.
  • the compute nodes 440 may also circulate additional tokens (not shown), which may grant permission for a maintenance or another operation. For example, some or all of the compute nodes 440 may be reassigned to process messages having different values, thereby rebalancing the distribution of the values among the compute nodes to improve efficiency and latency in processing the messages. Such a reassignment may be triggered by a determination that one or more nodes are not operational (e.g., exhibiting a fault or self-maintenance operation), and may provide for one or mode compute nodes being assigned to a greater number of values than prior to the reassignment.
  • compute node 440-4 or a gateway may detect that compute node 440-1 is not operational, and, in response, may pass a token to compute node 440-5 to reassign the compute node 440-5 to process messages corresponding to values assigned to subset 410A in addition to subset 410B.
  • the gateway or a compute node may cause a token to be circulated (e.g., among one or more subsets) that reassigns a number of compute nodes to process messages corresponding to that value. It may be advantageous to perform such a reassignment operation using multiple tokens as described above, thereby completing the reassignment more quickly in response to the change in demands on the distributed system.
  • each of the tokens 490A-B may be circulated among the compute nodes 440 as shown in Fig. 4A, but may be associated with a maintenance, reassignment or other operation for a subset of the compute nodes 440.
  • token 490A may be configured to enable the compute nodes 440-1, 440-4 of subset 410A to perform a maintenance operation when in possession of the token 490A, while the compute nodes of subsets 410B-C are configured to pass the token 490A without performing a maintenance operation.
  • the compute nodes of subsets 410B-C may instead perform a maintenance operation when in possession of token 490B.
  • a single token such as token 490 A, may grant permission to perform different operations (e.g., maintenance, reassignment) based on the compute node that is in possession of it.
  • the compute nodes 440 may perform a maintenance, reassignment or other operation based on a state of possession of a token.
  • the compute nodes 440 may circulate several tokens that, when possessed by a compute node, prohibit the node from performing the maintenance or other operation.
  • the compute node 440-1 may perform a maintenance or other operation when it is not in possession of a token.
  • the compute nodes 440 may be configured to perform a maintenance or other operation based on whether another node, such as a node within the same subset, possesses a token.
  • compute node 440-1 may be configured to perform a maintenance operation only when compute node 440-4 does or does not possess a token, thereby ensuring that at least one node of the subset 410A remains operational at all times.
  • Fig. 4B illustrates the compute nodes 440 in a configuration comparable to that of Fig. 4A, with the exception that each subset of nodes 410A-C circulates a respective token 491 A-C within the subset.
  • compute nodes 440-1 and 440-4 periodically pass the token 491 A between them, performing a self-maintenance operation when in possession of the token 491 A.
  • a configuration may be advantageous to ensure that each compute node has the opportunity to perform self-maintenance within an optimal timeframe, while also ensuring that at least one node of the subset is available to process messages.
  • the compute nodes 440 may implement a combination of the token configurations shown in Figs. 4A and 4B, passing tokens within and among the subsets of nodes 410A-C.
  • Fig. 5 illustrates a process 500 comprising several steps that may be performed by a compute node such as the compute nodes 340, 440 describe above.
  • a compute node 340-1 may parse the message 380 to determine the value of the common parameter of the message (step 505), and compare that value against the one or more values assigned to it (510). If the compute node 340-1 determines a match between the assigned value and the value of the message, the compute node 340-1 may then process the message 380 and forward a reply (e.g., information about the result of the processing) to the gateways 320 or another element (515). Alternatively, if the gateways 320 are configured to perform a sorting function on received messages and forward messages only to compute nodes assigned to process those messages, then the compute node 340-1 may refrain from determining the match.
  • a reply e.g., information about the result of the processing
  • the compute node 340-1 may determine whether to perform a self-maintenance operation. The compute node 340-1 may make this determination based on a period of time since a previous maintenance operation, a state of the node’s internal memory or other stored data, and/or a given type of maintenance operation that may be specified by the token 390. If the compute node 340- 1 determines that a self-maintenance operation is appropriate, then the compute node 340-1 performs the self-maintenance operation (525).
  • the compute node 340-1 may maintain a queue of messages (e.g., at memory 250) that it receives during the selfmaintenance operation. After the maintenance operation is done but before the compute node 340-1 may pass the token 390, the compute node 340-1 may be configured to process all messages in the queue and update its state. Newly-received messages may continue to be queued while the queue is non-empty. Once the queue is empty, the compute node 340-1 may then pass the token 390 to a subsequent compute node (e.g., node 340-2) (530).
  • a subsequent compute node e.g., node 340-2
  • Fig. 6 illustrates an arrangement 600 of compute nodes in a further embodiment, which may be implemented in the distributed system 300 described above.
  • compute nodes 640-1, 640-2. . ,640-N (collectively, the compute nodes 640) are configured to circulate a token 690 among them.
  • Each of the compute nodes 640 may be assigned to two or more values of the common parameter of the messages such that the compute nodes 640 are configured to process messages corresponding to the assigned values.
  • values are assigned to the compute nodes 640 in a configuration that may be referred to as a “striping” configuration.
  • each value of the common parameter may be assigned to at least two of the compute nodes 640, and each compute node 640 is assigned to a respective subset of two or more values of the common parameter. Further, each of the subsets may differ from one another by at least one value of the common parameter.
  • two or more other compute nodes may be available to process messages assigned to that node. For example, compute node 640-2 is assigned to process messages corresponding to values 2 and 3.
  • compute node 640-2 When the compute node 640-2 is in possession of the token 690 and performing a self-maintenance operation, compute node 640- 1 may continue to process messages corresponding to value 2, and compute node 640-3 may continue to process messages corresponding to value 3. After completing the self-maintenance operation, the compute node 640-2 may then pass the token to compute node 640-3 and resume processing messages corresponding to values 2 and 3 while compute node 640-3 is undergoing a self-maintenance operation.
  • Figs. 7A-D illustrate a distributed system 700 in a further embodiment.
  • the distributed system 700 may include some or all features of the electronic trading system 100 and distributed system 300 described above with reference to Figs. 1-3.
  • the distributed system 700 may include participant devices 730 (also referred to as client devices), gateways 720, and compute nodes 740-1, 740-2, 740-3 (collectively, the compute nodes 740), which may incorporate some or all features of the participant devices 130, the gateways 120, and the compute nodes 140, respectively.
  • Each of the participant devices 730, gateways 720, and compute nodes 740 may also incorporate some or all features of the mesh node 200 described above with reference to Fig. 2.
  • the compute nodes 740 are assigned to process messages, such as message 780, corresponding to values of a common parameter in a “striping” configuration as described above with reference to Fig. 6.
  • compute node 740-1 is assigned to process messages corresponding to values 1 and 2
  • compute node 740-2 is assigned to process messages corresponding to values 2 and 3
  • compute node 740-3 is assigned to process messages corresponding to values 3 and 1.
  • compute node 740-1 is currently in possession of the token 790. During this state of possession of the token 790, the compute node 740-1 may perform one or more self-maintenance operations such as those described above.
  • the compute node 740-1 may be unavailable to process messages.
  • compute node 740-2 is also configured to process messages corresponding to value 2
  • compute node 740-3 is also configured to process messages corresponding to value 1, those messages may continue to be processed without an increased delay.
  • Figs. 7B and 7D illustrate circulation of the token 790 among the compute nodes 740.
  • the compute node 740-1 upon completing the self-maintenance operation, may pass the token 790 to compute node 740-2.
  • the compute node 740-2 once in possession of the token 790, may perform a self-maintenance operation as described above.
  • compute nodes 740-1 and 740-2 may continue to process messages corresponding to values 2 and 3, respectively, and those messages may also be queued at the compute node 740-2.
  • Fig. 7D illustrates the network 700 during an event wherein one of the compute nodes, compute node 740-3, is exhibiting a fault or is otherwise unresponsive.
  • the compute node 740-3 may be undergoing a maintenance operation.
  • the gateways 720 may detect this state of the compute node 740-3 based on an absence of a response by the compute node 740-3, an indicator (e.g., error message) provided by the compute node 740-3, or another indication the that compute node 740-3 is unavailable to process messages.
  • the remaining compute nodes 740-1, 740-2 may be temporarily assigned to process messages corresponding to the values of compute node 740-3 in addition to the values that they are currently assigned.
  • compute node 340-1 may be temporarily assigned to process messages corresponding to value 3, and compute node 340-1 may be temporarily assigned to process messages corresponding to value 1.
  • This reconfiguration may be initiated by an instruction from the gateways 720, sequencers 150 (see Fig. 1), or may be carried out independently by compute nodes 740-1, 740-2 in response to detecting that compute node 740-3 is down.
  • values of the common parameter assigned to compute nodes 740 may be re-distributed even when all nodes are fully operational, for example, to better balance the load throughout the compute nodes 740, due to changing conditions.
  • the operational compute nodes i.e., nodes 740-1, 740-2) may continue to circulate the token 790 under a modified sequence that skips compute node 740-3.
  • the compute nodes 740 may return to the processing and circulation configurations shown in Figs. 7A-C.
  • one or more of the compute nodes 740 may be dedicated to processing messages corresponding to the value of any compute node that is currently unavailable, and optionally may refrain from processing messages when a self-maintenance operation is not occurring at any of the other compute nodes 740.
  • Further example embodiments disclosed herein may be configured using a computer program product; for example, controls may be programmed in software for implementing example embodiments. Further example embodiments may include a non-transitory computer- readable medium containing instructions that may be executed by a processor, and, when loaded and executed, cause the processor to complete methods described herein. It should be understood that elements of the block and flow diagrams may be implemented in software or hardware, such as via one or more arrangements of the components disclosed above, or equivalents thereof, firmware, custom designed semiconductor logic, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), a combination thereof, or other similar implementation determined in the future.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • the elements of the block and flow diagrams described herein may be combined or divided in any manner in software, hardware, or firmware. If implemented in software, the software may be written in any language that can support the example embodiments disclosed herein.
  • the software may be stored in any form of computer readable medium, such as one or more random access memor(ies) (RAMs), read only memor(ies) (ROMs), compact disk read-only memor(ies) (CD-ROMs), and so forth.
  • RAMs random access memor(ies)
  • ROMs read only memor(ies)
  • CD-ROMs compact disk read-only memor
  • a general purpose or application-specific processor or processing core loads and executes software in a manner well understood in the art.
  • the block and flow diagrams may include more or fewer elements, be arranged or oriented differently, or be represented differently. It should be understood that implementation may dictate the block, flow, and/or network diagrams and the number of block and flow diagrams illustrating the execution of embodiments disclosed herein.
  • compute nodes 640-1, 640-2, 640-3, and 640-N may be assigned 4 values, 8 values, 3 values, and 7 values, respectively, to distribute the load more evenly among the compute nodes.
  • a value may be assigned to more than two compute nodes.
  • each value may be assigned to three or more compute nodes, and/or, in some embodiments, some values may be assigned to more compute nodes than other values.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

Un système distribué comprend une pluralité de nœuds de calcul configurés pour traiter des messages. Les nœuds de calcul traitent chacun des messages correspondant à une valeur attribuée d'un paramètre commun des messages. Les valeurs sont attribuées aux nœuds de calcul de telle sorte que deux nœuds de calcul ou plus sont disponibles pour traiter chaque message. Les valeurs peuvent être attribuées aux nœuds de calcul dans une configuration de groupage ou une configuration de rayage. Les nœuds de calcul font également circuler un ou plusieurs jetons entre des nœuds , et effectuent une opération d'auto-maintenance pendant un état donné de possession du jeton. Au cours d'une opération d'auto-maintenance, les valeurs attribuées au nœud de calcul peuvent être réattribuées à d'autres nœuds de calcul pour assurer le traitement de messages correspondants.
PCT/US2021/044746 2020-08-07 2021-08-05 Système distribué avec tolérance aux pannes et auto-maintenance WO2022031970A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US16/988,464 2020-08-07
US16/988,491 US11328357B2 (en) 2020-08-07 2020-08-07 Sequencer bypass with transactional preprocessing in distributed system
US16/988,464 US11683199B2 (en) 2020-08-07 2020-08-07 Distributed system with fault tolerance and self-maintenance
US16/988,491 2020-08-07

Publications (1)

Publication Number Publication Date
WO2022031970A1 true WO2022031970A1 (fr) 2022-02-10

Family

ID=77519841

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/044746 WO2022031970A1 (fr) 2020-08-07 2021-08-05 Système distribué avec tolérance aux pannes et auto-maintenance

Country Status (1)

Country Link
WO (1) WO2022031970A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315183B2 (en) 2020-08-07 2022-04-26 Hyannis Port Research, Inc. Electronic trading system and method based on point-to-point mesh architecture
US11683199B2 (en) 2020-08-07 2023-06-20 Hyannis Port Research, Inc. Distributed system with fault tolerance and self-maintenance

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008142705A2 (fr) * 2007-05-17 2008-11-27 Pes Institute Of Technology Procédé et système pour un équilibrage de charge dans un système informatique distribué

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008142705A2 (fr) * 2007-05-17 2008-11-27 Pes Institute Of Technology Procédé et système pour un équilibrage de charge dans un système informatique distribué

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11315183B2 (en) 2020-08-07 2022-04-26 Hyannis Port Research, Inc. Electronic trading system and method based on point-to-point mesh architecture
US11683199B2 (en) 2020-08-07 2023-06-20 Hyannis Port Research, Inc. Distributed system with fault tolerance and self-maintenance

Similar Documents

Publication Publication Date Title
US8150946B2 (en) Proximity-based memory allocation in a distributed memory system
EP2406723B1 (fr) Interface extensible pour connecter de multiples systèmes d'ordinateur qui effectue un appariement d'entête mpi en parallèle
US9047243B2 (en) Method and apparatus for low latency data distribution
US20240007404A1 (en) Local and global quality of service shaper on ingress in a distributed system
US20080195719A1 (en) Resource Reservation Protocol over Unreliable Packet Transport
WO2022031970A1 (fr) Système distribué avec tolérance aux pannes et auto-maintenance
US11315183B2 (en) Electronic trading system and method based on point-to-point mesh architecture
US11729107B2 (en) Highly deterministic latency in a distributed system
EP4193255A1 (fr) Système et procédé de négociation électronique basés sur une architecture maillée point à point
WO2022031971A1 (fr) Dérivation de séquenceur avec prétraitement transactionnel dans un système distribué
US8428065B2 (en) Group communication system achieving efficient total order and state synchronization in a multi-tier environment
CN109495540A (zh) 一种数据处理的方法、装置、终端设备及存储介质
US20230269113A1 (en) Distributed System with Fault Tolerance and Self-Maintenance
EP4193256A1 (fr) Latence fortement déterministe dans un système distribué
US20230299864A1 (en) Systems and methods for clock synchronization using special physical layer clock sync symbols
US20200349645A1 (en) Stock exchange trading platform
US20230316399A1 (en) Electronic Trading System and Method based on Point-to-Point Mesh Architecture
US11303389B2 (en) Systems and methods of low latency data communication for physical link layer reliability
CN113014516A (zh) 一种数据流发送的方法及装置
CN116680069A (zh) 资源调度方法、装置、设备、系统、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21762273

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21762273

Country of ref document: EP

Kind code of ref document: A1