WO2018007988A1 - System for accelerating data transmission in network interconnections - Google Patents

System for accelerating data transmission in network interconnections Download PDF

Info

Publication number
WO2018007988A1
WO2018007988A1 PCT/IB2017/054100 IB2017054100W WO2018007988A1 WO 2018007988 A1 WO2018007988 A1 WO 2018007988A1 IB 2017054100 W IB2017054100 W IB 2017054100W WO 2018007988 A1 WO2018007988 A1 WO 2018007988A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
network card
node
receiving
source node
Prior art date
Application number
PCT/IB2017/054100
Other languages
French (fr)
Inventor
Roberto AMMENDOLA
Piero VICINI
Pier Stanislao PAOLUCCI
Alessandro Lonardo
Ottorino FREZZA
Francesca LO CICERO
Michele Martinelli
Andrea BIAGIONI
Francesco SIMULA
Original Assignee
Istituto Nazionale Di Fisica Nucleare (Infn)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Istituto Nazionale Di Fisica Nucleare (Infn) filed Critical Istituto Nazionale Di Fisica Nucleare (Infn)
Publication of WO2018007988A1 publication Critical patent/WO2018007988A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1863Arrangements for providing special services to substations for broadcast or conference, e.g. multicast comprising mechanisms for improved reliability, e.g. status reports
    • H04L12/1877Measures taken prior to transmission

Definitions

  • the present invention relates to a data transmission system in an interconnection network and to a corresponding method.
  • the present invention relates to a data transmission system in an interconnection network between calculating nodes and to a corresponding method.
  • the present invention relates to a data transmission system in an interconnection network between calculating nodes and to a corresponding method, the present invention being aimed at accelerating the data transmission and/or transfer operations.
  • Ethernet networks For example, data transmission systems are known which comprise a network card of Ethernet type or more generically regarding the IEEE 802 network protocol family, typically present in any type of computer.
  • Ethernet networks it is possible to broadcast (i.e. send a data package to all devices or computers connected by means of a same network) or multicast (i.e. send a data package to some of the devices or computers connected by means of a same network).
  • broadcast i.e. send a data package to all devices or computers connected by means of a same network
  • multicast i.e. send a data package to some of the devices or computers connected by means of a same network.
  • the systems which use Ethernet cards cannot be advantageously used in the context of high-performance computing for only technical reasons.
  • IP address, netmask, etc. IP address, netmask, etc.
  • the network protocol implemented in common LAN networks envisages special addresses, which guarantee broadcasting on selectable subnetworks; in order to define multicast or broadcast domains, it would be necessary to modify the network parameters of the nodes related to the networks while the application is being executed; this cannot be easily achieved from the practical point of view and may even be expressively forbidden to a computer user; finally, LAN-type protocols and peripheral devices have intrinsic latency features which make them not very suitable for building high-performance computing networks.
  • data transmission systems which comprise a network card of the Infiniband type, described for example in patents US 9110860 and US 8811417 by Mellanox: such data transmission systems can achieve an acceleration of some types of collective operations, such as broadcast operations, in which a data buffer is transferred from a source node to all the other nodes, the so-called “barrier” operation, i.e. the synchronization operation which is technically performed as a broadcast by sending an empty package or of minimum size, and the so-called “reduction” operation, in which all the nodes communicate a data set to a single receiving node, which performs a global arithmetic operation on it, of the sum, multiplication type, etc.
  • broadcast operations in which a data buffer is transferred from a source node to all the other nodes
  • the so-called “barrier” operation i.e. the synchronization operation which is technically performed as a broadcast by sending an empty package or of minimum size
  • the so-called “reduction” operation in which all the nodes communicate a data
  • Such operations are mainly performed on the switch (network switch) and are mostly used to optimize reduction operations, i.e. those in which a node collects all the data from the nodes concerned in the operation and performs a global arithmetic operation (e.g. a sum) on them.
  • the acceleration concerns the routing to be provided to the packages which belong to the aforesaid collective type communications, i.e. choosing an optimized route to be given to each single package which must reach the destination, as well as the capacity of the switch to perform part of the arithmetic operation calculations required on the transiting data.
  • the typical operation flow comprises the following steps, in sequence: - the source node prepares the data package to the transmitted; - the driver notifies the information to the network card (specifically: the amount of data to be transferred, the memory address where the data are located, the receiving node address, the memory address of the receiving node) to transmit such data package to the receiving node A; - the network card reads the data and transmits them to the receiving node A;
  • the driver waits for the transmission confirmation communication to the receiving node A from its own network card
  • the driver notifies the same information indicated above to the network card updated for the receiving node to transmit the data to the receiving node B;
  • the network card reads the data and transmits them to the receiving node B;
  • the driver waits for the transmission confirmation communication to the receiving node B from its own network card
  • the driver notifies the same information indicated above to the network card updated for the receiving node to transmit the data to the receiving node C;
  • the network card reads the data and transmits them to the receiving node C;
  • the driver waits for the transmission confirmation communication to the receiving node C from its own network card.
  • the present invention is designed to deal with the need to find an alternative data transmission system which can minimize or at least reduce transmission times, in particular by accelerating data transmission so that all the receiving nodes present in the system receive the data in acceptable times.
  • the present invention relates to a data transmission system in an interconnection network between calculating nodes according to claim 1 .
  • the present invention arises from the general consideration according to which the technical problem illustrated above may be overcome in efficient and reliable manner by means of a data transmission system and method thereof for the transmission of data through an interconnection network between at least one source node and a plurality of distinct receiving nodes, wherein said source node has an associated network card and each of said plurality of distinct receiving nodes has its own associated network card.
  • Such data transmission system is characterized in that said source node is configured so as to sub-divide the initial set of data to be transmitted into a plurality of X distinct data fractions and to communicate to said network card associated with it the information related to the starting address and which one or more of the distinct data fractions must be transmitted to each specific receiving node.
  • the data transmission speed can be increased by means of an interconnection network between nodes, by transmitting different parts of a same data package available in the source node to different receiving nodes so that such data are transmitted to each receiving node independently from the other nodes.
  • said network card associated with said source node is configured so as to sub-divide said data portion to be transferred to each receiving node into network packages according to a predetermined communication protocol.
  • At least one data fraction, different from the at least one data fraction transmitted to each other receiving node, is transmitted to each receiving node.
  • the data buffer is distributed into parts of even different size between the receiving nodes.
  • At least one same data fraction is transmitted to at least two different receiving nodes.
  • identical parts of the data buffer may also be transmitted between the receiving nodes.
  • the totality of said distinct data fractions is transmitted to each distinctive receiving node.
  • said network card associated with said source node is configured to contain a memory portion therein, such as for example a buffer, in which such data fractions to be transferred to each receiving node are temporarily stored.
  • said network card associated with said source node is configured so as to transmit a first data fraction of said data fractions to a first receiving node, via said network card associated with said first receiving node, and also configured to transmit in sequence other data fractions to a sequence of other receiving nodes, different from each other and different from such first receiving node, by means of said network cards respectively associated with said other receiving nodes, without needing to wait for the completion of data transmission to the previously activated receiving node(s).
  • the data transmission system of the present invention makes it possible to activate the transmission of one of such data fractions to each of the receiving nodes in succession. The result is thus that all the receiving nodes may receive a data fraction without waiting for the completion of the data transmission previously sent to other receiving nodes.
  • the network card associated with the source node is configured so as to comprise a device, positioned on the network card itself, able to store information related to the type of data fraction which was transmitted to a particular receiving node and the corresponding address of such receiving node.
  • the data transmission system of the present invention allows the card associated with the source node to progressively store the information related to the various data fractions which were transmitted to given receiving nodes, for each of the receiving nodes. Furthermore, since the network card associated with the source node is configured so as to store the address of the receiving nodes, whenever a new data fraction must be transmitted to a particular receiving node for which the address was stored, it is not necessary to create such address again, because it is already available, with a consequently acceleration of the times necessary for transmitting the data to that particular receiving node.
  • the network card associated with the source node is configured so as to check, whenever the transmission of a data fraction to a particular receiving node has been completed, whether such receiving node has received all the data fraction which should have been transmitted to it or not for each receiving node.
  • the data transmission system of the present invention allows the card associated with the source node to activate a new transmission of data fractions to that particular receiving node, different from the data fraction already transmitted to it.
  • the data transmission to that particular receiving node is considered complete.
  • the network card associated with the source node is configured so that, once the transmission of a data fraction to a receiving node is completed, such data fraction is transmitted to another receiving node to which the data fraction has not yet been transmitted, as soon as a previous data fraction transmission to such another receiving node is completed.
  • said network card associated with said source node is equal to each of said network cards respectively associated with each of said receiving nodes.
  • each node since all nodes are the same, each node may be used either as source or as receiver; consequently, in the context of high-performance scientific computing, the concept of source node and of receiver node is dynamic and interchangeable during the execution of the application.
  • said source node consists of a computer or a more elementary computing element or aggregation of multiple computing elements; for example, such source node may consist of a processor (CPU) and a memory (RAM).
  • processor CPU
  • RAM memory
  • each of said receiving nodes consists the same manner of a computer or a more elementary computing element or the aggregation of multiple computing elements; for example, such receiving node may consist of a processor (CPU) and a memory (RAM).
  • CPU processor
  • RAM memory
  • said interconnection network of the present invention between at least one source node and said plurality of distinct receiving nodes is a network between computing elements which may also be heterogeneous (different types and number of CPU).
  • said network card associated with said source node is further configured so as to sub-divide the data package to be transferred or transmitted into a plurality of fractions or sub-packages, as well as to specify the receiving node or receiver of each fraction and/or sub-package. In this manner, the sub-dividing operation of the data package to be transmitted by the source node to the receiving nodes is performed by the hardware present in the network card.
  • said interconnection network is free from an external entity of the switch type. Since the network card is provided with a multiplicity of communication channels, it can switch the data traffic coming from the network; the deriving network consists of point-point connections between arbitrary topology calculating nodes, which may be source and receiving nodes at the same time.
  • the present invention further relates to a method for transmitting data like the one indicated in claim 10, i.e. to a method for transmitting data in an interconnection network between at least one source node and a plurality of receiving nodes, wherein said source node has an associated network card and each of said plurality of separate receiving nodes has its own associated network card, wherein, according to the method, by means of said network card associated with said source node, the initial set of data to be transmitted to each of said plurality of distinct receiving nodes is sub-divided into a plurality of distinct data fractions and the information related to the starting address and to which one or more of the distinct data fractions the distinct data must be transmitted to each specific receiving node are transmitted to each network card associated with said source node.
  • the operation speed in an interconnection network between nodes is increased, in particular by transmitting different parts of a same data package available in the source node to different receiving nodes.
  • said method for transmitting data comprises the following steps: a) Preparing a data packet to be transmitted to each of said plurality of receiving nodes by the at least one source node; b) Notifying the information to said network card associated with said source node by a driver in order to transmit the data to each of said plurality of receiving nodes; c) Reading said data thus notified by the driver and transmitting them to each of said plurality of receiving nodes by said network card associated with said source node; d) Communicating the transmission confirmation by the network card associated with said source node.
  • said method further comprises the step e) of storing information related to the previously performed transfers on a hardware portion of the network card associated with said source node.
  • said method further comprises the step f) in which said network card associated with said source node communicates the data transmission confirmation to the various receiving nodes to said driver.
  • FIG. 1 is a schematic view of a data transmission system according to a first embodiment of the present invention in which a source node and three receiving nodes are shown;
  • - Figure 2 is a schematic view of a data transmission system according to a second embodiment of the present invention in which a source node and three receiving nodes are shown;
  • FIG. 3 is a schematic view of a data transmission system according to a third embodiment of the present invention in which a source node and three receiving nodes are shown.
  • a source node 3 e.g. a computer
  • a series of three receiving nodes 4A, 4B and 4C e.g. other three computers
  • each of the source node 3 communicates with the source node 3 by means of the interconnection network 2.
  • a network card 5 is associated with the source node 3
  • each of the network cards 5A, 5B and 5C is respectively associated with one of the receiving nodes 4A, 4B and 4C, wherein the network cards 5, 5A, 5B and 5C are mutually equal.
  • the source node 3 comprises a memory 6, in which a storage area 7, or "buffer”, is provided, adapted to temporarily contain the data waiting to be transmitted to the receiving nodes 4A, 4B and 4C which, in turn, contain therein a corresponding memory 6A, 6B and 6C to preserve the data transmitted to them.
  • a storage area 7, or "buffer” is provided, adapted to temporarily contain the data waiting to be transmitted to the receiving nodes 4A, 4B and 4C which, in turn, contain therein a corresponding memory 6A, 6B and 6C to preserve the data transmitted to them.
  • the data transmission system performs the following steps: the driver writes a data string in the memory 6 of the computer 3 and notifies the buffer 7 that the data string is ready in such memory 6.
  • the network card 5 associated with the source node sub-divides the data to be transmitted to the receiving nodes 4A, 4B and 4C into a series of data fractions Fi , F2, F3, indicating the physical initial memory address and the data size for each data fraction Fi , F2, F3.
  • the system checks how many receiving nodes are involved in the data transmission which is about to be started and the size of the data packages which must be prepared starting from such data string for the purpose of being transmitted to each receiving node.
  • the number of fractions x may be higher or lower than the number of receiving nodes N.
  • the first data fraction Fi is then transmitted from the source node 3 to the receiving node 4A by means of the network card 5 associated with the source node 3 and by means of the network card 5A associated with the first receiving node 4A.
  • the network card 5 associated with the source node 3 also activates the transmission of the second data fraction F2 from the source node 3 to the second receiving node 4B by means of the network card 5B associated with the receiving node 4B and, at the same time, the similar transmission of the data fraction F3 to the receiving node 4C.
  • the network card 5 associated with the source node 3 would activate the transmission in sequence of N data fractions FN, one for each of the N receiving nodes.
  • the network card 5 of the source node 3 thus stores the transfer confirmation of such data fraction and the address of the corresponding receiving node 4N in a memory device positioned on the same network card 5. In this manner, since the address of the receiving nodes to which the data fractions have been transmitted is already available, the transmission procedure is accelerated when a new data fraction must be sent to a receiving node 4N which is already known by the network card 5 associated with the source node 3.
  • the network card 5 associated with the source node 3 checks whether all data fractions Fi .. Fx intended to such receiving node 4N were transmitted to it. If there is still a data fraction F y to be transmitted to such receiving node 4N, the network card 5 associated with the source node 3 checks whether such data fraction F y is ready for transmission to such receiving node 4N or whether a similar transmission of such data fraction F y to another receiving node different from the receiving node 4N is still in progress. As soon as such data fraction F y is ready to be transmitted to such receiving node 4N, the network card 5 associated with the source node 3 activates such transmission of data fraction F y to the receiving node 4N.
  • the network card 5 associated with the source node 3 updates the data stored on the memory device positioned inside it so as to be able to easily check how many data fractions were sent to each single receiving node and, consequently, what and which fractions of data must still be transmitted to each receiving node.
  • the data buffer 7 was sub-divided into the various data fractions Fi , F2, F3 so that each of the receiving nodes 4A, 4B and 4C has received a data fraction different from that of the other receiving nodes. Since a network card 5A, 5B and 5C is associated with a corresponding receiving node 4A, 4B and 4C and each of such network cards 5A, 5B and 5C can communicate interdependent ⁇ from the others, by means of the interconnection network 2, with the network card 5 associated with the source node 3, it results that each network card 5A, 5B and 5C can autonomously manage the transmission of data fraction Fi ..
  • the transmissions of data fractions Fi .. Fx from the source node 3 to the various receiving nodes 4A, 4B and 4C may occur substantially at the same time.
  • the receiving node 4A receives the data fraction Fi
  • the receiving node 4B receives the data fractions Fi and F2
  • the receiving node 4C receives the data fractions Fi , F2 and F3. So, the data fraction Fi is transmitted to all the receiving nodes 4A, 4B and 4C, while the data fraction F2 is transmitted to the receiving nodes 4B and 4C, and the data fraction F3 is transmitted only to receiving node 4C.
  • a third embodiment of the data transmission system and of the respective method according to the present invention is described below with reference to Figures 3.
  • Such third embodiment is substantially similar to the embodiment shown above with reference to Fig. 1 and 2, with the only difference being that all data fractions Fi , F2, F3 are transmitted to all the receiving nodes 4A, 4B and 4C.
  • each node may be used either as source node or as receiver according to needs. Consequently, within the frame of high-performance scientific computing, the concept of source node and receiver node is dynamic and interchangeable during the execution of an application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Communication Control (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention relates to an accelerated data transmission system in an interconnection network between at least one source node and a plurality of receiving nodes. Each of said source nodes and receiving nodes is associated with a same network card type. The source node can sub-divide the initial set of data to be transmitted to each of said plurality of receiving nodes and communicate to said network card associated with it the information related to the starting address and to which one or more of the distinct data fractions must be transmitted to each specific receiving node. In this manner, the system of the present invention can increase the data transmission operation speed by means of an interconnection network between nodes, transmitting different parts of a same data package available in the source node to different receiving nodes at the same time. The present invention also relates to a method for transmitting data which uses such network card.

Description

SYSTEM FOR ACCELERATING DATA TRANSMISSION IN NETWORK INTERCONNECTIONS
Field of the invention
The present invention relates to a data transmission system in an interconnection network and to a corresponding method. In particular, the present invention relates to a data transmission system in an interconnection network between calculating nodes and to a corresponding method. In detail, the present invention relates to a data transmission system in an interconnection network between calculating nodes and to a corresponding method, the present invention being aimed at accelerating the data transmission and/or transfer operations.
Prior art.
Data transmission systems and methods are known in the prior art.
For example, data transmission systems are known which comprise a network card of Ethernet type or more generically regarding the IEEE 802 network protocol family, typically present in any type of computer. By means of such Ethernet networks it is possible to broadcast (i.e. send a data package to all devices or computers connected by means of a same network) or multicast (i.e. send a data package to some of the devices or computers connected by means of a same network). However, the systems which use Ethernet cards cannot be advantageously used in the context of high-performance computing for only technical reasons. Indeed, in the systems which use Ethernet cards it would be necessary to change the network configuration parameters (IP address, netmask, etc.) in continuous and variable manner while the high-performance computing application to be accelerated is being executed; the network protocol implemented in common LAN networks envisages special addresses, which guarantee broadcasting on selectable subnetworks; in order to define multicast or broadcast domains, it would be necessary to modify the network parameters of the nodes related to the networks while the application is being executed; this cannot be easily achieved from the practical point of view and may even be expressively forbidden to a computer user; finally, LAN-type protocols and peripheral devices have intrinsic latency features which make them not very suitable for building high-performance computing networks.
Other data transmission systems are also known, which comprise a network card of the Infiniband type, described for example in patents US 9110860 and US 8811417 by Mellanox: such data transmission systems can achieve an acceleration of some types of collective operations, such as broadcast operations, in which a data buffer is transferred from a source node to all the other nodes, the so-called "barrier" operation, i.e. the synchronization operation which is technically performed as a broadcast by sending an empty package or of minimum size, and the so-called "reduction" operation, in which all the nodes communicate a data set to a single receiving node, which performs a global arithmetic operation on it, of the sum, multiplication type, etc. Such operations are mainly performed on the switch (network switch) and are mostly used to optimize reduction operations, i.e. those in which a node collects all the data from the nodes concerned in the operation and performs a global arithmetic operation (e.g. a sum) on them. In such data transmission systems, the acceleration concerns the routing to be provided to the packages which belong to the aforesaid collective type communications, i.e. choosing an optimized route to be given to each single package which must reach the destination, as well as the capacity of the switch to perform part of the arithmetic operation calculations required on the transiting data.
Furthermore, other data transmission systems are known, e.g. applied to noncommercial super-computer networks of the IBM Bluegene type, as described in patent US 8001280, in which the collective reduction and broadcasting operations are optimized by using a dedicated network with tree topology. Optimizations of this type are based on the presence of a secondary specialized network and are not effective in the case of generalized multicasts.
Finally, for example in the simplified and general case of broadcast in a system with 1 source node and 3 receiving nodes A,B,C, the typical operation flow comprises the following steps, in sequence: - the source node prepares the data package to the transmitted; - the driver notifies the information to the network card (specifically: the amount of data to be transferred, the memory address where the data are located, the receiving node address, the memory address of the receiving node) to transmit such data package to the receiving node A; - the network card reads the data and transmits them to the receiving node A;
- the driver waits for the transmission confirmation communication to the receiving node A from its own network card;
- the driver notifies the same information indicated above to the network card updated for the receiving node to transmit the data to the receiving node B; - the network card reads the data and transmits them to the receiving node B;
- the driver waits for the transmission confirmation communication to the receiving node B from its own network card;
- the driver notifies the same information indicated above to the network card updated for the receiving node to transmit the data to the receiving node C; - the network card reads the data and transmits them to the receiving node C;
- the driver waits for the transmission confirmation communication to the receiving node C from its own network card.
Consequently, in such data transmission systems according to the prior art in which multiple receiving nodes are present, data can be transmitted to each single receiving node only after the driver has received the confirmation communication that the transmission of data to the preceding node has been completed. Such waiting period between the sending of a data block to one node and the successive sending of the same data block to the next node, repeated by the number of receiving nodes present in the system, implies a considerable use and waste of time for transmitting the same data block to each receiving node in the network.
It is the object of the present invention to overcome or at least minimize the drawbacks affecting the data transmission systems and methods according to the prior art and briefly summarized above. Consequently, in this context, the present invention is designed to deal with the need to find an alternative data transmission system which can minimize or at least reduce transmission times, in particular by accelerating data transmission so that all the receiving nodes present in the system receive the data in acceptable times.
Summary of the invention.
In general, the present invention relates to a data transmission system in an interconnection network between calculating nodes according to claim 1 .
Indeed, the present invention arises from the general consideration according to which the technical problem illustrated above may be overcome in efficient and reliable manner by means of a data transmission system and method thereof for the transmission of data through an interconnection network between at least one source node and a plurality of distinct receiving nodes, wherein said source node has an associated network card and each of said plurality of distinct receiving nodes has its own associated network card. Such data transmission system is characterized in that said source node is configured so as to sub-divide the initial set of data to be transmitted into a plurality of X distinct data fractions and to communicate to said network card associated with it the information related to the starting address and which one or more of the distinct data fractions must be transmitted to each specific receiving node.
Therefore, according to the present invention, in this manner, the data transmission speed can be increased by means of an interconnection network between nodes, by transmitting different parts of a same data package available in the source node to different receiving nodes so that such data are transmitted to each receiving node independently from the other nodes.
According to an embodiment, said network card associated with said source node is configured so as to sub-divide said data portion to be transferred to each receiving node into network packages according to a predetermined communication protocol. In this manner, by virtue of the network cards used in the present invention, it is possible to transfer only a desired portion of such data available in the source node to the receiving nodes each time.
According to an embodiment, at least one data fraction, different from the at least one data fraction transmitted to each other receiving node, is transmitted to each receiving node. In this manner, the data buffer is distributed into parts of even different size between the receiving nodes.
According to an embodiment, at least one same data fraction is transmitted to at least two different receiving nodes.
In this manner, identical parts of the data buffer may also be transmitted between the receiving nodes.
According to an embodiment, the totality of said distinct data fractions is transmitted to each distinctive receiving node.
In this manner, the data buffer may be entirely transmitted to the receiving nodes. According to an embodiment, said network card associated with said source node is configured to contain a memory portion therein, such as for example a buffer, in which such data fractions to be transferred to each receiving node are temporarily stored.
According to an embodiment, said network card associated with said source node is configured so as to transmit a first data fraction of said data fractions to a first receiving node, via said network card associated with said first receiving node, and also configured to transmit in sequence other data fractions to a sequence of other receiving nodes, different from each other and different from such first receiving node, by means of said network cards respectively associated with said other receiving nodes, without needing to wait for the completion of data transmission to the previously activated receiving node(s).
In this manner, contrary to that occurring in the transmission systems of prior art, in which it was necessary to wait for an entire data package to be transmitted by the source node to a first receiving node before performing a similar transmission of the same data package from the same source node to a second receiving node, the data transmission system of the present invention makes it possible to activate the transmission of one of such data fractions to each of the receiving nodes in succession. The result is thus that all the receiving nodes may receive a data fraction without waiting for the completion of the data transmission previously sent to other receiving nodes. According to an embodiment, the network card associated with the source node is configured so as to comprise a device, positioned on the network card itself, able to store information related to the type of data fraction which was transmitted to a particular receiving node and the corresponding address of such receiving node.
In this manner, the data transmission system of the present invention allows the card associated with the source node to progressively store the information related to the various data fractions which were transmitted to given receiving nodes, for each of the receiving nodes. Furthermore, since the network card associated with the source node is configured so as to store the address of the receiving nodes, whenever a new data fraction must be transmitted to a particular receiving node for which the address was stored, it is not necessary to create such address again, because it is already available, with a consequently acceleration of the times necessary for transmitting the data to that particular receiving node.
According to an embodiment, the network card associated with the source node is configured so as to check, whenever the transmission of a data fraction to a particular receiving node has been completed, whether such receiving node has received all the data fraction which should have been transmitted to it or not for each receiving node.
Consequently, if a particular receiving node has not yet received all the data fractions intended to it, the data transmission system of the present invention allows the card associated with the source node to activate a new transmission of data fractions to that particular receiving node, different from the data fraction already transmitted to it. In the opposite case, the data transmission to that particular receiving node is considered complete. Such check is applied to all receiving nodes. According to an embodiment, the network card associated with the source node is configured so that, once the transmission of a data fraction to a receiving node is completed, such data fraction is transmitted to another receiving node to which the data fraction has not yet been transmitted, as soon as a previous data fraction transmission to such another receiving node is completed.
In this manner, by using the data fractions stored in the memory part of the network card associated with the source node and with the network cards associated with the corresponding receiving nodes which communicate with such network card associated with the source node by means of the interconnection network, a decrease of the work load is obtained for the source node when the data transfer operation must be performed. Indeed, each data transfer operation is performed dynamically being able to manage the data transfer of such data fractions directly by means of the network card associated with the source node.
According to an embodiment, said network card associated with said source node is equal to each of said network cards respectively associated with each of said receiving nodes. In this manner, since all nodes are the same, each node may be used either as source or as receiver; consequently, in the context of high-performance scientific computing, the concept of source node and of receiver node is dynamic and interchangeable during the execution of the application.
According to an embodiment, said source node consists of a computer or a more elementary computing element or aggregation of multiple computing elements; for example, such source node may consist of a processor (CPU) and a memory (RAM).
According to an embodiment, each of said receiving nodes consists the same manner of a computer or a more elementary computing element or the aggregation of multiple computing elements; for example, such receiving node may consist of a processor (CPU) and a memory (RAM).
In this manner, said interconnection network of the present invention between at least one source node and said plurality of distinct receiving nodes is a network between computing elements which may also be heterogeneous (different types and number of CPU). According to an embodiment, said network card associated with said source node is further configured so as to sub-divide the data package to be transferred or transmitted into a plurality of fractions or sub-packages, as well as to specify the receiving node or receiver of each fraction and/or sub-package. In this manner, the sub-dividing operation of the data package to be transmitted by the source node to the receiving nodes is performed by the hardware present in the network card.
According to an embodiment, said interconnection network is free from an external entity of the switch type. Since the network card is provided with a multiplicity of communication channels, it can switch the data traffic coming from the network; the deriving network consists of point-point connections between arbitrary topology calculating nodes, which may be source and receiving nodes at the same time.
In this manner, since the presence of such external entity in the interconnection network is not necessary, a saving is achieved in the number of components of the data transmission system.
The present invention further relates to a method for transmitting data like the one indicated in claim 10, i.e. to a method for transmitting data in an interconnection network between at least one source node and a plurality of receiving nodes, wherein said source node has an associated network card and each of said plurality of separate receiving nodes has its own associated network card, wherein, according to the method, by means of said network card associated with said source node, the initial set of data to be transmitted to each of said plurality of distinct receiving nodes is sub-divided into a plurality of distinct data fractions and the information related to the starting address and to which one or more of the distinct data fractions the distinct data must be transmitted to each specific receiving node are transmitted to each network card associated with said source node.
In this manner, the operation speed in an interconnection network between nodes is increased, in particular by transmitting different parts of a same data package available in the source node to different receiving nodes.
Preferably, said method for transmitting data comprises the following steps: a) Preparing a data packet to be transmitted to each of said plurality of receiving nodes by the at least one source node; b) Notifying the information to said network card associated with said source node by a driver in order to transmit the data to each of said plurality of receiving nodes; c) Reading said data thus notified by the driver and transmitting them to each of said plurality of receiving nodes by said network card associated with said source node; d) Communicating the transmission confirmation by the network card associated with said source node.
According to an embodiment, said method further comprises the step e) of storing information related to the previously performed transfers on a hardware portion of the network card associated with said source node.
In this manner, such information thus stored can be reused if a new data transfer to a same address of the receiving node is required.
According to an embodiment, said method further comprises the step f) in which said network card associated with said source node communicates the data transmission confirmation to the various receiving nodes to said driver.
Further possible embodiments of the present invention are specified in the claims. Brief description of the figures
The present invention is clarified below in greater detail by means of the detailed description of the embodiment depicted by way of non-limiting example in the accompanying drawings, in which
- Figure 1 is a schematic view of a data transmission system according to a first embodiment of the present invention in which a source node and three receiving nodes are shown; - Figure 2 is a schematic view of a data transmission system according to a second embodiment of the present invention in which a source node and three receiving nodes are shown;
- Figure 3 is a schematic view of a data transmission system according to a third embodiment of the present invention in which a source node and three receiving nodes are shown.
Detailed description
An embodiment of the data transmission system and of the respective method according to the present invention is described below with reference to Figures 1 - 3.
In each of the Fig. 1 -3 there are depicted a source node 3 (e.g. a computer) and a series of three receiving nodes 4A, 4B and 4C (e.g. other three computers), each of which communicates with the source node 3 by means of the interconnection network 2. A network card 5 is associated with the source node 3, while each of the network cards 5A, 5B and 5C is respectively associated with one of the receiving nodes 4A, 4B and 4C, wherein the network cards 5, 5A, 5B and 5C are mutually equal. Furthermore, the source node 3 comprises a memory 6, in which a storage area 7, or "buffer", is provided, adapted to temporarily contain the data waiting to be transmitted to the receiving nodes 4A, 4B and 4C which, in turn, contain therein a corresponding memory 6A, 6B and 6C to preserve the data transmitted to them.
Operatively, the data transmission system performs the following steps: the driver writes a data string in the memory 6 of the computer 3 and notifies the buffer 7 that the data string is ready in such memory 6. The network card 5 associated with the source node sub-divides the data to be transmitted to the receiving nodes 4A, 4B and 4C into a series of data fractions Fi , F2, F3, indicating the physical initial memory address and the data size for each data fraction Fi , F2, F3. At this point, the system checks how many receiving nodes are involved in the data transmission which is about to be started and the size of the data packages which must be prepared starting from such data string for the purpose of being transmitted to each receiving node.
In the example shown in Fig. 1 there are three receiving nodes 4A, 4B and 4C and the data buffer 7 has been sub-divided so as to obtain the three data fractions Fi , F2, F3. However, in other embodiments of the invention, the number of fractions x may be higher or lower than the number of receiving nodes N.
The first data fraction Fi is then transmitted from the source node 3 to the receiving node 4A by means of the network card 5 associated with the source node 3 and by means of the network card 5A associated with the first receiving node 4A. As soon as the transmission of such data fraction Fi starts to the receiving node 4A, the network card 5 associated with the source node 3 also activates the transmission of the second data fraction F2 from the source node 3 to the second receiving node 4B by means of the network card 5B associated with the receiving node 4B and, at the same time, the similar transmission of the data fraction F3 to the receiving node 4C. It is worth noting that in an embodiment (not shown in figure 1 ), in which further receiving nodes could be present (for a total of N receiving nodes), the network card 5 associated with the source node 3 would activate the transmission in sequence of N data fractions FN, one for each of the N receiving nodes. Successively, in the embodiment shown in Figure 1 , the network card 5 of the source node 3 thus stores the transfer confirmation of such data fraction and the address of the corresponding receiving node 4N in a memory device positioned on the same network card 5. In this manner, since the address of the receiving nodes to which the data fractions have been transmitted is already available, the transmission procedure is accelerated when a new data fraction must be sent to a receiving node 4N which is already known by the network card 5 associated with the source node 3.
So, the network card 5 associated with the source node 3 checks whether all data fractions Fi .. Fx intended to such receiving node 4N were transmitted to it. If there is still a data fraction Fy to be transmitted to such receiving node 4N, the network card 5 associated with the source node 3 checks whether such data fraction Fy is ready for transmission to such receiving node 4N or whether a similar transmission of such data fraction Fy to another receiving node different from the receiving node 4N is still in progress. As soon as such data fraction Fy is ready to be transmitted to such receiving node 4N, the network card 5 associated with the source node 3 activates such transmission of data fraction Fy to the receiving node 4N.
Whenever the transmission of a given data fraction to a given receiving node 4A, 4B and 4C is completed, the network card 5 associated with the source node 3 updates the data stored on the memory device positioned inside it so as to be able to easily check how many data fractions were sent to each single receiving node and, consequently, what and which fractions of data must still be transmitted to each receiving node.
Such checks and transmissions of data fractions by the network card 5 associated with the source node 3 continue until all the receiving nodes 4A, 4B and 4C have received all the data fractions Fi .. Fx intended for them. At this point, the network card 5 associated with the source node 3 declares concluded the transmission of data fractions Fi .. Fx because all the data fractions Fi .. Fx have been transmitted to each receiving node. So, at the end of such transmissions of data fractions Fi .. Fx, all the data fractions Fi .. Fx intended for the receiving nodes are available in the memory 6A, 6B and 6C, respectively present in each receiving node 4A, 4B and 4C. In particular, in the embodiment described above with reference to Fig. 1 , the data buffer 7 was sub-divided into the various data fractions Fi , F2, F3 so that each of the receiving nodes 4A, 4B and 4C has received a data fraction different from that of the other receiving nodes. Since a network card 5A, 5B and 5C is associated with a corresponding receiving node 4A, 4B and 4C and each of such network cards 5A, 5B and 5C can communicate interdependent^ from the others, by means of the interconnection network 2, with the network card 5 associated with the source node 3, it results that each network card 5A, 5B and 5C can autonomously manage the transmission of data fraction Fi .. Fx to the corresponding receiving node 4A, 4B and 4C with which it is associated, without needing to follow the data transmission results from the source node 3 to the other receiving nodes. So, the transmissions of data fractions Fi .. Fx from the source node 3 to the various receiving nodes 4A, 4B and 4C may occur substantially at the same time.
So, according to the present invention, there is the advantage of being able to accelerate the transmission of the data fractions to the receiving nodes, without needing to wait for the data transmission from the source node to a receiving node to be completed before starting the data transmission from the source node to a successive receiving node, fact which occurs instead in data transmission systems of the prior art which use Ethernet and Infiniband type network cards. High-performance data transmission is thus achieved according to the present invention.
An embodiment of the data transmission system and of the respective method according to the present invention is described below with reference to Figures 2. Such second embodiment is substantially similar to the embodiment as disclosed above with reference to Fig. 1 , with the only difference being that some of the data fractions Fi , F2, F3 are transmitted to more than one receiving node 4A, 4B and 4C.
Indeed, according to such second embodiment, the receiving node 4A receives the data fraction Fi , the receiving node 4B receives the data fractions Fi and F2, and the receiving node 4C receives the data fractions Fi , F2 and F3. So, the data fraction Fi is transmitted to all the receiving nodes 4A, 4B and 4C, while the data fraction F2 is transmitted to the receiving nodes 4B and 4C, and the data fraction F3 is transmitted only to receiving node 4C.
A third embodiment of the data transmission system and of the respective method according to the present invention is described below with reference to Figures 3. Such third embodiment is substantially similar to the embodiment shown above with reference to Fig. 1 and 2, with the only difference being that all data fractions Fi , F2, F3 are transmitted to all the receiving nodes 4A, 4B and 4C.
The present invention has been clarified above by means of the above description of its preferred embodiments, but it is understood that equivalent changes may be made without departing from the scope of protection of the present invention.
For example, since the network card associated with the source node is equal to each of the network cards associated respectively to each of said receiving nodes, each node may be used either as source node or as receiver according to needs. Consequently, within the frame of high-performance scientific computing, the concept of source node and receiver node is dynamic and interchangeable during the execution of an application.
Consequently, the scope of protection of the present invention cannot be limited to the particular embodiments described above only by way of example but is rather defined by the accompanying claims.

Claims

1. A data transmission system (1 ) through an interconnection network (2) between at least a source node (3) and a plurality of N separate receiving nodes (4A, .., 4N), wherein a network card (5) is associated to said source node (3) and a network card (5A,...,5N) is respectively associated to each of said plurality of N separate receiving nodes (4A,..., 4N), characterized in that said source node (3) is configured so as to sub-divide the initial set of data (7) into a plurality of X distinct data fractions (Fi, .. Fx) and to communicate to said network card (5) associated to it (3) the information related to the starting address and which among the X distinct data fractions (Fi, .. Fx) is/are to be transmitted to each specific receiving node (4A,..., 4N).
2. A data transmission system (1 ) according to claim 1 , wherein to each separate receiving node (4A,...,4N) at least a data fraction (Fi , .. Fx) is transmitted which is different from the at least one data fraction (Fi , .. Fx) transmitted to each other receiving node (4A,...,4N).
3. A data transmission system (1 ) according to claim 1 , wherein at least one same data fraction (Fi , .. Fx) is transmitted to at least two different receiving nodes (4A,...,4N).
4. A data transmission system (1 ) according to claim 1 , wherein the totality of said distinct data fractions (Fi , .. Fx) is transmitted to each separate receiving node (4A,...,4N).
5. A data transmission system (1 ) according to any one of the preceding claims, wherein said network card (5) associated to said source node (3) is configured so as to transmit a first data fraction Fi of said data fractions (Fi, .. Fx) to a first receiving node (4A) by means of said network card (5A) associated to said first receiving node (4A), and wherein said network card (5) os also configured to transmit in sequence other data fractions (Fi , .. Fx) to a sequence of other receiving nodes (4B,...,4N) differing one from each other and differing from the first receiving node (4A), by means of said network cards (5B,...,5N) respectively associated to said other receiving nodes (4B,..,4N), without waiting for the completion of data transmission to the receiving node (4A) previously activated.
6. A data transmission system (1 ) according to any one of the preceding claims, wherein said network card (5) associated to the source node (3) is configured to notify a driver about the sending of each of said data fractions (Fi , .. Fx) to a specific receiving node (4).
7. A data transmission system (1 ) according to any one of the preceding claims, wherein said network card (5) associated to the source node (3) is configured so that, once completed the transmission of a data fraction (F) to a receiving node (4), such a data fraction (F) is transmitted to another receiving node to which the data fraction (F) has not yet been transmitted, as soon as a previous transmission of data fraction (Fi) to said another receiving node is completed.
8. A data transmission system according to any one of the preceding claims, wherein said network card (5) associated with said source node (3) is equal to each of said network cards (5A,...,5N) respectively associated to each of said receiving nodes (4A,...,4N).
9. A data transmission system (1 ) according to any one of the preceding claims, wherein said network card (5) associated with the source node (3) is configured so as to comprise a device, placed on the network card itself (5), able to store information about the data fraction type (F) which has been transmitted to a particular receiving node (4) and the corresponding address of such a receiving node (4).
10. A method for transmitting data in an interconnection network (2) between at least a source node (3) and a plurality of receiving nodes (4A,...,4N), wherein a network card (5) is associated to said source node (3) and a network card (5A,...,5N) is respectively associated to each of said plurality of separate receiving nodes (4A,..., 4N), wherein by means of said network card (5) associated to said source node (3) the initial set of data (7) to be transmitted to each of said distinct receiving node (4A,..., 4N) is sub-divided into a plurality of X distinct data fractions (Fi, .. Fx) and the information related to the starting address and to which among the X distinct data fractions (Fi, .. Fx) is/are to be transmitted to each specific receiving node (4A,...,4N) are transmitted to said network card (5) associated to said source node (3).
11. Method for transmitting data according to claim 10 which comprises the following steps: a) Prepare by the at least one source node (3) a data packet (7) to be transmitted to each of said plurality of receiving nodes (4A,...,4N); b) Notify by a driver to said network card (5) associated to said source node (3) the information about the data transmission to each of said plurality of receiving nodes (4A,...,4N); c) Read through said network card (5) associated to said source node (3) said data so notified by the driver and transmit them to each of said plurality of receiving nodes (4A,...,4N); d) Communicate the data transmission confirmation by the network card (5) associated to said source node (3).
12. Method for transmitting data according to claim 10 or 11 which further comprises the step e) to store on a hardware portion of the network card (5) associated to said source node (3) the information related to the data transmissions previously completed.
PCT/IB2017/054100 2016-07-08 2017-07-07 System for accelerating data transmission in network interconnections WO2018007988A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT102016000071637 2016-07-08
IT102016000071637A IT201600071637A1 (en) 2016-07-08 2016-07-08 SYSTEM TO ACCELERATE DATA TRANSMISSION IN NETWORK INTERCONNECTIONS

Publications (1)

Publication Number Publication Date
WO2018007988A1 true WO2018007988A1 (en) 2018-01-11

Family

ID=57796801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2017/054100 WO2018007988A1 (en) 2016-07-08 2017-07-07 System for accelerating data transmission in network interconnections

Country Status (2)

Country Link
IT (1) IT201600071637A1 (en)
WO (1) WO2018007988A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090252168A1 (en) * 2008-04-02 2009-10-08 Alaxala Networks Corporation Multi-plane cell switch fabric system
US20120188934A1 (en) * 2009-10-06 2012-07-26 Hang Liu Method and apparatus for hop-by-hop reliable multicast in wireless networks
US20150039793A1 (en) * 2012-03-14 2015-02-05 Istituto Nazionale Di Fisica Nucleare Network interface card for a computing node of a parallel computer accelerated by general purpose graphics processing units, and related inter-node communication method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090252168A1 (en) * 2008-04-02 2009-10-08 Alaxala Networks Corporation Multi-plane cell switch fabric system
US20120188934A1 (en) * 2009-10-06 2012-07-26 Hang Liu Method and apparatus for hop-by-hop reliable multicast in wireless networks
US20150039793A1 (en) * 2012-03-14 2015-02-05 Istituto Nazionale Di Fisica Nucleare Network interface card for a computing node of a parallel computer accelerated by general purpose graphics processing units, and related inter-node communication method

Also Published As

Publication number Publication date
IT201600071637A1 (en) 2018-01-08

Similar Documents

Publication Publication Date Title
CN100468377C (en) Apparatus and method for supporting memory management in an offload of network protocol processing
CN101848203B (en) Apparatus and method for supporting connection establishment in an offload of network protocol processing
US7051112B2 (en) System and method for distribution of software
US20080181115A1 (en) System for transmitting data within a network between nodes of the network and flow control process for transmitting the data
CN105511954A (en) Method and device for message processing
EP3575979B1 (en) Query priority and operation-aware communication buffer management
JPH09505713A (en) System for parallel assembly of data transmission in broadband networks
US9699118B2 (en) System for flexible dynamic reassignment of throughput
CN111382115A (en) Path creating method and device for network on chip and electronic equipment
CN102326158B (en) Information processing apparatus and operation method thereof
US9832135B2 (en) Apparatus for managing data queues in a network
US9753769B2 (en) Apparatus and method for sharing function logic between functional units, and reconfigurable processor thereof
US10728178B2 (en) Apparatus and method for distribution of congestion information in a switch
WO2018007988A1 (en) System for accelerating data transmission in network interconnections
JP6847334B2 (en) Network equipment, network systems, network methods, and network programs
CN116915708A (en) Method for routing data packets, processor and readable storage medium
US10609188B2 (en) Information processing apparatus, information processing system and method of controlling information processing system
US9509780B2 (en) Information processing system and control method of information processing system
CN105763519A (en) Consistency control method, device and system
US10084725B2 (en) Extracting features from a NoC for machine learning construction
WO2014102917A1 (en) Parallel processing method and parallel computer system
EP3955115B1 (en) Flexible link level retry for shared memory switches
EP3229145A1 (en) Parallel processing apparatus and communication control method
US9336172B2 (en) Parallel computer system, data transfer device, and method for controlling parallel computer system for performing arbitration
WO2022110384A1 (en) Routing control method and apparatus, and routing device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17748937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17748937

Country of ref document: EP

Kind code of ref document: A1