CN117319288A

CN117319288A - Integrated calculation network server and data transmission method

Info

Publication number: CN117319288A
Application number: CN202311070164.7A
Authority: CN
Inventors: 激扬; 张登勇; 李斌; 戴可毅; 任师臣
Original assignee: Bitdepth Beijing Technology Co ltd
Current assignee: Bitdepth Beijing Technology Co ltd
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-12-29
Anticipated expiration: 2043-08-23

Abstract

The invention relates to the technical field of intelligent computing servers and discloses a server integrating computing networks and a data transmission method. The server comprises: a computing unit cluster formed by the exchange backboard and a plurality of computing units; the exchange backboard adopts a lossless Ethernet communication protocol and is used for constructing a data exchange network for the computing unit and forwarding message data generated by the cooperative computing of the computing unit; each computing unit is connected with the exchange backboard through a high-speed uplink and downlink communication interface and performs cooperative computation with other computing units in the computing unit cluster. The server adopts a lossless Ethernet protocol, and is not centered on a CPU, and all computing units are equally based on the lossless Ethernet for direct communication and data transmission, so that peer-to-peer communication of any computing unit is realized, the communication speed is obviously improved, and the parallel computing efficiency and the utilization rate of computing resources are greatly improved.

Description

Integrated calculation network server and data transmission method

Technical Field

The invention relates to the technical field of intelligent computing servers, in particular to a server integrating computing networks and a data transmission method.

Background

With the technical application of cloud computing, big data, AI, the Internet of things and the like, the data grows exponentially in recent years, and great test is put forward on the processing capacity of server data; especially, with the advent of the AGI era, the model is larger and the parameters are more and more, a single computing unit cannot complete a computing task, and a plurality of computing units are required to form a parallel computing cluster to complete the task, wherein communication between GPUs occupies about eighty percent of data communication of the whole system, so how to improve the collaborative communication efficiency between GPUs becomes a key point for improving the utilization rate of computing resources of a server.

In the related art, in a conventional AI server, the conventional AI server is generally configured as 1 or more CPUs, memory buffer hard disks, I/O devices, and multiple GPU cards (or FPGAs/ASICs), which are typical von neumann architectures, and different modules communicate by adopting a PCIE bus manner, and all communications are centered on the CPUs, that is, communications between any two units are scheduled by the CPUs, and are transferred to a computing unit through buffering and decompression of data, so that the data communication speed between GPUs is greatly reduced, and meanwhile, because data is carried between the buffers for multiple times, much more energy consumption than computation is generated.

Disclosure of Invention

In view of this, the present invention provides a network-integrated server and a data transmission method, so as to solve the problem of low computing efficiency of the existing server architecture.

In a first aspect, the present invention provides a server integrating a computing network, the server comprising: a computing unit cluster formed by the exchange backboard and a plurality of computing units;

the exchange backboard adopts a lossless Ethernet communication protocol and is used for constructing a data exchange network for the computing unit and forwarding a message packet generated by the cooperative computing of the computing unit;

each computing unit is connected with the exchange backboard through a high-speed uplink and downlink communication interface and is used for carrying out cooperative computation with other computing units in the computing unit cluster based on the data exchange network.

In the invention, the server is not configured with an independent CPU, but a plurality of computing units with shared memory and complete graphics are connected with the high-speed exchange backboard through the communication interface, so that the computing architecture of the traditional server which takes the CPU as the center is changed, and the data communication of all the computing units based on the peer-to-peer direct connection of the ROCE protocol is realized. In addition, the communication between the computing units is not carried out by adopting a PCIE bus, but is carried out by accessing a lossless exchange backboard through a high-speed interface, so that the communication bandwidth between the computing units is obviously improved, and the communication efficiency between the communication units is further improved; meanwhile, because the Ethernet protocol is widely adopted, the construction cost of the equipment can be obviously reduced.

In an alternative embodiment, the computing unit is an MCU, SOC, or FPGA with graphics integrity and with a shared memory architecture.

In the mode, MCU, SOC or FPGA with higher calculation power and low power consumption is adopted as a core device designed by a calculation unit, so that the power consumption is far lower than the unit calculation power consumption of the traditional server, and the power consumption and cost of the server are further reduced. In addition, the MCU, the SOC or the FPGA computing unit has very low energy consumption, for example, 48TOPS energy consumption is 4.6W, so that the server can dissipate heat by adopting a traditional fan, and compared with the traditional server with the same power, the server needs to adopt a liquid cooling scheme, and the design difficulty and the system construction cost are greatly reduced.

In an alternative embodiment, each computing unit is provided with a RoCE protocol, and the RoCE protocol is used for performing data direct communication with other computing units in the computing unit cluster;

each computing unit is provided with a CoCE protocol for coordinating and controlling the communication and computation of the collection with other computing units in the computing unit cluster.

In the mode, the RoCE protocol is installed on each computing unit, so that the direct data communication between any computing nodes is realized, the communication overhead and delay are minimized, and the communication efficiency is further improved; the CoCE protocol is installed on each computing unit, so that the aggregate communication optimization and communication scheduling in 1-beat or 1-beat are realized.

In an optional implementation manner, a global controller is installed on any one computing unit in the computing unit cluster, and is used for carrying out load balancing and congestion control on the computing unit cluster so as to realize optimal forwarding route of data transmission of the computing unit cluster.

In the mode, through installing a global controller on any one computing unit in the computing unit cluster, the sensing, dispatching, distributing and load balancing coordination of the communication and the computing of the whole cluster are realized. And through the global controller, the invisible network nodes are mapped together with the global queue of the switching network, and finally the optimal forwarding effect of the whole cluster network is achieved.

In an alternative embodiment, a global dynamic perception responder is installed on each computing unit, and is used for responding to the routing scheduling of the global controller so as to realize the optimal forwarding route.

In the mode, a global dynamic perception responder is installed on each computing unit, and the routing scheduling strategy of the global controller is responded, so that the global-based dynamic optimal forwarding route is realized when each computing unit performs data synchronization.

In a second aspect, the present invention provides a data transmission method, which is applied to the server in the first aspect or any implementation manner corresponding to the first aspect, and the method includes:

receiving a message packet sent by a first computing unit;

based on the length of service message, forming message packets into a plurality of message pools;

and sending each packet to a corresponding target computing unit.

In the invention, the calculation and storage resources are placed in the communication network, so that the data can be directly processed and transmitted in the high-speed communication network, thereby reducing the time and cost of data transmission. According to simulation tests, the utilization rate of computing resources is improved to 75% compared with 35% of the traditional server. By adopting a fixed-length virtual message pool for forwarding and dynamic load balancing, the problems of elephant flow and the like caused by lossless Ethernet communication due to collective communication in parallel cluster calculation are solved.

In an alternative embodiment, based on the service message length, the message packets are formed into a plurality of message pools, including:

determining the length of a message pool based on the service message length, wherein the length of the message pool is larger than the maximum service message length;

judging whether the length of the current message packet exceeds the remaining length of the current message pool based on the length of the current message packet;

when the length of the current packet exceeds the remaining length of the current packet pool, scheduling the current packet to a next packet pool, and adding marking information of the next packet pool to the current packet to obtain a marked current packet;

and when the length of the current packet does not exceed the remaining length of the current packet pool, scheduling the current packet to the current packet pool, and adding the current packet pool marking information into the current packet to obtain the marked current packet. In the method, when load balancing is realized based on message forwarding, the influence of random message length needs to be overcome, and a message pool capable of accommodating at least 1 longest service message is established, so that the data flow is finely segmented, and the balance of the instantaneous load of the switching network is fully improved.

In an alternative embodiment, sending each packet to a corresponding target computing unit includes:

adding a global controller identifier corresponding to the target computing unit into each packet, and distributing the packet to the corresponding global controller for authorized scheduling;

after the packet obtains the authorization of the global controller, the packet is sent to a data exchange network;

constructing a forwarding link from the first computing unit to the target computing unit based on the data exchange network;

based on the forwarding link, the packet is sent to a control unit corresponding to the target computing unit, so that the control unit sorts the packet to obtain a sorted packet, and the sorted packet is sent to the target computing unit.

In the mode, the message packets are sequenced and then sent to the target computing unit, so that the disorder relieving pressure of the control unit corresponding to the target computing unit is reduced, the calculation power consumption of message transmission is further reduced, and the message forwarding efficiency is improved.

In an alternative embodiment, each packet is sent to a corresponding target computing unit, and further including:

constructing a virtual queue path from the first computing unit to the target computing unit;

judging whether a forwarding link has faults or not based on the virtual queue path;

when the forwarding link has a fault, the packet is distributed to the rest links without the fault.

In the mode, by constructing the virtual queue path from the first computing unit to the target computing unit, for forwarding service of the input port of the first computing unit, each path of the output port of the target computing unit is not required to be perceived, only the output port is required to be clarified, and the message forwarding efficiency is improved. By switching the forwarding links when the equipment or the forwarding links fail, the situation that the load of a certain forwarding link is suddenly overlapped can be avoided, and the robustness of the system and the unaware self-healing of the link resources are realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an integrated computing network server according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an internal structure of an integrated abacus-time server according to the embodiment of the invention.

Fig. 3 is a flow chart of a data transmission method according to an embodiment of the invention.

Fig. 4 is a flow chart of a message forwarding method according to an embodiment of the present invention.

Fig. 5 is a flowchart of another data transmission method according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a message pool according to an embodiment of the present invention.

Fig. 7 is a flowchart of yet another data transmission method according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of packet ordering according to an embodiment of the present invention.

Fig. 9 is a schematic diagram of a redefined standard ethernet frame according to an embodiment of the present invention.

Fig. 10 is a schematic diagram of an extended protocol header according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the related art, in a conventional AI server, the conventional AI server is generally configured as 1 or more CPUs, memory buffer hard disks, I/O devices, and multiple GPU cards (or FPGAs/ASICs), which are typical von neumann architectures, and different modules communicate by adopting a PCIE bus manner, and all the communications are centered on the CPUs, that is, the communications between any two units are scheduled by the CPUs, and are transferred to a computing unit through buffering and decompression of data, so that the data communication speed between GPUs is greatly reduced, and meanwhile, the data is carried multiple times of the energy consumption of computation between the buffering, which results in 10 times of computation.

In order to solve the above problems, the embodiment of the invention provides a server integrating calculation and network and a data transmission method, which are suitable for parallel cluster computing, such as a use scene of large-model parallel computing. The invention provides the server integrating the calculation network, the server is not configured with an independent CPU, but a plurality of computing units with shared memory and complete graphics are connected with the high-speed exchange backboard through the communication interface, so that the computing architecture of the traditional server which takes the CPU as the center is changed, and the data communication of peer-to-peer direct connection based on the ROCE protocol among all the computing units is realized; in addition, the communication between the computing units is not carried out by adopting a PCIE bus, but is carried out by accessing a lossless exchange backboard through a high-speed interface, so that the communication bandwidth between the computing units is obviously improved, the communication efficiency between the communication units is further improved, and the problems of a memory wall and a power consumption wall of the traditional server are broken; meanwhile, because the Ethernet protocol is widely adopted, the construction cost of the equipment can be obviously reduced.

According to an embodiment of the present invention, there is provided an embodiment of an integrated computing network server, and fig. 1 is a schematic structural diagram of an integrated computing network server according to an embodiment of the present invention, as shown in fig. 1, where the server includes: a computing unit cluster consisting of a switching backboard 102 and a plurality of computing units 101; the exchange backboard 102 adopts an Ethernet communication protocol and is used for constructing a data exchange network for the computing unit 101 and forwarding a message packet generated by the cooperative computing of the computing unit 101; each computing unit 101 is connected to the switch back plane 102 through a high-speed uplink and downlink communication interface, and is configured to perform cooperative computing with other computing units 101 in the computing unit cluster based on the data switch network.

In the embodiment of the present invention, the computing unit 101 is an MCU or an SOC or an FPGA with a graphics integrity and a shared memory architecture. The data switching network is a lossless ethernet switching network.

In the embodiment of the present invention, each computing unit 101 is installed with a RoCE protocol, which is used for performing data direct communication with other computing units 101 in the computing unit cluster.

Each computing unit 101 is provided with a CoCE protocol for coordinating and controlling the communication and computation with the collection of other computing units 101 in the computing unit cluster.

In the embodiment of the present invention, a global controller is installed on any one computing unit 101 in the computing unit cluster, so as to perform load balancing and congestion control on the computing unit cluster, so as to implement an optimal forwarding route for data transmission of the computing unit cluster.

In the embodiment of the present invention, each computing unit 101 is installed with a global dynamic sensing responder, which is configured to respond to the routing schedule of the global controller, so as to implement the optimal forwarding route.

In one example, FIG. 2 is a schematic diagram of the internal architecture of a computing network-integrated supercomputer server in accordance with embodiments of the present invention. As shown in fig. 2, the server includes a switch backplane and independent compute nodes. The supercomputer server includes: 1) The plurality of calculation nodes can support ROCE and COCE, and the calculation nodes are designed based on MCU or SOC or FPGA with calculation capability; 2) A high-speed Ethernet switching backboard, which is used for supporting protocols such as ROCE, COCE and the like, and selecting a high-speed switching chip to design the switching backboard; 3) The plurality of computing nodes are connected with the high-speed exchange backboard through high-speed uplink and downlink communication interfaces, communication and data synchronization among the nodes are realized on the lossless Ethernet through the ROCE protocol of the exchange backboard, and the exchange backboard is a main functional component for realizing cluster computation of a plurality of computing units; 4) Other parts include a case, a fan, a network card and the like, and components such as a CPU (Central processing Unit)/memory bank/hard disk/power supply, I/O (input/output) and the like are not independently configured.

Specifically, the design of the integral computing node of the computing network comprises the following steps:

a) The ROCE/COCE protocol and algorithm are installed on each computing node. The ROCE is an Ethernet remote direct memory access technology to realize data direct communication between any computing nodes, minimize communication overhead and delay and improve communication efficiency; the COCE technical protocol is a collective communication protocol (Collective over Converged Ethernet is integrated with Ethernet cluster communication) and realizes the communication optimization at 1 dozen or more and 1 dozen times.

b) And installing a Global Dynamic Perception Responder (GDPR) driver on each computing unit, wherein the GDPR is responsible for responding to a routing scheduling strategy of a global controller with a built-in global perception dynamic scheduling (GADS) technology, and realizing global-based dynamic optimal forwarding route when data of each computing node are synchronized.

c) And installing a global controller on any computing node to realize the coordination of sensing, scheduling, distribution and load balancing of communication and computation of the whole cluster. The controller is internally provided with a global perception dynamic scheduling (GADS) technology, and mutually invisible network nodes are mapped together with a global queue of the switching network, so that the optimal forwarding effect of the whole cluster network is finally achieved.

2. The design of the lossless switched circuit backplane comprises:

a) The switch backplane adopts an ethernet communication protocol, and adopts a data center-level switch chip, which needs to have ultra-low delay (less than 300 ns).

b) And the COCE algorithm optimization technology is adopted to improve the collective communication efficiency. A large number of ALL-Reduce flow models exist in parallel calculation, the end of any round of calculation depends on the return of the last result, and the long-tail time delay of the cluster network is reduced, so that the training completion time can be effectively improved. The whole forwarding time delay is calculated in a distributed mode, the congestion condition of the intermediate node on the forwarding path is positively correlated, and long-tail time delay can be eliminated by eliminating the congestion of the intermediate node. The fusion of the global perception dynamic scheduling (GADS) technology and the high-precision load balancing technology is the key for solving the problem, on one hand, the packet data volume entering the switching network is controlled not to exceed the forwarding capacity of the whole network through the PUSH+PULL combination mechanism of the GADS; on the other hand, congestion of any computing node of the switching network can be eliminated through the high-precision load balancing.

c) When data traffic is exchanged in the ethernet, packet loss will be indiscriminate, and conventional ECMP load balancing may cause uneven link load and hash polarization, especially in the presence of huge flows, where congestion and packet loss may occur no matter how long the huge flows are in duration. Therefore, aiming at the problems of congestion and packet loss, the solution is based on a packet-by-packet (forwarding of a virtual packet pool) load sharing technology, and the traditional Ethernet is forwarded according to traffic to a virtual packet pool, so that the Hash polarization problem is thoroughly eliminated, and the bandwidth utilization rate of a switching network is further improved.

d) Reliability and robustness technical solution: distributing the reliability and robustness of a computing task assurance system among multiple computing nodes is an important and complex problem. The virtual queue path from the ingress port of the computing node to the egress port of the computing node is constructed by a Global Aware Dynamic Scheduling (GADS) technology, and each hop path of the egress port is not required to be perceived for forwarding the service of the ingress port, and only the egress port is required to be clarified. The method has no perceptibility to a communication network formed by the Ethernet, the accessibility of paths and the switching of the paths are guaranteed by a load balancing technology based on the packets. When a certain link or a certain computing node in the parallel cluster computing network fails, the equipment node connected with the link can sense the change of the state of the link in real time, and automatically remove the corresponding link from the load balancing alternative list, and recover the scheduling authorization of the GADS related to the path, so that the packet is distributed to other available links. When the equipment or the link is recovered, the connected equipment nodes can sense the link state change in real time and complete self-healing. The load balancing technology based on the packet can keep stable balance in the uplink switching process, is not influenced by hash results or less links like the load balancing based on the stream, can avoid the condition that loads of a certain link are suddenly overlapped, and realizes robustness of the system and non-perception self-healing of link resources.

In an implementation scenario, the flow of model training based on the integrated server is as follows:

1. data preparation stage: a training dataset and model parameters are prepared. The training data set may be acquired from the local or cloud, and the model parameters may be loaded from the local or cloud.

2. Model parallel stage: the model parameters are divided into a plurality of parts and the parts are distributed to different computing units for processing. Each computing unit independently calculates the gradient of the model parameter and sends the gradient to the other computing units.

3. Data parallel stage: at this stage, the training data set is divided into a plurality of parts and the parts are distributed to different computing units for processing. Each computing unit independently calculates the gradient of the training data and sends the gradient to the other computing units.

According to the integrated calculation network server provided by the embodiment, the server is not provided with an independent CPU, but a plurality of calculation units with shared memory and complete graphics are connected with a high-speed exchange backboard through communication interfaces, so that the calculation architecture of the traditional server which takes the CPU as a center is changed, and the data communication of all calculation units which are directly connected in a peer-to-peer mode based on the ROCE protocol is realized. In addition, the communication between the computing units is not carried out by adopting a PCIE bus, but is carried out by accessing a lossless exchange backboard through a high-speed interface, so that the communication bandwidth between the computing units is obviously improved, and the communication efficiency between the communication units is further improved; meanwhile, because the Ethernet protocol is widely adopted, the construction cost of the equipment can be obviously reduced.

According to an embodiment of the present invention, there is provided a data transmission method embodiment, it being noted that the steps shown in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

In this embodiment, a data transmission method is provided, which may be used in the server described above, and fig. 3 is a flowchart of a data transmission method according to an embodiment of the present invention, as shown in fig. 3, where the flowchart includes the following steps:

step S301, a packet sent by the first computing unit is received.

In an example, the first computing unit is any computing unit in the server, and is configured to perform computation and forwarding of the message data.

Step S302, based on the service message length, the message packets are formed into a plurality of message pools.

In an example, packet division and marking are performed by a virtual packet pool technique, each virtual packet pool is marked with a mark, each packet has a unique packet pool number, the setting of the number of packet that can be accommodated in the packet pool can be adjusted according to the distribution condition of service packets, at least 1 longest service packet is required to be accommodated, and the total length is as short as possible under the condition that the forwarding capability and the out-of-order capability of the switch back board chip are allowed.

Step S303, each packet is sent to a corresponding target computing unit.

In an example, the switching backboard adopts a virtual message pool to form a fixed-length virtual logic unit for message forwarding and dynamic load balancing, and by constructing a DGSQ full scheduling mechanism, a fine back pressure mechanism and a non-perception self-healing mechanism based on the packet virtual logic unit, the accurate control under micro burst and fault scenes is realized, and the effective bandwidth and forwarding delay stability of the network are comprehensively improved. The message forwarding flow comprises the following steps: 1) After the Packet is received by the control unit, the integrated computing node 1 (source end device) finds a final outlet through the forwarding table, and distributes the message to the corresponding global dynamic perception controller for authorized scheduling based on the final outlet as required.

2) After the source control equipment obtains the authorization, the Packet will follow the load balancing requirement and send the message to the lossless Ethernet switching network.

3) When the message reaches the destination end equipment, the message packet level ordering is performed, then the message is stored into a physical Port queue through a forwarding table, and finally the message is sent to a computing node through Port scheduling.

Fig. 4 is a flow chart of a message forwarding method according to an embodiment of the present invention. The specific message forwarding flow is shown in fig. 4: 1. receiving a packet from a computing unit; 2. classifying and storing the packets in a control unit; 3. carrying out refined load balancing on the packet on a switching network; 4. forwarding the packet to the destination device; 5. sequencing by taking the package as a unit; 6. storing the packets in a control unit queue; 7. the packet is sent to the computing unit.

According to the data transmission method provided by the embodiment, the calculation and storage resources are placed in the communication network, so that data can be directly processed and transmitted in the high-speed communication network, and the time and the cost of data transmission are reduced. According to simulation tests, the utilization rate of computing resources is improved to 75% compared with 35% of the traditional server. By adopting a fixed-length message pool for forwarding and dynamic load balancing, the problems of congestion, elephant flow and the like in large-model parallel settlement are solved, and high-performance parallel calculation of a large-model reasoning scene is realized.

In this embodiment, a data transmission method is provided, which may be used in the server described above, and fig. 5 is a flowchart of another data transmission method according to an embodiment of the present invention, as shown in fig. 5, where the flowchart includes the following steps:

step S501, a packet sent by the first computing unit is received. Please refer to step S301 in the embodiment shown in fig. 3 in detail, which is not described herein.

Step S502, based on the service message length, the message packets are formed into a plurality of message pools.

Specifically, the step S502 includes:

step S5021, based on the service message length, the length of the message pool is determined.

The length of the message pool is larger than the maximum service message length.

Step S5022, based on the current packet length, judging whether the current packet length exceeds the remaining length of the current packet pool.

Step S5023, when the length of the current packet exceeds the remaining length of the current packet pool, the current packet is scheduled to the next packet pool, and the marking information of the next packet pool is added to the current packet to obtain the marked current packet.

Step S5024, when the length of the current packet does not exceed the remaining length of the current packet pool, the current packet is scheduled to the current packet pool, and the current packet is added with the marking information of the current packet pool to obtain the marked current packet.

In an example, when implementing load balancing based on forwarding of a packet, firstly, an influence generated by random packet length needs to be overcome, so that normalization processing needs to be performed on a basic forwarding unit of load balancing, and a fixed-length packet group is established. The setting of the number of the accommodating messages of the message group can be adjusted according to the distribution condition of the service message length, at least 1 longest service message can be required to be accommodated, and the total length is as short as possible under the condition that the chip forwarding capability and the disarrangement capability allow, so that the purposes of finely cutting the data flow and fully improving the instantaneous load balance degree are achieved. Fig. 6 is a schematic diagram of a message pool according to an embodiment of the present invention. As shown in fig. 6, the implementation of the message pool is logical and virtual, when a message enters an integrated node of the network, the control unit records the information of the number of the message pool to which the message belongs, the number of bytes occupied in the message pool, and the like, and when the number of the message bytes exceeds the set length of the virtual message pool, the message is scheduled and recorded in the next message pool. Each network control node directly forwards the message without caching the message to construct an actual message pool. And for all messages belonging to the same message pool, the messages are balanced to a unique path in the switching network for forwarding so as to ensure that the messages in the message pool are not disordered any more, thereby reducing the disordered pressure of the network outlet control unit integrated with the computing network.

Step S503, each packet is sent to a corresponding target computing unit. Please refer to step S303 in the embodiment shown in fig. 3 in detail, which is not described herein.

According to the data transmission method, when load balancing is achieved based on message forwarding, the influence caused by random message length needs to be overcome, and by establishing a message pool capable of accommodating at least 1 longest service message, fine segmentation of data streams is achieved, and the balance degree of instantaneous load of a switching network is fully improved.

In this embodiment, a data transmission method is provided, which may be used in the server described above, and fig. 7 is a flowchart of another data transmission method according to an embodiment of the present invention, as shown in fig. 7, where the flowchart includes the following steps:

step S701, receiving a packet sent by the first computing unit. Please refer to step S501 in the embodiment shown in fig. 5 in detail, which is not described herein.

Step S702, based on the service message length, the message packets are formed into a plurality of message pools. Please refer to step S502 in the embodiment shown in fig. 5 in detail, which is not described herein.

In step S703, each packet is sent to a corresponding target computing unit.

Specifically, the step S703 includes:

step S7031, adding a global controller identifier corresponding to the target computing unit to each packet, and distributing the packet to a corresponding global controller for authorized scheduling.

Step S7032, after the packet obtains the authorization of the global controller, the packet is sent to the data exchange network.

Step S7033, based on the data exchange network, constructs a forwarding link from the first computing unit to the target computing unit.

Step S7034, based on the forwarding link, the packet is sent to the control unit corresponding to the target computing unit.

In the embodiment of the invention, the message packets are sent to the control units corresponding to the target computing units, so that the control units sort the message packets to obtain sorted message packets, and the sorted message packets are sent to the target computing units.

In one example, fig. 8 is a schematic diagram of packet ordering according to an embodiment of the present invention. As shown in fig. 8, the packet-by-packet forwarding mechanism needs to carry relevant information in the data packet, so that the relevant information can be correctly identified and processed by the switch back plane and sent to the target node. So the message needs to carry the global controller identification when entering the switching network. The global schedule control identification is related to the system build target. Typically, a unique routing schedule identification may be established based on the source device, the destination port, and the priority under that port. Of course, the method can also simplify the service requirement, for example, 4, 2 or 1 priority levels are set under one target port, so that the requirement of the global dynamic scheduling queue is reduced, and the cost of a switching chip is reduced. The packet after entering the communication control unit can be sent to the switching network only after the downlink scheduling authorization. At this time, the messages sent from the same ingress switching node to the same egress switching node may be formed into a de-ordering queue, that is, the same sequence number (sequence of virtual message pool) and source control node ID are added to all the data packets in each virtual message pool, and after the messages are received in the downstream, de-ordering processing may be performed based on the source control node ID and the sequence number.

Specifically, the full-schedule lossless ethernet needs to add additional information to the service packet for global load balancing forwarding and ordering, and the information has three carrying modes, including:

1) Adding standard extension header outside standard Ethernet frame: the greatest benefit of this carrying is that the original service message is not destroyed, but there is some loss in compatibility and transmission efficiency. If the manner of adding an ethernet Tunnel is selected to improve ethernet compatibility, the transmission efficiency is further reduced.

2) Redefining standard ethernet frames: fig. 9 is a schematic diagram of a redefined standard ethernet frame according to an embodiment of the present invention. As shown in fig. 9, redefining the standard ethernet frame includes: MAC, IP, GSE Header and Payload. The biggest benefit of this way of carrying is that the transmission efficiency is high, but compatible ethernet ability is poor, only can use under specific circumstances.

3) Fig. 10 is a schematic diagram of an extended protocol header according to an embodiment of the present invention. As shown in fig. 10, the extension protocol header includes: MAC/GSE Header, IP and Payload. The biggest benefit of this approach is to balance ethernet compatibility and transmission efficiency by extending the protocol header after ethernet MAC or IP, but the processing of GSE additional information in the network may need to go deep into the message internal information, which may affect the forwarding delay.

In some optional embodiments, step S703 further includes:

step S7035, constructing a virtual queue path from the first computing unit to the target computing unit;

step S7036, based on the virtual queue path, judging whether the forwarding link has a fault;

step S7037, when the forwarding link has a failure, the packet is distributed to the rest links without failure.

In an example, a virtual queue path from an ingress port of a computing node to an egress port of the computing node is constructed through a Global Aware Dynamic Scheduling (GADS) technology, and each hop path of the egress port is not required to be perceived for forwarding traffic of the ingress port, and only the egress port is required to be explicitly determined. The method is unaware of a communication network formed by the Ethernet, and the accessibility and switching of paths are guaranteed by a packet-based load balancing technology. When a certain link or a certain computing node in the parallel cluster computing network fails, the equipment node connected with the link can sense the change of the state of the link in real time, and automatically remove the corresponding link from the load balancing alternative list, and recover the scheduling authorization of the GADS related to the path, so that the packet is distributed to other available links. When the equipment or the link is recovered, the connected equipment nodes can sense the link state change in real time and complete self-healing. The load balancing technology based on the packet can keep stable balance in the uplink switching process, is not influenced by hash results or less links like the load balancing based on the stream, can avoid the condition that loads of a certain link are suddenly overlapped, and realizes robustness of the system and non-perception self-healing of link resources.

According to the data transmission method, the message packets are sequenced and then sent to the target computing unit, so that the disorder relieving pressure of the control unit corresponding to the target computing unit is reduced, the calculation power consumption of message transmission is further reduced, and the message forwarding efficiency is improved. By constructing the virtual queue path from the first computing unit to the target computing unit, for the forwarding service of the input port of the first computing unit, each path of the output port of the target computing unit is not required to be perceived, only the output port is required to be clarified, and the message forwarding efficiency is improved. By switching the forwarding links when the equipment or the forwarding links fail, the situation that the load of a certain forwarding link is suddenly overlapped can be avoided, and the robustness of the system and the unaware self-healing of the link resources are realized.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A server for integrating a computing network, the server comprising: a computing unit cluster formed by the exchange backboard and a plurality of computing units;

2. The server of claim 1, wherein the computing unit is an MCU or an SOC or an FPGA with a shared memory architecture with graphics integrity.

3. The server of claim 1, wherein each computing unit has a RoCE protocol installed thereon for direct data communication with other computing units in the cluster of computing units;

and each computing unit is provided with a CoCE protocol for cooperating and controlling the communication and computation of the collection of other computing units in the computing unit cluster.

4. The server of claim 1, wherein a global controller is installed on any one of the computing units in the computing unit cluster, and is configured to perform load balancing and congestion control on the computing unit cluster, so as to implement an optimal forwarding route for data transmission of the computing unit cluster.

5. The server of claim 4, wherein each computing unit is provided with a global dynamic awareness responder for responding to a routing schedule of the global controller to achieve an optimal forwarding route.

6. A data transmission method applied to the server according to any one of claims 1 to 5, characterized in that the method comprises:

receiving a message packet sent by a first computing unit;

based on the length of the service message, forming the message packets into a plurality of message pools;

and sending each packet to a corresponding target computing unit.

7. The method of claim 6, wherein grouping the packets into a plurality of message pools based on the service message length comprises:

determining the length of the message pool based on the service message length, wherein the length of the message pool is larger than the maximum service message length;

and when the data length of the current message does not exceed the remaining length of the current message pool, scheduling the current message packet to the current message pool, and adding the current message packet with the current message pool marking information to obtain the marked current message packet.

8. The method of claim 7, wherein the sending each packet to a corresponding target computing unit comprises:

adding a global controller identifier corresponding to the target computing unit into each packet, and distributing the packet to a corresponding global controller for authorized scheduling;

9. The method of claim 8, wherein said sending each packet to a corresponding target computing unit further comprises:

constructing a virtual queue path from the computing unit to a target computing unit;

judging whether the forwarding link has faults or not based on the virtual queue path;

and when the forwarding link has faults, the packet is distributed to other links without faults.