Non-blocking network reduction computing device and method based on logic tree
Technical Field
The invention belongs to the technical field of hardware integrated circuits, and particularly relates to a non-blocking network reduction computing device and method based on a logic tree.
Background
In a high performance computing system, there is aggregate communication in which multiple nodes participate, in addition to point-to-point communication. The number of nodes participating in the collective communication is uncertain, and is determined by the task operation, and the characteristic makes the hardware implementation of the collective communication more difficult than the point-to-point communication.
In the collective communication, there is also a communication type that needs to calculate the communication data, that is, after calculating the data on all the nodes, the result is returned to all the nodes, and this communication is called reduction communication. The reduction communication is characterized in that the communication data on each communication node is the same in size, and after all the data are accumulated or logically operated, a piece of result data is finally formed and needs to be returned to all the communication nodes.
In a high-performance interconnection network, reduction communication participated by a plurality of communication nodes is converted into point-to-point communication between every two nodes on a software level, and then CPU processors in the nodes complete calculation of reduction data. When the data volume is only suitable for large data volume, in a high-performance computing system, the data volume of collective communication is not large in most times, when the data volume is small, the efficiency of a software method adopting point-to-point communication is low, the data computing requirement can interrupt the work of a CPU, and the operating efficiency of a project is influenced. Therefore, the reduction communication with small data volume is subjected to hardware conversion, and the method has important significance for improving the operation efficiency of a high-performance computing system.
The invention patent application CN91105946.6 discloses a reduction processor, and in particular discloses that the reduction processor is controlled by a program having a structure and is adapted to simplify said structure by including several reduction steps of different reduction types, a first-stage processor of the type comprising a fast memory (1, 2) comprising in turn a plurality of fast memory cells, each of which is likely to cause the execution of a reduction operation, and b a communication network informing all memory cells connected to said result of each reduction.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a non-blocking network reduction computing device and method based on a logic tree, which are suitable for reduction communication computing of big data and small data, can remarkably accelerate set reduction communication processing, reduce disturbance of set reduction communication to a CPU (central processing unit) of a processor, and improve set reduction communication performance.
The invention is realized by the following technical scheme:
the invention provides a non-blocking network reduction computing device based on a logic tree, which comprises:
the network packet receiving module is used for receiving the reduction data packet transmitted on the cache network and sending the reduction data packet to the network packet matching module;
the network packet matching module is used for matching the control information of the reduction data packet with the integrated message state record, and after the control information of the reduction data packet is successfully matched with the integrated message state record, the reduction data packet is sent to the reduction calculation module and triggers the reduction calculation module to start calculation;
the reduction calculation module is used for performing local reduction calculation and network reduction calculation;
and the network packet sending module is used for sending the reduction calculation result after the calculation to the reduction communication indication object.
The invention can automatically receive the reduction data packet on the network according to the type of the set operation reduction calculation and can complete the reduction calculation between the network reduction data packet and the local data packet.
Preferably, the network packet receiving module includes:
a receiving unit, configured to receive a reduction packet transmitted on a cache network;
the verification unit is used for detecting whether the target ID information of the reduction data packet is matched with the local node, and if so, the received reduction data packet is sent to the network packet matching module; otherwise, the received reduction data packet is discarded.
Preferably, the network packet matching module includes:
the matching logic unit is used for receiving the matching request and retrieving the set message state records of the set message state recording unit based on the set message ID in the matching request;
the integrated message state recording unit stores integrated message state records;
the matching unit is used for matching the control information of the reduction data packet with the integrated message state record, and if the matching is successful, the reduction data packet is sent to the reduction calculation module; if the matching is not successful, the reduction data packet is discarded.
Preferably, the reduction calculation module includes:
the local reduction calculation engine unit is used for carrying out reduction calculation on the data of the local node and the network data of the calculation data buffer unit;
the network reduction calculation engine unit is used for carrying out reduction calculation on the reduction data packet and the network data of the calculation data buffer unit;
and the calculation data buffer unit is used for storing a first reduction data packet of the local node assembly message, a reduction calculation result of the network reduction calculation engine unit and a reduction calculation result of the local reduction calculation engine unit.
Preferably, the calculation data buffer unit is a dual-port structure and is respectively connected with the local reduction calculation engine unit and the network reduction calculation engine unit.
A non-blocking network reduction computing method based on a logic tree is realized by adopting a network reduction computing device, and the method comprises the following steps:
step S01, receiving the reduction data packet transmitted on the cache network, matching the control information of the reduction data packet with the status record of the aggregate message, and after matching is successful, performing reduction calculation of the aggregate message;
step S02, performs local reduction calculation and network reduction calculation, and sends the reduction calculation result to the reduction communication instruction object.
Preferably, the step S01 further includes: before matching the reduction data packet, detecting whether the target ID information of the reduction data packet is matched with the local node, if so, matching the control information of the reduction data packet with the integrated message state record, and if not, discarding the received reduction data packet.
Preferably, the process of matching the control information of the reduction packet with the aggregate message status record in step S01 includes:
retrieving the state record of the aggregate message according to the aggregate message ID in the matching request;
matching the control information of the reduction data packet with the state record of the aggregate message, and if the matching is successful, carrying out reduction calculation on the aggregate message; if the matching is not successful, the reduction data packet is discarded.
Preferably, the retrieving the aggregated message status record according to the aggregated message ID in the matching request includes:
when the retrieved aggregate message state record is an empty entry, writing the aggregate message ID and the reduction data packet number in the matching request, and setting the entry to be valid;
and when the retrieved aggregate message state record is a valid entry, executing a matching step of the control information of the reduction data packet and the aggregate message state record.
Preferably, the network reduction calculation process in step S02 includes:
if the reduction data packet is the first data packet of the set message at the local node, storing the reduction data packet into a corresponding entry in the calculation data buffer;
if the reduction data packet is a middle data packet of the set message at the local node, carrying out reduction calculation on the reduction data packet and network data of the calculation data buffer unit, storing a reduction calculation result into a calculation data buffer, and updating the state of the set message;
and if the reduction data packet is the last data packet of the aggregate message at the local node, carrying out reduction calculation on the reduction data packet and the network data of the calculation data buffer unit, and generating a sending signal after the calculation is finished.
The invention has the following beneficial effects:
the invention relates to a non-blocking network reduction computing device and method based on a logic tree, which can automatically complete functions of set ID matching, reduction data computing, reduction result sending and the like in a reduction communication process, wherein two computing engines are arranged in the device, and can simultaneously complete reduction computing between local node data and network data packets and between local node data and network data packets without blocking; the supported reduction calculation types include logic operation, bitwise operation, comparison operation and the like with various byte lengths, can remarkably accelerate the processing of the set reduction communication, reduce the disturbance of the set reduction communication to a CPU (central processing unit) of the processor, and improve the performance of the set reduction communication.
Drawings
FIG. 1 is a schematic block diagram of a non-blocking network reduction computing device based on a logic tree according to the present invention;
FIG. 2 is a schematic structural diagram of a reduction calculation module in a non-blocking network reduction calculation apparatus based on a logic tree according to the present invention;
fig. 3 is a flowchart of a non-blocking network reduction calculation method based on a logic tree according to the present invention.
Detailed Description
The following are specific embodiments of the present invention and are further described with reference to the drawings, but the present invention is not limited to these embodiments.
Referring to fig. 1, the non-blocking network reduction computing device based on the logic tree of the present invention includes a network packet receiving module, a network packet matching module, a reduction computing module, and a network packet sending module. The network packet receiving module is used for receiving the reduction data packet transmitted on the cache network and sending the reduction data packet to the network packet matching module. And the network packet matching module is used for matching the control information of the reduction data packet with the integrated message state record, and after the control information of the reduction data packet is successfully matched with the integrated message state record, sending the reduction data packet to the reduction calculation module and triggering the reduction calculation module to start calculation. And the reduction calculation module is used for performing local reduction calculation and network reduction calculation. And the network packet sending module is used for sending the reduction calculation result after the calculation to the reduction communication indication object.
The invention relates to a non-blocking network reduction computing device based on a logic tree, which mainly comprises the following working procedures of receiving reduction data packets from a network, checking and matching the reduction data packets with local information, storing the reduction data packets in a corresponding cache, completing the computation of corresponding data, wherein the corresponding data comprises other reduction data packets received on the network and reduction data packets of local nodes, and submitting and sending the data packets according to reduction communication instructions of the local nodes after all reduction computations are completed.
Specifically, the network packet receiving module includes a receiving unit and a checking unit. The receiving unit is used for receiving the reduction data packet transmitted on the cache network. The check unit is used for detecting whether the target ID information of the reduction data packet is matched with the local node, and if so, the check unit sends the received reduction data packet to the network packet matching module; otherwise, the received reduction data packet is discarded. For example, when the local node 4 detects that the destination ID information sent by the reduction data packet does not include the local node 4, it considers that the destination ID information does not match, that is, the reduction data packet is not sent to the local node 4, the received reduction data packet is discarded, subsequent reduction calculation is not performed, and then the reduction data packet continues to be received, and the foregoing check is performed on the reduction data packet received each time.
Specifically, the network packet matching module includes a matching logic unit, an aggregate message state recording unit, and a matching unit. The matching logic unit is used for receiving the matching request and retrieving the set message state records of the set message state recording unit based on the set message ID in the matching request. The aggregate message state recording unit stores an aggregate message state record, and the aggregate message state record is stored corresponding to the index information. The matching unit is used for matching the control information of the reduction data packet with the integrated message state record, and if the matching is successful, the reduction data packet is sent to the reduction calculation module; if the matching is not successful, the reduction data packet is discarded. Wherein, the matching content comprises information such as a reduction operation ID, an operation type, a data length and the like.
The specific matching process is as follows: the matching logic unit receives a matching request submitted by the network interface, and retrieves the state record of the aggregate message according to the low order of the aggregate message ID in the matching request: if the entry is empty, writing the aggregate message ID and the network reduction packet number in the matching request, and concatenating the entry to be valid; if the entry is a valid entry, reading out the recorded aggregate message ID in the entry and comparing and matching the aggregate message ID carried in the network reduction packet: if the network reduction packet is matched with the network reduction packet, recording the number of the network reduction packet, and receiving and submitting the network reduction packet to a next-stage calculation control module; if not, the network reduction packet is discarded, and a matching error response is generated for the source node of the network reduction packet.
As shown in fig. 2, the reduction calculation module includes a local reduction calculation engine unit, a network reduction calculation engine unit, and a calculation data buffer unit. The local reduction calculation engine unit is used for carrying out reduction calculation on the data of the local node and the network data of the calculation data buffer unit. And the network reduction calculation engine unit is used for carrying out reduction calculation on the reduction data packet and the network data of the calculation data buffer unit. The calculation data buffer unit is used for storing a first reduction data packet of the local node assembly message, a reduction calculation result of the network reduction calculation engine unit and a reduction calculation result of the local reduction calculation engine unit.
The calculation data buffer unit is a dual-port structure and is respectively connected with the local reduction calculation engine unit and the network reduction calculation engine unit, namely, the port 0 is connected with the local reduction calculation engine unit, and the port 1 is connected with the network reduction calculation engine unit. The bit width of the port is 2 times of that of the data path, and both ports can carry out read-write operation. The local reduction calculation engine unit is responsible for reading out the data of the local node and performing reduction calculation with the network data buffered by the calculation data. The network reduction calculation engine unit is responsible for carrying out reduction calculation on the reduction data packet data and the data in the calculation data buffer.
The specific processing flow of the network reduction calculation engine unit is as follows:
1. if the reduction data packet is the first data packet of the set message at the local node, directly storing the reduction data packet into a corresponding entry in the calculation data buffer;
2. if the reduction data packet is an intermediate data packet, submitting the data packet to a network computing engine, and storing the result in a computing data buffer after computing operation is carried out on the data in the computing buffer and the network reduction computing engine unit; meanwhile, updating the state of the set message, and recording the receiving processing of the reduction data packet;
3. if the reduction data packet is the last data packet of the set message at the node, generating a sending signal after the calculation is finished; and after the calculation is finished, submitting the reduction result data packet to a network packet sending module.
Referring to fig. 3, the present invention provides a non-blocking network reduction calculation method based on a logical tree by using the non-blocking network reduction calculation apparatus based on a logical tree. The method comprises the following steps:
step S01, receiving the reduction data packet transmitted on the cache network, matching the control information of the reduction data packet with the status record of the aggregate message, and after matching is successful, performing reduction calculation of the aggregate message;
step S02, performs local reduction calculation and network reduction calculation, and sends the reduction calculation result to the reduction communication instruction object.
The step S01 further includes: before matching the reduction data packet, detecting whether the target ID information of the reduction data packet is matched with the local node, if so, matching the control information of the reduction data packet with the integrated message state record, and if not, discarding the received reduction data packet. This step is used to detect whether the reduction packet is sent to the receiving node.
As shown in Table one, the parts in the aggregate message state mean as follows:
aggregate message ID: the node is used for distinguishing different messages operated by the node;
local node attribute: the local nodes are divided into leaf nodes, father nodes and root nodes. The leaf node only sends the data of the node to the node of the previous level without calculation; the father node receives the network reduction packet, calculates the network reduction packet with local data and sends the father node to the previous node; the root node receives the network reduction packet and generates a reduction result after calculating with the local node;
the number of child nodes is as follows: the node is valid when the node is a root node or a father node and is used for indicating the number of the network reduction packets which need to be received by the node in the current set message;
child node vector: the number of the child nodes receiving the network reduction packet is accurately recorded, and the number is used for eliminating the repeated network reduction packet; the number of child node vector bits is related to the maximum number of child nodes supported.
Table one: schematic diagram of aggregate message state coding
The process of matching the control information of the reduction packet with the aggregate message status record in step S01 includes: retrieving the state record of the aggregate message according to the aggregate message ID in the matching request; matching the control information of the reduction data packet with the state record of the aggregate message, and if the matching is successful, carrying out reduction calculation on the aggregate message; if the matching is not successful, the reduction data packet is discarded.
Wherein, the process of retrieving the aggregated message state record according to the aggregated message ID in the matching request comprises:
when the retrieved aggregate message state record is an empty entry, writing the aggregate message ID and the reduction data packet number in the matching request, and setting the entry to be valid;
and when the retrieved aggregate message state record is a valid entry, executing a matching step of the control information of the reduction data packet and the aggregate message state record.
After the matching information is submitted, matching is carried out according to the matching index information and the reduction control information stored in the information suspension, the matching content comprises information comparison of reduction operation ID, operation type, data length and the like, and if the information is not matched, the suspension processing returns a matching error response; and if the information is matched successfully, recording the matched source information, and simultaneously informing the reduction calculation module to start calculation. Under the control of the calculation control module, the matched reduction data packet completes the calculation processing of the message according to the record in the state of the integrated message, and the processing content has the following conditions:
1. if the reduction data packet is the first data packet of the set message at the local node, storing the reduction data packet into a corresponding entry in the calculation data buffer;
2. if the reduction data packet is a middle data packet of the set message at the local node, carrying out reduction calculation on the reduction data packet and network data of the calculation data buffer unit, storing a reduction calculation result into a calculation data buffer, and updating the state of the set message;
3. and if the reduction data packet is the last data packet of the aggregate message at the local node, carrying out reduction calculation on the reduction data packet and the network data of the calculation data buffer unit, and generating a sending signal after the calculation is finished.
It will be appreciated by persons skilled in the art that the embodiments of the invention described above and shown in the drawings are given by way of example only and are not limiting of the invention. The objects of the present invention have been fully and effectively accomplished. The functional and structural principles of the present invention have been shown and described in the examples, and any variations or modifications of the embodiments of the present invention may be made without departing from the principles.