CN115720211B

CN115720211B - Network-computing integrated inter-core router and data packet aggregation method

Info

Publication number: CN115720211B
Application number: CN202211424707.6A
Authority: CN
Inventors: 高植林; 王小航
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2024-05-03
Anticipated expiration: 2042-11-14
Also published as: CN115720211A

Abstract

The invention discloses a network arithmetic integrated inter-core router which comprises an input buffer, an arbitrator, a parser, an in-network operation unit, a route calculation unit, a crossbar switch and an output buffer. The invention adds the parser and the network operation unit in the inter-core router, so that the router can aggregate the forwarded data packet. The invention also defines a new data packet format to work with the parser and the in-network arithmetic unit. The invention also provides a reduction tree routing algorithm to cooperate with the inter-core routers to ensure that more data packets can meet in the same inter-core router and execute aggregation operation as much as possible.

Description

Network-computing integrated inter-core router and data packet aggregation method

Technical Field

The invention relates to the technical field of multi-core integrated circuits, in particular to a network-arithmetic integrated inter-core router and a data packet aggregation method.

Background

The multi-chip integrated system integrates a plurality of chips (bare chips) for realizing specific functions into a system chip with specific functions through an advanced packaging technology, and compared with a traditional single-chip integrated mode, the multi-chip integrated technology has advantages and potential in various aspects of chip performance power consumption optimization, cost and business mode. Deep Neural Networks (DNNs) are deployed to run on multi-kernel integrated systems by way of segmentation of training data sets/models. And each core particle processes different subsets in the training data in parallel to train the local copy of the model, gathers training results after the processing of each core particle is completed, gathers parameters, broadcasts the parameters to each node to update the local model, and then carries out the next iteration.

In the above process, the process in which each core transmits data to the same core is called reduction, and the process in which one core transmits data to all other cores is called broadcasting. An all reduction is formed together in a reduction process and a broadcast process. all reduction is in fact a communication process that reduces the data in the different kernels and distributes the results to the individual kernels.

In the prior art, all reduction can be a performance bottleneck even for computationally intensive applications, such as neural network training tasks. Because the communication bandwidth between the core grains in the multi-core integrated system is limited, the network communication between the core grains is longer in time consumption compared with the network communication between the core grains, and the deep neural network is more likely to cause network congestion and increase network delay in the parameter summarizing stage if the network transmission rate is slow, so that the overall performance is reduced, and therefore, the communication performance of the network becomes a performance bottleneck in training.

At present, the process optimization method aiming at all reduce is mainly divided into a software method and a hardware method. Software algorithms such as literature J.PjeSivac-Grbovic.T.Angskun.G.Bosilca.G.E.Fagg.E.Gabriel.and J.J.Dongarra,″Performance analysis of mpi collective operations;'Cluster Computing.vol.10.no.2.pp.127-143.2007 and literature R.Rabenseifner."Optimization of collective reduction operations."In International Conference on Computational Science.Springer.2004.pp.I-9 increase the traffic in the system, resulting in high delays. The hardware method mainly aims at a large-scale parallel system consisting of a plurality of computers, and data is summarized before data transmission by integrating an operation unit in a switch, so that the transmission quantity of the data is reduced, but the network flow granularity aimed at by the design is large, and the function requirement on the operation unit is complex, so that the design cannot be directly transplanted into a multi-core integrated system with limited chip area.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provide a network-computing integrated inter-core router and a data packet aggregation method.

The first object of the present invention can be achieved by adopting the following technical scheme:

A network-arithmetic integrated inter-core router comprises N input caches, 1 arbiter, N resolvers, 1 in-network operation unit, N+1 route calculation units, 1 crossbar and N output caches, wherein,

The input buffer is connected with the output buffer of the core particle or another inter-core particle router, receives the data packet transmitted by the output buffer of the other inter-core particle router or the data packet transmitted by the core particle, and buffers the data packet to wait for the arbitration of the arbiter; the data packet comprises a source address, a target address, a network computing flag bit and a data part;

the arbiter reads the content of the data packet in the input buffer memory, arbitrates the data packet in the input buffer memory, and decides which data packet in the input buffer memory is transmitted to the parser for parsing;

The analyzer receives the data packet transmitted by the input cache, transmits the data packet to the route calculation unit or the in-network calculation unit, judges whether the data packet needs to be subjected to in-network calculation according to the in-network calculation flag bit of the data packet, and if the value of the in-network calculation flag bit of the data packet is equal to 0, the analyzer indicates that the data packet does not need to be subjected to in-network calculation, and transmits the data packet to the route calculation unit; if the value of the network calculation flag bit of the data packet is not 0, indicating that the data packet needs to be subjected to network operation, transmitting the data packet to a network operation unit by a parser;

the network operation unit receives the data packet transmitted by the analyzer, and transmits the data packet to the route calculation unit after the network operation;

The routing calculation unit receives the data packet from the analyzer or the in-network operation unit, transmits the data packet and the routing operation result to the crossbar switch, and calculates an output port address corresponding to the data packet by using a routing algorithm according to a destination address in the data packet; for the data packet transmitted by the analyzer, the routing calculation unit adopts an XY routing algorithm, namely, the data packet is forwarded according to columns and then forwarded according to rows; for the data packet transmitted by the in-network operation unit, the routing calculation unit adopts a reduction tree routing algorithm, wherein the routing operation result is used as an electric signal for controlling the state of the crossbar switch, and indicates which output buffer the crossbar switch should transmit the transmitted data packet to;

The crossbar receives the data packet and the routing operation result from the routing calculation unit, the crossbar comprises N multiplexers, each multiplexer is connected with one output buffer, and according to the routing operation result, the crossbar selects the data packet transmitted by one routing calculation unit through the multiplexer and transmits the data packet to the corresponding output buffer;

And the output buffer memory receives the data packet transmitted by the cross switch, and transmits the data packet to the input buffer memory of another inter-core router or the core connected with the inter-core router according to the first come first serve principle.

The inter-core router disclosed above is added with a parser and an in-network arithmetic unit for the first time, so that the inter-core router can perform aggregation operation on the forwarded data packets. And defines a new data packet format to work with the parser and the in-network arithmetic unit. In order to enable the improved router to better play a role in aggregation, a matched reduction tree routing algorithm is also provided, so that more data packets can meet in the same router, and aggregation operation is executed as much as possible.

Further, the source address in the data packet indicates which core in the network the data packet is generated, the destination address indicates which core in the network the data packet is received by, and the network computing flag bit is used for distinguishing whether the data packet needs to perform the network operation, and the data part is the specific communication content to be transmitted by the data packet. The source address helps the core particle identify the source of the data packet, and ensures that the core particle can correctly process the received data packet according to different sources; the target address is used as the input of the routing algorithm in the data packet forwarding stage; the network computing zone bit is used for classifying the data packets, the network computing zone bit of the data packets which do not need to be subjected to network operation is set to be 0, the data packets are prevented from being transmitted to the network computing unit, the forwarding speed of the data packets is accelerated, meanwhile, the network computing zone bit divides the data packets which need to be subjected to network operation into a plurality of groups, the network computing zone bit is used as a grouping identifier, and only the data packets of the same group are aggregated, so that the inter-core router can process the plurality of groups of data packets which need to be aggregated at the same time; the data portion is the specific data that the core is to transmit.

Further, the in-network operation unit comprises a multiplexer, a network calculation buffer memory, a network calculation controller, a comparator and a logic calculation unit; the working process of the in-network operation unit is called in-network operation, and the specific process is as follows: the comparator compares the net calculation flag bit of the data packet transmitted by the analyzer with the net calculation flag bit of the data packet in the net calculation cache, and if the net calculation flag bit is unequal to the net calculation flag bit of the data packet in the net calculation cache, the net calculation controller transmits the data packet in the net calculation cache to the route calculation unit; if the two are equal, the network computing controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computing cache to the logic computing unit for addition operation, the operation result is transmitted to the network computing cache to cover the original data part, and meanwhile, the source address, the destination address and the network computing flag bit of the data packet transmitted by the analyzer are transmitted to the network computing cache. A multiplexer connects each parser to the intra-network arithmetic unit; 1 data packet can be temporarily stored in the network calculation buffer memory, so that the requirement that the data packet aggregation needs to operate on the contents of two data packets simultaneously is met; the comparator compares the net calculation bit of the data packet, transmits the comparison result to the net calculation controller, the net calculation controller receives the comparison result of the comparator, controls the input and output of the in-net buffer, the comparator and the net calculation controller jointly ensure that the in-net calculation unit aggregates the data packets of the same group, and the data packets of different groups can be sent out by the data packets in the buffer first, and then the data packets transmitted by the multiplexer are stored in the in-net buffer; the logic calculation unit performs floating point addition operation on the data packet content to obtain a data part of the data packet aggregation result.

Further, the reduction tree routing algorithm works in the mesh network, the routing path of the data packet needing to be subjected to intra-network operation is planned into a tree shape by using the reduction tree routing algorithm, the generated tree is balanced as much as possible, the data packet needing to be subjected to intra-network operation can pass through the same inter-core router in similar time in the forwarding process, and an intra-network operation unit of the inter-core router can complete as many calculation tasks as possible in the process, and the algorithm process is as follows:

T1, setting the size of a mesh network as n multiplied by n, and creating n ² Node sets in total for representing core grains in the network, wherein the Node sets comprise the following information: the method comprises the steps of marking an x coordinate and a y coordinate of a core particle in a mesh network, the number of child nodes and a precursor Node, wherein the x coordinate is marked as x, the y coordinate is marked as y, and the position information of a Node set refers to the x coordinate and the y coordinate in the Node set and is marked as (x, y);

T2, assigning values for members of the Node set according to the positions of the core particles in the mesh network, wherein the number of child nodes is set to 0, and the precursor nodes of the core particles are set to be empty;

t3, the adjacent matrix M with the size of n ²×n² is used for representing the connection relation of nodes, the symbol i and the symbol j represent an integer between more than or equal to 1 and less than or equal to n, and for the (i multiplied by n+j) th row of the matrix, the ((i+1) th multiplied by n+j), the ((i multiplied by n+j) th column, the (i multiplied by n+j+1) th column and the (i multiplied by n+j-1) th column are assigned as 1 to represent connection, and the other columns are assigned as 0 to represent non-connection;

t4, for one Node set, the (i multiplied by n+y) th row of the adjacent matrix M records adjacent information, so n ² vertexes are used for representing the Node set, and the information of the record sides of the adjacent matrix M is combined to form a graph G;

And T5, updating the graph G, wherein the process is as follows:

T5.1, calculating the maximum path length from the node directly communicated with the source point to the source point in the graph G to be 1, and the shortest path length from the rest points to the source point to be infinity;

T5.2, selecting the nearest node reaching the source point from the nodes which are not accessed yet, and marking the nearest node as u;

t5.3, note u has been accessed;

T5.4, selecting a node communicated with the u node, and marking v;

T5.5, if v is not accessed and the path from the source point to the node v is shorter by taking u as an intermediate point, recording the path length from the node v to the source point as the path length from the node u to the source point plus one, and adding u in the candidate precursor node of v;

T5.6, repeating the steps T5.4 and T5.5 for all vertices that can be reached starting from u;

t5.7, selecting a node with the largest number of child nodes from candidate precursor nodes of u, and marking the node as w;

t5.8, recording the precursor of u as w, and adding 1 to the number of the child nodes of w;

T5.9, repeating steps T5.2 to T5.7 until all nodes are accessed;

T6, establishing a set R with the size of n ², traversing each Node in the graph G, calculating the precursor Node coordinate of each Node as (v- & gt x multiplied by n, v- & gt y) through the precursor member v in the Node set, and storing the precursor Node coordinate into the set R, wherein the symbol 'v- & gt x' represents the x coordinate in the Node set v, and the symbol 'v- & gt y' represents the y coordinate in the Node set v;

And T7, taking the set R as a result of a routing algorithm, and storing the set R into a routing calculation unit of the inter-core router.

Further, the arbiter uses round robin arbitration to poll the packets in each input buffer and take it as input to the parser. The round robin arbitration ensures that the arbiter can equally process data packets in each input buffer, avoiding that individual data packets are stored in the input buffers for too long.

Further, the input buffer stores data packets using a plurality of fixed length queues. The use of queues to store data packets ensures that the inter-core routers can process one by one when receiving multiple data packets. The fixed queue length reduces the complexity of the circuit and reduces the area of the inter-die router.

The second object of the invention can be achieved by adopting the following technical scheme:

A data packet aggregation method of an inter-core router based on network computing integration, the data packet aggregation method comprising the following steps:

S1, a data packet enters a network arithmetic integrated inter-core router, firstly, the data packet is temporarily stored in an input buffer memory, and is arbitrated by an arbitrator;

s2, reading the content of the data packets in the input caches by an arbiter, performing cyclic arbitration on the data packets in the input caches, polling the data packets in each input cache and determining which data packet in the input cache is transmitted to a parser for parsing;

S3, after passing through the arbiter, the data packet enters the resolver, in the resolver, the router judges whether the data packet needs to carry out the in-network operation according to the network calculation flag bit of the data packet, if the value of the network calculation flag bit of the data packet is equal to 0, the data packet does not need to carry out the in-network operation, the steps S4 and S5 are skipped, the step S6 is executed, and if the value of the network calculation flag bit of the data packet is not 0, the data packet does not need to carry out the in-network operation, the step S4 is executed;

S4, the data packet is recorded as a data packet A after entering the network computing unit, and if the network computing controller checks that the network computing cache does not have the data packet cached, the data packet A is stored in the network computing cache, and the work is finished; if the network computing buffer memory of the network computing unit stores the data packet B, the data part of the data packet A is firstly transmitted to the logic computing unit, the network computing flag bits of the data packet A and the data packet B enter the comparator for comparison, the comparator transmits the result to the network computing controller, if the network computing flag bits of the two data packets are not equal, the network computing controller controls the network computing buffer memory to output the data packet B, the network computing buffer memory is emptied, meanwhile, the data part of the network computing buffer memory is set to 0, and the data part is input to the logic computing unit, and the source address, the target address and the network computing flag bit of the data packet A are stored in the network computing buffer memory; if the network computing flag bits of the two data packets are equal, the network computing controller controls the network computing cache to input the data part of the data packet B into the logic computing unit, and the source address, the target address and the network computing flag bits of the data packet A are stored in the network computing cache and cover the source address, the target address and the network computing flag bits of the data packet B;

s5, the logic calculation unit carries out floating point addition operation on the data part of the data packet A and the data part of the data packet B, the operation result is stored in the network calculation buffer memory and covers the original data part in the network calculation buffer memory, and finally the network calculation controller transmits the data packet to the route calculation unit;

S6, the route calculation unit calculates the data packet by using a route algorithm according to the destination address in the data packet, and the corresponding output port, wherein the route calculation unit adopts an XY route algorithm for the data packet transmitted by the analyzer, and adopts a reduction tree route algorithm for the data packet transmitted by the intra-network operation unit;

s7, the data packet enters a cross switch, and the cross switch selects the data packet transmitted by a route calculation unit through a multiplexer and transmits the data packet to a corresponding output buffer;

S8, the output buffer memory transmits the data packet to the outside of the router according to the first come first serve principle.

Further, the network computing controller in the network computing unit records whether the network computing buffer memory stores the data packet and the duration of the data packet in the network computing buffer memory, and if the duration of the data packet in the network computing buffer memory exceeds the preset threshold period, the network computing controller controls the network computing buffer memory to transmit the data packet to the route computing unit. Setting the preset threshold period can avoid two problems: 1. the data packets are cached in the cache for too long to wait for aggregation, so that the forwarding efficiency of the data packets is affected; 2. when only one data packet needing to be subjected to intra-network operation remains in the network and the data packet does not reach the target address yet, the data packet is stored in the network calculation buffer memory and can not be forwarded to reach the target address.

Compared with the prior art, the invention has the following advantages and effects:

1. the invention reduces the number of data packets in the inter-core network of the multi-core integrated system at the same time, reduces the communication load of the inter-core network and improves the communication efficiency of the multi-core integrated system.

2. The aggregation task of the data packet is transferred to the data packet transmission process to be completed, so that the workload of the core particle is reduced, and the time consumption of the multi-core particle integrated system for completing the neural network training task is reduced.

3. The micro-architecture design provided by the invention has small occupied area and accords with the limitation of the router among the chips on the chip area.

4. The invention has good universality, even if the multi-core integrated system simultaneously has the data packet needing to carry out the in-network operation and the data packet not needing to carry out the in-network operation, the router among the cores can accurately complete the tasks of data packet forwarding and data packet aggregation.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a network-computing integrated inter-core router microarchitecture according to the invention;

FIG. 2 is a schematic diagram of a data path of a data packet within an inter-core router according to the present invention;

FIG. 3 is a diagram of the architecture of the intra-network arithmetic unit of the present invention;

FIG. 4 is a flowchart of the operation unit in the network of the present invention;

fig. 5 is a schematic diagram of a multi-pellet integrated system of 16 pellets according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment provides a network calculation integrated inter-core router, which is used in a multi-core integrated system consisting of 16 cores. The inter-core network topology of the multi-core integrated system adopts a mesh network. The general architecture diagram of a multichip integrated system is shown in fig. 5.

The network-integrated inter-die router microarchitectural layer design is shown in fig. 1, and comprises the connection relation of components. The network calculation integrated inter-core router has the main functions of completing the forwarding of data packets and the in-network operation task in the inter-core network of the multi-core integrated system. The data packet of the inter-core network comprises a source address, a target address, a network calculation flag bit and a data part, wherein the source address indicates which core particle in the network the data packet is generated by, the target address indicates which core particle in the network the data packet is received by, the network calculation flag bit is used for distinguishing whether the data packet needs to carry out the intra-network calculation or not, and the data part is specific communication content to be transmitted by the data packet. Forwarding refers to the transmission of a data packet from one inter-core router to the next based on the destination address of the data packet. In-network operation refers to that in the forwarding process of a data packet, the data parts of the data packets with the same network calculation flag bits are added to obtain an operation result, a new data packet comprising the operation result is generated for forwarding, and other data packets participating in the operation are discarded. In the inter-die network shown in fig. 5, a total of 4 inter-die routers are connected with at most 4 other inter-die routers, and the components of the 4 inter-die routers include 5 input buffers, 5 resolvers, 1 intra-network arithmetic unit, 6 route calculation units, 1 arbiter, 1 crossbar, and 5 output buffers; the total 8 inter-chip routers are connected with other 3 inter-chip routers at most, and the components of the 8 inter-chip routers comprise 4 input caches, 4 resolvers, 1 intra-network operation unit, 5 route calculation units, 1 arbiter, 1 crossbar and 4 output caches; a total of 4 inter-chip routers are connected with other 2 inter-chip routers at most, and the components of the 4 inter-chip routers comprise 3 input caches, 3 resolvers, 1 intra-network operation unit, 4 route calculation units, 1 arbiter, 1 crossbar and 3 output caches. The specific functions of the components are as follows:

one input buffer is connected with the core grains, the other input buffers are connected with the output buffer of the other inter-core-grain router, the input buffer receives data packets transmitted by the output buffer of the other inter-core-grain router or data packets transmitted by the core grains, and the data packets are buffered by using a queue with the length of 12 bytes to wait for arbitration by an arbiter; the data packet comprises a source address, a target address, a network computing flag bit and a data part.

The arbiter reads the content of the data packet in the input buffer, arbitrates the data packet in the input buffer, and decides which input route calculation unit can transmit the data to the crossbar. The arbiter uses round-robin arbitration to pollingly select the data packets in each input buffer and take them as inputs to the parser.

The parser receives the data packet transmitted by the input buffer, and transmits the data packet to the route calculation unit or the network operation unit after parsing. Analyzing means judging whether the data packet needs to perform the in-network operation according to the network calculation flag bit of the data packet, if the value of the network calculation flag bit of the data packet is equal to 0, the data packet is indicated not to need to perform the in-network operation, and the analyzer transmits the data packet to the route calculation unit; if the value of the network calculation flag bit of the data packet is not 0, the data packet is indicated to need to carry out the network operation, and the parser transmits the data packet to the network operation unit.

The in-network operation unit comprises a multiplexer, a network calculation buffer memory, a network calculation controller, a comparator and a logic calculation unit, wherein the in-network operation process is as follows: the comparator compares the net calculation flag bit of the data packet transmitted by the analyzer with the net calculation flag bit of the data packet in the net calculation cache, and if the net calculation flag bit is unequal to the net calculation flag bit of the data packet in the net calculation cache, the net calculation controller transmits the data packet in the net calculation cache to the route calculation unit; if the two are equal, the network computing controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computing cache to the logic computing unit for addition operation, the operation result is transmitted to the network computing cache to cover the original data part, and meanwhile, the source address, the destination address and the network computing flag bit of the data packet transmitted by the analyzer are transmitted to the network computing cache.

The route calculation unit receives the data packet from the resolver or the in-network operation unit, and transmits the data packet and the route operation result to the cross switch after route calculation. The routing algorithm is used to calculate which output port the data packet should be sent to according to the destination address in the data packet. For the data packet transmitted by the analyzer, the routing calculation unit adopts an XY routing algorithm, and for the data packet transmitted by the in-network operation unit, the routing calculation unit adopts a reduction tree routing algorithm. The routing result refers to an electrical signal that controls the state of the crossbar, indicating to which output buffer the crossbar should transmit incoming packets.

The crossbar receives the data packet and the routing operation result from the routing operation unit and transmits the data packet to the output buffer. The crossbar includes N multiplexers. Each multiplexer is connected with one output buffer memory, and according to the routing operation result, the crossbar selects 1 data packet transmitted by the routing calculation unit through the multiplexer and transmits the data packet to the corresponding output buffer memory. The output buffer receives the data packet transmitted by the cross switch, and transmits the data packet to the input buffer of another inter-core router or the transmission data packet connected with the inter-core router according to the first come first serve principle.

According to the hardware simulation result, under the 45nm technology, the area occupied by the inter-core router is 70902.3 mu m ², wherein the intra-network arithmetic unit occupies 3684 mu m ², and the area occupied by the inter-core router is about 5%.

Now, assume that there are two data packets, respectively denoted as data packet E1 and data packet E2, where data packet E1 is generated by a core particle in the first row and the first column in fig. 5, and needs to be sent to a core particle in the second row and the second column, so the source address is denoted as (1, 1), the destination address is denoted as (2, 2), the net calculation flag bit is set to 103, and the data portion is 736.5; the packet E2 is generated from the core of the third column of the first row in fig. 5 and needs to be sent to the core of the second column of the second row, so the source address is denoted as (1, 3), the destination address is denoted as (2, 2), the net-work flag bit is denoted as 103, and the data portion is denoted as 367.2. For data packet E1 and data packet E2, the reduction tree routing algorithm works as follows:

F1, setting the size of a mesh network to be 4 multiplied by 4, and creating 16 Node sets in total for representing core grains in the network, wherein the Node sets comprise the following information: the method comprises the steps of marking an x coordinate and a y coordinate of a core particle in a mesh network, the number of child nodes and a precursor Node, wherein the x coordinate is marked as x, the y coordinate is marked as y, and defining the position information of a Node set as the x coordinate and the y coordinate of the Node set and marking as (x, y);

F2, assigning values for members of the Node set according to the positions of the core particles in the mesh network, wherein the number of child nodes is set to 0, and the precursor nodes of the core particles are set to be empty;

F3, using an adjacent matrix M of 16×16 size to represent a connection relation of nodes, wherein the symbol i and the symbol j represent an integer of 1 or more and n or less, and for the (i×n+j) th row of the matrix, the ((i+1) th×n+j), the ((i-1) th×n+j), the (i×n+j+1) th and (i×n+j-1) th columns are assigned 1 to represent connection, and the other columns are assigned 0 to represent non-connection;

F4, for a Node set, the (i×n+y) th row of the adjacent matrix M records adjacent information, so n ² vertexes are used to represent the Node set, and the information of the record sides of the adjacent matrix M is combined to form a graph G;

f5, selecting Node set with position information of (2, 2) as source point, updating graph G, and the process is as follows:

f5.1, calculating the maximum path length from the node directly communicated with the source point to the source point in the graph G to be 1, and the shortest path length from the rest points to the source point to be infinity;

f5.2, selecting the nearest node reaching the source point from the nodes which are not accessed yet, and marking the nearest node as u;

f5.3, note u has been accessed;

f5.4, selecting a node communicated with the u node, and marking v;

f5.5, if v is not accessed and the path from the source point to the node v is shorter by taking u as an intermediate point, recording the path length from the node v to the source point as the path length from the node u to the source point plus one, and adding u in the candidate precursor nodes of v;

f5.6, repeating the step T5.4 and the step T5.5 for all vertexes which can be reached from u;

f5.7, selecting a node with the largest number of child nodes from candidate precursor nodes of u, and marking the node as w;

f5.8, recording the precursor of u as w, and adding 1 to the number of the child nodes of w;

F5.9, repeating steps F5.2 to F5.7 until all nodes are accessed;

F6, establishing a set R with the size of 16, traversing each Node in the graph G, calculating the precursor Node coordinate of each Node as (v- & gt x multiplied by n, v- & gt y) through the precursor member v in the Node set, and storing the precursor Node coordinate into the set R, wherein the symbol "v- & gt x" represents the x coordinate in the Node set v, and the symbol "v- & gt y" represents the y coordinate in the Node set v;

and F7, writing the set R into a router of the inter-core network as a routing table of the inter-core network.

In this embodiment, through steps F1 to F7, it is obtained that the precursor node of the inter-core routers located in the first row and the first column in the set R is the inter-core router of the first row and the second column; the precursor node of the inter-core router positioned in the third column of the first row is the inter-core router of the second column of the first row; the precursor node of the inter-core router located in the first row and the second column is the inter-core router of the second row and the second column.

The packet E1 and the packet E2 must both pass through the inter-core routers in the first row and the second column during the forwarding process, and assuming that the time interval between the arrival of the packet E1 and the packet E2 at the inter-core routers in the first row and the second column in fig. 5 is less than the preset threshold period in the network cache, the following describes the aggregation process of the packet E1 and the packet E2 in the inter-core routers in the first row and the second column in fig. 5 through a multi-core aggregation process, which includes the following steps:

p1, a data packet E1 enters an inter-core router of a first row and a second column, firstly, the data packet E1 is temporarily stored in an input buffer memory connected with an output buffer memory of the inter-core router of the first row and the first column, and the data packet E1 waits for an arbiter to arbitrate;

P2, the data packet E2 enters the inter-core router of the second column of the first row, firstly, the data packet E2 is temporarily stored in an input buffer memory connected with an output buffer memory of the inter-core router of the third column of the first row, and the data packet E2 waits for an arbiter to arbitrate;

P3, the arbiter polls the data packet in each input buffer, reads the content of the data packet E1 in the input buffer, and sends a control signal to transmit the data packet E1 from the input buffer to the analyzer;

P4, the data packet E1 enters a parser, in the parser, a router judges that the data packet E1 needs to be subjected to in-network operation according to the network calculation flag bit value 103 of the data packet E1, and the data packet E1 is transmitted to an in-network operation unit;

P5, the data packet E1 enters an in-network operation unit, the network calculation controller determines that the network calculation buffer is empty, and the data packet E1 is stored in the network calculation buffer;

P6, the arbiter reads the content of the data packet E2 in the input buffer, and sends a control signal to transmit the data packet E2 from the input buffer to the analyzer;

P7, the data packet E2 enters a parser, in the parser, a router judges that the data packet E2 needs to be subjected to in-network operation according to the network calculation flag bit value 103 of the data packet E2, and the data packet E2 is transmitted to an in-network operation unit;

P8, the data packet E2 enters an in-network operation unit, the network computation controller checks the data packet E1 stored in the network computation cache, then the data part 367.2 of the data packet E2 is firstly transmitted to the logic computation unit, the network computation flag bit 103 of the data packet E2 and the network computation flag bit 103 of the data packet E1 enter a comparator for comparison, the comparator transmits the result to the network computation controller, the network computation flag bits of the two data packets are equal, the network computation controller controls the network computation cache to input the data part 736.5 of the data packet E1 into the logic computation unit, and the source address, the target address and the network computation flag bit of the data packet E2 are stored in the network computation cache and cover the source address, the target address and the network computation flag bit of the data packet E1;

P9, the logic calculation unit carries out floating point addition operation on 736.5 and 367.2, and the operation result 1103.7 is stored in the network calculation cache and covers the original data part in the network calculation cache. The network computing controller transmits the data packet in the network computing cache to the route computing unit, the data packet is recorded as E3, and the content of the data packet E3 is a source address (1, 3), a target address (2, 2), a network computing flag bit 103 and a data part 1103.7;

And P10, the route calculation unit calculates that the output buffer corresponding to the data packet is the output buffer connected with the inter-core router of the second row and the second column in FIG. 5 by using a reduction tree routing algorithm according to the destination address in the data packet E3.

P11, packet E3 enters the crossbar, which selects packet E3 via the multiplexer and transmits it to the output buffer connected to the inter-die router in the second row and second column of FIG. 5.

And P12, the output buffer transmits the data packet E3 to the outside of the router according to the first come first serve principle.

So far, the aggregation process of the data packet E1 and the data packet E2 in the inter-core router of the first row and the second column is finished, and the data packet E1 and the data packet E2 are aggregated into the data packet E3 through the aggregation process, so that the data packet quantity of the inter-core network in the multi-core integrated system is reduced by 1.

Example 2

The embodiment proposes the working procedure of the data packet aggregation method based on the network-computing integrated inter-core router under the complex condition under the same system overall architecture based on the network-computing integrated inter-core router disclosed in the embodiment 1.

Now, assuming that there are two data packets, respectively denoted as data packet H1 and data packet H2, the data packet H1 is generated by the core of the first row and the first column in fig. 5, and needs to be sent to the core of the second row and the second column, so the source address is denoted as (1, 1), the destination address is denoted as (3, 2), the net calculation flag bit is set to be 101, and the data portion is set to be 61.8; the packet H2 is generated from the core of the third column of the first row in fig. 5 and needs to be sent to the core of the second column of the third row, so the source address is (1, 3), the destination address is (3, 2), the net calculation flag bit is 101, and the data portion is 37.2. According to a reduction tree routing algorithm, the precursor node of the inter-core router positioned in the first row and the first column is the inter-core router of the first row and the second column; the precursor node of the inter-core router positioned in the third column of the first row is the inter-core router of the second column of the first row; the precursor node of the inter-core router positioned in the first row and the second column is the inter-core router of the second row and the second column; the precursor node of the inter-core router located in the second row and the second column is the inter-core router of the third row and the second column. Thus, both the packet H1 and the packet H2 must pass through the first row and second column inter-core routers and the second row and second column inter-core routers during forwarding.

Assuming that the time interval between the arrival of the data packet H1 and the arrival of the data packet H2 at the inter-core router of the first row and the second column is smaller than the preset threshold period in the network computation buffer, the following describes the aggregation process of the data packet H1 and the data packet H2 in the inter-core router of the first row and the second column in fig. 5 through one multi-core aggregation process, which includes the following steps:

w1, a data packet H1 enters an inter-core router of a first row and a second column, firstly, the data packet H1 is temporarily stored in an input buffer memory connected with an output buffer memory of the inter-core router of the first row and the first column, and the data packet H1 waits for an arbiter to arbitrate;

w2, the data packet H2 enters the inter-core router of the second column of the first row, firstly, the data packet H2 is temporarily stored in an input buffer memory connected with an output buffer memory of the inter-core router of the third column of the first row, and the data packet H2 waits for an arbiter to arbitrate;

W3, the arbiter polls the data packet in each input buffer, reads the content of the data packet H1 in the input buffer, and sends a control signal to transmit the data packet H1 from the input buffer to the analyzer;

W4, the data packet H1 enters a parser, in the parser, a router judges that the data packet H1 needs to be subjected to in-network operation according to the network calculation flag bit value 101 of the data packet H1, and the data packet H1 is transmitted to an in-network operation unit;

w5, the data packet H1 enters an in-network operation unit, the network calculation controller determines that the network calculation buffer memory is empty, and the data packet H1 is stored in the network calculation buffer memory;

W6, the arbiter reads the content of the data packet H2 in the input buffer memory, and sends a control signal to transmit the data packet H2 from the input buffer memory to the analyzer;

W7, the data packet H2 enters a parser, in the parser, a router judges that the data packet H2 needs to be subjected to in-network operation according to the network calculation flag bit value 101 of the data packet H2, and the data packet H2 is transmitted to an in-network operation unit;

W8, the data packet H2 enters an in-network operation unit, the network computation controller checks the data packet H1 stored in the network computation cache, then the data part 37.2 of the data packet H2 is firstly transmitted to the logic computation unit, the network computation bit 101 of the data packet H2 and the network computation bit 101 of the data packet H1 enter a comparator for comparison, the comparator transmits the result to the network computation controller, the network computation bit of the two data packets are equal, the network computation controller controls the network computation cache to input the data part 61.8 of the data packet H1 into the logic computation unit, and the source address, the target address and the network computation bit of the data packet H2 are stored in the network computation cache and cover the source address, the target address and the network computation bit of the data packet H1;

W9, the logic calculation unit carries out floating point addition operation on 61.8 and 37.2, and the operation result 99.0 is stored in the network calculation buffer memory and covers the original data part in the network calculation buffer memory. The network computing controller transmits the data packet in the network computing cache to the route computing unit, the data packet is recorded as H3, the content of the data packet H3 is a source address (1, 3), a target address (3, 2), a network computing zone bit 101 and a data part of 99.0;

And W10, the route calculation unit calculates that the output buffer corresponding to the data packet is an output buffer connected with the inter-core router of the second row and the second column in FIG. 5 by using a reduction tree routing algorithm according to the destination address in the data packet H3.

W11, packet H3 enters the crossbar, which selects packet H3 via the multiplexer and transmits it to the output buffer connected to the inter-die router of the second row and second column in FIG. 5.

W12, the output buffer transmits the data packet H3 to the outside of the router according to the first come first serve principle.

The forwarding process of packet H3 in the inter-core router of the second row and the second column in fig. 5 is as follows:

z1, a data packet H3 enters an inter-core router of a first row and a second column, is temporarily stored in an input buffer memory connected with an output buffer memory of an inter-core router of a third column of the first row, and waits for an arbiter to arbitrate;

Z2, the arbiter polls the data packet in each input buffer, reads the content of the data packet H3 in the input buffer, and sends a control signal to transmit the data packet H3 from the input buffer to the analyzer;

Z3, the data packet H3 enters a parser, in the parser, a router judges that the data packet H3 needs to be subjected to in-network operation according to the network calculation flag bit value 101 of the data packet H3, and the data packet H3 is transmitted to an in-network operation unit;

Z4, the data packet H3 enters an in-network operation unit, the network calculation controller determines that the network calculation buffer is empty, and the data packet H3 is stored in the network calculation buffer;

z5, the data packet H3 is stored in the network computing buffer memory for more than a preset threshold period, and the network computing controller controls the network computing buffer memory to transmit the data packet H3 to the route computing unit.

And Z6, the route calculation unit calculates that the output buffer corresponding to the data packet is the output buffer connected with the inter-core router of the third row and the second column in FIG. 5 by using a reduction tree routing algorithm according to the destination address in the data packet H3.

Z7, the data packet H3 enters the crossbar switch, the crossbar switch selects the data packet H3 through the multiplexer and transmits the data packet to the output buffer connected with the inter-core router of the third row and the second column in FIG. 5.

And Z8, the output buffer transmits the data packet H3 to the outside of the router according to the first come first serve principle.

In this embodiment, the data packet H1 and the data packet H2 complete data aggregation in the inter-core routers located in the first row and the second column; the period of the data packet H3 existing in the network computing buffer memory of the inter-core router positioned in the second row and the second column exceeds a preset threshold period, and the network computing controller forwards the data packet to the routing computing unit. The above procedure represents a situation where setting a threshold period for which a packet is already present in the network cache is desired to be avoided: if the data packet can be stored in the network computing buffer memory for a period of time, when only one data packet needing to be subjected to the network operation remains in the network and the data packet does not reach the target address yet, the data packet can not be forwarded to the target address at all times by being stored in the network computing buffer memory.

Example 3

The embodiment is based on the network-computing integrated inter-core router disclosed in embodiment 1, and proposes the working process of the network-computing integrated inter-core router-based data packet aggregation method under the complex condition under the same system overall architecture.

Now assume that there are three data packets, respectively denoted as data packet K1, data packet K2 and data packet K3, where data packet K1 is generated from a first row of the first column of the core particles in fig. 5, and needs to be sent to a second row of the second column of the core particles, so the source address is denoted as (1, 1), the destination address is denoted as (2, 2), the net-calculated flag bit is 102, and the data portion is 15.0; the data packet K2 is generated by the core grain of the fourth column of the first row in fig. 5, and needs to be sent to the core grain of the second column of the second row, so the source address is marked as (1, 4), the target address is marked as (2, 2), the net calculation flag bit is 102, and the data part is 22.0; the packet K3 is generated from the core of the third column of the first row in fig. 5 and needs to be sent to the core of the second column of the third row, so the source address is denoted as (1, 3), the destination address is denoted as (3, 2), the net-work flag is bit 0, and the data portion is 22.0. According to a reduction tree routing algorithm, the precursor node of the inter-core router positioned in the first row and the first column is the inter-core router of the first row and the second column; the precursor node of the inter-core router positioned in the fourth column of the first row is the inter-core router of the third column of the first row; the precursor node of the inter-core router positioned in the third column of the first row is the inter-core router of the second column of the first row; the precursor node of the inter-core router located in the first row and the second column is the inter-core router of the second row and the second column. According to the XY routing algorithm, the data packet K3 sequentially passes through the inter-core routers positioned in the third row and the third column, the inter-core routers positioned in the second row and the second column, and the inter-core routers positioned in the third row and the second column in the forwarding process. Thus, packets K1, K2 and K3 must all pass through the inter-core routers of the first row and the second column during forwarding. Now assume that the order in which three data packets arrive at the first row and second column inter-core routers is: the data packet K1, the data packet K3, and the data packet K2, where the time between arrival of the data packet K1 and the time between arrival of the data packet K2 are smaller than a preset threshold period in the network computing buffer, the following describes the forwarding process of three data packets in the first row and the second column core router:

l1, a data packet K1 enters an inter-core router of a first row and a second column, is temporarily stored in an input buffer memory connected with an output buffer memory of the inter-core router of the first row and the first column, and waits for an arbiter to arbitrate;

l2, the data packet K3 enters the inter-core router of the second column of the first row, firstly temporarily stored in an input buffer memory connected with the output buffer memory of the inter-core router of the third column of the first row, and waiting for the arbiter to arbitrate;

l3, the arbiter polls the data packet in each input buffer, reads the content of the data packet K1 in the input buffer, and sends a control signal to transmit the data packet K1 from the input buffer to the analyzer;

l4, the data packet K1 enters a parser, in the parser, a router judges that the data packet K1 needs to be subjected to in-network operation according to the network calculation flag bit value 102 of the data packet K1, and the data packet K1 is transmitted to an in-network operation unit;

L5, the data packet K1 enters an in-network operation unit, the network calculation controller determines that the network calculation buffer memory is empty, and the data packet K1 is stored in the network calculation buffer memory;

l6, the arbiter polls the data packet in each input buffer, reads the content of the data packet K3 in the input buffer, and sends a control signal to transmit the data packet K3 from the input buffer to the analyzer;

L7, the data packet K3 enters a parser, in the parser, a router judges that the data packet K3 does not need to be subjected to intra-network operation according to the network calculation flag bit value 0 of the data packet K3, and the data packet K3 is transmitted to a route calculation unit;

And L8, the route calculation unit calculates that the output buffer corresponding to the data packet is an output buffer connected with the inter-core router of the second row and the second column in the figure 5 by using an XY route algorithm according to the destination address in the data packet K3.

L9, packet K3 enters the crossbar, which selects packet K3 via multiplexer and transmits the packet to the output buffer connected to the inter-die router of the second row and second column in FIG. 5.

L10, an output buffer connected with the inter-core router of the second row and the second column in FIG. 5 transmits the data packet K3 to the outside of the router according to the first come first serve principle.

L11 and data packet K2 enter the inter-core routers of the second row and the second column, and are temporarily stored in an input buffer connected with the output buffer of the inter-core router of the third row and the third column, and wait for the arbiter to arbitrate.

L12, the arbiter reads the content of the data packet K2 in the input buffer memory, and sends a control signal to transmit the data packet K2 from the input buffer memory to the analyzer;

L13, the data packet K2 enters a parser, in the parser, a router judges that the data packet K2 needs to be subjected to in-network operation according to the network calculation flag bit value 102 of the data packet K2, and the data packet K2 is transmitted to an in-network operation unit;

L14, the data packet K2 enters an in-network operation unit, the network computation controller checks the data packet K1 stored in the network computation cache, the data part 22.0 of the data packet K2 is firstly transmitted to the logic computation unit, the network computation bit 102 of the data packet K2 and the network computation bit 102 of the data packet K1 enter a comparator to be compared, the comparator transmits the result to the network computation controller, the network computation bit of the two data packets are equal, the network computation controller controls the network computation cache to input the data part 15.0 of the data packet K1 into the logic computation unit, and the source address, the target address and the network computation bit of the data packet K2 are stored in the network computation cache and cover the source address, the target address and the network computation bit of the data packet K1;

L15, the logic calculation unit carries out floating point addition operation on 15.0 and 22.02, and an operation result 37.0 is stored in the network calculation buffer and covers the original data part in the network calculation buffer. The network computing controller transmits the data packet in the network computing cache to the route computing unit, the data packet is recorded as K4, and the content of the data packet K4 is a source address (1, 4), a target address (2, 2), a network computing zone bit 102 and a data part 37.0;

And L16, the route calculation unit calculates that the output buffer corresponding to the data packet is the output buffer connected with the inter-core router of the second row and the second column by using a reduction tree routing algorithm according to the destination address in the data packet K4.

L17, packet K4 enters the crossbar, which selects packet K4 via the multiplexer and transmits it to the output buffer connected to the inter-die router of the second row and column.

And L18, an output buffer connected with the inter-core router of the second row and the second column transmits the data packet K4 to the outside of the router according to the first come first serve principle.

In this embodiment, K1 and K2 are data packets that need to be subjected to intra-network operation, K3 is a data packet that does not need to be subjected to intra-network operation, and the forwarding process of the three data packets by the inter-core router reflects that even if there are data packets that need to be subjected to intra-network operation and data packets that do not need to be subjected to intra-network operation in the system at the same time, the inter-core router can accurately complete forwarding and aggregation tasks.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A network arithmetic integrated inter-core router is characterized by comprising N input caches, 1 arbiter, N resolvers, 1 in-network operation unit, N+1 route calculation units, 1 crossbar and N output caches, wherein,

The crossbar receives data packets and routing operation results from the routing calculation units, the crossbar comprises N multiplexers, each multiplexer is connected with one output buffer, and according to the routing operation results, the crossbar selects one data packet transmitted by the routing calculation unit through the multiplexer and transmits the data packet to the corresponding output buffer;

The output buffer memory receives the data packet transmitted by the cross switch, and transmits the data packet to the input buffer memory of another inter-core router or the transmission data packet connected with the inter-core router according to the first come first serve principle.

2. The network-integrated inter-core router of claim 1, wherein the source address in the data packet indicates which core in the network the data packet originated from, the destination address indicates which core in the network the data packet received from, and the network-integrated flag bit is used to distinguish whether the data packet requires an intra-network operation, and the data portion is a specific communication content to be transmitted by the data packet.

3. The network-integrated inter-die router of claim 1, wherein the in-network arithmetic unit comprises a multiplexer, a network arithmetic cache, a network arithmetic controller, a comparator, and a logic computation unit; the working process of the in-network operation unit is as follows: the comparator compares the net calculation flag bit of the data packet transmitted by the analyzer with the net calculation flag bit of the data packet in the net calculation cache, and if the net calculation flag bit is unequal to the net calculation flag bit of the data packet in the net calculation cache, the net calculation controller transmits the data packet in the net calculation cache to the route calculation unit; if the two are equal, the network computing controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computing cache to the logic computing unit for addition operation, the operation result is transmitted to the network computing cache to cover the original data part, and meanwhile, the source address, the destination address and the network computing flag bit of the data packet transmitted by the analyzer are transmitted to the network computing cache.

4. The network-integrated inter-core router of claim 1, wherein the reduction tree routing algorithm operates in a mesh network as follows:

And T5, updating the graph G, wherein the process is as follows:

t5.3, note u has been accessed;

T5.4, selecting a node communicated with the u node, and marking v;

T5.9, repeating steps T5.2 to T5.7 until all nodes are accessed;

5. The network-integrated inter-die router of claim 1, wherein the arbiter uses round robin arbitration to poll and take as input to the parser the data packets in each input cache.

6. The network-integrated inter-core router of claim 1, wherein the input cache stores data packets using a plurality of fixed-length queues.

7. A method of packet aggregation based on a network-integrated inter-core router according to any one of claims 1 to 6, characterized in that the method of packet aggregation comprises the steps of:

8. The method for aggregating data packets of network-computing integrated inter-core routers according to claim 7, wherein a network computing controller in said network computing unit records whether a data packet exists in a network computing buffer and a duration of the existence of the data packet in the network computing buffer, and if the existence time of the data packet in the network computing buffer exceeds a preset threshold period, the network computing controller controls the network computing buffer to transmit the data packet to the routing computing unit.