CN115720211A - Network-computing integrated inter-core-particle router and data packet aggregation method - Google Patents
Network-computing integrated inter-core-particle router and data packet aggregation method Download PDFInfo
- Publication number
- CN115720211A CN115720211A CN202211424707.6A CN202211424707A CN115720211A CN 115720211 A CN115720211 A CN 115720211A CN 202211424707 A CN202211424707 A CN 202211424707A CN 115720211 A CN115720211 A CN 115720211A
- Authority
- CN
- China
- Prior art keywords
- data packet
- network
- core
- cache
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002776 aggregation Effects 0.000 title claims abstract description 27
- 238000004220 aggregation Methods 0.000 title claims abstract description 27
- 239000007771 core particle Substances 0.000 title claims description 62
- 238000000034 method Methods 0.000 title claims description 54
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 33
- 239000002245 particle Substances 0.000 claims abstract description 27
- 230000009467 reduction Effects 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims description 100
- 239000000872 buffer Substances 0.000 claims description 56
- 230000008569 process Effects 0.000 claims description 35
- 239000002243 precursor Substances 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000006854 communication Effects 0.000 claims description 11
- 238000004891 communication Methods 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000007667 floating Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 239000008187 granular material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241001522296 Erithacus rubecula Species 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012536 packaging technology Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a network-computing integrated core inter-particle router, which comprises an input cache, an arbiter, an analyzer, an in-network computing unit, a route computing unit, a cross switch and an output cache. The invention adds a resolver and an intra-network operation unit in the core inter-particle router, so that the core inter-particle router can carry out aggregation operation on the forwarded data packets. The invention also defines a new data packet format to work with the parser and the in-network arithmetic unit. The invention also provides a reduction tree routing algorithm to work together with the inter-core router, so that more data packets can meet in the same inter-core router, and aggregation operation can be executed as much as possible.
Description
Technical Field
The invention relates to the technical field of multi-core-grain integrated circuits, in particular to a network-computing integrated inter-core-grain router and a data packet aggregation method.
Background
The multi-die integrated system integrates some die (bare chips) for realizing specific functions into a system chip with specific functions through an advanced packaging technology, and compared with a traditional single-chip integration mode, the multi-die integrated system has advantages and potentials in aspects of chip performance and power consumption optimization, cost and business mode. The Deep Neural Network (DNN) is deployed to a multi-kernel integrated system to run by means of segmenting a training data set/model. And (3) each kernel processes different subsets in the training data in parallel to train a local copy of the model, summarizes the training result after the processing of each kernel is completed, summarizes the parameters, broadcasts the parameters to each node to update the local model, and then performs the next iteration.
In the above process, the process of transmitting data to the same core by each core is called reduction, and the process of transmitting data to all other cores by one core is called broadcasting. An all reduce is formed together in one reduce process and one broadcast process. The all reduce is actually a communication process, which reduces data in different core grains and distributes the result to each core grain.
In the prior art, even for computationally intensive applications, such as neural network training tasks, the all reduce may become a performance bottleneck. Because the communication bandwidth between core grains in the multi-core grain integrated system is limited, the time consumption of network communication across the core grains is longer than that of internal communication of the core grains, and the deep neural network is in a parameter summarization stage, if the network transmission rate is slow, network congestion is more easily caused, network delay is increased, and the overall performance is reduced, so that the communication performance of the network becomes a performance bottleneck during training.
At present, process optimization methods for all reduce are mainly divided into a software method and a hardware method. Software algorithms such as the document J.Pjesivac-Grbovic.T.Angskun.G.Bosillca.G.E.Fagg.E.Gabriel.and J.J.Dongarra, "Performance analysis of mpi collecting operations; ' Cluster computing. Vol.10.No.2.Pp.127-143.2007 and the document R.Rabenseifner, "Optimization of collective reduction operations," In International Conference on Computational science. Springer.2004.Pp.I-9, but these algorithms increase the amount of communication In the system, resulting In high delay. The hardware method mainly aims at a large-scale parallel system formed by a plurality of computers, and the data is firstly collected before data transmission by integrating an operation unit in a switch so as to reduce the transmission quantity of the data, but the design aims at larger network flow granularity and complex requirements on the function of the operation unit, so that the hardware method cannot be directly transplanted to a multi-core particle integrated system with limited chip area.
Disclosure of Invention
The present invention is directed to solve the above-mentioned drawbacks of the prior art, and an object of the present invention is to provide a network-integrated core inter-core router and a packet aggregation method.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a network-integrated core inter-particle router comprises N input buffers, 1 arbiter, N parsers, 1 in-network operation unit, N +1 route calculation units, 1 cross switch and N output buffers, wherein,
the input cache is connected with the core particle or the output cache of the other inter-core-particle router, receives a data packet transmitted by the output cache of the other inter-core-particle router or a data packet transmitted by the core particle, caches the data packet and waits for arbitration of the arbiter; the data packet comprises a source address, a target address, a network computation flag bit and a data part;
the arbiter reads the content of the data packet in the input cache, arbitrates the data packet in the input cache, and decides which data packet in the input cache is transmitted to the parser for parsing;
the parser receives a data packet input and cached and transmitted, and transmits the data packet to the route calculation unit or the in-network calculation unit, the parser judges whether the data packet needs to be subjected to in-network calculation according to the network calculation flag bit of the data packet, if the value of the network calculation flag bit of the data packet is equal to 0, the data packet does not need to be subjected to in-network calculation, and the parser transmits the data packet to the route calculation unit; if the value of the network computation flag bit of the data packet is not 0, the data packet is indicated to need to be subjected to network operation, and the analyzer transmits the data packet to the network operation unit;
the in-network operation unit receives the data packet transmitted by the analyzer, and transmits the data packet to the route calculation unit after in-network operation;
the route calculation unit receives the data packet from the resolver or the in-network operation unit, transmits the data packet and the route operation result to the cross switch, and calculates an output port address corresponding to the data packet by using a route algorithm according to a destination address in the data packet; for the data packet transmitted by the resolver, the route calculation unit adopts an XY route algorithm, namely, the data packet is forwarded according to columns and then the data packet is forwarded according to rows; for the data packets transmitted by the intra-network operation unit, the routing calculation unit adopts a reduction tree routing algorithm, wherein the routing operation result is used as an electric signal for controlling the state of the crossbar switch and indicates the crossbar switch to which output buffer memory the transmitted data packets should be transmitted;
the crossbar receives the data packet and the routing operation result from the routing calculation unit, the crossbar comprises N multiplexers, each multiplexer is connected with one output cache, and according to the routing operation result, the crossbar selects the data packet transmitted by one routing calculation unit through the multiplexers and transmits the data packet to the corresponding output cache;
and the output cache receives the data packet transmitted by the cross switch and transmits the data packet to the input cache of another core inter-particle router or a core particle connected with the core inter-particle router according to a first-come first-serve principle.
The core inter-core router disclosed above is added with a parser and an intra-network operation unit for the first time, so that the core inter-core router can perform an aggregation operation on forwarded data packets. And a new packet format is defined to work with the parser and the in-network arithmetic unit. In order to enable the improved router to better exert the aggregation function, a matched reduction tree routing algorithm is also provided, so that more data packets can meet in the same router, and aggregation operation is performed as much as possible.
Further, the source address in the data packet indicates which core particle in the network the data packet is generated from, the destination address indicates which core particle in the network the data packet is received from, the computation flag bit is used to distinguish whether the data packet needs to be subjected to intra-network computation, and the data portion is the specific communication content to be transmitted by the data packet. The source address helps the core particle to identify the source of the data packet, and the core particle is ensured to correctly process the received data packet according to different sources; the target address is used as the input of a routing algorithm in the data packet forwarding stage; the network computation zone bit is used for classifying the data packets, the network computation zone bit of the data packets which do not need to be computed in the network is set to be 0, the data packets are prevented from being transmitted to the network computation unit, the forwarding speed of the data packets is accelerated, meanwhile, the network computation zone bit divides the data packets which need to be computed in the network into a plurality of groups, the network computation zone bit is used as a grouping identifier, only the data packets in the same group can be aggregated, and the router among core particles can process a plurality of groups of data packets which need to be aggregated at the same time; the data portion is the specific data to be transmitted by the core particle.
Furthermore, the in-network operation unit comprises a multiplexer, an network calculation cache, a network calculation controller, a comparator and a logic calculation unit; the working process of the in-network operation unit is called as in-network operation, and the specific process is as follows: the comparator compares the network computation flag bit of the data packet transmitted by the resolver with the network computation flag bit of the data packet in the network computation cache, and if the two are not equal, the network computation controller transmits the data packet in the network computation cache to the route computation unit; if the two are equal, the network computation controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computation cache to the logic computation unit for addition operation, the operation result is transmitted to the network computation cache to cover the original data part, and meanwhile, the source address, the destination address and the network computation flag bit of the data packet transmitted by the analyzer are transmitted to the network computation cache. A multiplexer connecting each resolver to the in-network arithmetic unit; 1 data packet can be temporarily stored in the network computing cache, and the requirement that the contents of two data packets need to be operated simultaneously in data packet aggregation is met; the comparator compares network computation flag bits of the data packets, transmits the comparison result to the network computation controller, the network computation controller receives the comparison result of the comparator and controls input and output of the in-network cache, the comparator and the network computation controller jointly ensure that the in-network operation unit aggregates the data packets of the same group, the data packets in the cache can be sent out firstly for the data packets of different groups, and then the data packets transmitted by the multiplexer are stored in the in-network cache; and the logic calculation unit performs floating-point addition operation on the data packet content to obtain the data part of the data packet aggregation result.
Further, the reduction tree routing algorithm works in the mesh network, the reduction tree routing algorithm is used to plan the routing path of the data packet to be subjected to intra-network operation into a tree shape, and the generated tree is balanced as much as possible, so that the data packet to be subjected to intra-network operation can pass through the same inter-core router in a similar time in the forwarding process, and the intra-network operation unit of the inter-core router can complete as many calculation tasks as possible in the process, and the algorithm process is as follows:
t1, setting the size of the mesh network as n multiplied by n, and creating n together 2 The Node sets are used for representing core grains in the network, and comprise the following information: the method comprises the following steps that x coordinates and y coordinates of core particles in a mesh network, the number of child nodes and precursor nodes are obtained, wherein x coordinate values are recorded as x, y coordinate values are recorded as y, and position information of a Node set refers to the x coordinates and the y coordinate values in the Node set and is recorded as (x, y);
t2, assigning values to members of the Node set according to the position of the core grain in the mesh network, setting the number of the sub-nodes to be 0, and setting a precursor Node of the core grain to be null;
t3, use of n 2 ×n 2 The size of an adjacent matrix M represents the connection relation of nodes, a symbol i and a symbol j represent an integer which is greater than or equal to 1 and less than or equal to n, for the (i multiplied by n + j) th row of the matrix, the (i + 1) th multiplied by n + j), the (i-1) th multiplied by n + j), the (i multiplied by n + j + 1) th column and the (i multiplied by n + j-1) th column are assigned with 1 to represent connection, and the rest columns are assigned with 0 to represent disconnection;
t4, for a Node set, the (i × n + y) th row of the adjacency matrix M records adjacency information, so n is used 2 Each vertex represents a Node set, and a graph G is formed by combining information of recording edges of the adjacency matrix M;
t5, updating the graph G, and the process is as follows:
t5.1, calculating the length of the shortest path from the node directly communicated with the source point to be 1 and the lengths of the shortest paths from the rest points to the source point to be infinity in the graph G;
t5.2, selecting one of nodes which are not accessed yet and is closest to the source point, and marking the node as u;
t5.3, the note has been accessed;
t5.4, selecting a node communicated with the u node and recording as v;
t5.5, if v is not visited and u is taken as an intermediate point to make the path from the source point to the node v shorter, recording the path length from the node v to the source point as the path length from the node u to the source point plus one, and adding u in the candidate precursor node of v;
t5.6, repeating the step T5.4 and the step T5.5 for all the vertexes which can be reached from u;
t5.7, selecting a node with the largest number of child nodes from the candidate precursor nodes of u, and marking as w;
t5.8, recording the predecessor of u as w, and adding 1 to the number of child nodes of w;
t5.9, repeating the steps from T5.2 to T5.7 until all nodes are accessed;
t6, establish size n 2 For the graph G, traversing each Node in the graph, calculating the precursor Node coordinate of each Node as (v → x n, v → y) through the precursor member v in the Node set, and storing the precursor Node coordinate into the set R, wherein the symbol "v → x" represents the x coordinate in the slave Node set v, and the symbol "v → y" represents the y coordinate in the slave Node set v;
and T7, storing the set R into a route calculation unit of the inter-core router as a result of a routing algorithm.
Further, the arbiter uses round-robin arbitration to poll each packet in the input buffer and take it as the input to the parser. Round robin arbitration ensures that the arbiter processes the packets equally for each input buffer, avoiding that individual packets are stored in the input buffers for too long.
Further, the input buffer stores packets using a plurality of fixed length queues. The use of queues to store packets ensures that the core inter-core routers can process packets one by one as they receive multiple packets. The fixed queue length reduces circuit complexity and reduces the area of the core inter-die router.
The second purpose of the invention can be achieved by adopting the following technical scheme:
a data packet aggregation method based on a network-integrated inter-core router comprises the following steps:
s1, when a data packet enters a network-computing integrated core inter-particle router, the data packet is temporarily stored in an input cache and is arbitrated by an arbiter;
s2, reading the content of the data packets in the input caches by an arbitrator, circularly arbitrating the data packets in the input caches, polling the data packets in each input cache and determining which data packet in the input cache is transmitted to an analyzer for analysis;
s3, after passing through the arbiter, the data packet enters a resolver, in the resolver, the router judges whether the data packet needs to be subjected to intra-network operation according to the network operation flag bit of the data packet, if the value of the network operation flag bit of the data packet is equal to 0, the data packet does not need to be subjected to intra-network operation, the steps S4 and S5 are skipped, the step S6 is executed, if the value of the network operation flag bit of the data packet is not 0, the data packet needs to be subjected to intra-network operation, and the step S4 is executed;
s4, recording the data packet as a data packet A after the data packet enters the in-network operation unit, and if the network operation controller checks that no cache data packet exists in the network operation cache, storing the data packet A into the network operation cache, and ending the work; if the data packet stored in the network computation cache of the in-network computing unit is marked as a data packet B, firstly transmitting the data part of the data packet A to a logic computing unit, entering the network computation flag bits of the data packet A and the data packet B into a comparator for comparison, transmitting the result to a network computation controller by the comparator, if the network computation flag bits of the two data packets are not equal, controlling the network computation cache to output the data packet B by the network computation controller, emptying the network computation cache, simultaneously setting the data part in the network computation cache to be 0, inputting the data packet into the logic computing unit, and storing the source address, the target address and the network computation flag bit of the data packet A into the network computation cache; if the network computation flag bits of the two data packets are equal, the network computation controller controls the network computation cache to input the data part of the data packet B into the logic computation unit, stores the source address, the target address and the network computation flag bit of the data packet A into the network computation cache and covers the source address, the target address and the network computation flag bit of the data packet B;
s5, the logic calculation unit carries out floating point addition operation on the data part of the data packet A and the data part of the data packet B, an operation result is stored in a network calculation cache and covers the original data part in the network calculation cache, and finally the network calculation controller transmits the data packet to the routing calculation unit;
s6, the route calculation unit calculates the data packet by using a route algorithm according to the destination address in the data packet, the corresponding output port, the route calculation unit adopts an XY route algorithm for the data packet transmitted by the resolver, and the route calculation unit adopts a reduction tree route algorithm for the data packet transmitted by the intra-network operation unit;
s7, the data packet enters a cross switch, and the cross switch selects the data packet transmitted by one routing calculation unit through a multiplexer and transmits the data packet to a corresponding output buffer;
and S8, the output buffer transmits the data packet to the outside of the router according to the first-come first-serve principle.
Further, the network computing controller in the in-network computing unit records whether a data packet exists in the network computing cache and the time length of the data packet existing in the network computing cache, and if the existence time of the data packet in the network computing cache exceeds a preset threshold period, the network computing controller controls the network computing cache to transmit the data packet to the route computing unit. Setting the preset threshold period can avoid the following two problems: 1. the data packet is cached in the cache for too long time to wait for aggregation, so that the forwarding efficiency of the data packet is influenced; 2. when only one data packet needing intra-network operation is left in the network and the data packet does not reach the target address yet, the data packet is stored in a network computation cache and can never reach the target address through forwarding.
Compared with the prior art, the invention has the following advantages and effects:
1. the method reduces the number of data packets in the inter-core-particle network of the multi-core-particle integrated system at the same time, reduces the communication load of the inter-core-particle network, and improves the communication efficiency of the multi-core-particle integrated system.
2. The invention transfers the aggregation task of the data packet to the transmission process of the data packet to be completed, lightens the working load of the core particles and reduces the time consumption of the multi-core particle integrated system for completing the neural network training task.
3. The micro-architecture design provided by the invention has small occupied area and accords with the limitation of the inter-chip router on the chip area.
4. The invention has good universality, and even if a data packet needing intra-network operation and a data packet not needing intra-network operation exist in the multi-core-particle integrated system at the same time, the inter-core-particle router can correctly complete the tasks of data packet forwarding and data packet aggregation.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of a network-integrated core inter-core router micro-architecture according to the present invention;
FIG. 2 is a schematic diagram of the packet structure and data path of a packet within an inter-core router in accordance with the present invention;
FIG. 3 is a diagram of an intra-network arithmetic unit according to the present invention;
FIG. 4 is a flow chart of the operation of the intra-network arithmetic unit in the present invention;
fig. 5 is a schematic diagram of a multi-core-particle integrated system composed of 16 core particles in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
The embodiment provides a network-computing integrated inter-core-particle router, which is in a multi-core-particle integrated system consisting of 16 core particles. The network topology among the core grains of the multi-core grain integrated system adopts a mesh network. The overall architecture of the multi-die integrated system is shown in fig. 5.
The design of the micro-architecture layer of the network-integrated core inter-granular router is shown in FIG. 1, and comprises connection relations of components. The network-computing integrated inter-core-particle router has the main function of completing the tasks of forwarding data packets and performing intra-network computation in an inter-core-particle network of a multi-core-particle integrated system. The data packet of the inter-core network comprises a source address, a target address, a network calculation flag bit and a data part, wherein the source address indicates which core particle in the network generates the data packet, the target address indicates which core particle in the network receives the data packet, the network calculation flag bit is used for distinguishing whether the data packet needs to be subjected to intra-network operation, and the data part is specific communication content to be transmitted by the data packet. Forwarding refers to the transmission of a packet from one inter-core router to the next inter-core router according to the destination address of the packet. The intra-network operation means that in the forwarding process of the data packet, data parts of the data packets with the same network operation zone bits are added to obtain an operation result, a new data packet including the operation result is generated to be forwarded, and other data packets participating in the operation are discarded. In the inter-core network shown in fig. 5, 4 inter-core routers are connected to 4 other inter-core routers at most, and the components of the 4 inter-core routers include 5 input buffers, 5 resolvers, 1 intra-network arithmetic unit, 6 route calculation units, 1 arbiter, 1 crossbar switch, and 5 output buffers; the total number of 8 core inter-particle routers is at most connected with other 3 core inter-particle routers, and the components of the 8 core inter-particle routers comprise 4 input caches, 4 resolvers, 1 intra-network operation unit, 5 route calculation units, 1 arbiter, 1 cross switch and 4 output caches; the total number of the 4 core inter-core routers is at most connected with other 2 core inter-core routers, and the components of the 4 core inter-core routers comprise 3 input buffers, 3 resolvers, 1 intra-network operation unit, 4 route calculation units, 1 arbiter, 1 cross switch and 3 output buffers. The specific functions of each component are as follows:
one input cache is connected with the core particles, the other input caches are connected with the output cache of the other inter-core particle router, the input cache receives data packets transmitted by the output cache of the other inter-core particle router or data packets transmitted by the core particles, and caches the data packets by using a queue with the length of 12 bytes to wait for arbitration of the arbiter; the data packet comprises a source address, a target address, a network computation flag bit and a data part.
The arbiter reads the content of the data packet in the input buffer, arbitrates the data packet in the input buffer, and decides which data in the input route calculation unit can be transmitted to the crossbar. The arbiter uses round-robin arbitration to select the packet in each input buffer in a round-robin fashion as the input to the parser.
The analyzer receives the data packet transmitted by the input cache, and transmits the data packet to the route calculation unit or the in-network calculation unit after analysis. The analysis means that whether the data packet needs to be subjected to in-network operation is judged according to the network operation flag bit of the data packet, if the value of the network operation flag bit of the data packet is equal to 0, the data packet does not need to be subjected to in-network operation, and the analyzer transmits the data packet to the route calculation unit; if the value of the network computation flag bit of the data packet is not 0, the data packet is indicated to need to be subjected to the in-network computation, and the analyzer transmits the data packet to an in-network computation unit.
The in-network operation unit comprises a multiplexer, a network calculation cache, a network calculation controller, a comparator and a logic calculation unit, wherein the in-network operation process is as follows: the comparator compares the network computation flag bit of the data packet transmitted by the resolver with the network computation flag bit of the data packet in the network computation cache, and if the two are not equal, the network computation controller transmits the data packet in the network computation cache to the route computation unit; if the two are equal, the network computation controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computation cache to the logic computation unit for addition operation, the operation result is transmitted to the network computation cache to cover the original data part, and meanwhile, the source address, the destination address and the network computation flag bit of the data packet transmitted by the analyzer are transmitted to the network computation cache.
And the route calculation unit receives the data packet from the resolver or the in-network calculation unit, and transmits the data packet and a route calculation result to the cross switch after route calculation. The routing calculation refers to calculating which output port the data packet should be sent to by using a routing algorithm according to the destination address in the data packet. And for the data packet transmitted by the intra-network operation unit, the route calculation unit adopts a reduction tree routing algorithm. The routing operation result is an electrical signal that controls the state of the crossbar switch and indicates to which output buffer the crossbar switch should transmit the incoming data packet.
The crossbar receives the data packet and the routing operation result from the routing operation unit and transmits the data packet to the output cache. The crossbar includes N multiplexers. Each multiplexer is connected with one output buffer, and according to the routing operation result, the crossbar selects the data packets transmitted by 1 routing calculation unit through the multiplexers and transmits the data packets to the corresponding output buffer. The output buffer receives the data packet transmitted by the cross switch, and the data packet is buffered to the input of another core inter-particle router or transmitted by the router connected with the core inter-particle router according to the first-come first-serve principle.
According to the hardware simulation result, under the 45nm process, the occupied area of the core inter-particle router is 70902.3 mu m 2 Wherein the in-network operation unit occupies 3684 μm 2 About 5% of the area of the inter-core router.
Now, suppose there are two data packets, which are respectively denoted as data packet E1 and data packet E2, where the data packet E1 is generated by the core in the first row and the first column in fig. 5 and needs to be sent to the core in the second row and the second column, therefore, the source address is denoted as (1, 1), the destination address is denoted as (2, 2), the computation flag is set to 103, and the data portion is set to 736.5; since the packet E2 is generated by the core in the third row and column of fig. 5 and needs to be transmitted to the core in the second row and column of fig. 5, the source address is (1, 3), the destination address is (2, 2), the computation flag is 103, and the data portion is 367.2. For data packet E1 and data packet E2, the reduction tree routing algorithm works as follows:
f1, setting the size of the mesh network to be 4 multiplied by 4, and creating 16 Node sets for representing core grains in the network, wherein the Node sets comprise the following information: the method comprises the steps that x coordinates and y coordinates of core particles in a mesh network, the number of child nodes and precursor nodes are obtained, wherein x coordinate values are recorded as x, y coordinate values are recorded as y, position information of a Node set is defined as the x coordinates and the y coordinate values of the Node set, and the position information is recorded as (x, y);
f2, assigning values to the members of the Node set according to the position of the core grain in the mesh network, setting the number of the sub-nodes to be 0, and setting the precursor Node of the core grain to be null;
f3, using an adjacent matrix M with the size of 16 multiplied by 16 to represent the connection relation of nodes, wherein the symbol i and the symbol j represent an integer which is more than or equal to 1 and less than or equal to n, and for the (i multiplied by n + j) th row of the matrix, the (i + 1) th multiplied by n + j), the (i multiplied by n + j + 1) th column and the (i multiplied by n + j-1) th column are assigned with 1 to represent connection, and the rest columns are assigned with 0 to represent disconnection;
f4, for a Node set, the (i × n + y) th row of the adjacency matrix M records adjacency information, so n is used 2 Each vertex represents a Node set, and a graph G is formed by combining the information of the recording edge of the adjacent matrix M;
f5, selecting the Node set with the position information of (2, 2) as a source point, and updating the graph G, wherein the process is as follows:
f5.1, calculating the length of the shortest path from the node directly communicated with the source point to be 1 and the length of the shortest path from the rest points to the source point to be infinite in the graph G;
f5.2, selecting one of the nodes which are not accessed yet and is closest to the source point, and marking the selected one as u;
f5.3, the note u has been accessed;
f5.4, selecting a node communicated with the u node and recording as v;
f5.5, if v is not visited and the path from the source point to the node v is shorter by taking u as an intermediate point, recording the path length from the node v to the source point as the path length from the node u to the source point plus one, and adding u into the candidate precursor node of v;
f5.6, repeating the step T5.4 and the step T5.5 for all the vertexes which can be reached from the point u;
f5.7, selecting the node with the largest number of child nodes from the candidate precursor nodes of u, and marking as w;
f5.8, recording the predecessor of u as w, and adding 1 to the number of child nodes of w;
f5.9, repeating the steps F5.2 to F5.7 until all the nodes are accessed;
f6, establishing a set R with the size of 16, traversing each Node in the graph aiming at the graph G, calculating the precursor Node coordinate (v → x n, v → y) of each Node through a precursor member v in the Node set, and storing the precursor Node coordinate (v → x, v → y) into the set R, wherein the symbol "v → x" represents the x coordinate in the Node set v, and the symbol "v → y" represents the y coordinate in the Node set v;
and F7, writing the set R into a router of the inter-core network as a routing table of the inter-core network.
In this embodiment, through steps F1 to F7, it is obtained that the precursor node of the core-to-core router located in the first row and the first column in the set R is the core-to-core router in the first row and the second column; the precursor node of the core-to-core router positioned in the first row and the third column is the core-to-core router in the first row and the second column; the predecessor nodes of the inter-core routers in the first row and the second column are the inter-core routers in the second row and the second column.
The data packet E1 and the data packet E2 both must pass through the inter-core-particle routers in the first row and the second column in the forwarding process, and assuming that the time interval for the data packet E1 and the data packet E2 to reach the inter-core-particle routers in the first row and the second column in fig. 5 is less than the preset threshold period in the network cache, the following explains an aggregation process of the data packet E1 and the data packet E2 in the inter-core-particle routers in the first row and the second column in fig. 5 through a multi-core-particle aggregation process, and includes the following steps:
p1 and a data packet E1 enter the core inter-particle routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core inter-particle routers in the first row and the first column, and wait for the arbitrator to arbitrate;
p2 and a data packet E2 enter the core particle routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core particle routers in the first row and the third column, and wait for the arbitrator to arbitrate;
p3, polling the data packet in each input buffer by the arbiter, reading the content of the data packet E1 in the input buffer, and sending a control signal to transmit the data packet E1 from the input buffer to the parser;
p4, the data packet E1 enters a resolver, in the resolver, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation flag bit value 103 of the data packet E1, and transmits the data packet E1 to an intra-network operation unit;
p5, the data packet E1 enters an in-network operation unit, the network operation controller determines that the network operation cache is empty, and the data packet E1 is stored in the network operation cache;
p6, the arbiter reads the content of the data packet E2 in the input buffer, and sends a control signal to transmit the data packet E2 from the input buffer to the resolver;
p7, the data packet E2 enters a resolver, in the resolver, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation flag bit value 103 of the data packet E2, and transmits the data packet E2 to an intra-network operation unit;
p8, a data packet E2 enters an in-network operation unit, a network operation controller checks a data packet E1 stored in a network operation cache, firstly, a data part 367.2 of the data packet E2 is transmitted to a logic operation unit, a network operation flag bit 103 of the data packet E2 and a network operation flag bit 103 of the data packet E1 enter a comparator to be compared, the comparator transmits a result to the network operation controller, the network operation flag bits of the two data packets are equal, the network operation controller controls the network operation cache to input a data part 736.5 of the data packet E1 into the logic operation unit, and a source address, a target address and a network operation flag bit of the data packet E2 are stored in the network operation cache and cover the source address, the target address and the network operation flag bit of the data packet E1;
p9, the logic calculation unit carries out floating point addition operation on 736.5 and 367.2, and an operation result 1103.7 is stored in the network calculation cache and covers the original data part in the network calculation cache. The network computation controller transmits the data packet in the network computation cache to a route computation unit, records the data packet as E3, and the content of the data packet E3 is a source address (1, 3), a target address (2, 2), a network computation flag bit 103 and a data part 1103.7;
p10, the route calculating unit calculates, according to the destination address in the packet E3, that the output cache corresponding to the packet is the output cache connected to the inter-core router in the second row and the second column in fig. 5 by using the reduction tree routing algorithm.
P11, packet E3 enters the crossbar which selects packet E3 through the multiplexer and transfers it to the output buffer connected to the second row and column of inter-core routers in fig. 5.
And P12, the output buffer transmits the data packet E3 to the outside of the router according to the first come first serve principle.
By this time, the aggregation process of the data packet E1 and the data packet E2 in the inter-core-particle routers in the first row and the second column is finished, and through this aggregation process, the data packet E1 and the data packet E2 are aggregated into the data packet E3, so that the data packet amount of the inter-core-particle network in the multi-core-particle integrated system is reduced by 1.
Example 2
In this embodiment, based on the network-integrated core-to-core router disclosed in embodiment 1, a working process of a packet aggregation method based on the network-integrated core-to-core router under a complex condition is proposed under the same overall system architecture.
Now, suppose there are two data packets, which are respectively denoted as data packet H1 and data packet H2, where the data packet H1 is generated by the core in the first row and the first column in fig. 5 and needs to be sent to the core in the third row and the second column, therefore, the source address is denoted as (1, 1), the destination address is denoted as (3, 2), the computation flag is set to be 101, and the data portion is set to be 61.8; since the packet H2 is generated by the core in the first row and the third column in fig. 5 and needs to be transmitted to the core in the third row and the second column, the source address is represented by (1, 3), the destination address is represented by (3, 2), the network computation flag is 101, and the data portion is 37.2. Calculating according to a reduction tree routing algorithm that a precursor node of the core inter-particle router positioned in the first row and the first column is the core inter-particle router in the first row and the second column; the precursor node of the core-to-core router positioned in the first row and the third column is the core-to-core router in the first row and the second column; the precursor node of the core-to-core router positioned in the first row and the second column is the core-to-core router in the second row and the second column; the predecessor nodes of the inter-core routers in the second row and the second column are the inter-core routers in the third row and the second column. Therefore, the data packet H1 and the data packet H2 both must pass through the core-to-core routers in the first row and the second column and the core-to-core routers in the second row and the second column during forwarding.
Assuming that the time interval between the data packet H1 and the data packet H2 reaching the inter-core routers in the first row and the second column is smaller than the preset threshold period in the network cache, the following describes an aggregation process of the data packet H1 and the data packet H2 in the inter-core routers in the first row and the second column in fig. 5 through a multi-core aggregation process, including the following steps:
w1 and a data packet H1 enter the core inter-core routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core inter-core routers in the first row and the first column, and wait for arbitration by an arbiter;
w2 and the data packet H2 enter the core particle routers in the first row and the second column, are temporarily stored in an input cache connected with the output cache of the core particle routers in the first row and the third column, and wait for the arbitration of an arbiter;
w3, the arbiter polls the data packet in each input buffer, reads the content of the data packet H1 in the input buffer, and sends a control signal to transmit the data packet H1 from the input buffer to the resolver;
w4, the data packet H1 enters a resolver, in the resolver, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation flag bit value 101 of the data packet H1, and transmits the data packet H1 to an intra-network operation unit;
w5, the data packet H1 enters an in-network operation unit, a network operation controller determines that a network operation cache is empty, and the data packet H1 is stored in the network operation cache;
w6, reading the content of the data packet H2 in the input cache by the arbiter, and sending a control signal to transmit the data packet H2 from the input cache to the parser;
w7, the data packet H2 enters a resolver, in the resolver, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation zone bit value 101 of the data packet H2, and transmits the data packet H2 to an intra-network operation unit;
w8, a data packet H2 enters an in-network operation unit, a network operation controller checks a data packet H1 stored in a network operation cache, firstly, a data part 37.2 of the data packet H2 is transmitted to a logic operation unit, a network operation flag bit 101 of the data packet H2 and a network operation flag bit 101 of the data packet H1 enter a comparator for comparison, the comparator transmits a result to the network operation controller, the network operation flag bits of the two data packets are equal, the network operation controller controls the network operation cache to input a data part 61.8 of the data packet H1 into the logic operation unit, and a source address, a target address and a network operation flag bit of the data packet H2 are stored in the network operation cache and cover the source address, the target address and the network operation flag bit of the data packet H1;
w9 and the logic calculation unit carry out floating point addition operation on 61.8 and 37.2, and the operation result 99.0 is stored in the calculation cache and covers the original data part in the calculation cache. The network computation controller transmits the data packet in the network computation cache to the route computation unit, records the data packet as H3, and the content of the data packet H3 is a source address (1, 3), a target address (3, 2), a network computation zone bit 101 and a data part 99.0;
w10, the route calculation unit calculates, using a reduction tree routing algorithm, that the output cache corresponding to the packet is the output cache connected to the inter-core router in the second row and the second column in fig. 5, based on the destination address in the packet H3.
W11, packet H3 enters the crossbar which selects packet H3 through the multiplexer and transfers it to the output buffer connected to the second row and column of inter-core routers in fig. 5.
W12, the output buffer transmits the data packet H3 to the outside of the router according to the first come first serve principle.
The forwarding process of the data packet H3 in the core inter-core router in the second row and the second column in fig. 5 is as follows:
z1 and a data packet H3 enter the core inter-core routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core inter-core routers in the first row and the third column, and wait for the arbitration of an arbiter;
z2, polling the data packet in each input buffer by the arbiter, reading the content of the data packet H3 in the input buffer, and sending a control signal to transmit the data packet H3 from the input buffer to the parser;
z3, the data packet H3 enters an analyzer, in the analyzer, the router judges that the data packet needs to be subjected to in-network operation according to the network operation zone bit value 101 of the data packet H3, and transmits the data packet H3 to an in-network operation unit;
z4, the data packet H3 enters an in-network operation unit, the network operation controller determines that the network operation cache is empty, and the data packet H3 is stored in the network operation cache;
z5, the data packet H3 exceeds a preset threshold value period in the network computing cache, and the network computing controller controls the network computing cache to transmit the data packet H3 to the route computing unit.
And Z6, calculating that the output cache corresponding to the data packet is the output cache connected with the inter-core router in the third row and the second column in the figure 5 by using a reduction tree routing algorithm by the routing calculation unit according to the destination address in the data packet H3.
Z7, packet H3 enters the crossbar which selects packet H3 through the multiplexer and transfers it to the output buffer connected to the inter-core routers in the third row and second column of fig. 5.
And Z8, the output buffer transmits the data packet H3 to the outside of the router according to the first-come first-serve principle.
In the present embodiment, the data packet H1 and the data packet H2 complete data aggregation in the core inter-core routers located in the first row and the second column; and the data packet H3 has a period exceeding a preset threshold period in the network computing cache of the core inter-particle router positioned in the second row and the second column, and the network computing controller forwards the data packet to the route computing unit. The above procedure represents a situation that it is desirable to avoid setting a threshold period for which a packet is already present in the computation cache: if the data packet can be buffered in the network computation cache without a deadline, when only one data packet which needs to be subjected to intra-network computation is left in the network and the data packet does not reach the target address yet, the data packet is stored in the network computation cache and can never reach the target address through forwarding.
Example 3
In this embodiment, based on the network-integrated core-to-core router disclosed in embodiment 1, under the same system overall architecture, a working process of a packet aggregation method based on the network-integrated core-to-core router under a complex condition is proposed.
Now, suppose there are three data packets, which are respectively denoted as data packet K1, data packet K2 and data packet K3, and the data packet K1 is generated by the core in the first row and the first column in fig. 5 and needs to be sent to the core in the second row and the second column, so the source address is denoted as (1, 1), the destination address is denoted as (2, 2), the computation flag bit is 102, and the data portion is 15.0; the data packet K2 is generated by the core grain in the fourth column of the first row in fig. 5, and needs to be sent to the core grain in the second column of the second row, so the source address is (1, 4), the destination address is (2, 2), the network computation flag bit is 102, and the data portion is 22.0; packet K3 is generated by the core in the first row and the third column of fig. 5 and needs to be sent to the core in the third row and the second column, so the source address is (1, 3), the destination address is (3, 2), the computation flag is bit 0, and the data portion is 22.0. Calculating according to a reduction tree routing algorithm that a precursor node of the core inter-particle router positioned in the first row and the first column is the core inter-particle router in the first row and the second column; the predecessor nodes of the core-to-core routers in the first row and the fourth column are the core-to-core routers in the first row and the third column; the predecessor nodes of the core-to-core routers in the first row and the third column are the core-to-core routers in the first row and the second column; the predecessor nodes of the inter-core routers in the first row and the second column are the inter-core routers in the second row and the second column. According to the XY routing algorithm, the data packet K3 will sequentially pass through the inter-core routers in the first row and the third column, the inter-core routers in the first row and the second column, the inter-core routers in the second row and the second column, and the inter-core routers in the third row and the second column in the forwarding process. Therefore, the data packet K1, the data packet K2 and the data packet K3 must pass through the core inter-core routers in the first row and the second column during forwarding. Now assume that the three packets arrive at the first row and the second column of inter-core routers in the order: data package K1, data package K3, data package K2, and the time interval that data package K1 arrived and data package K2 arrived is less than the threshold value cycle that presets in the network calculation buffer, explain the forwarding process of three data packages at the first row of the second row of inter-core particle routers:
l1 and a data packet K1 enter the core inter-particle routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core inter-particle routers in the first row and the first column, and wait for the arbitrator to arbitrate;
the L2 and the data packet K3 enter the core inter-particle routers in the first row and the second column, are temporarily stored in an input cache connected with an output cache of the core inter-particle routers in the first row and the third column, and wait for the arbitrator to arbitrate;
l3, polling the data packet in each input buffer by the arbiter, reading the content of the data packet K1 in the input buffer, and sending a control signal to transmit the data packet K1 from the input buffer to the parser;
l4, the data packet K1 enters a resolver, in the resolver, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation zone bit value 102 of the data packet K1, and transmits the data packet K1 to an intra-network operation unit;
l5, the data packet K1 enters an in-network operation unit, the network operation controller determines that the network operation cache is empty, and the data packet K1 is stored in the network operation cache;
l6, polling the data packet in each input buffer by the arbitrator, reading the content of the data packet K3 in the input buffer, and sending a control signal to transmit the data packet K3 from the input buffer to the parser;
l7, the data packet K3 enters an analyzer, in the analyzer, the router judges that the data packet does not need to be subjected to in-network operation according to the network operation zone bit value 0 of the data packet K3, and transmits the data packet K3 to a route calculation unit;
and L8, calculating that the output cache corresponding to the data packet is the output cache connected with the core inter-core router in the second row and the second column in the figure 5 by using an XY routing algorithm by the routing calculation unit according to the destination address in the data packet K3.
L9, packet K3 enters the crossbar which selects packet K3 through the multiplexer and transfers the packet to the output buffer connected to the inter-core routers in the second row and column of fig. 5.
L10, the output cache associated with the second row and column of core inter-core routers in fig. 5, transmits packet K3 out of the router on a first come first serve basis.
The L11 and the data packet K2 enter the core particle routers in the first row and the second column, are temporarily stored in an input cache connected with the output cache of the core particle routers in the first row and the third column, and wait for the arbitration of the arbiter.
L12, the arbiter reads the content of the data packet K2 in the input buffer, and sends a control signal to transmit the data packet K2 from the input buffer to the resolver;
l13, the data packet K2 enters an analyzer, in the analyzer, the router judges that the data packet needs to be subjected to intra-network operation according to the network operation zone bit value 102 of the data packet K2, and transmits the data packet K2 to an intra-network operation unit;
l14, the data packet K2 enters an in-network operation unit, a network operation controller checks a data packet K1 stored in a network operation cache, firstly, a data part 22.0 of the data packet K2 is transmitted to a logic operation unit, a network operation zone bit 102 of the data packet K2 and a network operation zone bit 102 of the data packet K1 enter a comparator for comparison, the comparator transmits a result to the network operation controller, the network operation zone bits of the two data packets are equal, the network operation controller controls the network operation cache to input a data part 15.0 of the data packet K1 into the logic operation unit, and a source address, a target address and a network operation zone bit of the data packet K2 are stored in the network operation cache and cover the source address, the target address and the network operation zone bit of the data packet K1;
and the L15 and the logic calculation unit perform floating point addition operation on the 15.0 and the 22.02, and the operation result 37.0 is stored in the calculation cache and covers the original data part in the calculation cache. The network computation controller transmits the data packet in the network computation cache to a route computation unit, the data packet is recorded as K4, the content of the data packet K4 is a source address (1, 4), a target address (2, 2), a network computation flag bit 102 and a data part 37.0;
and L16, the route calculation unit calculates that the output cache corresponding to the data packet is the output cache connected with the inter-core router in the second row and the second column by using a reduction tree routing algorithm according to the destination address in the data packet K4.
L17, packet K4 enters the crossbar which selects packet K4 through the multiplexer and transfers it to the output buffer connected to the second row and column of inter-core routers.
L18, the output cache associated with the second row and second column of inter-core routers, transmits packet K4 out of the router on a first come first serve basis.
In this embodiment, K1 and K2 are packets that need intra-network operation, K3 is a packet that does not need intra-network operation, and the forwarding process of the three packets by the inter-core-granule router shows that even if there are packets that need intra-network operation and packets that do not need intra-network operation in the system at the same time, the inter-core-granule router can correctly complete the forwarding and aggregation tasks.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (8)
1. A network-integrated core inter-particle router is characterized by comprising N input buffers, 1 arbiter, N parsers, 1 in-network operation unit, N +1 route calculation units, 1 crossbar switch and N output buffers, wherein,
the input cache is connected with the core particle or the output cache of the other inter-core-particle router, receives a data packet transmitted by the output cache of the other inter-core-particle router or a data packet transmitted by the core particle, caches the data packet and waits for the arbitration of the arbiter; the data packet comprises a source address, a target address, a network computation flag bit and a data part;
the arbiter reads the content of the data packet in the input cache, arbitrates the data packet in the input cache, and decides which data packet in the input cache is transmitted to the analyzer for analysis;
the parser receives a data packet input and cached and transmitted, and transmits the data packet to the route calculation unit or the in-network calculation unit, the parser judges whether the data packet needs to be subjected to in-network calculation according to the network calculation flag bit of the data packet, if the value of the network calculation flag bit of the data packet is equal to 0, the data packet does not need to be subjected to in-network calculation, and the parser transmits the data packet to the route calculation unit; if the value of the network computation flag bit of the data packet is not 0, the data packet is indicated to need to be subjected to network operation, and the analyzer transmits the data packet to the network operation unit;
the in-network operation unit receives the data packet transmitted by the analyzer, and transmits the data packet to the route calculation unit after in-network operation;
the route calculation unit receives the data packet from the resolver or the in-network operation unit, transmits the data packet and the route operation result to the cross switch, and calculates an output port address corresponding to the data packet by using a route algorithm according to a destination address in the data packet; for the data packet transmitted by the analyzer, the route calculation unit adopts an XY route algorithm, namely firstly forwarding the data packet according to columns and then forwarding the data packet according to rows; for data packets transmitted by the in-network operation unit, the routing calculation unit adopts a reduction tree routing algorithm, wherein the routing calculation result is used as an electric signal for controlling the state of the cross switch and indicates the cross switch to transmit the transmitted data packets to which output cache;
the crossbar receives the data packet and the routing operation result from the routing calculation unit, comprises N multiplexers, and is connected with one output cache;
and the output buffer receives the data packet transmitted by the cross switch, and the data packet is buffered to the input of another inter-core-particle router or transmitted to the inter-core-particle router according to the first-come first-serve principle.
2. The network-integrated inter-core-particle router according to claim 1, wherein the source address in the packet indicates which core particle in the network the packet is generated from, the destination address indicates which core particle in the network the packet is received from, the computation flag is used to distinguish whether the packet needs to be subjected to intra-network computation, and the data portion is a specific communication content to be transmitted by the packet.
3. The integrated network-on-core inter-particle router of claim 1, wherein the in-network arithmetic unit comprises a multiplexer, a network arithmetic buffer, a network arithmetic controller, a comparator and a logic calculation unit; the operation process of the in-network operation unit is as follows: the comparator compares the network computation flag bit of the data packet transmitted by the resolver with the network computation flag bit of the data packet in the network computation cache, and if the two are not equal, the network computation controller transmits the data packet in the network computation cache to the route computation unit; if the two are equal, the network computation controller transmits the data part of the data packet transmitted by the analyzer and the data part of the data packet in the network computation cache to the logic computation unit for addition operation, the operation result is transmitted to the network computation cache to cover the original data part, and meanwhile, the source address, the destination address and the network computation flag bit of the data packet transmitted by the analyzer are transmitted to the network computation cache.
4. The cyber-integrated core-to-core router according to claim 1, wherein the reduction tree routing algorithm operates in a mesh network by:
t1, setting the size of the mesh network as n multiplied by n, and creating n together 2 The Node sets are used for representing core grains in the network, and comprise the following information: the method comprises the following steps that x coordinates and y coordinates of core particles in a mesh network, the number of child nodes and precursor nodes are obtained, wherein x coordinate values are recorded as x, y coordinate values are recorded as y, and position information of a Node set refers to the x coordinates and the y coordinate values in the Node set and is recorded as (x, y);
t2, assigning values to the members of the Node set according to the position of the core grain in the mesh network, setting the number of the sub-nodes to be 0, and setting the precursor Node of the core grain to be null;
t3, use of n 2 ×n 2 The size of the adjacency matrix M represents the connection relationship of the nodes, and the symbol i and the symbol j represent that the number is greater than or equal to 1 and less than or equal ton, for the (i multiplied by n + j) th row of the matrix, assigning 1 to the (i + 1) th multiplied by n + j), the ((i-1) th multiplied by n + j), the (i multiplied by n + j + 1) th column and the (i multiplied by n + j-1) th column to represent connection, and assigning 0 to the rest columns to represent disconnection;
t4, for a Node set, the (i × n + y) th row of the adjacency matrix M records adjacency information, so n is used 2 Each vertex represents a Node set, and a graph G is formed by combining information of recording edges of the adjacency matrix M;
t5, updating the graph G, and the process is as follows:
t5.1, calculating the length of the shortest path from the node directly communicated with the source point to be 1 and the lengths of the shortest paths from the rest points to the source point to be infinite in the graph G;
t5.2, selecting one of nodes which are not accessed yet and are closest to a source point, and recording the selected one as u;
t5.3, the note u has been accessed;
t5.4, selecting a node communicated with the u node and recording as v;
t5.5, if v is not accessed and u is used as an intermediate point to make the path from the source point to the node v shorter, recording the path length from the node v to the source point as the path length from the node u to the source point plus one, and adding u into the candidate precursor node of v;
t5.6, repeating the step T5.4 and the step T5.5 for all the vertexes which can be reached from u;
t5.7, selecting a node with the largest number of child nodes from the candidate precursor nodes of u, and marking as w;
t5.8, recording the predecessor of u as w, and adding 1 to the number of child nodes of w;
t5.9, repeating the steps from T5.2 to T5.7 until all nodes are accessed;
t6, establishing a size of n 2 For the graph G, traversing each Node in the graph, calculating the precursor Node coordinate of each Node as (v → x n, v → y) through the precursor member v in the Node set, and storing the precursor Node coordinate into the set R, wherein the symbol "v → x" represents the x coordinate in the slave Node set v, and the symbol "v → y" represents the y coordinate in the slave Node set v;
and T7, storing the set R into a route calculation unit of the inter-core router as a result of a routing algorithm.
5. The monolithic core-to-core router of claim 1, wherein the arbiter uses round-robin arbitration to poll each packet in the input buffer and take it as an input to the parser.
6. The network-integrated core inter-core router of claim 1, wherein the input buffer stores packets using a plurality of fixed length queues.
7. A packet aggregation method based on the network-integrated core inter-core router according to any one of claims 1 to 6, the packet aggregation method comprising the steps of:
s1, when a data packet enters a network-computing integrated core inter-particle router, the data packet is temporarily stored in an input cache and is arbitrated by an arbiter;
s2, reading the content of the data packets in the input cache by the arbitrator, performing circulating arbitration on the data packets in the input cache, polling the data packets in each input cache and determining which data packet in the input cache is transmitted to the analyzer for analysis;
s3, after passing through the arbiter, the data packet enters a resolver, in the resolver, the router judges whether the data packet needs to be subjected to intra-network operation according to the network operation flag bit of the data packet, if the value of the network operation flag bit of the data packet is equal to 0, the data packet does not need to be subjected to intra-network operation, the steps S4 and S5 are skipped, the step S6 is executed, if the value of the network operation flag bit of the data packet is not 0, the data packet needs to be subjected to intra-network operation, and the step S4 is executed;
s4, recording the data packet as a data packet A after the data packet enters the in-network operation unit, and if the network operation controller checks that no cache data packet exists in the network operation cache, storing the data packet A into the network operation cache, and finishing the work; if the data packet stored in the network calculation cache of the in-network operation unit is marked as a data packet B, firstly transmitting the data part of the data packet A to a logic calculation unit, enabling the network calculation flag bits of the data packet A and the data packet B to enter a comparator for comparison, transmitting the result to a network calculation controller by the comparator, if the network calculation flag bits of the two data packets are not equal, controlling the network calculation cache to output the data packet B by the network calculation controller, emptying the network calculation cache, meanwhile, setting the data part in the network calculation cache to be 0, inputting the data packet into the logic calculation unit, and storing the source address, the target address and the network calculation flag bit of the data packet A into the network calculation cache; if the net computation flag bits of the two data packets are equal, the net computation controller controls the net computation cache to input the data part of the data packet B into the logic computation unit, stores the source address, the target address and the net computation flag bit of the data packet A into the net computation cache and covers the source address, the target address and the net computation flag bit of the data packet B;
s5, the logic calculation unit carries out floating point addition operation on the data part of the data packet A and the data part of the data packet B, an operation result is stored in a network calculation cache and covers the original data part in the network calculation cache, and finally the network calculation controller transmits the data packet to the routing calculation unit;
s6, the route calculation unit calculates the data packet by using a route algorithm according to the destination address in the data packet, the corresponding output port, the route calculation unit adopts an XY route algorithm for the data packet transmitted by the resolver, and the route calculation unit adopts a reduction tree route algorithm for the data packet transmitted by the intra-network operation unit;
s7, the data packet enters a cross switch, and the cross switch selects the data packet transmitted by one routing calculation unit through a multiplexer and transmits the data packet to a corresponding output buffer;
and S8, the output buffer transmits the data packet to the outside of the router according to the first-come first-serve principle.
8. The method as claimed in claim 7, wherein the network computing controller in the intra-network computing unit records whether the data packet exists in the network computing cache and the time length of the data packet existing in the network computing cache, and if the time of the data packet existing in the network computing cache exceeds a preset threshold period, the network computing controller controls the network computing cache to transmit the data packet to the route computing unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211424707.6A CN115720211B (en) | 2022-11-14 | 2022-11-14 | Network-computing integrated inter-core router and data packet aggregation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211424707.6A CN115720211B (en) | 2022-11-14 | 2022-11-14 | Network-computing integrated inter-core router and data packet aggregation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115720211A true CN115720211A (en) | 2023-02-28 |
CN115720211B CN115720211B (en) | 2024-05-03 |
Family
ID=85255299
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211424707.6A Active CN115720211B (en) | 2022-11-14 | 2022-11-14 | Network-computing integrated inter-core router and data packet aggregation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115720211B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104780122A (en) * | 2015-03-23 | 2015-07-15 | 中国人民解放军信息工程大学 | Control method for hierarchical network-on-chip router based on cache redistribution |
US20210344618A1 (en) * | 2020-05-04 | 2021-11-04 | The George Washington University | Interconnection Network With Adaptable Router Lines For Chiplet-Based Manycore Architecture |
CN114756494A (en) * | 2022-03-31 | 2022-07-15 | 中国电子科技集团公司第五十八研究所 | Conversion interface of standard communication protocol and on-chip packet transmission protocol of multi-die interconnection |
-
2022
- 2022-11-14 CN CN202211424707.6A patent/CN115720211B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104780122A (en) * | 2015-03-23 | 2015-07-15 | 中国人民解放军信息工程大学 | Control method for hierarchical network-on-chip router based on cache redistribution |
US20210344618A1 (en) * | 2020-05-04 | 2021-11-04 | The George Washington University | Interconnection Network With Adaptable Router Lines For Chiplet-Based Manycore Architecture |
CN114756494A (en) * | 2022-03-31 | 2022-07-15 | 中国电子科技集团公司第五十八研究所 | Conversion interface of standard communication protocol and on-chip packet transmission protocol of multi-die interconnection |
Non-Patent Citations (3)
Title |
---|
JIEMING YIN;ZHIFENG LIN;ONUR KAYIRAN;MATTHEW POREMBA;MUHAMMAD SHOAIB BIN ALTAF;NATALIE ENRIGHT JERGER;GABRIEL H. LOH: "Modular Routing Design for Chiplet-Based Systems", 2018 ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 23 July 2018 (2018-07-23) * |
梁崇山;戴艺;徐炜遐: "一种基于Chiplet集成技术的超高阶路由器设计", 计算机工程与科学, 29 March 2022 (2022-03-29) * |
陈桂林;王观武;胡健;王康;许东忠: "Chiplet封装结构与通信结构综述", 计算机研究与发展, 31 March 2021 (2021-03-31) * |
Also Published As
Publication number | Publication date |
---|---|
CN115720211B (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9426099B2 (en) | Router, method for controlling router, and program | |
EP0197103B1 (en) | Load balancing for packet switching nodes | |
JPH0856230A (en) | Switching system | |
CN112084027B (en) | Network-on-chip data transmission method, device, network-on-chip, equipment and medium | |
CN115658274B (en) | Modularized scheduling method, device and computing equipment for neural network reasoning in core particle | |
CN106209679A (en) | For using the method and device of the memorizer list of multiple link | |
CN108199985B (en) | NoC arbitration method based on global node information in GPGPU | |
CN109302357A (en) | A kind of on piece interconnection architecture towards deep learning reconfigurable processor | |
US10547514B2 (en) | Automatic crossbar generation and router connections for network-on-chip (NOC) topology generation | |
CN113438171B (en) | Multi-chip connection method of low-power-consumption storage and calculation integrated system | |
US9727499B2 (en) | Hardware first come first serve arbiter using multiple request buckets | |
Hofmann | Multi-Chip Dataflow Architecture for Massive Scale Biophyscially Accurate Neuron Simulation | |
US20240121185A1 (en) | Hardware distributed architecture | |
CN115720211B (en) | Network-computing integrated inter-core router and data packet aggregation method | |
US8670454B2 (en) | Dynamic assignment of data to switch-ingress buffers | |
CN114884893B (en) | Forwarding and control definable cooperative traffic scheduling method and system | |
González et al. | IPDeN: Real-Time deflection-based NoC with in-order flits delivery | |
US7912068B2 (en) | Low-latency scheduling in large switches | |
CN113553279B (en) | RDMA communication acceleration set communication method and system | |
CN103999435B (en) | For the apparatus and method of efficient network address translation and ALG process | |
Mariño et al. | Loopback strategy for in-vehicle network processing in automotive gateway network on chip | |
US20180198682A1 (en) | Strategies for NoC Construction Using Machine Learning | |
CN111860793A (en) | Data transmission technology in neural network system and neural network system | |
CN116155801A (en) | Inter-core network router supporting on-network computing and on-network processing method of data packet | |
CN117221212B (en) | Optical network on chip low congestion routing method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |