WO2023006006A1

WO2023006006A1 - Roller arbitration method and circuit for on-chip data exchange

Info

Publication number: WO2023006006A1
Application number: PCT/CN2022/108409
Authority: WO
Inventors: 王东辉; 赵鹏; 常亮; 桑永奇; 李甲; 姚飞
Original assignee: 海飞科(南京)信息技术有限公司
Priority date: 2021-07-29
Filing date: 2022-07-27
Publication date: 2023-02-02
Also published as: CN113568849A; CN113568849B

Abstract

A roller arbitration method for on-chip data exchange. On the basis of an N×N cross network of N input ports and N output ports, the method comprises the following steps: S1, determining a priority arbitration arrangement; S2, determining whether each expected transmission pair in the arrangement has a transmission requirement; if yes, determining that the expected transmission pair is an actual transmission pair, and immediately performing data transmission on the determined actual transmission pair; if not, considering that the expected transmission pair is a non-transmission pair; S3, performing sequential row/column or column/row arbitration, and selecting switching points having a high priority as an actual transmission pair; S4, after the polling is completed in step S3, scrolling the priority arbitration arrangement to obtain a new arbitration arrangement; and S5, cyclically performing S2-S4.

Description

Roller arbitration method and circuit for on-chip data exchange

technical field

The invention relates to the fields of chip design, on-chip network, on-chip system, and computer architecture, in particular to a wheel scheduling method and circuit realization of an on-chip data exchange network. This method can improve the efficiency and speed of on-chip data exchange, and is especially suitable for artificial intelligence and big data processing chips, especially chips with SIMT architecture.

Background technique

Machine learning, scientific computing and graphics rendering require huge computing power, which is generally provided by large chips (such as GPU, TPU, APU, etc.) to achieve highly complex machine learning tasks and graphics processing tasks. Using machine learning to do recognition requires a huge deep learning network and massive image data, and the training process is very time-consuming; in a 3D application or game scene, if recursive ray-tracing (Recursive Ray-Tracing) is used for rendering, and the scene If it is complex, it needs to do massive calculations and transfer massive data. This requires extremely high computing performance, and therefore requires extremely wide data exchange bandwidth to support such demands. High-performance on-chip switches have become an important component of AI and GPU chips.

For specific scenarios such as AI and graphics computing, the arbitration method of on-chip caching and data exchange is very important. Low-efficiency arbitration (arbitration) method and arbiter (arbiter) design will become the bottleneck of the system, greatly affecting the performance of the system. Therefore, the arbitration method and the arbiter circuit must achieve high performance and low complexity.

The research on arbitration methods in network switches has a long history, especially in the rapid development stage of the Internet, there are many research results. Jonathan Chao's book "High Performance Switches and Routers" (Wiley-IEEE Press, 2007) and George Varghese's book "Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices" (Morgan Kaufmann, 2004) made a review of these and Description of various methods. Virtual output queue (Virtual Output Queue-VOQ) switch is a typical data exchange method, and the research results in this area are quite abundant.

Important results in this area include methods such as PIM, RRM, iSLIP, DRRM, and GA.

Among them, PIM has problems of fairness and complexity because each selection is random and requires three steps.

However, RRM and iSLIP use priority round-robin arbitration, which is simpler than random arbitration logic. iSLIP has improved the grant pointer jump condition, and the fairness has been improved. However, three steps are still required, which makes the complexity problem also exist. Difficult to implement high-speed circuits.

DRRM has two independent polling arbitration mechanisms for input and output to perform arbitration, which is shorter than the arbitration time of the iSLIP scheme, and at the same time achieves performance equivalent to iSLIP; on the basis of DRRM, GA sends the Grant information of the output port to the input port, although the arbitration efficiency is improved but the complexity is increased more than DRRM. Since the complexity increases exponentially with the increase of ports (N ³ logN), it is difficult for DDRM and GA to achieve more than two iterative arbitrations on high-speed circuits with the increase of ports.

Contents of the invention

Aiming at the problems existing in the background technology, the present invention proposes a wheel scheduling method and circuit realization of an on-chip data exchange network.

The present invention firstly discloses a roller arbitration method for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, all output ports corresponding to one input port are one row, and all input ports corresponding to one output port are One column, each input and output switching point is a transmission pair; it includes the following steps:

S1. Determine the priority arbitration arrangement W _VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1,x]}, a, b, c...x ∈[0,N-1] and are different from each other; the N elements in the priority arbitration arrangement W _VOQ represent N expected transmission pairs, where: VOQ[0,a] means that the expected input port is PI ₀ , and the expected output port is The expected transmission pair of PO a; the transmission pair in the non-priority arbitration arrangement W _VOQ is an unexpected transmission pair;

S2. Judging whether each expected transmission pair in the arrangement has a transmission demand, if yes, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;

S3. For the switching point of an unexpected transmission pair, first perform the column arbitration of each output port or the column arbitration of the input port where it is located to obtain the possible actual transmission pair; then perform row arbitration or column arbitration for the possible actual transmission pair, Select the unexpected transmission pair with high priority as the actual transmission pair;

S4, after the polling of step S3 is completed, when certain conditions are met, the priority arbitration arrangement W _VOQ scrolls to obtain a new arbitration arrangement _W'VOQ ; otherwise, the scroll wheel remains motionless;

S5. Perform S2-S4 in a loop.

Preferably, in S1, determine the priority arbitration arrangement W _VOQ ={VOQ[0,0], VOQ[1,1], VOQ[2,2], . . . , VOQ[N-1,N-1]}.

Preferably, in S2 and S3, for the burst transmission application, before determining the actual transmission pair, it needs to confirm that there is no unfinished transmission on the transmission pair.

Preferably, in S2 and S3, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the next transmission demand arbitration.

Preferably, in S4, the rolling condition is that each requested expected transmission pair has been confirmed by an actual transmission pair.

As a scrolling mode of the wheel, in S4, the scrolling is that all expected input ports remain unchanged, and all expected output ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel scrolls cyclically.

As another wheel scrolling method, in S4, the scrolling is that all expected output ports remain unchanged, all expected input ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel is cyclically scrolled.

The invention also discloses a roller arbitration circuit for on-chip data exchange, which is based on an NxN crossover network with N input ports and N output ports, which includes:

-Rolling point selection circuit, used to determine the priority arbitration arrangement in NxN crossover network W _VOQ ={VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1, x]}, and the rolling update of priority arbitration ranking W _VOQ ;

- Expected point matching circuit, used to judge whether each expected transmission pair in the arrangement has a transmission demand, mark the one with the transmission demand as the actual transmission pair, and mark the one without the transmission demand as the non-transmission pair;

-Row-column polling arbitration circuit, which is used to perform column arbitration or row arbitration in the column where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority The high switching point acts as the actual transmission pair.

Preferably, it also includes:

- Row and column clearing circuit, each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the expected transmission pair determines the actual transmission pair.

Specifically, the scrolling point selection circuit: each row and each column of the NxN crossover network is provided with a k-bit register, k=ceiling(N), and the register values of the row/column are sequentially moved to realize scrolling.

Specifically, the expected point matching circuit: each switching point of the NxN crossover network is provided with a comparator, and the value of the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is an expected point.

Specifically, the row-column polling arbitration circuit: each row and each column of the NxN crossover network is provided with an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel.

Beneficial effects of the present invention

The present application proposes a rolling scheduling method and circuit implementation of an on-chip data switching network, and realizes the fairness of data transmission opportunities of each switching point based on rolling scheduling. Based on the initial column/row polling, the traversal of the arbitration of each switching point is guaranteed; based on the second row/column polling, the uniqueness of each input-output is guaranteed to avoid conflicts. After the transmission pair is determined, the ranks and columns to which the switching point belongs are cleared, which not only ensures the uniqueness of each input-output, but also reduces arbitration conflicts and times, and improves transmission efficiency. Based on the wheel priority scheduling algorithm, it avoids the arbitration of the pre-transmission pairs on the wheel, and only needs to judge whether there is a transmission request for the expected transmission pair on the wheel. The judgment logic is simple, the arbitration time is shorter, and it is easy to implement high-speed circuits, especially in high-speed circuits with strict timing requirements. Two iterations can be easily implemented on our chip.

Description of drawings

Figure 1 is a schematic diagram of NxN cross network routing structure

Figure 2 is a schematic diagram of a wheel representing priority

Figure 3 is a schematic diagram of selecting and clearing rows and columns

Figure 4 is a schematic diagram of the movement of the roller

Figure 5a is a schematic diagram of a VOQ circuit implemented in a FIFO manner

Figure 5b is a schematic diagram of implementing a VOQ circuit by managing linked list pointers

Figure 6a is a schematic diagram of the wheel point selection circuit

Figure 6b is a schematic diagram of incremental updating of register rows (columns) in the scroll point selection circuit

Figure 7 is a schematic diagram of the desired point matching circuit

Figure 8 is a schematic diagram of the wheel pattern

Figure 9 is an example diagram of the row and column polling arbitration circuit

Figure 10 is an example diagram of three steps constituting an iteration of the PIM scheduling algorithm

Figure 11 is an example diagram of three steps constituting one iteration of the RRM scheduling algorithm

Figure 12 is an example diagram of the iSLIP algorithm

Figure 13 is an example diagram of the DRRM algorithm

Figure 14 is an example diagram of the GA algorithm

Detailed ways

The present invention will be further described below in conjunction with embodiment, but protection scope of the present invention is not limited to this:

Figure 1 shows the routing structure of the NxN crossover network: each intersection of I0, I1, ..., IN-1 and O0, O1, ..., ON-1 is a routing path, also called a transmission pair . Each intersection is a VOQij request path, the first number of the VOQ subscript indicates the input port number, and the second number indicates the output port number. Each input port Ii can only have one routing node selected in one cycle, and each output port can only have one routing node selected in one cycle. There are at most N paths selected in one cycle.

In order to ensure that each input port realizes fair arbitration, each input port obtains the data transmission volume as equal as possible, and each virtual output queue VOQ of each input port obtains the data transmission volume as equal as possible, the invention discloses an on-chip The wheel arbitration method for data exchange includes the following steps:

Referring to FIG. 8 , at a certain moment, the scroll wheel represents the expected output port for each input port, and there are N expected output ports for N input ports. The N expected ports are all different, that is, a permutation of {0,1,2,...,N-1}, a permutation becomes a pattern, and there are N in total! a parttern. When initializing, {0,1,2,…,N-1} can be used directly, that is to say, the input port PI ₀ expects the output port PO ₀ , the input port PI ₁ expects the output port PO ₁ ,…, the input port P _{N- 1} Expected output port PON _-1 . We record this arrangement as W={(0,0),(1,1),(2,2),...,(N,N)}, as shown in the small hexagonal hollow nodes in Figure 2 (the In the embodiment, a=1, b=2, c=3, ..., x=N-1).

Suppose at time t W(t)={(0,(t+0)%N),(1,(t+1)%N),(2,(t+2)%N),...,(N -1,(t+N-1)%N)}, this corresponds to the queue with the highest priority. That is, if input port _i has a request for output port ji, then this request must be granted. According to the corresponding relationship between different input and output according to the control, different patterns can be selected, and these patterns can be rolled and cycled to achieve fairness.

In a preferred embodiment, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the transmission demand arbitration. As shown in Figure 3, a node in the pattern is on VOQ10 (small hexagonal solid node), and if this VOQ10 has a request, then the route of this node will be selected. At the same time, other nodes in the row where the node is located, that is, all other output nodes (small hexagonal hollow nodes) of the input port I1 request VOQ11, VOQ12, ..., VOQ1N-1, and the column where the node is located, that is, the output The requests VOQ00, VOQ20, ..., VOQN-10 of all input nodes (yellow hexagonal nodes) of O0 will be cleared, and will not participate in the rank arbitration of S3, which can improve the efficiency of the second rank arbitration.

In a preferred embodiment, the above content can be realized by the following code:

In a preferred embodiment, the S3 step is realized by the following code:

Finally, VOQ[i,t _i ].stat==true or VOQ[i,j].row==true points are VOQs with successful arbitration; data on these VOQs can be transmitted immediately, and the burst counter is decremented by one ( VOQ[i,t _i ].cn--or VOQ[i,j].cnt--).

In S2 and S3, for the transmission application with a single cycle length, when each expected transmission pair has a transmission demand, it is directly identified as an actual transmission pair and performs data transmission. For a transmission application with a multi-cycle length (such as a burst transmission application), it is necessary to confirm that there is no unfinished transmission on the transmission pair before determining it as an actual transmission pair.

S4, after the polling of step S3, some conditions can be set here. After the conditions are met, the priority arbitration arrangement W _VOQ rolls to obtain a new arbitration arrangement W'_VOQ; this condition can be: ensure that all requested expected transmission pairs have been determined is the actual transfer pair, that is, each requested expected transfer pair has had at least one transfer. This condition can guarantee non-starvation and fairness, and other conditions also need to consider these two characteristics.

Combined with Figure 4, the scroll wheel needs to do circular scrolling between a set of patterns. As shown in Figure 4, the hollow small hexagonal nodes (VOQ00, VOQ11, ..., VOQN-1N-1) on the diagonal are the pattern before the scroll wheel, and the black solid small hexagonal nodes (VOQ10, VOQ21 ,..., VOQN-1N-2, VOQ0N-1) is the pattern after one scroll cycle of the wheel. The scrolling between other patterns is similar. You can scroll according to the preset scrolling mode + n. The value of n needs to ensure that all transmission pairs are covered when the scroll wheel scrolls circularly (for an N*N cross network routing structure, N is an odd number For example, a value of 2 for n can also satisfy the complete coverage requirement).

In a preferred embodiment, step S4 is realized by the following codes:

S5. Perform S2-S4 in a loop.

The present invention also proposes a roller arbitration circuit for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, the realization of the VOQ circuit is divided into two modes: one is the FIFO mode as shown in Figure 5a, according to Input the destination address information of I_s to route the VOQ routing request to the FIFO of the corresponding output port respectively. This implementation method is simple, but consumes hardware resources; another method is to realize VOQ through the linked list management as shown in Figure 5b: The input is stored in a set of linked lists, and the VOQ information is managed by managing the linked list pointers. Each input port contains a pointer register queue with a queue length of M, a head pointer and a tail pointer with a queue length of N, and a valid request with a width of N. Among them, M is the maximum number of requests that can be received and stored, and N is the number of request destination ports. Compared with the implementation of FIFO, the circuit area is smaller and less hardware resources are used.

The wheel arbitration circuit also includes:

In a preferred embodiment, the rolling point selection circuit is: each row and each column of the NxN crossover network are respectively provided with a k-bit register, k=ceiling (N), and the registers of one row can move the registered value sequentially, and the registers of one column Registered values can also be moved sequentially. The row registers are denoted as R[0], R[1], . . . , R[N-1]. The column registers are denoted as C[0], C[1], . . . , C[N-1]. When initializing, R[0]=0, R[1]=1,...,R[N-1]=N-1, C[0]=0, C[1]=1,...,C[N- 1] = N-1. Every time you scroll, C[0]=C[N-1], C[N-1]=C[N-2], C[N-2]=C[N-3],...,C[2 ]=C[1], C[1]=C[0], while R[N-1],...,R[0] remain unchanged.

Combined with Figure 6a, an example pattern is given, and the rows and columns of each node are mutually exclusive. Each node of pattern has a routing node number. In combination with Fig. 6b, each pattern completes the node update of the corresponding column (row) in the manner of incrementing the row (column), and the column (row) is pre-selected and stored in the register.

In a preferred embodiment, the expected point matching circuit is: each switching point of the NxN crossover network is provided with a comparator, and the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is the desired point.

Combined with Figure 7, each node on the Pattern will output a row and column routing number. When this number is equal to the routing number of the VOQ, that is, when there is a corresponding VOQ request on the node of the pattern, the expected point matches. Traverse and judge all the nodes of the current pattern to find the expected matching point on the wheel.

Fig. 9 is a relatively complete structural diagram of the wheel arbitration circuit. After matching the expected points on the wheel, do two arbitrations in the row and column directions as shown in the figure to complete the row-column polling arbitration (the result of column looping cannot guarantee that the rows of each column arbitration are different. Therefore, the row round-robin circuit is required to arbitrate the result of the column round-robin again, so that the final result has at most one selected point in each row and column, and there will be no conflict of data transmission), and complete all operations in one cycle Access routing arbitration. In Figure 9, each row and each column has an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel. In the figure, it can be realized by a shift register.

In a preferred embodiment, it also includes:

- Row and column clearing circuit, each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the input port determines the actual transmission pair. The implementation scheme is: locally exclude the marks of these VOQs at the input port.

Compared with the existing CrossBar arbitration algorithm

At present, most scheduling algorithms maximize the connection path of each arbitration to maximize the bandwidth, but these algorithms are too complex, and the complexity increases exponentially with the increase of ports (N ³ logN) cannot be implemented in hardware, and It takes a long time to complete. Now the general CrossBar is based on iterative or non-iterative loop algorithm. The more classic algorithms include FIFO, PIM, iSLIP, DRRM and GA algorithms. Based on these algorithms, analyze the differences and advantages between the wheel arbitration algorithm and these algorithms.

1. Introduction of VOQ

The early input ports used FIFO queuing to wait for arbitration allocation. The maximum input throughput of this FIFO-based arbitration scheduling method is only 58.6%. Later, after evolution, the VOQ method was proposed to increase the throughput rate to 100%. Finds the largest match every time provided condition. In the implementation of VOQ, we use linked list management to save hardware resources to the greatest extent. The advantage of this implementation is more obvious on large network routes with high latency.

2. PIM Algorithm

PIM refers to the ParallelmatIterarion Matching algorithm, which is divided into three steps, as shown in Figure 10, which is an example of the three steps that constitute one iteration of the PIM scheduling algorithm.

Step 1: Request, each input makes a request for each output it has cells;

Step 2: Grant, each output consistently chooses an input at random from among the inputs that requested it. In this example,

inputs

1 and 3 both request output 2. output 2 select grant input 3;

Step 3: Accept, each input randomly selects an output among the granted outputs. In this example, outputs 2 and 4 are both granted to input 3. Input 3 selects to accept output 2.

In this example, the first iteration does not match input 4 and output 4, even though it does not conflict with other connections. This connection will be established in the second iteration.

Three characteristics of the PIM algorithm:

First, each iteration process can select the unmatched part of the previous iteration, so the number of iterations to complete the maximum matching needs logN;

Second, it ensures that all requests will be approved eventually, and no VOQ will be arbitrated without input;

Third, it means not using memory or state to keep track of when a connection was established in the past.

Performance of the PIM algorithm: Because the arbitration process of PIM is random, there are the following limitations:

First, it is random arbitration, which is not conducive to high speed; each arbitration procedure must randomly select among all candidate members;

Second, when the CrossBar is overloaded, PIM will cause unfairness between connections;

Finally, PIM performs poorly in a single iteration; it limits throughput to about 63%, which is only slightly higher than FIFO switches. Therefore, the efficiency is only 1-1/e=63% when transmitting Crossbar request transmission overload.

3. RRM algorithm

RRM refers to the Basic Round-Robin Matching algorithm. The RRM algorithm is also divided into three steps: as shown in Figure 11, this is an example of the three steps that constitute one iteration of the RRM scheduling algorithm.

RRM potentially overcomes two problems in PIM: complexity and unfairness. A round-robin arbitrator implemented as a priority encoder is much simpler and faster to execute than a random arbitrator. Round priority helps the algorithm distribute bandwidth fairly and more equitably among requesting connections. The three steps of arbitration are:

Step 1: Request, each input sends a request at each output for which it has a queued VOQ.

Step 2: Grant, if the output receives any requests, it will choose the next request in a fixed roundrobin schedule starting from the highest priority element. The output notifies each input whether the request is approved or not. The output pointer is incremented to the highest priority element of the round robin schedule.

Step 3: Accept. If an input receives a grant, it will accept the next in a fixed round-robin schedule starting with the highest priority element. The pointer to the highest priority element in the round robin schedule will be incremented (one unit apart from accepted outputs).

RRM performance analysis

For loads delivering only 63% RRM, RRM will become unstable. The reason for the poor RRM performance lies in the rules for updating the pointers on the output arbiters. Let's illustrate this with the example shown in the figure above. Both

inputs

1 and 2 are under high load and receive a new unit for both outputs every cycle time. But because the output scheduler is moving in lock priority, only one input is served per compute unit time. Note that the grant pointers change in lock priority: in unit time 1, both point to input 1, in unit time 2, both point to input 2, etc. This synchronization phenomenon results in a maximum throughput of only 50% for this traffic pattern. Synchronization of grant pointers also limits performance in random arrival mode.

4. iSLIP Algorithm

The iSLIP algorithm uses a rotational priority ("Round") arbitration to schedule each input and output in turn. The main feature is simplicity; it is easy to implement on hardware and can run at high speed. It is found that the performance under uniform traffic conditions is high; for uniform IID Bernoulli arrival, a single iteration of iSLIP can achieve 100% throughput. A careful comparison will reveal that iSLIP is actually a variant of the simple basic round-robin matching algorithm (RRM). RRM is probably the easiest and most important.

Figure 12 shows an example of iSLIP.

Compared with RRM, iSLIP mainly makes the following changes: unless the grant is accepted, the grant pointer is not moved. Update the grant pointer. The Grant step of iSLIP is changed to:

Step 2: Grant, if the output receives any requests, it will select the next request that comes up in a fixed round-robin schedule, starting with the highest priority element. The output notifies each input pointer whether the request is approved or not. Increments the highest priority element of a round-robin schedule. This small change to the algorithm results in the following:

Feature 1: The most recently established connection has the lowest priority. This is because when the arbitrator moves the pointer, the most recently granted (accepted) input (output) becomes the lowest priority for that output (input). If input is prioritized to connect at the next computing unit time.

Feature 2: No connection interruption. This is because the input will keep asking for the output until it succeeds. Accepted by each input in the most cell time. Thus, a request input is always serviced in less time than a computational unit.

Property 3: Under high load, all queues with a common output have the same throughput. This is a consequence of feature 2, the output pointer is moved to each request input in a fixed order, thus providing the same throughput for each request.

But most importantly, this small change prevents the output arbiter from moving synchronously, resulting in a huge improvement in performance.

5. DRRM Algorithm

DRRM is Double Rounb-Robin Matching. It is called DRRM (Dual Round Robin Matching) because two independent polling arbitration mechanisms perform arbitration on the input and output.

There is a request arbiter and N VOQs at each input port, and the request arbitrator selects at most one non-empty queue according to the pointer value, representing the output port with the highest priority. There is also a grant arbiter at each output port, which arbitrates an input port among N requests, and then sends the result to the input port. If grant receives a request, it will update the pointer value, and the corresponding request arbiter The pointer value will also be updated.

Figure 13 shows an example of a DRRM algorithm:

The DRRM algorithm consists of two stages:

Step 1: Request, each input request in the round-robin scheduling arbitrates a request;

Step 2: Grant, each output arbitrates an input among all the requests of this port.

The DDRM scheme has a shorter arbitration time than the iSLIP scheme, and at the same time achieves performance equivalent to that of iSLIP.

6. GA Algorithm

The GA algorithm is Grant-Aware Scheduling Algorithm for VOQ-Based Input-Buffered Packet Switches. As shown in Figure 14, this arbitration algorithm iterates the DRRM algorithm several times in one cycle to achieve the maximum arbitration efficiency, and each iteration notifies the output port of the arbitration result to the input port. Efficiency has been improved, but the complexity is greater than DRRM. For a slightly larger routing arbitration (N>=8), more than two iterations are already difficult to implement on high-speed circuits, because each iteration is equivalent to a new Do a DRRM.

7. Wheel Arbitration Algorithm (this application)

The biggest advantage of the wheel arbitration algorithm proposed in this patent is that the wheel priority arbitration is adopted first, and then the DDRM arbitration algorithm is adopted. The two arbitrations are completed in one cycle. The first arbitration follows the wheel priority principle, and the arbitration can be completed by judging whether there is a request on the wheel node. The hardware circuit is very easy to implement, which saves a lot of time for the second arbitration. The second arbitration is based on filtering out the nodes of the first arbitration, and according to the DRRM algorithm, the second arbitration is performed.

Compared with PIM, RRM, iSLIP, DDRM, and GA algorithms, the advantage of the wheel priority algorithm is that it is simpler, more direct, and easier to implement. In the current high-speed circuit design, it is difficult to complete more than two iterative arbitrations of PIM, RRM, iSLIP, DDRM, and GA algorithms within one clock cycle. Each iteration of these algorithms requires arbitration of the input and output ports. The priority is processed logically, and the arbitration priority of the input and output ports of the wheel priority algorithm is a pattern structure, which is independent of the arbitration logic judgment, leaving sufficient time for the arbitration judgment, so more can be realized in one clock cycle number of iterations. Compared with the efficiency of 63% which realizes one iteration in one clock cycle, it is 23% higher, and can reach an efficiency of 86%. It has been implemented on the AIGPU chip.

Since the scroll wheel is a pattern selected by the user according to the characteristics of the data routing, the arbitration time in actual application will be much shorter than that of GA, and more iterations can be achieved to improve the arbitration efficiency.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.

Claims

A roller arbitration method for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, all output ports corresponding to one input port are one row, all input ports corresponding to one output port are one column, and each input and output switching point Be a transmission pair; It is characterized in that it comprises the following steps:

S1. Determine the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N‐1,x]}, a, b, c...x ∈[0,N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] means that the expected input port is PI 0 , and the expected output port is The expected transmission pair of PO a; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;

S2. Judging whether each expected transmission pair in the arrangement has a transmission demand, if yes, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;

S3. For the switching point of an unexpected transmission pair, first perform the column arbitration of each output port or the column arbitration of the input port where it is located to obtain the possible actual transmission pair; then perform row arbitration or column arbitration for the possible actual transmission pair, Select the unexpected transmission pair with high priority as the actual transmission pair;

S4, after the polling of step S3 is completed, when certain conditions are met, the priority arbitration arrangement W VOQ scrolls to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;

S5. Perform S2-S4 in a loop.
The method according to claim 1, characterized in that in S1, determine the priority arbitration arrangement W VOQ ={VOQ[0,0], VOQ[1,1], VOQ[2,2],..., VOQ[N- 1, N‐1]}.
The method according to claim 1, characterized in that in S2 and S3, for the burst transmission application, it is necessary to confirm that there is no unfinished transmission on the transmission pair before it is determined as the actual transmission pair.
The method according to claim 1, characterized in that in S2 and S3, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the next transmission demand arbitration.
The method according to claim 1, characterized in that in S4, the certain condition: each requested expected transmission pair has been confirmed as an actual transmission pair.
The method according to claim 1, characterized in that in S4, the scrolling is that all expected input ports remain unchanged, all expected output ports+n, and the value of n needs to ensure that all transmissions are covered when the roller rotates circularly right.
The method according to claim 1, characterized in that in S4, the scrolling is that all expected output ports remain unchanged, all expected input ports+n, and the value of n needs to ensure that all transmissions are covered when the roller rotates circularly right.
A roller arbitration circuit for on-chip data exchange, based on the NxN crossover network of N input ports and N output ports, is characterized in that it includes:

‐Scroll point selection circuit, used to determine the priority arbitration arrangement in NxN crossover network W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N‐1, x]}, and the rolling update of priority arbitration ranking W VOQ ;

‐Expectation point matching circuit, used to judge whether each expected transmission pair in the arrangement has a transmission demand, mark the one with the transmission demand as the actual transmission pair, and mark the one without the transmission demand as the non-transmission pair;

‐Row-column polling arbitration circuit, which is used to conduct column arbitration or row arbitration of the row where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority The high switching point is used as the actual transmission pair, and the operation logic between each circuit is:

S1. The scroll point selection circuit determines the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1,x]}, a, b , c...x∈[0, N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] represents the expected input port is PI 0 , The expected output port is the expected transmission pair of POa; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;

S2. The expected point matching circuit judges whether each expected transmission pair in the arrangement has a transmission demand, if so, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;

S3. The row-column polling arbitration circuit For the switching point of an unexpected transmission pair, first perform column arbitration of each output port or row arbitration of the input port where it is located to obtain a possible actual transmission pair; then perform row arbitration on the possible actual transmission pair Arbitration or column arbitration, select the unexpected transmission pair with high priority as the actual transmission pair;

S4, after the step S3 polling is completed, when certain conditions are met, the scroll point selection circuit prioritizes the arbitration arrangement W VOQ to scroll to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;

S5. Perform S2-S4 in a loop.
The circuit of claim 8, further comprising:

‐Row and column clearing circuit, each switching point of the NxN crossover network is equipped with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the expected transmission pair determines the actual transmission pair; the operating logic of this circuit is :

In S2: after being determined as the actual transmission pair, the row and column clearing circuit clears the row and column where the actual transmission pair is located, and no longer participates in the next transmission demand arbitration;

In S3: After determining as the actual transmission pair, the row and column clearing circuit clears the row and column where the actual transmission pair is located, and does not participate in the next transmission demand arbitration.
The circuit according to claim 8, characterized in that in the operating logic S4, a certain condition is: each requested expected transmission pair has been confirmed as an actual transmission pair.
The circuit according to claim 8, characterized in that the rolling point selection circuit: each row of the NxN crossover network, each column is respectively provided with a register of 1 k bits, k=ceiling (N), and the registers of the row/column are sequentially moved and registered The value implements scrolling.
The circuit according to claim 8, characterized in that the expected point matching circuit: each switching point of the NxN crossover network is provided with a comparator, and is compared with the value of the row register to which the switching point belongs and the column register to judge whether the switching point is a desired point .
The circuit according to claim 8, characterized in that the row and column polling arbitration circuit: each row and each column of the NxN crossover network is provided with an arbiter, and each arbiter has a priority pointer to scroll through the jump of the roller.
An electronic device, comprising: a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that the computer program is executed by the processor in claims 1-7 any one of the methods described.
A computer-readable storage medium storing one or more programs, characterized in that when the one or more programs are executed by an electronic device including a plurality of application programs, the electronic device Carry out the method described in any one in the claim 1-7.