WO2023006006A1 - Roller arbitration method and circuit for on-chip data exchange - Google Patents

Roller arbitration method and circuit for on-chip data exchange Download PDF

Info

Publication number
WO2023006006A1
WO2023006006A1 PCT/CN2022/108409 CN2022108409W WO2023006006A1 WO 2023006006 A1 WO2023006006 A1 WO 2023006006A1 CN 2022108409 W CN2022108409 W CN 2022108409W WO 2023006006 A1 WO2023006006 A1 WO 2023006006A1
Authority
WO
WIPO (PCT)
Prior art keywords
arbitration
voq
transmission pair
column
row
Prior art date
Application number
PCT/CN2022/108409
Other languages
French (fr)
Chinese (zh)
Inventor
王东辉
赵鹏
常亮
桑永奇
李甲
姚飞
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023006006A1 publication Critical patent/WO2023006006A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1642Handling requests for interconnection or transfer for access to memory bus based on arbitration with request queuing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/18Handling requests for interconnection or transfer for access to memory bus based on priority control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package

Definitions

  • the invention relates to the fields of chip design, on-chip network, on-chip system, and computer architecture, in particular to a wheel scheduling method and circuit realization of an on-chip data exchange network.
  • This method can improve the efficiency and speed of on-chip data exchange, and is especially suitable for artificial intelligence and big data processing chips, especially chips with SIMT architecture.
  • Machine learning, scientific computing and graphics rendering require huge computing power, which is generally provided by large chips (such as GPU, TPU, APU, etc.) to achieve highly complex machine learning tasks and graphics processing tasks.
  • Using machine learning to do recognition requires a huge deep learning network and massive image data, and the training process is very time-consuming; in a 3D application or game scene, if recursive ray-tracing (Recursive Ray-Tracing) is used for rendering, and the scene If it is complex, it needs to do massive calculations and transfer massive data. This requires extremely high computing performance, and therefore requires extremely wide data exchange bandwidth to support such demands.
  • High-performance on-chip switches have become an important component of AI and GPU chips.
  • arbitration method of on-chip caching and data exchange is very important.
  • Low-efficiency arbitration (arbitration) method and arbiter (arbiter) design will become the bottleneck of the system, greatly affecting the performance of the system. Therefore, the arbitration method and the arbiter circuit must achieve high performance and low complexity.
  • PIM has problems of fairness and complexity because each selection is random and requires three steps.
  • RRM and iSLIP use priority round-robin arbitration, which is simpler than random arbitration logic.
  • iSLIP has improved the grant pointer jump condition, and the fairness has been improved.
  • three steps are still required, which makes the complexity problem also exist. Difficult to implement high-speed circuits.
  • DRRM has two independent polling arbitration mechanisms for input and output to perform arbitration, which is shorter than the arbitration time of the iSLIP scheme, and at the same time achieves performance equivalent to iSLIP; on the basis of DRRM, GA sends the Grant information of the output port to the input port, although the arbitration efficiency is improved but the complexity is increased more than DRRM. Since the complexity increases exponentially with the increase of ports (N 3 logN), it is difficult for DDRM and GA to achieve more than two iterative arbitrations on high-speed circuits with the increase of ports.
  • the present invention proposes a wheel scheduling method and circuit realization of an on-chip data exchange network.
  • the present invention firstly discloses a roller arbitration method for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, all output ports corresponding to one input port are one row, and all input ports corresponding to one output port are One column, each input and output switching point is a transmission pair; it includes the following steps:
  • step S4 after the polling of step S3 is completed, when certain conditions are met, the priority arbitration arrangement W VOQ scrolls to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;
  • W VOQ ⁇ VOQ[0,0], VOQ[1,1], VOQ[2,2], . . . , VOQ[N-1,N-1] ⁇ .
  • the row and column where the actual transmission pair is located are cleared, and no longer participate in the next transmission demand arbitration.
  • the rolling condition is that each requested expected transmission pair has been confirmed by an actual transmission pair.
  • the scrolling is that all expected input ports remain unchanged, and all expected output ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel scrolls cyclically.
  • the scrolling is that all expected output ports remain unchanged, all expected input ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel is cyclically scrolled.
  • the invention also discloses a roller arbitration circuit for on-chip data exchange, which is based on an NxN crossover network with N input ports and N output ports, which includes:
  • -Row-column polling arbitration circuit which is used to perform column arbitration or row arbitration in the column where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority
  • the high switching point acts as the actual transmission pair.
  • it also includes:
  • each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the expected transmission pair determines the actual transmission pair.
  • each switching point of the NxN crossover network is provided with a comparator, and the value of the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is an expected point.
  • each row and each column of the NxN crossover network is provided with an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel.
  • the present application proposes a rolling scheduling method and circuit implementation of an on-chip data switching network, and realizes the fairness of data transmission opportunities of each switching point based on rolling scheduling.
  • the traversal of the arbitration of each switching point is guaranteed; based on the second row/column polling, the uniqueness of each input-output is guaranteed to avoid conflicts.
  • the ranks and columns to which the switching point belongs are cleared, which not only ensures the uniqueness of each input-output, but also reduces arbitration conflicts and times, and improves transmission efficiency.
  • the wheel priority scheduling algorithm it avoids the arbitration of the pre-transmission pairs on the wheel, and only needs to judge whether there is a transmission request for the expected transmission pair on the wheel.
  • the judgment logic is simple, the arbitration time is shorter, and it is easy to implement high-speed circuits, especially in high-speed circuits with strict timing requirements. Two iterations can be easily implemented on our chip.
  • Figure 1 is a schematic diagram of NxN cross network routing structure
  • Figure 2 is a schematic diagram of a wheel representing priority
  • Figure 3 is a schematic diagram of selecting and clearing rows and columns
  • Figure 4 is a schematic diagram of the movement of the roller
  • Figure 5a is a schematic diagram of a VOQ circuit implemented in a FIFO manner
  • Figure 5b is a schematic diagram of implementing a VOQ circuit by managing linked list pointers
  • Figure 6a is a schematic diagram of the wheel point selection circuit
  • Figure 6b is a schematic diagram of incremental updating of register rows (columns) in the scroll point selection circuit
  • Figure 7 is a schematic diagram of the desired point matching circuit
  • Figure 8 is a schematic diagram of the wheel pattern
  • Figure 9 is an example diagram of the row and column polling arbitration circuit
  • Figure 10 is an example diagram of three steps constituting an iteration of the PIM scheduling algorithm
  • Figure 11 is an example diagram of three steps constituting one iteration of the RRM scheduling algorithm
  • Figure 12 is an example diagram of the iSLIP algorithm
  • Figure 13 is an example diagram of the DRRM algorithm
  • Figure 14 is an example diagram of the GA algorithm
  • Figure 1 shows the routing structure of the NxN crossover network: each intersection of I0, I1, ..., IN-1 and O0, O1, ..., ON-1 is a routing path, also called a transmission pair .
  • Each intersection is a VOQij request path, the first number of the VOQ subscript indicates the input port number, and the second number indicates the output port number.
  • Each input port Ii can only have one routing node selected in one cycle, and each output port can only have one routing node selected in one cycle. There are at most N paths selected in one cycle.
  • each input port obtains the data transmission volume as equal as possible, and each virtual output queue VOQ of each input port obtains the data transmission volume as equal as possible, the invention discloses an on-chip
  • the wheel arbitration method for data exchange includes the following steps:
  • the scroll wheel represents the expected output port for each input port, and there are N expected output ports for N input ports.
  • the N expected ports are all different, that is, a permutation of ⁇ 0,1,2,...,N-1 ⁇ , a permutation becomes a pattern, and there are N in total! a parttern.
  • ⁇ 0,1,2,...,N-1 ⁇ can be used directly, that is to say, the input port PI 0 expects the output port PO 0 , the input port PI 1 expects the output port PO 1 ,..., the input port P N- 1 Expected output port PON -1 .
  • the row and column where the actual transmission pair is located are cleared, and no longer participate in the transmission demand arbitration.
  • a node in the pattern is on VOQ10 (small hexagonal solid node), and if this VOQ10 has a request, then the route of this node will be selected.
  • the above content can be realized by the following code:
  • the S3 step is realized by the following code:
  • step S4 after the polling of step S3, some conditions can be set here.
  • the priority arbitration arrangement W VOQ rolls to obtain a new arbitration arrangement W' VOQ ; this condition can be: ensure that all requested expected transmission pairs have been determined is the actual transfer pair, that is, each requested expected transfer pair has had at least one transfer. This condition can guarantee non-starvation and fairness, and other conditions also need to consider these two characteristics.
  • the scroll wheel needs to do circular scrolling between a set of patterns.
  • the hollow small hexagonal nodes (VOQ00, VOQ11, ..., VOQN-1N-1) on the diagonal are the pattern before the scroll wheel
  • the black solid small hexagonal nodes (VOQ10, VOQ21 ,..., VOQN-1N-2, VOQ0N-1) is the pattern after one scroll cycle of the wheel.
  • the scrolling between other patterns is similar.
  • the value of n needs to ensure that all transmission pairs are covered when the scroll wheel scrolls circularly (for an N*N cross network routing structure, N is an odd number For example, a value of 2 for n can also satisfy the complete coverage requirement).
  • step S4 is realized by the following codes:
  • the present invention also proposes a roller arbitration circuit for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, the realization of the VOQ circuit is divided into two modes: one is the FIFO mode as shown in Figure 5a, according to Input the destination address information of I_s to route the VOQ routing request to the FIFO of the corresponding output port respectively.
  • This implementation method is simple, but consumes hardware resources; another method is to realize VOQ through the linked list management as shown in Figure 5b: The input is stored in a set of linked lists, and the VOQ information is managed by managing the linked list pointers.
  • Each input port contains a pointer register queue with a queue length of M, a head pointer and a tail pointer with a queue length of N, and a valid request with a width of N.
  • M is the maximum number of requests that can be received and stored
  • N is the number of request destination ports.
  • the wheel arbitration circuit also includes:
  • the row registers are denoted as R[0], R[1], . . . , R[N-1].
  • each node of pattern has a routing node number.
  • each pattern completes the node update of the corresponding column (row) in the manner of incrementing the row (column), and the column (row) is pre-selected and stored in the register.
  • the expected point matching circuit is: each switching point of the NxN crossover network is provided with a comparator, and the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is the desired point.
  • each node on the Pattern will output a row and column routing number.
  • this number is equal to the routing number of the VOQ, that is, when there is a corresponding VOQ request on the node of the pattern, the expected point matches. Traverse and judge all the nodes of the current pattern to find the expected matching point on the wheel.
  • -Row-column polling arbitration circuit which is used to perform column arbitration or row arbitration in the column where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority
  • the high switching point acts as the actual transmission pair.
  • Fig. 9 is a relatively complete structural diagram of the wheel arbitration circuit. After matching the expected points on the wheel, do two arbitrations in the row and column directions as shown in the figure to complete the row-column polling arbitration (the result of column looping cannot guarantee that the rows of each column arbitration are different. Therefore, the row round-robin circuit is required to arbitrate the result of the column round-robin again, so that the final result has at most one selected point in each row and column, and there will be no conflict of data transmission), and complete all operations in one cycle Access routing arbitration.
  • each row and each column has an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel. In the figure, it can be realized by a shift register.
  • it also includes:
  • each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the input port determines the actual transmission pair.
  • the implementation scheme is: locally exclude the marks of these VOQs at the input port.
  • the early input ports used FIFO queuing to wait for arbitration allocation.
  • the maximum input throughput of this FIFO-based arbitration scheduling method is only 58.6%.
  • the VOQ method was proposed to increase the throughput rate to 100%. Finds the largest match every time provided condition.
  • PIM refers to the ParallelmatIterarion Matching algorithm, which is divided into three steps, as shown in Figure 10, which is an example of the three steps that constitute one iteration of the PIM scheduling algorithm.
  • Step 1 Request, each input makes a request for each output it has cells;
  • Step 2 Grant, each output consistently chooses an input at random from among the inputs that requested it.
  • inputs 1 and 3 both request output 2.
  • output 2 select grant input 3;
  • Step 3 Accept, each input randomly selects an output among the granted outputs.
  • outputs 2 and 4 are both granted to input 3.
  • Input 3 selects to accept output 2.
  • the first iteration does not match input 4 and output 4, even though it does not conflict with other connections. This connection will be established in the second iteration.
  • each iteration process can select the unmatched part of the previous iteration, so the number of iterations to complete the maximum matching needs logN;
  • RRM refers to the Basic Round-Robin Matching algorithm.
  • the RRM algorithm is also divided into three steps: as shown in Figure 11, this is an example of the three steps that constitute one iteration of the RRM scheduling algorithm.
  • RRM potentially overcomes two problems in PIM: complexity and unfairness.
  • a round-robin arbitrator implemented as a priority encoder is much simpler and faster to execute than a random arbitrator. Round priority helps the algorithm distribute bandwidth fairly and more equitably among requesting connections.
  • the three steps of arbitration are:
  • Step 1 Request, each input sends a request at each output for which it has a queued VOQ.
  • Step 2 Grant, if the output receives any requests, it will choose the next request in a fixed roundrobin schedule starting from the highest priority element.
  • the output notifies each input whether the request is approved or not.
  • the output pointer is incremented to the highest priority element of the round robin schedule.
  • Step 3 Accept. If an input receives a grant, it will accept the next in a fixed round-robin schedule starting with the highest priority element. The pointer to the highest priority element in the round robin schedule will be incremented (one unit apart from accepted outputs).
  • the iSLIP algorithm uses a rotational priority ("Round”) arbitration to schedule each input and output in turn.
  • the main feature is simplicity; it is easy to implement on hardware and can run at high speed. It is found that the performance under uniform traffic conditions is high; for uniform IID Bernoulli arrival, a single iteration of iSLIP can achieve 100% throughput.
  • RRM simple basic round-robin matching algorithm
  • Figure 12 shows an example of iSLIP.
  • iSLIP Compared with RRM, iSLIP mainly makes the following changes: unless the grant is accepted, the grant pointer is not moved. Update the grant pointer. The Grant step of iSLIP is changed to:
  • Step 2 Grant, if the output receives any requests, it will select the next request that comes up in a fixed round-robin schedule, starting with the highest priority element.
  • the output notifies each input pointer whether the request is approved or not. Increments the highest priority element of a round-robin schedule. This small change to the algorithm results in the following:
  • Feature 1 The most recently established connection has the lowest priority. This is because when the arbitrator moves the pointer, the most recently granted (accepted) input (output) becomes the lowest priority for that output (input). If input is prioritized to connect at the next computing unit time.
  • Feature 2 No connection interruption. This is because the input will keep asking for the output until it succeeds. Accepted by each input in the most cell time. Thus, a request input is always serviced in less time than a computational unit.
  • Property 3 Under high load, all queues with a common output have the same throughput. This is a consequence of feature 2, the output pointer is moved to each request input in a fixed order, thus providing the same throughput for each request.
  • DRRM is Double Rounb-Robin Matching. It is called DRRM (Dual Round Robin Matching) because two independent polling arbitration mechanisms perform arbitration on the input and output.
  • request arbiter and N VOQs at each input port, and the request arbitrator selects at most one non-empty queue according to the pointer value, representing the output port with the highest priority.
  • grant arbiter at each output port, which arbitrates an input port among N requests, and then sends the result to the input port. If grant receives a request, it will update the pointer value, and the corresponding request arbiter The pointer value will also be updated.
  • Figure 13 shows an example of a DRRM algorithm
  • the DRRM algorithm consists of two stages:
  • Step 1 Request, each input request in the round-robin scheduling arbitrates a request
  • Step 2 Grant, each output arbitrates an input among all the requests of this port.
  • the DDRM scheme has a shorter arbitration time than the iSLIP scheme, and at the same time achieves performance equivalent to that of iSLIP.
  • the biggest advantage of the wheel arbitration algorithm proposed in this patent is that the wheel priority arbitration is adopted first, and then the DDRM arbitration algorithm is adopted.
  • the two arbitrations are completed in one cycle.
  • the first arbitration follows the wheel priority principle, and the arbitration can be completed by judging whether there is a request on the wheel node.
  • the hardware circuit is very easy to implement, which saves a lot of time for the second arbitration.
  • the second arbitration is based on filtering out the nodes of the first arbitration, and according to the DRRM algorithm, the second arbitration is performed.
  • the advantage of the wheel priority algorithm is that it is simpler, more direct, and easier to implement.
  • Each iteration of these algorithms requires arbitration of the input and output ports.
  • the priority is processed logically, and the arbitration priority of the input and output ports of the wheel priority algorithm is a pattern structure, which is independent of the arbitration logic judgment, leaving sufficient time for the arbitration judgment, so more can be realized in one clock cycle number of iterations.
  • the efficiency of 63% which realizes one iteration in one clock cycle, it is 23% higher, and can reach an efficiency of 86%. It has been implemented on the AIGPU chip.
  • the arbitration time in actual application will be much shorter than that of GA, and more iterations can be achieved to improve the arbitration efficiency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A roller arbitration method for on-chip data exchange. On the basis of an N×N cross network of N input ports and N output ports, the method comprises the following steps: S1, determining a priority arbitration arrangement; S2, determining whether each expected transmission pair in the arrangement has a transmission requirement; if yes, determining that the expected transmission pair is an actual transmission pair, and immediately performing data transmission on the determined actual transmission pair; if not, considering that the expected transmission pair is a non-transmission pair; S3, performing sequential row/column or column/row arbitration, and selecting switching points having a high priority as an actual transmission pair; S4, after the polling is completed in step S3, scrolling the priority arbitration arrangement to obtain a new arbitration arrangement; and S5, cyclically performing S2-S4.

Description

片上数据交换的滚轮仲裁方法及电路Roller arbitration method and circuit for on-chip data exchange 技术领域technical field
本发明涉及芯片设计、片上网络、片上系统、和计算机体系结构领域,尤其是一种片上数据交换网络的滚轮调度方法和电路实现。此种方法可以提高片上数据交换效率和速度,特别适合人工智能和大数据处理芯片,尤其是SIMT架构的芯片。The invention relates to the fields of chip design, on-chip network, on-chip system, and computer architecture, in particular to a wheel scheduling method and circuit realization of an on-chip data exchange network. This method can improve the efficiency and speed of on-chip data exchange, and is especially suitable for artificial intelligence and big data processing chips, especially chips with SIMT architecture.
背景技术Background technique
机器学习、科学计算和图形渲染需要巨大的计算能力,一般由大型芯片(如GPU、TPU、APU等)提供这样的算力,来实现高度复杂的机器学习任务和图形处理任务。用机器学习来做识别需要巨大的深度(Deep Learning)网络和海量的图像数据,训练过程非常耗时;一个三维应用或游戏场景中,若采用递归光追踪(Recursive Ray-Tracing)渲染,且场景复杂,则需要做海量运算,也需要传递海量数据。这就要求极高的计算性能,也因此需要极宽的数据交换带宽来支持这样的需求。高性能的片上交换器就成了AI和GPU芯片的重要组成部件。Machine learning, scientific computing and graphics rendering require huge computing power, which is generally provided by large chips (such as GPU, TPU, APU, etc.) to achieve highly complex machine learning tasks and graphics processing tasks. Using machine learning to do recognition requires a huge deep learning network and massive image data, and the training process is very time-consuming; in a 3D application or game scene, if recursive ray-tracing (Recursive Ray-Tracing) is used for rendering, and the scene If it is complex, it needs to do massive calculations and transfer massive data. This requires extremely high computing performance, and therefore requires extremely wide data exchange bandwidth to support such demands. High-performance on-chip switches have become an important component of AI and GPU chips.
对于AI和图形计算这类特定场景,片上缓存和数据交换的仲裁方法非常重要。效率低的仲裁(arbitration)方法和仲裁器(arbiter)设计会成为系统的瓶颈,极大影响系统的性能。因此,仲裁方法和仲裁器电路必须实现高性能和低复杂度。For specific scenarios such as AI and graphics computing, the arbitration method of on-chip caching and data exchange is very important. Low-efficiency arbitration (arbitration) method and arbiter (arbiter) design will become the bottleneck of the system, greatly affecting the performance of the system. Therefore, the arbitration method and the arbiter circuit must achieve high performance and low complexity.
网络交换器中的仲裁方法研究有悠久的历史,特别是在互联网快速发展阶段的研究成果众多。Jonathan Chao的著作《High Performance Switches and Routers》(Wiley-IEEE Press,2007)和George Varghese的著作《Network Algorithmics,:An Interdisciplinary Approach to Designing Fast Networked Devices》(Morgan Kaufmann,2004)对这些做了综述和各种方法的叙述。虚拟输出排队(Virtual Output Queue-VOQ)交换器是典型的数据交换方式,这方面的研究成果颇丰。The research on arbitration methods in network switches has a long history, especially in the rapid development stage of the Internet, there are many research results. Jonathan Chao's book "High Performance Switches and Routers" (Wiley-IEEE Press, 2007) and George Varghese's book "Network Algorithmics,: An Interdisciplinary Approach to Designing Fast Networked Devices" (Morgan Kaufmann, 2004) made a review of these and Description of various methods. Virtual output queue (Virtual Output Queue-VOQ) switch is a typical data exchange method, and the research results in this area are quite abundant.
这方面的重要成果包括PIM、RRM、iSLIP、DRRM以及GA这几类方法。Important results in this area include methods such as PIM, RRM, iSLIP, DRRM, and GA.
其中,PIM由于每次选择是随机的,且需要三个步骤,存在公平性和复杂性的问题。Among them, PIM has problems of fairness and complexity because each selection is random and requires three steps.
而RRM和iSLIP采用优先级轮询仲裁比随机仲裁逻辑更简单,iSLIP又对grant指针跳转条件做了改进,公平性有了改善,但还是需要三个步骤,使得同样存在复杂性问题,很难实现高速电路。However, RRM and iSLIP use priority round-robin arbitration, which is simpler than random arbitration logic. iSLIP has improved the grant pointer jump condition, and the fairness has been improved. However, three steps are still required, which makes the complexity problem also exist. Difficult to implement high-speed circuits.
DRRM在输入和输出有两个独立的轮询仲裁机制执仲裁,比iSLIP方案的仲裁时间更短,同时实现了和iSLIP相当的性能;GA则在DRRM的基础上将输出端口的Grant信息给输入端口,虽然提高了仲裁效率但复杂性比DRRM增加了更多。由于复杂度随着端口增加而指 数增长(N 3logN),DDRM和GA随着端口增加很难在高速电路上实现两次以上的迭代仲裁。 DRRM has two independent polling arbitration mechanisms for input and output to perform arbitration, which is shorter than the arbitration time of the iSLIP scheme, and at the same time achieves performance equivalent to iSLIP; on the basis of DRRM, GA sends the Grant information of the output port to the input port, although the arbitration efficiency is improved but the complexity is increased more than DRRM. Since the complexity increases exponentially with the increase of ports (N 3 logN), it is difficult for DDRM and GA to achieve more than two iterative arbitrations on high-speed circuits with the increase of ports.
发明内容Contents of the invention
本发明针对背景技术中存在的问题,提出了一种片上数据交换网络的滚轮调度方法和电路实现。Aiming at the problems existing in the background technology, the present invention proposes a wheel scheduling method and circuit realization of an on-chip data exchange network.
本发明首先公开了一种片上数据交换的滚轮仲裁方法,基于N输入端口N输出端口的的NxN交叉网络,1个输入端口对应的所有输出端口为一行,1个输出端口对应的所有输入端口为一列,每个输入输出交换点为一个传输对;它包括以下步骤:The present invention firstly discloses a roller arbitration method for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, all output ports corresponding to one input port are one row, and all input ports corresponding to one output port are One column, each input and output switching point is a transmission pair; it includes the following steps:
S1、确定优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N-1,x]},a、b、c…x∈[0,N-1]且互不相同;优先仲裁排列W VOQ中的N个元素表示N个预期传输对,其中:VOQ[0,a]表示预期输入端口为PI 0,预期输出端口为PO a的预期传输对;非优先仲裁排列W VOQ中的传输对为非预期传输对; S1. Determine the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1,x]}, a, b, c...x ∈[0,N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] means that the expected input port is PI 0 , and the expected output port is The expected transmission pair of PO a; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;
S2、判断排列中各预期传输对是否有传输需求,是则确定为实际传输对,确定的实际传输对即可进行数据传输;S2. Judging whether each expected transmission pair in the arrangement has a transmission demand, if yes, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;
S3、对于非预期传输对的交换点,首先进行每个输出端口的列仲裁或所处输入端口的列仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的非预期传输对作为实际传输对;S3. For the switching point of an unexpected transmission pair, first perform the column arbitration of each output port or the column arbitration of the input port where it is located to obtain the possible actual transmission pair; then perform row arbitration or column arbitration for the possible actual transmission pair, Select the unexpected transmission pair with high priority as the actual transmission pair;
S4、步骤S3轮询完毕后,满足一定条件时,优先仲裁排列W VOQ滚动获得新的仲裁排列W' VOQ;否则滚轮保持不动; S4, after the polling of step S3 is completed, when certain conditions are met, the priority arbitration arrangement W VOQ scrolls to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;
S5、循环进行S2-S4。S5. Perform S2-S4 in a loop.
优选的,S1中,确定优先仲裁排列W VOQ={VOQ[0,0],VOQ[1,1],VOQ[2,2],…,VOQ[N-1,N-1]}。 Preferably, in S1, determine the priority arbitration arrangement W VOQ ={VOQ[0,0], VOQ[1,1], VOQ[2,2], . . . , VOQ[N-1,N-1]}.
优选的,S2和S3中,对于burst传输应用,在确定为实际传输对前,需要确认该传输对上没有未完成的传输。Preferably, in S2 and S3, for the burst transmission application, before determining the actual transmission pair, it needs to confirm that there is no unfinished transmission on the transmission pair.
优选的,S2和S3中,确定为实际传输对后,对实际传输对所处的行和列进行清除,不再参与下次传输需求仲裁。Preferably, in S2 and S3, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the next transmission demand arbitration.
优选的,S4中,所述滚动条件为每个有请求的预期传输对都被确认过实际传输对。Preferably, in S4, the rolling condition is that each requested expected transmission pair has been confirmed by an actual transmission pair.
作为一种滚轮滚动方式,S4中,所述滚动为所有的预期输入端口不变,所有的预期输出端口+n,n的取值需保证滚轮循环滚动时,覆盖所有的传输对。As a scrolling mode of the wheel, in S4, the scrolling is that all expected input ports remain unchanged, and all expected output ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel scrolls cyclically.
作为另一种滚轮滚动方式,S4中,所述滚动为所有的预期输出端口不变,所有的预期 输入端口+n,n的取值需保证滚轮循环滚动时,覆盖所有的传输对。As another wheel scrolling method, in S4, the scrolling is that all expected output ports remain unchanged, all expected input ports+n, and the value of n needs to ensure that all transmission pairs are covered when the wheel is cyclically scrolled.
本发明还公开了一种片上数据交换的滚轮仲裁电路,基于N输入端口N输出端口的的NxN交叉网络,它包括:The invention also discloses a roller arbitration circuit for on-chip data exchange, which is based on an NxN crossover network with N input ports and N output ports, which includes:
-滚动点选择电路,用于确定NxN交叉网络中的优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N-1,x]},以及优先仲裁排列W VOQ的滚动更新; -Rolling point selection circuit, used to determine the priority arbitration arrangement in NxN crossover network W VOQ ={VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1, x]}, and the rolling update of priority arbitration ranking W VOQ ;
-期望点匹配电路,用于判断排列中各预期传输对是否有传输需求,将有传输需求的标记为实际传输对,无传输需求的标记为非传输对;- Expected point matching circuit, used to judge whether each expected transmission pair in the arrangement has a transmission demand, mark the one with the transmission demand as the actual transmission pair, and mark the one without the transmission demand as the non-transmission pair;
-行列轮询仲裁电路,用于进行交换点所处列的列仲裁或所处行的行仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的交换点作为实际传输对。-Row-column polling arbitration circuit, which is used to perform column arbitration or row arbitration in the column where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority The high switching point acts as the actual transmission pair.
优选的,它还包括:Preferably, it also includes:
-行列清除电路,NxN交叉网络的每一个交换点设置一个清除逻辑,用于预期传输对确定实际传输对后,禁止匹配点所处的行和列再参与传输需求仲裁。- Row and column clearing circuit, each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the expected transmission pair determines the actual transmission pair.
具体的,滚动点选择电路:NxN交叉网络的每一行、每一列各设置1个k位的寄存器,k=ceiling(N),行/列的寄存器顺序移动寄存值实现滚动。Specifically, the scrolling point selection circuit: each row and each column of the NxN crossover network is provided with a k-bit register, k=ceiling(N), and the register values of the row/column are sequentially moved to realize scrolling.
具体的,期望点匹配电路:NxN交叉网络的每一个交换点设置比较器,通过交换点所属的行寄存器与列寄存器的值比较,判断交换点是否为期望点。Specifically, the expected point matching circuit: each switching point of the NxN crossover network is provided with a comparator, and the value of the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is an expected point.
具体的,行列轮询仲裁电路:NxN交叉网络的每一行、每一列都设置一个arbiter,每个arbiter有一个优先级指针通过滚轮的跳转而滚动。Specifically, the row-column polling arbitration circuit: each row and each column of the NxN crossover network is provided with an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel.
本发明的有益效果Beneficial effects of the present invention
本申请提出了一种片上数据交换网络的滚轮调度方法和电路实现,基于滚动调度实现了各交换点数据传输机会的公平。基于初次列/行轮询保障了各交换点仲裁的遍历;基于再次行/列轮询保障了各输入-输出的唯一,避免冲突。基于传输对确定后,交换点所属行列的清除,既保障了各输入-输出的唯一;又减少了仲裁冲突和次数,提高了传输效率。基于滚轮优先级调度算法,避免了对滚轮上的预传输对做仲裁,只需判断滚轮上的预期传输对是否存在传输请求。判断逻辑简单,仲裁时间更短,易于实现高速电路,尤其在对时序要求比较严苛的高速电路中优势尤为明显,在我们芯片上轻松实现两次迭代。The present application proposes a rolling scheduling method and circuit implementation of an on-chip data switching network, and realizes the fairness of data transmission opportunities of each switching point based on rolling scheduling. Based on the initial column/row polling, the traversal of the arbitration of each switching point is guaranteed; based on the second row/column polling, the uniqueness of each input-output is guaranteed to avoid conflicts. After the transmission pair is determined, the ranks and columns to which the switching point belongs are cleared, which not only ensures the uniqueness of each input-output, but also reduces arbitration conflicts and times, and improves transmission efficiency. Based on the wheel priority scheduling algorithm, it avoids the arbitration of the pre-transmission pairs on the wheel, and only needs to judge whether there is a transmission request for the expected transmission pair on the wheel. The judgment logic is simple, the arbitration time is shorter, and it is easy to implement high-speed circuits, especially in high-speed circuits with strict timing requirements. Two iterations can be easily implemented on our chip.
附图说明Description of drawings
图1为NxN的交叉网络路由结构示意图Figure 1 is a schematic diagram of NxN cross network routing structure
图2为代表优先权的滚轮示意图Figure 2 is a schematic diagram of a wheel representing priority
图3为行列选择清除示意图Figure 3 is a schematic diagram of selecting and clearing rows and columns
图4为滚轮的移动示意图Figure 4 is a schematic diagram of the movement of the roller
图5a为以FIFO方式实现VOQ电路的示意图Figure 5a is a schematic diagram of a VOQ circuit implemented in a FIFO manner
图5b为以管理链表指针方式实现VOQ电路的示意图Figure 5b is a schematic diagram of implementing a VOQ circuit by managing linked list pointers
图6a为滚轮点选择电路示意图Figure 6a is a schematic diagram of the wheel point selection circuit
图6b为滚动点选择电路中寄存器行(列)递增更新示意图Figure 6b is a schematic diagram of incremental updating of register rows (columns) in the scroll point selection circuit
图7为期望点匹配电路示意图Figure 7 is a schematic diagram of the desired point matching circuit
图8为滚轮pattern示意图Figure 8 is a schematic diagram of the wheel pattern
图9为行列轮询仲裁电路示例图Figure 9 is an example diagram of the row and column polling arbitration circuit
图10为构成PIM调度算法的一次迭代的三个步骤的示例图Figure 10 is an example diagram of three steps constituting an iteration of the PIM scheduling algorithm
图11为构成RRM调度算法的一次迭代的三个步骤的示例图Figure 11 is an example diagram of three steps constituting one iteration of the RRM scheduling algorithm
图12为iSLIP算法的示例图Figure 12 is an example diagram of the iSLIP algorithm
图13为DRRM算法的示例图Figure 13 is an example diagram of the DRRM algorithm
图14为GA算法的示例图Figure 14 is an example diagram of the GA algorithm
具体实施方式Detailed ways
下面结合实施例对本发明作进一步说明,但本发明的保护范围不限于此:The present invention will be further described below in conjunction with embodiment, but protection scope of the present invention is not limited to this:
图1给出了NxN的交叉网络路由结构:I0,I1,……,IN-1与O0,O1,……,ON-1的每个交叉点都是一条路由路径,也称为一个传输对。每个交叉点为一个VOQij请求路径,VOQ的下标第一个数字表示输入端口编号,第二个数字表示输出端口编号。每个输入端口Ii在一个周期只能有一个路由节点被选中,同样每个输出端口在一个周期也只能有一个路由节点被选中。一个周期最多有N条路径被选中。Figure 1 shows the routing structure of the NxN crossover network: each intersection of I0, I1, ..., IN-1 and O0, O1, ..., ON-1 is a routing path, also called a transmission pair . Each intersection is a VOQij request path, the first number of the VOQ subscript indicates the input port number, and the second number indicates the output port number. Each input port Ii can only have one routing node selected in one cycle, and each output port can only have one routing node selected in one cycle. There are at most N paths selected in one cycle.
为确保每个输入端口实现公平仲裁,每个输入端口获得尽可能均等的数据传输量,每个输入端口的每个虚拟输出排队VOQ获得尽可能均等的数据传输量,本发明公开了一种片上数据交换的滚轮仲裁方法,包括以下步骤:In order to ensure that each input port realizes fair arbitration, each input port obtains the data transmission volume as equal as possible, and each virtual output queue VOQ of each input port obtains the data transmission volume as equal as possible, the invention discloses an on-chip The wheel arbitration method for data exchange includes the following steps:
S1、确定优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N-1,x]},a、b、c…x∈[0,N-1]且互不相同;优先仲裁排列W VOQ中的N个元素表示N个预期传输对,其中:VOQ[0,a]表示预期输入端口为PI 0,预期输出端口为PO a的预期传输对;非优先仲裁排列W VOQ中的传输对为非预期传输对; S1. Determine the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1,x]}, a, b, c...x ∈[0,N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] means that the expected input port is PI 0 , and the expected output port is The expected transmission pair of PO a; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;
结合图8,在某一时刻,滚轮代表了对于每个输入端口的预期输出端口,N个输入端口就有N个预期的输出端口。这N个预期的端口都是不同的,也就是{0,1,2,…,N-1}的某个排 列(permutation),一种排列成为一个pattern,一共有N!个parttern。初始化时,可以直接使用{0,1,2,…,N-1},也就是说输入端口PI 0预期输出端口PO 0,输入端口PI 1预期输出端口PO 1,…,输入端口PI N-1预期输出端口PO N-1。我们将这一排列记作W={(0,0),(1,1),(2,2),…,(N,N)},如图2中小六边形空心节点所示(本实施例中,a=1,b=2,c=3,……,x=N-1)。 Referring to FIG. 8 , at a certain moment, the scroll wheel represents the expected output port for each input port, and there are N expected output ports for N input ports. The N expected ports are all different, that is, a permutation of {0,1,2,...,N-1}, a permutation becomes a pattern, and there are N in total! a parttern. When initializing, {0,1,2,…,N-1} can be used directly, that is to say, the input port PI 0 expects the output port PO 0 , the input port PI 1 expects the output port PO 1 ,…, the input port P N- 1 Expected output port PON -1 . We record this arrangement as W={(0,0),(1,1),(2,2),...,(N,N)}, as shown in the small hexagonal hollow nodes in Figure 2 (the In the embodiment, a=1, b=2, c=3, ..., x=N-1).
S2、判断排列中各预期传输对是否有传输需求,是则确定为实际传输对,确定的实际传输对即可进行数据传输;S2. Judging whether each expected transmission pair in the arrangement has a transmission demand, if yes, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;
假定在t时刻W(t)={(0,(t+0)%N),(1,(t+1)%N),(2,(t+2)%N),…,(N-1,(t+N-1)%N)},这个对应具有最高优先权的队列。也就是说,如果输入端口i有给输出端口j i的请求,则这个请求必须被准许。根据控制根据不同的输入输出之间对应关系,可以选择不同的pattern,并在这些parttern之间滚动循环以达到公平性。 Suppose at time t W(t)={(0,(t+0)%N),(1,(t+1)%N),(2,(t+2)%N),...,(N -1,(t+N-1)%N)}, this corresponds to the queue with the highest priority. That is, if input port i has a request for output port ji, then this request must be granted. According to the corresponding relationship between different input and output according to the control, different patterns can be selected, and these patterns can be rolled and cycled to achieve fairness.
优选的实施例中,确定实际传输对后,对实际传输对所处的行和列进行清除,不再参与传输需求仲裁。结合图3所示,pattern种一个节点在VOQ10(小六边形实心节点)上,如果这个VOQ10有请求,那么这个节点路由会被选中。同时,节点所在的行的其他节点,也就是输入端口I1的其它所有输出节点(小六边形空心节点)的请求VOQ11,VOQ12,……,VOQ1N-1,和节点所在的列,也就是输出O0的所有输入节点(黄色小六边形节点)的请求VOQ00,VOQ20,……,VOQN-10,都会被清除,而不会参与S3的行列仲裁,这样可以提高第二次行列仲裁的效率。In a preferred embodiment, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the transmission demand arbitration. As shown in Figure 3, a node in the pattern is on VOQ10 (small hexagonal solid node), and if this VOQ10 has a request, then the route of this node will be selected. At the same time, other nodes in the row where the node is located, that is, all other output nodes (small hexagonal hollow nodes) of the input port I1 request VOQ11, VOQ12, ..., VOQ1N-1, and the column where the node is located, that is, the output The requests VOQ00, VOQ20, ..., VOQN-10 of all input nodes (yellow hexagonal nodes) of O0 will be cleared, and will not participate in the rank arbitration of S3, which can improve the efficiency of the second rank arbitration.
优选的实施例中,以上内容可通过以下代码实现:In a preferred embodiment, the above content can be realized by the following code:
Figure PCTCN2022108409-appb-000001
Figure PCTCN2022108409-appb-000001
S3、对于非预期传输对的交换点,首先进行每个输出端口的列仲裁或所处输入端口的列仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的非预期传输对作为实际传输对;S3. For the switching point of an unexpected transmission pair, first perform the column arbitration of each output port or the column arbitration of the input port where it is located to obtain the possible actual transmission pair; then perform row arbitration or column arbitration for the possible actual transmission pair, Select the unexpected transmission pair with high priority as the actual transmission pair;
优选的实施例中,S3步骤通过以下代码实现:In a preferred embodiment, the S3 step is realized by the following code:
Figure PCTCN2022108409-appb-000002
Figure PCTCN2022108409-appb-000002
最后,VOQ[i,t i].stat==true或者VOQ[i,j].row==true点就是仲裁成功的VOQ;这些VOQ上的数据可以立即传输,同时将突发计数器减一(VOQ[i,t i].cn--或者VOQ[i,j].cnt--)。 Finally, VOQ[i,t i ].stat==true or VOQ[i,j].row==true points are VOQs with successful arbitration; data on these VOQs can be transmitted immediately, and the burst counter is decremented by one ( VOQ[i,t i ].cn--or VOQ[i,j].cnt--).
在S2和S3中,对于单周期长度的传输应用,各预期传输对有传输需求时,直接认定为实际传输对并进行数据传输。对于多周期长度的传输应用(如burst传输应用),在确定为实际传输对前,需要确认该传输对上没有未完成的传输。In S2 and S3, for the transmission application with a single cycle length, when each expected transmission pair has a transmission demand, it is directly identified as an actual transmission pair and performs data transmission. For a transmission application with a multi-cycle length (such as a burst transmission application), it is necessary to confirm that there is no unfinished transmission on the transmission pair before determining it as an actual transmission pair.
S4、步骤S3轮询完毕后,这里可以设置一些条件,满足条件后,优先仲裁排列W VOQ滚动获得新的仲裁排列W' VOQ;这个条件可以是:保证有请求的预期传输对都被确定过是实际传输对,也就是每个有请求的预期传输对都有过至少一次传输。这个条件可以保证无饥饿性和公平性,其他条件则也需要考虑这两个特性。 S4, after the polling of step S3, some conditions can be set here. After the conditions are met, the priority arbitration arrangement W VOQ rolls to obtain a new arbitration arrangement W'VOQ; this condition can be: ensure that all requested expected transmission pairs have been determined is the actual transfer pair, that is, each requested expected transfer pair has had at least one transfer. This condition can guarantee non-starvation and fairness, and other conditions also need to consider these two characteristics.
结合图4,滚轮需要在一组pattern之间做循环滚动。如图4所示示例,对角线上的空心小六边形节点(VOQ00,VOQ11,……,VOQN-1N-1)是滚轮滚动前的pattern,黑色实心小六边形节点(VOQ10,VOQ21,……,VOQN-1N-2,VOQ0N-1)是滚轮一次滚动循环后的pattern。其他pattern之间的滚动类似,可以根据预设的滚动模式+n滚动,n的取值需保证滚轮循环滚动时,覆盖所有的传输对(对于N*N的交叉网络路由结构,以N为奇数为例,n取值为2亦能满足完全覆盖要求)。Combined with Figure 4, the scroll wheel needs to do circular scrolling between a set of patterns. As shown in Figure 4, the hollow small hexagonal nodes (VOQ00, VOQ11, ..., VOQN-1N-1) on the diagonal are the pattern before the scroll wheel, and the black solid small hexagonal nodes (VOQ10, VOQ21 ,..., VOQN-1N-2, VOQ0N-1) is the pattern after one scroll cycle of the wheel. The scrolling between other patterns is similar. You can scroll according to the preset scrolling mode + n. The value of n needs to ensure that all transmission pairs are covered when the scroll wheel scrolls circularly (for an N*N cross network routing structure, N is an odd number For example, a value of 2 for n can also satisfy the complete coverage requirement).
优选的实施例中,S4步骤通过以下代码实现:In a preferred embodiment, step S4 is realized by the following codes:
Figure PCTCN2022108409-appb-000003
Figure PCTCN2022108409-appb-000003
Figure PCTCN2022108409-appb-000004
Figure PCTCN2022108409-appb-000004
S5、循环进行S2-S4。S5. Perform S2-S4 in a loop.
本发明还提出了一种片上数据交换的滚轮仲裁电路,基于N输入端口N输出端口的的NxN交叉网络,VOQ电路的实现分为两种方式:一种是如图5a所示FIFO方式,根据输入I_s的目的地址信息将VOQ路由请求分别路由到对应的输出端口的FIFO,这种实现方式简单,但比较耗费硬件资源;另一种方式是如图5b所示通过链表管理的方式实现VOQ:将输入存储在一组链表中,通过管理链表指针来管理VOQ信息。每一个输入端口包含了一个队列长度为M的指针寄存器队列,一个队列长度为N的头指针和尾指针,一个宽度为N的有效请求。其中M是能接收存储的最大请求个数,N是请求目标端口的个数。相对于FIFO的实现方式电路面积更小,使用更少的硬件资源。The present invention also proposes a roller arbitration circuit for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, the realization of the VOQ circuit is divided into two modes: one is the FIFO mode as shown in Figure 5a, according to Input the destination address information of I_s to route the VOQ routing request to the FIFO of the corresponding output port respectively. This implementation method is simple, but consumes hardware resources; another method is to realize VOQ through the linked list management as shown in Figure 5b: The input is stored in a set of linked lists, and the VOQ information is managed by managing the linked list pointers. Each input port contains a pointer register queue with a queue length of M, a head pointer and a tail pointer with a queue length of N, and a valid request with a width of N. Among them, M is the maximum number of requests that can be received and stored, and N is the number of request destination ports. Compared with the implementation of FIFO, the circuit area is smaller and less hardware resources are used.
滚轮仲裁电路还包括:The wheel arbitration circuit also includes:
-滚动点选择电路,用于确定NxN交叉网络中的优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N-1,x]},以及优先仲裁排列W VOQ的滚动更新; -Rolling point selection circuit, used to determine the priority arbitration arrangement in NxN crossover network W VOQ ={VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1, x]}, and the rolling update of priority arbitration ranking W VOQ ;
优选的实施例中,滚动点选择电路为:NxN交叉网络的每一行、每一列各设置1个k位的寄存器,k=ceiling(N),一行的寄存器可以顺序移动寄存的值,一列的寄存器也可以顺序移动寄存值。行寄存器记作R[0],R[1],…,R[N-1]。列寄存器记作C[0],C[1],…,C[N-1]。初始化时,R[0]=0,R[1]=1,…,R[N-1]=N-1,C[0]=0,C[1]=1,…,C[N-1]=N-1。每次滚动时,C[0]=C[N-1],C[N-1]=C[N-2],C[N-2]=C[N-3],…,C[2]=C[1],C[1]=C[0],而R[N-1],…,R[0]保持不变。In a preferred embodiment, the rolling point selection circuit is: each row and each column of the NxN crossover network are respectively provided with a k-bit register, k=ceiling (N), and the registers of one row can move the registered value sequentially, and the registers of one column Registered values can also be moved sequentially. The row registers are denoted as R[0], R[1], . . . , R[N-1]. The column registers are denoted as C[0], C[1], . . . , C[N-1]. When initializing, R[0]=0, R[1]=1,...,R[N-1]=N-1, C[0]=0, C[1]=1,...,C[N- 1] = N-1. Every time you scroll, C[0]=C[N-1], C[N-1]=C[N-2], C[N-2]=C[N-3],...,C[2 ]=C[1], C[1]=C[0], while R[N-1],...,R[0] remain unchanged.
结合图6a,给出了一个示例pattern,每个结点的行列分别互斥。pattern每个节点都有一个路由节点编号。结合图6b,是每个pattern按照行(列)递增的方式,完成对应列(行)的节点更新,列(行)预选选定被存在寄存器中。Combined with Figure 6a, an example pattern is given, and the rows and columns of each node are mutually exclusive. Each node of pattern has a routing node number. In combination with Fig. 6b, each pattern completes the node update of the corresponding column (row) in the manner of incrementing the row (column), and the column (row) is pre-selected and stored in the register.
-期望点匹配电路,用于判断排列中各预期传输对是否有传输需求,将有传输需求的标记为实际传输对,无传输需求的标记为非传输对;- Expected point matching circuit, used to judge whether each expected transmission pair in the arrangement has a transmission demand, mark the one with the transmission demand as the actual transmission pair, and mark the one without the transmission demand as the non-transmission pair;
优选的实施例中,期望点匹配电路为:NxN交叉网络的每一个交换点设置比较器,通过交换点所属的行寄存器与列寄存器的值比较,判断交换点是否为期望点。In a preferred embodiment, the expected point matching circuit is: each switching point of the NxN crossover network is provided with a comparator, and the row register to which the switching point belongs is compared with the value of the column register to determine whether the switching point is the desired point.
结合图7,Pattern上的每一个节点都会输出一个行列路由编号,当这个编号和VOQ的 路由编号相等时即pattern的节点上有对应的VOQ请求时即期望点匹配。将当前pattern所有节点遍历判断就找出滚轮上期望匹配点。Combined with Figure 7, each node on the Pattern will output a row and column routing number. When this number is equal to the routing number of the VOQ, that is, when there is a corresponding VOQ request on the node of the pattern, the expected point matches. Traverse and judge all the nodes of the current pattern to find the expected matching point on the wheel.
-行列轮询仲裁电路,用于进行交换点所处列的列仲裁或所处行的行仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的交换点作为实际传输对。-Row-column polling arbitration circuit, which is used to perform column arbitration or row arbitration in the column where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority The high switching point acts as the actual transmission pair.
图9是较完整的滚轮仲裁电路结构图。当做完滚轮上的期望点匹配后,按照图中所示在行和列两个方向上做两次仲裁就可以完成行列轮询仲裁(列循环的结果不能保证每一列仲裁的行都不相同。因此,需要行轮循电路把列轮循的结果再做一次仲裁,使得最终的结果每行和每列中最多只有一个选中的点,不会产生数据传输的冲突),完成一个周期内的所有通路路由仲裁。图9中每一行每一列都有一个arbiter,每个arbiter有一个优先级指针通过滚轮的跳转而滚动在图中可以通过移位寄存器实现。Fig. 9 is a relatively complete structural diagram of the wheel arbitration circuit. After matching the expected points on the wheel, do two arbitrations in the row and column directions as shown in the figure to complete the row-column polling arbitration (the result of column looping cannot guarantee that the rows of each column arbitration are different. Therefore, the row round-robin circuit is required to arbitrate the result of the column round-robin again, so that the final result has at most one selected point in each row and column, and there will be no conflict of data transmission), and complete all operations in one cycle Access routing arbitration. In Figure 9, each row and each column has an arbiter, and each arbiter has a priority pointer that scrolls through the jump of the scroll wheel. In the figure, it can be realized by a shift register.
优选的实施例中,它还包括:In a preferred embodiment, it also includes:
-行列清除电路,NxN交叉网络的每一个交换点设置一个清除逻辑,用于输入端口确定实际传输对后,禁止匹配点所处的行和列参与传输需求仲裁。实现方案为:在输入端口本地将这些VOQ的标记排除。- Row and column clearing circuit, each switching point of the NxN crossover network is provided with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the input port determines the actual transmission pair. The implementation scheme is: locally exclude the marks of these VOQs at the input port.
对比现有的CrossBar的仲裁算法Compared with the existing CrossBar arbitration algorithm
目前大多数调度算法都是最大化每次仲裁的连接通路,从而实现最大化带宽,但是这些算法过于复杂,且复杂度随着端口增加而指数增长(N 3logN)无法在硬件中实现,而且需要很长时间才能完成。现在一般的CrossBar都基于迭代或者非迭代的循环算法。比较经典的算法有FIFO,PIM,iSLIP、DRRM和GA算法等。基于这些算法分析滚轮仲裁算法和这些算法之间的区别和优势。 At present, most scheduling algorithms maximize the connection path of each arbitration to maximize the bandwidth, but these algorithms are too complex, and the complexity increases exponentially with the increase of ports (N 3 logN) cannot be implemented in hardware, and It takes a long time to complete. Now the general CrossBar is based on iterative or non-iterative loop algorithm. The more classic algorithms include FIFO, PIM, iSLIP, DRRM and GA algorithms. Based on these algorithms, analyze the differences and advantages between the wheel arbitration algorithm and these algorithms.
一、VOQ引进1. Introduction of VOQ
早期的输入端口采用FIFO排队方式等待仲裁分配,这种基于FIFO的仲裁调度,这种方式最大的输入吞吐量仅为58.6%,后期经过演进提出VOQ的方式,将吞吐率提高到100%,为每次都能找到最大匹配项提供条件。VOQ的实现方式上,我们采用链表管理的方式实现,最大化节省了硬件资源。这种实现方式在具有高延迟的大型网络路由上的优势更加明显。The early input ports used FIFO queuing to wait for arbitration allocation. The maximum input throughput of this FIFO-based arbitration scheduling method is only 58.6%. Later, after evolution, the VOQ method was proposed to increase the throughput rate to 100%. Finds the largest match every time provided condition. In the implementation of VOQ, we use linked list management to save hardware resources to the greatest extent. The advantage of this implementation is more obvious on large network routes with high latency.
二、PIM算法2. PIM Algorithm
PIM是指ParallelmatIterarion Matching算法,具体分为三步,如图10,这是构成PIM调度算法的一次迭代的三个步骤的一个示例。PIM refers to the ParallelmatIterarion Matching algorithm, which is divided into three steps, as shown in Figure 10, which is an example of the three steps that constitute one iteration of the PIM scheduling algorithm.
步骤1:Request,每个输入对其具有单元格的每个输出发出请求;Step 1: Request, each input makes a request for each output it has cells;
步骤2:Grant,每个输出一致地从请求它的输入中随机选择一个输入。在本例中,输入1和3都请求输出2。输出2选择授予输入3;Step 2: Grant, each output consistently chooses an input at random from among the inputs that requested it. In this example, inputs 1 and 3 both request output 2. output 2 select grant input 3;
步骤3:Accept,每个输入在被授予的输出中随机地选择一个输出。在本例中,输出2和4都被授予输入3。输入3选择接受输出2。Step 3: Accept, each input randomly selects an output among the granted outputs. In this example, outputs 2 and 4 are both granted to input 3. Input 3 selects to accept output 2.
在本例中,第一次迭代不匹配输入4和输出4,即使它与其他连接没有冲突。这个连接将在第二次迭代中建立。In this example, the first iteration does not match input 4 and output 4, even though it does not conflict with other connections. This connection will be established in the second iteration.
PIM算法的三个特征:Three characteristics of the PIM algorithm:
第一,每次迭代过程都能选出之前迭代未匹配部分,因此完成最大匹配的迭代次数需要logN;First, each iteration process can select the unmatched part of the previous iteration, so the number of iterations to complete the maximum matching needs logN;
第二,它确保所有请求最终都会被批准,没有输入VOQ不会被仲裁到;Second, it ensures that all requests will be approved eventually, and no VOQ will be arbitrated without input;
第三,它意味着不使用内存或状态来跟踪在过去建立连接的时间。Third, it means not using memory or state to keep track of when a connection was established in the past.
PIM算法的性能:因为在PIM的仲裁过程中都是随机的,所以有下面几个限制:Performance of the PIM algorithm: Because the arbitration process of PIM is random, there are the following limitations:
首先,它是随机仲裁,不利于实现高速;每个仲裁程序必须在的所有备选成员之间进行随机选择;First, it is random arbitration, which is not conducive to high speed; each arbitration procedure must randomly select among all candidate members;
其次,当CrossBar超负荷时,PIM会导致连接之间的不公平;Second, when the CrossBar is overloaded, PIM will cause unfairness between connections;
最后,PIM在单次迭代中表现不佳;它将吞吐量限制在大约63%,仅略高于FIFO交换机。所以在传输Crossbar请求传输超负荷时,效率只有1-1/e=63%。Finally, PIM performs poorly in a single iteration; it limits throughput to about 63%, which is only slightly higher than FIFO switches. Therefore, the efficiency is only 1-1/e=63% when transmitting Crossbar request transmission overload.
三、RRM算法3. RRM algorithm
RRM是指Basic Round-Robin Matching算法。RRM算法同样分为三步:如图11所示,这是构成RRM调度算法的一次迭代的三个步骤的一个示例。RRM refers to the Basic Round-Robin Matching algorithm. The RRM algorithm is also divided into three steps: as shown in Figure 11, this is an example of the three steps that constitute one iteration of the RRM scheduling algorithm.
RRM潜在地克服了PIM中的两个问题:复杂性和不公平性。作为优先级编码器实现的轮询仲裁程序要比随机仲裁程序简单得多,执行速度也更快。Round优先级有助于算法在请求连接之间公平、更公平地分配带宽。仲裁的三个步骤是:RRM potentially overcomes two problems in PIM: complexity and unfairness. A round-robin arbitrator implemented as a priority encoder is much simpler and faster to execute than a random arbitrator. Round priority helps the algorithm distribute bandwidth fairly and more equitably among requesting connections. The three steps of arbitration are:
步骤1:Request,每个输入在其有排队VOQ的每个输出发送一个请求。Step 1: Request, each input sends a request at each output for which it has a queued VOQ.
步骤2:Grant,如果输出接收到任何请求,它将选择从最高优先级元素开始的一个固定的roundrobin调度中的下一个请求。无论请求是否被批准,输出都会通知每个输入。输出的指针对循环调度的最高优先级元素进行递增。Step 2: Grant, if the output receives any requests, it will choose the next request in a fixed roundrobin schedule starting from the highest priority element. The output notifies each input whether the request is approved or not. The output pointer is incremented to the highest priority element of the round robin schedule.
步骤3:Accept。如果一个输入接收到一个授权,它将接受从最高优先级元素开始的一个固定的循环调度中的下一个。循环调度中最高优先级元素的指针将递增(除接受的输出外 的一个单元)。Step 3: Accept. If an input receives a grant, it will accept the next in a fixed round-robin schedule starting with the highest priority element. The pointer to the highest priority element in the round robin schedule will be incremented (one unit apart from accepted outputs).
RRM性能分析RRM performance analysis
对于仅提供63%RRM的负载,RRM将变得不稳定。RRM性能差的原因在于更新输出仲裁器上指针的规则。我们用上图所示的例子来说明这一点。输入1和2都处于高负载下,并且在每个周期时间内为两个输出接收一个新单元。但是因为输出调度器是在锁定优先级中移动的,所以在每个计算单元时间内只提供一个输入。注意,grant指针在锁定优先级中变化:在单元时间1中,两者都指向输入1,在单元时间2中,两者都指向输入2,等等。这种同步现象导致此流量模式的最大吞吐量仅为50%。grant指针的同步也限制了随机到达模式的性能。For loads delivering only 63% RRM, RRM will become unstable. The reason for the poor RRM performance lies in the rules for updating the pointers on the output arbiters. Let's illustrate this with the example shown in the figure above. Both inputs 1 and 2 are under high load and receive a new unit for both outputs every cycle time. But because the output scheduler is moving in lock priority, only one input is served per compute unit time. Note that the grant pointers change in lock priority: in unit time 1, both point to input 1, in unit time 2, both point to input 2, etc. This synchronization phenomenon results in a maximum throughput of only 50% for this traffic pattern. Synchronization of grant pointers also limits performance in random arrival mode.
四、iSLIP算法4. iSLIP Algorithm
iSLIP算法使用旋转优先级(“Round”)仲裁来依次调度每个输入和输出。主要特点是简单;它易于在硬件上实现,并且可以高速运行。研究发现,均匀交通条件下的性能较高;对于一致独立同分布伯努利到达,单次迭代iSLIP可实现100%的吞吐量。仔细对比就会发现,其实iSLIP是简单基本循环匹配算法(RRM)的一种变体。RRM可能是最简单和最重要的。The iSLIP algorithm uses a rotational priority ("Round") arbitration to schedule each input and output in turn. The main feature is simplicity; it is easy to implement on hardware and can run at high speed. It is found that the performance under uniform traffic conditions is high; for uniform IID Bernoulli arrival, a single iteration of iSLIP can achieve 100% throughput. A careful comparison will reveal that iSLIP is actually a variant of the simple basic round-robin matching algorithm (RRM). RRM is probably the easiest and most important.
图12给出了一个iSLIP的示例。Figure 12 shows an example of iSLIP.
相对于RRM,iSLIP主要做了下面的改动:除非接受授予,否则不移动授予指针。更新grant指针。iSLIP的Grant步骤改为:Compared with RRM, iSLIP mainly makes the following changes: unless the grant is accepted, the grant pointer is not moved. Update the grant pointer. The Grant step of iSLIP is changed to:
步骤2:Grant,如果输出接收到任何请求,它将从优先级最高的元素开始,选择在固定的循环调度中出现的下一个请求。无论请求是否被批准,输出都会通知每个输入的指针。对循环调度的最高优先级元素进行递增。对算法的这个小更改导致以下结果:Step 2: Grant, if the output receives any requests, it will select the next request that comes up in a fixed round-robin schedule, starting with the highest priority element. The output notifies each input pointer whether the request is approved or not. Increments the highest priority element of a round-robin schedule. This small change to the algorithm results in the following:
特性1:最近建立的连接具有最低优先级。这是因为当仲裁程序移动指针时,最近被授予(接受)的输入(输出)成为该输出(输入)的最低优先级。如果输入优先连接在下一次计算单元时间。Feature 1: The most recently established connection has the lowest priority. This is because when the arbitrator moves the pointer, the most recently granted (accepted) input (output) becomes the lowest priority for that output (input). If input is prioritized to connect at the next computing unit time.
特性2:没有连接中断。这是因为输入将继续请求输出,直到成功为止。在最多的单元格时间内被每个输入接受。因此,一个请求输入总是在少于计算单元的时间内提供服务。Feature 2: No connection interruption. This is because the input will keep asking for the output until it succeeds. Accepted by each input in the most cell time. Thus, a request input is always serviced in less time than a computational unit.
特性3:在高负载下,所有具有公共输出的队列具有相同的吞吐量。这是特性2的结果,输出指针按固定顺序移动到每个请求输入,从而为每个请求提供相同的吞吐量。Property 3: Under high load, all queues with a common output have the same throughput. This is a consequence of feature 2, the output pointer is moved to each request input in a fixed order, thus providing the same throughput for each request.
但是最重要的是,这个小的变化阻止了输出仲裁器的同步移动,从而导致性能的巨大改进。But most importantly, this small change prevents the output arbiter from moving synchronously, resulting in a huge improvement in performance.
五、DRRM算法5. DRRM Algorithm
DRRM是Double Rounb-Robin Matching。因为在输入和输出是两个独立的轮询仲裁机制执仲裁,所以称之为DRRM(Dual Round Robin Matching)。DRRM is Double Rounb-Robin Matching. It is called DRRM (Dual Round Robin Matching) because two independent polling arbitration mechanisms perform arbitration on the input and output.
在每个输入端口都有一个request仲裁器和N个VOQ,请求仲裁器根据指针值最多选择一个非空队列,表示最高优先级的输出端口。在每个输出端口也有一个grant仲裁器,在N个请求中仲裁出来一个输入端口,然后将这个结果发给输入端口,如果grant接收到一个请求那么它将更新指针值,而对应的请求仲裁器也将更新指针值。There is a request arbiter and N VOQs at each input port, and the request arbitrator selects at most one non-empty queue according to the pointer value, representing the output port with the highest priority. There is also a grant arbiter at each output port, which arbitrates an input port among N requests, and then sends the result to the input port. If grant receives a request, it will update the pointer value, and the corresponding request arbiter The pointer value will also be updated.
图13显示了一个DRRM算法的一个示例:Figure 13 shows an example of a DRRM algorithm:
DRRM算法包括两个阶段:The DRRM algorithm consists of two stages:
步骤1:Request,每个输入在轮询调度中的请求仲裁出一个请求;Step 1: Request, each input request in the round-robin scheduling arbitrates a request;
步骤2:Grant,每个输出在本端口的所有请求中仲裁出一个输入。Step 2: Grant, each output arbitrates an input among all the requests of this port.
DDRM方案比iSLIP方案的仲裁时间更短,同时实现了和iSLIP相当的性能。The DDRM scheme has a shorter arbitration time than the iSLIP scheme, and at the same time achieves performance equivalent to that of iSLIP.
六、GA算法6. GA Algorithm
GA算法即Grant-Aware Scheduling Algorithm for VOQ-Based Input-Buffered Packet Switches。如图14所示,这种仲裁算法是将DRRM算法在一个周期内迭代了多次来实现最大仲裁效率,每次迭代将输出端口仲裁结果告知输入端口。效率有提高,但复杂性比DRRM更大,对于稍微大一点的路由仲裁(N>=8),两次以上的迭代就已经很难在高速电路上实现,原因在于每次迭代都相当于重新做一次DRRM。The GA algorithm is Grant-Aware Scheduling Algorithm for VOQ-Based Input-Buffered Packet Switches. As shown in Figure 14, this arbitration algorithm iterates the DRRM algorithm several times in one cycle to achieve the maximum arbitration efficiency, and each iteration notifies the output port of the arbitration result to the input port. Efficiency has been improved, but the complexity is greater than DRRM. For a slightly larger routing arbitration (N>=8), more than two iterations are already difficult to implement on high-speed circuits, because each iteration is equivalent to a new Do a DRRM.
七、滚轮仲裁算法(本申请)7. Wheel Arbitration Algorithm (this application)
本专利提出的滚轮仲裁算法的最大优点是先采用滚轮优先仲裁,再采用DDRM仲裁算法。实现了在一个周期内完成两次仲裁,其中第一次仲裁按照滚轮优先原则,判断滚轮节点上是否有请求即可完成仲裁,硬件电路非常容易实现,为第二次仲裁节省了大量时间。第二次仲裁是在过滤掉第一次仲裁节点基础上,按照DRRM的算法,做第二次仲裁。The biggest advantage of the wheel arbitration algorithm proposed in this patent is that the wheel priority arbitration is adopted first, and then the DDRM arbitration algorithm is adopted. The two arbitrations are completed in one cycle. The first arbitration follows the wheel priority principle, and the arbitration can be completed by judging whether there is a request on the wheel node. The hardware circuit is very easy to implement, which saves a lot of time for the second arbitration. The second arbitration is based on filtering out the nodes of the first arbitration, and according to the DRRM algorithm, the second arbitration is performed.
相比于PIM,RRM,iSLIP,DDRM,GA算法,滚轮优先级算法优势在于,具有更简单直接且容易实现的特点。现在的高速电路设计中很难在一个时钟周期内完成PIM,RRM,iSLIP,DDRM,GA这几个算法的二次以上的迭代仲裁,这几个算法每次迭代都需要对输入输出端口的仲裁优先级做逻辑处理,而滚轮优先级算法的输入输出端口的仲裁优先级是pattern结构,独立于仲裁逻辑判断,给仲裁判断留下了充裕的时间,因此能够在一个时钟周期内实现更多的迭代次数。相比其他在一个时钟周期实现一次迭代的效率63%高出23%,而能达到86%的效率。在AIGPU芯片上已经实现。Compared with PIM, RRM, iSLIP, DDRM, and GA algorithms, the advantage of the wheel priority algorithm is that it is simpler, more direct, and easier to implement. In the current high-speed circuit design, it is difficult to complete more than two iterative arbitrations of PIM, RRM, iSLIP, DDRM, and GA algorithms within one clock cycle. Each iteration of these algorithms requires arbitration of the input and output ports. The priority is processed logically, and the arbitration priority of the input and output ports of the wheel priority algorithm is a pattern structure, which is independent of the arbitration logic judgment, leaving sufficient time for the arbitration judgment, so more can be realized in one clock cycle number of iterations. Compared with the efficiency of 63% which realizes one iteration in one clock cycle, it is 23% higher, and can reach an efficiency of 86%. It has been implemented on the AIGPU chip.
由于滚轮是用户按照数据路由特征选定的pattern,实际应用中仲裁时间会比GA短很多, 可以实现更多次迭代而提高仲裁效率。Since the scroll wheel is a pattern selected by the user according to the characteristics of the data routing, the arbitration time in actual application will be much shorter than that of GA, and more iterations can be achieved to improve the arbitration efficiency.
本文中所描述的具体实施例仅仅是对本发明精神做举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.

Claims (15)

  1. 一种片上数据交换的滚轮仲裁方法,基于N输入端口N输出端口的NxN交叉网络,一个输入端口对应的所有输出端口为一行,一个输出端口对应的所有输入端口为一列,每个输入输出交换点为一个传输对;其特征在于它包括以下步骤:A roller arbitration method for on-chip data exchange, based on an NxN crossover network with N input ports and N output ports, all output ports corresponding to one input port are one row, all input ports corresponding to one output port are one column, and each input and output switching point Be a transmission pair; It is characterized in that it comprises the following steps:
    S1、确定优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N‐1,x]},a、b、c…x∈[0,N‐1]且互不相同;优先仲裁排列W VOQ中的N个元素表示N个预期传输对,其中:VOQ[0,a]表示预期输入端口为PI 0,预期输出端口为PO a的预期传输对;非优先仲裁排列W VOQ中的传输对为非预期传输对; S1. Determine the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N‐1,x]}, a, b, c...x ∈[0,N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] means that the expected input port is PI 0 , and the expected output port is The expected transmission pair of PO a; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;
    S2、判断排列中各预期传输对是否有传输需求,是则确定为实际传输对,确定的实际传输对即可进行数据传输;S2. Judging whether each expected transmission pair in the arrangement has a transmission demand, if yes, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;
    S3、对于非预期传输对的交换点,首先进行每个输出端口的列仲裁或所处输入端口的列仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的非预期传输对作为实际传输对;S3. For the switching point of an unexpected transmission pair, first perform the column arbitration of each output port or the column arbitration of the input port where it is located to obtain the possible actual transmission pair; then perform row arbitration or column arbitration for the possible actual transmission pair, Select the unexpected transmission pair with high priority as the actual transmission pair;
    S4、步骤S3轮询完毕后,满足一定条件时,优先仲裁排列W VOQ滚动获得新的仲裁排列W' VOQ;否则滚轮保持不动; S4, after the polling of step S3 is completed, when certain conditions are met, the priority arbitration arrangement W VOQ scrolls to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;
    S5、循环进行S2‐S4。S5. Perform S2-S4 in a loop.
  2. 根据权利要求1所述的方法,其特征在于S1中,确定优先仲裁排列W VOQ={VOQ[0,0],VOQ[1,1],VOQ[2,2],…,VOQ[N‐1,N‐1]}。 The method according to claim 1, characterized in that in S1, determine the priority arbitration arrangement W VOQ ={VOQ[0,0], VOQ[1,1], VOQ[2,2],..., VOQ[N- 1, N‐1]}.
  3. 根据权利要求1所述的方法,其特征在于S2和S3中,对于burst传输应用,在确定为实际传输对前,需要确认该传输对上没有未完成的传输。The method according to claim 1, characterized in that in S2 and S3, for the burst transmission application, it is necessary to confirm that there is no unfinished transmission on the transmission pair before it is determined as the actual transmission pair.
  4. 根据权利要求1所述的方法,其特征在于S2和S3中,确定为实际传输对后,对实际传输对所处的行和列进行清除,不再参与下次传输需求仲裁。The method according to claim 1, characterized in that in S2 and S3, after the actual transmission pair is determined, the row and column where the actual transmission pair is located are cleared, and no longer participate in the next transmission demand arbitration.
  5. 根据权利要求1所述的方法,其特征在于S4中,所述一定条件:为每个有请求的预期传输对都被确认过实际传输对。The method according to claim 1, characterized in that in S4, the certain condition: each requested expected transmission pair has been confirmed as an actual transmission pair.
  6. 根据权利要求1所述的方法,其特征在于S4中,所述滚动为所有的预期输入端口不变,所有的预期输出端口+n,n的取值需保证滚轮循环滚动时,覆盖所有的传输对。The method according to claim 1, characterized in that in S4, the scrolling is that all expected input ports remain unchanged, all expected output ports+n, and the value of n needs to ensure that all transmissions are covered when the roller rotates circularly right.
  7. 根据权利要求1所述的方法,其特征在于S4中,所述滚动为所有的预期输出端口不变,所有的预期输入端口+n,n的取值需保证滚轮循环滚动时,覆盖所有的传输对。The method according to claim 1, characterized in that in S4, the scrolling is that all expected output ports remain unchanged, all expected input ports+n, and the value of n needs to ensure that all transmissions are covered when the roller rotates circularly right.
  8. 一种片上数据交换的滚轮仲裁电路,基于N输入端口N输出端口的NxN交叉网络,其特征在于它包括:A roller arbitration circuit for on-chip data exchange, based on the NxN crossover network of N input ports and N output ports, is characterized in that it includes:
    ‐滚动点选择电路,用于确定NxN交叉网络中的优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N‐1,x]},以及优先仲裁排列W VOQ的滚动更新; ‐Scroll point selection circuit, used to determine the priority arbitration arrangement in NxN crossover network W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N‐1, x]}, and the rolling update of priority arbitration ranking W VOQ ;
    ‐期望点匹配电路,用于判断排列中各预期传输对是否有传输需求,将有传输需求的标记为实际传输对,无传输需求的标记为非传输对;‐Expectation point matching circuit, used to judge whether each expected transmission pair in the arrangement has a transmission demand, mark the one with the transmission demand as the actual transmission pair, and mark the one without the transmission demand as the non-transmission pair;
    ‐行列轮询仲裁电路,用于进行交换点所处列的列仲裁或所处行的行仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的交换点作为实际传输对,各电路间运行逻辑为:‐Row-column polling arbitration circuit, which is used to conduct column arbitration or row arbitration of the row where the switching point is located, to obtain possible actual transmission pairs; then perform row arbitration or column arbitration for possible actual transmission pairs, and select the priority The high switching point is used as the actual transmission pair, and the operation logic between each circuit is:
    S1、滚动点选择电路确定优先仲裁排列W VOQ={VOQ[0,a],VOQ[1,b],VOQ[2,c],…,VOQ[N‐1,x]},a、b、c…x∈[0,N‐1]且互不相同;优先仲裁排列W VOQ中的N个元素表示N个预期传输对,其中:VOQ[0,a]表示预期输入端口为PI 0,预期输出端口为POa的预期传输对;非优先仲裁排列W VOQ中的传输对为非预期传输对; S1. The scroll point selection circuit determines the priority arbitration arrangement W VOQ = {VOQ[0,a], VOQ[1,b], VOQ[2,c],...,VOQ[N-1,x]}, a, b , c...x∈[0, N-1] and are different from each other; the N elements in the priority arbitration arrangement W VOQ represent N expected transmission pairs, where: VOQ[0,a] represents the expected input port is PI 0 , The expected output port is the expected transmission pair of POa; the transmission pair in the non-priority arbitration arrangement W VOQ is an unexpected transmission pair;
    S2、期望点匹配电路判断排列中各预期传输对是否有传输需求,是则确定为实际传输对,确定的实际传输对即可进行数据传输;S2. The expected point matching circuit judges whether each expected transmission pair in the arrangement has a transmission demand, if so, it is determined as an actual transmission pair, and the determined actual transmission pair can perform data transmission;
    S3、行列轮询仲裁电路对于非预期传输对的交换点,首先进行每个输出端口的列仲裁或所处输入端口的行仲裁,获取可能的实际传输对;再针对可能的实际传输对进行行仲裁或列仲裁,选择优先级高的非预期传输对作为实际传输对;S3. The row-column polling arbitration circuit For the switching point of an unexpected transmission pair, first perform column arbitration of each output port or row arbitration of the input port where it is located to obtain a possible actual transmission pair; then perform row arbitration on the possible actual transmission pair Arbitration or column arbitration, select the unexpected transmission pair with high priority as the actual transmission pair;
    S4、步骤S3轮询完毕后,满足一定条件时,滚动点选择电路优先仲裁排列W VOQ滚动获得新的仲裁排列W' VOQ;否则滚轮保持不动; S4, after the step S3 polling is completed, when certain conditions are met, the scroll point selection circuit prioritizes the arbitration arrangement W VOQ to scroll to obtain a new arbitration arrangement W'VOQ ; otherwise, the scroll wheel remains motionless;
    S5、循环进行S2‐S4。S5. Perform S2-S4 in a loop.
  9. 根据权利要求8所述的电路,其特征在于它还包括:The circuit of claim 8, further comprising:
    ‐行列清除电路,NxN交叉网络的每一个交换点设置一个清除逻辑,用于预期传输对确定实际传输对后,禁止匹配点所处的行和列再参与传输需求仲裁;该电路的运行逻辑为:‐Row and column clearing circuit, each switching point of the NxN crossover network is equipped with a clearing logic, which is used to prohibit the row and column where the matching point is located from participating in the transmission demand arbitration after the expected transmission pair determines the actual transmission pair; the operating logic of this circuit is :
    S2中:确定为实际传输对后,行列清除电路对实际传输对所处的行和列进行清除,不再 参与下次传输需求仲裁;In S2: after being determined as the actual transmission pair, the row and column clearing circuit clears the row and column where the actual transmission pair is located, and no longer participates in the next transmission demand arbitration;
    S3中:确定为实际传输对后,行列清除电路对实际传输对所处的行和列进行清除,不再参与下次传输需求仲裁。In S3: After determining as the actual transmission pair, the row and column clearing circuit clears the row and column where the actual transmission pair is located, and does not participate in the next transmission demand arbitration.
  10. 根据权利要求8所述的电路,其特征在于运行逻辑S4中,一定条件为:每个有请求的预期传输对都被确认过实际传输对。The circuit according to claim 8, characterized in that in the operating logic S4, a certain condition is: each requested expected transmission pair has been confirmed as an actual transmission pair.
  11. 根据权利要求8所述的电路,其特征在于滚动点选择电路:NxN交叉网络的每一行、每一列各设置1个k位的寄存器,k=ceiling(N),行/列的寄存器顺序移动寄存值实现滚动。The circuit according to claim 8, characterized in that the rolling point selection circuit: each row of the NxN crossover network, each column is respectively provided with a register of 1 k bits, k=ceiling (N), and the registers of the row/column are sequentially moved and registered The value implements scrolling.
  12. 根据权利要求8所述的电路,其特征在于期望点匹配电路:NxN交叉网络的每一个交换点设置比较器,通过交换点所属的行寄存器与列寄存器的值比较,判断交换点是否为期望点。The circuit according to claim 8, characterized in that the expected point matching circuit: each switching point of the NxN crossover network is provided with a comparator, and is compared with the value of the row register to which the switching point belongs and the column register to judge whether the switching point is a desired point .
  13. 根据权利要求8所述的电路,其特征在于行列轮询仲裁电路:NxN交叉网络的每一行、每一列都设置一个arbiter,每个arbiter有一个优先级指针通过滚轮的跳转而滚动。The circuit according to claim 8, characterized in that the row and column polling arbitration circuit: each row and each column of the NxN crossover network is provided with an arbiter, and each arbiter has a priority pointer to scroll through the jump of the roller.
  14. 一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于所述计算机程序被所述处理器执行权利要求1‐7中任一项所述的方法。An electronic device, comprising: a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that the computer program is executed by the processor in claims 1-7 any one of the methods described.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,其特征在于所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行权利要求1‐7中任一项所述的方法。A computer-readable storage medium storing one or more programs, characterized in that when the one or more programs are executed by an electronic device including a plurality of application programs, the electronic device Carry out the method described in any one in the claim 1-7.
PCT/CN2022/108409 2021-07-29 2022-07-27 Roller arbitration method and circuit for on-chip data exchange WO2023006006A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110866467.4A CN113568849B (en) 2021-07-29 2021-07-29 Roller arbitration method and circuit for on-chip data exchange
CN202110866467.4 2021-07-29

Publications (1)

Publication Number Publication Date
WO2023006006A1 true WO2023006006A1 (en) 2023-02-02

Family

ID=78169176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/108409 WO2023006006A1 (en) 2021-07-29 2022-07-27 Roller arbitration method and circuit for on-chip data exchange

Country Status (2)

Country Link
CN (1) CN113568849B (en)
WO (1) WO2023006006A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568849B (en) * 2021-07-29 2022-04-22 海飞科(南京)信息技术有限公司 Roller arbitration method and circuit for on-chip data exchange
CN117118934B (en) * 2023-10-25 2024-02-23 苏州元脑智能科技有限公司 Three-level CLOS interconnection network, transmission method, system, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364874B1 (en) * 2006-01-17 2013-01-29 Hewlett-Packard Development Company, L. P. Prioritized polling for virtual network interfaces
CN105022717A (en) * 2015-06-04 2015-11-04 中国航空无线电电子研究所 Network on chip resource arbitration method and arbitration unit of additional request number priority
CN109873771A (en) * 2019-01-21 2019-06-11 佛山市顺德区中山大学研究院 A kind of network-on-a-chip and its communication means
CN113568849A (en) * 2021-07-29 2021-10-29 海飞科(南京)信息技术有限公司 Roller arbitration method and circuit for on-chip data exchange

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6098109A (en) * 1996-12-30 2000-08-01 Compaq Computer Corporation Programmable arbitration system for determining priority of the ports of a network switch
US7200699B2 (en) * 2004-09-02 2007-04-03 Intel Corporation Scalable, two-stage round robin arbiter with re-circulation and bounded latency
US7706387B1 (en) * 2006-05-31 2010-04-27 Integrated Device Technology, Inc. System and method for round robin arbitration
CN101510181A (en) * 2009-03-19 2009-08-19 北京中星微电子有限公司 Bus arbitration method and bus arbitration apparatus
CN102394829A (en) * 2011-11-14 2012-03-28 上海交通大学 Reliability demand-based arbitration method in network on chip
US10635622B2 (en) * 2018-04-03 2020-04-28 Xilinx, Inc. System-on-chip interface architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364874B1 (en) * 2006-01-17 2013-01-29 Hewlett-Packard Development Company, L. P. Prioritized polling for virtual network interfaces
CN105022717A (en) * 2015-06-04 2015-11-04 中国航空无线电电子研究所 Network on chip resource arbitration method and arbitration unit of additional request number priority
CN109873771A (en) * 2019-01-21 2019-06-11 佛山市顺德区中山大学研究院 A kind of network-on-a-chip and its communication means
CN113568849A (en) * 2021-07-29 2021-10-29 海飞科(南京)信息技术有限公司 Roller arbitration method and circuit for on-chip data exchange

Also Published As

Publication number Publication date
CN113568849A (en) 2021-10-29
CN113568849B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
WO2023006006A1 (en) Roller arbitration method and circuit for on-chip data exchange
US5500858A (en) Method and apparatus for scheduling cells in an input-queued switch
KR100488478B1 (en) Multiple Input/Output-Queued Switch
US7539199B2 (en) Switch fabric scheduling with fairness and priority consideration
US7154885B2 (en) Apparatus for switching data in high-speed networks and method of operation
Jiang et al. A fully desynchronized round-robin matching scheduler for a VOQ packet switch architecture
US7394808B2 (en) Method and apparatus for implementing scheduling algorithms in a network element
JP6481427B2 (en) Arithmetic processing device, information processing device, and control method for information processing device
US7738472B2 (en) Method and apparatus for scheduling packets and/or cells
KR100905802B1 (en) Tagging and arbitration mechanism in an input/output node of computer system
US20090141733A1 (en) Algortihm and system for selecting acknowledgments from an array of collapsed voq's
JP5573491B2 (en) Data transfer system, switch, and data transfer method
US20240031304A1 (en) Virtual channel starvation-free arbitration for switches
CN111030927A (en) Network-on-chip routing method and network router with sequential perception
US8145823B2 (en) Parallel wrapped wave-front arbiter
US20130208727A1 (en) Apparatus & method
US20010043612A1 (en) Apparatus and method for resource arbitration
WO2024027133A1 (en) Priority grouping polling arbiter and arbitration method therefor, and crossbar and chip
Mhamdi PBC: A partially buffered crossbar packet switch
CN111711574A (en) Ultra-high order single-cycle message scheduling method and device
US7639704B2 (en) Message switching system
CN109522130A (en) Reverse dispatching method based on shared buffer memory
CN105847181A (en) Prediction method applied to input queue switch distributed scheduling algorithm
CN103346908B (en) A kind of communication referee method towards high-performance calculation
Mahobiya Designing of efficient iSLIP arbiter using iSLIP scheduling algorithm for NoC

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22848611

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE