CN114741353A

CN114741353A - Coarse-grained reconfigurable array based on fully-interconnected network and processor

Info

Publication number: CN114741353A
Application number: CN202210312847.8A
Authority: CN
Inventors: 唐士斌; 欧阳鹏
Original assignee: Beijing Qingwei Intelligent Information Technology Co ltd
Current assignee: Beijing Qingwei Intelligent Information Technology Co ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-12

Abstract

The invention provides a full-interconnection-network-based coarse-grained reconfigurable array and a processor. The array includes multiple types of heterogeneous operators connected by a fully interconnected network. According to the invention, since heterogeneous operators are adopted to form the CGRA, each operator has no too many repeated complex functions, so that a large amount of internal logic waste can be avoided, and the problem of logic waste in operators caused by port sharing of homogeneous operators is solved; according to the invention, because the point-to-point full-interconnection network is adopted to connect each operator, the interconnection between any two operators is realized, and the problem of resource mapping waste caused by insufficient interconnection resources in the traditional MESH structure is solved.

Description

Coarse-grained reconfigurable array based on full-interconnection network and processor

Technical Field

The invention belongs to the technical field of processors, and particularly relates to a full-interconnection-network-based coarse-grained reconfigurable array and a processor.

Background

Reconfigurable computing is proposed in 1999 by Dehon and Wawrzynel, the research center for reconfigurable technology at Berkeley, California university, and is a computing organizational structure with the following characteristics: one is that after its manufacture, the functionality of the chip is still customizable to address any issues; secondly, the spatial mapping from the task to the chip is realized to a great extent to complete the calculation. Any calculation that satisfies the above characteristics may be referred to as reconfigurable calculation.

A large Array of processing elements is found within a Coarse-Grained Reconfigurable Array CGRA (Coarse-Grained Reconfigurable Array) processor. For example, a PIXEL Vision Core installed in a PIXEL2 mobile phone issued by Google in 2017 adopts a CGRA technology, and the CGRA technology includes a computing array of 12 × 12 Processing units (PE), the central 8 × 8 array supports mathematical operations, the peripheral PE only supports data transmission, all the Processing units are connected together through a torus network, and the overall structure of the Processing unit is shown in fig. 2. Fig. 3 shows a typical MESH-based CGRA structure, RC (configurable cell) is a homogeneous reconfigurable computing unit, and can perform addition, subtraction, multiplication, division, shift, and or non-equivalent arithmetic operations, and all RCs are connected together through MESH interconnect structure and can perform some typical data flow operations.

Compared with traditional computing equipment DSP and GPU, the CGRA has the advantages of space mapping, data stream computing and the like. However, the CGRA has the same resource waste problem as the FPGA, and is mainly introduced from two aspects: one is the intra-operator logic waste introduced by port sharing of homogeneous operators (PEs). For example, a PE has all the capabilities of ALU, eal, LOOP, MAC, SPU, etc., and when the PE performs ALU operation, it cannot perform other operations, resulting in waste of internal resources of the PE; secondly, the resource mapping of the traditional MESH structure is wasted, and the data flow graph of one algorithm is mapped to the CGRA, so that not only is a corresponding relation between the operation of the data flow graph and the PEs required to be established, but also a routing channel needs to be established on the MESH graph according to the data transfer relation in the data flow graph, and many PEs cannot be used for mapping the operation of the data flow graph often due to the constraint of MESH resources.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a coarse-grained reconfigurable array and a processor based on a full interconnect network.

In order to achieve the above object, the present invention adopts the following technical solutions.

In a first aspect, the invention provides a full-interconnection-network-based coarse-grained reconfigurable array, which comprises a plurality of types of heterogeneous operators connected by a full-interconnection network.

Further, the array comprises 6 kinds of heterogeneous operators, namely a branch operator SEL, a numerical operator ALU, an enhanced numerical operator EALU, a multiplication and addition operator MAC, a special operator SPU, a BUFFER operator BUFFER and a LOOP control operator LOOP.

Further, the fully interconnected network comprises a Benes network or a Clos network or a Crossbar network.

Furthermore, the full interconnection network is an enhanced Benes network and comprises a first sub-Benes network and a second sub-Benes network with the input and output ends of N/2 and N switching units; the switching unit is provided with a first input end, a second input end, a first output end and a second output end, and has 4 working modes of direct connection, switching, upper broadcasting and lower broadcasting; wherein the content of the first and second substances,

the first output ports of the N/2 switching units are connected with the input ports of the first sub-Benes networks, and the second output ports of the N/2 switching units are connected with the input ports of the second sub-Benes networks;

the first output ports of the N/2 switching units are connected with the input ports of the first sub-Benes network, and the second output ports of the N/2 switching units are connected with the input ports of the second sub-Benes network.

Still further, the array further includes a set of compute units, Cluster, that includes one or more types of heterogeneous operators connected by an enhanced Benes network.

Further, the set of computing units Cluster comprises 24 enhanced numerical operators EALU, 48 numerical operators ALU, 12 branch operators SEL, 2 LOOP control operators LOOP, 48 BUFFER operators BUFFER connected by an enhanced Benes network.

Preferably, the number of input and output ends of the enhanced Benes network is N-256.

Further, the array includes 4 sets of compute units, Cluster, 50 multiply-add operators, MAC, 10 special operators, SPU, 64 BUFFER operators, and 6 LOOP control operators, LOOP, connected by an enhanced Benes network.

Preferably, the number of input and output ends of the enhanced Benes network is N-512.

In a second aspect, the present invention provides a processor comprising a coarse grain reconfigurable array as described above.

Compared with the prior art, the invention has the following beneficial effects.

The coarse-grained reconfigurable array CGRA provided by the invention consists of a plurality of types of heterogeneous operators, and the plurality of types of heterogeneous operators are connected by a fully interconnected network. Because the CGRA is formed by adopting the heterogeneous operators, each operator has no too many repeated complex functions, so that a large amount of internal logic waste can be avoided, and the problem of logic waste in the operators caused by port sharing of the homogeneous operators is solved. Because the invention adopts the point-to-point full interconnection network to connect each operator, the interconnection between any two operators is realized, and the problem of resource mapping waste caused by insufficient interconnection resources in the traditional MESH structure is solved.

Drawings

Fig. 1 is a schematic structural diagram of a coarse-grained reconfigurable array based on a fully-interconnected network according to an embodiment of the present invention.

In the figure: 1-fully interconnected network, 2-heterogeneous operator.

Fig. 2 is a schematic structural diagram of a CGRA connected through a torus network.

Fig. 3 is a schematic diagram of a CGRA structure based on MESH interconnect.

Fig. 4 is a schematic diagram of an enhanced Benes network structure.

Fig. 5 is a schematic structural diagram of a CGRA according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a block diagram of a full interconnection network-based coarse-grained reconfigurable array according to an embodiment of the present invention, where the coarse-grained reconfigurable array includes multiple types of heterogeneous operators 2 connected by a full interconnection network 1.

The embodiment provides a coarse-grained reconfigurable array CGRA which mainly comprises a plurality of types of heterogeneous operators 2 and is characterized in that the plurality of types of heterogeneous operators 2 are connected through a fully interconnected network 1. The heterogeneous operator 2 generally refers to a numerical operator alu (arithmetic logic unit), a multiply-add operator mac (multi increment account unit), and the like. The fully-interconnected network 1 can enable any input end to be communicated with any output end, so that interconnection between any two heterogeneous operators 2 is realized. The existing CGRA is generally composed of isomorphic operators, and the isomorphic operators are mostly connected by adopting a grid MESH structure. Port sharing of isomorphic operators (PEs) can introduce intra-operator logic waste. For example, a PE has the capability of all operators such as ALU, EALU (Enhanced numerical operation), LOOP control, MAC, SPU (special Unit), etc., and when the PE executes the ALU operation, other operations such as EALU and LOOP cannot be executed, which results in waste of internal resources of the PE. In order to eliminate internal resource waste caused by isomorphic operators, the embodiment adopts the isomerous operators 2, and because each operator does not have too many repeated complex functions, a large amount of internal logic waste is not caused. The MESH structure is adopted to connect operators, which is another important reason that the existing CGRA has the problem of resource waste. Mapping a data flow graph of an algorithm to a CGRA (Carrier grade error) requires establishing a corresponding relation between the operation of the data flow graph and a PE (provider edge), and simultaneously requires establishing a routing channel for a data transfer relation in the data flow graph on an MESH (MESH resource) graph, so that a plurality of PEs (provider edges) cannot be used for mapping the operation of the data flow graph due to the constraint of MESH resources. In order to eliminate resource waste caused by the operator connection structure, the embodiment adopts a full interconnection network to connect each operator. Because the fully-interconnected network can realize the interconnection between any two operators, the phenomenon that the operation in the data flow diagram cannot be mapped into the operators due to insufficient interconnection resources does not exist.

As an alternative embodiment, the array includes 6 kinds of heterogeneous operators 2, respectively, a branch operator SEL, a numerical operator ALU, an enhanced numerical operator EALU, a multiply-add operator MAC, a special operator SPU, a BUFFER operator BUFFER, and a LOOP control operator LOOP.

The present embodiment gives the kind of the heterogeneous operator 2. The CGRA array of this embodiment includes 6 heterogeneous operators 2, SEL, ALU, EALU, MAC, SPU, BUFFER, and LOOP, respectively. Each operator performs a different function, for example, the MAC can perform multiply-add, multiply-accumulate, multiply, accumulate operations. The function of the 6 kinds of heterogeneous operators 2 is shown in table 1.

TABLE 1 isomerous operator functional description

As an alternative embodiment, said fully interconnected network 1 comprises a Benes network or a Clos network or a Crossbar network.

The present embodiment presents several network architectures of the fully interconnected network 1. Benes, Clos and Crossbar are three common network structures, and the fully interconnected network 1 of the present embodiment may be constituted by any one of them. The Benes network is a switching network capable of realizing any two-point interconnection, and the Benes laboratory originally proposed for telecommunication systems, and the Benes network still has wide application in the current switch and router network. Crossbar is also a common point-to-point network, and the cost of interconnection introduced to achieve point-to-point interconnection is N²(ii) a While Benes only costs N Log₂And N is added. The technical principles of these three network architectures belong to the mature prior art and are not described in detail herein.

As an optional embodiment, the fully interconnected network 1 is an enhanced Benes network, and includes a first sub-Benes network and a second sub-Benes network with N/2 input and output terminals, and N switching units; the switching unit is provided with a first input end, a second input end, a first output end and a second output end, and has 4 working modes of direct connection, switching, upper broadcasting and lower broadcasting; wherein the content of the first and second substances,

n/2 switching units are connected between input ends In 0-In (N-1) of the enhanced Benes network and input ends of a first sub-Benes network and a second sub-Benes network, first output ends of the first N/4 switching units are respectively connected with first N/4 input ends of the first sub-Benes network, and second output ends of the first N/4 switching units are respectively connected with first N/4 input ends of the second sub-Benes network; the first output ends of the last N/4 switching units are respectively connected with the last N/4 input ends of the first sub-Benes network, and the second output ends of the last N/4 switching units are respectively connected with the last N/4 input ends of the second sub-Benes network;

in addition, N/2 switching units are connected between the output ends of the first sub-Benes network and the second sub-Benes network and the output ends Out 0-Out (N-1) of the enhanced sub-Benes network, the first input ends of the first N/4 switching units are respectively connected with the first N/4 output ends of the first sub-Benes network, and the second input ends of the first N/4 switching units are respectively connected with the first N/4 output ends of the second sub-Benes network; the first input ends of the last N/4 switching units are respectively connected with the last N/4 output ends of the first sub-Benes network, and the second input ends of the last N/4 switching units are respectively connected with the last N/4 output ends of the second sub-Benes network.

The present embodiment presents a specific fully interconnected network 1. The fully interconnected network 1 of the present embodiment employs an enhanced Benes network structure, as shown in fig. 4. Benes networks consist of 2 x 2(2 inputs, 2 outputs) switching units. The method for forming the Benes network of N multiplied by N (N inputs and N outputs) comprises the following steps: two sides are respectively provided with N/2 switching units, which are N switching units in total, and the middle is provided with two N/2 XN/2 Benes networks (namely a first sub-Benes network and a second sub-Benes network), and the specific connection method is also shown in FIG. 4; and then the middle two Benes networks are decomposed continuously according to the method until the middle sub-network has only one switching unit. The enhanced Benes network of the present embodiment is different from the general Benes network in the switching units constituting the network: the switching unit of general Benes network only has 2 operation modes of direct connection and exchange, while the switching unit of enhanced Benes network has 4 operation modes of direct connection, exchange, up broadcast and down broadcast. Therefore, the enhanced Benes network can not only connect any input end of the network with any output end, but also can simultaneously connect one input end with a plurality of output ends, and the function of the enhanced Benes network is enhanced relative to the common Benes network, so the enhanced Benes network is called the enhanced Benes network.

As an optional embodiment, the array further comprises a set of computing units Cluster comprising one or more types of heterogeneous operators 2 connected by an enhanced Benes network.

This example presents another solution for the array. In this embodiment, the array includes not only the heterogeneous operator 2, but also a computing unit set Cluster composed of one or more types of heterogeneous operators 2. Since interconnection cost in a fully interconnected network structure rises sharply with the increase of the number of resources, for example, an enhanced Benes network shows N × Log₂N is increased, the present embodiment reduces the number of heterogeneous operators 2 by grouping one or more types of heterogeneous operators 2 into a cluster, thereby reducing the interconnection cost. In order to reduce the resource waste, the heterogeneous operators 2 constituting the Cluster are still connected by the fully interconnected network 1, i.e. the enhanced Benes network.

As an alternative embodiment, the set of computing units Cluster includes 24 enhanced numerical operators EALU, 48 numerical operators ALU, 12 branch operators SEL, 2 LOOP control operators LOOP, and 48 BUFFER operators BUFFER connected by an enhanced Benes network.

This embodiment shows a specific Cluster structure. The Cluster of the embodiment is composed of 5 kinds of heterogeneous operators, namely 24 EALUs, 48 ALUs, 12 SELs, 2 LOOPs and 48 BUFFERs. These heterogeneous operators are connected by an enhanced Benes network. Because so many heterogeneous operators form a Cluster, and then the Cluster is used as an operator to be connected with other operators, the interconnection cost can be reduced.

As an alternative embodiment, the number of input and output ends of the enhanced Benes network is N-256.

The embodiment is an alternative embodiment of the previous embodiment, and further defines the enhanced Benes network in the specific Cluster of the previous embodiment, and the number of the input and output ends is N-256.

As an alternative embodiment, the array comprises 4 sets of compute units Cluster, 50 multiply-add operators MAC, 10 special operators SPU, 64 BUFFER operators BUFFER, and 6 LOOP control operators LOOP connected by an enhanced Benes network.

This example shows a specific structure of the array. The array of this embodiment includes 4 clusters in addition to the heterogeneous operators 2(50 MACs, 10 SPUs, 64 LOOPs). These heterogeneous operators 2 and Cluster are connected together by an enhanced Benes network, and the structural diagram is shown in FIG. 5.

As an alternative embodiment, the number of input and output ends of the enhanced Benes network is N-512.

This embodiment is an alternative to the previous embodiment, and the enhanced Benes network of the previous embodiment is further defined, and the number of the input and output ends is N-512.

The embodiment of the invention also provides a processor, which comprises the coarse-grained reconfigurable array based on the fully interconnected network, which is described in any one of the embodiments.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A full interconnect network based coarse grain reconfigurable array, characterized in that the array comprises a plurality of types of heterogeneous operators connected by a full interconnect network.

2. The full interconnect network based coarse grain reconfigurable array of claim 1, wherein the multiple types of heterogeneous operators include a branch operator SEL, a numerical operator ALU, an enhanced numerical operator EALU, a multiply-add operator MAC, a special operator SPU, a BUFFER operator BUFFER, and a LOOP control operator LOOP.

3. The full interconnect network-based coarse grain reconfigurable array of claim 1, wherein the full interconnect network comprises a Benes network, a Clos network, or a Crossbar network.

4. The full interconnection network-based coarse-grained reconfigurable array according to claim 3, wherein the full interconnection network is an enhanced Benes network, and comprises a first sub-Benes network and a second sub-Benes network with N/2 input/output terminals, and N switching units; the switching unit is provided with a first input end, a second input end, a first output end and a second output end, and has 4 working modes of direct connection, switching, upper broadcasting and lower broadcasting; wherein the content of the first and second substances,

n/2 switching units are connected between input ends In 0-In (N-1) of the enhanced Benes network and input ends of a first sub-Benes network and a second sub-Benes network, first output ports of the N/2 switching units are connected with input ports of the first sub-Benes network, and second output ports of the N/2 switching units are connected with input ports of the second sub-Benes network;

in addition, N/2 switching units are connected among the output ends of the first sub-Benes network and the second sub-Benes network and the output ends Out 0-Out (N-1) of the enhanced Benes network, the first output ports of the N/2 switching units are connected with the input port of the first sub-Benes network, and the second output ports of the N/2 switching units are connected with the input port of the second sub-Benes network.

5. The full interconnect network based coarse grain reconfigurable array of claim 4, wherein the array further comprises a set of compute units Cluster comprising one or more types of heterogeneous operators connected by an enhanced Benes network.

6. The full interconnect network based coarse grain reconfigurable array of claim 5, wherein the compute unit set Cluster comprises 24 enhanced numerical operators EALU, 48 numerical operators ALU, 12 branch operators SEL, 2 LOOP control operators LOOP, and 48 BUFFER operators BUFFER connected by an enhanced Benes network.

7. The full interconnection network based coarse grain reconfigurable array of claim 6, wherein the number of input and output ends of the enhanced Benes network is N-256.

8. The full interconnect network based coarse grain reconfigurable array of claim 6, wherein the array comprises 4 compute unit sets Cluster, 50 multiply-add operators MAC, 10 special operators SPU, 64 BUFFER operators BUFFER, 6 LOOP control operators LOOP connected by an enhanced Benes network.

9. The full interconnection network based coarse grain reconfigurable array of claim 8, wherein the number of input and output ends of the enhanced Benes network is N-512.

10. A processor comprising the coarse grained reconfigurable array of any one of claims 1 to 9.