CN113094030A

CN113094030A - Easily compiling method and system for reconfigurable chip

Info

Publication number: CN113094030A
Application number: CN202110176676.6A
Authority: CN
Inventors: 胡俊宝; 张振; 欧阳鹏
Original assignee: Beijing Qingwei Intelligent Technology Co ltd
Current assignee: Beijing Qingwei Intelligent Technology Co ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-07-09

Abstract

The invention provides a method for easily compiling a reconfigurable chip, which comprises the following steps: and acquiring a DFG data flow diagram of the software model to be compiled. The data dependency relationship between the computing nodes can be expressed by acquiring the computing node set and the edge set in the DFG data flow diagram. And establishing mathematical dependency relations among the computing node set, the edge set, the loading unit, the storage unit and the operation unit. And establishing and acquiring a compiling configuration file capable of being mapped on the accelerated computing unit through linear compiling, loading and mapping relations, storage and operation mapping relations of a linear compiler. The invention realizes the fast and easy compiling of the program structure, the simple and easy realization of the algorithm and the short solving time by establishing the mapping and mathematical dependency relationship among the computing node set, the edge set and each computing unit in the accelerated computing unit in the DFG data flow diagram. The invention also provides a compiling-facilitating system of the reconfigurable chip.

Description

Easily compiling method and system for reconfigurable chip

Technical Field

The invention relates to the development of a reconfigurable processor, which is applied to a compiler of a reconfigurable compiler and a compiling process. The invention particularly relates to a method and a system for easily compiling a reconfigurable chip.

Background

As a promising option for domain-specific accelerators, coarse-grained reconfigurable architectures (hereinafter CGRA) are drawing increasing attention because of their power efficiency approaching that of ASICs and their high software programmability. A CGRA is typically composed of a host controller (typically a CPU), a PE array, a main memory and a local storage (typically a multi-bank memory structure).

As shown in fig. 1 (host controller, context memory, instruction memory data memory, PEA Shared Memory (SM) local data memory, Parallel-access data bus, PEA Global register file, output register, local register file, context buffer instruction cache, Multi-bank PEA shared memory local data memory, PE calculation module, ALU operation unit, LSU read-write unit, GR Global register, MUX multiplexer in fig. 1).

The execution flow of the CGRA computing system is as follows: first, the host controller initializes CGRA instructions and input data and stores into the main memory. Prior to the CGRA acceleration application, input data is transferred from main memory to local memory and instructions are loaded into the configuration memory of the CGRA. When the CGRA completes the computation, the output data will be transferred from the local memory to the main memory. In CGRA, compute-intensive applications typically map instructions into different compute units (hereinafter PE) for parallel execution.

The high power efficiency of CGRA comes from the large number of computing resources distributed across its computing array, the complex manner of interconnection, and the different levels of storage systems. However, achieving better performance and energy efficiency in an application requires cooperative cooperation of these resources, and if these resources are not scheduled and cooperative well, CGRA as an accelerator may adversely affect the performance of the system. In addition, because the hardware architecture of the CGRA is greatly different from that of a popular general-purpose processor, the conventional compiling technology for the general-purpose processor cannot be completely transplanted to the CGRA. Therefore, it is necessary to research and develop the CGRA compiling technology. The set of compiling technology needs to be capable of mining parallelism in application and reducing data reading delay, and then, configuration information is given according to the hardware architecture characteristics of the CGRA, so that the aims of high performance and high energy efficiency are achieved.

In order for a CGRA to efficiently accomplish different types of computational tasks, a corresponding object must be generated for the CGRA's master controller and data path. Therefore, a CGRA compiler needs to provide control codes running in the reconfigurable controller and configuration information of the corresponding data path. Due to the huge difference between the hardware structure of the CGRA and the hardware structure of the general-purpose processor, the compiling technology and flow of the compiler are also different from those of the conventional compiler. The core work of the CGRA compiler is to analyze an application program, divide the program into a hardware execution part code and a software running part code, and then compile the two parts of codes respectively to generate a controller running code and configuration information of a reconfigurable data path. As shown in fig. 2, the compiling technology of the reconfigurable processor includes task partitioning and scheduling, operation mapping, memory mapping optimization, and the like.

Disclosure of Invention

The invention aims to provide an easy compiling method of a reconfigurable chip, which realizes the fast and easy compiling of a program structure of the method by establishing the mapping and mathematical dependency relationship among computing units in a computing node set, an edge set and an accelerated computing unit in a DFG data flow diagram, has simple and easy algorithm, short solving time and high practicability, and ensures that a coarse-grained reconfigurable body after compiling can realize better operation performance.

Meanwhile, the invention also aims to provide an easy compiling system of the reconfigurable chip, which greatly shortens the compiling time, enables the program structure to be easy to compile, has simple and easy-to-realize algorithm, short solving time and high practicability, and enables the coarse-grained reconfigurable body after compiling to realize better operation performance.

In a first aspect of the present invention, a method for easy compiling of a reconfigurable chip is provided, where the reconfigurable chip includes: and an acceleration calculation unit. The acceleration calculation unit includes: the device comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array. The operation unit array has P rows and Q columns of operation units PE. The loading units are respectively connected with the first row PE unit and the first column PE unit of the arithmetic unit array. The plurality of memory cells are respectively connected with the operation units PE of the P-th row and the Q-th column of the operation unit array.

The easily compiling method of the reconfigurable chip comprises the following steps:

and step S101, acquiring a DFG data flow graph of the software model to be compiled.

And step S102, acquiring a computing node set and an edge set in the DFG dataflow graph. Edges can express data dependencies between compute nodes and compute nodes.

And step S103, establishing mathematical dependency relations among the computing node set, the edge set, the loading unit, the storage unit and the operation unit.

And step S104, establishing a loading mapping relation between the calculation node set and the plurality of loading units according to the calculation node set, the edge set, the mathematical dependency relation and the number of the loading units.

And establishing a storage mapping relation between the computing node set and the plurality of storage units according to the computing node set, the edge set, the mathematical dependency relation and the number of the storage units.

And establishing an operation mapping relation among the calculation node set, the edge set and the operation unit array according to the calculation node set, the edge set, the mathematical dependency relation and the operation unit array.

Step S105, a linear compiler linearly compiles the load mapping relationship, the storage mapping relationship, and the operation mapping relationship to obtain a compilation configuration file that can be mapped to the accelerated computing unit.

In one embodiment of the compiler-friendly method of the reconfigurable chip of the invention, the operation unit array has 4 rows and 4 columns of PE units.

The number of load units is eight, wherein 4 load units are connected one-to-one with the four PE units in the first column. The 4 load units are respectively connected with the four PE units in the first row and are connected with the four PE units in the first column in a one-to-one mode.

In another embodiment of the easy compiling method for the reconfigurable chip, the number of the memory cells is seven, wherein 1 memory cell is connected to the PE unit in the first row of the fourth column, 1 memory cell is connected to the PE unit in the second row of the fourth column, and 1 memory cell is connected to the PE unit in the third row of the fourth column. The 4 memory cells are connected one-to-one with the four PE cells in the fourth row.

In another embodiment of the method for facilitating compilation of a reconfigurable chip according to the present invention, it is determined whether or not there are other arithmetic units in the arithmetic unit array in the row direction, the column direction, and the 45 ° oblique direction, and if so, the arithmetic units are connected to two units on the left and right in the row direction, two units on the top and bottom in the column direction, two units on the top and bottom in the 45 ° oblique direction, and other units on the top and bottom in the 45 ° oblique direction.

If not: the other operation units at the missing positions are filled up according to the mode that the operation units are sequentially and circularly arranged in the positive row direction and the operation units are sequentially and circularly arranged in the positive column direction, so that the operation units are connected with the other units at the corresponding positions. The forward row direction is a direction from the first row to the fourth row. The forward column direction is the direction from the first column to the four columns.

In another embodiment of the compiler-facilitating method for a reconfigurable chip according to the present invention, it is determined whether any of the operation units has another operation unit spaced apart in the forward column direction, and if so, the other operation unit is connected. If not, other operation units at the missing positions are filled according to a mode of sequentially and circularly arranging in the positive column direction, and then other units at corresponding positions are connected.

In another embodiment of the method for easily compiling a reconfigurable chip according to the present invention, the arithmetic operations that can be implemented by the arithmetic unit array include: multiplication, selection, unsigned addition, signed addition, unsigned subtraction, signed subtraction, absolute value taking, nop, route operation.

In another embodiment of the method for easily compiling a reconfigurable chip of the present invention, the mathematical dependency relationship includes:

and calculating mathematical dependency relations of the nodes and the operation units in one-to-one correspondence.

And the mathematical dependency relationship of the interconnection relationship and the edges in the loading unit, the storage unit and the operation unit array is in one-to-one correspondence.

The mathematical dependency of the same arithmetic unit is ensured to be used under different configurations.

Ensuring the mathematical dependency relationship of one-to-one correspondence between each interconnection edge in the DFG graph and the interconnection lines among the calculation units in the PEA array:

the mathematical dependency relationship that the interconnection can not be realized in the current configuration file and the interconnection can be realized outside the configuration.

When the compute nodes are mapped to the compute units, the edges connected from the compute nodes are also correspondingly mapped to the mathematical dependency on the interconnection relationship connected from the compute units.

When the computation nodes are mapped to the operation units, the edges connected from the computation nodes are correspondingly mapped to the mathematical dependency relationship on the interconnection relationship connected to the operation units.

The condition of the target is set to all edges and the smallest mathematical dependency.

In a second aspect of the present invention, a compiler-facilitating system for a reconfigurable chip is provided, where the reconfigurable chip includes: and an acceleration calculation unit. The acceleration calculation unit includes: the device comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array. The operation unit array has P rows and Q columns of operation units PE. The loading units are respectively connected with the first row PE unit and the first column PE unit of the arithmetic unit array. The plurality of memory cells are respectively connected with the operation units PE of the P-th row and the Q-th column of the operation unit array.

The easy compiling system of the reconfigurable chip comprises:

an obtaining unit is configured to obtain a DFG dataflow graph of a software model to be compiled.

A set acquisition unit configured to acquire a set of compute nodes and a set of edges in a DFG dataflow graph. Edges can express data dependencies between compute nodes and compute nodes.

And the dependency relationship establishing unit is configured to establish mathematical dependency relationships among the computing node set, the edge set, the loading unit, the storage unit and the operation unit.

And the mapping relationship establishing unit is configured to establish a loading mapping relationship between the computing node set and the plurality of loading units according to the computing node set, the edge set, the mathematical dependency relationship and the number of the loading units.

And the compiling output unit is configured to linearly compile the loading mapping relation, the storage mapping relation and the operation mapping relation to obtain a compiling configuration file capable of being mapped to the acceleration computing unit.

In another embodiment of the compiler-friendly system of the reconfigurable chip of the present invention, the mathematical dependency relationship comprises:

And ensuring the mathematical dependency relationship of one-to-one correspondence between each interconnection edge in the DFG graph and the interconnection lines among the calculation units in the PEA array.

In another embodiment of the compiler-friendly system of the reconfigurable chip of the invention, the operation unit array has 4 rows and 4 columns of PE units.

The number of the memory units is seven, wherein 1 memory unit is connected with the PE unit at the first row of the fourth column, 1 memory unit is connected with the PE unit at the second row of the fourth column, and 1 memory unit is connected with the PE unit at the third row of the fourth column. The 4 memory cells are connected one-to-one with the four PE cells in the fourth row.

And judging whether other operation units exist in the operation units in the operation unit array in the row direction, the column direction and the 45-degree inclined direction, if so, connecting the operation units with two units in the left and right in the row direction, two units in the upper and lower parts in the column direction, two units in the upper and lower parts in the 45-degree inclined direction, and other units in the upper and lower parts in the right and lower parts in the 45-degree inclined direction.

And judging whether any one of the operation units has other operation units spaced in the positive column direction, and if so, connecting other operation units. If not, other operation units at the missing positions are filled according to a mode of sequentially and circularly arranging in the positive column direction, and then other units at corresponding positions are connected.

The characteristics, technical features, advantages and implementation manners of the method and system for easily compiling a reconfigurable chip will be further described in a clear and understandable manner by referring to the attached drawings.

Drawings

Fig. 1 is a schematic diagram illustrating the composition of a coarse-grained reconfigurable architecture.

Fig. 2 is a schematic diagram of a compiling system framework for explaining coarse-grained reconfigurable chips.

Fig. 3 is a schematic diagram for illustrating internal unit connections of a reconfigurable chip in an embodiment of the invention.

Fig. 4 is a flowchart for explaining a compiling-facilitating method of a reconfigurable chip in an embodiment of the present invention.

Fig. 5 is a flow chart for explaining a compiler-facilitating system of a reconfigurable chip according to an embodiment of the present invention.

FIG. 6 is a flow chart illustrating the conversion of loop-to-information extraction into a data flow by a CGRA compiler in one embodiment of the present invention.

Fig. 7 is a mapping result for explaining a compiling-facilitated method of a reconfigurable chip in one embodiment of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings, in which the same reference numerals indicate the same or structurally similar but functionally identical elements.

"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings only schematically show the parts relevant to the present exemplary embodiment, and they do not represent the actual structure and the true scale of the product.

In a first aspect of the present invention, a method for easy compiling of a reconfigurable chip is provided, where the reconfigurable chip includes: and an acceleration calculation unit. As shown in fig. 3, the acceleration calculation unit includes: the device comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array. Wherein:

the loading units with loading function are respectively: a zero bit loading unit L0, a first bit loading unit L1, a second bit loading unit L2, a third bit loading unit L3, a fourth bit loading unit L4, a fifth bit loading unit L5, a sixth bit loading unit L6, and a seventh bit loading unit L7.

A plurality of memory cells having a memory function, each of which is: a zero bit storage unit S0, a first bit storage unit S1, a second bit storage unit S2, a third bit storage unit S3, a fourth bit storage unit S4, a fifth bit storage unit S5 and a sixth bit storage unit S6.

An arithmetic unit array PEA 101. The operation cell array has 4 rows and 4 columns of operation cells PE. Respectively as follows: a 0 th bit calculation unit PE0, a1 st bit calculation unit PE1, a 2 nd bit calculation unit PE2, a 15 th bit calculation unit PE 15.

The load calculation units are respectively connected with the PE units in the first row and the first column of the arithmetic unit array. The plurality of storage calculation units are respectively connected with the operation units PE in the P-th row and the Q-th column of the operation unit array.

In the present invention, as shown in fig. 4, the method for easily compiling a reconfigurable chip includes:

and step S101, acquiring a DFG dataflow graph.

In this step, a DFG dataflow graph of the software model to be compiled is obtained.

Step S102, a computing node set and an edge set are obtained.

In this step, a compute node set and an edge set in the DFG dataflow graph are obtained. Edges can express data dependencies between compute nodes and compute nodes.

Step S103, establishing a mathematical dependency relationship.

In this step, mathematical dependency relationships among the compute node set, the edge set, the load unit, the storage unit, and the operation unit are established.

And step S104, establishing a mapping relation.

In this step, a load mapping relationship between the compute node set and the plurality of load units is established according to the compute node set, the edge set, the mathematical dependency relationship, and the number of load units.

In this step, it is necessary to first:

(1) the initialization load operator corresponds to L0-L7:

if(v_p＝＝load)then

V_p,q∈[0,1]q∈[ALURange,ALURange+LoadRange)

else

V_p,q＝[0]q∈[ALURange,ALURange+LoadRange)

end if

(2) the initialize store operator corresponds to S0-S6:

if(v_p＝＝store)then

V_p,q∈[0,1]q∈[ALURange+LoadRange,ALURange+LoadRange+StoreRange)

else

V_p,q＝[0]q∈[ALURange+LoadRange,ALURange+LoadRange+StoreRange)

end if

(3) initializing the alu operator corresponds to P0-P15:

if(v_p＝＝alu)then

V_p,q∈[0,1]j∈[0,ALURange)

else

V_p,q＝[0]j∈[0,ALURange)

end if

Step S105, acquiring a compiling configuration file.

In this step, a linear compiler linearly compiles the load mapping relationship, the storage mapping relationship, and the operation mapping relationship to obtain a compilation configuration file that can be mapped to the accelerated computing unit.

Therefore, the invention greatly reduces the technical threshold of mapping, the algorithm is simple and easy to realize, and the core part is to establish a mathematical model. Easy to understand and easy to implement. The solving time is short, and only a few seconds are needed to solve under the condition of dozens of nodes. The method has the advantages of no wrong mapping condition, strong applicability and suitability for road planning route algorithm.

The eight loading units are respectively: a zero bit loading unit L0, a first bit loading unit L1, a second bit loading unit L2, a third bit loading unit L3, a fourth bit loading unit L4, a fifth bit loading unit L5, a sixth bit loading unit L6, and a seventh bit loading unit L7.

Wherein 4 load units are connected one-to-one with the four PE units in the first column (compute unit PE0, compute unit PE4, compute unit PE8, and compute unit PE 12). The 4 load units are respectively connected with the four PE units in the first row and are connected with the four PE units in the first column in a one-to-one mode.

As shown in FIG. 3, the first zero bit loading unit L0, the first bit loading unit L1, the second bit loading unit L2 and the third bit loading unit L3 are connected to the first row of PE0, PE1, PE2 and PE3 in a one-to-one correspondence. Meanwhile, each of the zero load unit L0, the first load unit L1, the second load unit L2, and the third load unit L3 is connected to the PE0, the PE1, the PE2, and the PE3 in the first row.

As shown in FIG. 3, the first bit loading unit L0, the first bit loading unit L1, the second bit loading unit L2, and the third bit loading unit L3 in the first row are connected to the PE0, the PE4, the PE8, and the PE12 in the first column in a one-to-one correspondence.

In another embodiment of the method for easily compiling the reconfigurable chip, the number of the storage units is seven: respectively as follows: the PE memory comprises a zero-bit storage unit S0, a first bit storage unit S1, a second bit storage unit S2, a third bit storage unit S3, a fourth bit storage unit S4, a fifth bit storage unit S5 and a sixth bit storage unit S6, wherein 1 storage unit is connected with the PE unit at the first row of the fourth column, 1 storage unit is connected with the PE unit at the second row of the fourth column, and 1 storage unit is connected with the PE unit at the third row of the fourth column. The 4 memory cells are connected one-to-one with the four PE cells in the fourth row.

As shown in FIG. 3, the fourth load unit L4, the fifth load unit L5, the sixth load unit L6 and the seventh load unit L7 are connected to the PE0, the PE4, the PE8 and the PE12 in the first column in a one-to-one correspondence.

As shown in FIG. 3, the calculation units PE3, PE7 and PE11 in the fourth column are connected to the fourth bit storage unit S4, the fifth bit storage unit S5 and the sixth bit storage unit S6 one by one. The calculation unit PE12, the calculation unit PE13, the calculation unit PE14, and the calculation unit PE15 in the fourth row are connected to the zero bit storage unit S0, the first bit storage unit S1, the second bit storage unit S2, and the third bit storage unit S3 one by one.

The connection among the units is physical wire connection, so that the call paths of each computing unit, each storage unit and each loading unit can be effectively reduced, the software structure design of the program is more met, and the software structure is suitable for software structures with various logic relationships.

As shown in fig. 3, each of the computing elements PE in the arithmetic element array PEA101 is connected to its adjacent nodes, up, down, left, and right, for example, the computing element PE5 is connected to the computing element PE1, the computing element PE4, the computing element PE6, and the computing element PE9, and is connected to the computing element PE8 at the bottom left, the computing element PE2 at the top right, and the computing element PE10 at the bottom right.

If there are non-adjacent calculation units in the above direction, the following units are connected in sequence according to the forward direction, for example: the forward row direction in the first row is: compute unit PE0 → compute unit PE1 → compute unit PE2 → compute unit PE 3. The forward row direction in the second row is: compute unit PE4 → compute unit PE5 → compute unit PE6 → compute unit PE 7. The forward row direction in the third row is: compute unit PE8 → compute unit PE9 → compute unit PE10 → compute unit PE 11. The forward row direction in the fourth row is: compute unit PE12 → compute unit PE13 → compute unit PE14 → compute unit PE 15.

If there are non-adjacent calculation units in the above direction, the following units are connected in sequence according to the forward column direction, for example: the forward column direction in the first column is: compute unit PE0 → compute unit PE4 → compute unit PE8 → compute unit PE 12. The positive column direction in the second column is: compute unit PE1 → compute unit PE5 → compute unit PE9 → compute unit PE 13. The forward column direction in the third column is: compute unit PE2 → compute unit PE6 → compute unit PE10 → compute unit PE 14. The forward column direction in the fourth column is: compute unit PE3 → compute unit PE7 → compute unit PE11 → compute unit PE 15.

For example, if there are no adjacent compute nodes in the upper direction of the compute unit PE0, the compute unit PE1, the compute unit PE2, and the compute unit PE3 in the first row, then the corresponding compute unit PE12, compute unit PE13, compute unit PE14, and compute unit PE15 in the fourth row are connected, respectively.

For example, if there is no adjacent compute node to the left of the compute unit PE0, compute unit PE4, compute unit PE8, compute unit PE12 in the first column, then the corresponding compute unit PE3, compute unit PE7, compute unit PE11, and compute unit PE15 in the fourth column are connected, respectively.

For example, the upper right and lower right elements of the PE7 in the fourth column are missing, and are connected to the PE0 and PE8 at the corresponding positions in the first column. Such as compute unit PE3 and compute unit PE12 and compute unit PE 4.

For example, the computing unit PE15 is connected to the computing unit PE 13. The computing element PE6 is connected to the computing element PE 14. The calculation element PE9 is connected to the calculation element PE1 (that is, after the other calculation elements at the missing positions are filled up in a manner sequentially arranged in a circular manner in the forward column direction, the other elements at the corresponding positions are connected) because there is no other calculation element at an interval in the forward column direction.

It should be noted that the connection relationships of the computing units are all physical wire connections, so that the interconnection between the computing units is improved, and the method is suitable for various program structures.

For a given cycle data flow DFG, D ═ V_d,E_d)。V_dA collection of nodes of a represented data flow graph,

is a compute node in the loop therein. E_dIs the set of edges therein that are,

is an edge in the data flow graph D, representing two computation nodes v_iAnd v_jHave data dependency relationship therebetween, and v_iMust be performed at v_jAnd then. Where i and j represent the number of the computing unit, for example: a computing element PE0, a computing element PE 4. With the above concept, the following mathematical model formula is established:

1. and calculating mathematical dependence of the nodes and the operation units in one-to-one correspondence, namely ensuring that each Vertex is mapped to one PE. As shown in equation 1

Wherein, p, DFGnode represents the node in DFG, q, PEARange arithmetic element array region; f represents: q areas corresponding to each Node; each node has a peaRange number of units.

2. Ensuring that the same mathematical dependencies of the arithmetic units can be used in different configurations.

When the DFG graph is large and cannot be placed in the PEA array at the same time, the original DFG graph needs to be cut into a plurality of DFG graph units, and each DFG graph unit has a configuration of graph parameters. Therefore, this step is to illustrate that, under each graph parameter configuration, the same mathematical dependency relationship of the operation units can be used in each DFG graph unit, that is, the operation units can be multiplexed in each DFG graph unit. As formula 2:

wherein, SplitRange divides the area; a Pearange arithmetic unit array region; split [ m ], q represents: m configured q area corresponding to each Node; the q region is a region in the PEARange arithmetic unit array.

3. Ensuring the mathematical dependency relationship of one-to-one correspondence between each interconnection edge in the DFG graph and the interconnection lines among the calculation units in the PEA array: as in equation 3:

wherein, q, EdgeRange edge region; p, DFGedge represents the edge in DFG; E2L represents the connecting edge between nodes in the DFG.

4. The mathematical dependence relationship that the current configuration file can not be interconnected and can be interconnected outside the configuration. That is, when the DFG is relatively large, the DFG needs to be divided into several blocks, each block corresponds to one configuration unit, the unit mapped in the current configuration cannot be mapped again, and the unit mapped in the remaining configurations can still be configured again. As in equation 4:

wherein m, SplitEdgeRange divides the edge region; an EdgeRange edge region; splitedge [ m ], q represents: q region corresponding to each Edge under m configuration; the q region is a region in the PEARange arithmetic unit array.

5. When the computation nodes are mapped to the computation units, the edges connected from the computation nodes are also correspondingly mapped to the mathematical dependency relationship on the interconnection relationship connected from the computation units, namely, when the DFG p is mapped to the PE a, the edges connected from p are also mapped to the interconnections connected from q. As equation 5:

wherein, p, DFGnode represents the node in DFG; q, a PEARange operation unit array area; a, DFGedge represents the edge in DFG; graph _ matrix_p,a+DFGnodeIn the table, Graph _ matrix represents the interconnection relationship table of two Node nodes. b, DFGedge represents the edge in DFG (as with a).

6. When the computation nodes are mapped to the computation units, the edges connected to the computation nodes are also correspondingly mapped to the mathematical dependency relationship on the interconnection relationship connected to the computation units, namely, when the DFG p is mapped to the PE a, the edge connected to the p is also mapped to the interconnection connected to the q. As in equation 6:

wherein, p, DFGnode represents the node in DFG; q, a PEARange operation unit array area; a, DFGedge represents the edge in DFG; graph _ matrix_a+DFGnode,pThe Graph _ matrix in the Graph represents an interconnection relation table of two Node nodes.

7. The condition of the target is set to all edges and the smallest mathematical dependency. The interconnection relation of all the nodes can be met as much as possible, and the minimum connection edge of each node is required. As in equation 7:

wherein, a, DFGedge represents the edge in DFG. b, EdgeRange represents the range of edges, lim represents a limit, and l represents a connecting edge between each node in the DFG; E2L represents the connecting edge between nodes in the DFG.

In a second aspect of the present invention, a compiler-facilitating system of a reconfigurable chip is provided, as shown in fig. 5, the reconfigurable chip includes: and an acceleration calculation unit. The acceleration calculation unit includes: the device comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array. The operation unit array has P rows and Q columns of operation units PE. The loading units are respectively connected with the first row PE unit and the first column PE unit of the arithmetic unit array. The plurality of memory cells are respectively connected with the operation units PE of the P-th row and the Q-th column of the operation unit array.

The easy compiling system of the reconfigurable chip comprises:

an obtaining unit 10 is configured to obtain a DFG dataflow graph of a software model to be compiled.

A set acquisition unit 20 configured to acquire a set of compute nodes and a set of edges in a DFG dataflow graph. Edges can express data dependencies between compute nodes and compute nodes.

And a dependency relationship establishing unit 30 configured to establish mathematical dependency relationships between the compute node sets, the edge sets and the load units, the store units and the operation units.

And the mapping relationship establishing unit 40 is configured to establish a loading mapping relationship between the computing node set and the plurality of loading units according to the computing node set, the edge set, the mathematical dependency relationship and the number of the loading units.

And a compiling output unit 50 configured to linearly compile the load mapping relationship, the storage mapping relationship and the operation mapping relationship to obtain a compiling configuration file capable of being mapped to the acceleration computing unit.

In a preferred embodiment of the present invention, the CGRA acceleration engine is an array of 31 acceleration PE units, as shown in fig. 3:

the acceleration PE unit includes three functional types, which are a Load type (red), a Store type (green), and an Alu type (white), the Load type unit is responsible for reading data from the SharedMemory, the Store type unit is responsible for storing data into the SharedMemory, the Alu type unit is responsible for acceleration in a true sense, and the Alu type unit includes 24 types of arithmetic operations such as multiplication (mul), selection (sel), unsigned addition (udd), signed addition (sadd), unsigned subtraction (usub), signed subtraction (ssub), absolute value (abs), nop (null operation), and route (route).

The specific interconnection relationship among the Alu type unit, the Load type unit and the Store type unit is as follows: the purple arrows indicate the interconnection between the Load/Store unit and the PEs, and the black arrows indicate the interconnection between the PEs. The interconnection between the Load unit and the Store unit and the PE unit is asymmetric interconnection, and all interconnections are marked by purple arrows in an interconnection structure diagram.

The interconnection among the PE units is symmetrical couplet, and the interconnection structure of each PE is completely consistent. When the PE unit at the PEA edge has no upper, lower, left and right units, PEA expansion is carried out by adopting the following rules: the left end of the left column of PEs is right, e.g., the fanout interconnection of PE5 cells includes interconnections with left, right, left-down, lower, right-down, upper, right-up, and lower PEs. When the PE unit at the PEA edge has no upper, lower, left and right units, PEA expansion is carried out by adopting the following rules: the left end of the left column of PEs is a right column of PEs (left and right); the right end of the right column PE is a left column PE (right and left); the upper end of the top row PE is a bottom row PE (upper and lower); the lower end of the bottom row PE is the top row PE (lower and upper).

An example of a procedure is shown above, and fig. 6 shows that a [ i ] and load3, b [ i ] and load5, c [ i ] and load7, '+' and sadd6, '+' and sadd8, '+' and sadd11, abs and abs0, '+' and sadd13, and d [ i ] and store12 are respectively mapped one-to-one by extracting loops from the CGRA compiler and converting the loops into information into a dataflow graph (hereinafter referred to as DFG). The final aim is to map DFG to CGRA array, and finally the CGRA accelerator calculates in real time according to the mapped topological structure and a data flow diagram. The mapping result obtained by the method is shown in fig. 7, which not only ensures that each PE unit can be mapped to the array, but also ensures the correctness of the interconnection relationship between the units.

The problem of circular mapping on the reconfigurable processor CGRA is solved, data transmission resources on various heterogeneous reconfigurable computing arrays are considered and unified, and a graph theory method is adopted to convert the data transmission resources into a mathematical model.

is a compute node in the loop therein. E_dIs the set of edges therein that are,

is an edge in the data flow graph D, representing two computation nodes v_iAnd v_jHave data dependency relationship therebetween, and v_iMust be performed at v_jAnd then. With the above concept, we have established the following mathematical model formula:

(1) the initialization load operator corresponds to L0-L7:

if(v_p＝＝load)then

V_p,q∈[0,1]q∈[ALURange,ALURange+LoadRange)

else

V_p,q＝[0]q∈[ALURange,ALURange+LoadRange)

end if

(2) the initialize store operator corresponds to S0-S6:

if(v_p＝＝store)then

V_p,q∈[0,1]q∈[ALURange+LoadRange,ALURange+LoadRange+StoreRange)

else

V_p,q＝[0]q∈[ALURange+LoadRange,ALURange+LoadRange+StoreRange)

end if

(3) initializing the alu operator corresponds to P0-P15:

if(v_p＝＝alu)then

V_p,q∈[0,1]j∈[0,ALURange)

else

V_p,q＝[0]j∈[0,ALURange)

end if

(4) ensure that each Vertex maps to a PE: as in equation 8:

(5) ensure that the same PE can be used in different configurations: as in equation 9:

(6) ensuring that each Edge maps to at least one interconnect: as in equation 10

(7) The interconnection can not be realized in the current configuration file, and the interconnection can be realized outside the configuration: as in equation 11

(8) Ensuring that DFG p maps to PE a, edge out of p also maps to q out of interconnect: as in equation 12;

(9) ensuring that when the DFG p is mapped to the PE a, edge connected to p is also mapped to the interconnection connected to q; as in equation 13;

after the mathematical model is built, we can bring the constraint method into an integer linear planner (e.g., ortools developed by google, Gurobi, inc.).

(10) The condition of the target is set to all edges and minimum: as in equation 14;

therefore, the beneficial effects of the invention are as follows: the technical threshold of mapping is greatly reduced, the algorithm is simple and easy to realize, and the core part is to establish a mathematical model. Easy to understand and easy to implement. The solving time is short, and only a few seconds are needed to solve under the condition of dozens of nodes. The method has the advantages of no wrong mapping condition, strong applicability and suitability for road planning route algorithm.

It should be understood that although the present description is described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein as a whole may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

1. A method for easy compiling of a reconfigurable chip is characterized in that the reconfigurable chip comprises: an acceleration calculation unit; the acceleration calculation unit includes: the system comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array; the arithmetic unit array is provided with P rows and Q columns of arithmetic units PE; the loading units are respectively connected with the PE units in the first row and the first column of the arithmetic unit array; the plurality of storage units are respectively connected with the arithmetic unit PE of the P-th row and the Q-th column of the arithmetic unit array;

the method for easily compiling the reconfigurable chip comprises the following steps:

step S101, a DFG data flow diagram of a software model to be compiled is obtained;

step S102, acquiring a computing node set and an edge set in the DFG data flow graph; the edge is capable of expressing a data dependency between the compute node and the compute node;

step S103, establishing mathematical dependency relations among the computing node set, the edge set, the loading unit, the storage unit and the operation unit;

step S104, establishing a loading mapping relation between the calculation node set and the plurality of loading units according to the calculation node set, the edge set, the mathematical dependency relation and the number of the loading units;

establishing a storage mapping relation between the computing node set and the plurality of storage units according to the computing node set, the edge set, the mathematical dependency relation and the number of the storage units;

establishing an operation mapping relation among the calculation node set, the edge set and the operation unit array according to the calculation node set, the edge set, the mathematical dependency relation and the operation unit array;

step S105, linearly compiling the load mapping relationship, the storage mapping relationship and the operation mapping relationship by using a linear compiler to obtain a compilation configuration file that can be mapped to the accelerated computing unit.

2. The method of claim 1, wherein the operation unit array comprises 4 rows and 4 columns of PE units;

the number of the loading units is eight, wherein 4 loading units are connected with four PE units in the first column in a one-to-one mode; the 4 load units are respectively connected with the four PE units in the first row and are connected with the four PE units in the first column in a one-to-one mode.

3. The compiler-friendly method of a reconfigurable chip according to claim 2, wherein the number of the memory cells is seven, wherein 1 memory cell is connected to the PE cell at the first row of the fourth column, 1 memory cell is connected to the PE cell at the second row of the fourth column, and 1 memory cell is connected to the PE cell at the third row of the fourth column; the 4 memory cells are connected one-to-one with the four PE cells in the fourth row.

4. The method of claim 2 or 3, wherein the operation units in the operation unit array are determined whether other operation units exist in the row direction, the column direction and the 45 ° inclined direction, if so, the operation units are connected with two units, namely the left and right units in the row direction, two units, namely the upper and lower units in the column direction, two units, namely the upper and lower units in the 45 ° inclined direction, the upper and lower units in the right and lower units and the lower and left units in the 45 ° inclined direction;

if not: filling other operation units at the missing positions according to a mode of sequential cyclic arrangement in the forward row direction and sequential cyclic arrangement in the forward column direction so as to connect the operation units with other units at corresponding positions; the forward row direction is a direction from the first row to the fourth row; the forward column direction is a direction from the first column to four columns.

5. The method of claim 4, wherein the method further comprises determining whether any of the operation units has another operation unit spaced apart in the forward column direction, and if so, connecting the other operation unit; if not, other operation units at the missing positions are filled up according to the mode of sequentially and circularly arranging the positive nematic direction, and then other units at corresponding positions are connected.

6. The method for facilitating compiling of the reconfigurable chip according to claim 1, wherein the arithmetic operation that the arithmetic unit array can realize comprises: multiplication, selection, unsigned addition, signed addition, unsigned subtraction, signed subtraction, absolute value taking, nop, route operation.

7. The method of claim 1, wherein the mathematical dependency comprises:

the mathematical dependency relationship of the calculation nodes and the operation units in one-to-one correspondence;

the mathematical dependency relationship of the interconnection relationship among the loading unit, the storage unit and the operation unit array and the edges in one-to-one correspondence is realized;

ensuring that the same mathematical dependency of the arithmetic units can be used in different configurations;

mathematical dependency relations that can not be interconnected in the current configuration file and can be interconnected outside the configuration;

when the computing nodes are mapped to the computing units, the edges connected from the computing nodes are correspondingly mapped to mathematical dependency relations on the interconnection relations connected from the computing units;

when the computing nodes are mapped to the computing units, the edges connected to the computing nodes are correspondingly mapped to mathematical dependency relations on the interconnection relations connected to the computing units;

8. A compiler-friendly system of a reconfigurable chip, the reconfigurable chip comprising: an acceleration calculation unit; the acceleration calculation unit includes: the system comprises a plurality of loading units with loading functions, a plurality of storage units with storage functions and an arithmetic unit array; the arithmetic unit array is provided with P rows and Q columns of arithmetic units PE; the loading units are respectively connected with the PE units in the first row and the first column of the arithmetic unit array; the plurality of storage units are respectively connected with the arithmetic unit PE of the P-th row and the Q-th column of the arithmetic unit array;

the easy compiling system of the reconfigurable chip comprises:

an acquisition unit configured to acquire a DFG dataflow graph of a software model to be compiled;

a set acquisition unit configured to acquire a set of compute nodes and a set of edges in the DFG dataflow graph; the edge is capable of expressing a data dependency between the compute node and the compute node;

a dependency relationship establishing unit configured to establish mathematical dependency relationships between the set of compute nodes, the set of edges, and the load unit, the store unit, and the compute unit;

a mapping relationship establishing unit configured to establish a load mapping relationship between the compute node set and the plurality of load units according to the compute node set, the edge set, the mathematical dependency relationship, and the number of load units;

9. The system of claim 8, wherein the mathematical dependencies comprise:

10. The system of claim 8, wherein the array of arithmetic units has 4 rows and 4 columns of PE units;

the number of the loading units is eight, wherein 4 loading units are connected with four PE units in the first column in a one-to-one mode; the 4 loading units are respectively connected with the four PE units in the first row and are in one-to-one connection with the four PE units in the first column;

the number of the storage units is seven, wherein 1 storage unit is connected with the PE unit at the first row of the fourth column, 1 storage unit is connected with the PE unit at the second row of the fourth column, and 1 storage unit is connected with the PE unit at the third row of the fourth column; the 4 storage units are connected with the four PE units in the fourth row one by one;

judging whether other operation units exist in the operation units in the operation unit array in the row direction, the column direction and the 45-degree inclined direction, if so, connecting the operation units with two units in the left and right direction of the row direction, two units in the upper and lower direction of the column direction, two units in the upper and lower right direction of the 45-degree inclined direction and other units in the upper and lower left direction of the 45-degree inclined direction;

if not: filling other operation units at the missing positions according to a mode of sequential cyclic arrangement in the forward row direction and sequential cyclic arrangement in the forward column direction so as to connect the operation units with other units at corresponding positions; the forward row direction is a direction from the first row to the fourth row; the forward column direction is a direction from a first column to four columns;

judging whether any one of the operation units has other operation units which are separated in the positive column direction, if so, connecting the other operation units; if not, other operation units at the missing positions are filled up according to the mode of sequentially and circularly arranging the positive nematic direction, and then other units at corresponding positions are connected.