CN113407483A

CN113407483A - Data intensive application oriented dynamic reconfigurable processor

Info

Publication number: CN113407483A
Application number: CN202110703118.0A
Authority: CN
Inventors: 刘大江; 朱蓉
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-09-17
Anticipated expiration: 2041-06-24
Also published as: CN113407483B

Abstract

The invention provides a dynamic reconfigurable processor for data intensive application, wherein the method comprises the following steps: a dynamic reconfigurable processor facing data intensive application comprises a processing unit array, an on-chip multi-bank scratchpad memory and a configuration memory, wherein the processing unit array is composed of m x n processing units PE in a two-dimensional array form, m and n are positive integers, the same row of PE is connected to the same bus, and each bus accesses the m banks in the scratchpad memory through a cross selection matrix unit. The method provided by the application enables reusable data to efficiently flow in the processing unit array, avoids repeated access of data in the same storage position, reduces data access amount from the source, and greatly improves circulating water performance of the dynamic reconfigurable processor.

Description

Data intensive application oriented dynamic reconfigurable processor

Technical Field

The invention relates to the technical field of integrated circuits, in particular to the field of a dynamic reconfigurable processor.

Background

With the development of technologies such as cloud computing, big data, internet of things and the like, and the popularization of various intelligent terminal devices, the speed increase of data traffic is accelerating continuously, and the requirement on high-performance chips is urgent increasingly. A dynamically reconfigurable processor is a new processor architecture with energy efficiency approaching that of an Application Specific Integrated Circuit (ASIC) without sacrificing much programming flexibility, and is one of the ideal architectures to accelerate data intensive applications. Unlike conventional General Purpose Processors (GPP), the dynamic reconfigurable Processor has no latency and energy consumption overhead for fetch and decode operations; different from an ASIC, the dynamic reconfigurable processor can dynamically configure the functions of the circuit during operation, and has better flexibility; different from a Field Programmable Gate Array (FPGA), the dynamic reconfigurable processor has a coarse-grained configuration mode, reduces the cost of configuration information and has higher calculation energy efficiency.

A typical dynamically reconfigurable processor is generally composed of an array of processing units, data memory, and configuration memory. The Processing unit array is configured by a plurality of Processing units (PEs), and functions of the entire array are defined by configuring connectivity and operation patterns of each PE. The configuration is mainly derived from the mapping of the specific compilation algorithm. Modular scheduling loop software pipelining is one of the most common methods for mapping optimization in compilation, and it improves application parallel execution performance by minimizing the Initiation Interval (II) of loop iteration. However, the high computational parallelism allows a large amount of data to be accessed in parallel between the data memory and the array of processing units. In order to deal with the problem of parallel data access pressure, the conventional reconfigurable processor generally uses a multi-bank Scratch Pad Memory (SPM) on a chip to provide data for the processing unit array in parallel. A 4 x 4 array of processing elements is typically equipped with a 4-bank SPM, in which each row of the array of processing elements can access each bank of the SPM in parallel, but different PEs within the same row can only access data serially because of the shared data bus. To further improve the parallel data access capability, HReA [ L.Liu, Z.Li, C.Yang, C.Deng, S.YIn and S.Wei ], "HReA: An Energy-Efficient Embedded dynamic Reconfigurable Fabric for 13-Dwarfs Processing," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol.65, No.3, pp.381-385, and March 2018 ] is equipped with a 16-bank SPM for a 4 × 4 array of Processing units. In the framework, all PEs can access each bank of the SPM in parallel, and the access and storage capacity of parallel data is greatly enhanced. But also increases the difficulty of bank management and increases chip area and power consumption. In order to fully utilize the limited bandwidth of SPM, the document [ s.yin, x.yao, t.lu, d.liu, j.gu, l.liu, and s.wei, "flash-free loop mapping for coarse-grained configurable architecture with multi-bank memory," IEEE Transactions on Parallel and Distributed Systems, vol.28, No.9, pp.2471-2485,2017 ] proposes a collision-free circular mapping method from a compiling point of view. According to the method, access operators in the DFG are scheduled to different time steps to reduce the access amount of parallel data, then the storage position of the data in the SPM is reasonably organized through a storage partitioning algorithm, and finally the conflict of data access is reduced. But as the start-up interval of the circular pipeline becomes smaller and smaller, multiple access operators have to execute simultaneously, eventually only at the expense of performance to guarantee conflict-free data access. The above work enables the original access operators in the application to operate without conflict from the architecture or compilation perspective, but the number of the access operators is not changed essentially, thereby limiting the extent of access conflict optimization.

From the perspective of data-intensive applications, there are many opportunities for data reuse, such as template computation, in the application's loop kernel. Although there are many memory accesses in its cycle core, some of them actually read the same data. If the same data is obtained once and used multiple times, then access conflicts can be reduced by reducing the access operators. But an inevitable problem is how to route these reusable data between PEs.

In a conventional processing unit array, a single channel network is formed between processing units, that is, data can be transmitted between PEs only by inputting data through two Multiplexer (MUX) input ends of the PEs and outputting the data through an output register, and the PEs at this time cannot perform any other operation except data routing operation, which is very resource-consuming. If the data is not routed out, but the data is multi-cycle retained in the Local Register File (LRF) of a PE, then the mapping scheme of the operator on the processing unit array is very limited in terms of compilation, and the operator can only map to a PE in order to use the data in the LRF of the PE. Therefore, how to creatively provide an architecture which can efficiently route data between PEs, fully utilize PE resources, and make the compiling mode of an operator flexible in the compiling aspect is a technical problem which is urgently solved by a person skilled in the art. The problem of efficient routing of data is solved, and the reusable data can be fully utilized, so that access operators are reduced, access conflicts are reduced, and the execution performance of the dynamic reconfigurable processor is improved finally.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the first purpose of the present invention is to provide a dynamic reconfigurable processor oriented to data intensive applications, so as to improve the parallel data access and storage capability of the reconfigurable processor.

A second object of the invention is to propose a non-transitory computer-readable storage medium.

To achieve the above object, a first embodiment of the present invention provides a dynamic reconfigurable processor for data intensive applications, where the dynamic reconfigurable processor includes a processing unit array, an on-chip multi-bank scratchpad memory SPM, and a configuration memory, where the processing unit array is composed of m × n processing units PE in a two-dimensional array, m and n are positive integers, where a same row PE is connected to a same bus, and each bus accesses m banks in the scratchpad memory through a cross-selection matrix unit.

Alternatively, in an embodiment of the application, each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is used for performing various fixed-point operations, and has two multiplexers at its inputs, the multiplexer is used to access data from different sources, the local register file RF is divided into r separate registers, where r is a positive integer, each register selects data originating from the functional unit FU or from a previous register, the information of the configuration register in each processing unit PE originates from the configuration memory, the configuration memory is connected to the various components within the processing unit PE and distributes the configuration flow to configure the selection signals of each multiplexer, the functions of the functional units FU and the read-write enable of the registers.

Optionally, in an embodiment of the application, the processing unit PE is in a two-channel network comprising a result network for passing the result of the computation of the functional unit FU and a value network for passing the value of the local register file RF.

Alternatively, in an embodiment of the application, when a value obtained from the memory by an access operator requires multiple references in a short time, the value is distributed to other processing elements PE requiring reference values through the numerical network.

Optionally, in an embodiment of the present application, a serially shifted data channel is added to an internal register of the local register file RF.

Optionally, in an embodiment of the present application, the method includes the following steps:

step 1, converting the application pseudo code into an original data dependency graph, because the data x [ i +2 ]]With data x [ i +1 ] one clock cycle later]And data x [ i ] two clock cycles later]Is the same data, L is removed₂，L₃Two operators, adding two new reuse dependent edges (L)₁，*)，(L₁And (+) obtaining a new data dependency graph, (L)₁X) represents the operator L₁The obtained data is transmitted to a multiplier (L) for consumption through a numerical network₁And (+) represents the access operator "L₁”

The acquired data will be transmitted to the addition operator "+" through the numerical network for consumption. And obtaining a compiling scheme through a compiling tool, generating a configuration information stream, and obtaining a layout result with II equal to 1. Access operator L₁Laid out in PE1, multiplier "+".

Placement at PE2, addition operator "+" Place at PE3, access operator "S₁"lay out on PE 4;

step 2, time t, under the drive of configuration flow, time tL of₁After the operator finishes taking the data, the operator places the data in the last register of PE 1;

step 3, at the time of t +1, under the drive of the configuration flow, L at the time of t₁Data is passed out through the multiplexer Mb of PE1, through the multiplexers Ma and M1 of PE2 to the first register R1 of PE2, L at time t +1₁After the operator finishes taking the data, the operator places the data in an output register of PE1 and also places the data in a last register R1 of PE 1;

step 4, at the time of t +2, under the drive of the configuration flow, L at the time of t₁Data is passed out through the multiplexer Mb of PE2, through the multiplexers Ma and M1 of PE3, to the first register R1 of PE 3; at the same time, L at time t is driven by the configuration stream₁Data is also passed through multiplexers Md and Mf before the FU of PE2 to a second input port of the FU; meanwhile, under the drive of the configuration flow, data in an output register of the PE1 reaches a first input port through a multiplexer Me in front of the FU of the PE2, and under the drive of the configuration flow, the FU performs a multiplication operation, and the result is stored in an output register of the PE 2;

step 5, at the time of t +3, under the drive of the configuration flow, L at the time of t₁Data is transmitted to a second register R2 of the PE3 through an output port of a first register R1 of the PE3 and a multiplexer M2, and meanwhile, FU of the PE2 carries out multiplication operation, and the result is stored in an output register of the PE 2;

step 6, at the time of t +4, under the drive of the configuration flow, L at the time of t₁Data reach the second input port of FU through the multiplexer Md and Mf before FU of PE3, meanwhile, the data in the output register of PE2 is transmitted to the first input port of FU through the multiplexer Me before FU of PE3, FU carries on the operation of addition "+" under the drive of configuration flow, the result is kept in the output register of PE 3;

at the time of t +5, at step 7, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through multiplexer Me, and the data is stored in Bank through bus.

To achieve the above object, a non-transitory computer-readable storage medium is provided in an embodiment of a third aspect of the present application, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for transferring reusable data described in the embodiment of the first aspect of the present application.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a diagram of a data-intensive application-oriented dynamically reconfigurable processor according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an implementation of how the architecture provided by the embodiment of the present invention is used.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

A data intensive application oriented dynamically reconfigurable processor according to an embodiment of the present invention is described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a data intensive application-oriented dynamic reconfigurable processor according to an embodiment of the present invention.

To achieve the above object, as shown in fig. 1, a first embodiment of the present invention provides a dynamic reconfigurable processor for data intensive applications, where the dynamic reconfigurable processor includes a processing unit array, an on-chip multi-scratch pad memory SPM, and a configuration memory, where the processing unit array is composed of m × n processing units PE in a two-dimensional array, where m and n are positive integers, where a same row PE is connected to a same bus, and each bus accesses m banks in the scratch pad memory through a cross-selection matrix unit.

In an embodiment of the application, furthermore, each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is used for performing various fixed-point operations, and has two multiplexers at its inputs, the multiplexer is used to access data from different sources, the local register file RF is divided into r separate registers, where r is a positive integer, each register selects data originating from the functional unit FU or from a previous register, the information of the configuration register in each processing unit PE originates from the configuration memory, the configuration memory is connected to the various components within the processing unit PE and distributes the configuration flow to configure the selection signals of each multiplexer, the functions of the functional units FU and the read-write enable of the registers.

In an embodiment of the application, further, the processing unit PE is in a two-channel network comprising a result network for passing the result of the computation of the functional unit FU and a value network for passing the value of the local register file RF.

In an embodiment of the application, further, when a value obtained from the memory by an access operator requires a plurality of references in a short time, the value is distributed to other processing elements PE requiring reference values through the value network.

In an embodiment of the application, further, a serially shifted data channel is added in an internal register of the local register file RF.

In one embodiment of the present application, in particular, the FU may perform various fixed-point operations including logical operations of addition, subtraction, multiplication, and the like. The input of the FU has two multiplexers (1-out-of-6 multiplexers Me, Mf) which can access data from different sources, such as the FU of the neighboring PE, the respective local registers of the own register file RF, and the memory. The output of the FU has three directions: memory, output registers, and various registers of the register file RF.

The RF is divided into R separate registers (R1, R2 … Rr, R being a positive integer) each preceded by a 2-to-1 multiplexer (M1, M2 … Mr) that can select data originating from the FU or previous register (the previous register of the first register R1 being some register of the adjacent RF). The output port of each register is connected to the next register (the last register Rr is not register connected) and to three multiplexers (Mb, Mc, Md) which all select 1 multiplexer for r, through which multiplexer Mc or Md the data in one register can be selected to the local FU for calculation. The data in one register is selectable by a multiplexer Mb into the first register (R1) of the adjacent RF. Meanwhile, the first register R1 of the RF selects data from one neighboring RF through the 4-to-1 multiplexer Ma.

The information of the configuration register in each PE comes from the configuration memory, connects the various components inside the PE, distributes the configuration flow to configure the selection signal of each multiplexer, the function of the FU, and the read-write enable of the register.

In one embodiment of the application, in particular, the dual channel interconnection between the processing units, the processing units are changed from the original single channel network to a dual channel network, one network is used for transferring the calculation result of the FU (result network), and the other network is used for transferring the RF value (value network). The result network passes the results of the FU's computation either through the output register to the dominating PE or through the bus into the memory. The numerical network flexibly configures the output direction of data by the numerical value in the RF through a multiplexer, and transfers the data to the first register of the adjacent RF or other local registers.

When the numerical value obtained by an access operator from the memory needs to be referred for multiple times in a short time, the numerical value can be distributed to other PEs needing to be referred by the numerical value through the constructed numerical value network, and the data multiplexing capability of the reconfigurable computing array is enhanced. The dual-channel interconnection network can reduce the number of access operators in a data flow diagram from the source, thereby reducing the access conflict of a data memory.

In one embodiment of the present application, specifically, although the inter-PE interconnection network design of the dual channel enhances the flexibility of data transfer, the register interconnections in the processing units can only ensure the correctness of the calculation function when the number of clock cycles (RT, Required Time) that should be reached and the number of clock cycles (AT, Arrival Time) that actually reach of the data are the same in the pipeline execution mode. AT reusing data depends on Manhattan distance between data producer PE and consumer PE, and it is difficult to guarantee Manhattan distance between producer and consumer to match RT in compiling process. Therefore, we add a serial shifted data path to the RF internal registers, i.e. each register inside the RF is connected end-to-end in sequence, so that the reused data can still remain inside the RF of the same PE for multiple clock cycles in the streaming mode.

After the serial shift data path is added, each register can select either data from the FU or data from the previous register, since each register is preceded by a 1-out-of-2 multiplexer. In this case, the RF can be operated in either the normal mode or the shift register mode. When operating in the normal mode, the register may temporarily store data for a next time period. When the shift register circuit works in a shift register mode, registers of all the processing units form a register chain, the length of the register chain can be selected through the multiplexer, and the number of clock cycles of data flow is flexibly configured. Assuming that the manhattan distance between the data producer and the consumer is M1 and the number of the single RF internal registers is r, the adjustable range of the AT of the data is M1+1, r x (M1+1), which greatly enhances the synchronous arrival capability of the data. Therefore, the register interconnection network structure in the PE can provide a hardware basis for data synchronization and provide flexibility guarantee for subsequent compiling and mapping.

The technical effects of this application: the embodiment of the application provides a data-intensive application-oriented dynamic reconfigurable processor, aiming at the disadvantages that only Function Units (FU) of a traditional interconnection network among PEs are connected through an output register, and direct interconnection channels do not exist among register files of the PEs, the framework of the invention increases the interconnection Function of the register files RF of each PE in a processing Unit array, so that reusable data efficiently flows in the processing Unit array, repeated access of data in the same storage position is avoided, the data access amount is reduced from the source, and the circulating water-based performance of the dynamic reconfigurable processor can be greatly improved.

As shown in fig. 2, in an embodiment of the present application, further, the following steps are included:

step 1, converting the application pseudo code into an original data dependency graph, and finding data x [ i +2 ]]With data x [ i +1 ] one clock cycle later]And data x [ i ] two clock cycles later]Is the same data, L is removed₂，L₃Two operators, adding two new reuse dependent edges (L)₁，*)，(L₁And (+) to obtain a new data dependency graph. (L)₁X) represents the operator L₁The obtained data is transmitted to a multiplier (L) for consumption through a numerical network₁And (+) represents the access operator "L₁The acquired data will be transmitted to the addition operator "+" through the numerical network for consumption. And obtaining a compiling scheme through a compiling tool, generating a configuration information stream, and obtaining a layout result with II equal to 1. Access operator L₁Placement at PE1, multiplication operator ". about" at PE2, addition operator "+" at PE3, and access operator "S₁Laid out on PE 4;

step 2, driving the configuration flow at the time t, and L at the time t₁After the operator finishes taking the data, the operator places the data in the last register of PE 1;

step 3, at the time of t +1, under the drive of the configuration flow, at the time of t, L₁Data is passed out through the multiplexer Mb of PE1, through the multiplexers Ma and M1 of PE2 to the first register R1 of PE2, L at time t +1₁After the operator finishes taking the data, the operator places the data in the output register of PE1 and the last register of PE1In the device;

step 4, at the time of t +2, under the drive of the configuration flow, at the time of t, L₁Data is passed out through the multiplexer Mb of PE2, through the multiplexers Ma and M1 of PE3, to the first register R1 of PE 3; at the same time, driven by the configuration stream, time t is L₁Data is also passed through multiplexers Md and Mf before the FU of PE2 to a second input port of the FU; meanwhile, under the drive of the configuration flow, data in an output register of the PE1 reaches a first input port through a multiplexer Me in front of the FU of the PE2, and under the drive of the configuration flow, the FU performs multiplication (×) operation, and the result is stored in an output register of the PE 2;

step 5, at the time of t +3, under the drive of the configuration flow, at the time of t, L₁Data is transmitted to a second register R2 of the PE3 through an output port of a first register R1 of the PE3 and a multiplexer M2, and meanwhile, FU of the PE2 carries out multiplication operation, and the result is stored in an output register of the PE 2;

step 6, at the time of t +4, under the drive of the configuration flow, at the time of t, L₁Data reach the second input port of FU through the multiplexer Md and Mf before FU of PE3, meanwhile, the data in the output register of PE2 is transmitted to the first input port of FU through the multiplexer Me before FU of PE3, FU carries on the operation of addition "+" under the drive of configuration flow, the result is kept in the output register of PE 3;

In one embodiment of the present application, specifically, after 6 clock cycles, one complete iteration in the circulating water has been performed. Since II is 1, the processing unit array of each clock cycle is executed with the same configuration information.

In one embodiment of the present application, specifically, FIG. 2 is a diagram of (a) an example loop pseudocode; (b) an original DDG plot from (a); (c) reusing the DDG with the data obtained in (b); (d) example (m-2, n-2, r-2) inventive architecture; (e) the data obtained at time t L1 is transmitted over the example inventive architecture.

In order to implement the above embodiments, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for transferring reusable data according to the embodiments of the first aspect of the present application.

Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the schematic or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful to implement logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A dynamic reconfigurable processor for data intensive applications, characterized in that the dynamic reconfigurable processor comprises a processing unit array, an on-chip multi-bank scratchpad memory and a configuration memory, the processing unit array is composed of m x n processing units PE in a two-dimensional array form, m and n are positive integers, wherein the same row PE is connected to the same bus, and each bus accesses m banks in the scratchpad memory through a cross selection matrix unit.

2. A dynamic reconfigurable processor method according to claim 1, characterized in that each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is used for performing various fixed-point operations, and has two multiplexers at its inputs, the multiplexer is used to access data from different sources, the local register file RF is divided into r separate registers, where r is a positive integer, each register selects data originating from the functional unit FU or from a previous register, the information of the configuration register in each processing unit PE originates from the configuration memory, the configuration memory is connected to the various components within the processing unit PE and distributes the configuration flow to configure the selection signals of each multiplexer, the functions of the functional units FU and the read-write enable of the registers.

3. A dynamic reconfigurable processor method according to claim 2, characterized in that the processing unit PE is in a two-channel network comprising a result network for the passing of the results of the computations of the functional units FU and a numerical network for the passing of the numerical values of the local register file RF.

4. A dynamic reconfigurable processor method according to claim 3, characterized in that, when a value obtained from a memory by an access operator requires a plurality of references in a short time, the value is distributed to other processing elements PE requiring reference values through the numerical network.

5. A dynamic reconfigurable processor method according to claim 3 or 4, characterized in that a serially shifted data channel is added at an internal register of the local register file RF.

6. A method of transferring reusable data using a data intensive application oriented dynamic reconfigurable processor according to any of claims 1 to 5, comprising the steps of:

step 1, converting the application pseudo code into an original data dependency graph, because the data x [ i +2 ]]With data x [ i +1 ] one clock cycle later]And data x [ i ] two clock cycles later]Is the same data, L is removed₂，L₃Two operators, adding two new reuse dependent edges (L)₁，*)，(L₁And (+) to obtain a new data dependency graph. (L)₁X) represents the operator L₁The obtained data is transmitted to a multiplier (L) for consumption through a numerical network₁And (+) represents the operator L₁The acquired data will be transmitted to the addition operator "+" through the numerical network for consumption. And obtaining a compiling scheme through a compiling tool, generating a configuration information stream, and obtaining a layout result with II equal to 1. Access operator L₁Placement at PE1, multiplication operator ". about" at PE2, addition operator "+" at PE3, access operator S₁Laid out on PE 4;

step 3, at the time of t +1, under the drive of configuration flow, at the time of tCarved L₁Data is passed out through the multiplexer Mb of PE1, through the multiplexers Ma and M1 of PE2 to the first register R1 of PE2, L at time t +1₁After the operator finishes taking the data, the operator places the data in an output register of PE1 and also places the data in the last register of PE 1;

step 5, at the time of t +3, under the drive of the configuration flow, L at the time of t₁The data passes through the output port of the first register R1 of the PE3, passes through the multiplexer M2, and is transferred into the second register R2 of the PE3, meanwhile, the PE2 carries out multiplication operation, and the result is stored in the output register of the PE 2;

7. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method of transferring reusable data according to any one of claims 1-6.