CN112631610B

CN112631610B - Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure

Info

Publication number: CN112631610B
Application number: CN202110061629.7A
Authority: CN
Inventors: 绳伟光; 陈雨歌; 蒋剑飞; 景乃锋; 王琴; 毛志刚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-11-30
Filing date: 2021-01-18
Publication date: 2022-04-26
Anticipated expiration: 2041-01-18
Also published as: CN112631610A; WO2022110567A1

Abstract

A method for eliminating memory access conflict by data reuse aiming at a coarse-grained reconfigurable structure provides a cycle transformation model for maximizing effective data reuse and applied to a perfect nested cycle kernel so as to maximize available data reuse in an iteration process in a program operation process and a configuration file change strategy in a code generation stage, eliminate redundant memory access operation in a data reuse pair and reduce memory access conflict in a cycle execution process.

Description

Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure

Technical Field

The application relates to the field of coarse-grained reconfigurable structure compilers, in particular to a circular transformation model aiming at maximizing effective data reuse of a reconfigurable structure and a configuration file modification strategy used in a code generation stage.

Background

A Coarse-Grained Reconfigurable architecture (CGRA) is a new hardware architecture with a higher energy efficiency ratio. It is mainly applied to hardware acceleration of compute-intensive applications, such as video processing, neural network operations. Research and investigation show that the application execution time is mainly consumed in the kernel part of the program cycle. The ADRES [1] model of the reference gives a typical CGRA structure, as shown in FIG. 1. The compiler abstracts the program loop kernel portion into a Data Flow Graph (DFG), maps the DFG onto different Processing Elements (PEs) in a coarse-grained Processing Element Array (PEA), and executes the DFG by using a soft-pipelining technique. The machine period required by the two loop iteration intervals is called Initiation Interval (II), and the smaller II, the better the acceleration performance. One of the main goals of a coarse-grained reconfigurable architecture compiler is to efficiently select scheduling and mapping strategies to minimize the II of the loop kernel.

However, software pipelining increases the requirement for on-chip global memory (ORF) data access parallelism, and pipeline interrupts caused by memory access conflicts become a bottleneck in CGRA efficiency. The ORF is connected to the columns of the PEA through crossbars and column buses, through which different PEs can access the on-chip memory banks. When multiple operations occupy the same hardware resources simultaneously in the same cycle (e.g., accessing the same bank of multi-bank on-chip memory simultaneously, or accessing on-chip memory via the column bus exceeds the column bus bandwidth limit), this will cause soft pipeline stalls, resulting in an increase in II.

Research is carried out on 22 common operation cores, the number of the access and storage operations accounts for 52.2% of all the operations, and the time delay caused by access and storage conflicts among the access and storage operations in the same control period accounts for 68.4% of the total running time. The existing work for eliminating the memory access conflict focuses on adjusting the position of the memory access operation in a scheduling and mapping stage and adjusting the storage position of data in an on-chip memory, but the existing work lacks the fundamental reason for causing the memory access conflict, namely, the existing work optimizes the number of the excessive memory access operation caused by the soft pipelining technology.

Some studies and analyses were as follows:

research on application of polyhedral model to coarse-grained reconfigurable architecture

The circular transformation based on the polyhedral model is a new circular optimization method, and compared with a traditional method of using affine transformation as circular transformation by using an unimodular matrix model, the circular transformation based on the polyhedral model has the advantages of wide application range, strong expression capability, large optimization space and the like. Today, multilateral models are used in various domain compilers such as Google MLIR [2], policy [3], TVM [4 ]. But the work of applying the polyhedron compiling model to the coarse-grained reconfigurable architecture compiling process is less. The PolyMap proposed in reference [5] utilizes a polyhedron compiling model, adjusts the hierarchical structure of a cycle kernel through the analysis of the whole mapping flow, and realizes parallelism of the inner-layer cycle development of the cycle kernel. Reference [6] will use a polyhedral model to represent programs and program conversion processes and use genetic algorithms to perform cyclic transformations for higher computational efficiency. However, none of the above studies consider applying a polyhedral model to increase data reuse between loop iterations to reduce the number of redundant memory access operations.

Research for reducing operation access conflict on CGRA

Memory conflicts are primarily accessed simultaneously by a large number of memory accesses. In the prior art, research on reducing access conflict is mostly focused on optimizing scheduling and mapping processes and a storage mode of data in an on-chip memory so as to reduce multi-bank conflict. Reference [7] divides the data into clusters so that the probability of different clusters being accessed is as equal as possible. This approach resolves conflicts from an array high level, with coarse-grained approaches resulting in limited optimization. Reference [8] analyzes that the loop kernel maps different addresses to different storage locations of different on-chip memories using linear transformation for all memory addresses of a single array in a single control cycle. Reference [9] on the basis of reference [8], the structure of the memory bank is further subdivided, and data partitions on blocks are added, so that the flexibility of linear transformation is increased. Reference [10] promotes the selection of linear transformation parameters from the analysis of single array accesses in a single control cycle to different array accesses in different control cycles, further reducing multi-bank conflicts. Meanwhile, a memory bank merging algorithm is provided, so that the area performance proportion of the memory bank is improved. Reference [11] changes the execution time of each access operator in the scheduling process based on reference [10], further reducing access conflict. And simultaneously, a dual-forced scheduling scheme is provided, so that the memory access operation is divided into different control periods as much as possible. However, none of the above studies consider a method of reducing the number of memory access operations to reduce memory access conflicts.

The information of the references mentioned above is as follows:

[1]Y.Park,J.J.K.Park,and S.Mahlke.2012.Efficient performance scaling of future CGRAs for mobile applications.In International Conference on Field-Programmable Technology(FPT).335–342.

[2]Lattner C,Amini M,Bondhugula U,et al.MLIR：A Compiler Infrastructure for the End of Moore′s Law[J].2020.

[3]Grosser,T.,Groesslinger,A.,&Lengauer,C.(2012).Polly-Performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters,22(4),1–28.

[4]Chen,Tianqi,ThierryMoreau,Ziheng Jiang,Lianmin Zheng,Eddie Yan, Haichen Shen,Meghan Cowan etal."TVM：An automated end-to-end optimizing compiler for deeplearning."In 13th USENIX Symposium on Operating Systems Design andImplementation(OSDI 18),pp.578-594.2018.

[5]Liu,D.,Yin,S.,Peng,Y.,Liu,L.,&Wei,S.(2015).Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures.IEEE Transactions on Very Large Scale Integration(VLSI)Systems,23(11),2581–2594.

[6]Ganser S,

Armin,Siegmund N,et al.Speeding up Itewrative Polyhedral Schedule Optimization with Surrogate Performance Models[J].Acm Transactions on Architecture&Code Optimization,2018,15(4)：1-27.

[7]Kim,Y.,Lee,J.,Shrivastava,A.,&Paek,Y.(2010).Operation and data mapping for CGRAs with multi-bank memory.Proceedings of the ACM SIGPLAN Conference on Languages,Compilers,and Tools for Embedded Systems(LCTES), 17–25.

[8]Wang Y,Li P,Zhang P,et al.Memory partitioning for multidimensional arrays in high-level synthesis[C]//Acm/edac/ieee Design Automation Conference. IEEE,2013.

[9]Wang,Y.,Li,P.,&Cong,J.(2014).Theory and algorithm for generalized memory partitioning in high-level synthesis.ACM/SIGDA International Symposium on Field Programmable Gate Arrays-FPGA,199–208.

[10]Yin S,Xie Z,Meng C,et al.MultiBank memory optimization for parallel data access in multiple data arrays[C]//IEEE/ACM International Conference on Computer-aided Design.IEEE,2017.

[11]Park,Hyunchul&Fan,Kevin&Mahlke,Scott&Oh,Taewook&Kim, Heeseok&Kim,Hong-seok.(2008).Edge-centric modulo scheduling for coarse-grained reconfigurable architectures.Parallel Architectures and Compilation Techniques-Conference Proceedings,PACT.166-176.10.1145/1454115.1454140.

disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present application aims to develop a new method for CGRA architecture, which performs compilation optimization on software loop code in the compilation stage to maximize data reuse available during iterations of program execution and eliminate redundant memory access operations in data reuse pairs, thereby reducing memory access conflicts during loop execution.

The method comprises the following steps:

step 1, compiling a loop kernel part of a source program into a first intermediate representation by adopting a CGRA (Carrier grade error correction) compiler front end;

step 2, carrying out cyclic transformation for reusing effective data between iterations to obtain a transformed second intermediate representation;

step 3, generating a data flow graph to obtain a mapping result;

step 4, selecting a modification strategy for data reuse;

step 5, generating a configuration file;

the step 2 further comprises:

step 2.1, analyzing the dependency relationship and the data reuse relationship in the original code to obtain a dependency set and a reuse set; the reuse relationship refers to that two operations in different iterations access the same position of a memory, and then the two operations have a reuse relationship; two operations in the reuse relationship form a producer/consumer pair, the producer/consumer pair comprising a producer and a consumer, the producer/consumer pair being of a type comprising a Load-Load reuse pair, a Load-Store reuse pair, a Store-Load reuse pair, and a Store-Store reuse pair;

2.2, according to the dependency set, constraining the search space of the innermost time vector; searching the remaining search spaces which accord with the constraint in sequence, and finding the innermost layer time vector which enables the effective reuse quantity in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;

step 2.3, generating the second intermediate representation according to each layer of time vectors;

the step 4 further comprises the following steps:

4.1, analyzing the data reuse relation in the cyclic kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;

and 4.2, when the configuration file is generated, selecting a corresponding modification strategy according to the reuse relation of the node.

Further, for the Load-Load reuse pair and the Store-Load reuse pair, in step 2, the output of the producer is buffered using a local register LRF (local register file), from which the consumer loads data.

Further, for the Load-Load reuse pair and the Store-Load reuse pair, in step 2, the output of the producer is buffered using a global register GRF (global register file) from which the consumer loads data.

Further, in the step 2, an affine transformation-based method is used to arrange the execution order of the loop iteration.

Further, in the step 2.2, a first policy is adopted to select the cyclic transformation, where the first policy refers to a policy of reducing a selection range of the selectable cyclic transformation in dependence relationship, and the method includes the following steps:

definition of

Is a time vector of dimension m, representing the change of all loop indexes between two successive iterations of the layer as

Wherein x_miIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iteration_mi(ii) a Let the dependency set R { < s, d in the perfect nested loop>I.e. loop iteration s must be performed before loop iteration d; the difference between d and s of each element in the dependency set is called a dependency vector and is stored in M_RPerforming the following steps; for a set of dependent vectors M_RThe innermost time vector of the selected time vector group

Satisfying at least either one of formula (1) and formula (2):

wherein, cone (M)_R) Is represented by M_RVertebral body formed by stretching of the elements, cone (M)_R-i_r) Representing vertebral body by removing i_rStretching of other elements; formula (1) indicates that the straight line where the currently selected time vector is located cannot be in the vertebral body formed by the dependent vector; equation (2) indicates that the currently selected time vector can be collinear with one vector on the vertebral surface that is made up of all the dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e., the time vector

Wherein c is a real number and c has a value in the range of 0<c≤1。

Further, in step 2.2, an innermost time vector is iteratively selected from the innermost loop using the first strategy; after an innermost layer time vector is selected, projecting all the dependent vectors to a time vector normal plane, and iteratively selecting the next dependent vector; finally, a cyclic transformation satisfying the cyclic dependency relationship is obtained.

Further, the step 2.2 includes a second strategy, which refers to a strategy of selecting a cyclic transformation to maximize effective data reuse, and includes the following steps:

a memory access operation p is represented as

Indicating that memory access operation p accessed on-chip memory in iteration i

An address; given reuse pair

It is valid when the corresponding following formula (3) holds; where c is a constant threshold used to help select valid reuse pairs:

each reuse pair is provided with a corresponding formula (3), so that as many reuse pairs as possible are effective, namely the innermost time vector which makes the formula (3) to be established as much as possible is the optimal time vector; the optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.

Further, in the step 4.2, modifying the policy includes:

the Store-Store reuse pair is not constrained by interval iterations between the producer and the consumer, and producers in this reuse relationship pair are eliminated.

Further, in step 4.2, modifying the policy includes:

for the Store-Load reuse pair and the Load-Load reuse pair, data is transferred from a producer to a consumer.

Further, data is transferred from the producer to the consumer using the interconnection resources.

Further, the data is transferred from the producer to the consumer using the registered resource.

Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying the policy includes:

the clock period I (r) of the interval between producer and consumer ≦ 0, and the reuse pair is not valid.

the clock period I (r) of the interval between the producer and the consumer meets 1 ≧ I (r) >0, and interconnection line resource connection exists between the PE where the producer is located and the PE where the consumer is located, and the reused consumer is converted into a routing operation for accessing data from an output register of the producer.

Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying a policy includes:

the clock cycle I (r) >1 of the interval between producer and consumer uses register resources to temporarily store data.

for the remaining space of LRF on pe(s) ((t)), and pe(s) ((t))

Satisfy the requirement of

In the case of (a) a (b),

the reused data is stored into the LRF, and the consumer is converted into a routing operation for obtaining the data from the LRF, wherein II is a starting interval of the software pipeline;

where PE(s) and PE (t) denote PE where the producer and consumer are located, respectively.

Further, modifying the policy includes:

in case no LRF is selected and the remaining GRF space is satisfied

The GRF is used to temporarily store the data.

Further, modifying the policy includes:

for accumulation reuse, the output operation sends data to the input operation according to the modification strategy of the Store-Load reuse pair, and the memory access operation during accumulation is converted into a null operation.

for the case of con (r) future, pe(s) not equal pe (t), add an additional routing node for data movement over the LRF;

where PE(s) and PE (t) respectively indicate PE where a producer and a consumer are located, and con (r) true indicates that interconnection line resource connection exists between PE(s) and PE (t).

Further, modifying the policy includes:

in case no LRF is selected and the remaining GRF space is satisfied

The GRF is used to temporarily store the data.

Further, modifying the policy includes:

Compared with the prior art, the beneficial effects of this application are as follows:

1. compared with the existing scheme for reducing the access conflict, the method and the device have the advantages that the data reuse relation among the loops is innovatively utilized, the configuration file is modified in the generation stage of the configuration file, the access operation number of the CGRA operation loop kernel is reduced, and the access conflict is avoided.

2. The method is orthogonal to the existing method aiming at the scheduling mapping stage and the data rearrangement storage, has strong expandability, and can simply cooperate with the existing data placement scheme, scheduling and mapping scheme to obtain higher application speed-up ratio.

Drawings

FIG. 1 is a diagram of a typical architecture of a 4x4 CGRA of the prior art;

FIG. 2 is a compiler back-end flow diagram of an embodiment of the present application;

FIG. 3 is a circular transformation model process of an embodiment of the present application;

FIG. 4 is a schematic diagram of a configuration file modification policy of an embodiment of the present application;

FIG. 5 is a graph of the run time of an embodiment of the present application at 22 kernels on a 4x4 PEA.

Detailed Description

The preferred embodiments of the present application will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present application may be embodied in many different forms of embodiments and the scope of the present application is not limited to only the embodiments set forth herein.

The conception, specific structure and technical effects of the present application will be further described below to fully understand the purpose, features and effects of the present application, but the present application is not limited thereto.

Fig. 2 is a schematic diagram of a compiler back-end flow according to an embodiment of the present application, wherein,

step 201, compiling a loop kernel part of a source program into an Intermediate Representation (IR) form by using a CGRA compiler front end based on llvm (low Level Virtual machine);

step 202-; wherein the content of the first and second substances,

step 202, analyzing the dependency relationship and the data reuse relationship in the original code to obtain a dependency set and a reuse set;

step 203, according to the dependency set, constraining the search space of the innermost time vector; searching the remaining search spaces which accord with the constraint in sequence, and finding the innermost layer time vector which enables the effective reuse number in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;

step 204, generating a new intermediate representation according to each layer of time vectors;

step 205 to step 210, generating a data flow graph through the modified intermediate representation, and obtaining a mapping result by adopting the existing scheduling, data distinguishing and space mapping scheme in the prior art; among them, the scheduling, data distinguishing, spatial mapping schemes known in the prior art can be obtained from documents [12] and [13], i.e., documents [12] and [13] are incorporated into the present application by reference, and documents [12] and [13] are listed at the end of the detailed description section.

Step 211-;

step 211, analyzing all data reuse relations in the cyclic kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;

step 212, when the configuration file is generated, selecting a corresponding modification strategy according to the reuse relation of the node;

step 213, generating a configuration file according to the previous modification policy.

In the above flow, the number of times of memory access conflicts in the execution process of the cyclic kernel is reduced by reducing the number of redundant memory access operations, which is a difference between the present application and other researches. In addition, different from the existing work of reducing the access conflict aiming at the placement, scheduling and mapping strategy of the data in the on-chip memory, the method and the device can be combined with the existing scheduling, data placement and mapping strategy orthogonally, and the performance of the existing optimization work is further improved. The optimization method for reducing the access conflict aiming at the front-end cyclic transformation and the configuration file generation stage is adopted, so that the method can be combined with most of the existing methods for reducing the access conflict, and is another difference between the method and other researches.

Circular transformation

Two operations in different iterations access the same location in memory, and there is a reuse relationship between the two operations. The two operations of the reuse relationship form a producer/consumer pair, also referred to as a reuse pair, depending on the order in which the data is accessed. Reuse pairs can be classified into four classes, namely Load-Load (two Load operations), Load-Store (Load before write), Store-Load (write before Load), and Store-Store (two write operations) reuse. For Load-Load or Store-Load reuse pairs, we buffer the producer's output using either a Local Register File (LRF) or a Global Register File (GRF), so that the consumer only needs to Load data from the LRF or GRF, without accessing the OGM, thereby reducing memory access operations. However, due to the limited size of the GRF and LRF provided by the hardware, our compiler should ensure that the reused data is successfully preserved during the life cycle of the LRF or GRF. Therefore, the goal of our loop transformation model is to use affine transformation based methods to orchestrate the execution order of loop iterations to minimize the life cycle of reuse data between iterations.

In step 203 shown in fig. 2, the present application proposes a strategy for narrowing the selection range of the optional circular transformation according to the dependency relationship.

Definition of

Wherein x_miIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iteration_mi. For theThe loop program in FIG. 3(a), whose set of time vectors is represented as

And

a group of time vectors corresponds to a determined loop execution sequence, affine transformation can be carried out on the group of time vectors, so that the execution sequence of iteration in a loop is adjusted, but the transformation cannot violate the dependency relationship of the original loop. Let the dependence set R ═ tone in the perfect nested loop<s,d>I.e. loop iteration s has to be performed before loop iteration d. The difference between d and s of each element in the dependency set is called a dependency vector and is stored in M_RIn (1). For a set of dependent vectors M_RThe innermost time vector of the selected time vector group

Satisfying at least either of equations (1) and (2), then the set of time vectors is only valid:

wherein, cone (M)_R) Is represented by M_RVertebral body formed by stretching of the elements, cone (M)_R-i_r) Representing vertebral body by removing i_rThe other elements are stretched. Formula (1) indicates that the straight line where the currently selected time vector is located cannot be in the vertebral body formed by the dependent vector; equation (2) indicates that the currently selected time vector can be collinear with one vector on the vertebral surface that is made up of all the dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e., the time vector

Wherein c is a real number and c has a value in the range of 0<c≤1。

According to the above selection strategy, the present application iteratively uses the strategy to select a legal innermost time vector from the innermost loop. After an innermost time vector is selected, all the dependency vectors can be projected onto the normal plane of the time vector, and the next dependency vector can be iteratively selected. Finally, a cyclic transformation satisfying the cyclic dependency relationship can be obtained.

In step 203 shown in fig. 2, the present application proposes a strategy of selecting a cyclic transformation to maximize the effective data reuse. In searching for the optimal transition, whether the Load-Load or Store-Load reuse pair is valid depends on the innermost time vector. A memory access operation p can be represented as

An address. Given reuse pair

It is valid when the corresponding following formula (3) holds. Where c is a constant threshold used to help select valid reuse pairs.

Each reuse pair has a corresponding formula (3), so that as many reuse pairs as possible are effective, i.e. the innermost time vector that makes formula (3) as possible is the optimal time vector. The optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.

As shown in fig. 3, the procedure shown in fig. 3(a) is also taken as an example to further explain the above circular conversion method. The goal of the optimization problem is to maximize the number of equations established in FIG. 3 (e). The nodes in fig. 3(g) represent the innermost candidate time vectors, where the feasible solutions to the optimization problem are nodes, highlighted as gray nodes. Fig. 3(f) shows a time vector for each cyclic layer. FIG. 3(h) shows the transformed program, with the innermost loop divided into two layers to traverse all iterations.

Configuration file correction

During the profile generation process, we propose a profile modification strategy that eliminates unnecessary memory access operations by using inter-iteration data duplication. The algorithm first analyzes whether each group of reuse pairs meets time constraints and register space constraints, and then adopts different modification strategies to reduce memory access operations. The modification strategy is shown in fig. 4. FIG. 4(a) is a representation of the structure of a PE, with Op representing operations placed on the PE, including P representing the producer in a reuse pair, C the consumer, and R the routing node. The LRF owned by each PE is also marked, and the LRF flag in a diagonal line pattern indicates that the LRF stores data.

Store-Store reuse is not limited by interval iterations between producers and consumers. If such a reuse relationship exists, the producer in the reuse relationship pair needs to be eliminated. FIG. 4(b) shows a modification strategy for Store-Store reuse.

For Store-Load and Load-Load reuse, data should be transferred from the producer to the consumer using either interconnection resources or registry resources. Given a data reuse pair, r ═ s, t, d >, where s is the producer in the reuse pair, t is the consumer in the reuse pair, and d is the number of interval iterations between s and t. Assuming that operation s is mapped to PE(s) of the control period time(s) in the software pipeline, the clock period i (r) of the reuse to the interval between producer and consumer can be calculated by i (r) ═ d × II + (time (t) -time (s)), where II is the start-up interval of the software pipeline.

If I (r) ≦ 0, reuse pair execution occurs after or concurrently with the consumer for the producer, the reuse pair is invalid.

When i (r) >0, the spatial constraint of this reuse pair should be checked. Assuming that there is an interconnection resource connection between pe(s) and pe (t) where the producer and the consumer are located, Con (r) is true. Fig. 4(d) shows the policy selected when i (r) is 1 and con (r) is true, where the reused consumer is translated into a routing operation that accesses data from the producer's output register.

When I (r) >1, register resources must be used to temporarily store data.

Fig. 4(c) shows a state when pe(s) and pe (t) are present. If LRF's remaining space on pe(s)

Satisfy the requirement of

The reused data should be stored into the LRF and the consumer is converted into a routing operation that retrieves the data from the LRF.

However, when i (r) ≧ 1, con (r) ≠ true, pe(s) ≠ pe (t), additional routing nodes are required for data movement. Assuming that there are two null operations np and nc that satisfy pe (np) ═ pe(s), pe (nc) ═ pe (t), (time (nc) -time (np)) mod II ═ 1, the reused data will be moved by both routing nodes through the LRF. It is noted that producers and consumers may also choose np and nc. Fig. 4(e) illustrates different policies, including a policy that selects producers and consumers as routing nodes. The four diagrams from left to right show the locations of the producer and the consumer, and when the producer is selected to be np, the consumer is selected to be nc, or spare nodes except the producer and the consumer are selected to be np and nc.

If the above strategy is not selected, and the remaining GRF space should be satisfied

The GRF will be used to temporarily store the data. The modification strategy for this case is shown in fig. 4 (f).

In addition to this, there is cumulative reuse. That is, the operation has accumulation operation to the same storage position data, and the operation can be temporarily stored in the local register without continuously accessing the main memory. Given a Store-Load reuse pair r ═ s, t, d >, if r' ═ t, s, d > is a Load-Store reuse pair, we call r an accumulated reuse. The modification strategy for cumulative reuse is shown in fig. 4(g) and 4(h), where the rectangular box containing the I and O blocks represents the set of all operations except the memory access operation. The nodes labeled I and O are operations directly connected to load and store operations (referred to as input operations and output operations). The output operation sends data to the input operation according to the Store-Load reuse modification strategy, and the memory access operation during accumulation should be converted into a null operation.

The selected policy is recorded and the generated profile is modified by the selected policy during the profile generation phase of step 213.

Evaluation of results

The simulation environment realized by the ADRES-based typical CGRA structure design mentioned in the background section is utilized to test and integrate the cyclic transformation model and the configuration file modification strategy provided by the application in 22 typical calculation-intensive application test sets, and 22 calculation-intensive kernels are selected from data sets which are derived from digital signal processing, computer vision and dynamic programming from EEMBC, Polybench, Machsuite and other existing references. FIG. 5 shows a comparison of the execution times of configuration packets generated by integrating the modulo scheduling compiler and the original compiler of the present application (i.e., not using the present application, but using the compiled streams based on references [12] and [13] below, noted THP + DP). The result shows that the configuration information generated by the present application can obtain an average performance improvement of 44.4%, which indicates that the scheme of the present application can effectively improve the performance of the configuration information packet generated by the coarse-grained reconfigurable architecture compiler compared with the existing scheme.

Table 1 shows a comparison of the number of memory accesses before and after the implementation of the scheme described in the cited application:

it can be seen that the average memory access operation is reduced to 55.4% of the previous memory access operation, which shows that the number of memory access operations is greatly reduced by our model. Wherein, potential data reuse relations of aes3, bezier1 and strassen1 are discovered through a loop conversion model, and the number of data reuse pairs between two continuous iterations in the three kernels is increased from 0, 1, 14 to 4, 9 and 23. In addition, the other kernels achieve maximum data reuse in their default states, so the cyclic conversion model effectively maximizes all available reuse relationships between cyclic iterations.

Reference to the literature

[12]Z.Zhao et al.,"Towards Higher Performance and Robust Compilation for CGRA Modulo Scheduling,"in IEEE Transactions on Parallel and Distributed Systems, vol.31,no.9,pp.2201-2219,1Sept.2020,doi：10.1109/TPDS.2020.2989149.

[13]Z.Zhao,Y.Liu,W.Sheng,T.Krishna,Q.Wang,and Z.Mao,“Optimizing the data placement and transformation for multi-bank cgra computing system,”in DATE,2018,pp.1087–1092.

The foregoing detailed description of the preferred embodiments of the present application. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the concepts of the present application should be within the scope of protection defined by the claims.

Claims

1. A method for eliminating access conflict for data reuse of a coarse-grained reconfigurable structure comprises the following steps:

step 3, generating a data flow graph to obtain a mapping result;

step 4, selecting a modification strategy for data reuse;

step 5, generating a configuration file;

the step 2 further comprises:

2.2, according to the dependency set, constraining the search space of the innermost time vector; searching remaining search spaces which accord with the constraint in sequence, and finding out the innermost layer time vector which enables the effective reuse quantity in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;

the step 4 further comprises the following steps:

step 4.1, analyzing the data reuse relation in the cycle kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;

2. The method of claim 1, wherein in step 2, for the Load-Load reuse pair and the Store-Load reuse pair, the output of the producer is buffered using a local register LRF from which the consumer loads data.

3. The method of claim 1, wherein in step 2, for the Load-Load reuse pair and the Store-Load reuse pair, the output of the producer is buffered using a global register GRF from which the consumer loads data.

4. The method for eliminating memory conflict as claimed in claim 1, wherein in step 2, the execution order of loop iteration is programmed by using affine transformation based method.

5. The method as claimed in claim 4, wherein in the step 2.2, a first policy is adopted to select the cyclic transformation, wherein the first policy refers to a policy whose dependency relationship narrows the selection range of the optional cyclic transformation, and the method includes the following steps:

definition of

Wherein x_miIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iteration_mi(ii) a Let the dependence set R ═ tone in the perfect nested loop<s，d>I.e. loop iteration s must be performed before loop iteration d; the difference between d and s of each element in the dependency set is called a dependency vector and is stored in M_RPerforming the following steps; for a set of dependent vectors M_RThe innermost time vector of the selected time vector group

Satisfying at least either one of formula (1) and formula (2):

Wherein c is a real number and the value range of c is more than 0 and less than or equal to 1.

6. The method of claim 5, wherein in step 2.2, the first policy is used iteratively from the innermost loop to select an innermost time vector; after an innermost layer time vector is selected, projecting all the dependent vectors to a time vector normal plane, and iteratively selecting the next dependent vector; finally, a cyclic transformation satisfying the cyclic dependency relationship is obtained.

7. A method of access conflict resolution as claimed in claim 5, wherein said step 2.2 includes a second strategy, said second strategy being a strategy of choosing a cyclic transformation to maximise effective data reuse, comprising the steps of:

a memory access operation p is represented as

An address; given reuse pair

each reuse pair is provided with a corresponding formula (3), wherein the innermost time vector which makes the formula (3) as good as possible is the optimal time vector; the optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.

8. The method for eliminating memory access conflict as claimed in claim 1, wherein in the step 4.2, the modifying strategy comprises:

the Store-Store reuse pair is not constrained by interval iterations between the producer and the consumer, and the producer in this reuse relationship pair is eliminated.

9. The method for eliminating memory access conflict as claimed in claim 1, wherein in the step 4.2, the modifying strategy comprises:

transferring data from the producer to the consumer for the Store-Load reuse pair and the Load-Load reuse pair.

10. The method of claim 9, wherein the data is transferred from the producer to the consumer using an interconnection resource.

11. The method of claim 9, wherein the data is transferred from the producer to the consumer using a registered resource.

12. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:

the clock period I (r) of the interval between the producer and the consumer ≦ 0, and the reuse pair is invalid.

13. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:

and the clock period I (r) of the interval between the producer and the consumer meets 1 ≧ I (r) >0, and interconnection line resource connection exists between the PE where the producer is located and the PE where the consumer is located, the reused consumer is converted into a routing operation for accessing data from an output register of the producer.

14. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:

the clock period I (r) >1 of the interval between the producer and the consumer uses register resources to temporarily store data.

15. The method of claim 14, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:

pe(s) ═ pe (t), and the remaining space of the LRF on pe(s) ((s))

Satisfy the requirement of

Storing the reused data into the LRF, the consumer being converted to a routing operation that retrieves the data from the LRF, wherein II is a start-up interval of the software pipeline;

16. The method of resolving memory conflicts as claimed in claim 15, wherein the modification policy comprises:

in case no LRF is selected and the remaining GRF space is satisfied

The GRF is used to temporarily store the data.

17. The method of resolving memory conflicts as claimed in claim 16, wherein the modification policy comprises:

18. The method of claim 14, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:

con (r) true, pe(s) not equal to pe (t), adding extra routing nodes for data movement through LRF;

19. The method of resolving memory conflicts as claimed in claim 18, wherein the modification policy comprises:

in case no LRF is selected and the remaining GRF space is satisfied

The GRF is used to temporarily store the data.

20. The method of resolving memory conflicts as claimed in claim 19, wherein the modification policy comprises: