CN112631610B - Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure - Google Patents

Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure Download PDF

Info

Publication number
CN112631610B
CN112631610B CN202110061629.7A CN202110061629A CN112631610B CN 112631610 B CN112631610 B CN 112631610B CN 202110061629 A CN202110061629 A CN 202110061629A CN 112631610 B CN112631610 B CN 112631610B
Authority
CN
China
Prior art keywords
reuse
load
pair
data
producer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110061629.7A
Other languages
Chinese (zh)
Other versions
CN112631610A (en
Inventor
绳伟光
陈雨歌
蒋剑飞
景乃锋
王琴
毛志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to PCT/CN2021/079524 priority Critical patent/WO2022110567A1/en
Publication of CN112631610A publication Critical patent/CN112631610A/en
Application granted granted Critical
Publication of CN112631610B publication Critical patent/CN112631610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

A method for eliminating memory access conflict by data reuse aiming at a coarse-grained reconfigurable structure provides a cycle transformation model for maximizing effective data reuse and applied to a perfect nested cycle kernel so as to maximize available data reuse in an iteration process in a program operation process and a configuration file change strategy in a code generation stage, eliminate redundant memory access operation in a data reuse pair and reduce memory access conflict in a cycle execution process.

Description

Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
Technical Field
The application relates to the field of coarse-grained reconfigurable structure compilers, in particular to a circular transformation model aiming at maximizing effective data reuse of a reconfigurable structure and a configuration file modification strategy used in a code generation stage.
Background
A Coarse-Grained Reconfigurable architecture (CGRA) is a new hardware architecture with a higher energy efficiency ratio. It is mainly applied to hardware acceleration of compute-intensive applications, such as video processing, neural network operations. Research and investigation show that the application execution time is mainly consumed in the kernel part of the program cycle. The ADRES [1] model of the reference gives a typical CGRA structure, as shown in FIG. 1. The compiler abstracts the program loop kernel portion into a Data Flow Graph (DFG), maps the DFG onto different Processing Elements (PEs) in a coarse-grained Processing Element Array (PEA), and executes the DFG by using a soft-pipelining technique. The machine period required by the two loop iteration intervals is called Initiation Interval (II), and the smaller II, the better the acceleration performance. One of the main goals of a coarse-grained reconfigurable architecture compiler is to efficiently select scheduling and mapping strategies to minimize the II of the loop kernel.
However, software pipelining increases the requirement for on-chip global memory (ORF) data access parallelism, and pipeline interrupts caused by memory access conflicts become a bottleneck in CGRA efficiency. The ORF is connected to the columns of the PEA through crossbars and column buses, through which different PEs can access the on-chip memory banks. When multiple operations occupy the same hardware resources simultaneously in the same cycle (e.g., accessing the same bank of multi-bank on-chip memory simultaneously, or accessing on-chip memory via the column bus exceeds the column bus bandwidth limit), this will cause soft pipeline stalls, resulting in an increase in II.
Research is carried out on 22 common operation cores, the number of the access and storage operations accounts for 52.2% of all the operations, and the time delay caused by access and storage conflicts among the access and storage operations in the same control period accounts for 68.4% of the total running time. The existing work for eliminating the memory access conflict focuses on adjusting the position of the memory access operation in a scheduling and mapping stage and adjusting the storage position of data in an on-chip memory, but the existing work lacks the fundamental reason for causing the memory access conflict, namely, the existing work optimizes the number of the excessive memory access operation caused by the soft pipelining technology.
Some studies and analyses were as follows:
research on application of polyhedral model to coarse-grained reconfigurable architecture
The circular transformation based on the polyhedral model is a new circular optimization method, and compared with a traditional method of using affine transformation as circular transformation by using an unimodular matrix model, the circular transformation based on the polyhedral model has the advantages of wide application range, strong expression capability, large optimization space and the like. Today, multilateral models are used in various domain compilers such as Google MLIR [2], policy [3], TVM [4 ]. But the work of applying the polyhedron compiling model to the coarse-grained reconfigurable architecture compiling process is less. The PolyMap proposed in reference [5] utilizes a polyhedron compiling model, adjusts the hierarchical structure of a cycle kernel through the analysis of the whole mapping flow, and realizes parallelism of the inner-layer cycle development of the cycle kernel. Reference [6] will use a polyhedral model to represent programs and program conversion processes and use genetic algorithms to perform cyclic transformations for higher computational efficiency. However, none of the above studies consider applying a polyhedral model to increase data reuse between loop iterations to reduce the number of redundant memory access operations.
Research for reducing operation access conflict on CGRA
Memory conflicts are primarily accessed simultaneously by a large number of memory accesses. In the prior art, research on reducing access conflict is mostly focused on optimizing scheduling and mapping processes and a storage mode of data in an on-chip memory so as to reduce multi-bank conflict. Reference [7] divides the data into clusters so that the probability of different clusters being accessed is as equal as possible. This approach resolves conflicts from an array high level, with coarse-grained approaches resulting in limited optimization. Reference [8] analyzes that the loop kernel maps different addresses to different storage locations of different on-chip memories using linear transformation for all memory addresses of a single array in a single control cycle. Reference [9] on the basis of reference [8], the structure of the memory bank is further subdivided, and data partitions on blocks are added, so that the flexibility of linear transformation is increased. Reference [10] promotes the selection of linear transformation parameters from the analysis of single array accesses in a single control cycle to different array accesses in different control cycles, further reducing multi-bank conflicts. Meanwhile, a memory bank merging algorithm is provided, so that the area performance proportion of the memory bank is improved. Reference [11] changes the execution time of each access operator in the scheduling process based on reference [10], further reducing access conflict. And simultaneously, a dual-forced scheduling scheme is provided, so that the memory access operation is divided into different control periods as much as possible. However, none of the above studies consider a method of reducing the number of memory access operations to reduce memory access conflicts.
The information of the references mentioned above is as follows:
[1]Y.Park,J.J.K.Park,and S.Mahlke.2012.Efficient performance scaling of future CGRAs for mobile applications.In International Conference on Field-Programmable Technology(FPT).335–342.
[2]Lattner C,Amini M,Bondhugula U,et al.MLIR:A Compiler Infrastructure for the End of Moore′s Law[J].2020.
[3]Grosser,T.,Groesslinger,A.,&Lengauer,C.(2012).Polly-Performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters,22(4),1–28.
[4]Chen,Tianqi,ThierryMoreau,Ziheng Jiang,Lianmin Zheng,Eddie Yan, Haichen Shen,Meghan Cowan etal."TVM:An automated end-to-end optimizing compiler for deeplearning."In 13th USENIX Symposium on Operating Systems Design andImplementation(OSDI 18),pp.578-594.2018.
[5]Liu,D.,Yin,S.,Peng,Y.,Liu,L.,&Wei,S.(2015).Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures.IEEE Transactions on Very Large Scale Integration(VLSI)Systems,23(11),2581–2594.
[6]Ganser S,
Figure BDA0002902888170000031
Armin,Siegmund N,et al.Speeding up Itewrative Polyhedral Schedule Optimization with Surrogate Performance Models[J].Acm Transactions on Architecture&Code Optimization,2018,15(4):1-27.
[7]Kim,Y.,Lee,J.,Shrivastava,A.,&Paek,Y.(2010).Operation and data mapping for CGRAs with multi-bank memory.Proceedings of the ACM SIGPLAN Conference on Languages,Compilers,and Tools for Embedded Systems(LCTES), 17–25.
[8]Wang Y,Li P,Zhang P,et al.Memory partitioning for multidimensional arrays in high-level synthesis[C]//Acm/edac/ieee Design Automation Conference. IEEE,2013.
[9]Wang,Y.,Li,P.,&Cong,J.(2014).Theory and algorithm for generalized memory partitioning in high-level synthesis.ACM/SIGDA International Symposium on Field Programmable Gate Arrays-FPGA,199–208.
[10]Yin S,Xie Z,Meng C,et al.MultiBank memory optimization for parallel data access in multiple data arrays[C]//IEEE/ACM International Conference on Computer-aided Design.IEEE,2017.
[11]Park,Hyunchul&Fan,Kevin&Mahlke,Scott&Oh,Taewook&Kim, Heeseok&Kim,Hong-seok.(2008).Edge-centric modulo scheduling for coarse-grained reconfigurable architectures.Parallel Architectures and Compilation Techniques-Conference Proceedings,PACT.166-176.10.1145/1454115.1454140.
disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present application aims to develop a new method for CGRA architecture, which performs compilation optimization on software loop code in the compilation stage to maximize data reuse available during iterations of program execution and eliminate redundant memory access operations in data reuse pairs, thereby reducing memory access conflicts during loop execution.
The method comprises the following steps:
step 1, compiling a loop kernel part of a source program into a first intermediate representation by adopting a CGRA (Carrier grade error correction) compiler front end;
step 2, carrying out cyclic transformation for reusing effective data between iterations to obtain a transformed second intermediate representation;
step 3, generating a data flow graph to obtain a mapping result;
step 4, selecting a modification strategy for data reuse;
step 5, generating a configuration file;
the step 2 further comprises:
step 2.1, analyzing the dependency relationship and the data reuse relationship in the original code to obtain a dependency set and a reuse set; the reuse relationship refers to that two operations in different iterations access the same position of a memory, and then the two operations have a reuse relationship; two operations in the reuse relationship form a producer/consumer pair, the producer/consumer pair comprising a producer and a consumer, the producer/consumer pair being of a type comprising a Load-Load reuse pair, a Load-Store reuse pair, a Store-Load reuse pair, and a Store-Store reuse pair;
2.2, according to the dependency set, constraining the search space of the innermost time vector; searching the remaining search spaces which accord with the constraint in sequence, and finding the innermost layer time vector which enables the effective reuse quantity in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;
step 2.3, generating the second intermediate representation according to each layer of time vectors;
the step 4 further comprises the following steps:
4.1, analyzing the data reuse relation in the cyclic kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;
and 4.2, when the configuration file is generated, selecting a corresponding modification strategy according to the reuse relation of the node.
Further, for the Load-Load reuse pair and the Store-Load reuse pair, in step 2, the output of the producer is buffered using a local register LRF (local register file), from which the consumer loads data.
Further, for the Load-Load reuse pair and the Store-Load reuse pair, in step 2, the output of the producer is buffered using a global register GRF (global register file) from which the consumer loads data.
Further, in the step 2, an affine transformation-based method is used to arrange the execution order of the loop iteration.
Further, in the step 2.2, a first policy is adopted to select the cyclic transformation, where the first policy refers to a policy of reducing a selection range of the selectable cyclic transformation in dependence relationship, and the method includes the following steps:
definition of
Figure BDA0002902888170000041
Is a time vector of dimension m, representing the change of all loop indexes between two successive iterations of the layer as
Figure BDA0002902888170000042
Wherein xmiIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iterationmi(ii) a Let the dependency set R { < s, d in the perfect nested loop>I.e. loop iteration s must be performed before loop iteration d; the difference between d and s of each element in the dependency set is called a dependency vector and is stored in MRPerforming the following steps; for a set of dependent vectors MRThe innermost time vector of the selected time vector group
Figure BDA0002902888170000043
Satisfying at least either one of formula (1) and formula (2):
Figure BDA0002902888170000044
Figure BDA0002902888170000045
wherein, cone (M)R) Is represented by MRVertebral body formed by stretching of the elements, cone (M)R-ir) Representing vertebral body by removing irStretching of other elements; formula (1) indicates that the straight line where the currently selected time vector is located cannot be in the vertebral body formed by the dependent vector; equation (2) indicates that the currently selected time vector can be collinear with one vector on the vertebral surface that is made up of all the dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e., the time vector
Figure BDA0002902888170000051
Wherein c is a real number and c has a value in the range of 0<c≤1。
Further, in step 2.2, an innermost time vector is iteratively selected from the innermost loop using the first strategy; after an innermost layer time vector is selected, projecting all the dependent vectors to a time vector normal plane, and iteratively selecting the next dependent vector; finally, a cyclic transformation satisfying the cyclic dependency relationship is obtained.
Further, the step 2.2 includes a second strategy, which refers to a strategy of selecting a cyclic transformation to maximize effective data reuse, and includes the following steps:
a memory access operation p is represented as
Figure BDA0002902888170000052
Indicating that memory access operation p accessed on-chip memory in iteration i
Figure BDA0002902888170000053
An address; given reuse pair
Figure BDA0002902888170000054
It is valid when the corresponding following formula (3) holds; where c is a constant threshold used to help select valid reuse pairs:
Figure BDA0002902888170000055
each reuse pair is provided with a corresponding formula (3), so that as many reuse pairs as possible are effective, namely the innermost time vector which makes the formula (3) to be established as much as possible is the optimal time vector; the optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.
Further, in the step 4.2, modifying the policy includes:
the Store-Store reuse pair is not constrained by interval iterations between the producer and the consumer, and producers in this reuse relationship pair are eliminated.
Further, in step 4.2, modifying the policy includes:
for the Store-Load reuse pair and the Load-Load reuse pair, data is transferred from a producer to a consumer.
Further, data is transferred from the producer to the consumer using the interconnection resources.
Further, the data is transferred from the producer to the consumer using the registered resource.
Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying the policy includes:
the clock period I (r) of the interval between producer and consumer ≦ 0, and the reuse pair is not valid.
Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying the policy includes:
the clock period I (r) of the interval between the producer and the consumer meets 1 ≧ I (r) >0, and interconnection line resource connection exists between the PE where the producer is located and the PE where the consumer is located, and the reused consumer is converted into a routing operation for accessing data from an output register of the producer.
Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying a policy includes:
the clock cycle I (r) >1 of the interval between producer and consumer uses register resources to temporarily store data.
Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying a policy includes:
for the remaining space of LRF on pe(s) ((t)), and pe(s) ((t))
Figure BDA0002902888170000061
Satisfy the requirement of
Figure BDA0002902888170000062
In the case of (a) a (b),
the reused data is stored into the LRF, and the consumer is converted into a routing operation for obtaining the data from the LRF, wherein II is a starting interval of the software pipeline;
where PE(s) and PE (t) denote PE where the producer and consumer are located, respectively.
Further, modifying the policy includes:
in case no LRF is selected and the remaining GRF space is satisfied
Figure BDA0002902888170000063
The GRF is used to temporarily store the data.
Further, modifying the policy includes:
for accumulation reuse, the output operation sends data to the input operation according to the modification strategy of the Store-Load reuse pair, and the memory access operation during accumulation is converted into a null operation.
Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, modifying a policy includes:
for the case of con (r) future, pe(s) not equal pe (t), add an additional routing node for data movement over the LRF;
where PE(s) and PE (t) respectively indicate PE where a producer and a consumer are located, and con (r) true indicates that interconnection line resource connection exists between PE(s) and PE (t).
Further, modifying the policy includes:
in case no LRF is selected and the remaining GRF space is satisfied
Figure BDA0002902888170000064
The GRF is used to temporarily store the data.
Further, modifying the policy includes:
for accumulation reuse, the output operation sends data to the input operation according to the modification strategy of the Store-Load reuse pair, and the memory access operation during accumulation is converted into a null operation.
Compared with the prior art, the beneficial effects of this application are as follows:
1. compared with the existing scheme for reducing the access conflict, the method and the device have the advantages that the data reuse relation among the loops is innovatively utilized, the configuration file is modified in the generation stage of the configuration file, the access operation number of the CGRA operation loop kernel is reduced, and the access conflict is avoided.
2. The method is orthogonal to the existing method aiming at the scheduling mapping stage and the data rearrangement storage, has strong expandability, and can simply cooperate with the existing data placement scheme, scheduling and mapping scheme to obtain higher application speed-up ratio.
Drawings
FIG. 1 is a diagram of a typical architecture of a 4x4 CGRA of the prior art;
FIG. 2 is a compiler back-end flow diagram of an embodiment of the present application;
FIG. 3 is a circular transformation model process of an embodiment of the present application;
FIG. 4 is a schematic diagram of a configuration file modification policy of an embodiment of the present application;
FIG. 5 is a graph of the run time of an embodiment of the present application at 22 kernels on a 4x4 PEA.
Detailed Description
The preferred embodiments of the present application will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present application may be embodied in many different forms of embodiments and the scope of the present application is not limited to only the embodiments set forth herein.
The conception, specific structure and technical effects of the present application will be further described below to fully understand the purpose, features and effects of the present application, but the present application is not limited thereto.
Fig. 2 is a schematic diagram of a compiler back-end flow according to an embodiment of the present application, wherein,
step 201, compiling a loop kernel part of a source program into an Intermediate Representation (IR) form by using a CGRA compiler front end based on llvm (low Level Virtual machine);
step 202-; wherein the content of the first and second substances,
step 202, analyzing the dependency relationship and the data reuse relationship in the original code to obtain a dependency set and a reuse set;
step 203, according to the dependency set, constraining the search space of the innermost time vector; searching the remaining search spaces which accord with the constraint in sequence, and finding the innermost layer time vector which enables the effective reuse number in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;
step 204, generating a new intermediate representation according to each layer of time vectors;
step 205 to step 210, generating a data flow graph through the modified intermediate representation, and obtaining a mapping result by adopting the existing scheduling, data distinguishing and space mapping scheme in the prior art; among them, the scheduling, data distinguishing, spatial mapping schemes known in the prior art can be obtained from documents [12] and [13], i.e., documents [12] and [13] are incorporated into the present application by reference, and documents [12] and [13] are listed at the end of the detailed description section.
Step 211-;
step 211, analyzing all data reuse relations in the cyclic kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;
step 212, when the configuration file is generated, selecting a corresponding modification strategy according to the reuse relation of the node;
step 213, generating a configuration file according to the previous modification policy.
In the above flow, the number of times of memory access conflicts in the execution process of the cyclic kernel is reduced by reducing the number of redundant memory access operations, which is a difference between the present application and other researches. In addition, different from the existing work of reducing the access conflict aiming at the placement, scheduling and mapping strategy of the data in the on-chip memory, the method and the device can be combined with the existing scheduling, data placement and mapping strategy orthogonally, and the performance of the existing optimization work is further improved. The optimization method for reducing the access conflict aiming at the front-end cyclic transformation and the configuration file generation stage is adopted, so that the method can be combined with most of the existing methods for reducing the access conflict, and is another difference between the method and other researches.
Circular transformation
Two operations in different iterations access the same location in memory, and there is a reuse relationship between the two operations. The two operations of the reuse relationship form a producer/consumer pair, also referred to as a reuse pair, depending on the order in which the data is accessed. Reuse pairs can be classified into four classes, namely Load-Load (two Load operations), Load-Store (Load before write), Store-Load (write before Load), and Store-Store (two write operations) reuse. For Load-Load or Store-Load reuse pairs, we buffer the producer's output using either a Local Register File (LRF) or a Global Register File (GRF), so that the consumer only needs to Load data from the LRF or GRF, without accessing the OGM, thereby reducing memory access operations. However, due to the limited size of the GRF and LRF provided by the hardware, our compiler should ensure that the reused data is successfully preserved during the life cycle of the LRF or GRF. Therefore, the goal of our loop transformation model is to use affine transformation based methods to orchestrate the execution order of loop iterations to minimize the life cycle of reuse data between iterations.
In step 203 shown in fig. 2, the present application proposes a strategy for narrowing the selection range of the optional circular transformation according to the dependency relationship.
Definition of
Figure BDA0002902888170000081
Is a time vector of dimension m, representing the change of all loop indexes between two successive iterations of the layer as
Figure BDA0002902888170000082
Wherein xmiIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iterationmi. For theThe loop program in FIG. 3(a), whose set of time vectors is represented as
Figure BDA0002902888170000083
And
Figure BDA0002902888170000084
a group of time vectors corresponds to a determined loop execution sequence, affine transformation can be carried out on the group of time vectors, so that the execution sequence of iteration in a loop is adjusted, but the transformation cannot violate the dependency relationship of the original loop. Let the dependence set R ═ tone in the perfect nested loop<s,d>I.e. loop iteration s has to be performed before loop iteration d. The difference between d and s of each element in the dependency set is called a dependency vector and is stored in MRIn (1). For a set of dependent vectors MRThe innermost time vector of the selected time vector group
Figure BDA0002902888170000085
Satisfying at least either of equations (1) and (2), then the set of time vectors is only valid:
Figure BDA0002902888170000086
Figure BDA0002902888170000087
wherein, cone (M)R) Is represented by MRVertebral body formed by stretching of the elements, cone (M)R-ir) Representing vertebral body by removing irThe other elements are stretched. Formula (1) indicates that the straight line where the currently selected time vector is located cannot be in the vertebral body formed by the dependent vector; equation (2) indicates that the currently selected time vector can be collinear with one vector on the vertebral surface that is made up of all the dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e., the time vector
Figure BDA0002902888170000091
Wherein c is a real number and c has a value in the range of 0<c≤1。
According to the above selection strategy, the present application iteratively uses the strategy to select a legal innermost time vector from the innermost loop. After an innermost time vector is selected, all the dependency vectors can be projected onto the normal plane of the time vector, and the next dependency vector can be iteratively selected. Finally, a cyclic transformation satisfying the cyclic dependency relationship can be obtained.
In step 203 shown in fig. 2, the present application proposes a strategy of selecting a cyclic transformation to maximize the effective data reuse. In searching for the optimal transition, whether the Load-Load or Store-Load reuse pair is valid depends on the innermost time vector. A memory access operation p can be represented as
Figure BDA0002902888170000092
Indicating that memory access operation p accessed on-chip memory in iteration i
Figure BDA0002902888170000093
An address. Given reuse pair
Figure BDA0002902888170000094
It is valid when the corresponding following formula (3) holds. Where c is a constant threshold used to help select valid reuse pairs.
Figure BDA0002902888170000095
Each reuse pair has a corresponding formula (3), so that as many reuse pairs as possible are effective, i.e. the innermost time vector that makes formula (3) as possible is the optimal time vector. The optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.
As shown in fig. 3, the procedure shown in fig. 3(a) is also taken as an example to further explain the above circular conversion method. The goal of the optimization problem is to maximize the number of equations established in FIG. 3 (e). The nodes in fig. 3(g) represent the innermost candidate time vectors, where the feasible solutions to the optimization problem are nodes, highlighted as gray nodes. Fig. 3(f) shows a time vector for each cyclic layer. FIG. 3(h) shows the transformed program, with the innermost loop divided into two layers to traverse all iterations.
Configuration file correction
During the profile generation process, we propose a profile modification strategy that eliminates unnecessary memory access operations by using inter-iteration data duplication. The algorithm first analyzes whether each group of reuse pairs meets time constraints and register space constraints, and then adopts different modification strategies to reduce memory access operations. The modification strategy is shown in fig. 4. FIG. 4(a) is a representation of the structure of a PE, with Op representing operations placed on the PE, including P representing the producer in a reuse pair, C the consumer, and R the routing node. The LRF owned by each PE is also marked, and the LRF flag in a diagonal line pattern indicates that the LRF stores data.
Store-Store reuse is not limited by interval iterations between producers and consumers. If such a reuse relationship exists, the producer in the reuse relationship pair needs to be eliminated. FIG. 4(b) shows a modification strategy for Store-Store reuse.
For Store-Load and Load-Load reuse, data should be transferred from the producer to the consumer using either interconnection resources or registry resources. Given a data reuse pair, r ═ s, t, d >, where s is the producer in the reuse pair, t is the consumer in the reuse pair, and d is the number of interval iterations between s and t. Assuming that operation s is mapped to PE(s) of the control period time(s) in the software pipeline, the clock period i (r) of the reuse to the interval between producer and consumer can be calculated by i (r) ═ d × II + (time (t) -time (s)), where II is the start-up interval of the software pipeline.
If I (r) ≦ 0, reuse pair execution occurs after or concurrently with the consumer for the producer, the reuse pair is invalid.
When i (r) >0, the spatial constraint of this reuse pair should be checked. Assuming that there is an interconnection resource connection between pe(s) and pe (t) where the producer and the consumer are located, Con (r) is true. Fig. 4(d) shows the policy selected when i (r) is 1 and con (r) is true, where the reused consumer is translated into a routing operation that accesses data from the producer's output register.
When I (r) >1, register resources must be used to temporarily store data.
Fig. 4(c) shows a state when pe(s) and pe (t) are present. If LRF's remaining space on pe(s)
Figure BDA0002902888170000101
Satisfy the requirement of
Figure BDA0002902888170000102
The reused data should be stored into the LRF and the consumer is converted into a routing operation that retrieves the data from the LRF.
However, when i (r) ≧ 1, con (r) ≠ true, pe(s) ≠ pe (t), additional routing nodes are required for data movement. Assuming that there are two null operations np and nc that satisfy pe (np) ═ pe(s), pe (nc) ═ pe (t), (time (nc) -time (np)) mod II ═ 1, the reused data will be moved by both routing nodes through the LRF. It is noted that producers and consumers may also choose np and nc. Fig. 4(e) illustrates different policies, including a policy that selects producers and consumers as routing nodes. The four diagrams from left to right show the locations of the producer and the consumer, and when the producer is selected to be np, the consumer is selected to be nc, or spare nodes except the producer and the consumer are selected to be np and nc.
If the above strategy is not selected, and the remaining GRF space should be satisfied
Figure BDA0002902888170000103
The GRF will be used to temporarily store the data. The modification strategy for this case is shown in fig. 4 (f).
In addition to this, there is cumulative reuse. That is, the operation has accumulation operation to the same storage position data, and the operation can be temporarily stored in the local register without continuously accessing the main memory. Given a Store-Load reuse pair r ═ s, t, d >, if r' ═ t, s, d > is a Load-Store reuse pair, we call r an accumulated reuse. The modification strategy for cumulative reuse is shown in fig. 4(g) and 4(h), where the rectangular box containing the I and O blocks represents the set of all operations except the memory access operation. The nodes labeled I and O are operations directly connected to load and store operations (referred to as input operations and output operations). The output operation sends data to the input operation according to the Store-Load reuse modification strategy, and the memory access operation during accumulation should be converted into a null operation.
The selected policy is recorded and the generated profile is modified by the selected policy during the profile generation phase of step 213.
Evaluation of results
The simulation environment realized by the ADRES-based typical CGRA structure design mentioned in the background section is utilized to test and integrate the cyclic transformation model and the configuration file modification strategy provided by the application in 22 typical calculation-intensive application test sets, and 22 calculation-intensive kernels are selected from data sets which are derived from digital signal processing, computer vision and dynamic programming from EEMBC, Polybench, Machsuite and other existing references. FIG. 5 shows a comparison of the execution times of configuration packets generated by integrating the modulo scheduling compiler and the original compiler of the present application (i.e., not using the present application, but using the compiled streams based on references [12] and [13] below, noted THP + DP). The result shows that the configuration information generated by the present application can obtain an average performance improvement of 44.4%, which indicates that the scheme of the present application can effectively improve the performance of the configuration information packet generated by the coarse-grained reconfigurable architecture compiler compared with the existing scheme.
Table 1 shows a comparison of the number of memory accesses before and after the implementation of the scheme described in the cited application:
Figure BDA0002902888170000111
it can be seen that the average memory access operation is reduced to 55.4% of the previous memory access operation, which shows that the number of memory access operations is greatly reduced by our model. Wherein, potential data reuse relations of aes3, bezier1 and strassen1 are discovered through a loop conversion model, and the number of data reuse pairs between two continuous iterations in the three kernels is increased from 0, 1, 14 to 4, 9 and 23. In addition, the other kernels achieve maximum data reuse in their default states, so the cyclic conversion model effectively maximizes all available reuse relationships between cyclic iterations.
Reference to the literature
[12]Z.Zhao et al.,"Towards Higher Performance and Robust Compilation for CGRA Modulo Scheduling,"in IEEE Transactions on Parallel and Distributed Systems, vol.31,no.9,pp.2201-2219,1Sept.2020,doi:10.1109/TPDS.2020.2989149.
[13]Z.Zhao,Y.Liu,W.Sheng,T.Krishna,Q.Wang,and Z.Mao,“Optimizing the data placement and transformation for multi-bank cgra computing system,”in DATE,2018,pp.1087–1092.
The foregoing detailed description of the preferred embodiments of the present application. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the concepts of the present application should be within the scope of protection defined by the claims.

Claims (20)

1. A method for eliminating access conflict for data reuse of a coarse-grained reconfigurable structure comprises the following steps:
step 1, compiling a loop kernel part of a source program into a first intermediate representation by adopting a CGRA (Carrier grade error correction) compiler front end;
step 2, carrying out cyclic transformation for reusing effective data between iterations to obtain a transformed second intermediate representation;
step 3, generating a data flow graph to obtain a mapping result;
step 4, selecting a modification strategy for data reuse;
step 5, generating a configuration file;
the step 2 further comprises:
step 2.1, analyzing the dependency relationship and the data reuse relationship in the original code to obtain a dependency set and a reuse set; the reuse relationship refers to that two operations in different iterations access the same position of a memory, and then the two operations have a reuse relationship; two operations in the reuse relationship form a producer/consumer pair, the producer/consumer pair comprising a producer and a consumer, the producer/consumer pair being of a type comprising a Load-Load reuse pair, a Load-Store reuse pair, a Store-Load reuse pair, and a Store-Store reuse pair;
2.2, according to the dependency set, constraining the search space of the innermost time vector; searching remaining search spaces which accord with the constraint in sequence, and finding out the innermost layer time vector which enables the effective reuse quantity in the reuse set to be the maximum; recursively obtaining an outer time vector according to the obtained optimal innermost time vector;
step 2.3, generating the second intermediate representation according to each layer of time vectors;
the step 4 further comprises the following steps:
step 4.1, analyzing the data reuse relation in the cycle kernel to obtain available different data reuse sets; sequentially checking whether each reuse relation meets hardware constraint, and recording the finally available data reuse relation in the information of the producer and consumer nodes of the reuse relation;
and 4.2, when the configuration file is generated, selecting a corresponding modification strategy according to the reuse relation of the node.
2. The method of claim 1, wherein in step 2, for the Load-Load reuse pair and the Store-Load reuse pair, the output of the producer is buffered using a local register LRF from which the consumer loads data.
3. The method of claim 1, wherein in step 2, for the Load-Load reuse pair and the Store-Load reuse pair, the output of the producer is buffered using a global register GRF from which the consumer loads data.
4. The method for eliminating memory conflict as claimed in claim 1, wherein in step 2, the execution order of loop iteration is programmed by using affine transformation based method.
5. The method as claimed in claim 4, wherein in the step 2.2, a first policy is adopted to select the cyclic transformation, wherein the first policy refers to a policy whose dependency relationship narrows the selection range of the optional cyclic transformation, and the method includes the following steps:
definition of
Figure FDA0002902888160000021
Is a time vector of dimension m, representing the change of all loop indexes between two successive iterations of the layer as
Figure FDA0002902888160000022
Wherein xmiIndicating that at the mth level of a perfectly nested loop, the ith loop variable will be increased by x per loop iterationmi(ii) a Let the dependence set R ═ tone in the perfect nested loop<s,d>I.e. loop iteration s must be performed before loop iteration d; the difference between d and s of each element in the dependency set is called a dependency vector and is stored in MRPerforming the following steps; for a set of dependent vectors MRThe innermost time vector of the selected time vector group
Figure FDA00029028881600000210
Satisfying at least either one of formula (1) and formula (2):
Figure FDA0002902888160000023
Figure FDA0002902888160000024
wherein, cone (M)R) Is represented by MRVertebral body formed by stretching of the elements, cone (M)R-ir) Representing vertebral body by removing irStretching of other elements; formula (1) indicates that the straight line where the currently selected time vector is located cannot be in the vertebral body formed by the dependent vector; equation (2) indicates that the currently selected time vector can be collinear with one vector on the vertebral surface that is made up of all the dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e., the time vector
Figure FDA0002902888160000025
Wherein c is a real number and the value range of c is more than 0 and less than or equal to 1.
6. The method of claim 5, wherein in step 2.2, the first policy is used iteratively from the innermost loop to select an innermost time vector; after an innermost layer time vector is selected, projecting all the dependent vectors to a time vector normal plane, and iteratively selecting the next dependent vector; finally, a cyclic transformation satisfying the cyclic dependency relationship is obtained.
7. A method of access conflict resolution as claimed in claim 5, wherein said step 2.2 includes a second strategy, said second strategy being a strategy of choosing a cyclic transformation to maximise effective data reuse, comprising the steps of:
a memory access operation p is represented as
Figure FDA0002902888160000026
Indicating that memory access operation p accessed on-chip memory in iteration i
Figure FDA0002902888160000027
An address; given reuse pair
Figure FDA0002902888160000028
It is valid when the corresponding following formula (3) holds; where c is a constant threshold used to help select valid reuse pairs:
Figure FDA0002902888160000029
each reuse pair is provided with a corresponding formula (3), wherein the innermost time vector which makes the formula (3) as good as possible is the optimal time vector; the optimal transformation search algorithm first searches for the optimal time vector of the innermost layer, and then recursively calculates the time vector of the outer layer using formula (2) based on the time vector of the inner layer.
8. The method for eliminating memory access conflict as claimed in claim 1, wherein in the step 4.2, the modifying strategy comprises:
the Store-Store reuse pair is not constrained by interval iterations between the producer and the consumer, and the producer in this reuse relationship pair is eliminated.
9. The method for eliminating memory access conflict as claimed in claim 1, wherein in the step 4.2, the modifying strategy comprises:
transferring data from the producer to the consumer for the Store-Load reuse pair and the Load-Load reuse pair.
10. The method of claim 9, wherein the data is transferred from the producer to the consumer using an interconnection resource.
11. The method of claim 9, wherein the data is transferred from the producer to the consumer using a registered resource.
12. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:
the clock period I (r) of the interval between the producer and the consumer ≦ 0, and the reuse pair is invalid.
13. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:
and the clock period I (r) of the interval between the producer and the consumer meets 1 ≧ I (r) >0, and interconnection line resource connection exists between the PE where the producer is located and the PE where the consumer is located, the reused consumer is converted into a routing operation for accessing data from an output register of the producer.
14. The method of claim 9, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:
the clock period I (r) >1 of the interval between the producer and the consumer uses register resources to temporarily store data.
15. The method of claim 14, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:
pe(s) ═ pe (t), and the remaining space of the LRF on pe(s) ((s))
Figure FDA0002902888160000031
Satisfy the requirement of
Figure FDA0002902888160000032
Storing the reused data into the LRF, the consumer being converted to a routing operation that retrieves the data from the LRF, wherein II is a start-up interval of the software pipeline;
where PE(s) and PE (t) denote PE where the producer and consumer are located, respectively.
16. The method of resolving memory conflicts as claimed in claim 15, wherein the modification policy comprises:
in case no LRF is selected and the remaining GRF space is satisfied
Figure FDA0002902888160000041
The GRF is used to temporarily store the data.
17. The method of resolving memory conflicts as claimed in claim 16, wherein the modification policy comprises:
for accumulation reuse, the output operation sends data to the input operation according to the modification strategy of the Store-Load reuse pair, and the memory access operation during accumulation is converted into a null operation.
18. The method of claim 14, wherein in step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification policy comprises:
con (r) true, pe(s) not equal to pe (t), adding extra routing nodes for data movement through LRF;
where PE(s) and PE (t) respectively indicate PE where a producer and a consumer are located, and con (r) true indicates that interconnection line resource connection exists between PE(s) and PE (t).
19. The method of resolving memory conflicts as claimed in claim 18, wherein the modification policy comprises:
in case no LRF is selected and the remaining GRF space is satisfied
Figure FDA0002902888160000042
The GRF is used to temporarily store the data.
20. The method of resolving memory conflicts as claimed in claim 19, wherein the modification policy comprises:
for accumulation reuse, the output operation sends data to the input operation according to the modification strategy of the Store-Load reuse pair, and the memory access operation during accumulation is converted into a null operation.
CN202110061629.7A 2020-11-30 2021-01-18 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure Active CN112631610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/079524 WO2022110567A1 (en) 2020-11-30 2021-03-08 Data reuse memory access conflict elimination method for coarse-grained reconfigurable structure

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011377746 2020-11-30
CN2020113777466 2020-11-30

Publications (2)

Publication Number Publication Date
CN112631610A CN112631610A (en) 2021-04-09
CN112631610B true CN112631610B (en) 2022-04-26

Family

ID=75294506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110061629.7A Active CN112631610B (en) 2020-11-30 2021-01-18 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure

Country Status (2)

Country Link
CN (1) CN112631610B (en)
WO (1) WO2022110567A1 (en)

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043659A (en) * 2010-12-08 2011-05-04 上海交通大学 Compiling device for eliminating memory access conflict and implementation method thereof
CN102156666A (en) * 2011-04-20 2011-08-17 上海交通大学 Temperature optimizing method for resource scheduling of coarse reconfigurable array processor
KR20110121313A (en) * 2010-04-30 2011-11-07 서울대학교산학협력단 Method and apparatus of optimal application mapping on coarse-grained reconfigurable array
EP2523120A1 (en) * 2011-05-12 2012-11-14 Imec Microcomputer architecture for low power efficient baseband processing
CN102855197A (en) * 2011-11-08 2013-01-02 东南大学 Storage system implementing method for large-scale coarse-grained reconfigurable system
KR101293701B1 (en) * 2012-02-23 2013-08-06 국립대학법인 울산과학기술대학교 산학협력단 Method and apparatus of executing nested loop on coarse-grained reconfigurable array
CN103377035A (en) * 2012-04-12 2013-10-30 浙江大学 Pipeline parallelization method for coarse-grained streaming application
CN203706197U (en) * 2014-02-10 2014-07-09 东南大学 Coarse-granularity dynamic and reconfigurable data regularity control unit structure
CN103942082A (en) * 2014-04-02 2014-07-23 南阳理工学院 Complier optimization method for eliminating redundant storage access operations
CN103984560A (en) * 2014-05-30 2014-08-13 东南大学 Embedded reconfigurable system based on large-scale coarseness and processing method thereof
CN105159737A (en) * 2015-07-28 2015-12-16 哈尔滨工程大学 Similar affine array subscript application-oriented parameterized parallel storage structure template
CN105260222A (en) * 2015-10-13 2016-01-20 哈尔滨工程大学 Optimization method for initiation interval between circulating pipeline iterations in reconfigurable compiler
CN105302624A (en) * 2015-09-17 2016-02-03 哈尔滨工程大学 Automatic analysis method capable of reconstructing start interval of periodic pipeline iteration in complier
CN105302525A (en) * 2015-10-16 2016-02-03 上海交通大学 Parallel processing method for reconfigurable processor with multilayer heterogeneous structure
CN105335331A (en) * 2015-12-04 2016-02-17 东南大学 SHA256 realizing method and system based on large-scale coarse-grain reconfigurable processor
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN105528243A (en) * 2015-07-02 2016-04-27 中国科学院计算技术研究所 A priority packet scheduling method and system utilizing data topological information
CN105700933A (en) * 2016-01-12 2016-06-22 上海交通大学 Parallelization and loop optimization method and system for a high-level language of reconfigurable processor
CN105718245A (en) * 2016-01-18 2016-06-29 清华大学 Reconfigurable computation cyclic mapping optimization method
CN105867994A (en) * 2016-04-20 2016-08-17 上海交通大学 Instruction scheduling optimization method for coarse-grained reconfigurable architecture complier
WO2019059927A1 (en) * 2017-09-22 2019-03-28 Intel Corporation Loop nest reversal

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200924B (en) * 2011-05-17 2014-07-16 北京北大众志微系统科技有限责任公司 Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling
CN103106067B (en) * 2013-03-01 2016-01-20 清华大学 The optimization method of processor cyclic mapping and system
US9910650B2 (en) * 2014-09-25 2018-03-06 Intel Corporation Method and apparatus for approximating detection of overlaps between memory ranges
US10528356B2 (en) * 2015-11-04 2020-01-07 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110121313A (en) * 2010-04-30 2011-11-07 서울대학교산학협력단 Method and apparatus of optimal application mapping on coarse-grained reconfigurable array
CN102043659A (en) * 2010-12-08 2011-05-04 上海交通大学 Compiling device for eliminating memory access conflict and implementation method thereof
CN102156666A (en) * 2011-04-20 2011-08-17 上海交通大学 Temperature optimizing method for resource scheduling of coarse reconfigurable array processor
EP2523120A1 (en) * 2011-05-12 2012-11-14 Imec Microcomputer architecture for low power efficient baseband processing
CN102855197A (en) * 2011-11-08 2013-01-02 东南大学 Storage system implementing method for large-scale coarse-grained reconfigurable system
KR101293701B1 (en) * 2012-02-23 2013-08-06 국립대학법인 울산과학기술대학교 산학협력단 Method and apparatus of executing nested loop on coarse-grained reconfigurable array
CN103377035A (en) * 2012-04-12 2013-10-30 浙江大学 Pipeline parallelization method for coarse-grained streaming application
CN203706197U (en) * 2014-02-10 2014-07-09 东南大学 Coarse-granularity dynamic and reconfigurable data regularity control unit structure
CN103942082A (en) * 2014-04-02 2014-07-23 南阳理工学院 Complier optimization method for eliminating redundant storage access operations
CN103984560A (en) * 2014-05-30 2014-08-13 东南大学 Embedded reconfigurable system based on large-scale coarseness and processing method thereof
CN105528243A (en) * 2015-07-02 2016-04-27 中国科学院计算技术研究所 A priority packet scheduling method and system utilizing data topological information
CN105159737A (en) * 2015-07-28 2015-12-16 哈尔滨工程大学 Similar affine array subscript application-oriented parameterized parallel storage structure template
CN105302624A (en) * 2015-09-17 2016-02-03 哈尔滨工程大学 Automatic analysis method capable of reconstructing start interval of periodic pipeline iteration in complier
CN105260222A (en) * 2015-10-13 2016-01-20 哈尔滨工程大学 Optimization method for initiation interval between circulating pipeline iterations in reconfigurable compiler
CN105302525A (en) * 2015-10-16 2016-02-03 上海交通大学 Parallel processing method for reconfigurable processor with multilayer heterogeneous structure
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN105335331A (en) * 2015-12-04 2016-02-17 东南大学 SHA256 realizing method and system based on large-scale coarse-grain reconfigurable processor
CN105700933A (en) * 2016-01-12 2016-06-22 上海交通大学 Parallelization and loop optimization method and system for a high-level language of reconfigurable processor
CN105718245A (en) * 2016-01-18 2016-06-29 清华大学 Reconfigurable computation cyclic mapping optimization method
CN105867994A (en) * 2016-04-20 2016-08-17 上海交通大学 Instruction scheduling optimization method for coarse-grained reconfigurable architecture complier
WO2019059927A1 (en) * 2017-09-22 2019-03-28 Intel Corporation Loop nest reversal

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Muhammad Ali Shami ; Ahmed Hemani.Control Scheme for a CGRA.《2010 22nd International Symposium on Computer Architecture and High Performance Computing》.2010,17-24. *
Peng Cao ; Huiyan Jiang ; Bo Liu ; Weiwei Shan.Memory Bandwidth Optimization Strategy of Coarse-Grained Reconfigurable Architecture.《2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery》.2012, *
一种快速高效的粗粒度可重构架构编译框架;尹文志; 赵仲元; 毛志刚; 王琴; 绳伟光;《微电子学与计算机》;20190805;第36卷(第8期);全文 *
典型可重构架构的算法映射分析;方琛; 何卫锋; 毛志刚;《微电子学与计算机》;20130805;第30卷(第8期);全文 *
核心循环到粗粒度可重构体系结构的流水化映射;王大伟; 窦勇; 李思昆;《计算机学报》;20090615;第32卷(第6期);全文 *
粗粒度可重构平台中循环自流水硬件实现;徐进辉; 杨梦梦; 窦勇; 周兴铭;《计算机学报》;20090615;第32卷(第6期);全文 *
粗粒度可重构计算架构及其映射算法的协同优化研究;尹文志;《中国优秀硕士论文全文数据库 信息科技辑》;20200615(第6期);全文 *
粗粒度可重构阵列处理器性能优化技术研究;徐佳庆;《中国优秀硕士论文全文数据库 信息科技辑》;20090715(第7期);全文 *

Also Published As

Publication number Publication date
CN112631610A (en) 2021-04-09
WO2022110567A1 (en) 2022-06-02

Similar Documents

Publication Publication Date Title
CN112465108B (en) Neural network compiling method for storage and calculation integrated platform
Wang et al. Supporting very large models using automatic dataflow graph partitioning
US8661422B2 (en) Methods and apparatus for local memory compaction
TW202127238A (en) Compiler flow logic for reconfigurable architectures
US20120331278A1 (en) Branch removal by data shuffling
Wuytack et al. Memory management for embedded network applications
CN112835627A (en) Approximate nearest neighbor search for single instruction multi-thread or single instruction multiple data type processors
Jiang et al. Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs
Lam et al. A data locality optimizing algorithm
Yin et al. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory
US6324629B1 (en) Method for determining an optimized data organization
US20090064120A1 (en) Method and apparatus to achieve maximum outer level parallelism of a loop
WO2022068205A1 (en) Data storage method and system, and data reading method and system
CN112306500B (en) Compiling method for reducing multi-class access conflict aiming at coarse-grained reconfigurable structure
Deest et al. Towards scalable and efficient FPGA stencil accelerators
CN112631610B (en) Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
US20230076473A1 (en) Memory processing unit architecture mapping techniques
Nakano Sequential memory access on the unified memory machine with application to the dynamic programming
Danckaert et al. Platform Independent Data Transfer and Storage Exploration Illustrated on Parallel Cavity Detection Algorithm.
Chen et al. Reducing memory access conflicts with loop transformation and data reuse on coarse-grained reconfigurable architecture
Ozdal Improving efficiency of parallel vertex-centric algorithms for irregular graphs
Cheng et al. Synthesis of statically analyzable accelerator networks from sequential programs
Koike et al. A novel computational model for GPUs with applications to efficient algorithms
US7363459B2 (en) System and method of optimizing memory usage with data lifetimes
Zhuang et al. A framework for parallelizing load/stores on embedded processors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant