CN101441564B - Method for implementing reconfigurable accelerator customized for program - Google Patents
Method for implementing reconfigurable accelerator customized for program Download PDFInfo
- Publication number
- CN101441564B CN101441564B CN2008101629053A CN200810162905A CN101441564B CN 101441564 B CN101441564 B CN 101441564B CN 2008101629053 A CN2008101629053 A CN 2008101629053A CN 200810162905 A CN200810162905 A CN 200810162905A CN 101441564 B CN101441564 B CN 101441564B
- Authority
- CN
- China
- Prior art keywords
- program
- reconfigurable accelerator
- accelerator
- reconfigurable
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Stored Programmes (AREA)
Abstract
The invention discloses a method for realizing a reconfigurable accelerator customized for a program. The reconfigurable accelerator customized for the program accelerates the program on an FPGA by arranging the FPGA for the prior general-purpose computer system. The method has a main function of analyzing the program, uses functions to calculate information for the runtime of a granularity sampling program, acquires computing-intensive hot spot functions in the program, realizes the hot spot functions as the reconfigurable accelerator on the FPGA, and modifies call of the hot spot functions in the program into call of the corresponding reconfigurable accelerator to accelerate execution of the hot spot functions. The method uses the reconfigurable accelerator to realize the hot spot functions of the program, improves the total speed-up ratio of the program, uses the FPGA to realize the reconfigurable accelerator, achieves the performance of an approximately applied custom integrated circuit, and simultaneously maintains the flexibility of a general-purpose processor.
Description
Technical field
The present invention relates to program optimization design, FPGA design field, relate in particular to a kind of program reconfigurable accelerator customized implementation method that is.
Background technology
Along with the application of new material and the development of new technology, very large scale integration technology makes great progress, and integrated transistor size is about to surpass 10,000,000,000 on the existing processor equivalent area.But because transistor utilization ratio, electric leakage, heat radiation and power problems come obtained performance to promote losing more than gain of becoming to improve processor host frequency.Therefore, the multi-core system structure technology replaces becomes the mainstream technology of processor, by a plurality of process nuclear of encapsulation in single-chip, realized real walking abreast physically, thereby improved transistorized utilization ratio relatively, alleviated heat radiation and power problems, for computing machine has brought bigger performance boost.From current trend, the number of integrated nuclear will continue to increase rapidly in the processor chips.Yet, because the restriction that general application degree of parallelism is difficult to improve, when the processor general purpose core outnumber 16 after, the number that increases common treatment nuclear more just is difficult to bring bigger performance boost, though the therefore simple number that increases the common treatment nuclear of isomorphism can be used up the transistor that increases rapidly, application program but can not make full use of the common treatment nuclear that quantity increases day by day, and calculated performance can not improve along with the increase of process nuclear number naturally simply yet.
The coprocessor and the accelerator of customization are to satisfy the another kind of technological means of user to the ever-increasing demand of performance, often comprise coprocessor or accelerator that some are special-purpose in the modern computing system, comprise " industry application specific processors " such as " domain-specific coprocessor ", graph and image processing and digital signal processing towards science calculating etc., as auxiliary process nuclear, the Intel of Cell
Figure media accelerator 950 etc.The architecture of these dedicated coprocessors and accelerator utilizes the feature of application-specific to customize, thereby can reach the high-performance and the high-level efficiency of customized application.But the coprocessor of this customizations and accelerator design only operation institute towards application the time performance that just can obtain, utilization factor and dirigibility are not high, and specialized customization will greatly increase design cost.
In this case, in conventional computer system, increase the reconfigurable accelerator that constitutes by restructural equipment more and provide another kind of approach for promoting calculated performance.Dynamic recognition by restructural equipment, reconfigurable accelerator can be supported various dissimilar application, thereby can reach superior performance in the scope more widely, improve the utilization factor of reconfigurable hardware resource, obtain general processor simultaneously and adapt to the most high flexibilities of using and the high-performance and the high-level efficiency of application specific processor.In the diverse problems of solve using, also can solve accelerator hardware resource utilization, design complexity, system reliability and reduce cost and many-sided problem such as power consumption.
Summary of the invention
In order to obtain the accelerator of better utilization restructural resource, design customization,, the object of the present invention is to provide a kind of program reconfigurable accelerator customized implementation method that is in order to improve the execution performance of application program.
The technical scheme that technical solution problem of the present invention is adopted is:
A kind of is program reconfigurable accelerator customized implementation method:
1) reconfigurable accelerator is auxiliary calculates:
Reconfigurable accelerator is accepted calling of program, is responsible for the part of computation-intensive in the handling procedure, and in the computation process of reconfigurable accelerator, program halt wait reconfigurable accelerator is returned;
2) program customization reconfigurable accelerator implementation procedure:
1. program analysis: the program parsing process comprises 2 steps:
I. determine the function focus
Determine that the function focus is a dynamic profile process, determines to take in the program maximum partial function of execution time; Program when utilizing parser to operation is followed the tracks of, with the function be granularity during to operation program sample, be the statistics of elementary cell then with the function to sampled data, draw each function calls number of times and execution time, from how to few sort by the execution time, wherein maximum function of execution time is exactly the focus function of program, can be used as the candidate functions that is embodied as reconfigurable accelerator;
II. analyzing data relies on
The data dependency analysis is a static analysis process, and the focus function is carried out the degree of parallelism that the data dependency analysis is determined function; If the data that do not exist between the loop iteration rely on, the different iteration of round-robin just can parallel expansion so, thereby makes full use of the high concurrency of physics of FPGA; If the focus function promotes by the forecast assessment obtained performance, so just be embodied as reconfigurable accelerator, with the execution of accelerated procedure;
2. hardware-software partition:
Determine to be embodied as after the function of reconfigurable accelerator, in fact finished division, the hardware-software partition step mainly is responsible for interface and the parameter between define program and the reconfigurable accelerator; Because the routine call reconfigurable accelerator needs extra cost, should in reconfigurable accelerator, increase metadata cache, make communication concentrate the extra cost of repeatedly calling to eliminate with repeatedly calling to merge, increase the execution time of at every turn calling, reduce the number of times of routine call reconfigurable accelerator;
3. the realization of focus function on FPGA:
According to program and the interface between the reconfigurable accelerator and the parameter of 2. middle definition, realize the hardware interface of reconfigurable accelerator, and increase buffer memory, be supported in the call number that reduces reconfigurable accelerator on the software; By increasing buffer memory, the input data of repeatedly calling reconfigurable accelerator by once calling the buffer memory that is transferred to reconfigurable accelerator, are reduced overall communication cost;
Utilize reconfigurable logic Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period; Improve the frequency of reconfigurable accelerator and reduce the performance period, can both directly improve the performance of reconfigurable accelerator;
4. update routine calls accelerator; Performing step:
At last, need in program, call accelerator on the FPGA:
I. increase code before the focus that reconfigurable accelerator is quickened in program, finish the preparation of reconfigurable accelerator input data;
II. call the execution reconfigurable accelerator by the reconfigurable accelerator software interface, program halt is waited for the reconfigurable accelerator return results;
III. receive the return results of reconfigurable accelerator, arrangement returns to program, and program continues to carry out again.
The beneficial effect that the present invention has is:
The present invention be a kind of be the implementation method of program customization reconfigurable accelerator based on FPGA, its major function is to use FPGA that the focus function of program is embodied as reconfigurable accelerator on computer architecture, and focus function calls in the program is revised as calling of corresponding reconfigurable accelerator, quicken the execution of focus function.
1) the use reconfigurable accelerator realizes the focus function of program, the overall speed-up ratio of raising program;
2) use FPGA to realize reconfigurable accelerator, in the performance that reaches approximate applied customization integrated circuit, the dirigibility that has kept general processor.
Description of drawings
Accompanying drawing is an overview flow chart of the present invention.
Embodiment
For the specific implementation flow process of program reconfigurable accelerator customized implementation method as follows.
1) increase the auxiliary calculating of reconfigurable accelerator:
On the traditional common computer system, increase FPGA as configurable component, FPGA is connected to conventional computer system by the PCI-E bus.
Reconfigurable accelerator is responsible for the part of computation-intensive in the handling procedure, accepts calling of program, and after the routine call reconfigurable accelerator, reconfigurable accelerator begins to handle the input data, in the computing interval of reconfigurable accelerator, program halt; Carry out end when reconfigurable accelerator, the result is returned to program, program continues to carry out again.
2) program customization reconfigurable accelerator implementation procedure, as shown in drawings:
1. program analysis:
I. determine the function focus
Determine that the function focus is a dynamic profile process, can determine to take in the program maximum partial function of execution time;
Program when a. utilizing parser to operation is followed the tracks of track record function calls number of times, and start time of at every turn calling and time of return;
B. be elementary cell with the function to the statistics of sampling gained data, draw each function calls number of times and execution time, by the execution time from how to sort to few, be designated as formation L
Func
C. come formation L
FuncThe 1st function is maximum function of execution time, is the focus function of program, can be used as the candidate functions that realizes reconfigurable accelerator.
II. analyzing data relies on
The data dependency analysis is a static analysis process, and the focus function is carried out the degree of parallelism that the data dependency analysis can be determined function, and circulation is the part that takies maximum execution time in the function usually, and therefore circulation is the preferential part of quickening;
A. to formation L
FuncThe data dependency analysis is carried out in circulation in the function that middle ordering is the 1st, and the data that do not exist between the iteration rely on, and the different iteration of round-robin are carried out the performance prediction assessment with regard to the energy parallel expansion to the corresponding reconfigurable accelerator of function so;
B. the focus function is carried out the performance prediction assessment,, can be implemented as reconfigurable accelerator if the prediction obtained performance promotes; If prediction can not obtained performance promote, from formation L
FuncSelect next function to analyze successively, account for total execution time of program up to the execution time of next function and be less than 10%, illustrate that all functions are not in this program.
The performance prediction assessment is as follows:
A. the processor execution time of computing function, be expressed as Time
CPU:
Wherein
ClockCycles
CPUExpression CPU finishes the periodicity of once carrying out;
InstructionNum represents that CPU finishes the instruction number of once carrying out;
CPI is every instruction cycles (Cycles Per Instruction);
Frequency
CPUBe processor host frequency.
When CPI was 1, the execution time was approximately:
B. the execution time of the corresponding reconfigurable accelerator of computing function, be expressed as Time
AFU:
Wherein
ClockCycles
AFUExpression FPGA accelerator is finished execution cycle number one time,
Frequency
AFUFrequency for the FPGA accelerator.
C. the processor execution time of comparison function and corresponding reconfigurable accelerator execution time, work as Time
AFU<Time
CPUThe time, that is:
Rule of thumb statistics can get, and the dominant frequency of processor approximately is 20 times of FPGA accelerator frequency, so Time
AFU<Time
CPUThe time:
If the speed-up ratio of this explanation FPGA accelerator is greater than 1, the execution cycle number of reconfigurable accelerator should just will be finished the required work of finishing of instruction more than 20 less than program 1/20 of the number that executes instruction in the one-period of reconfigurable accelerator so.
2. hardware-software partition:
The hardware-software partition step mainly is responsible for interface and the parameter between define program and the reconfigurable accelerator, comprises the following steps;
I. define the software transfer interface of reconfigurable accelerator, offer the software interface of routine call, should reduce the setup time of reconfigurable accelerator input data;
II. define the hardware interface of reconfigurable accelerator, in reconfigurable accelerator, increase metadata cache, make communication concentrate the extra cost of repeatedly calling to eliminate with repeatedly calling to merge, increase the execution time of at every turn calling, reduce the number of times of routine call reconfigurable accelerator;
3. the realization of focus function on FPGA:
I. realize the hardware interface of reconfigurable accelerator,, and increase buffer memory, be supported in the call number that reduces reconfigurable accelerator on the software according to program and the interface between the reconfigurable accelerator and the parameter of 2. middle definition; By increasing buffer memory, can reduce overall communication cost with the input data of repeatedly calling reconfigurable accelerator by once calling the buffer memory that is transferred to reconfigurable accelerator;
II. utilize reconfigurable logic Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period; Improve the frequency of reconfigurable accelerator and reduce the performance period, can both directly improve the performance of reconfigurable accelerator;
4. update routine calls accelerator; Performing step
At last, need in program, call accelerator on the FPGA:
I. increase code before the focus that reconfigurable accelerator is quickened in program, finish the preparation of reconfigurable accelerator input data;
II. call the execution reconfigurable accelerator by the reconfigurable accelerator software interface, program halt is waited for the reconfigurable accelerator return results;
III. receive the return results of reconfigurable accelerator, arrangement returns to program, and program continues to carry out again.
Claims (1)
1. one kind is program reconfigurable accelerator customized implementation method, it is characterized in that:
1) reconfigurable accelerator is auxiliary calculates:
Reconfigurable accelerator is accepted calling of program, is responsible for the part of computation-intensive in the handling procedure, and in the computation process of reconfigurable accelerator, program halt wait reconfigurable accelerator is returned;
2) program customization reconfigurable accelerator implementation procedure:
1. program analysis: the program parsing process comprises 2 steps:
I. determine the function focus
Determine that the function focus is a dynamic profile process, determines to take in the program maximum partial function of execution time; Program when utilizing parser to operation is followed the tracks of, with the function be granularity during to operation program sample, be the statistics of elementary cell then with the function to sampled data, draw each function calls number of times and execution time, from how to few sort by the execution time, wherein maximum function of execution time is exactly the focus function of program, can be used as the candidate functions that is embodied as reconfigurable accelerator;
II. analyzing data relies on
The data dependency analysis is a static analysis process, and the focus function is carried out the degree of parallelism that the data dependency analysis is determined function; If the data that do not exist between the loop iteration rely on, the different iteration of round-robin just can parallel expansion so, thereby makes full use of the high concurrency of physics of FPGA; If the focus function promotes by the forecast assessment obtained performance, so just be embodied as reconfigurable accelerator, with the execution of accelerated procedure;
2. hardware-software partition:
Determine to be embodied as after the function of reconfigurable accelerator, in fact finished division, the hardware-software partition step mainly is responsible for interface and the parameter between define program and the reconfigurable accelerator; Because the routine call reconfigurable accelerator needs extra cost, should in reconfigurable accelerator, increase metadata cache, make communication concentrate the extra cost of repeatedly calling to eliminate with repeatedly calling to merge, increase the execution time of at every turn calling, reduce the number of times of routine call reconfigurable accelerator;
3. the realization of focus function on FPGA:
According to program and the interface between the reconfigurable accelerator and the parameter of 2. middle definition, realize the hardware interface of reconfigurable accelerator, and increase buffer memory, be supported in the call number that reduces reconfigurable accelerator on the software; By increasing buffer memory, the input data of repeatedly calling reconfigurable accelerator by once calling the buffer memory that is transferred to reconfigurable accelerator, are reduced overall communication cost;
Utilize FPGA Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period;
4. update routine calls accelerator; Performing step:
At last, need in program, call accelerator on the FPGA:
I. increase code before the focus that reconfigurable accelerator is quickened in program, finish the preparation of reconfigurable accelerator input data;
II. call the execution reconfigurable accelerator by the reconfigurable accelerator software interface, program halt is waited for the reconfigurable accelerator return results;
III. receive the return results of reconfigurable accelerator, arrangement returns to program, and program continues to carry out again.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2008101629053A CN101441564B (en) | 2008-12-04 | 2008-12-04 | Method for implementing reconfigurable accelerator customized for program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2008101629053A CN101441564B (en) | 2008-12-04 | 2008-12-04 | Method for implementing reconfigurable accelerator customized for program |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101441564A CN101441564A (en) | 2009-05-27 |
CN101441564B true CN101441564B (en) | 2011-07-20 |
Family
ID=40726012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2008101629053A Expired - Fee Related CN101441564B (en) | 2008-12-04 | 2008-12-04 | Method for implementing reconfigurable accelerator customized for program |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101441564B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193827A (en) * | 2010-03-12 | 2011-09-21 | 西安交通大学 | CBEA (Cell Broadband Engine Architecture)-oriented transplanting method for isomorphic platform application |
EP2442228A1 (en) * | 2010-10-13 | 2012-04-18 | Thomas Lippert | A computer cluster arrangement for processing a computaton task and method for operation thereof |
CN102902581B (en) | 2011-07-29 | 2016-05-11 | 国际商业机器公司 | Hardware accelerator and method, CPU, computing equipment |
KR101861742B1 (en) * | 2011-08-30 | 2018-05-30 | 삼성전자주식회사 | Data processing system and method for switching between heterogeneous accelerators |
CN102929812A (en) * | 2012-09-28 | 2013-02-13 | 无锡江南计算技术研究所 | Reconfigurable accelerator mapping method based on storage interface |
CN104572134B (en) * | 2015-02-10 | 2018-03-06 | 中国农业银行股份有限公司 | A kind of optimization method and device |
CN106445678B (en) * | 2016-07-21 | 2020-02-07 | 天津大学 | Parallelism adjusting algorithm for reducing power consumption of instruction level parallel processor |
US11645059B2 (en) * | 2017-12-20 | 2023-05-09 | International Business Machines Corporation | Dynamically replacing a call to a software library with a call to an accelerator |
US10572250B2 (en) | 2017-12-20 | 2020-02-25 | International Business Machines Corporation | Dynamic accelerator generation and deployment |
WO2020191549A1 (en) * | 2019-03-22 | 2020-10-01 | 华为技术有限公司 | Soc chip, method for determination of hotspot function and terminal device |
CN114153494B (en) * | 2021-12-02 | 2024-02-13 | 中国核动力研究设计院 | Hot code optimization method and device based on thermodynamic diagram |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1867893A (en) * | 2003-10-14 | 2006-11-22 | 史坦利·M·海德克 | Method and apparatus for accelerating the verification of application specific integrated circuit designs |
-
2008
- 2008-12-04 CN CN2008101629053A patent/CN101441564B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1867893A (en) * | 2003-10-14 | 2006-11-22 | 史坦利·M·海德克 | Method and apparatus for accelerating the verification of application specific integrated circuit designs |
Also Published As
Publication number | Publication date |
---|---|
CN101441564A (en) | 2009-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101441564B (en) | Method for implementing reconfigurable accelerator customized for program | |
US10387319B2 (en) | Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features | |
US10416999B2 (en) | Processors, methods, and systems with a configurable spatial accelerator | |
US10558575B2 (en) | Processors, methods, and systems with a configurable spatial accelerator | |
Wang et al. | Kernel fusion: An effective method for better power efficiency on multithreaded GPU | |
US20190095383A1 (en) | Processors, methods, and systems for debugging a configurable spatial accelerator | |
Luo et al. | A performance and energy consumption analytical model for GPU | |
JPH04307625A (en) | Loop optimization system | |
CN104657219A (en) | Application program thread count dynamic regulating method used under isomerous many-core system | |
CN110427337B (en) | Processor core based on field programmable gate array and operation method thereof | |
CN101989192A (en) | Method for automatically parallelizing program | |
Kumar et al. | Speculative parallelism on multicore chip architecture strengthen green computing concept: A survey | |
Capalija et al. | Microarchitecture of a coarse-grain out-of-order superscalar processor | |
Whaley et al. | Heuristics for profile-driven method-level speculative parallelization | |
CN100583042C (en) | Compiling method, apparatus for loop in program | |
Marowka | Energy consumption modeling for hybrid computing | |
Lee et al. | Performance benefits of heterogeneous computing in HPC workloads | |
Wang et al. | A flexible chip multiprocessor simulator dedicated for thread level speculation | |
Curzel et al. | Higher-Level Synthesis: experimenting with MLIR polyhedral representations for accelerator design | |
Wang et al. | Investigation of factors impacting thread-level parallelism from desktop, multimedia and HPC applications | |
Tianxu | Convolutional Neural Network FPGA-accelerator on Intel DE10-Standard FPGA | |
Lai et al. | Performance improvement on heterogeneous platforms: a machine learning based approach | |
Lee et al. | Raptor: A single chip multiprocessor | |
Wang et al. | A bandwidth enhancement method of VTA based on paralleled memory access design | |
Liu et al. | An online profile guided optimization approach for speculative parallel threading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20110720 Termination date: 20111204 |