CN101441564B

CN101441564B - Method for implementing reconfigurable accelerator customized for program

Info

Publication number: CN101441564B
Application number: CN2008101629053A
Authority: CN
Inventors: 陈天洲; 严力科; 陈度; 王罡; 王勇刚
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2008-12-04
Filing date: 2008-12-04
Publication date: 2011-07-20
Anticipated expiration: 2028-12-04
Also published as: CN101441564A

Abstract

The invention discloses a method for realizing a reconfigurable accelerator customized for a program. The reconfigurable accelerator customized for the program accelerates the program on an FPGA by arranging the FPGA for the prior general-purpose computer system. The method has a main function of analyzing the program, uses functions to calculate information for the runtime of a granularity sampling program, acquires computing-intensive hot spot functions in the program, realizes the hot spot functions as the reconfigurable accelerator on the FPGA, and modifies call of the hot spot functions in the program into call of the corresponding reconfigurable accelerator to accelerate execution of the hot spot functions. The method uses the reconfigurable accelerator to realize the hot spot functions of the program, improves the total speed-up ratio of the program, uses the FPGA to realize the reconfigurable accelerator, achieves the performance of an approximately applied custom integrated circuit, and simultaneously maintains the flexibility of a general-purpose processor.

Description

Be program reconfigurable accelerator customized implementation method

Technical field

The present invention relates to program optimization design, FPGA design field, relate in particular to a kind of program reconfigurable accelerator customized implementation method that is.

Background technology

Along with the application of new material and the development of new technology, very large scale integration technology makes great progress, and integrated transistor size is about to surpass 10,000,000,000 on the existing processor equivalent area.But because transistor utilization ratio, electric leakage, heat radiation and power problems come obtained performance to promote losing more than gain of becoming to improve processor host frequency.Therefore, the multi-core system structure technology replaces becomes the mainstream technology of processor, by a plurality of process nuclear of encapsulation in single-chip, realized real walking abreast physically, thereby improved transistorized utilization ratio relatively, alleviated heat radiation and power problems, for computing machine has brought bigger performance boost.From current trend, the number of integrated nuclear will continue to increase rapidly in the processor chips.Yet, because the restriction that general application degree of parallelism is difficult to improve, when the processor general purpose core outnumber 16 after, the number that increases common treatment nuclear more just is difficult to bring bigger performance boost, though the therefore simple number that increases the common treatment nuclear of isomorphism can be used up the transistor that increases rapidly, application program but can not make full use of the common treatment nuclear that quantity increases day by day, and calculated performance can not improve along with the increase of process nuclear number naturally simply yet.

The coprocessor and the accelerator of customization are to satisfy the another kind of technological means of user to the ever-increasing demand of performance, often comprise coprocessor or accelerator that some are special-purpose in the modern computing system, comprise " industry application specific processors " such as " domain-specific coprocessor ", graph and image processing and digital signal processing towards science calculating etc., as auxiliary process nuclear, the Intel of Cell Figure media accelerator 950 etc.The architecture of these dedicated coprocessors and accelerator utilizes the feature of application-specific to customize, thereby can reach the high-performance and the high-level efficiency of customized application.But the coprocessor of this customizations and accelerator design only operation institute towards application the time performance that just can obtain, utilization factor and dirigibility are not high, and specialized customization will greatly increase design cost.

In this case, in conventional computer system, increase the reconfigurable accelerator that constitutes by restructural equipment more and provide another kind of approach for promoting calculated performance.Dynamic recognition by restructural equipment, reconfigurable accelerator can be supported various dissimilar application, thereby can reach superior performance in the scope more widely, improve the utilization factor of reconfigurable hardware resource, obtain general processor simultaneously and adapt to the most high flexibilities of using and the high-performance and the high-level efficiency of application specific processor.In the diverse problems of solve using, also can solve accelerator hardware resource utilization, design complexity, system reliability and reduce cost and many-sided problem such as power consumption.

Summary of the invention

In order to obtain the accelerator of better utilization restructural resource, design customization,, the object of the present invention is to provide a kind of program reconfigurable accelerator customized implementation method that is in order to improve the execution performance of application program.

The technical scheme that technical solution problem of the present invention is adopted is:

A kind of is program reconfigurable accelerator customized implementation method:

1) reconfigurable accelerator is auxiliary calculates:

Reconfigurable accelerator is accepted calling of program, is responsible for the part of computation-intensive in the handling procedure, and in the computation process of reconfigurable accelerator, program halt wait reconfigurable accelerator is returned;

2) program customization reconfigurable accelerator implementation procedure:

1. program analysis: the program parsing process comprises 2 steps:

I. determine the function focus

Determine that the function focus is a dynamic profile process, determines to take in the program maximum partial function of execution time; Program when utilizing parser to operation is followed the tracks of, with the function be granularity during to operation program sample, be the statistics of elementary cell then with the function to sampled data, draw each function calls number of times and execution time, from how to few sort by the execution time, wherein maximum function of execution time is exactly the focus function of program, can be used as the candidate functions that is embodied as reconfigurable accelerator;

II. analyzing data relies on

The data dependency analysis is a static analysis process, and the focus function is carried out the degree of parallelism that the data dependency analysis is determined function; If the data that do not exist between the loop iteration rely on, the different iteration of round-robin just can parallel expansion so, thereby makes full use of the high concurrency of physics of FPGA; If the focus function promotes by the forecast assessment obtained performance, so just be embodied as reconfigurable accelerator, with the execution of accelerated procedure;

2. hardware-software partition:

Determine to be embodied as after the function of reconfigurable accelerator, in fact finished division, the hardware-software partition step mainly is responsible for interface and the parameter between define program and the reconfigurable accelerator; Because the routine call reconfigurable accelerator needs extra cost, should in reconfigurable accelerator, increase metadata cache, make communication concentrate the extra cost of repeatedly calling to eliminate with repeatedly calling to merge, increase the execution time of at every turn calling, reduce the number of times of routine call reconfigurable accelerator;

3. the realization of focus function on FPGA:

According to program and the interface between the reconfigurable accelerator and the parameter of 2. middle definition, realize the hardware interface of reconfigurable accelerator, and increase buffer memory, be supported in the call number that reduces reconfigurable accelerator on the software; By increasing buffer memory, the input data of repeatedly calling reconfigurable accelerator by once calling the buffer memory that is transferred to reconfigurable accelerator, are reduced overall communication cost;

Utilize reconfigurable logic Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period; Improve the frequency of reconfigurable accelerator and reduce the performance period, can both directly improve the performance of reconfigurable accelerator;

4. update routine calls accelerator; Performing step:

At last, need in program, call accelerator on the FPGA:

I. increase code before the focus that reconfigurable accelerator is quickened in program, finish the preparation of reconfigurable accelerator input data;

II. call the execution reconfigurable accelerator by the reconfigurable accelerator software interface, program halt is waited for the reconfigurable accelerator return results;

III. receive the return results of reconfigurable accelerator, arrangement returns to program, and program continues to carry out again.

The beneficial effect that the present invention has is:

The present invention be a kind of be the implementation method of program customization reconfigurable accelerator based on FPGA, its major function is to use FPGA that the focus function of program is embodied as reconfigurable accelerator on computer architecture, and focus function calls in the program is revised as calling of corresponding reconfigurable accelerator, quicken the execution of focus function.

1) the use reconfigurable accelerator realizes the focus function of program, the overall speed-up ratio of raising program;

2) use FPGA to realize reconfigurable accelerator, in the performance that reaches approximate applied customization integrated circuit, the dirigibility that has kept general processor.

Description of drawings

Accompanying drawing is an overview flow chart of the present invention.

Embodiment

For the specific implementation flow process of program reconfigurable accelerator customized implementation method as follows.

1) increase the auxiliary calculating of reconfigurable accelerator:

On the traditional common computer system, increase FPGA as configurable component, FPGA is connected to conventional computer system by the PCI-E bus.

Reconfigurable accelerator is responsible for the part of computation-intensive in the handling procedure, accepts calling of program, and after the routine call reconfigurable accelerator, reconfigurable accelerator begins to handle the input data, in the computing interval of reconfigurable accelerator, program halt; Carry out end when reconfigurable accelerator, the result is returned to program, program continues to carry out again.

2) program customization reconfigurable accelerator implementation procedure, as shown in drawings:

1. program analysis:

I. determine the function focus

Determine that the function focus is a dynamic profile process, can determine to take in the program maximum partial function of execution time;

Program when a. utilizing parser to operation is followed the tracks of track record function calls number of times, and start time of at every turn calling and time of return;

B. be elementary cell with the function to the statistics of sampling gained data, draw each function calls number of times and execution time, by the execution time from how to sort to few, be designated as formation L _Func

C. come formation L _FuncThe 1st function is maximum function of execution time, is the focus function of program, can be used as the candidate functions that realizes reconfigurable accelerator.

II. analyzing data relies on

The data dependency analysis is a static analysis process, and the focus function is carried out the degree of parallelism that the data dependency analysis can be determined function, and circulation is the part that takies maximum execution time in the function usually, and therefore circulation is the preferential part of quickening;

A. to formation L _FuncThe data dependency analysis is carried out in circulation in the function that middle ordering is the 1st, and the data that do not exist between the iteration rely on, and the different iteration of round-robin are carried out the performance prediction assessment with regard to the energy parallel expansion to the corresponding reconfigurable accelerator of function so;

B. the focus function is carried out the performance prediction assessment,, can be implemented as reconfigurable accelerator if the prediction obtained performance promotes; If prediction can not obtained performance promote, from formation L _FuncSelect next function to analyze successively, account for total execution time of program up to the execution time of next function and be less than 10%, illustrate that all functions are not in this program.

The performance prediction assessment is as follows:

A. the processor execution time of computing function, be expressed as Time _CPU:

{Time}_{cpu} = \frac{{ClockCycles}_{CPU}}{{Frequency}_{CPU}} = \frac{InstructionNum \times CPI}{{Frequency}_{CPU}}

Wherein

ClockCycles _CPUExpression CPU finishes the periodicity of once carrying out;

InstructionNum represents that CPU finishes the instruction number of once carrying out;

CPI is every instruction cycles (Cycles Per Instruction);

Frequency _CPUBe processor host frequency.

When CPI was 1, the execution time was approximately:

{Time}_{cpu} \approx \frac{InstrctionNum}{{Frequency}_{CPU}}

B. the execution time of the corresponding reconfigurable accelerator of computing function, be expressed as Time _AFU:

{Time}_{AFU} = \frac{{ClockCycles}_{AFU}}{{Frequency}_{AFU}}

Wherein

ClockCycles _AFUExpression FPGA accelerator is finished execution cycle number one time,

Frequency _AFUFrequency for the FPGA accelerator.

C. the processor execution time of comparison function and corresponding reconfigurable accelerator execution time, work as Time _AFU＜Time _CPUThe time, that is:

\frac{{ClockCycles}_{AFU}}{{Frequency}_{AFU}} < \frac{InstrctionNum}{{Frequency}_{CPU}}

Rule of thumb statistics can get, and the dominant frequency of processor approximately is 20 times of FPGA accelerator frequency, so Time _AFU＜Time _CPUThe time:

{ClockCycles}_{AFU} < \frac{InstrctionNum}{20}

If the speed-up ratio of this explanation FPGA accelerator is greater than 1, the execution cycle number of reconfigurable accelerator should just will be finished the required work of finishing of instruction more than 20 less than program 1/20 of the number that executes instruction in the one-period of reconfigurable accelerator so.

2. hardware-software partition:

The hardware-software partition step mainly is responsible for interface and the parameter between define program and the reconfigurable accelerator, comprises the following steps;

I. define the software transfer interface of reconfigurable accelerator, offer the software interface of routine call, should reduce the setup time of reconfigurable accelerator input data;

II. define the hardware interface of reconfigurable accelerator, in reconfigurable accelerator, increase metadata cache, make communication concentrate the extra cost of repeatedly calling to eliminate with repeatedly calling to merge, increase the execution time of at every turn calling, reduce the number of times of routine call reconfigurable accelerator;

3. the realization of focus function on FPGA:

I. realize the hardware interface of reconfigurable accelerator,, and increase buffer memory, be supported in the call number that reduces reconfigurable accelerator on the software according to program and the interface between the reconfigurable accelerator and the parameter of 2. middle definition; By increasing buffer memory, can reduce overall communication cost with the input data of repeatedly calling reconfigurable accelerator by once calling the buffer memory that is transferred to reconfigurable accelerator;

II. utilize reconfigurable logic Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period; Improve the frequency of reconfigurable accelerator and reduce the performance period, can both directly improve the performance of reconfigurable accelerator;

4. update routine calls accelerator; Performing step

At last, need in program, call accelerator on the FPGA:

Claims

1. one kind is program reconfigurable accelerator customized implementation method, it is characterized in that:

1) reconfigurable accelerator is auxiliary calculates:

2) program customization reconfigurable accelerator implementation procedure:

1. program analysis: the program parsing process comprises 2 steps:

I. determine the function focus

II. analyzing data relies on

2. hardware-software partition:

3. the realization of focus function on FPGA:

Utilize FPGA Parallel Implementation and focus function identical functions, and satisfy the purpose that improves frequency and reduce the performance period;

4. update routine calls accelerator; Performing step:

At last, need in program, call accelerator on the FPGA: