CN102110052A

CN102110052A - Parallel acceleration method for dynamic analysis of program behavior

Info

Publication number: CN102110052A
Application number: CN2011100509272A
Authority: CN
Inventors: 金海�; 张伟富; 喻之斌; 涂旭平
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2011-03-03
Filing date: 2011-03-03
Publication date: 2011-06-29

Abstract

The invention discloses a parallel acceleration method for dynamic analysis of a program behavior. The method comprises the following steps of: acquiring an analyzed program, and generating fragments of all threads of the analyzed program on the basis of resource and load states; performing analysis code pile pitching on all the fragments; allocating the fragments subjected to the pile pitching to a specific processor core and performing concurrent execution on the fragments and the analyzed threads; performing reduction processing on execution results of the fragments after the concurrent execution is finished; and acquiring program behavior information on the basis of reduction processing results. Due to the adoption of behavior information of an idle computer resource collection program, the dynamic analysis process of the program is accelerated.

Description

A kind of parallel accelerated method of program behavior dynamic profile

Technical field

The invention belongs to program behavior dynamic profile field, be specifically related to a kind of multithread programs and (be primarily aimed at multi-thread programming model based on the shared drive architecture, for example based on the multithread programs of programming models such as OpenMP and Intel TBB) the parallel accelerated method of behavior dynamic profile, be applicable to the dynamic profile of program, dynamic pitching pile and utilize the walk abreast research of acceleration etc. of multinuclear.

Background technology

Program behavior is meant the program series of characteristics that implementation showed on CPU, for example buffer memory (cache) miss rate, branch prediction information, program internal memory use amount and working time, and multithreading between synchronously with communicate by letter etc.The dynamic behaviour feature of prehension program has important effect for the performance and the bottleneck analysis of architecture Design, Compiler Optimization, program.

Along with improving constantly of processor frequencies, the continuous reduction of manufacturing process size, processor is faced with power consumption and bottleneck is controlled in heating, makes that being difficult to the simple raising processor frequencies that relies on improves processor performance.Processor begins to develop towards the direction of multinuclear multithreading, and on-chip multi-processor (Chip Multiprocessor) and while multithreading (Simultaneous Multi-Threading) become the computing platform of main flow gradually.The continuous increase of the core of hardware supported and number of threads, make multithread programs design become the main flow programming model gradually, but because the intrinsic complicacy of multithread programs design, the uncertainty that for example task decomposition, deadlock, competition and multithread programs are carried out, feasible design writes efficiently that multithread programs is the part large order, this method with regard to being badly in need of efficiently analyzing multithread programs.On the other hand, the continuous development of soft project thought and software design technology, for improving the dynamic configurability of design flexibility and system, usually increasing level indirectly induces one, the programming model of virtual machine, middleware and OPENMP and so on for example, the exploitation of application program has been simplified in the introducing of level indirectly, but has also introduced abstract and complicacy, has therefore brought certain expense.The software developer needs more efficiently, more transparent and easy-to-use debugging and analysis tool, understands the dynamic behaviour of software and finds the defective and the performance bottleneck of software, guarantees the reliability of software and shortens software development and test process.In addition, the dynamic behaviour feature of prehension program also has important effect for architecture Design and Compiler Optimization.Architecture Design person often need utilize architecture Design and the system emulation and the simulation of the dynamic behaviour information optimization system of program; The information optimization code generated and automatically parallelizing when the compiler deviser can utilize program to carry out.

At present program is analyzed (profiling) and mainly contain two kinds of methods: one is based on the technology of hardware counter (Hardware Performance Counter); Two are based on dynamic pitching pile (Dynamic Instrumentation) technology of software.Based on the analysis technology of hardware counter, common expense is low but the collection data are more coarse.Performance analysis technology based on plug-in mounting, usually can collect detailed accurate data and flexible customizability is strong, can make up the cache modeling based on dynamic plug-in mounting, robust parsing, instruments such as internal memory leakage, but the overhead of introducing is higher, the dynamic pitching pile instrument of main flow, Pin for example, DynamoRIO and Valgrind, when doing simple fundamental block counting statistics, the execution time is more than 2.5 times of execution time during pitching pile not behind the analyzed program pitching pile, if program is done complicated analysis, it is more that performance can descend, and when for example cache simulated, Valgrind went up more than 100 times when carrying out than this locality (native) usually slowly.In general, good profiler has two features: the one, collect many as far as possible and accurate data; The 2nd, have lower expense.But the two is contradiction in realization, and collection performance data entirely and accurately means high expense usually, and vice versa.

The present invention is directed to profiler to collecting the contradiction of data accuracy and profile overhead, and the multithread programs developer is to the active demand and the dynamic very big contradiction of pitching pile expense of the analysis of multithread programs behavior high-efficiency dynamic, proposition utilizes the slack resources of polycaryon processor, program behind the dynamic pitching pile is carried out parallel partition to be carried out, thereby reach the purpose of faster procedure behavior dynamic profile, simultaneously can equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility.

Summary of the invention

The parallel accelerated method that the purpose of this invention is to provide a kind of multithread programs behavior dynamic profile has been accelerated program behavior analysis speed.

A kind of parallel accelerated method of program behavior dynamic profile the steps include:

(1) catches analyzed program;

(2) burst of each thread of the analyzed program of generation;

(3) each burst is carried out the code analysis pitching pile;

(4) burst behind the pitching pile is assigned to given processor nuclear and the concurrent execution of analyzed program;

(5) execution result of carrying out the burst that finishes is carried out reduction process, know program behavior information according to the reduction process result.

Described step (2) is specially:

(21) each thread of analyzed program is sampled respectively, and send corresponding sampled signal to each thread;

(22) after thread is received sampled signal, whether judge cpu load less than the predetermined load threshold values, if, enter step (23), otherwise, step (24) entered;

(23) when the burst number of the current execution of system during less than predetermined burst threshold values, directly generate a burst of the thread of receiving sampled signal, otherwise, finish to carry out burst overtime or that block, regeneration is received a burst of the thread of sampled signal, enters step (25);

(24) when the burst number of the current execution of system during, enter step (25) less than reservation threshold, otherwise, finish to carry out overtime or be in the burst of blocked state, enter step (25);

(25) return step (22), up to analyzed EOP (end of program).

Described predetermined burst threshold values value is more than or equal to 1.

Described predetermined burst threshold values deducts the Thread Count of analyzed program for the processor check figure.

Technique effect of the present invention is embodied in: the present invention is directed to each thread and generate burst, with being assigned to given processor nuclear and the concurrent execution of analyzed thread behind each burst pitching pile, make the operation of analyzed program not influenced by pitching pile, the speed of faster procedure dynamic profile.Further, the present invention's pitching pile and parallelization work of sampling in two stages of adopting and driving total system.In the phase one sampling, when taking place, inserts sampling condition the inspection code of the inventive method; In the subordinate phase sampling, check code detection subordinate phase sampling condition, when sampling condition satisfies, create burst and burst is inserted code analysis.The execution that burst and analyzed program walk abreast under multi-core environment, thereby the decoupling zero and the parallelization of realization code analysis and analyzed program.Burst is carried out pitching pile according to the burst that generates according to sampled signal and cpu load rather than former thread carries out pitching pile, its essence is the slack resources that utilizes polycaryon processor, program behind the dynamic pitching pile is carried out parallel partition to be carried out, thereby reach the purpose of faster procedure behavior dynamic profile, simultaneously can equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility.

The present invention has following characteristics and advantage: (1) has extendability and speed-up ratio preferably, and accurate data sampling and data analysis capabilities, and can reflect evaluation result intuitively; (2) can the equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility; (3) provide wieldy programming API, the user uses API can obtain parallel quicken and based on the sampling function of hardware counter; (4) allow own data processing of user definition and analytic function, the parallel framework that quickens is expanded.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is a system assumption diagram of the present invention;

Fig. 3 is an operational process synoptic diagram of the present invention.

Embodiment

The present invention is described in detail below in conjunction with accompanying drawing and example.

As shown in Figure 1, the step of the inventive method is:

(1) beginning

(2) after the system start-up, when analyzed program is prepared to carry out, watch-dog captures analyzed program and begins the execution incident, and with client (code that promptly is inserted into) with self load the into address space of analyzed program, and initialization CPU relevant information (for example core etc.), burst scheduling and error handling processing strategy.

(3) initialization sampling policy, and (cpu load uses the Thread Count of current execution and the ratio value representation of core cpu number, and the user can the given load threshold value be a value more than or equal to 1, and the default load threshold value is 1 to set sample event, dependent thresholds; Fragmentation threshold can be specified for the user, and default setting deducts the Thread Count of analyzed program for the core cpu number) and registration sampling processing function, sampling started.

(4) receiving the phase one during sampled signal, the monitor application programs is carried out pitching pile, inserts the checkpoint code.

(5) when the checkpoint code is performed, it carries out the sampling of subordinate phase.Watch-dog is judged according to the burst number of the current execution of load and system of CPU:

If cpu load is less than the predetermined load threshold values, when the burst number of the current execution of system during, generates a burst less than reservation threshold, burst is carried out the pitching pile execution.When the burst number of the current execution of system during, at first finish to carry out burst overtime or that always be in blocked state, and then generate a new burst more than or equal to predetermined burst threshold values;

If cpu load is more than or equal to the predetermined load threshold values, when the burst number of the current execution of system during less than predetermined burst threshold values, ignore this sampled signal,, finish to carry out burst overtime or that always be in blocked state when the burst number of the current execution of system during more than or equal to reservation threshold.

(6) produce burst after, scheduler is dispatched burst and is assigned according to the utilize situation and the set scheduling strategy of each processor core of CPU.

(7) burst is dispatched to and specifies after the core and the concurrent execution of analyzed thread.When the burst time executed, burst finished and handles function according to user-defined merger, fragment data is carried out stipulations handle.If mistake appears in the burst implementation, then re-execute or abandon burst according to the strategy of setting.

(8) repeating step (2)-(7) are up to analyzed EOP (end of program).

(9) when analyzed EOP (end of program), after monitor captures the incident of analyzed EOP (end of program), discharge corresponding resource, really finish the dynamic profile process.

Example:

Operational process of the present invention describes in detail with an example below with reference to figure 3:

(1) resource distribution and analyzed program information

On the linux system of one four nuclear, move an analyzed program with two threads.The pitching pile code is the fundamental block counting function, and the fundamental block number that program is carried out is added up.

(2) initialization

When analyzed program was prepared to carry out, watch-dog was carried out initialization operation with pitching pile code and the address space that self is loaded into analyzed program.In this example, concrete initialization following information: sampling policy instructs into sampling for using PAPI_TOT_INS hardware counter incident and setting threshold as the 5M bar; The Thread Count of cpu load=current execution/core cpu number, the threshold values of load is made as 1; The threshold values of burst number is made as the Thread Count that the core cpu number deducts analyzed program.

(3) generate burst and execution

Receiving the phase one during sampled signal, the monitor application programs is carried out pitching pile, inserts the checkpoint code of the inventive method.When the checkpoint code was performed, it carried out the sampling of subordinate phase, and watch-dog is according to cpu load and current burst number, and whether decision generates a burst, and takes and carry out follow-up associative operation.

(4) burst scheduling

In this example, the burst scheduling strategy is initialized as the acquiescence scheduling strategy that adopts operating system, also is to go scheduling by operating system after burst produces.

(5) executed in parallel and pitching pile

Burst and analyzed thread parallel are carried out, and insert the fundamental block statistics codes in the process of implementation.

(6) statistical study as a result and output

In this example, directly merger and gather the data of each burst, output analysis and statistics.

The above only is the specific implementation of the best of the present invention, and implementation method of the present invention is not limited thereto, and any change that does not break away from field of the present invention under the spirit of the present invention all should be contained within the scope of the present invention.

Claims

1. the parallel accelerated method of a program behavior dynamic profile the steps include:

(1) catches analyzed program;

(2) burst of each thread of the analyzed program of generation;

(3) each burst is carried out the code analysis pitching pile;

2. parallel accelerated method according to claim 1 is characterized in that, described step (2) is specially:

(25) return step (22), up to analyzed EOP (end of program).

3. parallel accelerated method according to claim 2 is characterized in that, described predetermined burst threshold values value is more than or equal to 1.

4. parallel accelerated method according to claim 2 is characterized in that, described predetermined burst threshold values deducts the Thread Count of analyzed program for the processor check figure.