CN102110052A - Parallel acceleration method for dynamic analysis of program behavior - Google Patents

Parallel acceleration method for dynamic analysis of program behavior Download PDF

Info

Publication number
CN102110052A
CN102110052A CN2011100509272A CN201110050927A CN102110052A CN 102110052 A CN102110052 A CN 102110052A CN 2011100509272 A CN2011100509272 A CN 2011100509272A CN 201110050927 A CN201110050927 A CN 201110050927A CN 102110052 A CN102110052 A CN 102110052A
Authority
CN
China
Prior art keywords
burst
program
analyzed
thread
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011100509272A
Other languages
Chinese (zh)
Inventor
金海�
张伟富
喻之斌
涂旭平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN2011100509272A priority Critical patent/CN102110052A/en
Publication of CN102110052A publication Critical patent/CN102110052A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a parallel acceleration method for dynamic analysis of a program behavior. The method comprises the following steps of: acquiring an analyzed program, and generating fragments of all threads of the analyzed program on the basis of resource and load states; performing analysis code pile pitching on all the fragments; allocating the fragments subjected to the pile pitching to a specific processor core and performing concurrent execution on the fragments and the analyzed threads; performing reduction processing on execution results of the fragments after the concurrent execution is finished; and acquiring program behavior information on the basis of reduction processing results. Due to the adoption of behavior information of an idle computer resource collection program, the dynamic analysis process of the program is accelerated.

Description

A kind of parallel accelerated method of program behavior dynamic profile
Technical field
The invention belongs to program behavior dynamic profile field, be specifically related to a kind of multithread programs and (be primarily aimed at multi-thread programming model based on the shared drive architecture, for example based on the multithread programs of programming models such as OpenMP and Intel TBB) the parallel accelerated method of behavior dynamic profile, be applicable to the dynamic profile of program, dynamic pitching pile and utilize the walk abreast research of acceleration etc. of multinuclear.
Background technology
Program behavior is meant the program series of characteristics that implementation showed on CPU, for example buffer memory (cache) miss rate, branch prediction information, program internal memory use amount and working time, and multithreading between synchronously with communicate by letter etc.The dynamic behaviour feature of prehension program has important effect for the performance and the bottleneck analysis of architecture Design, Compiler Optimization, program.
Along with improving constantly of processor frequencies, the continuous reduction of manufacturing process size, processor is faced with power consumption and bottleneck is controlled in heating, makes that being difficult to the simple raising processor frequencies that relies on improves processor performance.Processor begins to develop towards the direction of multinuclear multithreading, and on-chip multi-processor (Chip Multiprocessor) and while multithreading (Simultaneous Multi-Threading) become the computing platform of main flow gradually.The continuous increase of the core of hardware supported and number of threads, make multithread programs design become the main flow programming model gradually, but because the intrinsic complicacy of multithread programs design, the uncertainty that for example task decomposition, deadlock, competition and multithread programs are carried out, feasible design writes efficiently that multithread programs is the part large order, this method with regard to being badly in need of efficiently analyzing multithread programs.On the other hand, the continuous development of soft project thought and software design technology, for improving the dynamic configurability of design flexibility and system, usually increasing level indirectly induces one, the programming model of virtual machine, middleware and OPENMP and so on for example, the exploitation of application program has been simplified in the introducing of level indirectly, but has also introduced abstract and complicacy, has therefore brought certain expense.The software developer needs more efficiently, more transparent and easy-to-use debugging and analysis tool, understands the dynamic behaviour of software and finds the defective and the performance bottleneck of software, guarantees the reliability of software and shortens software development and test process.In addition, the dynamic behaviour feature of prehension program also has important effect for architecture Design and Compiler Optimization.Architecture Design person often need utilize architecture Design and the system emulation and the simulation of the dynamic behaviour information optimization system of program; The information optimization code generated and automatically parallelizing when the compiler deviser can utilize program to carry out.
At present program is analyzed (profiling) and mainly contain two kinds of methods: one is based on the technology of hardware counter (Hardware Performance Counter); Two are based on dynamic pitching pile (Dynamic Instrumentation) technology of software.Based on the analysis technology of hardware counter, common expense is low but the collection data are more coarse.Performance analysis technology based on plug-in mounting, usually can collect detailed accurate data and flexible customizability is strong, can make up the cache modeling based on dynamic plug-in mounting, robust parsing, instruments such as internal memory leakage, but the overhead of introducing is higher, the dynamic pitching pile instrument of main flow, Pin for example, DynamoRIO and Valgrind, when doing simple fundamental block counting statistics, the execution time is more than 2.5 times of execution time during pitching pile not behind the analyzed program pitching pile, if program is done complicated analysis, it is more that performance can descend, and when for example cache simulated, Valgrind went up more than 100 times when carrying out than this locality (native) usually slowly.In general, good profiler has two features: the one, collect many as far as possible and accurate data; The 2nd, have lower expense.But the two is contradiction in realization, and collection performance data entirely and accurately means high expense usually, and vice versa.
The present invention is directed to profiler to collecting the contradiction of data accuracy and profile overhead, and the multithread programs developer is to the active demand and the dynamic very big contradiction of pitching pile expense of the analysis of multithread programs behavior high-efficiency dynamic, proposition utilizes the slack resources of polycaryon processor, program behind the dynamic pitching pile is carried out parallel partition to be carried out, thereby reach the purpose of faster procedure behavior dynamic profile, simultaneously can equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility.
Summary of the invention
The parallel accelerated method that the purpose of this invention is to provide a kind of multithread programs behavior dynamic profile has been accelerated program behavior analysis speed.
A kind of parallel accelerated method of program behavior dynamic profile the steps include:
(1) catches analyzed program;
(2) burst of each thread of the analyzed program of generation;
(3) each burst is carried out the code analysis pitching pile;
(4) burst behind the pitching pile is assigned to given processor nuclear and the concurrent execution of analyzed program;
(5) execution result of carrying out the burst that finishes is carried out reduction process, know program behavior information according to the reduction process result.
Described step (2) is specially:
(21) each thread of analyzed program is sampled respectively, and send corresponding sampled signal to each thread;
(22) after thread is received sampled signal, whether judge cpu load less than the predetermined load threshold values, if, enter step (23), otherwise, step (24) entered;
(23) when the burst number of the current execution of system during less than predetermined burst threshold values, directly generate a burst of the thread of receiving sampled signal, otherwise, finish to carry out burst overtime or that block, regeneration is received a burst of the thread of sampled signal, enters step (25);
(24) when the burst number of the current execution of system during, enter step (25) less than reservation threshold, otherwise, finish to carry out overtime or be in the burst of blocked state, enter step (25);
(25) return step (22), up to analyzed EOP (end of program).
Described predetermined burst threshold values value is more than or equal to 1.
Described predetermined burst threshold values deducts the Thread Count of analyzed program for the processor check figure.
Technique effect of the present invention is embodied in: the present invention is directed to each thread and generate burst, with being assigned to given processor nuclear and the concurrent execution of analyzed thread behind each burst pitching pile, make the operation of analyzed program not influenced by pitching pile, the speed of faster procedure dynamic profile.Further, the present invention's pitching pile and parallelization work of sampling in two stages of adopting and driving total system.In the phase one sampling, when taking place, inserts sampling condition the inspection code of the inventive method; In the subordinate phase sampling, check code detection subordinate phase sampling condition, when sampling condition satisfies, create burst and burst is inserted code analysis.The execution that burst and analyzed program walk abreast under multi-core environment, thereby the decoupling zero and the parallelization of realization code analysis and analyzed program.Burst is carried out pitching pile according to the burst that generates according to sampled signal and cpu load rather than former thread carries out pitching pile, its essence is the slack resources that utilizes polycaryon processor, program behind the dynamic pitching pile is carried out parallel partition to be carried out, thereby reach the purpose of faster procedure behavior dynamic profile, simultaneously can equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility.
The present invention has following characteristics and advantage: (1) has extendability and speed-up ratio preferably, and accurate data sampling and data analysis capabilities, and can reflect evaluation result intuitively; (2) can the equilibrium criterion complete and accurate and profile overhead, make and the long-play program is carried out complexity analyze and become possibility; (3) provide wieldy programming API, the user uses API can obtain parallel quicken and based on the sampling function of hardware counter; (4) allow own data processing of user definition and analytic function, the parallel framework that quickens is expanded.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is a system assumption diagram of the present invention;
Fig. 3 is an operational process synoptic diagram of the present invention.
Embodiment
The present invention is described in detail below in conjunction with accompanying drawing and example.
As shown in Figure 1, the step of the inventive method is:
(1) beginning
(2) after the system start-up, when analyzed program is prepared to carry out, watch-dog captures analyzed program and begins the execution incident, and with client (code that promptly is inserted into) with self load the into address space of analyzed program, and initialization CPU relevant information (for example core etc.), burst scheduling and error handling processing strategy.
(3) initialization sampling policy, and (cpu load uses the Thread Count of current execution and the ratio value representation of core cpu number, and the user can the given load threshold value be a value more than or equal to 1, and the default load threshold value is 1 to set sample event, dependent thresholds; Fragmentation threshold can be specified for the user, and default setting deducts the Thread Count of analyzed program for the core cpu number) and registration sampling processing function, sampling started.
(4) receiving the phase one during sampled signal, the monitor application programs is carried out pitching pile, inserts the checkpoint code.
(5) when the checkpoint code is performed, it carries out the sampling of subordinate phase.Watch-dog is judged according to the burst number of the current execution of load and system of CPU:
If cpu load is less than the predetermined load threshold values, when the burst number of the current execution of system during, generates a burst less than reservation threshold, burst is carried out the pitching pile execution.When the burst number of the current execution of system during, at first finish to carry out burst overtime or that always be in blocked state, and then generate a new burst more than or equal to predetermined burst threshold values;
If cpu load is more than or equal to the predetermined load threshold values, when the burst number of the current execution of system during less than predetermined burst threshold values, ignore this sampled signal,, finish to carry out burst overtime or that always be in blocked state when the burst number of the current execution of system during more than or equal to reservation threshold.
(6) produce burst after, scheduler is dispatched burst and is assigned according to the utilize situation and the set scheduling strategy of each processor core of CPU.
(7) burst is dispatched to and specifies after the core and the concurrent execution of analyzed thread.When the burst time executed, burst finished and handles function according to user-defined merger, fragment data is carried out stipulations handle.If mistake appears in the burst implementation, then re-execute or abandon burst according to the strategy of setting.
(8) repeating step (2)-(7) are up to analyzed EOP (end of program).
(9) when analyzed EOP (end of program), after monitor captures the incident of analyzed EOP (end of program), discharge corresponding resource, really finish the dynamic profile process.
Example:
Operational process of the present invention describes in detail with an example below with reference to figure 3:
(1) resource distribution and analyzed program information
On the linux system of one four nuclear, move an analyzed program with two threads.The pitching pile code is the fundamental block counting function, and the fundamental block number that program is carried out is added up.
(2) initialization
When analyzed program was prepared to carry out, watch-dog was carried out initialization operation with pitching pile code and the address space that self is loaded into analyzed program.In this example, concrete initialization following information: sampling policy instructs into sampling for using PAPI_TOT_INS hardware counter incident and setting threshold as the 5M bar; The Thread Count of cpu load=current execution/core cpu number, the threshold values of load is made as 1; The threshold values of burst number is made as the Thread Count that the core cpu number deducts analyzed program.
(3) generate burst and execution
Receiving the phase one during sampled signal, the monitor application programs is carried out pitching pile, inserts the checkpoint code of the inventive method.When the checkpoint code was performed, it carried out the sampling of subordinate phase, and watch-dog is according to cpu load and current burst number, and whether decision generates a burst, and takes and carry out follow-up associative operation.
(4) burst scheduling
In this example, the burst scheduling strategy is initialized as the acquiescence scheduling strategy that adopts operating system, also is to go scheduling by operating system after burst produces.
(5) executed in parallel and pitching pile
Burst and analyzed thread parallel are carried out, and insert the fundamental block statistics codes in the process of implementation.
(6) statistical study as a result and output
In this example, directly merger and gather the data of each burst, output analysis and statistics.
The above only is the specific implementation of the best of the present invention, and implementation method of the present invention is not limited thereto, and any change that does not break away from field of the present invention under the spirit of the present invention all should be contained within the scope of the present invention.

Claims (4)

1. the parallel accelerated method of a program behavior dynamic profile the steps include:
(1) catches analyzed program;
(2) burst of each thread of the analyzed program of generation;
(3) each burst is carried out the code analysis pitching pile;
(4) burst behind the pitching pile is assigned to given processor nuclear and the concurrent execution of analyzed program;
(5) execution result of carrying out the burst that finishes is carried out reduction process, know program behavior information according to the reduction process result.
2. parallel accelerated method according to claim 1 is characterized in that, described step (2) is specially:
(21) each thread of analyzed program is sampled respectively, and send corresponding sampled signal to each thread;
(22) after thread is received sampled signal, whether judge cpu load less than the predetermined load threshold values, if, enter step (23), otherwise, step (24) entered;
(23) when the burst number of the current execution of system during less than predetermined burst threshold values, directly generate a burst of the thread of receiving sampled signal, otherwise, finish to carry out burst overtime or that block, regeneration is received a burst of the thread of sampled signal, enters step (25);
(24) when the burst number of the current execution of system during, enter step (25) less than reservation threshold, otherwise, finish to carry out overtime or be in the burst of blocked state, enter step (25);
(25) return step (22), up to analyzed EOP (end of program).
3. parallel accelerated method according to claim 2 is characterized in that, described predetermined burst threshold values value is more than or equal to 1.
4. parallel accelerated method according to claim 2 is characterized in that, described predetermined burst threshold values deducts the Thread Count of analyzed program for the processor check figure.
CN2011100509272A 2011-03-03 2011-03-03 Parallel acceleration method for dynamic analysis of program behavior Pending CN102110052A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011100509272A CN102110052A (en) 2011-03-03 2011-03-03 Parallel acceleration method for dynamic analysis of program behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011100509272A CN102110052A (en) 2011-03-03 2011-03-03 Parallel acceleration method for dynamic analysis of program behavior

Publications (1)

Publication Number Publication Date
CN102110052A true CN102110052A (en) 2011-06-29

Family

ID=44174218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011100509272A Pending CN102110052A (en) 2011-03-03 2011-03-03 Parallel acceleration method for dynamic analysis of program behavior

Country Status (1)

Country Link
CN (1) CN102110052A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929581A (en) * 2012-10-18 2013-02-13 无锡江南计算技术研究所 Code processing method and device
CN104881840A (en) * 2015-05-11 2015-09-02 华中科技大学 Data parallel access method based on graph data processing system
CN105183642A (en) * 2015-08-18 2015-12-23 中国人民解放军信息工程大学 Instrumentation based program behavior acquisition and structural analysis method
CN106257425A (en) * 2016-07-20 2016-12-28 东南大学 A kind of Java concurrent program path based on con current control flow graph method for decomposing
CN107391124A (en) * 2017-06-30 2017-11-24 东南大学 A kind of condition dicing method based on golden section search and software perform track
CN112363860A (en) * 2020-11-11 2021-02-12 中国建设银行股份有限公司 Batch processing operation abnormal interruption detection method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1949185A (en) * 2005-10-13 2007-04-18 同济大学 Parallel adjusting and performance analyzing method of supporting multi-language multi-platform under isomerized environment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1949185A (en) * 2005-10-13 2007-04-18 同济大学 Parallel adjusting and performance analyzing method of supporting multi-language multi-platform under isomerized environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张伟富: "基于多核的动态剖析加速方法研究", 《华中科技大学硕士学位论文》 *
沈立等: "多核平台下应用程序的动态优化", 《计算机科学与探索》 *
马桂杰等: "基于插桩技术的并行程序性能分析方法设计和实现", 《计算机应用研究》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929581A (en) * 2012-10-18 2013-02-13 无锡江南计算技术研究所 Code processing method and device
CN104881840A (en) * 2015-05-11 2015-09-02 华中科技大学 Data parallel access method based on graph data processing system
CN104881840B (en) * 2015-05-11 2017-10-31 华中科技大学 A kind of data parallel access method based on diagram data processing system
CN105183642A (en) * 2015-08-18 2015-12-23 中国人民解放军信息工程大学 Instrumentation based program behavior acquisition and structural analysis method
CN105183642B (en) * 2015-08-18 2018-03-13 中国人民解放军信息工程大学 Program behavior based on pitching pile obtains and structure analysis method
CN106257425A (en) * 2016-07-20 2016-12-28 东南大学 A kind of Java concurrent program path based on con current control flow graph method for decomposing
CN106257425B (en) * 2016-07-20 2019-04-09 东南大学 A kind of Java concurrent program path method for decomposing based on con current control flow graph
CN107391124A (en) * 2017-06-30 2017-11-24 东南大学 A kind of condition dicing method based on golden section search and software perform track
CN107391124B (en) * 2017-06-30 2020-06-16 东南大学 Conditional slicing method based on golden section search and software execution track
CN112363860A (en) * 2020-11-11 2021-02-12 中国建设银行股份有限公司 Batch processing operation abnormal interruption detection method and device

Similar Documents

Publication Publication Date Title
Liang et al. Timing analysis of concurrent programs running on shared cache multi-cores
Argollo et al. COTSon: infrastructure for full system simulation
US20120204154A1 (en) Symbolic Execution and Test Generation for GPU Programs
CN102110052A (en) Parallel acceleration method for dynamic analysis of program behavior
Kim et al. Predicting potential speedup of serial code via lightweight profiling and emulations with memory performance model
Zhang et al. A comprehensive benchmark of deep learning libraries on mobile devices
WO2012096849A2 (en) System and method for controlling excessive parallelism in multiprocessor systems
CN102968302A (en) Mechanism for improving multithreading performance using synchronization overheads
Liu et al. Pinpointing data locality bottlenecks with low overhead
Maas et al. GPUs as an opportunity for offloading garbage collection
Garcia et al. The kremlin oracle for sequential code parallelization
Feldman et al. Extending LDMS to enable performance monitoring in multi-core applications
Chatzopoulos et al. Estima: Extrapolating scalability of in-memory applications
Hofer et al. Lightweight Java profiling with partial safepoints and incremental stack tracing
Nilakantan et al. Platform-independent analysis of function-level communication in workloads
Lloyd et al. Automated GPU grid geometry selection for OpenMP kernels
Shukla et al. Investigating policies for performance of multi-core processors
Yuan et al. Automatic enhanced CDFG generation based on runtime instrumentation
Aldegheri et al. Rapid prototyping of embedded vision systems: Embedding computer vision applications into low-power heterogeneous architectures
Herbegue et al. Formal architecture specification for time analysis
Mysore et al. Profiling over adaptive ranges
Liu et al. Photon: A Fine-grained Sampled Simulation Methodology for GPU Workloads
Badr et al. A high-level model for exploring multi-core architectures
Mijaković et al. Specification of periscope tuning framework plugins
Stitt et al. Thread warping: Dynamic and transparent synthesis of thread accelerators

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110629