CN105243023B - Parallel Runtime error checking method - Google Patents
Parallel Runtime error checking method Download PDFInfo
- Publication number
- CN105243023B CN105243023B CN201510831795.5A CN201510831795A CN105243023B CN 105243023 B CN105243023 B CN 105243023B CN 201510831795 A CN201510831795 A CN 201510831795A CN 105243023 B CN105243023 B CN 105243023B
- Authority
- CN
- China
- Prior art keywords
- counter
- mpi
- message queue
- dump
- unfinished
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention provides a kind of parallel Runtime error checking method, including:The first counter and the second counter that initial value is 0 are set;When process enters a MPI blocking operation, the first counter adds one, and starts a timer;When being returned from the blocking operation, the value of the first counter is assigned to the second counter, and cancel timer;And, if the MPI is blocked in during a MPI calls, then a soft interrupt signal is triggered when timer is full, hence into an interrupt processing function, compare the currency of the first counter and the second counter in interrupt processing function, if the currency of the first counter and the second counter is unequal, performs status dump and then perform Deadlock Detection;If the currency of the first counter and the second counter is equal, returned from interrupt processing function, and continue executing with concurrent program.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a kind of parallel Runtime error checking method.
Background technology
In HPC (High Performance Computing, high-performance calculation) field, MPI (Message Passing
Interface, message passing interface) it is current most widely used most flexible parallel programming model, it is also disappearing for current main flow
Transfer Standards are ceased, current most widely used concurrent program programming model has multiple increase income to realize version.For MPI, because its
It is not automatically parallelizing, therefore programmer develops parallel computation generally according to the mode of parallel scale is stepped up from small to large.
When program is when small-scale parallel, printing or the adjustment method using debugging acid can be added in a program, to solve
The detection and positioning of certainly parallel run time error, but during in face of hundreds and thousands of points, tens thousand of line codes, the printout of flood tide
The parallel execution of program has not only been interfered significantly with, and huge workload is brought to analysis output.Existing large-scale parallel journey
The debugging acid of sequence is all, towards process and program statement, to select which process to be debugged, which is debugged from thousands of processes
A little program statements, therefore the debugging of large-scale parallel program at present is nearly all the experience with programmer, carries out manual analysis.
On the other hand, the execution time of large-scale parallel program may continue very long, if in normal running status,
Either with or without because it is that programmer pays special attention to that communication deadlock, which occurs, for the system failure such as program error or network, this is related to one
The input and output rate at individual calculating center, it is therefore desirable to a kind of lightweight, communication deadlock can be detected in time, and be greatly decreased artificial
Operation automation analyzes the device of a large amount of concurrent process interaction modes.
STAT is a concurrent program operation monitoring software, collects the function call stack layer relation of all concurrent processes, base
In MRNET (layered communications network that a kind of dynamic is set up), aggregation information carries out clustering, finally big with visual means displaying
The running status of scale concurrent program.Because the software is not interfered significantly with to parallel interaction, monitoring information is simple, can be with
Fairly large parallel monitoring is adapted to, but itself can not detect any run time error, it is necessary to by manually relatively more multiple prison
Result is controlled, can just discover whether there is the abnormal process of process performing, this kind of process of reselection enters one using conventional debugger instrument
Step confirms and analyzing program errors.
MUST is a MPI Runtime error checkings software, is inserted using storehouse level, to all MPI libraries that call in program
Place inserts error analysis, collects all parallel interactive information called based on MPI, and assemble information to one based on MRNET
Individual central point, builds the Direct dependence graph of message transmission, and then detect the Circular dependency of message transmission, deagnostic communication deadlock.By
The expense of communication and the bottleneck effect of central server are stored in information gathering, the parallel scale of technology application, 100 is limited
Within process.
In the increasing income and realize version of MPI run-time librarys, two kinds of debugging techniques were successively provided:The rule of early stage concurrent program
Mould is smaller, the adjustment method of use is collection, show process each time between alternative events there is provided programmer's manual analysis, simultaneously
There is provided the access interface to abstract mechanical floor (ADI) message queue after the increase of professional etiquette mould, by the interface library, it can obtain
Certain in a flash, the unfinished transmission message queue (pending send) of each process, unfinished reception message queue
(pending receive) and the unexpected message queue (unexected receive) that receives (have been received not by this
The reception message that ground program is claimed) message queue.(a kind of parallel debugging instrument supports a variety of architectures to put down to Totalview
Platform, the conventional debugger function towards process and thread is provided for many kinds of concurrent programs of MPI, OpenMP, OpenAcc) first with
Debugging interface stops the Process Movement of a process collection between process, then based on the queue accesses interface, into the MPI process spaces
Message queue is captured, the message that the program of the process sets is not completed in a flash at certain is shown, does not further provide for Deadlock Detection
Function is, it is necessary to which whether artificial judgment occurs communication deadlock.For the concurrent program of a thousands of scales, it is wholly off it is all enter
Journey activity may jeopardize the stability of debugging acid itself, therefore totalview show only in Partial Process, i.e. a subset
Message.
The content of the invention
The technical problems to be solved by the invention are that there is provided a kind of parallel operation for there is drawbacks described above in the prior art
When error-detecting method, it is detected towards MPI communication deadlock, and it is big to solve prior art operation expense, when lacking to operation
The wrong defect for carrying out probing into analysis of parallel interaction, has little influence on the performance of concurrent program, Neng Gouji when not being activated
Run-time library is realized in the open interface of MPI library, it is portable good, it is fully transparent to user program.
According to the present invention there is provided a kind of parallel Runtime error checking method, for positioning during the operation of MPI concurrent programs
The mistake of generation.
The parallel Runtime error checking method includes:The first counter and the second counter that initial value is 0 are set;
When process enters a MPI blocking operation, the first counter adds one, and starts a timer;Returned from the blocking operation
Hui Shi, is assigned to the second counter, and cancel timer by the value of the first counter;If moreover, the MPI is blocked in a MPI
In calling, then a soft interrupt signal is triggered when timer is full, hence into an interrupt processing function, in interrupt processing letter
Compare the currency of the first counter and the second counter in number, if the currency of the first counter and the second counter not phase
Deng then performing status dump and then perform Deadlock Detection;If the currency of the first counter and the second counter is equal,
Returned from interrupt processing function, and continue executing with concurrent program.
Preferably, when performing status dump, each MPI processes for entering interaction stable state are made current process state, letter
Address track, unfinished transmission message queue, unfinished reception message queue and unexpected reception message team that number is called
The data write-in dump file of row, returns to concurrent program after the completion of dump, continues executing with.
Preferably, when performing Deadlock Detection, all dump files, and any one in following two conditions are read in
Judge occur communication deadlock during individual establishment:
First condition:All processes in Communication Set all enter in MPI operations, but are not carried out same global operation;
Second condition:Unfinished reception message queue Circular dependency.
Preferably, the minimum process status of similarity is exported when first condition is satisfied;When second condition is satisfied
The unfinished transmission message of each process in Circular dependency relation, the process of the minimum predetermined quantity of inquiry similarity
Queue and the unexpected message for receiving message queue, being matched with lookup with the unfinished reception message queue.
The present invention completes error detection using stable state detection, automatic push key message to external file by external program
And analysis, reduce operation of the error detection to concurrent program itself and disturb.Communication deadlock is detected and analyzed based on message queue,
Solve and collect and record the great expense incurred that mpi communication semantemes are brought.Moreover, the present invention can realize it is transparent to application programming.
The present invention can for large-scale parallel program communication deadlock reduction computing resource utilization rate the problem of, using
Detect to enter after stable state, automatic push key message to external file completes error detection and analysis by external program.Base
Communication deadlock is detected and analyzed in message queue, overall process is interacted parallel without trace analysis, collection is solved and record MPI leads to
Letter semanteme causes the problem of parallel program performance is remarkably decreased, with the scalability and reality for adapting to large-scale parallel program debugging
The property used.In addition, the technology operationally realize by storehouse level, it is fully transparent to user program.
Brief description of the drawings
With reference to accompanying drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention
And its adjoint advantages and features is more easily understood, wherein:
Fig. 1 schematically shows the flow of parallel Runtime error checking method according to the preferred embodiment of the invention
Figure.
It should be noted that accompanying drawing is used to illustrate the present invention, it is not intended to limit the present invention.Note, represent that the accompanying drawing of structure can
It can be not necessarily drawn to scale.Also, in accompanying drawing, same or similar element indicates same or similar label.
Embodiment
In order that present disclosure is more clear and understandable, with reference to specific embodiments and the drawings in the present invention
Appearance is described in detail.
Fig. 1 schematically shows the flow of parallel Runtime error checking method according to the preferred embodiment of the invention
Figure.
Specifically, for example, parallel Runtime error checking method according to the preferred embodiment of the invention is used to position MPI simultaneously
The mistake produced during stroke sort run.
As shown in figure 1, parallel Runtime error checking method according to the preferred embodiment of the invention includes:
First step S1:Perform stable state detection;
Wherein, when performing stable state detection, for example, (it is used to initialize MPI execution rings in the function MPI_INIT of MPI instruments
Border, the contact set up between multiple MPI processes is that subsequent communications are prepared) in, set initial value for 0 the first counter A and
Second counter B;When process enters a MPI blocking operation, the first counter A adds one, and starts a timer
(alarm);When being returned from the blocking operation, the first counter A value is assigned to the second counter B, and cancels timer.Moreover,
If the MPI is blocked in during a MPI calls, timer just triggers a soft interrupt signal when full, hence into an interruption
Function is handled, the first counter A and the second counter B currency are compared in interrupt processing function, is recognized if unequal
A kind of interactive stable state (needing further to detect whether occur communication deadlock) is in for process, thus status dump (the is being performed
Two step S2) Deadlock Detection (third step S3) is performed afterwards.If the value of two counters is equal, from interrupt processing function
Return, continue executing with concurrent program.
Wherein, interaction stable state is referred to if process is no longer interacted since certain in a flash with other processes, then it is assumed that should
Process enters a kind of stable state interacted parallel.
Second step S2:Perform status dump;
When performing status dump, each MPI processes for entering stable state are current process state (calculate, communicate), function
The address track called, (unfinished sends message queue, unfinished reception message queue and accidents to three message queues
Receive message queue) etc. data write-in dump file, return to concurrent program after the completion of dump, continue executing with.Stable state is not entered into
Process do not perform this step.
Third step S3:Perform Deadlock Detection;
When performing Deadlock Detection, all dump files are read in, Deadlock Detection is carried out.Deadlock Detection is according to two conditions:
(1) global operation is not global participates in, i.e., all processes in Communication Set all enter in MPI operations, but are not carried out the same overall situation
Operation, 2) unfinished reception message queue Circular dependency, i.e., there is the reception message queue that does not complete in a certain process sets
Circular dependency relation.Constitute either condition, then it is assumed that occur communication deadlock.
Four steps S4:Deadlock is performed to speculate.
If being unsatisfactory for dead lock condition, return immediately;Otherwise to the process sets of generation dump, the cluster of process status is carried out
Analysis, if meeting condition (1) then exports the minimum process status of similarity;If meeting condition (2) then along Circular dependency
Relation, the low each process (for example, each process in the process of the minimum predetermined quantity of similarity) of inquiry similarity its
Its two message queue, searches the message that (for example, meeting predetermined matching condition) is matched with unfinished reception message queue, such as
Have, then export the similar message of hit process.
In the above-mentioned methods, the module for performing first two steps is integrated in MPI concurrent programs, is based on MPI run-time librarys
PMPI open interfaces, realize during link and the storehouses of MPI run-time librarys level are inserted, without the source code of modification MPI library, integrated convenience,
It is transparent to user program.The execution flow of the device is not required to artificial operation, and the communication deadlock that may occur can be reported automatically, and
A supposition analysis report is provided deadlock root original, the utilization rate of debugging efficiency and computing resource is substantially increased.
In a word, the run time error positioning of extensive MPI concurrent programs need to analyze a large amount of MPI processes execution state and
Interaction, collection, storage and the concentration analysis of process relevant information generate huge operation expense, not only reduce big gauge
The utilization rate of resource is calculated, and have impact on the scalability and practicality of existing debugging technique.The present apparatus will solve MPI and stroke
The efficiency of error detection during sort run, reduces the interference to running parallel and operation expense.
Furthermore, it is necessary to explanation, unless otherwise indicated, term " first " otherwise in specification, " second ", " the 3rd "
Be used only for distinguishing each component, element, step etc. in specification Deng description, without be intended to indicate that each component, element,
Logical relation or ordinal relation between step etc..
Although it is understood that the present invention is disclosed as above with preferred embodiment, but above-described embodiment and being not used to
Limit the present invention.For any those skilled in the art, without departing from the scope of the technical proposal of the invention,
Many possible variations and modification are all made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as
With the equivalent embodiment of change.Therefore, every content without departing from technical solution of the present invention, the technical spirit pair according to the present invention
Any simple modifications, equivalents, and modifications made for any of the above embodiments, still fall within the scope of technical solution of the present invention protection
It is interior.
Claims (3)
1. a kind of parallel Runtime error checking method, the mistake produced during for positioning the operation of MPI concurrent programs, its feature exists
In including:
The first counter and the second counter that initial value is 0 are set;When process enters a MPI blocking operation, the first meter
Number device adds one, and starts a timer;When being returned from the blocking operation, the value of the first counter is assigned to the second counting
Device, and cancel timer;If moreover, the MPI is blocked in during a MPI calls, during when timer is full, triggering one is soft
Break signal, hence into an interrupt processing function, compares the first counter and the second counter in interrupt processing function
Currency, if the currency of the first counter and the second counter is unequal, performs status dump and then performs deadlock
Detection;If the currency of the first counter and the second counter is equal, returned from interrupt processing function, and continue executing with simultaneously
Line program;Wherein, when performing status dump, make each MPI processes for entering interaction stable state that current process state, function are adjusted
Address track, unfinished transmission message queue, unfinished reception message queue and accident receive message queue
Data write dump file, return to concurrent program after the completion of dump, continue executing with.
2. parallel Runtime error checking method according to claim 1, it is characterised in that when performing Deadlock Detection,
Read in all dump files, and judge occur communication deadlock during any one establishment in following two conditions:
First condition:All processes in Communication Set all enter in MPI operations, but are not carried out same global operation;
Second condition:Unfinished reception message queue Circular dependency.
3. parallel Runtime error checking method according to claim 2, it is characterised in that when first condition is satisfied
Export the minimum process status of similarity;When second condition is satisfied along Circular dependency relation, inquiry similarity is minimum
The unfinished transmission message queue of each process in the process of predetermined quantity and it is unexpected receive message queue, with search with
The unfinished message for receiving message queue matching.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510831795.5A CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510831795.5A CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105243023A CN105243023A (en) | 2016-01-13 |
CN105243023B true CN105243023B (en) | 2017-09-26 |
Family
ID=55040676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510831795.5A Active CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243023B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301125B (en) * | 2017-06-19 | 2021-08-24 | 广州华多网络科技有限公司 | Method and device for searching root error and electronic equipment |
CN109213684B (en) * | 2018-09-18 | 2022-01-28 | 北京工业大学 | Program detection method based on OpenMP thread heartbeat detection technology and application |
CN112631816B (en) * | 2019-09-24 | 2022-11-15 | 无锡江南计算技术研究所 | Debugging log-based parallel program error positioning method |
CN111090528B (en) * | 2019-12-25 | 2023-09-26 | 北京天融信网络安全技术有限公司 | Deadlock determination method and device and electronic equipment |
CN111538599A (en) * | 2020-04-23 | 2020-08-14 | 杭州涂鸦信息技术有限公司 | LINUX-based multithreading deadlock problem positioning method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101833479A (en) * | 2010-04-16 | 2010-09-15 | 中国人民解放军国防科学技术大学 | MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment |
CN101937365A (en) * | 2009-06-30 | 2011-01-05 | 国际商业机器公司 | Deadlock detection method of parallel programs and system |
CN103365852A (en) * | 2012-03-28 | 2013-10-23 | 天津书生软件技术有限公司 | Concurrency control method and system for document library systems |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8527740B2 (en) * | 2009-11-13 | 2013-09-03 | International Business Machines Corporation | Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator |
TW201329863A (en) * | 2012-01-10 | 2013-07-16 | Nat Univ Tsing Hua | Deadlock-free synchronization synthesizer for must-happen-before relations in parallel programs and method thereof |
-
2015
- 2015-11-24 CN CN201510831795.5A patent/CN105243023B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937365A (en) * | 2009-06-30 | 2011-01-05 | 国际商业机器公司 | Deadlock detection method of parallel programs and system |
CN101833479A (en) * | 2010-04-16 | 2010-09-15 | 中国人民解放军国防科学技术大学 | MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment |
CN103365852A (en) * | 2012-03-28 | 2013-10-23 | 天津书生软件技术有限公司 | Concurrency control method and system for document library systems |
Non-Patent Citations (1)
Title |
---|
一种基于并行技术的死锁检测算法;陈岚;《广西科学院学报》;20030531;第19卷(第2期);64-68页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105243023A (en) | 2016-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243023B (en) | Parallel Runtime error checking method | |
US11494287B2 (en) | Scalable execution tracing for large program codebases | |
Vetter et al. | Dynamic software testing of MPI applications with Umpire | |
US8141053B2 (en) | Call stack sampling using a virtual machine | |
US8176475B2 (en) | Method and apparatus for identifying instructions associated with execution events in a data space profiler | |
US8032875B2 (en) | Method and apparatus for computing user-specified cost metrics in a data space profiler | |
US8813055B2 (en) | Method and apparatus for associating user-specified data with events in a data space profiler | |
US7509632B2 (en) | Method and apparatus for analyzing call history data derived from execution of a computer program | |
US8627335B2 (en) | Method and apparatus for data space profiling of applications across a network | |
US7770155B2 (en) | Debugger apparatus and method for indicating time-correlated position of threads in a multi-threaded computer program | |
US8392930B2 (en) | Resource contention log navigation with thread view and resource view pivoting via user selections | |
US20150234730A1 (en) | Systems and methods for performing software debugging | |
Dean et al. | Perfcompass: Online performance anomaly fault localization and inference in infrastructure-as-a-service clouds | |
US20080177756A1 (en) | Method and Apparatus for Synthesizing Hardware Counters from Performance Sampling | |
EP2609501B1 (en) | Dynamic calculation of sample profile reports | |
US10541042B2 (en) | Level-crossing memory trace inspection queries | |
US11003574B2 (en) | Optimized testing system | |
WO2014143279A1 (en) | Bottleneck detector for executing applications | |
Mitra et al. | Accurate application progress analysis for large-scale parallel debugging | |
CN117149658A (en) | Presenting differences between code entity calls | |
US8631280B2 (en) | Method of measuring and diagnosing misbehaviors of software components and resources | |
US9135082B1 (en) | Techniques and systems for data race detection | |
Xu et al. | Experience mining Google's production console logs | |
Sridharan et al. | Using pvf traces to accelerate avf modeling | |
Du et al. | An empirical study of fault triggers in deep learning frameworks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |