CN105243023B - Parallel Runtime error checking method - Google Patents

Parallel Runtime error checking method Download PDF

Info

Publication number
CN105243023B
CN105243023B CN201510831795.5A CN201510831795A CN105243023B CN 105243023 B CN105243023 B CN 105243023B CN 201510831795 A CN201510831795 A CN 201510831795A CN 105243023 B CN105243023 B CN 105243023B
Authority
CN
China
Prior art keywords
counter
mpi
message queue
dump
unfinished
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510831795.5A
Other languages
Chinese (zh)
Other versions
CN105243023A (en
Inventor
刘勇
彭超
陈华蓉
王敬宇
冯赟龙
王雯霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201510831795.5A priority Critical patent/CN105243023B/en
Publication of CN105243023A publication Critical patent/CN105243023A/en
Application granted granted Critical
Publication of CN105243023B publication Critical patent/CN105243023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of parallel Runtime error checking method, including:The first counter and the second counter that initial value is 0 are set;When process enters a MPI blocking operation, the first counter adds one, and starts a timer;When being returned from the blocking operation, the value of the first counter is assigned to the second counter, and cancel timer;And, if the MPI is blocked in during a MPI calls, then a soft interrupt signal is triggered when timer is full, hence into an interrupt processing function, compare the currency of the first counter and the second counter in interrupt processing function, if the currency of the first counter and the second counter is unequal, performs status dump and then perform Deadlock Detection;If the currency of the first counter and the second counter is equal, returned from interrupt processing function, and continue executing with concurrent program.

Description

Parallel Runtime error checking method
Technical field
The present invention relates to field of computer technology, and in particular to a kind of parallel Runtime error checking method.
Background technology
In HPC (High Performance Computing, high-performance calculation) field, MPI (Message Passing Interface, message passing interface) it is current most widely used most flexible parallel programming model, it is also disappearing for current main flow Transfer Standards are ceased, current most widely used concurrent program programming model has multiple increase income to realize version.For MPI, because its It is not automatically parallelizing, therefore programmer develops parallel computation generally according to the mode of parallel scale is stepped up from small to large.
When program is when small-scale parallel, printing or the adjustment method using debugging acid can be added in a program, to solve The detection and positioning of certainly parallel run time error, but during in face of hundreds and thousands of points, tens thousand of line codes, the printout of flood tide The parallel execution of program has not only been interfered significantly with, and huge workload is brought to analysis output.Existing large-scale parallel journey The debugging acid of sequence is all, towards process and program statement, to select which process to be debugged, which is debugged from thousands of processes A little program statements, therefore the debugging of large-scale parallel program at present is nearly all the experience with programmer, carries out manual analysis.
On the other hand, the execution time of large-scale parallel program may continue very long, if in normal running status, Either with or without because it is that programmer pays special attention to that communication deadlock, which occurs, for the system failure such as program error or network, this is related to one The input and output rate at individual calculating center, it is therefore desirable to a kind of lightweight, communication deadlock can be detected in time, and be greatly decreased artificial Operation automation analyzes the device of a large amount of concurrent process interaction modes.
STAT is a concurrent program operation monitoring software, collects the function call stack layer relation of all concurrent processes, base In MRNET (layered communications network that a kind of dynamic is set up), aggregation information carries out clustering, finally big with visual means displaying The running status of scale concurrent program.Because the software is not interfered significantly with to parallel interaction, monitoring information is simple, can be with Fairly large parallel monitoring is adapted to, but itself can not detect any run time error, it is necessary to by manually relatively more multiple prison Result is controlled, can just discover whether there is the abnormal process of process performing, this kind of process of reselection enters one using conventional debugger instrument Step confirms and analyzing program errors.
MUST is a MPI Runtime error checkings software, is inserted using storehouse level, to all MPI libraries that call in program Place inserts error analysis, collects all parallel interactive information called based on MPI, and assemble information to one based on MRNET Individual central point, builds the Direct dependence graph of message transmission, and then detect the Circular dependency of message transmission, deagnostic communication deadlock.By The expense of communication and the bottleneck effect of central server are stored in information gathering, the parallel scale of technology application, 100 is limited Within process.
In the increasing income and realize version of MPI run-time librarys, two kinds of debugging techniques were successively provided:The rule of early stage concurrent program Mould is smaller, the adjustment method of use is collection, show process each time between alternative events there is provided programmer's manual analysis, simultaneously There is provided the access interface to abstract mechanical floor (ADI) message queue after the increase of professional etiquette mould, by the interface library, it can obtain Certain in a flash, the unfinished transmission message queue (pending send) of each process, unfinished reception message queue (pending receive) and the unexpected message queue (unexected receive) that receives (have been received not by this The reception message that ground program is claimed) message queue.(a kind of parallel debugging instrument supports a variety of architectures to put down to Totalview Platform, the conventional debugger function towards process and thread is provided for many kinds of concurrent programs of MPI, OpenMP, OpenAcc) first with Debugging interface stops the Process Movement of a process collection between process, then based on the queue accesses interface, into the MPI process spaces Message queue is captured, the message that the program of the process sets is not completed in a flash at certain is shown, does not further provide for Deadlock Detection Function is, it is necessary to which whether artificial judgment occurs communication deadlock.For the concurrent program of a thousands of scales, it is wholly off it is all enter Journey activity may jeopardize the stability of debugging acid itself, therefore totalview show only in Partial Process, i.e. a subset Message.
The content of the invention
The technical problems to be solved by the invention are that there is provided a kind of parallel operation for there is drawbacks described above in the prior art When error-detecting method, it is detected towards MPI communication deadlock, and it is big to solve prior art operation expense, when lacking to operation The wrong defect for carrying out probing into analysis of parallel interaction, has little influence on the performance of concurrent program, Neng Gouji when not being activated Run-time library is realized in the open interface of MPI library, it is portable good, it is fully transparent to user program.
According to the present invention there is provided a kind of parallel Runtime error checking method, for positioning during the operation of MPI concurrent programs The mistake of generation.
The parallel Runtime error checking method includes:The first counter and the second counter that initial value is 0 are set; When process enters a MPI blocking operation, the first counter adds one, and starts a timer;Returned from the blocking operation Hui Shi, is assigned to the second counter, and cancel timer by the value of the first counter;If moreover, the MPI is blocked in a MPI In calling, then a soft interrupt signal is triggered when timer is full, hence into an interrupt processing function, in interrupt processing letter Compare the currency of the first counter and the second counter in number, if the currency of the first counter and the second counter not phase Deng then performing status dump and then perform Deadlock Detection;If the currency of the first counter and the second counter is equal, Returned from interrupt processing function, and continue executing with concurrent program.
Preferably, when performing status dump, each MPI processes for entering interaction stable state are made current process state, letter Address track, unfinished transmission message queue, unfinished reception message queue and unexpected reception message team that number is called The data write-in dump file of row, returns to concurrent program after the completion of dump, continues executing with.
Preferably, when performing Deadlock Detection, all dump files, and any one in following two conditions are read in Judge occur communication deadlock during individual establishment:
First condition:All processes in Communication Set all enter in MPI operations, but are not carried out same global operation;
Second condition:Unfinished reception message queue Circular dependency.
Preferably, the minimum process status of similarity is exported when first condition is satisfied;When second condition is satisfied The unfinished transmission message of each process in Circular dependency relation, the process of the minimum predetermined quantity of inquiry similarity Queue and the unexpected message for receiving message queue, being matched with lookup with the unfinished reception message queue.
The present invention completes error detection using stable state detection, automatic push key message to external file by external program And analysis, reduce operation of the error detection to concurrent program itself and disturb.Communication deadlock is detected and analyzed based on message queue, Solve and collect and record the great expense incurred that mpi communication semantemes are brought.Moreover, the present invention can realize it is transparent to application programming.
The present invention can for large-scale parallel program communication deadlock reduction computing resource utilization rate the problem of, using Detect to enter after stable state, automatic push key message to external file completes error detection and analysis by external program.Base Communication deadlock is detected and analyzed in message queue, overall process is interacted parallel without trace analysis, collection is solved and record MPI leads to Letter semanteme causes the problem of parallel program performance is remarkably decreased, with the scalability and reality for adapting to large-scale parallel program debugging The property used.In addition, the technology operationally realize by storehouse level, it is fully transparent to user program.
Brief description of the drawings
With reference to accompanying drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention And its adjoint advantages and features is more easily understood, wherein:
Fig. 1 schematically shows the flow of parallel Runtime error checking method according to the preferred embodiment of the invention Figure.
It should be noted that accompanying drawing is used to illustrate the present invention, it is not intended to limit the present invention.Note, represent that the accompanying drawing of structure can It can be not necessarily drawn to scale.Also, in accompanying drawing, same or similar element indicates same or similar label.
Embodiment
In order that present disclosure is more clear and understandable, with reference to specific embodiments and the drawings in the present invention Appearance is described in detail.
Fig. 1 schematically shows the flow of parallel Runtime error checking method according to the preferred embodiment of the invention Figure.
Specifically, for example, parallel Runtime error checking method according to the preferred embodiment of the invention is used to position MPI simultaneously The mistake produced during stroke sort run.
As shown in figure 1, parallel Runtime error checking method according to the preferred embodiment of the invention includes:
First step S1:Perform stable state detection;
Wherein, when performing stable state detection, for example, (it is used to initialize MPI execution rings in the function MPI_INIT of MPI instruments Border, the contact set up between multiple MPI processes is that subsequent communications are prepared) in, set initial value for 0 the first counter A and Second counter B;When process enters a MPI blocking operation, the first counter A adds one, and starts a timer (alarm);When being returned from the blocking operation, the first counter A value is assigned to the second counter B, and cancels timer.Moreover, If the MPI is blocked in during a MPI calls, timer just triggers a soft interrupt signal when full, hence into an interruption Function is handled, the first counter A and the second counter B currency are compared in interrupt processing function, is recognized if unequal A kind of interactive stable state (needing further to detect whether occur communication deadlock) is in for process, thus status dump (the is being performed Two step S2) Deadlock Detection (third step S3) is performed afterwards.If the value of two counters is equal, from interrupt processing function Return, continue executing with concurrent program.
Wherein, interaction stable state is referred to if process is no longer interacted since certain in a flash with other processes, then it is assumed that should Process enters a kind of stable state interacted parallel.
Second step S2:Perform status dump;
When performing status dump, each MPI processes for entering stable state are current process state (calculate, communicate), function The address track called, (unfinished sends message queue, unfinished reception message queue and accidents to three message queues Receive message queue) etc. data write-in dump file, return to concurrent program after the completion of dump, continue executing with.Stable state is not entered into Process do not perform this step.
Third step S3:Perform Deadlock Detection;
When performing Deadlock Detection, all dump files are read in, Deadlock Detection is carried out.Deadlock Detection is according to two conditions: (1) global operation is not global participates in, i.e., all processes in Communication Set all enter in MPI operations, but are not carried out the same overall situation Operation, 2) unfinished reception message queue Circular dependency, i.e., there is the reception message queue that does not complete in a certain process sets Circular dependency relation.Constitute either condition, then it is assumed that occur communication deadlock.
Four steps S4:Deadlock is performed to speculate.
If being unsatisfactory for dead lock condition, return immediately;Otherwise to the process sets of generation dump, the cluster of process status is carried out Analysis, if meeting condition (1) then exports the minimum process status of similarity;If meeting condition (2) then along Circular dependency Relation, the low each process (for example, each process in the process of the minimum predetermined quantity of similarity) of inquiry similarity its Its two message queue, searches the message that (for example, meeting predetermined matching condition) is matched with unfinished reception message queue, such as Have, then export the similar message of hit process.
In the above-mentioned methods, the module for performing first two steps is integrated in MPI concurrent programs, is based on MPI run-time librarys PMPI open interfaces, realize during link and the storehouses of MPI run-time librarys level are inserted, without the source code of modification MPI library, integrated convenience, It is transparent to user program.The execution flow of the device is not required to artificial operation, and the communication deadlock that may occur can be reported automatically, and A supposition analysis report is provided deadlock root original, the utilization rate of debugging efficiency and computing resource is substantially increased.
In a word, the run time error positioning of extensive MPI concurrent programs need to analyze a large amount of MPI processes execution state and Interaction, collection, storage and the concentration analysis of process relevant information generate huge operation expense, not only reduce big gauge The utilization rate of resource is calculated, and have impact on the scalability and practicality of existing debugging technique.The present apparatus will solve MPI and stroke The efficiency of error detection during sort run, reduces the interference to running parallel and operation expense.
Furthermore, it is necessary to explanation, unless otherwise indicated, term " first " otherwise in specification, " second ", " the 3rd " Be used only for distinguishing each component, element, step etc. in specification Deng description, without be intended to indicate that each component, element, Logical relation or ordinal relation between step etc..
Although it is understood that the present invention is disclosed as above with preferred embodiment, but above-described embodiment and being not used to Limit the present invention.For any those skilled in the art, without departing from the scope of the technical proposal of the invention, Many possible variations and modification are all made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as With the equivalent embodiment of change.Therefore, every content without departing from technical solution of the present invention, the technical spirit pair according to the present invention Any simple modifications, equivalents, and modifications made for any of the above embodiments, still fall within the scope of technical solution of the present invention protection It is interior.

Claims (3)

1. a kind of parallel Runtime error checking method, the mistake produced during for positioning the operation of MPI concurrent programs, its feature exists In including:
The first counter and the second counter that initial value is 0 are set;When process enters a MPI blocking operation, the first meter Number device adds one, and starts a timer;When being returned from the blocking operation, the value of the first counter is assigned to the second counting Device, and cancel timer;If moreover, the MPI is blocked in during a MPI calls, during when timer is full, triggering one is soft Break signal, hence into an interrupt processing function, compares the first counter and the second counter in interrupt processing function Currency, if the currency of the first counter and the second counter is unequal, performs status dump and then performs deadlock Detection;If the currency of the first counter and the second counter is equal, returned from interrupt processing function, and continue executing with simultaneously Line program;Wherein, when performing status dump, make each MPI processes for entering interaction stable state that current process state, function are adjusted Address track, unfinished transmission message queue, unfinished reception message queue and accident receive message queue Data write dump file, return to concurrent program after the completion of dump, continue executing with.
2. parallel Runtime error checking method according to claim 1, it is characterised in that when performing Deadlock Detection, Read in all dump files, and judge occur communication deadlock during any one establishment in following two conditions:
First condition:All processes in Communication Set all enter in MPI operations, but are not carried out same global operation;
Second condition:Unfinished reception message queue Circular dependency.
3. parallel Runtime error checking method according to claim 2, it is characterised in that when first condition is satisfied Export the minimum process status of similarity;When second condition is satisfied along Circular dependency relation, inquiry similarity is minimum The unfinished transmission message queue of each process in the process of predetermined quantity and it is unexpected receive message queue, with search with The unfinished message for receiving message queue matching.
CN201510831795.5A 2015-11-24 2015-11-24 Parallel Runtime error checking method Active CN105243023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510831795.5A CN105243023B (en) 2015-11-24 2015-11-24 Parallel Runtime error checking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510831795.5A CN105243023B (en) 2015-11-24 2015-11-24 Parallel Runtime error checking method

Publications (2)

Publication Number Publication Date
CN105243023A CN105243023A (en) 2016-01-13
CN105243023B true CN105243023B (en) 2017-09-26

Family

ID=55040676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510831795.5A Active CN105243023B (en) 2015-11-24 2015-11-24 Parallel Runtime error checking method

Country Status (1)

Country Link
CN (1) CN105243023B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301125B (en) * 2017-06-19 2021-08-24 广州华多网络科技有限公司 Method and device for searching root error and electronic equipment
CN109213684B (en) * 2018-09-18 2022-01-28 北京工业大学 Program detection method based on OpenMP thread heartbeat detection technology and application
CN112631816B (en) * 2019-09-24 2022-11-15 无锡江南计算技术研究所 Debugging log-based parallel program error positioning method
CN111090528B (en) * 2019-12-25 2023-09-26 北京天融信网络安全技术有限公司 Deadlock determination method and device and electronic equipment
CN111538599A (en) * 2020-04-23 2020-08-14 杭州涂鸦信息技术有限公司 LINUX-based multithreading deadlock problem positioning method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833479A (en) * 2010-04-16 2010-09-15 中国人民解放军国防科学技术大学 MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment
CN101937365A (en) * 2009-06-30 2011-01-05 国际商业机器公司 Deadlock detection method of parallel programs and system
CN103365852A (en) * 2012-03-28 2013-10-23 天津书生软件技术有限公司 Concurrency control method and system for document library systems

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8527740B2 (en) * 2009-11-13 2013-09-03 International Business Machines Corporation Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator
TW201329863A (en) * 2012-01-10 2013-07-16 Nat Univ Tsing Hua Deadlock-free synchronization synthesizer for must-happen-before relations in parallel programs and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937365A (en) * 2009-06-30 2011-01-05 国际商业机器公司 Deadlock detection method of parallel programs and system
CN101833479A (en) * 2010-04-16 2010-09-15 中国人民解放军国防科学技术大学 MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment
CN103365852A (en) * 2012-03-28 2013-10-23 天津书生软件技术有限公司 Concurrency control method and system for document library systems

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于并行技术的死锁检测算法;陈岚;《广西科学院学报》;20030531;第19卷(第2期);64-68页 *

Also Published As

Publication number Publication date
CN105243023A (en) 2016-01-13

Similar Documents

Publication Publication Date Title
CN105243023B (en) Parallel Runtime error checking method
US11494287B2 (en) Scalable execution tracing for large program codebases
Vetter et al. Dynamic software testing of MPI applications with Umpire
US8141053B2 (en) Call stack sampling using a virtual machine
US8176475B2 (en) Method and apparatus for identifying instructions associated with execution events in a data space profiler
US8032875B2 (en) Method and apparatus for computing user-specified cost metrics in a data space profiler
US8813055B2 (en) Method and apparatus for associating user-specified data with events in a data space profiler
US7509632B2 (en) Method and apparatus for analyzing call history data derived from execution of a computer program
US8627335B2 (en) Method and apparatus for data space profiling of applications across a network
US7770155B2 (en) Debugger apparatus and method for indicating time-correlated position of threads in a multi-threaded computer program
US8392930B2 (en) Resource contention log navigation with thread view and resource view pivoting via user selections
US20150234730A1 (en) Systems and methods for performing software debugging
Dean et al. Perfcompass: Online performance anomaly fault localization and inference in infrastructure-as-a-service clouds
US20080177756A1 (en) Method and Apparatus for Synthesizing Hardware Counters from Performance Sampling
EP2609501B1 (en) Dynamic calculation of sample profile reports
US10541042B2 (en) Level-crossing memory trace inspection queries
US11003574B2 (en) Optimized testing system
WO2014143279A1 (en) Bottleneck detector for executing applications
Mitra et al. Accurate application progress analysis for large-scale parallel debugging
CN117149658A (en) Presenting differences between code entity calls
US8631280B2 (en) Method of measuring and diagnosing misbehaviors of software components and resources
US9135082B1 (en) Techniques and systems for data race detection
Xu et al. Experience mining Google's production console logs
Sridharan et al. Using pvf traces to accelerate avf modeling
Du et al. An empirical study of fault triggers in deep learning frameworks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant