CN105243023A - Method for detecting errors generated during parallel running - Google Patents
Method for detecting errors generated during parallel running Download PDFInfo
- Publication number
- CN105243023A CN105243023A CN201510831795.5A CN201510831795A CN105243023A CN 105243023 A CN105243023 A CN 105243023A CN 201510831795 A CN201510831795 A CN 201510831795A CN 105243023 A CN105243023 A CN 105243023A
- Authority
- CN
- China
- Prior art keywords
- counter
- mpi
- message queue
- error
- completed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Debugging And Monitoring (AREA)
Abstract
The invention provides a method for detecting errors generated during parallel running. The method includes the steps that a first counter with an original value 0 and a second counter with an original value 0 are set; when a process gets into MPI blocking operation, one is added to the first counter, and a timer is started; when the process is returned from the blocking operation, the value of the first counter is assigned to the second counter, and the timer is removed; in addition, if the MPI blocking exists in MPI calling, a software interrupt signal is triggered when the timer is full, the process gets into an interrupt processing function accordingly, and the current value of the first counter and the current value of the second counter are compared in the interrupt processing function; if the current value of the first counter and the current value of the second counter are not equal, the state is dumped, and then deadlock detection is carried out; if the current value of the first counter and the current value of the second counter are equal, the process is returned from the interrupt processing function, and a parallel program continues to be executed.
Description
Technical field
The present invention relates to field of computer technology, error-detecting method when being specifically related to a kind of parallel running.
Background technology
At HPC (HighPerformanceComputing, high-performance calculation) field, MPI (MessagePassingInterface, message passing interface) be current most widely used general parallel programming model the most flexibly, also be the messaging standard of current main flow, current most widely used concurrent program programming model, has multiple increasing income to realize version.For MPI, because it is not automatically parallelizing, therefore programmer is generally according to progressively increasing the mode of parallel scale from small to large to develop parallel computation.
When program is when walking abreast on a small scale, the adjustment method printing or use debugging acid can be added in a program, solve the detection and positioning of mistake during parallel running, but when hundreds and thousands of points, tens thousand of line codes, the printout of flood tide not only significantly disturbs the executed in parallel of program, and brings huge workload to analysis output.The debugging acid of existing large-scale parallel program is all towards process and program statement, from thousands of process, select which process to carry out debugging, debugging which program statement, therefore the debugging of large-scale parallel program is nearly all the experience with programmer at present, carries out manual analysis.
On the other hand, the execution time of large-scale parallel program may continue very long, whether be in normal running status, have not because the system failure such as program error or network generation communication deadlock is that programmer pays special attention to, this relates to the inputoutput rate of a computing center, therefore need a kind of lightweight, communication deadlock can be detected in time, and significantly reduce the device of manual operation automated analysis a large amount of concurrent process interaction mode.
STAT is a concurrent program operation monitoring software, collect the function call stack layer relation of all concurrent processes, based on MRNET (a kind of layered communications network of Dynamic Establishing), gathering information carries out cluster analysis, finally shows the running status of large-scale parallel program with visual means.Because this software does not significantly disturb parallel reciprocal process, monitoring information is simple, fairly large parallel supervision can be adapted to, but self can not detect any run time error, need by manpower comparing more repeatedly monitored results, the process that whether there is act of execution exception could be found, then select this kind of process, use conventional debugger instrument to confirm further and analyzing program errors.
MUST is a MPI Runtime error checking software, adopt storehouse level plug-in mounting, error analysis is inserted to the place in the MPI of calling storehouses all in program, collect the whole parallel interactive information called based on MPI, and assemble information to a central point based on MRNET, build the Direct dependence graph of Message Transmission, so the Circular dependency that detect-message is transmitted, deagnostic communication deadlock.Due to the expense of information acquisition storing communication and the ink-bottle effect of central server, limit the parallel scale of technology application, within 100 processes.
Realize in version increasing income of MPI run-time library, successively provided two kinds of debugging techniques: the scale of early stage concurrent program is less, the adjustment method adopted is collected, show alternative events between process each time, programmer is provided manual analysis, after parallel scale increases, provide the access interface to abstract mechanical floor (ADI) message queue, by this interface library, can be obtained certain in a flash, the transmission message queue (pendingsend) that each process does not complete, the receipt message queue (pendingreceive) do not completed, and unexpected receipt message queue (unexectedreceive) (namely having received not by receipt message that local program is claimed) message queue.Totalview (a kind of parallel debugging instrument, support multiple architecture platform, for MPI, OpenMP, OpenAcc many kinds of concurrent programs provide the conventional debugger function towards process and thread) first utilize debugging interface between process to stop the Process Movement of a process collection, again based on this queue accesses interface, message queue is captured in the MPI process space, show the message that the program of this process sets does not complete in a flash at certain, do not provide Deadlock Detection function further, need artificial judgment whether communication deadlock to occur.For the concurrent program of thousands of somes scales, all stop all Process Movements may jeopardize the stability of debugging acid self, therefore totalview show only Partial Process, the message that namely a son is concentrated.
Summary of the invention
Technical matters to be solved by this invention is for there is above-mentioned defect in prior art, error-detecting method when a kind of parallel running is provided, its communication deadlock towards MPI detects, solve prior art operation expense large, lack the defect of the mutual mistake that walks abreast when running being probed into analysis, affecting the performance of concurrent program when not being activated hardly, run-time library can be realized based on the open interface in MPI storehouse, portable good, completely transparent to user program.
According to the present invention, error-detecting method when providing a kind of parallel running, for locating the mistake produced when MPI concurrent program runs.
During described parallel running, error-detecting method comprises: arrange the first counter and the second counter that initial value is 0; When process enters a MPI blocking operation, the first counter adds one, and starts a timer; When returning from this blocking operation, the value of the first counter being assigned to the second counter, and cancelling timer; And, if this MPI is blocked in during a MPI calls, then trigger a soft interrupt signal when timer is full, thus enter an interrupt processing function, the currency of the first counter and the second counter is compared in interrupt processing function, if the currency of the first counter and the second counter is unequal, then executing state dump also performs Deadlock Detection subsequently; If the currency of the first counter and the second counter is equal, then return from interrupt processing function, and continue to perform concurrent program.
Preferably, when executing state dump, make each data write dump file entering the address track of MPI process current process state, function call of mutual stable state, the transmission message queue do not completed, the receipt message queue do not completed and unexpected receipt message queue, return concurrent program after dump completes, continue to perform.
Preferably, when performing Deadlock Detection, read in all dump files, and judge communication deadlock occurs when any one in following two conditions is set up:
First condition: all processes in Communication Set all enter in MPI operation, but do not perform same global operation;
Second condition: the receipt message queue Circular dependency do not completed.
Preferably, the minimum process status of similarity is exported when first condition is satisfied; When second condition is satisfied along Circular dependency relation, the transmission message queue do not completed of each process in the process of the predetermined quantity that inquiry similarity is minimum and unexpected receipt message queue, to search the message of mating with the described receipt message queue do not completed.
The present invention adopts stable state to detect, and automatically pushes key message to external file, completes error-detecting and analysis, reduce error-detecting and disturb the operation of concurrent program self by external program.Effect-based operation queue detection and analyzing communication deadlock, solve the great expense incurred collected and record mpi communication semanteme and bring.And the present invention can realize application programming transparent.
The present invention can reduce the problem of computational resource utilization factor for the communication deadlock of large-scale parallel program, adopt after detecting and entering stable state, automatically push key message to external file, complete error-detecting and analysis by external program.Effect-based operation queue detection and analyzing communication deadlock, need not to walk abreast mutual overall process by trace analysis, solve the problem collected and record MPI communication semanteme and cause parallel program performance significantly to decline, there is the extensibility and practicality that adapt to large-scale parallel program debugging.In addition, this technology operationally storehouse level realizes, completely transparent to user program.
Accompanying drawing explanation
By reference to the accompanying drawings, and by reference to detailed description below, will more easily there is more complete understanding to the present invention and more easily understand its adjoint advantage and feature, wherein:
The process flow diagram of error-detecting method when Fig. 1 schematically shows parallel running according to the preferred embodiment of the invention.
It should be noted that, accompanying drawing is for illustration of the present invention, and unrestricted the present invention.Note, represent that the accompanying drawing of structure may not be draw in proportion.Further, in accompanying drawing, identical or similar element indicates identical or similar label.
Embodiment
In order to make content of the present invention clearly with understandable, below in conjunction with specific embodiments and the drawings, content of the present invention is described in detail.
The process flow diagram of error-detecting method when Fig. 1 schematically shows parallel running according to the preferred embodiment of the invention.
The mistake such as, produced when particularly, error-detecting method is for locating the operation of MPI concurrent program during parallel running according to the preferred embodiment of the invention.
As shown in Figure 1, according to the preferred embodiment of the invention parallel running time error-detecting method comprise:
First step S1: perform stable state and detect;
Wherein, when performing stable state and detecting, such as, (for initialization MPI execution environment, set up the contact between multiple MPI process at the function MPI_INIT of MPI instrument, for subsequent communications is prepared) in, the first counter A and the second counter B that initial value is 0 are set; When process enters a MPI blocking operation, the first counter A adds one, and starts a timer (alarm); When returning from this blocking operation, the value of the first counter A is assigned to the second counter B, and cancels timer.And, if this MPI is blocked in during a MPI calls, just a soft interrupt signal is triggered when timer is full, thus enter an interrupt processing function, the currency of the first counter A and the second counter B is compared in interrupt processing function, just think that if unequal process is in a kind of mutual stable state (needing further detection whether communication deadlock to occur), after executing state dump (second step S2), perform Deadlock Detection (third step S3) thus.If the value of two counters is equal, then return from interrupt processing function, continue to perform concurrent program.
Wherein, if to refer to process no longer mutual with other process in a flash from certain for mutual stable state, then think that this process enters a kind of parallel mutual steady state (SS).
Second step S2: executing state dump;
When executing state dump, each MPI process entering stable state writes dump file data such as current process state (calculate, communicate), the address track of function call, three message queues (the transmission message queue do not completed, the receipt message queue do not completed and unexpected receipt message queue), return concurrent program after dump completes, continue to perform.The process not entering stable state does not perform this step.
Third step S3: perform Deadlock Detection;
When performing Deadlock Detection, reading in all dump files, carrying out Deadlock Detection.Deadlock Detection is according to two conditions: (1) global operation not overall situation participates in, namely all processes in Communication Set all enter in MPI operation, but do not perform same global operation, 2), namely in a certain process sets, there is the Circular dependency relation of the receipt message queue do not completed in the receipt message queue Circular dependency do not completed.Form arbitrary condition, then think generation communication deadlock.
4th step S4: perform deadlock and infer.
If do not meet dead lock condition, then return immediately; Otherwise to producing the process sets of dump, carry out the cluster analysis of process status, if satisfy condition (1) then exports the minimum process status of similarity; If satisfy condition, (2) are then along Circular dependency relation, inquire about the low each process of similarity (such as, each process in the process of the predetermined quantity that similarity is minimum) other two message queues, the receipt message queue of searching and do not complete is mated (such as, meet predetermined matching condition) message, if any, then export the similar message of hit process.
In the above-mentioned methods, the module integration performing first two steps, in MPI concurrent program, is the PMPI open interface based on MPI run-time library, the storehouse level plug-in mounting to MPI run-time library is realized during link, the source code in MPI storehouse need not be revised, integrated convenience, transparent to user program.The execution flow process of this device does not need manual operation, can automatically report contingent communication deadlock, and provides a supposition analysis report to deadlock root is former, substantially increases the utilization factor of debugging efficiency and computational resource.
In a word, the executing state of a large amount of MPI process of run time error location Water demand of extensive MPI concurrent program and reciprocal process, the collection of process relevant information, storage and collective analysis create huge operation expense, not only reduce the utilization factor of a large amount of computational resource, and have impact on extensibility and the practicality of existing debugging technique.This device will solve the efficiency of MPI concurrent program Runtime error checking, reduces the interference to parallel running and operation expense.
In addition, it should be noted that, unless otherwise indicated, otherwise the term " first " in instructions, " second ", " the 3rd " etc. describe only for distinguishing each assembly, element, step etc. in instructions, instead of for representing logical relation between each assembly, element, step or ordinal relation etc.
Be understandable that, although the present invention with preferred embodiment disclose as above, but above-described embodiment and be not used to limit the present invention.For any those of ordinary skill in the art, do not departing under technical solution of the present invention ambit, the technology contents of above-mentioned announcement all can be utilized to make many possible variations and modification to technical solution of the present invention, or be revised as the Equivalent embodiments of equivalent variations.Therefore, every content not departing from technical solution of the present invention, according to technical spirit of the present invention to any simple modification made for any of the above embodiments, equivalent variations and modification, all still belongs in the scope of technical solution of the present invention protection.
Claims (4)
1. error-detecting method during parallel running, for locating the mistake produced when MPI concurrent program runs, is characterized in that comprising:
The first counter and the second counter that initial value is 0 are set; When process enters a MPI blocking operation, the first counter adds one, and starts a timer; When returning from this blocking operation, the value of the first counter being assigned to the second counter, and cancelling timer; And, if this MPI is blocked in during a MPI calls, then trigger a soft interrupt signal when timer is full, thus enter an interrupt processing function, the currency of the first counter and the second counter is compared in interrupt processing function, if the currency of the first counter and the second counter is unequal, then executing state dump also performs Deadlock Detection subsequently; If the currency of the first counter and the second counter is equal, then return from interrupt processing function, and continue to perform concurrent program.
2. error-detecting method during parallel running according to claim 1, it is characterized in that, when executing state dump, make each data write dump file entering the address track of MPI process current process state, function call of mutual stable state, the transmission message queue do not completed, the receipt message queue do not completed and unexpected receipt message queue, return concurrent program after dump completes, continue to perform.
3. error-detecting method during parallel running according to claim 1 and 2, is characterized in that, when performing Deadlock Detection, reads in all dump files, and judges communication deadlock occurs when any one in following two conditions is set up:
First condition: all processes in Communication Set all enter in MPI operation, but do not perform same global operation;
Second condition: the receipt message queue Circular dependency do not completed.
4. error-detecting method during parallel running according to claim 3, is characterized in that, exports the minimum process status of similarity when first condition is satisfied; When second condition is satisfied along Circular dependency relation, the transmission message queue do not completed of each process in the process of the predetermined quantity that inquiry similarity is minimum and unexpected receipt message queue, to search the message of mating with the described receipt message queue do not completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510831795.5A CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510831795.5A CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105243023A true CN105243023A (en) | 2016-01-13 |
CN105243023B CN105243023B (en) | 2017-09-26 |
Family
ID=55040676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510831795.5A Active CN105243023B (en) | 2015-11-24 | 2015-11-24 | Parallel Runtime error checking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243023B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301125A (en) * | 2017-06-19 | 2017-10-27 | 广州华多网络科技有限公司 | A kind of method, device and electronic equipment for finding root mistake |
CN109213684A (en) * | 2018-09-18 | 2019-01-15 | 北京工业大学 | Program detecting method and application based on OpenMP thread heartbeat detection technology |
CN111090528A (en) * | 2019-12-25 | 2020-05-01 | 北京天融信网络安全技术有限公司 | Deadlock determination method and device and electronic equipment |
CN111538599A (en) * | 2020-04-23 | 2020-08-14 | 杭州涂鸦信息技术有限公司 | LINUX-based multithreading deadlock problem positioning method and system |
CN112631816A (en) * | 2019-09-24 | 2021-04-09 | 无锡江南计算技术研究所 | Debugging log-based parallel program error positioning method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101833479A (en) * | 2010-04-16 | 2010-09-15 | 中国人民解放军国防科学技术大学 | MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment |
CN101937365A (en) * | 2009-06-30 | 2011-01-05 | 国际商业机器公司 | Deadlock detection method of parallel programs and system |
US20110119468A1 (en) * | 2009-11-13 | 2011-05-19 | International Business Machines Corporation | Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator |
US20130179864A1 (en) * | 2012-01-10 | 2013-07-11 | National Tsing Hua University | Deadlock free synchronization synthesizer for must-happen-before relations in parallel programs and method thereof |
CN103365852A (en) * | 2012-03-28 | 2013-10-23 | 天津书生软件技术有限公司 | Concurrency control method and system for document library systems |
-
2015
- 2015-11-24 CN CN201510831795.5A patent/CN105243023B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937365A (en) * | 2009-06-30 | 2011-01-05 | 国际商业机器公司 | Deadlock detection method of parallel programs and system |
US20110119468A1 (en) * | 2009-11-13 | 2011-05-19 | International Business Machines Corporation | Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator |
CN101833479A (en) * | 2010-04-16 | 2010-09-15 | 中国人民解放军国防科学技术大学 | MPI (Moldflow Plastics Insight) information scheduling method based on reinforcement learning under multi-network environment |
US20130179864A1 (en) * | 2012-01-10 | 2013-07-11 | National Tsing Hua University | Deadlock free synchronization synthesizer for must-happen-before relations in parallel programs and method thereof |
CN103365852A (en) * | 2012-03-28 | 2013-10-23 | 天津书生软件技术有限公司 | Concurrency control method and system for document library systems |
Non-Patent Citations (1)
Title |
---|
陈岚: "一种基于并行技术的死锁检测算法", 《广西科学院学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301125A (en) * | 2017-06-19 | 2017-10-27 | 广州华多网络科技有限公司 | A kind of method, device and electronic equipment for finding root mistake |
CN107301125B (en) * | 2017-06-19 | 2021-08-24 | 广州华多网络科技有限公司 | Method and device for searching root error and electronic equipment |
CN109213684A (en) * | 2018-09-18 | 2019-01-15 | 北京工业大学 | Program detecting method and application based on OpenMP thread heartbeat detection technology |
CN112631816A (en) * | 2019-09-24 | 2021-04-09 | 无锡江南计算技术研究所 | Debugging log-based parallel program error positioning method |
CN112631816B (en) * | 2019-09-24 | 2022-11-15 | 无锡江南计算技术研究所 | Debugging log-based parallel program error positioning method |
CN111090528A (en) * | 2019-12-25 | 2020-05-01 | 北京天融信网络安全技术有限公司 | Deadlock determination method and device and electronic equipment |
CN111090528B (en) * | 2019-12-25 | 2023-09-26 | 北京天融信网络安全技术有限公司 | Deadlock determination method and device and electronic equipment |
CN111538599A (en) * | 2020-04-23 | 2020-08-14 | 杭州涂鸦信息技术有限公司 | LINUX-based multithreading deadlock problem positioning method and system |
Also Published As
Publication number | Publication date |
---|---|
CN105243023B (en) | 2017-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bielik et al. | Scalable race detection for android applications | |
Cui et al. | Efficient deterministic multithreading through schedule relaxation | |
US8938729B2 (en) | Two pass automated application instrumentation | |
CN105243023A (en) | Method for detecting errors generated during parallel running | |
US20130159977A1 (en) | Open kernel trace aggregation | |
Xu et al. | Experience mining Google's production console logs | |
US20190347343A1 (en) | Systems and methods for indexing and searching | |
US9367428B2 (en) | Transparent performance inference of whole software layers and context-sensitive performance debugging | |
Chen et al. | Mc-checker: Detecting memory consistency errors in mpi one-sided applications | |
Guo et al. | G2: A graph processing system for diagnosing distributed systems | |
US9135082B1 (en) | Techniques and systems for data race detection | |
US20190377666A1 (en) | Optimized testing system | |
US8141082B2 (en) | Node-based representation of multi-threaded computing environment tasks, and node-based data race evaluation | |
US11768754B2 (en) | Parallel program scalability bottleneck detection method and computing device | |
Butrovich et al. | Tastes great! Less filling! High performance and accurate training data collection for self-driving database management systems | |
Sridharan et al. | Using pvf traces to accelerate avf modeling | |
Quinn et al. | Debugging the {OmniTable} Way | |
Sundaram et al. | Diagnostic tracing for wireless sensor networks | |
CN113760491A (en) | Task scheduling system, method, equipment and storage medium | |
Machado et al. | Lightweight cooperative logging for fault replication in concurrent programs | |
Goldshtein et al. | Pro. NET Performance | |
Dolz et al. | Enabling semantics to improve detection of data races and misuses of lock‐free data structures | |
US9818078B1 (en) | Converting a non-workflow program to a workflow program using workflow inferencing | |
Park et al. | Automatic method for distinguishing hardware and software faults based on software execution data and hardware performance counters | |
Parker et al. | Performance analysis and debugging tools at scale |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |