CN103577273A - Second failure data capture in co-operating multi-image systems - Google Patents

Second failure data capture in co-operating multi-image systems Download PDF

Info

Publication number
CN103577273A
CN103577273A CN201310343980.0A CN201310343980A CN103577273A CN 103577273 A CN103577273 A CN 103577273A CN 201310343980 A CN201310343980 A CN 201310343980A CN 103577273 A CN103577273 A CN 103577273A
Authority
CN
China
Prior art keywords
information
software
fault
images
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310343980.0A
Other languages
Chinese (zh)
Other versions
CN103577273B (en
Inventor
R.N.张伯伦
A.J.皮尔金顿
H.J.赫利尔
M.F.彼得斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN103577273A publication Critical patent/CN103577273A/en
Application granted granted Critical
Publication of CN103577273B publication Critical patent/CN103577273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0778Dumping, i.e. gathering error/state information after a fault for later diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program

Abstract

A method, computer system, and computer program captures diagnostic trace information in a computer system having a plurality of software images. Information is received that is associated with a first failure in a first one of the plurality of software images. The received information is distributed to others of the plurality of software images. Further information is captured that is associated with a second failure in another one of the plurality of software images. The information distribution comprises distributing a first part of the information to a plurality of first soft images of the soft images and distributing a second part of the information to a plurality of second soft images of the soft images.

Description

The second fault data in the multi-mapping system of co-operate is caught
Technical field
The present invention relates to automatically catching of diagnostic data in computer system, particularly the diagnostic data in the multi-mapping computer system of co-operate catches automatically.
Background technology
Automatically being captured in computer system of diagnostic data is well-known.Particularly, it is used in complicated and/or long-play application conventionally to allow the quick solution of problem, and does not need on-the-spot fault or the standby system of reproducing.Known solution is that the form with dump, daily record and trace file provides Fisrt fault data capture (FFDC), and data capture triggers when problem being detected.
The problem of the solution that this is known is to have compromise obtaining for the demand of enough diagnostic messages of analysis and solve problem and produce between the cost of this diagnostic message.The cost that produces diagnostic message can comprise a) for the performance cost of the application of daily record recording and tracking continuously, b) fault is produced dump institute's time spent to (this may postpone restarting of application), and c) the disk space amount that requires of storage diagnostic message output.
WO2012/026035A discloses a kind of fault processing system, it has: stored position information acquiring unit, obtains the stored position information of the memory location that is illustrated in the failure message generating when fault occurs for the storage unit of the assembly from wherein having broken down; Failure message acquiring unit, be used for based on stored position information, from memory device, obtain failure message that generate, relevant with fault when fault occurs messaging device, this memory device be connected in case can with messaging device and fault handling devices communicating; And configuration control module, for the failure message based on obtaining, according to messaging device, revise the configuration of fault handling equipment.Thereby fault processing system can easily be reproduced in the fault occurring in messaging device, to effectively carry out, reproduce test.
Therefore, in prior art, there are the needs of processing the problems referred to above.
Summary of the invention
It is a kind of for catching the method for diagnosis trace information that embodiments of the invention provide, described method, for having the computer system of a plurality of software images, said method comprising the steps of: receive the information relevant with Fisrt fault in first of described a plurality of software images; Other software image by described distribution of information to described a plurality of software images; Catch the information relevant with the second fault in another in described a plurality of software images.The advantage of the method is that the cost of acquisition and tracking diagnostic message is minimized, until Fisrt fault occurs, after this, the value of the trace diagnosis information of catching is maximized, and by only catching the detailed tracking diagnostic message relevant with Fisrt fault, the cost of acquisition and tracking diagnostic message is minimized.
In an embodiment, the execution of the step of the described information of described distribution in load balancer, supervisory routine, operating system, monitoring software or peer-to-peer communications mechanism.
In a preferred embodiment, described by described distribution of information, the step to other software images of described a plurality of software images comprises: the first of described information is distributed to more than first software image in described a plurality of software image, and the second portion of described information is distributed to more than second software image in described a plurality of software image.This advantage having is to have distributed and crossed over the load that software image is collected diagnosis trace information, and still allowed the collection of comprehensive trace diagnosis information.
In a preferred embodiment, the step of described capturing information is expired after predetermined amount of time.In alternate embodiments, the step of described capturing information is expired after the second fault.The advantage that these embodiment have is, is limited in the time period of catching other diagnosis trace information during it, and therefore the other cost of diagnosis trace information is caught in restriction.
In another embodiment, each of described software image also comprises process or thread; And the information of described reception is relevant with the first process or the Fisrt fault in thread of described process or thread; The distribution of information of described distribution is to other processes or the thread of described process or thread; The second fault in another of described information of catching and described process or thread is relevant.
In another embodiment, the diagnosis trace information of described reception is identified the external factor of described software image as the reason of described Fisrt fault.This advantage having is that the fault causing due to external factor (as network failure) may cause the other trace diagnosis information relevant with external factor that will collect in each software image.
In another embodiment, described method is further comprising the steps of: after described receiving step, whether one or more other software images that check described a plurality of software images are carrying out the software identical with described the first software image in described a plurality of software images.
In another embodiment, described method is further comprising the steps of: by the relevant information combination of the second fault in another of the relevant information of the Fisrt fault in the first software image of described and described a plurality of software images and described and described a plurality of software images; Analyze the information of described combination to determine the reason of Fisrt fault.This combination of trace diagnosis information and analysis allow to determine the reason of fault, and do not need on-the-spot fault or the standby system of reproducing.
In another embodiment, the step of described capturing information continues, until the information of the described combination of described analysis is to determine that the step of the reason of Fisrt fault finishes.This allows to catch the information from any further fault, combines simultaneously and analyzes from the trace diagnosis information of fault before, but allow to stop catching analyzing while finishing.
Embodiments of the invention also provide a kind of department of computer science to unify for realizing the computer program of the said method of catching diagnosis trace information.
From other aspect, the invention provides a kind of for catching the computer program of diagnosis trace information, described computer program comprises: computer-readable recording medium, it can be read by treatment circuit, and the instruction that storage is carried out by treatment circuit, for carrying out for carrying out the method for step of the present invention.
From other aspect, the invention provides a kind of computer program, it is stored on computer-readable medium and can be loaded in the internal storage of digital machine, comprises software code part, when described program is moved on computers, for carrying out step of the present invention.
From other aspect, the invention provides a kind of basic as the method being described with reference to the drawings.
From other aspect, the invention provides a kind of basic as the system being described with reference to the drawings.
Accompanying drawing explanation
Only by way of example, with reference to accompanying drawing, will be described in more detail the preferred embodiments of the present invention now, in accompanying drawing:
Fig. 1 wherein can be used the calcspar with a plurality of software images of communication agency of the present invention;
Fig. 2 is the calcspar of one of software image of Fig. 1;
Fig. 3 is the calcspar of the application software of Fig. 2;
Fig. 4 illustrates the time relationship between a plurality of reflections, Fisrt fault event and the second event of failure of Fig. 1;
Fig. 5 is the process flow diagram of catching diagnosis trace information according to the embodiment of the present invention; And
Fig. 6 is the process flow diagram of analyzing the diagnosis trace information that the embodiment by Fig. 5 catches.
Embodiment
With reference to figure 1, there are each operational processes data independently of application server of software image 102-112, and use communication agency 120 to intercom mutually.Communication agency 120 can be load balancer, supervisory routine, operating system or monitoring software.In another embodiment, communication agency 120 can be peer-to-peer communications mechanism simply.
Fig. 2 illustrates one of software image 102 of Fig. 1.Typically, software image comprises operating system 202, middleware 204 and application software 206.Any of these elements can not be present in software image, and other assemblies of not mentioning above may reside in software image.In a preferred embodiment, each software image is identical with other software images.In other embodiments, each software image has the assembly common with other software images.
Fig. 3 illustrates the application software of Fig. 2.Typically, application software will be implemented as a plurality of processes 302, and each of these processes 302 has a plurality of threads 304.Although Fig. 3 only illustrates a process 302 with a thread 304, can carry out any amount of process, each process can have any amount of thread.Each of the process 302 of carrying out can have the thread 304 of varying number.
Fig. 4 illustrates the timeline of the system of Fig. 1.Video 2 104, video 3 106, video 5 110 and video 6 112 each start to carry out and carry out continuously and there is no a fault.Video and 1 102 in the time 406, start to carry out.Its is carried out continuously until the time 408 while breaking down.This fault causes event of failure.Event of failure causes trace diagnosis information to be recorded to journal file 402.Trace diagnosis information is made as Fisrt fault data capture (FFDC) data that head straight for typically, that is to say, it is the general selection of trace diagnosis information, and this trace diagnosis information is optimized for any external cause (as process signals or I/O mistake) of fail soft assembly and fault can be identified.Because produce the cost of diagnostic message, as performance cost, fault is produced to the disk amount that dump institute's time spent and the output of storage diagnostic message require, detailed trace diagnosis information is not made as always and catches.
With reference to figure 5, the method for embodiments of the invention starts in step 502.In step 504, by communication agency, receive Fisrt fault data.Check 506, to check any other reflection that whether has operation same software.As mentioned above, in other embodiments, each software image has the assembly common with other software images.If there is no other reflections that move in same software, and if there is no video and there are other reflections of common assembly with fault alternatively, in step S512 method, finish.
If have other reflections move in same software, or have alternatively common assembly, in step 508, event of failure also causes the information exchange relevant to fault to cross communication agency 120 1 102 being delivered to other reflections 2 to 6 104-112 from videoing.These reflections 2 to 6 104-112 at least operate on some component softwares identical with the component software of operation in the reflection 1 102 breaking down in the time 408.Fig. 2 to 6 104-112 then can expect with reflection in 1 102 identical fault appear in these reflections and adjust their diagnostic configuration.For example, if the specific software components of videoing in 1 102 has been identified as, cause fault, the more more detailed logging of the operation of this specific software components record can be born in reflection 2 to 6 104-112.This may be included in the extra tracking being opened in component software.As another example, if the reason of fault in 1 102 of videoing is that storer is not enough, 2 to 6 104-112 that video can start the more details that log recording is used about the storer in their reflections.
Fig. 4 also illustrates the second fault occurring in the time 410 in reflection 4 108.This fault causes event of failure.Step 510 in Fig. 5, event of failure causes trace diagnosis information log to be recorded to journal file 404.Journal file 404 is included in component software that the time 408 breaks down in 1 102 at reflection or in the more detailed trace diagnosis information of the failure cause of time 408 in reflection 1 102.If the failure cause in software image 4 108 with cause videoing in 1 before the reason of fault same or similar, the more detailed trace diagnosis information of catching may should take to prevent that the action that further fault occurs is quite helpful to identification failure cause and identification.In Fig. 5, method finishes in step 512.
In another embodiment, in may being called " theory " or " ladder " embodiment, the increase level of catching of trace diagnosis information is crossed over reflection 102-112 by balancing the load.Each reflection is configured to the specific part of software group (stack) or a plurality of specific part to catch more fully trace diagnosis information.Between reflection 102-112, all part acquisition and tracking diagnostic messages that require to software group.Reflection can also be configured to any subset of acquisition and tracking diagnostic message, its may be expectation and to its can some or all reflection between divide coverage.
In another embodiment, said method can not crossed over reflection 102-112 application, but leap process 302 or 304 application of leap thread.The the first process acquisition and tracking diagnostic message breaking down, if its for and when other processes break down, reconfigure what trace diagnosis information and caught by other processes.Similarly, the first thread breaking down can acquisition and tracking diagnostic message, if its for and when other threads break down, reconfigure what trace diagnosis information and caught by other threads.This leap process and cross over thread method can with cross over the Combination of Methods that reflection uses or can use separately.
In another embodiment, before the level of catching of trace diagnosis information turns back to its level before Fisrt fault or is made as another predeterminated level, for the predetermined amount of time after Fisrt fault event, catching of the trace diagnosis information reconfiguring can be crossed over other reflections, process or thread application.
In another embodiment, second or subsequently event of failure occurred and/or after enough trace diagnosis information caught, the level of catching of the trace diagnosis information on all reflections turns back to its level before Fisrt fault event.
In another embodiment, said method can be applied to not identical software group or working load.For example, the fault causing for the external factor by common (as network failure), one or more reflections, process or thread can be configured to catch other trace diagnosis information, and wherein different configurations is suitably for the network failure of the expection of each reflection, process or thread.
Return to Fig. 6, in step 602, bring into use trace diagnosis information analysis fault.In step 604, Fisrt fault data and the combination of the second fault data.Then in step 606, analyze the information of combination.In step 608, analyze and finish.In another embodiment, first analyze Fisrt fault data, then consider that the second fault data is analyzed in the discovery of Fisrt fault data.Can in the first reflection 102, analyze, or can when finishing failure message from the first reflection 102, by other reflections 104-112, be analyzed.
In another embodiment, after fault, start or the reflection 102-112 of restarting can also be configured to catch the trace diagnosis information of increase level.
Person of ordinary skill in the field knows, various aspects of the present invention can be implemented as system, method, computer program or computer program.Therefore, various aspects of the present invention can specific implementation be following form, that is: hardware implementation mode, implement software mode (comprising firmware, resident software, microcode etc.) completely completely, or the embodiment of hardware and software aspect combination, can be referred to as " circuit ", " module " or " system " here.In addition, in certain embodiments, various aspects of the present invention can also be embodied as the form of the computer program in one or more computer-readable mediums, comprise computer-readable program code in this computer-readable medium.
Can adopt the combination in any of one or more computer-readable mediums.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.Computer-readable recording medium for example may be-but not limited to-electricity, magnetic, optical, electrical magnetic, infrared ray or semi-conductive system, device or device, or the combination arbitrarily.The example more specifically of computer-readable recording medium (non exhaustive list) comprising: have the electrical connection, portable computer diskette, hard disk, random-access memory (ram), ROM (read-only memory) (ROM), erasable type programmable read only memory (EPROM or flash memory), optical fiber, Portable, compact dish ROM (read-only memory) (CD-ROM), light storage device, magnetic memory device of one or more wires or the combination of above-mentioned any appropriate.In presents, computer-readable recording medium can be any comprising or stored program tangible medium, and this program can be used or be combined with it by instruction execution system, device or device.
Computer-readable signal media can be included in base band or the data-signal of propagating as a carrier wave part, has wherein carried computer-readable program code.The combination of electromagnetic signal that the data-signal of this propagation can adopt various ways, comprises---but being not limited to---, light signal or above-mentioned any appropriate.Computer-readable signal media can also be any computer-readable medium beyond computer-readable recording medium, and this computer-readable medium can send, propagates or transmit the program for being used or be combined with it by instruction execution system, device or device.
The program code comprising on computer-readable medium can be with any suitable medium transmission, comprises that---but being not limited to---is wireless, wired, optical cable, RF etc., or the combination of above-mentioned any appropriate.
Can write for carrying out the computer program code of the present invention's operation with the combination in any of one or more programming languages, described programming language comprises object-oriented programming language-such as Java, Smalltalk, C++ etc., also comprises conventional process type programming language-such as " C " language or similar programming language.Program code can fully be carried out, partly on subscriber computer, carries out, as an independently software package execution, part part on subscriber computer, carry out or on remote computer or server, carry out completely on remote computer on subscriber computer.In relating to the situation of remote computer, remote computer can be by the network of any kind---comprise LAN (Local Area Network) (LAN) or wide area network (WAN)-be connected to subscriber computer, or, can be connected to outer computer (for example utilizing ISP to pass through Internet connection).
Below with reference to describing the present invention according to process flow diagram and/or the block diagram of the method for the embodiment of the present invention, device (system) and computer program.Should be appreciated that the combination of each square frame in each square frame of process flow diagram and/or block diagram and process flow diagram and/or block diagram, can be realized by computer program instructions.These computer program instructions can offer the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus, thereby produce a kind of machine, make these computer program instructions when the processor by computing machine or other programmable data treating apparatus is carried out, produced the device of the function/action of stipulating in the one or more square frames in realization flow figure and/or block diagram.
Also these computer program instructions can be stored in computer-readable medium, these instructions make computing machine, other programmable data treating apparatus or other equipment with ad hoc fashion work, thereby the instruction being stored in computer-readable medium just produces the manufacture (article of manufacture) of the instruction of the function/action of stipulating in the one or more square frames that comprise in realization flow figure and/or block diagram.
Computer program instructions can also be loaded into computing machine, other programmable data treating apparatus or other equipment, so that sequence of operations step is carried out on computing machine, other programmable devices or other equipment, to produce computer implemented processing, the instruction that makes to carry out on computing machine or other programmable devices is provided for realizing the function/action of appointment in process flow diagram and/or calcspar square or a plurality of square.
Process flow diagram in accompanying drawing and block diagram have shown the system according to a plurality of embodiment of the present invention, architectural framework in the cards, function and the operation of method and computer program product.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more for realizing the executable instruction of the logic function of regulation.Also it should be noted that what the function marking in square frame also can be marked to be different from accompanying drawing occurs in sequence in some realization as an alternative.For example, in fact two continuous square frames can be carried out substantially concurrently, and they also can be carried out by contrary order sometimes, and this determines according to related function.Also be noted that, each square frame in block diagram and/or process flow diagram and the combination of the square frame in block diagram and/or process flow diagram, can realize by the special-purpose hardware based system of the function putting rules into practice or action, or can realize with the combination of specialized hardware and computer instruction.
For fear of doubt, term " comprises " as used herein, run through instructions and claim be not interpreted as meaning " only by ... form ".

Claims (15)

1. for catching a method for diagnosis trace information, described method, for having the computer system of a plurality of software images, said method comprising the steps of:
Receive the information relevant with Fisrt fault in first of described a plurality of software images;
Other software image by described distribution of information to described a plurality of software images;
Catch the information relevant with the second fault in another in described a plurality of software images.
2. method according to claim 1, the execution of the step of the described information of wherein said distribution in load balancer, supervisory routine, operating system, monitoring software or peer-to-peer communications mechanism.
According to claim 1 to the method described in any one of claim 2, wherein said by described distribution of information, the step to other software images of described a plurality of software images comprises: the first of described information is distributed to more than first software image in described a plurality of software image, and the second portion of described information is distributed to more than second software image in described a plurality of software image.
According to claim 1 to the method described in any one of claim 3, the step of wherein said capturing information is expired after predetermined amount of time.
According to claim 1 to the method described in any one of claim 3, the step of wherein said capturing information is expired after the second fault.
According to claim 1 to the method described in any one of claim 5, wherein:
Each of described software image also comprises process or thread; And
The information of described reception is relevant with the first process or the Fisrt fault in thread of described process or thread;
The distribution of information of described distribution is to other processes or the thread of described process or thread;
The second fault in another of described information of catching and described process or thread is relevant.
According to claim 1 to the method described in any one of claim 6, the information of wherein said reception is identified the external factor of described software image as the reason of described Fisrt fault.
According to claim 1 to the method described in any one of claim 7, further comprising the steps of: after described receiving step, whether one or more other software images that check described a plurality of software images are carrying out the software identical with described the first software image in described a plurality of software images.
According to claim 1 to the method described in any one of claim 8, further comprising the steps of:
By the relevant information combination of the second fault in another of the relevant information of the Fisrt fault in the first software image of described and described a plurality of software images and described and described a plurality of software images;
Analyze the information of described combination to determine the reason of Fisrt fault.
10. method according to claim 9, the step of wherein said capturing information continues, until the information of the described combination of described analysis is to determine that the described step of the reason of Fisrt fault finishes.
11. 1 kinds of computer systems, comprising:
A plurality of software images;
Journal file, comprises the relevant trace diagnosis information of Fisrt fault in the first software image with described a plurality of software images;
Communication agency, for other software images to described a plurality of software images by the distribution of information from described journal file;
Described other software images of described a plurality of software images are caught the information relevant with the second fault in another of described a plurality of software images.
12. computer systems according to claim 11, wherein said communication agency is distributed to more than first software image in described a plurality of software image by the first of described information, and the second portion of described information is distributed to more than second software image in described a plurality of software image.
13. according to claim 11 the arbitrary described computer system to claim 12, wherein:
Each of described software image also comprises process or thread; And
The information of described reception is relevant with the first process or the Fisrt fault in thread of described process or thread;
The distribution of information of described distribution is to other processes or the thread of described process or thread;
The second fault in another of described information of catching and described process or thread is relevant.
14. according to claim 11 the arbitrary described computer system to claim 13, one of one of wherein said communication agency or described a plurality of software images:
By the relevant information combination of the second fault in another of the relevant information of the Fisrt fault in the first software image of described and described a plurality of software images and described and described a plurality of software images;
Analyze the information of described combination to determine the reason of Fisrt fault.
15. 1 kinds according to any claim of claim 1-10 for catch diagnosis trace information system.
CN201310343980.0A 2012-08-08 2013-08-08 Method and computer system for capturing diagnosis tracking information Active CN103577273B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1214159.4 2012-08-08
GB1214159.4A GB2504728A (en) 2012-08-08 2012-08-08 Second failure data capture in co-operating multi-image systems

Publications (2)

Publication Number Publication Date
CN103577273A true CN103577273A (en) 2014-02-12
CN103577273B CN103577273B (en) 2017-06-06

Family

ID=46935094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310343980.0A Active CN103577273B (en) 2012-08-08 2013-08-08 Method and computer system for capturing diagnosis tracking information

Country Status (3)

Country Link
US (4) US9436590B2 (en)
CN (1) CN103577273B (en)
GB (1) GB2504728A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988882A (en) * 2015-02-12 2016-10-05 广东欧珀移动通信有限公司 Application software fault recovery method and terminal equipment
CN109757771A (en) * 2019-02-22 2019-05-17 红云红河烟草(集团)有限责任公司 Filter-stick forming device shuts down duration calculation method and computing device

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013162596A1 (en) * 2012-04-27 2013-10-31 Hewlett-Packard Development Company, L.P. Mapping application dependencies at runtime
US10970152B2 (en) 2017-11-21 2021-04-06 International Business Machines Corporation Notification of network connection errors between connected software systems
US10684910B2 (en) * 2018-04-17 2020-06-16 International Business Machines Corporation Intelligent responding to error screen associated errors
JP7367495B2 (en) * 2019-11-29 2023-10-24 富士通株式会社 Information processing equipment and communication cable log information collection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1077037A (en) * 1992-03-06 1993-10-06 国际商业机器公司 Multi-media computer diagnostic system
CN101226495A (en) * 2007-01-19 2008-07-23 国际商业机器公司 System and method for the capture and preservation of intermediate error state data
US20080222456A1 (en) * 2007-03-05 2008-09-11 Angela Richards Jones Method and System for Implementing Dependency Aware First Failure Data Capture

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5761739A (en) * 1993-06-08 1998-06-02 International Business Machines Corporation Methods and systems for creating a storage dump within a coupling facility of a multisystem enviroment
US6651183B1 (en) * 1999-10-28 2003-11-18 International Business Machines Corporation Technique for referencing failure information representative of multiple related failures in a distributed computing environment
CA2315449A1 (en) * 2000-08-10 2002-02-10 Ibm Canada Limited-Ibm Canada Limitee Generation of runtime execution traces of applications and associated problem determination
US6813731B2 (en) * 2001-02-26 2004-11-02 Emc Corporation Methods and apparatus for accessing trace data
US7120685B2 (en) * 2001-06-26 2006-10-10 International Business Machines Corporation Method and apparatus for dynamic configurable logging of activities in a distributed computing system
US6779132B2 (en) * 2001-08-31 2004-08-17 Bull Hn Information Systems Inc. Preserving dump capability after a fault-on-fault or related type failure in a fault tolerant computer system
US7080287B2 (en) * 2002-07-11 2006-07-18 International Business Machines Corporation First failure data capture
US7840856B2 (en) * 2002-11-07 2010-11-23 International Business Machines Corporation Object introspection for first failure data capture
CA2433750A1 (en) * 2003-06-27 2004-12-27 Ibm Canada Limited - Ibm Canada Limitee Automatic collection of trace detail and history data
GB0412104D0 (en) 2004-05-29 2004-06-30 Ibm Apparatus method and program for recording diagnostic trace information
US7519510B2 (en) * 2004-11-18 2009-04-14 International Business Machines Corporation Derivative performance counter mechanism
US7383471B2 (en) * 2004-12-28 2008-06-03 Hewlett-Packard Development Company, L.P. Diagnostic memory dumping
US20060195731A1 (en) * 2005-02-17 2006-08-31 International Business Machines Corporation First failure data capture based on threshold violation
US7487407B2 (en) * 2005-07-12 2009-02-03 International Business Machines Corporation Identification of root cause for a transaction response time problem in a distributed environment
WO2007088575A1 (en) * 2006-01-31 2007-08-09 Fujitsu Limited System monitor device control method, program, and computer system
US8949671B2 (en) * 2008-01-30 2015-02-03 International Business Machines Corporation Fault detection, diagnosis, and prevention for complex computing systems
US8381014B2 (en) * 2010-05-06 2013-02-19 International Business Machines Corporation Node controller first failure error management for a distributed system
WO2012026035A1 (en) 2010-08-27 2012-03-01 富士通株式会社 Fault processing method, fault processing system, fault processing device and fault processing program
JP5252014B2 (en) * 2011-03-15 2013-07-31 オムロン株式会社 Control device, control system, tool device, and collection instruction program
US8615676B2 (en) * 2011-03-24 2013-12-24 International Business Machines Corporation Providing first field data capture in a virtual input/output server (VIOS) cluster environment with cluster-aware vioses

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1077037A (en) * 1992-03-06 1993-10-06 国际商业机器公司 Multi-media computer diagnostic system
CN101226495A (en) * 2007-01-19 2008-07-23 国际商业机器公司 System and method for the capture and preservation of intermediate error state data
US20080222456A1 (en) * 2007-03-05 2008-09-11 Angela Richards Jones Method and System for Implementing Dependency Aware First Failure Data Capture

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105988882A (en) * 2015-02-12 2016-10-05 广东欧珀移动通信有限公司 Application software fault recovery method and terminal equipment
CN105988882B (en) * 2015-02-12 2019-08-27 Oppo广东移动通信有限公司 A kind of application software fault repairing method and terminal device
CN109757771A (en) * 2019-02-22 2019-05-17 红云红河烟草(集团)有限责任公司 Filter-stick forming device shuts down duration calculation method and computing device

Also Published As

Publication number Publication date
US20160196177A1 (en) 2016-07-07
US20140047280A1 (en) 2014-02-13
GB201214159D0 (en) 2012-09-19
US20160203037A1 (en) 2016-07-14
US9852051B2 (en) 2017-12-26
GB2504728A (en) 2014-02-12
US9921950B2 (en) 2018-03-20
US9436590B2 (en) 2016-09-06
CN103577273B (en) 2017-06-06
US9424170B2 (en) 2016-08-23
US20140372808A1 (en) 2014-12-18

Similar Documents

Publication Publication Date Title
CN103577273A (en) Second failure data capture in co-operating multi-image systems
CN111752799A (en) Service link tracking method, device, equipment and storage medium
CN105556482A (en) Monitoring mobile application performance
US20160274997A1 (en) End user monitoring to automate issue tracking
US9276819B2 (en) Network traffic monitoring
CN107045475B (en) Test method and device
CN103544095A (en) Server program monitoring method and system of server program
CN108762966A (en) System exception hold-up interception method, device, computer equipment and storage medium
CN112333044B (en) Shunting equipment performance test method, device and system, electronic equipment and medium
CN103514075A (en) Method and device for monitoring API function calling in mobile terminal
CN112860569A (en) Automatic testing method and device, electronic equipment and storage medium
CN110515821A (en) Based on the event-handling method, electronic equipment and computer storage medium buried a little
CN114745295A (en) Data acquisition method, device, equipment and readable storage medium
CN111970151A (en) Flow fault positioning method and system for virtual and container network
CN116431443A (en) Log recording method, device, computer equipment and computer readable storage medium
US10462234B2 (en) Application resilience system and method thereof for applications deployed on platform
CN115658500A (en) Vue-based front-end error log uploading method and system in hybrid development
US10432472B1 (en) Network operation center (NOC) tool pattern detection and trigger to real-time monitoring operation mode
KR101828156B1 (en) Transaction Monitoring System and Operating method thereof
CN112799910A (en) Hierarchical monitoring method and device
US20080154657A1 (en) System for monitoring order fulfillment of telecommunication services
CN113542796B (en) Video evaluation method, device, computer equipment and storage medium
CN107577546B (en) Information processing method and device and electronic equipment
CN117687870A (en) Mobile terminal white screen monitoring method, system, electronic equipment and medium
CN114245052A (en) Video data storage method and device, storage medium and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant