CN103577273A

CN103577273A - Second failure data capture in co-operating multi-image systems

Info

Publication number: CN103577273A
Application number: CN201310343980.0A
Authority: CN
Inventors: R.N.张伯伦; A.J.皮尔金顿; H.J.赫利尔; M.F.彼得斯
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2012-08-08
Filing date: 2013-08-08
Publication date: 2014-02-12
Anticipated expiration: 2033-08-08
Also published as: GB201214159D0; US9852051B2; GB2504728A; US20140047280A1; US9921950B2; US20140372808A1; US20160203037A1; US9436590B2; US9424170B2; CN103577273B; US20160196177A1

Abstract

A method, computer system, and computer program captures diagnostic trace information in a computer system having a plurality of software images. Information is received that is associated with a first failure in a first one of the plurality of software images. The received information is distributed to others of the plurality of software images. Further information is captured that is associated with a second failure in another one of the plurality of software images. The information distribution comprises distributing a first part of the information to a plurality of first soft images of the soft images and distributing a second part of the information to a plurality of second soft images of the soft images.

Description

The second fault data in the multi-mapping system of co-operate is caught

Technical field

The present invention relates to automatically catching of diagnostic data in computer system, particularly the diagnostic data in the multi-mapping computer system of co-operate catches automatically.

Background technology

Automatically being captured in computer system of diagnostic data is well-known.Particularly, it is used in complicated and/or long-play application conventionally to allow the quick solution of problem, and does not need on-the-spot fault or the standby system of reproducing.Known solution is that the form with dump, daily record and trace file provides Fisrt fault data capture (FFDC), and data capture triggers when problem being detected.

The problem of the solution that this is known is to have compromise obtaining for the demand of enough diagnostic messages of analysis and solve problem and produce between the cost of this diagnostic message.The cost that produces diagnostic message can comprise a) for the performance cost of the application of daily record recording and tracking continuously, b) fault is produced dump institute's time spent to (this may postpone restarting of application), and c) the disk space amount that requires of storage diagnostic message output.

WO2012/026035A discloses a kind of fault processing system, it has: stored position information acquiring unit, obtains the stored position information of the memory location that is illustrated in the failure message generating when fault occurs for the storage unit of the assembly from wherein having broken down; Failure message acquiring unit, be used for based on stored position information, from memory device, obtain failure message that generate, relevant with fault when fault occurs messaging device, this memory device be connected in case can with messaging device and fault handling devices communicating; And configuration control module, for the failure message based on obtaining, according to messaging device, revise the configuration of fault handling equipment.Thereby fault processing system can easily be reproduced in the fault occurring in messaging device, to effectively carry out, reproduce test.

Therefore, in prior art, there are the needs of processing the problems referred to above.

Summary of the invention

It is a kind of for catching the method for diagnosis trace information that embodiments of the invention provide, described method, for having the computer system of a plurality of software images, said method comprising the steps of: receive the information relevant with Fisrt fault in first of described a plurality of software images; Other software image by described distribution of information to described a plurality of software images; Catch the information relevant with the second fault in another in described a plurality of software images.The advantage of the method is that the cost of acquisition and tracking diagnostic message is minimized, until Fisrt fault occurs, after this, the value of the trace diagnosis information of catching is maximized, and by only catching the detailed tracking diagnostic message relevant with Fisrt fault, the cost of acquisition and tracking diagnostic message is minimized.

In an embodiment, the execution of the step of the described information of described distribution in load balancer, supervisory routine, operating system, monitoring software or peer-to-peer communications mechanism.

In a preferred embodiment, described by described distribution of information, the step to other software images of described a plurality of software images comprises: the first of described information is distributed to more than first software image in described a plurality of software image, and the second portion of described information is distributed to more than second software image in described a plurality of software image.This advantage having is to have distributed and crossed over the load that software image is collected diagnosis trace information, and still allowed the collection of comprehensive trace diagnosis information.

In a preferred embodiment, the step of described capturing information is expired after predetermined amount of time.In alternate embodiments, the step of described capturing information is expired after the second fault.The advantage that these embodiment have is, is limited in the time period of catching other diagnosis trace information during it, and therefore the other cost of diagnosis trace information is caught in restriction.

In another embodiment, each of described software image also comprises process or thread; And the information of described reception is relevant with the first process or the Fisrt fault in thread of described process or thread; The distribution of information of described distribution is to other processes or the thread of described process or thread; The second fault in another of described information of catching and described process or thread is relevant.

In another embodiment, the diagnosis trace information of described reception is identified the external factor of described software image as the reason of described Fisrt fault.This advantage having is that the fault causing due to external factor (as network failure) may cause the other trace diagnosis information relevant with external factor that will collect in each software image.

In another embodiment, described method is further comprising the steps of: after described receiving step, whether one or more other software images that check described a plurality of software images are carrying out the software identical with described the first software image in described a plurality of software images.

In another embodiment, described method is further comprising the steps of: by the relevant information combination of the second fault in another of the relevant information of the Fisrt fault in the first software image of described and described a plurality of software images and described and described a plurality of software images; Analyze the information of described combination to determine the reason of Fisrt fault.This combination of trace diagnosis information and analysis allow to determine the reason of fault, and do not need on-the-spot fault or the standby system of reproducing.

In another embodiment, the step of described capturing information continues, until the information of the described combination of described analysis is to determine that the step of the reason of Fisrt fault finishes.This allows to catch the information from any further fault, combines simultaneously and analyzes from the trace diagnosis information of fault before, but allow to stop catching analyzing while finishing.

Embodiments of the invention also provide a kind of department of computer science to unify for realizing the computer program of the said method of catching diagnosis trace information.

From other aspect, the invention provides a kind of for catching the computer program of diagnosis trace information, described computer program comprises: computer-readable recording medium, it can be read by treatment circuit, and the instruction that storage is carried out by treatment circuit, for carrying out for carrying out the method for step of the present invention.

From other aspect, the invention provides a kind of computer program, it is stored on computer-readable medium and can be loaded in the internal storage of digital machine, comprises software code part, when described program is moved on computers, for carrying out step of the present invention.

From other aspect, the invention provides a kind of basic as the method being described with reference to the drawings.

From other aspect, the invention provides a kind of basic as the system being described with reference to the drawings.

Accompanying drawing explanation

Only by way of example, with reference to accompanying drawing, will be described in more detail the preferred embodiments of the present invention now, in accompanying drawing:

Fig. 1 wherein can be used the calcspar with a plurality of software images of communication agency of the present invention;

Fig. 2 is the calcspar of one of software image of Fig. 1;

Fig. 3 is the calcspar of the application software of Fig. 2;

Fig. 4 illustrates the time relationship between a plurality of reflections, Fisrt fault event and the second event of failure of Fig. 1;

Fig. 5 is the process flow diagram of catching diagnosis trace information according to the embodiment of the present invention; And

Fig. 6 is the process flow diagram of analyzing the diagnosis trace information that the embodiment by Fig. 5 catches.

Embodiment

With reference to figure 1, there are each operational processes data independently of application server of software image 102-112, and use communication agency 120 to intercom mutually.Communication agency 120 can be load balancer, supervisory routine, operating system or monitoring software.In another embodiment, communication agency 120 can be peer-to-peer communications mechanism simply.

Fig. 2 illustrates one of software image 102 of Fig. 1.Typically, software image comprises operating system 202, middleware 204 and application software 206.Any of these elements can not be present in software image, and other assemblies of not mentioning above may reside in software image.In a preferred embodiment, each software image is identical with other software images.In other embodiments, each software image has the assembly common with other software images.

Fig. 3 illustrates the application software of Fig. 2.Typically, application software will be implemented as a plurality of processes 302, and each of these processes 302 has a plurality of threads 304.Although Fig. 3 only illustrates a process 302 with a thread 304, can carry out any amount of process, each process can have any amount of thread.Each of the process 302 of carrying out can have the thread 304 of varying number.

Fig. 4 illustrates the timeline of the system of Fig. 1.Video 2 104, video 3 106, video 5 110 and video 6 112 each start to carry out and carry out continuously and there is no a fault.Video and 1 102 in the time 406, start to carry out.Its is carried out continuously until the time 408 while breaking down.This fault causes event of failure.Event of failure causes trace diagnosis information to be recorded to journal file 402.Trace diagnosis information is made as Fisrt fault data capture (FFDC) data that head straight for typically, that is to say, it is the general selection of trace diagnosis information, and this trace diagnosis information is optimized for any external cause (as process signals or I/O mistake) of fail soft assembly and fault can be identified.Because produce the cost of diagnostic message, as performance cost, fault is produced to the disk amount that dump institute's time spent and the output of storage diagnostic message require, detailed trace diagnosis information is not made as always and catches.

With reference to figure 5, the method for embodiments of the invention starts in step 502.In step 504, by communication agency, receive Fisrt fault data.Check 506, to check any other reflection that whether has operation same software.As mentioned above, in other embodiments, each software image has the assembly common with other software images.If there is no other reflections that move in same software, and if there is no video and there are other reflections of common assembly with fault alternatively, in step S512 method, finish.

If have other reflections move in same software, or have alternatively common assembly, in step 508, event of failure also causes the information exchange relevant to fault to cross communication agency 120 1 102 being delivered to other reflections 2 to 6 104-112 from videoing.These reflections 2 to 6 104-112 at least operate on some component softwares identical with the component software of operation in the reflection 1 102 breaking down in the time 408.Fig. 2 to 6 104-112 then can expect with reflection in 1 102 identical fault appear in these reflections and adjust their diagnostic configuration.For example, if the specific software components of videoing in 1 102 has been identified as, cause fault, the more more detailed logging of the operation of this specific software components record can be born in reflection 2 to 6 104-112.This may be included in the extra tracking being opened in component software.As another example, if the reason of fault in 1 102 of videoing is that storer is not enough, 2 to 6 104-112 that video can start the more details that log recording is used about the storer in their reflections.

Fig. 4 also illustrates the second fault occurring in the time 410 in reflection 4 108.This fault causes event of failure.Step 510 in Fig. 5, event of failure causes trace diagnosis information log to be recorded to journal file 404.Journal file 404 is included in component software that the time 408 breaks down in 1 102 at reflection or in the more detailed trace diagnosis information of the failure cause of time 408 in reflection 1 102.If the failure cause in software image 4 108 with cause videoing in 1 before the reason of fault same or similar, the more detailed trace diagnosis information of catching may should take to prevent that the action that further fault occurs is quite helpful to identification failure cause and identification.In Fig. 5, method finishes in step 512.

In another embodiment, in may being called " theory " or " ladder " embodiment, the increase level of catching of trace diagnosis information is crossed over reflection 102-112 by balancing the load.Each reflection is configured to the specific part of software group (stack) or a plurality of specific part to catch more fully trace diagnosis information.Between reflection 102-112, all part acquisition and tracking diagnostic messages that require to software group.Reflection can also be configured to any subset of acquisition and tracking diagnostic message, its may be expectation and to its can some or all reflection between divide coverage.

In another embodiment, said method can not crossed over reflection 102-112 application, but leap process 302 or 304 application of leap thread.The the first process acquisition and tracking diagnostic message breaking down, if its for and when other processes break down, reconfigure what trace diagnosis information and caught by other processes.Similarly, the first thread breaking down can acquisition and tracking diagnostic message, if its for and when other threads break down, reconfigure what trace diagnosis information and caught by other threads.This leap process and cross over thread method can with cross over the Combination of Methods that reflection uses or can use separately.

In another embodiment, before the level of catching of trace diagnosis information turns back to its level before Fisrt fault or is made as another predeterminated level, for the predetermined amount of time after Fisrt fault event, catching of the trace diagnosis information reconfiguring can be crossed over other reflections, process or thread application.

In another embodiment, second or subsequently event of failure occurred and/or after enough trace diagnosis information caught, the level of catching of the trace diagnosis information on all reflections turns back to its level before Fisrt fault event.

In another embodiment, said method can be applied to not identical software group or working load.For example, the fault causing for the external factor by common (as network failure), one or more reflections, process or thread can be configured to catch other trace diagnosis information, and wherein different configurations is suitably for the network failure of the expection of each reflection, process or thread.

Return to Fig. 6, in step 602, bring into use trace diagnosis information analysis fault.In step 604, Fisrt fault data and the combination of the second fault data.Then in step 606, analyze the information of combination.In step 608, analyze and finish.In another embodiment, first analyze Fisrt fault data, then consider that the second fault data is analyzed in the discovery of Fisrt fault data.Can in the first reflection 102, analyze, or can when finishing failure message from the first reflection 102, by other reflections 104-112, be analyzed.

In another embodiment, after fault, start or the reflection 102-112 of restarting can also be configured to catch the trace diagnosis information of increase level.

Person of ordinary skill in the field knows, various aspects of the present invention can be implemented as system, method, computer program or computer program.Therefore, various aspects of the present invention can specific implementation be following form, that is: hardware implementation mode, implement software mode (comprising firmware, resident software, microcode etc.) completely completely, or the embodiment of hardware and software aspect combination, can be referred to as " circuit ", " module " or " system " here.In addition, in certain embodiments, various aspects of the present invention can also be embodied as the form of the computer program in one or more computer-readable mediums, comprise computer-readable program code in this computer-readable medium.

Can adopt the combination in any of one or more computer-readable mediums.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.Computer-readable recording medium for example may be-but not limited to-electricity, magnetic, optical, electrical magnetic, infrared ray or semi-conductive system, device or device, or the combination arbitrarily.The example more specifically of computer-readable recording medium (non exhaustive list) comprising: have the electrical connection, portable computer diskette, hard disk, random-access memory (ram), ROM (read-only memory) (ROM), erasable type programmable read only memory (EPROM or flash memory), optical fiber, Portable, compact dish ROM (read-only memory) (CD-ROM), light storage device, magnetic memory device of one or more wires or the combination of above-mentioned any appropriate.In presents, computer-readable recording medium can be any comprising or stored program tangible medium, and this program can be used or be combined with it by instruction execution system, device or device.

Computer-readable signal media can be included in base band or the data-signal of propagating as a carrier wave part, has wherein carried computer-readable program code.The combination of electromagnetic signal that the data-signal of this propagation can adopt various ways, comprises---but being not limited to---, light signal or above-mentioned any appropriate.Computer-readable signal media can also be any computer-readable medium beyond computer-readable recording medium, and this computer-readable medium can send, propagates or transmit the program for being used or be combined with it by instruction execution system, device or device.

The program code comprising on computer-readable medium can be with any suitable medium transmission, comprises that---but being not limited to---is wireless, wired, optical cable, RF etc., or the combination of above-mentioned any appropriate.

Can write for carrying out the computer program code of the present invention's operation with the combination in any of one or more programming languages, described programming language comprises object-oriented programming language-such as Java, Smalltalk, C++ etc., also comprises conventional process type programming language-such as " C " language or similar programming language.Program code can fully be carried out, partly on subscriber computer, carries out, as an independently software package execution, part part on subscriber computer, carry out or on remote computer or server, carry out completely on remote computer on subscriber computer.In relating to the situation of remote computer, remote computer can be by the network of any kind---comprise LAN (Local Area Network) (LAN) or wide area network (WAN)-be connected to subscriber computer, or, can be connected to outer computer (for example utilizing ISP to pass through Internet connection).

Below with reference to describing the present invention according to process flow diagram and/or the block diagram of the method for the embodiment of the present invention, device (system) and computer program.Should be appreciated that the combination of each square frame in each square frame of process flow diagram and/or block diagram and process flow diagram and/or block diagram, can be realized by computer program instructions.These computer program instructions can offer the processor of multi-purpose computer, special purpose computer or other programmable data treating apparatus, thereby produce a kind of machine, make these computer program instructions when the processor by computing machine or other programmable data treating apparatus is carried out, produced the device of the function/action of stipulating in the one or more square frames in realization flow figure and/or block diagram.

Also these computer program instructions can be stored in computer-readable medium, these instructions make computing machine, other programmable data treating apparatus or other equipment with ad hoc fashion work, thereby the instruction being stored in computer-readable medium just produces the manufacture (article of manufacture) of the instruction of the function/action of stipulating in the one or more square frames that comprise in realization flow figure and/or block diagram.

Computer program instructions can also be loaded into computing machine, other programmable data treating apparatus or other equipment, so that sequence of operations step is carried out on computing machine, other programmable devices or other equipment, to produce computer implemented processing, the instruction that makes to carry out on computing machine or other programmable devices is provided for realizing the function/action of appointment in process flow diagram and/or calcspar square or a plurality of square.

Process flow diagram in accompanying drawing and block diagram have shown the system according to a plurality of embodiment of the present invention, architectural framework in the cards, function and the operation of method and computer program product.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more for realizing the executable instruction of the logic function of regulation.Also it should be noted that what the function marking in square frame also can be marked to be different from accompanying drawing occurs in sequence in some realization as an alternative.For example, in fact two continuous square frames can be carried out substantially concurrently, and they also can be carried out by contrary order sometimes, and this determines according to related function.Also be noted that, each square frame in block diagram and/or process flow diagram and the combination of the square frame in block diagram and/or process flow diagram, can realize by the special-purpose hardware based system of the function putting rules into practice or action, or can realize with the combination of specialized hardware and computer instruction.

For fear of doubt, term " comprises " as used herein, run through instructions and claim be not interpreted as meaning " only by ... form ".

Claims

1. for catching a method for diagnosis trace information, described method, for having the computer system of a plurality of software images, said method comprising the steps of:

Receive the information relevant with Fisrt fault in first of described a plurality of software images;

Other software image by described distribution of information to described a plurality of software images;

Catch the information relevant with the second fault in another in described a plurality of software images.

2. method according to claim 1, the execution of the step of the described information of wherein said distribution in load balancer, supervisory routine, operating system, monitoring software or peer-to-peer communications mechanism.

According to claim 1 to the method described in any one of claim 2, wherein said by described distribution of information, the step to other software images of described a plurality of software images comprises: the first of described information is distributed to more than first software image in described a plurality of software image, and the second portion of described information is distributed to more than second software image in described a plurality of software image.

According to claim 1 to the method described in any one of claim 3, the step of wherein said capturing information is expired after predetermined amount of time.

According to claim 1 to the method described in any one of claim 3, the step of wherein said capturing information is expired after the second fault.

According to claim 1 to the method described in any one of claim 5, wherein:

Each of described software image also comprises process or thread; And

The information of described reception is relevant with the first process or the Fisrt fault in thread of described process or thread;

The distribution of information of described distribution is to other processes or the thread of described process or thread;

The second fault in another of described information of catching and described process or thread is relevant.

According to claim 1 to the method described in any one of claim 6, the information of wherein said reception is identified the external factor of described software image as the reason of described Fisrt fault.

According to claim 1 to the method described in any one of claim 7, further comprising the steps of: after described receiving step, whether one or more other software images that check described a plurality of software images are carrying out the software identical with described the first software image in described a plurality of software images.

According to claim 1 to the method described in any one of claim 8, further comprising the steps of:

By the relevant information combination of the second fault in another of the relevant information of the Fisrt fault in the first software image of described and described a plurality of software images and described and described a plurality of software images;

Analyze the information of described combination to determine the reason of Fisrt fault.

10. method according to claim 9, the step of wherein said capturing information continues, until the information of the described combination of described analysis is to determine that the described step of the reason of Fisrt fault finishes.

11. 1 kinds of computer systems, comprising:

A plurality of software images;

Journal file, comprises the relevant trace diagnosis information of Fisrt fault in the first software image with described a plurality of software images;

Communication agency, for other software images to described a plurality of software images by the distribution of information from described journal file;

Described other software images of described a plurality of software images are caught the information relevant with the second fault in another of described a plurality of software images.

12. computer systems according to claim 11, wherein said communication agency is distributed to more than first software image in described a plurality of software image by the first of described information, and the second portion of described information is distributed to more than second software image in described a plurality of software image.

13. according to claim 11 the arbitrary described computer system to claim 12, wherein:

Each of described software image also comprises process or thread; And

14. according to claim 11 the arbitrary described computer system to claim 13, one of one of wherein said communication agency or described a plurality of software images:

15. 1 kinds according to any claim of claim 1-10 for catch diagnosis trace information system.