CN101201786B

CN101201786B - Method and device for monitoring fault log

Info

Publication number: CN101201786B
Application number: CN200610165154A
Authority: CN
Inventors: 田丰
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2006-12-13
Filing date: 2006-12-13
Publication date: 2010-05-19
Anticipated expiration: 2026-12-13
Also published as: CN101201786A

Abstract

The invention discloses a fault log monitoring method used for managing host to monitor a controlled machine. The method includes the configuration step that fault analysis parameters are set for the controlled machine; calling step that fault log of the controlled machine is called and generated; filtering step that the fault information in fault log is filtered according to the fault analysis parameters, and warning levels are set for the filtered fault information according to the set rules; warning step that warning message with warning levels is sent to the management host aiming at the filtered fault information. By adopting the monitoring method provided by the invention, automatic and precise monitoring on fault log can be realized, a plurality of service accidents are avoided occurring and a great deal of heavy loss is reduced.

Description

A kind of monitoring fault log method and device

Technical field

The present invention relates to the fault handling technology of information equipment, particularly relate to a kind of method for supervising of fault log.

Background technology

Information equipment such as computing machine, server has been widely used in all trades and professions, because the defective of information equipment self software and hardware or user's use operating mistake, can produce equipment failure or potential faults inevitably, if can not in time find and get rid of, loss tends to cause a serious accident.For some information equipment, utilize the bulk information of daily record tool records information equipment generation in service, and search fault by the failure message of analyzing in the log information, be failure monitoring method relatively more commonly used.Yet, log information that information equipment operation is produced is magnanimity often, and because the diversity and the complicacy of practical application, the type of fault or doubtful fault is also of a great variety, and this makes general personnel can't find fault accurately and efficiently from the log information of magnanimity.With the IBM AIX machine of widespread use, the limitation of existing daily record method for supervising is described below:

Along with IBM AIX machine (for example: the IBM minicomputer, or the machine of other operation AIX operating system, for example: applying IBM OpenPower server), under many occasions, these machines are in very important position (for example: may move important Database Systems), in actual motion, the fault of some software and hardwares can take place in the IBMAIX machine, for the fault that influences service operation immediately, this can find from business is influenced, (hardware that has has redundancy or backup but other fault that has does not temporarily influence professional normal operation, for example: damaged one in the hard disk that two are done mirror image, or the hardware fault rank does not also seriously arrive to a certain degree), temporarily do not influence the fault of professional normal operation for these,, might cause system to continue to use after a period of time if can not be found in time and handle, more serious fault takes place, even the major accident of generation systems interruption.

For the above-mentioned fault that does not temporarily influence professional normal operation, though the maintainer is by checking (for example: check the errpt log information) on the operating system that signs in to the IBMAIX machine, the perhaps alarm indicator of facilities for observation, may also can find these faults, but, actual situation is: the maintainer may seldom login up and check (and this to operator's technical requirement also than higher), and, the groundwork position may be or not the next door of these machines, so might also fail in time to be noted for the stand by lamp on the machinery panel at ordinary times for they.

Errpt monitoring log information is a function in the IBM AIX operating system, call #errpt and can generate errpt monitoring log information, the failure message that may include some software and hardwares in these monitoring log informations, but because the monitoring log information that generates is often many, different information categories and different severity levels are wherein arranged, wherein some is fault really, some might not be fault (just some temporary transient reporting an error) yet, so certain degree of difficulty is arranged by direct analysis of maintainer, and, the errpt monitoring log information of IBM AIX can not allow the maintainer know automatically has important information to produce, and promptly lacks effective monitoring fault log method realization fault and alarms automatically.

Based on above analysis, a kind of monitoring fault log method need be provided, can find fault accurately and efficiently.

Summary of the invention

Problem to be solved by this invention is, a kind of monitoring fault log method is provided, can carry out fault detect by self-timing, and can be according to the frequency of the classification of failure message, severity level, generation, carry out the filtration of failure message, thereby automatically and exactly the fault in the discovering device makes the associated maintenance personnel can in time manage to repair these faults, avoids the accident that may take place.

The invention discloses a kind of monitoring fault log method, be used for the monitoring of a management host, comprising a controlled machine:

Configuration step is provided with the fault analysis parameter to described controlled machine; Described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;

Invocation step is called the fault log file that the fault log generation script generates described controlled machine;

Filtration step is provided with timer and activation according to spacing parameter detection time in the described fault analysis parameter, according to the setting of described timer, every the time of setting the failure message in the described fault log file is filtered; The described failure message that filters out is provided with alarm level according to default rule;

Alarm step at the described failure message that filters out, sends the warning information of described alarm level to described management host.

Before carrying out described invocation step, also comprise:

According to described detection time of spacing parameter, be provided with and activate a supervision timer, be used to control the described execution interval of step that the fault log generation script generates the fault log file of described controlled machine that calls.

Before carrying out described invocation step, also comprise:

Whether the process that detects described management host generation fault log file exists, if exist, continue to carry out the described step that the fault log generation script generates the fault log file of described controlled machine of calling, if there is no, directly send a warning message to described management host, flow process finishes.

Described default rule comprises:

In the time that sets at the timing statistics range parameter, frequency is not less than the P type H class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the P type S class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the P type O class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the P type U class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or

In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type H class failure message of the threshold parameter of the described number of stoppages; And/or

In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type S class failure message of the threshold parameter of the described number of stoppages; And/or

In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type O class failure message of the threshold parameter of the described number of stoppages; And/or

In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type U class failure message of the threshold parameter of the described number of stoppages; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the U type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set;

In the time that sets at the timing statistics range parameter, frequency is not less than the U type S class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the U type O class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the U type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the T type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the T type S class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or

Be not less than the T type O class failure message of the threshold parameter of the described number of stoppages for frequency, three grades of alarms are set; And/or

In the time that sets at the timing statistics range parameter, frequency is not less than the T type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set.

Further comprise in the described filtration step,

Detect the failure message that occurs in the described fault log file, in the time that described timing statistics range parameter sets, whether reach the number of times of the threshold parameter setting of the described number of stoppages, if reach, carry out the described step that alarm level is set according to the rule of setting, if do not reach, continue to detect.

After carrying out described alarm step, also comprise:, be recorded in the step in the internal memory of described controlled machine the identification number of the pairing failure message of described warning information.

Described filtration step further comprises,

The detection identification number has been recorded in the described failure message in the described internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.

Before carrying out described alarm step, further comprise,

Whether judgement preserves the identification number of the pairing failure message of described warning information in described internal memory, if do not have, execution is at the described failure message that filters out, send the step of the warning information of described alarm level to described management host, if have, do not carry out describedly, send the step of the warning information of described alarm level to described management host at the described failure message that filters out.

Described filtration step further comprises,

Detect the failure message that whether occurs in the described fault log file by the information parameter setting that reports an error of described needs shielding, if, with described failure message shielding, do not think fault.

The invention also discloses a kind of monitoring fault log device, be used for the monitoring of a management host, comprising a controlled machine:

One configuration module, be used for described controlled machine is provided with the fault analysis parameter, described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;

One calling module is used to call the fault log file that the fault log generation script generates described controlled machine;

One filtering module, be used for the failure message that occurs in the described fault log file being filtered, and the described failure message that filters out is provided with alarm level according to default rule according to the time interval that spacing parameter detection time of described fault analysis parameter is set;

One alarm module is used for the failure message that filters out at described, sends the warning information of described alarm level to described management host;

One supervision timer, spacing parameter detection time that is used for the fault analysis parameter set according to described configuration module is controlled described calling module and is called the execution interval that the fault log generation script generates the fault log file of described controlled machine.

Described filtering module comprises a frequency filtering module, whether the frequency of occurrences that is used for detecting the failure message of described fault log file reaches the frequency values of described fault analysis parameter defined, if reach, then described failure message is provided with alarm level according to the rule of described setting.

Described filtering module comprises a fault recovery module, be used for detecting the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described fault analysis parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host.

Described filtering module comprises the fault masking module, is used for judging whether described failure message exists the failure message of described fault analysis parameter defined, if exist then with described failure message shielding.

Adopt method for supervising of the present invention, can realize the generation of many interruption of services is avoided in the automatic accurate monitoring of fault log information, reduce a lot of heavy losses.

Description of drawings

Figure 1 shows that the process flow diagram of monitoring fault log method of the present invention;

Figure 2 shows that the structural drawing of monitoring fault log device of the present invention;

Figure 3 shows that the structural drawing of the filtering module of monitoring fault log device of the present invention.

Embodiment

Below in conjunction with accompanying drawing, be example with IBM AIX machine, the specific implementation process of monitoring fault log method of the present invention is elaborated.

The present invention is by the mode of operation one watchdog routine on controlled machine, failure message in the fault log file of controlled machine generation is monitored, for the fault that meets predefined monitoring condition, send a warning message to the management host that is connected with controlled machine, the keeper in time investigates this fault with prompting.This warning information can show on the interface of management host, also can give a warning by other modes such as alarm boxes.

For realizing method for supervising of the present invention, at first need carry out a configuration step, watchdog routine is provided with the fault analysis parameter, this fault analysis parameter has been stipulated the condition of the fault that need alarm, be that watchdog routine is in operation and can filters the failure message in the fault log file according to this fault analysis parameter, the failure message that meets the fault analysis parameter is sent warning information.

This fault analysis parameter comprises:

(1) detection time spacing parameter.

It is institute's interlude between the failure message initiation filtration in the per twice pair of fault log file of watchdog routine.Minute to be unit, concrete example is as being set to 10 minutes.

(2) threshold parameter of the timing statistics range parameter and the number of stoppages.

The frequency that the same failure message of the collaborative representative of above-mentioned two kinds of parameters occurs in the fault log file.Just, in same setting-up time (the timing statistics range parameter is determined), has the number of times of the fault appearance of identical IDENTIFIER, RESOURCE_NAME (IDENTIFIER, RESOURCE_NAME represent the sequence number and the source of a fault).This timing statistics range parameter for example can be set to 30 minutes.The threshold parameter of this number of stoppages has nothing in common with each other according to the difference of fault type.For example, be the fault of P (permanent or catastrophic failure), can be set to 2 times for the T that occurs in the fault log information (Type); For T (Type) is the fault of U (unknown failure), can be set to 4 times; For T (Type) is the fault of T (temporary derangement), can be set to 6 times.In this setting-up time, if the number of times that same failure message occurs has met or exceeded the threshold parameter of the number of stoppages of setting, then watchdog routine is promptly thought the standard of sending alarm signal that reached.For the detailed description of this T (Type) type, can be with reference to the correlation technique data of IBM AIX.

(3) release time parameter.

The previous fault of once sending warning information does not appear in the fault log file in a period of time of setting once more, thinks that then this fault eliminates, the time value of a period of time of this setting promptly by this release time parameter obtain.This, parameter for example minute to be unit, can be set at 30 minutes release time.If the IDENTIFIER of the pairing fault of before having sent of warning information, RESOURCE NAME do not appear in the fault log file once more, so just think that this fault recovered in the time that release time, parameter set.

(4) information parameter that reports an error that need shield.

In the fault log file, in fact some frequent reporting an error of occurring does not have great influence, but repairs cumbersome.For example, IDENTIFIER is 864D2CE3, and RESOURCE_NAME is reporting an error of topsvcs (that is " topological structure service " finger daemon).This reports an error is because the part version of IBM HACMP exists BUG to cause.This reports an error can be very frequent, and for the system of on-line operation, there is operational risk in the operation of repairing this BUG, and generally this reports an error and does not also have other significant impact, so watchdog routine when daily record is filtered to failure monitoring, can report an error this and mask, and no longer thinks fault.This information parameter that reports an error is promptly at the IDENTIFIER and the RESOURCE_NAME that are used to identify a certain fault.

More than four kinds of fault analysis parameters can carry out manual modification according to the effect of actual motion, to assert fault more exactly, find fault, to reduce unnecessary wrong report.

After watchdog routine successfully was provided with the fault analysis parameter, watchdog routine continue to be carried out next step, and timer promptly is set, this watchdog routine according to detection time spacing parameter one timer be set and activate.Subsequently, carry out an invocation step, this monitor call #errpt generates the fault log file.

Carry out filtration step, the failure message in the watchdog routine read failure journal file filters described failure message according to the fault analysis parameter that sets.Watchdog routine is according to the setting of timer, every the time of setting the failure message in the fault log file is filtered, and to filter result, determine alarm level according to an established rule, and send other warning information of this level, and IDENTIFIER, the RESOURCE_NAME of this fault be recorded in the watchdog routine internal memory, this watchdog routine internal memory is in the internal memory of controlled machine.

Wherein, the process of this filtration comprises:

Detect the failure message that occurs in the described fault log file, in the time that described timing statistics range parameter sets, whether reach the number of times of the threshold parameter setting of the described number of stoppages, whether the frequency that promptly breaks down reaches the requirement of setting in the fault analysis parameter, if reach then proceed to determine the step of alarm level.

Simultaneously, watchdog routine also detects identification number and has been recorded in described failure message in the described internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.

In addition, detect the failure message that whether occurs in the described fault log file by the information parameter setting that reports an error of described needs shielding, if, with described failure message shielding, do not think fault.

Following with reference to table 1, the above-mentioned established rule that is used for determining alarm level is described.

Table 1

Wherein, C (Class) is the grade of the fault that occurred, specifically can be with reference to the correlation technique data of IBM AIX.C (Class) is a hardware fault for the fault of H; C (Class) is a software fault for the fault of S; C (Class) is undetermined fault for the fault of U.Number of times recommended value wherein is the threshold parameter of the above-mentioned number of stoppages of having set, and frequency is meant the number of times that occurs in described timing statistics range parameter internal fault information.

Carry out alarm step, the step that this watchdog routine sends a warning message further comprises, when failure message meets the fault frequency of occurrences in the fault analysis parameter of having set of watchdog routine, and this fault is not to have sent alarm, unrecovered fault, then sends a warning message.Promptly for failure message by filtering, judge the IDENTIFIER, the RESOURCE_NAME that whether preserve this fault in the watchdog routine internal memory, if preserved IDENTIFIER, the RESOURCE_NAME of this fault, represent this fault not recover yet, needn't send a warning message once more.If do not preserve IDENTIFIER, the RESOURCE_NAME of this fault, then send the warning information of corresponding alarm grade.Comprise IDENTIFIER, RESOURCE_NAME, C (Class), T (Type), DESCRIPTION, alarm level of IP address, the fault of time of origin, the equipment of this fault or the like in the data structure of this warning information.

For T (Type) is P, and the frequency of occurrences is less than the fault frequency of occurrences that is provided with in the fault analysis parameter, and is as shown in table 1, sends announcement information.The difference of announcement information and warning information is: announcement information is not assert the current fault that taken place, just the information of a general prompting.And the difference that watchdog routine is assert alarm grade can show on the interface of management host.

The idiographic flow of monitoring fault log method of the present invention is for example shown in Figure 1.

Step 100, the keeper is provided with the configuration parameter file to determine the fault analysis parameter.

Step 101, watchdog routine setting and Active Timer.

This watchdog routine is according to spacing parameter detection time in the configuration parameter file this timer and activation to be set.

Whether errdemon (that is error log record the finger daemon) process that step 102, watchdog routine detect AIX exists, if there is execution in step 103, if there is no, directly send a warning message, point out this process to make mistakes, flow process finishes, and waits for keeper's reparation.

Step 103, watchdog routine dynamically generate the SHELL script, call this SHELL script and generate the fault log file.

Step 104, fault log information in the watchdog routine read failure journal file, filter according to existing failure logging in the report an error information parameter and the watchdog routine internal memory of the fault frequency of occurrences that is provided with in the configuration parameter file, needs shielding, according to the rank of predetermined rule decision fault warning.

Step 105, according to the filtration situation, whether decision needs to send fault warning information, if send, should alarm corresponding IDENTIFIER, RESOURCE_NAME so and be recorded in the watchdog routine internal memory.

If the frequency of i.e. fault appearance surpasses the frequency threshold that is provided with in the configuration parameter file, and this fault had not before sent alarm, perhaps sending alarm still recovers, then send a warning message, and IDENTIFIER, the RESOURCE_NAME of this fault correspondence is recorded in the watchdog routine internal memory.

Step 106, for the IDENTIFIER that writes down in the watchdog routine internal memory, RESOURCE_NAME, certain IDENTIFIER, RESOURCE_NAME in the nearest time (that is: parameter release time in the configuration parameter file) do not appear in the fault log file once more in the internal memory if watchdog routine detects, send alarm recovery message so, then deletion this IDENTIFIER, RESOURCE_NAME from the watchdog routine internal memory.

Step 107, time is up reaches in the timer setting, and watchdog routine is provided with and Active Timer once more.

In another embodiment of the present invention, a kind of monitoring fault log device also is provided, be illustrated in figure 2 as the structural drawing of this device.Wherein, this supervising device 200 is connected with controlled machine 100 and management host 300 respectively, this supervising device 200 is used for the fault log file of controlled machine 100 is monitored, and the failure message that needs are alarmed is provided with the alarm grade, sends a warning message to management host.

Wherein this supervising device 200 comprises configuration module 201, calling module 202, filtering module 203, alarm module 204, supervision timer 205.Configuration module 201 is connected with management host 300, is used for the configuration order of receiving management main frame 300, and the fault analysis parameter of supervising device 200 is configured.This fault analysis parameter comprises threshold parameter, the release time parameter and the information parameter that reports an error of needs shielding of spacing parameter detection time, timing statistics range parameter, the number of stoppages, as above describes in detail among the embodiment.

Calling module 202 is connected with controlled machine 100, be used for calling the monitoring journal file that controlled machine 100 produces, and the failure message that will monitor in the journal file sends to filtering module 203.Filtering module 203 filters described failure message according to the fault analysis parameter that is disposed in the configuration module 201.

The detailed structure of this filtering module sees also Fig. 3, and wherein, filtering module 203 comprises that further frequency filtering module 2031, fault recovery module 2032, fault masking module 2033, alarm are provided with module 2034.Frequency filtering module 2031 is used for the failure-frequency value definite according to the threshold parameter of the timing statistics range parameter and the number of stoppages, judges whether this failure message reaches described failure-frequency value specified standard.If reach or surpass then 2034 pairs of these failure messages of module are set the alarm grade is set, and sends the warning information of corresponding alarm grade by alarm module 204 by alarm.

Fault recovery module 2032 is used for the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.

Fault masking module 2033 is used for judging whether this failure message exists the failure message of the information parameter defined that reports an error of needs shielding, if exist then with this fault masking, do not think fault.

Supervision timer 205 in the monitoring fault log device 200 of present embodiment is used for the fault analysis parameter according to described inking device setting, controls the described execution interval that calls the step of the fault log file that generates controlled machine.

In sum, the embodiment that uses the inventive method can realize that self-timing detects the information in errpt (error message) daily record among the IBM AIX, realization is to the parts of IBM AIX machine and the part parts of part associated devices, carry out the self-timing fault detect, and classification according to failure message, severity level, the frequency that takes place, carry out the filtration of certain failure message, find out real possible fault, find that automatically and more exactly those are not to cause delaying the fault of machine (temporarily not influencing service operation) thereby reach at once, make the associated maintenance personnel can in time manage to repair these faults, the accident of avoiding further operation to take place.The present invention also can be applicable in other types.

Certainly; the present invention also can have other various embodiments; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. a monitoring fault log method is used for the monitoring of a management host to a controlled machine, it is characterized in that, comprising:

Configuration step is provided with the fault analysis parameter to described controlled machine, and described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;

Filtration step is provided with timer and activation according to spacing parameter detection time in the described fault analysis parameter, according to the setting of described timer, every the time of setting the failure message in the described fault log file is filtered; The failure message that filters out is provided with alarm level according to default rule;

2. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises before carrying out described invocation step:

3. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises before carrying out described invocation step:

4. monitoring fault log method as claimed in claim 1 is characterized in that, described default rule comprises:

5. monitoring fault log method as claimed in claim 1 is characterized in that, further comprises in the described filtration step,

6. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises after carrying out described alarm step: with the identification number of the pairing failure message of described warning information, be recorded in the step in the internal memory of described controlled machine.

7. monitoring fault log method as claimed in claim 6 is characterized in that, described filtration step further comprises,

8. monitoring fault log method as claimed in claim 6 is characterized in that, before carrying out described alarm step, further comprises,

9. monitoring fault log method as claimed in claim 1 is characterized in that, described filtration step further comprises,

10. a monitoring fault log device is used for the monitoring of a management host to a controlled machine, it is characterized in that, comprising:

11. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises a frequency filtering module, whether the frequency of occurrences that is used for detecting the failure message of described fault log file reaches the frequency values of described fault analysis parameter defined, if reach, then described failure message is provided with alarm level according to the rule of described setting.

12. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises a fault recovery module, be used for detecting the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described fault analysis parameter setting, whether appear at once more in the described fault log file,, send the alarm recovery message of described failure message to management host if do not occur.

13. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises the fault masking module, is used for judging whether described failure message exists the failure message of described fault analysis parameter defined, if exist then with described failure message shielding.