CN101201786B - Method and device for monitoring fault log - Google Patents

Method and device for monitoring fault log Download PDF

Info

Publication number
CN101201786B
CN101201786B CN200610165154A CN200610165154A CN101201786B CN 101201786 B CN101201786 B CN 101201786B CN 200610165154 A CN200610165154 A CN 200610165154A CN 200610165154 A CN200610165154 A CN 200610165154A CN 101201786 B CN101201786 B CN 101201786B
Authority
CN
China
Prior art keywords
parameter
failure message
fault
time
fault log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200610165154A
Other languages
Chinese (zh)
Other versions
CN101201786A (en
Inventor
田丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN200610165154A priority Critical patent/CN101201786B/en
Publication of CN101201786A publication Critical patent/CN101201786A/en
Application granted granted Critical
Publication of CN101201786B publication Critical patent/CN101201786B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault log monitoring method used for managing host to monitor a controlled machine. The method includes the configuration step that fault analysis parameters are set for the controlled machine; calling step that fault log of the controlled machine is called and generated; filtering step that the fault information in fault log is filtered according to the fault analysis parameters, and warning levels are set for the filtered fault information according to the set rules; warning step that warning message with warning levels is sent to the management host aiming at the filtered fault information. By adopting the monitoring method provided by the invention, automatic and precise monitoring on fault log can be realized, a plurality of service accidents are avoided occurring and a great deal of heavy loss is reduced.

Description

A kind of monitoring fault log method and device
Technical field
The present invention relates to the fault handling technology of information equipment, particularly relate to a kind of method for supervising of fault log.
Background technology
Information equipment such as computing machine, server has been widely used in all trades and professions, because the defective of information equipment self software and hardware or user's use operating mistake, can produce equipment failure or potential faults inevitably, if can not in time find and get rid of, loss tends to cause a serious accident.For some information equipment, utilize the bulk information of daily record tool records information equipment generation in service, and search fault by the failure message of analyzing in the log information, be failure monitoring method relatively more commonly used.Yet, log information that information equipment operation is produced is magnanimity often, and because the diversity and the complicacy of practical application, the type of fault or doubtful fault is also of a great variety, and this makes general personnel can't find fault accurately and efficiently from the log information of magnanimity.With the IBM AIX machine of widespread use, the limitation of existing daily record method for supervising is described below:
Along with IBM AIX machine (for example: the IBM minicomputer, or the machine of other operation AIX operating system, for example: applying IBM OpenPower server), under many occasions, these machines are in very important position (for example: may move important Database Systems), in actual motion, the fault of some software and hardwares can take place in the IBMAIX machine, for the fault that influences service operation immediately, this can find from business is influenced, (hardware that has has redundancy or backup but other fault that has does not temporarily influence professional normal operation, for example: damaged one in the hard disk that two are done mirror image, or the hardware fault rank does not also seriously arrive to a certain degree), temporarily do not influence the fault of professional normal operation for these,, might cause system to continue to use after a period of time if can not be found in time and handle, more serious fault takes place, even the major accident of generation systems interruption.
For the above-mentioned fault that does not temporarily influence professional normal operation, though the maintainer is by checking (for example: check the errpt log information) on the operating system that signs in to the IBMAIX machine, the perhaps alarm indicator of facilities for observation, may also can find these faults, but, actual situation is: the maintainer may seldom login up and check (and this to operator's technical requirement also than higher), and, the groundwork position may be or not the next door of these machines, so might also fail in time to be noted for the stand by lamp on the machinery panel at ordinary times for they.
Errpt monitoring log information is a function in the IBM AIX operating system, call #errpt and can generate errpt monitoring log information, the failure message that may include some software and hardwares in these monitoring log informations, but because the monitoring log information that generates is often many, different information categories and different severity levels are wherein arranged, wherein some is fault really, some might not be fault (just some temporary transient reporting an error) yet, so certain degree of difficulty is arranged by direct analysis of maintainer, and, the errpt monitoring log information of IBM AIX can not allow the maintainer know automatically has important information to produce, and promptly lacks effective monitoring fault log method realization fault and alarms automatically.
Based on above analysis, a kind of monitoring fault log method need be provided, can find fault accurately and efficiently.
Summary of the invention
Problem to be solved by this invention is, a kind of monitoring fault log method is provided, can carry out fault detect by self-timing, and can be according to the frequency of the classification of failure message, severity level, generation, carry out the filtration of failure message, thereby automatically and exactly the fault in the discovering device makes the associated maintenance personnel can in time manage to repair these faults, avoids the accident that may take place.
The invention discloses a kind of monitoring fault log method, be used for the monitoring of a management host, comprising a controlled machine:
Configuration step is provided with the fault analysis parameter to described controlled machine; Described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;
Invocation step is called the fault log file that the fault log generation script generates described controlled machine;
Filtration step is provided with timer and activation according to spacing parameter detection time in the described fault analysis parameter, according to the setting of described timer, every the time of setting the failure message in the described fault log file is filtered; The described failure message that filters out is provided with alarm level according to default rule;
Alarm step at the described failure message that filters out, sends the warning information of described alarm level to described management host.
Before carrying out described invocation step, also comprise:
According to described detection time of spacing parameter, be provided with and activate a supervision timer, be used to control the described execution interval of step that the fault log generation script generates the fault log file of described controlled machine that calls.
Before carrying out described invocation step, also comprise:
Whether the process that detects described management host generation fault log file exists, if exist, continue to carry out the described step that the fault log generation script generates the fault log file of described controlled machine of calling, if there is no, directly send a warning message to described management host, flow process finishes.
Described default rule comprises:
In the time that sets at the timing statistics range parameter, frequency is not less than the P type H class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type S class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type O class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type U class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type H class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type S class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type O class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type U class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set;
In the time that sets at the timing statistics range parameter, frequency is not less than the U type S class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type O class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type S class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
Be not less than the T type O class failure message of the threshold parameter of the described number of stoppages for frequency, three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set.
Further comprise in the described filtration step,
Detect the failure message that occurs in the described fault log file, in the time that described timing statistics range parameter sets, whether reach the number of times of the threshold parameter setting of the described number of stoppages, if reach, carry out the described step that alarm level is set according to the rule of setting, if do not reach, continue to detect.
After carrying out described alarm step, also comprise:, be recorded in the step in the internal memory of described controlled machine the identification number of the pairing failure message of described warning information.
Described filtration step further comprises,
The detection identification number has been recorded in the described failure message in the described internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.
Before carrying out described alarm step, further comprise,
Whether judgement preserves the identification number of the pairing failure message of described warning information in described internal memory, if do not have, execution is at the described failure message that filters out, send the step of the warning information of described alarm level to described management host, if have, do not carry out describedly, send the step of the warning information of described alarm level to described management host at the described failure message that filters out.
Described filtration step further comprises,
Detect the failure message that whether occurs in the described fault log file by the information parameter setting that reports an error of described needs shielding, if, with described failure message shielding, do not think fault.
The invention also discloses a kind of monitoring fault log device, be used for the monitoring of a management host, comprising a controlled machine:
One configuration module, be used for described controlled machine is provided with the fault analysis parameter, described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;
One calling module is used to call the fault log file that the fault log generation script generates described controlled machine;
One filtering module, be used for the failure message that occurs in the described fault log file being filtered, and the described failure message that filters out is provided with alarm level according to default rule according to the time interval that spacing parameter detection time of described fault analysis parameter is set;
One alarm module is used for the failure message that filters out at described, sends the warning information of described alarm level to described management host;
One supervision timer, spacing parameter detection time that is used for the fault analysis parameter set according to described configuration module is controlled described calling module and is called the execution interval that the fault log generation script generates the fault log file of described controlled machine.
Described filtering module comprises a frequency filtering module, whether the frequency of occurrences that is used for detecting the failure message of described fault log file reaches the frequency values of described fault analysis parameter defined, if reach, then described failure message is provided with alarm level according to the rule of described setting.
Described filtering module comprises a fault recovery module, be used for detecting the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described fault analysis parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host.
Described filtering module comprises the fault masking module, is used for judging whether described failure message exists the failure message of described fault analysis parameter defined, if exist then with described failure message shielding.
Adopt method for supervising of the present invention, can realize the generation of many interruption of services is avoided in the automatic accurate monitoring of fault log information, reduce a lot of heavy losses.
Description of drawings
Figure 1 shows that the process flow diagram of monitoring fault log method of the present invention;
Figure 2 shows that the structural drawing of monitoring fault log device of the present invention;
Figure 3 shows that the structural drawing of the filtering module of monitoring fault log device of the present invention.
Embodiment
Below in conjunction with accompanying drawing, be example with IBM AIX machine, the specific implementation process of monitoring fault log method of the present invention is elaborated.
The present invention is by the mode of operation one watchdog routine on controlled machine, failure message in the fault log file of controlled machine generation is monitored, for the fault that meets predefined monitoring condition, send a warning message to the management host that is connected with controlled machine, the keeper in time investigates this fault with prompting.This warning information can show on the interface of management host, also can give a warning by other modes such as alarm boxes.
For realizing method for supervising of the present invention, at first need carry out a configuration step, watchdog routine is provided with the fault analysis parameter, this fault analysis parameter has been stipulated the condition of the fault that need alarm, be that watchdog routine is in operation and can filters the failure message in the fault log file according to this fault analysis parameter, the failure message that meets the fault analysis parameter is sent warning information.
This fault analysis parameter comprises:
(1) detection time spacing parameter.
It is institute's interlude between the failure message initiation filtration in the per twice pair of fault log file of watchdog routine.Minute to be unit, concrete example is as being set to 10 minutes.
(2) threshold parameter of the timing statistics range parameter and the number of stoppages.
The frequency that the same failure message of the collaborative representative of above-mentioned two kinds of parameters occurs in the fault log file.Just, in same setting-up time (the timing statistics range parameter is determined), has the number of times of the fault appearance of identical IDENTIFIER, RESOURCE_NAME (IDENTIFIER, RESOURCE_NAME represent the sequence number and the source of a fault).This timing statistics range parameter for example can be set to 30 minutes.The threshold parameter of this number of stoppages has nothing in common with each other according to the difference of fault type.For example, be the fault of P (permanent or catastrophic failure), can be set to 2 times for the T that occurs in the fault log information (Type); For T (Type) is the fault of U (unknown failure), can be set to 4 times; For T (Type) is the fault of T (temporary derangement), can be set to 6 times.In this setting-up time, if the number of times that same failure message occurs has met or exceeded the threshold parameter of the number of stoppages of setting, then watchdog routine is promptly thought the standard of sending alarm signal that reached.For the detailed description of this T (Type) type, can be with reference to the correlation technique data of IBM AIX.
(3) release time parameter.
The previous fault of once sending warning information does not appear in the fault log file in a period of time of setting once more, thinks that then this fault eliminates, the time value of a period of time of this setting promptly by this release time parameter obtain.This, parameter for example minute to be unit, can be set at 30 minutes release time.If the IDENTIFIER of the pairing fault of before having sent of warning information, RESOURCE NAME do not appear in the fault log file once more, so just think that this fault recovered in the time that release time, parameter set.
(4) information parameter that reports an error that need shield.
In the fault log file, in fact some frequent reporting an error of occurring does not have great influence, but repairs cumbersome.For example, IDENTIFIER is 864D2CE3, and RESOURCE_NAME is reporting an error of topsvcs (that is " topological structure service " finger daemon).This reports an error is because the part version of IBM HACMP exists BUG to cause.This reports an error can be very frequent, and for the system of on-line operation, there is operational risk in the operation of repairing this BUG, and generally this reports an error and does not also have other significant impact, so watchdog routine when daily record is filtered to failure monitoring, can report an error this and mask, and no longer thinks fault.This information parameter that reports an error is promptly at the IDENTIFIER and the RESOURCE_NAME that are used to identify a certain fault.
More than four kinds of fault analysis parameters can carry out manual modification according to the effect of actual motion, to assert fault more exactly, find fault, to reduce unnecessary wrong report.
After watchdog routine successfully was provided with the fault analysis parameter, watchdog routine continue to be carried out next step, and timer promptly is set, this watchdog routine according to detection time spacing parameter one timer be set and activate.Subsequently, carry out an invocation step, this monitor call #errpt generates the fault log file.
Carry out filtration step, the failure message in the watchdog routine read failure journal file filters described failure message according to the fault analysis parameter that sets.Watchdog routine is according to the setting of timer, every the time of setting the failure message in the fault log file is filtered, and to filter result, determine alarm level according to an established rule, and send other warning information of this level, and IDENTIFIER, the RESOURCE_NAME of this fault be recorded in the watchdog routine internal memory, this watchdog routine internal memory is in the internal memory of controlled machine.
Wherein, the process of this filtration comprises:
Detect the failure message that occurs in the described fault log file, in the time that described timing statistics range parameter sets, whether reach the number of times of the threshold parameter setting of the described number of stoppages, whether the frequency that promptly breaks down reaches the requirement of setting in the fault analysis parameter, if reach then proceed to determine the step of alarm level.
Simultaneously, watchdog routine also detects identification number and has been recorded in described failure message in the described internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.
In addition, detect the failure message that whether occurs in the described fault log file by the information parameter setting that reports an error of described needs shielding, if, with described failure message shielding, do not think fault.
Following with reference to table 1, the above-mentioned established rule that is used for determining alarm level is described.
Table 1
Wherein, C (Class) is the grade of the fault that occurred, specifically can be with reference to the correlation technique data of IBM AIX.C (Class) is a hardware fault for the fault of H; C (Class) is a software fault for the fault of S; C (Class) is undetermined fault for the fault of U.Number of times recommended value wherein is the threshold parameter of the above-mentioned number of stoppages of having set, and frequency is meant the number of times that occurs in described timing statistics range parameter internal fault information.
Carry out alarm step, the step that this watchdog routine sends a warning message further comprises, when failure message meets the fault frequency of occurrences in the fault analysis parameter of having set of watchdog routine, and this fault is not to have sent alarm, unrecovered fault, then sends a warning message.Promptly for failure message by filtering, judge the IDENTIFIER, the RESOURCE_NAME that whether preserve this fault in the watchdog routine internal memory, if preserved IDENTIFIER, the RESOURCE_NAME of this fault, represent this fault not recover yet, needn't send a warning message once more.If do not preserve IDENTIFIER, the RESOURCE_NAME of this fault, then send the warning information of corresponding alarm grade.Comprise IDENTIFIER, RESOURCE_NAME, C (Class), T (Type), DESCRIPTION, alarm level of IP address, the fault of time of origin, the equipment of this fault or the like in the data structure of this warning information.
For T (Type) is P, and the frequency of occurrences is less than the fault frequency of occurrences that is provided with in the fault analysis parameter, and is as shown in table 1, sends announcement information.The difference of announcement information and warning information is: announcement information is not assert the current fault that taken place, just the information of a general prompting.And the difference that watchdog routine is assert alarm grade can show on the interface of management host.
The idiographic flow of monitoring fault log method of the present invention is for example shown in Figure 1.
Step 100, the keeper is provided with the configuration parameter file to determine the fault analysis parameter.
Step 101, watchdog routine setting and Active Timer.
This watchdog routine is according to spacing parameter detection time in the configuration parameter file this timer and activation to be set.
Whether errdemon (that is error log record the finger daemon) process that step 102, watchdog routine detect AIX exists, if there is execution in step 103, if there is no, directly send a warning message, point out this process to make mistakes, flow process finishes, and waits for keeper's reparation.
Step 103, watchdog routine dynamically generate the SHELL script, call this SHELL script and generate the fault log file.
Step 104, fault log information in the watchdog routine read failure journal file, filter according to existing failure logging in the report an error information parameter and the watchdog routine internal memory of the fault frequency of occurrences that is provided with in the configuration parameter file, needs shielding, according to the rank of predetermined rule decision fault warning.
Step 105, according to the filtration situation, whether decision needs to send fault warning information, if send, should alarm corresponding IDENTIFIER, RESOURCE_NAME so and be recorded in the watchdog routine internal memory.
If the frequency of i.e. fault appearance surpasses the frequency threshold that is provided with in the configuration parameter file, and this fault had not before sent alarm, perhaps sending alarm still recovers, then send a warning message, and IDENTIFIER, the RESOURCE_NAME of this fault correspondence is recorded in the watchdog routine internal memory.
Step 106, for the IDENTIFIER that writes down in the watchdog routine internal memory, RESOURCE_NAME, certain IDENTIFIER, RESOURCE_NAME in the nearest time (that is: parameter release time in the configuration parameter file) do not appear in the fault log file once more in the internal memory if watchdog routine detects, send alarm recovery message so, then deletion this IDENTIFIER, RESOURCE_NAME from the watchdog routine internal memory.
Step 107, time is up reaches in the timer setting, and watchdog routine is provided with and Active Timer once more.
In another embodiment of the present invention, a kind of monitoring fault log device also is provided, be illustrated in figure 2 as the structural drawing of this device.Wherein, this supervising device 200 is connected with controlled machine 100 and management host 300 respectively, this supervising device 200 is used for the fault log file of controlled machine 100 is monitored, and the failure message that needs are alarmed is provided with the alarm grade, sends a warning message to management host.
Wherein this supervising device 200 comprises configuration module 201, calling module 202, filtering module 203, alarm module 204, supervision timer 205.Configuration module 201 is connected with management host 300, is used for the configuration order of receiving management main frame 300, and the fault analysis parameter of supervising device 200 is configured.This fault analysis parameter comprises threshold parameter, the release time parameter and the information parameter that reports an error of needs shielding of spacing parameter detection time, timing statistics range parameter, the number of stoppages, as above describes in detail among the embodiment.
Calling module 202 is connected with controlled machine 100, be used for calling the monitoring journal file that controlled machine 100 produces, and the failure message that will monitor in the journal file sends to filtering module 203.Filtering module 203 filters described failure message according to the fault analysis parameter that is disposed in the configuration module 201.
The detailed structure of this filtering module sees also Fig. 3, and wherein, filtering module 203 comprises that further frequency filtering module 2031, fault recovery module 2032, fault masking module 2033, alarm are provided with module 2034.Frequency filtering module 2031 is used for the failure-frequency value definite according to the threshold parameter of the timing statistics range parameter and the number of stoppages, judges whether this failure message reaches described failure-frequency value specified standard.If reach or surpass then 2034 pairs of these failure messages of module are set the alarm grade is set, and sends the warning information of corresponding alarm grade by alarm module 204 by alarm.
Fault recovery module 2032 is used for the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.
Fault masking module 2033 is used for judging whether this failure message exists the failure message of the information parameter defined that reports an error of needs shielding, if exist then with this fault masking, do not think fault.
Supervision timer 205 in the monitoring fault log device 200 of present embodiment is used for the fault analysis parameter according to described inking device setting, controls the described execution interval that calls the step of the fault log file that generates controlled machine.
In sum, the embodiment that uses the inventive method can realize that self-timing detects the information in errpt (error message) daily record among the IBM AIX, realization is to the parts of IBM AIX machine and the part parts of part associated devices, carry out the self-timing fault detect, and classification according to failure message, severity level, the frequency that takes place, carry out the filtration of certain failure message, find out real possible fault, find that automatically and more exactly those are not to cause delaying the fault of machine (temporarily not influencing service operation) thereby reach at once, make the associated maintenance personnel can in time manage to repair these faults, the accident of avoiding further operation to take place.The present invention also can be applicable in other types.
Certainly; the present invention also can have other various embodiments; under the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims (13)

1. a monitoring fault log method is used for the monitoring of a management host to a controlled machine, it is characterized in that, comprising:
Configuration step is provided with the fault analysis parameter to described controlled machine, and described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;
Invocation step is called the fault log file that the fault log generation script generates described controlled machine;
Filtration step is provided with timer and activation according to spacing parameter detection time in the described fault analysis parameter, according to the setting of described timer, every the time of setting the failure message in the described fault log file is filtered; The failure message that filters out is provided with alarm level according to default rule;
Alarm step at the described failure message that filters out, sends the warning information of described alarm level to described management host.
2. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises before carrying out described invocation step:
According to described detection time of spacing parameter, be provided with and activate a supervision timer, be used to control the described execution interval of step that the fault log generation script generates the fault log file of described controlled machine that calls.
3. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises before carrying out described invocation step:
Whether the process that detects described management host generation fault log file exists, if exist, continue to carry out the described step that the fault log generation script generates the fault log file of described controlled machine of calling, if there is no, directly send a warning message to described management host, flow process finishes.
4. monitoring fault log method as claimed in claim 1 is characterized in that, described default rule comprises:
In the time that sets at the timing statistics range parameter, frequency is not less than the P type H class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type S class failure message of the threshold parameter of the described number of stoppages, and Level 1Alarming is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type O class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the P type U class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type H class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type S class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type O class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is provided with announcement information less than the P type U class failure message of the threshold parameter of the described number of stoppages; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set;
In the time that sets at the timing statistics range parameter, frequency is not less than the U type S class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type O class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the U type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type H class failure message of the threshold parameter of the described number of stoppages, and the secondary alarm is set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type S class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set; And/or
Be not less than the T type O class failure message of the threshold parameter of the described number of stoppages for frequency, three grades of alarms are set; And/or
In the time that sets at the timing statistics range parameter, frequency is not less than the T type U class failure message of the threshold parameter of the described number of stoppages, and three grades of alarms are set.
5. monitoring fault log method as claimed in claim 1 is characterized in that, further comprises in the described filtration step,
Detect the failure message that occurs in the described fault log file, in the time that described timing statistics range parameter sets, whether reach the number of times of the threshold parameter setting of the described number of stoppages, if reach, carry out the described step that alarm level is set according to the rule of setting, if do not reach, continue to detect.
6. monitoring fault log method as claimed in claim 1 is characterized in that, also comprises after carrying out described alarm step: with the identification number of the pairing failure message of described warning information, be recorded in the step in the internal memory of described controlled machine.
7. monitoring fault log method as claimed in claim 6 is characterized in that, described filtration step further comprises,
The detection identification number has been recorded in the described failure message in the described internal memory, in the time of described release time of parameter setting, whether appear at once more in the described fault log file, if do not occur, send the alarm recovery message of described failure message to management host, the identification number of the described failure message of deletion in described internal memory.
8. monitoring fault log method as claimed in claim 6 is characterized in that, before carrying out described alarm step, further comprises,
Whether judgement preserves the identification number of the pairing failure message of described warning information in described internal memory, if do not have, execution is at the described failure message that filters out, send the step of the warning information of described alarm level to described management host, if have, do not carry out describedly, send the step of the warning information of described alarm level to described management host at the described failure message that filters out.
9. monitoring fault log method as claimed in claim 1 is characterized in that, described filtration step further comprises,
Detect the failure message that whether occurs in the described fault log file by the information parameter setting that reports an error of described needs shielding, if, with described failure message shielding, do not think fault.
10. a monitoring fault log device is used for the monitoring of a management host to a controlled machine, it is characterized in that, comprising:
One configuration module, be used for described controlled machine is provided with the fault analysis parameter, described fault analysis parameter comprises: detection time spacing parameter, timing statistics range parameter, the number of stoppages threshold parameter, release time parameter and the information parameter that reports an error of needs shielding;
One calling module is used to call the fault log file that the fault log generation script generates described controlled machine;
One filtering module, be used for the failure message that occurs in the described fault log file being filtered, and the described failure message that filters out is provided with alarm level according to default rule according to the time interval that spacing parameter detection time of described fault analysis parameter is set;
One alarm module is used for the failure message that filters out at described, sends the warning information of described alarm level to described management host;
One supervision timer, spacing parameter detection time that is used for the fault analysis parameter set according to described configuration module is controlled described calling module and is called the execution interval that the fault log generation script generates the fault log file of described controlled machine.
11. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises a frequency filtering module, whether the frequency of occurrences that is used for detecting the failure message of described fault log file reaches the frequency values of described fault analysis parameter defined, if reach, then described failure message is provided with alarm level according to the rule of described setting.
12. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises a fault recovery module, be used for detecting the described failure message that identification number has been recorded in controlled machine internal memory, in the time of described fault analysis parameter setting, whether appear at once more in the described fault log file,, send the alarm recovery message of described failure message to management host if do not occur.
13. monitoring fault log device as claimed in claim 10, it is characterized in that, described filtering module comprises the fault masking module, is used for judging whether described failure message exists the failure message of described fault analysis parameter defined, if exist then with described failure message shielding.
CN200610165154A 2006-12-13 2006-12-13 Method and device for monitoring fault log Expired - Fee Related CN101201786B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200610165154A CN101201786B (en) 2006-12-13 2006-12-13 Method and device for monitoring fault log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200610165154A CN101201786B (en) 2006-12-13 2006-12-13 Method and device for monitoring fault log

Publications (2)

Publication Number Publication Date
CN101201786A CN101201786A (en) 2008-06-18
CN101201786B true CN101201786B (en) 2010-05-19

Family

ID=39516960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200610165154A Expired - Fee Related CN101201786B (en) 2006-12-13 2006-12-13 Method and device for monitoring fault log

Country Status (1)

Country Link
CN (1) CN101201786B (en)

Families Citing this family (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314392A (en) * 2011-08-09 2012-01-11 浪潮(北京)电子信息产业有限公司 Computer monitoring system and monitoring alarm method
CN102932194B (en) * 2011-08-09 2015-08-12 中国银行股份有限公司 Based on the internet, applications service monitoring system and method for bayes method
DE102012001083A1 (en) * 2012-01-20 2013-07-25 Heidelberger Druckmaschinen Ag Dynamic logfile
CN102857365A (en) * 2012-06-07 2013-01-02 中兴通讯股份有限公司 Fault preventing and intelligent repairing method and device for network management system
US9565080B2 (en) 2012-11-15 2017-02-07 Microsoft Technology Licensing, Llc Evaluating electronic network devices in view of cost and service level considerations
US9325748B2 (en) * 2012-11-15 2016-04-26 Microsoft Technology Licensing, Llc Characterizing service levels on an electronic network
CN103248421B (en) * 2013-01-09 2016-12-28 上海斐讯数据通信技术有限公司 Detection method to ONU fault in a kind of PON system
CN103200050B (en) * 2013-04-12 2016-12-28 北京百度网讯科技有限公司 The hardware state monitoring method and system of server
CN103617109B (en) * 2013-10-23 2016-04-27 上海华力微电子有限公司 The warning disposal system of probe board journal file and method
CN103546350B (en) * 2013-11-06 2018-07-13 北京国双科技有限公司 The detection method and device that daily record generates
CN104932428A (en) * 2014-03-18 2015-09-23 中芯国际集成电路制造(上海)有限公司 Method and device for detecting early failure of hardware
CN104166563B (en) * 2014-08-11 2017-12-12 Tcl通讯(宁波)有限公司 The method and system being controlled based on mobile terminal to the log for repeating output
CN104301136B (en) * 2014-09-11 2018-06-19 青岛海信电器股份有限公司 Fault information reporting and the method and apparatus of processing
CN105630647A (en) * 2014-11-28 2016-06-01 中兴通讯股份有限公司 Equipment detection method and detection equipment
CN104486106A (en) * 2014-12-04 2015-04-01 珠海金山网络游戏科技有限公司 Grading warning service system
CN104378246B (en) * 2014-12-09 2018-04-06 福建星网锐捷网络有限公司 A kind of network equipment failure alignment system, method and device
CN105787242B (en) * 2014-12-25 2019-02-26 华为技术有限公司 A kind of method and device that prediction non-volatile memory medium breaks down
CN106161135B (en) * 2015-04-23 2019-10-18 中国移动通信集团福建有限公司 Business transaction failure analysis methods and device
CN105099762B (en) * 2015-06-29 2018-09-18 北京宇航时代科技发展有限公司 A kind of self checking method and self-checking system of system O&M function
CN104932978B (en) * 2015-06-29 2018-04-13 北京宇航时代科技发展有限公司 A kind of system operation automatic fault selftesting and the method and system of selfreparing
CN104991852A (en) * 2015-06-29 2015-10-21 浪潮(北京)电子信息产业有限公司 System operating state indication method and host system
CN105116842B (en) * 2015-07-13 2018-05-11 华中科技大学 A kind of fault data visualization analytic method based on digital control system daily record
CN105159964B (en) * 2015-08-24 2019-06-21 Oppo广东移动通信有限公司 A kind of log monitoring method and system
CN106506185A (en) * 2015-09-08 2017-03-15 小米科技有限责任公司 The recognition methodss of hardware fault and device
CN105528280B (en) * 2015-11-30 2018-11-23 中电科华云信息技术有限公司 System log and health monitoring relationship determine the method and system of log alarm grade
CN106992900A (en) * 2016-01-20 2017-07-28 北京国双科技有限公司 The method and intelligent early-warning notification platform of monitoring and early warning
CN105739408A (en) * 2016-01-30 2016-07-06 山东大学 Business monitoring method used for power scheduling system and business monitoring system
CN105656699B (en) * 2016-03-29 2018-12-04 网宿科技股份有限公司 The alarm management method and system of content distributing network
CN107844110B (en) * 2016-09-21 2020-05-22 中车株洲电力机车研究所有限公司 Fault data recording system for current transformer
CN106338982A (en) * 2016-09-26 2017-01-18 深圳前海弘稼科技有限公司 Fault processing method, fault processing device and server
CN108268021A (en) * 2016-12-30 2018-07-10 北京金风科创风电设备有限公司 Fault handling method and device
CN106649055A (en) * 2017-01-10 2017-05-10 山东浪潮云服务信息科技有限公司 Domestic CPU (central processing unit) and operating system based software and hardware fault alarming system and method
CN107426005B (en) * 2017-05-15 2021-03-09 苏州浪潮智能科技有限公司 Control method and system for restarting nodes in cloud platform
CN107220162A (en) * 2017-07-04 2017-09-29 鹏元征信有限公司 A kind of service alarm method, storage medium and device
CN107358660A (en) * 2017-07-25 2017-11-17 北京微影时代科技有限公司 Receipt printing machine abnormality eliminating method and device
CN107483268A (en) * 2017-09-20 2017-12-15 深圳市中润四方信息技术有限公司 A kind of alert processing method and system
CN109818763B (en) * 2017-11-20 2022-04-15 北京绪水互联科技有限公司 Equipment fault analysis and statistics method and system and equipment real-time quality control method and system
CN108132868A (en) * 2018-01-15 2018-06-08 政采云有限公司 A kind of data monitoring method, device, computing device and storage medium
CN108896910A (en) * 2018-04-13 2018-11-27 湖南小步科技有限公司 A kind of fault handling method of dynamic lithium battery, device and battery management system
CN108768739A (en) * 2018-06-08 2018-11-06 山东超越数控电子股份有限公司 A kind of fault alarm method based on interchanger daily record
CN108880907B (en) * 2018-07-06 2022-03-04 上海财经大学 Network equipment automatic inspection and maintenance system based on operation log
CN109445993A (en) * 2018-11-02 2019-03-08 郑州云海信息技术有限公司 A kind of detection method and relevant apparatus of file system health status
CN111124817A (en) * 2019-12-06 2020-05-08 江苏智臻能源科技有限公司 Multi-type alarm judgment algorithm based on cache mechanism
CN113014884A (en) * 2021-03-10 2021-06-22 中信百信银行股份有限公司 Alarm processing method and device
CN113608908B (en) * 2021-07-28 2023-12-22 烽火超微信息科技有限公司 Server fault processing method, system, equipment and readable storage medium
CN114495316A (en) * 2022-02-15 2022-05-13 北京半导体专用设备研究所(中国电子科技集团公司第四十五研究所) Data monitoring method and device for precision motion platform
CN114302065B (en) * 2022-03-07 2022-06-03 广东电网有限责任公司东莞供电局 Self-adaptive operation and maintenance method for transformer substation video
CN117370052B (en) * 2023-09-14 2024-04-26 广州宇中网络科技有限公司 Microservice fault analysis method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6598179B1 (en) * 2000-03-31 2003-07-22 International Business Machines Corporation Table-based error log analysis
CN1490982A (en) * 2003-08-18 2004-04-21 北京港湾网络有限公司 Network fault analysing and monitoring method and apparatus
US6925586B1 (en) * 2002-05-09 2005-08-02 Ronald Perrella Methods and systems for centrally-controlled client-side filtering
CN1266881C (en) * 2002-11-20 2006-07-26 华为技术有限公司 Fault coherence analysis of network management system and implement method
CN1852540A (en) * 2005-11-29 2006-10-25 华为技术有限公司 System and method for analyzing call in mobile communication system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6598179B1 (en) * 2000-03-31 2003-07-22 International Business Machines Corporation Table-based error log analysis
US6925586B1 (en) * 2002-05-09 2005-08-02 Ronald Perrella Methods and systems for centrally-controlled client-side filtering
CN1266881C (en) * 2002-11-20 2006-07-26 华为技术有限公司 Fault coherence analysis of network management system and implement method
CN1490982A (en) * 2003-08-18 2004-04-21 北京港湾网络有限公司 Network fault analysing and monitoring method and apparatus
CN1852540A (en) * 2005-11-29 2006-10-25 华为技术有限公司 System and method for analyzing call in mobile communication system

Also Published As

Publication number Publication date
CN101201786A (en) 2008-06-18

Similar Documents

Publication Publication Date Title
CN101201786B (en) Method and device for monitoring fault log
KR101856543B1 (en) Failure prediction system based on artificial intelligence
US10931511B2 (en) Predicting computer network equipment failure
CN111092786B (en) Network equipment safety authentication service reliability enhancing system
CN103490917B (en) The detection method of troubleshooting situation and device
CN109034423B (en) Fault early warning judgment method, device, equipment and storage medium
CN109062723A (en) The treating method and apparatus of server failure
CN106254125A (en) The method and system of security incident correlation analysiss based on big data
KR101444250B1 (en) System for monitoring access to personal information and method therefor
CN109753410A (en) O&M service system based on big data
CN117240594B (en) Multi-dimensional network security operation and maintenance protection management system and method
JP4738155B2 (en) Alarm management device and alarm management method
CN105955864A (en) Power supply fault processing method, power supply module, monitoring management module and server
CN113285824B (en) Method and device for monitoring security of network configuration command
CN114143160A (en) Cloud platform automation operation and maintenance system
CN114915541A (en) System fault elimination method and device, electronic equipment and storage medium
KR20170127876A (en) System and method for dealing with troubles through fault analysis of log
CN112162906A (en) Server behavior monitoring method of probe management platform architecture
KR100506248B1 (en) How to Diagnose Links in a Private Switching System
KR101738770B1 (en) Enterprise Business Service Level Integration Monitoring Method and System
CN1968145A (en) Apparatus and method for determining billable resources on a computer network
CN116089965B (en) Information security emergency management system and method based on SOD risk model
CN112163198B (en) Host login security detection method, system, device and storage medium
JP2006094155A (en) Network failure monitoring system and program therefor
CN117670261B (en) Safe operation and maintenance audit operation integrated terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100519

Termination date: 20171213

CF01 Termination of patent right due to non-payment of annual fee