CN104243192B

CN104243192B - Fault handling method and system

Info

Publication number: CN104243192B
Application number: CN201310237951.6A
Authority: CN
Inventors: 李宏琳
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Beijing Shenzhou Taiyue Software Co Ltd
Priority date: 2013-06-17
Filing date: 2013-06-17
Publication date: 2017-11-10
Anticipated expiration: 2033-06-17
Also published as: CN104243192A

Abstract

The invention discloses a kind of fault handling method and system, is related to failure analysis techniques field.Fault handling method provided in an embodiment of the present invention and system, in this fault correlation method, equipment working condition is monitored in real time, occur when monitoring major error A, then forwardly and rearwardly searched in time window T with major error A time of origin points respectively, whether faulty B occurs, if so, then establishing incidence relation.The foundation of fault correlation relation, be advantageous to operation maintenance personnel and handled for relevant fault, improve troubleshooting efficiency.Further, the embodiment of the present invention has also set up memory cache queue, by the way that failure is cached in internal memory in a manner of queue, so as to it is determined that during fault correlation relation, only inquire about the memory cache queue can whether relevant failure occurs to quickly find, avoid analyzing large sample, further improve troubleshooting efficiency.

Description

Fault handling method and system

Technical field

The present invention relates to failure analysis techniques field, more particularly to a kind of fault handling method and system.

Background technology

In everyday devices maintenance, monitored typically by monitoring personnel, such as find failure, just the failure is submitted and safeguarded Personnel carry out investigation processing to it, to recover normal operating conditions in time.

But in above-mentioned processing method, for attendant, because the reporting fault received is disorderly and unsystematic, have no For rule, therefore, when to malfunction elimination and processing, efficiency is low.Therefore, at there is an urgent need to a kind of failure of effective Solution is managed, to improve troubleshooting efficiency.

The content of the invention

In view of the above problems, the embodiment of the present invention provides a kind of fault handling method and system, enabling according to orderly The failure reported, realize efficiently quickly troubleshooting solution.

The embodiment of the present invention employs following technical scheme：

One embodiment of the invention provides a kind of fault handling method, and methods described includes：

Whether when monitoring Fisrt fault generation, respectively forwardly and backward searching has the second failure hair in scheduled time window It is raw；

The generation of the second failure is such as found, then using the Fisrt fault and the second failure as relevant fault, is reported described Relevant fault；As do not found the generation of the second failure, then the Fisrt fault is reported；

For the failure reported, if relevant fault, then processing is merged to it；If Fisrt fault, then it is entered Row processing.

Methods described also includes：

Relation template is established, for recording the incidence relation between failure；And

Memory cache queue is established, if the faulty generation during monitoring, is cached in a manner of queue in internal memory The failure；

It is then described when monitoring Fisrt fault generation, respectively forwardly and backward search in scheduled time window whether have second Failure specifically includes：

When monitoring Fisrt fault generation, the relation template is inquired about, judges the Fisrt fault with the presence or absence of association Failure, if relevant fault is not present, report the Fisrt fault；

If relevant fault be present, scheduled time window before the Fisrt fault occurs is inquired about in the memory cache queue Inside whether there is the second failure being associated, also, in the scheduled time window after Fisrt fault generation, continuing monitoring is It is no to have the second failure.

Methods described also includes：Management is monitored to the memory cache queue, currently processed failure is removed and occurs in advance The failure fixed time before window；

Whether if relevant fault be present, being inquired about in the memory cache queue has the second failure being associated, and And whether in the scheduled time window after Fisrt fault generation, continuing to monitor has the second event in the memory cache queue Barrier occurs.

If relevant fault be present, methods described also includes：

Fault correlation relation in the relation template, establish in internal memory buffer queue and uniquely marked with network element Know and correlation rule ID mark one packet, then when monitor failure occur when, by current monitor to failure be buffered in it In the packet of corresponding network element unique mark and correlation rule ID marks.

Second failure is one or more.

The relation of the Fisrt fault of relevant fault and the second failure is as follows each other：

Fisrt fault is major error, then the second failure is time failure；Or

Fisrt fault is time failure, then the second failure is major error.

The embodiment of the present invention also provides a kind of fault processing system, and the system includes：

Relevant fault searching modul, for when monitoring Fisrt fault generation, respectively forwardly and backward searching pre- timing Between whether have the second failure in window；

Reporting module, occur if finding the second failure for the relevant fault searching modul, by the described first event Barrier and the second failure report the relevant fault as relevant fault；If the relevant fault searching modul does not find second Failure occurs, then reports the Fisrt fault；

Processing module, for the failure for reporting, if relevant fault, then processing is merged to it；If first Failure, then it is handled.

The system also includes：

Relation template module, for establishing relation template, record the incidence relation between failure；

Cache module, for establishing memory cache queue, if the faulty generation during monitoring, in a manner of queue The failure is cached in internal memory；

Then the relevant fault searching modul specifically includes：

Fault type judging unit, for when monitoring Fisrt fault and occurring, the relation template being inquired about, described in judgement Fisrt fault whether there is relevant fault；

Searching unit, is there is relevant fault in the judged result for the fault type judging unit, then described interior Deposit and inquire about before the Fisrt fault occurs whether have the second failure being associated in scheduled time window in buffer queue, also, In scheduled time window after Fisrt fault generation, continue to monitor whether the second failure；

Trigger element is reported, for being in the absence of relevant fault, then when the judged result of the fault type judging unit Trigger the reporting module and report the Fisrt fault；And the upper declaration form is triggered according to the lookup result of the searching unit First reporting fault.

The system also includes：

Memory cache queue management module, for being monitored management to the memory cache queue, remove currently processed Failure before scheduled time window occurs for failure；

Then the relevant fault searching modul specifically includes：

Searching unit, is there is relevant fault in the judged result for the fault type judging unit, then described interior Deposit and inquire about whether have the second failure being associated in buffer queue, also, the scheduled time after Fisrt fault generation In window, continue to monitor in the memory cache queue whether have the second failure；

The cache module also includes：

Grouped element, during for relevant fault be present, the fault correlation relation in the relation template, delay in internal memory The packet established in queue and identified with network element unique mark and correlation rule ID is deposited, then when monitoring failure When, by current monitor to failure be buffered in the packet of network element unique mark corresponding to it and correlation rule ID marks In；

Second failure is one or more；

Fisrt fault is major error, then the second failure is time failure；Or

Fisrt fault is time failure, then the second failure is major error.

Fault handling method provided in an embodiment of the present invention and system, in this fault correlation method, to equipment working condition Monitored in real time, occur when monitoring major error A, then the time is forwardly and rearwardly searched with major error A time of origin points respectively In window T, if faulty B occurs, if so, then establishing incidence relation.The foundation of fault correlation relation, is advantageous to operation maintenance personnel Handled for relevant fault, improve troubleshooting efficiency.

Further, the embodiment of the present invention has also set up memory cache queue, by being cached in a manner of queue in internal memory Failure, so that whether it is determined that during fault correlation relation, only inquiring about the memory cache queue can relevant to quickly find Failure occurs, and avoids analyzing large sample, further improves troubleshooting efficiency.

Brief description of the drawings

Fig. 1 is a kind of fault handling method flow chart that one embodiment of the invention provides；

Fig. 2 is a kind of fault handling method flow chart that another embodiment of the present invention provides；

Fig. 3 is a kind of instantiation flow chart of fault handling method provided in an embodiment of the present invention；

Fig. 4 is a kind of fault processing system block diagram that one embodiment of the invention provides.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.

In everyday devices maintenance, by the continuous observation analysis of monitoring personnel, the pests occurrence rule that is out of order is summarized.Generally, if Multiple failures occur typically together, then claim possess influence relation, referred to as relevant fault between the plurality of failure.Such as when A failures Occur, it is generally front and rear 10 minutes in B failures can also occur, then it is assumed that alarm A and alarm B is influence relation, according to concrete application Scene, primary-slave relation be present between relevant fault, such as in above-mentioned incidence relation, A is major error, B is time failure.

In this fault correlation method, equipment working condition is monitored in real time, occurs when monitoring major error A, then divides Do not searched forwardly and rearwardly in time window T with major error A time of origin points, if faulty B occurs, and is closed if so, then establishing Connection relation.The foundation of fault correlation relation, be advantageous to attendant and handled for relevant fault, improve troubleshooting effect Rate.

Specifically, referring to Fig. 1, it is a kind of fault handling method provided in an embodiment of the present invention, specifically comprises the following steps：

S101：Monitor failure.

S102：When monitoring Fisrt fault generation, respectively forwardly and backward search in scheduled time window whether have second Failure occurs.

According to different application scenarios, the length of scheduled time window can set difference.For example set in communications industry communication In standby maintenance application scene, the length that can set scheduled time window is 10 minutes.

S103：The generation of the second failure is such as found, then using the Fisrt fault and the second failure as relevant fault, is reported The relevant fault；As do not found the generation of the second failure, then the Fisrt fault is reported.

Fisrt fault and the second failure are relevant faults, i.e., under normal circumstances, both meetings are with generation, in practical application In, if before failure reports, analyzing and processing can be associated to failure, and report, so, attendant can be with pin Merging treatment is associated to failure, troubleshooting efficiency can be greatly improved.

It should be noted that above-mentioned second failure can be one or more, that is to say, that if Fisrt fault is A failures, Second failure can be B failures, or B failures, C failures and D failures etc., not be limited herein.

It is further to note that the relation of the Fisrt fault of relevant fault and the second failure can be each other：

Fisrt fault is major error, then the second failure is time failure；Or, Fisrt fault is time failure, then the second failure is Major error.For example if certain base station fault is major error, downstream signal sends failure and just could be arranged to time failure.

S104：For the failure reported, if relevant fault, then processing is merged to it；It is if Fisrt fault, then right It is handled.

In the embodiment of the present invention, T time before major error A occurs is searched（That is before time window T）Secondary failure B.Time found Failure B, establishes incidence relation.And time failure B is continued to, establish incidence relation.When major error A exceedes time window T, major error A is not in association time failure B.

It can be seen that fault handling method provided in an embodiment of the present invention and system, in this fault correlation method, equipment is worked Situation is monitored in real time, is occurred when monitoring major error A, is then forwardly and rearwardly searched with major error A time of origin points respectively In time window T, if faulty B occurs, if so, then establishing incidence relation.The foundation of fault correlation relation, is advantageous to O＆M Personnel are handled for relevant fault, improve troubleshooting efficiency.

Preferably, referring to Fig. 2, another embodiment of the present invention provides another fault handling method.The embodiment of the present invention Further establish has memory cache queue, by caching failure in internal memory in a manner of queue, so as to it is determined that fault correlation During relation, only inquiring about the memory cache queue can be to quickly find whether relevant failure occurs, so as to avoid to full-page proof Notebook data is analyzed and processed, and can further improve troubleshooting efficiency.

Comprise the following steps that：

S201：Relation template is established, for recording the incidence relation between failure.

S202：Memory cache queue is established, if the faulty generation during monitoring, in a manner of queue in internal memory Cache the failure.

In concrete practice, when receiving failure B, out of order time-out time window is calculated（That is time of failure+time window T minutes）.Failure is stored in memory cache in a manner of queue.If currently processed failure is failure A, audit memory caching Whether faulty B is present in time window T minutes before the failure A cached in queue, it is seen then that by increasing memory cache queue, The step of can avoiding analyzing big-sample data, only inquired about in internal memory buffer queue.

S203：When monitoring Fisrt fault generation, above-mentioned relation template is inquired about, judges Fisrt fault with the presence or absence of association Failure, if relevant fault is not present, step S204 is performed, if relevant fault be present, perform step S205.

S204：Fisrt fault is reported, performs step S208.

S205：Whether inquired about in internal memory buffer queue before Fisrt fault occurs has the be associated in scheduled time window Two failures, also, in the scheduled time window after Fisrt fault generation, continue to monitor whether the second failure.As searched Occur to the second failure, then perform step S206, do not find the generation of the second failure such as, then perform step S204.

Preferably, the embodiment of the present invention also comprises the following steps：Management is monitored to internal memory buffer queue, is removed current Failure before scheduled time window occurs for handling failure.

For the step, when receiving failure B, out of order time-out time window is calculated（That is time of failure+time window T minutes）.Failure is stored in memory cache in a manner of queue.If currently processed failure is failure A, audit memory caching Whether faulty B is present before the failure A cached in queue（The only currently processed failure stored in memory cache queue is i.e. Before failure A in time window T minutes）, also, in Fisrt fault（Failure A）In scheduled time window after generation, continue to monitor Whether second failure is had in memory cache queue（Failure B）Occur.It can be seen that by increasing memory cache queue, can avoid pair The step of big-sample data is analyzed, only inquired about in internal memory buffer queue.

Further, if relevant fault be present, methods described of the embodiment of the present invention also includes：

Fault correlation relation in the relation template, established in internal memory buffer queue so that " network element is unique One packet of mark+correlation rule ID " mark, then when monitoring failure and occurring, by current monitor to failure be buffered in In the packet that network element unique mark and correlation rule ID corresponding to it identify.

Wherein, network element unique mark, for certain equipment in unique mark network.

Correlation rule ID, for identifying correlation rule, for example A failures are major error, and B failures are time failure.

Accordingly, the step of relevant fault is inquired about in internal memory buffer queue, " to set specially in internal memory buffer queue Inquired about in the packet of standby network element unique mark+correlation rule ID " marks.So, data processing can further be reduced Sample, further improve the efficiency of troubleshooting.

S206：Using Fisrt fault and the second failure as relevant fault, the relevant fault is reported.

S207：Processing is merged to relevant fault, is terminated.

S208：Fisrt fault is handled.

In the embodiment of the present invention, T time before major error A occurs is searched in internal memory buffer queue（That is before time window T）'s Secondary failure B.The secondary failure B found, establishes incidence relation.And time failure B is continued to, establish incidence relation.When major error A surpasses Time window T is crossed, major error A is not in association time failure B.

Referring to Fig. 3, for a kind of example of specific fault handling method provided in an embodiment of the present invention, Integral Thought：It is first First, caching needs the failure associated.Then, according to fault location information, failure is grouped.Finally, when the primary and secondary of faulty equipment When failure breaks down, fault correlation is carried out.That realizes generallys include following description, and specific sub-step is shown in Figure 3, Here is omitted.

S301：Define correlation rule.

Alerted based on same device fails A, the B that breaks down is child alarm.Definition time window length is T minutes.

S302：Reception activity alerts.

I. failure A is received（Or failure B）, establish one with " one of network element unique mark+correlation rule ID " packet, Calculate out of order time-out time window（That is time of failure+time window T minutes）.Failure is stored in internal memory in a manner of queue In caching.

Ii. failure B is received（Or failure A）." network element unique mark+correlation rule ID ", which whether there is, has not timed out number for lookup According to.If it does, and each other primary and secondary alert, by failure A, failure B associate.

S303：Time-out abandons.

Queue is retrieved, the alarm more than time window T is deleted from packet queue.It is no longer used to associate.

It can be seen that the beneficial effect of this example is：The complex query of big data sample is greatly reduced, accelerates fault correlation speed Degree, so as to substantially increase the efficiency of troubleshooting.

Referring to Fig. 4, the embodiment of the present invention provides a kind of fault processing system, including：

Relevant fault searching modul 401, for when monitoring Fisrt fault generation, respectively forwardly and backward searching predetermined Whether second failure is had in time window.

Reporting module 402, occur if finding the second failure for relevant fault searching modul 401, by Fisrt fault With the second failure as relevant fault, the relevant fault is reported；If relevant fault searching modul 401 does not find the second failure hair It is raw, then Fisrt fault is reported.

Processing module 403, for the failure for reporting, if relevant fault, then processing is merged to it；If One failure, then handled it.

Further, fault processing system provided in an embodiment of the present invention also includes：

Relation template module 404, for establishing relation template, record the incidence relation between failure.

Cache module 405, for establishing memory cache queue, if the faulty generation during monitoring, with queue Mode caches the failure in internal memory.

Then relevant fault searching modul 401 specifically includes：

Fault type judging unit, for when monitoring Fisrt fault and occurring, the relation template being inquired about, described in judgement Fisrt fault whether there is relevant fault.

Searching unit, is there is relevant fault in the judged result for the fault type judging unit, then described interior Deposit and inquire about before the Fisrt fault occurs whether have the second failure being associated in scheduled time window in buffer queue, also, In scheduled time window after Fisrt fault generation, continue to monitor whether the second failure.

And trigger element is reported, for being in the absence of association event when the judged result of the fault type judging unit Barrier, then trigger the reporting module and report the Fisrt fault；And triggered according to the lookup result of the searching unit on described Declaration form member reporting fault.

Memory cache queue management module 406, for being monitored management to internal memory buffer queue, remove currently processed event Failure before scheduled time window occurs for barrier.

Then the relevant fault searching modul 401 specifically includes：

Searching unit, is there is relevant fault in the judged result for the fault type judging unit, then described interior Deposit and inquire about whether have the second failure being associated in buffer queue, also, the scheduled time after Fisrt fault generation In window, continue to monitor in the memory cache queue whether have the second failure.

Preferably, above-mentioned cache module also includes：

Grouped element, during for relevant fault be present, the fault correlation relation in the relation template, delay in internal memory The packet established in queue and identified with network element unique mark and correlation rule ID is deposited, then when monitoring failure When, by current monitor to failure be buffered in the packet of network element unique mark corresponding to it and correlation rule ID marks In.

It should be noted that the operation principle and processing procedure of the modules or unit in present system embodiment The associated description in embodiment of the method shown in above-mentioned Fig. 1, Fig. 2 and Fig. 3 is may refer to, here is omitted.

For the ease of clearly describing the technical scheme of the embodiment of the present invention, in the embodiment of invention, employ " first ", Printed words such as " second " make a distinction to function and the essentially identical identical entry of effect or similar item, and those skilled in the art can manage The printed words such as solution " first ", " second " are not defined to quantity and execution order.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of fault handling method, it is characterised in that methods described includes：

Memory cache queue is established, if the faulty generation during monitoring, caches the event in a manner of queue in internal memory Barrier；

When monitoring Fisrt fault generation, respectively forwardly and backward search in scheduled time window whether have the second failure；

The generation of the second failure is such as found, then using the Fisrt fault and the second failure as relevant fault, reports the association Failure；As do not found the generation of the second failure, then the Fisrt fault is reported；

For the failure reported, if relevant fault, then processing is merged to it；If Fisrt fault, then at it Reason；

It is described when monitoring Fisrt fault and occurring, whether search respectively forwardly and backward has the second failure hair in scheduled time window Life specifically includes：

When monitoring Fisrt fault generation, the relation template is inquired about, judges that the Fisrt fault whether there is relevant fault, If relevant fault is not present, the Fisrt fault is reported；

If relevant fault be present, inquired about in the memory cache queue before the Fisrt fault occurs is in scheduled time window It is no to have the second failure being associated, also, in the scheduled time window after Fisrt fault generation, continue to have monitored whether Second failure occurs.

2. fault handling method according to claim 1, it is characterised in that methods described also includes：The internal memory is delayed Deposit queue and be monitored management, remove the failure before currently processed failure generation scheduled time window；

It is then described when monitoring Fisrt fault generation, respectively forwardly and backward search in scheduled time window whether have the second failure Specifically include：

Whether if relevant fault be present, being inquired about in the memory cache queue has the second failure being associated, also, Whether in scheduled time window after the Fisrt fault generation, continuing to monitor has the second failure hair in the memory cache queue It is raw.

3. fault handling method according to claim 1 or 2, it is characterised in that if relevant fault be present, methods described is also Including：

Fault correlation relation in the relation template, established in internal memory buffer queue with network element unique mark and Correlation rule ID mark one packet, then when monitor failure occur when, by current monitor to failure be buffered in its institute it is right In the packet for network element unique mark and correlation rule the ID mark answered.

4. according to the fault handling method described in claim any one of 1-2, it is characterised in that second failure be one or It is multiple.

5. according to the fault handling method described in claim any one of 1-2, it is characterised in that the first event of relevant fault each other The relation of barrier and the second failure is as follows：

Fisrt fault is major error, then the second failure is time failure；Or

Fisrt fault is time failure, then the second failure is major error.

6. a kind of fault processing system, it is characterised in that the system includes：

Cache module, for establishing memory cache queue, if the faulty generation during monitoring, in a manner of queue including Deposit middle caching failure；

Relevant fault searching modul, for when monitoring Fisrt fault generation, respectively forwardly and backward searching scheduled time window Inside whether there is the second failure；

Reporting module, occur if finding the second failure for the relevant fault searching modul, by the Fisrt fault and Second failure reports the relevant fault as relevant fault；If the relevant fault searching modul does not find the second failure Occur, then report the Fisrt fault；Processing module, for the failure for reporting, if relevant fault, then it is carried out Merging treatment；If Fisrt fault, then it is handled；

The relevant fault searching modul specifically includes：

Fault type judging unit, for when monitoring Fisrt fault generation, inquiring about the relation template, judging described first Failure whether there is relevant fault；

Searching unit, the judged result for the fault type judging unit are then delayed relevant fault to be present in the internal memory Deposit and inquire about before the Fisrt fault occurs whether have the second failure being associated in scheduled time window in queue, also, in institute State in the scheduled time window after Fisrt fault occurs, continue to monitor whether the second failure.

7. fault processing system according to claim 6, it is characterised in that

Trigger element is reported, for being in the absence of relevant fault when the judged result of the fault type judging unit, is then triggered The reporting module reports the Fisrt fault；And triggered according to the lookup result of the searching unit in the reporting module Report failure.

8. fault processing system according to claim 7, it is characterised in that the system also includes：

Memory cache queue management module, for being monitored management to the memory cache queue, remove currently processed failure The failure before scheduled time window occurs；

Then the relevant fault searching modul specifically includes：

Searching unit, the judged result for the fault type judging unit are then delayed relevant fault to be present in the internal memory Deposit and inquire about whether have the second failure being associated in queue, also, in the scheduled time window after Fisrt fault generation, Continue to monitor in the memory cache queue whether have the second failure；

9. the fault processing system according to claim 7 or 8, it is characterised in that the cache module also includes：

Grouped element, during for relevant fault be present, the fault correlation relation in the relation template, in memory cache team The packet identified with network element unique mark and correlation rule ID is established in row, then, will when monitoring failure generation Current monitor to the network element unique mark that is buffered in corresponding to it of failure and correlation rule ID marks packet in；

Second failure is one or more；

Fisrt fault is major error, then the second failure is time failure；Or

Fisrt fault is time failure, then the second failure is major error.