WO2011103778A1

WO2011103778A1 - Fault monitoring method, monitoring device, and communication system

Info

Publication number: WO2011103778A1
Application number: PCT/CN2011/070390
Authority: WO
Inventors: 杨胜强
Original assignee: 华为技术有限公司
Priority date: 2010-02-25
Filing date: 2011-01-19
Publication date: 2011-09-01
Also published as: CN101800675A; CN101800675B

Abstract

The embodiments of the present invention provide a fault monitoring method, a monitoring device, and a communication system, wherein the fault monitoring method includes: obtaining a service processing failure event reported from a communication unit, the service processing failure event including the address information of an object entity which fails to process the service; determining abnormal entities according to the service processing failure event reported from the communication unit and a preset failure criterion; transmitting a fault warning notification message for indicating the implementation of fault detection, the fault warning notification message including the information of at least one entity in the determined abnormal entities. With the technical solutions provided by the embodiments of the present invention, the efficiency of fault detection can be improved.

Description

Fault monitoring method, monitoring device and communication system

The present invention relates to the field of communications technologies, and in particular, to a fault monitoring method, a monitoring device, and a communication system.

Background technique

Communication network equipment requires high reliability. In order to achieve high reliability and minimize product life cycle costs, equipment developers need to spend a lot of time and cost to perform detailed failure mode and impact analysis for the entire communication equipment. (Failure Mode and Effects Analysis, FMEA), in order to analyze all the failure modes of the communication equipment as much as possible, and provide effective fault handling measures to ensure that the communication equipment can return to normal as soon as possible after the failure, and minimize the impact of the communication equipment business.

Due to the increasing complexity of communication devices, especially the scale of software, it is very costly and time consuming to follow all the failure modes according to the traditional FMEA method. In the current fierce commercial competition environment. Any device developer can't afford this price, so most telecom devices now have more or less faults that communication devices can't detect. In addition, some fault detection means are generally designed to be executed when the communication device is idle due to the very costly communication device performance, which makes it impossible to detect such a fault in real time.

The disadvantages of the prior art are:

For the above two types of faults (that is, faults that cannot be detected by the communication device or faults that cannot be detected in real time;), the communication device cannot be detected in time, and recovery cannot be performed in time.

Summary of the invention

Embodiments of the present invention provide a fault monitoring method, a monitoring device, and a communication system, which can improve the efficiency of fault detection.

In view of this, the embodiments of the present invention provide:

A fault monitoring method includes:

Obtaining a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the business processing;

Determining, by the communication unit, the service processing failure event and the preset failure criterion, determining an entity that has an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined abnormality occurs. Information about at least one entity in an entity. A monitoring device, comprising:

a first acquiring unit, configured to acquire a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the service processing;

a determining unit, configured to determine an entity that has an abnormality according to a service processing failure event reported by the communication unit and a preset failure criterion;

And a sending unit, configured to send a fault warning notification message, where the fault early warning notification message includes: information of at least one entity of the determined abnormality entity, where the fault early warning notification message is used to indicate fault detection.

A communication system, comprising: a communication unit, a sub-monitoring unit, and a parent monitoring unit, wherein the sub-monitoring unit is configured to acquire a service processing failure event reported by the communication unit, and the service processing failure event carried in the business processing failure event The address information of the entity, determining that the object entity that fails the service processing does not belong to the scope of the management, and reporting the service processing failure event to the parent monitoring unit;

The parent monitoring unit is configured to receive the service processing failure event reported by the sub-monitoring unit, and determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that fails the service processing belongs to the scope of its own management. And if yes, determining, according to the service processing failure event and the preset failure criterion, an entity that generates an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined occurrence Information about at least one entity in the abnormal entity; if not, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit.

A communication system, comprising: a first communication unit, a second communication unit, and a monitoring unit, wherein the monitoring unit is configured to acquire a service processing failure event reported by the first communication unit, and obtain a service processing failure event reported by the second communication unit, when The address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit is the address information of the second communication unit, and the object processing entity of the service processing failure carried by the service processing failure event reported by the second communication unit When the address information is the address information of the first communication unit, the first communication unit and the second communication unit are not invalidated. The embodiment of the present invention determines an entity that has an abnormality by analyzing a service processing failure event reported by the communication unit, and sends a corresponding fault. The warning notification message, so that the system can take corresponding fault handling, and the entity that is abnormal can timely recover the fault, fix the fault in the bud, and avoid the fault expansion. Disperse, improve system reliability.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. One of ordinary skill in the art can also obtain other drawings based on these drawings without undue creative effort.

FIG. 1 is a flowchart of a fault monitoring method according to an embodiment of the present invention;

2 is a flowchart of a fault monitoring and processing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another fault monitoring and processing method according to an embodiment of the present invention; FIG.

4 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention; FIG.

6 is a structural diagram of a monitoring device according to an embodiment of the present invention;

FIG. 7 is a structural diagram of a communication system according to an embodiment of the present invention; FIG.

FIG. 8 is a structural diagram of another communication system according to an embodiment of the present invention.

detailed description

Referring to FIG. 1, an embodiment of the present invention provides a fault monitoring method, including:

101. Acquire a service processing failure event reported by the communication unit, where the service processing failure event includes: address information of the object entity that fails the service processing.

For the communication system, the completion of various communication services is essentially completed by the communication units of the communication system through the interactive processing of messages or service code streams. The communication unit may be a network element in the communication system, or may be a processing unit in the network element, such as: a hardware entity such as a chassis, a board, a chip, a processor, an I/O device, or the like; Software entities on a chip or processor, such as: software modules, processes, threads, etc.; or logical resource entities deployed in system programs, such as: memory resources, semaphores, business processing resources, bandwidth resources, A logical resource entity such as a link resource.

The service processing failure event reported by the communication unit may be obtained by: the first mode: directly receiving the service processing failure event reported by the communication unit; and the second mode, the parent monitoring unit receiving the service processing failure event sent by the sub monitoring unit . The second mode is applicable to the distributed failure analysis processing mode. The distributed failure analysis processing modes include but are not limited to: single board level failure analysis, frame level failure analysis, network element level failure analysis, and network level failure analysis. Different levels of monitoring units (ie, units that perform failure analysis) can be logically deployed together or deployed on different hardware. In order to improve processing efficiency, it is generally deployed on different hardware. Generally, the board-level failure analysis includes the failure analysis of the hardware modules running on the board or the software modules running on the board. It is deployed directly on the board. The frame-level failure analysis includes not only the content of the board-level failure analysis, but also the content that cannot be processed by the board-level failure analysis. It is deployed on the central control board of the frame. The NE-level failure analysis is deployed on the central control board of the NE. Network-level failure analysis is deployed on the central control node of the network, such as a central network management device. Therefore, the parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the NE. The sub-monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame. Alternatively, the parent monitoring unit is a frame-level monitoring unit, which is located at the center of the frame. The sub-monitoring unit is a single-board monitoring unit, which is located on the board where the communication unit is located.

Generally, if a level of failure analysis can make an explicit failure decision, the service processing failure event of the communication unit will be terminated at the failure analysis of the current level, and will not be reported to the upper layer; In the judgment, the failure analysis of the layer needs to report the failure of the service processing failure of the communication unit to the failure analysis of the upper layer. For example, if the A board receives a response from the B-board, some of the fields are incorrectly assigned. The A-board reports the service failure event of the B-board to the board-level monitoring unit. The event carries the B-list. If the board-level monitoring unit of the A-board is unable to analyze the failure of the other boards, you need to report the failure of the service processing to the frame-level monitoring unit of the A-board. Similarly, if the A-board and the B-board are in different frames, the frame-level monitoring unit to which the A-board belongs cannot be effectively analyzed, and then it needs to be reported to the NE-level monitoring unit to which the A-board belongs. The A-board and the B-board are located on different NEs. The NE-level monitoring unit of the A-board cannot be effectively analyzed. You need to continue reporting to the network-level monitoring unit for analysis.

The object entity that fails the service processing is the communication unit or the peer communication unit that communicates with the communication unit; the service processing failure event may be a signaling message processing failure event, a management message processing failure event, and a service code stream processing. A failed event, or an interface call handles a failed event.

Specifically, the communication unit fails to report the signaling message processing failure event when the corresponding function of the signaling message fails; or the communication unit fails to report the management message when the communication unit fails to perform the management message; or the communication unit fails to process the service code stream When the service stream processing failure event is reported, or the communication unit interface call processing fails, the reporting interface calls the processing failure event.

For the received message is normal, but fails during internal processing, the object of the business processing failure The address information of the body is the address information of the message processing communication unit.

If the received message contains an abnormal cell internally and fails, the address information of the object entity that failed the service processing is the address information of the message sending communication unit.

If the sent message is normal, and the timeout message is not received by the peer communication unit, the address information of the object entity that failed the service processing is the address information of the message receiving communication unit (ie, the peer communication unit).

The interface call processing fails, indicating that the interface device may be faulty, and the address information of the object entity that fails the service processing is the address information of the communication unit of the interface device. If the interface call processing fails when reading or writing the hard disk, it indicates that the hard disk may be faulty.

The service processing failure event may further include: a reason indication information indicating that the service processing fails. It can also include: Some context-critical operational parameters during business processing, such as current load, total number of business processes, and so on.

In particular, when the current load exceeds a preset threshold, the communication unit may not report the service failure event, thereby avoiding unnecessary failure analysis.

In particular, when the communication unit determines that the field assignment in the signaling message from the peer communication unit is abnormal because the accessed terminal device (including the user terminal and the operation and maintenance terminal) is illegal, the service processing failure event may not be reported, or the report may be reported. The business handles the failed event, but the event carries a specific field for identification. In this case, it is also possible to perform control at the communication unit, i.e., the control communication unit does not report the service processing failure event. For example, after receiving a call request message, a communication unit in a Home Location Register (HLR) device finds an international mobile subscriber identity (IMSI) of the terminal carried in the call request message, If the electronic serial number (ESN) is invalid, the service failure event may not be reported, and unnecessary failure analysis may be avoided.

The address information of the object entity that fails the service processing includes physical address information of the hardware to which the object entity belongs, to uniquely identify the detailed address information of the hardware to which the communication unit belongs in the entire communication system; if the communication unit is a certain processing in the network element The unit, such as: chassis, board, chip, processor, I/O device, etc., the physical address information of the hardware to which the object entity belongs may be the signaling point identifier or IP address, or according to the [rack number, board The physical address represented by the slot number, subsystem number].

If the object entity that fails the service processing is a software entity, the address information of the object entity that fails the service processing may further include logical address information of the software entity, and the logical address information may be a software module address or a process address, or with the software. Module address or process address - corresponding software module Number or process number.

The service processing failure indication information may indicate that the resource processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource, a semaphore, a service processing resource, a bandwidth resource, a link resource, etc., in the system. The reason for the failure of the service processing indication information may be a specific number, which corresponds to the resource for which the application fails. Generally, it is recommended that the number and the resource remain in the corresponding relationship, so that in the system, as long as the service processing fails due to the failure of the same resource application, the reason for the failure of the service processing is the same, which is beneficial to the resources in the system. Failure analysis processing.

Generally, the object entity does not need to report any event when the service entity is successful. However, after the service processing failure event is reported, the communication unit reports the service processing success event to the monitoring unit when the object entity performs the service processing again. In addition, whether the communication unit reports the service processing success event may also be controlled by the monitoring unit. For example, after receiving the service processing failure event reported by the communication unit, the monitoring unit returns a message to the communication unit to notify the communication unit of the service processing of the target entity. When successful, the business processing success event is reported.

The communication unit may use the same interface to report the service processing success event and the service processing failure event, and carry the service processing failure indication information in the service processing failure event, and carry the service processing success indication information, such as the service, in the service processing success event. Carrying a specific identifier in the processing success event indicates that the service processing is successful.

102. Determine, according to the service processing failure event reported by the communication unit and the preset failure criterion, determine an entity that has an abnormality.

Specifically, the service processing failure event reported by the communication unit may be used to calculate the failure indicator value for one or more analysis objects; and determine whether the corresponding analysis object is abnormal according to the statistical failure value and the corresponding failure threshold value in the failure determination criterion. .

If the business processing of the object entity fails, it will inevitably cause its related function to fail or be damaged. In terms of external performance, it is an entity exception. Among them, the failure criterion defines the failure threshold and the analysis object. The failure criterion may specify that the analysis object is a hardware entity corresponding to the physical address of the hardware to which the object entity to which the service processing fails, or the software object corresponding to the physical address and the logical address of the hardware to which the object entity that failed the service processing belongs. Or, the analysis object is a logical resource entity corresponding to both the physical address of the hardware to which the object entity that failed the service processing and the cause indication information of the service processing failure. Generally, the failure indicator value may be an accumulated value of the number of consecutive business processing failures, or a ratio of the number of business processing failures in a period of time to the total number of business processing times, or may be a key performance indicator of the system statistics (Key Performance Indicators, KPI), such as call loss rate, call drop rate and other statistical indicators. The specific failure indicator values selected depend on the established failure criterion. If the failure indicator value is the accumulated value of the number of consecutive service processing failures, when the monitoring unit receives the service processing failure event reported by the communication unit, the monitoring unit adds one to the failure indicator value corresponding to the analysis object according to different analysis objects. If the failure indicator value is the ratio of the number of service processing failures to the total number of service processing times in a period of time, the number of service processing failures corresponding to the analysis object is increased by one according to different analysis objects, and then the current service failure times and totals are obtained. The ratio of the number of business processes. Upon receiving the service processing success event reported by the communication unit, the failure indicator value corresponding to each analysis object is cleared.

If the failure indicator value is a key performance indicator, the business processing failure event reported by the communication unit triggers the monitoring unit to query the key performance indicator, and compares the key performance indicator with the preset threshold.

Generally, the failure criterion can be a threshold value comparison method. Specifically, the failure threshold is preset on the monitoring unit. When the failure indicator value is greater than the set failure threshold, the object entity that fails the business processing may be determined to be abnormal. . In particular, if the number of failures of the continuous service processing exceeds a certain threshold as the failure criterion, when the failure indicator value of the number of consecutive service failures exceeds the failure threshold in the failure criterion, the object entity that fails the service processing can be determined. An exception occurs.

Referring to step 101, the service processing failure event carries three parameters: physical address information of the hardware to which the object entity that failed the service processing, logical address information of the object entity that fails the service processing, and indication information indicating the failure of the service processing. When the monitoring unit receives the service processing failure event reported by the communication unit, the failure analysis may be separately analyzed by one or more analysis objects:

If the hardware entity corresponding to the physical address of the hardware of the object to which the service processing fails is the analysis object, if the failure indicator value corresponding to the analysis object exceeds the first failure threshold, the number of consecutive failures of the hardware entity to execute the service processing exceeds The first failure threshold determines that an abnormality has occurred in the hardware entity.

If the software entity corresponding to the physical address of the hardware of the object entity to which the service processing fails and the logical address of the object entity are used as the analysis object, if the failure indicator value corresponding to the analysis object exceeds the second failure threshold, the The software entity continuously executes the business processing failure times exceeding the second failure threshold, and determines that the software entity is abnormal.

If the logical resource entity corresponding to the physical address of the hardware to which the object entity failed to process the service and the reason for the failure of the service processing is the analysis object, if the object corresponding to the analysis fails If the value of the indicator exceeds the third expiration threshold, it indicates that the number of times the service processing fails due to the continuous invocation of the logical resource entity exceeds the third expiration threshold, and the logical resource entity is abnormal.

The monitoring unit saves the current failure analysis results separately for subsequent calls.

Specifically, if the current processing load is carried in the service processing failure event, and the current load exceeds the preset threshold, the monitoring unit may combine the running load of the entire system to decide whether to discard the service processing failure event. When the service processing failure event is discarded, that is, in this case, the failure indicator value corresponding to the analysis object is not added.

In particular, if the service processing failure event carries a specific identifier indicating that the accessed terminal device (including the user terminal and the operation and maintenance terminal) is illegal, the monitoring unit discards the service processing failure event, or records only The log, that is, in this case, the failure indicator value corresponding to the analysis object is not added.

103. Send a fault warning notification message, where the message includes: information about at least one entity of the determined abnormal entity.

If the failure analysis is performed by using the hardware entity in step 102 as the analysis object, when the failure analysis result indicates that the hardware entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: the hardware of the target entity to which the service processing fails Physical address information.

If the software entity in step 102 is used as the analysis object for the failure analysis, when the failure analysis result indicates that the software entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: the hardware of the target entity to which the business processing fails Physical address information and logical address information of the object entity.

If the failure analysis is performed by using the logical resource entity in step 102 as the analysis object, when the failure analysis result indicates that the logical resource entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: The physical address of the hardware and the reason for the failure indication.

If the hardware entity, the software entity, and the logical resource entity in step 102 are respectively used as the failure analysis object, and the failure analysis is performed, and multiple analysis objects are invalid, the multiple failure warning notification messages may be reported at the same time, or may be reported only. A fault warning notification message may also report the fault warning notification message one by one. For example, when it is determined that the hardware entity and the software entity are abnormal, the fault warning notification message corresponding to the software entity may be reported first, and the fault warning notification message corresponding to the hardware entity is not reported. When it is determined that the hardware entity and the logical resource entity are abnormal, the logical resource entity may be reported first. The fault warning notification message is not reported, and the fault warning notification message corresponding to the hardware entity is not reported. Preferably, when there are multiple analysis objects at the same time, the fault warning notification message corresponding to the minimum granularity of the failure analysis object is initiated. Perform the most accurate failure warning. In particular, if the subsequent analysis finds that the system is still faulty, the fault warning notification message corresponding to the hardware entity is reported. In particular, the hardware entity's fault warning notification message may also distinguish hardware entities of different granularity, wherein the physical address information of the hardware to which the object entity belongs includes: a first level subaddress; the object entity belongs to the first level a component of the hardware corresponding to the sub-address; after the monitoring unit sends the fault warning notification message including the physical address information of the hardware to which the object entity belongs, if the hardware of the object entity is abnormally determined within the preset time period, the sending includes the first level Sub-address failure warning notification message. Optionally, the first level sub-address includes: a second-level sub-address, where the hardware corresponding to the first-level sub-address is a component of hardware corresponding to the second-level sub-address; and the monitoring unit sends the sub-address including the first-level sub-address In the preset time period after the failure warning notification message, if it is determined that the hardware of the target entity is still abnormal, the failure warning notification message including the second-level sub-address is sent. For example, if the hardware entity represented by the physical address in the form of the chassis number, board slot number, or subsystem number is abnormal, you can send the corresponding number of the chassis number, board slot number, and subsystem number. The fault alarm notification message of the hardware entity (subsystem); then the fault alarm notification message of the hardware entity (board) corresponding to the [chassis number, board slot number] can be sent. Finally, the [rack number] can be sent. Corresponding hardware entity (frame) fault warning notification message. Specifically, when the failure warning notification message corresponding to the failure analysis object of different granularity is sent successively, after a failure warning notification message is reported, a waiting time may be preset, and after the waiting time expires, the current failure analysis result is rechecked. If the current failure analysis result indicates that the failure analysis object is still abnormal, report the next failure warning notification message. The [rack number, board slot number, and subsystem number] are the physical address information of the hardware to which the target entity belongs. The [rack number, board slot number] is the first-level sub-address.

[Chassis Number] is the second level subaddress.

The fault warning notification message may be sent to the entity that generated the abnormality itself, or may be sent to the management module of the entity that generated the abnormality. For example, the fault alarm notification message corresponding to the chassis is sent to the management module of the chassis; the fault warning notification message corresponding to the board is sent to the management module of the board; and the fault warning notification corresponding to the DSP chip subsystem is provided. The message is sent to the management module of the DSP chip subsystem; the failure warning notification message corresponding to the memory resource is sent to the management module of the memory resource; and the failure warning notification message corresponding to the software module can be sent to the software module itself. It can also be sent to the management module of the software module. Preferably, the fault warning notification message is sent to the tube in which the abnormal entity is generated. Management module.

The entity that has an abnormality or the management module of the entity that has an abnormality will perform a fault detection and failure recovery process for the entity that has an abnormality after receiving the failure warning notification message. See the description of the corresponding parts of the subsequent embodiments for details.

In particular, after the monitoring unit sends a failure warning notification message to an analysis object, a timer can be started. Before the timer expires, the subsequent failure analysis for the analysis object no longer sends a failure warning notification message.

In the embodiment of the present invention, the communication unit reports the service processing failure in time when the processing of the object entity fails.

The notification message promptly triggers the fault detection process and the fault recovery process of the entity that has an abnormality, which not only enables the entity that has an abnormality to be automatically repaired in time, but also repairs the fault in the bud state, thereby ensuring the system to operate stably for a long time. , effectively avoiding the spread of faults and improving system reliability. In addition, the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported. The technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that is abnormal to be automatically repaired or isolated in time, and the system returns to normal. Referring to FIG. 2, another embodiment of the present invention provides a fault monitoring method when a communication unit fails to continuously perform a signaling message, which includes:

201. The communication unit fails to perform the signaling message, and the signaling message processing failure event is reported. The event includes: physical address information of the board to which the communication unit belongs.

The signaling message can be any normal message of the signaling plane. The failure of the communication unit to perform the signaling message failure may be caused by various abnormal causes encountered by the communication unit during the message processing, such as failure to apply for a memory resource, failure to apply for a timer, failure to query the configuration, or configuration data to be queried. Processing failure due to various reasons such as abnormality.

202. The monitoring unit acquires a signaling message processing failure event reported by the communication unit. 203. The monitoring unit determines that the board to which the communication unit belongs is abnormal according to the signaling message processing failure event and the preset failure determination criterion reported by the communication unit.

The information included in the failure event of the signaling message processing: the physical address information of the board to which the communication unit belongs, the cumulative statistics of the number of consecutive service processing failures for the board, and the monitoring unit reports the signaling message every time the receiving communication unit reports If the failure occurs, the number of consecutive service processing failures corresponding to the board is increased by one. The monitoring unit determines that the board is abnormal when the number of consecutive service processing failures corresponding to the board is greater than the failure threshold set by the system.

204. The monitoring unit sends a fault warning notification message to the board, where the message includes: physical address information of the board.

After the fault alarm notification message is sent, the monitoring unit starts a timer. Before the timer expires, the failure analysis of the board will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating frequently. Failure warning notification message.

205. After receiving the fault warning notification message, the board triggers a fault detection process.

After receiving the fault warning notification message, the board triggers the fault detection process of the board to perform comprehensive fault detection on the board to determine the fault point and fault cause of the board. Generally, when a specific fault point and the cause of the fault are detected, the corresponding fault alarm information is reported, and the operation and maintenance personnel of the device are prompted. For example, the fault detection process includes the memory chip failure detection of the board. If the memory chip fails and the memory chip fails, the fault alarm information of the memory chip failure can be reported.

206. After performing the fault detection process, the board performs a fault failure confirmation process according to the fault detection result.

If the fault detection result of the board indicates that no fault is detected, the fault invalid query message is sent to the monitoring unit, and the monitoring unit returns a response message, where the response message includes the current latest failure analysis result. If the current latest failure analysis result indicates that the board still fails, the next step is performed; if the current latest failure analysis result indicates that the board is normal, the entire process ends.

If the fault detection result indicates that the board does have a fault, you can perform the next step without performing fault failure confirmation.

207. The board triggers a fault recovery process.

If the fault recovery process of the board is a board reset, the board reset process is performed. If the fault recovery process of the board is the active/standby switchover, the active/standby switchover process is performed. If the fault recovery process of the board is isolated, the board isolation process is performed. In particular, the fault recovery process of the board can be configured as a combination of multiple fault recovery measures. For example, the fault recovery process of the board can be configured to perform the active/standby switchover first, and then perform the board reset. Board isolation. After performing a fault recovery measure, re-execute steps 205-206 to re-execute the fault detection and failure failure confirmation process. If the fault detection result or the current latest failure analysis result indicates that the board is still faulty or invalid, continue to execute. The next fault recovery measure, otherwise the board is normal and the process ends.

In the embodiment of the present invention, when the communication unit fails to perform the signaling message processing, the signaling message processing failure event is reported in time, and the monitoring unit performs the failure analysis to determine that the board to which the communication unit belongs is abnormal, and sends a failure warning notification to the board. The fault detection process and the fault recovery process of the board can be triggered in time to ensure that the board can be automatically repaired or isolated in time, and the fault is repaired in a bud, ensuring long-term, stable and normal operation of the system, effectively avoiding The fault spreads and improves system reliability. In addition, because the fault detection process is triggered only after an abnormality is found in the board, compared with the original timing fault detection trigger mechanism, not only the timeliness but also the system performance is minimized. Referring to FIG. 3, the technical solutions provided by the embodiments of the present invention are described in detail below by way of specific examples. The embodiment of the present invention assumes that the DSP chip with the frame number of 3, the slot number of the board is 3, and the subsystem number of 1 fails, and the software module running on the DSP chip is assumed to be a single process.

301. The DSP chip fails to process the service, and reports the service processing failure event to the monitoring unit of the DSP chip. The event includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number Indicates the reason for 1), and the reason why the business processing failed.

Since the DSP chip runs the software module as a single process, there is no need to distinguish between them, so the logical address of the software module that fails the service processing here may not be carried.

The service processing failure event may be a signaling message processing failure event of the DSP, or a management message processing failure event of the DSP or a service code stream processing failure event of the DSP.

The reason for the failure of the service processing indication may indicate that the service processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource of the DSP chip, a timer resource of the DSP chip, a service channel processing resource of the DSP chip, etc. In the system, the reason indication information of the service processing failure is generally a specific number, which corresponds to the resource that fails the application.

302. The monitoring unit acquires a service processing failure event reported by the DSP chip.

After the monitoring unit obtains the service processing failure event reported by the DSP chip, the monitoring unit carries the event The information includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1), and the cause indication information of the service processing failure.

303. The monitoring unit determines whether the DSP chip is abnormal according to the service processing failure event reported by the DSP chip and the preset failure judging criterion.

The preset failure criterion is: whether the number of consecutive failures of the DSP chip processing exceeds the configured failure threshold, and the system has a failure threshold of 5 times. If it exceeds 5 times, the monitoring unit will determine the DSP chip. An exception occurs. Otherwise, it indicates that the DSP chip has not reached the failure criterion. The monitoring unit will determine that the DSP chip is normal.

According to the preset failure criterion, the monitoring unit needs to count the number of consecutive business processing failures of the DSP chip according to the service processing failure event reported by the DSP chip. Each time the monitoring unit receives a service processing failure event reported by the DSP chip, the physical entity corresponding to the physical address of the DSP chip carried in the event is analyzed, and the number of consecutive business processing failures of the physical entity is increased by one. The number of consecutive service processing failures of the DSP chip with the slot number of 1 and the subsystem number 1 is increased by one, and then the number of consecutive failures of the DSP chip processing service exceeds the configured failure. value. For example, if the DSP chip fails to perform 5 times of processing, the service processing failure event is reported to the monitoring unit 5 times. The monitoring unit performs the failure analysis when the service processing failure event reported by the DSP chip is received in the first 4 times. The failure threshold has not been reached 5 times. The results of the first 4 failure analysis are normal for the DSP chip. When the service processing failure event reported by the DSP chip is obtained for the 5th time, the failure analysis is performed, and the number of consecutive failures of the DSP chip processing service is found. The failure threshold has been reached 5 times, and the failure analysis result outputs an abnormality of the DSP chip. If the reason for the failure of 5 business processes is the same, and the support points to the memory resources of the DSP chip, the memory resources of the DSP chip are used as the analysis object, and the failure analysis result also outputs the memory resource of the DSP chip. the result of.

It should be noted that, if the monitoring unit receives the service processing success event reported by the DSP for the first time after receiving the service processing failure event reported by the DSP, the number of the counted service processing failures is cleared. If the DSP chip fails to perform three consecutive business processes, but the fourth service is successfully processed, a service processing success event is reported, and the monitoring unit changes the number of consecutive business processing failures of the statistical DSP chip from 3 to 0.

The monitoring unit saves the result of the failure analysis (ie, the DSP chip is abnormal or normal) as the current latest failure analysis result. 304. The monitoring unit sends a fault warning notification message to the DSP chip management unit when it is determined that the DSP chip is abnormal.

The fault warning notification message includes: the address information of the DSP chip in which the abnormality occurs (the address of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1).

After the monitoring unit sends the fault warning notification message, the monitoring unit starts a timer. Before the timer expires, the subsequent failure analysis will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating the frequent fault warning notification message. .

305. The DSP chip management unit calls the DSP fault detection processing program to perform fault detection. The DSP fault detection processing function can be registered in the DSP chip management unit, and calling this function triggers the DSP fault detection processing flow. For example: Send a message to the DSP chip that has an abnormality, trigger the DSP chip to perform CRC data verification of the program segment and the data segment, and return the CRC data verification result to the DSP chip management unit. The DSP fault detection processing flow reports the corresponding alarm and log when the fault is found, so as to facilitate the user's problem location.

306. The DSP chip management unit performs fault failure confirmation with the monitoring unit according to the DSP fault detection result.

If the DSP fault detection result indicates that no fault is detected, the fault failure query message is sent to the monitoring unit, and the monitoring unit returns a response message, which includes the current latest failure analysis result.

If the DSP fault detection result indicates that the fault is detected, the fault invalidation query message may also be sent to the monitoring unit, or the fault invalidation query message may not be sent to the monitoring unit for failure failure confirmation. Preferably, since the fault has been detected, the fault invalidation query message is generally not sent to the monitoring unit to improve system processing efficiency.

In particular, if the DSP fault detection result indicates that a fault is detected, or the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP has an abnormality, the next step is continued. If the DSP fault detection result indicates that no fault has been detected, and the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP chip is normal, indicating that the DSP chip has returned to normal, the entire process can be ended. This can avoid some flash-type failures that cause subsequent unnecessary failure recovery measures to affect the system.

307. The DSP chip management unit calls the DSP fault recovery processing program to perform fault recovery. The DSP fault recovery processing function can be registered in the DSP chip management unit, and calling this function touches Send DSP fault recovery processing flow. For example: Send a reset message to the DSP chip that has an abnormality, trigger the DSP chip to reset and restart, and start a timer, waiting for the DSP chip to re-run normally.

After executing the DSP fault recovery processing program, the DSP chip management unit can perform fault detection on the DSP chip again, and perform fault failure confirmation with the monitoring unit. If the DSP fault detection result indicates that the fault is detected, or the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP is still abnormal, the DSP chip isolation measure is executed to isolate the abnormal DSP chip.

In the embodiment of the present invention, when the DSP chip service fails, the service processing failure event reported to the monitoring unit is performed by the monitoring unit according to the service processing failure event, and the abnormality of the DSP chip is determined in time, and when the DSP chip is abnormal, The DSP chip management unit sends a fault warning notification message, and the DSP chip management unit promptly calls the DSP chip fault detection process and the fault recovery process to not only detect the specific fault cause of the DSP chip in time, but also report the alarm indicating the root cause of the fault, and can timely The DSP chip performs fault repair or isolation, repairs the fault in the bud, and quickly recovers or isolates the abnormal DSP chip, avoiding the spread of faults and improving system reliability. In addition, since the fault detection process is triggered only after receiving the fault warning notification message, compared with the original timing trigger mechanism, not only the timeliness is ensured, but also the system performance is minimally affected, and even the original timing triggered DSP chip failure can be turned off. Detection mechanism. The embodiment of the present invention can monitor all the service processing failures of the DSP chip, including the failure of the signaling message service processing, the failure of the management message service processing, and the failure of the processing processing of the service code stream, which can cover all service processing failures of the DSP chip. The completeness of the failure detection of the DSP chip can be ensured, so that even if the fault detection technology of the fault mode is designed by the DSP chip, the failure of the DSP chip can be basically determined by the external description of the DSP chip by the scheme described by the present invention. Then, the DSP chip failure recovery measures are taken to enable the abnormally generated DSP chip to be automatically repaired or isolated in time to return to normal. Referring to FIG. 4, an embodiment of the present invention provides a fault monitoring and processing method. This embodiment assumes that the first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit. . The failure handling process for this situation is as follows:

401. The first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service processing failure event is reported to the local monitoring unit of the first communication unit. The processing failure event includes: address information of the object entity (second communication unit) whose service processing has failed. 402. The upper monitoring unit acquires a service processing failure event reported by the first communication unit.

Since the second communication unit may not be in the monitoring range of the monitoring unit of the first communication unit, the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and then needs to report to the monitoring unit of the upper level. Finally, the service processing failure event reported by the first communication unit is received by the monitoring unit capable of monitoring the first communication unit and the second communication unit.

The monitoring unit may include: a board level monitoring unit, a frame level monitoring unit, a network element level monitoring unit, and a network level monitoring unit. The scope of failure analysis that can be handled by different levels of monitoring units (ie, units that perform failure analysis) is different. Generally, the board-level monitoring unit can perform failure analysis only on the hardware chips in the board or the software modules running in the board. The frame-level monitoring unit includes not only the failure analysis content of each board in the frame, but also the failure analysis content between the boards at the frame level. The network element level monitoring unit can analyze all hardware chips or software modules in the network element for failure analysis. The network level monitoring unit can analyze all hardware chips or software modules in the entire network for failure analysis.

403. The superior monitoring unit determines whether the second communication unit is abnormal according to the service processing failure event reported by the first communication unit and the preset failure determination criterion.

If it is indeed the second communication unit failure, all communication units only need to send a message to the second communication unit, and the service processing failure due to the timeout failure will occur, and these service processing failure events will be sent to the superior monitoring unit. The superior monitoring unit determines that the target entity pointed to by the service processing failure event sent by the multiple communication units is the second communication unit, and the number of consecutive failures of the service processing for the target entity exceeds the configured failure threshold. The superior monitoring unit will determine that an abnormality has occurred in the second communication unit.

404. The monitoring unit sends a failure warning notification message to the management unit of the second communication unit, where the failure warning notification message carries the address information of the second communication unit.

The processing steps of subsequent fault detection and fault recovery are basically the same as steps 205-207, and are not described here.

In the embodiment of the present invention, if the object entity pointed by the service processing failure event sent by the multiple communication units is the same object entity, and the number of consecutive failures of the service processing for the object entity exceeds the configured failure threshold, the determination is performed. The object entity is faulty, and the fault warning notification message is sent, and the fault detection process and the fault recovery process of the object entity are triggered in time, so that the object entity can be automatically repaired or isolated in time, and the fault is repaired in the bud state, thereby ensuring the long-term system. Ground, stable and normal operation, effectively avoiding the spread of faults and improving system reliability. Referring to FIG. 5, an embodiment of the present invention provides a fault monitoring and processing method. This embodiment assumes that the first communications unit sends a message to the second communications unit, and the service processing fails because the response message of the second communications unit is not received. At the same time, the second communication unit also sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit. In this case, the first communication unit will report the service processing failure event, and the second communication unit will report the service processing failure event, and the two object entities that fail to process the service respectively point to the opposite communication unit, which are respectively the second communication. The unit and the first communication unit, but actually reflecting the failure of the communication path between the first communication unit and the second communication unit, may also include other third communication units for switching, third Failure of the communication unit also causes such problems, so failure analysis of such problems needs to be performed by the monitoring unit covering all communication units of the entire path. The failure handling process for this situation is as follows:

501: The first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service for the second communication unit is reported to the local monitoring unit of the first communication unit. Processing the failure event, the business processing failure event includes: address information of the object entity (second communication unit) whose service processing failed. The second communication unit sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit, and the service processing for the first communication unit fails to be reported to the local monitoring unit of the second communication unit. The event, the service processing failure event includes: address information of the object entity (first communication unit) whose service processing failed.

502. The upper monitoring unit acquires a service processing failure event reported by the first communication unit for the second communication unit and a service processing failure event reported by the second communication unit for the first communication unit. Since the second communication unit may not be within the monitoring unit monitoring range of the first communication unit, the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and the second communication unit is reported by the first communication unit. The business processing failure event needs to be reported to the monitoring unit of the higher level. Similarly, the service processing failure event reported by the second communication unit for the first communication unit also needs to be reported to the monitoring unit of the upper level. Finally, the monitoring unit that can monitor the first communication unit and the second communication unit receives the service processing failure event reported by the first communication unit and the service processing failure event reported by the second communication unit.

503. The upper monitoring unit does not perform failure analysis on the first communication unit and the second communication unit according to the service processing failure event reported by the first communication unit, the service processing failure event reported by the second communication unit, and the preset failure determination criterion. The third communication unit on the path between the first communication unit and the second communication unit performs failure analysis. The preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the peer communication units, the failure analysis is not performed on the two communication units. Further, if the system is configured with the third communication unit on the path between the first communication unit and the second communication unit, the preset failure decision criterion specifies the target entity pointed to by the service processing failure event reported by the two communication units that communicate with each other. When both are the peer communication units, the failure analysis is performed on the communication unit on the path between the two communication units. Then in this case, the failure analysis can be performed for the third communication unit on the path between the first communication unit and the second communication unit. For example, the superior monitoring unit determines that the target entities pointed to by the service processing failure event sent by the plurality of communication units (including the first communication unit and the second communication unit) are all the third communication unit, and are counted for the third communication unit. If the number of consecutive failures of the service processing exceeds the configured failure threshold, the superior monitoring unit determines that the third communication unit is abnormal.

504. The monitoring unit sends a failure warning notification message to the management unit of the third communication unit, where the failure warning notification message carries the address information of the third communication unit.

In the embodiment of the present invention, when two communication units that communicate with each other (such as the first communication unit and the second communication unit described above) report the failure processing event of the other party, the two communication units are not invalidated according to the preset failure determination criterion. The failure analysis is performed on the third communication unit on the path between the first communication unit and the second communication unit, and the failed node on the communication path is found in time, and the fault detection notification message is sent to trigger the fault detection of the failed node in time. The process and the fault recovery process enable the failed nodes to be automatically repaired in time, and the faults are repaired in the bud, ensuring long-term, stable and normal operation of the system, effectively avoiding the spread of faults and improving system reliability.

Referring to FIG. 6, an embodiment of the present invention provides a monitoring device, including: a first acquiring unit 61, configured to acquire a service processing failure event reported by a communication unit; the service processing failure event includes: an object entity that fails service processing Address information;

a determining unit 62, configured to determine, according to the service processing failure event reported by the communication unit and the preset failure criterion, to determine an entity that has an abnormality;

The sending unit 63 is configured to send a fault warning notification message, where the fault warning notification message includes: Information of at least one entity of the determined abnormal entity, the fault warning notification message is used to indicate that fault detection is performed.

The determining unit 62 includes: an obtaining subunit 621, configured to use a service processing failure event reported by the communication unit, to calculate a failure indicator value; and a determining subunit 622, configured to perform, according to the failure indicator value and the failure criterion A threshold value that identifies the object entity in which the exception occurred.

The monitoring device may further include: a configuration unit 68, configured to configure and save the failure criterion described above. Specifically, the obtaining subunit 621 is configured to use the service processing failure event reported by the communication unit to count the accumulated value of the number of consecutive service processing failures; the accumulated value of the consecutive service processing failure times is a failure indicator value; or, the acquiring subunit 621. The service processing failure event reported by the communication unit is used to obtain a ratio of the number of service processing failures in a period of time to the total number of service processing times. The ratio of the number of service processing failures in the period to the total number of service processing times is invalid. The index value is obtained by: the obtaining sub-unit 621, configured to query a key performance indicator after receiving the service processing failure event reported by the communication unit, where the key performance indicator is the failure indicator value.

Specifically, the obtaining subunit 621 includes a first statistic subunit 6211, a second statistic subunit 6212, and a third statistic subunit 6213.

The first statistic sub-unit 6211 is specifically configured to use the service processing failure event reported by the communication unit to calculate a failure indicator value for the hardware entity, where the hardware entity is hardware to which the object entity belongs; Using the service processing failure event reported by the communication unit, the software entity calculates a failure indicator value, where the software entity is an entity corresponding to both the physical address information of the hardware to which the object entity belongs and the logical address information of the object entity;

The third statistic sub-unit 6213 is specifically configured to use the service processing failure event reported by the communication unit to collect a failure indicator value for the logical resource entity, where the logical resource entity is the physical address information of the hardware to which the target entity belongs and the reason for the service processing failure. An entity corresponding to both of the indication information;

The determining subunit 622 includes a first determining subunit 6221, a second determining subunit 6222, and a second determining subunit 6223,

The first determining sub-unit 6221 is specifically configured to determine whether the hardware entity is abnormal according to the failure indicator value and the first failure threshold for the hardware entity in the failure indicator value and the failure criterion.

The second determining subunit 6222 is configured to determine whether the software entity is abnormal according to a failure indicator value calculated for the software entity and a second failure threshold value for the software entity in the failure determination criterion.

a third determining subunit 6223, configured to calculate a failure indicator value according to a logical resource entity and Determining, in the failure criterion, a third failure threshold for the logical resource entity, determining whether the logical resource entity is abnormal.

Specifically, the sending unit 63 is configured to: when the hardware entity only fails, send a fault warning notification message that includes physical address information of the hardware to which the target entity belongs; when the hardware entity and the software entity are abnormal, send information including only the software entity. The failure warning notification message, the software entity information includes: physical address information of the hardware to which the object entity belongs and logical address information of the object entity; when both the hardware entity and the logical resource entity are abnormal, sending a failure warning notification including only the logical resource entity information The message, the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and cause indication information of the service processing failure.

Specifically, the physical address information of the hardware to which the target entity belongs includes: a first-level sub-address; the hardware to which the object entity belongs is a component of hardware corresponding to the first-level sub-address;

In order to ensure that the abnormal entity can be repaired in time, the fault is repaired in a budding state, and the monitoring device further includes: a first control unit 69 and a second control unit 610,

The first control unit 69 is configured to control the sending unit 63 if the first determining subunit 6221 determines that the hardware entity has been abnormal during the preset time period after transmitting the fault warning notification message including the physical address information of the hardware to which the target entity belongs. The failure warning notification message including the first-level sub-address is sent; at this time, the sending unit 63 is further configured to send a failure warning notification message including the first-level sub-address. At this time, the sending unit 63 is further configured to send a fault warning notification message including the first level subaddress. During the preset time period after the failure warning notification message, if the first determining sub-unit 6221 determines that the hardware entity has been abnormal, the control sending unit 63 sends a failure warning notification message including the hardware entity information, where the hardware entity information includes: The physical address information of the hardware. At this time, the sending unit 63 is further configured to send a fault warning notification message including hardware entity information.

Optionally, the service processing failure event further includes: a current load quantity of the communication unit; optionally, in order to ensure the accuracy of the failure analysis, the monitoring device further includes: a first determining unit 64 and a second determining unit 65,

The first determining unit 64 is configured to determine whether the current load of the communication unit is less than a preset threshold, and if not, discard the service processing failure event; the determining unit 62 is configured to use the first determining unit 64. When the judgment result is YES, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure determination criterion. The second determining unit 65 is configured to determine whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if yes, discarding the service processing failure event; When the determination result of the second judging unit 65 is NO, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure judging criterion. Optionally, the first obtaining unit 61 is specifically configured to acquire a service processing failure event reported by the communication unit that is forwarded by the sub-monitoring device, where the service processing failure event is that the target entity that fails the service processing does not belong to the management scope of the sub-monitoring device. When forwarded by the sub-monitoring device.

Optionally, the sending unit 63 is specifically configured to send a fault warning notification message to the management entity that fails the service processing or the management module that fails the service processing.

In order to ensure the accuracy of the failure analysis, the monitoring device further includes: a second obtaining unit 66, configured to acquire a service processing success event reported by the communication unit; and a clearing unit 67, configured to obtain a service processing success event in the second acquiring unit Then, the statistical failure indicator value is cleared to zero. Specifically, the failure indicator value counted by the first statistical subunit 6211, the second statistical subunit 6212, or the third statistical subunit 6213 is cleared. Optionally, in order to ensure that the abnormal entity can be repaired in time, and the fault is repaired in a budding state, the monitoring device may further include: a receiving unit 611,

The receiving unit 611 is configured to receive a fault invalidation query message, where the fault invalidation query message is sent by an object entity that fails the service processing or a management module of the object entity that fails the service processing, and the sending unit 63 is further configured to determine the subunit according to the Determining the result, sending a response message, where the response message includes the current latest failure analysis result. Specifically, the response message includes: a current latest failure analysis result of the abnormal entity targeted by the sent failure warning notification message. If the sent fault warning notification message is for the hardware entity (ie, the fault warning notification message includes information of the hardware entity), the response message includes the current latest failure analysis result of the hardware entity, that is, information indicating whether the hardware entity is abnormal. If the sent fault warning notification message is for the software entity (ie, the fault warning notification message includes information of the software entity), the response message includes the current latest failure analysis result of the software entity, that is, whether the software entity is abnormal. Information; if the sent fault alert notification message is for a logical resource entity (ie, the fault alert notification message includes information of a logical resource entity), the response message includes a current latest failure analysis result of the logical resource entity, that is, the logic is indicated Information about whether the resource entity is abnormal. In the embodiment of the present invention, the communication unit reports the service processing failure event in time when the object entity service fails, and the monitoring device performs failure analysis to determine a specific entity that has an abnormality, and sends a failure warning notification message to promptly trigger the entity that is abnormal. The fault detection process and the fault recovery process not only enable the entity with abnormality to be automatically repaired or isolated in time, but also repair the fault in the bud, ensuring long-term and stable operation of the system, effectively avoiding fault diffusion and improving system reliability. Sex. In addition, the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported. The technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that has an abnormality to be automatically repaired in time and the system to return to normal.

Referring to FIG. 7, an embodiment of the present invention provides a communication system, which is applicable to a distributed failure analysis processing mode, and includes: a communication unit 701, a sub-monitoring unit 702, and a parent monitoring unit 703, specifically, a sub-monitoring unit 702. The service processing failure event reported by the communication unit 701 is used to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, that the target entity whose service processing fails does not belong to the scope managed by itself, and processes the service The failure event is reported to the parent monitoring unit 703. The parent monitoring unit 703 is configured to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that failed the service processing belongs to the scope managed by itself. And determining, according to the service processing failure event reported by the communication unit 701 and the preset failure determination criterion, an entity that generates an abnormality, and sending a failure warning notification message for indicating failure detection, where the failure warning notification message includes: the determined At least one of the entities that have an exception If there is no, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit 703. The parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the network element, and the sub-monitoring unit is a frame-level monitoring unit. It is located on the central control board of the frame. Alternatively, the parent monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame. The sub-monitoring unit is a single-board monitoring unit located on the board where the communication unit is located. . For details, refer to the corresponding description in the method embodiment of the specification, and details are not described herein again. In the embodiment of the present invention, the communication unit reports the service processing loss notification message in time when the processing of the object entity fails, and triggers the fault detection process and the fault recovery process of the abnormal entity in time, which not only enables the entity that is abnormal to be automatically activated in time. Repair, repair the fault in the bud, ensure the long-term, stable and normal operation of the system, effectively avoid the spread of faults and improve system reliability.

Referring to FIG. 8, an embodiment of the present invention provides a communication system, including: a first communication unit 801, a second communication unit 802, and a monitoring unit 803, where the monitoring unit 803 is configured to acquire a service processing failure reported by the first communication unit 801. The event, the service processing failure event reported by the second communication unit 802 is obtained, and the address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit 801 is the address information of the second communication unit 802, and the When the address information of the object entity that fails the service processing carried by the service processing failure event reported by the second communication unit 802 is the address information of the first communication unit 801, the failure analysis is not performed on the first communication unit 801 and the second communication unit 802. The failure analysis of the first communication unit 801 and the second communication unit 802 is specifically performed by the monitoring unit 803 according to the service processing failure event reported by the first communication unit 801, the service processing failure event reported by the second communication unit 802, and the preset. The failure decision criterion does not perform failure analysis on the first communication unit 801 and the second communication unit 802. The preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the opposite communication units, the failure analysis is not performed on the two communication units.

It should be noted that, when the object entities pointed to by the service processing failure events reported by the two communication units that communicate with each other are all the peer communication units, the communication path between the first communication unit and the second communication unit is indicated to be faulty. Therefore, failure analysis of the two communication units is not required. With

The corresponding description in the method embodiment of the specification is omitted, and details are not described herein again. In the communication system provided by the embodiment of the present invention, when the object entities pointed to by the service processing failure events reported by the two communication units that should communicate with each other are all the peer communication units, the two are not The communication unit performs failure analysis to avoid erroneous failure analysis results.

A person skilled in the art can understand that all or part of the steps of implementing the above embodiments may be performed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as a read only memory. Disk or disc, etc.

The above description of the fault monitoring method, the communication device, and the communication system provided by the embodiments of the present invention is only for assisting in understanding the method and core idea of the present invention. Meanwhile, for those skilled in the art, The present invention is not limited by the scope of the present invention.

Claims

Rights request

A fault monitoring method, characterized in that:

Determining, by the communication unit, the service processing failure event and the preset failure criterion, determining an entity that has an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined abnormality occurs. Information about at least one entity in an entity.

2. The method of claim 1 wherein

According to the service processing failure event reported by the communication unit and the preset failure criterion, the entity that determines the abnormality is specifically:

Obtaining a failure indicator value by using a service processing failure event reported by the communication unit;

An entity that is abnormal is determined according to the failure indicator value and the corresponding failure threshold in the failure criterion.

3. The method of claim 2, wherein

The obtaining the failure indicator value by using the service processing failure event reported by the communication unit includes: the service processing failure event reported by the communication unit, and the accumulated value of the number of failed consecutive service processing failures; the accumulated value of the consecutive service processing failure times Is the failure indicator value;

Or,

The service processing failure event reported by the communication unit obtains a ratio of the number of service processing failures in a period of time to the total number of service processing times; the ratio of the number of service processing failures in the period to the total number of service processing times is a failure indicator value. .

4. The method of claim 2, wherein

The obtaining the failure indicator value by using the service processing failure event reported by the communication unit includes: after receiving the service processing failure event reported by the communication unit, querying the key performance indicator, where the key performance indicator is the failure indicator value.

5. The method of claim 3, wherein

The address information of the object entity that fails the service processing includes: physical address information of hardware to which the object entity belongs;

Determining an abnormality according to the failure threshold value and the corresponding failure threshold value in the failure criterion Entities include:

Determining whether the hardware entity is abnormal according to a failure indicator value for a hardware entity and a first failure threshold for the hardware entity in the failure criterion, wherein the hardware entity is hardware to which the object entity belongs.

6. The method of claim 5, wherein

The sending the fault warning notification message for indicating fault detection is specifically:

Sending a failure warning notification message including the physical address information of the hardware to which the object entity belongs; wherein the physical address information of the hardware to which the object entity belongs includes: a first-level sub-address; the hardware of the object entity is corresponding to the first-level sub-address Hardware component;

After the fault warning notification message including the physical address information of the hardware to which the target entity belongs is sent, the method further includes: if it is determined that the hardware of the target entity has been abnormal in the preset time period, sending a fault warning including the first-level sub-address Notification message.

7. The method of claim 5, wherein

The address information of the object entity that fails the service processing further includes: logical address information of the object entity;

Determining, according to the failure indicator value and the corresponding failure threshold value in the failure criterion, the entity that has an abnormality further includes:

Determining whether the software entity is abnormal according to a failure indicator value for a software entity and a second failure threshold for the software entity in the failure criterion, wherein the software entity is a physical address of a hardware to which the object entity belongs The entity corresponding to both the information and the logical address information of the object entity.

8. The method of claim 7 wherein:

The fault warning notification message sent to indicate fault detection is specifically:

When the hardware entity and the software entity are abnormal, the fault warning notification message including only the software entity information is sent, where the software entity information includes: physical address information of the hardware to which the object entity belongs and logical address information of the object entity.

9. The method of claim 5, wherein

The service processing failure event further includes: a reason indication information that the service processing fails;

Determining the entity that has an abnormality according to the failure indicator value and the corresponding failure threshold value in the failure criterion includes: Determining whether the logical resource entity is abnormal according to a failure indicator value for a logical resource entity and a third failure threshold for the logical resource entity, where the logical resource entity is a hardware belonging to the object entity The entity corresponding to both the physical address information and the reason for the failure of the service processing indicates the information.

10. The method of claim 9 wherein:

When the hardware entity and the logical resource entity are abnormal, the fault alarm notification message includes only the logical resource entity information, where the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and cause indication information of the service processing failure.

The method according to claim 8 or 10, further comprising: determining that the hardware entity has been abnormal for a predetermined period of time after transmitting a fault warning notification message for indicating fault detection, Sending a failure warning notification message including hardware entity information, where the hardware entity information includes: physical address information of hardware to which the object entity belongs.

12. Method according to claims 2-10, characterized in that

The service processing failure event further includes: a current load amount of the communication unit;

The method further includes: determining whether the current load of the communication unit is less than a preset threshold, and if so, triggering the step of performing a statistical failure indicator value; if not, discarding the service processing failure event.

13. Method according to claims 2-10, characterized in that

The method further includes: determining whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if not, triggering the step of performing a statistical failure indicator value; if yes, discarding the service processing failure event .

14. A method according to any one of claims 1 to 10, characterized in that

The service processing failure event is a signaling message processing failure event, or a management message processing failure event, or a service code processing failure event, or an interface call processing failure event.

15. A method according to any one of claims 1 to 10, characterized in that

The service processing failure event is a signaling message processing failure event reported by the communication unit when determining that the field assignment value in the signaling message from the peer communication unit is abnormal;

Alternatively, the service processing failure event is a signaling message processing failure event reported when the communication unit does not receive the response message of the peer communication unit within a predetermined time period.

16. A method according to any of claims 1-10, characterized in that Obtaining the service processing failure event reported by the communication unit is specifically:

Obtaining a service processing failure event reported by the communication unit forwarded by the sub-monitoring device, where the service processing failure event is forwarded by the sub-monitoring device when the object entity that fails the service processing does not belong to the management scope of the sub-monitoring device.

The method according to any one of claims 3 to 5, wherein the method further comprises: acquiring a service processing success event reported by the communication unit, and clearing the failure indicator value.

18. The method of any of claims 1-10, wherein

The sending failure warning notification message is specifically:

Sending a failure warning notification message to the object entity that failed in the business processing;

Alternatively, the failure warning notification message is sent to the management module of the object entity that failed the business process.

19. A monitoring device, comprising:

20. The monitoring device of claim 19, wherein

The determining unit includes:

Obtaining a sub-unit, configured to obtain a failure indicator value by using a service processing failure event reported by the communication unit;

Determining a subunit for determining an entity that has an abnormality according to the failure threshold value and the corresponding failure threshold value in the failure criterion.

21. The monitoring device of claim 20, wherein

The acquiring subunit is configured to use the service processing failure event reported by the communication unit to count the accumulated value of the number of consecutive service processing failures; and the accumulated value of the consecutive service processing failure times is a failure indicator value;

Or,

The obtaining subunit is configured to obtain a segment by using a service processing failure event reported by the communication unit The ratio of the number of service processing failures to the total number of service processing times in a period of time; the ratio of the number of service processing failures to the total number of service processing times in the period of time is the failure indicator value.

22. The monitoring device of claim 20, wherein

The obtaining sub-unit is configured to query a key performance indicator after receiving a service processing failure event reported by the communication unit, where the key performance indicator is the failure indicator value.

23. The monitoring device of claim 21, wherein

The obtaining subunit includes a first statistical subunit,

The first statistic subunit, configured to use a service processing failure event reported by the communication unit, to collect a failure indicator value for the hardware entity, where the hardware entity is hardware to which the object entity belongs; the determining subunit includes the first Determining subunits,

The first determining subunit is configured to determine whether the hardware entity is abnormal according to a failure indicator value for a hardware entity and a first failure threshold for the hardware entity in the failure criterion.

24. The monitoring device of claim 23, wherein

The fault warning notification message sent by the sending unit includes: physical address information of hardware to which the target entity belongs;

The physical address information of the hardware to which the target entity belongs includes: a first-level sub-address; the hardware to which the object entity belongs is a component of hardware corresponding to the first-level sub-address;

The monitoring device also includes:

a first control unit, configured to: when the first determining subunit determines that the hardware entity has been abnormal, after the failure of the fault warning notification message including the physical address information of the hardware to which the object entity belongs, the control sending unit sends the first A failure warning notification message of the level subaddress;

The sending unit is further configured to send a fault warning notification message including a first level subaddress.

25. The monitoring device of claim 23, wherein

The obtaining subunit further includes a second statistical subunit,

The second statistic subunit is configured to use a service processing failure event reported by the communication unit to collect a failure indicator value for the software entity, where the software entity is a physical address information of the hardware to which the object entity belongs and a logical address of the object entity. The entity corresponding to the information;

The determining subunit includes a second determining subunit, The second determining subunit is configured to determine whether the software entity is abnormal according to a failure indicator value calculated for the software entity and a second failure threshold value for the software entity in the failure determination criterion.

26. The monitoring device of claim 25, wherein

The sending unit is configured to send, when the hardware entity and the software entity are abnormal, a fault warning notification message that includes only information of the software entity, where the software entity information includes: physical address information of the hardware to which the object entity belongs and logic of the object entity Address information.

27. The monitoring device of claim 23, wherein

The statistical subunit further includes a third statistical subunit,

The third statistic subunit is configured to use a service processing failure event reported by the communication unit to collect a failure indicator value for the logical resource entity, where the logical resource entity is physical address information of the hardware to which the object entity belongs and the service processing fails. The entity corresponding to the reason indication information;

The determining subunit further includes a third determining subunit,

The third determining subunit is configured to determine whether the logical resource entity is abnormal according to a failure indicator value calculated for the logical resource entity and a third failure threshold value for the logical resource entity in the failure determination criterion.

28. The monitoring device of claim 27, wherein

The sending unit is configured to send, when the hardware entity and the logical resource entity are abnormal, a fault warning notification message that includes only logical resource entity information, where the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and service processing Reason for failure indication.

The monitoring device according to claim 26 or 28, wherein the monitoring device further comprises: a second monitoring unit, configured to: within a preset time period after the sending unit sends the fault warning notification message, if The first determining subunit determines that the hardware entity is always abnormal, and the control sending unit sends a fault warning notification message including hardware entity information, where the hardware entity information includes: physical address information of hardware to which the object entity belongs; A failure warning notification message including hardware entity information is transmitted.

The monitoring device according to any one of claims 19 to 28, wherein the service processing failure event further comprises: a current load amount of the communication unit; the monitoring device further includes: a first determining unit For determining the current load of the communication unit Whether the quantity is less than a preset threshold, and if not, discarding the service processing failure event;

The determining unit is configured to determine, according to the service processing failure event reported by the communication unit and the preset failure determination criterion, when the determination result of the first determining unit is YES, the entity that has an abnormality is determined.

31. A monitoring device according to any of claims 19-28, characterized in that

The monitoring device further includes: a second determining unit, configured to determine whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if yes, discarding the service processing failure event;

The determining unit is configured to determine, according to the service processing failure event reported by the communication unit and the preset failure determination criterion, when the determination result of the second determining unit is negative, the entity that has an abnormality is determined.

The monitoring device according to any one of claims 18 to 28, wherein the first acquiring unit is configured to acquire a service processing failure event reported by the communication unit forwarded by the sub-monitoring device, where the service processing is performed. The failure event is forwarded by the sub-monitoring device when the object entity whose service processing fails does not belong to the management scope of the sub-monitoring device.

The monitoring device according to any one of claims 21 to 23, wherein the monitoring device further comprises:

a second acquiring unit, configured to acquire a service processing success event reported by the communication unit;

The clearing unit is configured to clear the statistical failure indicator value after the second acquiring unit obtains the business processing success event.

A communication system, comprising: a communication unit, a sub-monitoring unit, and a parent monitoring unit, wherein the sub-monitoring unit is configured to acquire a service processing failure event reported by the communication unit, and carry the event according to the service processing failure event. The service processing fails the address information of the object entity, and determines that the object entity that fails the service processing does not belong to the scope of the management, and reports the service processing failure event to the parent monitoring unit;

The parent monitoring unit is configured to receive the service processing failure event reported by the sub-monitoring unit, and determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that fails the service processing belongs to the scope of its own management. If yes, according to the business processing failure And a preset failure criterion, determining an entity that has an abnormality, and transmitting a fault warning notification message for indicating fault detection, where the fault warning notification message includes: information of at least one entity of the determined entity that has an abnormality; If no, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit.

35. A communication system, comprising: a first communication unit, a second communication unit, and a monitoring unit,

The monitoring unit is configured to obtain a service processing failure event reported by the first communication unit, and obtain a service processing failure event reported by the second communication unit, and the address of the target entity that fails to be processed by the service processing failure event reported by the first communication unit If the information is the address information of the second communication unit, and the address information of the object entity that fails the service processing carried by the service processing failure event reported by the second communication unit is the address information of the first communication unit, the first communication unit and the second communication unit are not used. Communication unit performs failure