WO2011103778A1 - Fault monitoring method, monitoring device, and communication system - Google Patents

Fault monitoring method, monitoring device, and communication system Download PDF

Info

Publication number
WO2011103778A1
WO2011103778A1 PCT/CN2011/070390 CN2011070390W WO2011103778A1 WO 2011103778 A1 WO2011103778 A1 WO 2011103778A1 CN 2011070390 W CN2011070390 W CN 2011070390W WO 2011103778 A1 WO2011103778 A1 WO 2011103778A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
failure
service processing
communication unit
hardware
Prior art date
Application number
PCT/CN2011/070390
Other languages
French (fr)
Chinese (zh)
Inventor
杨胜强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2011103778A1 publication Critical patent/WO2011103778A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a fault monitoring method, a monitoring device, and a communication system.
  • FMEA Failure Mode and Effects Analysis
  • faults that is, faults that cannot be detected by the communication device or faults that cannot be detected in real time;
  • the communication device cannot be detected in time, and recovery cannot be performed in time.
  • Embodiments of the present invention provide a fault monitoring method, a monitoring device, and a communication system, which can improve the efficiency of fault detection.
  • a fault monitoring method includes:
  • the service processing failure event includes: address information of the object entity that fails the business processing;
  • a monitoring device comprising:
  • a first acquiring unit configured to acquire a service processing failure event reported by the communication unit;
  • the service processing failure event includes: address information of the object entity that fails the service processing;
  • a determining unit configured to determine an entity that has an abnormality according to a service processing failure event reported by the communication unit and a preset failure criterion
  • a sending unit configured to send a fault warning notification message, where the fault early warning notification message includes: information of at least one entity of the determined abnormality entity, where the fault early warning notification message is used to indicate fault detection.
  • a communication system comprising: a communication unit, a sub-monitoring unit, and a parent monitoring unit, wherein the sub-monitoring unit is configured to acquire a service processing failure event reported by the communication unit, and the service processing failure event carried in the business processing failure event The address information of the entity, determining that the object entity that fails the service processing does not belong to the scope of the management, and reporting the service processing failure event to the parent monitoring unit;
  • the parent monitoring unit is configured to receive the service processing failure event reported by the sub-monitoring unit, and determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that fails the service processing belongs to the scope of its own management. And if yes, determining, according to the service processing failure event and the preset failure criterion, an entity that generates an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined occurrence Information about at least one entity in the abnormal entity; if not, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit.
  • a communication system comprising: a first communication unit, a second communication unit, and a monitoring unit, wherein the monitoring unit is configured to acquire a service processing failure event reported by the first communication unit, and obtain a service processing failure event reported by the second communication unit, when The address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit is the address information of the second communication unit, and the object processing entity of the service processing failure carried by the service processing failure event reported by the second communication unit When the address information is the address information of the first communication unit, the first communication unit and the second communication unit are not invalidated.
  • the embodiment of the present invention determines an entity that has an abnormality by analyzing a service processing failure event reported by the communication unit, and sends a corresponding fault.
  • the warning notification message so that the system can take corresponding fault handling, and the entity that is abnormal can timely recover the fault, fix the fault in the bud, and avoid the fault expansion. Disperse, improve system reliability.
  • FIG. 1 is a flowchart of a fault monitoring method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a fault monitoring and processing method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of another fault monitoring and processing method according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention.
  • FIG. 6 is a structural diagram of a monitoring device according to an embodiment of the present invention.
  • FIG. 7 is a structural diagram of a communication system according to an embodiment of the present invention.
  • FIG. 8 is a structural diagram of another communication system according to an embodiment of the present invention.
  • an embodiment of the present invention provides a fault monitoring method, including:
  • the communication unit may be a network element in the communication system, or may be a processing unit in the network element, such as: a hardware entity such as a chassis, a board, a chip, a processor, an I/O device, or the like; Software entities on a chip or processor, such as: software modules, processes, threads, etc.; or logical resource entities deployed in system programs, such as: memory resources, semaphores, business processing resources, bandwidth resources, A logical resource entity such as a link resource.
  • the service processing failure event reported by the communication unit may be obtained by: the first mode: directly receiving the service processing failure event reported by the communication unit; and the second mode, the parent monitoring unit receiving the service processing failure event sent by the sub monitoring unit .
  • the second mode is applicable to the distributed failure analysis processing mode.
  • the distributed failure analysis processing modes include but are not limited to: single board level failure analysis, frame level failure analysis, network element level failure analysis, and network level failure analysis. Different levels of monitoring units (ie, units that perform failure analysis) can be logically deployed together or deployed on different hardware. In order to improve processing efficiency, it is generally deployed on different hardware.
  • the board-level failure analysis includes the failure analysis of the hardware modules running on the board or the software modules running on the board. It is deployed directly on the board.
  • the frame-level failure analysis includes not only the content of the board-level failure analysis, but also the content that cannot be processed by the board-level failure analysis. It is deployed on the central control board of the frame.
  • the NE-level failure analysis is deployed on the central control board of the NE.
  • Network-level failure analysis is deployed on the central control node of the network, such as a central network management device. Therefore, the parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the NE.
  • the sub-monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame.
  • the parent monitoring unit is a frame-level monitoring unit, which is located at the center of the frame.
  • the sub-monitoring unit is a single-board monitoring unit, which is located on the board where the communication unit is located.
  • the service processing failure event of the communication unit will be terminated at the failure analysis of the current level, and will not be reported to the upper layer;
  • the failure analysis of the layer needs to report the failure of the service processing failure of the communication unit to the failure analysis of the upper layer. For example, if the A board receives a response from the B-board, some of the fields are incorrectly assigned. The A-board reports the service failure event of the B-board to the board-level monitoring unit. The event carries the B-list. If the board-level monitoring unit of the A-board is unable to analyze the failure of the other boards, you need to report the failure of the service processing to the frame-level monitoring unit of the A-board.
  • the frame-level monitoring unit to which the A-board belongs cannot be effectively analyzed, and then it needs to be reported to the NE-level monitoring unit to which the A-board belongs.
  • the A-board and the B-board are located on different NEs.
  • the NE-level monitoring unit of the A-board cannot be effectively analyzed. You need to continue reporting to the network-level monitoring unit for analysis.
  • the object entity that fails the service processing is the communication unit or the peer communication unit that communicates with the communication unit;
  • the service processing failure event may be a signaling message processing failure event, a management message processing failure event, and a service code stream processing.
  • a failed event, or an interface call handles a failed event.
  • the communication unit fails to report the signaling message processing failure event when the corresponding function of the signaling message fails; or the communication unit fails to report the management message when the communication unit fails to perform the management message; or the communication unit fails to process the service code stream When the service stream processing failure event is reported, or the communication unit interface call processing fails, the reporting interface calls the processing failure event.
  • the address information of the body is the address information of the message processing communication unit.
  • the address information of the object entity that failed the service processing is the address information of the message sending communication unit.
  • the address information of the object entity that failed the service processing is the address information of the message receiving communication unit (ie, the peer communication unit).
  • the interface call processing fails, indicating that the interface device may be faulty, and the address information of the object entity that fails the service processing is the address information of the communication unit of the interface device. If the interface call processing fails when reading or writing the hard disk, it indicates that the hard disk may be faulty.
  • the service processing failure event may further include: a reason indication information indicating that the service processing fails. It can also include: Some context-critical operational parameters during business processing, such as current load, total number of business processes, and so on.
  • the communication unit may not report the service failure event, thereby avoiding unnecessary failure analysis.
  • the communication unit determines that the field assignment in the signaling message from the peer communication unit is abnormal because the accessed terminal device (including the user terminal and the operation and maintenance terminal) is illegal, the service processing failure event may not be reported, or the report may be reported.
  • the business handles the failed event, but the event carries a specific field for identification. In this case, it is also possible to perform control at the communication unit, i.e., the control communication unit does not report the service processing failure event.
  • a communication unit in a Home Location Register (HLR) device finds an international mobile subscriber identity (IMSI) of the terminal carried in the call request message, If the electronic serial number (ESN) is invalid, the service failure event may not be reported, and unnecessary failure analysis may be avoided.
  • IMSI international mobile subscriber identity
  • the address information of the object entity that fails the service processing includes physical address information of the hardware to which the object entity belongs, to uniquely identify the detailed address information of the hardware to which the communication unit belongs in the entire communication system; if the communication unit is a certain processing in the network element
  • the unit such as: chassis, board, chip, processor, I/O device, etc.
  • the physical address information of the hardware to which the object entity belongs may be the signaling point identifier or IP address, or according to the [rack number, board The physical address represented by the slot number, subsystem number].
  • the address information of the object entity that fails the service processing may further include logical address information of the software entity, and the logical address information may be a software module address or a process address, or with the software. Module address or process address - corresponding software module Number or process number.
  • the service processing failure indication information may indicate that the resource processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource, a semaphore, a service processing resource, a bandwidth resource, a link resource, etc., in the system.
  • the reason for the failure of the service processing indication information may be a specific number, which corresponds to the resource for which the application fails. Generally, it is recommended that the number and the resource remain in the corresponding relationship, so that in the system, as long as the service processing fails due to the failure of the same resource application, the reason for the failure of the service processing is the same, which is beneficial to the resources in the system. Failure analysis processing.
  • the object entity does not need to report any event when the service entity is successful.
  • the communication unit reports the service processing success event to the monitoring unit when the object entity performs the service processing again.
  • whether the communication unit reports the service processing success event may also be controlled by the monitoring unit. For example, after receiving the service processing failure event reported by the communication unit, the monitoring unit returns a message to the communication unit to notify the communication unit of the service processing of the target entity. When successful, the business processing success event is reported.
  • the communication unit may use the same interface to report the service processing success event and the service processing failure event, and carry the service processing failure indication information in the service processing failure event, and carry the service processing success indication information, such as the service, in the service processing success event. Carrying a specific identifier in the processing success event indicates that the service processing is successful.
  • the service processing failure event reported by the communication unit may be used to calculate the failure indicator value for one or more analysis objects; and determine whether the corresponding analysis object is abnormal according to the statistical failure value and the corresponding failure threshold value in the failure determination criterion. .
  • the failure criterion defines the failure threshold and the analysis object.
  • the failure criterion may specify that the analysis object is a hardware entity corresponding to the physical address of the hardware to which the object entity to which the service processing fails, or the software object corresponding to the physical address and the logical address of the hardware to which the object entity that failed the service processing belongs.
  • the analysis object is a logical resource entity corresponding to both the physical address of the hardware to which the object entity that failed the service processing and the cause indication information of the service processing failure.
  • the failure indicator value may be an accumulated value of the number of consecutive business processing failures, or a ratio of the number of business processing failures in a period of time to the total number of business processing times, or may be a key performance indicator of the system statistics (Key Performance Indicators, KPI), such as call loss rate, call drop rate and other statistical indicators.
  • KPI Key Performance Indicators
  • the specific failure indicator values selected depend on the established failure criterion. If the failure indicator value is the accumulated value of the number of consecutive service processing failures, when the monitoring unit receives the service processing failure event reported by the communication unit, the monitoring unit adds one to the failure indicator value corresponding to the analysis object according to different analysis objects.
  • the failure indicator value is the ratio of the number of service processing failures to the total number of service processing times in a period of time, the number of service processing failures corresponding to the analysis object is increased by one according to different analysis objects, and then the current service failure times and totals are obtained. The ratio of the number of business processes.
  • the failure indicator value corresponding to each analysis object is cleared.
  • the business processing failure event reported by the communication unit triggers the monitoring unit to query the key performance indicator, and compares the key performance indicator with the preset threshold.
  • the failure criterion can be a threshold value comparison method.
  • the failure threshold is preset on the monitoring unit.
  • the object entity that fails the business processing may be determined to be abnormal.
  • the failure indicator value of the continuous service processing exceeds a certain threshold as the failure criterion
  • the failure indicator value of the number of consecutive service failures exceeds the failure threshold in the failure criterion
  • the object entity that fails the service processing can be determined. An exception occurs.
  • the service processing failure event carries three parameters: physical address information of the hardware to which the object entity that failed the service processing, logical address information of the object entity that fails the service processing, and indication information indicating the failure of the service processing.
  • the failure analysis may be separately analyzed by one or more analysis objects:
  • the hardware entity corresponding to the physical address of the hardware of the object to which the service processing fails is the analysis object
  • the failure indicator value corresponding to the analysis object exceeds the first failure threshold
  • the first failure threshold determines that an abnormality has occurred in the hardware entity.
  • the software entity corresponding to the physical address of the hardware of the object entity to which the service processing fails and the logical address of the object entity are used as the analysis object, if the failure indicator value corresponding to the analysis object exceeds the second failure threshold, the The software entity continuously executes the business processing failure times exceeding the second failure threshold, and determines that the software entity is abnormal.
  • the object corresponding to the analysis fails If the value of the indicator exceeds the third expiration threshold, it indicates that the number of times the service processing fails due to the continuous invocation of the logical resource entity exceeds the third expiration threshold, and the logical resource entity is abnormal.
  • the monitoring unit saves the current failure analysis results separately for subsequent calls.
  • the monitoring unit may combine the running load of the entire system to decide whether to discard the service processing failure event.
  • the failure indicator value corresponding to the analysis object is not added.
  • the monitoring unit discards the service processing failure event, or records only The log, that is, in this case, the failure indicator value corresponding to the analysis object is not added.
  • the failure analysis is performed by using the hardware entity in step 102 as the analysis object, when the failure analysis result indicates that the hardware entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: the hardware of the target entity to which the service processing fails Physical address information.
  • the failure warning notification message includes: the hardware of the target entity to which the business processing fails Physical address information and logical address information of the object entity.
  • the failure warning notification message is sent, and the failure warning notification message includes: The physical address of the hardware and the reason for the failure indication.
  • the multiple failure warning notification messages may be reported at the same time, or may be reported only.
  • a fault warning notification message may also report the fault warning notification message one by one. For example, when it is determined that the hardware entity and the software entity are abnormal, the fault warning notification message corresponding to the software entity may be reported first, and the fault warning notification message corresponding to the hardware entity is not reported. When it is determined that the hardware entity and the logical resource entity are abnormal, the logical resource entity may be reported first. The fault warning notification message is not reported, and the fault warning notification message corresponding to the hardware entity is not reported.
  • the fault warning notification message corresponding to the minimum granularity of the failure analysis object is initiated. Perform the most accurate failure warning. In particular, if the subsequent analysis finds that the system is still faulty, the fault warning notification message corresponding to the hardware entity is reported.
  • the hardware entity's fault warning notification message may also distinguish hardware entities of different granularity, wherein the physical address information of the hardware to which the object entity belongs includes: a first level subaddress; the object entity belongs to the first level a component of the hardware corresponding to the sub-address; after the monitoring unit sends the fault warning notification message including the physical address information of the hardware to which the object entity belongs, if the hardware of the object entity is abnormally determined within the preset time period, the sending includes the first level Sub-address failure warning notification message.
  • the first level sub-address includes: a second-level sub-address, where the hardware corresponding to the first-level sub-address is a component of hardware corresponding to the second-level sub-address; and the monitoring unit sends the sub-address including the first-level sub-address In the preset time period after the failure warning notification message, if it is determined that the hardware of the target entity is still abnormal, the failure warning notification message including the second-level sub-address is sent. For example, if the hardware entity represented by the physical address in the form of the chassis number, board slot number, or subsystem number is abnormal, you can send the corresponding number of the chassis number, board slot number, and subsystem number.
  • Corresponding hardware entity (frame) fault warning notification message when the failure warning notification message corresponding to the failure analysis object of different granularity is sent successively, after a failure warning notification message is reported, a waiting time may be preset, and after the waiting time expires, the current failure analysis result is rechecked. If the current failure analysis result indicates that the failure analysis object is still abnormal, report the next failure warning notification message.
  • the [rack number, board slot number, and subsystem number] are the physical address information of the hardware to which the target entity belongs.
  • the [rack number, board slot number] is the first-level sub-address.
  • the fault warning notification message may be sent to the entity that generated the abnormality itself, or may be sent to the management module of the entity that generated the abnormality.
  • the fault alarm notification message corresponding to the chassis is sent to the management module of the chassis; the fault warning notification message corresponding to the board is sent to the management module of the board; and the fault warning notification corresponding to the DSP chip subsystem is provided.
  • the message is sent to the management module of the DSP chip subsystem; the failure warning notification message corresponding to the memory resource is sent to the management module of the memory resource; and the failure warning notification message corresponding to the software module can be sent to the software module itself. It can also be sent to the management module of the software module.
  • the fault warning notification message is sent to the tube in which the abnormal entity is generated. Management module.
  • the entity that has an abnormality or the management module of the entity that has an abnormality will perform a fault detection and failure recovery process for the entity that has an abnormality after receiving the failure warning notification message. See the description of the corresponding parts of the subsequent embodiments for details.
  • a timer can be started. Before the timer expires, the subsequent failure analysis for the analysis object no longer sends a failure warning notification message.
  • the communication unit reports the service processing failure in time when the processing of the object entity fails.
  • the notification message promptly triggers the fault detection process and the fault recovery process of the entity that has an abnormality, which not only enables the entity that has an abnormality to be automatically repaired in time, but also repairs the fault in the bud state, thereby ensuring the system to operate stably for a long time. , effectively avoiding the spread of faults and improving system reliability.
  • the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported.
  • the technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that is abnormal to be automatically repaired or isolated in time, and the system returns to normal.
  • another embodiment of the present invention provides a fault monitoring method when a communication unit fails to continuously perform a signaling message, which includes:
  • the communication unit fails to perform the signaling message, and the signaling message processing failure event is reported.
  • the event includes: physical address information of the board to which the communication unit belongs.
  • the signaling message can be any normal message of the signaling plane.
  • the failure of the communication unit to perform the signaling message failure may be caused by various abnormal causes encountered by the communication unit during the message processing, such as failure to apply for a memory resource, failure to apply for a timer, failure to query the configuration, or configuration data to be queried. Processing failure due to various reasons such as abnormality.
  • the monitoring unit acquires a signaling message processing failure event reported by the communication unit.
  • the monitoring unit determines that the board to which the communication unit belongs is abnormal according to the signaling message processing failure event and the preset failure determination criterion reported by the communication unit.
  • the information included in the failure event of the signaling message processing the physical address information of the board to which the communication unit belongs, the cumulative statistics of the number of consecutive service processing failures for the board, and the monitoring unit reports the signaling message every time the receiving communication unit reports If the failure occurs, the number of consecutive service processing failures corresponding to the board is increased by one.
  • the monitoring unit determines that the board is abnormal when the number of consecutive service processing failures corresponding to the board is greater than the failure threshold set by the system.
  • the monitoring unit sends a fault warning notification message to the board, where the message includes: physical address information of the board.
  • the monitoring unit After the fault alarm notification message is sent, the monitoring unit starts a timer. Before the timer expires, the failure analysis of the board will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating frequently. Failure warning notification message.
  • the board After receiving the fault warning notification message, the board triggers a fault detection process.
  • the board After receiving the fault warning notification message, the board triggers the fault detection process of the board to perform comprehensive fault detection on the board to determine the fault point and fault cause of the board. Generally, when a specific fault point and the cause of the fault are detected, the corresponding fault alarm information is reported, and the operation and maintenance personnel of the device are prompted.
  • the fault detection process includes the memory chip failure detection of the board. If the memory chip fails and the memory chip fails, the fault alarm information of the memory chip failure can be reported.
  • the board After performing the fault detection process, the board performs a fault failure confirmation process according to the fault detection result.
  • the fault invalid query message is sent to the monitoring unit, and the monitoring unit returns a response message, where the response message includes the current latest failure analysis result. If the current latest failure analysis result indicates that the board still fails, the next step is performed; if the current latest failure analysis result indicates that the board is normal, the entire process ends.
  • the board triggers a fault recovery process.
  • the board reset process is performed. If the fault recovery process of the board is the active/standby switchover, the active/standby switchover process is performed. If the fault recovery process of the board is isolated, the board isolation process is performed.
  • the fault recovery process of the board can be configured as a combination of multiple fault recovery measures. For example, the fault recovery process of the board can be configured to perform the active/standby switchover first, and then perform the board reset. Board isolation. After performing a fault recovery measure, re-execute steps 205-206 to re-execute the fault detection and failure failure confirmation process. If the fault detection result or the current latest failure analysis result indicates that the board is still faulty or invalid, continue to execute. The next fault recovery measure, otherwise the board is normal and the process ends.
  • the communication unit when the communication unit fails to perform the signaling message processing, the signaling message processing failure event is reported in time, and the monitoring unit performs the failure analysis to determine that the board to which the communication unit belongs is abnormal, and sends a failure warning notification to the board.
  • the fault detection process and the fault recovery process of the board can be triggered in time to ensure that the board can be automatically repaired or isolated in time, and the fault is repaired in a bud, ensuring long-term, stable and normal operation of the system, effectively avoiding The fault spreads and improves system reliability.
  • the fault detection process is triggered only after an abnormality is found in the board, compared with the original timing fault detection trigger mechanism, not only the timeliness but also the system performance is minimized. Referring to FIG.
  • the technical solutions provided by the embodiments of the present invention are described in detail below by way of specific examples.
  • the embodiment of the present invention assumes that the DSP chip with the frame number of 3, the slot number of the board is 3, and the subsystem number of 1 fails, and the software module running on the DSP chip is assumed to be a single process.
  • the DSP chip fails to process the service, and reports the service processing failure event to the monitoring unit of the DSP chip.
  • the event includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number Indicates the reason for 1), and the reason why the business processing failed.
  • the DSP chip runs the software module as a single process, there is no need to distinguish between them, so the logical address of the software module that fails the service processing here may not be carried.
  • the service processing failure event may be a signaling message processing failure event of the DSP, or a management message processing failure event of the DSP or a service code stream processing failure event of the DSP.
  • the reason for the failure of the service processing indication may indicate that the service processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource of the DSP chip, a timer resource of the DSP chip, a service channel processing resource of the DSP chip, etc.
  • the reason indication information of the service processing failure is generally a specific number, which corresponds to the resource that fails the application.
  • the monitoring unit acquires a service processing failure event reported by the DSP chip.
  • the information includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1), and the cause indication information of the service processing failure.
  • the monitoring unit determines whether the DSP chip is abnormal according to the service processing failure event reported by the DSP chip and the preset failure judging criterion.
  • the preset failure criterion is: whether the number of consecutive failures of the DSP chip processing exceeds the configured failure threshold, and the system has a failure threshold of 5 times. If it exceeds 5 times, the monitoring unit will determine the DSP chip. An exception occurs. Otherwise, it indicates that the DSP chip has not reached the failure criterion. The monitoring unit will determine that the DSP chip is normal.
  • the monitoring unit needs to count the number of consecutive business processing failures of the DSP chip according to the service processing failure event reported by the DSP chip.
  • the monitoring unit receives a service processing failure event reported by the DSP chip, the physical entity corresponding to the physical address of the DSP chip carried in the event is analyzed, and the number of consecutive business processing failures of the physical entity is increased by one.
  • the number of consecutive service processing failures of the DSP chip with the slot number of 1 and the subsystem number 1 is increased by one, and then the number of consecutive failures of the DSP chip processing service exceeds the configured failure. value. For example, if the DSP chip fails to perform 5 times of processing, the service processing failure event is reported to the monitoring unit 5 times.
  • the monitoring unit performs the failure analysis when the service processing failure event reported by the DSP chip is received in the first 4 times.
  • the failure threshold has not been reached 5 times.
  • the results of the first 4 failure analysis are normal for the DSP chip.
  • the failure analysis is performed, and the number of consecutive failures of the DSP chip processing service is found.
  • the failure threshold has been reached 5 times, and the failure analysis result outputs an abnormality of the DSP chip. If the reason for the failure of 5 business processes is the same, and the support points to the memory resources of the DSP chip, the memory resources of the DSP chip are used as the analysis object, and the failure analysis result also outputs the memory resource of the DSP chip. the result of.
  • the monitoring unit receives the service processing success event reported by the DSP for the first time after receiving the service processing failure event reported by the DSP, the number of the counted service processing failures is cleared. If the DSP chip fails to perform three consecutive business processes, but the fourth service is successfully processed, a service processing success event is reported, and the monitoring unit changes the number of consecutive business processing failures of the statistical DSP chip from 3 to 0.
  • the monitoring unit saves the result of the failure analysis (ie, the DSP chip is abnormal or normal) as the current latest failure analysis result. 304.
  • the monitoring unit sends a fault warning notification message to the DSP chip management unit when it is determined that the DSP chip is abnormal.
  • the fault warning notification message includes: the address information of the DSP chip in which the abnormality occurs (the address of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1).
  • the monitoring unit After the monitoring unit sends the fault warning notification message, the monitoring unit starts a timer. Before the timer expires, the subsequent failure analysis will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating the frequent fault warning notification message. .
  • the DSP chip management unit calls the DSP fault detection processing program to perform fault detection.
  • the DSP fault detection processing function can be registered in the DSP chip management unit, and calling this function triggers the DSP fault detection processing flow. For example: Send a message to the DSP chip that has an abnormality, trigger the DSP chip to perform CRC data verification of the program segment and the data segment, and return the CRC data verification result to the DSP chip management unit.
  • the DSP fault detection processing flow reports the corresponding alarm and log when the fault is found, so as to facilitate the user's problem location.
  • the DSP chip management unit performs fault failure confirmation with the monitoring unit according to the DSP fault detection result.
  • the fault failure query message is sent to the monitoring unit, and the monitoring unit returns a response message, which includes the current latest failure analysis result.
  • the fault invalidation query message may also be sent to the monitoring unit, or the fault invalidation query message may not be sent to the monitoring unit for failure failure confirmation.
  • the fault invalidation query message is generally not sent to the monitoring unit to improve system processing efficiency.
  • the next step is continued. If the DSP fault detection result indicates that no fault has been detected, and the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP chip is normal, indicating that the DSP chip has returned to normal, the entire process can be ended. This can avoid some flash-type failures that cause subsequent unnecessary failure recovery measures to affect the system.
  • the DSP chip management unit calls the DSP fault recovery processing program to perform fault recovery.
  • the DSP fault recovery processing function can be registered in the DSP chip management unit, and calling this function touches Send DSP fault recovery processing flow. For example: Send a reset message to the DSP chip that has an abnormality, trigger the DSP chip to reset and restart, and start a timer, waiting for the DSP chip to re-run normally.
  • the DSP chip management unit can perform fault detection on the DSP chip again, and perform fault failure confirmation with the monitoring unit. If the DSP fault detection result indicates that the fault is detected, or the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP is still abnormal, the DSP chip isolation measure is executed to isolate the abnormal DSP chip.
  • the DSP chip management unit sends a fault warning notification message, and the DSP chip management unit promptly calls the DSP chip fault detection process and the fault recovery process to not only detect the specific fault cause of the DSP chip in time, but also report the alarm indicating the root cause of the fault, and can timely
  • the DSP chip performs fault repair or isolation, repairs the fault in the bud, and quickly recovers or isolates the abnormal DSP chip, avoiding the spread of faults and improving system reliability.
  • the embodiment of the present invention can monitor all the service processing failures of the DSP chip, including the failure of the signaling message service processing, the failure of the management message service processing, and the failure of the processing processing of the service code stream, which can cover all service processing failures of the DSP chip.
  • an embodiment of the present invention provides a fault monitoring and processing method. This embodiment assumes that the first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit. .
  • the failure handling process for this situation is as follows:
  • the first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service processing failure event is reported to the local monitoring unit of the first communication unit.
  • the processing failure event includes: address information of the object entity (second communication unit) whose service processing has failed. 402.
  • the upper monitoring unit acquires a service processing failure event reported by the first communication unit.
  • the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and then needs to report to the monitoring unit of the upper level. Finally, the service processing failure event reported by the first communication unit is received by the monitoring unit capable of monitoring the first communication unit and the second communication unit.
  • the monitoring unit may include: a board level monitoring unit, a frame level monitoring unit, a network element level monitoring unit, and a network level monitoring unit.
  • the scope of failure analysis that can be handled by different levels of monitoring units is different.
  • the board-level monitoring unit can perform failure analysis only on the hardware chips in the board or the software modules running in the board.
  • the frame-level monitoring unit includes not only the failure analysis content of each board in the frame, but also the failure analysis content between the boards at the frame level.
  • the network element level monitoring unit can analyze all hardware chips or software modules in the network element for failure analysis.
  • the network level monitoring unit can analyze all hardware chips or software modules in the entire network for failure analysis.
  • the superior monitoring unit determines whether the second communication unit is abnormal according to the service processing failure event reported by the first communication unit and the preset failure determination criterion.
  • the superior monitoring unit determines that the target entity pointed to by the service processing failure event sent by the multiple communication units is the second communication unit, and the number of consecutive failures of the service processing for the target entity exceeds the configured failure threshold. The superior monitoring unit will determine that an abnormality has occurred in the second communication unit.
  • the monitoring unit sends a failure warning notification message to the management unit of the second communication unit, where the failure warning notification message carries the address information of the second communication unit.
  • processing steps of subsequent fault detection and fault recovery are basically the same as steps 205-207, and are not described here.
  • an embodiment of the present invention if the object entity pointed by the service processing failure event sent by the multiple communication units is the same object entity, and the number of consecutive failures of the service processing for the object entity exceeds the configured failure threshold, the determination is performed.
  • the object entity is faulty, and the fault warning notification message is sent, and the fault detection process and the fault recovery process of the object entity are triggered in time, so that the object entity can be automatically repaired or isolated in time, and the fault is repaired in the bud state, thereby ensuring the long-term system. Ground, stable and normal operation, effectively avoiding the spread of faults and improving system reliability.
  • an embodiment of the present invention provides a fault monitoring and processing method.
  • This embodiment assumes that the first communications unit sends a message to the second communications unit, and the service processing fails because the response message of the second communications unit is not received.
  • the second communication unit also sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit.
  • the first communication unit will report the service processing failure event
  • the second communication unit will report the service processing failure event, and the two object entities that fail to process the service respectively point to the opposite communication unit, which are respectively the second communication.
  • the unit and the first communication unit, but actually reflecting the failure of the communication path between the first communication unit and the second communication unit, may also include other third communication units for switching, third Failure of the communication unit also causes such problems, so failure analysis of such problems needs to be performed by the monitoring unit covering all communication units of the entire path.
  • the failure handling process for this situation is as follows:
  • the first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service for the second communication unit is reported to the local monitoring unit of the first communication unit.
  • the business processing failure event includes: address information of the object entity (second communication unit) whose service processing failed.
  • the second communication unit sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit, and the service processing for the first communication unit fails to be reported to the local monitoring unit of the second communication unit.
  • the event, the service processing failure event includes: address information of the object entity (first communication unit) whose service processing failed.
  • the upper monitoring unit acquires a service processing failure event reported by the first communication unit for the second communication unit and a service processing failure event reported by the second communication unit for the first communication unit. Since the second communication unit may not be within the monitoring unit monitoring range of the first communication unit, the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and the second communication unit is reported by the first communication unit. The business processing failure event needs to be reported to the monitoring unit of the higher level. Similarly, the service processing failure event reported by the second communication unit for the first communication unit also needs to be reported to the monitoring unit of the upper level. Finally, the monitoring unit that can monitor the first communication unit and the second communication unit receives the service processing failure event reported by the first communication unit and the service processing failure event reported by the second communication unit.
  • the upper monitoring unit does not perform failure analysis on the first communication unit and the second communication unit according to the service processing failure event reported by the first communication unit, the service processing failure event reported by the second communication unit, and the preset failure determination criterion.
  • the third communication unit on the path between the first communication unit and the second communication unit performs failure analysis.
  • the preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the peer communication units, the failure analysis is not performed on the two communication units.
  • the preset failure decision criterion specifies the target entity pointed to by the service processing failure event reported by the two communication units that communicate with each other.
  • the failure analysis is performed on the communication unit on the path between the two communication units.
  • the failure analysis can be performed for the third communication unit on the path between the first communication unit and the second communication unit.
  • the superior monitoring unit determines that the target entities pointed to by the service processing failure event sent by the plurality of communication units (including the first communication unit and the second communication unit) are all the third communication unit, and are counted for the third communication unit. If the number of consecutive failures of the service processing exceeds the configured failure threshold, the superior monitoring unit determines that the third communication unit is abnormal.
  • the monitoring unit sends a failure warning notification message to the management unit of the third communication unit, where the failure warning notification message carries the address information of the third communication unit.
  • processing steps of subsequent fault detection and fault recovery are basically the same as steps 205-207, and are not described here.
  • the two communication units that communicate with each other report the failure processing event of the other party
  • the two communication units are not invalidated according to the preset failure determination criterion.
  • the failure analysis is performed on the third communication unit on the path between the first communication unit and the second communication unit, and the failed node on the communication path is found in time, and the fault detection notification message is sent to trigger the fault detection of the failed node in time.
  • the process and the fault recovery process enable the failed nodes to be automatically repaired in time, and the faults are repaired in the bud, ensuring long-term, stable and normal operation of the system, effectively avoiding the spread of faults and improving system reliability.
  • an embodiment of the present invention provides a monitoring device, including: a first acquiring unit 61, configured to acquire a service processing failure event reported by a communication unit; the service processing failure event includes: an object entity that fails service processing Address information;
  • a determining unit 62 configured to determine, according to the service processing failure event reported by the communication unit and the preset failure criterion, to determine an entity that has an abnormality
  • the sending unit 63 is configured to send a fault warning notification message, where the fault warning notification message includes: Information of at least one entity of the determined abnormal entity, the fault warning notification message is used to indicate that fault detection is performed.
  • the determining unit 62 includes: an obtaining subunit 621, configured to use a service processing failure event reported by the communication unit, to calculate a failure indicator value; and a determining subunit 622, configured to perform, according to the failure indicator value and the failure criterion A threshold value that identifies the object entity in which the exception occurred.
  • the monitoring device may further include: a configuration unit 68, configured to configure and save the failure criterion described above.
  • the obtaining subunit 621 is configured to use the service processing failure event reported by the communication unit to count the accumulated value of the number of consecutive service processing failures; the accumulated value of the consecutive service processing failure times is a failure indicator value; or, the acquiring subunit 621.
  • the service processing failure event reported by the communication unit is used to obtain a ratio of the number of service processing failures in a period of time to the total number of service processing times. The ratio of the number of service processing failures in the period to the total number of service processing times is invalid.
  • the index value is obtained by: the obtaining sub-unit 621, configured to query a key performance indicator after receiving the service processing failure event reported by the communication unit, where the key performance indicator is the failure indicator value.
  • the obtaining subunit 621 includes a first statistic subunit 6211, a second statistic subunit 6212, and a third statistic subunit 6213.
  • the first statistic sub-unit 6211 is specifically configured to use the service processing failure event reported by the communication unit to calculate a failure indicator value for the hardware entity, where the hardware entity is hardware to which the object entity belongs; Using the service processing failure event reported by the communication unit, the software entity calculates a failure indicator value, where the software entity is an entity corresponding to both the physical address information of the hardware to which the object entity belongs and the logical address information of the object entity;
  • the third statistic sub-unit 6213 is specifically configured to use the service processing failure event reported by the communication unit to collect a failure indicator value for the logical resource entity, where the logical resource entity is the physical address information of the hardware to which the target entity belongs and the reason for the service processing failure. An entity corresponding to both of the indication information;
  • the determining subunit 622 includes a first determining subunit 6221, a second determining subunit 6222, and a second determining subunit 6223,
  • the first determining sub-unit 6221 is specifically configured to determine whether the hardware entity is abnormal according to the failure indicator value and the first failure threshold for the hardware entity in the failure indicator value and the failure criterion.
  • the second determining subunit 6222 is configured to determine whether the software entity is abnormal according to a failure indicator value calculated for the software entity and a second failure threshold value for the software entity in the failure determination criterion.
  • a third determining subunit 6223 configured to calculate a failure indicator value according to a logical resource entity and Determining, in the failure criterion, a third failure threshold for the logical resource entity, determining whether the logical resource entity is abnormal.
  • the sending unit 63 is configured to: when the hardware entity only fails, send a fault warning notification message that includes physical address information of the hardware to which the target entity belongs; when the hardware entity and the software entity are abnormal, send information including only the software entity.
  • the failure warning notification message, the software entity information includes: physical address information of the hardware to which the object entity belongs and logical address information of the object entity; when both the hardware entity and the logical resource entity are abnormal, sending a failure warning notification including only the logical resource entity information
  • the message, the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and cause indication information of the service processing failure.
  • the physical address information of the hardware to which the target entity belongs includes: a first-level sub-address; the hardware to which the object entity belongs is a component of hardware corresponding to the first-level sub-address;
  • the monitoring device further includes: a first control unit 69 and a second control unit 610,
  • the first control unit 69 is configured to control the sending unit 63 if the first determining subunit 6221 determines that the hardware entity has been abnormal during the preset time period after transmitting the fault warning notification message including the physical address information of the hardware to which the target entity belongs.
  • the failure warning notification message including the first-level sub-address is sent; at this time, the sending unit 63 is further configured to send a failure warning notification message including the first-level sub-address. At this time, the sending unit 63 is further configured to send a fault warning notification message including the first level subaddress.
  • the control sending unit 63 sends a failure warning notification message including the hardware entity information, where the hardware entity information includes: The physical address information of the hardware.
  • the sending unit 63 is further configured to send a fault warning notification message including hardware entity information.
  • the service processing failure event further includes: a current load quantity of the communication unit; optionally, in order to ensure the accuracy of the failure analysis, the monitoring device further includes: a first determining unit 64 and a second determining unit 65,
  • the first determining unit 64 is configured to determine whether the current load of the communication unit is less than a preset threshold, and if not, discard the service processing failure event; the determining unit 62 is configured to use the first determining unit 64.
  • the judgment result is YES, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure determination criterion.
  • the second determining unit 65 is configured to determine whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if yes, discarding the service processing failure event; When the determination result of the second judging unit 65 is NO, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure judging criterion.
  • the first obtaining unit 61 is specifically configured to acquire a service processing failure event reported by the communication unit that is forwarded by the sub-monitoring device, where the service processing failure event is that the target entity that fails the service processing does not belong to the management scope of the sub-monitoring device. When forwarded by the sub-monitoring device.
  • the sending unit 63 is specifically configured to send a fault warning notification message to the management entity that fails the service processing or the management module that fails the service processing.
  • the monitoring device further includes: a second obtaining unit 66, configured to acquire a service processing success event reported by the communication unit; and a clearing unit 67, configured to obtain a service processing success event in the second acquiring unit Then, the statistical failure indicator value is cleared to zero. Specifically, the failure indicator value counted by the first statistical subunit 6211, the second statistical subunit 6212, or the third statistical subunit 6213 is cleared.
  • the monitoring device may further include: a receiving unit 611,
  • the receiving unit 611 is configured to receive a fault invalidation query message, where the fault invalidation query message is sent by an object entity that fails the service processing or a management module of the object entity that fails the service processing, and the sending unit 63 is further configured to determine the subunit according to the Determining the result, sending a response message, where the response message includes the current latest failure analysis result.
  • the response message includes: a current latest failure analysis result of the abnormal entity targeted by the sent failure warning notification message. If the sent fault warning notification message is for the hardware entity (ie, the fault warning notification message includes information of the hardware entity), the response message includes the current latest failure analysis result of the hardware entity, that is, information indicating whether the hardware entity is abnormal.
  • the response message includes the current latest failure analysis result of the software entity, that is, whether the software entity is abnormal.
  • Information if the sent fault alert notification message is for a logical resource entity (ie, the fault alert notification message includes information of a logical resource entity), the response message includes a current latest failure analysis result of the logical resource entity, that is, the logic is indicated Information about whether the resource entity is abnormal.
  • the communication unit reports the service processing failure event in time when the object entity service fails, and the monitoring device performs failure analysis to determine a specific entity that has an abnormality, and sends a failure warning notification message to promptly trigger the entity that is abnormal.
  • the fault detection process and the fault recovery process not only enable the entity with abnormality to be automatically repaired or isolated in time, but also repair the fault in the bud, ensuring long-term and stable operation of the system, effectively avoiding fault diffusion and improving system reliability. Sex.
  • the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported.
  • the technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that has an abnormality to be automatically repaired in time and the system to return to normal.
  • an embodiment of the present invention provides a communication system, which is applicable to a distributed failure analysis processing mode, and includes: a communication unit 701, a sub-monitoring unit 702, and a parent monitoring unit 703, specifically, a sub-monitoring unit 702.
  • the service processing failure event reported by the communication unit 701 is used to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, that the target entity whose service processing fails does not belong to the scope managed by itself, and processes the service The failure event is reported to the parent monitoring unit 703.
  • the parent monitoring unit 703 is configured to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that failed the service processing belongs to the scope managed by itself.
  • the parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the network element, and the sub-monitoring unit is a frame-level monitoring unit.
  • the parent monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame.
  • the sub-monitoring unit is a single-board monitoring unit located on the board where the communication unit is located. .
  • the communication unit reports the service processing loss notification message in time when the processing of the object entity fails, and triggers the fault detection process and the fault recovery process of the abnormal entity in time, which not only enables the entity that is abnormal to be automatically activated in time. Repair, repair the fault in the bud, ensure the long-term, stable and normal operation of the system, effectively avoid the spread of faults and improve system reliability.
  • an embodiment of the present invention provides a communication system, including: a first communication unit 801, a second communication unit 802, and a monitoring unit 803, where the monitoring unit 803 is configured to acquire a service processing failure reported by the first communication unit 801.
  • the event, the service processing failure event reported by the second communication unit 802 is obtained, and the address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit 801 is the address information of the second communication unit 802, and the When the address information of the object entity that fails the service processing carried by the service processing failure event reported by the second communication unit 802 is the address information of the first communication unit 801, the failure analysis is not performed on the first communication unit 801 and the second communication unit 802.
  • the failure analysis of the first communication unit 801 and the second communication unit 802 is specifically performed by the monitoring unit 803 according to the service processing failure event reported by the first communication unit 801, the service processing failure event reported by the second communication unit 802, and the preset.
  • the failure decision criterion does not perform failure analysis on the first communication unit 801 and the second communication unit 802.
  • the preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the opposite communication units, the failure analysis is not performed on the two communication units.
  • the object entities pointed to by the service processing failure events reported by the two communication units that should communicate with each other are all the peer communication units, the two are not The communication unit performs failure analysis to avoid erroneous failure analysis results.

Abstract

The embodiments of the present invention provide a fault monitoring method, a monitoring device, and a communication system, wherein the fault monitoring method includes: obtaining a service processing failure event reported from a communication unit, the service processing failure event including the address information of an object entity which fails to process the service; determining abnormal entities according to the service processing failure event reported from the communication unit and a preset failure criterion; transmitting a fault warning notification message for indicating the implementation of fault detection, the fault warning notification message including the information of at least one entity in the determined abnormal entities. With the technical solutions provided by the embodiments of the present invention, the efficiency of fault detection can be improved.

Description

故障监控方法、 监控设备及通信系统 技术领域  Fault monitoring method, monitoring device and communication system
本发明涉及通信技术领域, 特别涉及一种故障监控方法、 监控设备及通信 系统。  The present invention relates to the field of communications technologies, and in particular, to a fault monitoring method, a monitoring device, and a communication system.
背景技术 Background technique
通信网络设备要求有很高的可靠性, 为了达到很高的可靠性, 将产品寿命 周期费用降至最低, 设备开发商需要花费大量的时间和成本为整个通信设备进 行详细的失效模式与影响分析 ( Failure Mode and Effects Analysis , FMEA ), 以 求尽量分析到通信设备所有故障模式, 并提供有效的故障处理措施, 确保通信 设备出了故障后能尽快恢复正常, 尽量减少通信设备业务影响。  Communication network equipment requires high reliability. In order to achieve high reliability and minimize product life cycle costs, equipment developers need to spend a lot of time and cost to perform detailed failure mode and impact analysis for the entire communication equipment. (Failure Mode and Effects Analysis, FMEA), in order to analyze all the failure modes of the communication equipment as much as possible, and provide effective fault handling measures to ensure that the communication equipment can return to normal as soon as possible after the failure, and minimize the impact of the communication equipment business.
由于目前通信设备越来越复杂, 特别是软件规模越来越庞大, 按照传统 FMEA 方式想要穷尽所有故障模式, 成本非常大, 所需要的时间也非常大, 在 目前激烈的商业竟争环境下, 任何一个设备开发商都无法承受这种代价, 所以 现在绝大部份电信设备都会或多或少存在一些通信设备无法检测的故障。 另夕卜, 有些故障检测手段由于非常耗费通信设备性能, 一般设计在通信设备空闲时执 行, 这样导致这类故障不可能实时进行检测。  Due to the increasing complexity of communication devices, especially the scale of software, it is very costly and time consuming to follow all the failure modes according to the traditional FMEA method. In the current fierce commercial competition environment. Any device developer can't afford this price, so most telecom devices now have more or less faults that communication devices can't detect. In addition, some fault detection means are generally designed to be executed when the communication device is idle due to the very costly communication device performance, which makes it impossible to detect such a fault in real time.
现有技术的缺点是:  The disadvantages of the prior art are:
对于上述两种故障 (即通信设备无法检测的故障或无法实时检测的故障;), 通信设备都无法及时检测到, 也无法及时进行恢复。  For the above two types of faults (that is, faults that cannot be detected by the communication device or faults that cannot be detected in real time;), the communication device cannot be detected in time, and recovery cannot be performed in time.
发明内容 Summary of the invention
本发明实施例提供一种故障监控方法、 监控设备及通信系统, 能够提高故 障检测的效率。  Embodiments of the present invention provide a fault monitoring method, a monitoring device, and a communication system, which can improve the efficiency of fault detection.
有鉴于此, 本发明实施例提供:  In view of this, the embodiments of the present invention provide:
一种故障监控方法, 包括:  A fault monitoring method includes:
获取通信单元上报的业务处理失败事件; 所述业务处理失败事件包括: 业 务处理失败的对象实体的地址信息;  Obtaining a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the business processing;
根据通信单元上报的业务处理失败事件和预置的失效判别准则, 确定发生 异常的实体, 发送用于指示进行故障检测的故障预警通知消息, 所述故障预警 通知消息包括: 所确定的发生异常的实体中至少一个实体的信息。 一种监控设备, 包括: Determining, by the communication unit, the service processing failure event and the preset failure criterion, determining an entity that has an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined abnormality occurs. Information about at least one entity in an entity. A monitoring device, comprising:
第一获取单元, 用于获取通信单元上报的业务处理失败事件; 所述业务处 理失败事件包括: 业务处理失败的对象实体的地址信息;  a first acquiring unit, configured to acquire a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the service processing;
确定单元, 用于根据通信单元上报的业务处理失败事件和预置的失效判别 准则, 确定发生异常的实体;  a determining unit, configured to determine an entity that has an abnormality according to a service processing failure event reported by the communication unit and a preset failure criterion;
发送单元, 用于发送故障预警通知消息, 所述故障预警通知消息包括: 所 确定的发生异常的实体中至少一个实体的信息, 所述故障预警通知消息用于指 示进行故障检测。  And a sending unit, configured to send a fault warning notification message, where the fault early warning notification message includes: information of at least one entity of the determined abnormality entity, where the fault early warning notification message is used to indicate fault detection.
一种通信系统, 包括: 通信单元, 子监控单元, 和父监控单元, 其中, 子监控单元, 用于获取通信单元上报的业务处理失败事件, 根据业务处理 失败事件中携带的业务处理失败的对象实体的地址信息, 确定所述业务处理失 败的对象实体不属于自己管理的范围, 将该业务处理失败事件上报给父监控单 元;  A communication system, comprising: a communication unit, a sub-monitoring unit, and a parent monitoring unit, wherein the sub-monitoring unit is configured to acquire a service processing failure event reported by the communication unit, and the service processing failure event carried in the business processing failure event The address information of the entity, determining that the object entity that fails the service processing does not belong to the scope of the management, and reporting the service processing failure event to the parent monitoring unit;
父监控单元, 用于接收子监控单元上报的业务处理失败事件, 根据业务处 理失败事件中携带的业务处理失败的对象实体的地址信息, 确定所述业务处理 失败的对象实体是否属于自己管理的范围, 如果是, 根据所述业务处理失败事 件和预置的失效判别准则, 确定发生异常的实体, 发送用于指示进行故障检测 的故障预警通知消息, 所述故障预警通知消息包括: 所确定的发生异常的实体 中至少一个实体的信息; 如果否, 继续将所述业务处理失败事件上报给所述父 监控单元的父监控单元。  The parent monitoring unit is configured to receive the service processing failure event reported by the sub-monitoring unit, and determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that fails the service processing belongs to the scope of its own management. And if yes, determining, according to the service processing failure event and the preset failure criterion, an entity that generates an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined occurrence Information about at least one entity in the abnormal entity; if not, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit.
一种通信系统, 包括: 第一通信单元、 第二通信单元和监控单元, 监控单元, 用于获取第一通信单元上报的业务处理失败事件, 获取第二通 信单元上报的业务处理失败事件, 当第一通信单元上报的业务处理失败事件携 带的业务处理失败的对象实体的地址信息为第二通信单元的地址信息, 且第二 通信单元上报的业务处理失败事件携带的业务处理失败的对象实体的地址信息 为第一通信单元的地址信息时, 不对第一通信单元和第二通信单元进行失效分 本发明实施例通过分析通信单元上报的业务处理失败事件确定发生异常的 实体, 并发送相应的故障预警通知消息, 以便系统采取相应的故障处理, 针对 发生异常的实体能及时进行故障恢复, 将故障修复在萌芽状态, 避免了故障扩 散, 提高系统可靠性。 A communication system, comprising: a first communication unit, a second communication unit, and a monitoring unit, wherein the monitoring unit is configured to acquire a service processing failure event reported by the first communication unit, and obtain a service processing failure event reported by the second communication unit, when The address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit is the address information of the second communication unit, and the object processing entity of the service processing failure carried by the service processing failure event reported by the second communication unit When the address information is the address information of the first communication unit, the first communication unit and the second communication unit are not invalidated. The embodiment of the present invention determines an entity that has an abnormality by analyzing a service processing failure event reported by the communication unit, and sends a corresponding fault. The warning notification message, so that the system can take corresponding fault handling, and the entity that is abnormal can timely recover the fault, fix the fault in the bud, and avoid the fault expansion. Disperse, improve system reliability.
附图说明 DRAWINGS
为了更清楚地说明本发明实施例的技术方案, 下面将对实施例中所需要使 用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的一些 实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可 以根据这些附图获得其他的附图。  In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. One of ordinary skill in the art can also obtain other drawings based on these drawings without undue creative effort.
图 1是本发明实施例提供的一种故障监控方法流程图;  FIG. 1 is a flowchart of a fault monitoring method according to an embodiment of the present invention;
图 2是本发明实施例提供的一种故障监控及处理方法流程图;  2 is a flowchart of a fault monitoring and processing method according to an embodiment of the present invention;
图 3是本发明实施例提供的另一种故障监控及处理方法示意图;  FIG. 3 is a schematic diagram of another fault monitoring and processing method according to an embodiment of the present invention; FIG.
图 4是本发明实施例提供的又一种故障监控及处理方法示意图;  4 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention;
图 5是本发明实施例提供的又一种故障监控及处理方法示意图;  FIG. 5 is a schematic diagram of still another fault monitoring and processing method according to an embodiment of the present invention; FIG.
图 6是本发明实施例提供的监控设备结构图;  6 is a structural diagram of a monitoring device according to an embodiment of the present invention;
图 7是本发明实施例提供的一种通信系统结构图;  FIG. 7 is a structural diagram of a communication system according to an embodiment of the present invention; FIG.
图 8是本发明实施例提供的另一种通信系统结构图。  FIG. 8 is a structural diagram of another communication system according to an embodiment of the present invention.
具体实施方式 detailed description
参阅图 1, 本发明一实施例提供一种故障监控方法, 其包括:  Referring to FIG. 1, an embodiment of the present invention provides a fault monitoring method, including:
101、 获取通信单元上报的业务处理失败事件 , 所述业务处理失败事件包 括: 业务处理失败的对象实体的地址信息。  101. Acquire a service processing failure event reported by the communication unit, where the service processing failure event includes: address information of the object entity that fails the service processing.
对通信系统而言, 各种通信业务的完成实质上是由通信系统中各通信单元 通过消息或业务码流交互协同处理完成的。 该通信单元可以是通信系统中的某 个网元, 也可以是网元内某个处理单元, 如: 机框, 单板, 芯片, 处理器, I/O 设备等硬件实体; 也可以是运行在芯片或处理器上的软件实体, 如: 软件模块, 进程, 线程等软件实体; 还可以是部署在系统程序中的逻辑资源实体, 如: 内 存资源, 信号量, 业务处理资源, 带宽资源, 链路资源等逻辑资源实体。  For the communication system, the completion of various communication services is essentially completed by the communication units of the communication system through the interactive processing of messages or service code streams. The communication unit may be a network element in the communication system, or may be a processing unit in the network element, such as: a hardware entity such as a chassis, a board, a chip, a processor, an I/O device, or the like; Software entities on a chip or processor, such as: software modules, processes, threads, etc.; or logical resource entities deployed in system programs, such as: memory resources, semaphores, business processing resources, bandwidth resources, A logical resource entity such as a link resource.
其中, 可以通过如下方式获取通信单元上报的业务处理失败事件: 第一种 方式: 直接接收通信单元上报的业务处理失败事件; 第二种方式, 父监控单元 接收子监控单元发送的业务处理失败事件。 其中, 第二种方式适用于分布式的 失效分析处理模式, 分布式的失效分析处理模式包括但不限于: 单板级失效分 析, 框级失效分析, 网元级失效分析和网络级失效分析。 不同层次的监控单元 (即进行失效分析的单元)逻辑上可以部署在一起, 也可以部署在不同硬件上。 为了提高处理效率, 一般采取分散部署在不同硬件上。 一般地, 单板级失效分 析包括单板内硬件芯片或单板内所运行的软件模块的失效分析, 直接就近部署 在本单板上。 框级失效分析不仅包括单板级失效分析内容, 还包括单板级失效 分析无法处理的内容, 部署在框的中心控制单板上。 网元级失效分析部署在网 元的中心控制单板上。 网络级失效分析部署在网络的中心控制节点上, 如中心 网管设备。 因此, 父监控单元是网络级监控单元, 其位于中心网管设备上, 子 监控单元是网元级监控单元, 其位于网元的中心控制单板上; 或者, 父监控单 元是网元级监控单元, 其位于网元的中心控制单板上, 子监控单元是框级监控 单元, 其位于框的中心控制单板上; 或者, 父监控单元是框级监控单元, 其位 于框的中心控制单板上, 子监控单元是单板级监控单元, 其位于通信单元所在 的单板上。 The service processing failure event reported by the communication unit may be obtained by: the first mode: directly receiving the service processing failure event reported by the communication unit; and the second mode, the parent monitoring unit receiving the service processing failure event sent by the sub monitoring unit . The second mode is applicable to the distributed failure analysis processing mode. The distributed failure analysis processing modes include but are not limited to: single board level failure analysis, frame level failure analysis, network element level failure analysis, and network level failure analysis. Different levels of monitoring units (ie, units that perform failure analysis) can be logically deployed together or deployed on different hardware. In order to improve processing efficiency, it is generally deployed on different hardware. Generally, the board-level failure analysis includes the failure analysis of the hardware modules running on the board or the software modules running on the board. It is deployed directly on the board. The frame-level failure analysis includes not only the content of the board-level failure analysis, but also the content that cannot be processed by the board-level failure analysis. It is deployed on the central control board of the frame. The NE-level failure analysis is deployed on the central control board of the NE. Network-level failure analysis is deployed on the central control node of the network, such as a central network management device. Therefore, the parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the NE. The sub-monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame. Alternatively, the parent monitoring unit is a frame-level monitoring unit, which is located at the center of the frame. The sub-monitoring unit is a single-board monitoring unit, which is located on the board where the communication unit is located.
一般地, 若一个层次的失效分析能做出明确失效判决, 则所述通信单元的 业务处理失败事件将在本层次的失效分析被终结, 不再上报给上一层; 若无法 做出明确失效判决, 则本层次失效分析需要将所述通信单元的业务处理失败事 件继续上报给上一层的失效分析。 比如: A单板接收到 B单板发送的响应消息中 某些字段赋值错误, A单板会向自己所在的单板级监控单元上报 B单板的业务处 理失败事件, 该事件中携带 B单板的地址信息; 由于 A单板所在的单板级监控单 元无法有效分析其他单板的失效,则需要将此业务处理失败事件继续上报给 A单 板所属的框级监控单元进行分析。 同样, 若 A单板与 B单板位于不同的框上, 则 A单板所属的框级监控单元仍无法有效分析, 则需要继续上报给 A单板所属的网 元级监控单元进行分析, 若 A单板与 B单板位于不同的网元上, A单板所属的网 元级监控单元仍无法有效分析, 则需要继续上报给网络级监控单元进行分析。  Generally, if a level of failure analysis can make an explicit failure decision, the service processing failure event of the communication unit will be terminated at the failure analysis of the current level, and will not be reported to the upper layer; In the judgment, the failure analysis of the layer needs to report the failure of the service processing failure of the communication unit to the failure analysis of the upper layer. For example, if the A board receives a response from the B-board, some of the fields are incorrectly assigned. The A-board reports the service failure event of the B-board to the board-level monitoring unit. The event carries the B-list. If the board-level monitoring unit of the A-board is unable to analyze the failure of the other boards, you need to report the failure of the service processing to the frame-level monitoring unit of the A-board. Similarly, if the A-board and the B-board are in different frames, the frame-level monitoring unit to which the A-board belongs cannot be effectively analyzed, and then it needs to be reported to the NE-level monitoring unit to which the A-board belongs. The A-board and the B-board are located on different NEs. The NE-level monitoring unit of the A-board cannot be effectively analyzed. You need to continue reporting to the network-level monitoring unit for analysis.
其中, 业务处理失败的对象实体是所述通信单元或者为与所述通信单元通 信的对端通信单元; 业务处理失败事件可以是信令消息处理失败事件、 管理消 息处理失败事件, 业务码流处理失败事件, 或者接口调用处理失败事件。  The object entity that fails the service processing is the communication unit or the peer communication unit that communicates with the communication unit; the service processing failure event may be a signaling message processing failure event, a management message processing failure event, and a service code stream processing. A failed event, or an interface call handles a failed event.
具体的, 通信单元执行信令消息相应的功能失败时上报信令消息处理失败 事件; 或者, 通信单元执行管理消息相应的功能失败时上报管理消息处理失败 事件; 或者, 通信单元处理业务码流失败时上报业务码流处理失败事件; 或者, 通信单元接口调用处理失败时上报接口调用处理失败事件。  Specifically, the communication unit fails to report the signaling message processing failure event when the corresponding function of the signaling message fails; or the communication unit fails to report the management message when the communication unit fails to perform the management message; or the communication unit fails to process the service code stream When the service stream processing failure event is reported, or the communication unit interface call processing fails, the reporting interface calls the processing failure event.
对于接收的消息正常, 而在内部处理时失败, 所述业务处理失败的对象实 体的地址信息为本消息处理通信单元的地址信息。 For the received message is normal, but fails during internal processing, the object of the business processing failure The address information of the body is the address information of the message processing communication unit.
对于接收的消息内部包含异常信元而导致失败的, 所述业务处理失败的对 象实体的地址信息为消息发送通信单元的地址信息。  If the received message contains an abnormal cell internally and fails, the address information of the object entity that failed the service processing is the address information of the message sending communication unit.
对于发送的消息正常, 而超时未接收到对端通信单元的响应消息而导致失 败的, 所述业务处理失败的对象实体的地址信息为消息接收通信单元(即对端 通信单元) 的地址信息。  If the sent message is normal, and the timeout message is not received by the peer communication unit, the address information of the object entity that failed the service processing is the address information of the message receiving communication unit (ie, the peer communication unit).
接口调用处理失败, 表示接口设备可能故障, 所述业务处理失败的对象实 体的地址信息为接口设备通信单元的地址信息。 如读写硬盘时的接口调用处理 失败, 表示硬盘可能故障。  The interface call processing fails, indicating that the interface device may be faulty, and the address information of the object entity that fails the service processing is the address information of the communication unit of the interface device. If the interface call processing fails when reading or writing the hard disk, it indicates that the hard disk may be faulty.
业务处理失败事件还可以包括: 业务处理失败的原因指示信息。 还可以包 括: 业务处理时一些上下文关键运行参数, 如当前负荷量, 总业务处理次数等。  The service processing failure event may further include: a reason indication information indicating that the service processing fails. It can also include: Some context-critical operational parameters during business processing, such as current load, total number of business processes, and so on.
特别地, 当通信单元在当前负荷量超过预设阔值时, 可以不上报该业务处 理失败事件, 避免后续进行不必要的失效分析。  In particular, when the current load exceeds a preset threshold, the communication unit may not report the service failure event, thereby avoiding unnecessary failure analysis.
特别地, 当通信单元确定来自对端通信单元的信令消息中字段赋值异常是 由于接入的终端设备(包括用户终端和操作维护终端)是非法的, 可以不上报 业务处理失败事件, 或者上报业务处理失败事件, 但在事件携带特定字段进行 标识。 对于这种情况, 也可以在通信单元处进行控制, 即控制通信单元不上报 该业务处理失败事件。 比如, 归属位置寄存器 (Home Location Register, HLR ) 设备中的某个通信单元接收到呼叫请求消息后, 发现该呼叫请求消息中携带的 终端的国际移动用户识别码 ( international mobile subscriber identity, IMSI )、 电 子序列号( Electronic Serial Number , ESN )不合法, 可以不上报业务处理失败 事件, 避免后续进行不必要的失效分析。  In particular, when the communication unit determines that the field assignment in the signaling message from the peer communication unit is abnormal because the accessed terminal device (including the user terminal and the operation and maintenance terminal) is illegal, the service processing failure event may not be reported, or the report may be reported. The business handles the failed event, but the event carries a specific field for identification. In this case, it is also possible to perform control at the communication unit, i.e., the control communication unit does not report the service processing failure event. For example, after receiving a call request message, a communication unit in a Home Location Register (HLR) device finds an international mobile subscriber identity (IMSI) of the terminal carried in the call request message, If the electronic serial number (ESN) is invalid, the service failure event may not be reported, and unnecessary failure analysis may be avoided.
其中, 业务处理失败的对象实体的地址信息包括对象实体所属硬件的物理 地址信息, 用以唯一标识通信单元所属硬件在整个通信系统中的详细地址信息; 如果通信单元是网元内的某个处理单元, 如: 机框, 单板, 芯片, 处理器, I/O 设备等,则对象实体所属硬件的物理地址信息可以是信令点标识或 IP地址,或者 是按照 [机框号, 单板槽位号, 子系统号]形式表示的物理地址。  The address information of the object entity that fails the service processing includes physical address information of the hardware to which the object entity belongs, to uniquely identify the detailed address information of the hardware to which the communication unit belongs in the entire communication system; if the communication unit is a certain processing in the network element The unit, such as: chassis, board, chip, processor, I/O device, etc., the physical address information of the hardware to which the object entity belongs may be the signaling point identifier or IP address, or according to the [rack number, board The physical address represented by the slot number, subsystem number].
若业务处理失败的对象实体为软件实体, 所述的业务处理失败的对象实体 的地址信息还可以包括此软件实体的逻辑地址信息, 此逻辑地址信息可以是软 件模块地址或进程地址, 或与软件模块地址或进程地址——对应的软件模块编 号或进程编号。 If the object entity that fails the service processing is a software entity, the address information of the object entity that fails the service processing may further include logical address information of the software entity, and the logical address information may be a software module address or a process address, or with the software. Module address or process address - corresponding software module Number or process number.
其中, 业务处理失败的原因指示信息可以指示因哪种资源申请失败而导致 业务处理失败, 其中, 上述资源可以是内存资源, 信号量, 业务处理资源, 带 宽资源, 链路资源等, 在系统中, 业务处理失败的原因指示信息可以为一个具 体编号, 其与申请失败的资源相对应。 一般地, 建议编号与资源保持——对应 关系, 这样在系统内, 只要是因为同一种资源申请失败导致的业务处理失败, 该业务处理失败的原因指示信息是相同的, 这有利于系统中资源的失效分析处 理。  The service processing failure indication information may indicate that the resource processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource, a semaphore, a service processing resource, a bandwidth resource, a link resource, etc., in the system. The reason for the failure of the service processing indication information may be a specific number, which corresponds to the resource for which the application fails. Generally, it is recommended that the number and the resource remain in the corresponding relationship, so that in the system, as long as the service processing fails due to the failure of the same resource application, the reason for the failure of the service processing is the same, which is beneficial to the resources in the system. Failure analysis processing.
一般地, 对象实体在业务处理成功时通信单元不需要上报任何事件, 但在 上报过业务处理失败事件后, 在对象实体再次执行业务处理成功时, 通信单元 要上报业务处理成功事件给监控单元。 另外, 通信单元是否上报业务处理成功 事件也可以由监控单元来控制, 例如: 监控单元在收到通信单元上报的业务处 理失败事件后, 回消息给通信单元, 通知通信单元在对象实体的业务处理成功 时上报业务处理成功事件。  Generally, the object entity does not need to report any event when the service entity is successful. However, after the service processing failure event is reported, the communication unit reports the service processing success event to the monitoring unit when the object entity performs the service processing again. In addition, whether the communication unit reports the service processing success event may also be controlled by the monitoring unit. For example, after receiving the service processing failure event reported by the communication unit, the monitoring unit returns a message to the communication unit to notify the communication unit of the service processing of the target entity. When successful, the business processing success event is reported.
其中, 通信单元可以使用同一个接口上报业务处理成功事件和业务处理失 败事件, 在业务处理失败事件中携带业务处理失败的原因指示信息, 在业务处 理成功事件中携带业务处理成功指示信息, 比如业务处理成功事件中携带特定 标识表示业务处理成功。  The communication unit may use the same interface to report the service processing success event and the service processing failure event, and carry the service processing failure indication information in the service processing failure event, and carry the service processing success indication information, such as the service, in the service processing success event. Carrying a specific identifier in the processing success event indicates that the service processing is successful.
102、 根据通信单元上报的业务处理失败事件和预置的失效判别准则, 确定 发生异常的实体。  102. Determine, according to the service processing failure event reported by the communication unit and the preset failure criterion, determine an entity that has an abnormality.
具体的, 可以利用通信单元上报的业务处理失败事件, 针对一个或者多个 分析对象统计失效指标值; 根据所统计的失效指标值和失效判别准则中相应的 失效阔值, 确定相应分析对象是否异常。  Specifically, the service processing failure event reported by the communication unit may be used to calculate the failure indicator value for one or more analysis objects; and determine whether the corresponding analysis object is abnormal according to the statistical failure value and the corresponding failure threshold value in the failure determination criterion. .
对象实体的业务处理失败, 必然会造成其相关功能失败或受损, 在对外表 现上, 就是某个实体异常。 其中, 失效判别准则规定了失效阔值和分析对象。 失效判别准则可以规定分析对象为业务处理失败的对象实体所属硬件的物理地 址对应的硬件实体, 或者, 分析对象为业务处理失败的对象实体所属硬件的物 理地址及逻辑地址两者所对应的软件实体, 或者, 分析对象为业务处理失败的 对象实体所属硬件的物理地址及业务处理失败的原因指示信息两者所对应的逻 辑资源实体。 一般地, 失效指标值可以是连续业务处理失败次数的累加值, 也可以是一 段时间内的业务处理失败次数占总业务处理次数的比值, 也可以是系统统计的 关键业绩指标(Key Performance Indicators , KPI ), 比如呼损率, 掉话率等统计 指标值。 具体选取哪些失效指标值依赖于所制定的失效判别准则。 若失效指标 值为连续业务处理失败次数的累加值时, 监控单元接收到通信单元上报的业务 处理失败事件时, 依据不同的分析对象, 将分析对象所对应的失效指标值分别 加一。 若失效指标值为一段时间内的业务处理失败次数占总业务处理次数的比 值时, 则依据不同的分析对象, 将分析对象所对应的业务处理失败次数加一, 然后求当前业务失败次数与总业务处理次数的比值。 在接收到通信单元上报的 业务处理成功事件时, 则将各分析对象对应的失效指标值清零。 If the business processing of the object entity fails, it will inevitably cause its related function to fail or be damaged. In terms of external performance, it is an entity exception. Among them, the failure criterion defines the failure threshold and the analysis object. The failure criterion may specify that the analysis object is a hardware entity corresponding to the physical address of the hardware to which the object entity to which the service processing fails, or the software object corresponding to the physical address and the logical address of the hardware to which the object entity that failed the service processing belongs. Or, the analysis object is a logical resource entity corresponding to both the physical address of the hardware to which the object entity that failed the service processing and the cause indication information of the service processing failure. Generally, the failure indicator value may be an accumulated value of the number of consecutive business processing failures, or a ratio of the number of business processing failures in a period of time to the total number of business processing times, or may be a key performance indicator of the system statistics (Key Performance Indicators, KPI), such as call loss rate, call drop rate and other statistical indicators. The specific failure indicator values selected depend on the established failure criterion. If the failure indicator value is the accumulated value of the number of consecutive service processing failures, when the monitoring unit receives the service processing failure event reported by the communication unit, the monitoring unit adds one to the failure indicator value corresponding to the analysis object according to different analysis objects. If the failure indicator value is the ratio of the number of service processing failures to the total number of service processing times in a period of time, the number of service processing failures corresponding to the analysis object is increased by one according to different analysis objects, and then the current service failure times and totals are obtained. The ratio of the number of business processes. Upon receiving the service processing success event reported by the communication unit, the failure indicator value corresponding to each analysis object is cleared.
若失效指标值为关键业绩指标时, 通信单元上报的业务处理失败事件触发 监控单元查询关键业绩指标, 将关键业绩指标与预设的阔值进行比较。  If the failure indicator value is a key performance indicator, the business processing failure event reported by the communication unit triggers the monitoring unit to query the key performance indicator, and compares the key performance indicator with the preset threshold.
一般地, 失效判别准则可以采用阔值比较法, 具体的, 在监控单元上预先 设置失效阔值, 当失效指标值大于所设置的失效阔值时, 则可判定业务处理失 败的对象实体发生异常。 特别地, 以连续业务处理失败的次数超过一定阔值作 为失效判别准则, 则当连续业务处理失败次数的失效指标值超过失效判别准则 中的失效阔值时, 即可判定业务处理失败的对象实体发生异常。  Generally, the failure criterion can be a threshold value comparison method. Specifically, the failure threshold is preset on the monitoring unit. When the failure indicator value is greater than the set failure threshold, the object entity that fails the business processing may be determined to be abnormal. . In particular, if the number of failures of the continuous service processing exceeds a certain threshold as the failure criterion, when the failure indicator value of the number of consecutive service failures exceeds the failure threshold in the failure criterion, the object entity that fails the service processing can be determined. An exception occurs.
参照步骤 101所述, £设业务处理失败事件携带三个参数: 业务处理失败的 对象实体所属硬件的物理地址信息, 业务处理失败的对象实体的逻辑地址信息, 业务处理失败的原因指示信息, 则监控单元在接收到通信单元上报的业务处理 失败事件时, 失效分析可以以一个或多个分析对象分别进行分析:  Referring to step 101, the service processing failure event carries three parameters: physical address information of the hardware to which the object entity that failed the service processing, logical address information of the object entity that fails the service processing, and indication information indicating the failure of the service processing. When the monitoring unit receives the service processing failure event reported by the communication unit, the failure analysis may be separately analyzed by one or more analysis objects:
若以业务处理失败的对象实体所属硬件的物理地址所对应的硬件实体作为 分析对象, 若该分析对象对应的失效指标值超过第一失效阔值, 则表示该硬件 实体连续执行业务处理失败次数超过第一失效阔值, 确定该硬件实体发生异常。  If the hardware entity corresponding to the physical address of the hardware of the object to which the service processing fails is the analysis object, if the failure indicator value corresponding to the analysis object exceeds the first failure threshold, the number of consecutive failures of the hardware entity to execute the service processing exceeds The first failure threshold determines that an abnormality has occurred in the hardware entity.
若以业务处理失败的对象实体所属硬件的物理地址和该对象实体的逻辑地 址两者所对应的软件实体作为分析对象, 若该分析对象对应的失效指标值超过 第二失效阔值, 则表示该软件实体连续执行业务处理失败次数超过第二失效阔 值, 确定此软件实体发生异常。  If the software entity corresponding to the physical address of the hardware of the object entity to which the service processing fails and the logical address of the object entity are used as the analysis object, if the failure indicator value corresponding to the analysis object exceeds the second failure threshold, the The software entity continuously executes the business processing failure times exceeding the second failure threshold, and determines that the software entity is abnormal.
若以业务处理失败的对象实体所属硬件的物理地址和业务处理失败的原因 指示信息两者所对应的逻辑资源实体作为分析对象, 若该分析对象对应的失效 指标值超过第三失效阔值, 则表示系统连续调用该逻辑资源实体而导致的业务 处理失败的次数超过第三失效阔值, 确定此逻辑资源实体发生异常。 If the logical resource entity corresponding to the physical address of the hardware to which the object entity failed to process the service and the reason for the failure of the service processing is the analysis object, if the object corresponding to the analysis fails If the value of the indicator exceeds the third expiration threshold, it indicates that the number of times the service processing fails due to the continuous invocation of the logical resource entity exceeds the third expiration threshold, and the logical resource entity is abnormal.
其中, 监控单元将当前失效分析结果分别进行保存, 以备后续调用。  The monitoring unit saves the current failure analysis results separately for subsequent calls.
特别地, 如果业务处理失败事件中携带当前负荷量, 且当前负荷量超过预 设阔值时, 则监控单元可以结合整个系统的运行负荷情况, 决策是否将该业务 处理失败事件丟弃, 当决策将该业务处理失败事件丟弃时, 即在这种情况下, 不将分析对象对应的失效指标值进行加一处理。  Specifically, if the current processing load is carried in the service processing failure event, and the current load exceeds the preset threshold, the monitoring unit may combine the running load of the entire system to decide whether to discard the service processing failure event. When the service processing failure event is discarded, that is, in this case, the failure indicator value corresponding to the analysis object is not added.
特别地, 如果业务处理失败事件中携带特定标识, 该特定标识表示接入的 终端设备(包括用户终端和操作维护终端)是非法的, 则监控单元将该业务处 理失败事件丟弃, 或仅记录日志, 即在这种情况下, 不将分析对象对应的失效 指标值进行加一处理。  In particular, if the service processing failure event carries a specific identifier indicating that the accessed terminal device (including the user terminal and the operation and maintenance terminal) is illegal, the monitoring unit discards the service processing failure event, or records only The log, that is, in this case, the failure indicator value corresponding to the analysis object is not added.
103、 发送故障预警通知消息, 该消息包括: 所确定的发生异常的实体中至 少一个实体的信息。  103. Send a fault warning notification message, where the message includes: information about at least one entity of the determined abnormal entity.
如果仅以步骤 102中的硬件实体为分析对象进行失效分析, 当失效分析结果 表示该硬件实体发生异常时, 发送故障预警通知消息, 该故障预警通知消息包 括: 业务处理失败的对象实体所属硬件的物理地址信息。  If the failure analysis is performed by using the hardware entity in step 102 as the analysis object, when the failure analysis result indicates that the hardware entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: the hardware of the target entity to which the service processing fails Physical address information.
如果仅以步骤 102中的软件实体为分析对象进行失效分析, 当失效分析结果 表示该软件实体发生异常时, 发送故障预警通知消息, 该故障预警通知消息包 括: 业务处理失败的对象实体所属硬件的物理地址信息和该对象实体的逻辑地 址信息。  If the software entity in step 102 is used as the analysis object for the failure analysis, when the failure analysis result indicates that the software entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: the hardware of the target entity to which the business processing fails Physical address information and logical address information of the object entity.
如果仅以步骤 102中的逻辑资源实体为分析对象进行失效分析, 当失效分析 结果表示该逻辑资源实体发生异常时, 发送故障预警通知消息, 该故障预警通 知消息包括: 业务处理失败的对象实体所属硬件的物理地址和失败原因指示信 息。  If the failure analysis is performed by using the logical resource entity in step 102 as the analysis object, when the failure analysis result indicates that the logical resource entity is abnormal, the failure warning notification message is sent, and the failure warning notification message includes: The physical address of the hardware and the reason for the failure indication.
如果同时以步骤 102中的硬件实体、 软件实体、 逻辑资源实体作为失效分析 对象分别进行失效分析, 且有多个分析对象都发生失效, 则可以同时上报多个 故障预警通知消息, 也可以只上报一个故障预警通知消息, 也可以逐个上报故 障预警通知消息。 比如: 在确定硬件实体和软件实体都异常时, 可以先上报软 件实体对应的故障预警通知消息, 暂不上报硬件实体对应的故障预警通知消息。 在确定硬件实体和逻辑资源实体都异常时, 可以先上报逻辑资源实体对应的故 障预警通知消息, 暂不上报硬件实体对应的故障预警通知消息; 优选的, 当同 时存在多个分析对象发生异常时, 先发起最小粒度的失效分析对象所对应的故 障预警通知消息, 这样可以先进行最精确的故障预警。 特别的, 如果后续分析 发现系统仍然故障, 再上报硬件实体对应的故障预警通知消息。 特别的, 对于 硬件实体的故障预警通知消息, 也可以区分不同粒度大小的硬件实体, 其中, 对象实体所属硬件的物理地址信息包括: 第一级子地址; 所述对象实体所属硬 件是第一级子地址对应的硬件的组件; 监控单元在发送包括对象实体所属硬件 的物理地址信息的故障预警通知消息后, 若预设时间段内确定所述对象实体所 属硬件一直异常, 则发送包括第一级子地址的故障预警通知消息。 可选的, 第 一级子地址包括: 第二级子地址, 所述第一级子地址对应的硬件为第二级子地 址对应的硬件的组件; 监控单元在发送包括第一级子地址的故障预警通知消息 之后的预设时间段内, 若确定对象实体所属硬件还是一直异常, 则发送包括第 二级子地址的故障预警通知消息。 比如: 按照 [机框号, 单板槽位号, 子系统号] 形式的物理地址表示的硬件实体发生异常, 可以先发送 [机框号, 单板槽位号, 子系统号]所对应的硬件实体(子系统)的故障预警通知消息; 然后可以发送 [机 框号, 单板槽位号]所对应的硬件实体(单板) 的故障预警通知消息; 最后可以 发送 [机框号]所对应的硬件实体(机框)的故障预警通知消息。 具体的, 在逐次 发送不同粒度的失效分析对象所对应的故障预警通知消息时, 在上报了一个故 障预警通知消息之后可以通过预设一个等待时间, 在等待时间超时后, 重新检 查当前失效分析结果, 若当前失效分析结果表明所述失效分析对象仍旧异常, 则再上报下一个故障预警通知消息。 其中, [机框号, 单板槽位号, 子系统号] 为对象实体所属硬件的物理地址信息, [机框号, 单板槽位号]为第一级子地址,If the hardware entity, the software entity, and the logical resource entity in step 102 are respectively used as the failure analysis object, and the failure analysis is performed, and multiple analysis objects are invalid, the multiple failure warning notification messages may be reported at the same time, or may be reported only. A fault warning notification message may also report the fault warning notification message one by one. For example, when it is determined that the hardware entity and the software entity are abnormal, the fault warning notification message corresponding to the software entity may be reported first, and the fault warning notification message corresponding to the hardware entity is not reported. When it is determined that the hardware entity and the logical resource entity are abnormal, the logical resource entity may be reported first. The fault warning notification message is not reported, and the fault warning notification message corresponding to the hardware entity is not reported. Preferably, when there are multiple analysis objects at the same time, the fault warning notification message corresponding to the minimum granularity of the failure analysis object is initiated. Perform the most accurate failure warning. In particular, if the subsequent analysis finds that the system is still faulty, the fault warning notification message corresponding to the hardware entity is reported. In particular, the hardware entity's fault warning notification message may also distinguish hardware entities of different granularity, wherein the physical address information of the hardware to which the object entity belongs includes: a first level subaddress; the object entity belongs to the first level a component of the hardware corresponding to the sub-address; after the monitoring unit sends the fault warning notification message including the physical address information of the hardware to which the object entity belongs, if the hardware of the object entity is abnormally determined within the preset time period, the sending includes the first level Sub-address failure warning notification message. Optionally, the first level sub-address includes: a second-level sub-address, where the hardware corresponding to the first-level sub-address is a component of hardware corresponding to the second-level sub-address; and the monitoring unit sends the sub-address including the first-level sub-address In the preset time period after the failure warning notification message, if it is determined that the hardware of the target entity is still abnormal, the failure warning notification message including the second-level sub-address is sent. For example, if the hardware entity represented by the physical address in the form of the chassis number, board slot number, or subsystem number is abnormal, you can send the corresponding number of the chassis number, board slot number, and subsystem number. The fault alarm notification message of the hardware entity (subsystem); then the fault alarm notification message of the hardware entity (board) corresponding to the [chassis number, board slot number] can be sent. Finally, the [rack number] can be sent. Corresponding hardware entity (frame) fault warning notification message. Specifically, when the failure warning notification message corresponding to the failure analysis object of different granularity is sent successively, after a failure warning notification message is reported, a waiting time may be preset, and after the waiting time expires, the current failure analysis result is rechecked. If the current failure analysis result indicates that the failure analysis object is still abnormal, report the next failure warning notification message. The [rack number, board slot number, and subsystem number] are the physical address information of the hardware to which the target entity belongs. The [rack number, board slot number] is the first-level sub-address.
[机框号]为第二级子地址。 [Chassis Number] is the second level subaddress.
其中, 故障预警通知消息可以是发给发生异常的实体自身, 也可以是发给 发生异常的实体的管理模块。 比如: 对于机框对应的故障预警通知消息, 发给 该机框的管理模块; 对于单板对应的故障预警通知消息, 发给该单板的管理模 块; 对于 DSP芯片子系统对应的故障预警通知消息, 发给该 DSP芯片子系统的管 理模块; 对于内存资源对应的故障预警通知消息, 发给该内存资源的管理模块; 对于软件模块对应的故障预警通知消息, 可以发给该软件模块自身, 也可以发 给该软件模块的管理模块。 优选的, 故障预警通知消息发给发生异常实体的管 理模块。 The fault warning notification message may be sent to the entity that generated the abnormality itself, or may be sent to the management module of the entity that generated the abnormality. For example, the fault alarm notification message corresponding to the chassis is sent to the management module of the chassis; the fault warning notification message corresponding to the board is sent to the management module of the board; and the fault warning notification corresponding to the DSP chip subsystem is provided. The message is sent to the management module of the DSP chip subsystem; the failure warning notification message corresponding to the memory resource is sent to the management module of the memory resource; and the failure warning notification message corresponding to the software module can be sent to the software module itself. It can also be sent to the management module of the software module. Preferably, the fault warning notification message is sent to the tube in which the abnormal entity is generated. Management module.
发生异常的实体或者发生异常的实体的管理模块在接收到故障预警通知消 息后, 将对发生异常的实体进行故障检测和故障恢复流程。 详见后续实施例相 应部分的描述。  The entity that has an abnormality or the management module of the entity that has an abnormality will perform a fault detection and failure recovery process for the entity that has an abnormality after receiving the failure warning notification message. See the description of the corresponding parts of the subsequent embodiments for details.
特别地, 监控单元在针对一个分析对象发送故障预警通知消息后, 可以启 动一个定时器, 在定时器超时前, 后续针对该分析对象的失效分析不再发送故 障预警通知消息。  In particular, after the monitoring unit sends a failure warning notification message to an analysis object, a timer can be started. Before the timer expires, the subsequent failure analysis for the analysis object no longer sends a failure warning notification message.
本发明实施例中通信单元在对象实体业务处理失败时及时上报业务处理失  In the embodiment of the present invention, the communication unit reports the service processing failure in time when the processing of the object entity fails.
通知消息, 及时触发对该发生异常的实体的故障检测流程和故障恢复流程, 不 仅可以使发生异常的实体能够及时被自动修复, 将故障修复在萌芽状态, 保障 了系统长期地, 稳定地正常运行, 有效避免了故障扩散, 提高系统可靠性。 另 外, 故障检测流程是在分析发现系统失效后才触发, 并且可以是只针对发生异 常的实体触发, 所以不仅可以保证故障检测产生的故障告警与系统失效表现的 一致性, 而且能有效抑制无关的告警上报。 本实施例提供的技术方案可以将系 统中所有业务处理失败进行监控, 包括信令消息处理失败, 管理消息处理失败, 和业务码流的处理失败, 可以覆盖系统所有业务处理失败, 可以保证系统能检 测到所有通信单元的失效, 保证了检测的完备性, 这样即使某些通信单元在系 统中没有设计相关的故障检测技术, 也能通过本发明所描述的方案基本确定通 信单元的失效, 进而采取针对性的故障恢复措施, 使发生异常的通信单元能够 及时被自动修复或隔离, 系统恢复正常。 参阅图 2, 本发明另一实施例提供一种通信单元出现连续执行信令消息失败 时的故障监控方法, 其包括: The notification message promptly triggers the fault detection process and the fault recovery process of the entity that has an abnormality, which not only enables the entity that has an abnormality to be automatically repaired in time, but also repairs the fault in the bud state, thereby ensuring the system to operate stably for a long time. , effectively avoiding the spread of faults and improving system reliability. In addition, the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported. The technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that is abnormal to be automatically repaired or isolated in time, and the system returns to normal. Referring to FIG. 2, another embodiment of the present invention provides a fault monitoring method when a communication unit fails to continuously perform a signaling message, which includes:
201、 通信单元执行信令消息失败, 上报信令消息处理失败事件, 事件中包 括: 该通信单元所属单板的物理地址信息。  201. The communication unit fails to perform the signaling message, and the signaling message processing failure event is reported. The event includes: physical address information of the board to which the communication unit belongs.
所述信令消息可以是信令面的任何正常消息。 所述的通信单元执行信令消 息失败可以是通信单元在消息处理时碰到的各种异常原因导致的失败, 比如申 请内存资源失败, 申请定时器失败, 查询配置失败, 或查询到的配置数据异常 等各种原因导致的处理失败。  The signaling message can be any normal message of the signaling plane. The failure of the communication unit to perform the signaling message failure may be caused by various abnormal causes encountered by the communication unit during the message processing, such as failure to apply for a memory resource, failure to apply for a timer, failure to query the configuration, or configuration data to be queried. Processing failure due to various reasons such as abnormality.
202、 监控单元获取通信单元上报的信令消息处理失败事件。 203、 监控单元根据通信单元上报的信令消息处理失败事件和预置的失效判 别准则, 确定所述通信单元所属的单板发生异常。 202. The monitoring unit acquires a signaling message processing failure event reported by the communication unit. 203. The monitoring unit determines that the board to which the communication unit belongs is abnormal according to the signaling message processing failure event and the preset failure determination criterion reported by the communication unit.
根据信令消息处理失败事件中包括的信息: 该通信单元所属单板的物理地 址信息, 针对该单板进行连续业务处理失败次数的累加统计, 监控单元每接收 到通信单元上报一次信令消息处理失败事件, 则将该单板所对应的连续业务处 理失败次数加一。 当该单板所对应的连续业务处理失败次数大于系统所设置的 失效阔值时, 监控单元判定该单板发生异常。  The information included in the failure event of the signaling message processing: the physical address information of the board to which the communication unit belongs, the cumulative statistics of the number of consecutive service processing failures for the board, and the monitoring unit reports the signaling message every time the receiving communication unit reports If the failure occurs, the number of consecutive service processing failures corresponding to the board is increased by one. The monitoring unit determines that the board is abnormal when the number of consecutive service processing failures corresponding to the board is greater than the failure threshold set by the system.
204、 监控单元发送故障预警通知消息给所述单板, 该消息包括: 所述单板 的物理地址信息。  204. The monitoring unit sends a fault warning notification message to the board, where the message includes: physical address information of the board.
监控单元在发送故障预警通知消息后, 启动一个定时器, 在定时器超时前, 后续针对该单板的失效分析将不再发送故障预警通知消息, 这样做主要是防止 后续监控单元进行重复频繁的故障预警通知消息。  After the fault alarm notification message is sent, the monitoring unit starts a timer. Before the timer expires, the failure analysis of the board will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating frequently. Failure warning notification message.
205、 所述单板在接收到故障预警通知消息之后, 触发故障检测流程。  205. After receiving the fault warning notification message, the board triggers a fault detection process.
所述单板在收到故障预警通知消息, 则触发此单板的故障检测流程, 对单 板进行全面的故障检测, 以最终确定单板的故障点和故障原因。 一般地, 在检 测到具体的故障点和故障原因时, 上报相应的故障告警信息, 提示设备的运维 人员。 比如: 故障检测流程包含单板的内存芯片失效检测, 运行内存芯片失效 检测发现内存芯片失效, 则可以上报内存芯片失效的故障告警信息。  After receiving the fault warning notification message, the board triggers the fault detection process of the board to perform comprehensive fault detection on the board to determine the fault point and fault cause of the board. Generally, when a specific fault point and the cause of the fault are detected, the corresponding fault alarm information is reported, and the operation and maintenance personnel of the device are prompted. For example, the fault detection process includes the memory chip failure detection of the board. If the memory chip fails and the memory chip fails, the fault alarm information of the memory chip failure can be reported.
206、 所述单板在执行完所述故障检测流程后, 根据故障检测结果, 进行故 障失效确认流程。  206. After performing the fault detection process, the board performs a fault failure confirmation process according to the fault detection result.
如果所述单板故障检测结果表示没有检测到任何故障, 则向监控单元发送 故障失效查询消息, 监控单元返回响应消息, 响应消息中包括当前最新的失效 分析结果。 如果当前最新的失效分析结果表示所述单板仍然失效, 则执行下一 步; 如果当前最新的失效分析结果表示所述单板已经正常, 则整个流程结束。  If the fault detection result of the board indicates that no fault is detected, the fault invalid query message is sent to the monitoring unit, and the monitoring unit returns a response message, where the response message includes the current latest failure analysis result. If the current latest failure analysis result indicates that the board still fails, the next step is performed; if the current latest failure analysis result indicates that the board is normal, the entire process ends.
若故障检测结果中表示所述单板确实存在故障, 则可以不进行故障失效确 认, 直接执行下一步。  If the fault detection result indicates that the board does have a fault, you can perform the next step without performing fault failure confirmation.
207、 所述单板触发故障恢复流程。  207. The board triggers a fault recovery process.
若所述单板的故障恢复流程为单板复位, 则执行该单板复位流程。 若所述 单板的故障恢复流程为主备倒换, 则执行该主备倒换流程。 若所述单板的故障 恢复流程为单板隔离, 则执行该单板隔离流程。 特别的, 所述单板的故障恢复流程可以配置为多个故障恢复措施的组合, 比如: 可以配置所述单板的故障恢复流程为首先执行主备倒换, 再执行单板复 位,最后执行单板隔离。在执行完一个故障恢复措施后,重新执行步骤 205 ~ 206, 重新进行故障检测和故障失效确认流程, 若故障检测结果或当前最新的失效分 析结果表示所述单板仍然故障或失效, 则继续执行下一个故障恢复措施, 否则 表示所述单板已经正常, 流程结束。 If the fault recovery process of the board is a board reset, the board reset process is performed. If the fault recovery process of the board is the active/standby switchover, the active/standby switchover process is performed. If the fault recovery process of the board is isolated, the board isolation process is performed. In particular, the fault recovery process of the board can be configured as a combination of multiple fault recovery measures. For example, the fault recovery process of the board can be configured to perform the active/standby switchover first, and then perform the board reset. Board isolation. After performing a fault recovery measure, re-execute steps 205-206 to re-execute the fault detection and failure failure confirmation process. If the fault detection result or the current latest failure analysis result indicates that the board is still faulty or invalid, continue to execute. The next fault recovery measure, otherwise the board is normal and the process ends.
本发明实施例中在通信单元连续执行信令消息处理失败时, 及时上报信令 消息处理失败事件, 监控单元进行失效分析确定通信单元所属的单板发生异常 时, 向该单板发送故障预警通知消息, 及时触发对该单板的故障检测流程和故 障恢复流程, 可以使该单板能够及时被自动修复或隔离, 将故障修复在萌芽状 态, 保障了系统长期地, 稳定地正常运行, 有效避免了故障扩散, 提高系统可 靠性。 另外, 由于故障检测流程只是在分析发现单板发生异常之后才触发, 相 比于原来定时故障检测触发机制, 不仅保证了及时性, 而且对系统性能影响最 小。 参阅图 3, 如下举具体实例对本发明实施例提供的技术方案进行详细描述。 本发明实施例假定所在框号为 3,单板槽位号为 3,子系统号为 1的 DSP芯片发 生连续业务处理失败, 并假定所述 DSP芯片运行的软件模块为单进程。  In the embodiment of the present invention, when the communication unit fails to perform the signaling message processing, the signaling message processing failure event is reported in time, and the monitoring unit performs the failure analysis to determine that the board to which the communication unit belongs is abnormal, and sends a failure warning notification to the board. The fault detection process and the fault recovery process of the board can be triggered in time to ensure that the board can be automatically repaired or isolated in time, and the fault is repaired in a bud, ensuring long-term, stable and normal operation of the system, effectively avoiding The fault spreads and improves system reliability. In addition, because the fault detection process is triggered only after an abnormality is found in the board, compared with the original timing fault detection trigger mechanism, not only the timeliness but also the system performance is minimized. Referring to FIG. 3, the technical solutions provided by the embodiments of the present invention are described in detail below by way of specific examples. The embodiment of the present invention assumes that the DSP chip with the frame number of 3, the slot number of the board is 3, and the subsystem number of 1 fails, and the software module running on the DSP chip is assumed to be a single process.
301、 DSP芯片进行业务处理失败,向 DSP芯片的监控单元上报业务处理失 败事件, 该事件包括: DSP芯片的物理地址(DSP芯片所在的框号为 3、 单板 槽位号为 1和子系统号为 1 ), 和业务处理失败的原因指示信息。  301. The DSP chip fails to process the service, and reports the service processing failure event to the monitoring unit of the DSP chip. The event includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number Indicates the reason for 1), and the reason why the business processing failed.
由于 DSP芯片运行软件模块为单进程, 不用进行区分, 所以这里的业务处 理失败的软件模块的逻辑地址可以不用携带。  Since the DSP chip runs the software module as a single process, there is no need to distinguish between them, so the logical address of the software module that fails the service processing here may not be carried.
其中, 该业务处理失败事件可以是该 DSP的信令消息处理失败事件、 或该 DSP的管理消息处理失败事件或者该 DSP的业务码流处理失败事件。  The service processing failure event may be a signaling message processing failure event of the DSP, or a management message processing failure event of the DSP or a service code stream processing failure event of the DSP.
其中, 业务处理失败的原因指示信息可以指示因哪种资源申请失败而导致 业务处理失败, 其中, 上述资源可以是 DSP芯片的内存资源, DSP芯片的定时器 资源, DSP芯片的业务通道处理资源等, 在系统中, 所述业务处理失败的原因指 示信息一般为一个具体编号, 其与申请失败的资源一一对应。  The reason for the failure of the service processing indication may indicate that the service processing fails due to the failure of the resource application, wherein the foregoing resource may be a memory resource of the DSP chip, a timer resource of the DSP chip, a service channel processing resource of the DSP chip, etc. In the system, the reason indication information of the service processing failure is generally a specific number, which corresponds to the resource that fails the application.
302、 监控单元获取到 DSP芯片上报的业务处理失败事件。  302. The monitoring unit acquires a service processing failure event reported by the DSP chip.
监控单元获取到 DSP芯片上报的业务处理失败事件后, 解析出事件中携带 的信息, 包括: DSP芯片的物理地址(DSP芯片所在的框号为 3、 单板槽位号 为 1和子系统号为 1 ), 和业务处理失败的原因指示信息。 After the monitoring unit obtains the service processing failure event reported by the DSP chip, the monitoring unit carries the event The information includes: the physical address of the DSP chip (the frame number of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1), and the cause indication information of the service processing failure.
303、监控单元根据 DSP芯片上报的业务处理失败事件和预置的失效判别准 则, 判断 DSP芯片是否异常。  303. The monitoring unit determines whether the DSP chip is abnormal according to the service processing failure event reported by the DSP chip and the preset failure judging criterion.
这里预置的失效判别准则为: DSP 芯片连续业务处理失败的次数是否超过 配置的失效阔值 艮设系统配置的失效阔值为 5次), 则若超过 5次, 则监控单 元将判定 DSP芯片发生异常, 否则, 表示 DSP芯片还未达到失效判别准则, 监 控单元将判定 DSP芯片正常。  The preset failure criterion is: whether the number of consecutive failures of the DSP chip processing exceeds the configured failure threshold, and the system has a failure threshold of 5 times. If it exceeds 5 times, the monitoring unit will determine the DSP chip. An exception occurs. Otherwise, it indicates that the DSP chip has not reached the failure criterion. The monitoring unit will determine that the DSP chip is normal.
根据预置的失效判别准则, 监控单元需要根据 DSP芯片上报的业务处理失 败事件统计 DSP芯片的连续业务处理失败的次数。 监控单元每接收到一次 DSP 芯片上报的业务处理失败事件, 则以事件中所携带的 DSP芯片的物理地址所对 应的物理实体为分析对象, 对此物理实体的连续业务处理失败次数进行加一处 理, 这里对框号为 3、 单板槽位号为 1和子系统号为 1的 DSP芯片的连续业务 处理失败次数进行加一处理, 然后判断 DSP芯片连续业务处理失败的次数是否 超过配置的失效阔值。 例如: DSP芯片连续进行 5次业务处理均失败, 则会连 续向监控单元上报 5次业务处理失败事件, 监控单元在前 4次获取到 DSP芯片 上报的业务处理失败事件时, 进行失效分析, 由于还未达到失效阔值 5次, 前 4 次失效分析结果都为 DSP芯片正常,在第 5次获取到 DSP芯片上报的业务处理 失败事件时, 进行失效分析, 发现 DSP芯片连续业务处理失败的次数已经达到 失效阔值 5次, 则失效分析结果输出 DSP芯片异常。 如果 5次业务处理失败的 原因指示信息都是一样, 支设都指向 DSP芯片的内存资源, 则以 DSP芯片的内 存资源作为分析对象, 其失效分析结果也同样会输出 DSP芯片的内存资源发生 异常的结果。  According to the preset failure criterion, the monitoring unit needs to count the number of consecutive business processing failures of the DSP chip according to the service processing failure event reported by the DSP chip. Each time the monitoring unit receives a service processing failure event reported by the DSP chip, the physical entity corresponding to the physical address of the DSP chip carried in the event is analyzed, and the number of consecutive business processing failures of the physical entity is increased by one. The number of consecutive service processing failures of the DSP chip with the slot number of 1 and the subsystem number 1 is increased by one, and then the number of consecutive failures of the DSP chip processing service exceeds the configured failure. value. For example, if the DSP chip fails to perform 5 times of processing, the service processing failure event is reported to the monitoring unit 5 times. The monitoring unit performs the failure analysis when the service processing failure event reported by the DSP chip is received in the first 4 times. The failure threshold has not been reached 5 times. The results of the first 4 failure analysis are normal for the DSP chip. When the service processing failure event reported by the DSP chip is obtained for the 5th time, the failure analysis is performed, and the number of consecutive failures of the DSP chip processing service is found. The failure threshold has been reached 5 times, and the failure analysis result outputs an abnormality of the DSP chip. If the reason for the failure of 5 business processes is the same, and the support points to the memory resources of the DSP chip, the memory resources of the DSP chip are used as the analysis object, and the failure analysis result also outputs the memory resource of the DSP chip. the result of.
需要说明的是, 如果监控单元接收到 DSP上报的业务处理失败事件后, 首 次接收到 DSP上报的业务处理成功事件,则将已统计的业务处理失败次数清零。 假如: DSP芯片连续进行 3次业务处理失败, 但第 4次业务处理成功, 则会上 报一个业务处理成功事件, 监控单元会将统计的 DSP芯片的连续业务处理失败 的次数由 3改为 0。  It should be noted that, if the monitoring unit receives the service processing success event reported by the DSP for the first time after receiving the service processing failure event reported by the DSP, the number of the counted service processing failures is cleared. If the DSP chip fails to perform three consecutive business processes, but the fourth service is successfully processed, a service processing success event is reported, and the monitoring unit changes the number of consecutive business processing failures of the statistical DSP chip from 3 to 0.
监控单元会保存失效分析的结果(即 DSP芯片异常或者正常)作为当前最 新的失效分析结果。 304、 监控单元当确定 DSP芯片发生异常时, 向 DSP芯片管理单元发送故障 预警通知消息。 The monitoring unit saves the result of the failure analysis (ie, the DSP chip is abnormal or normal) as the current latest failure analysis result. 304. The monitoring unit sends a fault warning notification message to the DSP chip management unit when it is determined that the DSP chip is abnormal.
该故障预警通知消息包括: 发生异常的 DSP芯片的地址信息(这里 DSP芯片 地址信息为框号为 3、 单板槽位号为 1和子系统号为 1 )。  The fault warning notification message includes: the address information of the DSP chip in which the abnormality occurs (the address of the DSP chip is 3, the slot number of the board is 1 and the subsystem number is 1).
监控单元在发送故障预警通知消息后, 启动一个定时器, 在定时器超时前, 后续的失效分析将不再发送故障预警通知消息, 这样做主要是防止后续监控单 元进行重复频繁的故障预警通知消息。  After the monitoring unit sends the fault warning notification message, the monitoring unit starts a timer. Before the timer expires, the subsequent failure analysis will not send the fault warning notification message. This is mainly to prevent the subsequent monitoring unit from repeating the frequent fault warning notification message. .
305、 DSP芯片管理单元调用 DSP故障检测处理程序, 进行故障检测。 在 DSP芯片管理单元中可以注册 DSP故障检测处理函数,调用此函数则触 发 DSP故障检测处理流程。 比如: 向发生异常的 DSP芯片发消息, 触发 DSP 芯片进行程序段和数据段的 CRC数据校验,并将 CRC数据校验结果返回给 DSP 芯片管理单元。 DSP故障检测处理流程在发现具体故障原因时会上报相应的告 警和记录日志, 以方便用户问题定位。  305. The DSP chip management unit calls the DSP fault detection processing program to perform fault detection. The DSP fault detection processing function can be registered in the DSP chip management unit, and calling this function triggers the DSP fault detection processing flow. For example: Send a message to the DSP chip that has an abnormality, trigger the DSP chip to perform CRC data verification of the program segment and the data segment, and return the CRC data verification result to the DSP chip management unit. The DSP fault detection processing flow reports the corresponding alarm and log when the fault is found, so as to facilitate the user's problem location.
306、 DSP芯片管理单元根据 DSP故障检测结果,与监控单元进行故障失效 确认。  306. The DSP chip management unit performs fault failure confirmation with the monitoring unit according to the DSP fault detection result.
如果 DSP故障检测结果表示没有检测到任何故障,则向监控单元发送故障失 效查询消息, 监控单元返回响应消息, 响应消息中包括当前最新的失效分析结 果。  If the DSP fault detection result indicates that no fault is detected, the fault failure query message is sent to the monitoring unit, and the monitoring unit returns a response message, which includes the current latest failure analysis result.
如果 DSP故障检测结果表示检测到故障, 也可以向监控单元发送故障失效 查询消息, 也可以不向监控单元发送故障失效查询消息进行故障失效确认。 优 选地, 由于已经检测到故障, 一般不向监控单元发送故障失效查询消息, 以提 高系统处理效率。  If the DSP fault detection result indicates that the fault is detected, the fault invalidation query message may also be sent to the monitoring unit, or the fault invalidation query message may not be sent to the monitoring unit for failure failure confirmation. Preferably, since the fault has been detected, the fault invalidation query message is generally not sent to the monitoring unit to improve system processing efficiency.
特别地, 如果 DSP故障测检结果表示检测到故障, 或与监控单元进行故障 失效确认得到的当前最新的失效分析结果表示 DSP发生异常, 则继续执行下一 步。 如果 DSP故障检测结果表示没有检测到任何故障, 且与监控单元进行故障 失效确认得到的当前最新的失效分析结果也表示 DSP芯片正常,表示 DSP芯片 已经恢复正常, 可以结束整个流程。 这样可以避免一些闪断类型故障造成后续 不必要的故障恢复措施对系统的影响。  In particular, if the DSP fault detection result indicates that a fault is detected, or the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP has an abnormality, the next step is continued. If the DSP fault detection result indicates that no fault has been detected, and the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP chip is normal, indicating that the DSP chip has returned to normal, the entire process can be ended. This can avoid some flash-type failures that cause subsequent unnecessary failure recovery measures to affect the system.
307、 DSP芯片管理单元调用 DSP故障恢复处理程序, 进行故障恢复。 在 DSP芯片管理单元中可以注册 DSP故障恢复处理函数,调用此函数则触 发 DSP故障恢复处理流程。 比如: 向发生异常的 DSP 芯片发复位消息, 触发 DSP芯片进行复位重启,并可以启动一个定时器,等待 DSP芯片重新运行正常。 307. The DSP chip management unit calls the DSP fault recovery processing program to perform fault recovery. The DSP fault recovery processing function can be registered in the DSP chip management unit, and calling this function touches Send DSP fault recovery processing flow. For example: Send a reset message to the DSP chip that has an abnormality, trigger the DSP chip to reset and restart, and start a timer, waiting for the DSP chip to re-run normally.
DSP芯片管理单元在执行完 DSP故障恢复处理程序后, 可以再次对该 DSP 芯片进行故障检测, 并与监控单元进行故障失效确认。 如果 DSP故障测检结果 表示检测到故障, 或与监控单元进行故障失效确认得到的当前最新的失效分析 结果表示 DSP仍旧异常, 则执行 DSP芯片隔离措施, 将该异常的 DSP芯片进 行隔离。  After executing the DSP fault recovery processing program, the DSP chip management unit can perform fault detection on the DSP chip again, and perform fault failure confirmation with the monitoring unit. If the DSP fault detection result indicates that the fault is detected, or the current latest failure analysis result obtained by the fault detection of the monitoring unit indicates that the DSP is still abnormal, the DSP chip isolation measure is executed to isolate the abnormal DSP chip.
本发明实施例通过在 DSP芯片业务处理失败时, 向监控单元上报的业务处 理失败事件, 由监控单元根据业务处理失败事件进行失效分析, 及时确定 DSP 芯片发生异常,并在 DSP芯片发生异常时向 DSP芯片管理单元发送故障预警通 知消息,由 DSP芯片管理单元及时调用 DSP芯片故障检测流程和故障恢复流程, 不仅可以及时检测 DSP芯片的具体故障原因, 上报表示故障根本原因的告警, 而且能及时对 DSP芯片进行故障修复或隔离, 将故障修复在萌芽状态, 快速恢 复或隔离发生异常的 DSP芯片, 避免了故障扩散, 提高系统可靠性。 另外, 由 于故障检测流程是在收到故障预警通知消息后才触发, 相比于原来定时触发机 制, 不仅保证了及时性, 而且对系统性能影响最小, 甚至可以关闭原来的定时 触发的 DSP芯片故障检测机制。 由于本发明实施例可以将 DSP芯片的所有业务 处理失败都进行监控, 包括信令消息业务处理失败, 管理消息业务处理失败, 和业务码流的处理处理失败, 可以覆盖 DSP芯片的所有业务处理失败, 可以保 证 DSP芯片的失效检测的完备性,这样即使 DSP芯片遗漏设计了一些故障模式 的故障检测技术, 也能通过本发明所描述的方案, 通过对 DSP芯片对外表现, 基本确定 DSP 芯片的失效, 进而采取 DSP 芯片故障恢复措施, 使发生异常的 DSP芯片能够及时被自动修复或隔离, 恢复正常。 参阅图 4, 本发明实施例提供一种故障监控及处理方法, 本实施例假定第一 通信单元向第二通信单元发送消息, 由于超时未收到第二通信单元的响应消息 而导致业务处理失败。 对此种情况的失效处理流程如下:  In the embodiment of the present invention, when the DSP chip service fails, the service processing failure event reported to the monitoring unit is performed by the monitoring unit according to the service processing failure event, and the abnormality of the DSP chip is determined in time, and when the DSP chip is abnormal, The DSP chip management unit sends a fault warning notification message, and the DSP chip management unit promptly calls the DSP chip fault detection process and the fault recovery process to not only detect the specific fault cause of the DSP chip in time, but also report the alarm indicating the root cause of the fault, and can timely The DSP chip performs fault repair or isolation, repairs the fault in the bud, and quickly recovers or isolates the abnormal DSP chip, avoiding the spread of faults and improving system reliability. In addition, since the fault detection process is triggered only after receiving the fault warning notification message, compared with the original timing trigger mechanism, not only the timeliness is ensured, but also the system performance is minimally affected, and even the original timing triggered DSP chip failure can be turned off. Detection mechanism. The embodiment of the present invention can monitor all the service processing failures of the DSP chip, including the failure of the signaling message service processing, the failure of the management message service processing, and the failure of the processing processing of the service code stream, which can cover all service processing failures of the DSP chip. The completeness of the failure detection of the DSP chip can be ensured, so that even if the fault detection technology of the fault mode is designed by the DSP chip, the failure of the DSP chip can be basically determined by the external description of the DSP chip by the scheme described by the present invention. Then, the DSP chip failure recovery measures are taken to enable the abnormally generated DSP chip to be automatically repaired or isolated in time to return to normal. Referring to FIG. 4, an embodiment of the present invention provides a fault monitoring and processing method. This embodiment assumes that the first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit. . The failure handling process for this situation is as follows:
401、 第一通信单元向第二通信单元发送消息, 由于超时未收到第二通信单 元的响应消息而导致业务处理失败, 向第一通信单元的本层监控单元上报业务 处理失败事件, 该业务处理失败事件包括: 业务处理失败的对象实体(第二通 信单元) 的地址信息。 402、 上级监控单元获取第一通信单元上报的业务处理失败事件。 401. The first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service processing failure event is reported to the local monitoring unit of the first communication unit. The processing failure event includes: address information of the object entity (second communication unit) whose service processing has failed. 402. The upper monitoring unit acquires a service processing failure event reported by the first communication unit.
由于第二通信单元可能不在第一通信单元的监控单元监控范围内, 那么第 一通信单元的监控单元无法有效的对第二通信单元进行失效分析, 则需要上报 给更上一级的监控单元, 最终由能监控第一通信单元和第二通信单元的监控单 元接收该第一通信单元上报的业务处理失败事件。  Since the second communication unit may not be in the monitoring range of the monitoring unit of the first communication unit, the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and then needs to report to the monitoring unit of the upper level. Finally, the service processing failure event reported by the first communication unit is received by the monitoring unit capable of monitoring the first communication unit and the second communication unit.
监控单元可以包括: 单板级监控单元, 框级监控单元, 网元级监控单元和 网络级监控单元。 不同层次的监控单元 (即进行失效分析的单元)可以处理的 失效分析范围是有区别的。 一般地, 单板级监控单元只能对单板内硬件芯片或 单板内所运行的软件模块进行失效分析。 框级监控单元不仅包括框内各单板级 失效分析内容, 还包括框级各单板间的失效分析内容。 网元级监控单元可以分 析网元内所有硬件芯片或软件模块进行失效分析。 网络级监控单元可以分析整 个网络中所有硬件芯片或软件模块进行失效分析。  The monitoring unit may include: a board level monitoring unit, a frame level monitoring unit, a network element level monitoring unit, and a network level monitoring unit. The scope of failure analysis that can be handled by different levels of monitoring units (ie, units that perform failure analysis) is different. Generally, the board-level monitoring unit can perform failure analysis only on the hardware chips in the board or the software modules running in the board. The frame-level monitoring unit includes not only the failure analysis content of each board in the frame, but also the failure analysis content between the boards at the frame level. The network element level monitoring unit can analyze all hardware chips or software modules in the network element for failure analysis. The network level monitoring unit can analyze all hardware chips or software modules in the entire network for failure analysis.
403、 上级监控单元根据第一通信单元上报的业务处理失败事件和预置的失 效判别准则, 判断第二通信单元是否异常。  403. The superior monitoring unit determines whether the second communication unit is abnormal according to the service processing failure event reported by the first communication unit and the preset failure determination criterion.
如果确实是第二通信单元故障, 则所有通信单元只要给第二通信单元发送 消息, 都会发生因超时未响应的业务处理失败, 这些业务处理失败事件都会向 上级监控单元发送。 该上级监控单元确定多个通信单元发送的业务处理失败事 件所指向的对象实体都是第二通信单元, 且针对该对象实体所统计的业务处理 连续失败的次数超过了配置的失效阔值, 则上级监控单元将判定第二通信单元 发生异常。  If it is indeed the second communication unit failure, all communication units only need to send a message to the second communication unit, and the service processing failure due to the timeout failure will occur, and these service processing failure events will be sent to the superior monitoring unit. The superior monitoring unit determines that the target entity pointed to by the service processing failure event sent by the multiple communication units is the second communication unit, and the number of consecutive failures of the service processing for the target entity exceeds the configured failure threshold. The superior monitoring unit will determine that an abnormality has occurred in the second communication unit.
404、 监控单元向第二通信单元的管理单元发送故障预警通知消息, 该故障 预警通知消息携带第二通信单元的地址信息。  404. The monitoring unit sends a failure warning notification message to the management unit of the second communication unit, where the failure warning notification message carries the address information of the second communication unit.
后续故障检测与故障恢复的处理步骤与步骤 205-207基本相同,在此不再赘 述。  The processing steps of subsequent fault detection and fault recovery are basically the same as steps 205-207, and are not described here.
本发明实施例中若多个通信单元发送的业务处理失败事件所指向的对象实 体是同一对象实体, 且针对该对象实体所统计的业务处理连续失败的次数超过 了配置的失效阔值时, 确定该对象实体故障, 发送故障预警通知消息, 及时触 发对该对象实体的故障检测流程和故障恢复流程, 可以使该对象实体能够及时 被自动修复或隔离, 将故障修复在萌芽状态, 保障了系统长期地, 稳定地正常 运行, 有效避免了故障扩散, 提高系统可靠性。 参阅图 5, 本发明实施例提供一种故障监控及处理方法, 本实施例假定第一 通信单元向第二通信单元发送消息, 由于超时未收到第二通信单元的响应消息 而导致业务处理失败; 同时, 第二通信单元也向第一通信单元发送消息, 由于 超时未收到第一通信单元的响应消息而导致业务处理失败。 在这种情况下, 第 一通信单元会向上报业务处理失败事件, 第二通信单元会向上报业务处理失败 事件, 两个业务处理失败的对象实体分别指向对端通信单元, 分别为第二通信 单元和第一通信单元, 但实际上反映的是第一通信单元到第二通信单元之间的 通信路径的失效, 在这个通信路径上还可能包含其它用于交换的第三通信单元, 第三通信单元的失效同样会造成此类问题, 所以对这类问题的失效分析需要在 覆盖整个路径所有通信单元的监控单元进行。 对此种情况的失效处理流程如下:In the embodiment of the present invention, if the object entity pointed by the service processing failure event sent by the multiple communication units is the same object entity, and the number of consecutive failures of the service processing for the object entity exceeds the configured failure threshold, the determination is performed. The object entity is faulty, and the fault warning notification message is sent, and the fault detection process and the fault recovery process of the object entity are triggered in time, so that the object entity can be automatically repaired or isolated in time, and the fault is repaired in the bud state, thereby ensuring the long-term system. Ground, stable and normal operation, effectively avoiding the spread of faults and improving system reliability. Referring to FIG. 5, an embodiment of the present invention provides a fault monitoring and processing method. This embodiment assumes that the first communications unit sends a message to the second communications unit, and the service processing fails because the response message of the second communications unit is not received. At the same time, the second communication unit also sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit. In this case, the first communication unit will report the service processing failure event, and the second communication unit will report the service processing failure event, and the two object entities that fail to process the service respectively point to the opposite communication unit, which are respectively the second communication. The unit and the first communication unit, but actually reflecting the failure of the communication path between the first communication unit and the second communication unit, may also include other third communication units for switching, third Failure of the communication unit also causes such problems, so failure analysis of such problems needs to be performed by the monitoring unit covering all communication units of the entire path. The failure handling process for this situation is as follows:
501、 第一通信单元向第二通信单元发送消息, 由于超时未收到第二通信单 元的响应消息而导致业务处理失败, 向第一通信单元的本层监控单元上报针对 第二通信单元的业务处理失败事件, 该业务处理失败事件包括: 业务处理失败 的对象实体(第二通信单元) 的地址信息。 第二通信单元向第一通信单元发送 消息, 由于超时未收到第一通信单元的响应消息而导致业务处理失败, 向第二 通信单元的本层监控单元上报针对第一通信单元的业务处理失败事件, 该业务 处理失败事件包括: 业务处理失败的对象实体(第一通信单元) 的地址信息。 501: The first communication unit sends a message to the second communication unit, and the service processing fails due to the failure to receive the response message of the second communication unit, and the service for the second communication unit is reported to the local monitoring unit of the first communication unit. Processing the failure event, the business processing failure event includes: address information of the object entity (second communication unit) whose service processing failed. The second communication unit sends a message to the first communication unit, and the service processing fails due to the failure to receive the response message of the first communication unit, and the service processing for the first communication unit fails to be reported to the local monitoring unit of the second communication unit. The event, the service processing failure event includes: address information of the object entity (first communication unit) whose service processing failed.
502、 上级监控单元获取第一通信单元上报的针对第二通信单元的业务处理 失败事件和第二通信单元上报的针对第一通信单元的业务处理失败事件。 由于第二通信单元可能不在第一通信单元的监控单元监控范围内, 那么第 一通信单元的监控单元无法有效的对第二通信单元进行失效分析, 则第一通信 单元上报的针对第二通信单元的业务处理失败事件需要上报给更上一级的监控 单元。 同理, 第二通信单元上报的针对第一通信单元的业务处理失败事件也需 要上报给更上一级的监控单元。 最终由能监控第一通信单元和第二通信单元的 监控单元接收该第一通信单元上报的业务处理失败事件和第二通信单元上报的 业务处理失败事件。  502. The upper monitoring unit acquires a service processing failure event reported by the first communication unit for the second communication unit and a service processing failure event reported by the second communication unit for the first communication unit. Since the second communication unit may not be within the monitoring unit monitoring range of the first communication unit, the monitoring unit of the first communication unit cannot effectively perform the failure analysis on the second communication unit, and the second communication unit is reported by the first communication unit. The business processing failure event needs to be reported to the monitoring unit of the higher level. Similarly, the service processing failure event reported by the second communication unit for the first communication unit also needs to be reported to the monitoring unit of the upper level. Finally, the monitoring unit that can monitor the first communication unit and the second communication unit receives the service processing failure event reported by the first communication unit and the service processing failure event reported by the second communication unit.
503、 上级监控单元根据第一通信单元上报的业务处理失败事件、 第二通信 单元上报的业务处理失败事件和预置的失效判决准则, 不对第一通信单元和第 二通信单元进行失效分析, 对第一通信单元和第二通信单元之间路径上的第三 通信单元进行失效分析。 其中, 预置的失效判决准则规定彼此通信的两个通信单元上报的业务处理 失败事件所指向的对象实体均为对端通信单元时, 不对这两个通信单元进行失 效分析。 进一步, 如果系统配置有第一通信单元与第二通信单元之间路径上包含第 三通信单元, 预置的失效判决准则规定彼此通信的两个通信单元上报的业务处 理失败事件所指向的对象实体均为对端通信单元时, 对这两个通信单元之间路 径上的通信单元进行失效分析。 则在此情况下, 可以针对第一通信单元和第二 通信单元之间路径上的第三通信单元进行失效分析。 比如: 该上级监控单元确 定多个通信单元(包括第一通信单元和第二通信单元)发送的业务处理失败事 件所指向的对象实体都是第三通信单元, 且针对该第三通信单元所统计的业务 处理连续失败的次数超过了配置的失效阔值, 则该上级监控单元将判定第三通 信单元发生异常。 503. The upper monitoring unit does not perform failure analysis on the first communication unit and the second communication unit according to the service processing failure event reported by the first communication unit, the service processing failure event reported by the second communication unit, and the preset failure determination criterion. The third communication unit on the path between the first communication unit and the second communication unit performs failure analysis. The preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the peer communication units, the failure analysis is not performed on the two communication units. Further, if the system is configured with the third communication unit on the path between the first communication unit and the second communication unit, the preset failure decision criterion specifies the target entity pointed to by the service processing failure event reported by the two communication units that communicate with each other. When both are the peer communication units, the failure analysis is performed on the communication unit on the path between the two communication units. Then in this case, the failure analysis can be performed for the third communication unit on the path between the first communication unit and the second communication unit. For example, the superior monitoring unit determines that the target entities pointed to by the service processing failure event sent by the plurality of communication units (including the first communication unit and the second communication unit) are all the third communication unit, and are counted for the third communication unit. If the number of consecutive failures of the service processing exceeds the configured failure threshold, the superior monitoring unit determines that the third communication unit is abnormal.
504、 监控单元向第三通信单元的管理单元发送故障预警通知消息, 该故障 预警通知消息携带第三通信单元的地址信息。 504. The monitoring unit sends a failure warning notification message to the management unit of the third communication unit, where the failure warning notification message carries the address information of the third communication unit.
后续故障检测与故障恢复的处理步骤与步骤 205-207基本相同,在此不再赘 述。  The processing steps of subsequent fault detection and fault recovery are basically the same as steps 205-207, and are not described here.
本发明实施例在彼此进行通信的两个通信单元(如上述第一通信单元和第 二通信单元)都上报对方业务处理失败事件时, 根据预置的失效判别准则不对 这两个通信单元进行失效分析, 对第一通信单元和第二通信单元之间路径上的 第三通信单元进行失效分析, 及时发现通信路径上的失效节点, 通过发送故障 预警通知消息, 及时触发对该失效节点的故障检测流程和故障恢复流程, 可以 使失效节点能够及时被自动修复, 将故障修复在萌芽状态, 保障了系统长期地, 稳定地正常运行, 有效避免了故障扩散, 提高系统可靠性。  In the embodiment of the present invention, when two communication units that communicate with each other (such as the first communication unit and the second communication unit described above) report the failure processing event of the other party, the two communication units are not invalidated according to the preset failure determination criterion. The failure analysis is performed on the third communication unit on the path between the first communication unit and the second communication unit, and the failed node on the communication path is found in time, and the fault detection notification message is sent to trigger the fault detection of the failed node in time. The process and the fault recovery process enable the failed nodes to be automatically repaired in time, and the faults are repaired in the bud, ensuring long-term, stable and normal operation of the system, effectively avoiding the spread of faults and improving system reliability.
参阅图 6, 本发明实施例提供一种监控设备, 其包括: 第一获取单元 61, 用于获取通信单元上报的业务处理失败事件; 所述业务 处理失败事件包括: 业务处理失败的对象实体的地址信息;  Referring to FIG. 6, an embodiment of the present invention provides a monitoring device, including: a first acquiring unit 61, configured to acquire a service processing failure event reported by a communication unit; the service processing failure event includes: an object entity that fails service processing Address information;
确定单元 62, 用于根据通信单元上报的业务处理失败事件和预置的失效判 别准则, 确定发生异常的实体;  a determining unit 62, configured to determine, according to the service processing failure event reported by the communication unit and the preset failure criterion, to determine an entity that has an abnormality;
发送单元 63, 用于发送故障预警通知消息, 所述故障预警通知消息包括: 所确定的发生异常的实体中至少一个实体的信息, 所述故障预警通知消息用于 指示进行故障检测。 The sending unit 63 is configured to send a fault warning notification message, where the fault warning notification message includes: Information of at least one entity of the determined abnormal entity, the fault warning notification message is used to indicate that fault detection is performed.
其中, 确定单元 62包括: 获取子单元 621, 用于利用通信单元上报的业务处 理失败事件, 统计失效指标值; 确定子单元 622, 用于根据所述失效指标值和失 效判别准则中相应的失效阔值, 确定发生异常的对象实体。  The determining unit 62 includes: an obtaining subunit 621, configured to use a service processing failure event reported by the communication unit, to calculate a failure indicator value; and a determining subunit 622, configured to perform, according to the failure indicator value and the failure criterion A threshold value that identifies the object entity in which the exception occurred.
该监控设备还可以包括: 配置单元 68, 用于配置并保存上述失效判别准则。 具体的, 获取子单元 621, 用于利用通信单元上报的业务处理失败事件, 统 计连续业务处理失败次数的累加值; 所述连续业务处理失败次数的累加值为失 效指标值; 或者, 获取子单元 621, 用于利用通信单元上报的业务处理失败事件, 获得一段时间内的业务处理失败次数占总业务处理次数的比值; 所述一段时间 内的业务处理失败次数占总业务处理次数的比值为失效指标值; 或者, 获取子 单元 621, 用于在接收到通信单元上报的业务处理失败事件后, 查询关键业绩指 标, 所述关键业绩指标为所述失效指标值。  The monitoring device may further include: a configuration unit 68, configured to configure and save the failure criterion described above. Specifically, the obtaining subunit 621 is configured to use the service processing failure event reported by the communication unit to count the accumulated value of the number of consecutive service processing failures; the accumulated value of the consecutive service processing failure times is a failure indicator value; or, the acquiring subunit 621. The service processing failure event reported by the communication unit is used to obtain a ratio of the number of service processing failures in a period of time to the total number of service processing times. The ratio of the number of service processing failures in the period to the total number of service processing times is invalid. The index value is obtained by: the obtaining sub-unit 621, configured to query a key performance indicator after receiving the service processing failure event reported by the communication unit, where the key performance indicator is the failure indicator value.
具体的, 获取子单元 621包括第一统计子单元 6211、 第二统计子单元 6212和 第三统计子单元 6213,  Specifically, the obtaining subunit 621 includes a first statistic subunit 6211, a second statistic subunit 6212, and a third statistic subunit 6213.
第一统计子单元 6211具体用于利用通信单元上报的业务处理失败事件, 针 对硬件实体统计失效指标值, 其中, 所述硬件实体是所述对象实体所属硬件; 第二统计子单元 6212具体用于利用通信单元上报的业务处理失败事件, 针 对软件实体统计失效指标值, 其中, 所述软件实体为以对象实体所属硬件的物 理地址信息和对象实体的逻辑地址信息两者所对应的实体;  The first statistic sub-unit 6211 is specifically configured to use the service processing failure event reported by the communication unit to calculate a failure indicator value for the hardware entity, where the hardware entity is hardware to which the object entity belongs; Using the service processing failure event reported by the communication unit, the software entity calculates a failure indicator value, where the software entity is an entity corresponding to both the physical address information of the hardware to which the object entity belongs and the logical address information of the object entity;
第三统计子单元 6213具体用于利用通信单元上报的业务处理失败事件, 针 对逻辑资源实体统计失效指标值, 所述逻辑资源实体为所述对象实体所属硬件 的物理地址信息和业务处理失败的原因指示信息两者所对应的实体;  The third statistic sub-unit 6213 is specifically configured to use the service processing failure event reported by the communication unit to collect a failure indicator value for the logical resource entity, where the logical resource entity is the physical address information of the hardware to which the target entity belongs and the reason for the service processing failure. An entity corresponding to both of the indication information;
确定子单元 622包括第一确定子单元 6221、 第二确定子单元 6222和第二确定 子单元 6223,  The determining subunit 622 includes a first determining subunit 6221, a second determining subunit 6222, and a second determining subunit 6223,
第一确定子单元 6221具体用于根据针对硬件实体所统计的失效指标值和失 效判别准则中针对所述硬件实体的第一失效阔值, 确定所述硬件实体是否异常。  The first determining sub-unit 6221 is specifically configured to determine whether the hardware entity is abnormal according to the failure indicator value and the first failure threshold for the hardware entity in the failure indicator value and the failure criterion.
第二确定子单元 6222, 用于根据针对软件实体所统计的失效指标值和失效 判别准则中针对所述软件实体的第二失效阔值, 确定所述软件实体是否异常。  The second determining subunit 6222 is configured to determine whether the software entity is abnormal according to a failure indicator value calculated for the software entity and a second failure threshold value for the software entity in the failure determination criterion.
第三确定子单元 6223, 用于根据针对逻辑资源实体所统计的失效指标值和 失效判别准则中针对所述逻辑资源实体的第三失效阔值, 确定所述逻辑资源实 体是否异常。 a third determining subunit 6223, configured to calculate a failure indicator value according to a logical resource entity and Determining, in the failure criterion, a third failure threshold for the logical resource entity, determining whether the logical resource entity is abnormal.
具体的, 发送单元 63, 用于当仅硬件实体故障时, 发送包括对象实体所属 硬件的物理地址信息的故障预警通知消息; 当硬件实体和软件实体都异常时, 发送仅包括软件实体的信息的故障预警通知消息, 所述软件实体信息包括: 对 象实体所属硬件的物理地址信息和对象实体的逻辑地址信息; 当硬件实体和逻 辑资源实体都异常时, 发送仅包括逻辑资源实体信息的故障预警通知消息, 所 述逻辑资源实体信息包括: 对象实体所属硬件的物理地址信息和业务处理失败 的原因指示信息。  Specifically, the sending unit 63 is configured to: when the hardware entity only fails, send a fault warning notification message that includes physical address information of the hardware to which the target entity belongs; when the hardware entity and the software entity are abnormal, send information including only the software entity. The failure warning notification message, the software entity information includes: physical address information of the hardware to which the object entity belongs and logical address information of the object entity; when both the hardware entity and the logical resource entity are abnormal, sending a failure warning notification including only the logical resource entity information The message, the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and cause indication information of the service processing failure.
具体的, 所述对象实体所属硬件的物理地址信息包括: 第一级子地址; 所 述对象实体所属硬件是第一级子地址对应的硬件的组件;  Specifically, the physical address information of the hardware to which the target entity belongs includes: a first-level sub-address; the hardware to which the object entity belongs is a component of hardware corresponding to the first-level sub-address;
为了确保异常的实体能够及时被修复, 将故障修复在萌芽状态, 该监控设 备还包括: 第一控制单元 69和第二控制单元 610,  In order to ensure that the abnormal entity can be repaired in time, the fault is repaired in a budding state, and the monitoring device further includes: a first control unit 69 and a second control unit 610,
其中, 第一控制单元 69用于在发送包括对象实体所属硬件的物理地址信息 的故障预警通知消息之后的预设时间段内, 若第一确定子单元 6221确定硬件实 体一直异常, 控制发送单元 63发送包括第一级子地址的故障预警通知消息; 此 时发送单元 63还用于发送包括第一级子地址的故障预警通知消息。 此时, 发送 单元 63还用于发送包括第一级子地址的故障预警通知消息。 故障预警通知消息之后的预设时间段内, 若第一确定子单元 6221确定硬件实体 一直异常, 控制发送单元 63发送包括硬件实体信息的故障预警通知消息, 所述 硬件实体信息包括: 对象实体所属硬件的物理地址信息。 此时, 发送单元 63还 用于发送包括硬件实体信息的故障预警通知消息。  The first control unit 69 is configured to control the sending unit 63 if the first determining subunit 6221 determines that the hardware entity has been abnormal during the preset time period after transmitting the fault warning notification message including the physical address information of the hardware to which the target entity belongs. The failure warning notification message including the first-level sub-address is sent; at this time, the sending unit 63 is further configured to send a failure warning notification message including the first-level sub-address. At this time, the sending unit 63 is further configured to send a fault warning notification message including the first level subaddress. During the preset time period after the failure warning notification message, if the first determining sub-unit 6221 determines that the hardware entity has been abnormal, the control sending unit 63 sends a failure warning notification message including the hardware entity information, where the hardware entity information includes: The physical address information of the hardware. At this time, the sending unit 63 is further configured to send a fault warning notification message including hardware entity information.
可选的, 业务处理失败事件还包括: 所述通信单元的当前负荷量; 可选的, 为了保证失效分析的准确性, 该监控设备还包括: 第一判断单元 64和第二判断单元 65,  Optionally, the service processing failure event further includes: a current load quantity of the communication unit; optionally, in order to ensure the accuracy of the failure analysis, the monitoring device further includes: a first determining unit 64 and a second determining unit 65,
其中, 第一判断单元 64用于判断所述通信单元的当前负荷量是否小于预设 阔值, 如果否, 丟弃所述业务处理失败事件; 此时确定单元 62用于在第一判断 单元 64的判断结果为是时, 根据通信单元上报的业务处理失败事件和预置的失 效判别准则, 确定发生异常的实体。 第二判断单元 65, 用于判断所述业务处理失败事件是否携带指示由终端设 备导致业务处理失败的特定指示标识, 如果是, 丟弃所述业务处理失败事件; 此时确定单元 62, 用于在第二判断单元 65的判断结果为否时,根据通信单元上 报的业务处理失败事件和预置的失效判别准则, 确定发生异常的实体。 可选的, 第一获取单元 61具体用于获取通过子监控设备转发的通信单元上 报的业务处理失败事件, 所述业务处理失败事件是当业务处理失败的对象实体 不属于子监控设备的管理范围时由子监控设备转发的。 The first determining unit 64 is configured to determine whether the current load of the communication unit is less than a preset threshold, and if not, discard the service processing failure event; the determining unit 62 is configured to use the first determining unit 64. When the judgment result is YES, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure determination criterion. The second determining unit 65 is configured to determine whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if yes, discarding the service processing failure event; When the determination result of the second judging unit 65 is NO, the entity that has an abnormality is determined according to the service processing failure event reported by the communication unit and the preset failure judging criterion. Optionally, the first obtaining unit 61 is specifically configured to acquire a service processing failure event reported by the communication unit that is forwarded by the sub-monitoring device, where the service processing failure event is that the target entity that fails the service processing does not belong to the management scope of the sub-monitoring device. When forwarded by the sub-monitoring device.
可选的, 发送单元 63具体用于向业务处理失败的对象实体或者业务处理失 败的对象实体的管理模块发送故障预警通知消息。  Optionally, the sending unit 63 is specifically configured to send a fault warning notification message to the management entity that fails the service processing or the management module that fails the service processing.
为了保证失效分析的准确性, 该监控设备还包括: 第二获取单元 66, 用于 获取通信单元上报的业务处理成功事件; 清零单元 67, 用于在第二获取单元获 取到业务处理成功事件之后, 将统计的失效指标值清零, 具体的, 将第一统计 子单元 6211、 第二统计子单元 6212或者第三统计子单元 6213所统计的失效指 标值清零。 可选的, 为了确保异常的实体能够及时被修复, 将故障修复在萌芽状态, 该监控设备还可以包括: 接收单元 611,  In order to ensure the accuracy of the failure analysis, the monitoring device further includes: a second obtaining unit 66, configured to acquire a service processing success event reported by the communication unit; and a clearing unit 67, configured to obtain a service processing success event in the second acquiring unit Then, the statistical failure indicator value is cleared to zero. Specifically, the failure indicator value counted by the first statistical subunit 6211, the second statistical subunit 6212, or the third statistical subunit 6213 is cleared. Optionally, in order to ensure that the abnormal entity can be repaired in time, and the fault is repaired in a budding state, the monitoring device may further include: a receiving unit 611,
接收单元 611, 用于接收故障失效查询消息, 所述故障失效查询消息是业务 处理失败的对象实体或者业务处理失败的对象实体的管理模块发送的; 发送单元 63, 还用于根据确定子单元的确定结果, 发送响应消息, 所述响 应消息包括当前最新的失效分析结果, 具体的, 响应消息包括: 已发送的故障 预警通知消息所针对的异常实体的当前最新失效分析结果。 如果发送的故障预 警通知消息是针对硬件实体的 (即所述故障预警通知消息包括硬件实体的信 息), 则响应消息包括该硬件实体的当前最新失效分析结果, 即指示该硬件实体 是否异常的信息; 如果发送的故障预警通知消息是针对软件实体的 (即所述故 障预警通知消息包括软件实体的信息), 则响应消息包括该软件实体的当前最新 失效分析结果, 即指示该软件实体是否异常的信息; 如果发送的故障预警通知 消息是针对逻辑资源实体的 (即所述故障预警通知消息包括逻辑资源实体的信 息), 则响应消息包括该逻辑资源实体的当前最新失效分析结果, 即指示该逻辑 资源实体是否异常的信息。 本发明实施例中通信单元在对象实体业务处理失败时及时上报业务处理 失败事件, 监控设备进行失效分析确定具体的发生异常的实体, 并发送故障预 警通知消息, 及时触发对该发生异常的实体的故障检测流程和故障恢复流程, 不仅可以使发生异常的实体能够及时被自动修复或隔离, 将故障修复在萌芽状 态, 保障了系统长期地, 稳定地正常运行, 有效避免了故障扩散, 提高系统可 靠性。 另外, 故障检测流程是在分析发现系统失效后才触发, 并且可以是只针 对发生异常的实体触发, 所以不仅可以保证故障检测产生的故障告警与系统失 效表现的一致性, 而且能有效抑制无关的告警上报。 本实施例提供的技术方案 可以将系统中所有业务处理失败进行监控, 包括信令消息处理失败, 管理消息 处理失败, 和业务码流的处理失败, 可以覆盖系统所有业务处理失败, 可以保 证系统能检测到所有通信单元的失效, 保证了检测的完备性, 这样即使某些通 信单元在系统中没有设计相关的故障检测技术, 也能通过本发明所描述的方案 基本确定通信单元的失效, 进而采取针对性的故障恢复措施, 使发生异常的通 信单元能够及时被自动修复, 系统恢复正常。 The receiving unit 611 is configured to receive a fault invalidation query message, where the fault invalidation query message is sent by an object entity that fails the service processing or a management module of the object entity that fails the service processing, and the sending unit 63 is further configured to determine the subunit according to the Determining the result, sending a response message, where the response message includes the current latest failure analysis result. Specifically, the response message includes: a current latest failure analysis result of the abnormal entity targeted by the sent failure warning notification message. If the sent fault warning notification message is for the hardware entity (ie, the fault warning notification message includes information of the hardware entity), the response message includes the current latest failure analysis result of the hardware entity, that is, information indicating whether the hardware entity is abnormal. If the sent fault warning notification message is for the software entity (ie, the fault warning notification message includes information of the software entity), the response message includes the current latest failure analysis result of the software entity, that is, whether the software entity is abnormal. Information; if the sent fault alert notification message is for a logical resource entity (ie, the fault alert notification message includes information of a logical resource entity), the response message includes a current latest failure analysis result of the logical resource entity, that is, the logic is indicated Information about whether the resource entity is abnormal. In the embodiment of the present invention, the communication unit reports the service processing failure event in time when the object entity service fails, and the monitoring device performs failure analysis to determine a specific entity that has an abnormality, and sends a failure warning notification message to promptly trigger the entity that is abnormal. The fault detection process and the fault recovery process not only enable the entity with abnormality to be automatically repaired or isolated in time, but also repair the fault in the bud, ensuring long-term and stable operation of the system, effectively avoiding fault diffusion and improving system reliability. Sex. In addition, the fault detection process is triggered only after the analysis finds that the system is invalid, and can be triggered only for the entity that has an abnormality, so that not only the fault alarm generated by the fault detection is consistent with the system failure performance, but also can effectively suppress irrelevant. The alarm is reported. The technical solution provided in this embodiment can monitor all service processing failures in the system, including failure of signaling message processing, failure of management message processing, and failure of processing of the service code stream, which can cover all service processing failures of the system, and can ensure that the system can Detecting the failure of all communication units, ensuring the completeness of the detection, so that even if some communication units do not have design-related fault detection techniques in the system, the failure of the communication unit can be basically determined by the solution described in the present invention, and then taken Targeted fault recovery measures enable the communication unit that has an abnormality to be automatically repaired in time and the system to return to normal.
参阅图 7, 本发明实施例提供一种通信系统, 适用于分布式的失效分析处理 模式, 其包括: 通信单元 701, 子监控单元 702, 和父监控单元 703, 具体的, 子监控单元 702, 用于获取通信单元 701上报的业务处理失败事件,根据业 务处理失败事件中携带的业务处理失败的对象实体的地址信息, 确定该业务处 理失败的对象实体不属于自己管理的范围, 将该业务处理失败事件上报给父监 控单元 703; 父监控单元 703,用于根据业务处理失败事件中携带的业务处理失败的对象 实体的地址信息, 确定该业务处理失败的对象实体是否属于自己管理的范围, 如果是, 根据通信单元 701 上报的业务处理失败事件和预置的失效判别准则, 确定发生异常的实体, 发送用于指示进行故障检测的故障预警通知消息, 所述 故障预警通知消息包括: 所确定的发生异常的实体中至少一个实体的信息; 如 果否, 继续将所述业务处理失败事件上报给所述父监控单元 703的父监控单元。 其中, 父监控单元是网络级监控单元, 其位于中心网管设备上, 子监控单 元是网元级监控单元, 其位于网元的中心控制单板上; 或者, 父监控单元是网 元级监控单元, 其位于网元的中心控制单板上, 子监控单元是框级监控单元, 其位于框的中心控制单板上; 或者, 父监控单元是框级监控单元, 其位于框的 中心控制单板上, 子监控单元是单板级监控单元, 其位于通信单元所在的单板 上。 具体参见说明书方法实施例中的相应描述, 在此不再赘述。 本发明实施例中通信单元在对象实体业务处理失败时及时上报业务处理失 通知消息, 及时触发对该发生异常的实体的故障检测流程和故障恢复流程, 不 仅可以使发生异常的实体能够及时被自动修复, 将故障修复在萌芽状态, 保障 了系统长期地, 稳定地正常运行, 有效避免了故障扩散, 提高系统可靠性。 Referring to FIG. 7, an embodiment of the present invention provides a communication system, which is applicable to a distributed failure analysis processing mode, and includes: a communication unit 701, a sub-monitoring unit 702, and a parent monitoring unit 703, specifically, a sub-monitoring unit 702. The service processing failure event reported by the communication unit 701 is used to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, that the target entity whose service processing fails does not belong to the scope managed by itself, and processes the service The failure event is reported to the parent monitoring unit 703. The parent monitoring unit 703 is configured to determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that failed the service processing belongs to the scope managed by itself. And determining, according to the service processing failure event reported by the communication unit 701 and the preset failure determination criterion, an entity that generates an abnormality, and sending a failure warning notification message for indicating failure detection, where the failure warning notification message includes: the determined At least one of the entities that have an exception If there is no, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit 703. The parent monitoring unit is a network level monitoring unit, which is located on the central network management device, and the child monitoring unit is a network element level monitoring unit, which is located on the central control board of the network element; or, the parent monitoring unit is a network element level monitoring unit. It is located on the central control board of the network element, and the sub-monitoring unit is a frame-level monitoring unit. It is located on the central control board of the frame. Alternatively, the parent monitoring unit is a frame-level monitoring unit, which is located on the central control board of the frame. The sub-monitoring unit is a single-board monitoring unit located on the board where the communication unit is located. . For details, refer to the corresponding description in the method embodiment of the specification, and details are not described herein again. In the embodiment of the present invention, the communication unit reports the service processing loss notification message in time when the processing of the object entity fails, and triggers the fault detection process and the fault recovery process of the abnormal entity in time, which not only enables the entity that is abnormal to be automatically activated in time. Repair, repair the fault in the bud, ensure the long-term, stable and normal operation of the system, effectively avoid the spread of faults and improve system reliability.
参阅图 8, 本发明实施例提供一种通信系统, 其包括: 第一通信单元 801、 第二通信单元 802和监控单元 803, 监控单元 803, 用于获取第一通信单元 801上报的业务处理失败事件, 获取 第二通信单元 802上报的业务处理失败事件, 当第一通信单元 801上报的业务 处理失败事件携带的业务处理失败的对象实体的地址信息为第二通信单元 802 的地址信息, 且第二通信单元 802 上报的业务处理失败事件携带的业务处理失 败的对象实体的地址信息为第一通信单元 801 的地址信息时, 不对第一通信单 元 801和第二通信单元 802进行失效分析。 其中, 不对第一通信单元 801和第二通信单元 802进行失效分析具体指: 监控单元 803根据第一通信单元 801上报的业务处理失败事件、 第二通信单元 802上报的业务处理失败事件和预置的失效判决准则,不对第一通信单元 801和 第二通信单元 802进行失效分析。 其中, 预置的失效判决准则规定彼此通信的 两个通信单元上报的业务处理失败事件所指向的对象实体均为对端通信单元 时, 不对这两个通信单元进行失效分析。  Referring to FIG. 8, an embodiment of the present invention provides a communication system, including: a first communication unit 801, a second communication unit 802, and a monitoring unit 803, where the monitoring unit 803 is configured to acquire a service processing failure reported by the first communication unit 801. The event, the service processing failure event reported by the second communication unit 802 is obtained, and the address information of the object entity that fails the service processing carried by the service processing failure event reported by the first communication unit 801 is the address information of the second communication unit 802, and the When the address information of the object entity that fails the service processing carried by the service processing failure event reported by the second communication unit 802 is the address information of the first communication unit 801, the failure analysis is not performed on the first communication unit 801 and the second communication unit 802. The failure analysis of the first communication unit 801 and the second communication unit 802 is specifically performed by the monitoring unit 803 according to the service processing failure event reported by the first communication unit 801, the service processing failure event reported by the second communication unit 802, and the preset. The failure decision criterion does not perform failure analysis on the first communication unit 801 and the second communication unit 802. The preset failure decision criterion specifies that when the object entities pointed to by the service processing failure events reported by the two communication units that are in communication with each other are the opposite communication units, the failure analysis is not performed on the two communication units.
需要说明的是, 由于当彼此通信的两个通信单元上报的业务处理失败事件 所指向的对象实体均为对端通信单元时表示该第一通信单元到第二通信单元之 间的通信路径故障, 因而不需要对这两个通信单元进行失效分析。 具  It should be noted that, when the object entities pointed to by the service processing failure events reported by the two communication units that communicate with each other are all the peer communication units, the communication path between the first communication unit and the second communication unit is indicated to be faulty. Therefore, failure analysis of the two communication units is not required. With
体参见说明书方法实施例中的相应描述, 在此不再赘述。 本发明实施例提供的通信系统中, 当彼此应该进行通信的两个通信单元 上报的业务处理失败事件所指向的对象实体均为对端通信单元时, 不对这两个 通信单元进行失效分析, 避免导致错误的失效分析结果。 The corresponding description in the method embodiment of the specification is omitted, and details are not described herein again. In the communication system provided by the embodiment of the present invention, when the object entities pointed to by the service processing failure events reported by the two communication units that should communicate with each other are all the peer communication units, the two are not The communication unit performs failure analysis to avoid erroneous failure analysis results.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是 可以通过程序来指令相关的硬件完成, 所述的程序可以存储于一种计算机可读 存储介质中, 例如只读存储器, 磁盘或光盘等。  A person skilled in the art can understand that all or part of the steps of implementing the above embodiments may be performed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as a read only memory. Disk or disc, etc.
以上对本发明实施例所提供的故障监控方法、 通信设备及通信系统进行了 上实施例的说明只是用于帮助理解本发明的方法及其核心思想; 同时, 对于本 领域的一般技术人员, 依据本发明的思想, 在具体实施方式及应用范围上均会 有改变之处, 综上所述, 本说明书内容不应理解为对本发明的限制。  The above description of the fault monitoring method, the communication device, and the communication system provided by the embodiments of the present invention is only for assisting in understanding the method and core idea of the present invention. Meanwhile, for those skilled in the art, The present invention is not limited by the scope of the present invention.

Claims

权 利 要 求 Rights request
1、 一种故障监控方法, 其特征在于, 包括: A fault monitoring method, characterized in that:
获取通信单元上报的业务处理失败事件; 所述业务处理失败事件包括: 业 务处理失败的对象实体的地址信息;  Obtaining a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the business processing;
根据通信单元上报的业务处理失败事件和预置的失效判别准则, 确定发生 异常的实体, 发送用于指示进行故障检测的故障预警通知消息, 所述故障预警 通知消息包括: 所确定的发生异常的实体中至少一个实体的信息。  Determining, by the communication unit, the service processing failure event and the preset failure criterion, determining an entity that has an abnormality, and sending a fault warning notification message for indicating fault detection, where the fault warning notification message includes: the determined abnormality occurs. Information about at least one entity in an entity.
2、 根据权利要求 1所述的方法, 其特征在于,  2. The method of claim 1 wherein
根据通信单元上报的业务处理失败事件和预置的失效判别准则, 确定发生 异常的实体具体为:  According to the service processing failure event reported by the communication unit and the preset failure criterion, the entity that determines the abnormality is specifically:
利用通信单元上报的业务处理失败事件, 获得失效指标值;  Obtaining a failure indicator value by using a service processing failure event reported by the communication unit;
根据所述失效指标值和所述失效判别准则中相应的失效阔值, 确定发生异 常的实体。  An entity that is abnormal is determined according to the failure indicator value and the corresponding failure threshold in the failure criterion.
3、 根据权利要求 2所述的方法, 其特征在于,  3. The method of claim 2, wherein
所述利用通信单元上报的业务处理失败事件, 获得失效指标值包括: 所述利用通信单元上报的业务处理失败事件, 统计连续业务处理失败次数 的累加值; 所述连续业务处理失败次数的累加值为失效指标值;  The obtaining the failure indicator value by using the service processing failure event reported by the communication unit includes: the service processing failure event reported by the communication unit, and the accumulated value of the number of failed consecutive service processing failures; the accumulated value of the consecutive service processing failure times Is the failure indicator value;
或者,  Or,
所述利用通信单元上报的业务处理失败事件, 获得一段时间内的业务处理 失败次数占总业务处理次数的比值; 所述一段时间内的业务处理失败次数占总 业务处理次数的比值为失效指标值。  The service processing failure event reported by the communication unit obtains a ratio of the number of service processing failures in a period of time to the total number of service processing times; the ratio of the number of service processing failures in the period to the total number of service processing times is a failure indicator value. .
4、 根据权利要求 2所述的方法, 其特征在于,  4. The method of claim 2, wherein
所述利用通信单元上报的业务处理失败事件, 获得失效指标值包括: 接收到通信单元上报的业务处理失败事件后, 查询关键业绩指标, 所述关 键业绩指标为所述失效指标值。  The obtaining the failure indicator value by using the service processing failure event reported by the communication unit includes: after receiving the service processing failure event reported by the communication unit, querying the key performance indicator, where the key performance indicator is the failure indicator value.
5、 根据权利要求 3所述的方法, 其特征在于,  5. The method of claim 3, wherein
所述业务处理失败的对象实体的地址信息包括: 对象实体所属硬件的物理 地址信息;  The address information of the object entity that fails the service processing includes: physical address information of hardware to which the object entity belongs;
根据所述失效指标值和失效判别准则中相应的失效阔值, 确定发生异常的 实体包括: Determining an abnormality according to the failure threshold value and the corresponding failure threshold value in the failure criterion Entities include:
根据针对硬件实体所统计的失效指标值和失效判别准则中针对所述硬件实 体的第一失效阔值, 确定所述硬件实体是否异常, 其中, 所述硬件实体是所述 对象实体所属硬件。  Determining whether the hardware entity is abnormal according to a failure indicator value for a hardware entity and a first failure threshold for the hardware entity in the failure criterion, wherein the hardware entity is hardware to which the object entity belongs.
6、 根据权利要求 5所述的方法, 其特征在于,  6. The method of claim 5, wherein
所述发送用于指示进行故障检测的故障预警通知消息具体为:  The sending the fault warning notification message for indicating fault detection is specifically:
发送包括对象实体所属硬件的物理地址信息的故障预警通知消息; 其中, 所述对象实体所属硬件的物理地址信息包括: 第一级子地址; 所述 对象实体所属硬件是第一级子地址对应的硬件的组件;  Sending a failure warning notification message including the physical address information of the hardware to which the object entity belongs; wherein the physical address information of the hardware to which the object entity belongs includes: a first-level sub-address; the hardware of the object entity is corresponding to the first-level sub-address Hardware component;
在发送包括对象实体所属硬件的物理地址信息的故障预警通知消息后, 该 方法还包括: 若预设时间段内确定所述对象实体所属硬件一直异常, 则发送包 括第一级子地址的故障预警通知消息。  After the fault warning notification message including the physical address information of the hardware to which the target entity belongs is sent, the method further includes: if it is determined that the hardware of the target entity has been abnormal in the preset time period, sending a fault warning including the first-level sub-address Notification message.
7、 根据权利要求 5所述的方法, 其特征在于,  7. The method of claim 5, wherein
所述业务处理失败的对象实体的地址信息还包括: 对象实体的逻辑地址信 息;  The address information of the object entity that fails the service processing further includes: logical address information of the object entity;
根据所述失效指标值和失效判别准则中相应的失效阔值, 确定发生异常的 实体还包括;  Determining, according to the failure indicator value and the corresponding failure threshold value in the failure criterion, the entity that has an abnormality further includes:
根据针对软件实体所统计的失效指标值和失效判别准则中针对所述软件实 体的第二失效阔值, 确定所述软件实体是否异常, 其中, 所述软件实体为以对 象实体所属硬件的物理地址信息和对象实体的逻辑地址信息两者所对应的实 体。  Determining whether the software entity is abnormal according to a failure indicator value for a software entity and a second failure threshold for the software entity in the failure criterion, wherein the software entity is a physical address of a hardware to which the object entity belongs The entity corresponding to both the information and the logical address information of the object entity.
8、 根据权利要求 7所述的方法, 其特征在于,  8. The method of claim 7 wherein:
发送用于指示进行故障检测的故障预警通知消息具体为:  The fault warning notification message sent to indicate fault detection is specifically:
当硬件实体和软件实体都异常时, 发送仅包括软件实体信息的故障预警通 知消息, 所述软件实体信息包括: 对象实体所属硬件的物理地址信息和对象实 体的逻辑地址信息。  When the hardware entity and the software entity are abnormal, the fault warning notification message including only the software entity information is sent, where the software entity information includes: physical address information of the hardware to which the object entity belongs and logical address information of the object entity.
9、 根据权利要求 5所述的方法, 其特征在于,  9. The method of claim 5, wherein
所述业务处理失败事件还包括: 业务处理失败的原因指示信息;  The service processing failure event further includes: a reason indication information that the service processing fails;
根据所述失效指标值和失效判别准则中相应的失效阔值, 确定发生异常的 实体还包括: 根据针对逻辑资源实体所统计的失效指标值和失效判别准则中针对所述逻 辑资源实体的第三失效阔值, 确定所述逻辑资源实体是否异常, 所述逻辑资源 实体为所述对象实体所属硬件的物理地址信息和业务处理失败的原因指示信息 两者所对应的实体。 Determining the entity that has an abnormality according to the failure indicator value and the corresponding failure threshold value in the failure criterion includes: Determining whether the logical resource entity is abnormal according to a failure indicator value for a logical resource entity and a third failure threshold for the logical resource entity, where the logical resource entity is a hardware belonging to the object entity The entity corresponding to both the physical address information and the reason for the failure of the service processing indicates the information.
10、 根据权利要求 9所述的方法, 其特征在于,  10. The method of claim 9 wherein:
发送用于指示进行故障检测的故障预警通知消息具体为:  The fault warning notification message sent to indicate fault detection is specifically:
当硬件实体和逻辑资源实体都异常时, 发送仅包括逻辑资源实体信息的 故 障预警通知消息, 所述逻辑资源实体信息包括: 对象实体所属硬件的物理地址 信息和业务处理失败的原因指示信息。  When the hardware entity and the logical resource entity are abnormal, the fault alarm notification message includes only the logical resource entity information, where the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and cause indication information of the service processing failure.
11、 根据权利要求 8或者 10所述的方法, 其特征在于, 该方法还包括: 在发送用于指示进行故障检测的故障预警通知消息之后的预定时间段内, 确定所述硬件实体一直异常, 发送包括硬件实体信息的故障预警通知消息, 所 述硬件实体信息包括: 对象实体所属硬件的物理地址信息。  The method according to claim 8 or 10, further comprising: determining that the hardware entity has been abnormal for a predetermined period of time after transmitting a fault warning notification message for indicating fault detection, Sending a failure warning notification message including hardware entity information, where the hardware entity information includes: physical address information of hardware to which the object entity belongs.
12、 根据权利要求 2-10所述的方法, 其特征在于,  12. Method according to claims 2-10, characterized in that
所述业务处理失败事件还包括: 所述通信单元的当前负荷量;  The service processing failure event further includes: a current load amount of the communication unit;
该方法还包括: 判断所述通信单元的当前负荷量是否小于预设阔值, 如果 是, 触发执行统计失效指标值的步骤; 如果否, 丟弃所述业务处理失败事件。  The method further includes: determining whether the current load of the communication unit is less than a preset threshold, and if so, triggering the step of performing a statistical failure indicator value; if not, discarding the service processing failure event.
13、 根据权利要求 2-10所述的方法, 其特征在于,  13. Method according to claims 2-10, characterized in that
该方法还包括: 判断所述业务处理失败事件是否携带指示由终端设备导致 业务处理失败的特定指示标识, 如果否, 触发执行统计失效指标值的步骤; 如 果是, 丟弃所述业务处理失败事件。  The method further includes: determining whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if not, triggering the step of performing a statistical failure indicator value; if yes, discarding the service processing failure event .
14、 根据权利要求 1-10任一项所述的方法, 其特征在于,  14. A method according to any one of claims 1 to 10, characterized in that
所述业务处理失败事件是信令消息处理失败事件, 或者, 管理消息处理失 败事件, 或者, 业务码流处理失败事件, 或者, 接口调用处理失败事件。  The service processing failure event is a signaling message processing failure event, or a management message processing failure event, or a service code processing failure event, or an interface call processing failure event.
15、 根据权利要求 1-10任一项所述的方法, 其特征在于,  15. A method according to any one of claims 1 to 10, characterized in that
所述业务处理失败事件是通信单元在确定来自对端通信单元的信令消息中 字段赋值异常时上报的信令消息处理失败事件;  The service processing failure event is a signaling message processing failure event reported by the communication unit when determining that the field assignment value in the signaling message from the peer communication unit is abnormal;
或者, 所述业务处理失败事件是通信单元在预定时间段内未接收到对端通 信单元的响应消息时上报的信令消息处理失败事件。  Alternatively, the service processing failure event is a signaling message processing failure event reported when the communication unit does not receive the response message of the peer communication unit within a predetermined time period.
16、 根据权利要求 1-10任一项所述的方法, 其特征在于, 获取通信单元上报的业务处理失败事件具体为: 16. A method according to any of claims 1-10, characterized in that Obtaining the service processing failure event reported by the communication unit is specifically:
获取通过子监控设备转发的通信单元上报的业务处理失败事件, 所述业务 处理失败事件是当业务处理失败的对象实体不属于子监控设备的管理范围时由 子监控设备转发的。  Obtaining a service processing failure event reported by the communication unit forwarded by the sub-monitoring device, where the service processing failure event is forwarded by the sub-monitoring device when the object entity that fails the service processing does not belong to the management scope of the sub-monitoring device.
17、 根据权利要求 3、 5-10任一项所述的方法, 其特征在于, 该方法还包括: 获取通信单元上报的业务处理成功事件, 将所述失效指标值清零。  The method according to any one of claims 3 to 5, wherein the method further comprises: acquiring a service processing success event reported by the communication unit, and clearing the failure indicator value.
18、 根据权利要求 1-10任一项所述的方法, 其特征在于,  18. The method of any of claims 1-10, wherein
所述发送故障预警通知消息具体为:  The sending failure warning notification message is specifically:
向业务处理失败的对象实体发送故障预警通知消息;  Sending a failure warning notification message to the object entity that failed in the business processing;
或者, 向业务处理失败的对象实体的管理模块发送故障预警通知消息。 Alternatively, the failure warning notification message is sent to the management module of the object entity that failed the business process.
19、 一种监控设备, 其特征在于, 包括: 19. A monitoring device, comprising:
第一获取单元, 用于获取通信单元上报的业务处理失败事件; 所述业务处 理失败事件包括: 业务处理失败的对象实体的地址信息;  a first acquiring unit, configured to acquire a service processing failure event reported by the communication unit; the service processing failure event includes: address information of the object entity that fails the service processing;
确定单元, 用于根据通信单元上报的业务处理失败事件和预置的失效判别 准则, 确定发生异常的实体;  a determining unit, configured to determine an entity that has an abnormality according to a service processing failure event reported by the communication unit and a preset failure criterion;
发送单元, 用于发送故障预警通知消息, 所述故障预警通知消息包括: 所 确定的发生异常的实体中至少一个实体的信息, 所述故障预警通知消息用于指 示进行故障检测。  And a sending unit, configured to send a fault warning notification message, where the fault early warning notification message includes: information of at least one entity of the determined abnormality entity, where the fault early warning notification message is used to indicate fault detection.
20、 根据权利要求 19所述的监控设备, 其特征在于,  20. The monitoring device of claim 19, wherein
所述确定单元包括:  The determining unit includes:
获取子单元, 用于利用通信单元上报的业务处理失败事件, 获得失效指标 值;  Obtaining a sub-unit, configured to obtain a failure indicator value by using a service processing failure event reported by the communication unit;
确定子单元, 用于才艮据所述失效指标值和失效判别准则中相应的失效阔值, 确定发生异常的实体。  Determining a subunit for determining an entity that has an abnormality according to the failure threshold value and the corresponding failure threshold value in the failure criterion.
21、 根据权利要求 20所述的监控设备, 其特征在于,  21. The monitoring device of claim 20, wherein
所述获取子单元, 用于利用通信单元上报的业务处理失败事件, 统计连续 业务处理失败次数的累加值; 所述连续业务处理失败次数的累加值为失效指标 值;  The acquiring subunit is configured to use the service processing failure event reported by the communication unit to count the accumulated value of the number of consecutive service processing failures; and the accumulated value of the consecutive service processing failure times is a failure indicator value;
或者,  Or,
所述获取子单元, 用于利用通信单元上报的业务处理失败事件, 获得一段 时间内的业务处理失败次数占总业务处理次数的比值; 所述一段时间内的业务 处理失败次数占总业务处理次数的比值为失效指标值。 The obtaining subunit is configured to obtain a segment by using a service processing failure event reported by the communication unit The ratio of the number of service processing failures to the total number of service processing times in a period of time; the ratio of the number of service processing failures to the total number of service processing times in the period of time is the failure indicator value.
22、 根据权利要求 20所述的监控设备, 其特征在于,  22. The monitoring device of claim 20, wherein
所述获取子单元, 用于在接收到通信单元上报的业务处理失败事件后, 查 询关键业绩指标, 所述关键业绩指标为所述失效指标值。  The obtaining sub-unit is configured to query a key performance indicator after receiving a service processing failure event reported by the communication unit, where the key performance indicator is the failure indicator value.
23、 根据权利要求 21所述的监控设备, 其特征在于,  23. The monitoring device of claim 21, wherein
所述业务处理失败的对象实体的地址信息包括: 对象实体所属硬件的物理 地址信息;  The address information of the object entity that fails the service processing includes: physical address information of hardware to which the object entity belongs;
所述获取子单元包括第一统计子单元,  The obtaining subunit includes a first statistical subunit,
所述第一统计子单元, 用于利用通信单元上报的业务处理失败事件, 针对 硬件实体统计失效指标值, 其中, 所述硬件实体是所述对象实体所属硬件; 所述确定子单元包括第一确定子单元,  The first statistic subunit, configured to use a service processing failure event reported by the communication unit, to collect a failure indicator value for the hardware entity, where the hardware entity is hardware to which the object entity belongs; the determining subunit includes the first Determining subunits,
所述第一确定子单元, 用于根据针对硬件实体所统计的失效指标值和失效 判别准则中针对所述硬件实体的第一失效阔值, 确定所述硬件实体是否异常。  The first determining subunit is configured to determine whether the hardware entity is abnormal according to a failure indicator value for a hardware entity and a first failure threshold for the hardware entity in the failure criterion.
24、 根据权利要求 23所述的监控设备, 其特征在于,  24. The monitoring device of claim 23, wherein
所述发送单元发送的所述故障预警通知消息包括: 所述对象实体所属硬件 的物理地址信息;  The fault warning notification message sent by the sending unit includes: physical address information of hardware to which the target entity belongs;
其中, 所述对象实体所属硬件的物理地址信息包括: 第一级子地址; 所述 对象实体所属硬件是第一级子地址对应的硬件的组件;  The physical address information of the hardware to which the target entity belongs includes: a first-level sub-address; the hardware to which the object entity belongs is a component of hardware corresponding to the first-level sub-address;
该监控设备还包括:  The monitoring device also includes:
第一控制单元, 用于当发送包括对象实体所属硬件的物理地址信息的故障 预警通知消息之后的预设时间段内, 若第一确定子单元确定硬件实体一直异常, 控制发送单元发送包括第一级子地址的故障预警通知消息;  a first control unit, configured to: when the first determining subunit determines that the hardware entity has been abnormal, after the failure of the fault warning notification message including the physical address information of the hardware to which the object entity belongs, the control sending unit sends the first A failure warning notification message of the level subaddress;
所述发送单元, 还用于发送包括第一级子地址的故障预警通知消息。  The sending unit is further configured to send a fault warning notification message including a first level subaddress.
25、 根据权利要求 23所述的监控设备, 其特征在于,  25. The monitoring device of claim 23, wherein
所述获取子单元还包括第二统计子单元,  The obtaining subunit further includes a second statistical subunit,
所述第二统计子单元, 用于利用通信单元上报的业务处理失败事件, 针对 软件实体统计失效指标值, 其中, 所述软件实体为以对象实体所属硬件的物理 地址信息和对象实体的逻辑地址信息两者所对应的实体;  The second statistic subunit is configured to use a service processing failure event reported by the communication unit to collect a failure indicator value for the software entity, where the software entity is a physical address information of the hardware to which the object entity belongs and a logical address of the object entity. The entity corresponding to the information;
所述确定子单元包括第二确定子单元, 所述第二确定子单元, 用于根据针对软件实体所统计的失效指标值和失效 判别准则中针对所述软件实体的第二失效阔值, 确定所述软件实体是否异常。 The determining subunit includes a second determining subunit, The second determining subunit is configured to determine whether the software entity is abnormal according to a failure indicator value calculated for the software entity and a second failure threshold value for the software entity in the failure determination criterion.
26、 根据权利要求 25所述的监控设备, 其特征在于,  26. The monitoring device of claim 25, wherein
所述发送单元, 用于当硬件实体和软件实体都异常时, 发送仅包括软件实 体的信息的故障预警通知消息, 所述软件实体信息包括: 对象实体所属硬件的 物理地址信息和对象实体的逻辑地址信息。  The sending unit is configured to send, when the hardware entity and the software entity are abnormal, a fault warning notification message that includes only information of the software entity, where the software entity information includes: physical address information of the hardware to which the object entity belongs and logic of the object entity Address information.
27、 根据权利要求 23所述的监控设备, 其特征在于,  27. The monitoring device of claim 23, wherein
所述统计子单元还包括第三统计子单元,  The statistical subunit further includes a third statistical subunit,
所述第三统计子单元, 用于利用通信单元上报的业务处理失败事件, 针对 逻辑资源实体统计失效指标值, 所述逻辑资源实体为所述对象实体所属硬件的 物理地址信息和业务处理失败的原因指示信息两者所对应的实体;  The third statistic subunit is configured to use a service processing failure event reported by the communication unit to collect a failure indicator value for the logical resource entity, where the logical resource entity is physical address information of the hardware to which the object entity belongs and the service processing fails. The entity corresponding to the reason indication information;
所述确定子单元还包括第三确定子单元,  The determining subunit further includes a third determining subunit,
所述第三确定子单元, 用于根据针对逻辑资源实体所统计的失效指标值和 失效判别准则中针对所述逻辑资源实体的第三失效阔值, 确定所述逻辑资源实 体是否异常。  The third determining subunit is configured to determine whether the logical resource entity is abnormal according to a failure indicator value calculated for the logical resource entity and a third failure threshold value for the logical resource entity in the failure determination criterion.
28、 根据权利要求 27所述的监控设备, 其特征在于,  28. The monitoring device of claim 27, wherein
所述发送单元, 用于当硬件实体和逻辑资源实体都异常时, 发送仅包括逻 辑资源实体信息的故障预警通知消息, 所述逻辑资源实体信息包括: 对象实体 所属硬件的物理地址信息和业务处理失败的原因指示信息。  The sending unit is configured to send, when the hardware entity and the logical resource entity are abnormal, a fault warning notification message that includes only logical resource entity information, where the logical resource entity information includes: physical address information of the hardware to which the object entity belongs and service processing Reason for failure indication.
29、 根据权利要求 26或者 28所述的监控设备, 其特征在于, 该监控设备还包括: 第二监控单元, 用于在所述发送单元发送故障预警通知消息之后的预设时 间段内, 若第一确定子单元确定所述硬件实体一直异常, 控制发送单元发送包 括硬件实体信息的故障预警通知消息, 所述硬件实体信息包括: 对象实体所属 硬件的物理地址信息; 所述发送单元, 还用于发送包括硬件实体信息的故障预警通知消息。 The monitoring device according to claim 26 or 28, wherein the monitoring device further comprises: a second monitoring unit, configured to: within a preset time period after the sending unit sends the fault warning notification message, if The first determining subunit determines that the hardware entity is always abnormal, and the control sending unit sends a fault warning notification message including hardware entity information, where the hardware entity information includes: physical address information of hardware to which the object entity belongs; A failure warning notification message including hardware entity information is transmitted.
30、 根据权利要求 19-28任一项所述的监控设备, 其特征在于, 所述业务处理失败事件还包括: 所述通信单元的当前负荷量; 所述监控设备还包括: 第一判断单元, 用于判断所述通信单元的当前负荷 量是否小于预设阔值, 如果否, 丟弃所述业务处理失败事件; The monitoring device according to any one of claims 19 to 28, wherein the service processing failure event further comprises: a current load amount of the communication unit; the monitoring device further includes: a first determining unit For determining the current load of the communication unit Whether the quantity is less than a preset threshold, and if not, discarding the service processing failure event;
所述确定单元, 用于在所述第一判断单元的判断结果为是时, 根据通信单 元上报的业务处理失败事件和预置的失效判别准则, 确定发生异常的实体。  The determining unit is configured to determine, according to the service processing failure event reported by the communication unit and the preset failure determination criterion, when the determination result of the first determining unit is YES, the entity that has an abnormality is determined.
31、 根据权利要求 19-28任一项所述的监控设备, 其特征在于, 31. A monitoring device according to any of claims 19-28, characterized in that
所述监控设备还包括: 第二判断单元, 用于判断所述业务处理失败事件是 否携带指示由终端设备导致业务处理失败的特定指示标识, 如果是, 丟弃所述 业务处理失败事件;  The monitoring device further includes: a second determining unit, configured to determine whether the service processing failure event carries a specific indication identifier indicating that the service processing fails by the terminal device, and if yes, discarding the service processing failure event;
所述确定单元, 用于在所述第二判断单元的判断结果为否时, 根据通信单 元上报的业务处理失败事件和预置的失效判别准则, 确定发生异常的实体。  The determining unit is configured to determine, according to the service processing failure event reported by the communication unit and the preset failure determination criterion, when the determination result of the second determining unit is negative, the entity that has an abnormality is determined.
32、 根据权利要求 18-28任一项所述的监控设备, 其特征在于, 所述第一获取单元, 用于获取通过子监控设备转发的通信单元上报的业务 处理失败事件, 所述业务处理失败事件是当业务处理失败的对象实体不属于子 监控设备的管理范围时由子监控设备转发的。 The monitoring device according to any one of claims 18 to 28, wherein the first acquiring unit is configured to acquire a service processing failure event reported by the communication unit forwarded by the sub-monitoring device, where the service processing is performed. The failure event is forwarded by the sub-monitoring device when the object entity whose service processing fails does not belong to the management scope of the sub-monitoring device.
33、 根据权利要求 21、 23-28任一项所述的监控设备, 其特征在于, 该监控设备还包括:  The monitoring device according to any one of claims 21 to 23, wherein the monitoring device further comprises:
第二获取单元, 用于获取通信单元上报的业务处理成功事件;  a second acquiring unit, configured to acquire a service processing success event reported by the communication unit;
清零单元, 用于在第二获取单元获取到业务处理成功事件之后, 将统计的 失效指标值清零。  The clearing unit is configured to clear the statistical failure indicator value after the second acquiring unit obtains the business processing success event.
34、 一种通信系统, 其特征在于, 包括: 通信单元, 子监控单元, 和父监 控单元, 其中, 子监控单元, 用于获取通信单元上报的业务处理失败事件, 根据业务处理 失败事件中携带的业务处理失败的对象实体的地址信息, 确定所述业务处理失 败的对象实体不属于自己管理的范围, 将该业务处理失败事件上报给父监控单 元;  A communication system, comprising: a communication unit, a sub-monitoring unit, and a parent monitoring unit, wherein the sub-monitoring unit is configured to acquire a service processing failure event reported by the communication unit, and carry the event according to the service processing failure event. The service processing fails the address information of the object entity, and determines that the object entity that fails the service processing does not belong to the scope of the management, and reports the service processing failure event to the parent monitoring unit;
父监控单元, 用于接收子监控单元上报的业务处理失败事件, 根据业务处 理失败事件中携带的业务处理失败的对象实体的地址信息, 确定所述业务处理 失败的对象实体是否属于自己管理的范围, 如果是, 根据所述业务处理失败事 件和预置的失效判别准则, 确定发生异常的实体, 发送用于指示进行故障检测 的故障预警通知消息, 所述故障预警通知消息包括: 所确定的发生异常的实体 中至少一个实体的信息; 如果否, 继续将所述业务处理失败事件上报给所述父 监控单元的父监控单元。 The parent monitoring unit is configured to receive the service processing failure event reported by the sub-monitoring unit, and determine, according to the address information of the object entity that the service processing fails in the service processing failure event, whether the object entity that fails the service processing belongs to the scope of its own management. If yes, according to the business processing failure And a preset failure criterion, determining an entity that has an abnormality, and transmitting a fault warning notification message for indicating fault detection, where the fault warning notification message includes: information of at least one entity of the determined entity that has an abnormality; If no, the service processing failure event is continuously reported to the parent monitoring unit of the parent monitoring unit.
35、 一种通信系统, 其特征在于, 包括: 第一通信单元、 第二通信单元和 监控单元,  35. A communication system, comprising: a first communication unit, a second communication unit, and a monitoring unit,
监控单元, 用于获取第一通信单元上报的业务处理失败事件, 获取第二通 信单元上报的业务处理失败事件, 当第一通信单元上报的业务处理失败事件携 带的业务处理失败的对象实体的地址信息为第二通信单元的地址信息, 且第二 通信单元上报的业务处理失败事件携带的业务处理失败的对象实体的地址信息 为第一通信单元的地址信息时, 不对第一通信单元和第二通信单元进行失效分  The monitoring unit is configured to obtain a service processing failure event reported by the first communication unit, and obtain a service processing failure event reported by the second communication unit, and the address of the target entity that fails to be processed by the service processing failure event reported by the first communication unit If the information is the address information of the second communication unit, and the address information of the object entity that fails the service processing carried by the service processing failure event reported by the second communication unit is the address information of the first communication unit, the first communication unit and the second communication unit are not used. Communication unit performs failure
PCT/CN2011/070390 2010-02-25 2011-01-19 Fault monitoring method, monitoring device, and communication system WO2011103778A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 201010115943 CN101800675B (en) 2010-02-25 2010-02-25 Failure monitoring method, monitoring equipment and communication system
CN201010115943.0 2010-02-25

Publications (1)

Publication Number Publication Date
WO2011103778A1 true WO2011103778A1 (en) 2011-09-01

Family

ID=42596179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/070390 WO2011103778A1 (en) 2010-02-25 2011-01-19 Fault monitoring method, monitoring device, and communication system

Country Status (2)

Country Link
CN (1) CN101800675B (en)
WO (1) WO2011103778A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135739A (en) * 2014-07-14 2014-11-05 大唐移动通信设备有限公司 Selection method and device of user access single board
CN107888649A (en) * 2016-09-29 2018-04-06 三菱电机大楼技术服务株式会社 Failure detector

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101800675B (en) * 2010-02-25 2013-03-20 华为技术有限公司 Failure monitoring method, monitoring equipment and communication system
CN103167539B (en) * 2011-12-13 2015-12-02 华为技术有限公司 Fault handling method, equipment and system
CN102541613B (en) * 2011-12-27 2015-09-30 华为技术有限公司 For the method and apparatus of fault detect and process
CN102857365A (en) * 2012-06-07 2013-01-02 中兴通讯股份有限公司 Fault preventing and intelligent repairing method and device for network management system
CN103701625B (en) * 2012-09-28 2017-06-23 中国电信股份有限公司 Home gateway WLAN network failure locating methods and network management system
CN103002487B (en) * 2012-09-29 2015-09-16 深圳友讯达科技股份有限公司 A kind of fault repairing method and network node being applied to self-organizing network
CN103929334B (en) * 2013-01-11 2018-02-23 华为技术有限公司 Network Abnormal Notification Method and device
EP3001606B1 (en) 2013-06-27 2018-12-19 Huawei Technologies Co., Ltd. Fault processing method, device and system
CN104346246B (en) * 2013-08-05 2017-12-15 华为技术有限公司 Failure prediction method and device
CN106506185A (en) * 2015-09-08 2017-03-15 小米科技有限责任公司 The recognition methodss of hardware fault and device
CN107548089A (en) * 2016-06-28 2018-01-05 中兴通讯股份有限公司 The method and device that a kind of base station fault is repaired automatically
CN107547238B (en) * 2016-06-29 2020-11-24 阿里巴巴集团控股有限公司 Event monitoring system, method and device
CN107769943B (en) * 2016-08-17 2021-01-08 阿里巴巴集团控股有限公司 Method and equipment for switching main and standby clusters
CN107920360B (en) * 2016-10-08 2022-07-29 中兴通讯股份有限公司 Method, device and system for positioning network problem
CN108432219B (en) 2016-10-25 2020-09-11 华为技术有限公司 Recovery method for boot failure of terminal equipment and terminal equipment
CN110752939B (en) * 2018-07-24 2022-09-16 成都华为技术有限公司 Service process fault processing method, notification method and device
JP6724960B2 (en) * 2018-09-14 2020-07-15 株式会社安川電機 Resource monitoring system, resource monitoring method, and program
CN109214129B (en) * 2018-10-25 2023-06-09 中国运载火箭技术研究院 LVC simulation fault tolerance method based on virtual-real substitution under limited network condition
CN109634252B (en) * 2018-11-06 2020-06-26 华为技术有限公司 Root cause diagnosis method and device
CN110519098B (en) * 2019-08-30 2022-06-21 新华三信息安全技术有限公司 Method and device for processing abnormal single board
CN111427676B (en) * 2020-03-20 2024-03-29 达观数据有限公司 Robot flow automatic task processing method and device
CN111367769B (en) * 2020-03-30 2023-07-21 浙江大华技术股份有限公司 Application fault processing method and electronic equipment
CN111475386B (en) * 2020-06-05 2024-01-23 中国银行股份有限公司 Fault early warning method and related device
CN111782456B (en) * 2020-06-30 2022-09-30 深圳赛安特技术服务有限公司 Anomaly detection method, device, computer equipment and storage medium
CN113641524B (en) * 2021-08-09 2024-02-02 国家计算机网络与信息安全管理中心 Reset method, device and equipment for single board starting overtime and readable storage medium
CN115118575A (en) * 2022-06-23 2022-09-27 奇安信科技集团股份有限公司 Monitoring method, monitoring device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499933A (en) * 2008-02-03 2009-08-05 突触计算机系统(上海)有限公司 Method and apparatus for error control in network system
CN101500249A (en) * 2008-02-02 2009-08-05 中兴通讯股份有限公司 Implementing method for single board state detection
CN101800675A (en) * 2010-02-25 2010-08-11 华为技术有限公司 Failure monitoring method, monitoring equipment and communication system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100370762C (en) * 2006-03-08 2008-02-20 华为技术有限公司 Method device and system for processing warning message
CN1852054A (en) * 2006-05-29 2006-10-25 中兴通讯股份有限公司 Communication apparatus alarm processing method
CN100456711C (en) * 2007-02-09 2009-01-28 华为技术有限公司 Network element state detecting method and network management equipment
CN101621404B (en) * 2008-07-05 2012-07-18 中兴通讯股份有限公司 Method and system for layering processing of failure
CN101494572B (en) * 2009-03-10 2011-04-20 中国电信股份有限公司 Remote management method and system for equipment alarm information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101500249A (en) * 2008-02-02 2009-08-05 中兴通讯股份有限公司 Implementing method for single board state detection
CN101499933A (en) * 2008-02-03 2009-08-05 突触计算机系统(上海)有限公司 Method and apparatus for error control in network system
CN101800675A (en) * 2010-02-25 2010-08-11 华为技术有限公司 Failure monitoring method, monitoring equipment and communication system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104135739A (en) * 2014-07-14 2014-11-05 大唐移动通信设备有限公司 Selection method and device of user access single board
CN104135739B (en) * 2014-07-14 2017-12-05 大唐移动通信设备有限公司 A kind of user accesses system of selection and the device of veneer
CN107888649A (en) * 2016-09-29 2018-04-06 三菱电机大楼技术服务株式会社 Failure detector
CN107888649B (en) * 2016-09-29 2022-01-11 三菱电机大楼技术服务株式会社 Fault detection device

Also Published As

Publication number Publication date
CN101800675A (en) 2010-08-11
CN101800675B (en) 2013-03-20

Similar Documents

Publication Publication Date Title
WO2011103778A1 (en) Fault monitoring method, monitoring device, and communication system
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
WO2017050130A1 (en) Failure recovery method and device
US7213179B2 (en) Automated and embedded software reliability measurement and classification in network elements
CA2493525C (en) Method and apparatus for outage measurement
US20110185235A1 (en) Apparatus and method for abnormality detection
CN106789445B (en) Status polling method and system for network equipment in broadcast television network
CN102404141B (en) Method and device of alarm inhibition
WO2010025674A1 (en) Method and apparatus for monitoring operating status of node in short message service center
US20110122761A1 (en) KPI Driven High Availability Method and apparatus for UMTS radio access networks
US20050204214A1 (en) Distributed montoring in a telecommunications system
US20210105179A1 (en) Fault management method and related apparatus
CN107294767B (en) Live broadcast network transmission fault monitoring method and system
US9578524B2 (en) Method, device and program for validation of sleeping cells in a communications network
CN101989933A (en) Method and system for failure detection
US8775869B2 (en) Device and method for coordinating automatic protection switching operation and recovery operation
CN112994971A (en) Equipment offline monitoring method based on cloud server and related device
JP5780553B2 (en) Fault monitoring apparatus and fault monitoring method
US11652682B2 (en) Operations management apparatus, operations management system, and operations management method
CN103731315A (en) Server failure detecting method
US10277484B2 (en) Self organizing network event reporting
CN102195824B (en) Method, device and system for out-of-service alarm of data service system
CN111865667A (en) Network connectivity fault root cause positioning method and device
CN113612647B (en) Alarm processing method and device
WO2012051778A1 (en) System and method for realizing service switching of network element in multimedia message service

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11746820

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11746820

Country of ref document: EP

Kind code of ref document: A1