WO2017107014A1 - 一种网络亚健康诊断方法及装置 - Google Patents

一种网络亚健康诊断方法及装置 Download PDF

Info

Publication number
WO2017107014A1
WO2017107014A1 PCT/CN2015/098107 CN2015098107W WO2017107014A1 WO 2017107014 A1 WO2017107014 A1 WO 2017107014A1 CN 2015098107 W CN2015098107 W CN 2015098107W WO 2017107014 A1 WO2017107014 A1 WO 2017107014A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
notification information
network element
communication
health state
Prior art date
Application number
PCT/CN2015/098107
Other languages
English (en)
French (fr)
Inventor
印杰
辛波
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201580083650.XA priority Critical patent/CN108141374B/zh
Priority to PCT/CN2015/098107 priority patent/WO2017107014A1/zh
Publication of WO2017107014A1 publication Critical patent/WO2017107014A1/zh

Links

Images

Definitions

  • the embodiments of the present invention relate to the field of communications technologies, and in particular, to a network sub-health diagnosis method and apparatus.
  • IP Internet Protocol
  • MS Multimedia Core Network Subsystem
  • the business layer's ⁇ packet loss capability is the most important means of dealing with sub-health of communication at the business level.
  • the main method of losing packets is a reasonable retransmission mechanism.
  • the service side detects the sub-health status of the network, but the underlying hardware cannot be detected and cannot be repaired in time, which may still cause business damage.
  • the embodiment of the invention provides a network sub-health diagnosis method and device, which is used to solve the problem that the service side of the prior art detects the sub-health state of the network, but the underlying hardware cannot be detected, and the timely hardware failure repair cannot be performed, which still leads to The problem of business damage.
  • an embodiment of the present invention provides a network sub-health diagnosis method, including:
  • the management and orchestration module receives the communication sub-health status notification information detected based on the service transmission;
  • the communication sub-health status notification information includes at least the network element identifiers of the two network elements in which the service communication is in the sub-health state;
  • the MANO performs hardware fault detection on the hardware device on the path corresponding to the two network elements in which the service communication is in a sub-health state, and saves the communication sub-health status notification information in the fault information when no hardware failure is detected.
  • the library In the library;
  • the MANO determines that the number of communication sub-health state notification information stored in the fault information database is greater than a predetermined threshold, parses each communication sub-health state notification information, and determines a network element in which a hardware failure occurs based on the parsed analysis result.
  • the information about the communication sub-health state notification information is parsed, and the network element in which the hardware failure occurs is determined based on the parsing result obtained by the parsing, including:
  • the network element where the communication failure occurs is determined according to the connection path topology between the network elements corresponding to the identifiers of the network elements.
  • the method further includes:
  • the MANO repairs the detected hardware failure when it is determined that hardware failure detection is based and a hardware failure is detected.
  • the MANO determines that the service communication is Before the hardware device on the path corresponding to the two network elements in the sub-health state performs hardware fault detection, it also includes:
  • the MANO receives trigger information for triggering hardware fault detection, where the trigger information carries path information of a path corresponding to two network elements in which the service communication is in a sub-health state.
  • the method further includes:
  • the MANO determines that the number of communication sub-health state notification information stored in the fault information database is 1, it is determined that the network element in which the hardware failure occurs is the virtual machine VM.
  • the The information of the health status notification information is analyzed, and the network element in which the hardware failure occurs is determined based on the analysis result obtained by the analysis, including:
  • each communication sub-health status notification information includes the same network element identifier, and the same network element If the network element corresponding to the same host is the same VM, the network element that determines the hardware failure is the VM.
  • the information about the communication sub-health state notification information is parsed, and the network element in which the hardware failure occurs is determined based on the parsing result obtained by the parsing, including:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state. If one of the NEs is located in the same Host, it is determined that all the communication switches including the two sub-healths of the sub-health state are in the sub-health state.
  • the information about the communication sub-health state notification information is parsed, and the network element in which the hardware failure occurs is determined based on the parsing result obtained by the parsing, including:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state.
  • An NE is in the same host, but the NEs on the same host are different VMs.
  • determining, by using the parsing result obtained by the parsing, determining a network element in which a hardware failure occurs it also includes:
  • the communication sub-health status notification information saved in the fault information database is deleted.
  • an embodiment of the present invention provides a network sub-health diagnosis apparatus, including:
  • the communication sub-health status notification information includes at least a network element identifier of two network elements in which the service communication is in a sub-health state;
  • a processing unit configured to perform hardware fault detection on a hardware device on a path corresponding to two network elements in which the service communication is in a sub-health state, which is included in the communication sub-health state notification information received by the receiving unit, is not detected
  • the communication sub-health status notification information is saved in the fault information database; when it is determined that the number of communication sub-health status notification information stored in the fault information database is greater than a predetermined threshold, each communication sub-health The state notification information is parsed, and the network element in which the hardware failure occurs is determined based on the analysis result obtained by the analysis.
  • the processing unit is configured to perform information analysis on each communication sub-health state, and determine, when the network element in which the hardware failure occurs, based on the parsing result obtained by the parsing For:
  • the network element where the communication failure occurs is determined according to the connection path topology between the network elements corresponding to the identifiers of the network elements.
  • the processing unit is further configured to:
  • determining that the communication is in a sub-health The receiving unit is further configured to receive trigger information for triggering the processing unit to perform hardware fault detection, where the trigger information carries service communication, before the hardware device on the path corresponding to the two network elements of the state performs hardware fault detection.
  • Path information of the path corresponding to the two network elements in the sub-health state is further configured to receive trigger information for triggering the processing unit to perform hardware fault detection, where the trigger information carries service communication, before the hardware device on the path corresponding to the two network elements of the state performs hardware fault detection.
  • the processing unit is further configured to determine When the number of the communication sub-health state notification information stored in the fault information database is 1, the network element that determines that the hardware failure occurs is the virtual machine VM.
  • the processing unit is configured to perform information analysis on each communication sub-health state, and determine, when the network element in which the hardware failure occurs, based on the parsing result obtained by the parsing For:
  • each communication sub-health status notification information includes the same network element identifier, and the same network element If the network element corresponding to the same host is the same VM, the network element that determines the hardware failure is the VM.
  • the processing unit is configured to perform information analysis on each communication sub-health state, and determine, when the network element in which the hardware failure occurs, based on the parsing result obtained by the parsing For:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state. If one of the NEs is located in the same Host, it is determined that all the communication switches including the two sub-healths of the sub-health state are in the sub-health state.
  • the processing unit is configured to perform information analysis on each communication sub-health state, and determine, when the network element in which the hardware failure occurs, based on the parsing result obtained by the parsing For:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state.
  • An NE is in the same host, but the NEs on the same host are different VMs.
  • the processing unit is further configured to: After the analysis result determines the network element in which the hardware failure occurs, the communication saved in the fault information database is deleted. Sub-health status notification information.
  • the management and orchestration module MANO receives the communication sub-health status notification information detected based on the service transmission; the communication sub-health status notification information includes the network elements of the two network elements in which the service communication is in the sub-health state. Identifying; then, the MANO performs hardware failure detection on the hardware device on the path corresponding to the two network elements in which the service communication is in a sub-health state, and saves the communication sub-health status notification information when no hardware failure is detected.
  • the MANO determines that the number of communication sub-health state notification information stored in the fault information database is greater than a predetermined threshold, parsing each communication sub-health state notification information, and determining based on the parsed analysis result A network element with a hardware failure. Therefore, when the communication sub-health occurs at the service level, when the hardware failure detection is not detected, the faulty network element is diagnosed by the sub-health status notification information in the fault information database, so that the faulty network element can be repaired in time.
  • FIG. 1 is a schematic diagram of a network application system for network sub-health diagnosis according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a network sub-health diagnosis method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a path topology structure in an application scenario according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a path topology structure in another application scenario according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of another network sub-health diagnosis method according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a network sub-health diagnosis apparatus according to an embodiment of the present invention.
  • the embodiment of the invention provides a network sub-health diagnosis method and device, which is used to solve the problem that the service side of the prior art detects the sub-health state of the network, but the underlying hardware cannot be detected, and the timely hardware failure repair cannot be performed, which still leads to The problem of business damage.
  • the method and the device are based on the same inventive concept. Since the principles of the method and the device for solving the problem are similar, the implementation of the device and the method can be referred to each other, and the repeated description is not repeated.
  • the network application system includes: a host (Host), a switch (Switch), and a user edge device (English: Customer Edge, referred to as CE).
  • a host Host
  • switch Switch
  • CE Customer Edge
  • Figure 1 is only an example and does not limit the number of devices.
  • the network application system includes multiple hosts and multiple switches.
  • the host includes a virtual machine (English: Virtual Machine, VM for short) and a physical network card (English: Physical Network Interface Card, pNIC for short). There is a virtual network card (English: Virtual Network Interface Card, vNIC for short) in the virtual machine.
  • the virtual machine and the physical NIC are connected by a virtual channel, that is, a virtual Ethernet bridge (English: Virtual Ethernet Bridge, VEB).
  • the virtual Ethernet bridge can be regarded as a virtual switch (Virtual Switch, vSwitch for short). responsible for packet forwarding between two virtual machines.
  • the network application system also includes a management and orchestration module (English: Management and Orchestration, MAN: MAN), which is responsible for the allocation and scheduling of system resources, managing the life cycle of virtual network functions, and so on.
  • Virtual network functions can be implemented by one virtual machine or multiple virtual machines. Multiple virtual machines can be virtual machines in one host or virtual machines in different hosts.
  • System resources include hardware resources as well as software resources.
  • the hardware resources include computing hardware storage hardware and network hardware.
  • the computing hardware can be a dedicated processor or a general purpose processor for providing processing and computing functions; the storage hardware can be used to provide storage capabilities, which can be provided by the storage hardware itself (eg, a server's local memory).
  • a network for example, a server connects to a network storage device through a network
  • the network hardware can be a switch, a router, and/or other network devices, and the network hardware is used to implement communication between multiple devices, and multiple devices pass through Wireless or wired connection.
  • network sub-health caused by the following hardware failure may occur:
  • the link failure between the Host and the Host causes the network to be sub-healthy.
  • the link between the Host and the Host may pass through switches, routers, and so on.
  • the embodiment of the present invention provides A network sub-health diagnosis method.
  • the execution device of the method may be a MANO, or may be a mobile service platform (English: Mobile Service Platform, MSP for short).
  • the method includes:
  • the MANO receives the communication sub-health status notification information detected based on the service transmission.
  • the communication sub-health status notification information includes network element information of two network elements in which the service communication is in a sub-health state.
  • the network element information includes at least a network element identifier, and may include device information to which the network element belongs.
  • the NE information of the two NEs can be the ID of the VM and the Host ID of the VM.
  • the communication sub-health status notification information is sent to the MANO, which may be a pipeline operating system (English: Operation System, referred to as OS).
  • the pipeline OS can continuously detect the status of the service communication and then report it to MANO or MSP periodically.
  • the MANO performs hardware fault detection on a hardware device on a path corresponding to two network elements in which the service communication is in a sub-health state, and saves the communication sub-health status notification information when no hardware failure is detected.
  • the fault information library In the fault information library.
  • the communication sub-health status notification information is further used to trigger the MANO to perform hardware fault detection on the path corresponding to the two network elements in which the service communication is in a sub-health state, so that the MANO receives the communication sub-health status.
  • the notification information is used to perform hardware fault detection on the hardware device on the path corresponding to the two network elements in which the service communication is in the sub-health state.
  • the MANO pair can also be triggered by an external trigger device and specify the path to be detected.
  • the MANO receives trigger information for triggering hardware failure detection.
  • the trigger information carries path information of a path corresponding to two network elements in which the service communication is in a sub-health state; and then the MANO performs hardware fault detection on the hardware device on the path corresponding to the path information.
  • the MANO determines, when the number of the communication sub-health state notification information stored in the fault information database is greater than a predetermined threshold, parses each communication sub-health state notification information, based on the analysis.
  • the obtained analysis result determines the network element in which the hardware failure occurs.
  • the communication sub-health status notification information is parsed, and the network element in which the communication failure occurs is determined based on the parsing result obtained by the parsing, and may be implemented as follows:
  • the network element information of the two network elements that are in the sub-health state for the service communication included in each communication sub-health status notification information is determined, and then the network element in which the communication failure occurs is determined according to the connection path topology between the network elements.
  • connection path topology between the network elements is pre-stored in MANO or MSP.
  • the detected hardware failure is repaired.
  • determining that the network element in which the hardware failure occurs is a VM fault.
  • the communication sub-health status notification information When the communication sub-health status notification information is one, it indicates that there has not been a similar situation before, and only the VM failure can be determined.
  • the reason why the VM failure is determined is that the pipeline OS has detected the failure, and the pipeline OS can detect the failure between the VMs through the transmission of the service.
  • a VM failure may be caused by a VM's vNIC failure.
  • the MANO determines that the VM is faulty, the MAN self-healing according to a preset rule.
  • VM self-healing mainly includes VM restart, migration, and reconstruction. The VM can be migrated to another suitable host depending on the configuration of the VM.
  • the parsing of each communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing may be implemented as follows:
  • each communication sub-health status notification information includes the same network element identifier, and the same network element If the network element corresponding to the same host is the same VM, the network element that determines the hardware failure is the VM.
  • the network of the two ends of the service communication included in the communication sub-health status information is the same VM of the same host, indicating that all communication sub-health is caused by the VM failure.
  • the network element at the two ends of the first service communication is VM1.
  • the network communication at both ends of the second service is VM1 and VM3
  • the network elements at the two ends of the third service are VM1 and VM4, indicating that VM1 fails and normal communication cannot be performed.
  • the parsing of each communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing may be implemented as follows:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state. If one of the NEs is located in the same Host, it is determined that all the communication switches including the two sub-healths of the sub-health state are in the sub-health state.
  • the communication network includes three VMs, VM1, VM2, and VM3, VM1 and VM2 are connected through a switch, VM1 and VM3 are connected through a switch, and VM2 and VM3 are also connected through a switch. connection.
  • the first communication sub-health status information indicates that the VM1 and VM2 service communication is abnormal
  • the second communication sub-health status information indicates that the VM1 and VM3 service communication is abnormal
  • the third communication sub-health status The information indicates that the VM3 and VM2 service communication is abnormal, so that it can be determined that the switch has failed, thereby generating the above three communication sub-health status information.
  • the parsing of each communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing includes:
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state.
  • An NE is in the same host, but the NEs on the same host are different VMs.
  • a Host failure may be a virtual channel failure from the vNIC to the pNIC or it may be a physical NIC failure.
  • the VM can be self-healing according to the configuration of the VM. If it cannot be modified, it can further determine whether a physical network card or the like has failed.
  • the method further includes:
  • the communication sub-health status notification information saved in the fault information database is deleted.
  • the communication network includes three hosts, Host1, Host2, and Host3.
  • VM1 and VM4 are installed in Host1, VM2 is installed in Host2, and VM3 is installed in Host3.
  • Host 1 is connected to the P11 interface of the switch through the P1 interface.
  • Host 2 is connected to the P12 interface of the switch through the P2 interface.
  • Host 3 is connected to the P13 interface of the switch through the P3 interface.
  • the MANO receives the communication sub-health status notification information sent by the pipeline OS. Go to S502.
  • the MANO periodically receives the communication sub-health status notification information sent by the pipeline OS.
  • the communication sub-health status notification information includes network element identifiers of two network elements in which the service communication is in a sub-health state.
  • the sub-health status notification information is used to trigger the MANO to perform hardware fault detection on the hardware devices in the path corresponding to the two network elements in the sub-health state.
  • S502 After receiving the communication sub-health status notification information sent by the pipeline OS, the MANO performs hardware fault detection on the hardware device in the path corresponding to the two network elements in the sub-health state. Go to S503.
  • MANO determines whether a hardware failure is detected, and if so, executes S504, and if not, executes S505.
  • the MANO processes the hardware failure according to a pre-stored rule. After processing the hardware failure, the communication sub-health status notification information on the path can also be cleared.
  • the MANO stores the received communication sub-health status notification information in the fault information base. Execute S506.
  • the MANO determines whether the number of communication sub-health status notification information in the fault information database is greater than 1, if yes, execute S508, and if not, execute S507.
  • MANO determines that the VM is faulty. Then MANO self-healing according to the VM configuration.
  • VM self-healing mainly includes VM restart, migration, and reconstruction. The VM can be migrated to the appropriate host based on the VM's configuration.
  • MANO determines whether each of the two network elements in the sub-health state in which the service communication sub-health status notification information included in the fault information database is in the sub-health state is located in the same Host, and if not, executing S509, if yes, executing S510.
  • the fault information database includes three communication sub-health status information, the first communication sub-health status information indicates that the VM1 and VM2 service communication is abnormal, and the second communication sub-health status information indicates that the VM1 and VM3 service communication is abnormal, and the third communication The sub-health status information indicates that the VM3 and the VM2 service communication are abnormal. According to the topology shown in FIG. 4, it can be determined that all three paths need to pass through the switch, so that it can be determined that the switch has failed.
  • S510 The MANO determines that each of the two network elements in which the service communication includes the sub-health status of the communication sub-health status notification information in the fault information database is a network element of the same VM. If yes, execute S511, and if not, execute S512.
  • MANO diagnoses that the VM is faulty.
  • the fault information database includes two communication sub-health status information.
  • the first network element in the sub-health state is VM1 and VM2, and the second network element in the sub-health state is VM1.
  • VM3 it can be determined that no matter which VM is in communication with VM1, communication is not normal, so VM1 failure is determined.
  • the VM's self-healing mainly includes VM restart, migration, and rebuild. It can also migrate the VM to the appropriate host according to the VM configuration.
  • the fault information base can be emptied. Of course, it can also be reserved. If the communication sub-health status information is received after the fault is processed and stored in the fault information database, and the VM fault is still diagnosed, other VM self-healing modes can be considered. For example, if the priority of the self-healing mode is set, if the VM is diagnosed twice, the self-healing mode adopted in the latter time has a lower priority than the self-healing mode adopted in the previous time.
  • MANO diagnoses that the Host has failed. Specifically, according to all VM configurations running on the host, select a suitable host for migration and reconstruction.
  • the fault information database includes two communication sub-health status information.
  • the network elements at the two ends of the first service communication are VM1 and VM2, and the network elements at the two ends of the second service communication are VM4 and VM3, which can be according to FIG.
  • the network topology determines that both VM4 and VM1 belong to Host1, so it is determined that Host1 has failed.
  • the management and orchestration module MANO receives the communication sub-health status notification information detected based on the service transmission; the communication sub-health status notification information includes the network elements of the two network elements in which the service communication is in the sub-health state. Identifying; then, the MANO performs hardware failure detection on the hardware device on the path corresponding to the two network elements in which the service communication is in a sub-health state, and saves the communication sub-health status notification information when no hardware failure is detected.
  • the MANO determines that the number of communication sub-health state notification information stored in the fault information database is greater than a predetermined threshold, parsing each communication sub-health state notification information, and determining based on the parsed analysis result A network element with a hardware failure. Therefore, when the communication sub-health occurs at the service level, when the hardware failure detection is not detected, the faulty network element is diagnosed by the sub-health status notification information in the fault information database, so that the faulty network element can be repaired in time.
  • the embodiment of the present invention further provides a network sub-health diagnosis device, which may be a MANO or an MSP.
  • a network sub-health diagnosis device which may be a MANO or an MSP.
  • the device includes:
  • the receiving unit 601 is configured to receive the communication sub-health status notification information that is detected based on the service transmission, where the communication sub-health status notification information includes at least the network element identifier of the two network elements in which the service communication is in the sub-health state;
  • the processing unit 602 is configured to perform hardware fault detection on the hardware device on the path corresponding to the two network elements in the sub-health state, which is included in the communication sub-health state notification information received by the receiving unit 601.
  • the communication sub-health status notification information is saved in the fault information database; and when it is determined that the number of communication sub-health status notification information stored in the fault information database is greater than a predetermined threshold, each communication is Sub-health status notification information analysis, base
  • the parsing result obtained by the parsing determines the network element in which the hardware failure occurs.
  • the processing unit 602 is configured to: when parsing the communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing,
  • the network element where the communication failure occurs is determined according to the connection path topology between the network elements corresponding to the identifiers of the network elements.
  • processing unit 602 is further configured to:
  • the receiving unit is further configured to receive a trigger for triggering the processing unit to perform hardware fault detection, before performing hardware fault detection on the hardware device on the path corresponding to the two network elements in the sub-health state.
  • Information the trigger information carries path information of a path corresponding to two network elements in which the service communication is in a sub-health state.
  • the processing unit 602 is further configured to: when determining that the number of the communication sub-health state notification information saved in the fault information database is 1, determine that the network element in which the hardware failure occurs is the virtual machine VM.
  • the processing unit 602 is configured to: when parsing the communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing,
  • each communication sub-health status notification information includes the same network element identifier, and the same network element If the network element corresponding to the same host is the same VM, the network element that determines the hardware failure is the VM.
  • the processing unit 602 is configured to: when parsing the communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing,
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state.
  • One of the network elements is located in the same Host, then all communication sub-identifications are determined.
  • the health switch includes the failure of the same switch that the two network elements in the sub-health state pass through.
  • the processing unit 602 is configured to: when parsing the communication sub-health state notification information, and determining the network element in which the hardware failure occurs based on the parsing result obtained by the parsing,
  • the two network elements corresponding to the two network element identifiers included in the communication sub-health status notification information are determined according to the network element identifiers of the two network elements in which the service communication sub-health status notification information is respectively included in the sub-health state.
  • An NE is in the same host, but the NEs on the same host are different VMs.
  • the processing unit 602 is further configured to: after determining the network element that has a hardware failure based on the parsing result obtained by the parsing, deleting the communication sub-health status notification information saved in the fault information database.
  • the network sub-health diagnosis apparatus may further include a storage unit 603 for storing the fault information database, and may also be used for storing the processing unit and the program that the receiving unit needs to execute.
  • the fault information base can also be stored by an external memory.
  • each functional unit in each embodiment of the present application may be integrated into one processing. In the device, it may be physically present alone, or two or more units may be integrated in one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software function module.
  • the hardware of the entity corresponding to the receiving unit 601 is a transceiver
  • the physical hardware corresponding to the processing unit 602 is a processor.
  • the processor can be a central processing unit (English: central processing unit, CPU for short), or a digital processing unit or the like.
  • the storage unit in the network sub-health diagnostic device may be a memory for storing a program executed by the processor.
  • the program used by the processor to perform memory storage is specifically used by the processing unit 602 and the scheme executed by the receiving unit 601.
  • the memory can be a volatile memory (English: volatile memory), such as random access memory Memory (English: random-access memory, abbreviation: RAM); memory can also be non-volatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), fast Flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviated: HDD) or solid state drive (English: solid-state drive, abbreviation: SSD), or memory can be used to carry or store with instructions or data
  • the desired program code in the form of a structure and any other medium that can be accessed by a computer, but is not limited thereto.
  • the memory can be a combination of the above memories.
  • the network sub-health diagnosis apparatus receives the communication sub-health status notification information detected based on the service transmission; the communication sub-health status notification information includes the network element identifiers of the two network elements in which the service communication is in the sub-health state; And then performing hardware fault detection on the hardware device on the path corresponding to the two network elements in the sub-health state, and saving the communication sub-health status notification information in the fault information database when no hardware fault is detected. And then determining that the number of communication sub-health state notification information stored in the fault information database is greater than a predetermined threshold, parsing each communication sub-health state notification information, and determining a network element in which a hardware failure occurs based on the parsed analysis result. Therefore, when the communication sub-health occurs at the service level, when the hardware failure detection is not detected, the faulty network element is diagnosed by the sub-health status notification information in the fault information database, so that the faulty network element can be repaired in time.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明实施例提供一种网络亚健康诊断方法及装置,用以解决业务侧检测到网络亚健康状态,但是底层硬件无法检测出来,不能进行及时硬件故障修复,依然会导致业务受损的问题。该方法包括:管理和编排模块接收基于业务传输检测到的通信亚健康状态通知信息;通信亚健康状态通知信息包括业务通信处于亚健康状态的两个网元的网元标识;对业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测未检测到故障时,将通信亚健康状态通知信息保存在故障信息库中;然后确定故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。

Description

一种网络亚健康诊断方法及装置 技术领域
本发明实施例涉及通信技术领域,尤其涉及一种网络亚健康诊断方法及装置。
背景技术
在通信系统中,例如在网络互联协议(英文:Internet Protocol,简称:IP)多媒体子系统(英文:Multimedia Core Network Subsystem,简称:MS)中,由于网元之间的业务承载网络故障,导致网元之间的网络亚健康状态;或者网元内部由于内存不足、内部通信故障等原因,导致网元处于亚健康状态,网元之间的网络亚健康状态和网元的亚健康状态均会导致业务受损,所以,为了避免在亚健康状态时造成的业务受损,需要及时准确检测出网络的亚健康状态。
业务层的扛丢包能力是业务层面应对通信亚健康的最主要的手段。扛丢包的主要方法是合理的重传机制。但是在某些情况下,若是因为实体硬件造成亚健康,业务侧检测到网络亚健康状态,但是底层硬件无法检测出来,不能进行及时的修复,依然会导致业务受损。
发明内容
本发明实施例提供一种网络亚健康诊断方法及装置,用以解决现有技术中存在的业务侧检测到网络亚健康状态,但是底层硬件无法检测出来,不能进行及时硬件故障修复,依然会导致业务受损的问题。
第一方面,本发明实施例提供了一种网络亚健康诊断方法,包括:
管理和编排模块(MANO)接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息至少包括业务通信处于亚健康状态的两个网元的网元标识;
所述MANO对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;
所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。
结合第一方面,在第一方面的第一种可能的实现方式中,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
确定每条通信亚健康状态通知信息中包括的业务通信处于亚健康状态的两个网元的网元标识;
根据各个网元标识对应的网元之间的连接路径拓扑结构,确定发生通信故障的网元。
结合第一方面,在第一方面的第二种可能的实现方式中,还包括:
所述MANO在确定基于硬件故障检测并检测到硬件故障时,则修复检测到的所述硬件故障。
结合第一方面和第一方面的第一种至第二种可能的实现方式中的任意一种,在第一方面的第三种可能的实现方式中,所述MANO确定对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,还包括:
所述MANO接收到用于触发硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息。
结合第一方面和第一方面的第一种至第三种可能的实现方式中的任意一种,在第一方面的第四种可能的实现方式中,还包括:
所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为虚拟机VM。
结合第一方面,在第一方面的第五种可能的实现方式中,所述对各条通 信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
结合第一方面,在第一方面的第六种可能的实现方式中,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
结合第一方面,在第一方面的第七种可能的实现方式中,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。
结合第一方面的第五种至第七种可能的实现方式中的任意一种,在第一方面的第八种可能的实现方式中,在基于解析得到的解析结果确定发生硬件故障的网元后,还包括:
删除所述故障信息库中保存的通信亚健康状态通知信息。
第二方面,本发明实施例提供了一种网络亚健康诊断装置,包括:
接收单元,用于接收基于业务传输检测到的通信亚健康状态通知信息; 所述通信亚健康状态通知信息至少包括业务通信处于亚健康状态的两个网元的网元标识;
处理单元,用于对所述接收单元接收到的通信亚健康状态通知信息中包括的所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;在确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。
结合第二方面,在第二方面的第一种可能的实现方式中,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
确定每条通信亚健康状态通知信息中包括的业务通信处于亚健康状态的两个网元的网元标识;
根据各个网元标识对应的网元之间的连接路径拓扑结构,确定发生通信故障的网元。
结合第二方面,在第二方面的第二种可能的实现方式中,所述处理单元,还用于:
在确定基于硬件故障检测并检测到硬件故障时,则修复检测到的所述硬件故障。
结合第二方面和第二方面的第一种至第二种可能的实现方式中的任意一种,在第二方面的第三种可能的实现方式中,在确定对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,所述接收单元还用于接收用于触发所述处理单元进行硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息。
结合第二方面和第二方面的第一种至第三种可能的实现方式中的任意一种,在第二方面的第四种可能的实现方式中,所述处理单元,还用于在确定 所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为虚拟机VM。
结合第二方面,在第二方面的第五种可能的实现方式中,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
结合第二方面,在第二方面的第六种可能的实现方式中,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
结合第二方面,在第二方面的第七种可能的实现方式中,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。
结合第二方面的第五种至第七种可能的实现方式中的任意一种,在第二方面的第八种可能的实现方式中,所述处理单元,还用于:在基于解析得到的解析结果确定发生硬件故障的网元后,删除所述故障信息库中保存的通信 亚健康状态通知信息。
本发明实施例提供的方案,管理和编排模块MANO接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息包括业务通信处于亚健康状态的两个网元的网元标识;然后所述MANO对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;然后所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。从而在在业务层面发生通信亚健康时,硬件故障检测未检测出时,通过故障信息库中的亚健康状态通知信息诊断发生故障的网元,从而能够对发生故障的网元进行及时修复。
附图说明
图1为本发明实施例提供的网络亚健康诊断的网络应用系统示意图;
图2为本发明实施例提供的一种网络亚健康诊断方法流程图;
图3为本发明实施例提供的其中一种应用场景下的路径拓扑结构示意图;
图4为本发明实施例提供的另一种应用场景下的路径拓扑结构示意图;
图5为本发明实施例提供的另一种网络亚健康诊断方法流程图;
图6为本发明实施例提供的一种网络亚健康诊断装置示意图。
具体实施方式
本发明实施例提供一种网络亚健康诊断方法及装置,用以解决现有技术中存在的业务侧检测到网络亚健康状态,但是底层硬件无法检测出来,不能进行及时硬件故障修复,依然会导致业务受损的问题。其中,方法和装置是基于同一发明构思的,由于方法及装置解决问题的原理相似,因此装置与方法的实施可以相互参见,重复之处不再赘述。
本发明实施例主要解决网元及网元与网元之间的通信亚健康问题。如图1 所示,网络应用系统包括:主机(Host)、交换机(Switch)以及用户边缘设备(英文:Customer Edge,简称:CE)。图1仅是一种示例,并不对设备数量进行限定。例如:网络应用系统包括多个Host以及多个交换机。
其中,主机中包括虚拟机(英文:Virtual Machine,简称:VM)、物理网卡(英文:physical Network Interface Card,简称:pNIC)。虚拟机中对应有虚拟网卡(英文:Virtual Network Interface Card,简称:vNIC)。虚拟机与物理网卡之间通过虚拟通道,即:虚拟以太网网桥(英文:Virtual Ethernet Bridge,简称:VEB)连接,虚拟以太网网桥可以认为是一个虚拟交换机(Virtual Switch,简称:vSwitch),负责两个虚拟机之间的报文转发。
网络应用系统中还包括有管理与编排模块(英文:Management and Orchestration,简称:MANO),负责系统资源的分配和调度,管理虚拟网络功能的生命周期等等。虚拟网络功能则可以由一个虚拟机或者多个虚拟机实现。多个虚拟机可以是一个主机中的虚拟机也可以是不同主机中的虚拟机。系统资源包括硬件资源以及软件资源。其中硬件资源包括计算硬件存储硬件以及网络硬件。计算硬件可以为专用的处理器或通用的用于提供处理和计算功能的处理器;存储硬件用于提供存储能力,该存储能力可以是存储硬件本身提供的(例如一台服务器的本地内存),也可以通过网络提供(例如服务器通过网络连接一个网络存储设备);网络硬件可以是交换机、路由器和/或其他网络设备,网络硬件用于实现多个设备之间的通信,多个设备之间通过无线或有线连接。
在上述网络应用系统可能出现如下硬件故障导致的网络亚健康:
1、VM的vNIC故障导致的网络亚健康。
2、vNIC到pNIC的虚拟通道故障导致的网络亚健康。
3、物理网卡故障导致的网络亚健康。
4、Host与Host之间的链路故障导致网络亚健康。Host与Host之间的链路中可能经过交换机、路由器等等。
为了解决上述网络应用系统可能出现网络亚健康问题,本发明实施例提 供的一种网络亚健康诊断方法,参见图2,该方法的执行设备可以是MANO,还可以是移动服务平台(英文:Mobile Service Platform,简称:MSP)。该方法包括:
S201,MANO接收基于业务传输检测到的通信亚健康状态通知信息。
所述通信亚健康状态通知信息包括业务通信处于亚健康状态的两个网元的网元信息。其中,网元信息中至少包括网元标识,还可以包括网元所归属的设备信息等等。
例如:两个虚拟机之间传输报文发生故障,则两个网元的网元信息可以是虚拟机的标识以及虚拟机所属的主机(Host)标识等等信息。
本发明实施例中向MANO发送通信亚健康状态通知信息可以是管道操作系统(英文:Operation System,简称OS)。管道OS可以持续检测业务通信状态,然后周期性的上报给MANO或者MSP。
S202,所述MANO对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中。
其中,所述通信亚健康状态通知信息还用于触发所述MANO对所述进行业务通信处于亚健康状态的两个网元对应的路径进行硬件故障检测,从而MANO接收到所述通信亚健康状态通知信息,对业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测。
可选地,MANO对还可以由外部触发设备触发,并指定所需检测的路径。具体的,所述MANO确定对所述进行业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,所述MANO接收到用于触发进行硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息;然后所述MANO对所述路径信息对应的路径上的硬件设备进行硬件故障检测。
S203,所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析 得到的解析结果确定发生硬件故障的网元。
可选地,对各条通信亚健康状态通知信息进行解析,基于解析得到的解析结果确定发生通信故障的网元,可以通过如下方式实现:
确定每条通信亚健康状态通知信息中包括的进行业务通信处于亚健康状态的两个网元的网元信息,然后根据各个网元之间的连接路径拓扑结构确定发生通信故障的网元。
其中,各个网元之间的连接路径拓扑结构已经预先存储在MANO或者MSP中。
可选地,在确定基于硬件故障检测并检测到通信故障时,则修复检测到的所述硬件故障。
可选地,所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为VM故障。
其中,通信亚健康状态通知信息为1条时,说明之前没有出现过类似情况,只能判断为VM故障。之所以确定VM故障是因为管道OS已经检测到故障,管道OS通过业务的传输可以检测到VM之间的故障。VM发生故障具体可能是VM的vNIC故障。所述MANO在确定为VM故障时,按照预设规则进行VM的自愈。VM的自愈主要包括VM重启、迁移、重建。可以根据VM的配置,将VM迁移到其他适合的主机上。
可选地,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,可以通过如下方式实现:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
由于各条通信亚健康状态信息中包括的进行业务通信两端网元中有一端网元为同一个Host的同一个VM,则说明所有的通信亚健康均由该VM故障导致。假设有三条通信亚健康状态信息,第一条的业务通信两端网元为VM1 和VM2,第二条的业务通信两端网元为VM1和VM3,第三条的业务通信两端网元为VM1和VM4,则说明VM1发生了故障导致无法进行正常通信。
可选地,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,可以通过如下方式实现:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
例如,如图3所示,通信网络中包括3个VM分别为VM1、VM2和VM3,VM1与VM2之间通过交换机连接,VM1和VM3之间通过交换机连接,并且VM2与VM3之间也通过交换机连接。假设包括三条通信亚健康状态信息,第一条通信亚健康状态信息指示VM1与VM2业务通信不正常,第二条通信亚健康状态信息指示VM1与VM3业务通信不正常,第三条通信亚健康状态信息指示VM3与VM2业务通信不正常,从而可以确定交换机发生了故障,从而产生了上述三条通信亚健康状态信息。
可选地,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。Host发生故障可能是vNIC到pNIC的虚拟通道故障或者还可能是物理网卡故障。
可以先根据VM的配置进行VM的自愈。若无法修改可以进一步确定是否是物理网卡等发生故障。
可选地,在基于解析得到的解析结果确定发生硬件故障的网元后,还包括:
删除所述故障信息库中保存的通信亚健康状态通知信息。
下面结合具体应用场景对本发明实施例作具体说明。
如图4所示,通信网络中包括3个Host分别Host1、Host2和Host3。Host1中安装有VM1和VM4,Host2中安装有VM2以及在Host3中安装有VM3。Host1通过P1接口连接交换机的P11接口,Host2通过P2接口连接交换机的P12接口,Host3通过P3接口连接交换机的P13接口。
那么具体的网络亚健康诊断方法流程如图5所示。下面具体以MANO为例进行说明。
S501,MANO接收到管道OS发送的通信亚健康状态通知信息。执行S502。
其中,MANO周期性的接收到管道OS发送的通信亚健康状态通知信息。
通信亚健康状态通知信息中包括业务通信处于亚健康状态的两个网元的网元标识。所述亚健康状态通知信息用于触发MANO对处于亚健康状态的两个网元对应的路径中的硬件设备进行硬件故障检测。
S502,MANO在接收到管道OS发送的通信亚健康状态通知信息后,对处于亚健康状态的两个网元对应的路径中的硬件设备进行硬件故障检测。执行S503。
S503,MANO确定是否检测到硬件故障,若是,执行S504,若否,执行S505。
S504,MANO按照预先存储的规则处理所述硬件故障。处理完硬件故障后还可以清除该路径上的通信亚健康状态通知信息。
S505,MANO将接收到的通信亚健康状态通知信息存入故障信息库。执行S506。
S506,MANO确定故障信息库中的通信亚健康状态通知信息数量是否大于1,若是,执行S508,若否,执行S507。
S507,MANO确定为VM故障。然后MANO根据VM配置,进行自愈。
其中,信息为1条时,说明之前没有出现过类似亚健康状态,只能判断为VM故障,进行VM自愈。之所以确定VM故障是因为管道OS已经检测 到故障,管道OS可以检测到VM之间的故障。VM的自愈主要包括VM重启、迁移、重建。可以根据VM的配置,将VM迁移到合适的主机上。
S508,MANO确定故障信息库中的各条通信亚健康状态通知信息包括的业务通信处于亚健康状态的两个网元中是否有一个网元位于同一个Host,若否,执行S509,若是,执行S510。
S509,MANO诊断为交换机故障。从而尝试性重启交换机。然后清除故障信息库中所有的通信亚健康状态信息。
故障信息库中包括三条通信亚健康状态信息,第一条通信亚健康状态信息指示VM1与VM2业务通信不正常,第二条通信亚健康状态信息指示VM1与VM3业务通信不正常,第三条通信亚健康状态信息指示VM3与VM2业务通信不正常,根据图4所示的拓扑结构,可以确定3条路径均需经过交换机,因此可以确定交换机发生了故障。
S510,MANO确定故障信息库中的各条通信亚健康状态通知信息包括的业务通信处于亚健康状态的两个网元中有一个为同一个VM的网元。若是,执行S511,若否执行S512。
S511,MANO诊断为所述VM故障。
故障信息库中包括2条通信亚健康状态信息,第一条的业务通信处于亚健康状态的两个网元为VM1和VM2,第二条的业务通信处于亚健康状态的两个网元为VM1和VM3,可以确定无论VM1与哪个VM通信,均通信不正常,因此确定VM1故障。
然后根据VM的配置,进行VM的自愈。VM的自愈主要包括VM重启、迁移、重建,还可以根据VM的配置,将VM迁移到合适的主机上。
在处理该故障之后,可以清空故障信息库。当然也可以保留,若处理故障之后又接收到通信亚健康状态信息并且保存在故障信息库后,依然诊断为VM故障时,可以考虑采用其他的VM的自愈方式。比如设置自愈方式的优先级,若两次均诊断为该VM故障,则后一次采用的自愈方式的优先级低于前一次采用的自愈方式。
S512,MANO诊断为所述Host发生故障。具体可以根据主机上运行的所有VM配置,选择合适的主机进行迁移、重建。
故障信息库中包括2条通信亚健康状态信息,第一条的业务通信两端网元为VM1和VM2,第二条的业务通信两端网元为VM4和VM3,可以根据图4所示的网络拓扑结构确定VM4和VM1都属于Host1,因此确定Host1发生故障。
本发明实施例提供的方案,管理和编排模块MANO接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息包括业务通信处于亚健康状态的两个网元的网元标识;然后所述MANO对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;然后所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。从而在在业务层面发生通信亚健康时,硬件故障检测未检测出时,通过故障信息库中的亚健康状态通知信息诊断发生故障的网元,从而能够对发生故障的网元进行及时修复。
基于与上述方法实施例同样的发明构思,本发明实施例还提供了一种网络亚健康诊断装置,该装置可以是MANO或者MSP。如图6所示,该装置包括:
接收单元601,用于接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息至少包括业务通信处于亚健康状态的两个网元的网元标识;
处理单元602,用于对所述接收单元601接收到的通信亚健康状态通知信息中包括的所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;在确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基 于解析得到的解析结果确定发生硬件故障的网元。
可选的,所述处理单元602,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
确定每条通信亚健康状态通知信息中包括的业务通信处于亚健康状态的两个网元的网元标识;
根据各个网元标识对应的网元之间的连接路径拓扑结构,确定发生通信故障的网元。
可选的,所述处理单元602,还用于:
在确定基于硬件故障检测并检测到硬件故障时,则修复检测到的所述硬件故障。
在确定对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,所述接收单元还用于接收用于触发所述处理单元进行硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息。
可选的,所述处理单元602,还用于在确定所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为虚拟机VM。
可选的,所述处理单元602,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
可选的,所述处理单元602,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚 健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
可选的,所述处理单元602,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。
可选的,所述处理单元602还用于:在基于解析得到的解析结果确定发生硬件故障的网元后,删除所述故障信息库中保存的通信亚健康状态通知信息。
本发明实施例提供的一种网络亚健康诊断装置还可以包括存储单元603,用于存储故障信息库,还可以用于存储处理单元以及接收单元需要执行的程序。当然故障信息库还可以由外部存储器存储。
本发明实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能单元可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
其中,集成的单元既可以采用硬件的形式实现时,接收单元601对应的实体的硬件为收发器,处理单元602对应的实体硬件为处理器。处理器,可以是一个中央处理单元(英文:central processing unit,简称CPU),或者为数字处理单元等等。
其中,网络亚健康诊断装置中的存储单元可以为存储器,用于存储处理器执行的程序。处理器用于执行存储器存储的程序,具体用于处理单元602以及接收单元601执行的方案。
存储器可以是易失性存储器(英文:volatile memory),例如随机存取存 储器(英文:random-access memory,缩写:RAM);存储器也可以是非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD)、或者存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是上述存储器的组合。
本发明实施例提供的网络亚健康诊断装置接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息包括业务通信处于亚健康状态的两个网元的网元标识;然后对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;然后确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。从而在在业务层面发生通信亚健康时,硬件故障检测未检测出时,通过故障信息库中的亚健康状态通知信息诊断发生故障的网元,从而能够对发生故障的网元进行及时修复。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通 过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明实施例进行各种改动和变型而不脱离本发明实施例的精神和范围。这样,倘若本发明实施例的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (18)

  1. 一种网络亚健康诊断方法,其特征在于,包括:
    管理和编排模块MANO接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息至少包括业务通信处于亚健康状态的两个网元的网元标识;
    所述MANO对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;
    所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。
  2. 如权利要求1所述的方法,其特征在于,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
    确定每条通信亚健康状态通知信息中包括的业务通信处于亚健康状态的两个网元的网元标识;
    根据各个网元标识对应的网元之间的连接路径拓扑结构,确定发生通信故障的网元。
  3. 如权利要求1所述的方法,其特征在于,还包括:
    所述MANO在确定基于硬件故障检测并检测到硬件故障时,则修复检测到的所述硬件故障。
  4. 如权利要求1至3任一项所述的方法,其特征在于,所述MANO确定对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,还包括:
    所述MANO接收到用于触发硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息。
  5. 如权利要求1至4任一项所述的方法,其特征在于,还包括:
    所述MANO确定所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为虚拟机VM。
  6. 如权利要求1所述的方法,其特征在于,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
  7. 如权利要求1所述的方法,其特征在于,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
  8. 如权利要求1所述的方法,其特征在于,所述对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元,包括:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。
  9. 如权利要求6至8任一项所述的方法,其特征在于,在基于解析得到的解析结果确定发生硬件故障的网元后,还包括:
    删除所述故障信息库中保存的通信亚健康状态通知信息。
  10. 一种网络亚健康诊断装置,其特征在于,包括:
    接收单元,用于接收基于业务传输检测到的通信亚健康状态通知信息;所述通信亚健康状态通知信息至少包括业务通信处于亚健康状态的两个网元 的网元标识;
    处理单元,用于对所述接收单元接收到的通信亚健康状态通知信息中包括的所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测,在未检测到硬件故障时,将所述通信亚健康状态通知信息保存在故障信息库中;在确定所述故障信息库中保存的通信亚健康状态通知信息的数量大于预定阈值时,对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元。
  11. 如权利要求10所述的装置,其特征在于,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
    确定每条通信亚健康状态通知信息中包括的业务通信处于亚健康状态的两个网元的网元标识;
    根据各个网元标识对应的网元之间的连接路径拓扑结构,确定发生通信故障的网元。
  12. 如权利要求10所述的装置,其特征在于,所述处理单元,还用于:
    在确定基于硬件故障检测并检测到硬件故障时,则修复检测到的所述硬件故障。
  13. 如权利要求10至12任一项所述的装置,其特征在于,在确定对所述业务通信处于亚健康状态的两个网元对应的路径上的硬件设备进行硬件故障检测之前,所述接收单元还用于接收用于触发所述处理单元进行硬件故障检测的触发信息,所述触发信息携带业务通信处于亚健康状态的两个网元对应的路径的路径信息。
  14. 如权利要求10至13任一项所述的装置,其特征在于,所述处理单元,还用于在确定所述故障信息库中保存的通信亚健康状态通知信息的数量为1时,确定发生硬件故障的网元为虚拟机VM。
  15. 如权利要求10所述的装置,其特征在于,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障 的网元时,用于:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定各条通信亚健康状态通知信息中均包含同一网元标识、且该同一网元标识对应的网元为位于同一个主机Host上的同一个VM,则确定发生硬件故障的网元为所述VM。
  16. 如权利要求10所述的装置,其特征在于,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定不是全部通信亚健康状态通知信息包括的两个网元标识对应的两个网元中有一个网元位于同一个Host,则确定所有通信亚健康状态信息包括的业务通信处于亚健康状态的两个网元所经过的同一交换机发生故障。
  17. 如权利要求10所述的装置,其特征在于,所述处理单元,在对各条通信亚健康状态通知信息解析,基于解析得到的解析结果确定发生硬件故障的网元时,用于:
    根据各条通信亚健康状态通知信息分别包括的业务通信处于亚健康状态的两个网元的网元标识,确定全部通信亚健康状态通知信息包含的两个网元标识对应的两个网元有一个网元位于同一个Host,但位于同一Host的网元为不同VM,确定为所述Host发生故障。
  18. 如权利要求15至17任一项所述的装置,其特征在于,所述处理单元,还用于:在基于解析得到的解析结果确定发生硬件故障的网元后,删除所述故障信息库中保存的通信亚健康状态通知信息。
PCT/CN2015/098107 2015-12-21 2015-12-21 一种网络亚健康诊断方法及装置 WO2017107014A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580083650.XA CN108141374B (zh) 2015-12-21 2015-12-21 一种网络亚健康诊断方法及装置
PCT/CN2015/098107 WO2017107014A1 (zh) 2015-12-21 2015-12-21 一种网络亚健康诊断方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/098107 WO2017107014A1 (zh) 2015-12-21 2015-12-21 一种网络亚健康诊断方法及装置

Publications (1)

Publication Number Publication Date
WO2017107014A1 true WO2017107014A1 (zh) 2017-06-29

Family

ID=59088772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/098107 WO2017107014A1 (zh) 2015-12-21 2015-12-21 一种网络亚健康诊断方法及装置

Country Status (2)

Country Link
CN (1) CN108141374B (zh)
WO (1) WO2017107014A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273804A1 (zh) * 2021-06-30 2023-01-05 中兴通讯股份有限公司 组网方法、网管系统、服务器和计算机可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111404767B (zh) * 2019-01-02 2021-11-19 中国移动通信有限公司研究院 一种nfv核心网网元测试方法、架构和mano架构
CN111510338B (zh) * 2020-03-09 2022-04-26 苏州浪潮智能科技有限公司 一种分布式块存储网络亚健康测试方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101489247A (zh) * 2009-01-13 2009-07-22 华为技术有限公司 提高业务分发性能的方法和系统及业务分发节点
CN103001811A (zh) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 故障定位方法和装置
CN103326877A (zh) * 2012-02-28 2013-09-25 国际商业机器公司 重新配置虚拟计算网络的组件之间相互关系的方法和系统
CN103560913A (zh) * 2013-10-31 2014-02-05 华为技术有限公司 一种容灾切换方法、设备及系统
CN104468181A (zh) * 2013-09-23 2015-03-25 英特尔公司 虚拟网络设备故障的检测和处理

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5176837B2 (ja) * 2008-09-30 2013-04-03 富士通株式会社 情報処理システム及びその管理方法、制御プログラム並びに記録媒体

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101489247A (zh) * 2009-01-13 2009-07-22 华为技术有限公司 提高业务分发性能的方法和系统及业务分发节点
CN103326877A (zh) * 2012-02-28 2013-09-25 国际商业机器公司 重新配置虚拟计算网络的组件之间相互关系的方法和系统
CN103001811A (zh) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 故障定位方法和装置
CN104468181A (zh) * 2013-09-23 2015-03-25 英特尔公司 虚拟网络设备故障的检测和处理
CN103560913A (zh) * 2013-10-31 2014-02-05 华为技术有限公司 一种容灾切换方法、设备及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023273804A1 (zh) * 2021-06-30 2023-01-05 中兴通讯股份有限公司 组网方法、网管系统、服务器和计算机可读存储介质

Also Published As

Publication number Publication date
CN108141374B (zh) 2020-12-18
CN108141374A (zh) 2018-06-08

Similar Documents

Publication Publication Date Title
US10073729B2 (en) Fault management method, entity, and system
US10601643B2 (en) Troubleshooting method and apparatus using key performance indicator information
RU2640724C1 (ru) Способ устранения неисправностей, устройство и система, основанные на виртуализации сетевых функций
US10374900B2 (en) Updating a virtual network topology based on monitored application data
WO2016029749A1 (zh) 一种通信故障的检测方法、装置及系统
US20170302502A1 (en) Arbitration processing method after cluster brain split, quorum storage apparatus, and system
US9489230B1 (en) Handling of virtual machine migration while performing clustering operations
JP2015062282A (ja) 仮想ネットワークアプライアンス不良の検知及びハンドリング
US9660902B2 (en) Apparatus, method and computer-readable medium of providing acceptable transmission unit
US10725810B2 (en) Migrating virtualized computing instances that implement a logical multi-node application
CN110912727B (zh) 用于非侵入式网络性能监测的系统和方法
US20180060061A1 (en) Method and system for tracking progress and providing fault tolerance in automated upgrade of a network virtualization platform
WO2019242487A1 (zh) 一种故障管理方法和相关装置
WO2017107014A1 (zh) 一种网络亚健康诊断方法及装置
JP5558422B2 (ja) ネットワークシステム、冗長化方法、障害検知装置及び障害検知プログラム
US11695665B2 (en) Cross-cloud connectivity checks
CN112335207B (zh) 应用感知链路
US9985862B2 (en) MEP configuration method and network device
WO2019119269A1 (zh) 一种网络故障探测方法及控制中心设备
JP2016513309A (ja) 分散コンピューティングシステムのコンピューティングノードにおける障害に起因するエラー伝播の制御
US10365934B1 (en) Determining and reporting impaired conditions in a multi-tenant web services environment
CN107104837B (zh) 路径检测的方法和控制设备
US10122612B2 (en) Method and apparatus for network diagnosis processing
US10498624B2 (en) Systems and methods for adaptive router failover in Linux-based computing systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15911014

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15911014

Country of ref document: EP

Kind code of ref document: A1