WO2022083540A1 - 故障恢复预案确定方法、装置及系统、计算机存储介质 - Google Patents

故障恢复预案确定方法、装置及系统、计算机存储介质 Download PDF

Info

Publication number
WO2022083540A1
WO2022083540A1 PCT/CN2021/124377 CN2021124377W WO2022083540A1 WO 2022083540 A1 WO2022083540 A1 WO 2022083540A1 CN 2021124377 W CN2021124377 W CN 2021124377W WO 2022083540 A1 WO2022083540 A1 WO 2022083540A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
root cause
target
failure
faults
Prior art date
Application number
PCT/CN2021/124377
Other languages
English (en)
French (fr)
Inventor
谢于明
李野
李举厂
高云鹏
侯延祥
杨延城
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21881956.3A priority Critical patent/EP4221004A4/en
Publication of WO2022083540A1 publication Critical patent/WO2022083540A1/zh
Priority to US18/302,629 priority patent/US20230318906A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • H04L41/0627Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time by acting on the notification or alarm source
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Definitions

  • the present application relates to the field of network technologies, and in particular, to a method, device and system for determining a fault recovery plan, and a computer storage medium.
  • Network failure refers to the failure of the network to provide normal services or poor service quality due to hardware problems, software problems, or network attacks. After a network failure occurs, in the process of failure recovery using the traditional operation and maintenance method, it is necessary to rely on manual judgment and then give a failure recovery plan based on experience, which has a low degree of automation and low efficiency.
  • a series of expert rules are usually formulated based on expert experience and fault cases of the live network.
  • the expert rules include the fault and the fault recovery plan corresponding to the fault.
  • the management device determines the fault recovery plan corresponding to the fault based on the formulated expert rules, and then implements the fault recovery plan to repair the network fault, which shortens the time it takes for the network device to change from the fault state to the working state. .
  • the time it takes for a network device to change from a fault state to a working state may also be called mean time to recovery (MTTR).
  • MTTR mean time to recovery
  • the expert rules usually specify the fault recovery plan corresponding to the fault in a hard-coded manner, only the faults contained in the expert rules can be processed, and the range of faults that can be handled is limited.
  • the present application provides a method, device, and system for determining a fault recovery plan, and a computer storage medium, which can solve the problem that the range of faults that can currently be handled based on expert rules is limited.
  • a method for determining a fault recovery plan includes: the control device acquires a similar known fault whose root cause and the fault root cause of a target fault in the network satisfy a similarity condition among a plurality of known faults. The control device obtains the fault recovery plan corresponding to similar known faults. The control device determines the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure.
  • the control device can determine the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure of the target failure in the network, that is, regardless of whether the target failure is a known failure, as long as it can If a similar known fault with the root cause of the fault and the fault root cause of the target fault satisfying the similarity condition is found among the known faults, the fault recovery plan corresponding to the target fault can be determined, and the range of faults that can be dealt with can be expanded.
  • a fault in the network may cause multiple cascading faults, and the root cause of the fault can reflect the root cause of the fault, the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found.
  • the higher the matching degree with the fault the higher the possibility that the fault recovery plan corresponding to the similar known fault is applicable to the fault, thereby making the determined fault recovery plan more reliable.
  • the fault root cause is represented by a fault root cause feature
  • the fault root cause feature includes a fault root cause object and a fault root cause event, wherein the fault root cause event is an abnormal event that causes the fault, and the fault root cause object is used to indicate the fault.
  • the fault root cause network entity is the network entity to which the fault root cause event belongs.
  • the fault root cause object can be understood as the ontology of the fault root cause network entity, and the fault root cause network entity can be understood as the instantiation of the fault root cause object.
  • Types of fault root cause objects include devices, interfaces, protocols, or services.
  • the fault root cause network entity is a physical interface
  • the fault root cause feature further includes an interface flashing indication of the fault root cause network entity, an interface suspended animation indication of the fault root cause network entity, and a status of sending and receiving packets of the fault root cause network entity.
  • one or more of the interface protocol state of the network entity as the root cause of the failure or the state of the physical interface of the device where the root cause of the failure is the network entity.
  • the fault root cause network entity is a BGP peer
  • the fault root cause characteristic further includes a BGP route flapping indication of the fault root cause network entity and/or the physical interface status of the device where the fault root cause network entity is located.
  • the fault root cause characteristic also includes the physical interface state of the device where the fault root cause network entity is located.
  • the control device acquires the similar known faults whose root cause and the fault root cause of the target fault in the network satisfy the similarity condition among multiple known faults, including: the control device obtains multiple known faults The fault root cause characteristic of the fault. For each known fault in the plurality of known faults, the control device calculates the fault root cause of the target fault and the fault root cause of the known fault according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault similarity between. According to the similarity between the root cause of the target fault and the root causes of multiple known faults, the control device determines that the root cause of the fault in the multiple known faults and the root cause of the target fault satisfy the similarity condition. Fault.
  • the control device determines the fault root cause of the multiple known faults and the fault root cause of the target fault according to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults
  • the realization process of similar known faults that satisfy the similarity condition includes: the control device determines the known faults whose similarity between the root cause of the fault and the root cause of the target fault is higher than the similarity threshold among multiple known faults for similar known failures.
  • control device ranks the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, and compares the fault root cause with the fault root cause of the target fault The n known faults with the highest similarity are used as the similar known faults of the target fault, and n is a positive integer.
  • a known fault similar to the target fault can always be found among the known faults, and then a fault recovery plan corresponding to the target fault can always be determined, and the range of faults that can be handled is large.
  • the management device determines the similarity between the root cause of the fault and the root cause of the target fault among the multiple known faults to be higher than the similarity threshold, and the faults according to the root cause of the fault and the fault of the target fault are The order of the similarity between root causes belongs to the top m known faults as the similar known faults of the target fault, and m is a positive integer.
  • known faults whose similarity between the root cause of the fault and the fault root cause of the target fault is higher than the similarity threshold can be screened, which can not only ensure the accuracy of the determined similar faults, but also limit the determined faults.
  • the number of similar faults reduces the amount of subsequent calculations.
  • the control device calculates, according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault, the realization process of calculating the similarity between the fault root cause of the target fault and the fault root cause of the known fault, including:
  • the control equipment inputs the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model to obtain the similarity between the fault root cause of the target fault output by the similarity model and the fault root cause of the known faults , the similarity model is trained based on the fault root cause characteristics of multiple sample faults, and the sample faults are marked with category labels, and the fault recovery plans corresponding to the sample faults marked with the same category labels are the same.
  • control device uses the fault root cause features of multiple sample faults to train to obtain a similarity model.
  • control device may also input the fault root cause features of multiple sample fault pairs into the similarity model in stages, so as to obtain the similarity between the fault root causes of each sample fault pair output by the similarity model.
  • the sample fault pair includes the first type of sample fault pair and the second type of sample fault pair.
  • the first type of sample fault pair includes two sample faults marked with the same category label, and the second type of sample fault pair includes two marked with different category labels. sample failure.
  • the control device determines the similarity threshold according to the similarity between the fault root causes of the multiple sample fault pairs.
  • control device receives the similarity model and/or the similarity threshold from the training device.
  • the training device is the upper layer device of the control device.
  • the training device uniformly trains the similarity model and/or determines the similarity threshold, so that all control devices managed by the training device can share the similarity model and/or the similarity threshold, which reduces the calculation amount of the control device .
  • control device may also receive a fault root cause feature set from the training device, where the fault root cause feature set includes correspondences between multiple known faults and the fault root cause features.
  • the implementation process for the control device to acquire fault root cause characteristics of multiple known faults includes: the control device acquires fault root cause characteristics of multiple known faults based on a set of fault root cause characteristics.
  • control device can also send the identification of the target fault and the fault root cause feature of the target fault to the training device, so that the training device can add the target fault and the fault root cause feature of the target fault in the fault root cause feature set.
  • the corresponding relationship of , the updated fault root cause feature set is obtained.
  • control device can report the fault root cause characteristics of the existing network fault to the training equipment in real time, so that the training equipment can automatically update the fault root cause characteristic set, thereby expanding the range of faults that can be handled.
  • the control device acquires abnormal events generated in the network.
  • the control device determines the fault root cause characteristics of the fault based on the abnormal events generated in the network.
  • the target failure can be any failure in the network.
  • the control device obtains the realization process of the similar known faults whose root cause and the fault root cause of the target fault in the network satisfy the similarity condition among the multiple known faults, including: the control device receives the information from the analysis device. Similar fault information corresponding to the target fault, the similar fault information includes an identifier of the target fault and a similar fault list, and the similar fault list includes one or more similar known faults of the target fault.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • a similar known failure of the target failure is determined by the analyzing device among the plurality of known failures, and the information of the similar known failure is sent to the control device.
  • the analysis device determines the similar known faults of the target fault
  • the control device determines the implementation process of the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure, including: the control device evaluates the failure recovery plan corresponding to the similar known failure based on the network configuration of the network. Feasible, the network configuration includes networking topology and/or device data, and the device data includes one or more of management plane data, data plane data, or control plane data. The control device determines one or more of the feasible failure recovery plans as a failure recovery plan corresponding to the target failure.
  • the control device determines one or more failure recovery plans in the feasible failure recovery plans as the implementation process of the failure recovery plan corresponding to the target failure, including: in response to the plurality of failure recovery plans being feasible, the control device is based on the network. Network configuration, and evaluate the impact of multiple failure recovery plans on the services running on the network. The control device determines the one with the least impact on the services running on the network among the multiple fault recovery plans as the one corresponding to the target fault.
  • the control device determines, among the failure recovery plans corresponding to the similar known failures of the target failure, the failure recovery plan that has the least impact on the services running on the network as the failure recovery plan corresponding to the target failure, which can not only solve the target failure, but also In addition, the impact on the services running on the network can be reduced as much as possible, and the reliability and stability of the network operation can be improved.
  • control device may further determine the target network device in the network to which the plan is to be executed based on the target failure and the failure recovery plan corresponding to the target failure.
  • the control device sends a plan execution instruction to the target network device, where the plan execution instruction is used to instruct the target network device to execute a failure recovery plan corresponding to the target failure, and the plan execution command includes a failure recovery plan corresponding to the target failure.
  • control device may also distribute the fault recovery plan to the relevant network devices in the network that need to execute the fault recovery plan, so as to realize end-to-end fault recovery.
  • control device may also send a plan execution rollback instruction to the target network device, where the plan execution rollback instruction is used to instruct the target network device to restore to the state before executing the failure recovery plan corresponding to the target failure.
  • control device in response to receiving the rollback trigger instruction, sends a plan execution rollback instruction to the target network device.
  • the control device may also send a plan execution rollback instruction to the target network device to instruct the target network device to restore to the state before executing the failure recovery plan, thereby realizing the network device's State rollback function.
  • This function can quickly restore the network device to the original state and improve the reliability of network operation in the scenario where the network device implements an unreasonable failure recovery plan.
  • control device receives a set of failure recovery plans from the training device, where the set of failure recovery plans includes correspondences between multiple known failures and the failure recovery plans.
  • the implementation process of the control device acquiring the fault recovery plan corresponding to the similar known fault includes: the control device acquires the fault recovery plan corresponding to the similar known fault based on the set of fault recovery plans.
  • control device sends the identification of the target failure and the failure recovery plan corresponding to the target failure to the training device, so that the training device can add the corresponding relationship between the target failure and the failure recovery plan corresponding to the target failure in the set of failure recovery plans, Get the updated set of recovery plans.
  • control device can report the fault recovery plan corresponding to the fault of the live network to the training device in real time, which realizes the automatic update of the set of fault recovery plans by the training device, thereby expanding the range of faults that can be handled.
  • the failure recovery plan is hard-coded, which improves the expansion flexibility of the failure recovery plan and reduces the maintenance difficulty.
  • a method for determining a fault recovery plan includes: an analysis device obtains a similar known fault whose root cause and the fault root cause of a target fault in the network satisfy a similarity condition among a plurality of known faults.
  • the analysis device sends similar fault information corresponding to the target fault to the control device.
  • the similar fault information includes the identification of the target fault and a list of similar faults.
  • the similar fault list includes one or more similar known faults of the target fault.
  • the similar fault information is used for the control device. Determine the failure recovery plan corresponding to the target failure.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the implementation process of the analysis device acquiring similar known faults in which the root cause of the fault and the fault root cause of the target fault in the network satisfy the similarity condition among the multiple known faults, including: the analyzing device acquires multiple known faults. fault root cause characteristics. For each known fault in the multiple known faults, the analysis device calculates the fault root cause of the target fault and the fault root cause of the known fault according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault due to the similarity between them. According to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, the analysis equipment determines the similar known similarity between the fault root cause of the multiple known faults and the fault root cause of the target fault that satisfy the similarity condition. Fault.
  • the analysis device determines, according to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, that the fault root cause of the multiple known faults and the fault root cause of the target fault satisfy the similarity condition.
  • the implementation process of the similar known faults includes: the analysis device determines the known faults whose similarity between the root cause of the fault and the root cause of the target fault is higher than the similarity threshold among multiple known faults as similar known faults Fault.
  • the implementation process of calculating the similarity between the fault root cause of the target fault and the fault root cause of the known fault according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault including:
  • the analysis equipment inputs the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults to the similarity model to obtain the similarity between the fault root cause of the target fault output by the similarity model and the fault root cause of the known fault.
  • the similarity model is trained based on the fault root cause features of multiple sample faults marked with category labels, wherein the fault recovery plans corresponding to sample faults marked with the same category labels are the same.
  • a method for determining a fault recovery plan includes: training a device to obtain a similarity model, the similarity model is obtained by training based on the fault root cause characteristics of multiple sample faults, the sample faults are marked with category labels, wherein the fault recovery plans corresponding to the sample faults marked with the same category labels are the same .
  • the training device sends the similarity model to the analysis device, so that the analysis device determines similar known faults of the target fault in the network, and the similar known fault is used to determine the fault recovery plan corresponding to the target fault.
  • the implementation process for the training device to obtain the similarity model includes: the training device adopts the fault root cause features of multiple sample faults to train to obtain the similarity model.
  • the training device can also input the fault root cause features of multiple sample fault pairs into the similarity model in stages to obtain the similarity between the fault root causes of each sample fault pair output by the similarity model, and multiple
  • the sample fault pair includes the first type of sample fault pair and the second type of sample fault pair.
  • the first type of sample fault pair includes two sample faults marked with the same category label, and the second type of sample fault pair includes two marked with different category labels. sample failure.
  • the training device determines the similarity threshold according to the similarity between the fault root causes of multiple sample fault pairs.
  • the training device sends the similarity threshold to the analysis device.
  • the training device may also send a fault root cause feature set to the analysis device, where the fault root cause feature set includes the correspondence between multiple known faults and the fault root cause features.
  • the training device may also send a set of failure recovery plans to the control device, where the set of failure recovery plans includes the correspondence between multiple known faults and the failure recovery plans.
  • the training device may also receive an identifier of the target fault, a fault root cause characteristic of the target fault, and a fault recovery plan corresponding to the target fault from the control device.
  • the training equipment adds the corresponding relationship between the target fault and the fault root cause characteristics of the target fault in the fault root cause characteristic set, obtains the updated fault root cause characteristic set, and adds the corresponding relationship between the target fault and the target fault to the fault recovery plan set The corresponding relationship between the fault recovery plans is obtained, and an updated set of fault recovery plans is obtained.
  • a control device in a fourth aspect, includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the first aspect and the respective embodiments thereof.
  • the multiple functional modules may be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules may be arbitrarily combined or divided based on specific implementations.
  • an analysis device in a fifth aspect, includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the second aspect and the various embodiments thereof.
  • the multiple functional modules may be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules may be arbitrarily combined or divided based on specific implementations.
  • a training device in a sixth aspect, includes a plurality of functional modules, and the plurality of functional modules interact to implement the methods in the third aspect and the respective embodiments thereof.
  • the multiple functional modules may be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules may be arbitrarily combined or divided based on specific implementations.
  • a failure recovery plan determination system Including: control equipment and analysis equipment.
  • the analysis device is configured to acquire similar known faults whose root cause and the fault root cause of the target fault in the network satisfy the similarity condition among a plurality of known faults, and send the information corresponding to the target fault to the control device.
  • Similar failure information includes an identification of the target failure and a similar failure list including one or more similar known failures of the target failure.
  • the control device is configured to acquire a failure recovery plan corresponding to the similar known failure, and determine a failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure.
  • system further includes: training equipment.
  • the training device is configured to use the fault root cause features of a plurality of sample faults to train to obtain a similarity model, and send the similarity model to the analysis device.
  • the sample faults are marked with a category label, wherein the same is marked with the same model.
  • the failure recovery plan corresponding to the sample failure of the category label is the same.
  • the analyzing device is configured to, for each known fault in the plurality of known faults, input the fault root cause characteristic of the target fault and the fault root cause characteristic of the known fault into the similarity model, to obtain the similarity between the fault root cause of the target fault and the fault root cause of the known fault output by the similarity model, and according to the fault root cause of the target fault and the plurality of known faults
  • the similarity between the fault root causes of the faults is to determine the similar known faults that satisfy the similarity condition with the fault root cause of the multiple known faults and the fault root cause of the target fault.
  • the training device is further configured to send a set of failure recovery plans to the control device, where the set of failure recovery plans includes the correspondence between the multiple known failures and the failure recovery plans.
  • the control device is configured to acquire, based on the set of failure recovery plans, a failure recovery plan corresponding to the similar known failure.
  • control device is further configured to send the identification of the target failure and the failure recovery plan corresponding to the target failure to the training device.
  • the training device is further configured to add a correspondence between the target failure and a failure recovery plan corresponding to the target failure in the set of failure recovery plans, to obtain an updated set of failure recovery plans.
  • the training device is further configured to send a fault root cause feature set to the analysis device, where the fault root cause feature set includes the correspondence between the multiple known faults and the fault root cause features.
  • the analyzing device is configured to acquire fault root cause characteristics of the multiple known faults based on the fault root cause characteristic set.
  • the similar fault information further includes a fault root cause characteristic of the target fault.
  • the control device is further configured to send the identification of the target fault and the fault root cause characteristic of the target fault to the training device.
  • the training device is further configured to add the corresponding relationship between the target fault and the fault root cause characteristics of the target fault in the fault root cause characteristic set, so as to obtain an updated fault root cause characteristic set.
  • a control device including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the methods in the first aspect and various embodiments thereof.
  • an analysis device including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the methods in the second aspect and various embodiments thereof.
  • a training device including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the methods in the third aspect and the various embodiments thereof.
  • a computer storage medium is provided, and instructions are stored on the computer storage medium, and when the instructions are executed by a processor of a computer device, the methods in the above-mentioned first aspect and its various embodiments are implemented, Alternatively, the methods in the above-mentioned second aspect and its various embodiments are implemented, or, the methods in the above-mentioned third aspect and its various implementations are implemented.
  • a twelfth aspect provides a chip, where the chip includes a programmable logic circuit and/or program instructions, and when the chip runs, implements the methods in the first aspect and its various embodiments, or implements the second aspect and the The methods in the various embodiments thereof, or alternatively, implement the methods in the above-mentioned third aspect and the various embodiments thereof.
  • the control device determines the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure of the target failure, that is, regardless of whether the target failure is a known failure, as long as the target failure can be found in the known failures. Similar to known faults, the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled. In addition, since a fault in the network may cause multiple cascading faults, and the root cause of the fault can reflect the root cause of the fault, the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found.
  • control device can also report the fault root cause characteristics of the existing network fault and the corresponding fault recovery plan to the cloud device in real time.
  • FIG. 1 is a schematic structural diagram of a system for determining a fault recovery plan provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for determining a fault recovery plan provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of an implementation process of a method for determining a fault recovery plan provided by an embodiment of the present application
  • FIG. 4 is a functional schematic diagram of a system for determining a fault recovery plan provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a control device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another control device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another control device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of still another control device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another control device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a control device provided by another embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an analysis device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a cloud device provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of another cloud device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another cloud device provided by an embodiment of the present application.
  • FIG. 15 is a block diagram of an apparatus for determining a fault recovery plan provided by an embodiment of the present application.
  • the embodiment of the present application provides a method for determining a fault recovery plan.
  • the control device acquires a similar known fault whose root cause and the fault root cause of the target fault in the network satisfy the similarity condition among the multiple known faults.
  • the control device obtains a fault recovery plan corresponding to the similar known fault.
  • the control device determines a failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure.
  • a known failure can correspond to one or more failure recovery plans, and a failure recovery plan can also correspond to one or more known failures.
  • the target failure in the network can be any failure in the network.
  • the control device can determine the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to a similar known failure of the target failure in the network, that is, regardless of whether the target failure is a known failure, as long as A similar known fault whose root cause and the fault root cause of the target fault satisfy the similarity condition can be found in the known faults, and the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled.
  • the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found. The higher the matching degree with the fault, the higher the possibility that the fault recovery plan corresponding to the similar known fault is applicable to the fault, thereby making the determined fault recovery plan more reliable.
  • the fault root cause is represented by the fault root cause feature.
  • the fault root cause feature includes a fault root cause object and a fault root cause event.
  • the fault root cause event is an abnormal event that causes the fault.
  • "physical interface interruption" may be a fault root cause event, indicating that the cause of the current fault is that the physical interface is interrupted.
  • the fault root cause object is used to indicate the type of the fault root cause network entity, the fault root cause network entity is used to indicate the specific location of the fault, and the fault root cause network entity is the network entity to which the fault root cause event belongs.
  • the fault root cause object can be understood as the ontology of the fault root cause network entity, and the fault root cause network entity can be understood as the instantiation of the fault root cause object.
  • Types of fault root cause objects include devices, interfaces, protocols, or services.
  • the equipment category specifically includes a single board or a daughter card.
  • Interface classes include physical interfaces, loopback interfaces, and virtual local area network (VLAN) interfaces.
  • the protocol class specifically includes open shortest path first (OSPF) or border gateway protocol (Border Gateway Protocol, BGP).
  • the business class specifically includes a virtual private network (VPN) business or a dynamic host configuration protocol (dynamic host configuration protocol, DHCP) business, and the like. For example, if the network entity of the fault root cause is the physical interface A on the device A, the fault root cause object of this fault is the physical interface.
  • VPN virtual private network
  • DHCP dynamic host configuration protocol
  • the fault root cause object of this fault is the OSPF network.
  • the network entity of the root cause of the failure is a virtual extensible local area network (VXLAN) tunnel end point (VXLAN tunnel end point, VTEP), which is expressed as: VXLAN tunnel-1.1.1.1-2.2.2.2, where 1.1. If 1.1 is the source VTEP address and 2.2.2.2 is the destination VTEP address, the root cause of the fault is the VXLAN tunnel.
  • VXLAN tunnel end point VXLAN tunnel end point
  • the fault root cause feature may also include the interface flashing indication of the fault root cause network entity (that is, the interface flashing indication of the physical interface), the interface of the fault root cause network entity.
  • Feigned death indication that is, the interface suspended animation of the physical interface
  • the status of sending and receiving packets of the network entity of the fault root cause that is, the status of sending and receiving packets of the physical interface
  • the interface protocol status of the network entity of the fault root cause that is, the status of the interface protocol of the physical interface
  • the root cause of the failure is one or more of the physical interface states of the device where the network entity is located.
  • the fault root cause characteristics may also include the BGP route flapping indication of the fault root cause network entity and/or the physical physical characteristics of the device where the fault root cause network entity is located. Interface status.
  • the fault root cause feature may also include the physical interface state of the device where the fault root cause network entity is located.
  • the interface flash indication is used to indicate whether the corresponding physical interface is interrupted multiple times in a short time. For example, if the physical interface is interrupted multiple times in a short time, the interface flash indication of the physical interface is set to 1, otherwise Set to 0.
  • the interface suspended animation is used to indicate whether the corresponding physical interface is in the normal state, whether the number of received packets or the number of sent packets is 0. Then the interface suspended animation indication of the physical interface is set to 1, otherwise it is set to 0.
  • the BGP route flapping indication is used to indicate whether BGP route flapping occurs to the corresponding BGP peer. For example, if the BGP route flapping occurs on the BGP peer, the BGP route flapping indication of the BGP peer is set to 1; otherwise, it is set to 0.
  • the state of the physical interface of the device where the root cause of the fault is located is used to reflect whether the state of the physical interface of the device is normal (up) or down (down). For example, if all the physical interfaces of the device are down, the state of the physical interface of the device is set to 1, otherwise set to 0.
  • the fault recovery plan is an emergency response plan formulated for possible faults in the network based on expert experience combined with fault cases in the existing network.
  • the failure recovery plan mainly includes the following:
  • Isolation equipment For example, a fault in the network is: repeated restarts of network devices or abnormal heartbeats cause the multichassis link aggregation group (MLAG) of the switch to be in a dual-active state. For this fault, a recovery plan for isolating the device can be formulated.
  • MLAG multichassis link aggregation group
  • the faults in the network are: the main control board is abnormal, the main control board is abnormal repeatedly, the switching network board is abnormal, or the switching network board is abnormal repeatedly.
  • Isolation interface For example, the faults in the network are: interface suspended animation, interface protocol status down, interface flashing, interface link single-pass failure, cyclic redundancy check (CRC) errors increased, transmission control protocol (transmission control protocol, TCP) synchronization (SYN) flood attack, or address resolution protocol (ARP) attack, etc.
  • CRC cyclic redundancy check
  • TCP transmission control protocol
  • SYN transmission control protocol
  • ARP address resolution protocol
  • ACL access control list
  • VM virtual machine
  • the fault in the network is: TCP SYN flood attack.
  • a recovery plan using three-layer ACL to isolate VMs can be formulated. This recovery plan solves the TCP SYN flood attack by configuring ACL rules based on interfaces or sub-interfaces in the relevant devices.
  • the source internet protocol (IP) address in the ACL rule is the attacker's IP address
  • the destination IP address is the global IP address
  • the configured flow policy is applied to the inbound interface direction.
  • ARP ACL Use ARP ACL to isolate VMs.
  • a fault in the network is an ARP attack.
  • a recovery plan that uses ARP ACL to isolate VMs can be formulated. This recovery plan resolves ARP attacks by configuring ACL rules for ARP packets on related devices.
  • a fault in the network is a neighbor discovery (ND) attack, for which a recovery plan using advanced ACL6 to isolate VMs can be formulated.
  • the recovery plan resolves ND attacks by judging the virtual routing forwarding table to which the attacked ND packets belong based on interfaces and VLANs, and configuring advanced ACL6 rules on related devices.
  • the faults in the network are: soft failure of the device chip, loss of ARP hard table entries, loss of hard routing table entries, suspected jumping of device entries, etc.
  • a recovery plan for restarting the device can be formulated.
  • the hard table is used to store the operating data of the chip, and the hard table is different from the definition of the soft table, and the soft table is used to store the configuration data.
  • the faults in the network are: the main control board is abnormal, the main control board is abnormal repeatedly, the switch fabric board is abnormal, and the switch fabric board is abnormal repeatedly.
  • the recovery plan smoothly recovers abnormal forwarding info base (FIB) entries by calling an application program interface (API) of the device.
  • API application program interface
  • FIG. 1 is a schematic structural diagram of a system for determining a fault recovery plan provided by an embodiment of the present application.
  • the system includes: a management device 101 and network devices 102a-102c in the network (collectively referred to as network devices 102).
  • the number of network devices in FIG. 1 is only used for illustration, and is not a limitation on the system for determining the fault recovery plan provided by the embodiment of the present application.
  • the network involved in the embodiments of the present application may be a data center network (DCN), a radio access network (RAN), a packet transport network (PTN), a metropolitan area network, a wide area network, A campus network, VLAN, or VXLAN, etc.
  • DCN data center network
  • RAN radio access network
  • PDN packet transport network
  • the embodiment of the present application does not limit the type of the network.
  • the management device 101 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the management device 101 includes a collection device, an analysis device and a control device.
  • the collection device, analysis device and control device may be physical servers or virtual servers.
  • the acquisition device, the analysis device and the control device are separate servers; or, the acquisition device and the analysis device are integrated into one server; alternatively, the analysis device and the control device are integrated into one server; alternatively, the acquisition device and the analysis device are integrated into one server.
  • the control device is integrated in a server. That is, the management device 101 may function as a collection device, an analysis device, and/or a control device.
  • the management device 101 is used to manage and control the network device 102 in the network, and the network may be an office network. Different site networks may be different networks divided according to corresponding dimensions, for example, may be networks in different regions, networks of different operators, networks of different services, and different network domains.
  • the management device 101 may be one or more devices.
  • the network device 102 may be a router or switch, or the like.
  • the management device 101 and the network device 102 are connected through a wired network or a wireless network.
  • the collection device in the management device 101 is configured to collect device data of the network device 102 in the network, and store the collected data in a database for use by the analysis device.
  • the analysis device in the management device 101 is used to perform anomaly detection on the network based on the device data of the network device 102, and then locate the network fault according to multiple abnormal events generated during the anomaly detection process, and determine the located faults in the known faults. similar known failures.
  • the control device in the management device 101 is configured to determine a failure recovery plan corresponding to the failure based on a similar known failure of the failure located by the analysis device, and send a plan execution instruction to the relevant network device in the network.
  • the management device 101 may also store the networking topology of the network.
  • the acquired device data may include at least one of management plane data, data plane data, or control plane data.
  • the management plane data includes configuration data and alarm data, for example, the configuration data includes security control policies.
  • Data plane data includes ARP table, media access control (Media Access Control, MAC) table, routing table, tunnel status table (VXLAN network) and interface status.
  • Control plane data includes central processing unit (CPU) data, memory data, link layer discovery protocol (LLDP) status, BGP status, and OSPF status. Both BGP and OSPF are routing protocols.
  • the collection device periodically collects device data of the network device 102 .
  • the collection device uses a simple network management protocol (simple network management protocol, SNMP) or a network telemetry (network telemetry) technology to collect the device data of the network device.
  • SNMP simple network management protocol
  • network telemetry network telemetry
  • the network device 102 actively reports the changed device data to the collection device.
  • the system further includes training equipment 103 .
  • the training device 103 may be a server, or a server cluster composed of several servers, or a cloud computing service center.
  • the training device 103 is an upper-level device of the management device 101 and can manage one or more management devices 101 .
  • the training device 103 can train a model for processing data (eg, similarity model), and provide the management device 101 with a set of processing data (eg, a set of fault root cause characteristics, a set of fault recovery plans) and/or a model for processing data Wait.
  • the training device 103 may share a set of processing data and/or a model for processing data provided by the training device 103 .
  • the training device 103 and the management device 101 may be separate devices, or may also be integrated into one device, which is not limited in this embodiment of the present application.
  • the training device 103 may also be referred to as a cloud device.
  • FIG. 2 is a schematic flowchart of a method for determining a fault recovery plan provided by an embodiment of the present application. This method can be applied to the fault recovery plan system shown in FIG. 1 . As shown in Figure 2, the method includes:
  • Step 201 The cloud device sends the fault root cause feature set and the fault recovery plan set to the management device.
  • the fault root cause feature set includes the correspondence between a plurality of known faults and the fault root cause features.
  • the fault root cause characteristics in the fault root cause characteristic set are stored in the form of groups, and each group of fault root cause characteristics belongs to a known fault.
  • the fault root cause feature of a known fault includes at least a fault root cause object and a fault root cause event.
  • the fault root cause feature set includes a corresponding relationship between the fault identifier of the known fault and the fault root cause feature.
  • Fault identification includes fault ID and/or fault name.
  • the fault root cause feature set can be as shown in Table 1.
  • fault root cause characteristics include fault root cause objects, fault root cause events, and interface flashing indications.
  • the set of failure recovery plans includes the correspondence between a plurality of known failures and the failure recovery plans.
  • One known failure may correspond to one or more failure recovery plans, and one failure recovery plan may also correspond to one or more known failures.
  • the set of failure recovery plans includes a corresponding relationship between failure identifiers of known failures and failure recovery plans.
  • the set of failure recovery plans can be as shown in Table 2.
  • Fault ID fault name failure recovery plan 10000001 Interface flashes Isolated interface 10000002
  • the main control board is abnormal 1. Isolate the board; 2. Restart the board ... ... ...
  • a set of fault root cause characteristics and a set of fault recovery plans are stored in the cloud device.
  • the cloud device can collect a large number of fault cases in the network, and extract the fault root cause characteristics of each fault according to the set type of fault root cause characteristics to generate an initial fault root cause characteristic set.
  • the failure recovery plans in the initial failure recovery plan set can be formulated based on expert experience.
  • Step 202 When a network failure occurs, the management device acquires abnormal events generated in the network.
  • the management device after acquiring the device data of the network devices in the network, performs abnormality detection on the device data of each network device to acquire abnormal events generated in the network.
  • the implementation process for the management device to perform abnormality detection on the device data of the network device includes: the management device performs alarm analysis and aggregation on the alarm data to reduce the amount of alarm data, and then extracts abnormal events from the aggregated alarm data. And/or, the management device performs log anomaly detection on a large amount of logs, for example, by using log template mining and/or log rarity analysis to perform log anomaly detection, so as to obtain abnormal events. And/or, the management device performs anomaly detection on the reported key performance indicator (key performance indicator, KPI), for example, a KPI with a mutation is regarded as an anomalous KPI.
  • KPI key performance indicator
  • the abnormal event includes one or more of an alarm log, a state change log or an abnormal KPI.
  • the alarm log includes the identifier of the abnormal network entity and the alarm type.
  • the state change log includes configuration file change information and/or routing table entry change information.
  • the state change log may include information such as "deletion of access sub-interface" and "deletion of destination IP host route”.
  • Abnormal KPIs are used to describe an abnormality in a certain indicator of a network entity.
  • the fault that occurs in the network is referred to as a target fault.
  • Step 203 The management device determines the fault root cause characteristic of the target fault based on the abnormal event generated in the network.
  • the management device performs expert rule-based fault location or network knowledge graph-based source tracing reasoning for abnormal events generated in the network, so as to locate the fault root cause object and fault root cause event of the target fault in the network.
  • the implementation process of the management device performing the traceability reasoning based on the network knowledge graph for the abnormal events generated in the network including: first, the management device generates the knowledge graph of the network managed by the management device; After the abnormal event caused by the fault, determine the abnormal network entity that produces the abnormal event in the network, for example, the abnormal network entity that produces the abnormal event in the network can be identified on the knowledge graph; then the management device is based on the fault propagation between network entities. Relationship, to determine one or more fault root cause network entities among all abnormal network entities.
  • the types of network entities on the knowledge graph are devices, interfaces, protocols or services.
  • the management device determines the fault root cause object of the fault according to the fault root cause network entity.
  • the fault root cause object of this fault is the physical interface.
  • the management device determines the abnormal event associated with the fault root cause network entity (that is, the abnormal event causing the fault) as the fault root cause event of the fault.
  • the process for the management device to obtain the fault propagation relationship includes: the management device obtains a plurality of knowledge map samples, and each knowledge map sample is marked with a network that the knowledge map sample belongs to when a failure occurs in the network to which the knowledge map sample belongs. All abnormal network entities that generate abnormal events in the network and the network entity that causes the fault.
  • the management device determines the fault propagation relationship based on the multiple knowledge graph samples. Among them, each knowledge graph sample is a fault case, and the abnormal network entities and the fault root cause network entities in the knowledge graph samples can be determined manually.
  • the management device may use a graph embedding algorithm or the like to learn the fault propagation relationship in the multiple knowledge graph samples. Alternatively, when the probability of simultaneous abnormality of two network entities in the same knowledge graph triplet is greater than a certain threshold, the management device may determine that fault propagation will occur between the two network entities.
  • the management device can obtain a set of fault propagation relationships: if an interface fails, the route IP adopted by the interface cannot be communicated.
  • the management device acquires the first abnormal event indicating the failure of the interface and the second abnormal event indicating that the IP of the route adopted by the interface is unreachable, the management device determines that the first abnormal event is the root cause event of the failure, and determines the interface It is the fault root cause object.
  • the management device may use multiple knowledge graph samples to learn the fault propagation relationship between network entities, and based on the fault propagation relationship, determine the fault root cause network entity in the abnormal network entities on the knowledge graph of the target network, Then the characteristics of the root cause of the fault are determined, and the automatic reasoning and location of the root cause of the network fault is realized.
  • the fault propagation relationship between network entities can also be determined by other devices and then sent to the management device.
  • the way that other devices determine the fault propagation relationship between network entities can refer to the above-mentioned way for the management device to determine the fault propagation relationship between network entities. , and details are not described here in the embodiments of the present application.
  • the fault propagation relationship can also be formulated based on expert rules.
  • Step 204 For each known fault in the multiple known faults, the management device calculates the similarity between the target fault and the known fault according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault Spend.
  • the similarity between the target fault and the known fault is equal to the weighted average of the similarity between each fault root cause feature of the target fault and each fault root cause feature of the known fault.
  • the types of fault root cause characteristics include fault root cause objects, fault root cause events, and interface flash indications.
  • the similarity between the fault root cause object of the target fault and the fault root cause object of the known fault is the first similarity
  • the similarity between the fault root cause event of the target fault and the fault root cause event of the known fault is the first similarity.
  • the second similarity, the similarity between the interface flash indication of the target fault and the interface flash indication of the known fault is the third similarity
  • the similarity between the target fault and the known fault is equal to the first similarity, the third similarity
  • the implementation process of step 204 includes: the management device inputs the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults to the similarity model, so as to obtain the fault root cause and the target fault output from the similarity model.
  • the similarity model is trained based on the fault root cause features of multiple sample faults. Sample faults are annotated with class labels. Among them, the fault recovery plans corresponding to the sample faults marked with the same category labels are the same. The fault recovery plans corresponding to sample faults marked with different category labels may be the same or different.
  • the similarity model is a machine learning model trained by supervised learning.
  • the management device uses the fault root cause characteristics of the multiple sample faults to train to obtain a similarity model.
  • each fault has d fault root cause characteristics
  • the similarity model is expressed as follows:
  • similarity(a,b) represents the similarity between fault a and fault b
  • s( ak ,b k ) represents the kth fault root cause feature of fault a and the kth fault root cause feature of fault b similarity between.
  • w k represents the weight of the k-th fault root cause feature
  • the value ranges of a k and b k are the same, max represents the maximum value in the value range of a k or b k , and min represents the minimum value in the value range of a k or b k .
  • the management device is trained to obtain the above-mentioned similarity model, that is, the weights of the k fault root cause features are determined, so that the similarity between the faults of the samples marked with the same category label is greater than that of the samples marked with different category labels. similarity between failures.
  • the management device may also use the fault root cause characteristics of multiple sample faults, and call the similarity model to determine the similarity threshold, which specifically includes: The fault root cause characteristics of each sample fault pair are obtained to obtain the similarity between the fault root causes of each sample fault pair output by the similarity model. Then the management device determines the similarity threshold according to the similarity between the fault root causes of the multiple sample fault pairs.
  • each sample fault pair includes two sample faults, and the similarity between the fault root causes of one sample fault pair is the similarity between the fault root causes of two sample faults in the sample fault pair.
  • the plurality of sample fault pairs include a first type of sample fault pair and a second type of sample fault pair, the first type of sample fault pair includes two sample faults marked with the same type label, and the second type of sample fault pair includes two types of sample faults marked with Sample failures with different class labels.
  • the similarity threshold may be a dividing value between the similarity between fault root causes of sample faults marked with the same category label and the similarity between fault root causes of sample faults marked with different category labels.
  • the fault root cause features of multiple sample fault pairs are input into the trained similarity model in stages.
  • the target threshold For the first type of sample fault pairs, most of the similarity outputs of the similarity model are greater than the target threshold.
  • the similarity For the sample fault pair, the similarity output by the similarity model is mostly smaller than the target threshold, then the target threshold can be determined as the similarity threshold.
  • the similarity threshold may take a value of 0.9.
  • the multiple sample faults used to train the similarity model and the multiple sample faults used to determine the similarity model may be the same or different.
  • the former is used to adjust the parameters of the similarity model, and the latter is used to adjust the parameters of the similarity model.
  • the similarity between the fault root causes of the sample faults of the same category and the similarity between the fault root causes of the sample faults of different categories are counted, and then the appropriate similarity threshold is found.
  • the management device receives the similarity model and/or the similarity threshold from the cloud device. That is, after acquiring multiple sample faults, the cloud device uses the fault root cause features of multiple sample faults to train to obtain a similarity model. After the cloud device obtains the similarity model after training, it can also use the fault root cause characteristics of multiple sample faults, and call the similarity model to determine the similarity threshold.
  • the cloud device training the similarity model and determining the similarity threshold reference may be made to the above-mentioned process of training the similarity model and determining the similarity threshold by the management device, which will not be repeated in this embodiment of the present application.
  • Step 205 The management device determines, according to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, that the fault root cause of the multiple known faults and the fault root cause of the target fault satisfy the similarity condition similar known failures.
  • the management device determines, among the multiple known faults, a known fault whose similarity between the root cause of the fault and the root cause of the target fault is higher than a similarity threshold as a similar known fault of the target fault Fault.
  • the management device determines all known faults in which the similarity between the fault root cause of the multiple known faults and the fault root cause of the target fault is higher than the similarity threshold as the similar known faults of the target fault, and also That is, the target failure may have one or more similar known failures.
  • managing the device in this way may also not find a similar known fault of the target fault among the known faults.
  • the management device sorts the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, and compares the fault root cause with the fault root cause of the target fault The n known faults with the highest similarity are used as the similar known faults of the target fault, and n is a positive integer.
  • the management device takes the three known faults with the highest similarity between the fault root cause and the fault root cause of the target fault as similar known faults of the target fault.
  • a known fault similar to the target fault can always be found among the known faults, and then a fault recovery plan corresponding to the target fault can always be determined, and the range of faults that can be handled is large.
  • the management device determines the similarity between the root cause of the fault and the root cause of the target fault among the multiple known faults to be higher than the similarity threshold, and the faults according to the root cause of the fault and the fault of the target fault are The order of the similarity between root causes belongs to the top m known faults as the similar known faults of the target fault, and m is a positive integer.
  • known faults whose similarity between the root cause of the fault and the fault root cause of the target fault is higher than the similarity threshold can be screened, which can not only ensure the accuracy of the determined similar faults, but also limit the determined faults.
  • the number of similar faults reduces the amount of subsequent calculations.
  • Step 206 The management device determines a failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to a similar known failure of the target failure.
  • the implementation process of step 206 includes: managing the network-based network configuration of the device, and evaluating the feasibility of a failure recovery plan corresponding to a similar known failure of the target failure.
  • the management device determines one or more of the feasible failure recovery plans as a failure recovery plan corresponding to the target failure.
  • Network configuration includes networking topology and/or device data.
  • the failure recovery plan of "isolating the device” for the device is feasible; if the link where a device is located does not have a redundant link, the device is The "isolated device” recovery plan is not feasible.
  • the management device first evaluates the degree of impact of the multiple failure recovery plans on services running on the network based on the network configuration of the network. Then, the management device determines, among the multiple failure recovery plans, the failure recovery plan with the least impact on the services running on the network as the failure recovery plan corresponding to the target failure.
  • the similar known fault of the target fault is the ARP attack fault
  • there are two fault recovery plans corresponding to the ARP attack fault namely isolation VM and isolation interface.
  • the management device determines that the two failure recovery plans of the isolated VM and the isolated interface are feasible, it evaluates the impact of the isolated VM and the isolated interface on the services running on the network.
  • VM isolation requires ACL blocking for each MAC address of the ARP attack source, that is, the required ACL resources are equal to the number of ARP attack sources. Under the condition of sufficient ACL resources in the network, the fault recovery plan of VM isolation is feasible. .
  • this fault recovery plan only affects the isolated attack source VM, and has a small impact on the services running on the network; for the isolated interface, this fault recovery plan affects all VMs (including the attack source) mounted on the isolated interface. VM and normal VM), which have a greater impact on the services running on the network. Since the impact of the isolated VM on the services running on the network is smaller than the impact of the isolated interface on the services running on the network, the management device will use the isolated VM as the fault recovery plan corresponding to the target fault. If the isolation of VMs is not feasible due to insufficient ACL resources on the network, the management device can use the isolation interface as a fault recovery plan corresponding to the target fault.
  • the management device may also output all the failure recovery plans corresponding to the similar known failure, and use the failure recovery plan specified by the selection instruction as the corresponding failure recovery plan of the target failure.
  • Failure recovery plan the selection command can be triggered by operation and maintenance personnel.
  • the management device can send all failure recovery plans corresponding to similar known failures of the target failure to the operations support system (OSS) or other terminal devices connected to the management device for display by the OSS or the terminal device.
  • OSS operations support system
  • the management device can also display all failure recovery plans corresponding to similar known failures of the target failure on its own display interface.
  • the operation and maintenance personnel can designate one of the fault recovery plans as the fault recovery plan corresponding to the target fault, or the operation and maintenance personnel can also input other fault recovery plans.
  • the pre-plan serves as a fault recovery pre-plan corresponding to the target fault, which is not limited in this embodiment of the present application.
  • the management device in response to a similar known fault in which the fault root cause of the target fault and the fault root cause of the target fault do not meet the similarity condition among the multiple known faults, the management device outputs the fault identification of the target fault and the fault root cause characteristic, so that The operation and maintenance personnel determine the fault recovery plan corresponding to the target fault.
  • the management device sends the fault identification of the target fault and the fault root cause characteristics to the OSS or other terminal devices connected to the management device for display by the OSS or the terminal device.
  • the management device can also display the fault identifier of the target fault and the fault root cause feature on its own display interface.
  • the management device can determine the failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to a similar known failure of the target failure, that is, regardless of whether the target failure is a known failure, as long as the target failure can be If similar known faults of the target fault are found in the faults, the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled.
  • the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found. The higher the matching degree with the fault, the higher the possibility that the fault recovery plan corresponding to the similar known fault is applicable to the fault, thereby making the determined fault recovery plan more reliable.
  • the management device may also distribute the failure recovery plan to the relevant network devices in the network that need to execute the failure recovery plan, so as to realize end-to-end failure recovery. For this process, see The following steps 207 to 208.
  • Step 207 The management device determines, based on the target failure and the failure recovery plan corresponding to the target failure, the target network device in the network to be executed the plan.
  • the target network device on which the plan is to be executed in the network includes the device where the failure root cause of the target failure is located, the device where the network entity is located and/or the access device in the network, and the like.
  • the target network device to be executed in the network is usually the device where the fault root cause network entity is located.
  • the root cause event of the target failure is that the OSPF route ID conflict causes the DHCP service to time out
  • the target network device to be executed in the network is usually the network device in which the OSPF route ID conflict occurs.
  • the target failure is an ARP attack failure
  • the target network device to be executed in the network is an edge device (ie, an access device) that mounts the attack source VM.
  • Step 208 The management device sends a plan execution instruction to the target network device.
  • the plan execution instruction is used to instruct the target network device to execute the failure recovery plan corresponding to the target failure.
  • the plan execution instruction includes a failure recovery plan corresponding to the target failure.
  • the failure recovery plan corresponding to the target failure included in the plan execution instruction may be an execution script of the failure recovery plan.
  • the content of the execution script of the "isolated device" failure recovery plan includes:
  • a) Determine the role of the device for example, determine whether the current device is a spine (spine) device or a leaf (leaf) device.
  • the current device is a leaf device or a non-spine leaf combined device, record the cost (cost) value of the spine-side interface of the current device, and then adjust the cost value to the maximum value; traverse the access-side interface of the current device, record Check the current state of the access-side interface, and then execute shutdown on the access-side interface that is not in the down state except the management interface managed by inband.
  • the current device is a spine device, if the spine device is an independent device group or a member of the device group has been isolated, the current device cannot perform the isolation operation; otherwise, record the cost value of the current device connected to the spine interface, and then adjust the cost value to the maximum value; traverse the interface connecting leaf to spine, record the current state of the interface, and execute shutdown on the interface that is not in the down state.
  • Step 209 The management device sends a plan execution rollback instruction to the target network device.
  • the plan execution rollback instruction is used to instruct the target network device to restore to the state before executing the failure recovery plan corresponding to the target failure.
  • the management device in response to receiving the rollback trigger instruction, sends a plan execution rollback instruction to the target network device.
  • the rollback trigger instruction may be triggered by the operation and maintenance personnel performing a specified operation on the management device. For example, when the management device detects a pressing operation on a certain key, it determines that the rollback trigger instruction is received.
  • the fallback trigger instruction may also come from other devices, that is, the management device may send a plan execution fallback instruction to the target network device under the control instruction (fallback trigger instruction) of the other device.
  • the management device may also send the plan execution rollback instruction to the target network device, so as to instruct the target network device to return to the state before executing the failure recovery plan, thereby realizing the network
  • the state rollback capability of the device This function can quickly restore the network device to the original state and improve the reliability of network operation in the scenario where the network device implements an unreasonable failure recovery plan.
  • the management device may also send the failure recovery plan corresponding to the target failure and/or the fault root cause feature of the target failure to the cloud device, so that the cloud device can update the failure recovery plan.
  • Collection and/or fault root cause feature collection for the implementation process, refer to the following steps 210 to 213 .
  • Step 210 The management device sends an identifier of the target failure and a failure recovery plan corresponding to the target failure to the cloud device.
  • the identifier of the target failure and the failure recovery plan corresponding to the target failure are used for the cloud device to add the corresponding relationship between the target failure and the failure recovery plan corresponding to the target failure in the failure recovery plan set, and obtain the updated failure recovery plan set .
  • Step 211 the cloud device adds the correspondence between the target failure and the failure recovery plan corresponding to the target failure in the set of failure recovery plans to obtain an updated set of failure recovery plans.
  • every time the cloud device updates the set of failure recovery plans it may send the updated set of failure recovery plans to the management device, or the cloud device periodically sends the latest set of failure recovery plans to the management device.
  • Step 212 the management device sends the identifier of the target fault and the fault root cause characteristic of the target fault to the cloud device.
  • the identifier of the target fault and the fault root cause characteristic of the target fault are used for the cloud device to add the corresponding relationship between the target fault and the fault root cause characteristic of the target fault in the fault root cause characteristic set, so as to obtain the updated fault root cause Feature collection.
  • steps 210 and 212 may be performed simultaneously, that is, the management device synchronously sends the fault recovery plan corresponding to the target fault and the fault root cause characteristics of the target fault to the cloud device.
  • the management device synchronously sends the fault recovery plan corresponding to the target fault and the fault root cause characteristics of the target fault to the cloud device.
  • the target fault is an interface flashing fault
  • the content sent by the management device to the cloud device can be as shown in Table 3.
  • Step 213 The cloud device adds the corresponding relationship between the target fault and the fault root cause characteristic of the target fault in the fault root cause characteristic set to obtain an updated fault root cause characteristic set.
  • the cloud device may send the updated fault root cause feature set to the management device, or the cloud device periodically sends the latest fault root cause feature set to the management device.
  • the cloud device can continuously train and update the similarity model after receiving the fault root cause characteristics of the existing network fault reported by the management device and the corresponding fault recovery plan. Improve model accuracy and reliability.
  • the management device can report the fault root cause characteristics of the existing network fault and the corresponding fault recovery plan to the cloud device in real time, so that the cloud device can automatically update the set of fault recovery plans and the set of fault root cause characteristics.
  • the scope of faults that can be handled is expanded, and there is no need to manually summarize the fault recovery plan corresponding to the fault and use hard coding, which improves the expansion flexibility of the fault recovery plan and reduces the difficulty of maintenance.
  • the above management device may be one device (control device), or the above management device may include multiple devices (collection device, analysis device and/or control device).
  • the management device includes an analysis device and a control device, and the analysis device is integrated with the function of a collection device, and the implementation process of the above step 201 may include: the cloud device sends a fault root cause feature set to the analysis device, and sends it to the control device. The device sends a set of failure recovery plans.
  • the above steps 202 to 205 are performed by the analysis device.
  • the above steps 206 to 210 and step 212 are performed by the control device.
  • FIG. 3 is a schematic diagram of an implementation process of a method for determining a fault recovery plan provided by an embodiment of the present application. As shown in Figure 3, the implementation process includes:
  • Step 301 The cloud device sends the fault root cause feature set to the analysis device.
  • the fault root cause feature set includes the correspondence between a plurality of known faults and the fault root cause features.
  • this step reference may be made to the relevant content in the foregoing step 201 and step 204, and details are not described herein again in this embodiment of the present application.
  • Step 302 The cloud device sends a set of failure recovery plans to the control device.
  • the set of failure recovery plans includes the correspondence between a plurality of known failures and the failure recovery plans.
  • this step reference may be made to the relevant content in the foregoing step 201, and details are not described herein again in this embodiment of the present application.
  • Step 303 When a network failure occurs, the analysis device acquires abnormal events generated in the network.
  • Step 304 The analyzing device determines the fault root cause characteristic of the target fault based on the abnormal event generated in the network.
  • Step 305 For each known fault in the multiple known faults, the analysis device calculates the similarity between the target fault and the known fault according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault. Spend.
  • the cloud device may also send the similarity model and the similarity threshold to the analysis device.
  • the analysis device can train the similarity model and determine the similarity threshold on its own. Then the analysis device can input the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model, so as to obtain the fault root cause of the target fault output by the similarity model and the fault root cause of the known fault. similarity between.
  • Step 306 The analysis device determines, according to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, which of the multiple known faults and the fault root cause of the target fault satisfy the similarity condition. Similar known failure.
  • the analyzing device determines, among the multiple known faults, a known fault whose similarity between the fault root cause and the fault root cause of the target fault is higher than a similarity threshold as a known fault similar to the target fault.
  • Step 307 The analysis device sends the similar fault information corresponding to the target fault to the control device.
  • the similar fault information includes an identification of the target fault and a list of similar faults.
  • the list of similar failures includes one or more similar known failures of the target failure.
  • the similar fault information is used for the control device to determine a fault recovery plan corresponding to the target fault.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the similar fault information can be as shown in Table 4.
  • the similar fault information may also include the occurrence time stamp of the target fault, and the like.
  • Step 308 The control device determines a failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to a similar known failure of the target failure.
  • Step 309 the control device determines, based on the target failure and the failure recovery plan corresponding to the target failure, the target network device in the network to be executed the plan.
  • Step 310 The control device sends a plan execution instruction to the target network device.
  • the plan execution instruction is used to instruct the target network device to execute the failure recovery plan corresponding to the target failure.
  • the plan execution instruction includes a failure recovery plan corresponding to the target failure.
  • Step 311 The control device sends a plan execution rollback instruction to the target network device.
  • the plan execution rollback instruction is used to instruct the target network device to restore to the state before executing the failure recovery plan corresponding to the target failure.
  • the control device in response to receiving the rollback trigger instruction, sends a plan execution rollback instruction to the target network device.
  • Step 312 the control device sends the fault recovery plan corresponding to the target fault and the fault root cause characteristics of the target fault to the cloud device.
  • Step 313 The cloud device adds the corresponding relationship between the target failure and the failure recovery plan corresponding to the target failure in the failure recovery plan set, obtains an updated failure recovery plan set, and adds the target failure and the target failure to the failure root cause feature set.
  • the corresponding relationship between the fault root cause features of the target fault is obtained, and the updated fault root cause feature set is obtained.
  • the cloud device can continue to train and update the similarity after receiving the fault root cause feature of the existing network fault and its corresponding fault recovery plan reported by the management device.
  • the cloud device can also update the similarity threshold to improve the accuracy of judging the similarity of faults.
  • FIG. 4 is a functional schematic diagram of a system for implementing a fault recovery plan provided by an embodiment of the present application for implementing the method shown in FIG. 3 .
  • the fault recovery plan system includes cloud equipment, analysis equipment and control equipment.
  • the cloud equipment includes a model training module and a fault knowledge base.
  • the model training module is used to train the similarity model and determine the similarity threshold.
  • the fault knowledge base includes a fault feature library and a fault recovery plan library.
  • the fault feature library is used to store the fault root cause feature set
  • the fault recovery plan library is used to store the fault recovery plan set.
  • the cloud device is used to send the fault root cause feature set, similarity model and similarity threshold to the analysis device, and send the fault recovery plan set to the control device.
  • the analysis equipment includes a fault location module, a similar fault determination module and a fault feature library.
  • the fault location module is used for locating faults in the network (abbreviated as: existing network faults) and extracting fault root cause characteristics.
  • the similar fault determination module is used to call the similarity model to determine the similar known faults of the existing network faults from the known faults in the fault signature database.
  • the analysis device is used to send the similar fault information corresponding to the existing network fault to the control device.
  • the control equipment includes a plan evaluation module, a plan management module and a fault recovery plan library.
  • the plan evaluation module is used to obtain and evaluate the feasibility of the failure recovery plan corresponding to the similar known failure of the existing network failure from the failure recovery plan.
  • the plan management module is used to determine the fault recovery plan corresponding to the existing network failure and the network devices that need to execute the plan.
  • the control device is used to send the fault root cause characteristics of the existing network fault and the corresponding fault recovery plan to the cloud device.
  • the management device determines the fault recovery plan corresponding to the target fault based on the fault recovery plan corresponding to a similar known fault of the target fault, that is, regardless of whether the target fault is It is a known fault.
  • the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled.
  • the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found.
  • the management device can also report the fault root cause characteristics of the existing network fault and the corresponding fault recovery plan to the cloud device in real time, which realizes the automatic update of the fault recovery plan set and the fault root cause feature set by the cloud device, thereby expanding the capacity of the network.
  • FIG. 5 is a schematic structural diagram of a control device provided by an embodiment of the present application. As shown in FIG. 5, the control device 50 includes:
  • the first obtaining module 501 is configured to obtain a similar known failure whose root cause and the failure root cause of a target failure in the network satisfy a similarity condition among a plurality of known failures.
  • the first determining module 502 is configured to obtain a fault recovery plan corresponding to a similar known fault, and determine a fault recovery plan corresponding to the target fault based on the fault recovery plan corresponding to the similar known fault.
  • the fault root cause is represented by a fault root cause feature
  • the fault root cause feature includes a fault root cause object and a fault root cause event, wherein the fault root cause event is an abnormal event that causes the fault, and the fault root cause object is used to indicate the fault.
  • the fault root cause network entity is the network entity to which the fault root cause event belongs.
  • the fault root cause network entity is a physical interface
  • the fault root cause feature further includes an interface flashing indication of the fault root cause network entity, an interface suspended animation indication of the fault root cause network entity, and a status of sending and receiving packets of the fault root cause network entity.
  • one or more of the interface protocol state of the network entity as the root cause of the failure or the state of the physical interface of the device where the root cause of the failure is the network entity.
  • the fault root cause network entity is a BGP peer
  • the fault root cause characteristic further includes a BGP route flapping indication of the fault root cause network entity and/or the physical interface status of the device where the fault root cause network entity is located.
  • the fault root cause characteristic also includes the physical interface state of the device where the fault root cause network entity is located.
  • the first obtaining module 501 is configured to: obtain the fault root cause characteristics of multiple known faults; for each known fault in the multiple known faults, obtain the fault root cause characteristics of the target fault and the known faults according to the known faults.
  • the fault root cause characteristics of the fault calculate the similarity between the fault root cause of the target fault and the fault root cause of the known fault; according to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults , to determine the similar known faults whose root cause and target fault satisfy the similarity condition among the multiple known faults.
  • the first obtaining module 501 is configured to: determine a known fault whose similarity between the root cause of the fault and the root cause of the target fault is higher than a similarity threshold among multiple known faults as similar known faults. Fault.
  • the first acquisition module 501 is configured to: input the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model, so as to obtain the fault root cause of the target fault output by the similarity degree model and the existing fault root cause characteristics.
  • the similarity between the fault root causes of the fault is known.
  • the similarity model is trained based on the fault root cause characteristics of multiple sample faults.
  • the sample faults are marked with category labels. Among them, the fault recovery plan corresponding to the sample faults marked with the same category label same.
  • control device 50 further includes: a training module 503 , configured to obtain a similarity model by training using the fault root cause features of multiple sample faults.
  • control device 50 further includes: a second acquisition module 504, configured to input the fault root cause features of multiple sample fault pairs into the similarity model in stages, so as to obtain each output of the similarity model.
  • the similarity between the root causes of the faults of each sample fault pair, the multiple sample fault pairs include the first type sample fault pair and the second type sample fault pair, and the first type sample fault pair includes two samples marked with the same category label fault, the second type of sample fault pair consists of two sample faults annotated with different class labels.
  • the second determining module 505 is configured to determine the similarity threshold according to the similarity between the fault root causes of the multiple sample fault pairs.
  • control device 50 further includes: a receiving module 506 .
  • the receiving module 506 is configured to receive the similarity model and/or the similarity threshold from the training device. And/or, the receiving module 506 is configured to receive a fault root cause feature set from the training device, where the fault root cause feature set includes correspondences between multiple known faults and fault root cause features.
  • the first acquiring module 501 is configured to acquire fault root cause characteristics of multiple known faults based on the fault root cause characteristic set.
  • control device 50 further includes: a first sending module 507 .
  • the first sending module 507 is configured to send the identification of the target fault and the fault root cause feature of the target fault to the training device, so that the training device can add the target fault and the fault root cause feature of the target fault to the fault root cause feature set.
  • the corresponding relationship of , the updated fault root cause feature set is obtained.
  • control device 50 further includes: a third acquiring module 508, configured to acquire abnormal events generated in the network when a network failure occurs.
  • the third determining module 509 is configured to determine the fault root cause characteristic of the fault based on the abnormal event generated in the network.
  • the first acquisition module 501 is configured to: receive similar fault information corresponding to the target fault from the analysis device, the similar fault information includes an identifier of the target fault and a similar fault list, and the similar fault list includes one or more of the target faults. Similar known failure.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the first determining module 502 is configured to: evaluate the feasibility of a failure recovery plan corresponding to a similar known fault based on a network configuration of the network, where the network configuration includes the networking topology and/or device data, and the device data includes the management plane.
  • One or more of data, data plane data, or control plane data; one or more failure recovery plans in the feasible failure recovery plans are determined as the failure recovery plans corresponding to the target failure.
  • the first determination module 502 is configured to: in response to the feasibility of multiple failure recovery plans, based on the network configuration of the network, evaluate the degree of impact of the multiple failure recovery plans on the services running on the network; Among them, the failure recovery plan with the least impact on the services running on the network is determined as the failure recovery plan corresponding to the target failure.
  • control device 50 further includes: a fourth determination module 510 for determining, based on the target failure and the failure recovery plan corresponding to the target failure, the target network device in the network to execute the plan.
  • the second sending module 511 is configured to send a plan execution instruction to the target network device, where the plan execution instruction is used to instruct the target network device to execute the failure recovery plan corresponding to the target failure, and the plan execution command includes the failure recovery plan corresponding to the target failure.
  • the second sending module 511 is further configured to send a plan execution rollback instruction to the target network device, where the plan execution rollback instruction is used to instruct the target network device to restore to the state before executing the failure recovery plan corresponding to the target failure.
  • the receiving module 506 is further configured to receive a set of failure recovery plans from the training device, where the set of failure recovery plans includes a correspondence between multiple known failures and the failure recovery plans.
  • the first determining module 502 is configured to obtain a failure recovery plan corresponding to a similar known failure based on the set of failure recovery plans.
  • the first sending module 507 is further configured to send the identifier of the target failure and the failure recovery plan corresponding to the target failure to the training device, so that the training device can add the target failure and the failure recovery corresponding to the target failure in the failure recovery plan set.
  • the corresponding relationship between the plans is obtained, and an updated set of failure recovery plans is obtained.
  • the control device provided by the embodiment of the present application can determine the fault recovery plan corresponding to the target fault based on the fault recovery plan corresponding to the similar known fault of the target fault through the first determination module, that is, regardless of the target fault Whether it is a known fault, as long as a similar known fault of the target fault can be found in the known faults, the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled.
  • the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found. The higher the matching degree with the fault, the higher the possibility that the fault recovery plan corresponding to the similar known fault is applicable to the fault, thereby making the determined fault recovery plan more reliable.
  • FIG. 11 is a schematic structural diagram of an analysis device provided in an embodiment of the present application. As shown in FIG. 11, the analysis device 110 includes:
  • the acquiring module 1101 is configured to acquire similar known faults whose root cause and the fault root cause of a target fault in the network satisfy a similarity condition among a plurality of known faults.
  • the sending module 1102 is configured to send similar fault information corresponding to the target fault to the control device, where the similar fault information includes an identifier of the target fault and a list of similar faults, and the similar fault list includes one or more similar known faults of the target fault, and similar fault information It is used for the control device to determine the fault recovery plan corresponding to the target fault.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the acquiring module 1101 is configured to: acquire fault root cause characteristics of multiple known faults. For each known fault in the multiple known faults, according to the fault root cause characteristics of the target fault and the fault root cause characteristics of the known fault, calculate the fault root cause of the target fault and the fault root cause of the known fault similarity between. According to the similarity between the fault root cause of the target fault and the fault root causes of multiple known faults, a similar known fault whose fault root cause and the fault root cause of the target fault satisfy the similarity condition among the multiple known faults are determined.
  • the obtaining module 1101 is configured to: determine a known fault whose similarity between the root cause of the fault and the root cause of the target fault is higher than a similarity threshold among the multiple known faults as similar known faults.
  • the obtaining module 1101 is configured to: input the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model, so as to obtain the fault root cause and the known faults of the target fault output by the similarity degree model.
  • the similarity between the root causes of the faults is obtained by training the similarity model based on the fault root cause features of multiple sample faults.
  • the sample faults are marked with category labels, and the fault recovery plans corresponding to the sample faults marked with the same category labels are the same.
  • the analysis device determines the similar known faults of the target fault through the determination module, and sends the similar fault information corresponding to the target fault to the control device through the sending module, so that the control device can determine the fault based on the target fault.
  • the failure recovery plan corresponding to the similar known failure determines the failure recovery plan corresponding to the target failure, that is, regardless of whether the target failure is a known failure, as long as a similar known failure of the target failure can be found in the known failures,
  • the fault recovery plan corresponding to the target fault can be determined, thus expanding the range of faults that can be handled.
  • a fault in the network may cause multiple cascading faults, and the root cause of the fault can reflect the root cause of the fault, the similar known faults of the fault in the network are searched based on the root cause of the fault, and the similar known faults are found.
  • the higher the matching degree with the fault the higher the possibility that the fault recovery plan corresponding to the similar known fault is applicable to the fault, thereby making the determined fault recovery plan more reliable.
  • FIG. 12 is a schematic structural diagram of a cloud device provided by an embodiment of the present application.
  • the cloud device is the training device in the above embodiment.
  • the cloud device 120 includes:
  • the first obtaining module 1201 is used to obtain a similarity model, the similarity model is obtained by training based on the fault root cause characteristics of multiple sample faults, and the sample faults are marked with a category label, wherein the fault recovery corresponding to the sample faults marked with the same category label The plan is the same.
  • the first sending module 1202 is configured to send the similarity model to the analyzing device, so that the analyzing device can determine the similar known faults of the target faults occurring in the network, and the similar known faults are used to determine the fault recovery plan corresponding to the target faults.
  • the first acquisition module 1201 is configured to use the fault root cause features of multiple sample faults to train to obtain a similarity model.
  • the cloud device 102 further includes: a second obtaining module 1203, configured to input the fault root cause features of multiple sample fault pairs into the similarity model in stages, so as to obtain each output of the similarity model.
  • the similarity between the root causes of the faults of each sample fault pair, the multiple sample fault pairs include the first type sample fault pair and the second type sample fault pair, and the first type sample fault pair includes two samples marked with the same category label fault, the second type of sample fault pair consists of two sample faults annotated with different class labels.
  • the determining module 1204 is configured to determine a similarity threshold according to the similarity between the fault root causes of the multiple sample fault pairs.
  • the first sending module 1202 is further configured to send the similarity threshold to the analysis device.
  • the cloud device 120 further includes a second sending module 1205 .
  • the first sending module 1202 is configured to send the fault root cause feature set to the analysis device, the fault root cause feature set includes multiple fault root cause feature subsets corresponding to multiple known faults, and each fault root cause feature subset includes a Failure root cause characteristics of known failures.
  • the second sending module 1205 is configured to send a set of failure recovery plans to the control device, where the set of failure recovery plans includes the correspondence between multiple known failures and the failure recovery plans.
  • the cloud device 120 further includes: a receiving module 1206 for receiving from the control device the identifier of the target fault, the fault root cause characteristic of the target fault, and the corresponding fault recovery plan of the target fault.
  • the update module 1207 is configured to add the corresponding relationship between the target fault and the fault root cause characteristics of the target fault in the fault root cause characteristic set, obtain the updated fault root cause characteristic set, and add the target fault to the fault recovery plan set The corresponding relationship between the failure recovery plans corresponding to the target failure is obtained, and an updated set of failure recovery plans is obtained.
  • Embodiments of the present application also provide a control device, including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the actions performed by the management device in the method embodiment corresponding to FIG. 2 or the actions performed by the control device in the method embodiment corresponding to FIG. 3 .
  • the embodiment of the present application also provides an analysis device, including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the actions performed by the analyzing device in the method embodiment corresponding to FIG. 3 .
  • Embodiments of the present application also provide a cloud device, including: a processor and a memory;
  • the memory for storing a computer program, the computer program including program instructions
  • the processor is configured to invoke the computer program to implement the actions performed by the cloud device in the method embodiment corresponding to FIG. 3 .
  • FIG. 15 is a block diagram of an apparatus for determining a fault recovery plan provided by an embodiment of the present application.
  • the device can be a control device, an analysis device or a cloud device.
  • the apparatus 150 includes: a processor 1501 and a memory 1502 .
  • a memory 1502 for storing a computer program, the computer program including program instructions
  • the processor 1501 is configured to invoke the computer program, and when the apparatus 150 is a control device, implement the actions performed by the management device in the method embodiment corresponding to FIG. 2 , or implement the actions performed by the control device in the method embodiment corresponding to FIG. 3 .
  • the apparatus 150 is an analysis device, the actions performed by the analysis device in the method embodiment corresponding to FIG. 3 are implemented; when the apparatus 150 is a cloud device, the actions performed by the cloud device in the method embodiment corresponding to FIG. 3 are implemented.
  • the apparatus 150 further includes a communication bus 1503 and a communication interface 1504 .
  • the processor 1501 includes one or more processing cores, and the processor 1501 executes various functional applications and data processing by running a computer program.
  • Memory 1502 may be used to store computer programs.
  • the memory may store the operating system and application program elements required for at least one function.
  • the operating system may be an operating system such as a real-time operating system (Real Time eXecutive, RTX), LINUX, UNIX, WINDOWS, or OS X.
  • the memory 1502 and the communication interface 1504 are respectively connected to the processor 1501 through the communication bus 1503 .
  • Embodiments of the present application further provide a computer storage medium, where instructions are stored on the computer storage medium, and when the instructions are executed by the processor, the management device, control device, analysis device, or cloud device in the foregoing method embodiments are implemented. Action performed.
  • the embodiment of the present application also provides a system for determining a fault recovery plan, including: a control device and an analysis device.
  • the analysis device is used to obtain similar known faults whose root cause and target fault in the network satisfy the similarity condition among multiple known faults, and send the similar fault information corresponding to the target fault to the control device.
  • the information includes an identification of the target failure and a list of similar failures including one or more similar known failures of the target failure.
  • the control device is used to obtain a failure recovery plan corresponding to a similar known failure, and determine a failure recovery plan corresponding to the target failure based on the failure recovery plan corresponding to the similar known failure.
  • the system further includes: a cloud device (ie, a training device).
  • a cloud device ie, a training device.
  • the cloud device is used to use the fault root cause characteristics of multiple sample faults to train the similarity model, and send the similarity model to the analysis device.
  • the sample faults are marked with category labels, among which, the faults corresponding to the sample faults marked with the same category label
  • the recovery plan is the same.
  • the analysis device is used for inputting the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model for each known fault in the plurality of known faults, so as to obtain the target fault output from the similarity degree model.
  • the similarity between the root cause of the fault and the root cause of the known fault and according to the similarity between the root cause of the target fault and the root causes of multiple known faults, determine the root cause of the fault among the multiple known faults Similar known faults that satisfy the similarity condition due to the fault root cause of the target fault.
  • the cloud device is further configured to send a set of failure recovery plans to the control device, where the set of failure recovery plans includes a correspondence between multiple known failures and the failure recovery plans.
  • the control device is used to obtain a fault recovery plan corresponding to a similar known fault based on the set of fault recovery plans.
  • control device is further configured to send an identifier of the target failure and a failure recovery plan corresponding to the target failure to the cloud device.
  • the cloud device is further configured to add the correspondence between the target failure and the failure recovery plan corresponding to the target failure in the set of failure recovery plans, so as to obtain an updated set of failure recovery plans.
  • the cloud device is further configured to send a fault root cause feature set to the analysis device, where the fault root cause feature set includes a correspondence between multiple known faults and fault root cause features.
  • the analysis device is used to acquire fault root cause characteristics of multiple known faults based on the fault root cause characteristic set.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the control device is further configured to send the identification of the target fault and the fault root cause characteristic of the target fault to the cloud device.
  • the cloud device is further configured to add the corresponding relationship between the target fault and the fault root cause characteristics of the target fault in the fault root cause characteristic set, so as to obtain an updated fault root cause characteristic set.
  • the embodiment of the present application also provides another system for determining a fault recovery plan, including: a control device and a cloud device.
  • the cloud device is configured to send a set of failure recovery plans to the control device, where the set of failure recovery plans includes a correspondence between multiple known faults and the failure recovery plans.
  • the control device is used to obtain, based on the set of failure recovery plans, the failure recovery plans corresponding to the similar known failures whose root cause and the failure root cause of the target failure in the network satisfy the similarity condition among multiple known failures, and based on the similar known failures. Know the fault recovery plan corresponding to the fault, and determine the fault recovery plan corresponding to the target fault.
  • control device is further configured to send an identifier of the target failure and a failure recovery plan corresponding to the target failure to the cloud device.
  • the cloud device is further configured to add the correspondence between the target failure and the failure recovery plan corresponding to the target failure in the set of failure recovery plans, so as to obtain an updated set of failure recovery plans.
  • the system further includes: an analysis device.
  • the analysis device is used to obtain similar known faults whose root cause and target fault in the network satisfy the similarity condition among multiple known faults, and send the similar fault information corresponding to the target fault to the control device.
  • the information includes an identification of the target failure and a list of similar failures including one or more similar known failures of the target failure.
  • the cloud device is used to use the fault root cause characteristics of multiple sample faults, train to obtain a similarity model, and send the similarity model to the analysis device.
  • the sample faults are marked with a category label, wherein the samples marked with the same category label are
  • the fault recovery plan corresponding to the fault is the same.
  • the analysis device is used for inputting the fault root cause characteristics of the target fault and the fault root cause characteristics of the known faults into the similarity model for each known fault in the plurality of known faults, so as to obtain the target fault output from the similarity degree model.
  • the similarity between the root cause of the fault and the root cause of the known fault and according to the similarity between the root cause of the target fault and the root causes of multiple known faults, determine the root cause of the fault among the multiple known faults Similar known faults that satisfy the similarity condition due to the fault root cause of the target fault.
  • the cloud device is further configured to send a fault root cause feature set to the analysis device, where the fault root cause feature set includes a correspondence between multiple known faults and the fault root cause features.
  • the analysis device is used to acquire fault root cause characteristics of multiple known faults based on the fault root cause characteristic set.
  • the similar fault information further includes the fault root cause characteristic of the target fault.
  • the control device is further configured to send the identification of the target fault and the fault root cause characteristic of the target fault to the cloud device.
  • the cloud device is further configured to add the corresponding relationship between the target fault and the fault root cause characteristics of the target fault in the fault root cause characteristic set, so as to obtain an updated fault root cause characteristic set.

Abstract

本申请公开了一种故障恢复预案确定方法、装置及系统、计算机存储介质,属于网络技术领域。首先控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。然后控制设备获取该相似已知故障对应的故障恢复预案。控制设备基于该相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。本申请中,无论目标故障是否是已知故障,只要能够在已知故障中找到故障根因与该目标故障的故障根因满足相似度条件的相似已知故障,就能确定该目标故障对应的故障恢复预案,扩大了能够处理的故障范围。

Description

故障恢复预案确定方法、装置及系统、计算机存储介质
本申请要求于2020年10月20日提交的申请号为202011123661.5、发明名称为“实现故障恢复预案推荐的方法、装置和系统”的中国专利申请的优先权,以及于2020年12月31日提交的申请号为202011622270.8、发明名称为“故障恢复预案确定方法、装置及系统、计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络技术领域,特别涉及一种故障恢复预案确定方法、装置及系统、计算机存储介质。
背景技术
网络故障是指由于硬件问题、软件问题或网络攻击等原因导致网络无法提供正常服务或服务质量较差。发生网络故障后,采用传统运维方式进行故障恢复的过程中,需要依靠人工判断后根据经验给出故障恢复预案,自动化程度低且效率低。
目前,通常依赖于专家经验以及现网的故障案例制定一系列专家规则,专家规则包括故障以及该故障对应的故障恢复预案。当发生网络故障时,管理设备基于制定的专家规则确定该故障对应的故障恢复预案,然后实施该故障恢复预案以对网络进行故障修复,缩短了网络设备由故障状态转为工作状态所耗费的时间。其中,网络设备由故障状态转为工作状态所耗费的时间也可称为平均恢复时间(mean time to recovery,MTTR)。
但是,由于目前专家规则通常采用硬编码的方式指定故障对应的故障恢复预案,因此只能处理专家规则中包含的故障,能够处理的故障范围有限。
发明内容
本申请提供了一种故障恢复预案确定方法、装置及系统、计算机存储介质,可以解决目前基于专家规则能够处理的故障范围有限的问题。
第一方面,提供了一种故障恢复预案确定方法。该方法包括:控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。控制设备获取相似已知故障对应的故障恢复预案。控制设备基于相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
本申请中,控制设备能够基于网络中的目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到故障根因与该目标故障的故障根因满足相似度条件的相似已知故障,就能确定该目标故障对应的故障恢复预案,扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可 靠性较高。
可选地,故障根因采用故障根因特征表示,故障根因特征包括故障根因对象和故障根因事件,其中,故障根因事件为导致故障的异常事件,故障根因对象用于指示故障根因网络实体的类型,故障根因网络实体为故障根因事件所属的网络实体。故障根因对象可以理解为故障根因网络实体的本体,故障根因网络实体可以理解为故障根因对象的实例化。故障根因对象的类型包括设备、接口、协议或业务。
可选地,故障根因网络实体为物理接口,故障根因特征还包括故障根因网络实体的接口闪断指示、故障根因网络实体的接口假死指示、故障根因网络实体的收发报文状态、故障根因网络实体的接口协议状态或故障根因网络实体所在设备的物理接口状态中的一个或多个。或者,故障根因网络实体为BGP对等体,故障根因特征还包括故障根因网络实体的BGP路由震荡指示和/或故障根因网络实体所在设备的物理接口状态。又或者,不限于故障根因网络实体的类型,故障根因特征还包括故障根因网络实体所在设备的物理接口状态。
第一种情况,控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障的实现过程,包括:控制设备获取多个已知故障的故障根因特征。对于多个已知故障中的每个已知故障,控制设备根据目标故障的故障根因特征以及已知故障的故障根因特征,计算目标故障的故障根因与已知故障的故障根因之间的相似度。控制设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
在一种可能实现方式中,控制设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障的实现过程,包括:控制设备将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为相似已知故障。
该实现方式中,基于相似度阈值在已知故障中查找目标故障的相似故障,确定的相似故障的准确度较高,进而可以使得确定的目标故障对应的故障恢复预案的可靠性较高。
在另一种可能实现方式中,控制设备对目标故障的故障根因与多个已知故障的故障根因之间的相似度进行排序,并将故障根因与目标故障的故障根因之间相似度最高的n个已知故障作为目标故障的相似已知故障,n为正整数。
该实现方式中,总能在已知故障中找到目标故障的相似已知故障,进而总能确定目标故障对应的故障恢复预案,能够处理的故障范围较大。
在又一种可能实现方式中,管理设备将多个已知故障中满足故障根因与目标故障的故障根因之间的相似度高于相似度阈值,且按照故障根因与目标故障的故障根因之间的相似度由高至低的排序方式属于前m个的已知故障作为目标故障的相似已知故障,m为正整数。
该实现方式中,能够对故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障进行筛选,既能保证确定的相似故障的准确度,又能限制确定的相似故障的数量,减小后续计算量。
可选地,控制设备根据目标故障的故障根因特征以及已知故障的故障根因特征,计算目标故障的故障根因与已知故障的故障根因之间的相似度的实现过程,包括:控制设备向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,相似度模型基于多个样本故障 的故障根因特征训练得到,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
可选地,控制设备采用多个样本故障的故障根因特征,训练得到相似度模型。
可选地,控制设备还可以向相似度模型分次输入多个样本故障对的故障根因特征,以获取相似度模型输出的每个样本故障对的故障根因之间的相似度,多个样本故障对包括第一类样本故障对和第二类样本故障对,第一类样本故障对包括两个标注有相同类别标签的样本故障,第二类样本故障对包括两个标注有不同类别标签的样本故障。控制设备根据多个样本故障对的故障根因之间的相似度,确定相似度阈值。
或者,控制设备接收来自训练设备的相似度模型和/或相似度阈值。该训练设备为控制设备的上层设备。
本申请中,由训练设备统一训练相似度模型和/或确定相似度阈值,可以使该训练设备所管理的所有控制设备共享相似度模型和/或相似度阈值,减小了控制设备的计算量。
可选地,控制设备还可以接收来自训练设备的故障根因特征集合,该故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。则控制设备获取多个已知故障的故障根因特征的实现过程,包括:控制设备基于故障根因特征集合,获取多个已知故障的故障根因特征。
可选地,控制设备还可以向训练设备发送目标故障的标识以及目标故障的故障根因特征,以供训练设备在故障根因特征集合中添加目标故障与该目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
本申请中,控制设备可以向训练设备实时上报现网故障的故障根因特征,实现了训练设备对故障根因特征集合的自动更新,从而可以扩大能够处理的故障范围。
可选地,当网络发生故障时,控制设备获取网络中产生的异常事件。控制设备基于网络中产生的异常事件,确定故障的故障根因特征。目标故障可以是网络中的任一故障。
第二种情况,控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障的实现过程,包括:控制设备接收来自分析设备的目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障。
可选地,相似故障信息还包括目标故障的故障根因特征。
在第二种情况中,由分析设备在多个已知故障中确定目标故障的相似已知故障,再将该相似已知故障的信息发送给控制设备。分析设备确定目标故障的相似已知故障的方式可参考上述第一种情况中控制设备确定目标故障的相似已知故障的方式,本申请在此不再赘述。
可选地,控制设备基于相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案的实现过程,包括:控制设备基于网络的网络配置,评估相似已知故障对应的故障恢复预案的可行性,网络配置包括组网拓扑和/或设备数据,设备数据包括管理面数据、数据面数据或控制面数据中的一种或多种。控制设备将可行的故障恢复预案中的一个或多个故障恢复预案确定为目标故障对应的故障恢复预案。
可选地,控制设备将可行的故障恢复预案中的一个或多个故障恢复预案确定为目标故障对应的故障恢复预案的实现过程,包括:响应于多个故障恢复预案可行,控制设备基于网络的网络配置,分别评估多个故障恢复预案对网络所运行业务的影响程度。控制设备将多个故 障恢复预案中,对网络所运行业务的影响程度最小的故障恢复预案确定为目标故障对应的故障恢复预案。
本申请中,控制设备将目标故障的相似已知故障对应的故障恢复预案中,对网络所运行业务的影响程度最小的故障恢复预案确定为目标故障对应的故障恢复预案,既可以解决目标故障,又可以尽可能地降低对网络所运行业务的影响,提高网络运行的可靠性和稳定性。
可选地,控制设备还可以基于目标故障以及目标故障对应的故障恢复预案,确定网络中待执行预案的目标网络设备。控制设备向目标网络设备发送预案执行指令,预案执行指令用于指示目标网络设备执行目标故障对应的故障恢复预案,该预案执行指令包括目标故障对应的故障恢复预案。
本申请中,控制设备在确定目标故障对应的故障恢复预案之后,还可以向网络中需要执行该故障恢复预案的相关网络设备分发该故障恢复预案,以实现端到端地故障恢复。
可选地,控制设备还可以向目标网络设备发送预案执行回退指令,该预案执行回退指令用于指示目标网络设备恢复至执行目标故障对应的故障恢复预案之前的状态。
可选地,响应于接收到回退触发指令,控制设备向目标网络设备发送预案执行回退指令。
本申请中,控制设备在向目标网络设备发送预案执行指令之后,还可以向目标网络设备发送预案执行回退指令,以指示目标网络设备恢复至执行故障恢复预案之前的状态,实现了网络设备的状态回滚功能。在网络设备执行了不合理的故障恢复预案的场景下,该功能可以快速使网络设备恢复至原始状态,提高网络运行可靠性。
可选地,控制设备接收来自训练设备的故障恢复预案集合,该故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。则控制设备获取相似已知故障对应的故障恢复预案的实现过程,包括:控制设备基于故障恢复预案集合,获取相似已知故障对应的故障恢复预案。
可选地,控制设备向训练设备发送目标故障的标识以及目标故障对应的故障恢复预案,以供训练设备在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
本申请中,控制设备可以向训练设备实时上报现网故障对应的故障恢复预案,实现了训练设备对故障恢复预案集合的自动更新,从而扩大能够处理的故障范围,另外无需人工方式总结故障对应的故障恢复预案并采用硬编码方式,提高了故障恢复预案的扩展灵活性,降低了维护难度。
第二方面,提供了一种故障恢复预案确定方法。该方法包括:分析设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。分析设备向控制设备发送目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障,相似故障信息用于控制设备确定目标故障对应的故障恢复预案。
可选地,相似故障信息还包括目标故障的故障根因特征。
可选地,分析设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障的实现过程,包括:分析设备获取多个已知故障的故障根因特征。对于多个已知故障中的每个已知故障,分析设备根据目标故障的故障根因特征以及该已知故 障的故障根因特征,计算目标故障的故障根因与该已知故障的故障根因之间的相似度。分析设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,分析设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障的实现过程,包括:分析设备将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为相似已知故障。
可选地,分析设备根据目标故障的故障根因特征以及已知故障的故障根因特征,计算目标故障的故障根因与已知故障的故障根因之间的相似度的实现过程,包括:分析设备向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,相似度模型基于标注有类别标签的多个样本故障的故障根因特征训练得到,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
第三方面,提供了一种故障恢复预案确定方法。该方法包括:训练设备获取相似度模型,相似度模型基于多个样本故障的故障根因特征训练得到,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。训练设备向分析设备发送相似度模型,供分析设备确定网络中发生的目标故障的相似已知故障,相似已知故障用于确定目标故障对应的故障恢复预案。
可选地,训练设备获取相似度模型的实现过程,包括:训练设备采用多个样本故障的故障根因特征,训练得到相似度模型。
可选地,训练设备还可以向相似度模型分次输入多个样本故障对的故障根因特征,以获取相似度模型输出的每个样本故障对的故障根因之间的相似度,多个样本故障对包括第一类样本故障对和第二类样本故障对,第一类样本故障对包括两个标注有相同类别标签的样本故障,第二类样本故障对包括两个标注有不同类别标签的样本故障。训练设备根据多个样本故障对的故障根因之间的相似度,确定相似度阈值。训练设备向分析设备发送相似度阈值。
可选地,训练设备还可以向分析设备发送故障根因特征集合,故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。训练设备还可以向控制设备发送故障恢复预案集合,故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。
可选地,训练设备还可以接收来自控制设备的目标故障的标识、目标故障的故障根因特征以及目标故障对应的故障恢复预案。训练设备在故障根因特征集合中添加目标故障与目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合,并在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
第四方面,提供了一种控制设备。所述控制设备包括多个功能模块,所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。
第五方面,提供了一种分析设备。所述分析设备包括多个功能模块,所述多个功能模块相互作用,实现上述第二方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。
第六方面,提供了一种训练设备。所述训练设备包括多个功能模块,所述多个功能模块相互作用,实现上述第三方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。
第七方面,提供了一种故障恢复预案确定系统。包括:控制设备和分析设备。
所述分析设备用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,并向所述控制设备发送所述目标故障对应的相似故障信息,所述相似故障信息包括所述目标故障的标识和相似故障列表,所述相似故障列表包括所述目标故障的一个或多个相似已知故障。所述控制设备用于获取所述相似已知故障对应的故障恢复预案,并基于所述相似已知故障对应的故障恢复预案,确定所述目标故障对应的故障恢复预案。
可选地,所述系统还包括:训练设备。
所述训练设备用于采用多个样本故障的故障根因特征,训练得到相似度模型,并向所述分析设备发送所述相似度模型,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。所述分析设备用于对于所述多个已知故障中的每个已知故障,向所述相似度模型输入所述目标故障的故障根因特征以及所述已知故障的故障根因特征,以获取所述相似度模型输出的所述目标故障的故障根因与所述已知故障的故障根因之间的相似度,并根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障。
可选地,所述训练设备还用于向所述控制设备发送故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系。所述控制设备用于基于所述故障恢复预案集合,获取所述相似已知故障对应的故障恢复预案。
可选地,所述控制设备还用于向所述训练设备发送所述目标故障的标识以及所述目标故障对应的故障恢复预案。所述训练设备还用于在所述故障恢复预案集合中添加所述目标故障与所述目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
可选地,所述训练设备还用于向所述分析设备发送故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系。所述分析设备用于基于所述故障根因特征集合,获取所述多个已知故障的故障根因特征。
可选地,所述相似故障信息还包括所述目标故障的故障根因特征。所述控制设备还用于向所述训练设备发送所述目标故障的标识以及所述目标故障的故障根因特征。所述训练设备还用于在所述故障根因特征集合中添加所述目标故障与所述目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
第八方面,还提供了一种控制设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现上述第一方面及其各实施方式中的方法。
第九方面,还提供了一种分析设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现上述第二方面及其各实施方式中的方法。
第十方面,还提供了一种训练设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现上述第三方面及其各实施方式中的方法。
第十一方面,提供了一种计算机存储介质,所述计算机存储介质上存储有指令,当所述指令被计算机设备的处理器执行时,实现上述第一方面及其各实施方式中的方法,或者,实现上述第二方面及其各实施方式中的方法,又或者,实现上述第三方面及其各实施方式中的方法。
第十二方面,提供了一种芯片,芯片包括可编程逻辑电路和/或程序指令,当芯片运行时,实现上述第一方面及其各实施方式中的方法,或者,实现上述第二方面及其各实施方式中的方法,又或者,实现上述第三方面及其各实施方式中的方法。
本申请提供的技术方案带来的有益效果至少包括:
控制设备基于目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到该目标故障的相似已知故障,就能确定该目标故障对应的故障恢复预案,因此扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。进一步地,控制设备还可以向云端设备实时上报现网故障的故障根因特征及其对应的故障恢复预案,实现了云端设备对故障恢复预案集合和故障根因特征集合的自动更新,从而扩大能够处理的故障范围,另外无需人工方式总结故障对应的故障恢复预案并采用硬编码方式,提高了故障恢复预案的扩展灵活性,降低了维护难度。
附图说明
图1是本申请实施例提供的一种故障恢复预案确定系统的结构示意图;
图2是本申请实施例提供的一种故障恢复预案确定方法的流程示意图;
图3是本申请实施例提供的故障恢复预案确定方法的实现过程示意图;
图4是本申请实施例提供的一种故障恢复预案确定系统的功能示意图;
图5是本申请实施例提供的一种控制设备的结构示意图;
图6是本申请实施例提供的另一种控制设备的结构示意图;
图7是本申请实施例提供的又一种控制设备的结构示意图;
图8是本申请实施例提供的再一种控制设备的结构示意图;
图9是本申请实施例提供的还一种控制设备的结构示意图;
图10是本申请另一实施例提供的一种控制设备的结构示意图;
图11是本申请实施例提供的一种分析设备的结构示意图;
图12是本申请实施例提供的一种云端设备的结构示意图;
图13是本申请实施例提供的另一种云端设备的结构示意图;
图14是本申请实施例提供的又一种云端设备的结构示意图;
图15是本申请实施例提供的一种故障恢复预案确定装置的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请实施例提供了一种故障恢复预案确定方法。首先,控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。其次,控制设备获取该相似已知故障对应的故障恢复预案。然后,控制设备基于该相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。一个已知故障可以对应一个或多个故障恢复预案,一个故障恢复预案也可以对应一个或多个已知故障。网络中的目标故障可以是网络中的任一故障。
由于本申请实施例中,控制设备能够基于网络中的目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到故障根因与该目标故障的故障根因满足相似度条件的相似已知故障,就能确定该目标故障对应的故障恢复预案,扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。
可选地,故障根因采用故障根因特征表示。故障根因特征包括故障根因对象和故障根因事件。其中,故障根因事件为导致故障的异常事件,例如,“物理接口中断”可以是一个故障根因事件,表示导致当前故障的原因是物理接口中断了。故障根因对象用于指示故障根因网络实体的类型,故障根因网络实体用于指示故障发生的具体位置,故障根因网络实体为故障根因事件所属的网络实体。故障根因对象可以理解为故障根因网络实体的本体,故障根因网络实体可以理解为故障根因对象的实例化。故障根因对象的类型包括设备、接口、协议或业务。设备类具体包括单板或子卡等。接口类包括物理接口、环回口和虚拟局域网(virtual local area network,VLAN)接口等。协议类具体包括开放式最短路径优先(open shortest path first,OSPF)或边界网关协议(Border Gateway Protocol,BGP)等。业务类具体包括虚拟专用网络(virtual private network,VPN)业务或动态主机配置协议(dynamic host configuration protocol,DHCP)业务等。例如,故障根因网络实体为设备A上的物理接口A,则该次故障的故障根因对象为物理接口。又例如,故障根因网络实体为OSPF网段(OSPF network),表示为:OSPF network- 112.172.7.0-0.0.0.3,则该次故障的故障根因对象为OSPF network。又例如,故障根因网络实体为虚拟扩展局域网(virtual extensible local area network,VXLAN)隧道端点(VXLAN tunnel end point,VTEP),表示为:VXLAN tunnel-1.1.1.1-2.2.2.2,其中,1.1.1.1为源VTEP地址,2.2.2.2为目的VTEP地址,则该次故障的故障根因对象为VXLAN隧道。
可选地,当故障根因网络实体为物理接口时,故障根因特征还可以包括故障根因网络实体的接口闪断指示(即物理接口的接口闪断指示)、故障根因网络实体的接口假死指示(即物理接口的接口假死指示)、故障根因网络实体的收发报文状态(即物理接口的收发报文状态)、故障根因网络实体的接口协议状态(即物理接口的接口协议状态)或故障根因网络实体所在设备的物理接口状态中的一个或多个。或者,当故障根因网络实体为BGP对等体(BGP peer)时,故障根因特征还可以包括故障根因网络实体的BGP路由震荡指示和/或所述故障根因网络实体所在设备的物理接口状态。又或者,不限于故障根因网络实体的类型,故障根因特征还可以包括故障根因网络实体所在设备的物理接口状态。其中,接口闪断指示用于指示对应的物理接口是否在短时间内发生多次中断,例如,物理接口在短时间内发生多次中断,则该物理接口的接口闪断指示置为1,否则置为0。接口假死指示用于指示对应的物理接口在正常状态下,接收报文数或发送报文数是否为0,例如,物理接口在正常状态下,接收报文数或发送报文数均为0,则该物理接口的接口假死指示置为1,否则置为0。BGP路由震荡指示用于指示对应的BGP对等体是否发生BGP路由震荡,例如,BGP对等体发生BGP路由震荡,则该BGP对等体的BGP路由震荡指示置为1,否则置为0。故障根因网络实体所在设备的物理接口状态用于反映该设备的物理接口的状态是正常(up)还是中断(down),例如,设备的全部物理接口down,则该设备的物理接口状态置为1,否则置为0。
本申请实施例中,故障恢复预案是根据专家经验结合现网的故障案例,针对网络中可能发生的故障制定的应急处置方案。示例地,故障恢复预案主要包括以下几种:
(1)隔离设备。例如,网络中的故障为:网络设备反复重启或者心跳异常导致交换机跨设备链路聚合组(multichassis link aggregation Group,MLAG)呈双主状态,针对该故障可以制定隔离设备的恢复预案。
(2)隔离单板。例如,网络中的故障为:主控板异常,主控板反复异常,交换网板异常,或者交换网板反复异常等,针对这类故障可以制定隔离单板的恢复预案。
(3)隔离接口。例如,网络中的故障为:接口假死,接口协议状态down,接口闪断,接口链路单通故障,循环冗余校验(cyclic redundancy check,CRC)错误增多,传输控制协议(transmission control protocol,TCP)同步(synchronization,SYN)洪水(flood)攻击,或者地址解析协议(address resolution protocol,ARP)攻击等,针对这类故障可以制定隔离接口的恢复预案。
(4)采用三层访问控制列表(access control list,ACL)隔离虚拟机(virtual machine,VM)。例如,网络中的故障为:TCP SYN flood攻击,针对该故障可以制定采用三层ACL隔离VM的恢复预案。该恢复预案通过在相关设备中配置基于接口或子接口的ACL规则来解决TCP SYN flood攻击。该ACL规则中的源互联网协议(internet protocol,IP)地址为攻击者的IP地址,目的IP地址为全局IP地址,配置流策略应用到入接口方向。
(5)采用ARP ACL隔离VM。例如,网络中的故障为:ARP攻击,针对该故障可以制定采用ARP ACL隔离VM的恢复预案。该恢复预案通过在相关设备中配置ARP报文的ACL 规则来解决ARP攻击。
(6)采用高级ACL6隔离VM。例如,网络中的故障为:邻居发现协议(neighbor discovery,ND)攻击,针对该故障可以制定采用高级ACL6隔离VM的恢复预案。该恢复预案通过根据接口和VLAN判断发送攻击ND报文所属的虚拟路由转发(virtual routing forwarding)表,并在相关设备中配置高级ACL6规则来解决ND攻击。
(7)重启设备。例如,网络中的故障为:设备芯片软失效、ARP硬表表项丢失、路由表硬表表项丢失、设备表项疑似跳变等,针对这类故障可以制定重启设备的恢复预案。其中,硬表用于存储芯片的运行数据,硬表区别于软表的定义,软表用于存储配置数据。
(8)重启单板。例如,网络中的故障为:主控板异常,主控板反复异常,交换网板异常,以及交换网板反复异常等,针对这类故障可以制定重启单板的恢复预案。
(9)重刷软硬表路由。例如,网络中的故障为:软表和硬表不一致导致业务中断,针对该故障可以制定重刷软硬表路由的恢复预案。
(10)重设OSPF接口的IP地址。
(11)路由平滑对账。该恢复预案通过调用设备的应用程序接口(application program interface,API),对异常的转发表(forwarding info base,FIB)表项进行平滑恢复。
图1是本申请实施例提供的一种故障恢复预案确定系统的结构示意图。如图1所示,该系统包括:管理设备101以及网络中的网络设备102a-102c(统称为网络设备102)。图1中网络设备的数量仅用作示意,不作为对本申请实施例提供的故障恢复预案确定系统的限制。本申请实施例涉及的网络可以是数据中心网络(data center network,DCN)、无线接入网(radio access network,RAN)、分组传送网(packet transport network,PTN)、城域网络、广域网络、园区网络、VLAN或VXLAN等,本申请实施例对网络的类型不做限定。
管理设备101可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。可选地,管理设备101包括采集设备、分析设备和控制设备。其中,采集设备、分析设备和控制设备可以是物理服务器,或者也可以是虚拟服务器。采集设备、分析设备和控制设备是单独的服务器;或者,采集设备和分析设备集成在一台服务器中;又或者,分析设备和控制设备集成在一台服务器中;又或者,采集设备、分析设备和控制设备集成在一台服务器中。也即是,管理设备101可用作采集设备、分析设备和/或控制设备。管理设备101用于管理和控制网络中的网络设备102,该网络可以是局点网络。不同局点网络可以是按照相应维度划分的不同网络,如,可以是不同地域的网络、不同运营商的网络、不同业务网络、不同网络域等。管理设备101可以是一个或多个设备。网络设备102可以是路由器或交换机等。管理设备101与网络设备102之间通过有线网络或无线网络连接。
可选地,管理设备101中的采集设备用于采集网络中的网络设备102的设备数据,并将采集到的数据存储至数据库供分析设备使用。管理设备101中的分析设备用于基于网络设备102的设备数据对网络进行异常检测,然后根据异常检测过程中产生的多个异常事件对网络进行故障定位,并在已知故障中确定所定位到的故障的相似已知故障。管理设备101中的控制设备用于基于分析设备所定位到的故障的相似已知故障确定该故障对应的故障恢复预案,并向网络中的相关网络设备发送预案执行指令。管理设备101中还可以存储有网络的组网拓扑。
网络中常见的故障类型包括:配置类、表项类、硬件类、拥塞类、攻击类、状态类、资源类和非网络侧故障等,根据以上分析设备故障定位时所需的信息,采集设备获取的设备数据可以包括管理面数据、数据面数据或控制面数据中的至少一种。其中,管理面数据包括配置数据和告警数据等,例如,配置数据包括安全控制策略。数据面数据包括ARP表、媒体访问控制(Media Access Control,MAC)表、路由表、隧道状态表(VXLAN网络)和接口状态等。控制面数据包括中央处理器(central processing unit,CPU)数据、内存数据、链路层发现协议(link layer discovery protocol,LLDP)状态、BGP状态和OSPF状态等,BGP和OSPF均为路由协议。
可选地,采集设备周期性地采集网络设备102的设备数据。例如采集设备采用简单网络管理协议(simple network management protocol,SNMP)或网络遥测(network telemetry)技术采集网络设备的设备数据。或者,当网络设备102的设备数据发生变更时,网络设备102主动向采集设备上报变更后的设备数据。
可选地,请继续参见图1,该系统还包括训练设备103。训练设备103可以是一台服务器,或者由若干台服务器组成的服务器集群,或者是一个云计算服务中心。该训练设备103为管理设备101的上级设备,能够管理一个或多个管理设备101。训练设备103可以训练用于处理数据的模型(例如相似度模型),并为管理设备101提供处理数据的集合(例如故障根因特征集合、故障恢复预案集合)和/或用于处理数据的模型等。当训练设备103用于管理多个管理设备101时,该多个管理设备101可以共享训练设备103提供的处理数据的集合和/或用于处理数据的模型。训练设备103与管理设备101可以是单独的设备,或者也可以集成在一台设备中,本申请实施例对此不做限定。训练设备103也可称为云端设备。
图2是本申请实施例提供的一种故障恢复预案确定方法的流程示意图。该方法可以应用于如图1所示的故障恢复预案系统中。如图2所示,该方法包括:
步骤201、云端设备向管理设备发送故障根因特征集合和故障恢复预案集合。
故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。故障根因特征集合中的故障根因特征是以组的形式存储的,每组故障根因特征属于一个已知故障。本申请实施例中,一个已知故障的故障根因特征至少包括故障根因对象和故障根因事件。
可选地,故障根因特征集合包括已知故障的故障标识与故障根因特征的对应关系。故障标识包括故障ID和/或故障名称。例如,故障根因特征集合可以如表1所示。
表1
Figure PCTCN2021124377-appb-000001
参见表1,故障根因特征的类型包括故障根因对象、故障根因事件和接口闪断指示。
故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。其中,一个已知故障可以对应一个或多个故障恢复预案,一个故障恢复预案也可以对应一个或多个已知故障。
可选地,故障恢复预案集合包括已知故障的故障标识与故障恢复预案的对应关系。例如,故障恢复预案集合可以如表2所示。
表2
故障ID 故障名称 故障恢复预案
10000001 接口闪断 隔离接口
10000002 主控板异常 1、隔离单板;2、重启单板
云端设备中存储有故障根因特征集合以及故障恢复预案集合。云端设备可以通过收集网络中大量的故障案例,并根据设定的故障根因特征的类型,提取每个故障的故障根因特征,以生成初始故障根因特征集合。初始故障恢复预案集合中的故障恢复预案可以基于专家经验制定得到。
步骤202、当网络发生故障时,管理设备获取网络中产生的异常事件。
可选地,管理设备在获取网络中的网络设备的设备数据后,对各个网络设备的设备数据进行异常检测,以获取网络中产生的异常事件。
可选地,管理设备对网络设备的设备数据进行异常检测的实现过程包括:管理设备对告警数据进行告警分析与聚合,以减少告警数据量,再从聚合后的告警数据中提取异常事件。和/或,管理设备对海量的日志进行日志异常检测,例如采用日志模板挖掘和/或日志罕见度分析的方式进行日志异常检测,以得到异常事件。和/或,管理设备对上报的关键绩效指标(key performance indicator,KPI)进行异常检测,例如将发生突变的KPI作为异常KPI。
可选地,异常事件包括告警日志、状态变化日志或异常KPI中的一个或多个。告警日志中包括异常网络实体的标识以及告警类型。状态变化日志中包括配置文件变化信息和/或路由表项变化信息等,例如状态变化日志中可以包括“接入子接口删除”以及“目的IP主机路由删除”等信息。异常KPI用于描述某个网络实体的某种指标出现异常。
为了便于说明,本申请以下实施例中将网络中发生的该次故障称为目标故障。
步骤203、管理设备基于网络中产生的异常事件,确定目标故障的故障根因特征。
可选地,管理设备对网络中产生的异常事件进行基于专家规则的故障定位或者基于网络知识图谱的溯源推理,以在网络中定位目标故障的故障根因对象和故障根因事件。
可选地,管理设备对网络中产生的异常事件进行基于网络知识图谱的溯源推理的实现过程,包括:首先,管理设备生成该管理设备所管理的网络的知识图谱;管理设备在获取由于网络发生故障而产生的异常事件之后,确定出该网络中产生异常事件的异常网络实体,例如可以在知识图谱上标识出该网络中产生异常事件的异常网络实体;然后管理设备基于网络实体间的故障传播关系,在所有异常网络实体中确定一个或多个故障根因网络实体。知识图谱上的网络实体的类型为设备、接口、协议或业务。管理设备根据故障根因网络实体确定该次故障的故障根因对象,例如,故障根因网络实体为设备A上的物理接口A,则该次故障的故障根因对象为物理接口。管理设备将故障根因网络实体所关联的异常事件(即导致该次故障 的异常事件)确定为该次故障的故障根因事件。
可选地,管理设备获取故障传播关系的过程包括:管理设备获取多个知识图谱样本,每个知识图谱样本上分别标识有该知识图谱样本所属的网络发生一次故障时,该知识图谱样本所属的网络中产生异常事件的所有异常网络实体以及故障根因网络实体。管理设备基于该多个知识图谱样本,确定故障传播关系。其中,每个知识图谱样本为一个故障案例,知识图谱样本中的异常网络实体以及故障根因网络实体可以是人工确定的。可选地,管理设备可以采用图嵌入算法等学习该多个知识图谱样本中的故障传播关系。或者,当同一知识图谱三元组中的两个网络实体同时发生异常的概率大于某个阈值时,管理设备可以确定该两个网络实体之间会进行故障传播。
示例地,当网络设备的接口发生故障时,会导致该接口无法正常通信,进而会导致该接口采用的路由IP不通。因此管理设备可以得到一组故障传播关系:接口故障会导致该接口采用的路由IP不通。当管理设备获取到用于指示接口故障的第一异常事件以及用于指示该接口采用的路由IP不通的第二异常事件时,管理设备确定第一异常事件为故障根因事件,并确定该接口为故障根因对象。
本申请实施例中,管理设备可以采用多个知识图谱样本学习网络实体间的故障传播关系,并基于该故障传播关系,确定目标网络的知识图谱上的异常网络实体中的故障根因网络实体,进而确定故障根因特征,实现了网络故障根因的自动推理和定位。
可选地,网络实体间的故障传播关系也可以由其它设备确定后发送至管理设备,其它设备确定网络实体间的故障传播关系的方式可参考上述管理设备确定网络实体间的故障传播关系的方式,本申请实施例在此不做赘述。当然,故障传播关系也可以基于专家规则制定得到。
步骤204、对于多个已知故障中的每个已知故障,管理设备根据目标故障的故障根因特征以及该已知故障的故障根因特征,计算目标故障与该已知故障之间的相似度。
可选地,目标故障与已知故障之间的相似度等于目标故障的各个故障根因特征与已知故障的各个故障根因特征之间的相似度的加权平均值。例如,故障根因特征的类型包括故障根因对象、故障根因事件和接口闪断指示。目标故障的故障根因对象与已知故障的故障根因对象之间的相似度为第一相似度,目标故障的故障根因事件与已知故障的故障根因事件之间的相似度为第二相似度,目标故障的接口闪断指示与已知故障的接口闪断指示之间的相似度为第三相似度,则目标故障与已知故障之间的相似度等于第一相似度、第二相似度与第三相似度的加权平均值。
可选地,步骤204的实现过程包括:管理设备向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取该相似度模型输出的目标故障的故障根因与该已知故障的故障根因之间的相似度。相似度模型基于多个样本故障的故障根因特征训练得到。样本故障标注有类别标签。其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。标注有不同类别标签的样本故障对应的故障恢复预案可能相同,也可能不同。该相似度模型为采用有监督学习方式训练得到的机器学习模型。
在一种实现方式中,管理设备在获取多个样本故障后,采用多个样本故障的故障根因特征,训练得到相似度模型。
可选地,每个故障具有d个故障根因特征,则相似度模型表示如下:
Figure PCTCN2021124377-appb-000002
其中,similarity(a,b)表示故障a和故障b之间的相似度,s(a k,b k)表示故障a的第k个故障根因特征与故障b的第k个故障根因特征之间的相似度。w k表示第k个故障根因特征的权重,
Figure PCTCN2021124377-appb-000003
对于离散型故障根因特征,其相似度满足:
Figure PCTCN2021124377-appb-000004
对于连续型故障根因特征,其相似度满足:
Figure PCTCN2021124377-appb-000005
其中,a k和b k的取值范围相同,max表示a k或b k的取值范围中的最大值,min表示a k或b k的取值范围中的最小值。
本申请实施例中,管理设备训练得到上述相似度模型,也即是确定k个故障根因特征的权重,使得标注有相同类别标签的样本故障之间的相似度大于标注有不同类别标签的样本故障之间的相似度。
可选地,管理设备在训练得到相似度模型后,还可以采用多个样本故障的故障根因特征,调用该相似度模型确定相似度阈值,具体包括:管理设备向相似度模型分次输入多个样本故障对的故障根因特征,以获取该相似度模型输出的每个样本故障对的故障根因之间的相似度。然后管理设备根据多个样本故障对的故障根因之间的相似度,确定相似度阈值。其中,每个样本故障对包括两个样本故障,一个样本故障对的故障根因之间的相似度即该样本故障对中的两个样本故障的故障根因之间的相似度。该多个样本故障对包括第一类样本故障对和第二类样本故障对,第一类样本故障对包括两个标注有相同类别标签的样本故障,第二类样本故障对包括两个标注有不同类别标签的样本故障。
相似度阈值可以是标注有相同类别标签的样本故障的故障根因之间的相似度与标注有不同类别标签的样本故障的故障根因之间的相似度的分界值。例如,向训练好的相似度模型分次输入多个样本故障对的故障根因特征,针对第一类样本故障对,该相似度模型输出的相似度绝大多数大于目标阈值,针对第二类样本故障对,该相似度模型输出的相似度绝大多数小于目标阈值,则可以将该目标阈值确定为相似度阈值。例如该相似度阈值可以取值为0.9。
本申请实施例中,用来训练相似度模型的多个样本故障和用来确定相似度模型的多个样本故障可以相同,也可以不同,前者用来调整相似度模型的参数,后者用来统计相同类别的样本故障的故障根因之间的相似度以及不同类别的样本故障的故障根因之间的相似度,进而找到合适的相似度阈值。
在另一种实现方式中,管理设备接收来自云端设备的相似度模型和/或相似度阈值。也即是,云端设备在获取多个样本故障后,采用多个样本故障的故障根因特征训练得到相似度模型。云端设备在训练得到相似度模型后,还可以采用多个样本故障的故障根因特征,调用该相似度模型确定相似度阈值。云端设备训练相似度模型和确定相似度阈值的实现过程可参考 上述管理设备训练相似度模型和确定相似度阈值的过程,本申请实施例在此不再赘述。
步骤205、管理设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定该多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
在一种可能实现方式中,管理设备将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为目标故障的相似已知故障。
可选地,管理设备将多个已知故障中故障根因与目标故障的故障根因之间的相似度高于相似度阈值的所有已知故障均确定为目标故障的相似已知故障,也即是,目标故障可以具有一个或多个相似已知故障。当然,采用该方式管理设备也可能在已知故障中找不到目标故障的相似已知故障。
该实现方式中,基于相似度阈值在已知故障中查找目标故障的相似故障,确定的相似故障的准确度较高,进而可以使得确定的目标故障对应的故障恢复预案的可靠性较高。
在另一种可能实现方式中,管理设备对目标故障的故障根因与多个已知故障的故障根因之间的相似度进行排序,并将故障根因与目标故障的故障根因之间相似度最高的n个已知故障作为目标故障的相似已知故障,n为正整数。
例如,管理设备将故障根因与目标故障的故障根因之间相似度最高的3个已知故障作为目标故障的相似已知故障。
该实现方式中,总能在已知故障中找到目标故障的相似已知故障,进而总能确定目标故障对应的故障恢复预案,能够处理的故障范围较大。
在又一种可能实现方式中,管理设备将多个已知故障中满足故障根因与目标故障的故障根因之间的相似度高于相似度阈值,且按照故障根因与目标故障的故障根因之间的相似度由高至低的排序方式属于前m个的已知故障作为目标故障的相似已知故障,m为正整数。
该实现方式中,能够对故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障进行筛选,既能保证确定的相似故障的准确度,又能限制确定的相似故障的数量,减小后续计算量。
步骤206、管理设备基于目标故障的相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
可选地,步骤206的实现过程包括:管理设备基于网络的网络配置,评估目标故障的相似已知故障对应的故障恢复预案的可行性。管理设备将可行的故障恢复预案中的一个或多个故障恢复预案确定为目标故障对应的故障恢复预案。网络配置包括组网拓扑和/或设备数据。
例如,若一个设备所在链路具有冗余链路(备份链路),则针对该设备“隔离设备”这个故障恢复预案可行;若一个设备所在链路不具有冗余链路,则针对该设备“隔离设备”这个故障恢复预案不可行。
可选地,响应于目标故障的相似已知故障对应的多个故障恢复预案可行,首先管理设备基于该网络的网络配置,分别评估多个故障恢复预案对网络所运行业务的影响程度。然后管理设备将多个故障恢复预案中,对网络所运行业务的影响程度最小的故障恢复预案确定为目标故障对应的故障恢复预案。
示例地,目标故障的相似已知故障为ARP攻击故障,ARP攻击故障对应有2个故障恢复预案,分别为隔离VM和隔离接口。管理设备在确定隔离VM和隔离接口这2个故障恢复预案均可行后,评估隔离VM和隔离接口对网络所运行业务的影响程度。其中,隔离VM需要 对针对ARP攻击源的每个MAC地址进行ACL阻断,即所需的ACL资源等于ARP攻击源的数量,在网络中ACL资源充足的条件下,隔离VM这个故障恢复预案可行。对于隔离VM这个故障恢复预案,仅影响被隔离的攻击源VM,对网络所运行业务的影响程度较小;对于隔离接口这个故障恢复预案,影响被隔离接口下挂载的所有VM(包括攻击源VM和正常VM),对网络所运行业务的影响程度较大。由于隔离VM对网络所运行业务的影响程度小于隔离接口对网络所运行业务的影响程度,因此管理设备会将隔离VM作为目标故障对应的故障恢复预案。若网络中ACL资源不足导致隔离VM不可行,管理设备可以将隔离接口作为目标故障对应的故障恢复预案。
或者,管理设备在获取目标故障的相似已知故障对应的故障恢复预案之后,也可以输出该相似已知故障对应的所有故障恢复预案,并将选择指令所指定的故障恢复预案作为目标故障对应的故障恢复预案,该选择指令可以由运维人员触发。例如管理设备可以将目标故障的相似已知故障对应的所有故障恢复预案发送给运维支撑系统(operations support system,OSS)或其它与管理设备连接的终端设备,供OSS或终端设备显示。当然,若管理设备自身具有显示功能,则管理设备也可以在自身的显示界面上显示目标故障的相似已知故障对应的所有故障恢复预案。管理设备在输出目标故障的相似已知故障对应的故障恢复预案后,可以由运维人员指定其中的一个故障恢复预案作为目标故障对应的故障恢复预案,或者也可以由运维人员输入其它故障恢复预案作为目标故障对应的故障恢复预案,本申请实施例对此不做限定。
可选地,响应于多个已知故障中不存在故障根因与目标故障的故障根因满足相似度条件的相似已知故障,管理设备输出该目标故障的故障标识以及故障根因特征,以便由运维人员确定该目标故障对应的故障恢复预案。例如管理设备将目标故障的故障标识以及故障根因特征发送给OSS或其它与管理设备连接的终端设备,供OSS或终端设备显示。当然,若管理设备自身具有显示功能,则管理设备也可以在自身的显示界面上显示目标故障的故障标识以及故障根因特征。
本申请实施例中,管理设备能够基于目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到该目标故障的相似已知故障,就能确定该目标故障对应的故障恢复预案,因此扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。
可选地,管理设备在确定目标故障对应的故障恢复预案之后,还可以向网络中需要执行该故障恢复预案的相关网络设备分发该故障恢复预案,以实现端到端地故障恢复,该过程参见以下步骤207至步骤208。
步骤207、管理设备基于目标故障以及目标故障对应的故障恢复预案,确定网络中待执行预案的目标网络设备。
可选地,网络中待执行预案的目标网络设备包括目标故障的故障根因网络实体所在设备和/或网络中的接入设备等。例如,当目标故障的故障根因对象为设备或接口时,网络中待执行预案的目标网络设备通常为该故障根因网络实体所在设备。又例如,当目标故障的故障根因事件为OSPF路由ID冲突导致DHCP业务超时,网络中待执行预案的目标网络设备通常 为网络中发生OSPF路由ID冲突的网络设备。又例如,当目标故障为ARP攻击故障,网络中待执行预案的目标网络设备为挂载攻击源VM的边缘设备(即接入设备)。
步骤208、管理设备向目标网络设备发送预案执行指令。
该预案执行指令用于指示目标网络设备执行目标故障对应的故障恢复预案。该预案执行指令包括目标故障对应的故障恢复预案。
可选地,预案执行指令中包括的目标故障对应的故障恢复预案可以是该故障恢复预案的执行脚本。例如,“隔离设备”这个故障恢复预案的执行脚本内容包括:
a)判断设备角色,例如判断当前设备为spine(脊)设备还是leaf(叶)设备。
b)如果当前设备为leaf设备或非spine的leaf合设备,记录当前设备的spine侧接口的cost(代价)值,然后将该cost值调整至最大值;遍历当前设备的接入侧接口,记录接入侧接口当前状态,然后对除带内管理的管理口以外的非down状态的接入侧接口执行shutdown。
c)如果当前设备为spine设备,若该spine设备是独立设备组或设备组成员已经被隔离,则当前设备不能执行隔离操作;否则记录当前设备连spine接口的cost值,然后将该cost值调整至最大值;遍历接leaf连spine的接口,记录接口当前状态,然后将非down状态的接口执行shutdown。
步骤209、管理设备向目标网络设备发送预案执行回退指令。
该预案执行回退指令用于指示目标网络设备恢复至执行目标故障对应的故障恢复预案之前的状态。可选地,响应于接收到回退触发指令,管理设备向目标网络设备发送预案执行回退指令。该回退触发指令可以是由运维人员在管理设备上执行指定操作触发,例如,当管理设备检测到对某个按键的按压操作,则确定接收到回退触发指令。或者,该回退触发指令也可以来自其它设备,也即是,管理设备可以在其它设备的控制指令(回退触发指令)下向目标网络设备发送预案执行回退指令。
本申请实施例中,管理设备在向目标网络设备发送预案执行指令之后,还可以向目标网络设备发送预案执行回退指令,以指示目标网络设备恢复至执行故障恢复预案之前的状态,实现了网络设备的状态回滚功能。在网络设备执行了不合理的故障恢复预案的场景下,该功能可以快速使网络设备恢复至原始状态,提高网络运行可靠性。
可选地,管理设备在确定目标故障对应的故障恢复预案之后,还可以向云端设备发送该目标故障对应的故障恢复预案和/或目标故障的故障根因特征,以供云端设备更新故障恢复预案集合和/或故障根因特征集合,该实现过程参见以下步骤210至步骤213。
步骤210、管理设备向云端设备发送目标故障的标识以及目标故障对应的故障恢复预案。
该目标故障的标识以及目标故障对应的故障恢复预案用于供云端设备在故障恢复预案集合中添加目标故障与该目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
步骤211、云端设备在故障恢复预案集合中添加目标故障与该目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
可选地,云端设备每更新故障恢复预案集合,可以将更新后的故障恢复预案集合发送给管理设备,或者,云端设备周期性地向管理设备发送最新的故障恢复预案集合。
步骤212、管理设备向云端设备发送目标故障的标识以及目标故障的故障根因特征。
该目标故障的标识以及目标故障的故障根因特征用于供云端设备在故障根因特征集合中 添加目标故障与该目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
可选地,步骤210与步骤212可以同时执行,即管理设备向云端设备同步发送目标故障对应的故障恢复预案以及目标故障的故障根因特征。例如,目标故障为接口闪断故障,管理设备向云端设备发送的内容可以如表3所示。
表3
Figure PCTCN2021124377-appb-000006
步骤213、云端设备在故障根因特征集合中添加目标故障与该目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
可选地,云端设备每更新故障根因特征集合后,可以将更新后的故障根因特征集合发送给管理设备,或者,云端设备周期性地向管理设备发送最新的故障根因特征集合。
可选地,当相似度模型由云端设备训练得到,云端设备在接收到管理设备上报的现网故障的故障根因特征及其对应的故障恢复预案后,可以持续训练和更新相似度模型,以提高模型的准确性和可靠性。
本申请实施例中,管理设备可以向云端设备实时上报现网故障的故障根因特征及其对应的故障恢复预案,实现了云端设备对故障恢复预案集合和故障根因特征集合的自动更新,从而扩大能够处理的故障范围,另外无需人工方式总结故障对应的故障恢复预案并采用硬编码方式,提高了故障恢复预案的扩展灵活性,降低了维护难度。
可选地,上述管理设备可以是一个设备(控制设备),或者,上述管理设备可以包括多个设备(采集设备、分析设备和/或控制设备)。本申请实施例以管理设备包括分析设备和控制设备,分析设备集成有采集设备的功能为例,则上述步骤201的实现过程可以包括:云端设备向分析设备发送故障根因特征集合,并向控制设备发送故障恢复预案集合。上述步骤202至步骤205由分析设备执行。上述步骤206至步骤210以及步骤212由控制设备执行。
示例地,图3是本申请实施例提供的故障恢复预案确定方法的实现过程示意图。如图3所示,该实现过程包括:
步骤301、云端设备向分析设备发送故障根因特征集合。
故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。此步骤的解释可参考上述步骤201和步骤204中的相关内容,本申请实施例在此不再赘述。
步骤302、云端设备向控制设备发送故障恢复预案集合。
故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。此步骤的解释可参考上述步骤201中的相关内容,本申请实施例在此不再赘述。
步骤303、当网络发生故障时,分析设备获取网络中产生的异常事件。
此步骤的解释可参考上述步骤202中的相关内容,本申请实施例在此不再赘述。
步骤304、分析设备基于网络中产生的异常事件,确定目标故障的故障根因特征。
此步骤的解释可参考上述步骤203中的相关内容,本申请实施例在此不再赘述。
步骤305、对于多个已知故障中的每个已知故障,分析设备根据目标故障的故障根因特征以及该已知故障的故障根因特征,计算目标故障与该已知故障之间的相似度。
可选地,云端设备还可以向分析设备发送相似度模型和相似度阈值。或者,分析设备可以自行训练相似度模型并确定相似度阈值。则分析设备可以向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取该相似度模型输出的目标故障的故障根因与该已知故障的故障根因之间的相似度。
此步骤的解释可参考上述步骤204中的相关内容,本申请实施例在此不再赘述。
步骤306、分析设备根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,分析设备将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为目标故障的相似已知故障。
此步骤的解释可参考上述步骤205中的相关内容,本申请实施例在此不再赘述。
步骤307、分析设备向控制设备发送目标故障对应的相似故障信息。
该相似故障信息包括目标故障的标识和相似故障列表。该相似故障列表包括目标故障的一个或多个相似已知故障。该相似故障信息用于控制设备确定目标故障对应的故障恢复预案。
可选地,该相似故障信息还包括目标故障的故障根因特征。例如,该相似故障信息可以如表4所示。
表4
Figure PCTCN2021124377-appb-000007
可选地,相似故障信息还可以包括目标故障的发生时间戳等。
步骤308、控制设备基于目标故障的相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
此步骤的解释可参考上述步骤206中的相关内容,本申请实施例在此不再赘述。
步骤309、控制设备基于目标故障以及目标故障对应的故障恢复预案,确定网络中待执行预案的目标网络设备。
此步骤的解释可参考上述步骤207中的相关内容,本申请实施例在此不再赘述。
步骤310、控制设备向目标网络设备发送预案执行指令。
该预案执行指令用于指示目标网络设备执行目标故障对应的故障恢复预案。该预案执行指令包括目标故障对应的故障恢复预案。此步骤的解释可参考上述步骤208中的相关内容,本申请实施例在此不再赘述。
步骤311、控制设备向目标网络设备发送预案执行回退指令。
该预案执行回退指令用于指示目标网络设备恢复至执行目标故障对应的故障恢复预案之 前的状态。可选地,响应于接收到回退触发指令,控制设备向目标网络设备发送预案执行回退指令。此步骤的解释可参考上述步骤209中的相关内容,本申请实施例在此不再赘述。
步骤312、控制设备向云端设备发送目标故障对应的故障恢复预案以及目标故障的故障根因特征。
此步骤的解释可参考上述步骤210和步骤212中的相关内容,本申请实施例在此不再赘述。
步骤313、云端设备在故障恢复预案集合中添加目标故障与该目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合,并在故障根因特征集合中添加目标故障与该目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
此步骤的解释可参考上述步骤211和步骤213中的相关内容,本申请实施例在此不再赘述。
可选地,当相似度模型和相似度阈值由云端设备训练得到,云端设备在接收到管理设备上报的现网故障的故障根因特征及其对应的故障恢复预案后,可以持续训练和更新相似度模型,以提高模型的准确性和可靠性,云端设备还可以更新相似度阈值,以提高对故障相似度的判断准确性。
示例地,图4是本申请实施例提供的用于实现如图3所示的方法的故障恢复预案系统的功能示意图。如图4所示,该故障恢复预案系统包括云端设备、分析设备和控制设备。云端设备包括模型训练模块和故障知识库。模型训练模块用于训练相似度模型和确定相似度阈值。故障知识库包括故障特征库和故障恢复预案库,故障特征库用于存储故障根因特征集合,故障恢复预案库用于存储故障恢复预案集合。云端设备用于向分析设备发送故障根因特征集合、相似度模型和相似度阈值,并向控制设备发送故障恢复预案集合。分析设备包括故障定位模块、相似故障确定模块和故障特征库。故障定位模块用于定位网络中发生的故障(简称:现网故障)并提取故障根因特征。相似故障确定模块用于调用相似度模型,在故障特征库的已知故障中确定现网故障的相似已知故障。分析设备用于向控制设备发送现网故障对应的相似故障信息。控制设备包括预案评估模块、预案管理模块和故障恢复预案库。预案评估模块用于从故障恢复预案中获取并评估现网故障的相似已知故障对应的故障恢复预案的可行性。预案管理模块用于确定现网故障对应的故障恢复预案以及需要执行预案的网络设备。控制设备用于向云端设备发送现网故障的故障根因特征以及对应的故障恢复预案。
本申请实施例提供的故障恢复预案确定方法的步骤先后顺序可以进行适当调整,步骤也可以根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化的方法,都应涵盖在本申请的保护范围之内,因此不再赘述。
综上所述,本申请实施例提供的故障恢复预案确定方法,管理设备基于目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到该目标故障的相似已知故障,就能确定该目标故障对应的故障恢复预案,因此扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。进一步地,管理设备还可以向云端设备实时上报现网故障的故障根因特征及其对应 的故障恢复预案,实现了云端设备对故障恢复预案集合和故障根因特征集合的自动更新,从而扩大能够处理的故障范围,另外无需人工方式总结故障对应的故障恢复预案并采用硬编码方式,提高了故障恢复预案的扩展灵活性,降低了维护难度。
图5是本申请实施例提供的一种控制设备的结构示意图。如图5所示,控制设备50包括:
第一获取模块501,用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。
第一确定模块502,用于获取相似已知故障对应的故障恢复预案,并基于相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
可选地,故障根因采用故障根因特征表示,故障根因特征包括故障根因对象和故障根因事件,其中,故障根因事件为导致故障的异常事件,故障根因对象用于指示故障根因网络实体的类型,故障根因网络实体为故障根因事件所属的网络实体。
可选地,故障根因网络实体为物理接口,故障根因特征还包括故障根因网络实体的接口闪断指示、故障根因网络实体的接口假死指示、故障根因网络实体的收发报文状态、故障根因网络实体的接口协议状态或故障根因网络实体所在设备的物理接口状态中的一个或多个。或者,故障根因网络实体为BGP对等体,故障根因特征还包括故障根因网络实体的BGP路由震荡指示和/或故障根因网络实体所在设备的物理接口状态。又或者,不限于故障根因网络实体的类型,故障根因特征还包括故障根因网络实体所在设备的物理接口状态。
可选地,第一获取模块501,用于:获取多个已知故障的故障根因特征;对于多个已知故障中的每个已知故障,根据目标故障的故障根因特征以及已知故障的故障根因特征,计算目标故障的故障根因与已知故障的故障根因之间的相似度;根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,第一获取模块501,用于:将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为相似已知故障。
可选地,第一获取模块501,用于:向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,相似度模型基于多个样本故障的故障根因特征训练得到,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
可选地,如图6所示,控制设备50还包括:训练模块503,用于采用多个样本故障的故障根因特征,训练得到相似度模型。
可选地,请继续参见图6,控制设备50还包括:第二获取模块504,用于向相似度模型分次输入多个样本故障对的故障根因特征,以获取相似度模型输出的每个样本故障对的故障根因之间的相似度,多个样本故障对包括第一类样本故障对和第二类样本故障对,第一类样本故障对包括两个标注有相同类别标签的样本故障,第二类样本故障对包括两个标注有不同类别标签的样本故障。第二确定模块505,用于根据多个样本故障对的故障根因之间的相似度,确定相似度阈值。
可选地,如图7所示,控制设备50还包括:接收模块506。
该接收模块506,用于接收来自训练设备的相似度模型和/或相似度阈值。和/或,该接收模块506,用于接收来自训练设备的故障根因特征集合,故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。相应地,第一获取模块501,用于基于故障根因特征集合,获取多个已知故障的故障根因特征。
可选地,如图8所示,控制设备50还包括:第一发送模块507。
该第一发送模块507,用于向训练设备发送目标故障的标识以及目标故障的故障根因特征,以供训练设备在故障根因特征集合中添加目标故障与目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
可选地,如图9所示,控制设备50还包括:第三获取模块508,用于当网络发生故障时,获取网络中产生的异常事件。第三确定模块509,用于基于网络中产生的异常事件,确定故障的故障根因特征。
可选地,第一获取模块501,用于:接收来自分析设备的目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障。
可选地,相似故障信息还包括目标故障的故障根因特征。
可选地,第一确定模块502,用于:基于网络的网络配置,评估相似已知故障对应的故障恢复预案的可行性,网络配置包括组网拓扑和/或设备数据,设备数据包括管理面数据、数据面数据或控制面数据中的一种或多种;将可行的故障恢复预案中的一个或多个故障恢复预案确定为目标故障对应的故障恢复预案。
可选地,第一确定模块502,用于:响应于多个故障恢复预案可行,基于网络的网络配置,分别评估多个故障恢复预案对网络所运行业务的影响程度;将多个故障恢复预案中,对网络所运行业务的影响程度最小的故障恢复预案确定为目标故障对应的故障恢复预案。
可选地,如图10所示,控制设备50还包括:第四确定模块510,用于基于目标故障以及目标故障对应的故障恢复预案,确定网络中待执行预案的目标网络设备。第二发送模块511,用于向目标网络设备发送预案执行指令,预案执行指令用于指示目标网络设备执行目标故障对应的故障恢复预案,预案执行指令包括目标故障对应的故障恢复预案。
可选地,第二发送模块511,还用于向目标网络设备发送预案执行回退指令,预案执行回退指令用于指示目标网络设备恢复至执行目标故障对应的故障恢复预案之前的状态。
可选地,接收模块506,还用于接收来自训练设备的故障恢复预案集合,故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。相应地,第一确定模块502,用于基于故障恢复预案集合,获取相似已知故障对应的故障恢复预案。
可选地,第一发送模块507,还用于向训练设备发送目标故障的标识以及目标故障对应的故障恢复预案,以供训练设备在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
综上所述,本申请实施例提供的控制设备,能够通过第一确定模块基于目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到该目标故障的相似已知故障,就能确定该目标故障对应的故障恢复预案,因此扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找 网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。
图11是本申请实施例提供的一种分析设备的结构示意图。如图11所示,分析设备110包括:
获取模块1101,用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障。
发送模块1102,用于向控制设备发送目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障,相似故障信息用于控制设备确定目标故障对应的故障恢复预案。
可选地,相似故障信息还包括目标故障的故障根因特征。
可选地,获取模块1101,用于:获取多个已知故障的故障根因特征。对于多个已知故障中的每个已知故障,根据目标故障的故障根因特征以及该已知故障的故障根因特征,计算目标故障的故障根因与该已知故障的故障根因之间的相似度。根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,获取模块1101,用于:将多个已知故障中,故障根因与目标故障的故障根因之间的相似度高于相似度阈值的已知故障确定为相似已知故障。
可选地,获取模块1101,用于:向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,相似度模型基于多个样本故障的故障根因特征训练得到,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
综上所述,本申请实施例提供的分析设备,通过确定模块确定目标故障的相似已知故障,并通过发送模块向控制设备发送目标故障对应的相似故障信息,使得控制设备能够基于目标故障的相似已知故障对应的故障恢复预案确定该目标故障对应的故障恢复预案,也即是,无论目标故障是否是已知故障,只要能够在已知故障中找到该目标故障的相似已知故障,就能确定该目标故障对应的故障恢复预案,因此扩大了能够处理的故障范围。另外,由于网络中的一个故障可能会引发多个连锁故障,而故障根因能够反映故障的根本所在,因此基于故障根因来查找网络中的故障的相似已知故障,找到的相似已知故障与该故障的匹配度较高,该相似已知故障对应的故障恢复预案适用于该故障的可能性也较高,进而使得确定的故障恢复预案的可靠性较高。
图12是本申请实施例提供的一种云端设备的结构示意图。该云端设备即上述实施例中的训练设备。如图12所示,云端设备120包括:
第一获取模块1201,用于获取相似度模型,相似度模型基于多个样本故障的故障根因特征训练得到,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
第一发送模块1202,用于向分析设备发送相似度模型,供分析设备确定网络中发生的目 标故障的相似已知故障,相似已知故障用于确定目标故障对应的故障恢复预案。
可选地,第一获取模块1201,用于采用多个样本故障的故障根因特征,训练得到相似度模型。
可选地,如图13所示,云端设备102还包括:第二获取模块1203,用于向相似度模型分次输入多个样本故障对的故障根因特征,以获取相似度模型输出的每个样本故障对的故障根因之间的相似度,多个样本故障对包括第一类样本故障对和第二类样本故障对,第一类样本故障对包括两个标注有相同类别标签的样本故障,第二类样本故障对包括两个标注有不同类别标签的样本故障。确定模块1204,用于根据多个样本故障对的故障根因之间的相似度,确定相似度阈值。第一发送模块1202,还用于向分析设备发送相似度阈值。
可选地,如图14所示,云端设备120还包括第二发送模块1205。第一发送模块1202,用于向分析设备发送故障根因特征集合,故障根因特征集合包括多个已知故障对应的多个故障根因特征子集,每个故障根因特征子集包括一个已知故障的故障根因特征。第二发送模块1205,用于向控制设备发送故障恢复预案集合,故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。
可选地,请继续参见图14,云端设备120还包括:接收模块1206,用于接收来自控制设备的目标故障的标识、目标故障的故障根因特征以及目标故障对应的故障恢复预案。更新模块1207,用于在故障根因特征集合中添加目标故障与目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合,并在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请实施例还提供了一种控制设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现图2对应的方法实施例中管理设备执行的动作,或者实现图3对应的方法实施例中控制设备执行的动作。
本申请实施例还提供了一种分析设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现图3对应的方法实施例中分析设备执行的动作。
本申请实施例还提供了一种云端设备,包括:处理器和存储器;
所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
所述处理器,用于调用所述计算机程序,实现图3对应的方法实施例中云端设备执行的动作。
示例地,图15是本申请实施例提供的一种故障恢复预案确定装置的框图。该装置可以是控制设备、分析设备或云端设备。如图15所示,装置150包括:处理器1501和存储器1502。
存储器1502,用于存储计算机程序,所述计算机程序包括程序指令;
处理器1501,用于调用所述计算机程序,当该装置150为控制设备时,实现图2对应的方法实施例中管理设备执行的动作,或者实现图3对应的方法实施例中控制设备执行的动作;当该装置150为分析设备时,实现图3对应的方法实施例中分析设备执行的动作;当该装置150为云端设备时,实现图3对应的方法实施例中云端设备执行的动作。
可选地,该装置150还包括通信总线1503和通信接口1504。
其中,处理器1501包括一个或者一个以上处理核心,处理器1501通过运行计算机程序,执行各种功能应用以及数据处理。
存储器1502可用于存储计算机程序。可选地,存储器可存储操作系统和至少一个功能所需的应用程序单元。操作系统可以是实时操作系统(Real Time eXecutive,RTX)、LINUX、UNIX、WINDOWS或OS X之类的操作系统。
通信接口1504可以为多个,通信接口1504用于与其它设备进行通信。
存储器1502与通信接口1504分别通过通信总线1503与处理器1501连接。
本申请实施例还提供了一种计算机存储介质,所述计算机存储介质上存储有指令,当所述指令被处理器执行时,实现上述方法实施例中管理设备、控制设备、分析设备或云端设备执行的动作。
本申请实施例还提供了一种故障恢复预案确定系统,包括:控制设备和分析设备。
分析设备用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,并向控制设备发送目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障。控制设备用于获取相似已知故障对应的故障恢复预案,并基于相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
可选地,该系统还包括:云端设备(即训练设备)。
云端设备用于采用多个样本故障的故障根因特征,训练得到相似度模型,并向分析设备发送相似度模型,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。分析设备用于对于多个已知故障中的每个已知故障,向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,并根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,云端设备还用于向控制设备发送故障恢复预案集合,故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。控制设备用于基于故障恢复预案集合,获取相似已知故障对应的故障恢复预案。
可选地,控制设备还用于向云端设备发送目标故障的标识以及目标故障对应的故障恢复预案。云端设备还用于在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
可选地,云端设备还用于向分析设备发送故障根因特征集合,故障根因特征集合包括多 个已知故障与故障根因特征之间的对应关系。分析设备用于基于故障根因特征集合,获取多个已知故障的故障根因特征。
可选地,相似故障信息还包括目标故障的故障根因特征。控制设备还用于向云端设备发送目标故障的标识以及目标故障的故障根因特征。云端设备还用于在故障根因特征集合中添加目标故障与目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
本申请实施例还提供了另一种故障恢复预案确定系统,包括:控制设备和云端设备。
云端设备用于向控制设备发送故障恢复预案集合,该故障恢复预案集合包括多个已知故障与故障恢复预案之间的对应关系。控制设备用于基于故障恢复预案集合,获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障对应的故障恢复预案,并基于相似已知故障对应的故障恢复预案,确定目标故障对应的故障恢复预案。
可选地,控制设备还用于向云端设备发送目标故障的标识以及目标故障对应的故障恢复预案。云端设备还用于在故障恢复预案集合中添加目标故障与目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
可选地,该系统还包括:分析设备。
分析设备用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,并向控制设备发送目标故障对应的相似故障信息,相似故障信息包括目标故障的标识和相似故障列表,相似故障列表包括目标故障的一个或多个相似已知故障。
可选地,云端设备用于采用多个样本故障的故障根因特征,训练得到相似度模型,并向分析设备发送相似度模型,样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。分析设备用于对于多个已知故障中的每个已知故障,向相似度模型输入目标故障的故障根因特征以及已知故障的故障根因特征,以获取相似度模型输出的目标故障的故障根因与已知故障的故障根因之间的相似度,并根据目标故障的故障根因与多个已知故障的故障根因之间的相似度,确定多个已知故障中故障根因与目标故障的故障根因满足相似度条件的相似已知故障。
可选地,云端设备还用于向分析设备发送故障根因特征集合,故障根因特征集合包括多个已知故障与故障根因特征之间的对应关系。分析设备用于基于故障根因特征集合,获取多个已知故障的故障根因特征。
可选地,相似故障信息还包括目标故障的故障根因特征。控制设备还用于向云端设备发送目标故障的标识以及目标故障的故障根因特征。云端设备还用于在故障根因特征集合中添加目标故障与目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本申请实施例中,术语“第一”、“第二”和“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的构思和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (50)

  1. 一种故障恢复预案确定方法,其特征在于,所述方法包括:
    控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障;
    所述控制设备获取所述相似已知故障对应的故障恢复预案;
    所述控制设备基于所述相似已知故障对应的故障恢复预案,确定所述目标故障对应的故障恢复预案。
  2. 根据权利要求1所述的方法,其特征在于,所述故障根因采用故障根因特征表示,所述故障根因特征包括故障根因对象和故障根因事件,其中,所述故障根因事件为导致故障的异常事件,所述故障根因对象用于指示故障根因网络实体的类型,所述故障根因网络实体为所述故障根因事件所属的网络实体。
  3. 根据权利要求2所述的方法,其特征在于,
    所述故障根因网络实体为物理接口,所述故障根因特征还包括所述故障根因网络实体的接口闪断指示、所述故障根因网络实体的接口假死指示、所述故障根因网络实体的收发报文状态、所述故障根因网络实体的接口协议状态或所述故障根因网络实体所在设备的物理接口状态中的一个或多个;
    或者,所述故障根因网络实体为边界网关协议BGP对等体,所述故障根因特征还包括所述故障根因网络实体的BGP路由震荡指示和/或所述故障根因网络实体所在设备的物理接口状态;
    又或者,所述故障根因特征还包括所述故障根因网络实体所在设备的物理接口状态。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述控制设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,包括:
    所述控制设备获取所述多个已知故障的故障根因特征;
    对于所述多个已知故障中的每个已知故障,所述控制设备根据所述目标故障的故障根因特征以及所述已知故障的故障根因特征,计算所述目标故障的故障根因与所述已知故障的故障根因之间的相似度;
    所述控制设备根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障。
  5. 根据权利要求4所述的方法,其特征在于,所述控制设备根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障,包括:
    所述控制设备将所述多个已知故障中,故障根因与所述目标故障的故障根因之间的相似 度高于相似度阈值的已知故障确定为所述相似已知故障。
  6. 根据权利要求4或5所述的方法,其特征在于,所述控制设备根据所述目标故障的故障根因特征以及所述已知故障的故障根因特征,计算所述目标故障的故障根因与所述已知故障的故障根因之间的相似度,包括:
    所述控制设备向相似度模型输入所述目标故障的故障根因特征以及所述已知故障的故障根因特征,以获取所述相似度模型输出的所述目标故障的故障根因与所述已知故障的故障根因之间的相似度,所述相似度模型基于多个样本故障的故障根因特征训练得到,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    所述控制设备采用多个所述样本故障的故障根因特征,训练得到所述相似度模型。
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    所述控制设备向所述相似度模型分次输入多个样本故障对的故障根因特征,以获取所述相似度模型输出的每个所述样本故障对的故障根因之间的相似度,所述多个样本故障对包括第一类样本故障对和第二类样本故障对,所述第一类样本故障对包括两个标注有相同类别标签的样本故障,所述第二类样本故障对包括两个标注有不同类别标签的样本故障;
    所述控制设备根据所述多个样本故障对的故障根因之间的相似度,确定相似度阈值。
  9. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    所述控制设备接收来自训练设备的所述相似度模型和/或相似度阈值。
  10. 根据权利要求4至9任一所述的方法,其特征在于,所述方法还包括:
    所述控制设备接收来自训练设备的故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系;
    所述控制设备获取所述多个已知故障的故障根因特征,包括:
    所述控制设备基于所述故障根因特征集合,获取所述多个已知故障的故障根因特征。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    所述控制设备向所述训练设备发送所述目标故障的标识以及所述目标故障的故障根因特征,以供所述训练设备在所述故障根因特征集合中添加所述目标故障与所述目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
  12. 根据权利要求4至11任一所述的方法,其特征在于,所述方法还包括:
    当所述网络发生故障时,所述控制设备获取所述网络中产生的异常事件;
    所述控制设备基于所述网络中产生的异常事件,确定所述故障的故障根因特征。
  13. 根据权利要求1至3任一所述的方法,其特征在于,所述控制设备获取多个已知故障 中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,包括:
    所述控制设备接收来自分析设备的所述目标故障对应的相似故障信息,所述相似故障信息包括所述目标故障的标识和相似故障列表,所述相似故障列表包括所述目标故障的一个或多个相似已知故障。
  14. 根据权利要求13所述的方法,其特征在于,所述相似故障信息还包括所述目标故障的故障根因特征。
  15. 根据权利要求1至14任一所述的方法,其特征在于,所述控制设备基于所述相似已知故障对应的故障恢复预案,确定所述目标故障对应的故障恢复预案,包括:
    所述控制设备基于所述网络的网络配置,评估所述相似已知故障对应的故障恢复预案的可行性,所述网络配置包括组网拓扑和/或设备数据,所述设备数据包括管理面数据、数据面数据或控制面数据中的一种或多种;
    所述控制设备将可行的故障恢复预案中的一个或多个故障恢复预案确定为所述目标故障对应的故障恢复预案。
  16. 根据权利要求15所述的方法,其特征在于,所述控制设备将可行的故障恢复预案中的一个或多个故障恢复预案确定为所述目标故障对应的故障恢复预案,包括:
    响应于多个故障恢复预案可行,所述控制设备基于所述网络的网络配置,分别评估所述多个故障恢复预案对所述网络所运行业务的影响程度;
    所述控制设备将所述多个故障恢复预案中,对所述网络所运行业务的影响程度最小的故障恢复预案确定为所述目标故障对应的故障恢复预案。
  17. 根据权利要求1至16任一所述的方法,其特征在于,所述方法还包括:
    所述控制设备基于所述目标故障以及所述目标故障对应的故障恢复预案,确定所述网络中待执行预案的目标网络设备;
    所述控制设备向所述目标网络设备发送预案执行指令,所述预案执行指令用于指示所述目标网络设备执行所述目标故障对应的故障恢复预案,所述预案执行指令包括所述目标故障对应的故障恢复预案。
  18. 根据权利要求17所述的方法,其特征在于,所述方法还包括:
    所述控制设备向所述目标网络设备发送预案执行回退指令,所述预案执行回退指令用于指示所述目标网络设备恢复至执行所述目标故障对应的故障恢复预案之前的状态。
  19. 根据权利要求1至18任一所述的方法,其特征在于,所述方法还包括:
    所述控制设备接收来自训练设备的故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系;
    所述控制设备获取所述相似已知故障对应的故障恢复预案,包括:
    所述控制设备基于所述故障恢复预案集合,获取所述相似已知故障对应的故障恢复预案。
  20. 根据权利要求19所述的方法,其特征在于,所述方法还包括:
    所述控制设备向所述训练设备发送所述目标故障的标识以及所述目标故障对应的故障恢复预案,以供所述训练设备在所述故障恢复预案集合中添加所述目标故障与所述目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
  21. 一种故障恢复预案确定方法,其特征在于,所述方法包括:
    分析设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障;
    所述分析设备向控制设备发送所述目标故障对应的相似故障信息,所述相似故障信息包括所述目标故障的标识和相似故障列表,所述相似故障列表包括所述目标故障的一个或多个相似已知故障,所述相似故障信息用于所述控制设备确定所述目标故障对应的故障恢复预案。
  22. 根据权利要求21所述的方法,其特征在于,所述分析设备获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,包括:
    所述分析设备获取所述多个已知故障的故障根因特征;
    对于所述多个已知故障中的每个已知故障,所述分析设备根据所述目标故障的故障根因特征以及所述已知故障的故障根因特征,计算所述目标故障的故障根因与所述已知故障的故障根因之间的相似度;
    所述分析设备根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障。
  23. 一种故障恢复预案确定方法,其特征在于,所述方法包括:
    训练设备获取相似度模型,所述相似度模型基于多个样本故障的故障根因特征训练得到,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同;
    所述训练设备向分析设备发送所述相似度模型,供所述分析设备确定网络中发生的目标故障的相似已知故障,所述相似已知故障用于确定所述目标故障对应的故障恢复预案。
  24. 根据权利要求23所述的方法,其特征在于,所述训练设备获取相似度模型,包括:
    所述训练设备采用多个所述样本故障的故障根因特征,训练得到所述相似度模型。
  25. 根据权利要求23或24所述的方法,其特征在于,所述方法还包括:
    所述训练设备向所述相似度模型分次输入多个样本故障对的故障根因特征,以获取所述相似度模型输出的每个所述样本故障对的故障根因之间的相似度,所述多个样本故障对包括第一类样本故障对和第二类样本故障对,所述第一类样本故障对包括两个标注有相同类别标签的样本故障,所述第二类样本故障对包括两个标注有不同类别标签的样本故障;
    所述训练设备根据所述多个样本故障对的故障根因之间的相似度,确定相似度阈值;
    所述训练设备向所述分析设备发送所述相似度阈值。
  26. 根据权利要求23至25任一所述的方法,其特征在于,所述方法还包括:
    所述训练设备向所述分析设备发送故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系;
    所述训练设备向控制设备发送故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系。
  27. 根据权利要求26所述的方法,其特征在于,所述方法还包括:
    所述训练设备接收来自所述控制设备的所述目标故障的标识、所述目标故障的故障根因特征以及所述目标故障对应的故障恢复预案;
    所述训练设备在所述故障根因特征集合中添加所述目标故障与所述目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合,并在所述故障恢复预案集合中添加所述目标故障与所述目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
  28. 一种控制设备,其特征在于,包括:
    第一获取模块,用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障;
    第一确定模块,用于获取所述相似已知故障对应的故障恢复预案,并基于所述相似已知故障对应的故障恢复预案,确定所述目标故障对应的故障恢复预案。
  29. 根据权利要求28所述的控制设备,其特征在于,所述第一获取模块,用于:
    获取所述多个已知故障的故障根因特征;
    对于所述多个已知故障中的每个已知故障,根据所述目标故障的故障根因特征以及所述已知故障的故障根因特征,计算所述目标故障的故障根因与所述已知故障的故障根因之间的相似度;
    根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障。
  30. 根据权利要求29所述的控制设备,其特征在于,所述第一获取模块,用于:
    向相似度模型输入所述目标故障的故障根因特征以及所述已知故障的故障根因特征,以获取所述相似度模型输出的所述目标故障的故障根因与所述已知故障的故障根因之间的相似度,所述相似度模型基于多个样本故障的故障根因特征训练得到,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同。
  31. 根据权利要求30所述的控制设备,其特征在于,所述控制设备还包括:
    训练模块,用于采用多个所述样本故障的故障根因特征,训练得到所述相似度模型。
  32. 根据权利要求30或31所述的控制设备,其特征在于,所述控制设备还包括:
    第二获取模块,用于向所述相似度模型分次输入多个样本故障对的故障根因特征,以获取所述相似度模型输出的每个所述样本故障对的故障根因之间的相似度,所述多个样本故障对包括第一类样本故障对和第二类样本故障对,所述第一类样本故障对包括两个标注有相同类别标签的样本故障,所述第二类样本故障对包括两个标注有不同类别标签的样本故障;
    第二确定模块,用于根据所述多个样本故障对的故障根因之间的相似度,确定相似度阈值。
  33. 根据权利要求30所述的控制设备,其特征在于,所述控制设备还包括:
    接收模块,用于接收来自训练设备的所述相似度模型和/或相似度阈值。
  34. 根据权利要求28至33任一所述的控制设备,其特征在于,所述控制设备还包括:
    接收模块,用于接收来自训练设备的故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系;
    所述第一获取模块,用于基于所述故障根因特征集合,获取所述多个已知故障的故障根因特征。
  35. 根据权利要求34所述的控制设备,其特征在于,所述控制设备还包括:
    第一发送模块,用于向所述训练设备发送所述目标故障的标识以及所述目标故障的故障根因特征,以供所述训练设备在所述故障根因特征集合中添加所述目标故障与所述目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合。
  36. 根据权利要求29至35任一所述的控制设备,其特征在于,所述控制设备还包括:
    第三获取模块,用于当所述网络发生故障时,获取所述网络中产生的异常事件;
    第三确定模块,用于基于所述网络中产生的异常事件,确定所述故障的故障根因特征。
  37. 根据权利要求28至36任一所述的控制设备,其特征在于,所述控制设备还包括:
    第四确定模块,用于基于所述目标故障以及所述目标故障对应的故障恢复预案,确定所述网络中待执行预案的目标网络设备;
    第二发送模块,用于向所述目标网络设备发送预案执行指令,所述预案执行指令用于指示所述目标网络设备执行所述目标故障对应的故障恢复预案,所述预案执行指令包括所述目标故障对应的故障恢复预案。
  38. 根据权利要求37所述的控制设备,其特征在于,
    所述第二发送模块,还用于向所述目标网络设备发送预案执行回退指令,所述预案执行回退指令用于指示所述目标网络设备恢复至执行所述目标故障对应的故障恢复预案之前的状态。
  39. 根据权利要求28至38任一所述的控制设备,其特征在于,所述控制设备还包括:
    接收模块,用于接收来自训练设备的故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系;
    所述第一确定模块,用于基于所述故障恢复预案集合,获取所述相似已知故障对应的故障恢复预案。
  40. 根据权利要求39所述的控制设备,其特征在于,所述控制设备还包括:
    第一发送模块,用于向所述训练设备发送所述目标故障的标识以及所述目标故障对应的故障恢复预案,以供所述训练设备在所述故障恢复预案集合中添加所述目标故障与所述目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
  41. 一种分析设备,其特征在于,包括:
    获取模块,用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障;
    发送模块,用于向控制设备发送所述目标故障对应的相似故障信息,所述相似故障信息包括所述目标故障的标识和相似故障列表,所述相似故障列表包括所述目标故障的一个或多个相似已知故障,所述相似故障信息用于所述控制设备确定所述目标故障对应的故障恢复预案。
  42. 一种训练设备,其特征在于,包括:
    第一获取模块,用于获取相似度模型,所述相似度模型基于多个样本故障的故障根因特征训练得到,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同;
    第一发送模块,用于向分析设备发送所述相似度模型,供所述分析设备确定网络中发生的目标故障的相似已知故障,所述相似已知故障用于确定所述目标故障对应的故障恢复预案。
  43. 根据权利要求42所述的训练设备,其特征在于,所述第一获取模块,用于:
    采用多个所述样本故障的故障根因特征,训练得到所述相似度模型。
  44. 根据权利要求42或43所述的训练设备,其特征在于,所述设备还包括:
    第二获取模块,用于向所述相似度模型分次输入多个样本故障对的故障根因特征,以获取所述相似度模型输出的每个所述样本故障对的故障根因之间的相似度,所述多个样本故障对包括第一类样本故障对和第二类样本故障对,所述第一类样本故障对包括两个标注有相同类别标签的样本故障,所述第二类样本故障对包括两个标注有不同类别标签的样本故障;
    确定模块,用于根据所述多个样本故障对的故障根因之间的相似度,确定相似度阈值;
    所述第一发送模块,还用于向所述分析设备发送所述相似度阈值。
  45. 根据权利要求42至44任一所述的训练设备,其特征在于,
    所述第一发送模块,还用于向所述分析设备发送故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系;
    所述设备还包括第二发送模块,用于向控制设备发送故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系。
  46. 根据权利要求45所述的训练设备,其特征在于,所述训练设备还包括:
    接收模块,用于接收来自所述控制设备的所述目标故障的标识、所述目标故障的故障根因特征以及所述目标故障对应的故障恢复预案;
    更新模块,用于在所述故障根因特征集合中添加所述目标故障与所述目标故障的故障根因特征之间的对应关系,得到更新后的故障根因特征集合,并在所述故障恢复预案集合中添加所述目标故障与所述目标故障对应的故障恢复预案之间的对应关系,得到更新后的故障恢复预案集合。
  47. 一种故障恢复预案确定系统,其特征在于,包括:控制设备和分析设备;
    所述分析设备用于获取多个已知故障中,故障根因与网络中的目标故障的故障根因满足相似度条件的相似已知故障,并向所述控制设备发送所述目标故障对应的相似故障信息,所述相似故障信息包括所述目标故障的标识和相似故障列表,所述相似故障列表包括所述目标故障的一个或多个相似已知故障;
    所述控制设备用于获取所述相似已知故障对应的故障恢复预案,并基于所述相似已知故障对应的故障恢复预案,确定所述目标故障对应的故障恢复预案。
  48. 根据权利要求47所述的系统,其特征在于,所述系统还包括:训练设备;
    所述训练设备用于采用多个样本故障的故障根因特征,训练得到相似度模型,并向所述分析设备发送所述相似度模型,所述样本故障标注有类别标签,其中,标注有相同类别标签的样本故障对应的故障恢复预案相同,
    所述分析设备用于对于所述多个已知故障中的每个已知故障,向所述相似度模型输入所述目标故障的故障根因特征以及所述已知故障的故障根因特征,以获取所述相似度模型输出的所述目标故障的故障根因与所述已知故障的故障根因之间的相似度,并根据所述目标故障的故障根因与所述多个已知故障的故障根因之间的相似度,确定所述多个已知故障中故障根因与所述目标故障的故障根因满足所述相似度条件的相似已知故障;
    和/或,
    所述训练设备用于向所述控制设备发送故障恢复预案集合,所述故障恢复预案集合包括所述多个已知故障与故障恢复预案之间的对应关系,
    所述控制设备用于基于所述故障恢复预案集合,获取所述相似已知故障对应的故障恢复预案;
    和/或,
    所述训练设备用于向所述分析设备发送故障根因特征集合,所述故障根因特征集合包括所述多个已知故障与故障根因特征之间的对应关系,
    所述分析设备用于基于所述故障根因特征集合,获取所述多个已知故障的故障根因特征。
  49. 一种设备,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
    所述处理器,用于调用所述计算机程序,实现如权利要求1至27任一所述的故障恢复预案确定方法。
  50. 一种计算机存储介质,其特征在于,所述计算机存储介质上存储有指令,当所述指令被计算机设备的处理器执行时,实现如权利要求1至27任一所述的故障恢复预案确定方法。
PCT/CN2021/124377 2020-10-20 2021-10-18 故障恢复预案确定方法、装置及系统、计算机存储介质 WO2022083540A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21881956.3A EP4221004A4 (en) 2020-10-20 2021-10-18 METHOD, APPARATUS AND SYSTEM FOR DETERMINING A FAULT RECOVERY PLAN AND COMPUTER STORAGE MEDIUM
US18/302,629 US20230318906A1 (en) 2020-10-20 2023-04-18 Fault recovery plan determining method, apparatus, and system, and computer storage medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011123661 2020-10-20
CN202011123661.5 2020-10-20
CN202011622270.8A CN114389940A (zh) 2020-10-20 2020-12-31 故障恢复预案确定方法、装置及系统、计算机存储介质
CN202011622270.8 2020-12-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/302,629 Continuation US20230318906A1 (en) 2020-10-20 2023-04-18 Fault recovery plan determining method, apparatus, and system, and computer storage medium

Publications (1)

Publication Number Publication Date
WO2022083540A1 true WO2022083540A1 (zh) 2022-04-28

Family

ID=81194671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124377 WO2022083540A1 (zh) 2020-10-20 2021-10-18 故障恢复预案确定方法、装置及系统、计算机存储介质

Country Status (4)

Country Link
US (1) US20230318906A1 (zh)
EP (1) EP4221004A4 (zh)
CN (1) CN114389940A (zh)
WO (1) WO2022083540A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174312A (zh) * 2022-07-06 2022-10-11 中国联合网络通信集团有限公司 广播信息发送方法、隧道端点设备、电子设备及介质
CN115225460A (zh) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 故障判定方法、电子设备和存储介质
CN115333923A (zh) * 2022-10-14 2022-11-11 成都飞机工业(集团)有限责任公司 一种故障点溯源分析方法、装置、设备及介质
CN115619383A (zh) * 2022-12-19 2023-01-17 中国空气动力研究与发展中心超高速空气动力研究所 一种基于知识图谱的故障诊断方法、装置及计算设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001946A (zh) * 2022-06-01 2022-09-02 中国建设银行股份有限公司 一种错包闪断故障处理方法和系统、电子设备、存储介质
CN115102844A (zh) * 2022-06-09 2022-09-23 摩拜(北京)信息技术有限公司 一种故障监控与处理方法、装置和电子设备
CN115048533B (zh) * 2022-06-21 2023-06-27 四维创智(北京)科技发展有限公司 知识图谱构建的方法、装置、电子设备及可读存储介质
CN115766404A (zh) * 2022-10-24 2023-03-07 浪潮通信信息系统有限公司 一种基于智能分析的通信运营商网络故障管理方法及系统
US11943131B1 (en) * 2023-07-26 2024-03-26 Cisco Technology, Inc. Confidence reinforcement of automated remediation decisions through service health measurements

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008310582A (ja) * 2007-06-14 2008-12-25 Hitachi Ltd 保守作業支援装置とシステム並びに保守作業支援方法
US20180329404A1 (en) * 2017-05-15 2018-11-15 Doosan Heavy Industries & Construction Co., Ltd. Fault signal recovery system and method
CN111082401A (zh) * 2019-11-15 2020-04-28 国网河南省电力公司郑州供电公司 基于自学习机制的配电网故障恢复方法
CN111224805A (zh) * 2018-11-26 2020-06-02 中兴通讯股份有限公司 一种网络故障根因检测方法、系统及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108289034B (zh) * 2017-06-21 2019-04-09 新华三大数据技术有限公司 一种故障发现方法和装置
CN107612756A (zh) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 一种具有智能故障分析处理功能的运维管理系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008310582A (ja) * 2007-06-14 2008-12-25 Hitachi Ltd 保守作業支援装置とシステム並びに保守作業支援方法
US20180329404A1 (en) * 2017-05-15 2018-11-15 Doosan Heavy Industries & Construction Co., Ltd. Fault signal recovery system and method
CN111224805A (zh) * 2018-11-26 2020-06-02 中兴通讯股份有限公司 一种网络故障根因检测方法、系统及存储介质
CN111082401A (zh) * 2019-11-15 2020-04-28 国网河南省电力公司郑州供电公司 基于自学习机制的配电网故障恢复方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4221004A4

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115174312A (zh) * 2022-07-06 2022-10-11 中国联合网络通信集团有限公司 广播信息发送方法、隧道端点设备、电子设备及介质
CN115174312B (zh) * 2022-07-06 2023-04-18 中国联合网络通信集团有限公司 广播信息发送方法、隧道端点设备、电子设备及介质
CN115225460A (zh) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 故障判定方法、电子设备和存储介质
CN115225460B (zh) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 故障判定方法、电子设备和存储介质
CN115333923A (zh) * 2022-10-14 2022-11-11 成都飞机工业(集团)有限责任公司 一种故障点溯源分析方法、装置、设备及介质
CN115333923B (zh) * 2022-10-14 2023-03-14 成都飞机工业(集团)有限责任公司 一种故障点溯源分析方法、装置、设备及介质
CN115619383A (zh) * 2022-12-19 2023-01-17 中国空气动力研究与发展中心超高速空气动力研究所 一种基于知识图谱的故障诊断方法、装置及计算设备

Also Published As

Publication number Publication date
CN114389940A (zh) 2022-04-22
EP4221004A1 (en) 2023-08-02
US20230318906A1 (en) 2023-10-05
EP4221004A4 (en) 2024-02-21

Similar Documents

Publication Publication Date Title
WO2022083540A1 (zh) 故障恢复预案确定方法、装置及系统、计算机存储介质
US11362884B2 (en) Fault root cause determining method and apparatus, and computer storage medium
CN112787841B (zh) 故障根因定位方法及装置、计算机存储介质
JP6518697B2 (ja) コントローラにスイッチモデリングインターフェイスを使用してネットワークスイッチを制御するためのシステム及び方法
US10193706B2 (en) Distributed rule provisioning in an extended bridge
EP3304822B1 (en) Method and apparatus for grouping features into classes with selected class boundaries for use in anomaly detection
US9083613B2 (en) Detection of cabling error in communication network
US20220200844A1 (en) Data processing method and apparatus, and computer storage medium
US10938660B1 (en) Automation of maintenance mode operations for network devices
US20160359695A1 (en) Network behavior data collection and analytics for anomaly detection
CN110266550B (zh) 故障影响预测的方法及装置
WO2021018309A1 (zh) 报文传输路径确定方法、装置及系统、计算机存储介质
WO2021147320A1 (zh) 路由异常检测方法、装置及系统、计算机存储介质
CN113852476A (zh) 确定异常事件关联对象的方法、装置及系统
US20230254244A1 (en) Path determining method and apparatus, and computer storage medium
CN113190368A (zh) 实现表项检查的方法、装置及系统、计算机存储介质
EP4080850A1 (en) Onboarding virtualized network devices to cloud-based network assurance system
CN116248479A (zh) 网络路径探测方法、装置、设备及存储介质
US20220200860A1 (en) Mitigation of physical network misconfigurations for clustered nodes
WO2023094867A1 (en) Method and system for learning and inferencing faults
CN114519095A (zh) 数据处理方法、装置及系统、计算机存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021881956

Country of ref document: EP

Effective date: 20230426

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21881956

Country of ref document: EP

Kind code of ref document: A1