WO2016107425A1 - 基于数据中心的故障分析方法和装置 - Google Patents

基于数据中心的故障分析方法和装置 Download PDF

Info

Publication number
WO2016107425A1
WO2016107425A1 PCT/CN2015/097903 CN2015097903W WO2016107425A1 WO 2016107425 A1 WO2016107425 A1 WO 2016107425A1 CN 2015097903 W CN2015097903 W CN 2015097903W WO 2016107425 A1 WO2016107425 A1 WO 2016107425A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
virtual
data center
virtual machine
unit
Prior art date
Application number
PCT/CN2015/097903
Other languages
English (en)
French (fr)
Inventor
王烽
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP15875103.2A priority Critical patent/EP3232620B1/en
Publication of WO2016107425A1 publication Critical patent/WO2016107425A1/zh
Priority to US15/638,109 priority patent/US10831630B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/40Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/08Locating faults in cables, transmission lines, or networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/08Locating faults in cables, transmission lines, or networks
    • G01R31/088Aspects of digital computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3048Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the topology of the computing system or computing system component explicitly influences the monitoring activity, e.g. serial, hierarchical systems
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02HEMERGENCY PROTECTIVE CIRCUIT ARRANGEMENTS
    • H02H7/00Emergency protective circuit arrangements specially adapted for specific types of electric machines or apparatus or for sectionalised protection of cable or line systems, and effecting automatic switching in the event of an undesired change from normal working conditions
    • H02H7/26Sectionalised protection of cable or line systems, e.g. for disconnecting a section on which a short-circuit, earth fault, or arc discharge has occured
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02HEMERGENCY PROTECTIVE CIRCUIT ARRANGEMENTS
    • H02H7/00Emergency protective circuit arrangements specially adapted for specific types of electric machines or apparatus or for sectionalised protection of cable or line systems, and effecting automatic switching in the event of an undesired change from normal working conditions
    • H02H7/26Sectionalised protection of cable or line systems, e.g. for disconnecting a section on which a short-circuit, earth fault, or arc discharge has occured
    • H02H7/261Sectionalised protection of cable or line systems, e.g. for disconnecting a section on which a short-circuit, earth fault, or arc discharge has occured involving signal transmission between at least two stations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies

Definitions

  • the present invention relates to virtual machine technology, and more particularly to a data center based failure analysis method and apparatus.
  • the data center is mainly composed of a host and a switching device.
  • the host is also called a physical machine, and is mainly used to support the operation of the virtual machine.
  • the switching device is mainly used to support communication between devices in the data center. Switching devices generally include switches, routers, gateways, and other network nodes with data switching functions. It should be noted that the host also has a virtual switch (vSwitch), so the host also has the function of supporting data exchange between virtual machines.
  • the data center contains many devices. If a fault occurs, you need to obtain an impact analysis of the fault on the data center to properly handle the fault.
  • the fault analysis of the existing data center only determines the fault level for the type of the faulty device, or judges the fault level according to the received fault alarm as the device fault alarm or the device performance alarm, and cannot operate the data center according to the fault. The impact of accurate fault analysis.
  • the present invention is directed to the above prior art, and the present invention can solve the problem that the fault analysis of the data center failure analysis in the prior art solution cannot be accurately performed according to the impact of the fault on the service running in the data center.
  • a first aspect of the present invention provides a data center-based fault analysis method, where the component of the data center includes: at least two hosts, at least one switching device, and each of the at least two hosts Running at least one virtual machine on a host, the at least one switching device is configured to establish a communication path between the constituent devices of the data center, and at least two virtual mechanisms having communication dependencies running on the at least two hosts
  • the failure analysis method includes:
  • nodes in the topology map include the component device and a virtual machine running on the at least two hosts;
  • a fault alarm is obtained, and it is determined according to the topology map whether the fault causes a communication path between each virtual machine in the virtual unit to be reduced.
  • the determining, by the topology map, whether the fault causes a communication path between the virtual machines in the virtual unit is reduced, specifically:
  • the virtual unit in the data center is at least two, and the method further includes:
  • the impact level of the fault is obtained according to the number of virtual units in which the fault occurs due to the fault, and the service weight corresponding to each virtual unit in which the error occurs.
  • the determining, according to the topology map, whether the fault causes a communication path between the virtual machines in the virtual unit is reduced, specifically:
  • a fault ratio of the virtual unit according to the connectivity relationship between the nodes in the topology map, where the fault ratio specifically includes the number of communication paths that are interrupted between the virtual machines in the virtual unit. a ratio of the total number of communication paths between the virtual machines in the virtual machine group.
  • the virtual unit in the data center is at least two, the method further includes:
  • At least two virtual machines that have communication dependencies constituting the virtual group specifically indicate that the same service is coordinated Or at least two virtual machines applied.
  • a second aspect of the embodiments of the present invention provides a fault analysis apparatus, where the fault analysis apparatus is applied to a data center, and the component equipment of the data center includes: at least two hosts, at least one switch Having at least one virtual machine running on each of the at least two hosts, the at least one switching device is configured to establish a communication path between the constituent devices of the data center, and the at least two hosts are running
  • the at least two virtual machines having communication dependencies constitute a virtual unit
  • the fault analysis device includes:
  • An acquiring module configured to acquire a topology diagram, where the nodes in the topology map include the component device and a virtual machine running on the at least two hosts;
  • an analysis module configured to: when the data center fails, acquire a fault alarm, and determine, according to the topology map, whether the fault causes a communication path between each virtual machine in the virtual unit to decrease.
  • the analyzing module is specifically configured to: determine that the fault causes at least at least one of the virtual units according to a connectivity relationship between nodes in the topology map When there is no available communication path between one virtual machine and another virtual machine in the virtual machine group, it indicates that the virtual machine unit has an error.
  • the virtual unit in the data center is at least two, and the fault analysis apparatus further includes:
  • the first calculating module is configured to obtain the impact level of the fault according to the number of virtual units in which the fault occurs due to the fault, and the service weight corresponding to each virtual unit in which the error occurs.
  • the analyzing module is specifically configured to: determine, according to the connectivity relationship between the nodes in the topology structure, a fault ratio of the virtual unit, where the fault ratio specifically includes The failure results in a ratio of the number of communication paths interrupted between the virtual machines in the virtual machine to the total number of communication paths between the virtual machines in the virtual machine.
  • the virtual unit in the data center is at least two, and the fault analysis device further includes:
  • the second calculating module is configured to obtain the impact level of the fault according to the fault ratio of each virtual unit caused by the fault and the corresponding service weight.
  • At least two virtual machines that have communication dependencies constituting the virtual group specifically indicate that the same service is coordinated Or at least two virtual machines applied.
  • the embodiment of the present invention provides a fault analysis method based on a data center.
  • a fault alarm is sent to the device that performs fault analysis.
  • the device analyzes the data according to a topology map of the data center obtained in advance. Whether the failure affects the communication path between virtual machines in the virtual machine running in the data center.
  • the importance of the fault is judged based on the type of the faulty device or the fault degree of the faulty device, and the actual impact of the fault on each service running on the data center cannot be comprehensively analyzed, and the data is improved. The accuracy of the center's failure analysis.
  • FIG. 1 is a schematic structural diagram of a data center applied according to an embodiment of the present invention.
  • FIG. 2 is a schematic flow chart of a fault analysis method applied to an embodiment of a method according to the present invention
  • FIG. 3 is a schematic structural diagram of another data center applied according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of another data center applied according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a fault analysis apparatus applied to an embodiment of a device according to the present invention.
  • FIG. 6 is a schematic structural diagram of a fault analysis device applied to an embodiment of a device according to the present invention.
  • virtual crew in this specification may specifically refer to a virtual unit or multiple virtual units.
  • Each virtual unit includes at least two virtual machines having communication dependencies with each other, specifically,
  • the communication dependencies between virtual machines in the same virtual machine group may refer to: virtual machines in the same virtual machine group cooperate to execute the same application or service, so frequent communication between virtual machines in the same virtual machine group is required.
  • Different virtual units are not required to communicate with each other due to different applications or services. Even if the communication between different virtual units is interrupted, the applications or services executed by each virtual unit will not be affected.
  • service weight in this specification is specifically used to indicate the importance of an application or service running on a virtual machine group, such as the user level to which the service belongs and the scope of the service impact.
  • the term "communication path" in this specification specifically indicates any communication channel between any two virtual machines in the virtual machine.
  • the virtual machine 202, the virtual machine 208 and the virtual machine 210 belong to the same virtual
  • the crew, between the virtual machine 202 and the virtual machine 208, has a host 214 - switching device 222 - switching device 228 - switching device 224 - host 218, and host 214 - switching device 222 - switching device 228 - switching device 226 - host 218
  • FIG. 1 is a schematic diagram showing the structure of a data center according to an embodiment of the present invention.
  • the components of the data center include a host 214-220 and a switching device 222-228.
  • the host 214 runs a virtual machine 202 and a virtual machine 204.
  • the virtual machine 206 is run on the host 218, and the virtual machine 208 is run on the host 218.
  • the virtual machine 210 and the virtual machine 212 are run on the host 220.
  • the virtual machine 202 and the virtual machine 212 constitute a first virtual machine group
  • the virtual machine 204 and the virtual machine 206 constitute a second virtual machine group
  • the virtual machine 208 and the virtual machine 210 constitute a third virtual machine group.
  • Switching devices 222-228 are used for communication connections between any two of the constituent devices in the data center.
  • the method embodiment is based on the data center shown in FIG. 1 and provides a fault analysis method. It should be noted that the method may be implemented by any server or host in the data center. In the specific implementation, the fault analysis of each manufacturer is generally software installed on the server or host, such as EMC's Business Impact Manager, or HP's Service Impact Analysis. To facilitate the description of the method embodiment, the execution body of the method is set as the host 214, and the fault analysis method includes:
  • Step 402 The host 214 acquires a topology map, where the nodes in the topology map include component devices of the data center, and virtual machines running on each host in the data center. Meanwhile, the connection line in the topology map includes a communication path between the constituent devices of the data center, and a communication path between each host in the data center and a virtual machine running on each host.
  • the host 214 traverses the component devices of the data center, and the traversal may specifically constitute a device discovery service.
  • the commonly used traversal algorithm includes a width-first traversal, a depth-first traversal, and the like, and then according to each of the data centers.
  • the host and the virtual machine running on each host obtain a topology map of the data center.
  • the nodes in the topology map include virtual machines 202-212, hosts 214-220, and switching devices 222-228.
  • the connection line includes a communication path between the virtual machine, the host, and the switching device of the data center.
  • FIG. 1 is a topology diagram of the data center.
  • Step 404 When the data center fails, the host 214 obtains a fault alarm. According to the topology map obtained in step 402, it is determined whether the fault causes a decrease in the communication path between the virtual machines included in the virtual unit in the data center.
  • the fault may be a component fault of the data center, or a communication path fault between the constituent devices of the data center.
  • FIG. 1 may indicate that any one of the switching devices or the host is faulty, and may also indicate the data center.
  • the communication path between any two component devices fails, such as a communication path failure between the switching device 222 and the switching device 228.
  • the host 214 is a fault analysis device, any component device in the data center fails or the communication path between the devices is faulty, and a fault alarm is sent to the host 214, and the fault alarm indicates the fault.
  • the host 214 After the host 214 obtains the fault alarm, it is determined according to the topology map obtained in step 402 whether the fault causes the communication path between the virtual machines included in any of the first, second, and third virtual units to be reduced.
  • the communication path between the virtual machine 202 and the virtual machine 212 included in the first virtual machine originally includes: a host 214 - a switching device 222 - a switching device 228 - a switching device 224 - a host 220, and a host 214 - a switching device 222 -
  • the device 228-switching device 226-host 220 has a total of two communication paths.
  • the host 214 performs fault analysis on the first virtual unit, that is, determines whether the fault causes the two communication channels. The road is reduced, and correspondingly, the host 214 can also perform corresponding fault analysis on the second and third virtual units.
  • step 402 and step 404 may be performed continuously, or after the host 214 performs the step 402, the topology map is obtained, and the host 214 obtains multiple fault alarms for each time. The fault alarm is performed once in step 404 to complete the fault analysis.
  • the step 404 specifically includes: after acquiring the fault alarm, the host 214 determines, according to the connection relationship between the corresponding nodes in the topology diagram of the virtual machine included in any virtual unit in the data center, whether the fault causes the fault. If there is no available communication path between the at least one virtual machine included in the virtual machine and another virtual machine in the virtual machine, the virtual unit generates an error. For example, after the host 214 obtains the fault alarm, after deleting the communication path between the component device or the component device indicated by the fault alarm in the topology map, any virtual machine included in any virtual group in the topology map is The starting point initiates the first traversal.
  • the host 214 traverses any constituent devices that have not passed the first traversal.
  • the starting point initiates a second traversal to obtain the second sub-topology until all node traversal is completed, wherein the obtained first sub-topology, the second sub-topology, the n-th sub-topology have no communication connection, so if there is any
  • a virtual machine includes a virtual machine running in two sub-topologies at the same time, it indicates that the failure causes the virtual unit to have no available communication path between the two virtual machines in the two sub-topologies, and the virtual unit occurs. error.
  • the virtual machine 202, the virtual machine 204, the virtual machine 206, the host 214, the host 216, and the switching device 222 form a first sub-topology, a virtual machine 208, a virtual machine 210, a virtual machine 212, a host 218, a host 220, a switching device 224,
  • the switching device 226 and the switching device 228 form a second sub-topology;
  • the first virtual unit includes a virtual machine 202 and a virtual machine 212, respectively located in the first sub-topology and the second sub-topology, and the failure causes the first virtual unit to include
  • the virtual machine 202 has no available communication path with the virtual machine 212, the first virtual unit has an error, and so on, and the failure alarm does not cause any virtual machine between the second and third virtual units to be connected to other virtual
  • the host 214 After the host 214 obtains the fault alarm, after deleting the communication path between the component device or the component device indicated by the fault alarm in the topology map, it is determined that the virtual machines included in any virtual group are in the topology map. Whether there is a shortest path, if there is no shortest path, it means that the virtual machine included in the virtual machine is in two sub-topologies, that is, no communication channel is available.
  • the foregoing virtual unit may refer to multiple virtual units, and the host 214 performs the foregoing fault analysis method on all virtual units in the data center to determine the number of virtual units that have caused errors due to the fault, for example, m, occurs.
  • the faulty virtual machine is: the virtual unit that includes at least one virtual machine and another virtual machine that has no other communication path, and the service weight of the failed virtual unit caused by the failure, and obtains the impact parameter of the fault alarm.
  • the service weights of the virtual units running in the data center are as shown in FIG. 3, and the service weights of the first, second, and third virtual units are respectively n 1 , n 2 , and n 3 .
  • f(m,n 1 ) calculates the output parameter of the fault. After the impact parameter is obtained, the impact level of the fault is further outputted. For example, if the impact parameter is greater than the preset threshold, the fault alarm is an emergency fault and needs to be repaired first. If the result is less than or equal to the preset threshold, then the fault is less than or equal to the preset threshold. If the fault alarm is a minor fault, you can wait for the emergency fault to be repaired before repairing.
  • the selection of parameters A and B in the previous formula can be set according to requirements, where f(m, n 1 ) is any function with m and n 1 as input parameters, and the function can be set according to requirements.
  • the switching device 224 and the switching device 226 in FIG. 1 are two parallel switching devices, and any failure thereof does not cause the communication path between any two component devices to be interrupted.
  • the switching device 224 and the switching device 226 together constitute a communication path between the host 218, the host 220, and the switching device 228. If one of the failures occurs, although the communication path is not interrupted, the reliability of the communication path is reduced, and the communication path is reduced. Bandwidth and quality of service The amount is also affected. Therefore, in step 404, the host 214 determines whether the communication path between the virtual machines in each virtual machine is reduced according to the topology map.
  • the host 214 determines, according to the connectivity relationship between the corresponding nodes in the topology diagram of the virtual machine included in the virtual unit in the data center, whether the fault causes the virtual machine included in the virtual unit.
  • the communication path between the virtual channels is reduced, that is, there is an interrupted communication path between the virtual machines. If the failure causes a decrease in the communication path between the virtual machines included in the virtual unit, the virtual unit generates an error.
  • the host 214 also obtains the failure ratio of the failed virtual unit. Specifically, the failure ratio of any virtual unit indicates that the failure causes the number of communication channels interrupted between the virtual machines included in the virtual unit, and the virtual unit. The ratio of the total number of communication paths between virtual machines included.
  • the host 214 obtains a fault alarm indicating that the switching device 226 has failed. Due to the failure of the switching device 226, the topology map of the data center in FIG. 1 is converted into the topology diagram shown in FIG. 4, if according to the foregoing scheme, It can be concluded that the fault alarm does not cause a communication path interruption between any two virtual machines included in any virtual unit, but the switching device 226 and the switching device 224 are functionally equivalent, and both are used for the virtual machine 208 and the virtual machine.
  • the communication reliability between the virtual machines 212 is reduced, and the communication reliability between the virtual machine 208, the virtual machine 210, the virtual machine 212 and the virtual machine 202, the virtual machine 204, and the virtual machine 206 is reduced, that is, the first virtual is affected.
  • the reliability of the communication path between the crew (between virtual machine 202 and virtual machine 212) and the third virtual machine (between virtual machine 208 and virtual machine 210) is reduced.
  • the total number of communication paths of the first virtual machine is 2, that is, the host 214 - the switching device 222 - the switching device 228 - the switching device 224 - the host 220, and the host 214 - the switching device 222 - the switching device 228 - the switching device 226 - the host 220 If the fault causes the latter communication channel to be interrupted, the fault ratio of the first virtual unit is 0.5. Similarly, the fault ratio of the third virtual unit is 0.5.
  • the foregoing virtual unit may refer to multiple virtual units, and the host 214 performs the foregoing fault analysis method on all virtual units in the data center to determine the number of virtual units that have caused errors due to the failure, for example, M, occurs.
  • the wrong virtual machine is the virtual unit that interrupts the communication path between the virtual machines included in the virtual machine.
  • the fault alarm causes the service weight of the affected virtual unit.
  • the service weights of the M virtual units are N 1 and N 2 respectively. ...N M
  • the failure ratios of the M virtual units are X 1 , X 2 ... X M , respectively, and the host 214 acquires the influence of the fault according to N 1 , N 2 ... N M , X 1 , X 2 ...
  • the parameters, specifically, the host 214 is calculated according to f(N 1 , N 2 ... N M , X 1 , X 2 ... X M , ), which is an influence parameter.
  • the impact level of the fault may be further outputted. For example, if the impact parameter is greater than the preset threshold, the fault alarm is an emergency fault and needs to be repaired first. If the result is less than or equal to the preset threshold. If the fault alarm is a secondary fault, you can wait for the emergency fault to be repaired before repairing.
  • f(N 1 , N 2 ... N M , X 1 , X 2 ... X M ,) is a function of any of N 1 , N 2 ... N M and X 1 , X 2 ... X M as input parameters, The function can be set by itself according to the requirements.
  • the virtual machine included in the foregoing virtual group specifically indicates a virtual machine that cooperates to execute the same service or application.
  • various faults in the data center may be analyzed to determine that each fault causes a virtual machine communication path in the virtual unit running in the data center. Impact, and obtain the impact level of each fault, determine the priority of multiple fault repairs, prioritize that the faulty equipment with high impact on the virtual crew is prioritized to be repaired, and try to ensure the performance of the data center; In the case that the communication path between the component devices or the component devices fails, the communication path failure between the respective component devices or the respective component devices is acquired, and the impact level on the working performance of the data center is obtained, for example, the host 214 receives the failure in turn.
  • the alarm indicates that the host 214-the host 220, the switching device 222-switching device 228 fails, and obtains the impact level of the fault when each component device fails, thereby obtaining the host 214-host 220, the switching device 222-switching device 228's important priority, so it can be done in the data center
  • priority is given to the maintenance of components with high priority to reduce the probability of failure of important component devices.
  • the foregoing provides a data center-based fault analysis method.
  • a fault alarm is sent to the device that performs the fault analysis.
  • the device analyzes whether the fault affects the fault according to the topology map of the data center obtained in advance.
  • the impact level of the center In the existing fault analysis method, the importance of the fault is judged based on the type of the faulty device or the fault degree of the faulty device, and the actual impact of the fault on each service running on the data center cannot be comprehensively analyzed, and the data is improved.
  • the accuracy of the center's fault analysis improves the fault analysis capability and fault response capability of the data center.
  • the device embodiment provides a fault analysis device 600.
  • the schematic structure of the device is shown in FIG. 5.
  • the fault analysis device 600 is actually applied to the data center shown in FIG. 1, and may be any of the data centers shown in FIG.
  • a host or server including:
  • the obtaining module 602 is configured to obtain a topology diagram, where the nodes in the topology map include component devices of the data center and virtual machines running in the data center;
  • the obtaining module 602 actually performs step 402 in the method embodiment and its various alternatives, and details are not described herein again.
  • the analyzing module 604 is configured to obtain a fault alarm when the data center fails, and determine, according to the topology map, whether the fault causes a communication path between the virtual machines in the virtual unit to decrease.
  • the analysis module 604 actually performs the step 404 and its various alternatives in the method embodiment, and details are not described herein again.
  • the fault analysis device 600 further includes:
  • the first calculating module is configured to obtain the impact level of the fault according to the number of virtual units in which the fault occurs due to the fault, and the service weight corresponding to each virtual unit in which the error occurs.
  • the fault analysis device 600 further includes:
  • the second calculating module is configured to obtain the impact level of the fault according to the fault ratio of each virtual unit caused by the fault and the corresponding service weight.
  • At least two virtual machines that have communication dependencies constituting the virtual group specifically indicate that at least two virtual machines of the same service or application are cooperatively executed.
  • the above provides a data center based fault analysis device, which is first obtained by the fault analysis device
  • the topology diagram of the data center after obtaining the fault alarm, analyzes whether the fault affects the communication path between the virtual machines in the virtual machine running in the data center according to the pre-obtained topology map of the data center, and can The number of affected virtual units, the service weight of the affected virtual units, and the failure ratio of the affected virtual units are combined to obtain the impact level of the fault alarm on the data center.
  • the importance of the fault is judged based on the type of the faulty device or the fault degree of the faulty device, and the actual impact of the fault on each service running on the data center cannot be comprehensively analyzed, and the data is improved.
  • the accuracy of the center's fault analysis improves the fault analysis capability and fault response capability of the data center.
  • the device analysis apparatus provides a fault analysis device 800, and its organization structure is shown in FIG. 6.
  • the fault analysis device 800 is actually applied to the data center shown in FIG. 1, and may be any of the data centers shown in FIG.
  • a host or server including:
  • the failure analysis device 800 includes a processor 804, a memory 804, a communication interface 806, and a bus 808. Among them, the processor 802, the memory 804, and the communication interface 806 implement a communication connection with each other through the bus 808.
  • the processor 802 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 804 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). Memory 804 can store operating systems and other applications.
  • ROM read only memory
  • RAM random access memory
  • Memory 804 can store operating systems and other applications.
  • the communication interface 806 is used to communicate with other component devices or virtual machines in the data center.
  • Bus 808 can include a path for communicating information between various components of fault analysis device 800.
  • the above provides a data center based fault analysis device that runs its storage
  • the degree code of the data center first obtains the topology map of the data center. After obtaining the fault alarm, it analyzes whether the fault affects the communication between the virtual machines in the virtual machine group running in the data center according to the topology map of the data center obtained in advance. path.
  • the importance of the fault is judged based on the type of the faulty device or the fault degree of the faulty device, and the actual impact of the fault on each service running on the data center cannot be comprehensively analyzed, and the data is improved.
  • the accuracy of the center's fault analysis improves the fault analysis capability and fault response capability of the data center.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Manipulator (AREA)

Abstract

本发明实施例公开了一种基于数据中心的故障分析方法,包括获取拓扑结构图,该拓扑结构图中的节点包括数据中心的组成设备以及数据中心中运行的虚拟机;当数据中心发生故障时,获取故障告警,根据该拓扑结构图判断所述故障是否导致数据中心中运行的虚拟机组中的各个虚拟机之间的通信通路减少。本方法能够分析故障对数据中心上运行的各个业务的实际影响,提升了数据中心的故障分析的准确程度。

Description

基于数据中心的故障分析方法和装置 技术领域
本发明涉及虚拟机技术,尤其涉及基于数据中心的故障分析方法和装置。
背景技术
数据中心(Data Center)主要由主机(Host)以及交换设备构成,其中主机也称为物理机,主要用于支持虚拟机的运行,交换设备主要用于支持数据中心的各个设备之间的通信,交换设备一般包括交换机、路由器、网关等其他具有数据交换功能的网络节点,应当说明的是,主机上由于搭载了虚拟交换机(vSwitch),因此主机也具有支持虚拟机之间的数据交换的功能。数据中心包含的设备较多,如果出现故障,需要获取该故障对数据中心造成的影响分析(Impact Analysis),以便合理处理该故障。
现有数据中心的故障分析,仅针对发生故障的设备的类型来判断故障级别,或根据接收到的故障告警为设备故障告警或设备性能告警来判断故障级别,无法根据故障对数据中心运行的业务的影响进行准确的故障分析。
发明内容
针对上述现有技术而提出本发明,利用本发明可以解决现有技术方案中数据中心故障分析中,无法根据故障对数据中心运行的业务的影响准确进行故障分析的问题。
本发明实施例提供的第一方面,提供了一种基于数据中心的故障分析方法,所述数据中心的组成设备包括:至少两个主机,至少一个交换设备,所述至少两个主机中的每一个主机上运行至少一个虚拟机,所述至少一个交换设备用于建立所述数据中心的组成设备之间的通信通路,所述至少两个主机上运行的具有通信依赖关系的至少两个虚拟机构成虚拟机组,所述故障分析方法包括:
获取拓扑结构图,所述拓扑结构图中的节点包括所述组成设备以及所述至少两个主机上运行的虚拟机;
当所述数据中心发生故障时,获取故障告警,根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少。
结合第一方面,在第一方面的第一种实现方式中,所述根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少,具体包括:
根据所述拓扑结构图中节点间的连通关系,确定所述故障导致所述虚拟机组中的至少一个虚拟机与所述虚拟机组中的另一虚拟机之间无可用通信通路时,则表示所述虚拟机组发生错误。
结合第一方面的第一种实现方式,在第一方面的第二种实现方式中,所述数据中心中的所述虚拟机组为至少两个,所述方法还包括:
根据所述故障导致的发生错误的虚拟机组的数量,以及发生错误的各个虚拟机组对应的业务权重,获取所述故障的影响级别。
结合第一方面,在第一方面的第三种实现方式中,所述根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少,具体包括:
根据所述拓扑结构图中节点间的连通关系,确定所述虚拟机组的故障比值,所述故障比值具体包括,所述故障导致所述虚拟机组中的各个虚拟机之间中断的通信通路的数量,与所述虚拟机组中各个虚拟机之间通信通路的总数量的比值。
结合第一方面的第三种实现方式,在第一方面的第四种实现方式中,所述数据中心中的所述虚拟机组为至少两个,所述方法还包括:
根据所述故障导致的各个虚拟机组的故障比值及其对应的业务权重获取所述故障的影响级别。
结合第一方面,以及第一方面的第一至第四种实现方式,在第五种实现方式中,构成所述虚拟机组的具有通信依赖关系的至少两个虚拟机具体指示,协同执行同一业务或应用的至少两个虚拟机。
本发明实施例的第二方面提供一种故障分析装置,所述故障分析装置运用于数据中心,所述数据中心的组成设备包括:至少两个主机,至少一个交换设 备,所述至少两个主机中的每一个主机上运行至少一个虚拟机,所述至少一个交换设备用于建立所述数据中心的组成设备之间的通信通路,所述至少两个主机上运行的具有通信依赖关系的至少两个虚拟机构成虚拟机组,所述故障分析装置包括:
获取模块,用于获取拓扑结构图,所述拓扑结构图中的节点包括所述组成设备以及所述至少两个主机上运行的虚拟机;
分析模块,用于当所述数据中心发生故障时,获取故障告警,根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少。
结合第二方面,在第二方面的第一种实现方式中,所述分析模块具体用于:根据所述拓扑结构图中节点间的连通关系,确定所述故障导致所述虚拟机组中的至少一个虚拟机与所述虚拟机组中的另一虚拟机之间无可用通信通路时,则表示所述虚拟机组发生错误。
结合第二方面的第一种实现方式,在第二种实现方式中,所述数据中心中的所述虚拟机组为至少两个,所述故障分析装置还包括:
第一计算模块,用于根据所述故障导致的发生错误的虚拟机组的数量,以及发生错误的各个虚拟机组对应的业务权重,获取所述故障的影响级别。
结合第二方面,在第三种实现方式中,所述分析模块具体用于:根据所述拓扑结构图中节点间的连通关系,确定所述虚拟机组的故障比值,所述故障比值具体包括,所述故障导致所述虚拟机组中的各个虚拟机之间中断的通信通路的数量,与所述虚拟机组中各个虚拟机之间通信通路的总数量的比值。
结合第二方面的第三种实现方式,在第四种实现方式中,所述数据中心中的所述虚拟机组为至少两个,所述故障分析装置还包括:
第二计算模块,用于根据所述故障导致的各个虚拟机组的故障比值及其对应的业务权重获取所述故障的影响级别。
结合第二方面,以及第二方面的第一至第四种实现方式,在第五种实现方式中,构成所述虚拟机组的具有通信依赖关系的至少两个虚拟机具体指示,协同执行同一业务或应用的至少两个虚拟机。
本发明实施例提供了一种基于数据中心的故障分析方法,数据中心中发生故障的情况下,向进行故障分析的设备发送故障告警,该设备根据预先获得的数据中心的拓扑结构图,分析该故障是否影响了数据中心中运行的虚拟机组中虚拟机之间的通信通路。避免了现有的故障分析方法中,仅根据故障设备的类型,或者故障设备的故障程度来判断该故障的重要性,无法综合分析故障对数据中心上运行的各个业务的实际影响,提升了数据中心的故障分析准确程度。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作以简单地介绍,显而易见的,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例所应用的数据中心的组成结构示意图;
图2为本发明方法实施例所应用的故障分析方法的流程示意图;
图3为本发明实施例所应用的再一数据中心的组成结构示意图;
图4为本发明实施例所应用的又一数据中心的组成结构示意图;
图5为本发明装置实施例所应用的故障分析装置的组成结构示意图;
图6为本发明设备实施例所应用的故障分析设备的组成结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本说明书中的术语“虚拟机组”,具体可以指代一个虚拟机组,或多个虚拟机组。每一个虚拟机组包括彼此具有通信依赖关系的至少两个虚拟机,具体的, 同一虚拟机组中的虚拟机之间的通信依赖关系可以指代:同一虚拟机组中的虚拟机协同执行同一应用或者业务,因此同一虚拟机组中的虚拟机之间需要频繁通信。而不同虚拟机组之间由于执行的应用或者业务不同,因此不同虚拟机组之间一般无须通信,即使不同虚拟机组之间的通信中断,也不会影响到各个虚拟机组执行的应用或者业务。
本说明书中的术语“业务权重”,具体用于指示虚拟机组上运行的应用或者业务的重要程度,比如,业务归属的用户等级、业务影响的范围。
本说明书中的术语“通信通路”,具体指示虚拟机组中任意两个虚拟机之间的任一条通信通道,以图1为例,如虚拟机202,虚拟机208和虚拟机210隶属于同一虚拟机组,则虚拟机202和虚拟机208之间,有主机214-交换设备222-交换设备228-交换设备224-主机218,和主机214-交换设备222-交换设备228-交换设备226-主机218两条通信通路,依次类推,虚拟机202和虚拟机210之间有两条通信通路,虚拟机208和虚拟机210之间有两条通信通路,该虚拟机组内共有六条通信通路,则如果交换设备224发生故障与其他设备断开连接,则该虚拟机组内也随之中断三条通信通路。
本发明实施例的数据中心的组成结构
图1描述了本发明实施例所提供的数据中心的组成结构示意图,数据中心的组成设备包括主机214-220和交换设备222-228,其中主机214上运行虚拟机202和虚拟机204,主机216上运行了虚拟机206,主机218上运行了虚拟机208,主机220上运行了虚拟机210和虚拟机212。虚拟机202和虚拟机212构成第一虚拟机组,虚拟机204和虚拟机206构成第二虚拟机组,虚拟机208和虚拟机210构成第三虚拟机组。交换设备222-228用于数据中心中任意两个组成设备之间的通信连接。
方法实施例
参见图2,本方法实施例是基于图1所示的数据中心,所提供的一种故障分析方法,应当说明的是,执行本方法的可以是数据中心中任一服务器或主机, 具体实现中,各厂商进行故障分析的一般为安装在该服务器或主机上的软件,例如EMC的Business Impact Manager,或HP的Service Impact Analysis。为方便描述本方法实施例中将本方法的执行主体设置为主机214,故障分析方法包括:
步骤402,主机214获取拓扑结构图,拓扑结构图中的节点包括该数据中心的组成设备,以及该数据中心中各个主机上运行的虚拟机。同时,拓扑结构图中的连接线包括该数据中心的组成设备之间的通信通路,以及该数据中心中各个主机与各个主机上运行的虚拟机之间的通信通路。
具体的,主机214在数据中心启动时,对数据中心的组成设备进行遍历,该遍历具体可以为组成设备发现服务,常用的遍历算法包括宽度优先遍历、深度优先遍历等,之后根据数据中心中各个主机与各个主机上运行的虚拟机,获取该数据中心的拓扑结构图,该拓扑结构图中的节点包括虚拟机202-212,主机214-220和交换设备222-228,该拓扑结构图中的连接线包括该数据中心的虚拟机、主机、交换设备之间的通信通路,示例性的,图1即为数据中心的拓扑结构图。
步骤404,当数据中心发生故障时,主机214获取故障告警,根据步骤402中获取的拓扑结构图,判断该故障是否导致数据中心中虚拟机组包括的虚拟机之间的通信通路减少。
具体的,该故障可以为数据中心的组成设备故障,或数据中心的组成设备之间的通信通路故障,以图1为例,可以指示其中任一交换设备或主机故障,也可以指示数据中心的任何两个组成设备之间的通信通路故障,例如交换设备222和交换设备228之间的通信通路故障。本方法实施例中,由于主机214为故障分析设备,因此数据中心中任一组成设备发生故障或组成设备之间的通信通路故障,均向主机214发送故障告警,该故障告警指示该故障。
主机214获取故障告警后,根据步骤402中获取的拓扑结构图,判断该故障是否导致:第一、第二和第三虚拟机组中任一虚拟机组包括的虚拟机之间的通信通路减少。例如,第一虚拟机组包括的虚拟机202和虚拟机212之间的通信通路原本包括:主机214-交换设备222-交换设备228-交换设备224-主机220,和主机214-交换设备222-交换设备228-交换设备226-主机220,一共两条通信通路,步骤404即主机214对第一虚拟机组进行故障分析,即判断该故障是否导致这两条通信通 路减少,与之对应,主机214也可以对第二、第三虚拟机组也进行相应的故障分析。
需要说明的是,实际中步骤402和步骤404可以连续执行,也可以在主机214执行了一次步骤402后,获取拓扑结构图,主机214接下来获取了多次故障告警的情况下,针对每次故障告警均执行一次步骤404以完成故障分析。
可选的,步骤404具体包括,主机214获取了故障告警后,根据数据中心中任一虚拟机组包括的虚拟机在拓扑结构图中对应的节点之间的连通关系,判断该故障是否导致了该虚拟机组包括的至少一个虚拟机,与该虚拟机组中的另一虚拟机之间的无可用通信通路,则该虚拟机组发生错误。例如:主机214获取故障告警后,将拓扑结构图中该故障告警指示的组成设备或组成设备之间的通信通路删除后,在该拓扑结构图中以任一虚拟机组包括的任一虚拟机为起点发起第一次遍历,如果第一次遍历无法遍历全部节点,则第一次遍历经过的全部组成设备构成了第一子拓扑结构,主机214以任一第一次遍历未经过的组成设备为起点发起第二次遍历,获取第二子拓扑结构,直至全部节点遍历完成,其中获取的第一子拓扑结构、第二子拓扑结构…第n子拓扑结构之间无通信连接,因此如果有任一虚拟机组包括的虚拟机同时运行于两个子拓扑结构中,则说明该故障导致该虚拟机组包括的位于两个子拓扑结构中的两部分虚拟机之间的无可用通信通路,则该虚拟机组发生错误。
以故障告警指示图1中交换设备222与交换设备228之间的通信通路故障为例,由于该通信通路故障图1中的数据中心的拓扑结构图转换成为图3所示的拓扑结构图,则虚拟机202、虚拟机204、虚拟机206、主机214、主机216和交换设备222组成第一子拓扑结构,虚拟机208、虚拟机210、虚拟机212、主机218、主机220、交换设备224、交换设备226和交换设备228组成第二子拓扑结构;第一虚拟机组包括虚拟机202和虚拟机212,分别位于第一子拓扑结构和第二子拓扑结构,则该故障导致第一虚拟机组包括虚拟机202,与虚拟机212之间的无可用通信通路,第一虚拟机组发生错误,依次类推,该故障告警未导致第二、第三虚拟机组中的任一虚拟机与其他虚拟机之间无可用通信通路。
还例如:主机214获取故障告警后,将拓扑结构图中该故障告警指示的组成设备或组成设备之间的通信通路删除后,判断任一虚拟机组包括的虚拟机之间在该拓扑结构图中是否存在最短路径,如果不存在最短路径,则说明该虚拟机组包括的虚拟机分别位于两个子拓扑结构中无法连通,即无可用通信通路。
可选的,前述虚拟机组可以指代多个虚拟机组,主机214对数据中心中所有虚拟机组均执行前述故障分析方法,以确定该故障导致的发生错误的虚拟机组的数量,例如为m,发生错误的虚拟机即:其包括的至少一个虚拟机与其包括的另一虚拟机无可用通信通路的虚拟机组,以及该故障导致的发生故障的虚拟机组的业务权重后,获取该故障告警的影响参数。具体的,数据中心运行的各个虚拟机组的业务权重,以图3为例,第一、第二、第三虚拟机组的业务权重分别为n1、n2、n3。承接上例,若故障告警指示交换设备222与交换设备228之间的通信通路故障,则仅有第一虚拟机组发生错误(即m=1),主机214根据A×m+B×n1,或f(m,n1)计算输出的即为该故障的影响参数。获取该影响参数后,进一步输出该故障的影响级别,例如,若该影响参数大于预设的阈值,则该故障告警为紧急故障,需要优先修复,若该结果小于或等于预设的阈值,则该故障告警为次要故障,则可以等待紧急故障修复完毕后再行修复。前式中的参数A和B的选取可以按照需求自行设置,其中的f(m,n1)为任意以m和n1作为输入参数的函数,该函数具体可以按照需求自行设置。
同时,数据中心的各个组成设备之间的通信通路可能有很多,有的故障告警并不会导致数据中心的组成设备之间的通信通路中断,也即不会导致各个虚拟机组包括的任意两个虚拟机之间的无可用通信通路,如果按照前述可选方案中的故障分析方法,则会得出这类故障告警对数据中心上运行的业务或应用没有影响的结论,即数据中心中各个虚拟机组未发生错误,例如图1中交换设备224和交换设备226为两个并行的交换设备,其中任一发生故障并不会导致任意两个组成设备之间的通信通路中断。但交换设备224和交换设备226共同构成主机218、主机220和交换设备228之间的通信通路,若其中之一发生故障,虽然通信通路不会中断,但通信通路的可靠性会降低,通信通路的带宽以及服务质 量也会受到影响,因此步骤404中主机214根据拓扑结构图,判断各个虚拟机组中的各个虚拟机之间的通信通路是否减少还可以包括如下可选方案。
可选的,主机214获取了故障告警后,根据数据中心中虚拟机组包括的虚拟机在拓扑结构图中对应的节点之间的连通关系,判断该故障是否导致了该虚拟机组包括的虚拟机之间的通信通路减少,即虚拟机之间存在中断的通信通路,若该故障导致了该虚拟机组包括的虚拟机之间的通信通路减少,则该虚拟机组发生错误。同时,主机214还获取发生故障的虚拟机组的故障比值,具体的,任一虚拟机组的故障比值指示:该故障导致该虚拟机组包括的虚拟机之间中断的通信通路的数量,与该虚拟机组包括的虚拟机之间的通信通路的总数量的比值。
例如:主机214获取故障告警,该故障告警指示交换设备226发生故障,由于交换设备226故障,图1中的数据中心的拓扑结构图转换成为图4所示的拓扑结构图,如果根据前述方案,则会得出该故障告警未造成任一虚拟机组包括的任意两个虚拟机之间的通信通路中断,但交换设备226与交换设备224功能对等,两者均用于虚拟机208、虚拟机210、虚拟机212之间的通信,和虚拟机208、虚拟机210、虚拟机212与虚拟机202、虚拟机204、虚拟机206,因此交换设备226的故障导致虚拟机208、虚拟机210、虚拟机212之间的通信可靠性降低,和虚拟机208、虚拟机210、虚拟机212与虚拟机202、虚拟机204、虚拟机206之间的通信可靠性降低,也即影响了第一虚拟机组(虚拟机202和虚拟机212之间)和第三虚拟机组(虚拟机208和虚拟机210之间)通信通路的可靠性降低。第一虚拟机组的通信通路的总数量为2,即主机214-交换设备222-交换设备228-交换设备224-主机220,和主机214-交换设备222-交换设备228-交换设备226-主机220,该故障导致了后一条通信通路中断,则第一虚拟机组的故障比值为0.5,同理可以得出第三虚拟机组的故障比值为0.5。
可选的,前述虚拟机组可以指代多个虚拟机组,主机214对数据中心中所有虚拟机组均执行前述故障分析方法,以确定该故障导致的发生错误的虚拟机组的数量,例如为M,发生错误的虚拟机即:其包括的各个虚拟机之间发生通信通路中断的虚拟机组,该故障告警造成影响的虚拟机组的业务权重,例如该M个虚拟机组的业务权重分别为N1、N2…NM,和该M个虚拟机组的故障比值分 别为X1、X2…XM,主机214根据N1、N2…NM、X1、X2…XM,获取该故障的影响参数,具体的,主机214根据f(N1,N2…NM,X1、X2…XM,)计算所得即为影响参数。获取该影响参数后,还可以进一步输出该故障的影响级别,例如,若该影响参数大于预设的阈值,则该故障告警为紧急故障,需要优先修复,若该结果小于或等于预设的阈值,则该故障告警为次要故障,则可以等待紧急故障修复完毕后再行修复。其中的f(N1,N2…NM,X1,X2…XM,)为任意以N1、N2…NM和X1、X2…XM为输入参数的函数,该函数具体可以按照需求自行设置。
可选的,前述虚拟机组中包括的虚拟机,具体指示协同执行同一业务或者应用的虚拟机。
需要说明的是,本方法实施例中的各种可选方法,可以在数据中心发生多个故障告警的情况下,分析得出各个故障对数据中心中运行的虚拟机组中虚拟机的通信通路造成的影响,并获取各个故障的影响级别,确定多个故障修复时的优先级,优先保证对虚拟机组影响程度高的故障设备优先被修复,尽量保证数据中心的工作性能;还可以模拟数据中心中各个组成设备或者组成设备之间的通信通路发生故障的情况下,获取各个组成设备或者各个组成设备之间的通信通路故障,对数据中心的工作性能的影响级别,例如依次模拟主机214接收到了故障告警指示主机214-主机220,交换设备222-交换设备228发生故障的情况下,获取各个组成设备发生故障时,该故障的影响级别,从而得出主机214-主机220,交换设备222-交换设备228的重要优先级,因此可以在对数据中心进行维护时,优先维护重要优先级高的组成设备,以减少重要的组成设备的故障发生概率。
上述提供一种基于数据中心的故障分析方法,数据中心中发生故障的情况下,向进行故障分析的设备发送故障告警,该设备根据预先获得的数据中心的拓扑结构图,分析该故障是否影响了数据中心中运行的虚拟机组中虚拟机之间的通信通路,并可以根据受影响的虚拟机组的数量,受影响的虚拟机组的业务权重,可选的以及受影响的虚拟机组的故障比值,综合获取该故障告警对数据 中心的影响级别。避免了现有的故障分析方法中,仅根据故障设备的类型,或者故障设备的故障程度来判断该故障的重要性,无法综合分析故障对数据中心上运行的各个业务的实际影响,提升了数据中心的故障分析准确程度,提升了数据中心的故障分析能力和故障应对能力。
装置实施例
本装置实施例提供一种故障分析装置600,其组织结构示意图如图5所示,该故障分析装置600实际运用于图1所示的数据中心,并且可以为图1所示的数据中心中任一主机或服务器,包括:
获取模块602,用于获取拓扑结构图,拓扑结构图中的节点包括数据中心的组成设备以及数据中心中运行的虚拟机;
具体的,获取模块602实际执行方法实施例中步骤402及其各个可选方案,在此不再赘述。
分析模块604,用于当数据中心发生故障时,获取故障告警,根据拓扑结构图判断故障是否导致虚拟机组中的各个虚拟机之间的通信通路减少。
具体的,分析模块604实际执行方法实施例中步骤404及其各个可选方案,在此不再赘述。
可选的,数据中心中的虚拟机组为至少两个,故障分析装置600还包括:
第一计算模块,用于根据该故障导致的发生错误的虚拟机组的数量,以及发生错误的各个虚拟机组对应的业务权重,获取故障的影响级别。
可选的,数据中心中的虚拟机组为至少两个,故障分析装置600还包括:
第二计算模块,用于根据该故障导致的各个虚拟机组的故障比值及其对应的业务权重获取所述故障的影响级别。
可选的,构成虚拟机组的具有通信依赖关系的至少两个虚拟机具体指示,协同执行同一业务或应用的至少两个虚拟机。
上述提供一种基于数据中心的故障分析装置,该故障分析装置首先获得的 数据中心的拓扑结构图,在获取故障告警后,根据预先获得的数据中心的拓扑结构图,分析该故障是否影响了数据中心中运行的虚拟机组中虚拟机之间的通信通路,并可以根据受影响的虚拟机组的数量,受影响的虚拟机组的业务权重,可选的,以及受影响的虚拟机组的故障比值,综合获取该故障告警对数据中心的影响级别。避免了现有的故障分析方法中,仅根据故障设备的类型,或者故障设备的故障程度来判断该故障的重要性,无法综合分析故障对数据中心上运行的各个业务的实际影响,提升了数据中心的故障分析准确程度,提升了数据中心的故障分析能力和故障应对能力。
设备实施例
设备实施例提供一种故障分析设备800,其组织结构示意图如图6所示,该故障分析设备800实际运用于图1所示的数据中心,并且可以为图1所示的数据中心中任一主机或服务器,包括:
故障分析设备800包括处理器804、存储器804、通信接口806和总线808。其中,处理器802、存储器804和通信接口806通过总线808实现彼此之间的通信连接。
处理器802可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现前述本发明方法实施例所提供的技术方案。
存储器804可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器804可以存储操作系统和其他应用程序。在通过软件或者固件来实现本发明实施例提供的技术方案时,用于实现本发明前述方法实施例提供的技术方案的程序代码保存在存储器804中,并由处理器802来执行。
通信接口806用以与数据中心其他组成设备或虚拟机进行通信。
总线808可包括一通路,在故障分析设备800的各个部件之间传送信息。
上述提供一种基于数据中心的故障分析设备,该故障分析设备运行其存储 的程度代码,首先获取数据中心的拓扑结构图,在获取故障告警后,根据预先获得的数据中心的拓扑结构图,分析该故障是否影响了数据中心中运行的虚拟机组中虚拟机之间的通信通路。避免了现有的故障分析方法中,仅根据故障设备的类型,或者故障设备的故障程度来判断该故障的重要性,无法综合分析故障对数据中心上运行的各个业务的实际影响,提升了数据中心的故障分析准确程度,提升了数据中心的故障分析能力和故障应对能力。
需要说明的是:对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和单元并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (12)

  1. 一种基于数据中心的故障分析方法,其特征在于,所述数据中心的组成设备包括:至少两个主机,至少一个交换设备,所述至少两个主机中的每一个主机上运行至少一个虚拟机,所述至少一个交换设备用于建立所述数据中心的组成设备之间的通信通路,所述至少两个主机上运行的具有通信依赖关系的至少两个虚拟机构成虚拟机组,所述故障分析方法包括:
    获取拓扑结构图,所述拓扑结构图中的节点包括所述组成设备以及所述至少两个主机上运行的虚拟机;
    当所述数据中心发生故障时,获取故障告警,根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少。
  2. 如权利要求1所述的故障分析方法,其特征在于,所述根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少,具体包括:
    根据所述拓扑结构图中节点间的连通关系,确定所述故障导致所述虚拟机组中的至少一个虚拟机与所述虚拟机组中的另一虚拟机之间无可用通信通路时,则表示所述虚拟机组发生错误。
  3. 如权利要求2所述的故障分析方法,其特征在于,所述数据中心中的所述虚拟机组为至少两个,所述方法还包括:
    根据所述故障导致的发生错误的虚拟机组的数量,以及发生错误的各个虚拟机组对应的业务权重,获取所述故障的影响级别。
  4. 如权利要求1所述的故障分析方法,其特征在于,所述根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少,具体包括:
    根据所述拓扑结构图中节点间的连通关系,确定所述虚拟机组的故障比值,所述故障比值具体包括,所述故障导致所述虚拟机组中的各个虚拟机之间中断的通信通路的数量,与所述虚拟机组中各个虚拟机之间通信通路的总数量的比值。
  5. 如权利要求4所述的故障分析方法,其特征在于,所述数据中心中的所 述虚拟机组为至少两个,所述方法还包括:
    根据所述故障导致的各个虚拟机组的故障比值及其对应的业务权重获取所述故障的影响级别。
  6. 如权利要求1至5任一所述的方法,其特征在于,构成所述虚拟机组的具有通信依赖关系的至少两个虚拟机具体指示,协同执行同一业务或应用的至少两个虚拟机。
  7. 一种故障分析装置,其特征在于,所述故障分析装置运用于数据中心,所述数据中心的组成设备包括:至少两个主机,至少一个交换设备,所述至少两个主机中的每一个主机上运行至少一个虚拟机,所述至少一个交换设备用于建立所述数据中心的组成设备之间的通信通路,所述至少两个主机上运行的具有通信依赖关系的至少两个虚拟机构成虚拟机组,所述故障分析装置包括:
    获取模块,用于获取拓扑结构图,所述拓扑结构图中的节点包括所述组成设备以及所述至少两个主机上运行的虚拟机;
    分析模块,用于当所述数据中心发生故障时,获取故障告警,根据所述拓扑结构图判断所述故障是否导致所述虚拟机组中的各个虚拟机之间的通信通路减少。
  8. 如权利要求7所述的故障分析装置,其特征在于,所述分析模块具体用于:根据所述拓扑结构图中节点间的连通关系,确定所述故障导致所述虚拟机组中的至少一个虚拟机与所述虚拟机组中的另一虚拟机之间无可用通信通路时,则表示所述虚拟机组发生错误。
  9. 如权利要求8所述的故障分析装置,其特征在于,所述数据中心中的所述虚拟机组为至少两个,所述故障分析装置还包括:
    第一计算模块,用于根据所述故障导致的发生错误的虚拟机组的数量,以及发生错误的各个虚拟机组对应的业务权重,获取所述故障的影响级别。
  10. 如权利要求7所述的故障分析装置,其特征在于,所述分析模块具体用于:根据所述拓扑结构图中节点间的连通关系,确定所述虚拟机组的故障比值,所述故障比值具体包括,所述故障导致所述虚拟机组中的各个虚拟机之间中断的通信通路的数量,与所述虚拟机组中各个虚拟机之间通信通路的总数量的比 值。
  11. 如权利要求10所述的故障分析装置,其特征在于,所述数据中心中的所述虚拟机组为至少两个,所述故障分析装置还包括:
    第二计算模块,用于根据所述故障导致的各个虚拟机组的故障比值及其对应的业务权重获取所述故障的影响级别。
  12. 如权利要求7至11任一所述的故障分析装置,其特征在于,构成所述虚拟机组的具有通信依赖关系的至少两个虚拟机具体指示,协同执行同一业务或应用的至少两个虚拟机。
PCT/CN2015/097903 2014-12-31 2015-12-18 基于数据中心的故障分析方法和装置 WO2016107425A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15875103.2A EP3232620B1 (en) 2014-12-31 2015-12-18 Data center based fault analysis method and device
US15/638,109 US10831630B2 (en) 2014-12-31 2017-06-29 Fault analysis method and apparatus based on data center

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410856613.5 2014-12-31
CN201410856613.5A CN105812170B (zh) 2014-12-31 2014-12-31 基于数据中心的故障分析方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/638,109 Continuation US10831630B2 (en) 2014-12-31 2017-06-29 Fault analysis method and apparatus based on data center

Publications (1)

Publication Number Publication Date
WO2016107425A1 true WO2016107425A1 (zh) 2016-07-07

Family

ID=56284217

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097903 WO2016107425A1 (zh) 2014-12-31 2015-12-18 基于数据中心的故障分析方法和装置

Country Status (4)

Country Link
US (1) US10831630B2 (zh)
EP (1) EP3232620B1 (zh)
CN (1) CN105812170B (zh)
WO (1) WO2016107425A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639107A (zh) * 2020-05-26 2020-09-08 姜沛松 电力物联网故障检测方法、装置及检测终端

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235816A1 (en) 2016-02-12 2017-08-17 Nutanix, Inc. Entity database data aggregation
US11909603B2 (en) * 2017-12-01 2024-02-20 Cisco Technology, Inc. Priority based resource management in a network functions virtualization (NFV) environment
US11929869B2 (en) * 2018-05-14 2024-03-12 Netflix, Inc. Scalable and real-time anomaly detection
CN112115390A (zh) * 2019-06-20 2020-12-22 华为技术有限公司 拓扑结构图的展示方法、装置、设备及存储介质
US11966319B2 (en) * 2021-02-23 2024-04-23 Mellanox Technologies, Ltd. Identifying anomalies in a data center using composite metrics and/or machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102143008A (zh) * 2010-01-29 2011-08-03 国际商业机器公司 用于数据中心的诊断故障事件的方法及装置
CN102455951A (zh) * 2011-07-21 2012-05-16 中标软件有限公司 一种虚拟机容错方法和系统
CN103403689A (zh) * 2012-07-30 2013-11-20 华为技术有限公司 一种资源故障管理方法、装置及系统
US20140047444A1 (en) * 2011-04-20 2014-02-13 Nec Corporation Virtual machine managing apparatus, virtual machine managing method, and program thereof

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7380017B2 (en) * 2001-05-03 2008-05-27 Nortel Networks Limited Route protection in a communication network
US8751866B2 (en) * 2006-09-28 2014-06-10 International Business Machines Corporation Autonomic fault isolation in a highly interconnected system
US8031634B1 (en) * 2008-03-31 2011-10-04 Emc Corporation System and method for managing a virtual domain environment to enable root cause and impact analysis
US8661295B1 (en) * 2011-03-31 2014-02-25 Amazon Technologies, Inc. Monitoring and detecting causes of failures of network paths
US8811212B2 (en) * 2012-02-22 2014-08-19 Telefonaktiebolaget L M Ericsson (Publ) Controller placement for fast failover in the split architecture
US9003027B2 (en) * 2012-08-17 2015-04-07 Vmware, Inc. Discovery of storage area network devices for a virtual machine
US9183033B2 (en) * 2012-12-06 2015-11-10 Industrial Technology Research Institute Method and system for analyzing root causes of relating performance issues among virtual machines to physical machines
CN103294521B (zh) * 2013-05-30 2016-08-10 天津大学 一种降低数据中心通信负载及能耗的方法
US9811435B2 (en) * 2013-09-03 2017-11-07 Cisco Technology, Inc. System for virtual machine risk monitoring
US10348628B2 (en) * 2013-09-12 2019-07-09 Vmware, Inc. Placement of virtual machines in a virtualized computing environment
US9882805B2 (en) * 2013-09-30 2018-01-30 Vmware, Inc. Dynamic path selection policy for multipathing in a virtualized environment
US9389970B2 (en) * 2013-11-01 2016-07-12 International Business Machines Corporation Selected virtual machine replication and virtual machine restart techniques
US9164695B2 (en) * 2013-12-03 2015-10-20 Vmware, Inc. Placing a storage network device into a maintenance mode in a virtualized computing environment
US20150172222A1 (en) * 2013-12-16 2015-06-18 James Liao Data center ethernet switch fabric
US9946614B2 (en) * 2014-12-16 2018-04-17 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102143008A (zh) * 2010-01-29 2011-08-03 国际商业机器公司 用于数据中心的诊断故障事件的方法及装置
US20140047444A1 (en) * 2011-04-20 2014-02-13 Nec Corporation Virtual machine managing apparatus, virtual machine managing method, and program thereof
CN102455951A (zh) * 2011-07-21 2012-05-16 中标软件有限公司 一种虚拟机容错方法和系统
CN103403689A (zh) * 2012-07-30 2013-11-20 华为技术有限公司 一种资源故障管理方法、装置及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3232620A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639107A (zh) * 2020-05-26 2020-09-08 姜沛松 电力物联网故障检测方法、装置及检测终端
CN111639107B (zh) * 2020-05-26 2023-08-04 广东信通通信有限公司 电力物联网故障检测方法、装置及检测终端

Also Published As

Publication number Publication date
EP3232620A1 (en) 2017-10-18
CN105812170A (zh) 2016-07-27
EP3232620A4 (en) 2017-12-20
EP3232620B1 (en) 2019-05-22
US20170299645A1 (en) 2017-10-19
CN105812170B (zh) 2019-01-18
US10831630B2 (en) 2020-11-10

Similar Documents

Publication Publication Date Title
WO2016107425A1 (zh) 基于数据中心的故障分析方法和装置
US10601643B2 (en) Troubleshooting method and apparatus using key performance indicator information
US9483343B2 (en) System and method of visualizing historical event correlations in a data center
US20170310432A1 (en) Network link monitoring and testing
RU2641706C1 (ru) Способ обработки отказа сетевой службы, система управления службами и модуль управления системой
US10567232B2 (en) System and method for mapping a connectivity state of a network
US20140078882A1 (en) Automated Datacenter Network Failure Mitigation
WO2017083024A1 (en) Methods, systems, and computer readable media for testing network function virtualization (nfv)
CN113973042B (zh) 用于网络问题的根本原因分析的方法和系统
CN111030873A (zh) 一种故障诊断方法及装置
US10855546B2 (en) Systems and methods for non-intrusive network performance monitoring
CN109787865B (zh) 一种升级情况的验证方法、系统、交换机及存储介质
WO2021103800A1 (zh) 故障修复操作推荐方法、装置及存储介质
CN110708715A (zh) 一种5g基站业务故障查找方法及装置
US8886506B2 (en) Path failure importance sampling
Oi et al. Method for estimating locations of service problem causes in service function chaining
US11516073B2 (en) Malfunction point estimation method and malfunction point estimation apparatus
WO2012106914A1 (zh) 动态隧道故障诊断方法及设备和系统
CN108141374B (zh) 一种网络亚健康诊断方法及装置
US10656988B1 (en) Active monitoring of packet loss in networks using multiple statistical models
US10461992B1 (en) Detection of failures in network devices
JP6052150B2 (ja) 中継装置
JP6326383B2 (ja) ネットワーク評価システム、ネットワーク評価方法、及びネットワーク評価プログラム
Stidsen et al. Complete rerouting protection
Holbert et al. Effects of partial topology on fault diagnosis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15875103

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2015875103

Country of ref document: EP