CN110635954A - Method and system for processing network fault of data center - Google Patents

Method and system for processing network fault of data center Download PDF

Info

Publication number
CN110635954A
CN110635954A CN201911002517.3A CN201911002517A CN110635954A CN 110635954 A CN110635954 A CN 110635954A CN 201911002517 A CN201911002517 A CN 201911002517A CN 110635954 A CN110635954 A CN 110635954A
Authority
CN
China
Prior art keywords
fault
alarm information
equipment
data center
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911002517.3A
Other languages
Chinese (zh)
Other versions
CN110635954B (en
Inventor
朱聿津
戴之光
张维嘉
王勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Travelsky Technology Co Ltd
Original Assignee
China Travelsky Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Travelsky Technology Co Ltd filed Critical China Travelsky Technology Co Ltd
Priority to CN201911002517.3A priority Critical patent/CN110635954B/en
Publication of CN110635954A publication Critical patent/CN110635954A/en
Application granted granted Critical
Publication of CN110635954B publication Critical patent/CN110635954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for processing network faults of a data center, which comprises the steps of acquiring alarm information generated when a data center network is abnormal, searching fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons based on the alarm information, taking the fault information as fault information for generating network faults of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

Description

Method and system for processing network fault of data center
Technical Field
The invention relates to the technical field of network data processing, in particular to a method and a system for processing network faults of a data center.
Background
With the rapid advance of information technology, various data are in explosive growth, data centers are developing more and more rapidly, and network structures are increasingly complex. Data center networks are networks applied in data centers, and because the traffic in the data center networks presents characteristics of typical switched data set concentration, east-west traffic increase and the like, further requirements are made on the data center networks: high expansibility, high robustness, flexible topology and link capacity control, green and energy-saving, and the like. However, data concentration means risk concentration, response concentration, complexity concentration, and the like. Thus, the data center network is inevitably in a failure condition, especially in an emergency condition.
Data center network failure types are numerous, mainly: and the equipment, the link or the server of the data center network fails to provide normal service to the outside. Because the number of network devices is huge, when a fault occurs, more warning information is provided, and the fault is difficult to locate, especially in an emergency situation, if the fault is manually located and operated only by experience, operation accidents are easily caused, the processing time is long, and a large amount of manpower is consumed.
Disclosure of Invention
In view of this, the invention discloses a method and a system for processing a data center network fault, so as to realize automatic positioning and fault processing of the data center network fault, thereby not only greatly saving labor cost and improving fault processing efficiency, but also effectively reducing operation faults caused by manual operation.
A method for processing a data center network fault comprises the following steps:
acquiring alarm information generated when a data center network is abnormal;
based on the alarm information, searching out the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
searching all equipment interfaces connected with the fault equipment;
and sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
Optionally, the finding, based on the alarm information, the faulty device and the fault reason corresponding to the alarm information from the pre-stored corresponding relationship among the alarm information, the faulty device and the fault reason, and using the found faulty device and fault reason as the fault information for generating the data center network fault specifically includes:
matching the alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
Optionally, the finding, based on the alarm information, the faulty device and the fault reason corresponding to the alarm information from the pre-stored corresponding relationship among the alarm information, the faulty device and the fault reason, and using the found faulty device and fault reason as the fault information for generating the data center network fault specifically includes:
absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
Optionally, the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorption tree diagram specifically includes:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
Optionally, after the sending a device interface shutdown instruction to the target device connected to the faulty device through the device interface, so that the target device is disconnected from the faulty device and a standby device of the faulty device is enabled, the method further includes:
judging whether the data center network is recovered to be normal or not, wherein the step of recovering to be normal comprises the following steps: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and if so, storing the configuration of the fault equipment and the fault reason in a corresponding relationship form.
A system for handling data center network failures, comprising:
the acquisition unit is used for acquiring alarm information generated when the data center network is abnormal;
the first searching unit is used for searching the fault equipment and the fault reason corresponding to the alarm information from the corresponding relation among the pre-stored alarm information, fault equipment and fault reasons based on the alarm information, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
the second searching unit is used for searching all the equipment interfaces connected with the fault equipment;
and the fault processing unit is used for sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, standby equipment of the fault equipment is started.
Optionally, the first searching unit specifically includes:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
Optionally, the first searching unit specifically includes:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
Optionally, the second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
Optionally, the method further includes:
a determining unit, configured to, after the failure processing unit sends an equipment interface shutdown instruction to a target equipment connected to the failed equipment through the equipment interface, so that the target equipment is disconnected from the failed equipment, and a standby equipment of the failed equipment is enabled, determine whether a data center network is recovered to be normal, where the recovering to be normal of the data center network includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
According to the technical scheme, the invention discloses a method for processing the network fault of the data center, which comprises the steps of obtaining the alarm information generated when the data center network is abnormal, searching the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as the fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting the standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the disclosed drawings without creative efforts.
Fig. 1 is a flowchart of a method for processing a data center network fault according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alarm information database according to an embodiment of the present invention;
FIG. 3 is a tree diagram of an alarm information absorption according to an embodiment of the present invention;
FIG. 4 is a flowchart of another data center network failure processing method disclosed in the embodiments of the present invention;
fig. 5 is a schematic structural diagram of a system for processing a data center network failure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method and a system for processing a data center network fault, which are used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the data center network fault, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
Referring to fig. 1, an embodiment of the present invention discloses a flow chart of a method for processing a data center network fault, where the method includes the steps of:
s101, acquiring alarm information generated when a data center network is abnormal;
specifically, in practical application, the alarm information generated when the network is abnormal may be detected by the operation and maintenance monitoring platform, where the network abnormal includes: a network error condition.
In this embodiment, the alarm information indicates that a data center network failure may occur, and in an actual production environment, a certain failure generally generates a plurality of pieces of alarm information,
each piece of alarm information includes: event first occurrence time, event latest occurrence time, alarm times, event name, alarm device IP, alarm object (specific port or motherboard, etc.), alarm event source and script (ICM alarm source IP and device reachability script, for example), alarm event details (alarm of SNMP Trap, for example, or device vendor management system alarm such as IMC alarm of H3C, for example), and the like.
It should be noted that, in practical applications, the content of the alarm information may be listed in an entry manner, and the alarm information may be classified according to the source or the device model of the alarm information.
Step S102, based on the alarm information, finding out fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons, and using the fault equipment and the fault reasons as fault information for generating the data center network fault;
wherein, step S102 may specifically include:
matching the acquired alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
In this embodiment, the matching degree is: and the coincidence percentage of all the alarm items under the fault reasons corresponding to the obtained alarm information and the alarm information.
It should be noted that, an alarm information database is established in advance, and the alarm information database includes: fault equipment and fault cause, and alarm information caused by the fault equipment.
Specifically, referring to fig. 2, a schematic diagram of a composition of an alarm information database disclosed in an embodiment of the present invention is shown, where each entry in the alarm information database is: the fault equipment and the fault reason, and the alarm entry caused by the fault equipment comprise: alarm entry 1, alarm entry 2, alarm entry 3, … …, alarm entry n, n being a positive integer.
The value of the preset matching degree is determined according to actual needs, for example, the preset matching degree is 80%, and the present invention is not limited herein.
And when the alarm information database does not have alarm items with the matching degree not lower than the preset matching degree, updating the alarm information database.
The method for establishing the warning information database is suitable for the data center with a relatively perfect database, and has relatively sufficient information for fault equipment, fault reasons and warning information.
In the foregoing embodiment, step S102 may further include:
carrying out absorption matching on the acquired alarm information and each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
The alarm information absorption tree diagram is established based on the equipment network topological diagram and the known alarm information causal relationship.
Referring to fig. 3, in an alarm information absorption tree diagram disclosed in an embodiment of the present invention, a root alarm information includes: a plurality of parent alarm information, such as parent alarm information 1, parent alarm information 2; each parent alarm message includes: a plurality of child alarm messages, for example, the parent alarm message 1 includes: sub alarm information 1 and sub alarm information 2; the parent alarm information 2 includes: sub alarm information 3 and sub alarm information 4.
When new alarm information is generated, alarm information absorption association analysis is carried out based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by father alarm information or son alarm information in the alarm information absorption tree diagram is judged, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as son alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.
Because the alarm information associated through the alarm information rule is usually generated within a period of time, in practical application, an associated alarm information time period can be set, when judging whether the new alarm information can be absorbed by father alarm information or son alarm information in the tree diagram, whether the generation time of the new alarm information is within the set associated alarm information time period is judged, and when the new alarm information is within the set associated alarm information time period, whether the new alarm information can be absorbed by the father alarm information or the son alarm information in the tree diagram is judged; and otherwise, when the new alarm information is not in the set associated alarm information time period, the new alarm information is abandoned.
In order to further optimize the above embodiment, the absorbing and matching of the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically includes:
judging whether the generation time of the acquired alarm information is within a set associated alarm information time period;
if yes, the acquired alarm information is subjected to absorption matching with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
taking fault equipment and fault reasons corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault;
and if not, discarding the acquired alarm information.
It should be noted that the alarm information absorption tree diagram is established according to the alarm information acquired within the time period for setting the associated alarm information.
Step S103, searching all equipment interfaces connected with the fault equipment;
specifically, in practical applications, all the device interfaces connected to the faulty device are searched according to the device topology and a CMDB (Configuration Management Database).
Step S104, sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
Specifically, in practical application, an equipment interface closing instruction may be sent to a target equipment connected to the faulty equipment through an equipment interface, so that the connection between the faulty equipment and the target equipment is disconnected, the faulty equipment is isolated from the equipment network, and fault isolation is achieved.
The fault equipment isolation method includes two methods, as follows:
method 1
Determining fault equipment in a data center system with an independently deployed out-of-band management network, automatically logging in the fault equipment through the management network in a netconf interface and cli mode, wherein the log-in ip and a user name password are stored in a CMDB, inquiring all available equipment ports of the fault equipment in the CMDB, and automatically executing an instruction for closing all available equipment ports of the fault equipment according to a label in the CMDB.
Method two
In a data center system without an independently deployed out-of-band management network, because a fault device cannot be directly logged in to operate the data center system, when a device fault occurs, all device ports which are connected with the fault device in an up-Link or down-Link mode are found out through a Link layer discovery Protocol (CMDB) and a Link Layer Discovery Protocol (LLDP) according to a device network topology structure, all target devices which are connected with the fault device in the up-Link or down-Link mode are sequentially logged in based on the device ports, and a device interface closing instruction of the device port connected with the fault device is executed. It should be noted that, because the architecture and the scene of the data center are both the active/standby dual-active mode, rather than the single-point structure, the order of login need not be considered in the order issue.
In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
After the data center network fault is processed, the fault processing result is also verified.
Referring to fig. 4, a flowchart of a method for processing a data center network fault according to another embodiment of the present invention is disclosed, and on the basis of the embodiment shown in fig. 1, after step S104, the method may further include the steps of:
step S105, judging whether the data center network is recovered to be normal, if so, executing step S106;
the embodiment determines whether the data center network failure is completely processed by judging whether the data center network is recovered to be normal.
Whether the data center network is recovered to be normal or not comprises the following steps: whether the fault equipment is isolated, whether the standby equipment of the fault equipment is started and whether the network is recovered to be normal or not, and when the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal, determining that the data center network is recovered to be normal; otherwise, determining that the data center network is not recovered to be normal, and at the moment, continuously processing the unprocessed fault until the data center network is recovered to be normal.
For example, the machine a and the machine B are active and standby with each other, when the machine a fails and is subjected to fault processing, the machine B is automatically logged in through a netconf interface and a cli mode no matter whether the machine a is in the management network, whether the machine B is enabled or not is checked, whether the network is unblocked or not is checked, the state of the machine a is checked through NRRP (virtual router redundancy protocol), and if the machine a is in an offline state, the machine a is proved to be successfully isolated.
And step S106, storing the configuration of the fault equipment and the fault reason in a corresponding relationship mode.
After the equipment is determined to have a fault, the configuration of the fault equipment and the fault reason are stored in a corresponding relation mode so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and the configuration backtracking can be realized according to the original configuration of the fault equipment.
In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.
Corresponding to the embodiment of the method, the invention discloses a system for processing the network fault of the data center.
Referring to fig. 5, an embodiment of the present invention discloses a schematic structural diagram of a system for processing a data center network failure, where the system includes:
an obtaining unit 201, configured to obtain alarm information generated when an abnormality occurs in a data center network;
specifically, in practical application, the alarm information generated when the network is abnormal may be detected by the operation and maintenance monitoring platform, where the network abnormal includes: a network error condition.
In this embodiment, the alarm information indicates that a data center network failure may occur, and in an actual production environment, a certain failure generally generates a plurality of pieces of alarm information,
each piece of alarm information includes: event first occurrence time, event latest occurrence time, alarm times, event name, alarm device IP, alarm object (specific port or motherboard, etc.), alarm event source and script (ICM alarm source IP and device reachability script, for example), alarm event details (alarm of SNMP Trap, for example, or device vendor management system alarm such as IMC alarm of H3C, for example), and the like.
It should be noted that, in practical applications, the content of the alarm information may be listed in an entry manner, and the alarm information may be classified according to the source or the device model of the alarm information.
A first search unit 202, configured to search, based on the alarm information, a faulty device and a faulty reason corresponding to the alarm information from a pre-stored correspondence relationship between the alarm information, the faulty device, and the faulty reason, and use the faulty device and the faulty reason as fault information for generating a data center network fault;
the first searching unit 202 may specifically include:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
It should be noted that, an alarm information database is established in advance, and the alarm information database includes: fault equipment and fault cause, and alarm information caused by the fault equipment.
The first searching unit 202 may further include:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
The second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
It should be noted that, when new alarm information is generated, alarm information absorption association analysis is performed based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by parent alarm information or child alarm information in the alarm information absorption tree diagram is determined, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as child alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.
Because the alarm information associated through the alarm information rule is usually generated within a period of time, in practical application, an associated alarm information time period can be set, when judging whether the new alarm information can be absorbed by father alarm information or son alarm information in the tree diagram, whether the generation time of the new alarm information is within the set associated alarm information time period is judged, and when the new alarm information is within the set associated alarm information time period, whether the new alarm information can be absorbed by the father alarm information or the son alarm information in the tree diagram is judged; and otherwise, when the new alarm information is not in the set associated alarm information time period, the new alarm information is abandoned.
A second searching unit 203, configured to search all device interfaces connected to the faulty device;
a fault processing unit 204, configured to send an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device is disconnected from the faulty device, and a standby device of the faulty device is enabled at the same time.
Specifically, in practical application, an equipment interface closing instruction may be sent to a target equipment connected to the faulty equipment through an equipment interface, so that the connection between the faulty equipment and the target equipment is disconnected, the faulty equipment is isolated from the equipment network, and fault isolation is achieved.
The fault equipment isolation method includes two methods, as follows:
method 1
Determining fault equipment in a data center system with an independently deployed out-of-band management network, automatically logging in the fault equipment through the management network in a netconf interface and cli mode, wherein the log-in ip and a user name password are stored in a CMDB, inquiring all available equipment ports of the fault equipment in the CMDB, and automatically executing an instruction for closing all available equipment ports of the fault equipment according to a label in the CMDB.
Method two
In a data center system without an independently deployed out-of-band management network, because a fault device cannot be directly logged in to operate the data center system, when a device fault occurs, all device ports which are connected with the fault device in an up-Link or down-Link mode are found out through a Link layer discovery Protocol (CMDB) and a Link Layer Discovery Protocol (LLDP) according to a device network topology structure, all target devices which are connected with the fault device in the up-Link or down-Link mode are sequentially logged in based on the device ports, and a device interface closing instruction of the device port connected with the fault device is executed. It should be noted that, because the architecture and the scene of the data center are both the active/standby dual-active mode, rather than the single-point structure, the order of login need not be considered in the order issue.
In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
After the data center network fault is processed, the fault processing result is also verified.
To further optimize the above embodiment, the system for processing the data center network failure may further include:
a determining unit, configured to, after the fault processing unit 204 sends an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device disconnects from the faulty device and activates a standby device of the faulty device, determine whether a data center network is recovered to normal, where the data center network recovering to normal includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for processing a data center network fault is characterized by comprising the following steps:
acquiring alarm information generated when a data center network is abnormal;
based on the alarm information, searching out the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
searching all equipment interfaces connected with the fault equipment;
and sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
2. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:
matching the alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
3. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:
absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
4. The processing method according to claim 3, wherein the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically comprises:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
5. The processing method according to claim 1, wherein after the sending of the device interface shutdown instruction to the target device connected to the failed device through the device interface causes the target device to disconnect from the failed device while enabling the standby device of the failed device, further comprising:
judging whether the data center network is recovered to be normal or not, wherein the step of recovering to be normal comprises the following steps: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and if so, storing the configuration of the fault equipment and the fault reason in a corresponding relationship form.
6. A system for handling data center network failures, comprising:
the acquisition unit is used for acquiring alarm information generated when the data center network is abnormal;
the first searching unit is used for searching the fault equipment and the fault reason corresponding to the alarm information from the corresponding relation among the pre-stored alarm information, fault equipment and fault reasons based on the alarm information, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
the second searching unit is used for searching all the equipment interfaces connected with the fault equipment;
and the fault processing unit is used for sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, standby equipment of the fault equipment is started.
7. The processing system of claim 6, wherein the first lookup unit specifically comprises:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
8. The processing system of claim 6, wherein the first lookup unit specifically comprises:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
9. The processing system of claim 8, wherein the second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
10. The processing system of claim 6, further comprising:
a determining unit, configured to, after the failure processing unit sends an equipment interface shutdown instruction to a target equipment connected to the failed equipment through the equipment interface, so that the target equipment is disconnected from the failed equipment, and a standby equipment of the failed equipment is enabled, determine whether a data center network is recovered to be normal, where the recovering to be normal of the data center network includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
CN201911002517.3A 2019-10-21 2019-10-21 Method and system for processing network fault of data center Active CN110635954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911002517.3A CN110635954B (en) 2019-10-21 2019-10-21 Method and system for processing network fault of data center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911002517.3A CN110635954B (en) 2019-10-21 2019-10-21 Method and system for processing network fault of data center

Publications (2)

Publication Number Publication Date
CN110635954A true CN110635954A (en) 2019-12-31
CN110635954B CN110635954B (en) 2022-10-21

Family

ID=68976905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911002517.3A Active CN110635954B (en) 2019-10-21 2019-10-21 Method and system for processing network fault of data center

Country Status (1)

Country Link
CN (1) CN110635954B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865673A (en) * 2020-07-08 2020-10-30 上海燕汐软件信息科技有限公司 Automatic fault management method, device and system
CN114285725A (en) * 2021-12-24 2022-04-05 中国电信股份有限公司 Network fault determination method and device, storage medium and electronic equipment
WO2022193617A1 (en) * 2021-03-16 2022-09-22 通号通信信息集团有限公司 Fault location method, fault location system, and video management system
CN114285725B (en) * 2021-12-24 2024-07-02 中国电信股份有限公司 Network fault determining method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021195A (en) * 2014-06-13 2014-09-03 中国民航信息网络股份有限公司 Warning association analysis method based on knowledge base
WO2016086705A1 (en) * 2014-12-02 2016-06-09 中兴通讯股份有限公司 Fault locating method, and server
CN106130761A (en) * 2016-06-22 2016-11-16 北京百度网讯科技有限公司 The recognition methods of the failed network device of data center and device
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021195A (en) * 2014-06-13 2014-09-03 中国民航信息网络股份有限公司 Warning association analysis method based on knowledge base
WO2016086705A1 (en) * 2014-12-02 2016-06-09 中兴通讯股份有限公司 Fault locating method, and server
CN106130761A (en) * 2016-06-22 2016-11-16 北京百度网讯科技有限公司 The recognition methods of the failed network device of data center and device
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865673A (en) * 2020-07-08 2020-10-30 上海燕汐软件信息科技有限公司 Automatic fault management method, device and system
WO2022193617A1 (en) * 2021-03-16 2022-09-22 通号通信信息集团有限公司 Fault location method, fault location system, and video management system
CN114285725A (en) * 2021-12-24 2022-04-05 中国电信股份有限公司 Network fault determination method and device, storage medium and electronic equipment
CN114285725B (en) * 2021-12-24 2024-07-02 中国电信股份有限公司 Network fault determining method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110635954B (en) 2022-10-21

Similar Documents

Publication Publication Date Title
CN106130761B (en) The recognition methods of the failed network device of data center and device
WO2022083540A1 (en) Method, apparatus, and system for determining fault recovery plan, and computer storage medium
CN112291075B (en) Network fault positioning method and device, computer equipment and storage medium
CN103179599B (en) The method for supervising of WLAN performance, equipment and system
WO2017092400A1 (en) Failure recovery method and device, controller, and software defined network
CN108429629A (en) Equipment fault restoration methods and device
CN105450472A (en) Method and device for automatically acquiring states of physical components of servers
CN110635954B (en) Method and system for processing network fault of data center
JP2008059114A (en) Automatic network monitoring system using snmp
CN112468592A (en) Terminal online state detection method and system based on electric power information acquisition
WO2020010906A1 (en) Method and device for operating system (os) batch installation, and network device
CN113821242B (en) Intelligent firmware matching method and system
CN101340377B (en) Method, apparatus and system for data transmission in double layer network
JP6124612B2 (en) Engineering apparatus and engineering method
US20190132261A1 (en) Link locking in ethernet networks
CN106713038B (en) remote transmission line quality detection method and system
CN112636960A (en) Edge computing equipment intranet collaborative maintenance method, system, device, server and storage medium thereof
JP2008244902A (en) Failure recovery apparatus, failure recovery method, and failure recovery system
US8644137B2 (en) Method and system for providing safe dynamic link redundancy in a data network
CN112231154A (en) Dual-computer hot standby switching method and device
WO2016082368A1 (en) Data consistency maintaining method, device and ptn transmission apparatus
CN111488235A (en) Terminal fault processing method and system and cloud platform
CN108123864B (en) EVPN tunnel monitoring method and device
CN110958145A (en) Method and device for managing ad hoc network equipment and electronic equipment
CN106488489B (en) Method and device for recovering user service data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant