CN110635954A - Method and system for processing network fault of data center - Google Patents
Method and system for processing network fault of data center Download PDFInfo
- Publication number
- CN110635954A CN110635954A CN201911002517.3A CN201911002517A CN110635954A CN 110635954 A CN110635954 A CN 110635954A CN 201911002517 A CN201911002517 A CN 201911002517A CN 110635954 A CN110635954 A CN 110635954A
- Authority
- CN
- China
- Prior art keywords
- fault
- alarm information
- equipment
- data center
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a method for processing network faults of a data center, which comprises the steps of acquiring alarm information generated when a data center network is abnormal, searching fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons based on the alarm information, taking the fault information as fault information for generating network faults of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
Description
Technical Field
The invention relates to the technical field of network data processing, in particular to a method and a system for processing network faults of a data center.
Background
With the rapid advance of information technology, various data are in explosive growth, data centers are developing more and more rapidly, and network structures are increasingly complex. Data center networks are networks applied in data centers, and because the traffic in the data center networks presents characteristics of typical switched data set concentration, east-west traffic increase and the like, further requirements are made on the data center networks: high expansibility, high robustness, flexible topology and link capacity control, green and energy-saving, and the like. However, data concentration means risk concentration, response concentration, complexity concentration, and the like. Thus, the data center network is inevitably in a failure condition, especially in an emergency condition.
Data center network failure types are numerous, mainly: and the equipment, the link or the server of the data center network fails to provide normal service to the outside. Because the number of network devices is huge, when a fault occurs, more warning information is provided, and the fault is difficult to locate, especially in an emergency situation, if the fault is manually located and operated only by experience, operation accidents are easily caused, the processing time is long, and a large amount of manpower is consumed.
Disclosure of Invention
In view of this, the invention discloses a method and a system for processing a data center network fault, so as to realize automatic positioning and fault processing of the data center network fault, thereby not only greatly saving labor cost and improving fault processing efficiency, but also effectively reducing operation faults caused by manual operation.
A method for processing a data center network fault comprises the following steps:
acquiring alarm information generated when a data center network is abnormal;
based on the alarm information, searching out the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
searching all equipment interfaces connected with the fault equipment;
and sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
Optionally, the finding, based on the alarm information, the faulty device and the fault reason corresponding to the alarm information from the pre-stored corresponding relationship among the alarm information, the faulty device and the fault reason, and using the found faulty device and fault reason as the fault information for generating the data center network fault specifically includes:
matching the alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
Optionally, the finding, based on the alarm information, the faulty device and the fault reason corresponding to the alarm information from the pre-stored corresponding relationship among the alarm information, the faulty device and the fault reason, and using the found faulty device and fault reason as the fault information for generating the data center network fault specifically includes:
absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
Optionally, the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorption tree diagram specifically includes:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
Optionally, after the sending a device interface shutdown instruction to the target device connected to the faulty device through the device interface, so that the target device is disconnected from the faulty device and a standby device of the faulty device is enabled, the method further includes:
judging whether the data center network is recovered to be normal or not, wherein the step of recovering to be normal comprises the following steps: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and if so, storing the configuration of the fault equipment and the fault reason in a corresponding relationship form.
A system for handling data center network failures, comprising:
the acquisition unit is used for acquiring alarm information generated when the data center network is abnormal;
the first searching unit is used for searching the fault equipment and the fault reason corresponding to the alarm information from the corresponding relation among the pre-stored alarm information, fault equipment and fault reasons based on the alarm information, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
the second searching unit is used for searching all the equipment interfaces connected with the fault equipment;
and the fault processing unit is used for sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, standby equipment of the fault equipment is started.
Optionally, the first searching unit specifically includes:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
Optionally, the first searching unit specifically includes:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
Optionally, the second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
Optionally, the method further includes:
a determining unit, configured to, after the failure processing unit sends an equipment interface shutdown instruction to a target equipment connected to the failed equipment through the equipment interface, so that the target equipment is disconnected from the failed equipment, and a standby equipment of the failed equipment is enabled, determine whether a data center network is recovered to be normal, where the recovering to be normal of the data center network includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
According to the technical scheme, the invention discloses a method for processing the network fault of the data center, which comprises the steps of obtaining the alarm information generated when the data center network is abnormal, searching the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as the fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting the standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the disclosed drawings without creative efforts.
Fig. 1 is a flowchart of a method for processing a data center network fault according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alarm information database according to an embodiment of the present invention;
FIG. 3 is a tree diagram of an alarm information absorption according to an embodiment of the present invention;
FIG. 4 is a flowchart of another data center network failure processing method disclosed in the embodiments of the present invention;
fig. 5 is a schematic structural diagram of a system for processing a data center network failure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method and a system for processing a data center network fault, which are used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the data center network fault, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
Referring to fig. 1, an embodiment of the present invention discloses a flow chart of a method for processing a data center network fault, where the method includes the steps of:
s101, acquiring alarm information generated when a data center network is abnormal;
specifically, in practical application, the alarm information generated when the network is abnormal may be detected by the operation and maintenance monitoring platform, where the network abnormal includes: a network error condition.
In this embodiment, the alarm information indicates that a data center network failure may occur, and in an actual production environment, a certain failure generally generates a plurality of pieces of alarm information,
each piece of alarm information includes: event first occurrence time, event latest occurrence time, alarm times, event name, alarm device IP, alarm object (specific port or motherboard, etc.), alarm event source and script (ICM alarm source IP and device reachability script, for example), alarm event details (alarm of SNMP Trap, for example, or device vendor management system alarm such as IMC alarm of H3C, for example), and the like.
It should be noted that, in practical applications, the content of the alarm information may be listed in an entry manner, and the alarm information may be classified according to the source or the device model of the alarm information.
Step S102, based on the alarm information, finding out fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons, and using the fault equipment and the fault reasons as fault information for generating the data center network fault;
wherein, step S102 may specifically include:
matching the acquired alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
In this embodiment, the matching degree is: and the coincidence percentage of all the alarm items under the fault reasons corresponding to the obtained alarm information and the alarm information.
It should be noted that, an alarm information database is established in advance, and the alarm information database includes: fault equipment and fault cause, and alarm information caused by the fault equipment.
Specifically, referring to fig. 2, a schematic diagram of a composition of an alarm information database disclosed in an embodiment of the present invention is shown, where each entry in the alarm information database is: the fault equipment and the fault reason, and the alarm entry caused by the fault equipment comprise: alarm entry 1, alarm entry 2, alarm entry 3, … …, alarm entry n, n being a positive integer.
The value of the preset matching degree is determined according to actual needs, for example, the preset matching degree is 80%, and the present invention is not limited herein.
And when the alarm information database does not have alarm items with the matching degree not lower than the preset matching degree, updating the alarm information database.
The method for establishing the warning information database is suitable for the data center with a relatively perfect database, and has relatively sufficient information for fault equipment, fault reasons and warning information.
In the foregoing embodiment, step S102 may further include:
carrying out absorption matching on the acquired alarm information and each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
The alarm information absorption tree diagram is established based on the equipment network topological diagram and the known alarm information causal relationship.
Referring to fig. 3, in an alarm information absorption tree diagram disclosed in an embodiment of the present invention, a root alarm information includes: a plurality of parent alarm information, such as parent alarm information 1, parent alarm information 2; each parent alarm message includes: a plurality of child alarm messages, for example, the parent alarm message 1 includes: sub alarm information 1 and sub alarm information 2; the parent alarm information 2 includes: sub alarm information 3 and sub alarm information 4.
When new alarm information is generated, alarm information absorption association analysis is carried out based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by father alarm information or son alarm information in the alarm information absorption tree diagram is judged, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as son alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.
Because the alarm information associated through the alarm information rule is usually generated within a period of time, in practical application, an associated alarm information time period can be set, when judging whether the new alarm information can be absorbed by father alarm information or son alarm information in the tree diagram, whether the generation time of the new alarm information is within the set associated alarm information time period is judged, and when the new alarm information is within the set associated alarm information time period, whether the new alarm information can be absorbed by the father alarm information or the son alarm information in the tree diagram is judged; and otherwise, when the new alarm information is not in the set associated alarm information time period, the new alarm information is abandoned.
In order to further optimize the above embodiment, the absorbing and matching of the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically includes:
judging whether the generation time of the acquired alarm information is within a set associated alarm information time period;
if yes, the acquired alarm information is subjected to absorption matching with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
taking fault equipment and fault reasons corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault;
and if not, discarding the acquired alarm information.
It should be noted that the alarm information absorption tree diagram is established according to the alarm information acquired within the time period for setting the associated alarm information.
Step S103, searching all equipment interfaces connected with the fault equipment;
specifically, in practical applications, all the device interfaces connected to the faulty device are searched according to the device topology and a CMDB (Configuration Management Database).
Step S104, sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
Specifically, in practical application, an equipment interface closing instruction may be sent to a target equipment connected to the faulty equipment through an equipment interface, so that the connection between the faulty equipment and the target equipment is disconnected, the faulty equipment is isolated from the equipment network, and fault isolation is achieved.
The fault equipment isolation method includes two methods, as follows:
method 1
Determining fault equipment in a data center system with an independently deployed out-of-band management network, automatically logging in the fault equipment through the management network in a netconf interface and cli mode, wherein the log-in ip and a user name password are stored in a CMDB, inquiring all available equipment ports of the fault equipment in the CMDB, and automatically executing an instruction for closing all available equipment ports of the fault equipment according to a label in the CMDB.
Method two
In a data center system without an independently deployed out-of-band management network, because a fault device cannot be directly logged in to operate the data center system, when a device fault occurs, all device ports which are connected with the fault device in an up-Link or down-Link mode are found out through a Link layer discovery Protocol (CMDB) and a Link Layer Discovery Protocol (LLDP) according to a device network topology structure, all target devices which are connected with the fault device in the up-Link or down-Link mode are sequentially logged in based on the device ports, and a device interface closing instruction of the device port connected with the fault device is executed. It should be noted that, because the architecture and the scene of the data center are both the active/standby dual-active mode, rather than the single-point structure, the order of login need not be considered in the order issue.
In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
After the data center network fault is processed, the fault processing result is also verified.
Referring to fig. 4, a flowchart of a method for processing a data center network fault according to another embodiment of the present invention is disclosed, and on the basis of the embodiment shown in fig. 1, after step S104, the method may further include the steps of:
step S105, judging whether the data center network is recovered to be normal, if so, executing step S106;
the embodiment determines whether the data center network failure is completely processed by judging whether the data center network is recovered to be normal.
Whether the data center network is recovered to be normal or not comprises the following steps: whether the fault equipment is isolated, whether the standby equipment of the fault equipment is started and whether the network is recovered to be normal or not, and when the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal, determining that the data center network is recovered to be normal; otherwise, determining that the data center network is not recovered to be normal, and at the moment, continuously processing the unprocessed fault until the data center network is recovered to be normal.
For example, the machine a and the machine B are active and standby with each other, when the machine a fails and is subjected to fault processing, the machine B is automatically logged in through a netconf interface and a cli mode no matter whether the machine a is in the management network, whether the machine B is enabled or not is checked, whether the network is unblocked or not is checked, the state of the machine a is checked through NRRP (virtual router redundancy protocol), and if the machine a is in an offline state, the machine a is proved to be successfully isolated.
And step S106, storing the configuration of the fault equipment and the fault reason in a corresponding relationship mode.
After the equipment is determined to have a fault, the configuration of the fault equipment and the fault reason are stored in a corresponding relation mode so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and the configuration backtracking can be realized according to the original configuration of the fault equipment.
In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.
Corresponding to the embodiment of the method, the invention discloses a system for processing the network fault of the data center.
Referring to fig. 5, an embodiment of the present invention discloses a schematic structural diagram of a system for processing a data center network failure, where the system includes:
an obtaining unit 201, configured to obtain alarm information generated when an abnormality occurs in a data center network;
specifically, in practical application, the alarm information generated when the network is abnormal may be detected by the operation and maintenance monitoring platform, where the network abnormal includes: a network error condition.
In this embodiment, the alarm information indicates that a data center network failure may occur, and in an actual production environment, a certain failure generally generates a plurality of pieces of alarm information,
each piece of alarm information includes: event first occurrence time, event latest occurrence time, alarm times, event name, alarm device IP, alarm object (specific port or motherboard, etc.), alarm event source and script (ICM alarm source IP and device reachability script, for example), alarm event details (alarm of SNMP Trap, for example, or device vendor management system alarm such as IMC alarm of H3C, for example), and the like.
It should be noted that, in practical applications, the content of the alarm information may be listed in an entry manner, and the alarm information may be classified according to the source or the device model of the alarm information.
A first search unit 202, configured to search, based on the alarm information, a faulty device and a faulty reason corresponding to the alarm information from a pre-stored correspondence relationship between the alarm information, the faulty device, and the faulty reason, and use the faulty device and the faulty reason as fault information for generating a data center network fault;
the first searching unit 202 may specifically include:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
It should be noted that, an alarm information database is established in advance, and the alarm information database includes: fault equipment and fault cause, and alarm information caused by the fault equipment.
The first searching unit 202 may further include:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
The second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
It should be noted that, when new alarm information is generated, alarm information absorption association analysis is performed based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by parent alarm information or child alarm information in the alarm information absorption tree diagram is determined, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as child alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.
Because the alarm information associated through the alarm information rule is usually generated within a period of time, in practical application, an associated alarm information time period can be set, when judging whether the new alarm information can be absorbed by father alarm information or son alarm information in the tree diagram, whether the generation time of the new alarm information is within the set associated alarm information time period is judged, and when the new alarm information is within the set associated alarm information time period, whether the new alarm information can be absorbed by the father alarm information or the son alarm information in the tree diagram is judged; and otherwise, when the new alarm information is not in the set associated alarm information time period, the new alarm information is abandoned.
A second searching unit 203, configured to search all device interfaces connected to the faulty device;
a fault processing unit 204, configured to send an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device is disconnected from the faulty device, and a standby device of the faulty device is enabled at the same time.
Specifically, in practical application, an equipment interface closing instruction may be sent to a target equipment connected to the faulty equipment through an equipment interface, so that the connection between the faulty equipment and the target equipment is disconnected, the faulty equipment is isolated from the equipment network, and fault isolation is achieved.
The fault equipment isolation method includes two methods, as follows:
method 1
Determining fault equipment in a data center system with an independently deployed out-of-band management network, automatically logging in the fault equipment through the management network in a netconf interface and cli mode, wherein the log-in ip and a user name password are stored in a CMDB, inquiring all available equipment ports of the fault equipment in the CMDB, and automatically executing an instruction for closing all available equipment ports of the fault equipment according to a label in the CMDB.
Method two
In a data center system without an independently deployed out-of-band management network, because a fault device cannot be directly logged in to operate the data center system, when a device fault occurs, all device ports which are connected with the fault device in an up-Link or down-Link mode are found out through a Link layer discovery Protocol (CMDB) and a Link Layer Discovery Protocol (LLDP) according to a device network topology structure, all target devices which are connected with the fault device in the up-Link or down-Link mode are sequentially logged in based on the device ports, and a device interface closing instruction of the device port connected with the fault device is executed. It should be noted that, because the architecture and the scene of the data center are both the active/standby dual-active mode, rather than the single-point structure, the order of login need not be considered in the order issue.
In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.
After the data center network fault is processed, the fault processing result is also verified.
To further optimize the above embodiment, the system for processing the data center network failure may further include:
a determining unit, configured to, after the fault processing unit 204 sends an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device disconnects from the faulty device and activates a standby device of the faulty device, determine whether a data center network is recovered to normal, where the data center network recovering to normal includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for processing a data center network fault is characterized by comprising the following steps:
acquiring alarm information generated when a data center network is abnormal;
based on the alarm information, searching out the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
searching all equipment interfaces connected with the fault equipment;
and sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.
2. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:
matching the alarm information with each alarm item in a pre-established alarm information database;
and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.
3. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:
absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.
4. The processing method according to claim 3, wherein the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically comprises:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
5. The processing method according to claim 1, wherein after the sending of the device interface shutdown instruction to the target device connected to the failed device through the device interface causes the target device to disconnect from the failed device while enabling the standby device of the failed device, further comprising:
judging whether the data center network is recovered to be normal or not, wherein the step of recovering to be normal comprises the following steps: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and if so, storing the configuration of the fault equipment and the fault reason in a corresponding relationship form.
6. A system for handling data center network failures, comprising:
the acquisition unit is used for acquiring alarm information generated when the data center network is abnormal;
the first searching unit is used for searching the fault equipment and the fault reason corresponding to the alarm information from the corresponding relation among the pre-stored alarm information, fault equipment and fault reasons based on the alarm information, and using the fault equipment and the fault reason as fault information for generating the data center network fault;
the second searching unit is used for searching all the equipment interfaces connected with the fault equipment;
and the fault processing unit is used for sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, standby equipment of the fault equipment is started.
7. The processing system of claim 6, wherein the first lookup unit specifically comprises:
the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;
and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.
8. The processing system of claim 6, wherein the first lookup unit specifically comprises:
the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;
and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.
9. The processing system of claim 8, wherein the second matching subunit is specifically configured to:
judging whether the generation time of the alarm information is within a set associated alarm information time period or not;
if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;
and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.
10. The processing system of claim 6, further comprising:
a determining unit, configured to, after the failure processing unit sends an equipment interface shutdown instruction to a target equipment connected to the failed equipment through the equipment interface, so that the target equipment is disconnected from the failed equipment, and a standby equipment of the failed equipment is enabled, determine whether a data center network is recovered to be normal, where the recovering to be normal of the data center network includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;
and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911002517.3A CN110635954B (en) | 2019-10-21 | 2019-10-21 | Method and system for processing network fault of data center |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911002517.3A CN110635954B (en) | 2019-10-21 | 2019-10-21 | Method and system for processing network fault of data center |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110635954A true CN110635954A (en) | 2019-12-31 |
CN110635954B CN110635954B (en) | 2022-10-21 |
Family
ID=68976905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911002517.3A Active CN110635954B (en) | 2019-10-21 | 2019-10-21 | Method and system for processing network fault of data center |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110635954B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111865673A (en) * | 2020-07-08 | 2020-10-30 | 上海燕汐软件信息科技有限公司 | Automatic fault management method, device and system |
CN114285725A (en) * | 2021-12-24 | 2022-04-05 | 中国电信股份有限公司 | Network fault determination method and device, storage medium and electronic equipment |
WO2022193617A1 (en) * | 2021-03-16 | 2022-09-22 | 通号通信信息集团有限公司 | Fault location method, fault location system, and video management system |
CN114285725B (en) * | 2021-12-24 | 2024-07-02 | 中国电信股份有限公司 | Network fault determining method and device, storage medium and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021195A (en) * | 2014-06-13 | 2014-09-03 | 中国民航信息网络股份有限公司 | Warning association analysis method based on knowledge base |
WO2016086705A1 (en) * | 2014-12-02 | 2016-06-09 | 中兴通讯股份有限公司 | Fault locating method, and server |
CN106130761A (en) * | 2016-06-22 | 2016-11-16 | 北京百度网讯科技有限公司 | The recognition methods of the failed network device of data center and device |
CN108989132A (en) * | 2018-08-24 | 2018-12-11 | 深圳前海微众银行股份有限公司 | Fault warning processing method, system and computer readable storage medium |
-
2019
- 2019-10-21 CN CN201911002517.3A patent/CN110635954B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021195A (en) * | 2014-06-13 | 2014-09-03 | 中国民航信息网络股份有限公司 | Warning association analysis method based on knowledge base |
WO2016086705A1 (en) * | 2014-12-02 | 2016-06-09 | 中兴通讯股份有限公司 | Fault locating method, and server |
CN106130761A (en) * | 2016-06-22 | 2016-11-16 | 北京百度网讯科技有限公司 | The recognition methods of the failed network device of data center and device |
CN108989132A (en) * | 2018-08-24 | 2018-12-11 | 深圳前海微众银行股份有限公司 | Fault warning processing method, system and computer readable storage medium |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111865673A (en) * | 2020-07-08 | 2020-10-30 | 上海燕汐软件信息科技有限公司 | Automatic fault management method, device and system |
WO2022193617A1 (en) * | 2021-03-16 | 2022-09-22 | 通号通信信息集团有限公司 | Fault location method, fault location system, and video management system |
CN114285725A (en) * | 2021-12-24 | 2022-04-05 | 中国电信股份有限公司 | Network fault determination method and device, storage medium and electronic equipment |
CN114285725B (en) * | 2021-12-24 | 2024-07-02 | 中国电信股份有限公司 | Network fault determining method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110635954B (en) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106130761B (en) | The recognition methods of the failed network device of data center and device | |
WO2022083540A1 (en) | Method, apparatus, and system for determining fault recovery plan, and computer storage medium | |
CN112291075B (en) | Network fault positioning method and device, computer equipment and storage medium | |
CN103179599B (en) | The method for supervising of WLAN performance, equipment and system | |
WO2017092400A1 (en) | Failure recovery method and device, controller, and software defined network | |
CN108429629A (en) | Equipment fault restoration methods and device | |
CN105450472A (en) | Method and device for automatically acquiring states of physical components of servers | |
CN110635954B (en) | Method and system for processing network fault of data center | |
JP2008059114A (en) | Automatic network monitoring system using snmp | |
CN112468592A (en) | Terminal online state detection method and system based on electric power information acquisition | |
WO2020010906A1 (en) | Method and device for operating system (os) batch installation, and network device | |
CN113821242B (en) | Intelligent firmware matching method and system | |
CN101340377B (en) | Method, apparatus and system for data transmission in double layer network | |
JP6124612B2 (en) | Engineering apparatus and engineering method | |
US20190132261A1 (en) | Link locking in ethernet networks | |
CN106713038B (en) | remote transmission line quality detection method and system | |
CN112636960A (en) | Edge computing equipment intranet collaborative maintenance method, system, device, server and storage medium thereof | |
JP2008244902A (en) | Failure recovery apparatus, failure recovery method, and failure recovery system | |
US8644137B2 (en) | Method and system for providing safe dynamic link redundancy in a data network | |
CN112231154A (en) | Dual-computer hot standby switching method and device | |
WO2016082368A1 (en) | Data consistency maintaining method, device and ptn transmission apparatus | |
CN111488235A (en) | Terminal fault processing method and system and cloud platform | |
CN108123864B (en) | EVPN tunnel monitoring method and device | |
CN110958145A (en) | Method and device for managing ad hoc network equipment and electronic equipment | |
CN106488489B (en) | Method and device for recovering user service data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |