CN110635954A

CN110635954A - Method and system for processing network fault of data center

Info

Publication number: CN110635954A
Application number: CN201911002517.3A
Authority: CN
Inventors: 朱聿津; 戴之光; 张维嘉; 王勇
Original assignee: China Travelsky Technology Co Ltd
Current assignee: China Travelsky Technology Co Ltd
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2019-12-31
Anticipated expiration: 2039-10-21
Also published as: CN110635954B

Abstract

The invention discloses a method for processing network faults of a data center, which comprises the steps of acquiring alarm information generated when a data center network is abnormal, searching fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons based on the alarm information, taking the fault information as fault information for generating network faults of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

Description

Method and system for processing network fault of data center

Technical Field

The invention relates to the technical field of network data processing, in particular to a method and a system for processing network faults of a data center.

Background

With the rapid advance of information technology, various data are in explosive growth, data centers are developing more and more rapidly, and network structures are increasingly complex. Data center networks are networks applied in data centers, and because the traffic in the data center networks presents characteristics of typical switched data set concentration, east-west traffic increase and the like, further requirements are made on the data center networks: high expansibility, high robustness, flexible topology and link capacity control, green and energy-saving, and the like. However, data concentration means risk concentration, response concentration, complexity concentration, and the like. Thus, the data center network is inevitably in a failure condition, especially in an emergency condition.

Data center network failure types are numerous, mainly: and the equipment, the link or the server of the data center network fails to provide normal service to the outside. Because the number of network devices is huge, when a fault occurs, more warning information is provided, and the fault is difficult to locate, especially in an emergency situation, if the fault is manually located and operated only by experience, operation accidents are easily caused, the processing time is long, and a large amount of manpower is consumed.

Disclosure of Invention

In view of this, the invention discloses a method and a system for processing a data center network fault, so as to realize automatic positioning and fault processing of the data center network fault, thereby not only greatly saving labor cost and improving fault processing efficiency, but also effectively reducing operation faults caused by manual operation.

A method for processing a data center network fault comprises the following steps:

acquiring alarm information generated when a data center network is abnormal;

based on the alarm information, searching out the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason, and using the fault equipment and the fault reason as fault information for generating the data center network fault;

searching all equipment interfaces connected with the fault equipment;

and sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.

Optionally, the finding, based on the alarm information, the faulty device and the fault reason corresponding to the alarm information from the pre-stored corresponding relationship among the alarm information, the faulty device and the fault reason, and using the found faulty device and fault reason as the fault information for generating the data center network fault specifically includes:

matching the alarm information with each alarm item in a pre-established alarm information database;

and when the alarm items with the matching degree not lower than the preset matching degree exist, taking fault equipment and fault reasons corresponding to the alarm items obtained through matching as fault information for generating the data center network fault.

absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;

and taking the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault.

Optionally, the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorption tree diagram specifically includes:

judging whether the generation time of the alarm information is within a set associated alarm information time period or not;

if yes, absorbing and matching the alarm information with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;

and establishing the alarm information absorption tree diagram according to the alarm information acquired in the set associated alarm information time period.

Optionally, after the sending a device interface shutdown instruction to the target device connected to the faulty device through the device interface, so that the target device is disconnected from the faulty device and a standby device of the faulty device is enabled, the method further includes:

judging whether the data center network is recovered to be normal or not, wherein the step of recovering to be normal comprises the following steps: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;

and if so, storing the configuration of the fault equipment and the fault reason in a corresponding relationship form.

A system for handling data center network failures, comprising:

the acquisition unit is used for acquiring alarm information generated when the data center network is abnormal;

the first searching unit is used for searching the fault equipment and the fault reason corresponding to the alarm information from the corresponding relation among the pre-stored alarm information, fault equipment and fault reasons based on the alarm information, and using the fault equipment and the fault reason as fault information for generating the data center network fault;

the second searching unit is used for searching all the equipment interfaces connected with the fault equipment;

and the fault processing unit is used for sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, standby equipment of the fault equipment is started.

Optionally, the first searching unit specifically includes:

the first matching subunit is used for matching the alarm information with each alarm item in a pre-established alarm information database;

and the first fault selecting subunit is used for determining fault equipment and fault reasons corresponding to the alarm items obtained by matching as fault information for generating the data center network fault when the alarm items with the matching degree not lower than the preset matching degree exist.

Optionally, the first searching unit specifically includes:

the second matching child unit is used for performing absorption matching on the alarm information, the child alarm information and the father alarm information in the alarm information absorption tree diagram which is established in advance;

and determining the fault equipment and the fault reason corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as the fault information for generating the data center network fault.

Optionally, the second matching subunit is specifically configured to:

Optionally, the method further includes:

a determining unit, configured to, after the failure processing unit sends an equipment interface shutdown instruction to a target equipment connected to the failed equipment through the equipment interface, so that the target equipment is disconnected from the failed equipment, and a standby equipment of the failed equipment is enabled, determine whether a data center network is recovered to be normal, where the recovering to be normal of the data center network includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;

and the storage unit is used for storing the configuration of the fault equipment and the fault reason in a corresponding relation mode when the judging unit judges that the fault equipment is the fault equipment.

According to the technical scheme, the invention discloses a method for processing the network fault of the data center, which comprises the steps of obtaining the alarm information generated when the data center network is abnormal, searching the fault equipment and the fault reason corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as the fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting the standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the disclosed drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing a data center network fault according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alarm information database according to an embodiment of the present invention;

FIG. 3 is a tree diagram of an alarm information absorption according to an embodiment of the present invention;

FIG. 4 is a flowchart of another data center network failure processing method disclosed in the embodiments of the present invention;

fig. 5 is a schematic structural diagram of a system for processing a data center network failure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a method and a system for processing a data center network fault, which are used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the data center network fault, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

Referring to fig. 1, an embodiment of the present invention discloses a flow chart of a method for processing a data center network fault, where the method includes the steps of:

s101, acquiring alarm information generated when a data center network is abnormal;

specifically, in practical application, the alarm information generated when the network is abnormal may be detected by the operation and maintenance monitoring platform, where the network abnormal includes: a network error condition.

In this embodiment, the alarm information indicates that a data center network failure may occur, and in an actual production environment, a certain failure generally generates a plurality of pieces of alarm information,

each piece of alarm information includes: event first occurrence time, event latest occurrence time, alarm times, event name, alarm device IP, alarm object (specific port or motherboard, etc.), alarm event source and script (ICM alarm source IP and device reachability script, for example), alarm event details (alarm of SNMP Trap, for example, or device vendor management system alarm such as IMC alarm of H3C, for example), and the like.

It should be noted that, in practical applications, the content of the alarm information may be listed in an entry manner, and the alarm information may be classified according to the source or the device model of the alarm information.

Step S102, based on the alarm information, finding out fault equipment and fault reasons corresponding to the alarm information from the prestored corresponding relation among the alarm information, the fault equipment and the fault reasons, and using the fault equipment and the fault reasons as fault information for generating the data center network fault;

wherein, step S102 may specifically include:

matching the acquired alarm information with each alarm item in a pre-established alarm information database;

In this embodiment, the matching degree is: and the coincidence percentage of all the alarm items under the fault reasons corresponding to the obtained alarm information and the alarm information.

It should be noted that, an alarm information database is established in advance, and the alarm information database includes: fault equipment and fault cause, and alarm information caused by the fault equipment.

Specifically, referring to fig. 2, a schematic diagram of a composition of an alarm information database disclosed in an embodiment of the present invention is shown, where each entry in the alarm information database is: the fault equipment and the fault reason, and the alarm entry caused by the fault equipment comprise: alarm entry 1, alarm entry 2, alarm entry 3, … …, alarm entry n, n being a positive integer.

The value of the preset matching degree is determined according to actual needs, for example, the preset matching degree is 80%, and the present invention is not limited herein.

And when the alarm information database does not have alarm items with the matching degree not lower than the preset matching degree, updating the alarm information database.

The method for establishing the warning information database is suitable for the data center with a relatively perfect database, and has relatively sufficient information for fault equipment, fault reasons and warning information.

In the foregoing embodiment, step S102 may further include:

carrying out absorption matching on the acquired alarm information and each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;

The alarm information absorption tree diagram is established based on the equipment network topological diagram and the known alarm information causal relationship.

Referring to fig. 3, in an alarm information absorption tree diagram disclosed in an embodiment of the present invention, a root alarm information includes: a plurality of parent alarm information, such as parent alarm information 1, parent alarm information 2; each parent alarm message includes: a plurality of child alarm messages, for example, the parent alarm message 1 includes: sub alarm information 1 and sub alarm information 2; the parent alarm information 2 includes: sub alarm information 3 and sub alarm information 4.

When new alarm information is generated, alarm information absorption association analysis is carried out based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by father alarm information or son alarm information in the alarm information absorption tree diagram is judged, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as son alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.

Because the alarm information associated through the alarm information rule is usually generated within a period of time, in practical application, an associated alarm information time period can be set, when judging whether the new alarm information can be absorbed by father alarm information or son alarm information in the tree diagram, whether the generation time of the new alarm information is within the set associated alarm information time period is judged, and when the new alarm information is within the set associated alarm information time period, whether the new alarm information can be absorbed by the father alarm information or the son alarm information in the tree diagram is judged; and otherwise, when the new alarm information is not in the set associated alarm information time period, the new alarm information is abandoned.

In order to further optimize the above embodiment, the absorbing and matching of the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically includes:

judging whether the generation time of the acquired alarm information is within a set associated alarm information time period;

if yes, the acquired alarm information is subjected to absorption matching with each child alarm information and each father alarm information in a pre-established alarm information absorption tree diagram;

taking fault equipment and fault reasons corresponding to the root alarm information corresponding to the matched child alarm information and/or parent alarm information as fault information for generating the data center network fault;

and if not, discarding the acquired alarm information.

It should be noted that the alarm information absorption tree diagram is established according to the alarm information acquired within the time period for setting the associated alarm information.

Step S103, searching all equipment interfaces connected with the fault equipment;

specifically, in practical applications, all the device interfaces connected to the faulty device are searched according to the device topology and a CMDB (Configuration Management Database).

Step S104, sending an equipment interface closing instruction to target equipment connected with the fault equipment through the equipment interface, so that the target equipment is disconnected from the fault equipment, and meanwhile, starting standby equipment of the fault equipment.

Specifically, in practical application, an equipment interface closing instruction may be sent to a target equipment connected to the faulty equipment through an equipment interface, so that the connection between the faulty equipment and the target equipment is disconnected, the faulty equipment is isolated from the equipment network, and fault isolation is achieved.

The fault equipment isolation method includes two methods, as follows:

method 1

Determining fault equipment in a data center system with an independently deployed out-of-band management network, automatically logging in the fault equipment through the management network in a netconf interface and cli mode, wherein the log-in ip and a user name password are stored in a CMDB, inquiring all available equipment ports of the fault equipment in the CMDB, and automatically executing an instruction for closing all available equipment ports of the fault equipment according to a label in the CMDB.

Method two

In a data center system without an independently deployed out-of-band management network, because a fault device cannot be directly logged in to operate the data center system, when a device fault occurs, all device ports which are connected with the fault device in an up-Link or down-Link mode are found out through a Link layer discovery Protocol (CMDB) and a Link Layer Discovery Protocol (LLDP) according to a device network topology structure, all target devices which are connected with the fault device in the up-Link or down-Link mode are sequentially logged in based on the device ports, and a device interface closing instruction of the device port connected with the fault device is executed. It should be noted that, because the architecture and the scene of the data center are both the active/standby dual-active mode, rather than the single-point structure, the order of login need not be considered in the order issue.

In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

After the data center network fault is processed, the fault processing result is also verified.

Referring to fig. 4, a flowchart of a method for processing a data center network fault according to another embodiment of the present invention is disclosed, and on the basis of the embodiment shown in fig. 1, after step S104, the method may further include the steps of:

step S105, judging whether the data center network is recovered to be normal, if so, executing step S106;

the embodiment determines whether the data center network failure is completely processed by judging whether the data center network is recovered to be normal.

Whether the data center network is recovered to be normal or not comprises the following steps: whether the fault equipment is isolated, whether the standby equipment of the fault equipment is started and whether the network is recovered to be normal or not, and when the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal, determining that the data center network is recovered to be normal; otherwise, determining that the data center network is not recovered to be normal, and at the moment, continuously processing the unprocessed fault until the data center network is recovered to be normal.

For example, the machine a and the machine B are active and standby with each other, when the machine a fails and is subjected to fault processing, the machine B is automatically logged in through a netconf interface and a cli mode no matter whether the machine a is in the management network, whether the machine B is enabled or not is checked, whether the network is unblocked or not is checked, the state of the machine a is checked through NRRP (virtual router redundancy protocol), and if the machine a is in an offline state, the machine a is proved to be successfully isolated.

And step S106, storing the configuration of the fault equipment and the fault reason in a corresponding relationship mode.

After the equipment is determined to have a fault, the configuration of the fault equipment and the fault reason are stored in a corresponding relation mode so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and the configuration backtracking can be realized according to the original configuration of the fault equipment.

In summary, the invention discloses a method for processing a network fault of a data center, which includes acquiring alarm information generated when a data center network is abnormal, searching a fault device and a fault reason corresponding to the alarm information from a pre-stored corresponding relationship among the alarm information, the fault device and the fault reason based on the alarm information, using the fault device and the fault reason as fault information for generating the network fault of the data center, closing all device ports connected with the fault device, isolating the fault device, and starting a standby device of the fault device. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.

Corresponding to the embodiment of the method, the invention discloses a system for processing the network fault of the data center.

Referring to fig. 5, an embodiment of the present invention discloses a schematic structural diagram of a system for processing a data center network failure, where the system includes:

an obtaining unit 201, configured to obtain alarm information generated when an abnormality occurs in a data center network;

A first search unit 202, configured to search, based on the alarm information, a faulty device and a faulty reason corresponding to the alarm information from a pre-stored correspondence relationship between the alarm information, the faulty device, and the faulty reason, and use the faulty device and the faulty reason as fault information for generating a data center network fault;

the first searching unit 202 may specifically include:

The first searching unit 202 may further include:

The second matching subunit is specifically configured to:

It should be noted that, when new alarm information is generated, alarm information absorption association analysis is performed based on the alarm information absorption tree diagram, whether the new alarm information can be absorbed by parent alarm information or child alarm information in the alarm information absorption tree diagram is determined, and if the new alarm information can be absorbed, the new alarm information is directly added to the alarm information absorption tree diagram and serves as child alarm information for absorbing the alarm information. When the new alarm information can not be absorbed by the father alarm information or the son alarm information in the alarm information absorption treemap, marking the new alarm information, adding the new alarm information into the alarm information absorption treemap, and updating the alarm information absorption treemap. The method can generate and update the alarm information absorption tree diagram in real time.

A second searching unit 203, configured to search all device interfaces connected to the faulty device;

a fault processing unit 204, configured to send an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device is disconnected from the faulty device, and a standby device of the faulty device is enabled at the same time.

The fault equipment isolation method includes two methods, as follows:

method 1

Method two

In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation.

To further optimize the above embodiment, the system for processing the data center network failure may further include:

a determining unit, configured to, after the fault processing unit 204 sends an equipment interface shutdown instruction to a target device connected to the faulty device through the equipment interface, so that the target device disconnects from the faulty device and activates a standby device of the faulty device, determine whether a data center network is recovered to normal, where the data center network recovering to normal includes: the fault equipment is isolated, the standby equipment is started and the network is recovered to be normal;

In summary, the invention discloses a system for processing a network fault of a data center, which is used for acquiring alarm information generated when a data center network is abnormal, searching fault equipment and a fault reason corresponding to the alarm information from a prestored corresponding relation among the alarm information, the fault equipment and the fault reason based on the alarm information, taking the fault information as fault information for generating the network fault of the data center, closing all equipment ports connected with the fault equipment, isolating the fault equipment and starting standby equipment of the fault equipment. Compared with the traditional scheme, the method and the device can realize automatic positioning and fault processing of the network fault of the data center, greatly save labor cost, improve fault processing efficiency and effectively reduce operation faults caused by manual operation. In addition, in order to ensure complete fault processing of the data center network, the invention also verifies whether the data center network is recovered to be normal or not, and stores the configuration of the fault equipment and the fault reason in a corresponding relation form after the verification is passed so as to provide reference for subsequent equipment configuration change, fault first-aid repair and the like, and can realize configuration backtracking according to the original configuration of the fault equipment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing a data center network fault is characterized by comprising the following steps:

acquiring alarm information generated when a data center network is abnormal;

searching all equipment interfaces connected with the fault equipment;

2. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:

3. The processing method according to claim 1, wherein the searching for the faulty device and the faulty reason corresponding to the alarm information from the pre-stored correspondence among the alarm information, the faulty device, and the faulty reason based on the alarm information, and using the faulty device and the faulty reason as the fault information for generating the data center network fault specifically includes:

4. The processing method according to claim 3, wherein the absorbing and matching the alarm information with each child alarm information and each parent alarm information in a pre-established alarm information absorbing tree diagram specifically comprises:

5. The processing method according to claim 1, wherein after the sending of the device interface shutdown instruction to the target device connected to the failed device through the device interface causes the target device to disconnect from the failed device while enabling the standby device of the failed device, further comprising:

6. A system for handling data center network failures, comprising:

7. The processing system of claim 6, wherein the first lookup unit specifically comprises:

8. The processing system of claim 6, wherein the first lookup unit specifically comprises:

9. The processing system of claim 8, wherein the second matching subunit is specifically configured to:

10. The processing system of claim 6, further comprising: