CN113810239A - Data center network fault detection method, device, equipment and storage medium - Google Patents

Data center network fault detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN113810239A
CN113810239A CN202010544540.1A CN202010544540A CN113810239A CN 113810239 A CN113810239 A CN 113810239A CN 202010544540 A CN202010544540 A CN 202010544540A CN 113810239 A CN113810239 A CN 113810239A
Authority
CN
China
Prior art keywords
detection information
sub
detection
packet loss
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010544540.1A
Other languages
Chinese (zh)
Inventor
曹紫莹
李诗逸
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202010544540.1A priority Critical patent/CN113810239A/en
Publication of CN113810239A publication Critical patent/CN113810239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • H04L43/0829Packet loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a data center network fault detection method, a data center network fault detection device, data center network fault detection equipment and a storage medium. Wherein, the method comprises the following steps: determining that the network fault is a host side packet loss fault, and starting a host side packet loss fault detection process; acquiring detection information of each sub-service item in a host side packet loss fault detection process; and screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information. The embodiment of the invention can start the host side packet loss fault detection process only after the host side packet loss fault occurs, thereby avoiding the preemption of system resources under the condition of normal service. In addition, the detection information of each sub-service item in the host side packet loss fault detection process is obtained; screening the detection information of each sub-service item to obtain effective detection information; therefore, a detection result for determining a fault point corresponding to the network fault is obtained, and the host side packet loss fault can be accurately positioned.

Description

Data center network fault detection method, device, equipment and storage medium
Technical Field
The present invention relates to the field of network fault detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a network fault in a data center.
Background
With the development of information technology, various data shows explosive growth, the development of data centers is more and more rapid, and the network structure is more and more complex. The data center network is applied to a network in a data center, and plays an extremely important role as an important medium for information interaction among all hosts of the data center. Pooling of resources among hosts, sharing of resources, and consistency of important configurations all require a stable, reliable, and fast network as a bearer. Network failures of data centers are various, and the packet loss problem of the data center network is particularly prominent. Once the data center network packet loss problem occurs, all components on a network path that pass through the data center network packet loss problem are called suspected objects of a fault, and it is conceivable that a fault link is very long and heavy, and there are several better types, such as a router, a switch, a network card, an optical module, and the like, in the case of physical hardware. Through some active detection tools and a packet loss link fault positioning scheme, whether a point of occurrence of a packet loss fault is located on a host side or a non-host side (a router and a switch) can be effectively distinguished. However, this is often not enough, because the suspected object of network packet loss caused by the host side is still very long and huge, and the problem of locating the packet loss fault at the host side cannot be solved.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a system, and a storage medium for detecting a network failure in a data center, which aim to accurately locate a host side packet loss failure.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a data center network fault detection method, which comprises the following steps:
determining that the network fault is a host side packet loss fault, and starting a host side packet loss fault detection process;
acquiring detection information of each sub-service item in a host side packet loss fault detection process;
and screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information.
The embodiment of the invention also provides a data center network fault detection device, which comprises:
the starting module is used for determining that the network fault is a host side packet loss fault and starting a host side packet loss fault detection process;
the acquisition module is used for acquiring detection information of each sub-service item in the host side packet loss fault detection process;
and the screening module is used for screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information.
An embodiment of the present invention further provides a data center network fault detection device, including: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, is configured to perform the steps of the method according to an embodiment of the invention.
The embodiment of the invention also provides a storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the method of the embodiment of the invention are realized.
According to the technical scheme provided by the embodiment of the invention, after the network fault is determined to be the host side packet loss fault, the host side packet loss fault detection process is started, and the host side packet loss fault detection process can be started only after the host side packet loss fault occurs, so that the system resource is prevented from being seized under the condition of normal service. Detecting information of each sub-service item in a host side packet loss fault detection process is obtained; screening the detection information of each sub-service item to obtain effective detection information; therefore, a detection result for determining a fault point corresponding to the network fault is obtained, and the host side packet loss fault can be accurately positioned.
Drawings
FIG. 1 is a schematic flow chart illustrating a data center network fault detection method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a data center network fault detection method according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a data center network fault detection apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data center network fault detection device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
In the related art, the fault location tool of the data center network includes an active detection tool and a packet loss fault location algorithm, and although it can effectively distinguish whether a fault point (also called suspected object) of a packet loss fault is located on a host side or a non-host side, it cannot satisfy a more detailed fault location problem. For example, even if the fault location tool has located the suspected object to the host side, the path that the host side network passes through is still long, and the fault location tool cannot determine the stage of packet loss. For example, how to determine which layer the packet loss fault occurs in is a pain point problem for current fault location for a TCP (Transmission Control Protocol) OSI (Open System Interconnect) seven-layer Protocol (application layer, presentation layer, session layer, transport layer, network layer, link layer, and physical layer).
In addition, the current fault location tool is strongly related to protocols, such as the network problem that ping can only detect ICMP (Internet Control Message Protocol), and the network problem that arping can only detect ARP (Address Resolution Protocol). If the packet loss fault in the data center is effective for each protocol (for example, the protocols such as TCP, UDP, ICMP, ARP and the like all have packet loss), a plurality of positioning tools are required to perform troubleshooting, the troubleshooting process is very complicated due to tool dependency, and the possibility of automatic troubleshooting is greatly reduced.
Based on this, in various embodiments of the present invention, after determining that the network failure is a host side packet loss failure, starting a host side packet loss failure detection process, acquiring detection information of each sub-service item in the host side packet loss failure detection process, and screening the detection information of each sub-service item to obtain effective detection information; therefore, a detection result for determining a fault point corresponding to the network fault is obtained, and the host side packet loss fault can be accurately positioned. In addition, the host-side packet loss fault detection process in the embodiment of the present invention has a plurality of types of positioning tools involved, and resource consumption (such as cpu, memory, network traffic, etc.) is higher than that of an active detection tool, and the host-side packet loss fault detection process is started based on a specific condition (that is, it is determined that a network fault is a host-side packet loss fault).
The embodiment of the invention provides a data center network fault detection method, which is applied to data center network fault detection equipment, and as shown in figure 1, the method comprises the following steps:
step 101, determining that the network fault is a host side packet loss fault, and starting a host side packet loss fault detection process.
Here, the data center network fault detection device may determine whether the network fault is a packet loss fault and whether the fault occurrence location is on the host side based on the active detection tool and the packet loss fault location algorithm, and if it is determined that the network fault is the packet loss fault and the network fault is located on the host side, it is determined that the host side packet loss fault exists, and a host side packet loss fault detection procedure is started. The active detection tool can carry data packet probes of different protocols to perform end-to-end detection, and whether the network fault is a packet loss fault is determined according to the end-to-end detection result. The packet loss fault location algorithm may locate whether the fault point is located at the switch, the router, or the host side according to the packet path, thereby determining whether the network fault is located at the host side.
Step 102, obtaining detection information of each sub-service item in the host side packet loss fault detection process.
Here, the data center network fault detection device starts a host-side packet loss fault detection process by relying on a plurality of positioning tools. In practical applications, a plurality of positioning tools on which the host-side packet loss fault detection process depends may be referred to as a tool set. The tool set is divided into an open source tool set (ethtool, etc.) and a vendor tool set (intel, etc.).
In an application example, the toolset supports three types of detection flows: the method comprises a protocol classification troubleshooting process, a device level troubleshooting process and a network stack detection process. And various detection processes respectively support the detection of a plurality of sub-service items. Specifically, the protocol classification troubleshooting process may include: the detection method includes the steps of detecting a Transmission Control Protocol (TCP) sub-service, detecting a User Datagram Protocol (UDP) sub-service, detecting an Internet Control Message Protocol (ICMP) sub-service item, detecting an Address Resolution Protocol (ARP) sub-service item, and detecting a Remote Direct data Access (RDMA) Protocol sub-service. The device level troubleshooting process may include: detecting network card sub-service items, detecting network port sub-service items and detecting optical module sub-service items. The network stack detection flow may support detection of sub-service items at the seven layers of the OSI reference model.
In practical application, after the data center network fault detection device starts the host-side packet loss fault detection process, the detection information of each sub-service item in the host-side packet loss fault detection process can be acquired.
In practical application, the detection of each sub-service item can support the function of filtering the exception. For example, the method includes respectively corresponding to each sub-service item, removing some interference information, removing noise points, normalizing data, and returning detection information in a dictionary form of key-value, where key represents a sub-service item, such as sub-service items of each layer in a network stack, and value is a fault identification code corresponding to the sub-service item. Here, by key-value representation (sub service item, failure identification code), normalization of the detection information may be achieved, and a mapping relationship between the detection information and the failure point may be established, so that the failure point is determined by the detection information.
In an embodiment, the obtaining of the detection information of each sub-service item in the host side packet loss fault detection process includes at least one of:
acquiring detection information corresponding to a protocol classification troubleshooting process in a host side packet loss fault detection process;
acquiring detection information corresponding to a device level troubleshooting process in a host side packet loss fault detection process;
and acquiring detection information corresponding to a network stack detection flow in a host side packet loss fault detection flow.
In practical application, the data center network fault detection device can determine that a network fault is related to a specified protocol based on an active detection tool and a packet loss fault positioning algorithm, and can trigger the detection of a sub-service item of the specified protocol, and if the network fault is not related to the protocol, a complete protocol classification troubleshooting process is executed. Here, the protocol classification troubleshooting procedure may include: detection of sub-service items of TCP, UDP, ICMP, ARP, RDMA protocol.
Therefore, the data center network fault detection device can obtain the detection information corresponding to the protocol classification troubleshooting process in the host side packet loss fault detection process. Specifically, the data center network fault detection device may obtain detection information of a sub-service item of a target protocol based on an open source tool and/or a vendor tool, where the target protocol includes at least one of: TCP, UDP, ICMP, ARP, and RDMA protocols.
Due to many network failure phenomena, such as packet loss, there are a large part of reasons that are caused by some hardware problems of the device itself, such as loosening of transmission media (network cables), aging of products with too long service life, local high temperature caused by not performing dust-proof treatment on time, and the like. The packet loss fault caused by the problems is often irrelevant to the protocol, and even if the protocol troubleshooting process is completely carried out for one time, the occurrence point of the packet loss fault cannot be accurately sensed.
Based on this, in the embodiment of the present invention, the data center network fault detection device may further obtain detection information corresponding to a device-level troubleshooting process in the host-side packet loss fault detection process. Specifically, a device-level troubleshooting process may be initiated based on the vendor tool and the open source tool, depending on the functionality of such tools. For example, the detection of the sub-service items of the network card, the network port and the optical module is started.
In an embodiment, the obtaining detection information corresponding to a device-level troubleshooting process in a host-side packet loss fault detection process includes: and acquiring the detection information of the sub-service item of at least one of the network card, the network port and the optical module based on the open source tool and/or the manufacturer tool.
In the embodiment of the present invention, the data center network fault detection device may further start a network stack detection process based on an open source tool and/or a manufacturer tool, in order to perform finer-grained detection, because the network stack problem corresponding to the network fault has more layers. Therefore, the data center network fault detection device can obtain detection information corresponding to a network stack detection process in a host side packet loss fault detection process.
In an embodiment, the obtaining detection information corresponding to a network stack detection procedure in a host side packet loss fault detection procedure includes: detection information for sub-service items of seven layers of the OSI reference model is obtained based on an open source tool and/or a vendor tool.
In practical application, the data center network fault detection device performs analysis by combining with a standard data receiving and sending packet flow, screens out a hierarchy corresponding to a network stack with high probability of network fault occurrence, deduces a root cause of a problem, performs scene reproduction and simulation, and collects key data indexes (such as stack configuration, abnormal statistic value increase and the like) in a scene as reference data of similar problems encountered later.
And 103, screening the detection information of each sub-service item based on the sub-service items to obtain effective detection information.
Here, since the detection information of each sub-service item is a normalized result, the detection information of each sub-service item can be screened based on the sub-service item to remove invalid detection information and obtain valid detection information, thereby facilitating more accurate locking of a fault point of a network fault.
In an embodiment, the screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information includes:
screening the detection information of each sub-service item based on at least one of cascading analysis, homogeneity filtering, similarity normalization, rule-based judgment and threshold-based judgment to obtain effective detection information;
the cascading analysis is used for screening a plurality of detection information of the sub-service items belonging to the cascading relation; the homogeneity filtering is used for screening a plurality of detection information of the sub-service items belonging to the homogeneity relation; the similarity normalization is used for screening a plurality of detection information of the sub-service items belonging to similar fault categories; the rule-based judgment is used for screening the detection information of the sub-service items based on the set rule; the threshold-based determination is used for screening the detection information of the sub-service items based on a set threshold.
In practical application, the root cause number of network failures can be reduced based on the principles of similarity classification and homogeneity comparison. For example, through homogeneity filtering, multiple pieces of detection information of sub-service items belonging to a homogeneous relation are screened, and redundant detection information of the homogeneous relation is removed, wherein the homogeneous relation can be an inclusion relation between two pieces of detection information, for example, if a network port is one part of a network card, the network port damage belongs to network card damage, and the detection information of the network card damage can be discarded, so that the root of a network fault can be positioned more accurately because the network port is damaged. For another example, the similarity normalization refers to screening multiple pieces of detection information of the sub-service items belonging to similar fault categories, and the corresponding detection information can be deleted according to the predefined sub-service items belonging to similar fault categories, so that the number of root causes of the network fault is reduced.
Here, the cascading analysis is used for multi-level root cause determination. The root cause behind the packet loss phenomenon may be a relationship of multi-level influence except for similarity, and a second-level root cause occurs due to a first-level root cause, and finally the packet loss phenomenon is caused. For example, the packet loss of the application layer may be caused by improper configuration of the transport layer of the OSI seven-layer protocol, so the real first-level root is the improper configuration of the transport layer, the second-level root is the packet loss of the application layer, and the phenomenon is the network packet loss. Based on the cascading analysis and the judgment of the multi-level root, suspected objects or stages of the fault can be effectively distinguished, and evidence support is provided for subsequent processing suggestions.
According to the embodiment of the invention, the detection information of each sub-service item is screened based on the sub-service items, so that the types of the root causes of packet loss faults can be effectively reduced, and the complexity of the root causes is reduced.
In an embodiment, the valid detection information includes at least two filtered detection information, and the method further includes:
and sequencing the effective detection information to obtain sequenced detection information.
Because the effective detection information has a plurality of detection information, the plurality of detection information in the effective detection information needs to be sequenced, so that the accuracy of positioning the fault point of the network fault is improved.
In an embodiment, the sorting the valid detection information to obtain sorted detection information includes:
grading the effective detection information based on the influence factors corresponding to the corresponding sub-service items;
and sequencing based on the scores of all the detection information in the effective detection information to obtain the sequenced detection information.
In an embodiment, the scoring the valid detection information based on the influence factor corresponding to the corresponding sub-service item includes:
carrying out comprehensive scoring on the detection information of each sub-service item in the effective detection information based on the weight coefficient, the association coefficient and the destruction coefficient to obtain the score of each detection information;
the weighting coefficient is a first coefficient of the probability of causing the fault corresponding to the current detection information, the association coefficient is a second coefficient determined based on the association between the current detection information and other detection information in the effective detection information, and the destruction coefficient is a third coefficient of the degree of network destruction caused by the repair operation corresponding to the current detection information.
In practical applications, a mapping relationship exists between the sub service item and the first coefficient, and the corresponding first coefficient may be determined for the detection information based on the mapping relationship between the sub service item and the first coefficient. And a mapping relation exists between the sub service item and the third coefficient, and a corresponding third coefficient can be determined for the detection information based on the mapping relation between the sub service item and the third coefficient. The second coefficient is a coefficient determined based on the correlation analysis.
In one example of application, the score of the detection information is weight coefficient + correlation coefficient-destruction coefficient. And sorting each piece of effective detection information based on the grade. The priority of the fault point corresponding to the detection information in the front order is higher than that of the fault point corresponding to the detection information in the back order.
In an embodiment, the method further comprises:
and triggering a repair process for automatically repairing the network fault based on the effective detection information.
Here, based on the fault point corresponding to the effective detection information, the corresponding repair process can be triggered to automatically repair the network fault, thereby realizing the automatic repair of the network fault.
In an embodiment, the triggering a repair procedure for automatically repairing a network failure based on the valid detection information includes:
and determining that the target detection information in the effective detection information has an automatic repair program and the damage coefficient is smaller than a set threshold value, and triggering the automatic repair program of the target detection information.
In practical application, effective detection information is sequenced, the detection information which is sequenced most in front is determined as target detection information based on the sequenced detection information, and if the target detection information has an automatic repairing program and a damage coefficient is smaller than a set threshold value, the automatic repairing program of the target detection information is triggered. If the target detection information does not have an automatic repair program or the damage coefficient is larger than or equal to a set threshold value, manual processing is carried out, so that a processing suggestion with high confidence level can be given, and a semi-automatic troubleshooting process is realized.
The present invention will be described in further detail with reference to the following application examples.
As shown in fig. 2, in the embodiment of the present application, the method for detecting a data center network fault includes:
step 201, determining whether the network fault occurs on the host side, if so, executing step 202, otherwise, starting a non-host side fault removal process.
Here, the packet loss fault location algorithm may locate whether the fault point is located on the switch, the router, or the host side according to the packet path, so as to determine whether the network fault is located on the host side, if so, step 202 is executed, and if not, the non-host side fault removal procedure is started.
Step 202, determining whether the network fault is a packet loss fault, if so, executing step 203, and if not, starting other types of troubleshooting processes.
Here, the active detection tool may carry data packet probes of different protocols to perform end-to-end detection, determine whether the network fault is a packet loss fault according to an end-to-end detection result, if yes, execute step 203, and if not, start other types of troubleshooting processes.
Step 203, starting the host side packet loss fault detection process, and acquiring detection information of each sub-service item in the host side packet loss fault detection process.
Here, the toolset supports three types of detection flows: the method comprises a protocol classification troubleshooting process, a device level troubleshooting process and a network stack detection process. The method can acquire detection information corresponding to sub-service items of TCP, UDP, ICMP, ARP and RDMA protocols in the protocol classification troubleshooting process, can also acquire detection information of at least one sub-service item in a network card, a network port and an optical module, and can also acquire detection information of seven layers of sub-service items of an OSI reference model.
Here, each sub-service item returns detection information in a dictionary form of key-value, where key represents a sub-service item, such as a sub-service item of each layer in a network stack, and value is a fault identification code of the corresponding sub-service item. Here, by key-value representation (sub service item, failure identification code), normalization of the detection information may be achieved, and a mapping relationship between the detection information and the failure point may be established, so that the failure point is determined by the detection information.
And step 204, screening the detection information of each sub-service item based on the sub-service item.
Here, the detection information of each sub-service item is screened based on at least one of cascade analysis, homogeneity filtering, similarity normalization, rule-based determination, and threshold-based determination, to obtain effective detection information.
Step 205, the association rules are scored.
Here, each piece of valid detection information is scored based on the association rule to obtain the association coefficient of each piece of detection information.
In step 206, the damage factor is scored.
Here, each piece of detection information in the valid detection information is scored based on the destruction coefficient, and the destruction coefficient of each piece of detection information is obtained.
Step 207, determining the priority of troubleshooting.
Here, the evaluation of each piece of detection information is obtained by performing a comprehensive evaluation based on the weight coefficient, the correlation coefficient, and the destruction coefficient of each piece of detection information in the valid detection information, and the priority of the obstacle avoidance is determined based on the evaluation of each piece of detection information.
In practical application, association rule scoring can be performed according to suspected objects given by effective detection information, and ranking is performed according to scores of association rules, however, this reference standard is often insufficient, suggestions given by some suspected objects, such as a problem of a network card, may cause restarting of the network card, and the network card restarting has a risk of service disconnection, and the damage influence of such problems on network quality is high (namely, a damage coefficient).
Step 208, generate a log alarm.
And generating a log alarm based on the fault-removing priority of the effective detection information.
Step 209 determines whether the automatic repair process is started, if yes, step 210 is executed, and if not, step 211 is executed.
Here, the detection information having the highest priority is set as the target detection information. If the target detection information has an automatic repair program and the damage coefficient is smaller than the set threshold, step 210 is executed, otherwise, step 211 is executed.
And step 210, carrying out automatic processing.
Here, if the target detection information has an automatic restoration process and the destruction coefficient is smaller than the set threshold, the automatic restoration process of the target detection information is triggered.
And step 211, manual processing.
And if the target detection information does not have an automatic repair program or the destruction coefficient is greater than or equal to a set threshold value, handing over to manual processing.
In order to implement the method of the embodiment of the present invention, an embodiment of the present invention further provides a data center network fault detection apparatus, where the data center network fault detection apparatus corresponds to the data center network fault detection method, and each step in the data center network fault detection method is also completely applicable to the embodiment of the data center network fault detection apparatus.
As shown in fig. 3, the data center network fault detection apparatus includes: a starting module 301, an obtaining module 302 and a screening module 303. The starting module 301 is configured to determine that the network fault is a host side packet loss fault, and start a host side packet loss fault detection process; the obtaining module 302 is configured to obtain detection information of each sub-service item in the host side packet loss fault detection process; the screening module 303 is configured to screen the detection information of each sub-service item based on the sub-service item to obtain effective detection information.
In an embodiment, the obtaining module 302 is configured to at least one of:
acquiring detection information corresponding to a protocol classification troubleshooting process in a host side packet loss fault detection process;
acquiring detection information corresponding to a device level troubleshooting process in a host side packet loss fault detection process;
and acquiring detection information corresponding to a network stack detection flow in a host side packet loss fault detection flow.
In an embodiment, the obtaining module 302 is specifically configured to: acquiring detection information of a sub-service item of a target protocol based on an open source tool and/or a vendor tool, wherein the target protocol comprises at least one of the following: TCP, UDP, ICMP, ARP, and RDMA protocols.
In an embodiment, the obtaining module 302 is specifically configured to: and acquiring the detection information of the sub-service item of at least one of the network card, the network port and the optical module based on the open source tool and/or the manufacturer tool.
In an embodiment, the obtaining module 302 is specifically configured to: detection information for sub-service items of seven layers of the OSI reference model is obtained based on an open source tool and/or a vendor tool.
In an embodiment, the screening module 303 is specifically configured to: screening the detection information of each sub-service item based on at least one of cascading analysis, homogeneity filtering, similarity normalization, rule-based judgment and threshold-based judgment to obtain effective detection information;
the cascading analysis is used for screening a plurality of detection information of the sub-service items belonging to the cascading relation; the homogeneity filtering is used for screening a plurality of detection information of the sub-service items belonging to the homogeneity relation; the similarity normalization is used for screening a plurality of detection information of the sub-service items belonging to similar fault categories; the rule-based judgment is used for screening the detection information of the sub-service items based on the set rule; the threshold-based determination is used for screening the detection information of the sub-service items based on a set threshold.
In an embodiment, the valid detection information includes at least two filtered detection information, and the data center network failure detection apparatus further includes: and a sorting module 304, configured to sort the valid detection information to obtain sorted detection information.
In an embodiment, the sorting module 304 is specifically configured to:
grading the effective detection information based on the influence factors corresponding to the corresponding sub-service items;
and sequencing based on the scores of all the detection information in the effective detection information to obtain the sequenced detection information.
In an embodiment, the sorting module 304 is specifically configured to:
carrying out comprehensive scoring on the detection information of each sub-service item in the effective detection information based on the weight coefficient, the association coefficient and the destruction coefficient to obtain the score of each detection information;
the weighting coefficient is a first coefficient of the probability of causing the fault corresponding to the current detection information, the association coefficient is a second coefficient determined based on the association between the current detection information and other detection information in the effective detection information, and the destruction coefficient is a third coefficient of the degree of network destruction caused by the repair operation corresponding to the current detection information.
In an embodiment, the data center network failure detection apparatus further includes:
a repairing module 305, configured to trigger a repairing process for automatically repairing the network failure based on the valid detection information.
In an embodiment, the repair module 305 is specifically configured to:
and determining that the target detection information in the effective detection information has an automatic repair program and the damage coefficient is smaller than a set threshold value, and triggering the automatic repair program of the target detection information.
In practical applications, the starting module 301, the obtaining module 302, the screening module 303, the sorting module 304, and the repairing module 305 may be implemented by a processor in a data center network fault detection apparatus. Of course, the processor needs to run a computer program in memory to implement its functions.
It should be noted that: in the data center network failure detection apparatus provided in the foregoing embodiment, when performing data center network failure detection, only the division of the program modules is illustrated, and in practical applications, the processing distribution may be completed by different program modules according to needs, that is, the internal structure of the apparatus is divided into different program modules, so as to complete all or part of the processing described above. In addition, the data center network fault detection apparatus provided in the above embodiment and the data center network fault detection method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
Based on the hardware implementation of the program module, and in order to implement the method of the embodiment of the present invention, the embodiment of the present invention further provides a data center network fault detection device. Fig. 4 shows only an exemplary structure of the data center network failure detection apparatus, not a whole structure, and a part of or the whole structure shown in fig. 4 may be implemented as necessary.
As shown in fig. 4, a data center network failure detection apparatus 400 provided in an embodiment of the present invention includes: at least one processor 401, memory 402, a user interface 403, and at least one network interface 404. The various components in the data center network failure detection device 400 are coupled together by a bus system 405. It will be appreciated that the bus system 405 is used to enable communications among the components. The bus system 405 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 405 in fig. 4.
The user interface 403 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.
Memory 402 in embodiments of the present invention is used to store various types of data to support the operation of data center network failure detection equipment. Examples of such data include: any computer program for operating on a data center network failure detection device.
The data center network fault detection method disclosed by the embodiment of the invention can be applied to the processor 401, or can be realized by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the data center network failure detection method may be implemented by hardware integrated logic circuits or instructions in software in the processor 401. The Processor 401 described above may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 401 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software module may be located in a storage medium located in the memory 402, and the processor 401 reads information in the memory 402, and completes the steps of the data center network fault detection method provided by the embodiment of the present invention in combination with hardware thereof.
In an exemplary embodiment, the data center network failure detection Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.
It will be appreciated that the memory 402 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The described memory for embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
In an exemplary embodiment, the embodiment of the present invention further provides a storage medium, that is, a computer storage medium, which may be specifically a computer readable storage medium, for example, a memory 402 storing a computer program, where the computer program is executable by a processor 401 of a data center network failure detection device to complete the steps described in the method of the embodiment of the present invention. The computer readable storage medium may be a ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM, among others.
It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (13)

1. A data center network fault detection method is characterized by comprising the following steps:
determining that the network fault is a host side packet loss fault, and starting a host side packet loss fault detection process;
acquiring detection information of each sub-service item in a host side packet loss fault detection process;
and screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information.
2. The method according to claim 1, wherein the obtaining of the detection information of each sub-service item in the host side packet loss fault detection process includes at least one of:
acquiring detection information corresponding to a protocol classification troubleshooting process in a host side packet loss fault detection process;
acquiring detection information corresponding to a device level troubleshooting process in a host side packet loss fault detection process;
and acquiring detection information corresponding to a network stack detection flow in a host side packet loss fault detection flow.
3. The method according to claim 2, wherein the obtaining of the detection information corresponding to the protocol classification troubleshooting process in the host side packet loss fault detection process includes:
acquiring detection information of a sub-service item of a target protocol based on an open source tool and/or a vendor tool, wherein the target protocol comprises at least one of the following: TCP, UDP, ICMP, ARP and RDMA.
4. The method according to claim 2, wherein the obtaining of the detection information corresponding to the device-level troubleshooting process in the host-side packet loss fault detection process includes:
and acquiring the detection information of the sub-service item of at least one of the network card, the network port and the optical module based on the open source tool and/or the manufacturer tool.
5. The method according to claim 2, wherein the obtaining of the detection information corresponding to the network stack detection procedure in the host side packet loss fault detection procedure includes:
and acquiring detection information of the sub-service items of the seven layers of the open system interconnection OSI reference model based on the open source tool and/or the vendor tool.
6. The method of claim 1, wherein the screening the detection information of each sub-service item based on the sub-service item to obtain valid detection information comprises:
screening the detection information of each sub-service item based on at least one of cascading analysis, homogeneity filtering, similarity normalization, rule-based judgment and threshold-based judgment to obtain effective detection information;
the cascading analysis is used for screening a plurality of detection information of the sub-service items belonging to the cascading relation; the homogeneity filtering is used for screening a plurality of detection information of the sub-service items belonging to the homogeneity relation; the similarity normalization is used for screening a plurality of detection information of the sub-service items belonging to similar fault categories; the rule-based judgment is used for screening the detection information of the sub-service items based on the set rule; the threshold-based determination is used for screening the detection information of the sub-service items based on a set threshold.
7. The method of claim 1, wherein the valid detection information comprises at least two filtered detection information, the method further comprising:
and sequencing the effective detection information to obtain sequenced detection information.
8. The method of claim 7, wherein the sorting the valid detection information to obtain sorted detection information comprises:
grading the effective detection information based on the influence factors corresponding to the corresponding sub-service items;
and sequencing based on the scores of all the detection information in the effective detection information to obtain the sequenced detection information.
9. The method of claim 8, wherein scoring the valid detection information based on the impact factors corresponding to the respective sub-service items comprises:
carrying out comprehensive scoring on the detection information of each sub-service item in the effective detection information based on the weight coefficient, the association coefficient and the destruction coefficient to obtain the score of each detection information;
the weighting coefficient is a first coefficient of the probability of causing the fault corresponding to the current detection information, the association coefficient is a second coefficient determined based on the association between the current detection information and other detection information in the effective detection information, and the destruction coefficient is a third coefficient of the degree of network destruction caused by the repair operation corresponding to the current detection information.
10. The method according to claim 1, characterized in that it comprises:
and determining that the target detection information in the effective detection information has an automatic repair program and the damage coefficient is smaller than a set threshold value, and triggering the automatic repair program of the target detection information.
11. A data center network fault detection device, characterized by comprising:
the starting module is used for determining that the network fault is a host side packet loss fault and starting a host side packet loss fault detection process;
the acquisition module is used for acquiring detection information of each sub-service item in the host side packet loss fault detection process;
and the screening module is used for screening the detection information of each sub-service item based on the sub-service item to obtain effective detection information.
12. A data center network failure detection device, comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein,
the processor, when executing the computer program, is adapted to perform the steps of the method of any of claims 1 to 10.
13. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method of any one of claims 1 to 10.
CN202010544540.1A 2020-06-15 2020-06-15 Data center network fault detection method, device, equipment and storage medium Pending CN113810239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010544540.1A CN113810239A (en) 2020-06-15 2020-06-15 Data center network fault detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010544540.1A CN113810239A (en) 2020-06-15 2020-06-15 Data center network fault detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113810239A true CN113810239A (en) 2021-12-17

Family

ID=78944383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010544540.1A Pending CN113810239A (en) 2020-06-15 2020-06-15 Data center network fault detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113810239A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118632A (en) * 2022-06-21 2022-09-27 中电信数智科技有限公司 Automatic host packet loss detection method based on cloud network fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1276118A (en) * 1997-10-14 2000-12-06 诺基亚网络有限公司 Network monitoring method for telecommunications network
CN102158360A (en) * 2011-04-01 2011-08-17 华中科技大学 Network fault self-diagnosis method based on causal relationship positioning of time factors
US20140078882A1 (en) * 2012-09-14 2014-03-20 Microsoft Corporation Automated Datacenter Network Failure Mitigation
CN107171819A (en) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 A kind of network fault diagnosis method and device
CN108199874A (en) * 2017-12-29 2018-06-22 上海司南卫星导航技术股份有限公司 A kind of Network Fault Detection and method, GNSS receiver and the computer-readable medium repaired

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1276118A (en) * 1997-10-14 2000-12-06 诺基亚网络有限公司 Network monitoring method for telecommunications network
CN102158360A (en) * 2011-04-01 2011-08-17 华中科技大学 Network fault self-diagnosis method based on causal relationship positioning of time factors
US20140078882A1 (en) * 2012-09-14 2014-03-20 Microsoft Corporation Automated Datacenter Network Failure Mitigation
CN107171819A (en) * 2016-03-07 2017-09-15 北京华为数字技术有限公司 A kind of network fault diagnosis method and device
CN108199874A (en) * 2017-12-29 2018-06-22 上海司南卫星导航技术股份有限公司 A kind of Network Fault Detection and method, GNSS receiver and the computer-readable medium repaired

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱正红: "计算机机房网络系统故障检测与维护", 《电脑学习》 *
李华燕: "略论网络故障诊断和排除方法", 《电脑知识与技术》 *
王志燕: "计算机常见网络故障及维护措施", 《山西煤炭管理干部学院学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115118632A (en) * 2022-06-21 2022-09-27 中电信数智科技有限公司 Automatic host packet loss detection method based on cloud network fusion
CN115118632B (en) * 2022-06-21 2024-02-06 中电信数智科技有限公司 Automatic detection method for packet loss of host based on cloud network integration

Similar Documents

Publication Publication Date Title
US20200106660A1 (en) Event based service discovery and root cause analysis
US9003230B2 (en) Method and apparatus for cause analysis involving configuration changes
CN110955550B (en) Cloud platform fault positioning method, device, equipment and storage medium
US9210057B2 (en) Cross-cutting event correlation
JP2014112400A (en) Method and apparatus for generating configuration rules for computing entities within computing environment by using association rule mining
CN106897197B (en) Error log duplicate removal method and device
Wang et al. Log-based anomaly detection with the improved k-nearest neighbor
CN111475411A (en) Server problem detection method, system, terminal and storage medium
CN111984488B (en) Memory fault detection method and device, electronic equipment and readable storage medium
CN113810239A (en) Data center network fault detection method, device, equipment and storage medium
Chen et al. Exploiting local and global invariants for the management of large scale information systems
CN112068979B (en) Service fault determination method and device
CN113672471A (en) Software monitoring method, device, equipment and storage medium
US8478575B1 (en) Automatic anomaly detection for HW debug
US11316873B2 (en) Detecting malicious threats via autostart execution point analysis
Zou et al. Improving log-based fault diagnosis by log classification
CN117424743A (en) Data processing method and device, electronic equipment and storage medium
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
Macura et al. Multi-criteria analysis and prediction of network incidents using monitoring system
US11860724B2 (en) Method and system for facilitating a self-healing network
CN114417349A (en) Attack result determination method, device, electronic equipment and storage medium
CN113792291A (en) Host identification method and device infected by domain generation algorithm malicious software
CN111813872A (en) Fault troubleshooting model generation method, device and equipment
CN113055396B (en) Cross-terminal traceability analysis method, device, system and storage medium
CN116980468B (en) Asset discovery and management method, device, equipment and medium in industrial control environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211217

RJ01 Rejection of invention patent application after publication