CN117714325A - Network monitoring method and device for server cluster, electronic equipment and storage medium - Google Patents

Network monitoring method and device for server cluster, electronic equipment and storage medium Download PDF

Info

Publication number
CN117714325A
CN117714325A CN202311812679.XA CN202311812679A CN117714325A CN 117714325 A CN117714325 A CN 117714325A CN 202311812679 A CN202311812679 A CN 202311812679A CN 117714325 A CN117714325 A CN 117714325A
Authority
CN
China
Prior art keywords
network
telemetry
probe
port
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311812679.XA
Other languages
Chinese (zh)
Inventor
江卓
袁乾宸
李福亮
张磊
王兴伟
张璋
王磊
赖育霖
柏洋洋
霍朋飞
董永彬
叶剑西
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Lemon Inc Cayman Island
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Lemon Inc Cayman Island
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd, Lemon Inc Cayman Island filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202311812679.XA priority Critical patent/CN117714325A/en
Publication of CN117714325A publication Critical patent/CN117714325A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention relates to the technical field of computers and discloses a network monitoring method, a device, electronic equipment and a storage medium of a server cluster, wherein the method comprises the steps of obtaining network topology information of the server cluster to be monitored, wherein the network topology information is used for representing the connection relation between network equipment in the server cluster to be monitored; generating a first type in-band network telemetry probe based on network topology information to obtain network quality of ports in the network device; generating a second type in-band network telemetry probe based on flow quintuple information of the current service flow to obtain a service flow telemetry path, wherein the transmission frequency of the first type in-band network telemetry probe is higher than that of the second type in-band network telemetry probe; and inquiring port network quality based on ports in the traffic telemetry path, and determining a network monitoring result of the server cluster to be monitored. The method ensures the real-time performance and accuracy of network monitoring.

Description

Network monitoring method and device for server cluster, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a network monitoring method and device for a server cluster, electronic equipment and a storage medium.
Background
With the development of network applications, such as large model training, distributed storage, etc., the collaborative operation of a large number of servers in a server cluster puts higher demands on network stability, and not only needs to ensure efficient transmission of a network, but also needs to ensure high quality of the network. Thus, the network is required to have accurate network monitoring capabilities.
Disclosure of Invention
In view of this, the disclosure provides a network monitoring method, device, electronic equipment and storage medium for a server cluster, so as to solve the problem of network monitoring.
In a first aspect, the present disclosure provides a network monitoring method of a server cluster, the method including:
acquiring network topology information of a server cluster to be monitored, wherein the network topology information is used for representing the connection relation between network devices in the server cluster to be monitored;
generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of a port in the network device;
generating a second type in-band network telemetry probe based on flow quintuple information of the current service flow to obtain a service flow telemetry path, wherein the transmission frequency of the first type in-band network telemetry probe is higher than that of the second type in-band network telemetry probe;
And inquiring port network quality based on the ports in the traffic telemetry path, and determining a network monitoring result of the server cluster to be monitored.
In a second aspect, the present disclosure provides a network monitoring apparatus for a server cluster, the apparatus comprising:
the network topology information acquisition module is used for acquiring network topology information of the server cluster to be monitored, and the network topology information is used for representing the connection relation between network devices in the server cluster to be monitored;
a first type in-band network telemetry probe generation module for generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of a port in the network device;
the second type in-band network telemetry probe generation module is used for generating a second type in-band network telemetry probe based on flow quintuple information of current service flow so as to obtain a service flow telemetry path, and the sending frequency of the first type in-band network telemetry probe is higher than that of the second type in-band network telemetry probe;
and the network quality query module is used for querying the port network quality based on the port in the service flow telemetry path and determining the network monitoring result of the server cluster to be monitored.
In a third aspect, the present disclosure provides an electronic device comprising: the network monitoring system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the network monitoring method of the server cluster in the first aspect or any corresponding embodiment of the first aspect is executed.
In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the network monitoring method of the server cluster of the first aspect or any of its corresponding embodiments.
According to the network monitoring method of the server cluster, the first type in-band network telemetry probe is generated based on the network topology information of the server cluster to be monitored, and network quality detection is carried out on the ports of the network equipment in the server cluster to be monitored, namely, detection of the first type in-band network telemetry probe is irrelevant to actual service flow. And generating a second type in-band network telemetry probe based on the flow quintuple information of the current service flow, and detecting a service flow telemetry path. Because the second type in-band network telemetry probe is related to the actual service flow, the service flow telemetry path of the current service flow can be obtained through the second type in-band network telemetry probe, the network quality of each port can be obtained in the detection result of the first type in-band network telemetry probe, the service flow telemetry path comprises a plurality of ports, and the network quality corresponding to each service flow telemetry path can be accurately obtained by combining the network quality of the plurality of ports, so that the real-time performance and the accuracy of network monitoring are ensured.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a system schematic diagram of fault localization according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of a method of network monitoring of a server cluster according to an embodiment of the present disclosure;
FIG. 3 is a flow diagram of a method of network monitoring of a server cluster according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a network topology according to an embodiment of the present disclosure;
5 a-5 d are updated schematic diagrams of a heap to be screened according to embodiments of the present disclosure;
FIG. 6 is a flow diagram of a method of network monitoring of a server cluster according to an embodiment of the disclosure;
FIG. 7 is a schematic diagram of a fault path according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a network monitoring device of a server cluster according to an embodiment of the disclosure;
Fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.
As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.
It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.
It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.
In the related art, network fault positioning of a server cluster is realized through an active network measurement mode, but when a network fails, a specific port where the network fault occurs is difficult to directly position, and the connectivity problem at the port level can be positioned only through a flow sampling mode. However, when a plurality of faults occur simultaneously, massive data are analyzed and processed in a centralized way, and difficulty is brought to positioning of network problems.
In other related art, network failure localization of a server cluster is achieved through passive network measurement. However, although this method can realize switch-based network monitoring, instantaneous network state information when a packet passes through cannot be acquired, and a problem of traffic cannot be directly related to a network problem. The fault removal is specifically performed by comparing whether the time points of occurrence of network faults coincide, but it is not possible to directly determine whether the fluctuation of the traffic flow is caused by the problem.
In some related technologies, a network measurement scheme based on in-band network telemetry (Inband Network Telemetry, INT) is adopted, and the switch is used to insert port-level telemetry information into a probe packet, including information such as a data packet passing through a time-in port, a time-out port, a queue depth, a hop-by-hop forwarding time delay, and the like, so that a network state with finer granularity can be detected. But because INT provides only a basic source domain, it is difficult to directly implement the network-aware requirements of full ports and full traffic. In addition, the number of telemetry probes is positively correlated with the size and number of traffic flows, which results in significant bandwidth occupancy and server resource overhead.
The network measurement method based on INT is difficult to meet network sensing requirements of full ports and full traffic at the same time, for example, a network telemetry scheme combined with Segment Routing (SR) technology realizes light-weight full-port telemetry without overlapping coverage through a path planning algorithm, but is difficult to meet the requirements of a remote direct memory access (Remote Direct Memory Access, RDMA) network for full traffic measurement; for another example, a network telemetry scheme based on data compression realizes the requirement of full-flow measurement through statistics of network flow, but cannot collect network information such as queue depth, hop-by-hop time delay and the like.
Because the INT-based network measurement method needs the switch to insert telemetry data and needs to transmit in a data plane, more network bandwidth resources and CPU resources are needed to be consumed, and the network measurement requirement of full port and full flow is difficult to be lightweight. The INT in the related art needs to support both fine-grained network monitoring and efficient network congestion control, and only switch-based network telemetry information reporting would result in excessive southbound overhead. In addition, network measurement schemes have difficulty in directly locating the specific port at which the network failure occurred, and network administrators need a great deal of time to analyze the massive network telemetry data to troubleshoot the network failure.
Based on this, the embodiment of the disclosure provides a network monitoring method for a server cluster, which can not only detect the network state of all ports, but also measure the real-time network information of all traffic and accurately locate the position where the network problem occurs while not affecting the server and the network performance. Further, the network monitoring method of the server cluster provided by the embodiment of the disclosure is realized based on a network monitoring system, wherein the network monitoring system is a lightweight network measurement architecture, and a high-real-time fine-grained full-port full-flow measurement scheme is provided for an RDMA network.
In the network monitoring system provided by the embodiment of the disclosure, as shown in fig. 1, the components such as the INT controller, the INT analyzer and the INT database are deployed in the distributed server cluster, so that it is ensured that the telemetry system is not abnormal due to a single point of failure and the data processing capability of the components is enhanced. An INT agent and an INT collector are deployed in each server, ensuring that each device supports INT probe transmit and receive capabilities. The RDMA network card automatically sends the INT detection packet to the corresponding network application by identifying the packet head of the INT detection packet. The data plane needs to support hash-based Equal-cost multi-path routing (ECMP) techniques and INT transmission functions. In order to realize an in-band full-network telemetry system with strong universality, strong robustness and scalability, an INT controller is actively interrogated by an INT proxy to transmit an INT probe, and the INT probe is prevented from being actively transmitted by the INT controller, thereby realizing a loosely coupled design.
Specifically, the INT controller adopts a full-port full-flow network measurement scheme based on network topology information, and obtains a target detection packet quintuple by screening detection packet quintuple information so as to realize lightweight and high-performance full-network telemetry. The INT agent is used for periodically transmitting INT probes according to telemetry requirements such as a set probe transmission frequency based on a telemetry scheme of the INT controller. The RDMA network card transmits the INT probe packet to a specified network application through the identification of the header of the INT probe packet. An INT collector is deployed on each server for collecting telemetry data written directly into the memory by the RDMA network card, and periodically forwards the telemetry data to the INT analyzer, i.e. the RDMA network card is written directly into the memory without processing by the CPU, thereby avoiding additional overhead in data collection. The INT analyzer is responsible for parsing the received telemetry data and storing it in the INT database as needed. Because it is deployed on a server cluster, the INT analyzer is very efficient in resolving large amounts of telemetry data. The INT database is responsible for storing telemetry data and can provide network telemetry information in real time, and the telemetry data is separated into port data obtained by the first type in-band network telemetry probe and path data obtained by the second type in-band network telemetry probe for separate storage.
When the network monitoring system works, network measurement of a first frequency is carried out on each port through a first type in-band network telemetry probe, telemetry paths of all service flows are obtained through a second type in-band network telemetry probe of a second frequency corresponding to the service flows, lightweight whole-network measurement is carried out through a telemetry data acquisition scheme of a data plane, high-performance network telemetry data processing is achieved based on RDMA network cards and network telemetry data, and finally accurate network monitoring is achieved through port network quality obtained through the first type in-band network telemetry probe and the service flow telemetry paths obtained through the second type in-band network telemetry probe.
According to an embodiment of the present disclosure, there is provided a network monitoring method embodiment of a server cluster, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown or described herein.
In this embodiment, a network monitoring method of a server cluster is provided, which may be used in the servers described above, and fig. 2 is a flowchart of a network monitoring method of a server cluster according to an embodiment of the disclosure, as shown in fig. 2, where the flowchart includes the following steps:
Step S201, obtaining network topology information of a server cluster to be monitored.
The network topology information is used for representing the connection relation between network devices in the server cluster to be monitored.
The server cluster to be monitored comprises a plurality of network devices, the types of the network devices include, but are not limited to, a switch, a server and a network card in the server, for example, the servers can be connected through the switch, and a multi-stage connection mode can be deployed between the switch and the switch according to network deployment requirements. At least one network card is arranged in the same server, and different network cards in the same server can be connected with the same network equipment or different network equipment, and the like.
The network topology information is obtained through the connection between the network devices after the network deployment is completed. The network topology information characterizes the connection condition between the network devices, and each network device adopts the corresponding identification representation in the network topology information so as to distinguish different network devices.
Step S202, generating a first type in-band network telemetry probe based on network topology information to obtain network quality of ports in the network device.
The same network device includes a plurality of physical ports, and when the network device communicates, it is necessary to obtain the corresponding physical ports in addition to the network addresses of both communication parties. That is, the network probe information is obtained based on the detection of the first type in-band network telemetry probe, and the network quality of the port obtained by further analysis based on the network probe information corresponds to the quality of the physical port. Because the network topology information characterizes the connection of all network devices in the server cluster to be monitored, a first type in-band network telemetry probe is generated based on the connection, and is used for detecting each port of the network devices to obtain the network quality of each port in the network devices. It should be noted that, the determination manner of the network quality of each port includes, but is not limited to, a self-monitoring result of the network device, or an aggregation result combined with the related log, and the like, and the determination manner of the network quality is not limited in any way. For the first type in-band network telemetry probe, each port adds the respective network quality to the detection data of the first type in-band network telemetry probe, so as to acquire network detection information of each port in the network equipment, and further analyze the network detection information to acquire the network quality of the port.
The first type in-band network telemetry probe may be generated randomly based on network topology information to cover all ports; or screening is performed on the basis of random generation, and each port only needs to be ensured to be covered for a plurality of times, so that a large number of repeated or redundant probes are avoided.
And step S203, generating a second type in-band network telemetry probe based on the flow quintuple information of the current service flow so as to obtain a service flow telemetry path.
Wherein the transmission frequency of the first type in-band network telemetry probe is higher than the transmission frequency of the second type in-band network telemetry probe.
The manner of obtaining the flow quintuple information from the current traffic includes, but is not limited to, an enhanced berkeley packet filter (extended Berkeley Packet Filter, eBPF) based manner, etc., and after obtaining the flow quintuple information of the current traffic, a second type in-band network telemetry probe is generated. The flow quintuple information includes a source IP (Internet Protocol) address, a destination IP address, a source port, a destination port, and a protocol number. After the flow quintuple information is determined, both communication parties corresponding to the service flow can be determined. Based on the information, a second type in-band network telemetry probe is generated based on the flow quintuple information, and a service flow telemetry path of the current service flow is obtained.
Because the first type in-band network telemetry probe is irrelevant to the traffic flow, no influence is caused on the traffic flow, and the sending frequency of the second type in-band network telemetry probe related to the traffic flow is smaller than that of the first type in-band network telemetry probe, the network telemetry is ensured not to influence the normal traffic flow. For example, the first type in-band network telemetry probe may be transmitted at a frequency of 100PPS and the second type in-band network telemetry probe may be transmitted at a frequency of 1PPS, with very low network bandwidth resource occupancy (e.g., 200 Kb/s), to achieve lightweight full-traffic and full-port network monitoring.
Step S204, inquiring port network quality based on ports in the traffic telemetry path, and determining network monitoring results of the server cluster to be monitored.
The traffic telemetry path passes through at least one network device, including an ingress port and an egress port for each network device. And (3) analyzing the service flow telemetry path to obtain each port corresponding to the service flow telemetry path, and combining the port network quality obtained in the step (S202) to obtain the network quality of each port in the service flow telemetry path, so that the network monitoring of all the ports can be realized, and the network monitoring result of the system to be processed can be obtained.
According to the network monitoring method of the server cluster, the first type in-band network telemetry probe is generated based on the network topology information of the server cluster to be monitored, and network quality detection is carried out on the ports of the network equipment in the server cluster to be monitored, namely, detection of the first type in-band network telemetry probe is irrelevant to actual service flow. And generating a second type in-band network telemetry probe based on the flow quintuple information of the current service flow, and detecting a service flow telemetry path. Because the second type in-band network telemetry probe is related to the actual service flow, the service flow telemetry path of the current service flow can be obtained through the second type in-band network telemetry probe, the network quality of each port can be obtained in the detection result of the first type in-band network telemetry probe, the service flow telemetry path comprises a plurality of ports, and the network quality corresponding to each service flow telemetry path can be accurately obtained by combining the network quality of the plurality of ports, so that the real-time performance and the accuracy of network monitoring are ensured.
In this embodiment, a network monitoring method of a server cluster is provided, which may be used in the servers described above, and fig. 3 is a flowchart of a network monitoring method of a server cluster according to an embodiment of the disclosure, as shown in fig. 3, where the flowchart includes the following steps:
Step S301, obtaining network topology information of a server cluster to be monitored.
The network topology information is used for representing the connection relation between network devices in the server cluster to be monitored. Please refer to step S201 in the embodiment shown in fig. 2 in detail, which is not described herein.
Step S302, generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of ports in the network device.
Specifically, the step S302 includes:
in step S3021, a random probe packet is generated based on the network topology information to obtain a corresponding relationship between the probe packet quintuple and the telemetry path.
Wherein the source ports in the random probe packet are randomly generated.
The random probe packet is generated based on network topology information, and since the network topology information characterizes a connection relationship between network devices, a quintuple is formed by using the network topology information to generate the random probe packet. Specifically, the random probe packet includes a source IP address, a destination IP address, a source port, a destination port and a protocol number, where, since the source IP address and the destination IP address are fixed, both communication parties are fixed, the protocol number is fixed, and the destination port is designated by the network card, if the random probe packet is desired, only the source port can be randomly generated, so as to ensure that the random probe packet can cover all ports of all network devices in the system to be processed.
In some alternative embodiments, step S3021 includes:
and a step a1, labeling the network equipment in the server cluster to be monitored based on the network topology information.
Step a2, grouping all network devices based on the network cluster where the network device is located to obtain a unique number of a network card in a server in the server cluster to be monitored, wherein the second cluster labels corresponding to the network devices in the same group are different and the other cluster labels are the same, and the second cluster labels are labels of clusters which are except the top network device and have a connection relation.
And a step a3 of generating random probe packets for transmission between network cards inside each packet based on the unique numbers so as to obtain the corresponding relation between the probe packet quintuple and the telemetry path.
And labeling all network devices according to the network topology information, wherein all network cards of all servers have unique labels. For example, as shown in connection with fig. 4, switches are numbered S0, S1, and S2 depending on the distance of the switch from the nearest server. Wherein, the switch of S0 level includes switch 0-1, switch 0-2, switch 0-3 and switch 0-4, the switch of S1 level includes switch 1-1, switch1-2, switch 1-3 and switch 1-4, and the switch of S2 level includes switch 2-1 and switch 2-2. The bottommost server comprises servers 1-4, and each server is provided with two network cards, namely eth1 and eth2.
Of course, fig. 4 is merely an example, and may be further extended according to the network hierarchy. The clusters of all network devices (including switches, servers and network cards) are called a first Cluster Bigpod, the clusters with connection relations after the top-layer switch (S2 in the three-layer network shown in fig. 4) is removed are called a second Cluster Minipod, the servers connected with the same switch are clustered into a third Cluster, and when the servers have a plurality of network cards, all the connected up switches of the servers in the same Cluster are the same, namely the network topology positions are equivalent. Based on this, all network cards Eth on all servers have unique network numbers (Bigpod id, minipod id, cluster id, server id, eth id). Wherein all the id values are automatically generated by the acquisition sequence of the network topology information.
For example, eth1 of server1 in fig. 4 corresponds to the reference numeral (Bigpod 1,Minipod 1,Cluster1,Server 1,Eth 1), and eth1 of server3 in fig. 4 corresponds to the reference numeral (Bigpod 1,Minipod2,Cluster 1,Server 1,Eth 1).
After obtaining the unique label of each network device, grouping all network cards according to the following grouping principle: the Bigpod id, cluster id, server id and Eth id are all the same, but the Minipod id is divided into a group of different network cards, for example, (Bigpod 1,Minipod 1,Cluster1,Server 1,Eth 1) and (Bigpod 1,Minipod2,Cluster 1,Server 1,Eth 1).
Taking (Bigpod 1,Minipod 1,Cluster 1,Server 1,Eth 1) as an example, a random probe packet is generated by taking the address of the network card as a source IP address, randomly originating the port and targeting all other network cards in the same group. Wherein, the rate of the whole transmission of the random probe packet is limited to be below 100 PPS. The data plane opens an equal cost multi-path mechanism (different paths are selected based on hash values of the quintuples) due to the fact that the source port is randomized, which results in different quintuple information of the probe packet of each random probe packet, so that network telemetry can detect more paths. For example, when the source port is 65534, the corresponding random probe packet passes through switches 0-1, switches 1-1, switches 2-1, switches 1-3, and switches 0-3. When the source port is 65535, the corresponding random probe packet passes through switches 0-1, switches 1-2, switches 2-2, switches 1-4, and switches 0-3. Since the network card of each server participates in network telemetry and in sufficient random probe packets, it is possible to cover all ports including the last hop (server and switch links) in the pending system.
To further illustrate the advantages of the above-described packet, the same miniod id network cards are grouped together at the time of the packet, i.e., probing the same miniod id network cards, which results in fewer ports through which a random probe packet passes. Taking (Bigpod 1,Minipod 1,Cluster 1,Server 1,Eth 1) as an example, the random probe packet can only obtain information of switch 0-1 if it communicates with (Bigpod 1,Minipod 1,Cluster 1,Server 2,Eth 1) in the same Minipod 1. If communicating with (Bigpod 1,Minipod 2,Cluster 1,Server 1,Eth 1) in a different Minipod id, a random probe packet can probe all switches and network ports in the entire path (switches 0-1, switches 1-1, switches 2-1, switches 1-3, switches 0-3). In addition, because the random source port generates different random probe packets, the telemetry probe it generates can cover more switches and ports, such as switches 0-1, switches 0-3, switches 1-1, switches 1-2, switches 1-3, switches 1-4, switches 2-1, switches 2-2, etc. Thus, random probe packets across Minipod ids can carry more telemetry information and can cover more network ports.
The network devices in the server cluster to be monitored are marked based on the network topology information, and the network devices are grouped on the basis of the network topology information to obtain the unique number of the network card in the server. Meanwhile, the grouping mode can ensure that all servers and network cards participate, so that all network ports are covered.
In step S3022, based on the correspondence between the probe packet quintuples and the telemetry path, the probe packet quintuples corresponding to the random probe packet are screened to obtain the target probe packet quintuples, so as to determine the ports corresponding to the target probe packet quintuples.
And the five-tuple of all the target detection packets is used for covering all the ports at least for preset times, wherein the preset times are the lengths of the longest telemetry paths corresponding to the random detection packets.
Because the randomness of the source ports in the detection packet quintuple corresponding to the random detection packet can ensure that as many ports as possible are covered and the full coverage of the full ports is brought by a grouping mode, and further, in order to reduce the redundant detection packet quintuple, the detection packet quintuple corresponding to the random detection packet is screened. Wherein, the screening principle is as follows: the target detection packet quintuple obtained after screening can cover all ports at least for the number of times, and the number of times of prediction is the length of the longest telemetry path corresponding to the random detection packet. For example, if the length of the longest telemetry path is 5, then all target probe packet five tuples ensure that all ports are covered at least 5 times. Of course, the length of the longest telemetry path is derived based on network topology information.
The screening of the five-tuple of the detection packet corresponding to the random detection packet may be to record the number of times each port is covered after each time of sending the random detection packet, and if the number of times of prediction is reached, discard the random detection packet for the port. Or, the detection packet quintuple corresponding to the random detection packet can be screened by combining an optimization algorithm, so that the target detection packet quintuple is obtained.
In some alternative embodiments, step S3022 includes:
and b1, constructing a pile to be screened, wherein the pile to be screened comprises pile elements and connection relations among the pile elements, the pile elements in the pile to be screened are in one-to-one correspondence with the five-element groups of the detection packet, and the numerical values of the pile elements are used for representing the number of effective ports which can be covered by the five-element groups of the detection packet, and the number of the effective ports is obtained based on a telemetry path.
And b2, determining a detection packet quintuple corresponding to the top-of-stack element of the pair to be screened as a target detection packet quintuple, and deleting the top-of-stack element.
And b3, updating the remaining heap elements with the largest values into new heap top elements so as to update the heap elements in the heap to be screened and the values of the heap elements.
And b4, screening the detection packet quintuple based on the updated pile to be screened to determine the target detection packet quintuple.
And screening the detection packet quintuple corresponding to the random detection packet by using a heap optimization algorithm to finally obtain the target detection packet quintuple. It should be noted that, the target detection packet five-tuple obtained by the final screening is multiple, but not one.
Specifically, a pile to be screened is constructed, and the initial pile to be screened comprises all pile elements and connection relations among the pile elements. The heap elements are in one-to-one correspondence with the five-tuple of the detection packet, and the numerical value of the heap elements is used for indicating the number of effective ports which can be covered by the five-tuple of the detection packet. The ports which can be covered by the five-tuple of the detection packet are obtained through telemetry paths corresponding to the five-tuple of the detection packet, and if one port is covered for a preset number of times, the port is an invalid port and statistics is not included.
For example, as shown in fig. 5a, the initial stack to be screened is preset for 2 times, and a total of 5 probe packets, five-tuple, respectively five-tuple 1 to five-tuple 5, participate in screening. In the initial state, the number of times of the ports which can be covered by each five-tuple is 2. The stack top element of the initial stack to be screened is five-tuple 1, and the number of effective ports which can be covered by the stack top element is 2. And each time, selecting a detection packet quintuple corresponding to the top-of-stack element as a target detection packet quintuple, updating the remaining top-of-stack element with the largest value into a new top-of-stack element, for example, as shown in fig. 5b, deleting the top-of-stack element in fig. 5a, and updating the quintuple 5 into the new top-of-stack element to obtain the updated stack to be screened shown in fig. 5 b. Likewise, after the stack to be screened is updated, the number of times each port is covered also needs to be updated so as to continue screening later. And sequentially iterating to obtain a pile to be screened shown in fig. 5c and a pile to be screened shown in fig. 5d respectively.
And selecting a stack element with the largest number of effective ports at the top of the stack at each time, determining the five-tuple as a target detection packet five-tuple, and screening out the target detection packet five-tuple from the top of the stack. And updating the port information related to the screened quintuple, if the port coverage exceeds 2 times, setting the port information as an invalid port, and updating the number of valid ports corresponding to the related quintuple. The above procedure was repeated until all ports were covered 2 times.
Through the processing mode, the target detection packet quintuple obtained by screening can be ensured to be used for covering all ports at least for preset times.
And screening the five-tuple of the detection packet by a pile construction mode to obtain as few target five-tuple of the detection packet as possible, so that the cost of the full-port network telemetry is minimum. The method greatly reduces the number of target detection packet quintuples and the data volume to be analyzed. The method comprises the steps of obtaining high-value target detection packet quintuples through random network telemetry crossing minimod, and then selecting the least target detection packet quintuple based on a heap optimization mode to realize telemetry coverage of all ports. The method realizes the measurement of the lightweight full network port, and can accurately position the position of network faults in the follow-up process.
In step S3023, the target telemetry path corresponding to the target probe packet quintuple is obtained by using the correspondence between the probe packet quintuple and the telemetry path, so as to determine the target probe packet quintuple set corresponding to each port in the target telemetry path.
The corresponding relationship between the five-tuple of the detection packet and the telemetry path is obtained through the step S3021, and after the five-tuple of the target detection packet is obtained, the target telemetry path corresponding to the five-tuple of the target detection packet is obtained by querying the corresponding relationship, and accordingly, all ports corresponding to the five-tuple of the target detection packet can be obtained. One target probe packet quintuple corresponds to a plurality of ports, and of course, different target probe packet quintuples may pass through the same port, so that one port corresponds to a plurality of target probe packet quintuples, and for convenience of description, a target probe packet quintuple set corresponding to the same port is generated based on a plurality of target probe packet quintuples corresponding to the same port.
In some alternative embodiments, step S3023 includes:
and c1, inquiring the corresponding relation based on the target detection packet quintuple to obtain a target telemetry path so as to obtain the target detection packet quintuple corresponding to each port in the target telemetry path, wherein the target telemetry path comprises a plurality of ports.
And c2, integrating the target detection packet quintuples corresponding to the same port to obtain target detection packet quintuple sets corresponding to the ports.
And obtaining a target telemetry path corresponding to the target detection packet quintuple based on the corresponding relation between the detection packet quintuple and the telemetry path. Since the target telemetry path includes a plurality of ports, the target probe packet five-tuple corresponds to the plurality of ports. Correspondingly, the target detection packet quintuples corresponding to the same port are integrated, and the target detection packet quintuple set corresponding to each port is obtained. Wherein, the integration is to put a plurality of target detection packet quintuples into a set for representation, and not more target detection packet quintuples are processed more.
Because the corresponding relation exists between the five-element group of the detection packet and the telemetry path, and the five-element group of the detection packet comprises a plurality of ports, the target five-element group of the detection packet corresponding to each port can be obtained through integrating the target five-element group of the detection packet corresponding to the same port, so that the target five-element group of the detection packet corresponding to each port can be used for subsequent fault location.
Step S3024, generating a first type in-band network telemetry probe based on the target probe packet quintuple set corresponding to each port, and obtaining the network quality of each port in the network device.
And as the target detection packet quintuple set corresponding to each port is obtained, generating a plurality of first-type in-band network telemetry probes through the target detection packet quintuple set, and obtaining the network quality of each port. The first type in-band network telemetry probes are in one-to-one correspondence with the target detection packet quintuples, and each time one first type in-band network telemetry probe is sent, the network quality of all ports on the corresponding target telemetry path can be obtained, and then the network quality of the ports is correspondingly stored, so that the network quality of each port is obtained.
Step S303, generating a second type in-band network telemetry probe based on the flow quintuple information of the current service flow to obtain a service flow telemetry path.
Wherein the transmission frequency of the first type in-band network telemetry probe is higher than the transmission frequency of the second type in-band network telemetry probe. Please refer to step S203 in the embodiment shown in fig. 2 in detail, which is not described herein.
Step S304, inquiring port network quality based on ports in the traffic telemetry path, and determining network monitoring results of the server cluster to be monitored. Please refer to step S204 in the embodiment shown in fig. 2 in detail, which is not described herein.
According to the network monitoring method of the server cluster, due to randomness of source ports in the random probe packet, it can be ensured that all obtained probe packet quintuples cover as many ports as possible. On the basis, screening is performed to obtain target detection packet quintuples so as to ensure that all ports are covered at least for prediction times, redundant and repeated detection packet quintuples are reduced, and therefore the processing and analysis quantity of the first-type in-band network telemetry probes can be improved, and further the network monitoring efficiency is improved.
In this embodiment, a network monitoring method of a server cluster is provided, which may be used in the servers described above, and fig. 6 is a flowchart of a network monitoring method of a server cluster according to an embodiment of the disclosure, as shown in fig. 6, where the flowchart includes the following steps:
step S601, obtaining network topology information of a server cluster to be monitored.
The network topology information is used for representing the connection relation between network devices in the server cluster to be monitored. Please refer to step S201 in the embodiment shown in fig. 2 in detail, which is not described herein.
Step S602, generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of a port in the network device.
For a specific implementation of obtaining network quality by using the first type in-band network telemetry probe, please refer to step S302 in the embodiment shown in fig. 3 in detail, which is not described herein.
In some embodiments, the sending and receiving of the probe are implemented through a preset network card set in the server, where the preset network card may be an RDMA network card. After receiving the probe data corresponding to the probe, the RDMA network card directly stores the probe data into the memory without being processed by the CPU. Based on this, the step S602 includes:
and d1, identifying the packet header information of the first type in-band network telemetry probe through a preset network card, and determining the target network equipment corresponding to the first type in-band network telemetry probe.
And d2, transmitting the first type in-band network telemetry probe to the target network equipment through a preset network card.
And d3, receiving the network telemetry information of the first type in-band network telemetry probe through a preset network card, and analyzing the network telemetry information to obtain the value of the target field.
And d4, performing target processing on the network telemetry information through a preset network card based on the value of the target field, wherein the target processing comprises the steps of sending a feedback data packet to target network equipment and storing the network telemetry information into a memory.
In combination with the network monitoring system shown in fig. 1, according to the detection requirement, the INT agent acquires the first type in-band network telemetry probe from the INT controller, sends the first type in-band network telemetry probe to the RDMA network card, and the RDMA network card identifies the data packet, determines the target network device and sends the first type in-band network telemetry probe to the target network device. Correspondingly, the RDMA network card receives the detection data of the first type in-band network telemetry probe, and analyzes the detection data to obtain the value of the target field. Due to the difference in the values of the target fields, different target treatments of the probe data are characterized. After the value of the target field is obtained through analysis, corresponding target processing is carried out on the target field. The target processing includes sending a feedback packet (ACK) to the target network device, and storing the probe data in the memory. According to the method, all the switches are set as transmission nodes, the network card of the server is regarded as a tail-hop node, and the remote measurement data are acquired through the RDMA network card, so that soft interruption brought by network remote measurement to a CPU of the server can be reduced, and more accurate remote measurement data are provided.
In some embodiments, if the value of the destination field characterizes that the telemetry data needs to be fed back in real time, the RDMA network card repackages the telemetry data that needs to be perceived and fed back in real time, such as congestion control and routing, flips the source IP address and the destination IP address, encapsulates the telemetry data into a load, and then obtains a feedback data packet and feeds it back to the destination network device. If the value of the target field characterizes that the network telemetry information needs to be directly stored in the memory, RDMA directly writes the network telemetry information into a specified server memory, and the server periodically reports the network telemetry information to a telemetry database.
By offloading telemetry data processing capability originally realized by the switch to the RDMA network card, soft interruption of a server is reduced, so that occupation of CPU resources is reduced, the switch is not required to actively report data, southward overhead is reduced, and meanwhile, the problem of inaccurate tail-hop time delay of INT technology is solved.
The probe is sent, received and analyzed through the preset network card, the detection data packet can be directly transmitted to the corresponding network application through the identification of the network telemetry information, and the information based on network monitoring is directly transmitted to the appointed memory through the RDMA technology, so that the soft interruption of the CPU is reduced, and the accuracy of the telemetry data is prompted.
Step S603, generating a second type in-band network telemetry probe based on the flow quintuple information of the current traffic to obtain a fault path.
Wherein the transmission frequency of the first type in-band network telemetry probe is higher than the transmission frequency of the second type in-band network telemetry probe.
Because the second type in-band network telemetry probe can acquire the traffic telemetry paths of all traffic in real time, under the condition of abnormal traffic, the second type in-band network telemetry probe can acquire the fault path corresponding to the abnormal traffic. The abnormal traffic flow is obtained through traffic monitoring, or may be obtained through monitoring in other manners, which is not limited in any way, and is specifically set according to actual requirements.
It should be noted that the abnormality of the current traffic may be caused by a port failure or may be caused by a failure of the host side. If the analysis is performed to determine that no port fails, it is determined that the abnormality of the current traffic is caused by a problem at the host side.
Specifically, the step S603 includes:
step S6031, if the current service flow fluctuates, obtaining the flow quintuple information of the current service flow.
After the network fluctuation of the current service flow is monitored, the path information of the service flow before the fault occurs is obtained by obtaining the flow quintuple information of the current service flow. Wherein, the path information of the same service flow is unchanged under the default condition.
Step S6032, generating a second type in-band network telemetry probe based on the flow quintuple information to obtain a traffic telemetry path and determining the traffic telemetry path as a fault path.
Wherein the failure path includes a plurality of ports.
Upon occurrence of a network fluctuation, path information of the current traffic is changed, and these faults are referred to as path change events by acquiring a path before the path change event occurs as a fault path. The acquisition of the fault path is obtained through a second type in-band network telemetry probe generated by flow quintuple information of the current service flow.
For example, as shown in fig. 7, when network telemetry is performed on traffic, first, a second type in-band network telemetry probe consistent with the traffic quintuple information is sent. When a second type in-band network telemetry probe is transmitted in the network, the second type in-band network telemetry probe will collect traffic telemetry paths (switch 1 in port, switch 1 out port, switch 2 in port, switch 2 out port, switch 4 in port, switch 4 out port) for that traffic in the network. The network quality of the traffic is obtained by querying each port based on the network quality obtained by the first type in-band network telemetry probe and then combining all port states.
Step S604, inquiring port network quality based on ports in the traffic telemetry path, and determining fault positioning results of the server cluster to be monitored.
Specifically, the step S604 includes:
in step S6041, the port in the failure path and the failure time stamp are obtained by analyzing based on the failure path.
The fault path includes a plurality of ports, and a time point corresponding to the fault path is called a fault time stamp. Since the transmission frequency of the second type in-band network telemetry probe is lower than that of the first type in-band network telemetry probe, in order to ensure the reliability of the fault information corresponding to the fault time stamp, a corresponding fault inquiry time period is formed based on the fault time stamp.
Step S6042, determining a failure inquiry period based on the failure timestamp.
And determining a time period corresponding to the preset time period before and after the fault time stamp as a fault inquiry time period. For example, based on the failure time stamp, a period corresponding to 5s before and after it is determined as the failure inquiry period.
Step S6043, inquires of network quality of each port in the failure path in the failure inquiry period.
Because the sending frequency of the first type in-band network telemetry probe is higher, network quality obtained by multiple detection can be obtained for each port in the fault query time period, and the network quality is aggregated to obtain the network quality in the fault query time period.
Step S6044, determining a fault location result of the server cluster to be monitored based on the queried network quality.
The inquired network quality is used for representing the network state of each port, and if the network quality represents network abnormality, the port abnormality is determined; otherwise, it can be regarded as a host-side anomaly. Further, after determining that the port is abnormal, the fault type is determined in combination with detailed information of the network quality so as to obtain detailed fault information.
In some alternative embodiments, step S6044 includes:
And e1, acquiring network telemetry information detected by each first type in-band network telemetry probe corresponding to a port in the fault path, wherein the network quality is obtained based on analysis of the network telemetry information.
And e2, if all the network telemetry information detected by the first type in-band network telemetry probes of the target port represent that the network quality has faults, determining that the target port is a fault port and determining the fault type based on the network telemetry information detected by all the five-tuple of the target detection packets.
As can be seen from the above analysis, each port corresponds to a plurality of first type in-band network telemetry probes, each of which is capable of obtaining corresponding network telemetry information. Thus, for each port in the fault path, multiple network telemetry probes of the first type of each port are able to obtain multiple network telemetry information. And further processing the network telemetry information to obtain the network quality. If the network telemetry information of each first type in-band network telemetry probe represents that the network quality has faults, determining that the port has faults, and determining the port as a target port. Based on this, the failure type is determined based on network telemetry detected by all target probe packet quintuple of the target port.
Because the same port comprises a plurality of target detection packet quintuples, the target detection packet quintuple can be detected by a plurality of first-type in-band network telemetry probes, and when network telemetry information of all the first-type in-band network telemetry probes of the same port represents that network quality has faults, the port is determined to be a fault port, and the fault positioning accuracy is further ensured.
According to the network monitoring method for the server cluster, when network fluctuation occurs, the second type in-band network telemetry probe is generated based on the flow quintuple information of the current service flow, fault paths are detected, and accurate fault paths are obtained. Because the transmission frequency difference exists between the first type in-band network telemetry probe and the second type in-band network telemetry probe, the fault time stamp is extended to the inquiry time period, and the inquiry of the network quality is performed by utilizing the inquiry time period, so that the accuracy of the obtained network quality is ensured.
In some optional embodiments, the network monitoring method of a server cluster further includes:
and f1, storing the network telemetry information of each port obtained by the first type in-band network telemetry probe into a first area of a target storage position so as to analyze based on the network telemetry information to obtain network quality.
And f2, storing the service flow telemetry path obtained by the second type in-band network telemetry probe to a second area of the target storage location.
The first type in-band network telemetry probe obtains network telemetry information for each port, obtains network quality for the port based on further analysis of the network telemetry information, and stores the network quality in a first region of the target storage location. I.e. the first area is used for storing the obtained network telemetry information and analyzing the resulting network quality. The second type of in-band network telemetry probe obtains a second region where the traffic telemetry path is stored to the target storage location. For example, the target storage location is the INT database shown in fig. 1, and accordingly, is divided into two areas in the INT database, a first area for storing network data and a second area for storing path data.
Because the network telemetry information obtained by the first type in-band network telemetry probe is different from the processing mode corresponding to the service flow telemetry path obtained by the second type in-band network telemetry probe, the network telemetry information is stored in different areas respectively, so that the normal processing of the corresponding processing modes can be ensured.
According to the network monitoring method of the server cluster, a high-performance lightweight in-band network telemetry framework is provided based on the RDMA network. The high-performance data packet identification scheme is designed based on the programmability of the RDMA network card, telemetry data processing is unloaded to the RDMA network card, the compatibility of INT-based monitoring and INT-based congestion control is realized, detection data is directly written into a memory by combining with an RDMA technology, and finally real-time and fine-granularity full-flow full-port measurement is realized through a data analysis technology. Specifically, a target detection packet five-tuple set uniquely corresponding to each port is obtained, a first type in-band telemetry probe set is obtained correspondingly, then port information of the service flow passing when the network fault occurs is obtained through service flow measurement, and finally the quick positioning of the network fault and the network performance bottleneck of the service flow is realized through the port information and the detection result of the first type in-band network telemetry probe corresponding to each port. The scheme reduces the total data quantity required to be analyzed for locating the network faults and can support real-time fault location.
As a specific application embodiment of the embodiments of the present disclosure, the current traffic telemetry path is port <1,1>, port <2,1>, port <3,1>, wherein the target probe packet five-tuple set corresponding to port <1,1> is: five-tuple 1, five-tuple needle 4 and five-tuple 5; the five-tuple set of the target detection packet corresponding to the port <2,1> is as follows: five-tuple 1, five-tuple 3 and five-tuple 4; the five-tuple set of the target detection packet corresponding to the port <3,1> is as follows: five-tuple 1, five-tuple 4 and five-tuple 6.
If network fluctuation occurs, a fault path is identified through the second type in-band network telemetry probe, so that a fault port is obtained, a corresponding target detection packet quintuple set is obtained based on the fault port, and network telemetry information is obtained by detecting through the corresponding first type in-band network telemetry probe. If the fault exists at the moment, the network telemetry information of the nth five-tuple corresponding probe corresponding to the fault port represents that the network quality has the fault. Based on this, the detection result of each five-tuple corresponding probe is: the detection result of the probes corresponding to the quintuple 1 is a fault, the detection result of the probes corresponding to the quintuple 2 is a non-fault, the detection result of the probes corresponding to the quintuple 3 is a non-fault, the detection result of the probes corresponding to the quintuple 4 is a fault, the detection result of the probes corresponding to the quintuple 5 is a fault, and the detection result of the probes corresponding to the quintuple 6 is a non-fault. Current state port failure determination:
Port <1,1>: the probe results of each probe corresponding to the five-tuple are: the detection results of the probes corresponding to the five-tuple 1, the five-tuple 4 and the five-tuple 5 are faults;
port <2,1>: the probe results of each probe corresponding to the five-tuple are: the detection result of the probes corresponding to the quintuple 1 is fault and the detection results of the probes corresponding to the quintuple 3 and the quintuple 4 are non-fault;
port <3,1>: the probe results of each probe corresponding to the five-tuple are: the detection results of probes corresponding to the five-tuple 1, the five-tuple 4 and the five-tuple 6 are non-faults;
since the detection results of all probes corresponding to the five-tuple of the port <1,1> are faulty, the port <1,1> is a faulty port.
If the detection results of all the probes corresponding to the five-tuple in any one of the five-tuple sets are faults, the port is located. Otherwise, judging the problem to be a host side problem. Wherein, can judge the trouble type and be: at least one of network congestion, network jitter, network packet loss, link failure, switch failure.
The embodiment also provides a network monitoring device for a server cluster, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a network monitoring device of a server cluster, as shown in fig. 8, including:
the network topology information obtaining module 801 is configured to obtain network topology information of a server cluster to be monitored, where the network topology information is used to characterize a connection relationship between network devices in the server cluster to be monitored.
A first type in-band network telemetry probe generation module 802 for generating a first type in-band network telemetry probe based on network topology information to obtain network quality of ports in the network device.
The second type in-band network telemetry probe generation module 803 is configured to generate a second type in-band network telemetry probe based on the flow quintuple information of the current traffic flow, so as to obtain a traffic flow detection path, where a transmission frequency of the first type in-band network telemetry probe is higher than a transmission frequency of the second type in-band network telemetry probe.
The network quality query module 804 is configured to query port network quality based on ports in the traffic flow detection path, and determine a network monitoring result of the server cluster to be monitored.
In some alternative embodiments, first type in-band network telemetry probe generation module 802 includes:
and the detection packet quintuple generation unit is used for generating a random detection packet based on the network topology information so as to obtain the corresponding relation between the detection packet quintuple and the telemetry path, wherein a source port in the random detection packet is randomly generated.
And the detection packet quintuple screening unit is used for screening the detection packet quintuple corresponding to the random detection packet based on the corresponding relation between the detection packet quintuple and the telemetry path to obtain a target detection packet quintuple so as to determine the port corresponding to the target detection packet quintuple, wherein all the target detection packet quintuple is used for covering all the ports at least for preset times, and the preset times are the length of the longest telemetry path corresponding to the random detection packet.
And the target telemetry path determining unit is used for obtaining a target telemetry path corresponding to the target detection packet quintuple by utilizing the corresponding relation between the detection packet quintuple and the telemetry path so as to determine a target detection packet quintuple set corresponding to each port in the target telemetry path.
And the port network quality determining unit is used for generating a first type in-band network telemetry probe based on the target detection packet quintuple set corresponding to each port to obtain the network quality of each port in the network equipment.
In some alternative embodiments, the probe packet five-tuple generating unit includes:
and the marking subunit is used for marking the network equipment in the server cluster to be monitored based on the network topology information.
The grouping subunit is configured to group all network devices based on the network cluster where the network device is located, so as to obtain a unique number of a network card in a server in the server cluster to be monitored, where second cluster numbers corresponding to the network devices in the same group are different and other cluster numbers are the same, and the second cluster numbers are numbers of clusters which are other than the top network device and have a connection relationship.
And the random detection packet generation subunit is used for generating random detection packets for transmission among network cards inside each packet based on the unique number so as to obtain the corresponding relation between the five-tuple of the detection packets and the telemetry path.
In some alternative embodiments, the probe packet five-tuple screening unit comprises:
the to-be-screened pile construction subunit is used for constructing a to-be-screened pile, the to-be-screened pile comprises pile elements and a connection relation among the pile elements, the pile elements in the to-be-screened pile are in one-to-one correspondence with the five-element groups of the detection packet, the numerical value of the pile elements is used for representing the number of effective ports which can be covered by the five-element groups of the detection packet, and the number of the effective ports is obtained based on a telemetry path.
And the first target detection packet quintuple determining subunit is used for determining the detection packet quintuple corresponding to the top element of the stack to be screened as the target detection packet quintuple and deleting the top element.
And the to-be-screened pile updating subunit is used for updating the residual pile element with the largest value into a new pile top element so as to update the pile element in the to-be-screened pile and the value of the pile element.
And the second target detection packet quintuple determining subunit is used for screening the detection packet quintuple based on the updated pile to be screened to determine the target detection packet quintuple.
In some alternative embodiments, the target telemetry path determination unit includes:
and the corresponding relation inquiring subunit is used for inquiring the corresponding relation based on the target detection packet quintuple to obtain a target telemetry path so as to obtain the target detection packet quintuple corresponding to each port in the target telemetry path, wherein the target telemetry path comprises a plurality of ports.
And the integration subunit is used for integrating the target detection packet quintuples corresponding to the same port to obtain target detection packet quintuple sets corresponding to all ports.
In some alternative embodiments, the second type in-band network telemetry probe generation module 803 includes:
and the flow detection packet quintuple acquisition unit is used for acquiring the flow detection packet quintuple of the current service flow if the current service flow fluctuates in the network.
And the fault path acquisition unit is used for generating a second type in-band network telemetry probe based on the flow detection packet quintuple so as to acquire a service flow telemetry path and determine the service flow telemetry path as a fault path, wherein the fault path comprises a plurality of ports.
In some alternative embodiments, the network quality query module 804 includes:
and the fault time stamp determining unit is used for analyzing based on the fault path to obtain the ports in the fault path and the fault time stamp.
And the fault inquiry time period determining unit is used for determining the fault inquiry time period based on the fault time stamp.
And the network quality query unit is used for querying the network quality of each port in the fault path in the fault query time period.
And the fault locating unit is used for determining a fault locating result of the server cluster to be monitored based on the inquired network quality.
In some alternative embodiments, the fault locating unit includes:
the network telemetry information acquisition subunit is used for acquiring the network telemetry information detected by each first type in-band network telemetry probe corresponding to the port in the fault path, and the network quality is obtained based on analysis of the network telemetry information.
And the fault type determining subunit is used for determining that the target port is a fault port and determining the fault type based on the network telemetry information detected by all the first type in-band network telemetry probes if the network telemetry information detected by all the first type in-band network telemetry probes of the target port represent that the network quality has faults.
In some alternative embodiments, first type in-band network telemetry probe generation module 802 further comprises:
the packet header information identification unit is used for identifying the packet header information of the first type in-band network telemetry probe through a preset network card and determining target network equipment corresponding to the first type in-band network telemetry probe.
And the first type in-band network telemetry probe sending unit is used for sending the first type in-band network telemetry probe to the target network equipment through the preset network card.
The detection data receiving unit is used for receiving the network telemetry information of the first type in-band network telemetry probe through the preset network card and analyzing the network telemetry information to obtain the value of the target field.
The target processing unit is used for carrying out target processing on the network telemetry information through a preset network card based on the value of the target field, and the target processing comprises the steps of sending a feedback data packet to target network equipment and storing detection data into a memory.
In some optional embodiments, the network monitoring device of the server cluster further includes:
and the first storage module is used for storing the network telemetry information of each port obtained by the first type in-band network telemetry probe to a first area of the target storage position so as to obtain network quality by analysis based on the network telemetry information.
And the second storage module is used for storing the service flow telemetry path obtained by the second type in-band network telemetry probe to a second area of the target storage position.
The network monitoring device of the server cluster in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above functions.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the disclosure also provides an electronic device, and a network monitoring device with the server cluster shown in fig. 8.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an alternative embodiment of the disclosure, as shown in fig. 9, the electronic device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 9.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the electronic device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The electronic device also includes a communication interface 30 for the electronic device to communicate with other devices or communication networks.
The presently disclosed embodiments also provide a computer readable storage medium, and the methods described above according to the presently disclosed embodiments may be implemented in hardware, firmware, or as recordable storage medium, or as computer code downloaded over a network that is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, such that the methods described herein may be stored on such software processes on a storage medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.

Claims (13)

1. A method for monitoring a network of a server cluster, the method comprising:
acquiring network topology information of a server cluster to be monitored, wherein the network topology information is used for representing the connection relation between network devices in the server cluster to be monitored;
generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of a port in the network device;
generating a second type in-band network telemetry probe based on flow quintuple information of the current service flow to obtain a service flow telemetry path, wherein the transmission frequency of the first type in-band network telemetry probe is higher than that of the second type in-band network telemetry probe;
and inquiring port network quality based on the ports in the traffic telemetry path, and determining a network monitoring result of the server cluster to be monitored.
2. The method of claim 1, wherein generating a first type of in-band network telemetry probe based on the network topology information to obtain network quality for a port in the network device comprises:
Generating a random detection packet based on the network topology information to obtain a corresponding relation between a detection packet quintuple and a telemetry path, wherein a source port in the random detection packet is randomly generated;
screening the detection packet quintuple corresponding to the random detection packet based on the corresponding relation between the detection packet quintuple and the telemetry path to obtain a target detection packet quintuple, so as to determine ports corresponding to the target detection packet quintuple, wherein all the target detection packet quintuple is used for covering all the ports at least for preset times, and the preset times are the length of the longest telemetry path corresponding to the random detection packet;
obtaining a target telemetry path corresponding to the target detection packet quintuple by utilizing the corresponding relation between the detection packet quintuple and the telemetry path so as to determine a target detection packet quintuple set corresponding to each port in the target telemetry path;
and generating the first type in-band network telemetry probe based on the target detection packet quintuple set corresponding to each port, and obtaining the network quality of each port in the network equipment.
3. The method of claim 2, wherein generating random probe packets based on the network topology information to obtain a probe packet quintuple to telemetry path correspondence comprises:
Labeling network equipment in the server cluster to be monitored based on the network topology information;
based on network clusters in which the network devices are located, grouping all the network devices to obtain unique numbers of network cards in servers in the server clusters to be monitored, wherein second cluster labels corresponding to the network devices in the same group are different and the other cluster labels are the same, and the second cluster labels are labels of clusters which are except for the top network device and have a connection relation;
and generating random probe packets for transmission among network cards inside each packet based on the unique numbers so as to obtain the corresponding relation between the five-tuple of the probe packets and the telemetry path.
4. The method according to claim 2, wherein the screening the probe packet quintuple corresponding to the random probe packet based on the correspondence between the probe packet quintuple and the telemetry path to obtain a target probe packet quintuple includes:
constructing a pile to be screened, wherein the pile to be screened comprises pile elements and connection relations among the pile elements, the pile elements in the pile to be screened are in one-to-one correspondence with the five-tuple of the detection packet, the numerical value of the pile elements is used for representing the number of effective ports which can be covered by the five-tuple of the detection packet, and the number of the effective ports is obtained based on the telemetry path;
Determining a detection packet quintuple corresponding to a top element of the stack to be screened as a target detection packet quintuple, and deleting the top element;
updating the remaining heap elements with the largest values into new heap top elements so as to update the heap elements in the heap to be screened and the values of the heap elements;
and screening the detection packet quintuple based on the updated pile to be screened to determine the target detection packet quintuple.
5. The method according to claim 2, wherein the obtaining the target telemetry path corresponding to the target probe packet quintuple by using the correspondence between the probe packet quintuple and the telemetry path to determine the target probe packet quintuple set corresponding to each port in the target telemetry path includes:
inquiring the corresponding relation based on the target detection packet quintuple to obtain the target telemetry path so as to obtain target detection packet quintuple corresponding to each port in the target telemetry path, wherein the target telemetry path comprises a plurality of ports;
and integrating the target detection packet quintuples corresponding to the same port to obtain target detection packet quintuples corresponding to each port.
6. The method of claim 1, wherein generating a second type in-band network telemetry probe based on traffic quintuple information of current traffic to obtain a traffic telemetry path comprises:
If the current service flow fluctuates, acquiring flow quintuple information of the current service flow;
and generating a second type in-band network telemetry probe based on the traffic quintuple information to obtain the traffic telemetry path and determine the traffic telemetry path as a fault path, wherein the fault path comprises a plurality of ports.
7. The method of claim 6, wherein the determining the network monitoring result of the server cluster to be monitored based on the query of port network quality by the ports in the traffic telemetry path comprises:
analyzing based on the fault path to obtain a port in the fault path and a fault time stamp;
determining a fault inquiry time period based on the fault timestamp;
inquiring the network quality of each port in the fault path in the fault inquiry time period;
and determining a fault positioning result of the server cluster to be monitored based on the inquired network quality.
8. The method of claim 7, wherein determining a failure location result for the cluster of servers to be monitored based on the queried network quality comprises:
Acquiring network telemetry information detected by each first type in-band network telemetry probe corresponding to a port in the fault path, wherein the network quality is obtained based on analysis of the network telemetry information;
if all the network telemetry information detected by the first type in-band network telemetry probes of the target port represent that the network quality has faults, determining that the target port is a fault port and determining the fault type based on the network telemetry information detected by all the first type in-band network telemetry probes.
9. The method of claim 1, wherein generating a first type of in-band network telemetry probe based on the network topology information to obtain network quality for a port in the network device further comprises:
identifying the packet header information of the first type in-band network telemetry probe through a preset network card, and determining target network equipment corresponding to the first type in-band network telemetry probe;
transmitting the first type in-band network telemetry probe to the target network device through the preset network card;
receiving network telemetry information of the first type in-band network telemetry probe through the preset network card, and analyzing the network telemetry information to obtain a value of a target field;
And performing target processing on the network telemetry information through the preset network card based on the value of the target field, wherein the target processing comprises the steps of sending a feedback data packet to the target network equipment and storing the network telemetry information into a memory.
10. The method according to claim 1, wherein the method further comprises:
storing network telemetry information of each port obtained by the first type in-band network telemetry probe to a first area of a target storage location to analyze based on the network telemetry information to obtain the network quality;
and storing the service flow telemetry path obtained by the second type in-band network telemetry probe to a second area of the target storage location.
11. A network monitoring device for a server cluster, the device comprising:
the network topology information acquisition module is used for acquiring network topology information of the server cluster to be monitored, and the network topology information is used for representing the connection relation between network devices in the server cluster to be monitored;
a first type in-band network telemetry probe generation module for generating a first type in-band network telemetry probe based on the network topology information to obtain network quality of a port in the network device;
The second type in-band network telemetry probe generation module is used for generating a second type in-band network telemetry probe based on flow quintuple information of current service flow so as to obtain a service flow telemetry path, and the sending frequency of the first type in-band network telemetry probe is higher than that of the second type in-band network telemetry probe;
and the network quality query module is used for querying the port network quality based on the port in the service flow telemetry path and determining the network monitoring result of the server cluster to be monitored.
12. An electronic device, comprising:
a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the network monitoring method of the server cluster of any one of claims 1 to 10.
13. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the network monitoring method of a server cluster according to any of claims 1 to 10.
CN202311812679.XA 2023-12-26 2023-12-26 Network monitoring method and device for server cluster, electronic equipment and storage medium Pending CN117714325A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311812679.XA CN117714325A (en) 2023-12-26 2023-12-26 Network monitoring method and device for server cluster, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311812679.XA CN117714325A (en) 2023-12-26 2023-12-26 Network monitoring method and device for server cluster, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117714325A true CN117714325A (en) 2024-03-15

Family

ID=90149716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311812679.XA Pending CN117714325A (en) 2023-12-26 2023-12-26 Network monitoring method and device for server cluster, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117714325A (en)

Similar Documents

Publication Publication Date Title
US11038744B2 (en) Triggered in-band operations, administration, and maintenance in a network environment
US11677635B2 (en) Hierarchical network analysis service
US10129127B2 (en) Software defined network controller, service function chaining system and trace tracking method
Tammana et al. Simplifying datacenter network debugging with {PathDump}
US9800478B2 (en) Cross-layer troubleshooting of application delivery
US8443074B2 (en) Constructing an inference graph for a network
US7634682B2 (en) Method and system for monitoring network health
US8245079B2 (en) Correlation of network alarm messages based on alarm time
JP6097889B2 (en) Monitoring system, monitoring device, and inspection device
KR20170049509A (en) Collecting and analyzing selected network traffic
EP3222003B1 (en) Inline packet tracing in data center fabric networks
US20150103642A1 (en) Diagnosing connectivity in a network
US8990611B2 (en) Communication system and generating apparatus
US9954748B2 (en) Analysis method and analysis apparatus
CN113973042A (en) Method and system for root cause analysis of network problems
Marques et al. Intsight: Diagnosing slo violations with in-band network telemetry
CN110071843B (en) Fault positioning method and device based on flow path analysis
JPWO2015182629A1 (en) Monitoring system, monitoring device and monitoring program
JP4464256B2 (en) Network host monitoring device
CN117714325A (en) Network monitoring method and device for server cluster, electronic equipment and storage medium
JP2017199250A (en) Computer system, analysis method of data, and computer
US20170026278A1 (en) Communication apparatus, control apparatus, and communication system
US11848837B2 (en) Network telemetry based on application-level information
CN113132140A (en) Network fault detection method, device, equipment and storage medium
US11552870B2 (en) Dynamic profile guided network telemetry configuration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination