CN114095341A - Network recovery method and device, computer equipment and storage medium - Google Patents

Network recovery method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114095341A
CN114095341A CN202111375328.8A CN202111375328A CN114095341A CN 114095341 A CN114095341 A CN 114095341A CN 202111375328 A CN202111375328 A CN 202111375328A CN 114095341 A CN114095341 A CN 114095341A
Authority
CN
China
Prior art keywords
network
host
state
link
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111375328.8A
Other languages
Chinese (zh)
Inventor
周玉坤
王正
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202111375328.8A priority Critical patent/CN114095341A/en
Publication of CN114095341A publication Critical patent/CN114095341A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities

Abstract

The application provides a network recovery method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: the isolation state information is used for acquiring isolation state information of a target network port in the first host, and the isolation state information represents an isolation state of the target network port isolated from the cluster network; acquiring monitoring data corresponding to the isolation state; and releasing the isolation of the target network port according to the monitoring data. The method and the device have the advantages that the network ports or the links which are processed are recovered, so that the distributed storage aggregation network is recovered to a normal state or the problem of network interruption is automatically solved.

Description

Network recovery method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of communications technologies, and in particular, to a network recovery method and apparatus, a computer device, and a storage medium.
Background
In public cloud and private cloud scenarios, a distributed storage system carries client core services to operate. Typically, to ensure distributed network reliability, aggregation networks are employed to provide reliability and double the network transmission bandwidth. The distributed storage network aggregation configuration scheme supports single-switch link aggregation and double-switch link aggregation, and aims to meet the requirements of port redundancy and load balancing. In a normal service scene, after link aggregation, a single link abnormality exists, which causes network performance to be affected, for example, the physical sub-health state of a single network port (affected by voltage, current, temperature, and the like), optical module faults, poor optical fiber line contact, and the like. The IO delay is large due to the fact that a single link is abnormal in link aggregation, the influence range is all virtual machines with data in a cluster, and therefore the service performance of a client is reduced, and the problems of blocking and the like are caused.
In order to guarantee the service performance, the sub-health network port or link is disposed. After the sub-health network port is disposed, although the service performance is recovered to normal, only part of the network ports of the host are in a normal communication state, and compared with the case that all the network ports of the host are in normal communication, the network is still in an abnormal state which can barely maintain service operation. If the network port of the normal communication is also abnormal, how to recover the network to ensure the service performance as much as possible and recover the network to normal is also an urgent problem to be solved.
Disclosure of Invention
The method and the device aim to solve the technical problems that in the prior art, the network performance is not optimal after the network port is processed, and the service performance cannot be guaranteed if the normal network port is abnormal. The application provides a network recovery method, a network recovery device, computer equipment and a storage medium, and mainly aims to recover a distributed storage aggregation network to a normal state or automatically solve the problem of network interruption.
In order to achieve the above object, the present application provides a network recovery method, including:
acquiring isolation state information of a target network port in a first host, wherein the isolation state information represents an isolation state of the target network port isolated from a cluster network;
acquiring monitoring data corresponding to the isolation state;
and releasing the isolation of the target network port according to the monitoring data.
In addition, to achieve the above object, the present application further provides a network recovery apparatus, including:
the state data acquisition module is used for acquiring isolation state information of a target network port in the first host, wherein the isolation state information represents an isolation state of the target network port isolated from the cluster network;
the monitoring module is used for acquiring monitoring data corresponding to the isolation state;
and the recovery module is used for removing the isolation of the target network port according to the monitoring data.
To achieve the above object, the present application further provides a computer device comprising a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor executes the computer readable instructions to perform the steps of the network recovery method according to any one of the preceding claims.
To achieve the above object, the present application further provides a computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the network recovery method according to any one of the preceding claims.
According to the network recovery method, the device, the computer equipment and the storage medium, after the network port or the link is treated due to sub-health, corresponding monitoring data is obtained according to different treatment methods or the isolation state of the treated network port, whether the recovery condition is met or not is determined according to the monitoring data, under the condition that the recovery condition is met, the treated or isolated network port is isolated, and the recovery of the link network is realized through the recovery of the network port, so that the distributed storage aggregation network is recovered to a normal state, even double network bandwidth is supported, the usability of the system is enhanced, the treated network port can be automatically recovered online after the normal network port is disconnected, and the service problem caused by network abnormality is temporarily relieved.
Drawings
Fig. 1 is a diagram illustrating an application scenario of a network recovery method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a network recovery method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating a network handling method according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating a network port communication between hosts in a single switch link aggregation mode according to an embodiment of the present application;
fig. 5 is a schematic diagram of an embodiment of the present application, where the network interface communication between hosts is performed in a dual-switch link aggregation mode;
fig. 6 is a schematic flowchart of a method for acquiring network analysis data according to an embodiment of the present application;
fig. 7 is a schematic flowchart of a method for acquiring network analysis data according to another embodiment of the present application;
fig. 8 is a block diagram illustrating a network recovery apparatus according to an embodiment of the present invention;
fig. 9 is a block diagram illustrating an internal structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The network recovery method provided by the present application can be applied to the application environment shown in fig. 1, where any one of the host 121, the host 122, and the host 123 can be used as a first host, and the other hosts can be used as second hosts. And at least one link exists between every two hosts. Each link comprises, in addition to two hosts, a switch (not shown in the figure).
Taking the host 121 as a first host, the host 122 and the host 123 as a second host as an example, the host 121 sends a plurality of probe packets to the corresponding second host (the second hosts such as the host 122 and the host 123) in the cluster through at least one link between the first host and the second host according to a preset probe frequency, and obtains probe data of each link of the first port of the main key 121 in a probe period; the host 121 determines the health states of the first network port and the corresponding network and the link states of the first network port and the corresponding link of the second network port of the second host in the cluster according to the packet loss rates and the first time delays of all links of the first network port in the detection period, where the health states include a sub-health state and a normal state, and the link states include a normal state and an abnormal state.
The host 121 obtains network aggregation mode information of the cluster, where the network aggregation mode information is used to determine a network aggregation mode of the cluster; acquiring network analysis data and disposal condition information corresponding to a plurality of first network ports of the host 121 in a network aggregation mode; determining whether the host 121 meets the disposal condition indicated by the disposal condition information from the network analysis data; if the processing conditions are met, isolating a target portal from the plurality of first portals of the host 121 to form a cluster network, where the target portal is in a sub-health state, and the network analysis data is obtained according to the previous detection analysis.
The host 121 obtains isolation state information of a target port of the host 121, where the isolation state information indicates that the target port is isolated from an isolation state of the cluster network; acquiring monitoring data corresponding to the isolation state; and releasing the isolation of the target network port according to the monitoring data.
Fig. 2 is a flowchart illustrating a network recovery method according to an embodiment of the present application. Referring to fig. 2, the method is illustrated as applied to the host in fig. 1. The network recovery method includes the following steps S100 to S300.
S100: and acquiring isolation state information of the target network port in the first host, wherein the isolation state information represents the isolation state of the target network port isolated from the cluster network.
Specifically, the target portal is a first portal isolated from the cluster network in the first host, and is also a sub-health portal with a health state of sub-health state in the first host. The isolation state information is determined according to the isolation mode of the target network port. The isolation approach includes disabling the portal and removing the portal from the corresponding binding group.
S200: and acquiring monitoring data corresponding to the isolation state.
Specifically, monitored data of the target network port isolated in the isolation state is obtained, and the monitored data are different when the target network port is in different isolation states.
S300: and releasing the isolation of the target network port according to the monitoring data.
Specifically, whether the isolated target network port meets the recovery condition is determined according to the monitoring data, and if the isolated target network port meets the recovery condition, the isolation of the target network port is released. The isolation release mode for the target network port corresponds to the isolation mode or isolation state of the target network port. Under different isolation states, the isolation releasing mode is different.
Before the network port is isolated, whether the network port is isolated or not is determined according to the health state and the link state by detecting and analyzing the health state of the network corresponding to the network port and the link state of the link. Meanwhile, if the health state of the network corresponding to the network port is in a sub-health state, the network operation and maintenance personnel can be reminded to intervene through output logs and alarms. The log analysis and the reason output help operation and maintenance personnel to quickly locate the fault problem, and then solve the problem, such as replacing or reconfiguring a network card, an optical module or a switch, re-online of a network and the like. Therefore, the isolated network port can be recovered to be normal after maintenance, and the network port recovered to be normal is isolated so that the network is more robust and the supportable network bandwidth is larger.
In this embodiment, after the portal or link is treated for sub-health, corresponding monitored data is obtained according to different treatment modes or the isolation state of the treated portal, whether a recovery condition is met is determined according to the monitored data, and under the condition that the recovery condition is met, the treated or isolated portal is isolated, so that the link network is recovered, so that the distributed storage aggregation network is recovered to a normal state, even the double network bandwidth is supported, the availability of the system is enhanced, and the treated portal can be automatically recovered online after the normal portal is disconnected, so that the service problem caused by network abnormality is temporarily alleviated.
In one embodiment, the isolation status includes a shutdown status, the monitored data includes a shutdown duration of the target portal, and the step S200 specifically includes: and starting to count and acquire the stop-down duration after the isolated target internet access is stopped.
Step S300 specifically includes:
determining whether the deactivation time length reaches a first preset time length;
and if the first preset time length is reached, starting the target internet access.
Specifically, if the deactivation duration reaches a first preset duration, the target net port is enabled to release the isolation of the target net port. The recovery condition is satisfied when the deactivation duration reaches a first preset duration, and the first preset duration is specifically set according to an actual situation, and may be, for example, 0.5 hour or after 1 hour, and the like, but is not limited thereto.
And under the condition that the treatment condition is met, isolating the target network interface with the health state of sub-health state or the sub-health network interface from the cluster network to form the isolated target network interface. The way of isolating the network port from the cluster network is as follows: and calling a network port stopping instruction, and stopping the target network port with the health state of sub-health state in the first host, namely the sub-health network port. The network port deactivation instruction is used for taking the sub-health network port off line to drop the sub-health network port, so that the first host can not communicate with other hosts in the cluster through the deactivated network port any more, but communicates with other hosts or the outside through the normal network port which is not deactivated, and the network connection method is equivalent to deactivating all link networks corresponding to the deactivated network port. The portal deactivation instruction is, for example, an ifconfig down command; if the first portal with portal name eth0 is to be disabled, then the complete instruction is ifconfig eth0 down.
In this embodiment, after the isolated target portal is isolated or deactivated for the first preset time, the isolation of the isolated target portal is automatically and tentatively released. The isolation releasing mode is opposite to the isolation mode, an internet access enabling instruction is called to try to enable the isolated first internet access, and after the isolated target internet access is successfully enabled, the isolation is released.
The network port enabling instruction is used for enabling the target network port which is disabled to be online (up), so that the first host can communicate with other hosts in the cluster through the enabled network port, and all link networks corresponding to the enabled network port are enabled. The portal enable instruction is, for example, an ifconfig up command; if the first portal with portal name eth0 is to be enabled, the complete instruction is ifconfig eth0 up.
In another embodiment, the isolation state includes a deactivation state, the monitored data includes a recovery instruction from a user, and step S200 specifically includes: receiving a recovery instruction from a user in a deactivated state;
step S300 specifically includes: and enabling the target network port according to the recovery instruction.
Specifically, after the target internet access is deactivated, a recovery instruction from the user is monitored, if the recovery instruction from the user is monitored and received, it is determined that the recovery condition is satisfied, and if the recovery instruction from the user is not monitored and received, it is determined that the recovery condition is not satisfied. And enabling the target network port under the condition that the recovery condition is met.
The recovery instruction is sent by a user, and the user determines when to release the isolation, and is specifically a network operation and maintenance person. And a cluster management interface is arranged at a management end corresponding to the cluster, a recovery button is arranged on the cluster management interface, and a user sends a recovery instruction to the first host by triggering the recovery button. And after the first host receives the recovery instruction, calling an internet access enabling instruction to try to enable the isolated first internet access, and after the isolated first internet access is successfully enabled, the isolated first internet access is released from isolation.
The network port enabling instruction is used for enabling the disabled network port to be connected to the line (up), so that the first host can communicate with other hosts in the cluster through the enabled network port, and all link networks corresponding to the enabled network port are enabled. The portal enable instruction is, for example, an ifconfig up command; if the first portal with portal name eth0 is to be enabled, the complete instruction is ifconfig eth0 up.
After the target internet access is deactivated, the first host cannot send a detection packet through the isolated or deactivated target internet access to monitor whether the health state of the target internet access is recovered to the health state, so that the first host can only start timing after the isolated target internet access is isolated through a timer, and independently try to activate the isolated target internet access after the isolated target internet access is isolated for a first preset time, or manually trigger the first host to try to activate the isolated target internet access by a user.
In one embodiment, the isolation status includes a removal status, where the removal status indicates that the target portal is removed from the binding group corresponding to the first host, and the monitoring data includes a health status of the target portal, and the step S200 specifically includes: if the time length of the target network port removed from the binding group exceeds a second preset time length, acquiring the health state of the isolated target network port;
step S300 specifically includes: and if the health state of the target internet access is a normal state, adding the target internet access into the binding group of the first host again.
Specifically, under the condition that the isolated target internet access is removed from the binding group corresponding to the first host, after the isolated target internet access is removed for a second preset time period, acquiring the health state of the isolated target internet access through detection and analysis, wherein the health state is the health state of the isolated target internet access after the isolated target internet access is removed; and if the health state of the isolated target network port is recovered to be the normal state, determining that the isolated target network port meets the recovery condition, and adding the isolated target network port into the binding group of the first host again to remove the isolation.
And under the condition that the treatment condition is met, isolating the target network port with the health state of sub-health state or the sub-health network port from the cluster network, and then enabling the target network port to become the isolated target network port. In this embodiment, the way of isolating the network port from the cluster network is as follows: and removing the sub-health internet access with the health state of the sub-health state in the first host, namely the target internet access, from the binding group corresponding to the first host. Port aggregation is used to treat multiple ports of a device as a single logical interface, allowing multiple ports to be connected in parallel while transferring data to provide higher bandwidth, greater throughput. The port aggregation is suitable for link bandwidth expansion and redundancy scenes, and the problem of link throughput bottleneck and single link failure is solved. When the first host has a plurality of network ports, the first host can use a port aggregation or port binding technology to bind the plurality of first network ports into a binding group to form a logic network port, and the binding group comprises the first network ports added into the binding group in the first host. For the external, it is the logical network port that communicates with the devices outside the first host, and for the internal, the first host will select the first network port that works according to the different network port bond modes. And removing the sub-health network port, namely the target network port, from the binding group of the first host (removing from the aggregation port binding), so that the removed target network port cannot be selected as a working network port, namely, the first host cannot communicate with other hosts in the cluster through the removed target network port any more, which is equivalent to removing all link networks corresponding to the removed target network port. And further, the removed target network port is isolated from the cluster network, and the service performance is ensured to be recovered to a normal level.
Although the removed target portal is removed from the aggregation port bonding (binding group), the first host may still send the probe packet to the second host in the cluster through the removed target portal, and therefore, after the sub-health portal is removed for a second preset time period, the health state of the removed target portal after removal may be obtained through probe analysis. It can be determined from the health status whether the removed target portal can be un-isolated. If the health state of the target internet access is recovered to the normal state from the sub-health state, the removed target internet access can be isolated, and if the health state is still the abnormal state, the removed target internet access cannot be isolated. The second preset time period is specifically set according to practical applications, and may be, for example, 0.5 hour or 1 hour, but is not limited thereto.
In this embodiment, after the isolated target portal is isolated for the second preset time, the new health state of the isolated target portal is continuously obtained through detection analysis. The isolation releasing mode is opposite to the isolation mode, and the isolated target internet access is added into the binding group of the first host again to release the isolation. After the isolation is released, the first host can communicate with the devices in the cluster or with the outside through the target internet access for releasing the isolation.
In one embodiment, the isolation status includes a deactivation status or a removal status, where the removal status indicates that the target portal is removed from the binding group corresponding to the first host, and the monitoring data includes a kernel network event of the first host, and the step S200 specifically includes: and if the isolation state is a deactivation state or a removal state, acquiring the kernel network event of the first host.
Step S300 specifically includes: and determining whether the normal network port in the normal state in the first host is disconnected or not according to the kernel network event, and if so, removing the isolation of the target network port.
Specifically, after the target network port is isolated, monitoring a kernel network event of a first host; and if the first network port with the normal health state in the first host is acquired according to the kernel network event, namely the normal network port is disconnected, the isolation of the isolated target network port is released.
And monitoring a kernel network event of the first host through the Netlink. Netlink is a way of bi-directional data transmission between kernel and user applications, and the user mode can use the function of Netlink using standard socket API. Through communication with a Linux kernel, monitoring obtains rtentlink _ event, such as a portal up event (a portal online event) and a portal down event (a portal offline event). Therefore, the network state of the internet access can be monitored through the Netlink, and if the internet access is disconnected, the corresponding network is interrupted. The originally normal network port of the first host is disconnected, and the sub-health network port is isolated out of the cluster network, at this moment, no matter whether the health state of the network corresponding to the isolated sub-health network port is recovered to the health state, the isolation of the isolated target network port needs to be released, so that the first host can normally communicate through the target network port, the isolation of which is released, and the service performance can be recovered. In this case, releasing the isolated target network port is a rescue measure when the normal network port cannot work, and although the best network state cannot be guaranteed, the normal operation of the service can be guaranteed to a certain extent.
In one embodiment, the releasing the isolation of the target portal specifically includes:
under the condition that the isolation state of the target internet access is the deactivation state, calling an internet access activation instruction and activating the target internet access;
and under the condition that the isolation state of the target internet access is the removal state, the target internet access is added into the binding group of the first host again to release the isolation.
Specifically, the present embodiment is implemented on the basis of the previous embodiment of the present embodiment, and no matter whether the target portal is isolated from the cluster network by the portal deactivation instruction or the target portal is isolated from the cluster network by removing the binding group, as long as the drop event of the normal portal is monitored, the isolation of the isolated target portal is released, so as to achieve the purpose of saving the network.
In this embodiment, under the condition that the target portal is already deactivated, it is not necessary to invoke the portal activation instruction after the target portal is deactivated for the first preset duration, nor to invoke the portal activation instruction after receiving the recovery instruction of the user, and the portal activation instruction is called only when a normal portal drop event occurs. The portal enabling instruction is used for enabling the disabled portal to be online (up), so that the first host can communicate with other hosts in the cluster through the enabled portal, and all link networks corresponding to the enabled portal are enabled. The portal enable instruction is, for example, an ifconfig up command; if the first portal with portal name eth0 is to be enabled, the complete instruction is ifconfig eth0 up.
In this embodiment, when the target portal is removed from the binding group corresponding to the first host, the isolation does not need to be released after determining that the health state of the target portal is restored to the normal state through detection and analysis, and the isolation needs to be released as long as a normal portal disconnection event occurs, regardless of whether the isolated target portal is restored to the normal portal. The isolation releasing mode is opposite to the isolation mode, and the isolated target internet access is added into the binding group of the first host again to release the isolation. After the isolation is released, the first host can communicate with the devices in the cluster or with the outside through the target internet access for releasing the isolation.
In this embodiment, after monitoring a normal gateway drop event (down event), the isolated target gateway is temporarily enabled to quickly repair the link and rescue the network, so as to recover the normal service performance, thereby playing a role in quickly handling an emergency.
In one embodiment, the enabling of the target portal specifically includes: calling a network port starting instruction to start the isolated target network port; detecting whether a target network port is communicated; and if the target network port is not connected, determining whether the called times of the calling network port starting instruction exceed a threshold value, and if not, re-calling the network port starting instruction.
Specifically, the network port starting operation is executed circularly until the isolated target network port is started or the calling frequency reaches the preset frequency; the network port starting operation comprises the following steps: and calling the internet access enabling instruction to try to enable the isolated first internet access, detecting the connectivity of the isolated first internet access to determine whether the isolated first internet access is enabled or not, and accumulating the calling times.
Under the condition that the isolated target network port is stopped, because the isolated target network port is offline or stopped, the first host cannot send a detection packet through the stopped target network port any more, so that whether the stopped target network port is normal or not is analyzed through detection data, namely whether the stopped target network port is salvageed by network operation and maintenance personnel or engineering personnel or not is not directly acquired. Therefore, in the case that the isolated target portal is disabled, an attempt is made to call the portal enabling instruction to recover the isolated target portal after being isolated for the first preset time.
Attempting to invoke the portal enablement instruction to recover the isolated target portal may be successful or unsuccessful in enablement. Therefore, after the portal enabling instruction is called, the connectivity of the isolated target portal needs to be detected through the realhtool, and whether the isolated target portal is already in an enabling state (up state) is checked through the connectivity. If the isolated target network port is not in the enabled state, the network port is enabled in failure. If the cumulative calling times of the network port starting instruction does not reach the preset times, the steps in the network port starting operation are executed circularly to try to call the network port starting instruction again to start the isolated target network port. And if the cumulative calling times of the network port starting instruction reach the preset times, the starting of the isolated target network port fails. That is, the isolated target portal cannot be de-isolated at this time. The preset number is the maximum number of retries to attempt to release the isolation.
In one embodiment, the method further comprises:
if the isolated target network port is enabled, acquiring the health state of the enabled target network port through detection and analysis;
and if the health state of the enabled target network port is still in a sub-health state, calling a network port disabling instruction, and isolating the enabled target network port from the cluster network again.
Specifically, the present embodiment is based on the previous embodiment of the present embodiment, and if the isolated target portal is successfully enabled by invoking the portal enabling instruction, a new health state of the enabled target portal is obtained through detection analysis to determine whether the health state of the enabled target portal is recovered to a normal state. And if the new health state is a normal state, determining that the enabled target internet access is recovered to be a normal internet access, otherwise, determining that the enabled target internet access is still a sub-health internet access. If the enabled target port is still a sub-healthy port, the port deactivation instruction can be called again to re-deactivate the enabled target port. And then, after waiting for the first preset time, trying to start again, and circulating the steps until the isolated target network port is recovered to be normal.
Wherein, judging whether the enabled target port is recovered to a normal state specifically comprises: acquiring detection data of each link of the target network port in a detection period after the target network port is started, determining packet loss rate and first time delay of each link of the target network port in the detection period according to the detection data, and determining the health state of the target network port according to the packet loss rate and the first time delay of all links of the target network port. And if the links with the first time delay smaller than the third threshold and the packet loss rate smaller than the fourth threshold exist in all the links of the target network port, judging that the target network port is recovered to be the normal network port. The detection period may be, for example, within 3 minutes, within 5 minutes, or the like, but is not limited thereto. And if the target network port is recovered to be the normal network port, updating and outputting the health state of the target network port.
After the network port or the link is processed by sub-health, the health state of the network port is continuously and actively sent out, the stop time of the network port after being processed is monitored, or a recovery instruction from a user is monitored, the processed network port or the link which is repaired or meets the recovery condition is recovered, so that the distributed storage aggregation network is recovered to a normal state, the availability of the system is enhanced, and the double network bandwidth can be recovered and supported.
In one embodiment, before step S100, a network handling method is further included, and with particular reference to fig. 3, the network handling method includes the following steps:
s010: and acquiring network aggregation mode information of the cluster, wherein the network aggregation mode information is used for determining the network aggregation mode of the cluster.
Specifically, the network aggregation mode of the present embodiment includes a single switch link aggregation mode and a dual switch link aggregation mode. The single switch link aggregation mode means that a plurality of network ports of one host are connected to the same switch, and the network ports of a plurality of hosts can be connected to the same switch; the dual-switch link aggregation mode means that a plurality of network ports of one host are distributed and connected on two switches, and the same switch can be connected with the network ports of a plurality of hosts.
S020: the method comprises the steps of obtaining network analysis data and disposal condition information corresponding to a plurality of first network ports of a first host in a network aggregation mode.
Specifically, the network analysis data obtained in different network aggregation modes are different, and the corresponding treatment conditions are also different. The network analysis data is acquired according to detection analysis, and the network analysis data at least comprises health states of a plurality of first network ports of the first host. The method for determining the health state of the first internet access specifically comprises the following steps: the method comprises the steps that a first host sends a plurality of detection packets to a corresponding second host in a cluster through at least one link between the first host and the second host according to a preset detection frequency, detection data of each link of a first network port of the first host in a detection period are obtained, and a packet loss rate and a first time delay corresponding to each link in the detection period are determined according to the detection data; and the first host determines the health state of the first network interface according to the first time delay and the packet loss rate of all links of the first network interface in the detection period.
More specifically, if the first time delays of all links of the first portal exceed the first threshold value in the probing period, or the packet loss rates of all links of the first portal exceed the second threshold value, it is determined that the health state of the first portal is the sub-health state.
Under the condition that the first time delays of all links of the first network interface are smaller than a first threshold and the packet loss rates of all links of the first network interface are smaller than a second threshold in a detection period, if the first time delays of all links of the first network interface exceed a third threshold or the packet loss rates of all links of the first network interface exceed a fourth threshold, the health state of the first network interface is determined to be a sub-health state, wherein the third threshold is smaller than the first threshold, and the fourth threshold is smaller than the second threshold.
And if the links with the first time delay smaller than the third threshold and the packet loss rate smaller than the fourth threshold exist in all the links of the first network interface in the detection period, determining that the health state of the first network interface is a normal state.
Determining whether the first network port is an available network port or not according to the acquired network port state of the first network port of the first host, wherein the network port state comprises a connection state and a non-connection state, and if the network port state of the first network port is the non-connection state, determining that the health state of the first network port is the non-connection state.
Determining whether the first internet access is an available internet access or not according to the acquired internet access state of the first internet access of the first host, wherein the internet access state comprises a connection state and a connectionless state, acquiring a negotiated bandwidth and a rated bandwidth of the available internet access, determining negotiation bandwidth degradation of the corresponding available internet access if the negotiated bandwidth is smaller than the corresponding rated bandwidth, and determining the health state of the first internet access to be a sub-health state if the negotiated bandwidth degradation of the first internet access is smaller than the corresponding rated bandwidth.
The handling condition is specifically a handling condition of the internet access, and whether the internet access which can be handled in the first host is handled is determined by judging whether the first host meets the handling condition.
The single switch link aggregation mode and the dual switch link aggregation mode have different constraints when active network handling is performed due to different topo structures.
S030: it is determined whether the first host meets the handling condition indicated by the handling condition information according to the network analysis data.
Specifically, the handling condition at least includes that the first host has a sub-health portal with a health state being a sub-health state and a normal portal with a health state being a normal state, and whether the first host meets the handling condition is determined according to the sub-health portal, the normal portal and the network aggregation mode.
S040: and if the processing conditions are met, isolating the target network port out of the cluster network, wherein the target network port is in a sub-health state.
Specifically, in this embodiment, it is specifically determined whether the first host meets the handling condition according to network analysis data of a plurality of first ports of the first host, and when the handling condition is met, the target port, that is, the sub-health port, is isolated from the cluster network. The health status of the target portal of the isolated cluster network is changed from a sub-health status to an isolated status (infected). The isolated target portal or sub-health may be de-isolated after treatment by monitoring network bandwidth and system performance if recovery conditions are met.
The cluster is a distributed storage aggregation network composed of a plurality of hosts, and as known from the characteristics of the aggregation network, if one of the multiple network ports of the same host is isolated, all the receiving and sending packets and network traffic of the host are automatically processed through the other non-isolated and normal network ports, so that the availability of the network is ensured. Isolating the sub-health portal or the target portal from the cluster network specifically means that all sub-health links of the first portal with the health state being the sub-health state are disconnected, so that the first host cannot communicate with other hosts or the outside through the first portal with the sub-health state, and the first host communicates with the other hosts or the outside through the normal portal without isolation.
If the links of part of the first network ports of the first host in the link aggregation are sub-healthy and the links of part of the first network ports are normal, the sub-healthy first network ports are isolated, only the normal links corresponding to the normal network ports are used, and the influence of customer service on the sub-health of the network can be reduced.
In this embodiment, whether the sub-health portal can be handled is determined by the health status of the portal and the link status of the link corresponding to the portal, so as to avoid the negative impact on the network caused by blind handling. The sub-health network port is processed when the processing condition is met, so that the sub-health link isolation is realized on the premise of not interrupting the network and not influencing the service performance, the service performance is ensured to be restored to a normal level, and the possibility of service performance reduction caused by network sub-health is reduced.
In one embodiment, the network aggregation mode includes a single switch link aggregation mode, the network analysis data includes health states of a plurality of first portals in the first host, and the step S030 specifically includes: and if the plurality of first network ports of the first host comprise a target network port with a sub-health state and at least one normal network port with a normal state, determining that the first host meets the handling condition.
Specifically, fig. 4 is a schematic diagram illustrating a network port communication between hosts in a single-switch link aggregation mode.
Fig. 4 shows a schematic diagram of a plurality of first portals of a first host communicating with a plurality of second portals of a second host. The first host a includes two first ports, i.e., a first port 1 and a first port 2, and the second host B includes a second port 1 and a second port 2. First net gape 1 passes through the switch and communicates with second net gape 1 and form first link, and first net gape 2 passes through the switch and communicates with second net gape 2 and form the second link, and first net gape 1 passes through the switch and communicates with second net gape 2 and form the third link, and first net gape 2 passes through the switch and communicates with second net gape 1 and form the fourth link.
Normally, the first host a may communicate with the second host B through any one of the first link, the second link, the third link, and the fourth link. Each of the first link, the second link, the third link and the fourth link includes forwarding devices such as switches in addition to the two hosts.
However, if the health status of the first portal 1 of the first host a and the corresponding network is sub-health status, that is, the health status of the first portal 1 is sub-health status, the first portal 1 is sub-health portal, and the health status of the first portal 2 of the first host a is normal status, that is, the first portal 2 is normal portal, the first portal 1 is isolated from the cluster network, so that the first host cannot communicate with the host B through the first portal 1, and communicates with the host B through the first portal 2 of the normal portal without isolation.
The first gateway 1 and the first gateway 2 are two slave gateways (slave gateways) of the first host, and the first gateway 1 and the first gateway 2 are bound (bound) into a master gateway (master gateway) by using a gateway binding technology, and the master gateway receives and transmits data based on three layers. The 2 slave ports are scheduled by the master port based on the layer 2, so that one of the slave ports is allowed not to provide scheduling, and only the other slave port accepts all data of the whole master port, so that at least one slave port needs to be ensured to be normally served. Therefore, before the sub-health network port is treated in the single switch link aggregation mode, whether another slave port of the binding is normal needs to be judged, and the treatment is carried out normally. Thus, the first host can communicate with other hosts through another normal slave port. For example, the first host communicates with other hosts or the outside through a first portal 2(slave portal) in this example. After 1 network port of the aggregation link is isolated, all receiving and sending packets and network traffic automatically pass through the other 1 network port, and therefore the availability of the network is guaranteed.
In one embodiment, the network aggregation mode includes a dual switch link aggregation mode, the network analysis data includes health states of a plurality of first ports in the first host and link states of links between the plurality of first ports and a second port of the second host in the cluster, and the step S030 specifically includes: and if the plurality of first network ports of the first host comprise a target network port with a sub-health state and at least one normal network port with a normal state, and the link states of the links between the at least one normal network port and the second network port connected to the same switch are both normal states, determining that the first host meets the handling condition.
Specifically, under the condition that a sub-health portal exists, if at least one first portal with a health state being a normal state exists in the first host, that is, at least one normal portal exists, and the link states of the corresponding links of the first portal with the at least one normal state and the second portal connected to the same switch are both normal, that is, the normal first portal can normally communicate with the second portal of the other host connected to the switch through the switch connected to the normal first portal, and it is determined that the network handling condition is satisfied.
Fig. 5 is a schematic diagram illustrating the network port communication between hosts in the dual-switch link aggregation mode. Fig. 5 shows a schematic diagram of a plurality of first portals of a first host communicating with a plurality of second portals of a second host. The first host A comprises two first network ports, namely a first network port 1 and a first network port 2, the second host B comprises a second network port 1 and a second network port 2, and the second host C comprises a second network port 3 and a second network port 4. First net gape 1 communicates with second net gape 1 through switch 1 and forms first link, first net gape 2 communicates with second net gape 2 through switch 2 and forms the second link, first net gape 1 communicates with second net gape 3 through switch 1 and forms the third link, first net gape 2 communicates with second net gape 4 through switch 2 and forms the fourth link, second net gape 1 communicates with second net gape 3 through switch 1 and forms the fifth link, second net gape 2 communicates with second net gape 4 through switch 2 and forms the sixth link.
Normally, the first host a may communicate with the second host B through the first link or the second link, the first host a may communicate with the second host B through the third link or the fourth link, the first host a may communicate with the second host C through the third link or the fourth link, and the second host B may communicate with the second host C through the fifth link or the sixth link. Each of the first link, the second link, the third link, the fourth link, the fifth link and the sixth link includes forwarding devices such as switches besides two hosts.
If the health status of the first port 1 of the first host a and the corresponding network is a sub-health status, that is, the first port 1 is a sub-health port, and the health status of the first port 2 of the first host a is a normal status, that is, the first port 2 is a normal port, if the first port 1 is isolated, then to make the service normal, that is, the first host a can communicate with the second host B and the second host C, respectively, it is necessary that the link statuses of the second link formed by the first port 2 and the second port 2 through the switch 2 and the fourth link formed by the first port 2 and the second port 4 through the switch 2 are both normal. Therefore, even if the first network port 1 isolates the cluster network, the communication between the first host A and other hosts in the cluster is not influenced, the isolation of the sub-health link corresponding to the sub-health network port is realized on the premise of not interrupting the network and not influencing the service performance, and the service performance is ensured to be recovered to the normal level.
The first portal 1 is isolated from the cluster network, so that the first host cannot communicate with the host B and the host C through the first portal 1, and communicates with the host B and the host C through the normal portal 2 which is not isolated. Due to the characteristics of the aggregation network, after 1 network port is isolated, all the receiving and sending packets and network traffic automatically pass through the other 1 network port to be processed, so that the availability of the network is ensured.
In one embodiment, the network aggregation mode comprises a single switch link aggregation mode, the network analysis data comprises health states of a plurality of first portals in the first host, the network handling method further comprising: acquiring the total bandwidth of a plurality of first network ports of a first host and the normal network port bandwidth of a normal network port with a normal state in the plurality of first network ports;
step S030 specifically includes: if the plurality of first network ports of the first host include a target network port with a sub-health state and at least one normal network port with a normal state, and the ratio of the total bandwidth of the plurality of first network ports to the normal network port bandwidth of the normal network port does not exceed a first preset ratio, determining that the first host meets the handling condition, wherein the first preset ratio is less than or equal to 1.
Specifically, the first host includes at least 2 first ports, and the total bandwidth is the sum of bandwidths of the first ports which are on-line or enabled in the first host, including the first ports whose health states are sub-health states and normal states. Theoretically, if the network ports of the same host are normal, the more the total bandwidth is higher than the bandwidth of the normal network ports, the higher the bandwidth utilization rate of the network ports is. On the contrary, if the total bandwidth is not much different from the normal port bandwidth, it indicates that the bandwidth utilization of a part of the ports (e.g. sub-healthy ports) is low and does not contribute to the bandwidth utilization if a plurality of ports are enabled. Therefore, the present embodiment determines that if the sub-health portal is isolated, the bandwidth of the remaining non-isolated normal portal will not be reduced too much compared to the total bandwidth before the non-isolation, by comparing the total bandwidth with the product of the normal portal bandwidth and the first preset ratio. If the ratio of the total bandwidth of the first network ports to the normal network port bandwidth of the normal network ports does not exceed a first preset ratio, the method can still ensure that the bandwidth is not reduced after the network ports are switched even if the sub-health network ports are isolated, and the bandwidth of the normal network ports can still meet the requirement of normal network communication. The first predetermined ratio may be 80% or 90%, but is not limited thereto.
In this embodiment, under the condition that a single switch link aggregation mode exists and a sub-health portal exists, it is simultaneously satisfied that at least one first portal or normal portal exists, the health state of which is a normal state, and the ratio of the total bandwidth of the plurality of first portals to the normal portal bandwidth of the normal portal does not exceed a first preset ratio, and then it is determined that the sub-health portal is isolated from the trunking network, so that the sub-health link isolation is realized and the service performance is ensured to be recovered to a normal level on the premise that the network is not interrupted, the service performance is not affected, and the bandwidth is not reduced after the portal switching is ensured.
In one embodiment, the network aggregation mode includes a dual switch link aggregation mode, the network analysis data includes health states of a plurality of first portals in the first host and link states of links between the plurality of first portals and a second portal of a second host in the cluster, and the network handling method further includes: acquiring the total bandwidth of a plurality of first network ports of a first host and the normal network port bandwidth of a normal network port with a normal state in the plurality of first network ports;
step S030 specifically includes: if the plurality of first network ports of the first host comprise a target network port with a sub-health state and at least one normal network port with a normal state, and the link states of the link between the at least one normal network port and the second network port connected to the same switch are both normal states, and the ratio of the total bandwidth of the plurality of first network ports to the normal network port bandwidth of the normal network port does not exceed a second preset ratio, determining that the first host meets the disposal condition, wherein the second preset ratio is less than or equal to 1.
Specifically, the first host includes at least 2 first ports, and the total bandwidth is the sum of bandwidths of the first ports which are on-line or enabled in the first host, including the first ports whose health states are sub-health states and normal states. Theoretically, if the network ports of the same host are all normal, the more the total bandwidth is higher than the bandwidth of the normal network ports, the higher the bandwidth utilization rate of the network ports is. On the contrary, if the difference between the total bandwidth and the normal network port bandwidth is not large, it indicates that the bandwidth utilization of a part of network ports (sub-healthy network ports) is low and does not contribute to the bandwidth utilization in the case that a plurality of network ports are all enabled. Therefore, the present embodiment determines that if the sub-health portal is isolated, the bandwidth of the remaining un-isolated normal portal will not be reduced too much from the total bandwidth before the un-isolation by comparing the total bandwidth with the product of the normal portal bandwidth and the second predetermined ratio. If the ratio of the total bandwidth of the first network ports to the normal network port bandwidth of the normal network ports does not exceed a second preset ratio, the bandwidth of the normal network ports can still meet normal network communication after the network ports are switched without reducing even if the sub-health network ports are isolated. The second predetermined ratio may be 80% or 90%, but is not limited thereto.
In this embodiment, under the condition that a link aggregation mode of a dual switch and a sub-health portal exists, it is simultaneously satisfied that the link states of a link corresponding to a first portal having at least one health state as a normal state and a second portal connected to the same switch in a first host are both normal states, and the ratio of the total bandwidth of a plurality of first portals to the normal portal bandwidth of the normal portal is not more than a second preset ratio, it is determined that the sub-health portal is isolated from the cluster network, thereby ensuring that the sub-health link is isolated and the service performance is recovered to a normal level without interrupting the network, affecting the service performance and ensuring that the bandwidth is not reduced after the portal is switched.
In the foregoing embodiments, isolating a target portal from the multiple first portals out of the trunking network specifically includes:
and calling a network port stopping instruction and stopping the target network port.
Specifically, the network port deactivation command is used for taking the sub-health network port, namely the target network port, off-line (down) so that the first host can not communicate with other hosts in the cluster through the deactivated network port any more, but communicates with other hosts or the external through the normal network port which is not deactivated, which is equivalent to deactivating all link networks corresponding to the deactivated network port. The portal deactivation instruction is, for example, an ifconfig down command; if the first portal with portal name eth0 is to be disabled, then the complete instruction is ifconfig eth0 down.
The embodiment isolates the sub-health network port by calling the network port stopping instruction, and ensures that the service performance is recovered to a normal level.
In the above embodiments, isolating a target portal from the plurality of first portals out of the cluster network includes:
and removing the target internet access from the binding group corresponding to the first host.
In particular, port aggregation is used to treat multiple ports of a device as a single logical interface, which allows multiple ports to be connected in parallel while transferring data to provide higher bandwidth and greater throughput. The port aggregation is suitable for link bandwidth expansion and redundancy scenes, and the problem of link throughput bottleneck and single link failure is solved. When the first host has a plurality of network ports, the first host can use a port aggregation or port binding technology to bind the plurality of first network ports into a binding group to form a logic network port, and the binding group comprises the first network ports added into the binding group in the first host. For the external aspect, the logical network interface communicates with a device outside the first host, and for the internal aspect, the first host selects the first network interface to operate according to different network interface bond modes.
In this embodiment, a sub-health portal, that is, a target portal, is removed from a binding group of a first host (removed from an aggregation portal binding), so that the removed first portal cannot be selected as a working portal, that is, the first host cannot communicate with other hosts in a cluster through the removed portal any more, which is equivalent to removing all link networks corresponding to the removed portal, and further isolating the removed first portal from the cluster network, thereby ensuring that service performance is recovered to a normal level.
In an embodiment, the network aggregation mode includes a single switch link aggregation mode, the network analysis data includes health statuses of a plurality of first ports in the first host, fig. 6 is a schematic flowchart of a method for acquiring network analysis data according to an embodiment of the present application, and the step S020 of acquiring the network analysis data corresponding to the plurality of first ports of the first host in the network aggregation mode specifically includes:
s021: sending a plurality of detection packets to a corresponding second host in the cluster through at least one link between the first host and the second host according to a preset detection frequency to obtain detection data of each link of a first network port of the first host in a detection period, and determining a packet loss rate and a first time delay corresponding to each link in the detection period according to the detection data;
s022: and determining the health state of the first network port according to the first time delay and the packet loss rate of all links of the first network port in the detection period, wherein the health state comprises a normal state and a sub-health state.
Specifically, in this embodiment, a certain host in the cluster is used as a first host, other hosts are used as second hosts, and the first host is used as a local host and an execution main body. Each host in the cluster can be used as an execution main body to actively send a detection packet to other hosts, namely, the hosts in the cluster actively detect and send packets with each other so as to detect whether the network port of the host is normal or not. In this embodiment, raw socket two-layer communication is adopted, and packets are directly sent and received between host network ports through Mac addresses.
The first host comprises at least one first internet access, the cluster comprises at least one second host, each second host comprises at least one second internet access, and the internet access is a physical network interface. The first host sends a plurality of detection packets to a second network port corresponding to any one second host through the first network port, and obtains data such as a receiving timestamp according to a reply packet returned by the second host through the corresponding second network port. In this embodiment, the detection data of the link corresponding to each host is collected by sending the detection packet between the hosts, where the link is a communication channel between a first network port of a first host and a corresponding second network port in a second host.
Each first network port of the first host forms different links with the second network ports corresponding to different second hosts, so that each first network port can have a plurality of links.
The detection data is data corresponding to the whole event from the time when the first network port available for the first host sends the detection packet to the corresponding second network port to the time when the reply packet returned by the second network port is received. The same first network port sends a detection packet to the corresponding second network port according to the preset detection frequency, so that the detection data corresponding to each link comprises data or accumulated data correspondingly generated by all times of detection.
The sub-detection data of each link during each detection comprises the corresponding network port information of the first network port and the network port information of the second network port, the packet sending number and the packet receiving number of each detection corresponding to the first network port, and the second time delay of each detection.
The calculation formula of the second time delay of each link is detected as follows: t ═ T (T4-T1) - (T3-T2). In the end-to-end host communication, T1 is a first sending timestamp T1 when the first host sends the packet P to the second port corresponding to the second host through the first port, T2 is a first receiving timestamp when the second port of the second host receives the packet P, T3 is a second sending timestamp when the second host returns the reply packet H to the first port of the first host through the second port, and T4 is a second receiving timestamp when the first host receives the reply packet H through the first port.
In this embodiment, a first port of a first host is controlled to send a probe packet to a second port corresponding to a second host based on a sliding window protocol and a preset probe frequency, so as to collect and acquire probe data.
The sliding window protocol is used for maintaining a continuous and fixed-length sending data packet sequence number at any time, and is used for flow control during network data transmission so as to avoid congestion. The protocol allows a sender to send a plurality of data packets before stopping and waiting for acknowledgement, which can speed up data transmission and improve network throughput. In the embodiment, the transmission speed of the first host to the second host for transmitting the detection packet is coordinated through the sliding window protocol, and the packet transmission frequency is further constrained according to the preset detection frequency, so that the detection operation can meet certain network health analysis requirements, and network congestion is avoided as much as possible.
The portal information may specifically include a Mac address of the portal, and may include, but is not limited to, a Mac address of the portal, a port to which the switch is connected, and the like.
The detection period is specifically a time interval of the network health analysis. The first host sends a detection packet to the second host to acquire detection data, but the acquired detection data does not need to be processed in real time, and the detection data of one detection period is extracted every other detection period to analyze the health state of the network port and the corresponding network.
And (2) the packet loss rate of each link in the detection period is (the number of packets sent by the link network port-the number of packets received by the link network port)/the number of packets sent by the link network port is 100%.
The packet sending number of the link network port is the number of the detection packets sent by the first network port of the link in the detection period, and the packet receiving number of the link network port is the number of the reply packets received by the first network port of the link and returned by the second network port of the link.
In a detection period, each link has multiple detections, that is, a first port of a certain link sends a detection packet to a corresponding second port multiple times according to a preset detection frequency, and each link is detected each time to generate a second delay, so that a plurality of second delays are corresponding to the same link in a detection period. And the packet sending number and the packet receiving number of the first network port of the link are respectively the accumulation of the multiple detection packet sending and the multiple detection packet receiving. Each first network port may communicate with second network ports corresponding to different second hosts to form different links, and the first time delay is calculated according to all second time delays of corresponding links in the probing period. The first delay is not limited to the P99 delay or the average delay corresponding to the corresponding link in the probing period. The packet loss rate of a link is specifically a corresponding packet loss rate of a link in a detection period.
The first host has at least one first network port, and the health state of each first network port and the corresponding network is obtained through comprehensive judgment of the first time delay and the packet loss rate of all links of the first network port. The health status of the first portal may be a normal status, a sub-health status, a connectionless status, or the like. When the health state of the first network port is a sub-health state, the health state of the corresponding network is also the sub-health state; when the health state of the first internet access is a connectionless state, the health state of the corresponding network is also a connectionless state; when the health status of the first portal is normal, the health status of the corresponding network may be a sub-health status, and at this time, the cause of the sub-health may not be caused by the first portal, but may be caused by a second portal communicating with the first portal. In addition, sub-health states may also be caused by different causes. The health status of different first ports and corresponding networks in all the first ports of the same host may be the same or different.
The system can alarm and treat according to the sub-health state of the network link, so that the problem of service performance reduction caused by the sub-health state of the network is solved.
In this embodiment, a link from a host port to a port in a cluster is determined, and a detection packet is actively sent to collect time delay and packet loss rate of a specified link, so as to obtain detection data; analyzing the states of all links in the cluster according to the detection data, and judging the sub-health state of the network link by adopting an experience threshold; and accurately identifying and analyzing reasons caused by the network sub-health state. And various different application scenes are compatible, and powerful basis is provided for rapidly recovering the network health state.
In one embodiment, step S022 specifically includes:
if the first time delays of all links of the first network port exceed a first threshold value in a detection period, and/or the packet loss rates of all links of the first network port exceed a second threshold value, determining that the health states of the first network port and the corresponding network are sub-health states;
and recording first reason information corresponding to the sub-health state.
Specifically, the first threshold and the second threshold may be a sensitive value (an adjustable value according to an actual situation) or a dull value (a fixed value), the first threshold is an upper limit value corresponding to the time delay and the second threshold is an upper limit value corresponding to the packet loss rate. In this embodiment, when the first time delays of all links exceed the first threshold, it is determined that the health state of the first portal is a sub-health state, that is, the first portal is a sub-health portal, and it is determined that the health state of a network corresponding to the sub-health portal is a sub-health state; under the condition that the packet loss rates of all links exceed a second threshold, determining that the health state of the first network port is a sub-health state, namely the first network port is a sub-health network port, and determining that the health state of a network corresponding to the sub-health network port is a sub-health state; and under the condition that the first time delays of all the links exceed a first threshold and the packet loss rates of all the links exceed a second threshold, determining that the health state of the first network port is a sub-health state, namely the first network port is a sub-health network port, and determining that the health state of a network corresponding to the sub-health network port is a sub-health state.
The first cause information may specifically be a link failure. The network sub-health state refers to a state of the network when a single link of the vs aggregation network has problems of packet loss, large time delay, low negotiation bandwidth and the like; although the above problems do not cause network disruption, network transmission performance is affected to varying degrees. When the network port is in a sub-health state, the corresponding network is also in the sub-health state, and the network port still can work, but the efficiency is low, and the performance is poor.
In one embodiment, the probe data includes a packet sending number and a packet receiving number of the corresponding link and a plurality of second time delays corresponding to a plurality of probe packets;
step S020 further includes: and calculating the average time delay of the first network port in the detection period according to the second time delays corresponding to all the links of the first network port, and calculating the packet loss rate of the first network port in the detection period according to the packet sending number and the packet receiving number of all the links of the first network port.
Specifically, the average delay of the first network port may be an average of all second delays corresponding to all links of the first network port in a probing period, or an average of P99 delays of all links of the first network port in a probing period. The P99 time delay is calculated as: and arranging the second time delays of any link in a detection period in an ascending order, wherein the second time delay at the position of 99% in the ordering is the P99 time delay of the link. And averaging the P99 time delays of all the links of the first network port to obtain the average time delay of the first network port. Alternatively, the P99 time delay is calculated as: and sorting the plurality of second time delays in a descending order, and taking the first 1% second time delay in the sorted plurality of second time delays as the P99 time delay.
The packet loss rate of the first port is obtained by dividing the difference value of the sum of the packet sending numbers of all links of the first port and the sum of the packet receiving numbers of all links by the sum of the packet sending numbers of all links in a detection period.
Of course, the packet loss rate of each link of the first network port, all the second time delays, and all the first time delays may also be recorded in the probing period.
In one embodiment, step S022 further comprises:
under the condition that the first time delays of all links of the first network interface are smaller than a first threshold and the packet loss rates of all links of the first network interface are smaller than a second threshold in a detection period, if the first time delays of all links of the first network interface exceed a third threshold and/or the packet loss rates of all links of the first network interface exceed a fourth threshold, determining that the health state of the network corresponding to the first network interface is a sub-health state, and recording second reason information corresponding to the sub-health state, wherein the third threshold is smaller than the first threshold, and the fourth threshold is smaller than the second threshold;
step S020 further includes:
and if the links with the first time delay smaller than the third threshold and the packet loss rate smaller than the fourth threshold exist in all the links of the first network interface in the detection period, determining that the health state of the first network interface is a normal state.
Specifically, under the condition that the first time delays of all links of the first portal are smaller than a first threshold and the packet loss rates of all links of the first portal are smaller than a second threshold in the detection period, if at least one of the conditions that the first time delays of all links of the first portal exceed a third threshold and the packet loss rates of all links of the first portal exceed a fourth threshold is met, it is determined that the health state of the network corresponding to the first portal is a sub-health state. In addition, the sub-health state determined when the first time delays of all links of the first network port exceed the first threshold and/or the packet loss rates of all links exceed the second threshold in the foregoing embodiment is worse than the sub-health state in this embodiment.
And under the condition that the link with the first time delay smaller than the third threshold and the packet loss rate smaller than the fourth threshold exists in all the links of the first network port, indicating that the link state of the first network port is a normal link. That is, in the detection period, the first time delays of all the links of the first portal exceed the first threshold, and the first time delays of all the links exceed the third threshold and are smaller than the first threshold, and the packet loss rates of all the links of the first portal exceed the second threshold, and the packet loss rates of all the links of the first portal exceed the fourth threshold and are smaller than the second threshold, in this case, the link in the normal state exists in the first portal, the first portal is normal, and the reason why the link in the abnormal state exists in the first portal is that a part of links of the first portal occurs due to other portals or other reasons. Therefore, the health state of the first network port is determined and recorded as a normal state.
The first threshold, the second threshold, the third threshold and the fourth threshold may be sensitive values (adjustable values according to actual conditions) or insensitive values (fixed values). The third threshold is a lower limit value corresponding to the time delay and the fourth threshold is a lower limit value corresponding to the packet loss rate. The second cause information may specifically be exceeding a threshold.
If the health status of the network corresponding to the first portal is determined to be normal, and the health status of the first portal is also normal, the status reason may be recorded as unknown or Null. When the health state of the network corresponding to the first network port is normal, the first time delay of a part of links existing in all links corresponding to the first network port is smaller than the third threshold, and the packet loss rate is smaller than the fourth threshold, and the part of links are normal, so that it is determined that the network corresponding to the first network port is normal, and therefore, the first time delay or the packet loss rate of the part of links is too high, which may be caused by failures of second network ports or switches of other hosts.
In one embodiment, the probing data includes a packet sending number and a packet receiving number of a corresponding link in the probing period, and a plurality of second delays corresponding to a plurality of probing packets in the corresponding link, and step S020 further includes: and determining the average time delay of the first network interface in the detection period according to the detection data, and determining the packet loss rate of the first network interface in the detection period according to the detection data.
Specifically, when the first port is determined to be normal or in a sub-health state, the average time delay of the first port in the detection period may be calculated, and the packet loss rate of the first port in the detection period may be calculated.
The average time delay of the first network port may be an average of all the second time delays of all the links corresponding to the first network port in the probing period, or an average of the P99 time delays of all the links in one probing period. The P99 time delay is calculated as: and arranging the second time delays of any link in a detection period in an ascending order, wherein the second time delay at the position of 99% in the ordering is the P99 time delay of the link. And averaging the P99 time delays of all the links of the first network port to obtain the average time delay of the first network port. Alternatively, the P99 time delay is calculated as: and sorting the plurality of second time delays in a descending order, and taking the first 1% second time delay in the sorted plurality of second time delays as the P99 time delay.
The packet loss rate of the first port is obtained by dividing the difference value between the sum of the packet sending numbers of all links of the first port and the sum of the packet receiving numbers of all links by the packet sending numbers of all links in a detection period.
The obtained average time delay of the first network port in the detection period and/or all the second time delays of the first network port in the detection period and the packet loss rate of the first network port in the detection period can be recorded in a log file for outputting, so that engineering personnel can conveniently and quickly eliminate and locate the reason of the network fault according to the output log file.
In one embodiment, before step S021, S020 further includes:
and generating a detection list through cluster topology based on network aggregation mode information between the first host and a second host in the cluster, wherein the detection list comprises link information of each link between the first host and the second host in the cluster, and the link information comprises network port information of a first network port of the first host and network port information of a second network port of the second host, corresponding to the first network port, for receiving the packet.
Specifically, the network aggregation mode information includes a single switch link aggregation mode and a dual switch link aggregation mode. In a single switch link aggregation mode, hosts in a cluster communicate with each other through the same switch, and at least one link between any two hosts through the switch is included. In the dual-switch link aggregation mode, any two hosts in the cluster can communicate with each other through two switches, and for the two hosts, links corresponding to different switches are different.
Specifically, the first host sends a cluster topology detection request to each second host in the cluster, and generates a detection list according to host information returned by the second hosts. The host information includes a host name, an IP address, a network port included in the host, and a Mac address of the network port.
Each second host includes at least one second portal.
Step S021 specifically includes:
and sending a plurality of detection packets to the corresponding second network ports through the first network ports of the first host according to the detection list and the preset detection frequency.
Specifically, in the single switch link aggregation mode, the Mac addresses of the two network ports are different, and which second network port of the second host receives the packet is determined by the Mac address of the target (second host). In the link aggregation mode of the double switch, the addresses of the two network ports Mac are the same, the two links are different, the specified network port sends a packet, and the other party is necessarily a fixed network port to receive the packet. Specifically, when the host a designates the first port a-eth3 to send a packet, the second port of the host B corresponding to the first port a-eth3 is necessarily a B-eth3 packet receiving port; after receiving the probe packet, the second port B-eth3 of the host B replies a data packet to the first port a-eth3 through the second port B-eth 3.
In one embodiment, before step S021, step S020 further includes:
and determining whether the first network port is an available network port or not according to the acquired network port state of the first network port of the first host, wherein the network port state comprises a connection state and a non-connection state.
Step S021 specifically includes:
and sending a plurality of detection packets to corresponding second hosts in the cluster through links formed by the available network ports and the corresponding second network ports according to the preset detection frequency.
Specifically, the network port status link status of the first network port is detected, that is, the network connectivity of the first network port is detected. The first host: realethtolool eth4 grep "Link protected" | awk-F: { print $2}' to obtain the portal status of the first portal. And if the returned result is yes after the first host calls the connectivity detection command, the state of the network port of the first network port is a connection state, and the first network port is judged to be an available network port. And if the first host calls the connectivity detection command and returns a no result, the state of the network port of the first network port is a no-connection state, and the first network port is judged to be an unavailable network port.
In this embodiment, the connectivity of the network port is detected in advance, only the first network port that is the available network port is called to send the probe packet to the second network port corresponding to the second host, and the first network port that is the unavailable network port does not need to send the probe packet, so that invalid probing is reduced, and interference of the unavailable network port or the connectivity on network port health analysis is reduced. Meanwhile, the first net mouths can be preliminarily diagnosed in advance.
In one embodiment, before step S021, step S020 further includes:
determining whether the first network port is an available network port or not according to the acquired network port state of the first network port of the first host, wherein the network port state comprises a connection state and a non-connection state,
and acquiring the negotiated bandwidth and the rated bandwidth of the available network port, and determining the negotiated bandwidth degradation of the corresponding available network port if the negotiated bandwidth is smaller than the corresponding rated bandwidth.
Step S021 specifically includes:
and sending a plurality of detection packets to a corresponding second host in the cluster through a link formed by an available network port with undegraded negotiation bandwidth and a corresponding second network port according to a preset detection frequency.
Specifically, after detecting that the port status of the first port is a connection status, that is, the first port is an available port, it is further required to detect whether the negotiated bandwidth of the first port is normal. The command for acquiring the network port negotiation bandwidth is any one of the following commands:
realethtool eth4|grep Speed|awk-F:'{print$2}',
realethtool eth4|grep Duplex|awk-F:'{print$2}'。
and if the obtained negotiation bandwidth of the first network port is smaller than the corresponding rated bandwidth, judging that the negotiation of the bandwidth of the first network port is abnormal or the negotiation bandwidth is degraded, wherein the first network port belongs to a fault network port. And finally, the detection packet can be sent to the corresponding second network port only by simultaneously meeting the requirement that the first network port with the bandwidth not degraded is written and negotiated for the available network ports. Wherein the rated bandwidth is the lspci physical network card bandwidth. The embodiment detects the connectivity and the negotiation bandwidth of the first network port in advance, and further eliminates the interference of the connectivity and the negotiation bandwidth on the health diagnosis of the network port.
In one embodiment, step S020 further includes: and if the network port state of the first network port is connectionless, determining that the health state of the network corresponding to the first network port is connectionless, and determining third cause information corresponding to the connectionless state.
In one embodiment, step S020 further includes: and if the first network port negotiates bandwidth degradation, determining that the health state of the network corresponding to the first network port is a sub-health state, and determining fourth reason information corresponding to the sub-health state.
Specifically, for a first port that is an unavailable port, since the probe packet is not sent to a second port, and the health state of the network corresponding to the port can be directly determined, the health state of the first port is directly recorded as a no-connection (no link) state, and the third cause may specifically be a no-connection (no link).
If the first network port is an available network port but negotiates bandwidth degradation, the health state of the corresponding network is directly recorded as a sub-health state, and a detection packet does not need to be sent to the second network port. And recording the fourth reason information corresponding to the sub-health state as rate negotiation Failed (SpeedNegotiation Failed).
In one embodiment, step S020 further includes:
if the health state of the network corresponding to the first network port is a sub-health state, monitoring network port operation data of the first network port;
determining whether each operation index in the network port operation data is increased in a preset time period according to the network port operation data, wherein the operation indexes of the network port operation data comprise a first error packet number and a second error packet number, the first error packet number is the total number of error packets generated by multiple reasons, and the second error packet number is the number of error packets generated by the same reason;
and if the operation indexes with increased numerical values exist in the network port operation data in the preset time period, updating the first reason information corresponding to the sub-health state into fifth reason information.
Specifically, if the health state of the network corresponding to the first portal is a sub-health state, deep analysis and mining are continued to be performed to find the reason for the sub-health state of the first portal.
The first number of error packets includes fifo _ errors, i.e. the total number of buffer error packets counted, and the first number of error packets includes rx _ fifo _ errors (the number of error packets counted by the receiving queue) and tx _ fifo _ errors (the number of error packets counted by the sending queue). This includes error packets resulting from too-long-frames errors, Ring Buffer overflow errors, crc check errors, frame sync errors, fifo errors, missedpkg, and so on.
The second number of erroneous packets is the number of erroneous packets generated by a certain cause, such as overruns. Overruns: overruns, i.e., a receive queue overflow, representing fifo, produces an error, and a computer may produce an overflow (overruns) when more packets arrive than the core can handle. More specifically, the packet is dropped because the fifo of the network card is full when the packet has not entered the fifo queue of the network card. Because the system is busy and cannot respond to the network card interrupt in time, the data packet in the network card is not copied to the system memory in time, and if fifo is full, the following data packet cannot come, that is, the data packet is lost by the network card hardware. This is caused by that the IO transmitted by Ring Buffer (aka Driver Queue) is larger than the IO that can be processed by kernel, and Ring Buffer refers to the block of Buffer before initiating the IRQ request. Obviously, the increase of overruns means that the data packet is discarded by the network card physical layer without reaching Ring Buffer, and the reason that the Ring Buffer is full is that the CPU cannot process the interrupt in time.
The first host can view the discarded packet statistics through the ethtools or/proc/net/dev, and the statistics items are identified by errors:
realethtool-S eth4|grep tx_fifo|awk-F:'{print$1}',
realethtool-S eth4|grep rx_fifo|awk-F:'{print$2}'。
the first host may obtain the overrides value by the following command.
for i in`seq 1100`;do ifconfig ethX|grep RX|grep overruns|awk‘{print$3}’|awk-F:‘{print$2}’;sleep 1;done,
for i in`seq 1100`;do ifconfig ethX|grep TX|grep overruns|awk‘{print$3}’|awk-F:‘{print$2}’;sleep 1;done。
The preset time period may be set to 2s, 1 minute, 2 minutes, etc. without being limited thereto.
If the value of at least one operation index is increased within a preset time period, the reason causing the sub-health of the network corresponding to the first network port is updated from the first reason information to fifth reason information, and the fifth reason information can be specifically network port failure (interface fault).
In one embodiment, step S020 further includes:
if the operation indexes with increased numerical values do not exist in the network port operation data in the preset time period, detecting whether the first network port comprises a corresponding optical module or not;
if the first network port comprises the corresponding optical module, detecting whether the temperature, the input power and the output power of the optical module are normal;
and if at least one of the temperature, the input power and the output power of the optical module is abnormal, updating the fifth cause information corresponding to the sub-health state into sixth cause information.
Specifically, if the operation index of the first portal is not increased within the preset time period, the deep analysis and excavation are continued to be performed to find the reason why the first portal is not healthy.
The first host automatically detects whether the first internet access comprises the optical module and a plurality of operating parameters of the optical module.
The first host side detects the optical module information, whether the types of the two end optical modules are matched, and the following commands are used:
realethtol-m ethX | grep "driver type", this command is used to check the optical module transmission type.
realethtolool-m ethX | grep "Length (OM3)", this command is used to check the transmission distance.
realethtol-m ethX | grep "lasewavelength", which is a command for checking the optical module wavelength.
If the output is error, it indicates that there is no optical module, and conversely, there is an optical module. And if no optical module exists, recording that the health state of the network corresponding to the first network port is still sub-health, and keeping the fifth cause information unchanged.
If the optical module exists, the temperature, the input power and the output power of the optical module can be obtained through an ethnool-m command. And judging whether the temperature, the input power and the output power of the optical module are normal, if at least one of the temperatures, the input power and the output power is abnormal, determining that the health state of the network corresponding to the first network port is still a sub-health state, and updating the fifth cause information into sixth cause information. The sixth cause information is specifically an optical module fault (optical module fault).
If the optical module exists and the temperature, the input power and the output power of the optical module are normal, recording that the health state of the network corresponding to the first network port is still a sub-health state, and recording seventh reason information corresponding to the sub-health state at the moment, wherein the seventh reason information may be a link fault (link fault).
In one embodiment, step S020 further includes:
counting the number of fault network ports of a first network port of which the corresponding network state is a sub-health state in a first host;
under the condition that the network aggregation mode of the cluster is a single switch link aggregation mode and the binding mode of the network ports is a seventh mode, if the number of the failed network ports is the total number of the first network ports contained in the first host, updating the reason information corresponding to the sub-health state to be a single switch failure;
under the condition that the network aggregation mode of the cluster is a double-switch link aggregation mode and the binding mode of the network ports is a first mode, if the number of the failed network ports is less than the total number of the first network ports contained in the first host, updating the reason information corresponding to the sub-health state to be a single-switch failure under the double-switch mode,
and if the number of the failed network ports is the total number of the first network ports contained in the first host, updating the reason information corresponding to the sub-health state to be the double-switch failure in the double-switch mode.
Specifically, the first mode: mod 0, i.e.: (balance-rr) Round-robinoplicy (balanced Round robin strategy). The method is characterized in that: the transmission data packet sequence is transmitted in sequence (i.e. the 1 st packet goes eth0, the next packet goes eth1 … and loops until the last transmission is finished), this mode provides load balancing and fault tolerance; however, it is known that if a packet of a connection or session is sent from a different interface and passes through a different link, there is a high possibility that the packet arrives out of order at the client, and the packet arriving out of order needs to be sent again, so that the throughput of the network is reduced. The seventh mode: mod 6, i.e.: (balance-alb) Adaptive load balancing. The method is characterized in that: the mode comprises a balance-tlb mode, and is added with receiving load balance (rlb) aiming at IPV4 traffic, and the mode does not need any switch support. Receive load balancing is achieved through ARP negotiation. The binding driver intercepts the ARP response sent by the local machine and rewrites the source hardware address into a unique hardware address of a certain slave in the bond, so that different opposite ends use different hardware addresses for communication.
In an embodiment, the network aggregation mode includes a dual switch link aggregation mode, the network analysis data includes health states of a plurality of first ports in the first host and link states of links between the plurality of first ports and the second port of the second host in the cluster, fig. 7 is a schematic flow diagram of a method for acquiring network analysis data in another embodiment of the present application, and the step S020 of acquiring the network analysis data corresponding to the plurality of first ports of the first host in the network aggregation mode specifically includes:
s021: sending a plurality of detection packets to a corresponding second host in the cluster through at least one link between the first host and the second host according to a preset detection frequency to obtain detection data of each link of a first network port of the first host in a detection period, and determining a packet loss rate and a first time delay corresponding to each link in the detection period according to the detection data;
s022: determining the health state of the first network interface according to the first time delay and the packet loss rate of all links of the first network interface in the detection period, wherein the health state comprises a normal state and a sub-health state;
s023: and determining the link state of the corresponding link according to the first time delay and the packet loss rate of the corresponding link, wherein the link state comprises a normal state and an abnormal state.
Specifically, the steps S021-S022 are referred to the above steps, and are not described herein. Step S023 specifically includes: if the first time delay corresponding to the link is smaller than a third threshold and the packet loss rate corresponding to the link is smaller than a fourth threshold, determining that the link state of the link is a normal state; and if the first time delay corresponding to the link is not less than the third threshold and/or the packet loss rate corresponding to the link is not less than the fourth threshold, determining that the link state of the link is an abnormal state.
In one embodiment, the network handling method further comprises the following steps:
and acquiring and outputting a network log of the first host and sending first alarm information, wherein the network log comprises a first network port of the first host, a health state of a corresponding network and corresponding reason information.
Specifically, the reason information corresponding to the sub-health status or the no-connection status is determined with reference to the previous steps, which are not described herein again. In the embodiment, the network operation and maintenance personnel are reminded to intervene by outputting the log and the alarm. The log analysis and the reason output help operation and maintenance personnel to quickly locate the fault problem, and then solve the problem, such as replacing or reconfiguring a network card, an optical module or a switch, re-online of a network and the like.
The first alarm information is specifically to send an alarm prompt to a management end corresponding to the cluster so as to remind network operation and maintenance personnel to timely process and repair the first network port with the health state being the sub-health state or the sub-health network corresponding to the sub-health network port. The first warning information can be more specifically text prompt information and/or voice prompt information on the terminal equipment.
In one embodiment, the network handling method further comprises the following steps:
and if the first host has a target internet access with a health state of a sub-health state and does not meet the disposal conditions, sending out second alarm information.
Specifically, if the first host has a sub-health internet access but does not meet the treatment condition, the second alarm information is sent out. The second alarm information is used for reporting the reason that the treatment cannot be carried out to the network operation and maintenance personnel. For example, the reason for the failure to handle is that after no available link, i.e. after isolating the sub-health portal, the first host cannot normally communicate with all the second hosts in the cluster through the un-isolated first portal. And more network sub-health related information can be provided for the operation and maintenance personnel through the second alarm information so as to guide the operation and maintenance personnel to quickly recover the network.
In addition, a cluster management interface is arranged at a management end corresponding to the cluster, an isolation button is arranged on the cluster management interface, and a user sends an isolation instruction to the first host computer by triggering the isolation button to realize manual isolation.
In order to solve the problem that service performance is possibly reduced due to network sub-health, the application provides a network handling method suitable for a distributed storage aggregation network. Specifically, by accurately identifying all links and actively sending a detection packet, detection data corresponding to the links are obtained to analyze and judge the health state of the network port or the health state of the broken network port and the link state of each link of the network port. And then according to the analysis result, treating the sub-health net mouth when the sub-health net mouth exists and the treatment condition is met. The disposal method comprises the steps of automatically isolating the sub-health network ports, namely isolating the sub-health links corresponding to the sub-health network ports, ensuring that the network and the service performance are recovered to a normal state, outputting logs and giving an alarm. There are two methods of isolation: (1) closing the sub-health net mouth (ifconfig down); (2) and removing the sub-health network ports from the bonding of the aggregation network port (binding group), wherein the bonding only reserves an available normal network port after the removing, so that the first host sends the packet through a normal link corresponding to the normal network port and does not send the packet from the sub-health link any more. Through the log and the alarm, the sub-health reasons can be output to help the operation and maintenance personnel to quickly locate the fault problem and remind the network operation and maintenance personnel to intervene. The operation and maintenance personnel, such as replacing or reconfiguring network cards, optical modules or switches, network rewiring, etc., are not limited to this.
The network handling method of the embodiment is applied to handling of sub-health of the distributed storage network. Aiming at a special scene of link aggregation of a single switch and a double switch, the connectivity of a proper network port, the time delay and the packet loss rate of a link are actively detected, and the health state and the sub-health generation reason of each first network port and a corresponding network and the link state of each link of the first network port are analyzed according to the connectivity, the time delay and the packet loss rate of the link; and determining whether the first host meets a disposal condition according to the health state and the link state of the first internet access, and disposing the sub-health internet access when the disposal condition is met. On the premise of ensuring that the network is not interrupted and the service performance is not influenced, the sub-health link alarm and link isolation can be quickly and automatically realized, and the service performance is ensured to be restored to the normal level.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 8 is a block diagram of a network recovery apparatus according to an embodiment of the present application. Referring to fig. 8, the network recovery apparatus includes:
a status data obtaining module 100, configured to obtain isolation status information of a target portal in the first host, where the isolation status information indicates an isolation status that the target portal is isolated from a cluster network;
a monitoring module 200, configured to obtain monitored data corresponding to the isolated state;
and a recovery module 300, configured to remove the isolation of the target internet access according to the monitored data.
In one embodiment, the isolation status includes a shutdown status, the monitored data includes a shutdown duration of the target portal, and the recovery module 100 specifically includes:
the first judgment module is used for determining whether the deactivation time length reaches a first preset time length;
and the first recovery module is used for starting the target internet access if the first preset duration is reached.
In one embodiment, the isolation state comprises a deactivated state, and the listening data comprises a recovery instruction from a user;
a monitoring module 200, specifically configured to receive a recovery instruction from a user in the deactivated state;
the recovery module 300 is specifically configured to enable the target internet access according to the recovery instruction.
In one embodiment, the isolation state comprises a removal state, the removal state indicates that the target portal is removed from the binding group corresponding to the first host, the snoop data comprises a health state of the target portal,
a monitoring module 200, configured to obtain a health status of the isolated target portal if a duration that the target portal is removed from the binding group exceeds a second preset duration;
the recovery module 300 is specifically configured to, if the health state of the target portal is a normal state, add the target portal to the binding group of the first host again.
In one embodiment, the isolation state comprises a deactivated state or a removed state, the removed state characterizes that the target portal is removed from a binding group corresponding to the first host, the snooping data comprises a kernel network event of the first host,
a monitoring module 200, configured to obtain a kernel network event of the first host if the isolation state is a shutdown state or a removal state;
the recovery module 300 is specifically configured to determine whether a normal portal in a normal state in the first host is disconnected according to a kernel network event, and if so, remove the isolation of the target portal.
In an embodiment, the recovery module 300 specifically includes:
the calling unit is used for calling a network port starting instruction to start the isolated target network port;
a connectivity detection unit, configured to detect whether the target internet access is connected;
and the recall unit is used for determining whether the called times of the calling network port enabling instruction exceed a threshold value or not if the target network port is not communicated, and recalling the network port enabling instruction if the called times of the calling network port enabling instruction do not exceed the threshold value.
In one embodiment, the network recovery apparatus further comprises:
the first data acquisition module is used for acquiring network aggregation mode information of the cluster, and the network aggregation mode information is used for determining the network aggregation mode of the cluster;
the second data acquisition module is used for acquiring network analysis data and disposal condition information corresponding to a plurality of first network ports of the first host in the network aggregation mode;
a processing condition determining module, configured to determine, according to the network analysis data, whether the first host meets a processing condition indicated by the processing condition information;
and the isolation module is used for isolating a target network port in the first network ports out of the cluster network if the processing conditions are met, wherein the target network port is in a sub-health state.
In one embodiment, the network aggregation mode includes a single switch link aggregation mode, the network analysis data includes health states of a plurality of first portals in the first host, and the second data obtaining module specifically includes:
the detection module is used for sending a plurality of detection packets to a corresponding second host in the cluster through at least one link between the first host and the second host according to a preset detection frequency, obtaining detection data of each link of a first network port of the first host in a detection period, and determining a packet loss rate and a first time delay corresponding to each link in the detection period according to the detection data;
and the analysis module is used for determining the health state of the first network port according to the first time delay and the packet loss rate of all links of the first network port in the detection period, wherein the health state comprises a sub-health state and a normal state.
In one embodiment, the network aggregation mode includes a dual switch link aggregation mode, the network analysis data includes health statuses of a plurality of first ports in the first host and link statuses of links between the plurality of first ports and a second port of a second host in the cluster, and the second data obtaining module specifically includes:
the detection module is used for sending a plurality of detection packets to a corresponding second host in the cluster through at least one link between the first host and the second host according to a preset detection frequency, obtaining detection data of each link of a first internet access of the first host in a detection period, and determining packet loss rate and first time delay corresponding to each link in the detection period according to the detection data;
the first analysis module is used for determining the health state of the first network port according to the first time delay and the packet loss rate of all links of the first network port in the detection period, wherein the health state comprises a sub-health state and a normal state;
and the second analysis module is used for determining the link state of the corresponding link according to the first time delay and the packet loss rate of the corresponding link, wherein the link state comprises a normal state and an abnormal state.
Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.
For specific limitations of the network recovery apparatus, reference may be made to the above limitations of the network recovery method, which is not described herein again. The modules in the network recovery apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 9 is a block diagram illustrating an internal structure of a computer device according to an embodiment of the present application. The computer device may specifically be the host computer in fig. 1. As shown in fig. 9, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory includes a storage medium and an internal memory. The storage medium may be a nonvolatile storage medium or a volatile storage medium. The storage medium stores an operating system and may also store computer readable instructions that, when executed by the processor, may cause the processor to implement a network recovery method. The internal memory provides an environment for the operating system and execution of computer readable instructions in the storage medium. The internal memory may also have computer readable instructions stored therein that, when executed by the processor, cause the processor to perform a network recovery method. The network interface of the computer device is used for communicating with an external server through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided, which includes a memory, a processor, and computer readable instructions (e.g., a computer program) stored on the memory and executable on the processor, and when the processor executes the computer readable instructions, the steps of the network recovery method in the above embodiments are implemented, for example, steps S100 to S300 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer readable instructions, implements the functions of the modules/units of the network recovery apparatus in the above embodiments, such as the functions of the modules 100 to 300 shown in fig. 8. To avoid repetition, further description is omitted here.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer readable instructions and/or modules, and the processor may implement various functions of the computer apparatus by executing or executing the computer readable instructions and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated in the processor or may be provided separately from the processor.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer readable storage medium is provided, on which computer readable instructions are stored, which when executed by a processor implement the steps of the network recovery method in the above embodiments, such as the steps S100 to S300 shown in fig. 2 and extensions of other extensions and related steps of the method. Alternatively, the computer readable instructions, when executed by the processor, implement the functions of the modules/units of the network recovery apparatus in the above embodiments, such as the functions of the modules 100 to 300 shown in fig. 8. To avoid repetition, further description is omitted here.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing associated hardware to implement computer readable instructions, which may be stored in a computer readable storage medium, and when executed, may include processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (12)

1. A network recovery method applied to a first host that is a local host, the method comprising:
acquiring isolation state information of a target internet access in the first host, wherein the isolation state information represents an isolation state of the target internet access isolated from a cluster network;
acquiring monitoring data corresponding to the isolation state;
and releasing the isolation of the target network port according to the monitoring data.
2. The method of claim 1, wherein the isolation state comprises a deactivated state, wherein the snoop data comprises a deactivated duration of the target portal, and wherein the de-isolating the target portal from the snoop data comprises:
determining whether the deactivation time length reaches a first preset time length;
and if the first preset time is reached, starting the target internet access.
3. The method of claim 1, wherein the isolation state comprises a disabled state, wherein the monitored data comprises a recovery instruction from a user, wherein obtaining the monitored data corresponding to the isolation state, and wherein removing the isolation of the target portal according to the monitored data comprises:
receiving a recovery instruction from a user in the deactivated state;
and enabling the target internet access according to the recovery instruction.
4. The method of claim 1, wherein the isolation status comprises a removal status indicating that the target portal is removed from the binding group corresponding to the first host, wherein the snoop data comprises a health status of the target portal, wherein obtaining the snoop data corresponding to the isolation status and de-isolating the target portal according to the snoop data comprises:
if the time length of the target internet access removed from the binding group exceeds a second preset time length, acquiring the health state of the isolated target internet access;
and if the health state of the target internet access is a normal state, adding the target internet access into the binding group of the first host again.
5. The method of claim 1, wherein the isolation state comprises a deactivated state or a removed state, the removed state indicates that the target portal is removed from the binding group corresponding to the first host, the snoop data comprises a kernel network event of the first host, the obtaining of the snoop data corresponding to the isolation state and the de-isolation of the target portal according to the snoop data comprise:
if the isolation state is a deactivation state or a removal state, acquiring a kernel network event of the first host;
and determining whether a normal network port in a normal state in the first host is disconnected or not according to the kernel network event, and if so, removing the isolation of the target network port.
6. The method of claim 2 or 3, wherein said enabling the target portal comprises:
calling an internet access enabling instruction to enable the isolated target internet access;
detecting whether the target network ports are communicated;
and if the target network port is not communicated, determining whether the called frequency of the calling network port starting instruction exceeds a threshold value, and if not, re-calling the network port starting instruction.
7. The method of claim 1, further comprising:
acquiring network aggregation mode information of a cluster, wherein the network aggregation mode information is used for determining a network aggregation mode of the cluster;
acquiring network analysis data and disposal condition information corresponding to a plurality of first network ports of the first host in the network aggregation mode;
determining, from the network analysis data, whether the first host complies with a handling condition indicated by the handling condition information;
and if the processing condition is met, isolating the target internet access out of the cluster network, wherein the target internet access is in a sub-health state.
8. The method of claim 7, wherein the network aggregation mode comprises a single switch link aggregation mode, wherein the network analysis data comprises health statuses of a plurality of first ports of the first host, and wherein the obtaining the network analysis data corresponding to the plurality of first ports of the first host in the network aggregation mode comprises:
sending a plurality of detection packets to a corresponding second host in a cluster through at least one link between the first host and the second host according to a preset detection frequency to obtain detection data of each link of a first network port of the first host in a detection period, and determining a packet loss rate and a first time delay corresponding to each link in the detection period according to the detection data;
and determining the health state of the first network interface according to the first time delays and packet loss rates of all links of the first network interface in the detection period, wherein the health state comprises a sub-health state and a normal state.
9. The method of claim 7, wherein the network aggregation mode comprises a dual-switch link aggregation mode, wherein the network analysis data comprises health statuses of a plurality of first ports of the first host and link statuses of links between the plurality of first ports and a second port of a second host in a cluster, and wherein the obtaining the network analysis data corresponding to the plurality of first ports of the first host in the network aggregation mode comprises:
sending a plurality of detection packets to a corresponding second host in a cluster through at least one link between the first host and the second host according to a preset detection frequency to obtain detection data of each link of a first network port of the first host in a detection period, and determining a packet loss rate and a first time delay corresponding to each link in the detection period according to the detection data;
determining the health state of the first internet access according to the first time delay and the packet loss rate of all links of the first internet access in the detection period, wherein the health state comprises a sub-health state and a normal state;
and determining the link state of the corresponding link according to the first time delay and the packet loss rate of the corresponding link, wherein the link state comprises a normal state and an abnormal state.
10. A network recovery apparatus for use with a first host that is a local host, the apparatus comprising:
a state data obtaining module, configured to obtain isolation state information of a target portal in the first host, where the isolation state information indicates an isolation state in which the target portal is isolated from a cluster network;
the monitoring module is used for acquiring monitoring data corresponding to the isolation state;
and the recovery module is used for removing the isolation of the target network port according to the monitoring data.
11. A computer device comprising a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor when executing the computer readable instructions performs the steps of the network recovery method of any of claims 1-9.
12. A computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform the steps of the network recovery method of any of claims 1-9.
CN202111375328.8A 2021-11-19 2021-11-19 Network recovery method and device, computer equipment and storage medium Pending CN114095341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111375328.8A CN114095341A (en) 2021-11-19 2021-11-19 Network recovery method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111375328.8A CN114095341A (en) 2021-11-19 2021-11-19 Network recovery method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114095341A true CN114095341A (en) 2022-02-25

Family

ID=80302435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111375328.8A Pending CN114095341A (en) 2021-11-19 2021-11-19 Network recovery method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114095341A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107547252A (en) * 2017-06-29 2018-01-05 新华三技术有限公司 A kind of network failure processing method and device
CN109450666A (en) * 2018-10-12 2019-03-08 新华三技术有限公司成都分公司 Distributed system network management method and device
CN111030851A (en) * 2019-11-29 2020-04-17 苏州浪潮智能科技有限公司 Management method, equipment and readable medium for network diagnosis recovery

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107547252A (en) * 2017-06-29 2018-01-05 新华三技术有限公司 A kind of network failure processing method and device
CN109450666A (en) * 2018-10-12 2019-03-08 新华三技术有限公司成都分公司 Distributed system network management method and device
CN111030851A (en) * 2019-11-29 2020-04-17 苏州浪潮智能科技有限公司 Management method, equipment and readable medium for network diagnosis recovery

Similar Documents

Publication Publication Date Title
CN112866004B (en) Control plane equipment switching method and device and transfer control separation system
US10659345B2 (en) Service path protection method, controller, device and system
JP4437984B2 (en) Network relay device and control method thereof
EP2798782B1 (en) Technique for handling a status change in an interconnect node
CN109344014B (en) Main/standby switching method and device and communication equipment
CN106936613B (en) Method and system for rapidly switching main and standby Openflow switch
US8055765B2 (en) Service take-over method based on apparatus disaster recovery, service transfer apparatus and backup machine
WO2018108149A1 (en) Data-link switching method and apparatus and data-link switching device
CN112491700B (en) Network path adjustment method, system, device, electronic equipment and storage medium
US20140050078A1 (en) Communication interruption time reduction method in a packet communication network
CN113890816A (en) Network health state analysis method and device, computer equipment and storage medium
WO2009124499A1 (en) Protection method, device and communication system for transmitting signaling
CN106533736A (en) Network device reboot method and apparatus
EP3629535B1 (en) Method, device, and system for implementing mux machine
WO2012058895A1 (en) Method and device for switching aggregation links
CN109889411B (en) Data transmission method and device
US11258666B2 (en) Method, device, and system for implementing MUX machine
CN115085993A (en) Data verification method and device and domain controller
CN104468347B (en) Control method and device of the network data from loopback
US10033573B2 (en) Protection switching method, network, and system
US8681601B1 (en) Connectivity fault management and redundant trunk group integration
WO2015180265A1 (en) Multi-link protection switching method and device
US9049256B1 (en) Fabric switchover for systems with control plane and fabric plane on same board
CN114095341A (en) Network recovery method and device, computer equipment and storage medium
EP2953299B1 (en) Protection switching method, system and node

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination