CN117354267A - Link failure processing method, device, equipment, storage medium and program product - Google Patents

Link failure processing method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN117354267A
CN117354267A CN202311325217.5A CN202311325217A CN117354267A CN 117354267 A CN117354267 A CN 117354267A CN 202311325217 A CN202311325217 A CN 202311325217A CN 117354267 A CN117354267 A CN 117354267A
Authority
CN
China
Prior art keywords
output port
switch
target
notification message
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311325217.5A
Other languages
Chinese (zh)
Inventor
万伟
李俊宏
欧阳长冬
纵瑞博
李�柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Co Ltd
Original Assignee
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Co Ltd
Priority to CN202311325217.5A priority Critical patent/CN117354267A/en
Publication of CN117354267A publication Critical patent/CN117354267A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/557Error correction, e.g. fault recovery or fault tolerance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • H04L45/247Multipath using M:N active or standby paths
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/25Routing or path finding in a switch fabric
    • H04L49/253Routing or path finding in a switch fabric using establishment or release of connections between ports

Abstract

The application relates to a link failure processing method, a device, equipment, a storage medium and a program product. Determining the state of a first output port of a first switch, and if the port type of a target output port is an inter-switch link under the condition that the states of the first output ports are all disconnected, sending a first recovery notification message to a second switch through a second output port; the first output port comprises a target output port and a standby output port corresponding to the target output port, the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault. The method can accelerate the link recovery time, and further can reduce the network delay.

Description

Link failure processing method, device, equipment, storage medium and program product
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for processing a link failure.
Background
High performance computing (High performance computing, HPC) has been used to solve complex problems, with current data growing exponentially, requiring larger cluster computing scales to address current and future computing challenges. In HPC, an efficient link failure handling method is needed to reduce latency of inter-process communication of the cluster.
In the conventional technology, if a failure occurs in a switch link, for example, a target output port of a switch is not available, the switch needs to send a notification message to a server to trigger the server to recalculate a switch route by using a subnet management service, and send the recalculated route information to each switch, and each switch updates a route table according to the sent route information to bypass the failure link.
However, the above processing method may result in a long link recovery time, resulting in network delay.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a link failure processing method, apparatus, device, storage medium, and program product capable of reducing network latency.
In a first aspect, the present application provides a link failure processing method, including:
determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
if the port type of the target output port is an inter-switch link under the condition that the states of the first output port are all disconnected, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the fault of the target output port.
In the link fault processing method, the state of the first output port of the first switch is determined, and if the port type of the target output port is the inter-switch link under the condition that the states of the first output port are all disconnected, a first recovery notification message is sent to the second switch through the second output port; the first output port comprises a target output port and a standby output port corresponding to the target output port, the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault. In the conventional technology, if a failure occurs in a switch link, for example, a target output port of a switch is not available, the switch needs to send a notification message to a server to trigger the server to recalculate a switch route by using a subnet management service, and send the recalculated route information to each switch, and each switch updates a route table according to the sent route information, bypasses the failed link, and causes a longer link recovery time, thereby causing network delay. In this embodiment of the present application, when the first output port of the first switch is unavailable, the first recovery notification packet may be sent to other switches connected to the first switch, so that when the other switches subsequently forward the packet, the packet is not sent to the failed target output port. When the first output port is not available, the message informing other switches that the data message is not sent to the failed target output port is sent, so that the time for recovering the link can be shortened, and the network delay can be reduced.
In addition, since the route restoration is performed autonomously by the switch preferentially, the problem of degradation of the subnet controller of the server due to a large number of trap messages to be transmitted to the server, and the trap message repeatability due to a large number of trap messages to be transmitted are alleviated. In addition, the fault handling mechanism of the subnet controller cannot completely eliminate the possibility of faults, and in the embodiment of the application, the switch is required to perform route recovery autonomously, so that the possibility of fault elimination is improved.
In one embodiment, the method further comprises:
determining a state of an adaptive routing function of the first switch;
generating a risk table based on the state of the adaptive routing function and the state of the standby output port in the routing table of the first switch;
and generating the first recovery notification message according to the risk table.
In this embodiment, the state of the adaptive routing function of the first switch is determined, a risk table is generated based on the state of the adaptive routing function and the state of the standby output port in the routing table of the first switch, and a first recovery notification message is generated according to the risk table. The method and the device have the advantages that the influence of the states of different self-adaptive routing functions on the generation of the first recovery notification message is considered when the first recovery notification message is generated, so that the information accuracy of the first recovery notification message is improved, and further, other switches in the network are notified by the first recovery notification message under the condition that the link failure is detected but the first recovery notification message cannot be recovered by the user, so that the whole network can quickly update the failure link state to obtain an available path, the speed of recovering the network from the failure state to the normal state can be improved, and the link failure recovery time is reduced from the second level to the millisecond level.
In one embodiment, the generating a risk table based on the state of the adaptive routing function and the state of the standby output port in the routing table of the first switch includes:
generating the risk table under the condition that a preset triggering condition is met; the risk table comprises the destination network card identifier; wherein,
the meeting of the preset triggering condition comprises any one of the following:
the state of the self-adaptive routing function is unopened, and the standby output port is not arranged in the routing table;
the state of the self-adaptive routing function is on, and the number of the standby output ports in the routing table is a preset number.
In this embodiment, a risk table is generated when a preset trigger condition is satisfied, where the risk table includes a destination network card identifier. Wherein meeting the preset trigger condition includes any one of the following: the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table; the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number. Since, when the state of the adaptive routing function is unopened, the influence of whether the standby output port is included in the routing table of the first switch on the generation of the risk table is considered. And when the state of the self-adaptive routing function of the first switch is on, considering whether the number of the standby output ports in the routing table of the first switch is the preset number or not to influence the generation of the risk table. The accuracy of the information in the risk table is improved, and the information accuracy of the first recovery notification message is further improved. The whole network can update the fault link state rapidly to obtain an available path, so that the network delay can be reduced, and the robustness of the whole network is improved.
In one embodiment, the generating the first recovery notification message according to the risk table includes:
inquiring a target network card identifier of the target output port in the risk table;
and generating the first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
In this embodiment, the first recovery notification message is generated by querying the destination network card identifier of the destination output port in the risk table according to the preset generation format corresponding to the destination network card identifier and the first recovery notification message, and since the first recovery notification message can be generated based on the destination network card identifier, that is, the first recovery notification message includes the unreachable destination network card identifier, the first recovery notification message can be used to inform other switches of the failure of the destination output port corresponding to the destination network card identifier, so that the server and other switches can automatically switch to the standby path under the condition of no perception, thereby realizing self-repair of the link failure and ensuring connectivity of the network.
In one embodiment, the method further comprises:
receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch;
If the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
In this embodiment, the second recovery notification message sent by the third switch is received through the target receiving port of the first switch, and if the target receiving port is the target output port, the target output port in the routing table of the first switch is updated according to the standby output port corresponding to the destination network card identifier. According to the method, the second recovery notification message can be sent to each switch in the switch link, namely, the link state can be updated immediately after the link fault is detected and other switches in the network are notified, so that the whole network can update the fault link state quickly and select a new available path. The speed of the network from the fault state to the normal state can be increased. Meanwhile, the problems in the network can be automatically repaired, and the complexity of network management is reduced. In addition, because the recovery notification message is sent to each switch, the full network detection of the fault point is equivalent, and the probability of packet loss retransmission when the data message sent by other nodes or switches is transmitted to the fault port is greatly reduced.
In one embodiment, the updating the destination output port in the routing table of the first switch according to the standby output port corresponding to the destination network card identifier includes:
if the self-adaptive routing function of the first switch is not started, writing the standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
and if the state of the adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
In this embodiment, if the state of the adaptive routing function of the first switch is unopened, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch; and if the state of the adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch. Because the route can be automatically updated when the link fault occurs, the self-repairing of the network can be realized without manual intervention and repairing of an administrator, and the network time delay is reduced.
In a second aspect, the present application further provides a link failure processing apparatus, including:
a first determining module, configured to determine a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
the first sending module is used for sending a first recovery notification message to the second switch through the second output port if the port type of the target output port is an inter-switch link under the condition that the states of the first output port are all disconnected; the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the fault of the target output port.
In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.
In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
The method, the device, the equipment, the storage medium and the program product for processing the link failure determine the state of a first output port of a first switch, and if the port type of a target output port is an inter-switch link under the condition that the states of the first output ports are all disconnected, a first recovery notification message is sent to a second switch through a second output port; the first output port comprises a target output port and a standby output port corresponding to the target output port, the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault. In the conventional technology, if a failure occurs in a switch link, for example, a target output port of a switch is not available, the switch needs to send a notification message to a server to trigger the server to recalculate a switch route by using a subnet management service, and send the recalculated route information to each switch, and each switch updates a route table according to the sent route information, bypasses the failed link, and causes a longer link recovery time, thereby causing network delay. In this embodiment of the present application, when the first output port of the first switch is unavailable, the first recovery notification packet may be sent to other switches connected to the first switch, so that when the other switches subsequently forward the packet, the packet is not sent to the failed target output port. When the first output port is not available, the message informing other switches that the data message is not sent to the failed target output port is sent, so that the time for recovering the link can be shortened, and the network delay can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for a person having ordinary skill in the art.
FIG. 1 is an application environment diagram of a link failure handling method in one embodiment;
FIG. 2 is a flow diagram of a method of link failure handling in one embodiment;
FIG. 3 is a flowchart illustrating a method for processing port exceptions according to one embodiment;
FIG. 4 is a schematic diagram of a conventional link failure handling method in one embodiment;
FIG. 5 is a schematic diagram of a link failure handling method according to one embodiment;
FIG. 6 is a flow chart of a method of handling link failure in another embodiment;
FIG. 7 is a risk representation intent provided by one embodiment;
FIG. 8 is a schematic diagram of a routing table according to an embodiment;
FIG. 9 is a schematic diagram of a backup output port set according to one embodiment;
FIG. 10 is a flowchart illustrating a risk table generation method without turning on an adaptive routing function according to an embodiment;
FIG. 11 is a flowchart illustrating a risk table generation method for turning on an adaptive routing function according to an embodiment;
FIG. 12 is a flowchart of a method for generating a recovery notification message according to one embodiment;
FIG. 13 is a flow chart of a method of handling link failure in yet another embodiment;
FIG. 14 is a schematic diagram of an update table provided by one embodiment;
fig. 15 is a flowchart of a method for resolving a recovery notification message without turning on an adaptive routing function according to an embodiment;
FIG. 16 is a flowchart illustrating a method for resolving a recovery notification message for opening an adaptive routing function according to an embodiment;
FIG. 17 is a simulated diagram of an uplink failure provided by one embodiment;
FIG. 18 is a simulated diagram of a downlink failure provided by one embodiment;
FIG. 19 is a flow chart of a detailed link failure handling method;
FIG. 20 is a block diagram of a link failure handling device in one embodiment;
fig. 21 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Traditionally, high performance computing (High performance computing, HPC) has been used to solve complex problems, with current data growing exponentially, requiring larger cluster computing scales to address current and future computing challenges. In HPC, efficient inter-process communication depends on an interconnect structure that can provide high bandwidth, low latency, support a large number of endpoints, and thus, a high-speed reliable interconnect network is needed. There is also a need for a high-speed reliable network in MPI, shared storage, machine learning frameworks and new heterogeneous computing architectures. As high-speed interconnection networks continue to expand to accommodate larger-scale computing and storage capacities, 40K network nodes, and even 100K nodes, may be needed in the future to meet demand, but as host channel adapters (e.g., HCA cards) and switches increase, and particularly fiber optic cables that accompany this growing situation, will suffer physical or electrical damage, resulting in a link failure, a job checking method is provided in the conventional art to solve the above-mentioned problems, and in particular, to create a point-in-time snapshot of a job that will continue from the last successful state and point-in-time if the job fails at a later point in time. Or by utilizing the data integrity check and retransmission mechanism of the infinite wideband protocol, these methods negatively impact performance and are not practical in the case of large-scale communications. Wherein, MPI is a cross-language communication protocol for writing parallel computers.
Currently, in a high-speed network, if a link failure occurs, a network node sends notification information (e.g., a trap message packet) to a subnet management service of a server. After receiving the trap message packet, the subnet management service triggers severe scanning and recalculates the route to bypass the failed link. This process takes a long time, typically on the order of a minimum of seconds, even more than 30 seconds for larger node specifications, which may take up to 5 seconds for 1000 nodes; for clusters with 10000 or more nodes, this may take 30 seconds or even longer, longer link recovery time, resulting in network delay. This will not guarantee reliable stability of the network, possibly leading to job failure, which is an intolerable situation. The trap message packet is a mechanism used in the subnet management service to send notification to the network manager, and can automatically send a message when a fault or abnormality occurs in the network.
In addition, processing of link failures using the above method can lead to unreliability of the system: in parallel computing such as MPI, network quality is important. If network failure occurs, the operation failure may be caused if the routing link is not recovered for a long time, and the stability and reliability of the system are greatly affected.
The related art may also cause network congestion: if there are a large number of link failures in the network, there may be a large number of trap messages, which may cause congestion in the network, thereby affecting the normal operation of the network.
The related art may also cause a degradation of the subnet management service performance: the subnet management service is the only management tool in the high-speed interconnection network and is responsible for the functions of subnet scanning, route calculation and issuing, network configuration management, load balancing, fault detection and the like. The subnet management service may affect the performance and processing efficiency of the subnet management service if it is required to send a large number of trap messages while it is also required to process other network events.
The related art may also result in incomplete elimination of the fault: although the subnet management service aims to improve network resilience and reliability, there is also a mechanism for related failure handling, it does not completely eliminate the possibility of failure. This means that other fault countermeasures, such as backup and monitoring systems, still need to be implemented.
The related art may also lead to Trap message repeatability: the trap message may be repeatedly sent, causing the management node to process repeated alarms. In the case of an unstable network, the same link failed trap message may be repeatedly transmitted multiple times. There are many Trap messages in the network. In summary, the above processing method may result in a long link recovery time, resulting in network delay, and other problems.
In view of the above problems, an embodiment of the present application provides a link failure processing method, which may be applied to an application environment as shown in fig. 1, and fig. 1 is an application environment diagram of the link failure processing method in one embodiment. Wherein the switch 101 may communicate with a server 102, which may carry subnet management services. The data storage system may store data that the server 102 needs to process. The switch 101 may be a plurality of switches, and the plurality of switches are interconnected to form a link and connected to the server 102 to receive information issued by the server 102. The switch 101 may check the on-off state of the target output port of the switch and the standby output port corresponding to the target output port, and forward the data message or send the fast recovery notification message according to the on-off state. The server 102 may be implemented as a stand-alone server or a server cluster including a plurality of servers.
In an exemplary embodiment, as shown in fig. 2, a link failure processing method is provided, which is illustrated by taking an example that the method is applied to the switch 101 in fig. 1, for example, a first switch, and includes the following steps:
s201, determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to the target network card identifier in the routing table of the first switch.
The first output port may include a target output port and a standby output port corresponding to the target output port, where the target output port may be a static output port corresponding to a destination network card identifier in a routing table of the first switch. The on-off state of the first output port includes open and not open, i.e., the on-off state of the target output port and the standby output port includes open and not open. The routing table may be a linear routing table.
In this embodiment of the present application, the first switch may query the destination network card identifier of the first switch in the routing table, and determine the static output port corresponding to the destination network card identifier of the first switch, that is, the first switch may query the destination output port of the first switch in the routing table. The first switch determines the on-off state of a target output port of the first switch and the on-off state of a standby output port corresponding to the target output port through the routing table information in the routing table. The routing table information may include a destination network card identifier, a static output port corresponding to the destination network card identifier, and an on-off state of the standby output port.
Specifically, if the first switch detects that the target output port and the standby output port have no signals, the on-off state of the first output port is determined to be off. Or the first switch checks that the state of the target output port recorded by the routing table of the first switch is disconnected and the state of the standby output port corresponding to the target output port is disconnected, and then determines that the on-off state of the first output port is disconnected.
In addition, if the on-off state of the target output port is disconnected, but the on-off state of the standby output port corresponding to the target output port is not disconnected, the standby output port can be directly utilized to forward the data message. Namely: if only one spare output port exists and the spare output port is not disconnected, forwarding the data message by using the spare output port; if there are a plurality of spare output ports, one of the spare output ports which is not disconnected can be utilized to forward the data message.
It should be noted that, when topology discovery is performed, the subnet management service in the server marks a location relationship between each switch and other switches for each switch, for example, marks a hierarchy of the switch by using a fat tree structure, marks that the switch belongs to a core layer, a convergence layer or an access layer of the fat tree structure, and notifies each switch to record its own hierarchy. The subnet management service calculates the route in each scanning period, and if the state of the self-adaptive routing function is unopened, an output port and a standby output port can be calculated for each switch; wherein it is required that the route through this output port and the route of the spare output port do not overlap with each other, and the spare output port is set to 0 if there is no suitable spare output port. If the state of the adaptive routing function is on, one output port and all the standby output ports corresponding to the output port can be calculated for each switch, and all the standby output ports corresponding to one output port are placed in one port group. That is, the server may send the MAD message to each switch through the subnet management service to inform each switch to record its own hierarchy and routing table information of the routing table. The MAD message is a management message, and is used for sending network management information.
S202, under the condition that the states of the first output ports are all disconnected, if the port type of the target output port is an inter-switch link, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except for the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault.
The port type of the second output port is an inter-switch link, and the second output port is a static output port except the target output port in the first switch. There may be a plurality of second output ports.
Specifically, in the case that the states of the first output ports are all disconnected, if the port type of the disconnected target output port in the first output port is an inter-switch link, the first switch may directly discard the data packet. In addition, the first switch may generate a first restoration notification message (frn_mad, fast Recovery Notification MAD) using the risk table stored in the chip storage structure of the first switch, and transmit the first restoration notification message to the second switch through the second output port to inform the second switch of the failure of the target output port of the first switch. For example, as shown in fig. 3, fig. 3 is a flow chart of a method for processing port exceptions according to one embodiment, which may include the following steps:
And S301, the first switch detects that the target output port K and the standby output port A are in a state that data communication cannot be performed, and processes the data according to an Infiniband protocol.
S302, if the port type of the port K is an inter-switch link, selecting a network card identification (Lid) list corresponding to the port K in the risk table, and packaging the Lid list into a first recovery notification message FRN_MAD.
S303, the first recovery notification message FRN_MAD is sent through all ports except the port K as ports of the inter-switch link.
Fig. 4 is a schematic diagram of a conventional link failure processing method in one embodiment, as shown in fig. 4, in the conventional technology, after the occurrence of a link failure, a switch sends a notification to a server, and a subnet manager in the server can calculate routes and update all switch routing tables, which results in a longer link recovery time, and further, network delay. Fig. 5 is a schematic diagram of a link failure processing method according to an embodiment, where after a link failure occurs, a switch performs autonomous processing to perform route recovery, that is, the switch forwards a data packet through a standby output port. After the switch can not perform autonomous processing to recover the route, the switch sends a notification to the server, and the subnet manager in the server calculates the route and updates all the switch routing tables.
When the server calculates the route in each scanning period by using the subnet management service, the automatically modified route table data is synchronized and updated. For example, the routing table which is automatically completed by the server synchronous switch is modified into the data structure of the subnet management service, so that the recalculation of the routing by the subnet management service is ensured to be consistent with the routing which is automatically updated. If the self-adaptive routing function is not started, the subnet management service of the server is predictable when the routing switch automatically modifies the routing result; if the adaptive routing function, the subnet management service is unpredictable when the routing switch automatically modifies the routing result, requiring the addition of a provisioning or synchronization mechanism.
In the link fault processing method, the state of the first output port of the first switch is determined, and if the port type of the target output port is the inter-switch link under the condition that the states of the first output port are all disconnected, a first recovery notification message is sent to the second switch through the second output port; the first output port comprises a target output port and a standby output port corresponding to the target output port, the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault. In the conventional technology, if a failure occurs in a switch link, for example, a target output port of a switch is not available, the switch needs to send a notification message to a server to trigger the server to recalculate a switch route by using a subnet management service, and send the recalculated route information to each switch, and each switch updates a route table according to the sent route information, bypasses the failed link, and causes a longer link recovery time, thereby causing network delay. In this embodiment of the present application, when the first output port of the first switch is unavailable, the first recovery notification packet may be sent to other switches connected to the first switch, so that when the other switches subsequently forward the packet, the packet is not sent to the failed target output port. When the first output port is not available, the message informing other switches that the data message is not sent to the failed target output port is sent, so that the time for recovering the link can be shortened, and the network delay can be reduced.
In addition, since the route restoration is performed autonomously by the switch preferentially, the problem of degradation of the subnet controller of the server due to a large number of trap messages to be transmitted to the server, and the trap message repeatability due to a large number of trap messages to be transmitted are alleviated. In addition, the fault handling mechanism of the subnet controller cannot completely eliminate the possibility of faults, and in the embodiment of the application, the switch is required to perform route recovery autonomously, so that the possibility of fault elimination is improved.
In an exemplary embodiment, as shown in fig. 6, the link failure processing method further includes the steps of:
s601, determining the state of the adaptive routing function of the first switch.
The Adaptive routing function may be an Adaptive routing function (AR). The status of the adaptive routing function includes on and off.
In the embodiment of the application, it may be determined whether the state of the adaptive routing function of the first switch is on or not. Specifically, the first switch may query the routing table information in the local routing table to determine whether the state of the adaptive routing function recorded in the routing table information is on or not on.
S602, a risk table is generated based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch.
The risk table may be stored in a memory structure of the switch chip.
In this embodiment of the present application, the state of the spare output ports in the routing table of the first switch may include that no spare output ports are displayed in the routing table, one spare output port exists, and a plurality of spare output ports exist. The network card identifier to be recorded can be determined based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch, and a risk table is generated according to the network card identifier.
For example, FIG. 7 provides a risk representation intent for one embodiment. As shown in fig. 7, the port numbers of the destination output ports are 1-48, and the network card identifiers corresponding to the destination output ports with the port number 1 include 127-130, where these network card identifiers may represent all addresses that may be affected by disconnection of the destination output port.
S603, generating a first recovery notification message according to the risk table.
Specifically, the first switch may query a risk table stored in a storage structure of the first switch chip, and generate a first recovery notification message after merging the destination network card identifiers in the risk table.
In this embodiment, the state of the adaptive routing function of the first switch is determined, a risk table is generated based on the state of the adaptive routing function and the state of the standby output port in the routing table of the first switch, and a first recovery notification message is generated according to the risk table. The method and the device have the advantages that the influence of the states of different self-adaptive routing functions on the generation of the first recovery notification message is considered when the first recovery notification message is generated, so that the information accuracy of the first recovery notification message is improved, and further, other switches in the network are notified by the first recovery notification message under the condition that the link failure is detected but the first recovery notification message cannot be recovered by the user, so that the whole network can quickly update the failure link state to obtain an available path, the speed of recovering the network from the failure state to the normal state can be improved, and the link failure recovery time is reduced from the second level to the millisecond level.
In an exemplary embodiment, a risk table is generated if a preset trigger condition is met; the risk table comprises a destination network card identifier; wherein,
meeting the preset trigger condition includes any one of the following:
the state of the first item, the adaptive routing function, is unopened and there is no spare output port in the routing table.
The second item, the state of the adaptive routing function is on, and the number of spare output ports in the routing table is a preset number.
Wherein the preset number may comprise 0. The routing table may be a linear routing table, and the structure of the routing table may refer to fig. 8 and fig. 9, and fig. 8 is a schematic structural diagram of the routing table provided in an embodiment, where, as shown in fig. 8, the routing table includes a destination network card identifier, a destination output port, a state of an adaptive routing function, and a standby output port group corresponding to the open adaptive routing function. Fig. 9 is a schematic structural diagram of a backup output port group provided in an embodiment, where the backup output port group shows a state of each backup output port when the adaptive routing function is turned on, for example, G1 has backup output ports P2 and P3.
Specifically, if the first term is satisfied: and if the first switch does not start the self-adaptive routing function, recording a destination network card identifier with a standby output port of 0 in a routing table of the first switch according to routing table information corresponding to the routing table of the first switch in the MAD message, and generating a risk table according to the network card identifier. For example, as shown in fig. 10, fig. 10 is a flowchart of a risk table generating method without turning on an adaptive routing function according to an embodiment, and the steps are as follows:
S1001, receiving the MAD message and analyzing the MAD message.
S1002, determining whether the MAD message is analyzed.
S1003, if the MAD message analysis is not finished, whether the standby output port of the target output port corresponding to one network card identifier is 0 or not is judged.
S1004, if the standby output port is 0, updating the network card identifier of the standby output port to a risk table.
S1005, if the standby output port is not 0, returning to execute the determination of whether the MAD message is analyzed.
By using the risk table generation method without starting the self-adaptive routing function to update the risk table, the accuracy of information in the risk table can be improved, and further, the speed of recovering the network from the fault state to the normal state can be improved.
Specifically, if the second term is satisfied: if the switch starts the self-adaptive routing function, recording a destination network card identifier of only one standby output port in the standby output port group in the routing table of the first switch according to the routing table information corresponding to the routing table of the first switch in the MAD message, and generating a risk table according to the destination network card identifier. For example, as shown in fig. 11, fig. 11 is a flowchart of a risk table generating method for turning on an adaptive routing function according to an embodiment, and the steps are as follows:
S1101, receiving the MAD message and analyzing the MAD message.
S1102, determining whether the MAD message is analyzed.
S1103, if the MAD message analysis is not finished, determining whether a spare output port group of the target output port corresponding to the network card identifier has only one spare output port.
S1104, if the spare output port group has only one spare output port, updating the network card identification of the spare output port group to the risk table.
S1105, if the standby output port group comprises a plurality of standby output ports, returning to execute to determine whether the MAD message is analyzed.
By using the risk table generation method for starting the self-adaptive routing function to update the risk table, the accuracy of information in the risk table can be improved, and further, the speed of recovering the network from the fault state to the normal state can be improved.
In this embodiment, a risk table is generated when a preset trigger condition is satisfied, where the risk table includes a destination network card identifier. Wherein meeting the preset trigger condition includes any one of the following: the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table; the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number. Since, when the state of the adaptive routing function is unopened, the influence of whether the standby output port is included in the routing table of the first switch on the generation of the risk table is considered. And when the state of the self-adaptive routing function of the first switch is on, considering whether the number of the standby output ports in the routing table of the first switch is the preset number or not to influence the generation of the risk table. The accuracy of the information in the risk table is improved, and the information accuracy of the first recovery notification message is further improved. The whole network can update the fault link state rapidly to obtain an available path, so that the network delay can be reduced, and the robustness of the whole network is improved.
In an exemplary embodiment, as shown in fig. 12, fig. 12 is a flowchart of a method for generating a recovery notification message according to an embodiment, where S603 includes the following steps:
s1201, inquiring a destination network card identifier of a destination output port in the risk table.
Specifically, the first switch may query a risk table stored in a storage structure of the switch chip, and determine, from the risk table, the destination network card identifier of the destination output port. The target output port is an output port with the on-off state being off in the above embodiment. Referring to fig. 7, if the port number of the destination output port is 48, the destination network card of the destination output port is 10103.
S1202, generating a first recovery notification message according to a target network card identifier and a preset generation format corresponding to the first recovery notification message.
Specifically, the target output port and the target network card identifier may be processed according to a preset generation format corresponding to the first recovery notification message, so as to obtain the first recovery notification message.
In this embodiment, the first recovery notification message is generated by querying the destination network card identifier of the destination output port in the risk table according to the preset generation format corresponding to the destination network card identifier and the first recovery notification message, and since the first recovery notification message can be generated based on the destination network card identifier, that is, the first recovery notification message includes the unreachable destination network card identifier, the first recovery notification message can be used to inform other switches of the failure of the destination output port corresponding to the destination network card identifier, so that the server and other switches can automatically switch to the standby path under the condition of no perception, thereby realizing self-repair of the link failure and ensuring connectivity of the network.
In an exemplary embodiment, as shown in fig. 13, the link failure processing method further includes the steps of:
s1301, receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch.
The second recovery notification message may be a recovery notification message sent by the third switch to the first switch when the on-off state of the target output port and the standby output port of the third switch is off and the port type of the target output port of the third switch is an inter-switch link.
Specifically, if the state of the adaptive routing function of the third switch is unopened, the on-off states of the target output port and the standby output port of the third switch are both disconnected, the port type of the target output port of the third switch is an inter-switch link, or the state of the adaptive routing function is opened, all standby output ports in the standby output port group are disconnected, and the port type of the target output port of the third switch is an inter-switch link, the third switch sends a second recovery notification message to the first switch. The first switch may receive the second recovery notification message through the target receiving port, and record which port (target receiving port) in the first switch receives the second recovery notification message when receiving the second recovery notification message sent by the third switch, for example, port S in the first switch receives the second recovery notification message.
And S1302, if the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
In this embodiment of the present application, if the target receiving port is a target output port of the first switch, the target output port in the routing table of the first switch may be updated by using the standby output port according to the state of the adaptive routing function of the first switch and the state of the standby output port corresponding to the destination network card identifier.
It should be noted that, if the state of the adaptive routing function of the first switch is unopened and the standby output port in the routing table is 0, the target output port in the routing table of the first switch is not updated, a first restoration notification message is generated, and the first restoration notification message is sent to the second switch through the first output port. The first output port is an output port except for the target output port in the first switch, the port type of the first output port is an inter-switch link, and the first recovery notification message is used for notifying the second switch of the target output port fault. For example, if the target receiving port S is a target output port OutputPort of the first switch, the adaptive routing function of the first switch is not started, and the standby output port corresponding to the destination network card identifier is 0, the first recovery notification message is sent to the second switch through the first output port.
In addition, if the state of the adaptive routing function of the first switch is on and the standby output port in the routing table does not have the standby output port of the non-target receiving port, the target output port in the routing table of the first switch is not updated, a first recovery notification message is generated, and the first recovery notification message is sent to the second switch through the first output port. The first output port is an output port except for the target output port in the first switch, the port type of the first output port is an inter-switch link, and the first recovery notification message is used for notifying the second switch of the target output port fault. For example, if the target receiving port S is the target output port OutputPort of the first switch, the adaptive routing function of the first switch is turned on, and a non-S standby output port cannot be found in the routing table, this indicates that this network card identifier is not configured with a standby output port, and the network card identifier lid corresponding to the target output port of the first switch sends a first restoration notification message to other switches (including the second switch) through all the output ports except the target output port and having the port type of the inter-switch link, so as to notify the other switches not to route these network card identifiers onto the first switch.
It should be noted that, referring to fig. 14, fig. 14 is a schematic diagram of an update table provided in an embodiment, a processor in a chip of a first switch may store switching information of route Fast Recovery (FR) into the update table to provide an update condition to a subnet management service query of a server.
In this embodiment, the second recovery notification message sent by the third switch is received through the target receiving port of the first switch, and if the target receiving port is the target output port, the target output port in the routing table of the first switch is updated according to the standby output port corresponding to the destination network card identifier. According to the method, the second recovery notification message can be sent to each switch in the switch link, namely, the link state can be updated immediately after the link fault is detected and other switches in the network are notified, so that the whole network can update the fault link state quickly and select a new available path. The speed of the network from the fault state to the normal state can be increased. Meanwhile, the problems in the network can be automatically repaired, and the complexity of network management is reduced. In addition, because the recovery notification message is sent to each switch, the full network detection of the fault point is equivalent, and the probability of packet loss retransmission when the data message sent by other nodes or switches is transmitted to the fault port is greatly reduced.
In an exemplary embodiment, the step S1302 includes the following cases:
and in the first case, if the self-adaptive routing function of the first switch is in an unopened state, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch.
Specifically, if the state of the adaptive routing function of the first switch is unopened and the standby output port corresponding to the destination network card identifier in the routing table is not 0, writing the standby output port into the target output port in the routing table of the first switch to update the target output port in the routing table of the first switch. And recording the modified content to the memory of the first switch, and marking the state of the first switch as fault-tolerant update.
For example, in an exemplary recovery notification message parsing method, fig. 15 is a flow chart of a recovery notification message parsing method for an unopened adaptive routing function according to an embodiment, and as shown in fig. 15, the recovery notification message parsing method for an unopened adaptive routing function may include the following steps:
s1501, the port S receives the recovery notification message FRN_MAD and empties the network card identification list FRN_Lid of the original recovery notification message.
S1502, determining whether the MAD message is analyzed.
The following steps S1503 to S1504 are performed, or steps after S1505 and S1505 are performed:
s1503, if the MAD message analysis is completed, determining whether the network card identification list of the original recovery notification message is empty.
S1504, if the network card identification list of the original recovery notification message is not empty, constructing a recovery notification message FRN_MAD through the list FRN_Lid, and sending the recovery notification message FRN_MAD out of all switch ports except the port S.
S1505, if the MAD message analysis is not completed, a network card identification number Lid in a recovery notification message FRN_MAD is obtained.
S1506, reading the target output port and the standby output port of the Lid in the linear routing table, and determining whether the target output port of the Lid is the port S.
The steps of S1507 to S1510 described below are performed, or S1511 described below is performed.
S1507, if the target output port of Lid is port S, it is determined whether the standby output port is 0.
The following step S1508 is performed, or the following steps S1509 to S1511 are performed.
S1508, if the standby output port is 0, adding the Lid to the list FRN_Lid, and executing the step of determining whether the MAD message is analyzed.
S1509, if the spare output port is not 0, writing the spare output port into the target output port, and setting the spare output port to 0.
S1510, adding the Lid and the modified target output port into an update table, and updating the switch state into fault-tolerant update.
S1511, if the target output port of the Lid is not the port S, returning to execute the step of determining whether the MAD message is analyzed.
And in the second case, if the state of the self-adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
Specifically, if the state of the adaptive routing function of the first switch is on and a standby output port of a non-target receiving port exists in a standby output port group corresponding to a destination network card identifier in the routing table, writing any standby output port in the standby output port group into a target output port in the routing table of the first switch so as to update the target output port in the routing table of the first switch. And simultaneously, setting the self-adaptive routing function corresponding to the target network card identifier to be in a static mode, namely, the self-adaptive routing function corresponding to the target network card identifier is not started any more. And recording the modified content to the memory of the first switch, and marking the state of the first switch as fault-tolerant update.
For example, in an exemplary recovery notification message parsing method, fig. 16 is a flow chart of a recovery notification message parsing method for turning on an adaptive routing function according to an embodiment, and as shown in fig. 16, the recovery notification message parsing method for turning on an adaptive routing function may include the following steps:
s1601, the port S receives the recovery notification message FRN_MAD, and empties a network card identification list FRN_Lid of the original recovery notification message.
S1602, determining whether the MAD message is analyzed.
The following steps S1603 to S1604 are performed, or steps S1605 and after S1605 are performed:
s1603, if the MAD message analysis is completed, determining whether the network card identification list of the original recovery notification message is empty.
S1604, if the network card identifier list of the original recovery notification message is not empty, constructing a recovery notification message frn_mad through the list frn_lid, and sending the recovery notification message frn_mad out from all switch ports except the port S.
S1605, if the MAD message analysis is not finished, obtaining a network card identification number Lid in a recovery notification message FRN_MAD.
S1606, the target output port of the Lid in the linear routing table, the state of the self-adaptive routing function and the standby output port group are read, and whether the target output port of the Lid is the port S is determined.
Steps S1607 to S1612 described below are performed, or S1613 described below is performed.
S1607, if the destination output port of the Lid is the port S, it is determined whether the state of the adaptive routing function is on.
The steps of S1608 to S1611 described below are performed, or S1612 described below is performed:
s1608, if the state of the self-adaptive routing function is on, determining whether there is only one port in the standby output port group.
The steps of S1609 described below, or S1610 to S1611 described below are performed.
S1609, if only one port is in the standby output port group, adding the Lid to the list FRN_Lid, and executing the step of determining whether the MAD message is analyzed.
S1610, if a plurality of ports exist in the standby output port group, writing the ports of the non-ports S in the standby port group into the target output port, and setting the state of the self-adaptive routing function of the Lid to 0.
S1611, adding the Lid and the modified target output port to an update table, and updating the switch state to be fault-tolerant update.
S1612, if the status of the adaptive routing function is not on, adding the Lid to the list FRN_lid, and executing the step of determining whether the MAD message is analyzed.
S1613, if the target output port of the Lid is not the port S, returning to execute the step of determining whether the MAD message is analyzed.
In this embodiment, if the state of the adaptive routing function of the first switch is unopened, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch; and if the state of the self-adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch. Because the route can be automatically updated when the link fault occurs, the self-repairing of the network can be realized without manual intervention and repairing of an administrator, and the network time delay is reduced.
Fig. 17 is a simulated diagram of an uplink failure provided by an embodiment. As shown in fig. 17, when an uplink failure occurs, the switch has multiple routing paths to reach the destination, which typically occurs at a three-layer fat-tree network upstream switch, such as an access layer to a convergence layer, and a convergence layer to a core layer. If one of the paths of the switch fails, the data message is forwarded to the other port of the switch to reach the destination via the new path. When the self-adaptive routing function is started, inquiring a spare port group field of the linear routing table according to the target network card identifier Lid, searching a bit map of a corresponding port group from the table of the spare port group, and selecting an available spare output port to forward a data message; when the self-adaptive routing function is closed, the standby output port of the linear routing table can be queried according to the switch target network card identifier Lid, and the data message is forwarded by using the standby output port.
Fig. 18 is a simulated diagram of a downlink failure provided by an embodiment. As shown in fig. 18, in the event of a switch downlink failure, the switch may not have alternative other ports capable of reaching the destination, which typically occurs at switches downstream of a three-layer fat-tree network, such as core layer to convergence layer, convergence layer to access layer. In this case, the switch may pass the restoration notification message to another switch of the network, selecting the best route from the new switch to reach the destination. If the switch has only 1 port capable of reaching the node corresponding to the destination Lid, when the port fails, the switch receiving the recovery notification message determines whether the current receiving port can reach the node corresponding to the destination network card identifier Lid. If the receiving port is different from the port of the destination network card identification Lid, ending the processing of the recovery notification message; if the receiving port is the same as the destination network card identification Lid port and the destination network card identification Lid port is also provided with an available standby output port, updating the linear routing table, and ending the recovery notification message processing; if the receiving port is the same as the port of the destination network card identification (Lid) and there is no other standby output port, regenerating a recovery notification message, sending the recovery notification message from other switch ports of the switch, and repeating the steps of judging whether the receiving port is the same as the port of the network card identification (Lid).
It should be noted that, the following situations may occur in the transmission process of the resume notification message:
(1) For example, sw100 receives the restoration notification message sent from sw201, and at this time, since there is a spare output port, the linear routing table needs to be updated. When the recovery notification message sent by the sw200 is received again, the standby output port is not reachable, and the sw100 also becomes a switch which cannot reach the server, and needs to generate the recovery notification message and send the recovery notification message to other switches.
(2) The switch that has sent the resume notification message becomes a switch that cannot reach the destination network card identifier, and may also receive resume notification messages sent by other switches, for example, sw201 receives the resume notification message of sw130, determines that the resume notification message needs to be generated again and sent, and then sw100 also determines and generates the resume notification message to send the resume notification message to sw201. Since sw201 cannot match with the destination network card identifier when querying the linear routing table, processing of the recovery notification message can be ended, and cycle repeated transmission cannot be formed.
In a detailed embodiment, as shown in fig. 19, fig. 19 is a flow chart of a detailed link failure processing method, and the link failure processing method may include the following steps:
S1901, determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to the target network card identifier in the routing table of the first switch.
The following S1902 or S1903-S1908 are performed:
s1902, if the on-off state of the target output port is disconnected and the on-off state of the standby output port is not disconnected, forwarding the data message through the standby output port.
And S1903, if the on-off state of the target output port and the standby output port is off and the port type of the target output port is an inter-switch link, determining the state of the self-adaptive routing function of the first switch.
S1904, generating a risk table under the condition that a preset trigger condition is met; the risk table comprises a destination network card identifier; wherein meeting the preset trigger condition includes any one of the following: the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table; the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number.
S1905, inquiring the destination network card identification of the destination output port in the risk table.
S1906, generating a first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
S1907, sending a first recovery notification message to the second switch through the second output port; the second output port is an output port except for the target output port in the first switch, the port type of the second output port is an inter-switch link, and the first recovery notification message is used for notifying the second switch of the target output port fault.
S1908, receiving a second recovery notification message sent by the third switch through the target receiving port of the first switch.
The following S1909 or S1910 is performed:
s1909, if the target receiving port is the target output port of the first switch and the self-adaptive routing function of the first switch is not started, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch.
S1910, if the target receiving port is a target output port of the first switch and the state of the adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
In this embodiment, when the standby output port is available, the first switch switches the standby output port to forward the data packet preferentially, so that the link recovery time can be shortened, and the network delay can be reduced. When the standby output port is not available, the message that the message is not sent to the failed target output port is informed to other switches, so that the time for recovering the link can be shortened, and the network delay can be reduced.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a link failure processing device for implementing the above-mentioned link failure processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the link failure processing device or devices provided below may refer to the limitation of the link failure processing method hereinabove, and will not be repeated herein.
In an exemplary embodiment, as shown in fig. 20, there is provided a link failure processing apparatus 2000, including: a first determination module 2001 and a first transmission module 2002, wherein:
a first determining module 2001, configured to determine a state of a first output port of the first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
the first sending module 2002 is configured to send, when the states of the first output ports are all open, a first recovery notification message to the second switch through the second output port if the port type of the target output port is an inter-switch link; the second output port is a static output port except for the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault.
In an exemplary embodiment, the link failure processing apparatus 2000 further includes:
and the second determining module is used for determining the state of the adaptive routing function of the first switch.
The first generation module is used for generating a risk table based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch.
And the second generation module is used for generating a first recovery notification message according to the risk table.
In one exemplary embodiment, the first generation module includes:
the first generation sub-module is used for generating a risk table under the condition that a preset trigger condition is met; the risk table comprises a destination network card identifier; wherein meeting the preset trigger condition includes any one of the following: the state of the adaptive routing function is unopened, and there is no spare output port in the routing table. The state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number.
In one exemplary embodiment, the second generating module includes:
and the inquiring sub-module is used for inquiring the destination network card identification of the target output port in the risk table.
And the second generation sub-module is used for generating the first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
In an exemplary embodiment, the link failure processing apparatus 2000 further includes:
and the second sending module is used for receiving a second recovery notification message sent by the third switch through the target receiving port of the first switch.
And the receiving module is used for updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier if the target receiving port is the target output port.
In one exemplary embodiment, the receiving module includes:
and the first writing sub-module is used for writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch if the state of the self-adaptive routing function of the first switch is unopened.
And the second writing submodule is used for writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch if the state of the self-adaptive routing function of the first switch is on.
The respective modules in the above-described link failure processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 21. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a link failure handling method.
It will be appreciated by those skilled in the art that the structure shown in fig. 21 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one exemplary embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
under the condition that the states of the first output ports are all disconnected, if the port type of the target output port is an inter-switch link, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except for the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault.
In one embodiment, the processor when executing the computer program further performs the steps of:
determining a state of an adaptive routing function of the first switch;
generating a risk table based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch;
and generating a first recovery notification message according to the risk table.
In one embodiment, the processor when executing the computer program further performs the steps of:
generating a risk table under the condition that a preset triggering condition is met; the risk table comprises a destination network card identifier; wherein,
meeting the preset trigger condition includes any one of the following:
the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table;
the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number.
In one embodiment, the processor when executing the computer program further performs the steps of:
inquiring a target network card identifier of a target output port in the risk table;
and generating a first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
In one embodiment, the processor when executing the computer program further performs the steps of:
Receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch;
if the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
In one embodiment, the processor when executing the computer program further performs the steps of:
updating the target output port in the routing table of the first switch according to the standby output port corresponding to the destination network card identifier, including:
if the self-adaptive routing function of the first switch is not started, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
and if the state of the self-adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
Under the condition that the states of the first output ports are all disconnected, if the port type of the target output port is an inter-switch link, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except for the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining a state of an adaptive routing function of the first switch;
generating a risk table based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch;
and generating a first recovery notification message according to the risk table.
In one embodiment, the computer program when executed by the processor further performs the steps of:
generating a risk table under the condition that a preset triggering condition is met; the risk table comprises a destination network card identifier; wherein,
meeting the preset trigger condition includes any one of the following:
the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table;
the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inquiring a target network card identifier of a target output port in the risk table;
and generating a first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch;
if the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
In one embodiment, the computer program when executed by the processor further performs the steps of:
updating the target output port in the routing table of the first switch according to the standby output port corresponding to the destination network card identifier, including:
if the self-adaptive routing function of the first switch is not started, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
and if the state of the self-adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
under the condition that the states of the first output ports are all disconnected, if the port type of the target output port is an inter-switch link, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except for the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the target output port fault.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining a state of an adaptive routing function of the first switch;
generating a risk table based on the state of the self-adaptive routing function and the state of the standby output port in the routing table of the first switch;
And generating a first recovery notification message according to the risk table.
In one embodiment, the computer program when executed by the processor further performs the steps of:
generating a risk table under the condition that a preset triggering condition is met; the risk table comprises a destination network card identifier; wherein,
meeting the preset trigger condition includes any one of the following:
the state of the self-adaptive routing function is unopened, and a standby output port is not arranged in the routing table;
the state of the adaptive routing function is on, and the number of standby output ports in the routing table is a preset number.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inquiring a target network card identifier of a target output port in the risk table;
and generating a first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch;
if the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
In one embodiment, the computer program when executed by the processor further performs the steps of:
updating the target output port in the routing table of the first switch according to the standby output port corresponding to the destination network card identifier, including:
if the self-adaptive routing function of the first switch is not started, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
and if the state of the self-adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not thereby to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A method of link failure handling, the method comprising:
determining a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
If the port type of the target output port is an inter-switch link under the condition that the states of the first output port are all disconnected, a first recovery notification message is sent to a second switch through the second output port; the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the fault of the target output port.
2. The method according to claim 1, wherein the method further comprises:
determining a state of an adaptive routing function of the first switch;
generating a risk table based on the state of the adaptive routing function and the state of the standby output port in the routing table of the first switch;
and generating the first recovery notification message according to the risk table.
3. The method of claim 2, wherein the generating a risk table based on the state of the adaptive routing function and the state of the backup output port in the routing table of the first switch comprises:
generating the risk table under the condition that a preset triggering condition is met; the risk table comprises the destination network card identifier; wherein,
The meeting of the preset triggering condition comprises any one of the following:
the state of the self-adaptive routing function is unopened, and the standby output port is not arranged in the routing table;
the state of the self-adaptive routing function is on, and the number of the standby output ports in the routing table is a preset number.
4. The method of claim 3, wherein generating the first recovery notification message according to the risk table comprises:
inquiring a destination network card identifier of the destination output port in the risk table;
and generating the first recovery notification message according to the destination network card identifier and a preset generation format corresponding to the first recovery notification message.
5. The method according to claim 4, wherein the method further comprises:
receiving a second recovery notification message sent by a third switch through a target receiving port of the first switch;
and if the target receiving port is the target output port, updating the target output port in the routing table of the first switch according to the standby output port corresponding to the target network card identifier.
6. The method of claim 5, wherein updating the destination output port in the routing table of the first switch according to the backup output port corresponding to the destination network card identifier comprises:
If the state of the self-adaptive routing function of the first switch is unopened, writing the standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
and if the state of the adaptive routing function of the first switch is on, writing the ports except the target receiving port in the first switch into the target output port in the routing table of the first switch.
7. A link failure handling apparatus, the apparatus comprising:
a first determining module, configured to determine a state of a first output port of a first switch; the first output port comprises a target output port and a standby output port corresponding to the target output port; the target output port is a static output port corresponding to a target network card identifier in a routing table of the first switch;
the first sending module is used for sending a first recovery notification message to the second switch through the second output port if the port type of the target output port is an inter-switch link under the condition that the states of the first output port are all disconnected; the second output port is a static output port except the target output port in the first switch, and the first recovery notification message is used for notifying the second switch of the fault of the target output port.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202311325217.5A 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product Pending CN117354267A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311325217.5A CN117354267A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311325217.5A CN117354267A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117354267A true CN117354267A (en) 2024-01-05

Family

ID=89370598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311325217.5A Pending CN117354267A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117354267A (en)

Similar Documents

Publication Publication Date Title
CN100452797C (en) High-available distributed boundary gateway protocol system based on cluster router structure
KR102014433B1 (en) System and method for supporting discovery and routing degraded fat-trees in a middleware machine environment
CN106059791B (en) Link switching method of service in storage system and storage device
JP5941404B2 (en) Communication system, path switching method, and communication apparatus
KR20070026327A (en) Redundant routing capabilities for a network node cluster
CN105827419A (en) Forwarding equipment fault processing method, equipment and controller
CN112787960B (en) Stack splitting processing method, device and equipment and storage medium
CN102394914A (en) Cluster brain-split processing method and device
US11403319B2 (en) High-availability network device database synchronization
EP3316555A1 (en) Mac address synchronization method, device and system
EP3213441B1 (en) Redundancy for port extender chains
CN109639773A (en) A kind of the distributed data cluster control system and its method of dynamic construction
US9509523B2 (en) Method for protection switching in ethernet ring network
CN108512753B (en) Method and device for transmitting messages in cluster file system
CN113489149B (en) Power grid monitoring system service master node selection method based on real-time state sensing
US11695856B2 (en) Scheduling solution configuration method and apparatus, computer readable storage medium thereof, and computer device
US20180048487A1 (en) Method for handling network partition in cloud computing
US8972771B2 (en) Connection control apparatus, storage system, and control method of connection control apparatus
CN117354267A (en) Link failure processing method, device, equipment, storage medium and program product
CN112131201B (en) Method, system, equipment and medium for high availability of network additional storage
CN114124803B (en) Device management method and device, electronic device and storage medium
US10516625B2 (en) Network entities on ring networks
CN112491633B (en) Fault recovery method, system and related components of multi-node cluster
CN108282346B (en) Software upgrading method and device
CN117411840A (en) Link failure processing method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination