CN117411840A - Link failure processing method, device, equipment, storage medium and program product - Google Patents

Link failure processing method, device, equipment, storage medium and program product Download PDF

Info

Publication number
CN117411840A
CN117411840A CN202311324249.3A CN202311324249A CN117411840A CN 117411840 A CN117411840 A CN 117411840A CN 202311324249 A CN202311324249 A CN 202311324249A CN 117411840 A CN117411840 A CN 117411840A
Authority
CN
China
Prior art keywords
output port
switch
state
port
standby
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311324249.3A
Other languages
Chinese (zh)
Inventor
万伟
纵瑞博
李俊宏
欧阳长冬
李�柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Co Ltd
Original Assignee
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Co Ltd
Priority to CN202311324249.3A priority Critical patent/CN117411840A/en
Publication of CN117411840A publication Critical patent/CN117411840A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/557Error correction, e.g. fault recovery or fault tolerance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • H04L45/247Multipath using M:N active or standby paths
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/25Routing or path finding in a switch fabric
    • H04L49/253Routing or path finding in a switch fabric using establishment or release of connections between ports

Abstract

The application relates to a link failure processing method, a device, equipment, a storage medium and a program product. The method comprises the following steps: checking the on-off state of a target output port of the first switch, and if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port; and forwarding the data message through the standby output port. The method can accelerate the link recovery time, and further can reduce the network delay.

Description

Link failure processing method, device, equipment, storage medium and program product
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for processing a link failure.
Background
High performance computing (High performance computing, HPC) has been used to solve complex problems, with current data growing exponentially, requiring larger cluster computing scales to address current and future computing challenges. In HPC, an efficient link failure handling method is needed to reduce latency of inter-process communication of the cluster.
In the conventional art, if a problem of a link failure occurs in a network, a network node transmits notification information (e.g., a trap message packet) to a subnet management service of a server. After receiving the trap message packet, the subnet management service triggers severe scanning and recalculates the route to bypass the failed link.
However, the above processing method may result in a long link recovery time, resulting in network delay.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a link failure processing method, apparatus, device, storage medium, and program product that can reduce the problem of network delay caused by excessively long link recovery time.
In a first aspect, the present application provides a link failure processing method, including:
checking the on-off state of a target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port;
and forwarding the data message through the standby output port.
The link fault processing method comprises the steps of checking the on-off state of a target output port of a first switch, and determining a standby output port corresponding to the target output port if the on-off state of the target output port is disconnected; and forwarding the data message through the standby output port. The target output port is an output port corresponding to the destination network card identifier in the routing table of the first switch. In the conventional technology, after a link failure occurs, a switch transmits a notification to a subnet management service, and the subnet management recalculates routes and updates all switch routing tables, resulting in a long link recovery time and thus network delay. In the embodiment of the application, since the switch can perform the route recovery by autonomous processing after the link failure, that is, the switch forwards the data message through the standby output port, the switch sends the notification to the server only after the switch cannot perform the route recovery by autonomous processing, so that the time of link recovery is shortened, and the network delay can be reduced.
In one embodiment, the forwarding the data packet through the backup output port includes:
determining a state of an adaptive routing function of the first switch;
and forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In this embodiment, the state of the adaptive routing function of the first switch may be determined, and the data packet is forwarded through the standby output port based on the state of the adaptive routing function.
In one embodiment, the forwarding the data packet through the backup output port based on the status of the adaptive routing function includes:
if the state of the self-adaptive routing function is unopened, forwarding the data message through the standby output port under the condition that the standby output port is opened;
if the state of the self-adaptive routing function is on, determining an opened candidate output port from a plurality of candidate output ports as the standby output port, and forwarding the data message through the standby output port.
In this embodiment, if the state of the adaptive routing function is not turned on, the data packet is forwarded through the standby output port under the condition that the standby output port is turned on, if the state of the adaptive routing function is turned on, one turned-on candidate output port is determined from a plurality of candidate output ports to serve as the standby output port, and the data packet is forwarded through the standby output port.
In one embodiment, the method further comprises:
generating a first recovery notification message according to the data message under the condition that a preset condition is met; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions include: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and the plurality of candidate output ports are all disconnected;
Transmitting the first recovery notification message to a second switch; the second switch is a switch that sends the data message to the first switch.
In this embodiment, under the condition that a preset condition is met, a first recovery notification message is generated according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and a plurality of candidate output ports are disconnected, and a first recovery notification message is sent to a second switch; the second switch is a switch for sending the data message to the first switch. According to the method, the switch can inform other switches in the network under the condition that the link fault is detected but the switch cannot recover by itself, so that the whole network can quickly update the fault link state to obtain an available path, the speed of recovering the network from the fault state to the normal state can be improved, and the link fault recovery time is reduced from the second level to the millisecond level.
In one embodiment, the method further comprises:
receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met;
If the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing a target standby output port into a target output port in a routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message;
if the port type of the next hop is inter-switch link, forwarding the second recovery notification message to the second switch.
In this embodiment, the second recovery notification message sent by the third switch is received, and the port type of the next hop is determined, if the port type of the next hop is a network card and the state of the adaptive routing function is unopened, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch; if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into a target output port in a routing table of the first switch; if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch. The second recovery notification message is sent by the third switch under the condition that the preset condition is met, and the target standby output port is a port which does not receive the second recovery notification message. According to the method, the second recovery notification message can be returned until the next hop type is the inter-switch link, namely, the switch can notify other switches and servers in the network under the condition that the switch detects the link fault but cannot recover the link by itself, so that the whole network can update the state of the fault link quickly to obtain an available path, the network delay can be reduced, and the robustness of the whole network is improved.
In one embodiment, the method further comprises:
and receiving port information sent by the server, wherein the port information comprises the target output port and the standby output port.
In this embodiment, the receiving server sends the port information including the target output port and the standby output port, which is favorable for the switch to forward the data message according to the acquired port information, improves the timeliness of the switch for acquiring the port information, and improves the robustness of the network.
In a second aspect, the present application further provides a link failure processing apparatus, including:
the checking module is used for checking the on-off state of the target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
the determining module is used for determining a standby output port corresponding to the target output port if the on-off state of the target output port is off;
and the first forwarding module is used for forwarding the data message through the standby output port.
In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when the processor executes the computer program.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.
In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
The link fault processing method, the device, the equipment, the storage medium and the program product are used for checking the on-off state of the target output port of the first switch, and if the on-off state of the target output port is disconnected, determining a standby output port corresponding to the target output port; and forwarding the data message through the standby output port. The target output port is an output port corresponding to the destination network card identifier in the routing table of the first switch. In the conventional technology, after a link failure occurs, a switch sends a notification to a server, and a subnet manager in the server can calculate routes and update all switch routing tables, resulting in a long link recovery time and thus network delay. In the embodiment of the application, since the switch can perform the route recovery by autonomous processing after the link failure, that is, the switch forwards the data message through the standby output port, the switch sends the notification to the server only after the switch cannot perform the route recovery by autonomous processing, so that the time of link recovery is shortened, and the network delay can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for a person having ordinary skill in the art.
FIG. 1 is an application environment diagram of a link failure handling method in one embodiment;
FIG. 2 is a flow diagram of a method of link failure in one embodiment;
FIG. 3 is a schematic diagram of a conventional link failure handling method in one embodiment;
FIG. 4 is a schematic diagram of a link failure handling method according to one embodiment;
FIG. 5 is a flow chart of a method for forwarding data packets in one embodiment;
FIG. 6 is a diagram of a routing table when the status of the adaptive routing function is unopened, in one embodiment;
FIG. 7 is a diagram of a routing table when the state of the adaptive routing function is on, in one embodiment;
FIG. 8 is a flow chart of a method of link failure in another embodiment;
FIG. 9 is a flow diagram of port forwarding exception handling in one embodiment;
FIG. 10 is a flow chart of a method of link failure in yet another embodiment;
FIG. 11 is a flow chart illustrating an adaptive routing function when the state is unopened in one embodiment;
FIG. 12 is a flow diagram of an adaptive routing function in one embodiment when the state is on;
FIG. 13 is a simulated diagram of an uplink failure in one embodiment;
FIG. 14 is a simulated diagram of a downlink failure in one embodiment;
FIG. 15 is a flow chart of a method of link failure in one embodiment;
FIG. 16 is a block diagram of a link failure device in one embodiment;
fig. 17 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Traditionally, high performance computing (High performance computing, HPC) has been used to solve complex problems, with current data growing exponentially, requiring larger cluster computing scales to address current and future computing challenges. In HPC, efficient inter-process communication depends on an interconnect structure that can provide high bandwidth, low latency, support a large number of endpoints, and thus, a high-speed reliable interconnect network is needed. There is also a need for a high-speed reliable network in MPI, shared storage, machine learning frameworks and new heterogeneous computing architectures. As high-speed interconnection networks continue to expand to accommodate larger-scale computing and storage capacities, 40K network nodes, and even 100K nodes, may be needed in the future to meet demand, but as host channel adapters (e.g., HCA cards) and switches increase, and particularly fiber optic cables that accompany this growing situation, will suffer physical or electrical damage, resulting in a link failure, a job checking method is provided in the conventional art to solve the above-mentioned problems, and in particular, to create a point-in-time snapshot of a job that will continue from the last successful state and point-in-time if the job fails at a later point in time. Or by utilizing the data integrity check and retransmission mechanism of the infinite wideband protocol, these methods negatively impact performance and are not practical in the case of large-scale communications. Wherein, MPI is a cross-language communication protocol for writing parallel computers.
Currently, in a high-speed network, if a link failure occurs, a network node sends notification information (e.g., a trap message packet) to a subnet management service of a server. After receiving the trap message packet, the subnet management service triggers severe scanning and recalculates the route to bypass the failed link. This process takes a long time, typically on the order of a minimum of seconds, even more than 30 seconds for larger node specifications, which may take up to 5 seconds for 1000 nodes; for clusters with 10000 or more nodes, this may take 30 seconds or even longer, longer link recovery time, resulting in network delay. This will not guarantee reliable stability of the network, possibly leading to job failure, which is an intolerable situation. The trap message packet is a mechanism used in the subnet management service to send notification to the network manager, and can automatically send a message when a fault or abnormality occurs in the network.
In addition, processing of link failures using the above method can lead to unreliability of the system: in parallel computing such as MPI, network quality is important. If network failure occurs, the operation failure may be caused if the routing link is not recovered for a long time, and the stability and reliability of the system are greatly affected.
The related art may also cause network congestion: if there are a large number of link failures in the network, there may be a large number of trap messages, which may cause congestion in the network, thereby affecting the normal operation of the network.
The related art may also cause a degradation of the subnet management service performance: the subnet management service is the only management tool in the high-speed interconnection network and is responsible for the functions of subnet scanning, route calculation and issuing, network configuration management, load balancing, fault detection and the like. The subnet management service may affect the performance and processing efficiency of the subnet management service if it is required to send a large number of trap messages while it is also required to process other network events.
The related art may also result in incomplete elimination of the fault: although the subnet management service aims to improve network resilience and reliability, there is also a mechanism for related failure handling, it does not completely eliminate the possibility of failure. This means that other fault countermeasures, such as backup and monitoring systems, still need to be implemented.
The related art may also lead to Trap message repeatability: the trap message may be repeatedly sent, causing the management node to process repeated alarms. In the case of an unstable network, the same link failed trap message may be repeatedly transmitted multiple times. There are many Trap messages in the network.
In summary, the above processing method may result in a long link recovery time, resulting in network delay, and other problems.
In view of the above problems, an embodiment of the present application provides a link failure processing method, which may be applied to an application environment as shown in fig. 1, and fig. 1 is an application environment diagram of the link failure processing method in one embodiment. Wherein the switch 101 may communicate with a server 102, which may carry subnet management services. The data storage system may store data that the server 102 needs to process. The switch 101 may be a plurality of switches, and the plurality of switches are interconnected to form a link and connected to a server to receive information issued by the server. The switch 101 may check the on-off state of the target output port of the first switch, determine the standby output port corresponding to the target output port, and forward the data packet to other switches through the standby output port. The server 102 may be implemented as a stand-alone server or a server cluster including a plurality of servers.
In an exemplary embodiment, as shown in fig. 2, a link failure processing method is provided, which is illustrated by taking an example that the method is applied to the switch 101 in fig. 1, for example, a first switch, and includes the following steps:
S201, checking the on-off state of a target output port of a first switch; the target output port is the output port corresponding to the target network card identifier in the routing table of the first switch.
The destination output port may be an output port corresponding to a destination network card identifier in the routing table of the first switch. The routing table may be a linear routing table.
Specifically, according to the destination network card of the first switch, an output port corresponding to the destination network card of the first switch is determined, that is, a target output port of the first switch is determined, and the on-off state of the target output port of the first switch is checked.
It should be noted that, when topology discovery is performed, the subnet management service in the server marks a location relationship between each switch and other switches for each switch, for example, marks a hierarchy of the switch by using a fat tree structure, marks that the switch belongs to a core layer, a convergence layer or an access layer of the fat tree structure, and notifies each switch to record its own hierarchy. The subnet management service calculates the route in each scanning period, and if the state of the self-adaptive routing function is unopened, an output port and a standby output port can be calculated for each switch; wherein it is required that the route through this output port and the route of the spare output port do not overlap with each other, and the spare output port is set to 0 if there is no suitable spare output port. If the state of the adaptive routing function is on, one output port and all the standby output ports corresponding to the output port can be calculated for each switch, and all the standby output ports corresponding to one output port are placed in one port group. That is, the server may send the MAD message to each switch through the subnet management service to inform each switch to record its own hierarchy and routing information. The MAD message is a management message, and is used for sending network management information.
S202, if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port.
Wherein, the on-off state of the target output port comprises disconnection and connection. If the on-off state is the communication, the data message is forwarded by directly utilizing the target output port.
Specifically, if the first switch detects that the target output port does not have a signal, or if the first switch detects that the state of the target output port recorded in the routing table of the first switch is off, it determines that the on-off state of the target output port is off, it may query the routing table of the first switch, determine each candidate output port corresponding to the target output port from the routing table, and determine one candidate output port that does not fail from the candidate output ports as a standby output port corresponding to the target output port. And if only one candidate output port exists, taking the candidate output port as a standby output port corresponding to the target output port.
S203, forwarding the data message through the standby output port.
Specifically, if the standby output port of the first switch is not disconnected, the data packet may be forwarded by using the standby output port. For example, there is only one spare output port, and the spare output port is available, and the spare output port is used for forwarding the data message; for another example, if there are multiple spare output ports, one of the available spare output ports may be used to forward the data message.
As shown in fig. 3, fig. 3 is a schematic diagram of a conventional link failure processing method in an embodiment, where after a link failure occurs, a switch sends a notification to a server, and a subnet manager in the server can calculate routes and update all switch routing tables, resulting in a longer link recovery time, and further, network delay. Fig. 4 is a schematic diagram of a link failure processing method according to an embodiment, and with the method provided by the embodiment of the present application, after a link failure occurs, a switch performs autonomous processing to perform route recovery, that is, the switch forwards a data packet through a standby output port. After the switch can not perform autonomous processing to recover the route, the switch sends a notification to the server, and the subnet manager in the server calculates the route and updates all the switch routing tables.
The link fault processing method comprises the steps of checking the on-off state of a target output port of a first switch, and determining a standby output port corresponding to the target output port if the on-off state of the target output port is disconnected; and forwarding the data message through the standby output port. In the conventional technology, after a link failure occurs, a switch sends a notification to a server, and a subnet manager in the server can calculate routes and update all switch routing tables, resulting in a long link recovery time and thus network delay. In the embodiment of the application, since the switch can perform the route recovery by autonomous processing after the link failure, that is, the switch forwards the data message through the standby output port, the switch sends the notification to the server only after the switch cannot perform the route recovery by autonomous processing, the time of link recovery is shortened, and the network delay can be reduced.
In addition, since route restoration is performed autonomously by the switch preferentially, the problem of reduced operamsm performance of the server due to a large number of trap messages to be transmitted to the server, and the trap message repeatability due to a large number of trap messages to be transmitted are alleviated. In addition, the fault handling mechanism of the OpemSM cannot completely eliminate the possibility of faults, and in the embodiment of the application, the switch is required to perform route recovery autonomously, so that the possibility of fault elimination is improved.
In the scenario of forwarding the data packet through the backup output port, in an exemplary embodiment, as shown in fig. 5, the step S203 includes:
s501, determining the state of the adaptive routing function of the first switch.
The Adaptive routing function may be an Adaptive routing function (AR).
In the embodiment of the application, it may be determined whether the state of the adaptive routing function of the first switch is on or not. Specifically, the first switch may query the routing table information in the local routing table to determine whether the state of the adaptive routing function recorded in the routing table information is on or not on.
S502, forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In this embodiment of the present application, the first switch may forward the data packet through the standby output port according to the states of different adaptive routing functions. For example, when the state of the adaptive routing function is on, a plurality of standby output ports are provided, and a data message is forwarded by using one standby output port which is not disconnected and has no fault; when the state of the self-adaptive routing function is on, a standby output port is provided, and the standby output port can be used for forwarding the data message.
In this embodiment, the state of the adaptive routing function of the first switch may be determined, and the data packet is forwarded through the standby output port based on the state of the adaptive routing function.
In the scenario of forwarding the data packet through the standby output port based on the state of the adaptive routing function, in an exemplary embodiment, the step S502 includes:
In the first case, if the state of the self-adaptive routing function is unopened, the data message is forwarded through the standby output port under the condition that the standby output port is opened.
Specifically, if the state of the adaptive routing function is unopened, only one standby output port corresponding to the target output port exists, whether the standby output port corresponding to the target output port is opened is further determined, and under the condition that the standby output port is opened, the data message is forwarded through the standby output port.
For example, fig. 6 is a diagram of a routing table when the status of the adaptive routing function is not turned on, as shown in fig. 6, the DLID field is a destination network card identifier, taking the DLID field is 1 as an example, the output port field of the first row is an output port corresponding to the destination network card identifier 1, the AR mode field is whether the adaptive routing function is turned on, and the AR Group field is an identifier of a backup output port table corresponding to the destination output port. The GroupTable is a spare output port table corresponding to the target output port. And under the condition that the state of the self-adaptive routing function is unopened, the ARGroup field corresponds to the port number backupPort of the only standby output port, the P identifies the port number, G is a mark in the ARGroup field, and 0 indicates that the network card identifies that the Lid has no standby output port.
And in the second case, if the state of the self-adaptive routing function is on, determining an opened candidate output port from a plurality of candidate output ports as a standby output port, and forwarding the data message through the standby output port.
Specifically, if the state of the adaptive routing function is on, a target output port corresponds to a plurality of candidate output ports, an on candidate output port can be determined from the plurality of candidate output ports, the on candidate output port is used as a standby output port, and the data message is forwarded through the standby output port.
Alternatively, the standby output port table corresponding to the target output port can be queried in the routing table according to the destination network card identifier, and then an opened candidate output port is determined from the standby output port table in a load balancing or circulating mode to forward the data message. If one candidate output port is disconnected or fails, the candidate output port is automatically avoided, and other candidate output ports are selected to forward the data message.
For example, fig. 7 is a diagram of a routing table when the state of the adaptive routing function is on, as shown in fig. 7, and an example of the above-mentioned case one is identified in the diagram. The AR Group field in the routing table may be queried according to the network card identifier Lid, and then the bitmap of the corresponding spare output port Group may be queried from the GroupTable corresponding to the AR Group field, and an opened candidate output port may be determined from the spare output port table in a cyclic or load balancing manner to forward the data packet. When detecting that the candidate output port corresponding to a certain bit is disconnected, the candidate output port is automatically avoided, and other candidate output ports are used as standby output ports to forward the data message.
In this embodiment, if the state of the adaptive routing function is not turned on, the data packet is forwarded through the standby output port under the condition that the standby output port is turned on, if the state of the adaptive routing function is turned on, one turned-on candidate output port is determined from a plurality of candidate output ports to serve as the standby output port, and the data packet is forwarded through the standby output port.
In one exemplary embodiment, as shown in fig. 8, the method further comprises:
s801, under the condition that a preset condition is met, generating a first recovery notification message according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the adaptive routing function is unopened and the standby output port is disconnected, or the state of the adaptive routing function is opened and a plurality of candidate output ports are all disconnected.
The first recovery notification message (frn_mad, fast Recovery Notification MAD) may include a hierarchy of the first switch in the fat tree structure, a destination network card identifier, a data message, information of the disconnected standby output port, information of the target output port, and the like.
Specifically, if the state of the adaptive routing function is unopened and the standby output port is disconnected, the state indicates that no standby port is available and can forward the data message, the data message can be discarded, and a first recovery notification message is generated according to the data message and the disconnected standby output port information; if the state of the self-adaptive routing function is open and the plurality of candidate output ports are all disconnected, the state indicates that no available standby output port can forward the data message, the data message can be discarded, and a first recovery notification message is generated according to the data message and each disconnected candidate output port.
S802, sending a first recovery notification message to a second switch; the second switch is a switch that sends data messages to the first switch.
Specifically, the first recovery notification message may be sent to a switch that sends the data message to the first switch, i.e., the first recovery notification message is sent to the second switch.
In this embodiment, when the first switch checks that the state of the adaptive routing function recorded in the local routing table is unopened and the standby output port is disconnected, or when the first switch checks that the state of the adaptive routing function recorded in the local routing table is opened and the plurality of candidate output ports are all disconnected, the first switch generates a first recovery notification message according to the data message. According to the method, the switch can inform other switches in the network under the condition that the link fault is detected but the switch cannot recover by itself, so that the whole network can quickly update the fault link state to obtain an available path, the speed of recovering the network from the fault state to the normal state can be improved, and the link fault recovery time is reduced from the second level to the millisecond level.
FIG. 9 is a flow diagram of port forwarding exception handling in one embodiment, as shown in FIG. 9, may include the steps of:
and S901, if a link fault occurs during data message forwarding, forwarding the data message according to the processing flow of the Infiniband network protocol.
S902, if the preset condition is met, constructing a first recovery notification message, taking the identifier of the second switch as the identifier of the target network card, and taking the identifier of the source network card as the identifier of the target network card.
S903, the first recovery notification message is sent to the second switch.
In this embodiment of the present application, if a link failure occurs during forwarding of a data packet, the first switch may forward the data packet according to a processing flow of an infiniband network protocol, that is, the steps S501 to S502 described above, if the state of the adaptive routing function is not opened and the standby output port is disconnected, or if the state of the adaptive routing function is opened and the multiple candidate output ports are all disconnected, the first switch may construct a first recovery notification packet, use an identifier of the second switch in the first recovery notification packet as a destination network card identifier, use a source network card identifier as a destination network card identifier, and send the first recovery notification packet to the second switch.
In an exemplary embodiment, as shown in fig. 10, the method further comprises:
s1001, receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met.
The third switch may be a switch that receives the data packet sent by the first switch.
Specifically, if the state of the adaptive routing function of the third switch is unopened and the standby output port is disconnected, or if the state of the adaptive routing function is opened and the plurality of candidate output ports are all disconnected, the third switch sends a second recovery notification message to the first switch, and the first switch receives the second recovery notification message sent by the third switch. When the first switch receives the second recovery notification message sent by the third switch, it records which port in the first switch receives the second recovery notification message, for example, port S in the first switch receives the second recovery notification message. The first switch can judge whether the port type of the next hop of the second recovery notification message is a network card or an inter-switch link according to the second recovery notification message.
One of S1002, S1003, or S1004 described below may be performed.
S1002, if the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch.
Specifically, if the port type of the next hop of the second recovery notification message is a network card and the state of the adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch, setting the standby output port to be 0 in the routing table, recording the modification content in a memory of the first switch, and marking the state of the first switch as fault-tolerant update.
It should be noted that, the chip of the switch may set a flag to mark whether the fast fault-tolerant switch (fault-tolerant update) of the route is performed by itself, and the server may reset the flag of the switch by using the subnet management service.
S1003, if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into the target output port in the routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message.
Specifically, if the port type of the next hop of the second recovery notification packet is a network card and the state of the adaptive routing function is on, the port that does not receive the second recovery notification packet is written into the target output port in the routing table of the first switch, for example, a port other than the port S may be written into the target output port in the routing table of the first switch. And, the adaptive routing function corresponding to the target network card identifier may be set to be not turned on (i.e., in a static mode), and at the same time, the modification content is recorded in the memory of the first switch, and the state of the first switch is marked as fault-tolerant update.
S1004, if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch.
Specifically, if the port type of the next hop of the second recovery notification message is an inter-switch link, the first switch sends the second recovery notification message to the second switch that sends the data message to the first switch.
When the server calculates the route in each scanning period by using the subnet management service, the automatically modified route table data is synchronized and updated. For example, the synchronous switch automatically completes the modification of the routing table and synchronizes the modified content of the routing table to the data structure of the subnet management service, so as to ensure that the recalculated routing of the subnet management service is consistent with the updated routing of the switch. That is, if the state of the adaptive routing function is unopened, there is only one standby output port, and the subnet management service is predictable for the switch to automatically modify the routing; if the state of the adaptive routing function is on, the number of candidate output ports is multiple, i.e. the number of standby output ports is multiple, and the subnet management service is unpredictable for the automatic modification of the routing result by the switch, a contract or synchronization mechanism can be added. The server may also recalculate the standby output port in the routing table with the subnet management service setting the standby output port to 0, and may also recalculate the new standby output port that sets the adaptive routing function to the static destination network card identification.
As shown in fig. 11, fig. 11 is a flowchart of an adaptive routing function when the state is unopened in one embodiment, and includes the following steps:
s1101, the port S receives a second recovery notification message sent by the third switch.
S1102, the port type of the next hop is a network card.
S1103, analyzing the second recovery notification message to obtain the destination network card identifier.
S1104, writing the standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch, and setting the standby output port to be 0 in the routing table.
S1105, the network card identification and the modified port item are recorded in the memory of the first switch.
S1106, marking the state of the first switch as fault tolerant update.
S1107, forwarding the second recovery notification message.
In this embodiment of the present application, the port S of the first switch receives the second recovery notification packet sent by the third switch, and determines whether the port type of the next hop is a network card (CA port). And if the port type of the next hop is a network card, analyzing the second recovery notification message to obtain a target network card identifier, writing a standby output port corresponding to the target network card identifier into a target output port in a routing table of the first switch, setting the standby output port to be 0 in the routing table, recording the network card identifier and the modified port item in a memory of the first switch, and marking the state of the first switch as fault-tolerant update. And if the port type of the next hop is not the network card, forwarding a second recovery notification message to the second switch.
As shown in fig. 12, fig. 12 is a flowchart of an adaptive routing function when the state is on, and includes the following steps:
s1201, port S receives a second recovery notification message sent by the third switch.
S1202, the port type of the next hop is a network card.
And S1203, analyzing the second recovery notification message to obtain the destination network card identifier.
S1204, determining a standby output port of a non-S port from a port group of the standby output port corresponding to the destination network card identifier, writing the standby output port into a target output port in a routing table of the first switch, and setting a self-adaptive routing function state corresponding to the destination network card identifier in the routing table to be unopened.
S1205, the network card identification and the modified port item are recorded in the memory of the first switch.
S1206, marking the state of the first switch as fault tolerant update.
S1207, forwarding the second recovery notification message.
In this embodiment of the present application, the port S of the first switch receives the second recovery notification packet sent by the third switch, and determines whether the port type of the next hop is a network card (CA port). And if the port type of the next hop is a network card, analyzing the second recovery notification message to obtain a target network card identifier, determining a standby output port of a non-S port from a port group of the standby output port corresponding to the target network card identifier, writing the standby output port into a target output port in a routing table of the first switch, setting a self-adaptive routing function state corresponding to the target network card identifier in the routing table to be unopened, recording the network card identifier and the modified port item in a memory of the first switch, and marking the state of the first switch as fault-tolerant update. And if the port type of the next hop is not the network card, forwarding a second recovery notification message to the second switch.
In this embodiment, the second recovery notification message sent by the third switch is received, and the port type of the next hop is determined, if the port type of the next hop is a network card and the state of the adaptive routing function is unopened, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch; if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into a target output port in a routing table of the first switch; if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch. The second recovery notification message is sent by the third switch under the condition that the preset condition is met, and the target standby output port is a port which does not receive the second recovery notification message. According to the method, the second recovery notification message can be returned until the next hop type is the inter-switch link, namely, the switch can notify other switches and servers in the network under the condition that the switch detects the link fault but cannot recover the link by itself, so that the whole network can update the state of the fault link quickly to obtain an available path, the network delay can be reduced, and the robustness of the whole network is improved.
In an exemplary embodiment, the method further comprises:
and receiving port information sent by the server, wherein the port information comprises a target output port and a standby output port.
Specifically, the server calculates the route in each scanning period by using the subnet management service, and if the state of the self-adaptive routing function is unopened, one output port and one standby output port can be calculated for each switch; wherein it is required that the route through this output port and the route of the spare output port do not overlap with each other, and the spare output port is set to 0 if there is no suitable spare output port. The MAD message is a management message, and is used for sending network management information. The server utilizes the subnet management service to send port information to each switch in the form of MAD message, the first switch can receive the port information sent by the server, and the port information can comprise an output port and a standby output port.
In this embodiment, the receiving server sends the port information including the target output port and the standby output port, which is favorable for the switch to forward the data message according to the acquired port information, improves the timeliness of the switch for acquiring the port information, and improves the robustness of the network.
Fig. 13 is a simulated diagram of an uplink failure in one embodiment, as shown in fig. 13, where the switch has multiple routing paths to reach the destination, typically where the failure occurs in a three-layer fat-tree network upstream switch, such as an access layer to a convergence layer, and a convergence layer to a core layer. If one of the paths of the switch fails, the fast fault tolerant switching of the route takes effect to forward the data to the other port of the switch through the new path to the destination. Namely, the uplink fails, the switch directly switches ports locally without returning the first recovery notification message FRN_MAD, and the probability of packet loss is reduced.
Fig. 14 is a simulated diagram of a downlink failure in one embodiment. As shown in fig. 14, when the downlink is failed, for example, the link from the switch 330 to the switch 231 is failed and there is no available standby output port, the switch 330 returns a first recovery notification message frn_mad to the switch 401, the switch 401 returns a first recovery notification message frn_mad to the switch 300, the switch 300 returns a first recovery notification message frn_mad to the switch 201, so that the switch 201 sends a data message to the switch 301, the switch 301 sends a data message to the switch 402, the switch 402 sends a data message to the switch 331, and the switch 331 sends a data message to the switch 231, so that the data message is successfully sent to the host 133.
In one detailed embodiment, as shown in FIG. 15, the method includes:
s1501, receiving port information sent by a server; the port information includes a target output port and a standby output port.
S1502, checking the on-off state of a target output port of a first switch; the target output port is the output port corresponding to the target network card identifier in the routing table of the first switch.
S1503, if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port.
S1504, determining a state of an adaptive routing function of the first switch.
S1505 or S1506 or S1507 described below are performed.
S1505, if the status of the adaptive routing function is not on, forwarding the data packet through the standby output port when the standby output port is on.
S1506, if the state of the self-adaptive routing function is on, determining an opened candidate output port from the plurality of candidate output ports as a standby output port, and forwarding the data message through the standby output port.
S1507, under the condition that the preset condition is met, generating a first recovery notification message according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the adaptive routing function is unopened and the standby output port is disconnected, or the state of the adaptive routing function is opened and a plurality of candidate output ports are all disconnected.
S1508, receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met.
S1509 or S1510 or S1511 described below is performed.
S1509, if the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch.
S1510, if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into the target output port in the routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message.
S1511, if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch.
According to the link failure processing method, the switch can perform route recovery through autonomous processing after the link failure, namely, the switch forwards the data message through the standby output port, and after the switch cannot perform autonomous processing to perform route recovery, the switch sends a notification to the server so that the server calculates the recalculated route, updates the switch routing table, and forwards the data message according to the updated switch routing table. In contrast to the conventional art, after a link failure occurs, the switch sends a notification to the server, and the subnet manager in the server can calculate the route and update all the switch routing tables. The link failure processing method provided by the embodiment of the application accelerates the link recovery time, and further can reduce the network delay.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a link failure device for realizing the above-mentioned link failure method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the one or more link failure devices provided below may be referred to the limitation of the link failure method hereinabove, and will not be repeated herein.
In one exemplary embodiment, as shown in fig. 16, a link failure device 1600 is provided, comprising: an inspection module 1601, a determination module 1602, and a first forwarding module 1603, wherein:
an inspection module 1601, configured to inspect an on-off state of a target output port of the first switch; the target output port is the output port corresponding to the target network card identifier in the routing table of the first switch.
The determining module 1602 is configured to determine a standby output port corresponding to the target output port if the on-off state of the target output port is off.
A first forwarding module 1603, configured to forward the data packet through the standby output port.
In one exemplary embodiment, the first forwarding module 1603 includes:
a first determination submodule for determining a state of an adaptive routing function of the first switch.
And the forwarding sub-module is used for forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In one exemplary embodiment, the forwarding sub-module includes:
and the first forwarding unit is used for forwarding the data message through the standby output port under the condition that the standby output port is opened if the state of the self-adaptive routing function is not opened.
And the second forwarding unit is used for determining one opened candidate output port from the plurality of candidate output ports as a standby output port if the state of the self-adaptive routing function is opened, and forwarding the data message through the standby output port.
In one exemplary embodiment, the apparatus 1600 further comprises:
the generating module is used for generating a first recovery notification message according to the data message under the condition that the preset condition is met; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and a plurality of candidate output ports are disconnected;
the sending module is used for sending the first recovery notification message to the second switch; the second switch is a switch that sends data messages to the first switch.
In one exemplary embodiment, the apparatus 1600 further comprises:
the first receiving module is used for receiving a second recovery notification message sent by the third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met.
And the first writing module is used for writing the standby output port corresponding to the destination network card identifier into the target output port in the routing table of the first switch if the port type of the next hop is the network card and the state of the self-adaptive routing function is unopened.
The second writing module is used for writing the target standby output port into the target output port in the routing table of the first switch if the port type of the next hop is a network card and the state of the self-adaptive routing function is on; the target standby output port is a port which does not receive the second recovery notification message.
And the second forwarding module is used for forwarding a second recovery notification message to the second switch if the port type of the next hop is the inter-switch link.
In one exemplary embodiment, the apparatus 1600 further comprises:
the second receiving module is used for receiving the port information sent by the server; the port information includes a target output port and a standby output port.
The various modules in the above-described link failure device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a switch, and the internal structure thereof may be as shown in fig. 17. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a link failure method.
It will be appreciated by those skilled in the art that the structure shown in fig. 17 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one exemplary embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
checking the on-off state of a target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port;
and forwarding the data message through the standby output port.
In one embodiment, the processor when executing the computer program further performs the steps of:
determining a state of an adaptive routing function of the first switch;
and forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In one embodiment, the processor when executing the computer program further performs the steps of:
if the state of the self-adaptive routing function is unopened, forwarding the data message through the standby output port under the condition that the standby output port is opened;
if the state of the self-adaptive routing function is on, determining one opened candidate output port from a plurality of candidate output ports as a standby output port, and forwarding the data message through the standby output port.
In one embodiment, the processor when executing the computer program further performs the steps of:
under the condition that the preset condition is met, generating a first recovery notification message according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and a plurality of candidate output ports are disconnected;
sending the first recovery notification message to the second switch; the second switch is a switch that sends data messages to the first switch.
In one embodiment, the processor when executing the computer program further performs the steps of:
Receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into a target output port in a routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message;
if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch.
In one embodiment, the processor when executing the computer program further performs the steps of:
receiving port information sent by a server; the port information includes a target output port and a standby output port.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
Checking the on-off state of a target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port;
and forwarding the data message through the standby output port.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining a state of an adaptive routing function of the first switch;
and forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In one embodiment, the computer program when executed by the processor further performs the steps of:
if the state of the self-adaptive routing function is unopened, forwarding the data message through the standby output port under the condition that the standby output port is opened;
if the state of the self-adaptive routing function is on, determining one opened candidate output port from a plurality of candidate output ports as a standby output port, and forwarding the data message through the standby output port.
In one embodiment, the computer program when executed by the processor further performs the steps of:
Under the condition that the preset condition is met, generating a first recovery notification message according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and a plurality of candidate output ports are disconnected;
sending the first recovery notification message to the second switch; the second switch is a switch that sends data messages to the first switch.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into a target output port in a routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message;
If the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving port information sent by a server; the port information includes a target output port and a standby output port.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
checking the on-off state of a target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port;
and forwarding the data message through the standby output port.
In one embodiment, the computer program when executed by the processor further performs the steps of:
determining a state of an adaptive routing function of the first switch;
and forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
In one embodiment, the computer program when executed by the processor further performs the steps of:
If the state of the self-adaptive routing function is unopened, forwarding the data message through the standby output port under the condition that the standby output port is opened;
if the state of the self-adaptive routing function is on, determining one opened candidate output port from a plurality of candidate output ports as a standby output port, and forwarding the data message through the standby output port.
In one embodiment, the computer program when executed by the processor further performs the steps of:
under the condition that the preset condition is met, generating a first recovery notification message according to the data message; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions comprise: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and a plurality of candidate output ports are disconnected;
sending the first recovery notification message to the second switch; the second switch is a switch that sends data messages to the first switch.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met;
If the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing the target standby output port into a target output port in a routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message;
if the port type of the next hop is the inter-switch link, forwarding a second recovery notification message to the second switch.
In one embodiment, the computer program when executed by the processor further performs the steps of:
receiving port information sent by a server; the port information includes a target output port and a standby output port.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or adaptive random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A method of link failure handling, the method comprising:
checking the on-off state of a target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
if the on-off state of the target output port is off, determining a standby output port corresponding to the target output port;
And forwarding the data message through the standby output port.
2. The method of claim 1, wherein forwarding the data message through the backup output port comprises:
determining a state of an adaptive routing function of the first switch;
and forwarding the data message through the standby output port based on the state of the self-adaptive routing function.
3. The method of claim 2, wherein forwarding the data packet through the backup output port based on the status of the adaptive routing function comprises:
if the state of the self-adaptive routing function is unopened, forwarding a data message through the standby output port under the condition that the standby output port is opened;
and if the state of the self-adaptive routing function is on, determining an opened candidate output port from a plurality of candidate output ports as the standby output port, and forwarding the data message through the standby output port.
4. A method according to claim 3, characterized in that the method further comprises:
generating a first recovery notification message according to the data message under the condition that a preset condition is met; the first recovery notification message is used for indicating that the first switch has no available output port; the preset conditions include: the state of the self-adaptive routing function is unopened and the standby output port is disconnected, or the state of the self-adaptive routing function is opened and the plurality of candidate output ports are all disconnected;
Sending the first recovery notification message to a second switch; the second switch is a switch which sends the data message to the first switch.
5. The method according to claim 4, wherein the method further comprises:
receiving a second recovery notification message sent by a third switch and determining the port type of the next hop; the second recovery notification message is sent by the third switch under the condition that the preset condition is met;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is unopened, writing a standby output port corresponding to the destination network card identifier into a target output port in a routing table of the first switch;
if the port type of the next hop is a network card and the state of the self-adaptive routing function is on, writing a target standby output port into a target output port in a routing table of the first switch; the target standby output port is a port which does not receive the second recovery notification message;
and if the port type of the next hop is the inter-switch link, forwarding the second recovery notification message to the second switch.
6. The method according to any one of claims 1-5, further comprising:
receiving port information sent by a server; the port information includes the target output port and the backup output port.
7. A link failure handling apparatus, the apparatus comprising:
the checking module is used for checking the on-off state of the target output port of the first switch; the target output port is an output port corresponding to a target network card identifier in a routing table of the first switch;
the determining module is used for determining a standby output port corresponding to the target output port if the on-off state of the target output port is off;
and the first forwarding module is used for forwarding the data message through the standby output port.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202311324249.3A 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product Pending CN117411840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311324249.3A CN117411840A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311324249.3A CN117411840A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117411840A true CN117411840A (en) 2024-01-16

Family

ID=89497256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311324249.3A Pending CN117411840A (en) 2023-10-12 2023-10-12 Link failure processing method, device, equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117411840A (en)

Similar Documents

Publication Publication Date Title
KR102014433B1 (en) System and method for supporting discovery and routing degraded fat-trees in a middleware machine environment
CN106059791B (en) Link switching method of service in storage system and storage device
CN105827419A (en) Forwarding equipment fault processing method, equipment and controller
CN102394914A (en) Cluster brain-split processing method and device
CN104982010A (en) Systems and methods for layer-2 traffic polarization during failures in a virtual link trunking domain
CN112787960B (en) Stack splitting processing method, device and equipment and storage medium
CN109491609B (en) Cache data processing method, device and equipment and readable storage medium
CN112291160B (en) BUM message suppression method, device and medium
CN112104478A (en) Link switching method, device, equipment and machine readable storage medium
CN108512753B (en) Method and device for transmitting messages in cluster file system
CN102651712B (en) Node routing method of multiprocessor system, controller and multiprocessor system
US20180048487A1 (en) Method for handling network partition in cloud computing
US7573810B2 (en) Avoiding deadlocks in performing failovers in communications environments
US8972771B2 (en) Connection control apparatus, storage system, and control method of connection control apparatus
CN117411840A (en) Link failure processing method, device, equipment, storage medium and program product
CN103414591A (en) Method and system for fast converging when port failure is recovered
US10516625B2 (en) Network entities on ring networks
CN106776107B (en) A kind of parity error correction method and the network equipment
CN112131201B (en) Method, system, equipment and medium for high availability of network additional storage
CN110661599A (en) HA implementation method, device and storage medium between main node and standby node
CN117354267A (en) Link failure processing method, device, equipment, storage medium and program product
US9104562B2 (en) Enabling communication over cross-coupled links between independently managed compute and storage networks
US8352776B2 (en) Facilitating persistence of routing states
WO2021249173A1 (en) Distributed storage system, abnormality processing method therefor, and related device
CN107329698B (en) Data protection method and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination