CN116760763B - Link switching method, device, computing system, electronic equipment and storage medium - Google Patents

Link switching method, device, computing system, electronic equipment and storage medium Download PDF

Info

Publication number
CN116760763B
CN116760763B CN202311029289.5A CN202311029289A CN116760763B CN 116760763 B CN116760763 B CN 116760763B CN 202311029289 A CN202311029289 A CN 202311029289A CN 116760763 B CN116760763 B CN 116760763B
Authority
CN
China
Prior art keywords
computing node
path
value
target
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311029289.5A
Other languages
Chinese (zh)
Other versions
CN116760763A (en
Inventor
苏康
席鑫
李亚民
王梦龙
索曌君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202311029289.5A priority Critical patent/CN116760763B/en
Publication of CN116760763A publication Critical patent/CN116760763A/en
Application granted granted Critical
Publication of CN116760763B publication Critical patent/CN116760763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • H04L45/247Multipath using M:N active or standby paths
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • H04L45/123Evaluation of link metrics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/22Alternate routing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a link switching method, a device, a computing system, electronic equipment and a storage medium, and relates to the technical field of data transmission, wherein the method comprises the following steps: when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes; acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path; and determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path. The invention improves the data link fault processing efficiency of the heterogeneous computing system of the multiple computing nodes.

Description

Link switching method, device, computing system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data transmission technologies, and in particular, to a link switching method, a device, a computing system, an electronic device, and a storage medium.
Background
With the continuous development of technologies such as the internet, big data and artificial intelligence, the computing demands are in explosive growth, the traditional computing architecture has limited computing capacity due to single computing node types, and the rapid development and diversification demands of the computing demands cannot be met, so that heterogeneous computing systems based on multiple computing nodes are generated.
The heterogeneous computing system with multiple computing nodes can connect different types of computing nodes together, so that the computing efficiency and performance are improved by utilizing the advantages of the different types of computing nodes, and more efficient data processing, analysis and computation are realized. However, in a computing system, due to the large data transmission amount, the interconnection links are complex and complex, and are easily interfered to generate link faults, thereby affecting the system performance. In the prior art, when link fault processing is performed, fault removal is usually performed manually, and for the link fault which cannot be repaired immediately, a new link is started to restore data transmission among nodes by changing a routing table, so that the problem of low processing efficiency of calculating the link fault of the nodes exists.
Therefore, a link switching method, apparatus, computing system, electronic device and storage medium are needed to solve the above-mentioned problems.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a link switching method, a device, a computing system, electronic equipment and a storage medium.
The invention provides a link switching method, which comprises the following steps:
when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes;
acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path;
and determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path.
According to the link switching method provided by the invention, the method further comprises the following steps:
Acquiring a first credit value variation and a second credit value variation, wherein the first credit value variation represents the variation of the credit value of the first target computing node in a preset period; the second credit value variation amount represents the variation amount of the credit value of the second target computing node in the preset period; the credit value is set based on the parallel calculation quantity of the calculation nodes, and comprises an input port credit value and an output port credit value;
and judging the fault condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation, so as to execute data link switching operation on the first target computing node and the second target computing node under the condition that the fault is determined to exist.
According to the link switching method provided by the invention, the determining the fault condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation includes:
And judging whether the first credit value variation and the second credit value variation are preset values or not based on the preset period, and determining a fault condition according to a judging result.
According to the link switching method provided by the invention, the determining whether the first credit value variation and the second credit value variation are preset values based on the preset time period, and determining the fault condition according to the determination result comprises:
and if any credit value variation in the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault.
According to the link switching method provided by the invention, the method further comprises the following steps:
acquiring port credit values corresponding to the first target computing node and the second target computing node respectively in the preset period;
if the port credit value is smaller than the corresponding maximum credit threshold value in the preset period, judging that the original data link between the first target computing node and the second target computing node is in a non-idle state;
If the port credit value is equal to the corresponding maximum credit threshold value within the preset period, judging that the original data link between the first target computing node and the second target computing node is in an idle state;
if any one of the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault includes:
and if the original data link between the first target computing node and the second target computing node is in a non-idle state within the preset period, and any credit value variation in the first credit value variation and the second credit value variation is 0, determining that the original data link between the first target computing node and the second target computing node has a fault.
According to the link switching method provided by the invention, before the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each computing node in each alternative path, the method further comprises:
Based on the calculation nodes in each alternative path, acquiring the data transmission load value of each calculation node at the last moment in the preset period;
acquiring the total amount of the credit value change of the input port and the total amount of the credit value change of the output port of each computing node between the current moment and the last moment;
and based on the ratio between the credit value change total quantity of the input port and the credit value change total quantity of the output port, acquiring the data transmission load value of each calculation node in each alternative path at the current moment through the data transmission load value of each calculation node at the last moment.
According to the link switching method provided by the invention, the obtaining, based on the ratio between the total amount of credit change of the input port and the total amount of credit change of the output port, the data transmission load value of each computing node in each candidate path at the current moment through the data transmission load value of each computing node at the previous moment includes:
constructing a node transmission load value calculation formula, wherein the node transmission load value calculation formula is as follows:
wherein,for calculating the data transmission load value of the node at said current moment +. >For the computing node at the last momenttData transmission load value of->For a preset time interval duration, < >>Representing the total value of credit values of the corresponding input ports when all input ports in the computing node are not used,/->Representing the total credit value of the corresponding output ports when all the output ports in the computing node are not used; />Representing +.>Input ports->Representing a total of +.>Input ports->Representing->The number of input ports is->Input port credit value variation in time period, < >>Indicating that all input ports in the compute node are +.>The total amount of input port credit value variation in a time period; />Representing +.>Output ports->Representing a total of +.>Output ports->Representing->The number of output ports is->Output port credit value variation in time period, +.>Indicating that all output ports in the compute node are +.>The total amount of output port credit value variation in the time period; wherein, when->When (I)>
And calculating the data transmission load value of each calculation node in each alternative path at the current moment according to the ratio and the data transmission load value of each calculation node at the last moment based on the node transmission load value calculation formula.
According to the link switching method provided by the invention, the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each computing node in each alternative path, and the method comprises the following steps:
calculating to obtain a link congestion value of a data link between each computing node and a corresponding next computing node according to the data transmission load value of each computing node in each alternative path and the data transmission load value and the credit value used amount of the next computing node corresponding to each computing node;
and obtaining the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path.
According to the link switching method provided by the invention, the obtaining of the path congestion value corresponding to each alternative path according to the link congestion values of the data links between all the calculation nodes in each alternative path comprises the following steps:
constructing a path congestion value calculation formula, wherein the path congestion value calculation formula is as follows:
wherein,representing the pair of alternative pathsA corresponding path congestion value,/->Representing the +.sup.th in the alternative path >Computing node->Indicating that there is a total of +.>Computing node->Representing the +.sup.th in the alternative path>The next computing node corresponding to the computing node +.>Credit value used amount of the input port of (c); />Representing the +.sup.th in the alternative path>Data transmission load value of each computing node, +.>Representing the +.sup.th in the alternative path>Calculating data transmission load values of the nodes;
and calculating the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path based on the path congestion value calculation formula.
According to the link switching method provided by the invention, the obtaining a plurality of alternative paths between the first target computing node and the second target computing node includes:
determining a plurality of data transmission paths between the first target computing node and the second target computing node;
and taking the data transmission paths with the number of the computing nodes smaller than the number of the preset computing nodes in the plurality of data transmission paths as the alternative paths, and acquiring a plurality of alternative paths between the first target computing node and the second target computing node.
According to the link switching method provided by the invention, the determining a target path from a plurality of alternative paths according to the path congestion value comprises the following steps:
and sorting the path congestion values corresponding to the alternative paths, and determining the alternative path with the smallest path congestion value as the target path according to the sorting result.
According to the link switching method provided by the invention, if a plurality of path congestion values with the same size exist in the sorting result, the method further comprises:
taking a plurality of alternative paths corresponding to the path congestion values with the same size as pending paths;
and acquiring the number of computing nodes in each undetermined path, and determining the undetermined path corresponding to the minimum number of computing nodes as the target path.
According to the link switching method provided by the invention, a plurality of monitoring moments are set in the preset period, and the obtaining of the first credit value variation and the second credit value variation comprises the following steps:
and respectively summing the variation of the credit value of the first target computing node and the variation of the credit value of the second target computing node at each monitoring moment to obtain the first credit value variation and the second credit value variation.
According to the link switching method provided by the invention, the method further comprises the following steps:
determining a monitoring frequency type based on hardware types of the first target computing node and the second target computing node, wherein the monitoring frequency type comprises a high-frequency monitoring type, a medium-frequency monitoring type and a low-frequency monitoring type;
and configuring a plurality of corresponding monitoring moments in the preset period according to the monitoring frequency type.
According to the link switching method provided by the invention, after determining a target path from a plurality of alternative paths according to the path congestion value and switching the original data link between the first target computing node and the second target computing node to the target path, the method further comprises:
and switching the target path to the original data link if the failure of the original data link between the first target computing node and the second target computing node has been eliminated.
According to the link switching method provided by the invention, the switching the original data link between the first target computing node and the second target computing node to the target path includes:
Acquiring Internet protocol address information and input/output port information of each computing node in the target path;
and establishing a data transmission path between the first target computing node and the second target computing node through each computing node in the target path based on the internet protocol address information and the input/output port information.
The invention also provides a link switching device, which comprises:
a first configuration module, configured to obtain a plurality of alternative paths between a first target computing node and a second target computing node when it is determined that an original data link between the first target computing node and the second target computing node has a failure, where each of the alternative paths is formed by data links between the plurality of computing nodes;
the second configuration module is used for acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each calculation node in each alternative path;
and the link switching module is used for determining a target path from a plurality of alternative paths according to the path congestion value and switching the original data link between the first target computing node and the second target computing node to the target path.
The invention also provides a computing system which comprises a plurality of computing nodes and the link switching device.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any of the link switching methods described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a link switching method as described in any of the above.
According to the link switching method, the device, the computing system, the electronic equipment and the storage medium, when the original data links among the nodes are determined to have faults, a plurality of alternative paths among the nodes are obtained, then the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each computing node in each alternative path, and then the target path is determined from the plurality of alternative paths through the path congestion value, so that the original data links among the nodes are switched to the target path, and the data link fault processing efficiency of the heterogeneous computing system of the multiple computing nodes is improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a link switching method provided by the present invention;
FIG. 2 is a schematic diagram illustrating a switching of a failed link of a computing node according to the present invention;
fig. 3 is a schematic structural diagram of a link switching device provided by the present invention;
fig. 4 is a schematic diagram of module interconnection of a link switching device according to the present invention;
FIG. 5 is a second schematic diagram of module interconnection of the link switching device according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In existing heterogeneous computing systems, when a link between computing nodes fails, maintenance personnel typically confirm the failure by means of network diagnostic tools, log analysis, or physical inspection, etc., to determine whether the link is broken or damaged. For failure to repair the link immediately, maintenance personnel may make a backup link or an alternative path to maintain communication between nodes, where the backup link may be a redundant network link, a backup network device, or other available communication channel, and if a pre-configured backup link is available, the backup link may be enabled by configuring the network device. At present, for the link failure in the heterogeneous computing system, the operations of changing the routing table, creating a new physical connection or a virtual link and the like are needed to be manually participated, and the processing efficiency of the link failure is influenced.
The invention realizes the judgment of the link fault by monitoring the link credit value variation in real time, and then switches the links between the nodes at two ends of the fault link according to the weighted calculation result of the load value transmitted by the node and the credit value used in the link, the whole process is realized in a hardware mode without software participation, the efficiency of the link fault monitoring and the link switching is improved, and the availability and the reliability of the multi-node heterogeneous calculation system are improved.
Fig. 1 is a flow chart of a link switching method provided by the present invention, and as shown in fig. 1, the present invention provides a link switching method, including:
step 101, when it is determined that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by data links between a plurality of computing nodes.
In the invention, the data links between the computing nodes can be monitored in various ways to judge whether faults exist, for example, log information of the network equipment is analyzed regularly through a log so as to analyze potential faults or abnormal conditions of the links according to error, warning or abnormal information related to the links; or in the heterogeneous computing system, the availability of the link can be monitored through a heartbeat mechanism, heartbeat signals are periodically exchanged between nodes, and if the corresponding reply is not received, the link can be judged to be faulty.
Preferably, in the present invention, the data link is analyzed for failure by monitoring the credit value of the link between the respective computing nodes in real time. In the present invention, the credit value of each computing node changes with the usage of the data transmission channel of the node, for example, the total credit value of a certain computing node is 8, where the total credit value of the input port is 4, the total credit value of the output port is 4, when 2 data packets in the computing node are input to the input port for data reception, the input port uses 2 credit values, and 2 credit values remain, and at the same time, 1 data packet in the computing node is transmitting data to another computing node through the output port, 1 credit value remains in the output port, and 3 credit values remain, that is, the credit value change amount of the input port of the computing node at the current time is 2, and the credit value change amount of the output port is 1.
Further, taking a data link between two computing nodes as an example, namely, a first target computing node and a second target computing node, an output port of the first target computing node and an input port of the second target computing node perform data transmission. And in the preset period, if the credit values corresponding to the output port of the first target computing node and the input port of the second target computing node are not changed, judging that the data link (namely the original data link) between the first target computing node and the second target computing node is in fault.
Further, based on a heterogeneous computing system where the first target computing node and the second target computing node are located, the first target computing node is used as a source node, the second target computing node is used as a destination node, other computing nodes in the computing system are used as transfer nodes, and data links are formed between each computing node by selecting different transfer nodes, so that a plurality of alternative paths are formed.
Step 102, obtaining a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path.
In the present invention, before calculating the path congestion value of each alternative path, the data transmission load value of the calculation node in each alternative path needs to be calculated first. Specifically, on the basis of the foregoing embodiment, before the obtaining, according to the data transmission load value of each computing node in each of the alternative paths, a path congestion value corresponding to each of the alternative paths, the method further includes:
based on the calculation nodes in each alternative path, acquiring the data transmission load value of each calculation node at the last moment in the preset period;
acquiring the total amount of the credit value change of the input port and the total amount of the credit value change of the output port of each computing node between the current moment and the last moment;
and based on the ratio between the credit value change total quantity of the input port and the credit value change total quantity of the output port, acquiring the data transmission load value of each calculation node in each alternative path at the current moment through the data transmission load value of each calculation node at the last moment.
In the invention, after the heterogeneous computing system is initialized, the initial value corresponding to the data transmission load value of each computing node is 1 (for facilitating subsequent computation), and in the subsequent data processing process, the data transmission load value of each computing node also changes along with the change condition of the credit values of the input port and the output port of the computing node. It should be noted that, in the present invention, the data transmission load value represents the data amount or data flow carried by a node in a certain period of time, and an appropriate index may be selected to measure the data transmission load value of the node according to a specific application scenario and a requirement, for example, according to a bandwidth utilization rate, a data packet transmission rate, a data throughput, and the like.
According to the invention, the data transmission load values of all the computing nodes at the current moment are obtained in real time through the node state at the previous moment (namely the data transmission load value of the node at the previous moment) and the credit value variation of the input port and the output port, wherein when the credit value variation of all the input ports in a certain computing node is large, the data quantity received by the computing node is large, the capability of the computing node for receiving data at the next moment can be predicted to be reduced, and the data transmission load value and the credit variation of the input port are positively correlated; when the credit value variation of all the output ports in the computing node is large, the computing node is transmitting a large amount of data to the next computing node, and the computing node can be predicted to have stronger capacity for data output at the next moment, and the data transmission load value and the credit variation of the output ports are inversely related. By analyzing the credit value variation corresponding to the input port and the output port of each computing node, the load condition of the computing node at the current moment can be rapidly determined and analyzed, and more accurate data support is provided for subsequent path switching.
Further, the obtaining, based on the ratio between the total amount of credit change of the input port and the total amount of credit change of the output port, the data transmission load value of each computing node in each candidate path at the current time through the data transmission load value of each computing node at the previous time includes:
Constructing a node transmission load value calculation formula, wherein the node transmission load value calculation formula is as follows:
wherein,for calculating the data transmission load value of the node at said current moment +.>For the computing node at the last momenttData transmission load value of->For a preset time interval duration, < >>Representing the total value of credit values of the corresponding input ports when all input ports in the computing node are not used,/->Representing the total credit value of the corresponding output ports when all the output ports in the computing node are not used; />Representing +.>Input ports->Representing a total of +.>Input ports->Representing->The number of input ports is->Input port credit value variation in time period, < >>Indicating that all input ports in the compute node are +.>The total amount of input port credit value variation in a time period; />Representing +.>Output ports->Representing a total of +.>Output ports->Representing->The number of output ports is->Output port credit value variation in time period, +.>Indicating that all output ports in the compute node are +.>The total amount of output port credit value variation in the time period; wherein, when- >When (I)>
And calculating the data transmission load value of each calculation node in each alternative path at the current moment according to the ratio and the data transmission load value of each calculation node at the last moment based on the node transmission load value calculation formula.
In the invention, the data transmission load value of each computing node at the current moment is mainly calculated by the data transmission load value of the previous moment and the credit value change total quantity of the input port and the output port. For each computing node, after initialization is completed, the heterogeneous computing system calculates the data transmission load value of the computing node at each moment, so that the load value of the computing node at any moment is obtained in real time. It should be noted that the number of the substrates,and->The total credit value corresponding to the input port and the output port of the node (namely, the maximum credit value corresponding to the port in the idle state) is calculated, and the two parameters are used for preventing the situation that the denominator is 0 in the node transmission load value calculation formula, so that the meaning of the formula is not influenced.
Further, after the data transmission load value of each calculation node in each alternative path at the current moment is calculated by the node transmission load value calculation formula, the corresponding path congestion value needs to be calculated according to the data transmission load values of all calculation nodes in each alternative path. On the basis of the foregoing embodiment, the obtaining, according to the data transmission load value of each computing node in each candidate path, a path congestion value corresponding to each candidate path includes:
Calculating to obtain a link congestion value of a data link between each computing node and a corresponding next computing node according to the data transmission load value of each computing node in each alternative path and the data transmission load value and the credit value used amount of the next computing node corresponding to each computing node;
and obtaining the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path.
In the invention, for any computing node in an alternative path, the credit value used amount of the computing node in the input port of the next computing node in the path is obtained, for example, the first computing node in the alternative pathThe next node in the path is the +.>A personal node, wherein->The credit value of the input port of each node is 3, which represents the firstIn the process of data interaction at the current moment, 3 input port credit values are consumed by each node. Further, in the present invention, the credit value used amount of the input port of the next computing node in the path of the computing node is used as a weighted value, the link congestion value of the data link between the two adjacent computing nodes is calculated according to the data transmission load values of the two adjacent computing nodes, and then the link congestion values corresponding to all the data links in the alternative path are summed to obtain the path congestion value corresponding to the alternative path, so that two paths based on the fault condition exist All possible paths among the nodes are calculated by the target, and the best distribution path is rapidly determined by calculating the congestion value corresponding to each path, so that the link switching efficiency is provided.
Specifically, the obtaining, according to the link congestion values of the data links between all the computing nodes in each of the alternative paths, the path congestion value corresponding to each of the alternative paths includes:
constructing a path congestion value calculation formula, wherein the path congestion value calculation formula is as follows:
wherein,representing a path congestion value corresponding to said alternative path,/or->Representing the +.sup.th in the alternative path>Computing node->Indicating that there is a total of +.>Computing node->Representing the +.sup.th in the alternative path>The next computing node corresponding to the computing node +.>Credit value used amount of the input port of (c); />Representing the saidThe>Data transmission load value of each computing node, +.>Representing the +.sup.th in the alternative path>Calculating data transmission load values of the nodes;
and calculating the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path based on the path congestion value calculation formula.
In the invention, the congestion value of each alternative path is calculated respectively through the path congestion value calculation formula, and the data transmission load value and the used credit value of each calculation node are obtained and calculated in real time through related hardware (such as monitoring equipment), so that the efficiency of link fault monitoring and link switching is improved.
And step 103, determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path.
Fig. 2 is a schematic diagram of switching a failed link of a computing node according to the present invention, and referring to fig. 2, in the present invention, there are 16 nodes in a heterogeneous computing system, where node 0 to node 7 are fully interconnected, and node 8 to node 15 are fully interconnected, so as to form a plurality of data links. Assuming that the initial credit value of each data link in the heterogeneous computing system is 8, the node 0 and the node 8 perform data interaction through the direct communication channels of the two, and the process consumes the link credit value and makes the credit value change continuously. It is assumed that at a certain moment, by monitoring the credit value change value of the direct connection channel between the node 0 and the node 8, it is determined that the credit value does not change in a preset period (for example, 10 clock cycles are set in the preset period, credit value monitoring is performed every clock cycle, a specific number value of the clock cycles can be set according to the actual situation of the heterogeneous computing system), and the node 0 and the node 8 are not in an idle state at this moment, so that it can be determined that the direct connection channel between the node 0 and the node 8 has a fault. At this point, the heterogeneous computing system triggers the flow of automatic switching of the failed link to resume the data interaction between node 0 and node 8.
Further, calculating an alternative path for data transmission between the node 0 and the node 8, wherein the number of nodes in the alternative pathnThe settings may be made according to the actual circumstances of the heterogeneous computing system. In one embodiment of the present invention, in one embodiment,nif the set is set to 5, the obtained alternative paths include node 0 and node 8, and at most 5 computing nodes participate, and four alternative paths can be enumerated as shown in fig. 2, for example:
alternative path 1: node 0-node 1-node 9-node 8;
alternative path 2: node 0-node 2-node 10-node 8;
alternative path 3: node 0-node 1-node 9-node 12-node 8;
alternative path 4: node 0-node 3-node 11-node 9-node 8.
Further, the data transmission load value of the calculation node involved in each alternative path is calculated through a node transmission load value calculation formula. In this embodiment, for the sake of calculation, it is assumed that the total credit value corresponding to each of the input port and the output port of all the calculation nodesAnd->64, for example, node 0 to node 1, the total amount of input port credit value change of node 1 in the last 10 clock cycles (i.e., within the preset period) is 12, i.e./2>The method comprises the steps of carrying out a first treatment on the surface of the The total amount of credit value change of the output port is 60, namely +. >And the transmission load value of the node 1 at the previous time is 1 (for the sake of easy calculation, for example), the data transmission load value of the node 1 at this time can be calculated according to this, which is 0.613. Meanwhile, the credit value used quantity corresponding to the input port of the next computing node corresponding to each computing node in the alternative path is collected. Table 1 shows the data transmission load values of each computing node based on the 4 alternative paths, as shown in table 1:
TABLE 1
In table 1, the data transmission load value of each calculation node at the current moment is calculated by a node transmission load value calculation formula
Table 2 shows the credit value used amount of the input port of the next computing node corresponding to each computing node based on the above 4 alternative paths, and specifically referring to table 2:
TABLE 2
By means of a path congestion value calculation formula and combining table 1 and table 2, the congestion values of the 4 alternative paths are calculated, and the alternative paths are taken as examples, and specifically are as follows:
alternative path 1:
based on a path congestion value calculation formula, calculating to obtain alternative paths 2 respectively:the method comprises the steps of carrying out a first treatment on the surface of the Alternative path 3:the method comprises the steps of carrying out a first treatment on the surface of the Alternative path 4: />
Further, comparing the congestion values of the 4 alternative paths, in the present invention, the congestion value corresponding to the alternative path 3 is the smallest, and the alternative path 3 may be used as the most unblocked path in the alternative paths, namely: the alternative path of the node 0-node 1-node 9-node 12-node 8 is a target path, so that the original data connection from the node 0 to the node 8 is switched to the target node, and the data transmission from the node 0 to the node 8 is realized. Similarly, in the process of transmitting data from the node 8 to the node 0, when a link fault exists, the path selection and switching are completed through the above process.
According to the link switching method provided by the invention, when the original data links among the nodes are determined to have faults, a plurality of alternative paths among the nodes are obtained, then the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each calculation node in each alternative path, and then the target path is determined from the plurality of alternative paths through the path congestion value, so that the original data links among the nodes are switched to the target path, and the data link fault processing efficiency of the heterogeneous calculation system of the multiple calculation nodes is improved.
On the basis of the above embodiment, the method further includes:
acquiring a first credit value variation and a second credit value variation, wherein the first credit value variation represents the variation of the credit value of the first target computing node in a preset period; the second credit value variation amount represents the variation amount of the credit value of the second target computing node in the preset period; the credit value is set based on the parallel calculation quantity of the calculation nodes, and comprises an input port credit value and an output port credit value;
and judging the fault condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation, so as to execute data link switching operation on the first target computing node and the second target computing node under the condition that the fault is determined to exist.
In the invention, the credit value dynamically changes along with the data input and output conditions of the computing nodes, each data link is provided with a corresponding credit value in the early stage, based on the parallel computing processing capacity of the computing nodes, each computing node has a plurality of input ports and output ports, and can simultaneously process a plurality of data tasks in parallel, when a certain data link is used, the credit value of the corresponding input port or output port corresponding link is used, for example, the credit value of each link in the computing node is 8, the computing node is respectively kept interconnected with other 8 computing nodes, and then the computing node has 8 input ports and 8 output ports in total, namely, the input ports of the computing node and the other 8 computing nodes form 8 data input links, and the total credit value of the corresponding links of the input ports is 64; the output port of the computing node and the other 8 computing nodes form 8 data output links, and the total credit value of the corresponding links of the output ports is 64.
Further, when the computing node performs data transmission with any one of the computing nodes, the corresponding credit value will be consumed, for example, assuming that the computing node does not consume the credit value at the previous time, 2 credit values are consumed by the input port of the computing node in the preset period, 1 credit value is consumed by the output port, and then the variation of the credit value of the computing node in the preset period is 3. It should be noted that, in the preset period, the data transmission condition of the data link of each computing node is dynamically changed, that is, after some data links complete data transmission in a period of time, the used credit value will be recovered, for example, in the preset period, the monitoring time is only set with the last time and the current time, the credit value change in the preset period in the computing node is not intuitively reflected, for example, the input port consumes 5 credit values in a certain time in the preset period, when the data processing is completed in another time, the credit value change in the preset period of the input port of the computing node is 0, the monitoring result is relatively single, and it is not possible to accurately determine whether the computing node is in an idle state or has a fault, and it is necessary to monitor the credit value change of the computing node more accurately. Therefore, in an embodiment, a plurality of monitoring moments can be set in a preset period, so that the monitoring frequency is increased, and the fault condition of the computing node is more accurately judged according to the credit value variation.
Further, in the present invention, for the data link between the nodes, the credit value is dynamically changed, the change range is relatively stable in the normal state, and when the credit value change amount between the two nodes is abnormal, the data link is at risk of failure.
In the invention, a network monitoring system can be built, which comprises monitoring equipment and software and is used for monitoring the change condition of the credit value of the link between the computing nodes in real time; meanwhile, various performance indexes such as delay, packet loss rate, bandwidth utilization rate and the like on the link are collected through equipment such as a network management tool, a flow analyzer or network monitoring equipment and the like, and auxiliary data is provided for link fault analysis.
On the basis of the foregoing embodiment, the determining, according to the first credit value variation and the second credit value variation, a failure condition of the original data link between the first target computing node and the second target computing node includes:
and judging whether the first credit value variation and the second credit value variation are preset values or not based on the preset period, and determining a fault condition according to a judging result.
In the invention, the credit value variation between two computing nodes is collected in real time to judge whether the credit value variation meets the preset value, thereby rapidly judging the fault condition of the data link between the computing nodes. Specifically, the determining, based on the preset period, whether the first credit value variation and the second credit value variation are preset values, and determining a fault condition according to a determination result includes:
and if any credit value variation in the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault.
In the invention, the output port of the first target computing node and the input port of the second target computing node perform data transmission, credit values of data links between the first target computing node and the second target computing node are acquired in real time within a preset period, and then whether the data link at the current moment has faults or not is judged through a link fault judging formula, wherein the link fault judging formula is as follows:
wherein,for computing node- >The link credit value variation of the output end (namely, the first target computing node), namely, the first credit value variation; />For computing node->The input link credit value variation (namely, the second target calculation section), namely, the second credit value variation; />Is the duration of the preset period. Based on the link failure judgment formula, calculating according to the credit value variation of two calculation nodes, when +.>When equal to 0The method shows that the credit values at the two ends of the link have no change in a preset period, namely the link is judged to be faulty, and because the link fault does not need artificial subjective judgment, the fault judgment accuracy is improved, meanwhile, the fault judgment efficiency is also improved, and further, the heterogeneous computing system can trigger the subsequent link switching operation more quickly.
On the basis of the above embodiment, the method further includes:
acquiring port credit values corresponding to the first target computing node and the second target computing node respectively in the preset period;
if the port credit value is smaller than the corresponding maximum credit threshold value in the preset period, judging that the original data link between the first target computing node and the second target computing node is in a non-idle state;
If the port credit value is equal to the corresponding maximum credit threshold value within the preset period, judging that the original data link between the first target computing node and the second target computing node is in an idle state;
if any one of the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault includes:
and if the original data link between the first target computing node and the second target computing node is in a non-idle state within the preset period, and any credit value variation in the first credit value variation and the second credit value variation is 0, determining that the original data link between the first target computing node and the second target computing node has a fault.
In the present invention, for some heterogeneous computing systems, in which the computing nodes are usually kept in a long-term operation state, in order to further improve the accuracy of fault determination, it is also necessary to consider the idle states between the computing nodes, for each computing node, its input port and The output port is set with the corresponding credit value in the earlier stage, that is, when the computing node is in the idle state, the credit values corresponding to the input port and the output port are the maximum value. Therefore, the maximum value can be set as the maximum credit threshold value, and when the link failure judgment is carried out, the judgment is firstly carried out on whether the credit value of the corresponding port between the two computing nodes is smaller than the maximum credit threshold value, only the credit value is smaller than the maximum credit threshold value, andand when the value is equal to 0, determining that the link fails, and further improving the accuracy of failure detection.
On the basis of the above embodiment, the acquiring a plurality of alternative paths between the first target computing node and the second target computing node includes:
determining a plurality of data transmission paths between the first target computing node and the second target computing node;
and taking the data transmission paths with the number of the computing nodes smaller than the number of the preset computing nodes in the plurality of data transmission paths as the alternative paths, and acquiring a plurality of alternative paths between the first target computing node and the second target computing node.
In the invention, the preset number of the computing nodes can be set according to the actual use condition Is a value range of>The value of (2) should not be too large to avoid the phenomenon of excessive calculation amount caused by excessive number of nodes in the path, and influence the switching rate. Preferably, in an embodiment, the number of computing nodes is preset +.>Setting to 5, so as to screen out all data transmission paths meeting the preset calculation quantity as alternative paths, and calculating each data transmission path through a node transmission load value calculation formula and a path congestion value calculation formulaCongestion values of the individual alternative paths to determine the target path.
On the basis of the foregoing embodiment, the determining, according to the path congestion value, a target path from a plurality of alternative paths includes:
and sorting the path congestion values corresponding to the alternative paths, and determining the alternative path with the smallest path congestion value as the target path according to the sorting result.
In the invention, after the congestion value of each alternative path is calculated by a path congestion value calculation formula, the path congestion values are ordered, and the lower the path congestion value is, the smoother the alternative path is, so that the target path is rapidly determined according to the alternative path with the minimum congestion value, and the subsequent link switching efficiency is improved.
On the basis of the foregoing embodiment, if there are a plurality of path congestion values with the same size in the sorting result, the method further includes:
taking a plurality of alternative paths corresponding to the path congestion values with the same size as pending paths;
and acquiring the number of computing nodes in each undetermined path, and determining the undetermined path corresponding to the minimum number of computing nodes as the target path.
In the invention, for the condition that a plurality of same minimum path congestion values exist, the number of the calculation nodes of each undetermined path can be selected again, so that a unique target path is determined, and the problem of system stability is avoided. In an embodiment, the selection may also be performed manually, so as to generate a corresponding link switching instruction for the link switching process, thereby improving the stability of the heterogeneous computing system.
On the basis of the foregoing embodiment, the preset period is provided with a plurality of monitoring moments, and the acquiring the first credit value variation and the second credit value variation includes:
and respectively summing the variation of the credit value of the first target computing node and the variation of the credit value of the second target computing node at each monitoring moment to obtain the first credit value variation and the second credit value variation.
In the invention, in the fault monitoring process, a plurality of monitoring moments are set, so that the monitoring frequency is improved, and the fault condition of the computing node is more accurately judged according to the credit value variation. Further, the credit value variation quantities collected at a plurality of monitoring moments are summed to obtain a first credit value variation quantity and a second credit value variation quantity respectively, and whether a fault condition exists at the current moment is judged according to a link fault judging formula.
On the basis of the above embodiment, the method further includes:
determining a monitoring frequency type based on hardware types of the first target computing node and the second target computing node, wherein the monitoring frequency type comprises a high-frequency monitoring type, a medium-frequency monitoring type and a low-frequency monitoring type;
and configuring a plurality of corresponding monitoring moments in the preset period according to the monitoring frequency type.
In the present invention, the monitoring frequency is set for the hardware type of the compute node, wherein if the response time requirement for the link failure is high, a shorter monitoring interval (i.e. high frequency monitoring type) can be selected, for example, monitoring is performed every few seconds. Therefore, the problem of the link can be found in time, and corresponding measures can be quickly taken for repairing.
The response time requirements for link failure are less stringent, and a moderate monitoring interval (i.e., medium frequency monitoring type) may be selected, such as monitoring once per minute, which balances the monitoring overhead and timeliness of link failure detection.
In some stable network environments, the probability of link failure is low, and longer monitoring intervals (i.e., low frequency monitoring types) may be selected, which may reduce monitoring overhead and ensure timely detection of potential problems.
On the basis of the above embodiment, after the determining a target path from a plurality of the alternative paths according to the path congestion value and switching the original data link between the first target computing node and the second target computing node to the target path, the method further includes:
and switching the target path to the original data link if the failure of the original data link between the first target computing node and the second target computing node has been eliminated.
In the invention, because the original data link is generally a direct connection channel, the transmission efficiency is higher, and when the original data link is determined to be eliminated, the target path can be switched back to the original data link, so that the data transmission efficiency is improved, and the load degree of other computing nodes can be reduced.
Based on the above embodiment, the switching the original data link between the first target computing node and the second target computing node to the target path includes:
acquiring Internet protocol address information and input/output port information of each computing node in the target path;
and establishing a data transmission path between the first target computing node and the second target computing node through each computing node in the target path based on the internet protocol address information and the input/output port information.
In the invention, after each computing node in the target path is determined, the computing nodes are positioned through Internet protocol (Internet Protocol, abbreviated as IP) address information and input/output port information, so that the computing nodes are used as transit nodes, and a data transmission path between the first target computing node and the second target computing node is quickly established.
The link switching device provided by the invention is described below, and the link switching device described below and the link switching method described above can be referred to correspondingly.
Fig. 3 is a schematic structural diagram of a link switching device provided by the present invention, and as shown in fig. 3, the present invention provides a link switching device, which includes a first configuration module 301, a second configuration module 302, and a link switching module 303, where the first configuration module 301 is configured to obtain, when determining that an original data link between a first target computing node and a second target computing node has a failure, a plurality of alternative paths between the first target computing node and the second target computing node, where each of the alternative paths is formed by data links between a plurality of computing nodes; the second configuration module 302 is configured to obtain a path congestion value corresponding to each of the alternative paths according to the data transmission load value of each computing node in each of the alternative paths; the link switching module 303 is configured to determine a target path from a plurality of the alternative paths according to the path congestion value, and switch the original data link between the first target computing node and the second target computing node to the target path.
According to the link switching device provided by the invention, when the original data links among the nodes are determined to have faults, a plurality of alternative paths among the nodes are obtained, then the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each calculation node in each alternative path, and then the target path is determined from the plurality of alternative paths through the path congestion value, so that the original data links among the nodes are switched to the target path, and the data link fault processing efficiency of the heterogeneous calculation system of the multiple calculation nodes is improved.
On the basis of the above embodiment, the apparatus further includes a first monitoring module and a second monitoring module, where the first monitoring module is configured to obtain a first credit value variation and a second credit value variation, where the first credit value variation represents a variation of a credit value of the first target computing node in a preset period; the second credit value variation amount represents the variation amount of the credit value of the second target computing node in the preset period; the credit value is set based on the parallel calculation quantity of the calculation nodes, and comprises an input port credit value and an output port credit value; the second monitoring module is configured to determine, according to the first credit value variation and the second credit value variation, a failure condition of the original data link between the first target computing node and the second target computing node, so as to perform a data link switching operation on the first target computing node and the second target computing node when determining that the failure exists.
In the invention, the link switching device comprises a monitoring module (comprising a first monitoring module and a second monitoring module) and a configuration module (comprising a first configuration module and a second configuration module), wherein the monitoring module predicts whether the link fails according to whether the credit values of ports at two ends of the link change within a preset period; when a link fault occurs, the monitoring module calculates a data transmission load value of a node in the optional path, wherein the data transmission load value is related to the total credit value variation of an input port and an output port at the last moment of the node, and if the node is not started, the initial value of the node is 1; and then, the configuration module selects a transmission path with the lowest congestion value according to the weighted sum of the node transmission load value and the credit value used by the link in the optional paths through the interface module of each node, wherein the related parameters of the non-enabled node and the link are configured by the link initialization module. Fig. 4 is a schematic diagram of module interconnection of the link switching device provided by the present invention, and fig. 5 is a schematic diagram of module interconnection of the link switching device provided by the present invention, in which a configuration module may be disposed inside each computing node according to actual needs, or may be disposed separately (i.e. on the same side as a monitoring module).
On the basis of the foregoing embodiment, the second monitoring module includes a fault judging unit configured to judge, based on the preset period, whether the first credit value variation and the second credit value variation are preset values, and determine a fault condition according to a judgment result.
On the basis of the foregoing embodiment, the failure determination unit is specifically configured to determine that the original data link between the first target computing node and the second target computing node has a failure if any one of the first credit value variation and the second credit value variation is 0 within the preset period.
On the basis of the above embodiment, the apparatus further includes a credit monitoring module and a link idle state judging module, where the credit monitoring module is configured to obtain port credit values corresponding to the first target computing node and the second target computing node in the preset period; a link idle state judging module, configured to determine that the original data link between the first target computing node and the second target computing node is in a non-idle state if the port credit value is less than a corresponding maximum credit threshold value within the preset period; if the port credit value is equal to the corresponding maximum credit threshold value within the preset period, judging that the original data link between the first target computing node and the second target computing node is in an idle state;
The failure determination unit is further configured to determine that a failure exists in the original data link between the first target computing node and the second target computing node if the original data link between the first target computing node and the second target computing node is in a non-idle state and any one of the first credit value variation and the second credit value variation is 0 within the preset period.
On the basis of the above embodiment, the apparatus further includes a first processing module, a second processing module, and a load value calculating module, where the first processing module is configured to obtain, based on the calculating nodes in each of the alternative paths, a data transmission load value of each calculating node at a previous time in the preset period; the second processing module is used for obtaining the total amount of the credit value change of the input port and the total amount of the credit value change of the output port of each computing node between the current moment and the last moment; the load value calculation module is configured to obtain, based on a ratio between the total amount of credit value variation of the input port and the total amount of credit value variation of the output port, a data transmission load value of each calculation node in each candidate path at the current time through a data transmission load value of each calculation node at the previous time.
On the basis of the foregoing embodiment, the load value calculation module is specifically configured to construct a node transmission load value calculation formula, where the node transmission load value calculation formula is:
wherein,for calculating the data transmission load value of the node at said current moment +.>For the computing node at the last momenttData transmission load value of->For a preset time interval duration, < >>Representing the total value of credit values of the corresponding input ports when all input ports in the computing node are not used,/->Representing the total credit value of the corresponding output ports when all the output ports in the computing node are not used; />Representing +.>Input ports->Representing a total of +.>Input ports->Representing computing nodesMiddle->The number of input ports is->Input port credit value variation in time period, < >>Indicating that all input ports in the compute node are +.>The total amount of input port credit value variation in a time period; />Representing +.>Output ports->Representing a total of +.>Output ports->Representing->The number of output ports is->Output port credit value variation in time period, +.>Indicating that all output ports in the compute node are +. >The total amount of output port credit value variation in the time period; wherein, when->When (I)>
And calculating the data transmission load value of each calculation node in each alternative path at the current moment according to the ratio and the data transmission load value of each calculation node at the last moment based on the node transmission load value calculation formula.
On the basis of the foregoing embodiment, the second configuration module includes a link congestion value calculation unit and a path congestion value calculation unit, where the link congestion value calculation unit is configured to calculate, according to a data transmission load value of each calculation node in each of the alternative paths and a data transmission load value and a used amount of a credit value of a next calculation node corresponding to each calculation node, a link congestion value of a data link between each calculation node and the corresponding next calculation node; and the path congestion value calculation unit is used for obtaining the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path.
On the basis of the foregoing embodiment, the path congestion value calculation unit is specifically configured to construct a path congestion value calculation formula, where the path congestion value calculation formula is:
Wherein,representing a path congestion value corresponding to said alternative path,/or->Representing the +.sup.th in the alternative path>Computing node->Indicating that there is a total of +.>Computing node->Representing the +.sup.th in the alternative path>The next computing node corresponding to the computing node +.>Credit value used amount of the input port of (c); />Representing the +.sup.th in the alternative path>Data transmission load value of each computing node, +.>Representing the +.sup.th in the alternative path>Calculating data transmission load values of the nodes;
and calculating the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path based on the path congestion value calculation formula.
On the basis of the above embodiment, the first configuration module is specifically configured to determine a plurality of data transmission paths between the first target computing node and the second target computing node; and taking the data transmission paths with the number of the computing nodes smaller than the number of the preset computing nodes in the plurality of data transmission paths as the alternative paths, and acquiring a plurality of alternative paths between the first target computing node and the second target computing node.
On the basis of the foregoing embodiment, the link switching module includes an alternative path sorting unit, configured to sort the path congestion values corresponding to the alternative paths, and determine, according to a sorting result, the alternative path with the smallest path congestion value as the target path.
On the basis of the above embodiment, if there are a plurality of path congestion values with the same size in the sorting result, the alternative path sorting unit is further configured to use the alternative paths corresponding to the plurality of path congestion values with the same size as pending paths; and acquiring the number of computing nodes in each undetermined path, and determining the undetermined path corresponding to the minimum number of computing nodes as the target path.
On the basis of the foregoing embodiment, a plurality of monitoring moments are set in the preset period, and the first monitoring module is specifically configured to sum the variation of the credit value of the first target computing node and the variation of the credit value of the second target computing node at each monitoring moment, so as to obtain the first credit value variation and the second credit value variation.
On the basis of the embodiment, the device further comprises a monitoring frequency determining module and a monitoring mode configuration module, wherein the monitoring frequency determining module is used for determining a monitoring frequency type based on hardware types of the first target computing node and the second target computing node, and the monitoring frequency type comprises a high-frequency monitoring type, a medium-frequency monitoring type and a low-frequency monitoring type; the monitoring mode configuration module is used for configuring a plurality of corresponding monitoring moments in the preset period according to the monitoring frequency type.
On the basis of the above embodiment, the apparatus is further configured to switch the target path to the original data link if the failure of the original data link between the first target computing node and the second target computing node has been eliminated.
On the basis of the above embodiment, the link switching module includes a node information acquiring unit and a data transmission path constructing unit, where the node information acquiring unit is configured to acquire internet protocol address information and input/output port information of each computing node in the target path; the data transmission path construction unit is used for establishing a data transmission path between the first target computing node and the second target computing node through each computing node in the target path based on the Internet protocol address information and the input/output port information.
The device provided by the invention is used for executing the method embodiments, and specific flow and details refer to the embodiments and are not repeated herein.
The invention also provides a computing system which comprises a plurality of computing nodes and the link switching device in each embodiment.
According to the computing system provided by the invention, when the original data links among the nodes are determined to have faults through the internal link switching device, a plurality of alternative paths among the nodes are obtained, and then the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each computing node in each alternative path, and then the target path is determined from the plurality of alternative paths through the path congestion value, so that the original data links among the nodes are switched to the target path, and the data link fault processing efficiency of the heterogeneous computing system of multiple computing nodes is improved.
Fig. 6 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 6, the electronic device may include: processor (Processor) 601, communication interface (Communications Interface) 602, memory (Memory) 603 and communication bus 604, wherein Processor 601, communication interface 602, memory 603 accomplish the communication between each other through communication bus 604. The processor 601 may invoke logic instructions in the memory 603 to perform a link switching method comprising: when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes; acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path; and determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path.
Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a link switching method provided by the above methods, the method comprising: when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes; acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path; and determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the link switching method provided in the above embodiments, the method comprising: when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes; acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path; and determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (17)

1. A method for link switching, comprising:
when determining that an original data link between a first target computing node and a second target computing node has a fault, acquiring a plurality of alternative paths between the first target computing node and the second target computing node, wherein each alternative path is formed by the data links between the plurality of computing nodes;
acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path;
determining a target path from a plurality of alternative paths according to the path congestion value, and switching the original data link between the first target computing node and the second target computing node to the target path;
the method further comprises the steps of:
acquiring a first credit value variation and a second credit value variation, wherein the first credit value variation represents the variation of the credit value of the first target computing node in a preset period; the second credit value variation amount represents the variation amount of the credit value of the second target computing node in the preset period; the credit value is set based on the parallel calculation quantity of the calculation nodes, and comprises an input port credit value and an output port credit value;
Judging the fault condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation, so as to execute data link switching operation on the first target computing node and the second target computing node under the condition that the fault exists;
before the path congestion value corresponding to each alternative path is obtained according to the data transmission load value of each computing node in each alternative path, the method further comprises:
based on the calculation nodes in each alternative path, acquiring the data transmission load value of each calculation node at the last moment in the preset period;
acquiring the total amount of the credit value change of the input port and the total amount of the credit value change of the output port of each computing node between the current moment and the last moment;
based on the ratio between the credit value change total amount of the input port and the credit value change total amount of the output port, acquiring the data transmission load value of each calculation node in each alternative path at the current moment through the data transmission load value of each calculation node at the last moment;
The obtaining the path congestion value corresponding to each alternative path according to the data transmission load value of each computing node in each alternative path includes:
calculating to obtain a link congestion value of a data link between each computing node and a corresponding next computing node according to the data transmission load value of each computing node in each alternative path and the data transmission load value and the credit value used amount of the next computing node corresponding to each computing node;
and obtaining the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path.
2. The link switching method according to claim 1, wherein determining a failure condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation includes:
and judging whether the first credit value variation and the second credit value variation are preset values or not based on the preset period, and determining a fault condition according to a judging result.
3. The link switching method according to claim 2, wherein the determining, based on the preset period, whether the first credit value variation and the second credit value variation are preset values, and determining a fault condition according to a determination result, includes:
and if any credit value variation in the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault.
4. A link switching method according to claim 3, characterized in that the method further comprises:
acquiring port credit values corresponding to the first target computing node and the second target computing node respectively in the preset period;
if the port credit value is smaller than the corresponding maximum credit threshold value in the preset period, judging that the original data link between the first target computing node and the second target computing node is in a non-idle state;
if the port credit value is equal to the corresponding maximum credit threshold value within the preset period, judging that the original data link between the first target computing node and the second target computing node is in an idle state;
If any one of the first credit value variation and the second credit value variation is 0 in the preset period, determining that the original data link between the first target computing node and the second target computing node has a fault includes:
and if the original data link between the first target computing node and the second target computing node is in a non-idle state within the preset period, and any credit value variation in the first credit value variation and the second credit value variation is 0, determining that the original data link between the first target computing node and the second target computing node has a fault.
5. The link switching method according to claim 1, wherein the obtaining, based on the ratio between the total amount of input port credit change and the total amount of output port credit change, the data transmission load value of each computing node in each of the alternative paths at the current time by the data transmission load value of each computing node at the previous time includes:
constructing a node transmission load value calculation formula, wherein the node transmission load value calculation formula is as follows:
Wherein,for calculating the data transmission load value of the node at said current moment +.>For the computing node at the last momenttData transmission load value of->For a preset time interval duration, < >>Representing the total value of credit values of the corresponding input ports when all input ports in the computing node are not used,/->Representing the total credit value of the corresponding output ports when all the output ports in the computing node are not used; />Representing +.>Input ports->Indicating a total of computing nodesInput ports->Representing->The number of input ports is->Input port credit value variation in time period, < >>Indicating that all input ports in the compute node are +.>The total amount of input port credit value variation in a time period; />Representing +.>Output ports->Representing a total of +.>Output ports->Representing->The number of output ports is->The output port credit value variation in the period,indicating that all output ports in the compute node are +.>The total amount of output port credit value variation in the time period; wherein, when->When (I)>
And calculating the data transmission load value of each calculation node in each alternative path at the current moment according to the ratio and the data transmission load value of each calculation node at the last moment based on the node transmission load value calculation formula.
6. The link switching method according to claim 1, wherein the obtaining the path congestion value corresponding to each of the alternative paths according to the link congestion values of the data links between all the computing nodes in each of the alternative paths includes:
constructing a path congestion value calculation formula, wherein the path congestion value calculation formula is as follows:
wherein,representing a path congestion value corresponding to said alternative path,/or->Representing the +.sup.th in the alternative path>The number of computing nodes that are to be connected,indicating that there is a total of +.>Computing node->Representing the +.sup.th in the alternative path>The next computing node corresponding to the computing node +.>Credit value used amount of the input port of (c); />Representing the +.sup.th in the alternative path>Data transmission load value of each computing node, +.>Representing the +.sup.th in the alternative path>Calculating data transmission load values of the nodes;
and calculating the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path based on the path congestion value calculation formula.
7. The link switching method of claim 1, wherein the acquiring a plurality of alternative paths between the first target computing node and the second target computing node comprises:
Determining a plurality of data transmission paths between the first target computing node and the second target computing node;
and taking the data transmission paths with the number of the computing nodes smaller than the number of the preset computing nodes in the plurality of data transmission paths as the alternative paths, and acquiring a plurality of alternative paths between the first target computing node and the second target computing node.
8. The link switching method according to claim 1, wherein said determining a target path from among a plurality of said alternative paths according to said path congestion value comprises:
and sorting the path congestion values corresponding to the alternative paths, and determining the alternative path with the smallest path congestion value as the target path according to the sorting result.
9. The link switching method according to claim 8, wherein if there are a plurality of the path congestion values of the same size in the sorting result, the method further comprises:
taking a plurality of alternative paths corresponding to the path congestion values with the same size as pending paths;
and acquiring the number of computing nodes in each undetermined path, and determining the undetermined path corresponding to the minimum number of computing nodes as the target path.
10. The link switching method according to claim 1, wherein a plurality of monitoring moments are set in the preset period, and the acquiring the first credit value variation and the second credit value variation includes:
and respectively summing the variation of the credit value of the first target computing node and the variation of the credit value of the second target computing node at each monitoring moment to obtain the first credit value variation and the second credit value variation.
11. The link switching method according to claim 10, wherein the method further comprises:
determining a monitoring frequency type based on hardware types of the first target computing node and the second target computing node, wherein the monitoring frequency type comprises a high-frequency monitoring type, a medium-frequency monitoring type and a low-frequency monitoring type;
and configuring a plurality of corresponding monitoring moments in the preset period according to the monitoring frequency type.
12. The link switching method according to any one of claims 1 to 11, wherein after said determining a target path from a plurality of said alternative paths according to said path congestion value and switching said original data link between said first target computing node and said second target computing node to said target path, said method further comprises:
And switching the target path to the original data link if the failure of the original data link between the first target computing node and the second target computing node has been eliminated.
13. The link switching method according to any one of claims 1 to 11, wherein said switching the original data link between the first target computing node and the second target computing node to the target path comprises:
acquiring Internet protocol address information and input/output port information of each computing node in the target path;
and establishing a data transmission path between the first target computing node and the second target computing node through each computing node in the target path based on the internet protocol address information and the input/output port information.
14. A link switching apparatus, comprising:
a first configuration module, configured to obtain a plurality of alternative paths between a first target computing node and a second target computing node when it is determined that an original data link between the first target computing node and the second target computing node has a failure, where each of the alternative paths is formed by data links between the plurality of computing nodes;
The second configuration module is used for acquiring a path congestion value corresponding to each alternative path according to the data transmission load value of each calculation node in each alternative path;
a link switching module, configured to determine a target path from a plurality of alternative paths according to the path congestion value, and switch the original data link between the first target computing node and the second target computing node to the target path;
the device is also for:
acquiring a first credit value variation and a second credit value variation, wherein the first credit value variation represents the variation of the credit value of the first target computing node in a preset period; the second credit value variation amount represents the variation amount of the credit value of the second target computing node in the preset period; the credit value is set based on the parallel calculation quantity of the calculation nodes, and comprises an input port credit value and an output port credit value;
judging the fault condition of the original data link between the first target computing node and the second target computing node according to the first credit value variation and the second credit value variation, so as to execute data link switching operation on the first target computing node and the second target computing node under the condition that the fault exists;
The device also comprises a first processing module, a second processing module and a load value calculating module, wherein:
the first processing module is used for acquiring a data transmission load value of each computing node at the last moment in the preset period based on the computing nodes in each alternative path;
the second processing module is used for obtaining the total amount of the credit value change of the input port and the total amount of the credit value change of the output port of each computing node between the current moment and the last moment;
the load value calculation module is used for acquiring the data transmission load value of each calculation node in each alternative path at the current moment through the data transmission load value of each calculation node at the last moment based on the ratio between the credit value change total quantity of the input port and the credit value change total quantity of the output port;
the second configuration module comprises a link congestion value calculation unit and a path congestion value calculation unit, wherein:
a link congestion value calculation unit, configured to calculate a link congestion value of a data link between each computing node and a corresponding next computing node according to the data transmission load value of each computing node in each candidate path and the data transmission load value and the credit value used amount of the next computing node corresponding to each computing node;
And the path congestion value calculation unit is used for obtaining the path congestion value corresponding to each alternative path according to the link congestion values of the data links among all the calculation nodes in each alternative path.
15. A computing system comprising a plurality of computing nodes and the link switching apparatus of claim 14.
16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the link switching method of any of claims 1 to 13 when the computer program is executed.
17. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the link switching method according to any one of claims 1 to 13.
CN202311029289.5A 2023-08-16 2023-08-16 Link switching method, device, computing system, electronic equipment and storage medium Active CN116760763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311029289.5A CN116760763B (en) 2023-08-16 2023-08-16 Link switching method, device, computing system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311029289.5A CN116760763B (en) 2023-08-16 2023-08-16 Link switching method, device, computing system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116760763A CN116760763A (en) 2023-09-15
CN116760763B true CN116760763B (en) 2024-01-09

Family

ID=87948148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311029289.5A Active CN116760763B (en) 2023-08-16 2023-08-16 Link switching method, device, computing system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116760763B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322406A (en) * 2017-12-28 2018-07-24 广东电网有限责任公司电力调度控制中心 A kind of SDN data plane failure restoration methods based on link performance and flow point class
CN108667727A (en) * 2018-04-27 2018-10-16 广东电网有限责任公司 network link failure processing method, device and controller
CN116112423A (en) * 2022-12-29 2023-05-12 新华三信息技术有限公司 Path determination method, device and equipment
CN116320068A (en) * 2022-09-05 2023-06-23 Oppo广东移动通信有限公司 Data transmission method and device, electronic equipment and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040526A1 (en) * 2012-07-31 2014-02-06 Bruce J. Chang Coherent data forwarding when link congestion occurs in a multi-node coherent system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322406A (en) * 2017-12-28 2018-07-24 广东电网有限责任公司电力调度控制中心 A kind of SDN data plane failure restoration methods based on link performance and flow point class
CN108667727A (en) * 2018-04-27 2018-10-16 广东电网有限责任公司 network link failure processing method, device and controller
CN116320068A (en) * 2022-09-05 2023-06-23 Oppo广东移动通信有限公司 Data transmission method and device, electronic equipment and computer storage medium
CN116112423A (en) * 2022-12-29 2023-05-12 新华三信息技术有限公司 Path determination method, device and equipment

Also Published As

Publication number Publication date
CN116760763A (en) 2023-09-15

Similar Documents

Publication Publication Date Title
CN111147287B (en) Network simulation method and system in SDN scene
CN113098773B (en) Data processing method, device and system
US6496941B1 (en) Network disaster recovery and analysis tool
CN102415059B (en) Bus control device
TWI389475B (en) Dynamic load balancing of fibre channel traffic
CN107547249A (en) Link switch-over method, device, SDN switch, controller and storage medium
US20030058800A1 (en) System and method for selection of redundant control path links in a multi-shelf network element
CN107872457B (en) Method and system for network operation based on network flow prediction
CA2369351A1 (en) System and method for providing error analysis and correlation in a network element
CN110809060B (en) Monitoring system and monitoring method for application server cluster
WO2014067268A1 (en) Node partition dividing method, device and server
CN109039795B (en) Cloud server resource monitoring method and system
CN113938407A (en) Data center network fault detection method and device based on in-band network telemetry system
CN104283780A (en) Method and device for establishing data transmission route
CN114500218A (en) Method and device for controlling network equipment
US8441929B1 (en) Method and system for monitoring a network link in network systems
CN116760763B (en) Link switching method, device, computing system, electronic equipment and storage medium
CN113079427B (en) ASON network service availability evaluation method based on network evolution model
WO2017101997A1 (en) Monitoring arrangement, network manager and respective methods performed thereby for enabling resource management in a data centre
Yi et al. A safe and reliable heterogeneous controller deployment approach in SDN
CN113543246B (en) Network switching method and device
Li et al. Data-driven routing optimization based on programmable data plane
Zacharis et al. Performance evaluation of topology discovery protocols in software defined networks
Qin et al. Interference and topology-aware VM live migrations in software-defined networks
CN114244692B (en) Fault rapid positioning method suitable for ultra-large scale interconnection network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant