CN112039747B - Mass computing node communication tree construction method based on fault rate prediction - Google Patents

Mass computing node communication tree construction method based on fault rate prediction Download PDF

Info

Publication number
CN112039747B
CN112039747B CN202010953328.0A CN202010953328A CN112039747B CN 112039747 B CN112039747 B CN 112039747B CN 202010953328 A CN202010953328 A CN 202010953328A CN 112039747 B CN112039747 B CN 112039747B
Authority
CN
China
Prior art keywords
node
target node
linked list
target
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010953328.0A
Other languages
Chinese (zh)
Other versions
CN112039747A (en
Inventor
卢凯
戴屹钦
王睿伯
董勇
谢旻
周恩强
迟万庆
张伟
张文喆
邬会军
李佳鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010953328.0A priority Critical patent/CN112039747B/en
Publication of CN112039747A publication Critical patent/CN112039747A/en
Application granted granted Critical
Publication of CN112039747B publication Critical patent/CN112039747B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/44Star or tree networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for constructing a massive computing node communication tree based on failure rate prediction, which comprises the steps of obtaining the failure prediction probability of each target node in a target node linked list; determining key nodes in a communication tree and determining key positions in a target node linked list; adjusting the sequence of the target node linked list according to the fault probability of the target node and the key position in the target node linked list, so that the node with high fault probability does not have the key position of the target node linked list as much as possible, so that the node with high fault probability sinks to the bottom layer of the communication tree as much as possible, and obtaining a new target node linked list after adjustment; and constructing the communication tree by using the new target node linked list. According to the invention, the failure probability of the predicted node is adopted, and the node sequence in the target node linked list is adjusted according to the predicted failure rate, so that the failed node is finally moved to the bottom layer of the communication tree as far as possible, the delay of state feedback can be reduced, and the influence of a node failure mode on the total timeout time is reduced.

Description

Mass computing node communication tree construction method based on fault rate prediction
Technical Field
The invention relates to a resource management technology of massive computing nodes in a high-performance computer, in particular to a method for constructing a communication tree of the massive computing nodes based on fault rate prediction.
Background
Currently, a management mode in which a single control node controls a large number of computing nodes is adopted for a large number of computing node resources in a high-performance computer. In the system operation process, the control node needs to monitor and record the real-time state of each computing node so as to facilitate tasks such as task allocation. The main way to implement this function is that the control node continuously generates a request for sending a message to the compute node (referred to as a message sending request in this patent), obtains the current state of the compute node according to the return message of the compute node, and modifies the data structure on the control node for managing the compute node. The common characteristic of these messaging requests is that the content of the messages sent is the same, but the number of target nodes is often large, and even some of the target nodes of the messaging requests may contain all the computing nodes. When processing a message sending request, the control node sends the message by adopting a star structure or a tree structure. The star structure means that the control node directly sends messages to all target computing nodes, and the tree structure needs the control node and the computing nodes to jointly construct a communication tree to complete the sending and receiving processes of the messages. Generally, the message transmission mode of the tree shape can bring less load to the control node than the transmission mode of the star shape.
The transmission mode of the tree structure depends on the construction of the communication tree. As shown in fig. 1, the control node groups the target nodes, and the number of groups is the communication tree width (the communication tree width is 3 in fig. 1). The control node only communicates with the first target node in each group of nodes, and the first target node in each group continues to send messages to other nodes in the group according to the tree structure, so that a communication tree is formed. Considering the part of the function of a single control node to monitor and manage the state of a compute node by generating and processing a message transmission request, it can be seen that the process has the following two problems when facing high performance computers with huge amount of compute node resources.
Firstly, when the scale of the node is increased, the real-time performance of state feedback is reduced, and the node state and the actual state grasped by the control node are delayed and increased.
In the process of sending the tree-shaped message, once a socket connection fails or a target node fails, a sender of the message may not always receive a response, and since the sender cannot wait for the response of a receiver forever, an upper limit of waiting time needs to be set, and once the response of the target node is not received within the set time, it can be determined that the failure has occurred, and the upper limit is called timeout time in the patent. In the application, no matter a socket connection fault occurs or a fault of a target node occurs in the process of sending a message, a receiving party is called a fault node, so that the timeout time can be regarded as extra communication time overhead caused by the fault node. Failures can be divided into two categories:
the connection fails. When the target node has a connection fault, the sending node cannot establish socket connection with the receiving node. The connection timeout time is affected by the operating system kernel setting, is approximately 3-10 seconds, and is independent of where the failed node is located in the communication tree.
A failure is received. When a receiving fault occurs in a target node, although a sending node can successfully establish socket connection with a receiving node, the receiving node cannot return a message, and at the moment, the sending node still waits for the return message of the receiving node until the sending node can judge that the receiving node is in a fault state after the overtime time is reached. This timeout is approximately 10-30 seconds and is related to where the receiving node is located in the communication tree. The closer the position of the failed node is to the root node, the larger the timeout time that the sending node needs to wait for is, because the time overhead of forwarding the message downwards by the node is also considered when determining the timeout time of a node, and the following situations are avoided: the node is normal, but the time overhead is increased due to the fact that the lower-layer node fails too much, so that the upper-layer node is mistaken for the node failure. In the current code implementation, assuming that the total number of nodes excluding a node in a packet in which the node is located is m (the number of lower-layer target nodes to be forwarded by the node next), the tree width is n, and the common TIMEOUT set by the system is MESSEGE _ TIMEOUT, then the reception TIMEOUT is: (2 × [ (m +1)/n ] +1) × messeg _ TIMEOUT.
The total extra communication overhead due to node failure is referred to as the total timeout time, which, due to the tree structure's concurrency, is not a simple sum over each connection timeout time or reception timeout time, although it relates to a single connection timeout time and reception timeout time.
Because the control node needs to judge the current state of each computing node by means of the received return message and modify the data structure on the control node related to the management computing node according to the current state, the time overhead of the control node for waiting for the return message directly causes a certain delay between the node state mastered by the control node and the actual node state, and the total timeout time is an important component of the time overhead of the control node for waiting for the return message, so the length of the total timeout time directly affects the delay of the node state mastered by the control node and the actual node state.
An increase in the size of the node leads to a larger total timeout time, which increases the overhead of the time for the control node to wait for a return message for the following reasons: one is the increase in the total number of failed nodes. Each fault node brings certain timeout time, so that the total timeout time is increased inevitably; secondly, the depth of the communication tree is increased due to the fact that the number of summary points in the communication tree is increased. This change has a significant effect on the reception timeout, and once a node receiving a failure appears on an upper node, a large timeout time is caused, resulting in an increase in the total timeout time.
The increase of the total timeout time eventually increases the delay between the node state grasped by the control node and the actual node state. We can intuitively understand how the timeout varies with the size of the node through a simulation experiment. In the experiment, the scale of the node is set to be ten million, one hundred thousand and one million, two fault types of connection fault and receiving fault are respectively and independently considered, the connection timeout time and the receiving timeout time are respectively set to be 3 seconds and 10 seconds, the fault rate is 0.1, and the fault node randomly appears. The results of the experiment are shown in table 1.
Table 1: simulation experiment 1 results
Figure BDA0002677767940000031
At a tree width of 100, the node size is expanded from ten thousand to one hundred thousand, the connection timeout increases by about 5 seconds on average, and the reception timeout increases by about 34 seconds on average. At a tree width of 300, the node scale is scaled from one hundred thousand to one million, the connection timeout increases on average for about 4 seconds, and the reception timeout increases on average for about 48 seconds. The influence of the increase of the node scale on the two timeout times is verified, and the influence of the increase of the node scale on the state feedback delay is further explained.
Second, when other conditions are the same, the continuous node failure mode will result in a greater total timeout time.
In the system operation process, fault nodes do not always appear randomly, and in the tree-shaped sending mode, if the fault rate is the same as the total target node number, compared with the situation that fault nodes appear randomly, the occurrence of relative aggregation of the fault nodes can greatly increase the total timeout time. This is because after finding a failed node, the sending node always selects the next node in the group to send in sequence, which may encounter the failed node continuously, thereby bringing a larger total communication time, and the specific reasons are as follows:
one is to reduce the concurrency of message sending. The failed nodes are easily concentrated in some groups, and even if the number of the failed nodes in other groups is small, the time overhead of obtaining the return message is small, and the groups with more failed nodes need to be waited to return together. And if the fault nodes are more evenly distributed in different groups, the timeout time can be hidden through the concurrency among the groups. This variation has an effect on both the connection timeout time and the reception timeout time.
The second is that once successive failed nodes appear, they are in the same position in the communication tree, which results in a multiplied reception timeout. In contrast, if the failed nodes are relatively scattered, some of the failed nodes may move down to a deeper level, resulting in a reduced receive timeout. This change only affects the receive timeout.
More seriously, in the application scenario of a large node scale, the influence of the continuous node failure mode on the total timeout time is more obvious. We can also visually see this effect through experimentation. In the experiment, the number of the nodes is ten thousand, one hundred thousand and one million respectively, the tree width is 100, 350 and 1000 respectively, and the failure rate is 0.1. The results of the experiment are shown in table 2.
Table 2: simulation experiment 2 results
Figure BDA0002677767940000032
From experimental results, it can be seen that when the node size is increased from ten thousand to one million, although the failure rate is the same, the continuous failure mode causes a larger total timeout time, and the trend is very obvious in the reception timeout and is also more obvious in the application scenario of the larger node size. This simulation verifies that the continued presence of failed nodes results in a greater total timeout time, and that this effect is more pronounced on large-scale nodes.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method for constructing a massive computing node communication tree based on fault rate prediction.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for constructing a massive computing node communication tree based on fault rate prediction comprises the following steps:
1) acquiring the fault prediction probability of each target node in a target node linked list;
2) determining key nodes in the communication tree, and determining key positions in a target node linked list according to the one-to-one correspondence of the target node linked list and the communication tree;
3) adjusting the sequence of the target node linked list according to the fault probability of the target node and the key position in the target node linked list, so that the node with high fault probability does not have the key position of the target node linked list as much as possible, so that the node with high fault probability sinks to the bottom layer of the communication tree as much as possible, and obtaining a new target node linked list after adjustment;
4) and constructing the communication tree by using the new target node linked list.
Optionally, step 1) comprises: sending an ICMP request to each target node in the target node linked list, calculating the current response time of each target node in the target node linked list, calculating the time difference between the current response time and the recorded last normal response time, and determining the fault prediction probability of each target node corresponding to the preset level according to the value range of the time difference.
Optionally, step 1) comprises: sending an ICMP request to each target node in the target node linked list, judging whether a fault occurs according to the response of each target node in the target node linked list, and determining the fault prediction probability of each target node corresponding to a preset level according to the number of the faults occurring in the latest specified time length of each target node.
Optionally, the step of step 2) comprises:
2.1) determining the communication tree width and the number of target nodes;
2.2) judging whether the number of the target nodes is larger than the communication tree width, if so, judging that only one layer of communication tree can be formed, generating a key array without key nodes, wherein each element in the key array corresponds to one target node, and all the elements are 0; skipping to execute step 2.4);
2.3) judging whether the number of the target nodes is enough to form more than two layers and not including two layers of communication trees, if so, judging that the number of the target nodes is enough to form three layers or more of communication trees, wherein the key nodes are a first layer node and a second layer node in the communication trees, the key degree of the first layer node is greater than that of the second layer node, generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first layer node in the generated key array is 1, the value of the element corresponding to the second layer node is 2, and the value of the element corresponding to other layer nodes is 3; skipping to execute step 2.4); otherwise, judging that the number of the target nodes exceeds the width of the communication tree but only two layers of communication trees can be formed, wherein the key nodes are first-layer nodes in the communication tree, and generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first-layer node in the key array is 1, and the value of the element corresponding to the second-layer node in the key array is 3;
2.4) outputting the generated critical array containing the critical position information in the target node linked list.
Optionally, a value of any element in the critical array is 0, 1, 2, or 3.
Optionally, the step of step 3) comprises:
3.1) determining a key array containing key position information in a target node chain table, an original target node chain table and a predicted failure rate of each target node;
3.2) judging whether the number of the target nodes is larger than the communication tree width, if so, executing the next step; otherwise, assigning the new _ host list of the new target node linked list as the original target node linked list, not changing the sequence of the original target node linked list, and skipping to execute the step 3.7);
3.3) splitting the original target node linked list into two parts according to the failure prediction probability of each target node: the link table hh1 and the link table hh2, wherein the number of target nodes stored in the link table hh1 is equal to the set first-layer bandwidth, the link table hh2 stores the target nodes which are left except the target nodes in the link table hh1, the predicted failure probability grade of the target node stored in the link table hh1 is not less than the predicted failure probability grade of the target node stored in the link table hh2, and for any two target nodes stored in the link table hh1 or the link table hh2, the predicted failure probability grade of the target node close to the head of the link table is not less than the predicted failure probability grade of the target node close to the tail of the link table, so that the link table hh1 and the link table hh2 combine all input information of the original target node link table and the node failure rate;
3.4) judging whether the number of the target nodes is enough to form a communication tree with more than two layers and not including two layers, and if so, skipping to execute the step 3.5); otherwise, skipping to execute the step 3.6);
3.5) traversal of the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 2 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a next target node is taken from the tail of the linked list hh1 and added to the tail of the new _ hostlist of the new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.6) traverse the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.7) outputting the new target node linked list new _ hostlist.
In addition, the invention also provides a system for constructing the massive computing node communication tree based on the failure rate prediction, which comprises a computer device, wherein the computer device is programmed or configured to execute the steps of the method for constructing the massive computing node communication tree based on the failure rate prediction.
In addition, the invention also provides a system for constructing the massive computing node communication tree based on the failure rate prediction, which comprises a computer device, wherein a computer program which is programmed or configured to execute the method for constructing the massive computing node communication tree based on the failure rate prediction is stored in a memory of the computer device.
In addition, the invention also provides a high-performance computer system, which comprises a control node and a plurality of computing nodes, wherein the control node is programmed or configured to execute the steps of the method for constructing the communication tree of the mass computing nodes based on the fault rate prediction, or a computer program which is programmed or configured to execute the method for constructing the communication tree of the mass computing nodes based on the fault rate prediction is stored in a memory of the control node.
Furthermore, the present invention also provides a computer-readable storage medium having stored therein a computer program programmed or configured to execute the method for constructing a communication tree of a large number of computing nodes based on failure rate prediction.
Compared with the prior art, the invention has the following advantages: the method comprises the steps of obtaining the fault prediction probability of each target node in a target node linked list; determining key nodes in the communication tree, and determining key positions in a target node linked list according to the one-to-one correspondence of the target node linked list and the communication tree; adjusting the sequence of the target node linked list according to the fault probability of the target node and the key position in the target node linked list, so that the node with high fault probability does not have the key position of the target node linked list as much as possible, so that the node with high fault probability sinks to the bottom layer of the communication tree as much as possible, and obtaining a new target node linked list after adjustment; and constructing the communication tree by using the new target node linked list. According to the invention, the failure probability of the predicted node is adopted, and the node sequence in the target node linked list is adjusted according to the predicted failure rate, so that the failed node is finally moved to the bottom layer of the communication tree as far as possible, the delay of state feedback can be reduced, and the influence of a node failure mode on the total timeout time is reduced.
Drawings
Fig. 1 is a schematic diagram of a communication tree of a prior art pair.
FIG. 2 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of the basic flow of step 2) in the embodiment of the present invention.
Fig. 4 shows the positions of the key nodes in the communication tree and the corresponding positions on the target node linked list in the case where the communication tree has only one layer in the embodiment of the present invention.
Fig. 5 shows the positions of the key nodes in the communication tree and the corresponding positions on the target node linked list in the case where the communication tree has only two layers in the embodiment of the present invention.
FIG. 6 is a diagram illustrating the positions of key nodes in a communication tree and the corresponding positions on a linked list of target nodes when the communication tree is larger than two levels according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of the basic flow of step 3) in the embodiment of the present invention.
Detailed Description
As shown in fig. 2, the method for constructing a massive computing node communication tree based on failure rate prediction in this embodiment includes:
1) acquiring the fault prediction probability of each target node in a target node linked list;
2) determining key nodes in the communication tree, and determining key positions in a target node linked list according to the one-to-one correspondence of the target node linked list and the communication tree;
3) adjusting the sequence of the target node linked list according to the fault probability of the target node and the key position in the target node linked list, so that the node with high fault probability does not have the key position of the target node linked list as much as possible, so that the node with high fault probability sinks to the bottom layer of the communication tree as much as possible, and obtaining a new target node linked list after adjustment;
4) and constructing the communication tree by using the new target node linked list.
Since the failure probability of the predicted node can be implemented in various ways, the failure probability of the predicted node is implemented as an independent function in the embodiment, so that future possible further optimization is facilitated. As an alternative implementation, the output of the function in this embodiment is a 16-bit integer array, in which the failure prediction probability of each target node in the target node linked list is stored. The failure prediction probability is represented by failure levels, which are divided into four failure probability levels (0-3) in the present embodiment, where 0 represents the least likely failure and 3 represents the most likely failure.
The predicted node failure probability may be obtained based on the detection.
For example, as an alternative embodiment, step 1) includes: sending an ICMP request to each target node in the target node linked list, calculating the current response time of each target node in the target node linked list, calculating the time difference between the current response time and the recorded last normal response time, and determining the fault prediction probability of each target node corresponding to the preset level according to the value range of the time difference. The method is inspired by a ping mechanism in the current code, and the fault probability is divided into three levels according to the length of the time difference between last _ response time (last normal response time) of the node and the current time, wherein the larger the time difference is, the larger the fault probability is.
For example, as another alternative embodiment, step 1) includes: sending an ICMP request to each target node in the target node linked list, judging whether a fault occurs according to the response of each target node in the target node linked list, and determining the fault prediction probability of each target node corresponding to a preset level according to the number of the faults occurring in the latest specified time length of each target node. The method is to estimate the failure probability according to the recent state change condition of the nodes, and the probability that the node which has more failures recently fails recently is higher.
And step 2) is used for determining key nodes in the communication tree and determining key positions in the target node linked list according to the one-to-one correspondence of the target node linked list and the communication tree. Since the relative order of the target nodes in the child chain table and the original chain table remains unchanged after the target nodes are grouped, the positions of the target nodes in the target node chain table can directly determine the positions of the target nodes in the communication tree. In other words, if the key nodes in the communication tree are known, they can be inferred back where the target node is located. By critical location in the communication tree, it is meant that a node at this location will have a large impact on the system in case of a failure.
As shown in fig. 3, the step of step 2) in this embodiment includes:
2.1) determining the communication tree width and the number of target nodes;
2.2) judging whether the number of the target nodes is larger than the communication tree width, if so, judging that only one layer of communication tree can be formed, generating a key array (import array) without key nodes, wherein each element in the key array corresponds to one target node, and all the elements are 0; skipping to execute step 2.4);
2.3) judging whether the number of the target nodes is enough to form more than two layers and not including two layers of communication trees, if so, judging that the number of the target nodes is enough to form three layers or more of communication trees, wherein the key nodes are a first layer node and a second layer node in the communication trees, the key degree of the first layer node is greater than that of the second layer node, generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first layer node in the generated key array is 1, the value of the element corresponding to the second layer node is 2, and the value of the element corresponding to other layer nodes is 3; skipping to execute step 2.4); otherwise, judging that the number of the target nodes exceeds the width of the communication tree but only two layers of communication trees can be formed, wherein the key nodes are first-layer nodes in the communication tree, and generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first-layer node in the key array is 1, and the value of the element corresponding to the second-layer node in the key array is 3;
2.4) outputting the generated critical array containing the critical position information in the target node linked list.
Wherein, the value of any element in the key array is 0 or 1 or 2 or 3.
In this embodiment, the key nodes we choose have the following three cases: firstly, the communication tree has only one layer (the root node, namely the control node, is the zeroth layer), and in this case, no key node exists; the communication tree has only two layers, and the first layer node is a key node under the condition; three is that the communication tree is more than two levels, in which case the key nodes are nodes at the first and second levels of the communication tree. Where nodes at the first level of the communication tree are more critical. The implementation method is based on the following aspects:
(1) the closer the fault node appears in the communication tree to the root node, the larger the timeout brought to the system;
(2) at most, only the first two layers of nodes in the communication tree are considered as key nodes, which can bring sufficient optimization effect, and the conclusion can be obtained through the following simulation experiment. Assume a tree width of 50 (the default tree width for the current system). As can be seen from table 3, the tree width does not exceed 5 levels even when the number of nodes is one million. It is sufficient to consider the importance of only the first two levels of nodes.
Table 3: number of communication tree layers.
Number of target nodes Number of communication tree layers
1000 2
10000 3
100000 3
1000000 5
(3) At most, only the first two layers of nodes in the communication tree are considered as key nodes, so that the complexity of the algorithm can be reduced. As can be seen from table 4 below, when the target node size is large, the first layer and the second layer of nodes only occupy a small portion of the total target node, and only paying attention to the failure probability of these nodes can greatly reduce the complexity of the algorithm. It should be noted that since the communication tree itself has only two levels when the target node is 1000, only the first level target node is considered.
Table 4: the proportion of the key nodes to the total nodes.
Number of target nodes Number of key nodes Key node proportion
1000 50 5%
10000 2501 25.1%
100000 2501 2.41%
1000000 2551 0.2551%
Fig. 4-6 show the location of a key node in the communication tree and the corresponding location on the linked list of target nodes under three different conditions, respectively. Fig. 4 shows a case where the communication tree has only one layer, fig. 5 shows a case where the communication tree has only two layers, and fig. 6 shows a case where the communication tree has more than two layers. In fig. 4-6, the white computing nodes are normal nodes, and the dark gray computing nodes and the light gray computing nodes are nodes at key positions in the communication tree, wherein the dark gray computing nodes are more critical than the light gray computing nodes. The one-to-one correspondence of these nodes to positions on the chain of destination nodes is also shown.
In a specific code implementation, the embodiment completes the part of functions in step 2) through a get _ import function. The inputs to the get _ import function are the communication tree width and the total number of target nodes, which together describe the current application scenario. The specific function of the get _ import function is to obtain the positions of key nodes in the communication tree in the current application scene, and then finally obtain the positions of the key nodes in the target node linked list according to the one-to-one correspondence relationship between the positions in the communication tree and the positions of the nodes in the target node linked list. And finally outputting an import number group which indicates the key degree of each position in the target node linked list.
In this embodiment, step 3) is configured to adjust the order of the target node chain table according to the failure probability of the target node and the key position in the target node chain table, so that the node with the high failure probability does not have the key position of the target node chain as much as possible. As shown in fig. 7, the step of step 3) includes:
3.1) determining a key array containing key position information in a target node chain table, an original target node chain table and a predicted failure rate of each target node;
3.2) judging whether the number of the target nodes is larger than the communication tree width, if so, executing the next step; otherwise, assigning the new _ host list of the new target node linked list as the original target node linked list, not changing the sequence of the original target node linked list, and skipping to execute the step 3.7);
3.3) splitting the original target node linked list into two parts according to the failure prediction probability of each target node: the link table hh1 and the link table hh2, wherein the number of target nodes stored in the link table hh1 is equal to the set first-layer bandwidth, the link table hh2 stores the target nodes which are left except the target nodes in the link table hh1, the predicted failure probability grade of the target node stored in the link table hh1 is not less than the predicted failure probability grade of the target node stored in the link table hh2, and for any two target nodes stored in the link table hh1 or the link table hh2, the predicted failure probability grade of the target node close to the head of the link table is not less than the predicted failure probability grade of the target node close to the tail of the link table, so that the link table hh1 and the link table hh2 combine all input information of the original target node link table and the node failure rate;
3.4) judging whether the number of the target nodes is enough to form a communication tree with more than two layers and not including two layers, and if so, skipping to execute the step 3.5); otherwise, skipping to execute the step 3.6);
3.5) traversal of the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 2 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a next target node is taken from the tail of the linked list hh1 and added to the tail of the new _ hostlist of the new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.6) traverse the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.7) outputting the new target node linked list new _ hostlist.
As shown in fig. 7, the input of step 3) has three items: the import array obtained in the last part, the original target node linked list and the predicted failure rate of each target node output only one item, namely a new target node linked list new _ hostlist, wherein the total number of the nodes in the target node linked list is the same as that in the original target node linked list, but the sequence is different. The main process of the process is as follows: for the case that the communication tree has only one layer, the present embodiment does not change the order of the nodes in the target node linked list. For other cases, firstly splitting an original target node linked list into two parts according to the failure prediction probability of each target node: link hh1 and link hh 2.
In step 2) of this embodiment, an import array is obtained, where each element indicates the importance of a target node on a corresponding target node chain. The order of the nodes on the target node chain table can be adjusted by using the separated hh1, hh2 target node chain and import array. The method is that the import array is traversed, and appropriate nodes in hh1 and hh2 are selected according to the values of the elements in the import array and added into the new _ host. The specific treatment process can be divided into the following three types: (1) and (4) not processing the condition that the number of the target nodes is less than the communication tree width, wherein all elements in the import array are 0. In this case, all target nodes will be at the same level, and changing their positions in the chain of target nodes has no meaning to reduce the total timeout time. (2) For the situation that the number of target nodes is enough to form two-layer communication trees but not enough to form three-layer communication trees, only the low failure probability of the first-layer target node is required to be ensured. At this time, the elements in the import array have two values, if the value is 1, the corresponding node is the first layer node in the communication tree, and if the value is 3, the corresponding node is the other layer (non-first layer) node of the communication tree. Therefore, when a target node marked as a first-layer node in the import array is encountered, the next target node is taken from the head of the chain table of hh1 and added to the tail of the new _ hostlist chain table. When a target node is encountered, which is designated as a non-first-level node in the import array, a target node is taken from the head of the chain table of hh2 and added to the tail of the new _ hostlist chain table. (3) When the number of target nodes is enough to form a three-layer communication tree, the lowest fault probability level of the first layer node and the lowest fault probability level of the second layer node are required to be ensured, and the nodes with higher fault probability are all located on other layers of nodes. At this time, the elements in the import array have three values, if the value is 1, the corresponding node is the first layer node in the communication tree, if the value is 2, the corresponding node is the second layer node of the communication tree, and if the value is 3, the corresponding node is the other layer (not the first layer and not the second layer) node. So every time a target node designated as the first level node in the import array is encountered, the next target node is taken from the head of the link table of hh1 and added to the tail of the new _ hostlist link table. When a target node designated as the second level in the import array is encountered, the next target node is taken from the head of the chain table of hh2 and added to the tail of the new _ hostlist chain table. When the target nodes marked as other layers in the import array are encountered, the next target node is taken from the tail of the chain table of hh2 and added to the tail of the new _ hostlist chain table. After the above operations are completed, the number of target nodes in the new _ host should be the same as the number of target nodes in the original target node chain, and the difference is only that the sequence is different. Finally, in this embodiment, the new _ hostlist is used to replace the original target node chain.
And 4) using the new _ host chain generated in the step 3) as the adjusted target node chain. And then only replacing the original target node chain by new _ hostlist without changing the construction method of the communication tree. The final result is that the new target node chain ensures that the fault probability of the nodes at the key positions in the constructed communication tree is as low as possible, and simultaneously, the nodes with high fault probability sink to the bottom layer of the communication tree as much as possible.
Referring to the background art, to solve the problem of increased state feedback delay, the fundamental method is to increase concurrency, so that the detection time of multiple failed nodes is overlapped and delay is hidden. This can be achieved by optimizing the construction of the communication tree so that potentially faulty nodes appear as far as possible in the last layer of the communication tree. Sinking a failed node to the bottom of the communication tree may result in a significant reduction in the receive TIMEOUT incurred by that node, e.g., if the failed node appears in a leaf node, it may incur a receive TIMEOUT of only 1 MESSAGE TIMEOUT, and the reduction in the total TIMEOUT incurred thereby may reduce the latency of the state feedback. Once the fault nodes continuously appear in the target node chain, the target node chain can be adjusted by predicting the fault probability of the nodes in advance, so that the fault nodes do not continuously appear in the target node chain any more, but are distributed on different positions of the target nodes. Therefore, the situation that the fault node is continuously selected to be sent in the same group can be avoided, and the influence of the fault mode on the total timeout time can be relieved to a certain extent. According to the method for constructing the massive computing node communication tree based on the failure rate prediction, the failure probability of the predicted nodes is adopted, the node sequence in the target node linked list is adjusted according to the predicted failure rate, and finally the failed nodes are moved to the bottom layer of the communication tree as far as possible, so that the time delay of state feedback can be reduced, and the influence of the node failure mode on the total timeout time is reduced.
In addition, the present embodiment also provides a system for constructing a massive computing node communication tree based on failure rate prediction, which includes a computer device programmed or configured to execute the steps of the aforementioned method for constructing a massive computing node communication tree based on failure rate prediction.
In addition, the embodiment also provides a system for constructing a massive computing node communication tree based on failure rate prediction, which includes a computer device, and a memory of the computer device stores a computer program programmed or configured to execute the method for constructing the massive computing node communication tree based on failure rate prediction.
In addition, the present embodiment also provides a high-performance computer system, which includes a control node and a plurality of computing nodes, where the control node is programmed or configured to execute the steps of the aforementioned method for building a communication tree of mass computing nodes based on failure rate prediction, or a memory of the control node stores therein a computer program programmed or configured to execute the aforementioned method for building a communication tree of mass computing nodes based on failure rate prediction.
Furthermore, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured to execute the aforementioned method for constructing a communication tree of mass computing nodes based on failure rate prediction.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (9)

1. A method for constructing a massive computing node communication tree based on fault rate prediction is characterized by comprising the following steps:
1) acquiring the fault prediction probability of each target node in a target node linked list;
2) determining key nodes in the communication tree, and determining key positions in a target node linked list according to the one-to-one correspondence of the target node linked list and the communication tree;
3) adjusting the sequence of the target node linked list according to the fault probability of the target node and the key position in the target node linked list, so that the node with high fault probability does not have the key position of the target node linked list as much as possible, so that the node with high fault probability sinks to the bottom layer of the communication tree as much as possible, and obtaining a new target node linked list after adjustment;
4) constructing a communication tree by using the new target node linked list;
the step 3) comprises the following steps:
3.1) determining a key array containing key position information in a target node chain table, an original target node chain table and a predicted failure rate of each target node;
3.2) judging whether the number of the target nodes is larger than the communication tree width, if so, executing the next step; otherwise, assigning the new _ host list of the new target node linked list as the original target node linked list, not changing the sequence of the original target node linked list, and skipping to execute the step 3.7);
3.3) splitting the original target node linked list into two parts according to the failure prediction probability of each target node: the link table hh1 and the link table hh2, wherein the number of target nodes stored in the link table hh1 is equal to the set first-layer bandwidth, the link table hh2 stores the target nodes which are left except the target nodes in the link table hh1, the predicted failure probability grade of the target node stored in the link table hh1 is not less than the predicted failure probability grade of the target node stored in the link table hh2, and for any two target nodes stored in the link table hh1 or the link table hh2, the predicted failure probability grade of the target node close to the head of the link table is not less than the predicted failure probability grade of the target node close to the tail of the link table, so that the link table hh1 and the link table hh2 combine all input information of the original target node link table and the node failure rate;
3.4) judging whether the number of the target nodes is enough to form a communication tree with more than two layers and not including two layers, and if so, skipping to execute the step 3.5); otherwise, skipping to execute the step 3.6);
3.5) traversal of the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 2 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a next target node is taken from the tail of the linked list hh1 and added to the tail of the new _ hostlist of the new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.6) traverse the critical array: each time an element with the value of 1 is encountered, a target node is taken down from the head of the linked list hh1 and added to the tail of a new _ hostlist of a new linked list of target nodes; each time an element with the value of 3 is encountered, a target node is taken down from the head of the linked list hh2 and added to the tail of a new _ hostlist of a new linked list of target nodes; after traversing, skipping to execute the step 3.7);
3.7) outputting the new target node linked list new _ hostlist.
2. The method for constructing the communication tree of the mass computing nodes based on the failure rate prediction as claimed in claim 1, wherein the step 1) comprises: sending an ICMP request to each target node in the target node linked list, calculating the current response time of each target node in the target node linked list, calculating the time difference between the current response time and the recorded last normal response time, and determining the fault prediction probability of each target node corresponding to the preset level according to the value range of the time difference.
3. The method for constructing the communication tree of the mass computing nodes based on the failure rate prediction as claimed in claim 1, wherein the step 1) comprises: sending an ICMP request to each target node in the target node linked list, judging whether a fault occurs according to the response of each target node in the target node linked list, and determining the fault prediction probability of each target node corresponding to a preset level according to the number of the faults occurring in the latest specified time length of each target node.
4. The method for constructing the communication tree of the mass computing nodes based on the failure rate prediction as claimed in claim 1, wherein the step 2) comprises:
2.1) determining the communication tree width and the number of target nodes;
2.2) judging whether the number of the target nodes is larger than the communication tree width, if so, judging that only one layer of communication tree can be formed, generating a key array without key nodes, wherein each element in the key array corresponds to one target node, and all the elements are 0; skipping to execute step 2.4);
2.3) judging whether the number of the target nodes is enough to form more than two layers and not including two layers of communication trees, if so, judging that the number of the target nodes is enough to form three layers or more of communication trees, wherein the key nodes are a first layer node and a second layer node in the communication trees, the key degree of the first layer node is greater than that of the second layer node, generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first layer node in the generated key array is 1, the value of the element corresponding to the second layer node is 2, and the value of the element corresponding to other layer nodes is 3; skipping to execute step 2.4); otherwise, judging that the number of the target nodes exceeds the width of the communication tree but only two layers of communication trees can be formed, wherein the key nodes are first-layer nodes in the communication tree, and generating a key array, each element in the key array corresponds to one target node, the value of the element corresponding to the first-layer node in the key array is 1, and the value of the element corresponding to the second-layer node in the key array is 3;
2.4) outputting the generated critical array containing the critical position information in the target node linked list.
5. The method for constructing the communication tree of the massive computing nodes based on the failure rate prediction according to claim 4, wherein the value of any element in the critical array is 0 or 1 or 2 or 3.
6. A system for constructing a communication tree of mass computing nodes based on failure rate prediction, comprising a computer device, wherein the computer device is programmed or configured to execute the steps of the method for constructing a communication tree of mass computing nodes based on failure rate prediction according to any one of claims 1 to 5.
7. A system for constructing a communication tree of mass computing nodes based on failure rate prediction, comprising a computer device, wherein a computer program programmed or configured to execute the method for constructing the communication tree of mass computing nodes based on failure rate prediction according to any one of claims 1 to 5 is stored in a memory of the computer device.
8. A high performance computer system comprising a control node and a plurality of computing nodes, wherein the control node is programmed or configured to perform the steps of the method for building a communication tree of mass computing nodes based on failure rate prediction according to any one of claims 1 to 5, or wherein a memory of the control node has stored therein a computer program programmed or configured to perform the method for building a communication tree of mass computing nodes based on failure rate prediction according to any one of claims 1 to 5.
9. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program being programmed or configured to perform the method for constructing a communication tree of mass computing nodes based on failure rate prediction according to any one of claims 1 to 5.
CN202010953328.0A 2020-09-11 2020-09-11 Mass computing node communication tree construction method based on fault rate prediction Active CN112039747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010953328.0A CN112039747B (en) 2020-09-11 2020-09-11 Mass computing node communication tree construction method based on fault rate prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010953328.0A CN112039747B (en) 2020-09-11 2020-09-11 Mass computing node communication tree construction method based on fault rate prediction

Publications (2)

Publication Number Publication Date
CN112039747A CN112039747A (en) 2020-12-04
CN112039747B true CN112039747B (en) 2021-10-26

Family

ID=73588670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010953328.0A Active CN112039747B (en) 2020-09-11 2020-09-11 Mass computing node communication tree construction method based on fault rate prediction

Country Status (1)

Country Link
CN (1) CN112039747B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101672482B1 (en) * 2015-05-15 2016-11-16 세종대학교산학협력단 One-time traversal device to search modules in a fault tree for the risk analysis of safety-critical systems and method for the same
CN109919181A (en) * 2019-01-24 2019-06-21 南京航空航天大学 Dynamic fault tree quantitative analysis method based on probabilistic model checking
CN111581036A (en) * 2020-03-31 2020-08-25 西安电子科技大学 Internet of things fault detection method, detection system and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10769007B2 (en) * 2018-06-08 2020-09-08 Microsoft Technology Licensing, Llc Computing node failure and health prediction for cloud-based data center

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101672482B1 (en) * 2015-05-15 2016-11-16 세종대학교산학협력단 One-time traversal device to search modules in a fault tree for the risk analysis of safety-critical systems and method for the same
CN109919181A (en) * 2019-01-24 2019-06-21 南京航空航天大学 Dynamic fault tree quantitative analysis method based on probabilistic model checking
CN111581036A (en) * 2020-03-31 2020-08-25 西安电子科技大学 Internet of things fault detection method, detection system and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Alleviating Network Congestion for HPC Clusters with Fat-tree Interconnection Leveraging Software-Defined Networking;Zhenwei Wu;《IEEE》;20161121;全文 *
StageFS: A Parallel File System Optimizing Metadata Performance for SSD Based Clusters;Kai Lu;《IEEE》;20160826;全文 *

Also Published As

Publication number Publication date
CN112039747A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
US9773015B2 (en) Dynamically varying the number of database replicas
CA2776127C (en) Data security for a database in a multi-nodal environment
WO2019148716A1 (en) Data transmission method, server, and storage medium
US11294934B2 (en) Command processing method and server
US20170078207A1 (en) Network prioritization based on node-level attributes
WO2021184589A1 (en) Flow scheduling method and device, server, and storage medium
CN104852867A (en) Data packet forwarding method, device and system
CN113568577B (en) Distributed grouping storage method based on alliance block chain
US9122546B1 (en) Rapid processing of event notifications
KR20150042874A (en) Sorting
JP2021077129A (en) Information processing system, model control method, and storage medium
US20160087759A1 (en) Tuple recovery
CN112039747B (en) Mass computing node communication tree construction method based on fault rate prediction
WO2022016969A1 (en) Data processing method and device
WO2017045640A1 (en) Associated stream bandwidth scheduling method and apparatus in data center
JP6323243B2 (en) System and anomaly detection method
CN107943615A (en) Data processing method and system based on distributed type assemblies
CN112187765A (en) Method and device for selecting target node in block chain
US20200007642A1 (en) Software Application Updating in a Local Network
WO2017049861A1 (en) Data processing status monitoring method and device
CN112783673A (en) Method and device for determining call chain, computer equipment and storage medium
CN114625501A (en) Automatic evidence obtaining scheduling system and method based on block chain
KR101014977B1 (en) Load balancing method in the function of link aggregration
CN112000486A (en) Mass computing node resource monitoring and management method for high-performance computer
CN116954721B (en) Asynchronous non-blocking splitting method for multi-modal operator of actuator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant