CN112000486A - Mass computing node resource monitoring and management method for high-performance computer - Google Patents

Mass computing node resource monitoring and management method for high-performance computer Download PDF

Info

Publication number
CN112000486A
CN112000486A CN202010952582.9A CN202010952582A CN112000486A CN 112000486 A CN112000486 A CN 112000486A CN 202010952582 A CN202010952582 A CN 202010952582A CN 112000486 A CN112000486 A CN 112000486A
Authority
CN
China
Prior art keywords
node
message
state
thread
intermediate node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010952582.9A
Other languages
Chinese (zh)
Other versions
CN112000486B (en
Inventor
戴屹钦
卢凯
董勇
王睿伯
张伟
张文喆
邬会军
李佳鑫
谢旻
周恩强
迟万庆
陈娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010952582.9A priority Critical patent/CN112000486B/en
Publication of CN112000486A publication Critical patent/CN112000486A/en
Application granted granted Critical
Publication of CN112000486B publication Critical patent/CN112000486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/508Monitor

Abstract

The invention discloses a high-performance computer-oriented massive computing node resource monitoring and management method, which comprises the following steps that a control node sends a message sending request through an intermediate node: the control node takes out a message sending request and generates a working thread for processing the message sending request; selecting a normal intermediate node through the working thread; forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node; and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread. The invention adds a layer of intermediate nodes between the control node and the massive computing nodes to share the load of the control node in the process of monitoring and managing the massive computing node resources and simultaneously reduce the related load of the computing nodes in the process.

Description

Mass computing node resource monitoring and management method for high-performance computer
Technical Field
The invention relates to a massive computing node resource management technology of a high-performance computer, in particular to a massive computing node resource monitoring and management method for the high-performance computer.
Background
Currently, a management mode in which a single control node controls a large number of computing nodes is adopted for a large number of computing node resources in a high-performance computer. In the system operation process, the control node needs to monitor and record the real-time state of each computing node so as to facilitate tasks such as task allocation. The main way to realize the function is that the control node continuously generates a request (message sending request) for sending messages to the computing node, obtains the current state of the computing node according to the returned messages of the computing node and modifies the data structure used for managing the computing node on the control node. The common characteristic of these messaging requests is that the content of the messages sent is the same, but the number of target nodes is often large, and even some of the target nodes of the messaging requests may contain all the computing nodes. When processing a message sending request, the control node sends the message by adopting a star structure or a tree structure. The star structure means that the control node directly sends messages to all target computing nodes, and the tree structure needs the control node and the computing nodes to jointly construct a communication tree to complete the sending and receiving processes of the messages. Specifically, the control node groups the target nodes, the number of the grouped groups is the communication tree width, the control node only sends messages to the first target node in each group of nodes, and the first target node in each group continues to send messages to other nodes in the group according to the tree structure. Generally, the message transmission mode of the tree shape can bring less load to the control node than the transmission mode of the star shape.
Considering a part of functions of monitoring and managing the states of massive computing nodes by generating and processing message sending requests by a single control node, it can be found that once the size of the node is increased, various problems are brought to the system:
firstly, when the node scale is increased, the load related to the monitoring and calculating node state on the control node is increased.
The mode that the control node sends the message to the target node when processing the message sending request can be divided into a star mode and a tree mode.
In the star-shaped transmission mode, the load on the control node is directly related to the number of target nodes. For a message sending request, assuming that the number of target nodes is n, the control node generates n sending threads to be directly responsible for sending and receiving messages of each target node, and simultaneously generates a monitoring thread to monitor the sending and receiving conditions of all threads under the sending request and update the state information of the nodes. When the node size is enlarged, the same sending request will have more target nodes, and for the star sending mode, the same sending request will directly cause the increase of sending threads, and bring larger CPU and network load.
In the tree transmission mode, the load on the control node is related to the tree width. When the size of the node is increased, if the previously set tree width is kept unchanged, the depth of the communication tree is greatly increased, the message forwarding times are increased, the time interval from the message sending to the message receiving of the control node is increased, and the real-time performance of the returned message is reduced. Therefore, in order to ensure the performance of the system, it must be considered to increase the communication tree width appropriately to decrease the communication tree depth. Once the communication tree width increases, the number of groups obtained by grouping the target nodes by the control node increases, which may cause the number of threads related to transmission on the control node to increase, further cause the proportional increase of the number of bus threads on the control node, and finally cause the increase of the CPU and network load on the control node. This is even more evident in the case where the control node is constantly generating a large number of message sending requests.
In practice, to avoid running too many threads at the same time, the control node will typically set an upper limit (e.g., 256) for the class of threads that are running at the same time. In addition, for each transmission request, the upper limit (such as 12) of the thread in the running state which is allowed by the transmission request at most in the same time is set. However, this can result in the system being in a conflicting state: first, if these two upper limits are not changed, when the size of the node increases, although it can be ensured that the number of threads within the same time does not exceed the upper limits, a large number of requests to be sent will enter a waiting state and cannot be processed in time, and the performance of the system will be seriously impaired. Secondly, once these two upper limits are raised, although it is guaranteed that the sending request is processed in time, the load on the control node will also increase.
And secondly, when the size of the node is increased, the load related to message forwarding on the computing node is increased.
In the tree-shaped message sending mode, the computing nodes also carry out grouping and message forwarding on the target nodes, and the increase of the communication tree width caused by the expansion of the node scale also brings larger loads of a CPU and a network to the computing nodes.
In summary, under the condition that the size of the node is increased, if the tree structure is adopted to transmit the message, the communication tree width cannot be increased greatly, otherwise, a greater load is brought to both the control node and the computing node, and the system performance is reduced.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method for monitoring and managing mass computing node resources for a high-performance computer.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for monitoring and managing massive computing node resources facing a high-performance computer comprises the following steps that a control node sends a message sending request through an intermediate node:
1) the control node takes out a message sending request and generates a working thread for processing the message sending request;
2) selecting a normal intermediate node through the working thread;
3) forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node;
4) and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread.
Optionally, the step 1) of taking out a message sending request by the control node specifically means that a control thread of the control node takes out a message sending request from a global chain, where the global chain is used to store the message sending request of the control node, and the control thread is used to manage each message sending request and a corresponding working thread thereof.
Optionally, selecting a normal intermediate node through the working thread in step 2) specifically means sequentially selecting a normal intermediate node from an intermediate node list formed by all intermediate nodes by using a polling method, and recording the state of each intermediate node by using a state machine, where the state machine includes two states of state 0 and 1 and three events of events 1 to 3, and state 0 represents a node fault; the state 1 represents that the node is normal, and the event 1 is that the control node sends a PING message to the intermediate node and obtains a correct return value; event 2 is that the control node sends a PING message to the intermediate node and cannot obtain a return value; the event 3 is that the control node forwards a message sending request to a normal node and cannot obtain a return message, when the event 1 occurs, if the original state of the state machine is the state 0, the state is changed to the state 1, and if the original state of the state machine is the state 1, the state is kept unchanged; when an event 2 occurs, if the original state of the state machine is a state 0, the state is kept unchanged, and if the original state of the state machine is a state 1, the state is changed to a state 0; when the event 3 occurs, the original state of the state machine is changed to the state 1 and then to the state 0.
Optionally, when waiting for the message returned by the intermediate node in step 3), if the message returned by the intermediate node is not received after the waiting timeout, skipping to execute step 2) to reselect the next normal intermediate node to process the message sending request.
Optionally, when the worker thread forwards the message sending request to the selected intermediate node in step 3), the data structure agent _ t for forwarding the message sending request includes the following information fields:
the node _ count of the number of the target nodes is used for storing the number of the target nodes in the message sending request;
retry identification, which is used to record the identification whether retry is needed after transmission failure;
the target node chain hostlist is used for recording the target node chain;
the message type msg _ type is used for recording the message type needing to be sent;
and the message body msg _ args is used for recording the message body needing to be transmitted.
Optionally, when the message returned by the intermediate node is received in step 3), the data structure agent _ response _ t of the message returned by the intermediate node includes the following information fields:
a communication error node list comm _ err _ nodelist for storing a target node which uses a communication function and has an error when transmitting through a star structure;
a task number jobid for storing a task number return value of a message transmission request related to the srun command;
a step number step _ id for storing a step number return value of a message transmission request related to the srun command;
retry node list retry _ nodelist for storing a target node that needs to resend the message;
a NO-response grounding list NO _ RESP _ nodelist for storing a target node contained in a sending thread with a final state of DSH _ NO _ RESP;
a FAILED node list FAILED _ nodelist used for storing a target node contained in a sending thread with a final state of FAILED DSH _ FAILED;
the repeated node list dupid _ nodelist is used for storing a target node contained in a sending thread with a final state of repeated DSH _ DUP _ JOBID;
a DONE node list DONE _ nodelist for storing the target node contained in the sending thread with the final state of being DONE DSH _ DONE
A list of failed nodes, error _ nodelist, for storing the target nodes contained by the sending thread with unrecognized final state
And the information chain ret _ list is used for recording a return information chain, and comprises a chain formed by return values of all target nodes of the message sending request, and each node on the chain corresponds to one computing node.
Optionally, after forwarding the message sending request to the selected intermediate node through the worker thread in step 3), the method further includes the following steps of processing the message sending request by the intermediate node:
s1) receiving a message sending request forwarded by the control node;
s2) for data preparation, grouping the target nodes, preparing data for each sending thread, and then generating one or more sending threads and a monitoring thread;
s3) the sending thread sends messages to the target node and receives return messages, and the monitoring thread monitors the state of each sending thread;
s4) the intermediate node arranges the return information according to the return information of the target node and the state of the sending thread, fills the return information into the data structure agent _ response _ t and sends the return information to the control node.
In addition, the invention also provides a high-performance computer-oriented mass computing node resource monitoring and management system, which comprises computer equipment, wherein the computer equipment is programmed or configured to execute the steps of the high-performance computer-oriented mass computing node resource monitoring and management method.
In addition, the invention also provides a high-performance computer-oriented mass computing node resource monitoring and management system, which comprises computer equipment, wherein a memory of the computer equipment stores a computer program which is programmed or configured to execute the high-performance computer-oriented mass computing node resource monitoring and management method.
In addition, the present invention also provides a computer readable storage medium, in which a computer program programmed or configured to execute the method for monitoring and managing mass computing node resources for high-performance computers is stored.
Compared with the prior art, the invention has the following advantages:
currently, a management mode in which a single control node controls a large number of computing nodes is adopted for a large number of computing node resources in a high-performance computer. In the system operation process, the control node needs to monitor and record the real-time state of each computing node so as to facilitate tasks such as task allocation. The main way to realize the function is that the control node continuously generates a request (message sending request) for sending messages to the computing node, obtains the current state of the computing node according to the returned messages of the computing node and modifies the data structure used for managing the computing node on the control node. The common characteristic of these messaging requests is that the content of the messages sent is the same, but the number of target nodes is often large, and even some of the target nodes of the messaging requests may contain all the computing nodes. When processing a message sending request, the control node sends the message by adopting a star structure or a tree structure. The star structure means that the control node directly sends messages to all target computing nodes, and the tree structure needs the control node and the computing nodes to jointly construct a communication tree to complete the sending and receiving processes of the messages. Specifically, the control node groups the target nodes, the number of the grouped groups is the communication tree width, the control node only sends messages to the first target node in each group of nodes, and the first target node in each group continues to send messages to other nodes in the group according to the tree structure. Generally, the message transmission mode of the tree shape can bring less load to the control node than the transmission mode of the star shape. The invention relates to a method for monitoring and managing mass computing node resources for a high-performance computer, which comprises the following steps that a control node sends a message sending request through an intermediate node: the control node takes out a message sending request and generates a working thread for processing the message sending request; selecting a normal intermediate node through the working thread; forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node; and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread. The invention adopts a method that a layer of intermediate nodes are added between a control node and a mass computing node to share the load of the control node in the process of monitoring and managing the mass computing node resources, and simultaneously, the related load of the computing node in the process is reduced:
one, the intermediate node can share the load of the control node. A layer of intermediate nodes are added in the control nodes and the computing nodes, and the related load of the control nodes can be shared no matter a star-shaped sending mode or a tree-shaped sending mode is adopted. The intermediate node has the specific functions of receiving the message sending requirement from the control node, wherein the message sending requirement at least comprises the type and the content of the message to be sent and the target node linked list sent this time, and then the intermediate node replaces the control node to carry out the grouping, sending and receiving work of the node, even can bear the primary processing work of the return value, and sends the return value after the primary processing to the control node. Because the data structure for managing the computing nodes is on the control node, even if the intermediate node is added, the control node still needs to continuously undertake the work of specifically modifying the node state. Therefore, for a sending request, the control node only needs one thread to process, the thread directly forwards the sending request to a proper intermediate node, then waits for the intermediate node to return a primarily processed message, and finally updates the state of each node according to the content of the returned message. The thread saved on the control node by the mode can increase the message sending requests allowed to be processed at the same time, and the system performance is improved.
And secondly, the intermediate node can reduce the load of the computing node. The presence of intermediate nodes can also reduce the load on the compute nodes in the tree-like communication mode. Because the function of the intermediate node is single and the intermediate node does not need to bear the calculation task, almost all the performance of the intermediate node can be used for sending and receiving the message, and the intermediate node can simultaneously run more sending threads. In other words, a higher communication tree width may be used at the level of forwarding messages to the compute nodes at intermediate nodes, thereby keeping forwarding between compute nodes at a lower communication bandwidth. Because the current grouping algorithm results in a larger number of nodes in the previous grouping and a smaller number of computational nodes in the following grouping. In the grouping of the first layer, the grouping algorithm is modified to make the number of nodes in each group as average as possible.
Drawings
FIG. 1 is a flow chart of a basic scheme of a pre-improvement implementation of an embodiment of the present invention.
Fig. 2 is a flowchart of a control node in an improved implementation scheme according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a state machine in an embodiment of the invention.
Fig. 4 is a flowchart of an intermediate node in the solution of the improved implementation of the embodiment of the present invention.
Detailed Description
First, this embodiment proposes an implementation manner in which the whole process of processing a transmission request is independently completed by a control node, and in an application scenario of a large-scale node, a multi-aspect load may be brought to the control node, and the whole processing process is as shown in fig. 1: as shown in fig. 1, the specific workflow is as follows:
the first step is that the control thread (agent _ init thread) continuously takes a next message sending request from the chain on the premise that the total number of related threads does not exceed the upper limit of the threads, and generates a working thread (agent thread) to process the request;
the second step is data preparation for the worker thread. The data preparation is mainly to determine whether the message is sent through a star structure or a tree structure, and if the message is sent through the tree structure, the message is also grouped to a target node;
and thirdly, generating a sending thread for the working thread to send messages to the target computing node and receive return messages on one hand, and generating a monitoring thread on the other hand to monitor the state of each sending thread. There are 6 possible thread states in total, three of which are normal states, including DSH _ NEW (indicating that the thread is ready to run), DSH _ ACTIVE (the thread is running), DSH _ DONE (the thread has finished), and the other three are exception states, including DSH _ NO _ RESP (NO response to the thread), DSH _ FAILED (a return value with an error in the thread), and DSH _ DUP _ JOBID (a job id number with a repeat in the thread); and fourthly, the sending thread performs corresponding processing on the return message, and the monitoring thread performs corresponding processing according to the state of each monitored sending thread.
On the basis of the above implementation manner, this embodiment further provides a method for monitoring and managing a large number of computing node resources for a high-performance computer, where, for each message sending request, a part of functions on a control node are handed over to an intermediate node to be completed, and the processing flows of the control node and the intermediate node after modification are shown in fig. 2 and fig. 4.
As shown in fig. 2, this embodiment further provides a method for monitoring and managing a resource of a mass computing node facing a high-performance computer, including the following steps that a control node sends a message sending request through an intermediate node:
1) the control node takes out a message sending request and generates a working thread for processing the message sending request;
2) selecting a normal intermediate node through the working thread;
3) forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node;
4) and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread.
In this embodiment, the step 1) of taking out a message sending request by the control node specifically means that a control thread of the control node takes out a message sending request from a global chain, where the global chain is used to store the message sending requests of the control node, and the control thread is used to manage each message sending request and a corresponding working thread thereof.
In this embodiment, selecting a normal intermediate node through the working thread in step 2) specifically means sequentially selecting a normal intermediate node from an intermediate node list formed by all intermediate nodes by using a polling method. The adopted method is that tasks are distributed to each intermediate node in turn, the control node records the number of the intermediate node selected last time, the next normal intermediate node is directly selected when the intermediate node is selected each time, and once errors occur in the communication between the control node and the intermediate node, the control node sends the message sending request to the next normal intermediate node for processing.
As shown in fig. 3, in this embodiment, a state machine is used to record the state of each intermediate node, where the state machine includes two states, namely, state 0 and 1, and three events, namely, events 1 to 3, and state 0 represents a node fault; the state 1 represents that the node is normal, and the event 1 is that the control node sends a PING message to the intermediate node and obtains a correct return value; event 2 is that the control node sends a PING message to the intermediate node and cannot obtain a return value; the event 3 is that the control node forwards a message sending request to a normal node and cannot obtain a return message, when the event 1 occurs, if the original state of the state machine is the state 0, the state is changed to the state 1, and if the original state of the state machine is the state 1, the state is kept unchanged; when an event 2 occurs, if the original state of the state machine is a state 0, the state is kept unchanged, and if the original state of the state machine is a state 1, the state is changed to a state 0; when the event 3 occurs, the original state of the state machine is changed to the state 1 and then to the state 0, as shown in table 1.
Table 1: state transition table of state machine.
Event(s) In its original state Present state of the art
Event 1 0 1
Event 1 1 1
Event 2 0 0
Event 2 1 0
Event 3 1 0
In this embodiment, when waiting for the message returned by the intermediate node in step 3), if the message returned by the intermediate node is not received after waiting for timeout, the step 2) is skipped to reselect the next normal intermediate node to process the message sending request. When a control node sends a message sending request to an intermediate node, the control node cannot receive a return message once the intermediate node has failed. In this case, if the control node still cannot receive the return message after waiting for a period of time, the control node will hand the message sending request to the next intermediate node for processing, although this implementation method can ensure that each message sending request can be processed by a normal intermediate node, once a failed intermediate node is encountered, a certain delay will be brought to the system. In order to avoid the situation that a message sending request is sent to an intermediate node which has failed, so that a large amount of time delay occurs, a state machine is designed to manage the state of the intermediate node.
In this embodiment, when the worker thread forwards the message sending request to the selected intermediate node in step 3), the data structure agent _ t for forwarding the message sending request includes the following information fields:
the node _ count of the number of the target nodes is used for storing the number of the target nodes in the message sending request;
retry identification, which is used to record the identification whether retry is needed after transmission failure;
the target node chain hostlist is used for recording the target node chain;
the message type msg _ type is used for recording the message type needing to be sent;
and the message body msg _ args is used for recording the message body needing to be transmitted.
In this embodiment, when receiving the message returned by the intermediate node in step 3), the data structure agent _ response _ t of the message returned by the intermediate node includes the following information fields:
a communication error node list comm _ err _ nodelist, wherein the field type is a character string and is used for storing a target node linked list with errors when the communication error node list is sent by a star structure;
a job number jobid, the field type of which is 32-bit unsigned integer, for storing a return value of the job number of the message sending request related to the srun command;
step _ id, field type is 32 bit unsigned integer, used for storing the step number return value of the message sending request related to the srun command;
retry node list retry _ notify, the field type is a character string, which is used for storing the target node that needs to resend the message;
a NO-response node list NO _ RESP _ nodelist, wherein the field type is a character string and is used for storing a target node contained in a sending thread with a final state of DSH _ NO _ RESP;
a FAILED node list FAILED _ nodelist, wherein the field type is a character string and is used for storing a target node contained in a sending thread with a final state of FAILED DSH _ FAILED;
a repeated node list duration _ nodelist, wherein the field type is a character string and is used for storing a target node contained in a sending thread with a final state of repeated DSH _ DUP _ JOBID;
a DONE node list DONE _ nodelist, wherein the field type is a character string, and the field type is used for storing the target node contained in the sending thread with the final state of being DONE DSH _ DONE
A failed node list error _ nodelist, wherein the field type is a character string, and is used for storing a target node contained in a sending thread with an unrecognized final state
And the information chain ret _ list has a field type of a linked list and is used for recording a return information chain, the chain comprises return values of all target nodes of the message sending request, and each node on the chain corresponds to one computing node.
The control node needs to correctly send all information contained in the message sending request to the intermediate node, and since the previous message sending request does not need to be sent between the nodes, a completely new data structure agent _ t needs to be designed to ensure the correctness of the sending process. After the intermediate node processes the message sending request, the intermediate node needs to send the result to the control node in the form of a return message. Since there are many types of message sending requests and the return messages of each type are different, it is important to design a common data structure agent _ response _ t that is sufficient to contain all cases for the transmission of the return messages. In this embodiment, the following table shows the processing functions used by some key return items and the source files in which the processing functions are located.
Table 1: the processing function for which the key return item is used and the source file in which the processing function is located.
Figure BDA0002677505780000091
Once the control node receives the return message, it must do a corresponding process for each non-empty return entry.
Communication error node list comm _ err _ nodelist: the control node takes down the wrong node names from the node list one by one as the input of a _ comm _ err function, and the function outputs error information to report the communication error of the wrong node;
task number jobid and step number step _ id: the control node takes the task number and the step number as the input of a srun _ response function, which marks the task and the status of the job as responded;
retry node list retry _ nodelist: the control node replaces the node list in the current message sending request with the node list to form a new message sending request, and then adds the sending request into the global chain;
no-response node list no _ resp _ nodelist: the control node takes off the non-response node names one by one from the node list as the input of a node _ resp function, and the function marks the non-response node and updates the current state of the non-response node;
failed node list failed _ nodelist: the control node takes down the failed node names one by one from the node list as the input of a drain _ nodes function, and the function marks the failed node and updates the current state of the failed node;
duplicate node list dupid _ nodelist: the control node takes down the names of the nodes with repeated operation numbers one by one from the node list as the input of a drain _ nodes function, the nodes can be regarded as special cases of failure nodes essentially, and the drain _ nodes function marks the nodes and updates the current state of the nodes;
completion node list done _ nodelist: the control node takes the successful node names from the node list one by one as the input of a node _ did _ resp function, and the function marks the successful nodes and updates the current state of the nodes;
error node list error _ nodelist: the control node takes down the names of the nodes with other errors one by one from the node list and outputs error information;
information chain ret _ list: the control node processes the return message of each target computing node stored in the information chain according to the original logic, and the part of processing content is irrelevant to the monitoring and management of the control node on the state of the computing node.
In general, for a message sending request, the control node only needs to generate one thread. In the work flow of the control node, the second step and the third step are newly added codes, and the first step and the fourth step can multiplex the codes before optimization on the control node.
As shown in fig. 4, after forwarding the message sending request to the selected intermediate node through the worker thread in step 3), the method further includes the following steps of processing the message sending request by the intermediate node:
s1) receiving a message sending request forwarded by the control node;
s2) for data preparation, grouping the target nodes, preparing data for each sending thread, and then generating one or more sending threads and a monitoring thread;
s3) the sending thread sends messages to the target node and receives return messages, and the monitoring thread monitors the state of each sending thread;
s4) the intermediate node arranges the return information according to the return information of the target node and the state of the sending thread, fills the return information into the data structure agent _ response _ t and sends the return information to the control node.
In the overall view, in the work flow of the intermediate node, S2) and S3) are the loads shared by the intermediate node as the control node, and originally, the two steps are independently completed by the control node, and are now transferred to the intermediate node, so that the codes on the control node before optimization can be multiplexed. The first step and the fourth step are new codes.
In this embodiment, in order to implement the basic function of the state machine, the following data structure or function needs to be added: an array medium _ state that records the state of each intermediate node, each entry in the array corresponding to the state of an intermediate node. The control node periodically generates PING messages to be sent to all intermediate nodes, and updates the state of the intermediate nodes stored in the medium _ state array through a return value; when the control node selects the intermediate node to forward the message sending request, since only the intermediate node currently recorded as normal (i.e., the intermediate node with the state of 1) is selected to send in step 2), once the return message of the intermediate node cannot be received, it is proved that the intermediate node has failed and is no longer normal, and the state of the intermediate node needs to be changed to the failure (i.e., the corresponding item of the intermediate node in the medium _ states array is modified to the state of 0).
In summary, the monitoring and management of the massive computing node resources in the method of the present embodiment are performed synchronously when the high performance computer system runs, and are also important guarantees for normal system running, especially for normal loading of jobs. The first key point of the method of the embodiment is to completely divide and download the functions related to resource monitoring and management and other functions on the control node to the intermediate node to reduce the load of the control node and the computing node so as to enable the control node and the computing node to meet the performance requirements in the application scene of the super-large scale node. In the method, a message sending request is sent to one intermediate node as a basic unit, and the intermediate node is completely responsible for processing the message sending request, so that the processing process of the message sending request is clear and definite, and tasks among the intermediate nodes are not overlapped. A round-robin approach is used in selecting the intermediate nodes to try to ensure that the load on each intermediate node is relatively balanced. Meanwhile, the current state of the intermediate node is managed in a state machine mode so as to avoid sending a message sending request to the intermediate node with the fault. This is a second key point of the design, where the type of message sending request is quite complex, hundreds in total. The return messages of each message sending request are different, and the method designs a general data structure so that the intermediate node can correctly transmit the return message to the control node after finishing processing any message sending request. Designing this generic data structure requires a review of almost all message sending requests, which is also the third key point of the method of this embodiment. The method for monitoring and managing the mass computing node resources facing the high-performance computer comprises the following steps that a control node sends a message sending request through an intermediate node: the control node takes out a message sending request and generates a working thread for processing the message sending request; selecting a normal intermediate node through the working thread; forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node; and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread. In this embodiment, a layer of intermediate nodes is added between the control node and the mass computing nodes to share the load of the control node in the process of monitoring and managing the mass computing node resources, and reduce the related load of the computing nodes in this process:
one, the intermediate node can share the load of the control node.
A layer of intermediate nodes are added in the control nodes and the computing nodes, and the related load of the control nodes can be shared no matter a star-shaped sending mode or a tree-shaped sending mode is adopted. The intermediate node has the specific functions of receiving the message sending requirement from the control node, wherein the message sending requirement at least comprises the type and the content of the message to be sent and the target node linked list sent this time, and then the intermediate node replaces the control node to carry out the grouping, sending and receiving work of the node, even can bear the primary processing work of the return value, and sends the return value after the primary processing to the control node. Because the data structure for managing the computing nodes is on the control node, even if the intermediate node is added, the control node still needs to continuously undertake the work of specifically modifying the node state. Therefore, for a sending request, the control node only needs one thread to process, the thread directly forwards the sending request to a proper intermediate node, then waits for the intermediate node to return a primarily processed message, and finally updates the state of each node according to the content of the returned message. The thread saved on the control node by the mode can increase the message sending requests allowed to be processed at the same time, and the system performance is improved.
And secondly, the intermediate node can reduce the load of the computing node.
The presence of intermediate nodes can also reduce the load on the compute nodes in the tree-like communication mode. Because the function of the intermediate node is single and the intermediate node does not need to bear the calculation task, almost all the performance of the intermediate node can be used for sending and receiving the message, and the intermediate node can simultaneously run more sending threads. In other words, a higher communication tree width may be used at the level of forwarding messages to the compute nodes at intermediate nodes, thereby keeping forwarding between compute nodes at a lower communication bandwidth. Because the current grouping algorithm results in a larger number of nodes in the previous grouping and a smaller number of computational nodes in the following grouping. In the grouping of the first layer, the grouping algorithm is modified to make the number of nodes in each group as average as possible. If the number of nodes is ten thousand, if the tree width of the first layer is 200, the number of nodes in the first groups will be 200 and the number of nodes in the second groups will be only 1 according to the original grouping algorithm, which greatly reduces the optimization effect caused by expanding the number of the first layer groups. This result can be predicted by the following simulation experiment. In a simulation experiment, the number of nodes is ten million, one hundred thousand and one million respectively, and the failure rate is 0.1. The communication tree width is set as follows:
table 1: and setting the communication tree width.
Figure BDA0002677505780000121
The experimental results are as follows:
table 2: and (5) simulating an experimental result.
Figure BDA0002677505780000122
As can be seen from the simulation results, by expanding the first layer of tree width and reducing the remaining layer of tree width, the connection timeout time can be made substantially constant while the reception timeout is made significantly smaller. Since the tree width of the remaining layers is reduced, the message forwarding load of the compute node is reduced.
In addition, the embodiment also provides a system for monitoring and managing the resources of the mass computing nodes facing the high-performance computer, which includes a computer device programmed or configured to execute the steps of the method for monitoring and managing the resources of the mass computing nodes facing the high-performance computer.
In addition, the embodiment also provides a system for monitoring and managing a mass computing node resource for a high-performance computer, which includes a computer device, where a memory of the computer device stores a computer program programmed or configured to execute the method for monitoring and managing a mass computing node resource for a high-performance computer.
In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program programmed or configured to execute the foregoing method for monitoring and managing mass computing node resources for a high-performance computer is stored.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A method for monitoring and managing massive computing node resources facing a high-performance computer is characterized by comprising the following steps that a control node sends a message sending request through an intermediate node:
1) the control node takes out a message sending request and generates a working thread for processing the message sending request;
2) selecting a normal intermediate node through the working thread;
3) forwarding the message sending request to the selected intermediate node through the working thread, then waiting for the message returned by the intermediate node, and skipping to execute the next step after receiving the message returned by the intermediate node;
4) and the working thread processes the returned message, updates the states of the intermediate node and the computing node and finishes the working thread.
2. The method for monitoring and managing the resources of the massive computing nodes facing the high-performance computer according to claim 1, wherein the step 1) of taking out a message sending request by the control node specifically means that a control thread of the control node takes out a message sending request from a global chain, the global chain is used for storing the message sending request of the control node, and the control thread is used for managing each message sending request and a corresponding working thread thereof.
3. The method for monitoring and managing the resources of the massive computing nodes facing the high-performance computer according to claim 1, wherein the step 2) of selecting a normal intermediate node through the working thread specifically means that a polling method is adopted to sequentially select a normal intermediate node from an intermediate node list formed by all the intermediate nodes, and a state machine is adopted to record the state of each intermediate node, wherein the state machine comprises two states of 0 and 1 and three events of 1 to 3, and the state of 0 represents a node fault; the state 1 represents that the node is normal, and the event 1 is that the control node sends a PING message to the intermediate node and obtains a correct return value; event 2 is that the control node sends a PING message to the intermediate node and cannot obtain a return value; the event 3 is that the control node forwards a message sending request to a normal node and cannot obtain a return message, when the event 1 occurs, if the original state of the state machine is the state 0, the state is changed to the state 1, and if the original state of the state machine is the state 1, the state is kept unchanged; when an event 2 occurs, if the original state of the state machine is a state 0, the state is kept unchanged, and if the original state of the state machine is a state 1, the state is changed to a state 0; when the event 3 occurs, the original state of the state machine is changed to the state 1 and then to the state 0.
4. The method for monitoring and managing resources of mass computing nodes oriented to high-performance computers, as recited in claim 1, wherein when waiting for the message returned by the intermediate node in step 3), if the message returned by the intermediate node is not received after waiting for timeout, then skipping to execute step 2) to reselect the next normal intermediate node to process the message transmission request.
5. The method for monitoring and managing the resources of the massive computing nodes facing the high-performance computer according to claim 1, wherein when the message sending request is forwarded to the selected intermediate node through the working thread in step 3), the data structure agent _ t for forwarding the message sending request includes the following information fields:
the node _ count of the number of the target nodes is used for storing the number of the target nodes in the message sending request;
retry identification, which is used to record the identification whether retry is needed after transmission failure;
the target node chain hostlist is used for recording the target node chain;
the message type msg _ type is used for recording the message type needing to be sent;
and the message body msg _ args is used for recording the message body needing to be transmitted.
6. The method for monitoring and managing the resources of the massive computing nodes facing the high-performance computer according to claim 5, wherein when the message returned by the intermediate node is received in the step 3), the data structure agent _ response _ t of the message returned by the intermediate node includes the following information fields:
a communication error node list comm _ err _ nodelist for storing a target node which uses a communication function and has an error when transmitting through a star structure;
a task number jobid for storing a task number return value of a message transmission request related to the srun command;
a step number step _ id for storing a step number return value of a message transmission request related to the srun command;
retry node list retry _ nodelist for storing a target node that needs to resend the message;
a NO _ RESP _ nodelist, configured to store a target node included in a sending thread whose final state is DSH _ NO _ RESP;
a FAILED node list FAILED _ nodelist used for storing a target node contained in a sending thread with a final state of FAILED DSH _ FAILED;
the repeated node list dupid _ nodelist is used for storing a target node contained in a sending thread with a final state of repeated DSH _ DUP _ JOBID;
a DONE node list DONE _ nodelist for storing the target node contained in the sending thread with the final state of being DONE DSH _ DONE
A list of failed nodes, error _ nodelist, for storing the target nodes contained by the sending thread with unrecognized final state
And the information chain ret _ list is used for recording a return information chain, and comprises a chain formed by return values of all target nodes of the message sending request, and each node on the chain corresponds to one computing node.
7. The method for monitoring and managing massive computing node resources oriented to high-performance computers according to claim 6, wherein after forwarding the message sending request to the selected intermediate node through the working thread in step 3), further comprising the following steps of processing the message sending request by the intermediate node:
s1) receiving a message sending request forwarded by the control node;
s2) for data preparation, grouping the target nodes, preparing data for each sending thread, and then generating one or more sending threads and a monitoring thread;
s3) the sending thread sends messages to the target node and receives return messages, and the monitoring thread monitors the state of each sending thread;
s4) the intermediate node arranges the return information according to the return information of the target node and the state of the sending thread, fills the return information into the data structure agent _ response _ t and sends the return information to the control node.
8. A high-performance computer-oriented mass computing node resource monitoring and management system, comprising computer equipment, characterized in that the computer equipment is programmed or configured to execute the steps of the high-performance computer-oriented mass computing node resource monitoring and management method of any one of claims 1 to 7.
9. A high-performance computer-oriented mass computing node resource monitoring and management system comprises computer equipment, and is characterized in that a memory of the computer equipment stores a computer program which is programmed or configured to execute the high-performance computer-oriented mass computing node resource monitoring and management method according to any one of claims 1 to 7.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program being programmed or configured to execute the method for monitoring and managing resources of a mass computing node facing a high-performance computer according to any one of claims 1 to 7.
CN202010952582.9A 2020-09-11 2020-09-11 Mass computing node resource monitoring and management method for high-performance computer Active CN112000486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010952582.9A CN112000486B (en) 2020-09-11 2020-09-11 Mass computing node resource monitoring and management method for high-performance computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010952582.9A CN112000486B (en) 2020-09-11 2020-09-11 Mass computing node resource monitoring and management method for high-performance computer

Publications (2)

Publication Number Publication Date
CN112000486A true CN112000486A (en) 2020-11-27
CN112000486B CN112000486B (en) 2022-10-28

Family

ID=73468679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010952582.9A Active CN112000486B (en) 2020-09-11 2020-09-11 Mass computing node resource monitoring and management method for high-performance computer

Country Status (1)

Country Link
CN (1) CN112000486B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324539A (en) * 2013-06-24 2013-09-25 浪潮电子信息产业股份有限公司 Job scheduling management system and method
WO2015128276A1 (en) * 2014-02-27 2015-09-03 Alcatel Lucent Improved traffic control in packet transport networks
CN106776984A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 A kind of cleaning method of distributed system mining data
CN111190714A (en) * 2019-12-27 2020-05-22 西安交通大学 Cloud computing task scheduling system and method based on block chain

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324539A (en) * 2013-06-24 2013-09-25 浪潮电子信息产业股份有限公司 Job scheduling management system and method
WO2015128276A1 (en) * 2014-02-27 2015-09-03 Alcatel Lucent Improved traffic control in packet transport networks
CN106776984A (en) * 2016-12-02 2017-05-31 航天星图科技(北京)有限公司 A kind of cleaning method of distributed system mining data
CN111190714A (en) * 2019-12-27 2020-05-22 西安交通大学 Cloud computing task scheduling system and method based on block chain

Also Published As

Publication number Publication date
CN112000486B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
US10838777B2 (en) Distributed resource allocation method, allocation node, and access node
US9703610B2 (en) Extensible centralized dynamic resource distribution in a clustered data grid
US20080052322A1 (en) Conflict resolution in database replication through autonomous node qualified folding
CN106817408B (en) Distributed server cluster scheduling method and device
WO2016025333A1 (en) Fault tolerant federation of computing clusters
CN111045811A (en) Task allocation method and device, electronic equipment and storage medium
CN111400041A (en) Server configuration file management method and device and computer readable storage medium
CN115297124B (en) System operation and maintenance management method and device and electronic equipment
CN106936620B (en) Alarm event processing method and processing device
CN111652728A (en) Transaction processing method and device
CN112492022A (en) Cluster, method, system and storage medium for improving database availability
CN103164262B (en) A kind of task management method and device
CN111163140A (en) Method, apparatus and computer readable storage medium for resource acquisition and allocation
US10205630B2 (en) Fault tolerance method for distributed stream processing system
CN106452899A (en) Distributed data mining system and method
CN112000486B (en) Mass computing node resource monitoring and management method for high-performance computer
CN106559278B (en) Data processing state monitoring method and device
WO2017169471A1 (en) Processing system and processing method
CN106294445A (en) The method and device stored based on the data across machine room Hadoop cluster
US11381642B2 (en) Distributed storage system suitable for sensor data
CN115291891A (en) Cluster management method and device and electronic equipment
CN111400100B (en) Management method and system for distributed software backup
CN109818767B (en) Method and device for adjusting Redis cluster capacity and storage medium
CN112039747B (en) Mass computing node communication tree construction method based on fault rate prediction
CN113934792A (en) Processing method and device of distributed database, network equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant