WO2023115975A1

WO2023115975A1 - Slow node detection method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023115975A1
Application number: PCT/CN2022/111137
Authority: WO
Inventors: 付浩瀚; 王雁鹏; 黎世勇; 孙鹏; 张恒华; 骆宝童; 张建宇; 王帅俭; 刘伟
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-12-23
Filing date: 2022-08-09
Publication date: 2023-06-29
Also published as: CN114328098A; CN114328098B

Abstract

A slow node detection method and apparatus, an electronic device, and a storage medium, relating to the technical field of artificial intelligence, and in particular, to the fields such as cluster systems, distributed machine learning, and node fault detection. The specific implementation solution comprises: a sensing module initiates a timing request to a first node, wherein the first node is one or more training nodes executing a training task in a cluster system (S201); the sensing module receives timing information fed back by the first node (S202); and the sensing module detects the existence of a slow node in the cluster system according to the timing information (S203).

Description

Slow node detection method, device, electronic equipment and storage medium

technical field

The present disclosure relates to the technical field of artificial intelligence, in particular to cluster systems, distributed machine learning, node failure detection and other fields.

Background technique

In multiple nodes of the cluster system or multiple electronic devices under one node (such as terminal devices or servers, etc.), large-scale model training tasks can be performed based on artificial intelligence technology to obtain models with higher processing efficiency. The trained model is deployed in the cluster system, which can improve the overall operating efficiency of the cluster system.

Contents of the invention

The disclosure provides a slow node detection method, device, electronic equipment and storage medium.

According to an aspect of the present disclosure, a slow node detection method is provided, including: a perception module initiates a timing request to a first node, wherein the first node is one or more training nodes that perform training tasks in a cluster system The sensing module receives timing information fed back by the first node; and, the sensing module detects that there is a slow node in the cluster system according to the timing information.

According to another aspect of the present disclosure, a slow node detection method is provided, including: a first node receiving a timing request initiated by a perception module; wherein, the first node is one or more nodes performing training tasks in a cluster system training nodes; the first node performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and, the first node sends the timing information to the sensing module.

According to another aspect of the present disclosure, a slow node detection device is provided, including a perception module configured to: initiate a timing request to a first node, wherein the first node is one or more The training node of the training task; receiving the timing information fed back by the first node; and detecting that there is a slow node in the cluster system according to the timing information.

According to another aspect of the present disclosure, a slow node detection device is provided, including a first node configured to: receive a timing request initiated by a perception module; wherein, the first node is one or more A training node that executes a training task; performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and sends the timing information to the sensing module.

According to another aspect of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be executed by the at least one processor An instruction, the instruction is executed by the at least one processor, so that the at least one processor can execute any method provided by the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute any one of the methods provided in the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, including computer instructions. When the computer instructions are executed by a processor, any method provided in the embodiments of the present disclosure is implemented.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 is a scene diagram of a distributed cluster system according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of a slow node detection method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure;

Fig. 5 is a schematic diagram of an architecture of an application example of an embodiment of the present disclosure to realize a perception process;

6 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure;

7 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure;

Fig. 8 is a block diagram of an electronic device used to implement the slow node detection method of the embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. The term "at least one" herein means any one of a variety or any combination of at least two of a variety, for example, including at least one of A, B, and C, may mean including from A, B, and Any one or more elements selected from the set formed by C. The terms "first" and "second" in this article refer to and distinguish multiple similar technical terms, and do not mean to limit the order, or to limit only two meanings, for example, the first feature and the second Feature refers to two types/two features, the first feature can be one or more, and the second feature can also be one or more.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present disclosure may be practiced without some of the specific details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.

In multiple nodes of a cluster system (such as a distributed cluster system) or multiple electronic devices (such as terminal equipment or servers, etc.) Processing Unit, GPU), with the development of deep learning technology, both CPU and GPU can carry out model training through deep learning technology, and this distributed cluster system can also be a distributed machine learning system based on sampling to improve the above-mentioned electronic equipment, The operating efficiency of the nodes, so that the above-mentioned cluster system has stronger software and hardware capabilities such as operating efficiency and scheduling functions.

Taking GPU as an example, with the increase of GPU training scale, it has reached the level of kilocalories and 10,000 cards. For GPU large-scale training tasks, high requirements are put forward for the stability and ease of use of the service.

First of all, compared with traditional CPU training, GPU training has some characteristics, such as faster iteration of basic software and hardware, but there are unpredictable situations that need to be dealt with in practice; such as higher power consumption and heat dissipation pressure, The hardware will be under high pressure for a long time, which is prone to failure, and highly integrated hardware leads to low operation and maintenance efficiency; for example, the rapid expansion of training scale has brought great pressure to network deployment and cluster scheduling. These characteristics of GPU training can easily cause service or node failures in GPU training tasks, which will affect GPU training.

Secondly, large-scale training also has some characteristics: for example, the training scale is large and there is generally no redundant computing design, and a single point of failure will affect the entire training task; Calculations and monthly calculations are very common. In other words, the training time of large models and large amounts of data is long. If there is a lack of timely alarms, it is easy to cause problems to be discovered after a few days and cause great losses; for example, in the case of synchronous mode If a single-point performance problem occurs, it will spread to the entire range of training tasks for large-scale training.

To sum up, GPU-oriented training tasks, especially large-scale GPU training tasks, are more likely to encounter failures, and the losses caused by failures are greater. It is necessary to detect faults in time to ensure the stability of training tasks for large-scale GPU training tasks and reduce the overall operating cost of cluster operation and scheduling. Among them, problematic nodes that cause training task failures or performance degradation can be called slow nodes. To ensure the stability of training tasks for GPU large-scale training tasks and reduce the overall operating cost of cluster operation and scheduling, it is necessary to detect such slow nodes. However, for the large-scale training task of GPU, the detection of this slow node cannot be realized.

With this disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the GPU large-scale training tasks in the above example), after the perception module initiates a timing request to the first node, the timing information fed back by the first node can be received. Therefore, it can be detected that there is a slow node in the cluster system according to the timing information, and the detection of the slow node is realized.

According to an embodiment of the present disclosure, FIG. 1 is a scene diagram of a distributed cluster system applying the slow node detection method of the embodiment of the present disclosure. The distributed cluster system is an example of a cluster system, and it exemplarily describes that the distributed The cluster system performs model training to complete the training task. As shown in Figure 1, in this distributed cluster system 100, comprise a plurality of nodes (as server cluster 101, server 102, server cluster 103, server 104, server 105, server 105 can also connect electronic equipment, as mobile phone 1051 and desktop Machine 1052), multiple nodes, and multiple nodes and connected electronic devices can jointly execute one or more model training tasks. Optionally, multiple nodes in the distributed cluster system can adopt a data parallel model training method, and then multiple nodes can perform training tasks based on the same training method to better train the model; if the distributed cluster system Multiple nodes in the model adopt the model parallel model training method, so multiple nodes can perform training tasks based on different training methods to better train the model. Optionally, after each round of model training is completed, data exchange (such as data synchronization) can be performed between multiple nodes.

According to an embodiment of the present disclosure, a slow node detection method is provided. FIG. 2 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure. The method can be applied to a slow node detection device. For example, the device can be deployed in Terminals or servers or other processing devices in the cluster system can implement processing such as timing and slow node detection when they are running. Wherein, the terminal may be user equipment (UE, User Equipment), mobile device, personal digital assistant (PDA, Personal Digital Assistant), handheld device, computing device, vehicle-mounted device, wearable device, and the like. In some possible implementation manners, the method may also be implemented in a manner in which the processor invokes computer-readable instructions stored in the memory. As shown in Figure 2, the method is applied to the perception modules in the cluster system, including S201-S203.

S201. The perception module initiates a timing request to a first node, where the first node is one or more training nodes executing training tasks in a cluster system.

S202. The sensing module receives timing information fed back by the first node.

S203. The perception module detects that there is a slow node in the cluster system according to the timing information.

In an example of S201-S203, the first node, as a training node in the cluster system, may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the perception module can initiate a timing request to the training node, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations. Wherein, any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized. When performing the collective communication operation, the collective communication library can start the timing function to obtain the timing information. After receiving the timing information, the sensing module can detect the presence of slow nodes in the cluster system according to the timing information.

With the embodiment of the present disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the large-scale training tasks of GPUs in the above example in FIG. The timing information fed back by the nodes can detect the existence of slow nodes in the cluster system according to the timing information, and realize the detection of slow nodes.

In an embodiment, the sensing module detects that there is a slow node in the cluster system according to timing information, which may include: the sensing module detects that there is a slow node in the cluster system when the timing information is greater than a threshold. With this embodiment, by comparing the timing information fed back by the first node with the threshold, it is possible to detect that there is a slow node in the cluster system, thereby realizing the detection of the slow node.

In an embodiment, the slow node detection method may further include: the perception module initiates a request to suspend the training task to the first node, and the perception module runs a slow node detection program to detect the position of the slow node in the cluster system. With this embodiment, not only can it be detected that there is a slow node in the cluster system, but also the position of the slow node in the cluster system can be detected according to the running slow node detection program, so that the slow node can be completed in time from the cluster system. Check to locate the specific location of the slow node.

In one embodiment, the sensing module runs the slow node detection program to detect the position of the slow node in the cluster system, which may include: the sensing module cyclically executes the detection mode in the collective communication detection to run the slow node detection program, and detects the slow node The position of the slow node in the cluster system, the detection mode includes at least one of the set, and the set includes stand-alone detection, cluster detection, dichotomy and their combination. With this embodiment, one or a combination of various detection modes can be used to run the corresponding slow node detection program to detect the position of the slow node in the cluster system, so that single-machine and multi-machine can be realized in the cluster system , dichotomy and other efficient slow node troubleshooting methods to locate the specific location of the slow node faster and more accurately.

In an embodiment, the slow node detection method may further include: the perception module notifies the scheduling module of the slow node information, where the slow node information is used to characterize the position of the slow node in the cluster system. Wherein, the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or the second node that communicates with the first node (such as the control node in the cluster system) ). With this embodiment, the scheduling module can be deployed on the first node (such as the training node in the cluster system) or the second node that communicates with the first node (such as the control node in the cluster system) according to actual needs , when the slow node information is received, the training tasks performed by multiple training nodes in the cluster system can be scheduled, for example, after detecting the slow node and its position in the cluster system from multiple training nodes, Save the progress state of the training task performed by the slow node (that is, the progress state of the slow node), so as to trigger the master-standby switchover operation (that is, replace the progress state of the slow node with a normal candidate node, and continue to execute the training task).

According to an embodiment of the present disclosure, a slow node detection method is provided. FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure. As shown in FIG. 3 , the method is applied to a perception module in a cluster system, Including S301-S303.

S301. The first node receives a timing request initiated by the sensing module; wherein, the first node is one or more training nodes executing training tasks in a cluster system.

S302. The first node performs a collective communication operation based on the timing request, completes data exchange in the cluster system, and obtains timing information.

S303. The first node sends the timing information to the perception module, so that the perception module detects that there is a slow node in the cluster system according to the timing information.

In an example of S301-S303, the first node, as a training node in the cluster system, may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the training framework program may receive the timing request initiated by the perception module, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations. Wherein, any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized. When performing the collective communication operation, the collective communication library can start the timing function to obtain the timing information. After the training node sends the timing information to the sensing module, the sensing module can detect the presence of slow nodes in the cluster system based on the timing information. .

With the embodiment of the present disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the GPU large-scale training tasks in the above-mentioned example in FIG. The module can detect that there are slow nodes in the cluster system, thereby realizing the detection of slow nodes.

In an embodiment, the slow node detection method may further include: the first node receives a request for suspending the training task initiated by the sensing module; the first node responds to the request for suspending the training task, suspends the training task, and stores the progress status of the training task. The first node notifies the perception module to run a slow node detection program. With this embodiment, the perception module can run the slow node detection program through the notification fed back by the first node to detect the presence of slow nodes in the cluster system, thereby realizing the detection of slow nodes.

In an embodiment, the slow node detection method may further include: the scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system. Wherein, the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or a second node that communicates with the first node (such as the control node in the cluster system) node).

When the scheduling module is located at the first node (such as the training node in the cluster system), the first node accepts the scheduling control of the scheduling module, and according to the slow node information, the progress of the training task to be performed by the slow node The state is replaced with a normal candidate node, and the training task is continued. With this embodiment, after the scheduling module receives the slow node information, it can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node). The selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.

In the case that the scheduling module is located at a second node that communicates with the first node (such as a control node in the cluster system), the first node receives the slow node information from the second node, that is, the slow node The node information is: the information that the second node forwards to the first node after accepting the scheduling control of the scheduling module. According to the slow node information, the first node replaces the progress status of the training task executed by the slow node with a normal candidate node, and continues to execute the training task. With this embodiment, after the scheduling module receives the slow node information, it can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node). The selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.

The method for detecting a slow node provided by the above-mentioned embodiments of the present disclosure will be illustrated below with an example.

Large-scale training tasks can be divided into pure CPU, pure GPU, and CPU and GPU mixed settings according to the type of training nodes in the cluster system, and can be divided into synchronous training and asynchronous training according to the training mode. Among them, the synchronous training further includes: parameter server (Parameter Server, PS) architecture and collective (collective) architecture, the PS architecture is the most commonly used distributed training architecture for deep learning, the collective architecture is to achieve multi-GPU collective communication architecture. In large-scale training tasks, slow node detection is required, especially for pure GPU, synchronous training, and collective architecture scenarios. This is the most difficult research direction for slow node positioning, and it also requires special attention.

In the collective communication operation in the cluster system, if the collective communication is blocked, it will be slow at one time. It is impossible to accurately analyze the cause of the slow node in the collective communication, and it is also impossible to accurately locate the training node where the slow node is abnormal in the cluster system. In collective communication, the collective communication library can provide high-performance inter-GPU or inter-CPU communication between training nodes. In a collective communication operation, multiple training nodes in the cluster system can complete their respective data calculation and transmission operations. Perform synchronous waiting, and each node will not perceive the progress of the training task until all training nodes have completed the operation, that is to say, it is impossible to directly distinguish a single or part of the slow nodes. Slow nodes need to be detected more efficiently before and during training tasks.

Using the cluster system architecture shown in Figure 4 below and the perception architecture based on the perception process of the interaction between the perception module and the training nodes shown in Figure 5, slow nodes can be detected more efficiently before and during the execution of the training task.

4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure, including one or more control nodes (such as control node 1-control node M, where M is an integer greater than 1), and the control node through One or more training nodes for communication and interaction in the network (for example, training node 1-training node N, where N is an integer greater than 1). A network card for communication interaction and a CPU for issuing control commands to any training node can be deployed in any control node; a network card for communication interaction and a CPU for issuing control commands to GPU can be deployed in any training node CPU that controls instructions.

As far as the CPU and GPU on the training node are concerned, the training framework program runs on the CPU of the training node in the form of multi-process and multi-thread or single-process and multi-thread, so that the main calculation instructions and communication instructions during the execution of the training task are Send to the GPU of the training node. The collective communication library also runs on the CPU of the training node to provide high-performance inter-GPU or inter-CPU communication between multiple training nodes (such as data exchange between GPUs). Part of the scheduler program can run on the CPU of one or more training nodes, and the other part can run on the CPU of the control node outside these training nodes and can communicate with these training nodes. The scheduler program is used for training tasks node resource allocation and operating environment preparation. The slow nodes detected in this application example are all training nodes.

It should be pointed out that the perception module can be a module in the cluster system, not limited to the training node or control node deployed in the cluster system, it can be located on any node in the cluster system, or can be set independently, that is, in the cluster system And it is configured separately from the node. Slow node detection and slow node location can be performed through the perception process of the perception module interacting with the training node (eg, the perception module interacts with the training framework program and the collective communication library in the training node).

Fig. 5 is a schematic diagram of an architecture for realizing a perception process applied to an application example of an embodiment of the present disclosure. The diagram of the architecture shown in Fig. 5 includes the following content.

In this application example, the slow node detection can be performed through the perception module, and some other programs/modules can also be used to respond to the slow node detection. For example, you can set the training framework program running on the CPU of the training node, or you can A scheduler program is set (a part of the scheduler program can run on the CPU of one or more training nodes, and another part can run on the CPU of a control node outside these training nodes and capable of communicating with these training nodes). The process of detecting the slow node may include the following 1-4.

1. The perception module sends the timing request to the training framework program (the training framework program can run on the CPU of the training node in a multi-process multi-thread or single-process multi-thread mode), and the training framework program calls the collective communication library based on the timing request ( The collective communication library runs on the CPU of the training node, and can provide high-performance inter-GPU communication implementation or inter-CPU communication implementation between training nodes, such as data exchange between GPUs). When performing collective communication operations, the collective communication library starts the timing function, Execute the collection communication implementation inside the cluster system, and complete the timing and recording of this operation. For example, trigger the data exchange between multiple GPUs in a training node, or the data exchange between GPUs included in multiple training nodes, and finally complete the collective communication operation, timing and recording.

It should be pointed out that, in the case of performing the above collective communication operation, in response to the collective communication operation, for example, all training nodes and/or electronic devices participating in data exchange (a training node may include multiple electronic devices) conduct a collective Data exchange (such as data synchronization) is not limited to the above-mentioned data exchange between GPUs. Considering the large-scale training tasks for GPUs, this application example uses data exchange between GPUs as an example for description. The data exchange is performed, and the duration of the data exchange process can be recorded through a timing operation (for example, recorded as "time 1").

2. The perception module judges whether the "time 1" is slower than a time threshold, that is, whether it is greater than the time threshold, and if it is greater, it is determined to be "slow", that is, there are slow nodes in the current cluster system, and the perception module is running on The training framework program on the CPU of the training node sends a request to suspend the training task. In other words, by comparing the "time 1" with the time threshold, it can be found that there are slow nodes in one or more training nodes of the cluster system, but the exact location of the slow nodes cannot be located. The node detection program locates the exact location of the slow nodes. The training task refers to a training task whose main process is completed on the GPU, and the above data exchange is a subtask of the training task.

3. After receiving the request to suspend the training task initiated by the perception module, the training framework program running on the CPU of the training node suspends the current training task, saves the progress status of multiple training nodes participating in the training task, and notifies the perception module The module executes the following slow node detection programs, and the following slow node detection programs can be selected or used in combination. Run the slow node detection program to locate the slow node among multiple training nodes, and notify the scheduling module of the slow node information (that is, used to characterize the position of the slow node in the cluster system, such as the slow node ID) (the module that runs the scheduler program).

Running the slow node detection program includes at least one of the following manners 3.1-3.3.

In mode 3.1, a stand-alone detection is performed to detect the basic environment, that is, to scan the hardware status of all the multiple training nodes participating in the training task to find out the slow nodes.

In mode 3.2, perform cluster detection and scan all cluster basic environment configuration parameters to find slow nodes.

In mode 3.3, the detection of set communication is cyclically performed in the way of dichotomy to find out the slow nodes. Specifically, all multiple training nodes participating in the training task are partitioned to obtain multiple sub-areas, and collective communication bandwidth and delay tests are performed on each sub-area to obtain the delay and bandwidth of each sub-area. The delay and bandwidth of each sub-area are compared with the expected delay and bandwidth. If the delay is greater than the expected value and/or the bandwidth is less than the expected value, it is determined that there is a slow node in the sub-area. Continue to divide the sub-area and implement the dichotomy method until the slow node is located.

In order to improve the detection speed of slow nodes, considering that method 3.3 is time-consuming, you can first execute method 3.1 or method 3.2, and then execute method 3.3. After the slow node is detected through method 3.1 or method 3.2, exclude it from the set of all training nodes to be tested, and then perform method 3.3 to check whether there are slow nodes in the remaining training nodes in the set.

4. After the perception module obtains the slow node information through the slow node detection program, it sends the slow node information to the training framework program, and the training framework program is synchronized to the scheduler program. Taking the control node in which the scheduler program runs in the cluster system as an example, the scheduler program can replace the located slow node and resume the training task so that it can continue to run. Prepare the candidate node of the switching relationship (the candidate node is in a normal operating state), prepare the running environment of the training task for the candidate node, and copy the progress status of the slow node located by the perception module to the candidate node , thus, replace the progress status of the slow node with the candidate node. Resume the training task after replacing the progress state of the slow node with the candidate node. In other words, during the entire slow node detection process, there is no need to stop the slow node first and then restart the training task.

In related technologies, a synchronous-based collective communication mode is used. If there is a slow node in the cluster system and a failure occurs, other training nodes will quit the training task and release the computing resources of the training node. When the fault is recovered, the training node will obtain the training task and exit the current model data, and execute the training task again. Since the training node needs to reload the model data and reallocate computing resources, that is to say, it is very time-consuming to stop the slow node and then restart the training task. However, using this application example, the slow node can be realized without restarting the training task. The detection and replacement saves a lot of time required for training task stop/training task recovery under large-scale GPU training, saves a lot of time cost, and can obtain a trained model more efficiently. The trained model can be deployed on In the cluster system, the operating efficiency of the cluster system is improved.

According to an embodiment of the present disclosure, a slow node detection device is provided. FIG. 6 is a schematic diagram of the composition and structure of the slow node detection device according to an embodiment of the present disclosure. As shown in FIG. 6 , the slow node detection device 600 includes: a perception module 601 , configured to initiate a timing request to a first node, wherein the first node is one or more training nodes performing training tasks in a cluster system; receive timing information fed back by the first node; and, according to the The timing information detects that there is a slow node in the cluster system.

In an implementation manner, the sensing module 601 may be configured to detect that there is a slow node in the cluster system when the timing information is greater than a threshold.

In one embodiment, the sensing module 601 may be configured to initiate a request to the first node to suspend the training task; and run a slow node detection program to detect the position of the slow node in the cluster system .

In an embodiment, the perception module 601 may be configured to execute the detection mode in collective communication detection in a loop to run the slow node detection program to detect the position of the slow node in the cluster system, and the detection mode Include at least one of a set including stand-alone detection, cluster detection, dichotomy, and combinations thereof.

In one embodiment, the sensing module 601 may be configured to notify the scheduling module of slow node information, the slow node information is used to characterize the position of the slow node in the cluster system; wherein the scheduling module A second node located at the first node or having communication interaction with the first node.

According to an embodiment of the present disclosure, a device for detecting a slow node is provided. FIG. 7 is a schematic diagram of the composition and structure of the device for detecting a slow node according to an embodiment of the present disclosure. As shown in FIG. 7 , the device for detecting a slow node 700 includes: a first node 701, configured to receive a timing request initiated by the sensing module; wherein, the first node 701 is one or more training nodes that perform training tasks in a cluster system; perform collective communication operations based on the timing request, and complete the Data exchange in the cluster system to obtain timing information; and sending the timing information to the sensing module.

In an embodiment, the first node 701 may be configured to receive a request for suspending the training task initiated by the perception module; respond to the request for suspending the training task, suspend the training task, and store the progress status of the training task ; And, notify the perception module to run a slow node detection program.

In an embodiment, the slow node detection device 700 may further include: a scheduling module located at the first node 701, configured to receive the slow node information sent by the perception module, the slow node information is used to represent the The position of the slow node in the cluster system. The first node 701 may be configured to accept the scheduling control of the scheduling module, replace the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continue Execute the training task.

In an embodiment, the slow node detection device 700 may further include: a scheduling module located at the second node, configured to receive the slow node information sent by the sensing module, the slow node information is used to indicate that the slow node is The location in the cluster system. The first node 701 may be configured to receive the slow node information, the slow node information is the information forwarded to the first node 701 after the second node accepts the scheduling control of the scheduling module, wherein , there is communication interaction between the second node and the first node 701; and, according to the slow node information, replace the progress status of the training task executed by the slow node with a normal candidate node, and continue to execute the training task.

In an implementation manner, the candidate node and the slow node have an active/standby switchover relationship.

For the functions of the modules in the devices in the embodiments of the present disclosure, reference may be made to the corresponding descriptions in the foregoing methods, and details are not repeated here.

In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 8 , an electronic device 800 includes a computing unit 801, which can perform calculations according to a computer program stored in a read-only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803. Various appropriate actions and processes are performed. In the RAM 803, various programs and data necessary for the operation of the electronic device 800 can also be stored. The computing unit 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804 .

Multiple components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk etc.; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

The computing unit 801 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 executes various methods and processes described above, such as a slow node detection method. For example, in some embodiments, the slow node detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 . In some embodiments, part or all of the computer program can be loaded and/or installed on the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the slow node detection method described above can be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to execute the slow node detection method in any other appropriate manner (for example, by means of firmware).

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input, or tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A slow node detection method, comprising:

The perception module initiates a timing request to the first node, wherein the first node is one or more training nodes performing training tasks in the cluster system;

The sensing module receives timing information fed back by the first node; and

The perception module detects that there is a slow node in the cluster system according to the timing information.
The method according to claim 1, wherein the sensing module detects that the slow node exists in the cluster system according to the timing information, comprising:

The sensing module detects that the slow node exists in the cluster system when the timing information is greater than a threshold.
The method according to claim 1, further comprising:

The perception module initiates a request to the first node to suspend the training task; and

The perception module runs a slow node detection program to detect the position of the slow node in the cluster system.
The method according to claim 3, wherein the sensing module runs the slow node detection program to detect the position of the slow node in the cluster system, comprising:

The perception module loops through the detection mode in the set communication detection to run the slow node detection program to detect the position of the slow node in the cluster system, the detection mode includes at least one of the set, The collection includes stand-alone detection, cluster detection, dichotomy, and combinations thereof.
The method according to any one of claims 1-4, further comprising:

The perception module notifies the scheduling module of the slow node information, the slow node information is used to characterize the position of the slow node in the cluster system;

Wherein, the scheduling module is located at the first node, or a second node that communicates with the first node.
A slow node detection method, comprising:

The first node receives the timing request initiated by the perception module; wherein, the first node is one or more training nodes that perform training tasks in the cluster system;

The first node performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and

The first node sends the timing information to the perception module.
The method of claim 6, further comprising:

The first node receives a request to suspend the training task initiated by the perception module;

The first node responds to the request to suspend the training task, suspends the training task, and stores the progress status of the training task; and

The first node notifies the perception module to run a slow node detection program.
The method according to claim 7, further comprising:

The scheduling module receives the slow node information sent by the perception module, and the slow node information is used to characterize the position of the slow node in the cluster system; and

When the scheduling module is located at the first node, the first node accepts the scheduling control of the scheduling module, and replaces the progress status of the training task executed by the slow node according to the information of the slow node Go to the normal candidate node and continue to execute the training task.
The method according to claim 7, further comprising:

The scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system;

When the scheduling module is located at a second node that communicates with the first node, the first node receives the slow node information, and the slow node information is: the second node accepts the scheduling the information forwarded to the first node after the scheduling control of the module; and

The first node replaces the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continues to execute the training task.
The method according to claim 8 or 9, wherein the candidate node has an active/standby switching relationship with the slow node.
A slow node detection device, including a perception module, configured to:

Initiate a timing request to the first node, wherein the first node is one or more training nodes executing training tasks in the cluster system;

receiving timing information fed back by the first node; and

It is detected that there is a slow node in the cluster system according to the timing information.
The device according to claim 11, wherein the perception module is configured to:

When the timing information is greater than a threshold, it is detected that the slow node exists in the cluster system.
The device according to claim 11, wherein the perception module is configured to:

initiate a request to the first node to suspend the training task; and

Running the slow node detection program to detect the position of the slow node in the cluster system.
The device according to claim 13, wherein the perception module is configured to:

Cyclic execution of the detection mode in the set communication detection to run the slow node detection program to detect the position of the slow node in the cluster system, the detection mode includes at least one of the set, the set includes Stand-alone detection, cluster detection, dichotomy, and combinations thereof.
The device according to any one of claims 11-14, wherein the perception module is configured to:

Notifying the scheduling module of the slow node information, where the slow node information is used to characterize the position of the slow node in the cluster system;

Wherein, the scheduling module is located at the first node, or a second node that communicates with the first node.
A device for detecting slow nodes, including a first node, configured to:

Receive a timing request initiated by the perception module; wherein, the first node is one or more training nodes that perform training tasks in the cluster system;

performing collective communication operations based on the timing request, completing data exchange in the cluster system, and obtaining timing information; and

Send the timing information to the sensing module.
The apparatus according to claim 16, wherein the first node is configured to:

receiving a request to suspend the training task initiated by the perception module;

Responding to the request for suspending the training task, suspending the training task, storing the progress status of the training task; and

Informing the perception module to run a slow node detection program.
The apparatus of claim 17, further comprising:

The scheduling module located at the first node is configured to receive the slow node information sent by the perception module, the slow node information is used to characterize the position of the slow node in the cluster system;

Wherein, the first node is configured to accept the scheduling control of the scheduling module, replace the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continue Execute the training task.
The apparatus of claim 17, further comprising:

The scheduling module located at the second node is configured to receive the slow node information sent by the perception module, where the slow node information is used to characterize the position of the slow node in the cluster system;

Wherein, the first node is configured as:

receiving the slow node information, the slow node information is: the information that the second node forwards to the first node after accepting the scheduling control of the scheduling module, wherein the second node and the first There is a communication interaction between the nodes; and

According to the information of the slow node, the progress status of the training task executed by the slow node is replaced with a normal candidate node, and the training task is continued to be executed.
The apparatus according to claim 18 or 19, wherein the candidate node has an active-standby switching relationship with the slow node.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-10.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.