WO2023115975A1 - Slow node detection method and apparatus, electronic device, and storage medium - Google Patents

Slow node detection method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023115975A1
WO2023115975A1 PCT/CN2022/111137 CN2022111137W WO2023115975A1 WO 2023115975 A1 WO2023115975 A1 WO 2023115975A1 CN 2022111137 W CN2022111137 W CN 2022111137W WO 2023115975 A1 WO2023115975 A1 WO 2023115975A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
slow
training
cluster system
information
Prior art date
Application number
PCT/CN2022/111137
Other languages
French (fr)
Chinese (zh)
Inventor
付浩瀚
王雁鹏
黎世勇
孙鹏
张恒华
骆宝童
张建宇
王帅俭
刘伟
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023115975A1 publication Critical patent/WO2023115975A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, in particular to cluster systems, distributed machine learning, node failure detection and other fields.
  • large-scale model training tasks can be performed based on artificial intelligence technology to obtain models with higher processing efficiency.
  • the trained model is deployed in the cluster system, which can improve the overall operating efficiency of the cluster system.
  • the disclosure provides a slow node detection method, device, electronic equipment and storage medium.
  • a slow node detection method including: a perception module initiates a timing request to a first node, wherein the first node is one or more training nodes that perform training tasks in a cluster system
  • the sensing module receives timing information fed back by the first node; and, the sensing module detects that there is a slow node in the cluster system according to the timing information.
  • a slow node detection method including: a first node receiving a timing request initiated by a perception module; wherein, the first node is one or more nodes performing training tasks in a cluster system training nodes; the first node performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and, the first node sends the timing information to the sensing module.
  • a slow node detection device including a perception module configured to: initiate a timing request to a first node, wherein the first node is one or more The training node of the training task; receiving the timing information fed back by the first node; and detecting that there is a slow node in the cluster system according to the timing information.
  • a slow node detection device including a first node configured to: receive a timing request initiated by a perception module; wherein, the first node is one or more A training node that executes a training task; performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and sends the timing information to the sensing module.
  • an electronic device including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be executed by the at least one processor An instruction, the instruction is executed by the at least one processor, so that the at least one processor can execute any method provided by the embodiments of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute any one of the methods provided in the embodiments of the present disclosure.
  • a computer program product including computer instructions.
  • the computer instructions are executed by a processor, any method provided in the embodiments of the present disclosure is implemented.
  • FIG. 1 is a scene diagram of a distributed cluster system according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flow diagram of a slow node detection method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure
  • Fig. 5 is a schematic diagram of an architecture of an application example of an embodiment of the present disclosure to realize a perception process
  • FIG. 6 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure
  • FIG. 7 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure.
  • Fig. 8 is a block diagram of an electronic device used to implement the slow node detection method of the embodiment of the present disclosure.
  • first and second in this article refer to and distinguish multiple similar technical terms, and do not mean to limit the order, or to limit only two meanings, for example, the first feature and the second Feature refers to two types/two features, the first feature can be one or more, and the second feature can also be one or more.
  • GPU As an example, with the increase of GPU training scale, it has reached the level of kilocalories and 10,000 cards. For GPU large-scale training tasks, high requirements are put forward for the stability and ease of use of the service.
  • GPU training has some characteristics, such as faster iteration of basic software and hardware, but there are unpredictable situations that need to be dealt with in practice; such as higher power consumption and heat dissipation pressure,
  • the hardware will be under high pressure for a long time, which is prone to failure, and highly integrated hardware leads to low operation and maintenance efficiency; for example, the rapid expansion of training scale has brought great pressure to network deployment and cluster scheduling.
  • These characteristics of GPU training can easily cause service or node failures in GPU training tasks, which will affect GPU training.
  • large-scale training also has some characteristics: for example, the training scale is large and there is generally no redundant computing design, and a single point of failure will affect the entire training task; Calculations and monthly calculations are very common. In other words, the training time of large models and large amounts of data is long. If there is a lack of timely alarms, it is easy to cause problems to be discovered after a few days and cause great losses; for example, in the case of synchronous mode If a single-point performance problem occurs, it will spread to the entire range of training tasks for large-scale training.
  • GPU-oriented training tasks especially large-scale GPU training tasks, are more likely to encounter failures, and the losses caused by failures are greater. It is necessary to detect faults in time to ensure the stability of training tasks for large-scale GPU training tasks and reduce the overall operating cost of cluster operation and scheduling.
  • problematic nodes that cause training task failures or performance degradation can be called slow nodes.
  • the detection of this slow node cannot be realized.
  • the perception module initiates a timing request to the first node
  • the timing information fed back by the first node can be received. Therefore, it can be detected that there is a slow node in the cluster system according to the timing information, and the detection of the slow node is realized.
  • FIG. 1 is a scene diagram of a distributed cluster system applying the slow node detection method of the embodiment of the present disclosure.
  • the distributed cluster system is an example of a cluster system, and it exemplarily describes that the distributed The cluster system performs model training to complete the training task.
  • this distributed cluster system 100 comprise a plurality of nodes (as server cluster 101, server 102, server cluster 103, server 104, server 105, server 105 can also connect electronic equipment, as mobile phone 1051 and desktop Machine 1052), multiple nodes, and multiple nodes and connected electronic devices can jointly execute one or more model training tasks.
  • multiple nodes in the distributed cluster system can adopt a data parallel model training method, and then multiple nodes can perform training tasks based on the same training method to better train the model; if the distributed cluster system Multiple nodes in the model adopt the model parallel model training method, so multiple nodes can perform training tasks based on different training methods to better train the model.
  • data exchange (such as data synchronization) can be performed between multiple nodes.
  • FIG. 2 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure.
  • the method can be applied to a slow node detection device.
  • the device can be deployed in Terminals or servers or other processing devices in the cluster system can implement processing such as timing and slow node detection when they are running.
  • the terminal may be user equipment (UE, User Equipment), mobile device, personal digital assistant (PDA, Personal Digital Assistant), handheld device, computing device, vehicle-mounted device, wearable device, and the like.
  • the method may also be implemented in a manner in which the processor invokes computer-readable instructions stored in the memory.
  • the method is applied to the perception modules in the cluster system, including S201-S203.
  • the perception module initiates a timing request to a first node, where the first node is one or more training nodes executing training tasks in a cluster system.
  • the sensing module receives timing information fed back by the first node.
  • the perception module detects that there is a slow node in the cluster system according to the timing information.
  • the first node as a training node in the cluster system, may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the perception module can initiate a timing request to the training node, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations.
  • any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized.
  • the collective communication library can start the timing function to obtain the timing information. After receiving the timing information, the sensing module can detect the presence of slow nodes in the cluster system according to the timing information.
  • the timing information fed back by the nodes can detect the existence of slow nodes in the cluster system according to the timing information, and realize the detection of slow nodes.
  • the sensing module detects that there is a slow node in the cluster system according to timing information, which may include: the sensing module detects that there is a slow node in the cluster system when the timing information is greater than a threshold.
  • timing information may include: the sensing module detects that there is a slow node in the cluster system when the timing information is greater than a threshold.
  • the slow node detection method may further include: the perception module initiates a request to suspend the training task to the first node, and the perception module runs a slow node detection program to detect the position of the slow node in the cluster system.
  • the perception module initiates a request to suspend the training task to the first node
  • the perception module runs a slow node detection program to detect the position of the slow node in the cluster system.
  • the sensing module runs the slow node detection program to detect the position of the slow node in the cluster system, which may include: the sensing module cyclically executes the detection mode in the collective communication detection to run the slow node detection program, and detects the slow node
  • the position of the slow node in the cluster system, the detection mode includes at least one of the set, and the set includes stand-alone detection, cluster detection, dichotomy and their combination.
  • one or a combination of various detection modes can be used to run the corresponding slow node detection program to detect the position of the slow node in the cluster system, so that single-machine and multi-machine can be realized in the cluster system , dichotomy and other efficient slow node troubleshooting methods to locate the specific location of the slow node faster and more accurately.
  • the slow node detection method may further include: the perception module notifies the scheduling module of the slow node information, where the slow node information is used to characterize the position of the slow node in the cluster system.
  • the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or the second node that communicates with the first node (such as the control node in the cluster system) ).
  • the scheduling module can be deployed on the first node (such as the training node in the cluster system) or the second node that communicates with the first node (such as the control node in the cluster system) according to actual needs , when the slow node information is received, the training tasks performed by multiple training nodes in the cluster system can be scheduled, for example, after detecting the slow node and its position in the cluster system from multiple training nodes, Save the progress state of the training task performed by the slow node (that is, the progress state of the slow node), so as to trigger the master-standby switchover operation (that is, replace the progress state of the slow node with a normal candidate node, and continue to execute the training task).
  • the training tasks performed by multiple training nodes in the cluster system can be scheduled, for example, after detecting the slow node and its position in the cluster system from multiple training nodes, Save the progress state of the training task performed by the slow node (that is, the progress state of the slow node), so as to trigger the master-standby switchover operation (that
  • FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure. As shown in FIG. 3 , the method is applied to a perception module in a cluster system, Including S301-S303.
  • the first node receives a timing request initiated by the sensing module; wherein, the first node is one or more training nodes executing training tasks in a cluster system.
  • the first node performs a collective communication operation based on the timing request, completes data exchange in the cluster system, and obtains timing information.
  • the first node sends the timing information to the perception module, so that the perception module detects that there is a slow node in the cluster system according to the timing information.
  • the first node may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the training framework program may receive the timing request initiated by the perception module, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations.
  • any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized.
  • the collective communication library can start the timing function to obtain the timing information. After the training node sends the timing information to the sensing module, the sensing module can detect the presence of slow nodes in the cluster system based on the timing information. .
  • the module can detect that there are slow nodes in the cluster system, thereby realizing the detection of slow nodes.
  • the slow node detection method may further include: the first node receives a request for suspending the training task initiated by the sensing module; the first node responds to the request for suspending the training task, suspends the training task, and stores the progress status of the training task.
  • the first node notifies the perception module to run a slow node detection program.
  • the perception module can run the slow node detection program through the notification fed back by the first node to detect the presence of slow nodes in the cluster system, thereby realizing the detection of slow nodes.
  • the slow node detection method may further include: the scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system.
  • the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or a second node that communicates with the first node (such as the control node in the cluster system) node).
  • the first node When the scheduling module is located at the first node (such as the training node in the cluster system), the first node accepts the scheduling control of the scheduling module, and according to the slow node information, the progress of the training task to be performed by the slow node The state is replaced with a normal candidate node, and the training task is continued.
  • the scheduling module can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node).
  • the selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.
  • the slow node detection method may further include: the scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system.
  • the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or a second node that communicates with the first node (such as the control node in the cluster system) node).
  • the scheduling module is located at a second node that communicates with the first node (such as a control node in the cluster system)
  • the first node receives the slow node information from the second node, that is, the slow node
  • the node information is: the information that the second node forwards to the first node after accepting the scheduling control of the scheduling module.
  • the first node replaces the progress status of the training task executed by the slow node with a normal candidate node, and continues to execute the training task.
  • the scheduling module can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node).
  • the selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.
  • Large-scale training tasks can be divided into pure CPU, pure GPU, and CPU and GPU mixed settings according to the type of training nodes in the cluster system, and can be divided into synchronous training and asynchronous training according to the training mode.
  • the synchronous training further includes: parameter server (Parameter Server, PS) architecture and collective (collective) architecture, the PS architecture is the most commonly used distributed training architecture for deep learning, the collective architecture is to achieve multi-GPU collective communication architecture.
  • PS Parameter Server
  • collective architecture collective communication architecture.
  • slow node detection is required, especially for pure GPU, synchronous training, and collective architecture scenarios. This is the most difficult research direction for slow node positioning, and it also requires special attention.
  • the collective communication library can provide high-performance inter-GPU or inter-CPU communication between training nodes.
  • multiple training nodes in the cluster system can complete their respective data calculation and transmission operations. Perform synchronous waiting, and each node will not perceive the progress of the training task until all training nodes have completed the operation, that is to say, it is impossible to directly distinguish a single or part of the slow nodes. Slow nodes need to be detected more efficiently before and during training tasks.
  • FIG. 4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure, including one or more control nodes (such as control node 1-control node M, where M is an integer greater than 1), and the control node through One or more training nodes for communication and interaction in the network (for example, training node 1-training node N, where N is an integer greater than 1).
  • a network card for communication interaction and a CPU for issuing control commands to any training node can be deployed in any control node;
  • a network card for communication interaction and a CPU for issuing control commands to GPU can be deployed in any training node CPU that controls instructions.
  • the training framework program runs on the CPU of the training node in the form of multi-process and multi-thread or single-process and multi-thread, so that the main calculation instructions and communication instructions during the execution of the training task are Send to the GPU of the training node.
  • the collective communication library also runs on the CPU of the training node to provide high-performance inter-GPU or inter-CPU communication between multiple training nodes (such as data exchange between GPUs).
  • Part of the scheduler program can run on the CPU of one or more training nodes, and the other part can run on the CPU of the control node outside these training nodes and can communicate with these training nodes.
  • the scheduler program is used for training tasks node resource allocation and operating environment preparation.
  • the slow nodes detected in this application example are all training nodes.
  • the perception module can be a module in the cluster system, not limited to the training node or control node deployed in the cluster system, it can be located on any node in the cluster system, or can be set independently, that is, in the cluster system And it is configured separately from the node. Slow node detection and slow node location can be performed through the perception process of the perception module interacting with the training node (eg, the perception module interacts with the training framework program and the collective communication library in the training node).
  • Fig. 5 is a schematic diagram of an architecture for realizing a perception process applied to an application example of an embodiment of the present disclosure.
  • the diagram of the architecture shown in Fig. 5 includes the following content.
  • the slow node detection can be performed through the perception module, and some other programs/modules can also be used to respond to the slow node detection.
  • the process of detecting the slow node may include the following 1-4.
  • the perception module sends the timing request to the training framework program (the training framework program can run on the CPU of the training node in a multi-process multi-thread or single-process multi-thread mode), and the training framework program calls the collective communication library based on the timing request (
  • the collective communication library runs on the CPU of the training node, and can provide high-performance inter-GPU communication implementation or inter-CPU communication implementation between training nodes, such as data exchange between GPUs).
  • the collective communication library starts the timing function, Execute the collection communication implementation inside the cluster system, and complete the timing and recording of this operation. For example, trigger the data exchange between multiple GPUs in a training node, or the data exchange between GPUs included in multiple training nodes, and finally complete the collective communication operation, timing and recording.
  • a training node in response to the collective communication operation, for example, all training nodes and/or electronic devices participating in data exchange (a training node may include multiple electronic devices) conduct a collective Data exchange (such as data synchronization) is not limited to the above-mentioned data exchange between GPUs.
  • this application example uses data exchange between GPUs as an example for description. The data exchange is performed, and the duration of the data exchange process can be recorded through a timing operation (for example, recorded as "time 1").
  • the perception module judges whether the "time 1" is slower than a time threshold, that is, whether it is greater than the time threshold, and if it is greater, it is determined to be "slow", that is, there are slow nodes in the current cluster system, and the perception module is running on
  • the training framework program on the CPU of the training node sends a request to suspend the training task. In other words, by comparing the "time 1" with the time threshold, it can be found that there are slow nodes in one or more training nodes of the cluster system, but the exact location of the slow nodes cannot be located.
  • the node detection program locates the exact location of the slow nodes.
  • the training task refers to a training task whose main process is completed on the GPU, and the above data exchange is a subtask of the training task.
  • the training framework program running on the CPU of the training node suspends the current training task, saves the progress status of multiple training nodes participating in the training task, and notifies the perception module
  • the module executes the following slow node detection programs, and the following slow node detection programs can be selected or used in combination. Run the slow node detection program to locate the slow node among multiple training nodes, and notify the scheduling module of the slow node information (that is, used to characterize the position of the slow node in the cluster system, such as the slow node ID) (the module that runs the scheduler program).
  • Running the slow node detection program includes at least one of the following manners 3.1-3.3.
  • a stand-alone detection is performed to detect the basic environment, that is, to scan the hardware status of all the multiple training nodes participating in the training task to find out the slow nodes.
  • the detection of set communication is cyclically performed in the way of dichotomy to find out the slow nodes.
  • all multiple training nodes participating in the training task are partitioned to obtain multiple sub-areas, and collective communication bandwidth and delay tests are performed on each sub-area to obtain the delay and bandwidth of each sub-area.
  • the delay and bandwidth of each sub-area are compared with the expected delay and bandwidth. If the delay is greater than the expected value and/or the bandwidth is less than the expected value, it is determined that there is a slow node in the sub-area. Continue to divide the sub-area and implement the dichotomy method until the slow node is located.
  • the perception module After the perception module obtains the slow node information through the slow node detection program, it sends the slow node information to the training framework program, and the training framework program is synchronized to the scheduler program.
  • the scheduler program can replace the located slow node and resume the training task so that it can continue to run.
  • Prepare the candidate node of the switching relationship (the candidate node is in a normal operating state)
  • prepare the running environment of the training task for the candidate node prepare the running environment of the training task for the candidate node, and copy the progress status of the slow node located by the perception module to the candidate node , thus, replace the progress status of the slow node with the candidate node.
  • Resume the training task after replacing the progress state of the slow node with the candidate node. In other words, during the entire slow node detection process, there is no need to stop the slow node first and then restart the training task.
  • a synchronous-based collective communication mode is used. If there is a slow node in the cluster system and a failure occurs, other training nodes will quit the training task and release the computing resources of the training node. When the fault is recovered, the training node will obtain the training task and exit the current model data, and execute the training task again. Since the training node needs to reload the model data and reallocate computing resources, that is to say, it is very time-consuming to stop the slow node and then restart the training task. However, using this application example, the slow node can be realized without restarting the training task.
  • the detection and replacement saves a lot of time required for training task stop/training task recovery under large-scale GPU training, saves a lot of time cost, and can obtain a trained model more efficiently.
  • the trained model can be deployed on In the cluster system, the operating efficiency of the cluster system is improved.
  • FIG. 6 is a schematic diagram of the composition and structure of the slow node detection device according to an embodiment of the present disclosure.
  • the slow node detection device 600 includes: a perception module 601 , configured to initiate a timing request to a first node, wherein the first node is one or more training nodes performing training tasks in a cluster system; receive timing information fed back by the first node; and, according to the The timing information detects that there is a slow node in the cluster system.
  • the sensing module 601 may be configured to detect that there is a slow node in the cluster system when the timing information is greater than a threshold.
  • the sensing module 601 may be configured to initiate a request to the first node to suspend the training task; and run a slow node detection program to detect the position of the slow node in the cluster system .
  • the perception module 601 may be configured to execute the detection mode in collective communication detection in a loop to run the slow node detection program to detect the position of the slow node in the cluster system, and the detection mode Include at least one of a set including stand-alone detection, cluster detection, dichotomy, and combinations thereof.
  • the sensing module 601 may be configured to notify the scheduling module of slow node information, the slow node information is used to characterize the position of the slow node in the cluster system; wherein the scheduling module A second node located at the first node or having communication interaction with the first node.
  • FIG. 7 is a schematic diagram of the composition and structure of the device for detecting a slow node according to an embodiment of the present disclosure.
  • the device for detecting a slow node 700 includes: a first node 701, configured to receive a timing request initiated by the sensing module; wherein, the first node 701 is one or more training nodes that perform training tasks in a cluster system; perform collective communication operations based on the timing request, and complete the Data exchange in the cluster system to obtain timing information; and sending the timing information to the sensing module.
  • the first node 701 may be configured to receive a request for suspending the training task initiated by the perception module; respond to the request for suspending the training task, suspend the training task, and store the progress status of the training task ; And, notify the perception module to run a slow node detection program.
  • the slow node detection device 700 may further include: a scheduling module located at the first node 701, configured to receive the slow node information sent by the perception module, the slow node information is used to represent the The position of the slow node in the cluster system.
  • the first node 701 may be configured to accept the scheduling control of the scheduling module, replace the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continue Execute the training task.
  • the slow node detection device 700 may further include: a scheduling module located at the second node, configured to receive the slow node information sent by the sensing module, the slow node information is used to indicate that the slow node is The location in the cluster system.
  • the first node 701 may be configured to receive the slow node information, the slow node information is the information forwarded to the first node 701 after the second node accepts the scheduling control of the scheduling module, wherein , there is communication interaction between the second node and the first node 701; and, according to the slow node information, replace the progress status of the training task executed by the slow node with a normal candidate node, and continue to execute the training task.
  • the candidate node and the slow node have an active/standby switchover relationship.
  • the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • an electronic device 800 includes a computing unit 801, which can perform calculations according to a computer program stored in a read-only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803. Various appropriate actions and processes are performed. In the RAM 803, various programs and data necessary for the operation of the electronic device 800 can also be stored.
  • the computing unit 801, ROM 802, and RAM 803 are connected to each other through a bus 804.
  • An input/output (I/O) interface 805 is also connected to the bus 804 .
  • the I/O interface 805 includes: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk etc.; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 801 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the calculation unit 801 executes various methods and processes described above, such as a slow node detection method.
  • the slow node detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 .
  • part or all of the computer program can be loaded and/or installed on the electronic device 800 via the ROM 802 and/or the communication unit 809.
  • the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the slow node detection method described above can be performed.
  • the computing unit 801 may be configured to execute the slow node detection method in any other appropriate manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input, or tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A slow node detection method and apparatus, an electronic device, and a storage medium, relating to the technical field of artificial intelligence, and in particular, to the fields such as cluster systems, distributed machine learning, and node fault detection. The specific implementation solution comprises: a sensing module initiates a timing request to a first node, wherein the first node is one or more training nodes executing a training task in a cluster system (S201); the sensing module receives timing information fed back by the first node (S202); and the sensing module detects the existence of a slow node in the cluster system according to the timing information (S203).

Description

慢节点检测方法、装置、电子设备及存储介质Slow node detection method, device, electronic equipment and storage medium 技术领域technical field
本公开涉及人工智能技术领域,尤其涉及集群系统、分布式机器学习、节点故障检测等领域。The present disclosure relates to the technical field of artificial intelligence, in particular to cluster systems, distributed machine learning, node failure detection and other fields.
背景技术Background technique
在集群系统的多个节点或一个节点下的多个电子设备(如终端设备或服务器等)中,可以基于人工智能技术,执行大规模的模型训练任务,以得到处理效率更高的模型,将训练得到的该模型部署于该集群系统中,可以提高集群系统的整体运行效率。In multiple nodes of the cluster system or multiple electronic devices under one node (such as terminal devices or servers, etc.), large-scale model training tasks can be performed based on artificial intelligence technology to obtain models with higher processing efficiency. The trained model is deployed in the cluster system, which can improve the overall operating efficiency of the cluster system.
发明内容Contents of the invention
本公开提供了一种慢节点检测方法、装置、电子设备以及存储介质。The disclosure provides a slow node detection method, device, electronic equipment and storage medium.
根据本公开的一方面,提供了一种慢节点检测方法,包括:感知模块向第一节点发起计时请求,其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;所述感知模块接收所述第一节点反馈的计时信息;以及,所述感知模块根据所述计时信息检测出所述集群系统存在慢节点。According to an aspect of the present disclosure, a slow node detection method is provided, including: a perception module initiates a timing request to a first node, wherein the first node is one or more training nodes that perform training tasks in a cluster system The sensing module receives timing information fed back by the first node; and, the sensing module detects that there is a slow node in the cluster system according to the timing information.
根据本公开的另一方面,提供了一种慢节点检测方法,包括:第一节点接收感知模块发起的计时请求;其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;所述第一节点基于所述计时请求进行集合通信操作,完成所述集群系统中的数据交换,得到计时信息;以及,所述第一节点向所述感知模块发送所述计时信息。According to another aspect of the present disclosure, a slow node detection method is provided, including: a first node receiving a timing request initiated by a perception module; wherein, the first node is one or more nodes performing training tasks in a cluster system training nodes; the first node performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and, the first node sends the timing information to the sensing module.
根据本公开的另一方面,提供了一种慢节点检测装置,包括感知模块,被配置为:向第一节点发起计时请求,其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;接收所述第一节点反馈的计时信息;以及,根据所述计时信息检测出所述集群系统存在慢节点。According to another aspect of the present disclosure, a slow node detection device is provided, including a perception module configured to: initiate a timing request to a first node, wherein the first node is one or more The training node of the training task; receiving the timing information fed back by the first node; and detecting that there is a slow node in the cluster system according to the timing information.
根据本公开的另一方面,提供了一种慢节点检测装置,包括第一节点,被配置为:接收感知模块发起的计时请求;其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;基于所述计时请求进行集合通信操作,完成所述集群系统中的数据交换,得到计时信息;以及,向所述感知模块发送所述计时信息。According to another aspect of the present disclosure, a slow node detection device is provided, including a first node configured to: receive a timing request initiated by a perception module; wherein, the first node is one or more A training node that executes a training task; performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and sends the timing information to the sensing module.
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及,与该至少一个处理器通信连接的存储器;其中,该存储器存储有可被该至少一个处理器执行的指令,该指令被该至少一个处理器执行,以使该至少一个处理器能够执行本公开实施例所提供的任意一个方法。According to another aspect of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be executed by the at least one processor An instruction, the instruction is executed by the at least one processor, so that the at least one processor can execute any method provided by the embodiments of the present disclosure.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,该计算机指令用于使该计算机执行本公开实施例所提供的任意一个方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute any one of the methods provided in the embodiments of the present disclosure.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机指令,该计算机指令被处理器执行时实现本公开实施例所提供的任意一个方法。According to another aspect of the present disclosure, a computer program product is provided, including computer instructions. When the computer instructions are executed by a processor, any method provided in the embodiments of the present disclosure is implemented.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:
图1是根据本公开实施例的分布式集群系统的场景图;FIG. 1 is a scene diagram of a distributed cluster system according to an embodiment of the present disclosure;
图2是根据本公开实施例的慢节点检测方法的流程示意图;FIG. 2 is a schematic flow diagram of a slow node detection method according to an embodiment of the present disclosure;
图3是根据本公开实施例的慢节点检测方法的流程示意图;FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure;
图4是应用于本公开实施例的一应用示例的集群系统架构示意图;FIG. 4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure;
图5是应用于本公开实施例的一应用示例的实现感知流程的架构示意图;Fig. 5 is a schematic diagram of an architecture of an application example of an embodiment of the present disclosure to realize a perception process;
图6是根据本公开实施例的慢节点检测装置的组成结构示意图;6 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure;
图7是根据本公开实施例的慢节点检测装置的组成结构示意图;7 is a schematic diagram of the composition and structure of a slow node detection device according to an embodiment of the present disclosure;
图8是用来实现本公开实施例的慢节点检测方法的电子设备的框图。Fig. 8 is a block diagram of an electronic device used to implement the slow node detection method of the embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。本文中术语“第一”、“第二”表示指代多个类似的技术用语并对其进行区分,并不是限定顺序的意思,或者限定只有两个的意思,例如,第一特征和第二特征,是指代有两类/两个特征,第一特征可以为一个或多个,第二特征也可以为一个或多个。The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. The term "at least one" herein means any one of a variety or any combination of at least two of a variety, for example, including at least one of A, B, and C, may mean including from A, B, and Any one or more elements selected from the set formed by C. The terms "first" and "second" in this article refer to and distinguish multiple similar technical terms, and do not mean to limit the order, or to limit only two meanings, for example, the first feature and the second Feature refers to two types/two features, the first feature can be one or more, and the second feature can also be one or more.
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present disclosure may be practiced without some of the specific details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.
在集群系统(如分布式集群系统)的多个节点或一个节点下的多个电子设备(如终端设备或服务器等)中可以采用中央处理器(Central Processing Unit,CPU)及图形处理器(Graphic Processing Unit,GPU),随着深度学习技术的发展,CPU和GPU都可以通过深度学习技术进行模型训练,该分布式集群系统还可以是基于采样的分布式机器学习系统,以提高上述电子设备、节点的运行效率,从而使上述集群系统 的运行效率、调度功能等软硬件能力更强。In multiple nodes of a cluster system (such as a distributed cluster system) or multiple electronic devices (such as terminal equipment or servers, etc.) Processing Unit, GPU), with the development of deep learning technology, both CPU and GPU can carry out model training through deep learning technology, and this distributed cluster system can also be a distributed machine learning system based on sampling to improve the above-mentioned electronic equipment, The operating efficiency of the nodes, so that the above-mentioned cluster system has stronger software and hardware capabilities such as operating efficiency and scheduling functions.
以GPU为例,随着GPU训练规模的增长,已经迈向了千卡、万卡的级别,针对GPU大规模的训练任务对服务的稳定性和易用性提出了很高的要求。Taking GPU as an example, with the increase of GPU training scale, it has reached the level of kilocalories and 10,000 cards. For GPU large-scale training tasks, high requirements are put forward for the stability and ease of use of the service.
首先,与传统的CPU训练相比,就GPU训练而言,存在一些特点,比如基础软硬件的迭代更快,但是存在无法预知的情况,需要在实践中应对;比如功耗散热压力更高,硬件会长期处于高压状态,容易导致故障,且高度集成化的硬件导致运维效率低;比如,训练规模快速扩张,给网络的部署和集群调度带来了很大的压力。这些GPU训练的这些特点,容易造成针对GPU的训练任务更容易出现服务或者节点故障,影响GPU的训练。First of all, compared with traditional CPU training, GPU training has some characteristics, such as faster iteration of basic software and hardware, but there are unpredictable situations that need to be dealt with in practice; such as higher power consumption and heat dissipation pressure, The hardware will be under high pressure for a long time, which is prone to failure, and highly integrated hardware leads to low operation and maintenance efficiency; for example, the rapid expansion of training scale has brought great pressure to network deployment and cluster scheduling. These characteristics of GPU training can easily cause service or node failures in GPU training tasks, which will affect GPU training.
其次,大规模的训练也存在一些特点:比如,训练规模大且一般没有冗余计算的设计,单点失败会影响整个训练任务;比如,大模型、大数据量的训练,以天为单位的计算、以月为单位的计算很常见,换言之,大模型、大数据量的训练时间长,若缺少及时的告警,很容易造成在数天后才发现问题,造成较大损失;比如,同步模式情况下出现单点性能问题,会传播到整个大规模训练的训练任务范围。Secondly, large-scale training also has some characteristics: for example, the training scale is large and there is generally no redundant computing design, and a single point of failure will affect the entire training task; Calculations and monthly calculations are very common. In other words, the training time of large models and large amounts of data is long. If there is a lack of timely alarms, it is easy to cause problems to be discovered after a few days and cause great losses; for example, in the case of synchronous mode If a single-point performance problem occurs, it will spread to the entire range of training tasks for large-scale training.
综上所述,针对GPU的训练任务,尤其是GPU大规模的训练任务更容易遇到故障,且故障带来的损失更大。需要及时发现故障,以保障GPU大规模训练任务的训练任务稳定性,降低集群运行及调度的综合运行成本。其中,造成训练任务故障或者性能下降的问题节点可以称之为慢节点,要想保障GPU大规模训练任务的训练任务稳定性,降低集群运行及调度的综合运行成本,需要检测出该慢节点。然而,针对GPU大规模的训练任务,无法实现该慢节点的检测。To sum up, GPU-oriented training tasks, especially large-scale GPU training tasks, are more likely to encounter failures, and the losses caused by failures are greater. It is necessary to detect faults in time to ensure the stability of training tasks for large-scale GPU training tasks and reduce the overall operating cost of cluster operation and scheduling. Among them, problematic nodes that cause training task failures or performance degradation can be called slow nodes. To ensure the stability of training tasks for GPU large-scale training tasks and reduce the overall operating cost of cluster operation and scheduling, it is necessary to detect such slow nodes. However, for the large-scale training task of GPU, the detection of this slow node cannot be realized.
采用本公开,针对集群系统中节点的大规模训练任务(不限于上述示例的GPU大规模训练任务),可以通过感知模块向第一节点发起计时请求后,接收该第一节点反馈的计时信息,从而可以根据该计时信息检测出该集群系统存在慢节点,实现了慢节点的检测。With this disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the GPU large-scale training tasks in the above example), after the perception module initiates a timing request to the first node, the timing information fed back by the first node can be received. Therefore, it can be detected that there is a slow node in the cluster system according to the timing information, and the detection of the slow node is realized.
根据本公开的实施例,图1是应用本公开实施例的慢节点检测方法的分布式集群系统的场景图,该分布式集群系统为集群系统的一 个示例,示例性的描述了可以利用该分布式集群系统进行模型训练,以完成训练任务。如图1所示,在该分布式集群系统100中包括多个节点(如服务器集群101、服务器102、服务器集群103、服务器104、服务器105,服务器105还可以连接电子设备,如手机1051及台式机1052),多个节点间,以及多个节点与连接的电子设备间可以共同执行一个或多个模型训练任务。可选地,该分布式集群系统中的多个节点可以采用数据并行的模型训练方式,则多个节点可以基于相同的训练方式执行训练任务,以更好的训练模型;若该分布式集群系统中的多个节点采用的是模型并行的模型训练方式,则多个节点可以基于不同的训练方式执行训练任务,以更好以训练模型。可选地,在每一轮模型训练完成后,多个节点之间都可以进行数据交换(如数据同步)。According to an embodiment of the present disclosure, FIG. 1 is a scene diagram of a distributed cluster system applying the slow node detection method of the embodiment of the present disclosure. The distributed cluster system is an example of a cluster system, and it exemplarily describes that the distributed The cluster system performs model training to complete the training task. As shown in Figure 1, in this distributed cluster system 100, comprise a plurality of nodes (as server cluster 101, server 102, server cluster 103, server 104, server 105, server 105 can also connect electronic equipment, as mobile phone 1051 and desktop Machine 1052), multiple nodes, and multiple nodes and connected electronic devices can jointly execute one or more model training tasks. Optionally, multiple nodes in the distributed cluster system can adopt a data parallel model training method, and then multiple nodes can perform training tasks based on the same training method to better train the model; if the distributed cluster system Multiple nodes in the model adopt the model parallel model training method, so multiple nodes can perform training tasks based on different training methods to better train the model. Optionally, after each round of model training is completed, data exchange (such as data synchronization) can be performed between multiple nodes.
根据本公开的实施例,提供了一种慢节点检测方法,图2是根据本公开实施例的慢节点检测方法的流程示意图,该方法可以应用于慢节点检测装置,例如,该装置可以部署于集群系统中的终端或服务器或其它处理设备,在运行的情况下,可以实现计时及慢节点检测等处理。其中,终端可以为用户设备(UE,User Equipment)、移动设备、个人数字处理(PDA,Personal Digital Assistant)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。如图2所示,该方法应用于集群系统中的感知模块,包括S201-S203。According to an embodiment of the present disclosure, a slow node detection method is provided. FIG. 2 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure. The method can be applied to a slow node detection device. For example, the device can be deployed in Terminals or servers or other processing devices in the cluster system can implement processing such as timing and slow node detection when they are running. Wherein, the terminal may be user equipment (UE, User Equipment), mobile device, personal digital assistant (PDA, Personal Digital Assistant), handheld device, computing device, vehicle-mounted device, wearable device, and the like. In some possible implementation manners, the method may also be implemented in a manner in which the processor invokes computer-readable instructions stored in the memory. As shown in Figure 2, the method is applied to the perception modules in the cluster system, including S201-S203.
S201、感知模块向第一节点发起计时请求,其中,该第一节点为一个或多个在集群系统中执行训练任务的训练节点。S201. The perception module initiates a timing request to a first node, where the first node is one or more training nodes executing training tasks in a cluster system.
S202、该感知模块接收该第一节点反馈的计时信息。S202. The sensing module receives timing information fed back by the first node.
S203、该感知模块根据该计时信息检测出该集群系统存在慢节点。S203. The perception module detects that there is a slow node in the cluster system according to the timing information.
S201-S203的一示例中,第一节点作为该集群系统中的训练节点,可以包括:训练框架程序和集合通信库(该训练框架程序和该集合通信库都可以设置于该训练节点的CPU上),则该感知模块可以向该训练节点发起计时请求,训练框架程序基于该计时请求调用该集合通 信库,以进行集合通信操作。其中,该集群系统中任一个训练节点可以包括一个CPU和多个GPU,该CPU可以将训练过程中的计算指令和通信指令下发至多个GPU,在多个GPU实现该集合通信操作,比如,在多个GPU上运行大规模训练任务的情况下,实现多个GPU间的数据交换(如数据同步)。进行该集合通信操作时,集合通信库可以启动计时功能,得到该计时信息,该感知模块接收到该计时信息后可以根据该计时信息检测出该集群系统存在慢节点。In an example of S201-S203, the first node, as a training node in the cluster system, may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the perception module can initiate a timing request to the training node, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations. Wherein, any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized. When performing the collective communication operation, the collective communication library can start the timing function to obtain the timing information. After receiving the timing information, the sensing module can detect the presence of slow nodes in the cluster system according to the timing information.
采用本公开实施例,针对集群系统中节点的大规模训练任务(不限于图2上述示例的GPU大规模训练任务),可以通过该感知模块向该第一节点发起计时请求后,接收该第一节点反馈的计时信息,从而可以根据该计时信息检测出该集群系统存在慢节点,实现了慢节点的检测。With the embodiment of the present disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the large-scale training tasks of GPUs in the above example in FIG. The timing information fed back by the nodes can detect the existence of slow nodes in the cluster system according to the timing information, and realize the detection of slow nodes.
一实施方式中,感知模块根据计时信息检测出集群系统存在慢节点,可包括:感知模块在计时信息大于阈值的情况下,检测出集群系统存在慢节点。采用本实施方式,可以通过第一节点反馈的该计时信息与阈值的比对,检测出集群系统中存在慢节点,从而实现了慢节点的检测。In an embodiment, the sensing module detects that there is a slow node in the cluster system according to timing information, which may include: the sensing module detects that there is a slow node in the cluster system when the timing information is greater than a threshold. With this embodiment, by comparing the timing information fed back by the first node with the threshold, it is possible to detect that there is a slow node in the cluster system, thereby realizing the detection of the slow node.
一实施方式中,慢节点检测方法还可包括:感知模块向第一节点发起暂停训练任务的请求,该感知模块运行慢节点检测程序,检测出慢节点在集群系统中的位置。采用本实施方式,不仅可以检测出该集群系统中存在慢节点,还可以根据运行的慢节点检测程序,检测出慢节点在集群系统中的位置,从而可以从该集群系统中及时完成慢节点的排查,以定位出该慢节点的具体位置。In an embodiment, the slow node detection method may further include: the perception module initiates a request to suspend the training task to the first node, and the perception module runs a slow node detection program to detect the position of the slow node in the cluster system. With this embodiment, not only can it be detected that there is a slow node in the cluster system, but also the position of the slow node in the cluster system can be detected according to the running slow node detection program, so that the slow node can be completed in time from the cluster system. Check to locate the specific location of the slow node.
一实施方式中,感知模块运行慢节点检测程序,检测出慢节点在集群系统中的位置,可包括:该感知模块循环执行集合通信检测中的检测模式以运行该慢节点检测程序,检测出该慢节点在该集群系统中的位置,检测模式包括集合中的至少一个,该集合包括单机检测、集群检测、二分法和它们的组合。采用本实施方式,可以采用多种检测模式中择一方式或组合方式来运行对应的慢节点检测程序,检测出慢节点在集群系统中的位置,从而可以在该集群系统中实现单机、多机、 二分法等更多高效的慢节点排查手段,以更快、更准确的定位出该慢节点的具体位置。In one embodiment, the sensing module runs the slow node detection program to detect the position of the slow node in the cluster system, which may include: the sensing module cyclically executes the detection mode in the collective communication detection to run the slow node detection program, and detects the slow node The position of the slow node in the cluster system, the detection mode includes at least one of the set, and the set includes stand-alone detection, cluster detection, dichotomy and their combination. With this embodiment, one or a combination of various detection modes can be used to run the corresponding slow node detection program to detect the position of the slow node in the cluster system, so that single-machine and multi-machine can be realized in the cluster system , dichotomy and other efficient slow node troubleshooting methods to locate the specific location of the slow node faster and more accurately.
一实施方式中,慢节点检测方法还可包括:感知模块将慢节点信息通知调度模块,该慢节点信息用于表征慢节点在集群系统中的位置。其中,调度模块(如运行调度器程序的模块)可以位于第一节点(如该集群系统中的训练节点)、或与第一节点存在通信交互的第二节点(如该集群系统中的控制节点)。采用本实施方式,该调度模块根据实际需要可以部署于第一节点(如该集群系统中的训练节点)、或与第一节点存在通信交互的第二节点(如该集群系统中的控制节点),当接收到该慢节点信息后,可以对集群系统中的多个训练节点所执行的训练任务进行调度,比如,从多个训练节点中检测出慢节点及其在集群系统中的位置后,保存慢节点所执行训练任务的进度状态(即慢节点的进度状态),以便触发主备倒换操作(即将该慢节点的进度状态替换到正常的备选节点,继续执行该训练任务)。In an embodiment, the slow node detection method may further include: the perception module notifies the scheduling module of the slow node information, where the slow node information is used to characterize the position of the slow node in the cluster system. Wherein, the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or the second node that communicates with the first node (such as the control node in the cluster system) ). With this embodiment, the scheduling module can be deployed on the first node (such as the training node in the cluster system) or the second node that communicates with the first node (such as the control node in the cluster system) according to actual needs , when the slow node information is received, the training tasks performed by multiple training nodes in the cluster system can be scheduled, for example, after detecting the slow node and its position in the cluster system from multiple training nodes, Save the progress state of the training task performed by the slow node (that is, the progress state of the slow node), so as to trigger the master-standby switchover operation (that is, replace the progress state of the slow node with a normal candidate node, and continue to execute the training task).
根据本公开的实施例,提供了一种慢节点检测方法,图3是根据本公开实施例的慢节点检测方法的流程示意图,如图3所示,该方法应用于集群系统中的感知模块,包括S301-S303。According to an embodiment of the present disclosure, a slow node detection method is provided. FIG. 3 is a schematic flowchart of a slow node detection method according to an embodiment of the present disclosure. As shown in FIG. 3 , the method is applied to a perception module in a cluster system, Including S301-S303.
S301、第一节点接收感知模块发起的计时请求;其中,该第一节点为一个或多个在集群系统中执行训练任务的训练节点。S301. The first node receives a timing request initiated by the sensing module; wherein, the first node is one or more training nodes executing training tasks in a cluster system.
S302、该第一节点基于该计时请求进行集合通信操作,完成该集群系统中的数据交换,得到计时信息。S302. The first node performs a collective communication operation based on the timing request, completes data exchange in the cluster system, and obtains timing information.
S303、该第一节点向该感知模块发送该计时信息,以使该感知模块根据该计时信息检测出该集群系统存在慢节点。S303. The first node sends the timing information to the perception module, so that the perception module detects that there is a slow node in the cluster system according to the timing information.
S301-S303的一示例中,第一节点作为该集群系统中的训练节点,可以包括:训练框架程序和集合通信库(该训练框架程序和该集合通信库都可以设置于该训练节点的CPU上),则该训练框架程序可以接收该感知模块发起的计时请求,该训练框架程序基于该计时请求调用该集合通信库,以进行集合通信操作。其中,该集群系统中任一个训练节点可以包括一个CPU和多个GPU,该CPU可以将训练过程中的计算指令和通信指令下发至多个GPU,在多个GPU实现该集合 通信操作,比如,在多个GPU上运行大规模训练任务的情况下,实现多个GPU间的数据交换(如数据同步)。进行该集合通信操作时,集合通信库可以启动计时功能,得到该计时信息,该训练节点向该感知模块发送该计时信息后,使得该感知模块可以根据该计时信息检测出该集群系统存在慢节点。In an example of S301-S303, the first node, as a training node in the cluster system, may include: a training framework program and a collection communication library (both the training framework program and the collection communication library can be set on the CPU of the training node ), the training framework program may receive the timing request initiated by the perception module, and the training framework program calls the collective communication library based on the timing request to perform collective communication operations. Wherein, any training node in the cluster system may include a CPU and multiple GPUs, and the CPU may issue calculation instructions and communication instructions during the training process to multiple GPUs, and realize the collective communication operation on multiple GPUs, for example, In the case of running large-scale training tasks on multiple GPUs, data exchange (such as data synchronization) between multiple GPUs is realized. When performing the collective communication operation, the collective communication library can start the timing function to obtain the timing information. After the training node sends the timing information to the sensing module, the sensing module can detect the presence of slow nodes in the cluster system based on the timing information. .
采用本公开实施例,针对集群系统中节点的大规模训练任务(不限于图3上述示例的GPU大规模训练任务),可以通过该训练框架程序向该感知模块反馈的该计时信息,使该感知模块可以检测出该集群系统存在慢节点,从而实现了慢节点的检测。With the embodiment of the present disclosure, for the large-scale training tasks of nodes in the cluster system (not limited to the GPU large-scale training tasks in the above-mentioned example in FIG. The module can detect that there are slow nodes in the cluster system, thereby realizing the detection of slow nodes.
一实施方式中,慢节点检测方法还可包括:第一节点接收感知模块发起的暂停训练任务的请求;第一节点响应该暂停训练任务的请求,暂停训练任务,存储训练任务的进度状态。该第一节点通知该感知模块运行慢节点检测程序。采用本实施方式,可以通过第一节点反馈的通知使得该感知模块运行该慢节点检测程序,以检测出集群系统中存在慢节点,从而实现了慢节点的检测。In an embodiment, the slow node detection method may further include: the first node receives a request for suspending the training task initiated by the sensing module; the first node responds to the request for suspending the training task, suspends the training task, and stores the progress status of the training task. The first node notifies the perception module to run a slow node detection program. With this embodiment, the perception module can run the slow node detection program through the notification fed back by the first node to detect the presence of slow nodes in the cluster system, thereby realizing the detection of slow nodes.
一实施方式中,慢节点检测方法还可包括:调度模块接收感知模块发送的慢节点信息,该慢节点信息用于表征慢节点在集群系统中的位置。其中,该调度模块(如运行调度器程序的模块)可以位于第一节点(如该集群系统中的训练节点)、或与第一节点存在通信交互的第二节点(如该集群系统中的控制节点)。In an embodiment, the slow node detection method may further include: the scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system. Wherein, the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or a second node that communicates with the first node (such as the control node in the cluster system) node).
在调度模块位于该第一节点(如该集群系统中的训练节点)的情况下,该第一节点接受该调度模块的调度控制,根据该慢节点信息,将由该慢节点执行的训练任务的进度状态替换到正常的备选节点,继续执行该训练任务。采用本实施方式,当该调度模块接收到该慢节点信息后,可以对集群系统中的多个训练节点所执行的训练任务进行调度,比如,从多个训练节点中检测出慢节点及其在集群系统中的位置后,保存慢节点所执行训练任务的进度状态(即慢节点的进度状态),之后执行主备倒换操作,即将该慢节点的进度状态替换到正常的备选节点(该备选节点与该慢节点存在主备倒换关系),继续执行该训练任务。When the scheduling module is located at the first node (such as the training node in the cluster system), the first node accepts the scheduling control of the scheduling module, and according to the slow node information, the progress of the training task to be performed by the slow node The state is replaced with a normal candidate node, and the training task is continued. With this embodiment, after the scheduling module receives the slow node information, it can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node). The selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.
一实施方式中,慢节点检测方法还可包括:调度模块接收感知模块发送的慢节点信息,该慢节点信息用于表征该慢节点在集群系统中的位置。其中,该调度模块(如运行调度器程序的模块)可以位于第一节点(如该集群系统中的训练节点)、或与第一节点存在通信交互的第二节点(如该集群系统中的控制节点)。In an embodiment, the slow node detection method may further include: the scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system. Wherein, the scheduling module (such as the module running the scheduler program) can be located at the first node (such as the training node in the cluster system), or a second node that communicates with the first node (such as the control node in the cluster system) node).
在该调度模块位于与该第一节点存在通信交互的第二节点(如该集群系统中的控制节点)的情况下,该第一节点从该第二节点接收该慢节点信息,即,该慢节点信息为:该第二节点接受该调度模块的调度控制后转发给该第一节点的信息。该第一节点根据该慢节点信息,将由该慢节点执行的训练任务的进度状态替换到正常的备选节点,继续执行该训练任务。采用本实施方式,当该调度模块接收到该慢节点信息后,可以对集群系统中的多个训练节点所执行的训练任务进行调度,比如,从多个训练节点中检测出慢节点及其在集群系统中的位置后,保存慢节点所执行训练任务的进度状态(即慢节点的进度状态),之后执行主备倒换操作,即将该慢节点的进度状态替换到正常的备选节点(该备选节点与该慢节点存在主备倒换关系),继续执行该训练任务。In the case that the scheduling module is located at a second node that communicates with the first node (such as a control node in the cluster system), the first node receives the slow node information from the second node, that is, the slow node The node information is: the information that the second node forwards to the first node after accepting the scheduling control of the scheduling module. According to the slow node information, the first node replaces the progress status of the training task executed by the slow node with a normal candidate node, and continues to execute the training task. With this embodiment, after the scheduling module receives the slow node information, it can schedule the training tasks performed by the multiple training nodes in the cluster system, for example, detect the slow node and its status from multiple training nodes After the location in the cluster system, save the progress status of the training tasks performed by the slow node (that is, the progress status of the slow node), and then perform the master-standby switchover operation, that is, replace the progress status of the slow node with the normal candidate node (the backup node). The selected node has an active-standby switchover relationship with the slow node), and continues to execute the training task.
下面对上述本公开实施例提供的慢节点检测方法进行示例说明。The method for detecting a slow node provided by the above-mentioned embodiments of the present disclosure will be illustrated below with an example.
大规模的训练任务,按集群系统中训练节点的类型可以分为纯CPU、纯GPU,以及CPU和GPU混合设置的类型,按训练模式可以分为同步训练和异步训练。其中,在同步训练中进一步包括:参数服务器(Parameter Server,PS)架构和集合(collective)架构,该PS架构是深度学习最常采用的分布式训练架构,该collective架构是实现多GPU集合通信的架构。在大规模的训练任务需要进行慢节点的检测,尤其是针对纯GPU、同步训练及collective架构场景的慢节点检测,是慢节点定位最困难的研究方向,也是需要尤为关注的问题。Large-scale training tasks can be divided into pure CPU, pure GPU, and CPU and GPU mixed settings according to the type of training nodes in the cluster system, and can be divided into synchronous training and asynchronous training according to the training mode. Among them, the synchronous training further includes: parameter server (Parameter Server, PS) architecture and collective (collective) architecture, the PS architecture is the most commonly used distributed training architecture for deep learning, the collective architecture is to achieve multi-GPU collective communication architecture. In large-scale training tasks, slow node detection is required, especially for pure GPU, synchronous training, and collective architecture scenarios. This is the most difficult research direction for slow node positioning, and it also requires special attention.
在集群系统中实现集合通信操作,如果遇到集合通信阻塞,则一慢全慢,无法精确分析集合通信中慢节点发生的原因,也就无法精确定位该集群系统中慢节点异常的训练节点。在集合通信中,集合通信库可以提供训练节点间高性能的GPU间或CPU间通信实现,在一次 集合通信操作中,该集群系统中的多个训练节点可以在完成各自的数据计算和传输操作后进行同步等待,在所有训练节点都完成该操作后各节点才会感知到训练任务的进度,也就是说,无法直接区分出单个或部分慢节点。需要在训练任务运行前和运行时更高效的检测出慢节点。In the collective communication operation in the cluster system, if the collective communication is blocked, it will be slow at one time. It is impossible to accurately analyze the cause of the slow node in the collective communication, and it is also impossible to accurately locate the training node where the slow node is abnormal in the cluster system. In collective communication, the collective communication library can provide high-performance inter-GPU or inter-CPU communication between training nodes. In a collective communication operation, multiple training nodes in the cluster system can complete their respective data calculation and transmission operations. Perform synchronous waiting, and each node will not perceive the progress of the training task until all training nodes have completed the operation, that is to say, it is impossible to directly distinguish a single or part of the slow nodes. Slow nodes need to be detected more efficiently before and during training tasks.
采用如下图4的集群系统架构、及图5所示的基于感知模块与训练节点交互的感知流程的感知架构,可以在训练任务运行前和运行时更高效的检测出慢节点。Using the cluster system architecture shown in Figure 4 below and the perception architecture based on the perception process of the interaction between the perception module and the training nodes shown in Figure 5, slow nodes can be detected more efficiently before and during the execution of the training task.
图4是应用于本公开实施例的一应用示例的集群系统架构示意图,包括一个或多个控制节点(如控制节点1-控制节点M,M为大于1的整数),及与该控制节点通过网络进行通信交互的一个或多个训练节点(如训练节点1-训练节点N,N为大于1的整数)。在任一个控制节点中可以部署用于通信交互的网卡、及用于向任一个训练节点下发控制指令的CPU;在任一个训练节点中可以部署用于通信交互的网卡、及用于向GPU下发控制指令的CPU。4 is a schematic diagram of a cluster system architecture applied to an application example of an embodiment of the present disclosure, including one or more control nodes (such as control node 1-control node M, where M is an integer greater than 1), and the control node through One or more training nodes for communication and interaction in the network (for example, training node 1-training node N, where N is an integer greater than 1). A network card for communication interaction and a CPU for issuing control commands to any training node can be deployed in any control node; a network card for communication interaction and a CPU for issuing control commands to GPU can be deployed in any training node CPU that controls instructions.
就训练节点上的CPU和GPU而言,训练框架程序以多进程多线程或单进程多线程的方式运行在该训练节点的CPU上,以便将训练任务执行过程中的主要计算指令和通信指令下发至该训练节点的GPU。集合通信库也运行在该训练节点的CPU上,以提供多个训练节点间高性能的GPU间或CPU间通信实现(如GPU间的数据交换)。调度器程序可以一部分运行在一个或多个训练节点的CPU上,另一部分运行在这些训练节点之外、且能和这些训练节点通信的控制节点的CPU上,该调度器程序用于负责训练任务的节点资源分配和运行环境准备。本应用示例中所检测得到的慢节点均为训练节点。As far as the CPU and GPU on the training node are concerned, the training framework program runs on the CPU of the training node in the form of multi-process and multi-thread or single-process and multi-thread, so that the main calculation instructions and communication instructions during the execution of the training task are Send to the GPU of the training node. The collective communication library also runs on the CPU of the training node to provide high-performance inter-GPU or inter-CPU communication between multiple training nodes (such as data exchange between GPUs). Part of the scheduler program can run on the CPU of one or more training nodes, and the other part can run on the CPU of the control node outside these training nodes and can communicate with these training nodes. The scheduler program is used for training tasks node resource allocation and operating environment preparation. The slow nodes detected in this application example are all training nodes.
需要指出的是,该感知模块可以为集群系统中的模块,不限于部署于集群系统中的训练节点或控制节点,可以位于集群系统中的任一个节点,或者可以独立设置,即位于集群系统中且与节点分开配置。通过感知模块与训练节点交互(如该感知模块与该训练节点中的训练框架程序及集合通信库交互)的感知流程,可以执行慢节点检测及慢节点的定位。It should be pointed out that the perception module can be a module in the cluster system, not limited to the training node or control node deployed in the cluster system, it can be located on any node in the cluster system, or can be set independently, that is, in the cluster system And it is configured separately from the node. Slow node detection and slow node location can be performed through the perception process of the perception module interacting with the training node (eg, the perception module interacts with the training framework program and the collective communication library in the training node).
图5是应用于本公开实施例的一应用示例的实现感知流程的架构示意图,如图5所示的架构示意图,包括下面的内容。Fig. 5 is a schematic diagram of an architecture for realizing a perception process applied to an application example of an embodiment of the present disclosure. The diagram of the architecture shown in Fig. 5 includes the following content.
本应用示例中,可以通过感知模块来执行慢节点的检测,还可以采用一些其他程序/模块来响应该慢节点的检测,比如,可以设置运行在训练节点的CPU上的训练框架程序,还可以设置调度器程序(该调度器程序,可以一部分运行在一个或多个训练节点的CPU上,另一部分运行在这些训练节点之外、且能和这些训练节点通信的控制节点的CPU上)。该慢节点的检测的流程可以包括下面的1-4。In this application example, the slow node detection can be performed through the perception module, and some other programs/modules can also be used to respond to the slow node detection. For example, you can set the training framework program running on the CPU of the training node, or you can A scheduler program is set (a part of the scheduler program can run on the CPU of one or more training nodes, and another part can run on the CPU of a control node outside these training nodes and capable of communicating with these training nodes). The process of detecting the slow node may include the following 1-4.
1、感知模块将计时请求发至训练框架程序(该训练框架程序可以采用多进程多线程或单进程多线程的方式运行在训练节点的CPU上),训练框架程序基于计时请求调用集合通信库(该集合通信库运行于训练节点的CPU上,可以提供训练节点间高性能的GPU间通信实现或CPU间通信实现,例如GPU间的数据交换)进行集合通信操作时,集合通信库启动计时功能,执行集群系统内部的集合通信实现,完成该次操作的计时和记录。比如,触发一个训练节点中多个GPU间的数据交换、或者多个训练节点所包含GPU间的数据交换,最终完成该次集合通信操作后,予以的计时和记录。1. The perception module sends the timing request to the training framework program (the training framework program can run on the CPU of the training node in a multi-process multi-thread or single-process multi-thread mode), and the training framework program calls the collective communication library based on the timing request ( The collective communication library runs on the CPU of the training node, and can provide high-performance inter-GPU communication implementation or inter-CPU communication implementation between training nodes, such as data exchange between GPUs). When performing collective communication operations, the collective communication library starts the timing function, Execute the collection communication implementation inside the cluster system, and complete the timing and recording of this operation. For example, trigger the data exchange between multiple GPUs in a training node, or the data exchange between GPUs included in multiple training nodes, and finally complete the collective communication operation, timing and recording.
需要指出的是,进行上述集合通信操作的情况下,可以响应该集合通信操作,比如,所有参与数据交换的训练节点和/或电子设备(一个训练节点可以包括多个电子设备)进行一次集体的数据交换(如数据同步),不限于上述GPU间的数据交换,考虑到针对GPU的大规模训练任务,本应用示例以GPU间的数据交换进行举例描述。执行该数据交换,并可以通过计时操作记录该数据交换过程持续的时间(比如记录为“时间1”)。It should be pointed out that, in the case of performing the above collective communication operation, in response to the collective communication operation, for example, all training nodes and/or electronic devices participating in data exchange (a training node may include multiple electronic devices) conduct a collective Data exchange (such as data synchronization) is not limited to the above-mentioned data exchange between GPUs. Considering the large-scale training tasks for GPUs, this application example uses data exchange between GPUs as an example for description. The data exchange is performed, and the duration of the data exchange process can be recorded through a timing operation (for example, recorded as "time 1").
2、该感知模块判断该“时间1”是否慢于一个时间阈值,即是否大于该时间阈值,如果大于,则确定为“慢”,即当前集群系统中存在慢节点,该感知模块向运行在训练节点CPU上的训练框架程序发出申请暂停训练任务的请求。换言之,通过该“时间1”与该时间阈值的比对,可以发现该集群系统的一个或多个训练节点中存在慢节点,但是无法定位出慢节点的准确位置,需要通过如下步骤3的慢节 点检测程序定位出慢节点的准确位置。该训练任务指:一个主要过程在GPU上完成的训练任务,上述数据交换为训练任务的一个子任务。2. The perception module judges whether the "time 1" is slower than a time threshold, that is, whether it is greater than the time threshold, and if it is greater, it is determined to be "slow", that is, there are slow nodes in the current cluster system, and the perception module is running on The training framework program on the CPU of the training node sends a request to suspend the training task. In other words, by comparing the "time 1" with the time threshold, it can be found that there are slow nodes in one or more training nodes of the cluster system, but the exact location of the slow nodes cannot be located. The node detection program locates the exact location of the slow nodes. The training task refers to a training task whose main process is completed on the GPU, and the above data exchange is a subtask of the training task.
3、运行在训练节点CPU上的训练框架程序收到该感知模块发起的该暂停训练任务的请求后,暂停当前训练任务,并保存参与训练任务的多个训练节点的进度状态,并通知该感知模块执行如下的慢节点检测程序,如下的慢节点检测程序可以择一或组合使用。运行该慢节点检测程序,从而在多个训练节点中定位出慢节点,并将定位出的慢节点信息(即用于表征该慢节点在集群系统中的位置,如慢节点ID)通知调度模块(运行调度器程序的模块)。3. After receiving the request to suspend the training task initiated by the perception module, the training framework program running on the CPU of the training node suspends the current training task, saves the progress status of multiple training nodes participating in the training task, and notifies the perception module The module executes the following slow node detection programs, and the following slow node detection programs can be selected or used in combination. Run the slow node detection program to locate the slow node among multiple training nodes, and notify the scheduling module of the slow node information (that is, used to characterize the position of the slow node in the cluster system, such as the slow node ID) (the module that runs the scheduler program).
运行该慢节点检测程序包括如下方式3.1-3.3中的至少之一。Running the slow node detection program includes at least one of the following manners 3.1-3.3.
在方式3.1中,执行单机检测,对基础环境进行检测,即:扫描所有参与训练任务的多个训练节点的硬件的状态,以找出慢节点。In mode 3.1, a stand-alone detection is performed to detect the basic environment, that is, to scan the hardware status of all the multiple training nodes participating in the training task to find out the slow nodes.
在方式3.2中,执行集群检测,扫描所有集群基础环境配置参数,以找出慢节点。In mode 3.2, perform cluster detection and scan all cluster basic environment configuration parameters to find slow nodes.
在方式3.3中,用二分法的方式循环执行集合通信的检测,以找出慢节点。具体的,是将所有参与训练任务的多个训练节点进行分区,得到多个子区,对每个子区做集合通信带宽和延时测试,得到各个子区的延时和带宽。将各个子区的延时和带宽与延时和带宽的期望值做比较,若延时大于期望值和/或带宽小于期望值,则判定该子区存在慢节点。对该子区继续进行划分,执行二分法的方式,直至定位出慢节点。In mode 3.3, the detection of set communication is cyclically performed in the way of dichotomy to find out the slow nodes. Specifically, all multiple training nodes participating in the training task are partitioned to obtain multiple sub-areas, and collective communication bandwidth and delay tests are performed on each sub-area to obtain the delay and bandwidth of each sub-area. The delay and bandwidth of each sub-area are compared with the expected delay and bandwidth. If the delay is greater than the expected value and/or the bandwidth is less than the expected value, it is determined that there is a slow node in the sub-area. Continue to divide the sub-area and implement the dichotomy method until the slow node is located.
为了提高慢节点的检测速度,考虑到方式3.3比较耗时,可以先执行方式3.1或方式3.2,再执行方式3.3。通过方式3.1或方式3.2检测出慢节点后,将其从当测试的所有训练节点的集合中排除,再执行方式3.3以对集合中剩下的训练节点是否存在慢节点进行排查。In order to improve the detection speed of slow nodes, considering that method 3.3 is time-consuming, you can first execute method 3.1 or method 3.2, and then execute method 3.3. After the slow node is detected through method 3.1 or method 3.2, exclude it from the set of all training nodes to be tested, and then perform method 3.3 to check whether there are slow nodes in the remaining training nodes in the set.
4、该感知模块通过慢节点检测程序得到慢节点信息后,将该慢节点信息发送至该训练框架程序,由训练框架程序同步至调度器程序。以调度器程序运行于该集群系统中的控制节点为例,该调度器程序可以将定位出的慢节点进行替换,并恢复训练任务,使其继续运行,比如,找到一个与该慢节点存在主备倒换关系的备选节点(备选节点处 于正常运行状态),为该备选节点准备好训练任务的运行环境,将通过该感知模块定位出的该慢节点的进度状态复制到该备选节点,从而,将该慢节点的进度状态替换到该备选节点。将慢节点的进度状态替换到该备选节点后恢复训练任务。换言之,在整个慢节点检测过程中,不需要先停止慢节点再重启训练任务。4. After the perception module obtains the slow node information through the slow node detection program, it sends the slow node information to the training framework program, and the training framework program is synchronized to the scheduler program. Taking the control node in which the scheduler program runs in the cluster system as an example, the scheduler program can replace the located slow node and resume the training task so that it can continue to run. Prepare the candidate node of the switching relationship (the candidate node is in a normal operating state), prepare the running environment of the training task for the candidate node, and copy the progress status of the slow node located by the perception module to the candidate node , thus, replace the progress status of the slow node with the candidate node. Resume the training task after replacing the progress state of the slow node with the candidate node. In other words, during the entire slow node detection process, there is no need to stop the slow node first and then restart the training task.
相关技术中是使用基于同步的集合通信模式,若集群系统中存在慢节点,发生了故障,其他训练节点会退出训练任务并释放训练节点的计算资源。当故障恢复后,训练节点再获取训练任务并退出当前的模型数据,重新执行训练任务。由于训练节点需要重新加载模型数据,重新分配计算资源,也就是说,需要先停止慢节点再重启训练任务,非常耗时,而采用本应用示例,实现了不需要重启训练任务即可实现慢节点的检测及替换,节省了GPU大规模训练下训练任务停止/训练任务恢复所需的大量时间,节约了大量的时间成本,可以更高效的得到训练好的模型,将训练得到的该模型部署于集群系统中,提高了集群系统的运行效率。In related technologies, a synchronous-based collective communication mode is used. If there is a slow node in the cluster system and a failure occurs, other training nodes will quit the training task and release the computing resources of the training node. When the fault is recovered, the training node will obtain the training task and exit the current model data, and execute the training task again. Since the training node needs to reload the model data and reallocate computing resources, that is to say, it is very time-consuming to stop the slow node and then restart the training task. However, using this application example, the slow node can be realized without restarting the training task. The detection and replacement saves a lot of time required for training task stop/training task recovery under large-scale GPU training, saves a lot of time cost, and can obtain a trained model more efficiently. The trained model can be deployed on In the cluster system, the operating efficiency of the cluster system is improved.
根据本公开的实施例,提供了一种慢节点检测装置,图6是根据本公开实施例的慢节点检测装置的组成结构示意图,如图6所示,慢节点检测装置600包括:感知模块601,被配置为向第一节点发起计时请求,其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;接收所述第一节点反馈的计时信息;以及,根据所述计时信息检测出所述集群系统存在慢节点。According to an embodiment of the present disclosure, a slow node detection device is provided. FIG. 6 is a schematic diagram of the composition and structure of the slow node detection device according to an embodiment of the present disclosure. As shown in FIG. 6 , the slow node detection device 600 includes: a perception module 601 , configured to initiate a timing request to a first node, wherein the first node is one or more training nodes performing training tasks in a cluster system; receive timing information fed back by the first node; and, according to the The timing information detects that there is a slow node in the cluster system.
一实施方式中,所述感知模块601,可被配置为在所述计时信息大于阈值的情况下,检测出所述集群系统存在慢节点。In an implementation manner, the sensing module 601 may be configured to detect that there is a slow node in the cluster system when the timing information is greater than a threshold.
一实施方式中,所述感知模块601,可被配置为向所述第一节点发起暂停训练任务的请求;以及,运行慢节点检测程序,检测出所述慢节点在所述集群系统中的位置。In one embodiment, the sensing module 601 may be configured to initiate a request to the first node to suspend the training task; and run a slow node detection program to detect the position of the slow node in the cluster system .
一实施方式中,所述感知模块601,可被配置为循环执行集合通信检测中的检测模式以运行所述慢节点检测程序,检测出所述慢节点在所述集群系统中的位置,检测模式包括集合中的至少一个,该集合包括单机检测、集群检测、二分法和它们的组合。In an embodiment, the perception module 601 may be configured to execute the detection mode in collective communication detection in a loop to run the slow node detection program to detect the position of the slow node in the cluster system, and the detection mode Include at least one of a set including stand-alone detection, cluster detection, dichotomy, and combinations thereof.
一实施方式中,所述感知模块601,可被配置为将慢节点信息通知调度模块,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;其中,所述调度模块位于所述第一节点、或与所述第一节点存在通信交互的第二节点。In one embodiment, the sensing module 601 may be configured to notify the scheduling module of slow node information, the slow node information is used to characterize the position of the slow node in the cluster system; wherein the scheduling module A second node located at the first node or having communication interaction with the first node.
根据本公开的实施例,提供了一种慢节点检测装置,图7是根据本公开实施例的慢节点检测装置的组成结构示意图,如图7所示,慢节点检测装置700包括:第一节点701,被配置为接收感知模块发起的计时请求;其中,所述第一节点701为一个或多个在集群系统中执行训练任务的训练节点;基于所述计时请求进行集合通信操作,完成所述集群系统中的数据交换,得到计时信息;以及,向所述感知模块发送所述计时信息。According to an embodiment of the present disclosure, a device for detecting a slow node is provided. FIG. 7 is a schematic diagram of the composition and structure of the device for detecting a slow node according to an embodiment of the present disclosure. As shown in FIG. 7 , the device for detecting a slow node 700 includes: a first node 701, configured to receive a timing request initiated by the sensing module; wherein, the first node 701 is one or more training nodes that perform training tasks in a cluster system; perform collective communication operations based on the timing request, and complete the Data exchange in the cluster system to obtain timing information; and sending the timing information to the sensing module.
一实施方式中,所述第一节点701,可被配置为接收所述感知模块发起的暂停训练任务的请求;响应所述暂停训练任务的请求,暂停训练任务,存储所述训练任务的进度状态;以及,通知所述感知模块运行慢节点检测程序。In an embodiment, the first node 701 may be configured to receive a request for suspending the training task initiated by the perception module; respond to the request for suspending the training task, suspend the training task, and store the progress status of the training task ; And, notify the perception module to run a slow node detection program.
一实施方式中,慢节点检测装置700还可包括:位于所述第一节点701的调度模块,可被配置为接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置。所述第一节点701,可被配置为接受所述调度模块的调度控制,根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。In an embodiment, the slow node detection device 700 may further include: a scheduling module located at the first node 701, configured to receive the slow node information sent by the perception module, the slow node information is used to represent the The position of the slow node in the cluster system. The first node 701 may be configured to accept the scheduling control of the scheduling module, replace the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continue Execute the training task.
一实施方式中,慢节点检测装置700还可包括:位于第二节点的调度模块,可被配置为接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置。所述第一节点701,可被配置为接收所述慢节点信息,所述慢节点信息为所述第二节点接受所述调度模块的调度控制后转发给所述第一节点701的信息,其中,所述第二节点与所述第一节点701存在通信交互;以及,根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。In an embodiment, the slow node detection device 700 may further include: a scheduling module located at the second node, configured to receive the slow node information sent by the sensing module, the slow node information is used to indicate that the slow node is The location in the cluster system. The first node 701 may be configured to receive the slow node information, the slow node information is the information forwarded to the first node 701 after the second node accepts the scheduling control of the scheduling module, wherein , there is communication interaction between the second node and the first node 701; and, according to the slow node information, replace the progress status of the training task executed by the slow node with a normal candidate node, and continue to execute the training task.
一实施方式中,所述备选节点与所述慢节点存在主备倒换关系。In an implementation manner, the candidate node and the slow node have an active/standby switchover relationship.
本公开实施例各装置中的各模块的功能可以参见上述方法中的对应描述,在此不再赘述。For the functions of the modules in the devices in the embodiments of the present disclosure, reference may be made to the corresponding descriptions in the foregoing methods, and details are not repeated here.
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图8示出了可以用来实施本公开的实施例的示例电子设备800的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图8所示,电子设备800包括计算单元801,其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序,来执行各种适当的动作和处理。在RAM 803中,还可存储电子设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , an electronic device 800 includes a computing unit 801, which can perform calculations according to a computer program stored in a read-only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803. Various appropriate actions and processes are performed. In the RAM 803, various programs and data necessary for the operation of the electronic device 800 can also be stored. The computing unit 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804 .
电子设备800中的多个部件连接至I/O接口805,包括:输入单元806,例如键盘、鼠标等;输出单元807,例如各种类型的显示器、扬声器等;存储单元808,例如磁盘、光盘等;以及通信单元809,例如网卡、调制解调器、无线通信收发机等。通信单元809允许电子设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk etc.; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器 (DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理,例如慢节点检测方法。例如,在一些实施例中,慢节点检测方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元808。在一些实施例中,计算机程序的部分或者全部可以经由ROM 802和/或通信单元809而被载入和/或安装到电子设备800上。当计算机程序加载到RAM 803并由计算单元801执行时,可以执行上文描述的慢节点检测方法的一个或多个步骤。备选地,在其他实施例中,计算单元801可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行慢节点检测方法。The computing unit 801 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 executes various methods and processes described above, such as a slow node detection method. For example, in some embodiments, the slow node detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 . In some embodiments, part or all of the computer program can be loaded and/or installed on the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the slow node detection method described above can be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to execute the slow node detection method in any other appropriate manner (for example, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介 质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入、或者触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input, or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增 加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims (23)

  1. 一种慢节点检测方法,包括:A slow node detection method, comprising:
    感知模块向第一节点发起计时请求,其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;The perception module initiates a timing request to the first node, wherein the first node is one or more training nodes performing training tasks in the cluster system;
    所述感知模块接收所述第一节点反馈的计时信息;以及The sensing module receives timing information fed back by the first node; and
    所述感知模块根据所述计时信息检测出所述集群系统存在慢节点。The perception module detects that there is a slow node in the cluster system according to the timing information.
  2. 根据权利要求1所述的方法,其中,所述感知模块根据所述计时信息检测出所述集群系统存在所述慢节点,包括:The method according to claim 1, wherein the sensing module detects that the slow node exists in the cluster system according to the timing information, comprising:
    所述感知模块在所述计时信息大于阈值的情况下,检测出所述集群系统存在所述慢节点。The sensing module detects that the slow node exists in the cluster system when the timing information is greater than a threshold.
  3. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    所述感知模块向所述第一节点发起暂停训练任务的请求;以及The perception module initiates a request to the first node to suspend the training task; and
    所述感知模块运行慢节点检测程序,检测出所述慢节点在所述集群系统中的位置。The perception module runs a slow node detection program to detect the position of the slow node in the cluster system.
  4. 根据权利要求3所述的方法,其中,所述感知模块运行所述慢节点检测程序,检测出所述慢节点在所述集群系统中的所述位置,包括:The method according to claim 3, wherein the sensing module runs the slow node detection program to detect the position of the slow node in the cluster system, comprising:
    所述感知模块循环执行集合通信检测中的检测模式以运行所述慢节点检测程序,检测出所述慢节点在所述集群系统中的所述位置,所述检测模式包括集合中的至少一个,所述集合包括单机检测、集群检测、二分法和它们的组合。The perception module loops through the detection mode in the set communication detection to run the slow node detection program to detect the position of the slow node in the cluster system, the detection mode includes at least one of the set, The collection includes stand-alone detection, cluster detection, dichotomy, and combinations thereof.
  5. 根据权利要求1-4中任一项所述的方法,还包括:The method according to any one of claims 1-4, further comprising:
    所述感知模块将慢节点信息通知调度模块,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;The perception module notifies the scheduling module of the slow node information, the slow node information is used to characterize the position of the slow node in the cluster system;
    其中,所述调度模块位于所述第一节点、或与所述第一节点存在 通信交互的第二节点。Wherein, the scheduling module is located at the first node, or a second node that communicates with the first node.
  6. 一种慢节点检测方法,包括:A slow node detection method, comprising:
    第一节点接收感知模块发起的计时请求;其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;The first node receives the timing request initiated by the perception module; wherein, the first node is one or more training nodes that perform training tasks in the cluster system;
    所述第一节点基于所述计时请求进行集合通信操作,完成所述集群系统中的数据交换,得到计时信息;以及The first node performs collective communication operations based on the timing request, completes data exchange in the cluster system, and obtains timing information; and
    所述第一节点向所述感知模块发送所述计时信息。The first node sends the timing information to the perception module.
  7. 根据权利要求6所述的方法,还包括:The method of claim 6, further comprising:
    所述第一节点接收所述感知模块发起的暂停训练任务的请求;The first node receives a request to suspend the training task initiated by the perception module;
    所述第一节点响应所述暂停训练任务的请求,暂停所述训练任务,存储所述训练任务的进度状态;以及The first node responds to the request to suspend the training task, suspends the training task, and stores the progress status of the training task; and
    所述第一节点通知所述感知模块运行慢节点检测程序。The first node notifies the perception module to run a slow node detection program.
  8. 根据权利要求7所述的方法,还包括:The method according to claim 7, further comprising:
    调度模块接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;以及The scheduling module receives the slow node information sent by the perception module, and the slow node information is used to characterize the position of the slow node in the cluster system; and
    所述调度模块位于所述第一节点的情况下,所述第一节点接受所述调度模块的调度控制,根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。When the scheduling module is located at the first node, the first node accepts the scheduling control of the scheduling module, and replaces the progress status of the training task executed by the slow node according to the information of the slow node Go to the normal candidate node and continue to execute the training task.
  9. 根据权利要求7所述的方法,还包括:The method according to claim 7, further comprising:
    调度模块接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;The scheduling module receives the slow node information sent by the perception module, and the slow node information is used to represent the position of the slow node in the cluster system;
    所述调度模块位于与所述第一节点存在通信交互的第二节点的情况下,所述第一节点接收所述慢节点信息,所述慢节点信息为:所述第二节点接受所述调度模块的调度控制后转发给所述第一节点的信息;以及When the scheduling module is located at a second node that communicates with the first node, the first node receives the slow node information, and the slow node information is: the second node accepts the scheduling the information forwarded to the first node after the scheduling control of the module; and
    所述第一节点根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。The first node replaces the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continues to execute the training task.
  10. 根据权利要求8或9所述的方法,其中,所述备选节点,与所述慢节点存在主备倒换关系。The method according to claim 8 or 9, wherein the candidate node has an active/standby switching relationship with the slow node.
  11. 一种慢节点检测装置,包括感知模块,被配置为:A slow node detection device, including a perception module, configured to:
    向第一节点发起计时请求,其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;Initiate a timing request to the first node, wherein the first node is one or more training nodes executing training tasks in the cluster system;
    接收所述第一节点反馈的计时信息;以及receiving timing information fed back by the first node; and
    根据所述计时信息检测出所述集群系统存在慢节点。It is detected that there is a slow node in the cluster system according to the timing information.
  12. 根据权利要求11所述的装置,其中,所述感知模块,被配置为:The device according to claim 11, wherein the perception module is configured to:
    在所述计时信息大于阈值的情况下,检测出所述集群系统存在所述慢节点。When the timing information is greater than a threshold, it is detected that the slow node exists in the cluster system.
  13. 根据权利要求11所述的装置,其中,所述感知模块,被配置为:The device according to claim 11, wherein the perception module is configured to:
    向所述第一节点发起暂停训练任务的请求;以及initiate a request to the first node to suspend the training task; and
    运行慢节点检测程序,检测出所述慢节点在所述集群系统中的位置。Running the slow node detection program to detect the position of the slow node in the cluster system.
  14. 根据权利要求13所述的装置,其中,所述感知模块,被配置为:The device according to claim 13, wherein the perception module is configured to:
    循环执行集合通信检测中的检测模式以运行所述慢节点检测程序,检测出所述慢节点在所述集群系统中的所述位置,所述检测模式包括集合中的至少一个,所述集合包括单机检测、集群检测、二分法和它们的组合。Cyclic execution of the detection mode in the set communication detection to run the slow node detection program to detect the position of the slow node in the cluster system, the detection mode includes at least one of the set, the set includes Stand-alone detection, cluster detection, dichotomy, and combinations thereof.
  15. 根据权利要求11-14中任一项所述的装置,其中,所述感知模块,被配置为:The device according to any one of claims 11-14, wherein the perception module is configured to:
    将慢节点信息通知调度模块,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;Notifying the scheduling module of the slow node information, where the slow node information is used to characterize the position of the slow node in the cluster system;
    其中,所述调度模块位于所述第一节点、或与所述第一节点存在通信交互的第二节点。Wherein, the scheduling module is located at the first node, or a second node that communicates with the first node.
  16. 一种慢节点检测装置,包括第一节点,被配置为:A device for detecting slow nodes, including a first node, configured to:
    接收感知模块发起的计时请求;其中,所述第一节点为一个或多个在集群系统中执行训练任务的训练节点;Receive a timing request initiated by the perception module; wherein, the first node is one or more training nodes that perform training tasks in the cluster system;
    基于所述计时请求进行集合通信操作,完成所述集群系统中的数据交换,得到计时信息;以及performing collective communication operations based on the timing request, completing data exchange in the cluster system, and obtaining timing information; and
    向所述感知模块发送所述计时信息。Send the timing information to the sensing module.
  17. 根据权利要求16所述的装置,其中,所述第一节点,被配置为:The apparatus according to claim 16, wherein the first node is configured to:
    接收所述感知模块发起的暂停训练任务的请求;receiving a request to suspend the training task initiated by the perception module;
    响应所述暂停训练任务的请求,暂停所述训练任务,存储所述训练任务的进度状态;以及Responding to the request for suspending the training task, suspending the training task, storing the progress status of the training task; and
    通知所述感知模块运行慢节点检测程序。Informing the perception module to run a slow node detection program.
  18. 根据权利要求17所述的装置,还包括:The apparatus of claim 17, further comprising:
    位于所述第一节点的调度模块,被配置为接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;The scheduling module located at the first node is configured to receive the slow node information sent by the perception module, the slow node information is used to characterize the position of the slow node in the cluster system;
    其中,所述第一节点,被配置为接受所述调度模块的调度控制,根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。Wherein, the first node is configured to accept the scheduling control of the scheduling module, replace the progress status of the training task executed by the slow node with a normal candidate node according to the slow node information, and continue Execute the training task.
  19. 根据权利要求17所述的装置,还包括:The apparatus of claim 17, further comprising:
    位于第二节点的调度模块,被配置为接收所述感知模块发送的慢节点信息,所述慢节点信息用于表征所述慢节点在所述集群系统中的位置;The scheduling module located at the second node is configured to receive the slow node information sent by the perception module, where the slow node information is used to characterize the position of the slow node in the cluster system;
    其中,所述第一节点,被配置为:Wherein, the first node is configured as:
    接收所述慢节点信息,所述慢节点信息为:所述第二节点接受所述调度模块的调度控制后转发给所述第一节点的信息,其中,所述第二节点与所述第一节点存在通信交互;以及receiving the slow node information, the slow node information is: the information that the second node forwards to the first node after accepting the scheduling control of the scheduling module, wherein the second node and the first There is a communication interaction between the nodes; and
    根据所述慢节点信息,将由所述慢节点执行的所述训练任务的进度状态替换到正常的备选节点,继续执行所述训练任务。According to the information of the slow node, the progress status of the training task executed by the slow node is replaced with a normal candidate node, and the training task is continued to be executed.
  20. 根据权利要求18或19所述的装置,其中,所述备选节点,与所述慢节点存在主备倒换关系。The apparatus according to claim 18 or 19, wherein the candidate node has an active-standby switching relationship with the slow node.
  21. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. Methods.
  22. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-10中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-10.
  23. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-10中任一项所述的方法。A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.
PCT/CN2022/111137 2021-12-23 2022-08-09 Slow node detection method and apparatus, electronic device, and storage medium WO2023115975A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111588055.5 2021-12-23
CN202111588055.5A CN114328098B (en) 2021-12-23 2021-12-23 Slow node detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023115975A1 true WO2023115975A1 (en) 2023-06-29

Family

ID=81054847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/111137 WO2023115975A1 (en) 2021-12-23 2022-08-09 Slow node detection method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114328098B (en)
WO (1) WO2023115975A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628508A (en) * 2023-07-20 2023-08-22 科大讯飞股份有限公司 Model training process anomaly detection method, device, equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328098B (en) * 2021-12-23 2023-04-18 北京百度网讯科技有限公司 Slow node detection method and device, electronic equipment and storage medium
CN114979180B (en) * 2022-05-24 2024-05-17 超聚变数字技术有限公司 Data synchronization method, system and equipment
CN115600687B (en) * 2022-11-08 2023-06-09 北京百度网讯科技有限公司 Model training method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824934A (en) * 2016-03-18 2016-08-03 杭州数梦工场科技有限公司 Method and device for finding slow nodes in distributive ETL
CN106878388A (en) * 2017-01-04 2017-06-20 北京百度网讯科技有限公司 Detection to slow node in distributed memory system
CN110852445A (en) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 Distributed machine learning training method and device, computer equipment and storage medium
US20210132855A1 (en) * 2019-11-04 2021-05-06 Beijing Xiaomi Mobile Software Co., Ltd. Method and device for detecting slow node and computer-readable storage medium
CN114328098A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Slow node detection method and device, electronic equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8315278B2 (en) * 2008-12-23 2012-11-20 Nokia Corporation Network synchronization method
US11087234B2 (en) * 2016-01-29 2021-08-10 Verizon Media Inc. Method and system for distributed deep machine learning
CN107181608B (en) * 2016-03-11 2020-06-09 阿里巴巴集团控股有限公司 Method for recovering service and improving performance and operation and maintenance management system
CN106650922B (en) * 2016-09-29 2019-05-03 清华大学 Hardware neural network conversion method, computing device, software and hardware cooperative system
CN108446770B (en) * 2017-02-16 2020-12-04 中国科学院上海高等研究院 Distributed machine learning slow node processing system and method based on sampling
US10769007B2 (en) * 2018-06-08 2020-09-08 Microsoft Technology Licensing, Llc Computing node failure and health prediction for cloud-based data center
CN111753973A (en) * 2020-06-22 2020-10-09 深圳鲲云信息科技有限公司 Optimization method, system, equipment and storage medium of neural network chip
CN112711422B (en) * 2020-12-31 2024-01-19 北京清微智能科技有限公司 Neural network compiling optimization method and system
CN113190405B (en) * 2021-04-29 2022-08-19 山东英信计算机技术有限公司 Node health detection method and device, electronic equipment and storage medium
CN113656494B (en) * 2021-07-27 2024-06-07 中南大学 Synchronization method and system of parameter server and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824934A (en) * 2016-03-18 2016-08-03 杭州数梦工场科技有限公司 Method and device for finding slow nodes in distributive ETL
CN106878388A (en) * 2017-01-04 2017-06-20 北京百度网讯科技有限公司 Detection to slow node in distributed memory system
CN110852445A (en) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 Distributed machine learning training method and device, computer equipment and storage medium
US20210132855A1 (en) * 2019-11-04 2021-05-06 Beijing Xiaomi Mobile Software Co., Ltd. Method and device for detecting slow node and computer-readable storage medium
CN114328098A (en) * 2021-12-23 2022-04-12 北京百度网讯科技有限公司 Slow node detection method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628508A (en) * 2023-07-20 2023-08-22 科大讯飞股份有限公司 Model training process anomaly detection method, device, equipment and storage medium
CN116628508B (en) * 2023-07-20 2023-12-01 科大讯飞股份有限公司 Model training process anomaly detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114328098A (en) 2022-04-12
CN114328098B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2023115975A1 (en) Slow node detection method and apparatus, electronic device, and storage medium
US11544137B2 (en) Data processing platform monitoring
CN113656175B (en) Method and equipment for training model based on distributed system
CN111726413B (en) Equipment connection method and device
CN102789305A (en) Postponing suspend
US10261874B2 (en) Enabling a cloud controller to communicate with power systems
WO2016173450A1 (en) Graphic processing device, resource service device, resource scheduling method and device thereof
WO2017075989A1 (en) Method, device and system for virtual machines migration
US20180260256A1 (en) Fine-grain synchronization in data-parallel jobs for distributed machine learning
KR20210156243A (en) Training methods of deep-running frameworks, devices and storage media
EP4030736A1 (en) Load balancing system, method and apparatus, and storage medium
CN108401453B (en) Method and device for controlling display screen and intelligent terminal
EP4224317A1 (en) Method and apparatus for controlling distributed operation system, and device, medium and program product
KR20210103415A (en) Speech chip and electronic device
US20200125380A1 (en) Guest operating system wake-up method, device, electronic apparatus, and readable medium
CN111625949A (en) Simulation engine system, simulation processing method, device and medium
US20220244990A1 (en) Method for performing modification task, electronic device and readable storage medium
CN116467100B (en) Data processing method, device, electronic equipment and storage medium
US9703646B2 (en) Centralized database system
US20180285232A1 (en) Management apparatus and management method
CN113656239A (en) Monitoring method and device for middleware and computer program product
CN116938826B (en) Network speed limiting method, device and system and electronic equipment
CN113596172B (en) Method and device for updating nodes in distributed cluster
CN116662276B (en) Data processing method, device, electronic equipment and storage medium
CN117768441A (en) Data transmission method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909305

Country of ref document: EP

Kind code of ref document: A1