CN117725976A

CN117725976A - Model training method, device, equipment, system and storage medium

Info

Publication number: CN117725976A
Application number: CN202211139622.3A
Authority: CN
Inventors: 郝日佩; 蔡志方
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2024-03-19

Abstract

The application provides a model training method, device, equipment, system and storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: the model training system comprises at least one computing node, wherein the computing node is a fault node in the process of executing a training task, a management node stops the training task, then a second cluster communication file is generated according to the first cluster communication file, the first cluster communication file is consistent with the second cluster communication file in a cluster communication mode for indicating at least one non-fault node which does not have a fault, and then the training task is restarted according to the cluster communication file. Therefore, the cluster communication mode of at least one non-fault node which does not have faults is consistent, the cluster communication mode of the non-fault node is not changed, the break point training can be realized without recovering the non-fault node by using the node state backup data, and the consumption of the backup and acquisition of the node state backup data on storage resources and communication resources is reduced.

Description

Model training method, device, equipment, system and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a model training method, apparatus, device, system, and storage medium.

Background

When the neural network model is subjected to distributed training, elastic training can be realized, namely, the resources are flexibly scheduled, the number of nodes of the training cluster is automatically increased or reduced according to the current owned resources in the training process, and if some nodes are down in the training process, new nodes can be automatically introduced to join the training cluster to continue training.

However, when the number of nodes for executing the training tasks is increased or decreased or the nodes are introduced for training, the nodes for running the training tasks are changed, all the nodes are released, the training tasks are redistributed according to the new nodes, the original nodes are restored by using the node state backup data for restarting the training, and a large amount of storage resources and communication resources are consumed.

Disclosure of Invention

The embodiment of the application provides a model training method, device, equipment, system and storage medium, which can solve the problem that a neural network model needs to consume a large amount of storage resources and communication resources to restore original nodes from node state backup data according to new nodes after increasing and decreasing nodes in distributed training.

In a first aspect, a model training method is provided, which is applied to a management node in a distributed model training system. The communication method comprises the following steps: the model training system comprises at least one computing node, wherein the computing node is a fault node in the process of executing a training task, a management node stops the training task, then a second cluster communication file is generated according to the first cluster communication file, the first cluster communication file is consistent with the second cluster communication file in a cluster communication mode for indicating at least one non-fault node which does not have a fault, and then the training task is restarted according to the cluster communication file.

Therefore, compared with the mode that when the training task fails, the management node releases all the nodes for executing the training task and then randomly distributes the training task to each node, the first cluster communication file and the second cluster communication file indicate that the cluster communication mode of at least one non-failure node which does not fail is consistent, and the cluster communication mode of the non-failure node is not changed. Therefore, under the condition that the training task is unchanged, even if the non-fault node is recovered without using the node state backup data, the non-fault node can continuously acquire and operate the training task before the restarting of the training task in a state before the restarting of the training task, and the training tasks operated by each non-fault node before and after the restarting of the training task are the same. Therefore, when the training task is restarted after the training task is failed, the management node does not need to recover the non-failed node by using the node state backup data, so that the consumption of the backup and the acquisition of the node state backup data on storage resources and communication resources is reduced.

The first cluster communication file is generated when the management node starts the training task last time, and the last time the training task is started before the management node stops the training task. The cluster communication file may be a chip resource configuration file (rank table file). For example, the cluster communication file is part of the data contained in the chip resource configuration file, that is, the chip resource configuration file contains the correspondence between the process identifier and the chip running the process. The cluster communication refers to that an operation unit (pod) included in each node in the model training system performs inter-node communication based on an internet protocol (Internet Protocol, IP) address. pod is the smallest resource management component in the model training system, and pod is also the resource object that minimizes running containerized applications. One pod may include one or more containers (containers), which are logical packaging entities for executing processes. Containers in the same pod share the same network namespace and can communicate with each other. The pod is a transient, not persistent entity.

As a possible implementation manner, when the management node detects that any computing node fails in the process of executing the training task, the task scheduling of all nodes in the model training system is stopped, and the training process with the failure in the failed node is released.

Optionally, the management node determines that the computing node fails in performing the training task when the failure information is obtained. The fault information is used to indicate a node unreachable fault or a chip fault.

Optionally, the management node monitors the process running information, determines a fault training process with faults according to the process running information, and releases the fault training process. The process running information is obtained by monitoring the pod running process in each computing node by the management node, and the pod running process is recorded when the training task is started.

Therefore, compared with all the pod of all the computing nodes which cannot know the relation between the training process and the pod and need to release the running training task, the management node can accurately stop the fault training process in the fault node according to the process running information, namely, the pod of the running fault training process is released, the pod of the non-fault node is not required to be released, and the non-fault node is not required to use the node state backup data to restore when restarting the training task. In addition, because the process running information is recorded, the management node can also restart the training task and then re-issue the fault training process to any computing node based on the process running information.

As one possible implementation, the management node may reassign the training task according to the resource steady state (e.g., whether the computing power, the storage resource capacity, etc. meet the training task requirement, etc.) of each computing node included in the model training system.

Optionally, the management node may increase or decrease the computing nodes running the training task when reassigning the training task, may assign a portion of the training task to the failed node that is restarted, and may assign a portion of the training task to the non-failed node.

For example, the hardware resources used before the failure training process fails are not recovered before the second cluster communication file is generated, and the cluster communication mode of the failed node in the second cluster communication file generated by the management node is different from the cluster communication mode of the failed node in the first cluster communication file, that is, the second cluster communication file is used for indicating that the failure training process is allocated to the redundant node or at least one non-failed node to operate. The redundant nodes are computing nodes which do not run training tasks before restarting the training tasks.

For another example, the hardware resources used before the failure training process fails are recovered before the second cluster communication file is generated, and the cluster communication mode of the failure node in the second cluster communication file generated by the management node is the same as the cluster communication mode of the failure node in the first cluster communication file, that is, the second cluster communication file is used for indicating that the failure training process is allocated to the failure node for operation.

Therefore, the management node can scale the number of the computing nodes running the training task more flexibly, and the training process of the fault node running before restarting the training task can be redistributed in the model training system without containing redundant nodes, so that the occupation of hardware resources is reduced.

As a possible implementation manner, when the management node schedules the computing node for running the training task for the user according to the training request of the user, the management node determines the number of the invoked nodes according to the preset node number interval of the training task input by the user, so as to generate the trunking communication file.

The preset node number interval is used for indicating the maximum value and the minimum value of the node number called by the training task. Therefore, the management node can flexibly determine the number of the computing nodes for running the training tasks through the preset node number interval, and the number of redundant nodes in the model training system is reduced, so that the model training efficiency and the node resource utilization rate are improved.

In a second aspect, a model training apparatus is provided, the apparatus comprising respective modules for performing the model training method of the first aspect or any one of the possible implementations of the first aspect.

The model training apparatus according to the second aspect may be a terminal device or a network device, or may be a chip (system) or other components or assemblies that may be disposed in the terminal device or the network device, or may be an apparatus including the terminal device or the network device, which is not limited in this application.

In addition, the technical effects of the model training apparatus according to the second aspect may refer to the technical effects of the model training method according to the first aspect, which are not described herein.

In a third aspect, there is provided a computing device comprising a memory for storing a set of computer instructions for performing the operational steps of the model training method in any one of the possible designs of the first aspect, when the processor executes the set of computer instructions.

In addition, the technical effects of the computing device described in the third aspect may refer to the technical effects of the model training method described in the first aspect, which are not described herein.

In a fourth aspect, a model training system is provided, comprising a management node and at least one computing node, the at least one computing node being configured to perform training tasks, the management node being configured to perform the operational steps of the method as described in any one of the possible implementations of the first aspect, when at least one computing node fails in performing the training tasks. Wherein the computing node may be a computing device in the third aspect.

In a fifth aspect, there is provided a computer readable storage medium comprising: computer software instructions; the computer software instructions, when executed in a training system, cause a computing device to perform the operational steps of the method as described in any one of the possible implementations of the first aspect.

In a sixth aspect, there is provided a computer program product for, when run on a computer, causing a computing device to perform the operational steps of the method as described in any one of the possible implementations of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

Fig. 1 is a schematic structural diagram of a neural network according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a model training system according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a model training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a training task initiation step provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a system formed by a computing device according to an embodiment of the present application.

Detailed Description

For ease of understanding, the following description will first be given of the relevant terms related to the embodiments of the present application.

(1) Neural network

The neural network may be composed of neurons, which may be referred to as x _s And an arithmetic unit whose intercept 1 is an input. The output of the arithmetic unit satisfies the following formula:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neuron. f is an activation function (activation functions) of the neuron for introducing a nonlinear characteristic into the neural network to convert an input signal in the neuron into an output signal. The output signal of the activation function may be used as an input to the next layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neurons, i.e., the output of one neuron may be the input of another neuron. The input of each neuron may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be a region composed of several neurons. The weights characterize the strength of the connection between the different neurons. The weight determines the impact of the input on the output. A weight close to 0 means that changing the input does not change the output. Negative weight means increasing input and decreasing output.

Fig. 1 is a schematic structural diagram of a neural network according to an embodiment of the present application. The neural network 100 includes N processing layers, N being an integer greater than or equal to 3. The first layer of the neural network 100 is the input layer 110, which is responsible for receiving the input signal, and the last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing result of the neural network. The other layers except the first layer and the last layer are intermediate layers 140, and these intermediate layers 140 together form a hidden layer 120, and each intermediate layer 140 in the hidden layer 120 may either receive an input signal or output a signal. The hidden layer 120 is responsible for the processing of the input signal. Each layer represents a logic level of signal processing through which data signals may be processed through multiple levels of logic.

In some possible embodiments the input signal of the neural network may be a signal of various forms, such as a video signal, a voice signal, a text signal, an image signal, or a temperature signal. The voice signal can be various sensor signals such as human voice audio signals recorded by a microphone (sound sensor) for speaking and singing. The input signals to the neural network also include various other computer-processable engineering signals, which are not listed here. If the neural network is used for deep learning of the image signals, the quality of the image processed by the neural network can be improved.

(2) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is too high, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

The intention of a gradient is a vector (vector) that means that the directional derivative of a certain function at that point takes a maximum along that direction, i.e. the function changes the fastest along that direction (the direction of this gradient) at that point with the greatest rate of change. In finding the optimal parameters for each network layer during the training of the deep neural network, the parameters that minimize the value of the loss function are determined. In order to find the place where the value of the loss function is as small as possible, the gradient of the loss function with respect to the parameter needs to be calculated, i.e. when the gradient vector is closer to 0, the loss function reaches a minimum point and the model accuracy reaches a maximum point.

(3) Distributed deep learning training (distributed deep learning training, DDLT)

In the model training process, a single chip (for example, a graphics processing unit (graphic processing unit, GPU), a central processing unit (central processing unit, CPU), a neural network processor (neural network processing unit, NPU) and the like) is adopted, that is, a training card always encounters some problems, for example, an original data sample is too large to be loaded into the training card, or a model is too large to be trained, then a distributed deep learning training technology is needed to divide a large amount of data into small blocks to be respectively calculated by a plurality of training cards, and after an operation result is updated, the result is unified and combined to obtain a final available model. The plurality of training cards may be different hardware in the same device or may be different hardware on different components (or nodes) in a distributed system. A distributed system refers to a system whose components reside on different network computers, which communicate and coordinate their actions by communicating messages with each other, and which interact with each other to accomplish a common task objective.

(4) Elastic training (elastic training)

In conventional deep learning distributed training tasks, the example configuration of the task is typically fixed. This limits the flexibility and training speed of the task to a large extent, nor is it friendly for the resource utilization of the whole cluster. By elastic training, it is meant that the training task is enabled to dynamically adjust the number of instances involved in the computation at run-time. This makes training more flexible, and better expansion and contraction capacity and scheduling can be performed in cooperation with the load of the cluster.

For example, the horovad elastic training mechanism includes two processes, namely a driver and a worker, the driver process runs on a CPU node, and the worker process may run on the CPU node or a GPU node. The driver process is used for assisting in constructing a communication domain between the worker processes, so that the driver process can access the communication domain to communicate with the worker processes. The driver process is also used for starting/restarting the worker process on the worker node and monitoring the overall state of the system. The worker processes are responsible for running training tasks to implement model training and model iteration, and each worker node communicates with other worker nodes to construct a communication domain. In the elastic training mechanism, if the driver node fails to find the worker node, the abnormality is captured, and the communication domain is re-established for training according to the worker node which operates normally; if the driver node finds a new worker node to join the training cluster, the communication domain is re-established to join the new worker node to the training cluster.

However, in the existing elastic training mode in distributed training, when the number of nodes for executing training tasks is increased or nodes are introduced for training, for example, when a computing node fails and a redundant node is introduced for performing task training, the nodes for running the training tasks are changed, the management node instructs to release all the nodes, the communication domain is reconstructed according to the new nodes and the original nodes to configure the trunking communication mode among the nodes, then the training tasks are redistributed based on the new trunking communication mode, and the nodes for running the training tasks are reconfigured. Because the corresponding relation between the chip of each node in the new cluster communication mode and the training process of the training task can be randomly matched, the cluster communication mode of each node in the training task after restarting is changed with the original cluster communication mode of the training task before restarting, the mode that each node in the training task after restarting calls the chip resource is changed, and the node needs to be restored to the state before restarting of the training task by using the node state backup data, so that breakpoint continuous training is realized. But the original node is restored by using the node state backup data to restart training, and a great deal of storage resources and communication resources are consumed.

The node state backup data may be a checkpoint file, where the checkpoint file is a part of intermediate results recorded by each node in the forward propagation process in the distributed training process, and the management node may restore the data state of the computing node according to the intermediate results.

The embodiment of the application provides a model training method, in particular to a cluster communication mode of a non-fault node is not changed in breakpoint continuous training of distributed model training, and the model training method of the non-fault node is not required to be recovered by using node state backup data, namely, when one or more computing nodes contained in a model training system are in fault, a management node stops training tasks, then generates a second cluster communication file according to the first cluster communication file, the first cluster communication file is consistent with the cluster communication mode of at least one non-fault node which is not in fault, and the cluster communication file is used for indicating the non-fault node to have the same cluster communication mode, and then restarts the training tasks according to the cluster communication file, so that the non-fault node can continuously acquire and operate the training tasks before the restarting of the training tasks in a state before the fault of the training tasks, thereby the management node is not required to recover the non-fault node by using the node state backup data during the breakpoint continuous training, and consumption of storage resources and communication resources caused by the breakpoint continuous training is reduced.

The implementation of the examples of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 2 is a schematic architecture diagram of a model training system according to an embodiment of the present application. As shown in fig. 2, taking the example of a container cluster management system being Kubernetes, model training system 200 includes a cluster of management nodes, a cluster of computing nodes, and a cluster of storage nodes.

Wherein the management node cluster includes one or more management nodes (one management node 210 is shown in fig. 2, but is not limited to one management node). In Kubernetes clusters, the management node is a master node, which is a cluster control node of the Kubernetes cluster, and is responsible for management and control of the whole cluster, and user commands are usually run on the management node. For example, the management node 210 is responsible for distributing training, reasoning tasks to each computing node for execution, and performing functions of data management, task management, model management, and log management of the model training system 200. In this embodiment, a single "cluster" represents a Kubernetes cluster composed of all nodes included in the model training system 200, and a management node cluster, a computing node cluster, and a storage node cluster represent a cluster composed of management nodes, a cluster composed of computing nodes, and a cluster composed of storage nodes, respectively.

At the software level, the management node 210 runs on control plane components, which may include, but are not limited to, one or more of an application programming interface server (Application Programming Interface Server, API server) component, etcd component 212, elastic component 213, scheduling (scheduler) component, and the like.

The API server component 211 is responsible for providing Kubernetes APIs externally, providing a unique portal for resource operations, and for handling tasks that accept requests, such as authentication, authorization, access control, API registration and discovery, etc. The commands entered by the user or commands of other components each implement their respective functions by calling an interface provided by API server component 211. For example, API server component 211 provides a hub for data interactions and communications between components, other components query or modify data through API server component 211, and only API server component 211 directly operates etcd).

The etcd component 212 is a highly available key-value database provided by Kubernetes for maintaining state information for all network configurations and resource objects of the cluster, i.e., the state of the entire cluster. The data changes in etcd component 212 are made through API server component 211. For example, etcd component 212 is used to store cluster communication files, process run information, etc. of model training system 200.

The elastic component 213 is used to dynamically adjust the number of instances involved in computation at runtime, and to coordinate with the load of the cluster for better capacity expansion and scheduling. For example, the elastic component 213 is configured to monitor the fault information in the node annotation component 223, stop the training task when the fault information exists in the node annotation component 223, generate a second cluster communication file based on the fault information and the first cluster communication file, and restart the training task according to the second cluster communication file. The first cluster communication file is generated when the training task is started before the training task fails and is used for indicating a cluster communication mode of each node in the training task. The second cluster communication file is used for indicating the cluster communication mode of each node in the restarted training task. The communication modes of the non-failure computing nodes indicated by the first cluster communication file and the second cluster communication file are the same. For another example, the elastic component 213 continuously records the training task and various events of the pod thereof by using an index mechanism of Kubernetes, and reads chip information of the pod to generate a corresponding configmap. The configmap is used for injecting configuration data into the pod, the configuration data comprises trunking communication configuration on which the training task depends, the better cooperation of the training task and the chip of the bottom layer are convenient to schedule, and manual configuration of a user is not needed.

Scheduling component 214 is configured to monitor the newly created pod duplicate information and select a most appropriate computing node for the pod via a scheduling algorithm. Scheduling component 214 retrieves all computing nodes that meet the pod requirements and executes the pod scheduling logic. After the schedule is successful, the schedule component 214 binds the pod information to the target node while writing the information into etcd.

The computing node cluster includes one or more computing nodes (three computing nodes, namely computing node 220a, computing node 220b, and computing node 220c, are shown in FIG. 2, but are not limited to three management nodes). In the Kubernetes cluster, the computing nodes are worker nodes, and the computing nodes are workload nodes in the Kubernetes cluster, and each computing node is distributed with a workload by a management node for actually executing training and reasoning tasks.

At the software level, node components run on compute nodes 220a, 220b, and 220 c. Alternatively, taking computing node 220a as an example, the node components included in computing node 220a may include, but are not limited to, one or more of a proxy component 221, a node notes component, a driver component 222, a reporting component 224, a pod component, a chip status component 225, and the like.

The proxy component 221 will run on each node in the cluster, such as the kubelet component. The proxy component 221 is used to ensure that containers (containers) are all running in the pod. Specifically, the proxy component 221 will monitor the pod that has been assigned to the node, be responsible for lifecycle management of the pod, while in cooperation with the management node 210, maintain and manage all containers above the compute node, implementing the basic functions of cluster management. For example, the proxy component 221 registers information of the node on the API server component 211, and periodically reports the resource usage of the node to the management node 210.

The driving component 222 is configured to add functions of device discovery, device allocation and device health status reporting to the processor based on the Kubernetes device plug-in mechanism, so that the Kubernetes can manage chip resources. In actual operation, the user does not need to provide a discover_hosts.sh script to feed back currently available chip resources in real time, and the management node 210 can realize functions of node scaling, breakpoint training and the like according to the states of the nodes and the chips.

The node annotation component 223 is operative to append any unidentified metadata to the node using Kubernetes annotations.

The reporting component 224 is operative to provide a node care function. The reporting component 224 reports fault information to the proxy component 221 and writes the fault information to the node annotation component 223 upon monitoring a chip fault or node unreachable.

The chip status component 225 is configured to provide real-time management of various indexes of chip resources, and can obtain information such as chip utilization, frequency, temperature, voltage, memory, and allocation status of chips in a container in real time.

A single compute node contains one or more pod components, each of which may be referred to as a pod.

The storage node cluster (not shown in fig. 2) comprises one or more storage nodes for storing data such as data sets and models of training output.

In this embodiment, each of the management node 210, the computing node 220a, the computing node 220b, the computing node 220c, the storage node, and the like included in the model training system 200 may be a physical host or a Virtual Machine (VM). For example, the computing node 220a, 220b, or 220c may be virtual machines virtualized using at least one training server, i.e., the hardware resources of the training servers, each including one or more accelerator cards, which may include one or more chips, which may be GPUs, CPUs, NPUs, or the like. The management node 210, the calculation node 220a, the calculation node 220b, the calculation node 220c, and the storage node may be different nodes or the same node.

It should be noted that fig. 2 is only a schematic diagram, and should not be construed as limiting the application, each node in the model training system 200 and the components included in each node are only one example of a distributed architecture based on Kubernetes, and in other embodiments, the model training system 200 may also configure different types of nodes and components according to architectures other than Kubernetes, which are not shown in fig. 2.

Next, please refer to fig. 3, a detailed description of the model training method will be provided. The management node 210 in fig. 2 is illustrated here as an example.

The model training method provided in this embodiment is used to implement breakpoint continuous training for a training task in operation, and the model training system 200 needs to execute the training task first.

The process of the management node 210 starting the training task may be that the management node 210 builds a training environment, acquires the training task to be performed by the model training system 200 indicated by the user, determines one or more computing nodes running the training task, and invokes the pod of the computing nodes to run the training task.

When the pod of the computing node is invoked to run the training task, the management node 210 generates a first cluster communication file, and then issues the training task to each computing node based on the first cluster communication file.

The trunking communication file in this embodiment may refer to a chip resource configuration file, where the chip resource configuration file is used to define trained chip resource information. The chip resource configuration file comprises a training server instance list, a server identifier, a chip identifier, a network card identifier, a rank identifier and the like which participate in a training task. The server identifier is a physical IP address of the training server, the chip identifier is a physical identifier of a chip, namely a serial number of the chip on the training server, the network card identifier is an IP address of an integrated network card of the chip, the rank identifier is an identifier of a rank allocated to the chip in a training task, and the rank represents a training process in the training task.

For example, the portion of the chip resource configuration file corresponding to a certain training server may include the following:

thus, the management node 210 can construct a communication domain between each node and the pod according to the chip identifier and the network card identifier in the chip resource configuration file, and issue training tasks to the chips included in each computing node according to the rank identifier, for example, the management node 210 issues training tasks to the computing nodes 220a and 220c according to the chip resource configuration file.

The specific steps for the management node 210 to initiate the training task may refer to fig. 4 and steps 410-430, which are not described herein.

Next, the management node 210 performs the following steps 310-330 to implement breakpoint training when a failed node occurs in the training task.

Step 310, when the management node 210 fails in the training task, stopping the training task.

When the management node 210 fails in the training task, the task scheduling of all the computing nodes in the training task, namely, the computing nodes 220a and 220b, is stopped.

As one possible implementation, the compute node 220a experiences a failure that is not reachable by a node such as a power down, the reporting component 224 reports the failure information, i.e., network unhealthy information, to the proxy component 221 and writes network unhealthy information to the node annotation component 223. The elastic component 213 determines that network unhealthy information is present in the node annotation component 223 and stops all the progress of the training task. Scheduling component 214 determines that network unhealthy information is present in node annotation component 223 and releases the failed pod.

Optionally, the reporting component 224 writes network unhealthy information to the network unhealthy field of the node annotation component 223 and the elastic component 213 stops all the progress of the training task when network unhealthy field contains content and updates the state information of the resource objects and network configuration of the clusters in the etcd component 212.

As another possible implementation, the compute node 220a experiences a chip failure, e.g., a check error (Error Checking and Correcting, ECC), the reporting component 224 reports the failure information to the proxy component 221 and writes the failure information to the node annotation component 223. The elastic component 213 determines that network unhealthy information is present in the node annotation component 223 and stops all the progress of the training task. Scheduling component 214 determines that network unhealthy information is present in node annotation component 223 and releases the failed pod.

In this embodiment, after the management node 210 stops all the progress of the training task, only the failed pod is released, not all the pods of all the computing nodes participating in the training task. Since the chip status component 225 may report the allocation status of the acquired chip in the container to the elastic component 213, the elastic component 213 acquires the process running information of the chip according to the allocation status of the chip in the container, and stores the process running information in the etcd component 212 and updates the process running information in real time. Therefore, when the chip fails, the scheduling component 214 can determine the pod corresponding to the failed chip according to the process running information, and release the pod with the failure, thereby releasing the training process with the failure in the failed node.

Therefore, when the training task fails, the management node 210 can accurately determine the failed pod, and only release the failed pod, instead of releasing all the pods contained in all the nodes participating in the training task, so that the nodes which do not fail do not need to be recovered according to the node state backup data (such as the checkpoint data) in breakpoint training, storage resources required for storing the node state backup data are saved, and communication resources required for reading the node state backup data are saved when the nodes are restored.

Step 320, the management node 210 generates a second trunking communication file according to the first trunking communication file.

The elastic component 213 of the management node 210 generates a second cluster communication file according to the first cluster communication file and the failure information, that is, the elastic component 213 retains a file portion of a node where the first cluster communication file fails, and modifies the file portion of the failed node.

Alternatively, the failure information is written to the configmap by the scheduling component 214, and the elastic component 213 generates a second trunking communication file from the configmap and the first trunking communication file. configmap is a configuration management component of Kubernetes that can pass configurations in the form of key-values. The fault information stored by the configmap may include a fault identification, a training task identification, a fault chip identification, and the like.

The elastic component 213 keeps the file portion of the node where the first cluster communication file does not fail, which means that the elastic component 213 does not modify the server identifier, the chip identifier, the network card identifier, the rank identifier, and the like of the node where the failure does not occur, so that the non-failure node can communicate in the cluster communication mode before the failure after restarting, and the training process carried by the non-failure node before the failure is obtained to continue the model training.

The elastic component 213 modifying the file portion of the failed node means that the elastic component 213 determines whether there is a computing node that is not currently being model trained according to the container information maintained by the proxy component 221, if there is a computing node that is not currently being model trained, for example, the computing node 220c, the training server of the computing node 220c is added to the training server list of the second trunking communication file. If the elastic component 213 determines that there is no computing node that is not currently performing model training, the chip identifier, the network card identifier, the training server identifier, etc. corresponding to the rank identifier of the failure training process are modified, for example, the chip identifier, the network card identifier, the training server identifier, etc. corresponding to the rank identifier of the failure training process are modified to be the chip identifier, the network card identifier, the training server identifier, etc. of the training server of the non-failure node.

Therefore, when the training task fails, the elastic component 213 does not need to define the aggregation operation of the cluster under the hvd.elastic.run function by modifying the cluster communication file, and after the training task is restarted, the rank of the non-failure node is not required to be randomly allocated according to the hvd.elastic.run function, so that the new node is added into the communication domain or the target node of the failure training process is modified to the original non-failure node while the cluster communication mode of the non-failure node is maintained, the expansion or contraction of the cluster is realized, and the flexibility of running the training task of the cluster is improved.

Step 330, the management node 210 restarts the training task according to the second cluster communication file.

The management node 210 issues a training process corresponding to each computing node in the training task to each computing node according to the second cluster communication file.

For example, the training server list in the second cluster communication file includes the training servers of the computing node 220c, issues the failed training process to the computing node 220c, and issues the non-failed training process to the computing node running the non-failed training process before failure.

For another example, the rank identifier corresponding to the failure training process in the second cluster communication file is modified to be a chip identifier, a network card identifier, and a training server identifier of the training server of the computing node 220a and/or the computing node 220b, and the failure training process is issued to the non-failure node, i.e., the computing node 220a and/or the computing node 220b.

As a possible implementation, the management node 210 further detects a state of each computing node before restarting the training task according to the second cluster communication file, and restarts the training task when the resource state of each computing node is stable. For example, the management node 210 determines whether each pod can normally run the training process according to the device health status reported by the reporting component 224 of each computing node, and if so, restarts the training task.

Therefore, compared with the mode that when the training task fails, the management node releases all the nodes for executing the training task and then randomly distributes the training task to each node, the first cluster communication file and the second cluster communication file indicate that the cluster communication mode of at least one non-failure node which does not fail is consistent, and the cluster communication mode of the non-failure node is not changed. Therefore, under the condition that the training task is unchanged, the management node does not need to release the non-fault node and recover the non-fault node by using the node state backup data, and after the training task is restarted, the fault node can continuously acquire and operate the self-operated training process in the training task before the restarting in a state before the fault, so that the storage resources required for storing the node state backup data are saved, and the communication resources required for reading the node state backup data are saved when the node is recovered.

The overall steps of the model training method are described above in steps 310-330, and the specific steps of initiating training tasks by management node 210 are described in detail below in conjunction with FIG. 4. As shown in FIG. 4, particular steps of managing node 210 to initiate a training task may include steps 410-430.

Step 410, the management node 210 determines the computing node according to the preset node number interval input by the user.

The management node 210 determines which computing node runs the training task according to the preset node number interval input by the user.

The management node 210 acquires a preset node number interval input by a user, and distributes training tasks to the available computing nodes when the available computing node number determines that the node resources are in the preset node number interval. Wherein the available computing nodes refer to the computing nodes which do not currently run training tasks. The preset node number interval comprises a maximum value and a minimum value, and the available calculation node number determination node resources are in the preset node number interval, namely the available calculation node number is larger than or equal to the minimum value of the preset node number interval and smaller than or equal to the maximum value of the preset node number interval.

In the management node 210, the API server component 211 obtains a task request of a user through a communication interface with a client, where the task request includes a training task and a preset node number interval, and the elastic component 213 determines the number of computing nodes capable of currently running the training task according to the container information maintained by the proxy component 221, and determines the computing nodes for running the training task, such as the computing node 220a and the computing node 220b, when the number of computing nodes capable of running the training task is greater than or equal to a maximum value of the preset node number interval and less than or equal to a minimum value of the preset node number interval.

Step 420, the management node 210 generates a first trunking communication file according to the computing node.

The first cluster communication file may refer to the description of the chip resource configuration file in the above description, and will not be repeated herein.

Step 430, the management node 210 creates a pod to run a training task according to the first cluster communication file.

The elastic component 213 of the management node 210 constructs a pod of each computing node according to the first trunking communication file, constructs communication between chips used by the pod and communication between the chips used by the pod and the management node 210, and issues a training task to the pod of each computing node, so that the pod of each computing node runs its own training process in the training task.

Therefore, the management node 220 can determine the number of nodes running the training task according to the preset node number interval when starting the training task, so that the number of nodes can be flexibly scaled, and redundant nodes in the model training system 200 are not required when restarting the training task in breakpoint continuous training.

It will be appreciated that, in order to implement the functions in the above embodiments, the server includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.

Fig. 5 is a schematic structural diagram of a model training device according to an embodiment of the present application. The model training device can be used for realizing the function of the management node in the method embodiment, so that the beneficial effects of the method embodiment can be realized. In the embodiment of the present application, the model training apparatus may be the management node 210 shown in fig. 2, or may be a module (such as a chip) applied to a server.

Model training apparatus 500 includes a management module 510 and a configuration module 520. The management module 510 is configured to implement scheduling functions such as start-stop of training tasks of the management node 210 in the above-described method embodiment. The configuration module 520 is configured to implement the function of creating and modifying the trunking communication file by the management node 210 in the above-described method embodiment. For example, when model training apparatus 500 is used to implement the method embodiment shown in FIG. 3, management module 510 is used to perform steps 310 and 330 and configuration module 520 is used to perform step 320.

The specific process of the model training apparatus 500 for implementing the model training method includes:

and the management module 510 is configured to stop the training task when a failure node occurs in the training task.

The configuration module 520 is configured to generate a second cluster communication file according to the first cluster communication file, where the first cluster communication file and the second cluster communication file are used to indicate that the cluster communication mode of at least one non-faulty node that does not have a fault is consistent.

The management module 510 is further configured to restart the training task according to the second trunking communication file, where each non-faulty node in the restarted training task runs a part of the training task that runs before restarting based on the trunking communication method.

As a possible implementation manner, the management module 510 is specifically configured to, when a failed node occurs in a training task, stop the training task:

when a fault node appears in the training task, stopping task scheduling of the fault node and at least one non-fault node;

and releasing the training process with faults in the fault node.

As a possible implementation manner, when the management module 510 releases the training process that fails in the failed node, the management module is specifically configured to:

determining a fault training process with faults according to the process operation information;

and releasing the fault training process, wherein the process running information is used for indicating the running state of the running process contained in the training task.

As a possible implementation manner, if the hardware resource used before the failure training process fails is recovered before the second cluster communication file is generated, the cluster communication mode of the failure node in the second cluster communication file is consistent with the cluster communication mode of the failure node in the first cluster communication file.

The configuration module 520 is specifically configured to, when restarting the training task according to the second cluster communication file:

starting a fault node according to the second cluster communication file and the process operation information;

And restarting the training task by using the fault node and at least one non-fault node, wherein each non-fault node and the fault node in the restarted training task operate the same part of the training tasks in the training task before restarting.

As a possible implementation manner, if the hardware resource used before the failure training process fails is not recovered before the second cluster communication file is generated, the cluster communication mode of the failure node in the second cluster communication file is different from the cluster communication mode of the failure node in the first cluster communication file.

determining that redundant nodes capable of running a fault training process are not available except the fault nodes;

distributing the failure training process to one or more of the at least one non-failure node according to the second cluster communication file and the process running information;

and restarting the training task by using at least one non-fault node, wherein each non-fault node in the restarted training task also independently runs part of the training task which runs before the restarting of the training task.

As a possible implementation manner, the number of nodes called by the first cluster communication file is determined based on a preset node number interval of the training task, where the preset node number interval is used for indicating a maximum value and a minimum value of the number of nodes called by the training task.

As one possible implementation, the first cluster communication file and the second cluster communication file are chip resource configuration files.

It should be understood that the model training apparatus 500 according to the embodiments of the present invention may be implemented by a CPU, or may be implemented by an ASIC, or may be implemented by a programmable logic device (programmable logic device, PLD), which may be a complex program logic device (complex programmable logical device, CPLD), an FPGA, a general-purpose array logic (generic array logic, GAL), or any combination thereof. When the model training apparatus 500 implements the model training method shown in fig. 3 by software, the model training apparatus 500 and its respective modules may be software modules.

The more detailed description of the model training apparatus 500 can be directly obtained by referring to the related description in the embodiment shown in fig. 3, and the detailed description is omitted herein.

By way of example, when model training apparatus 500 is implemented in hardware, the hardware may be a computing device, such as a server as described above, or a processor or chip or the like, as applied to a server, such as the computing device including interface circuitry and control circuitry.

The interface circuit is used for receiving signals from other devices outside the computing device and transmitting the signals to the control circuit or sending the signals from the control circuit to the other devices outside the computing device.

The control circuitry is configured to implement the method of any one of the possible implementations of the above embodiments by logic circuitry or executing code instructions. The advantages may be seen from the description of any of the above embodiments, and are not repeated here.

It should be understood that the computing nodes, the management nodes, and the like in the embodiments of the present application may correspond to the model training apparatus 500 in the embodiments of the present application, and may correspond to the respective subjects in fig. 3 performing the method according to the embodiments of the present application, and that the foregoing and other operations and/or functions of each module in the model training apparatus 500 are respectively for implementing the respective flows of the method in fig. 3, and are not repeated herein for brevity.

It is to be appreciated that the processor in embodiments of the present application may be a CPU, NPU, or GPU, but may also be other general purpose processor, DSP, ASIC, FPGA, or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.

The present application also provides a system as shown in fig. 6, comprising a plurality of computing devices 600, each computing device 600 comprising a memory 601, a processor 602, a communication interface 603, and a bus 604. The memory 601, the processor 602, and the communication interface 603 are connected to each other by a bus 604.

The memory 601 may be a read only memory, a static storage device, a dynamic storage device, or a random access memory. The memory 601 may store computer instructions that, when executed by the processor 602, the processor 602 and the communication interface 603 are operable to perform a model training method.

The processor 602 may employ a general purpose central processing unit (central processing unit, CPU), application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or any combination thereof. The processor 602 may include one or more chips. The processor 602 may include an AI accelerator, e.g., a neural network processor (neural processing unit, NPU). In addition, in fig. 6, for example, each computing device 600 includes one processor 602, in particular, the number and types of the processors 602 in each computing device 600 may be set according to service requirements, and when the same computing device 600 includes multiple processors, the types of the processors are not limited in this application.

The communication interface 603 enables communication between the computing device 600 and other devices or communication networks using a transceiver module such as, but not limited to, a transceiver. For example, a snapshot query request may be obtained through the communication interface 603.

A bus 604 may include a path to transfer information between various components of the computing device 600 (e.g., the memory 601, the processor 602, the communication interface 603).

A communication path is established between each of the computing devices 600 described above through a communication network. Any of the computing devices 600 may be a computer in a distributed storage system (e.g., a server), or a computer in an edge data center, or a terminal computing device.

Each computing device 600 may have disposed thereon the functionality of a computing node and/or a storage node, such as performing steps 310, 320, and 330 shown in fig. 3, or performing the functionality of management module 510 and configuration module 520 in model training apparatus 500.

The method steps in this embodiment may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a terminal device. The processor and the storage medium may reside as discrete components in a network device or terminal device.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; optical media, such as digital video discs (digital video disc, DVD); but also semiconductor media such as solid state disks (solid state drive, SSD). While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

stopping the training task when a fault node appears in the training task;

generating a second cluster communication file according to the first cluster communication file, wherein the first cluster communication file and the second cluster communication file are used for indicating that the cluster communication mode of at least one non-fault node which does not have a fault is consistent;

restarting the training task according to the second cluster communication file, wherein each non-fault node in the restarted training task runs part of the training task running before restarting according to the cluster communication mode.

2. The method of claim 1, wherein stopping the training task when a failed node occurs in the training task comprises:

stopping task scheduling of the failed node and the at least one non-failed node when the failed node occurs in the training task;

and releasing the training process with the fault in the fault node.

3. The method of claim 2, wherein said releasing the failed training process in the failed node comprises:

4. A method according to claim 3, wherein if the hardware resource used before the failure of the failure training process is recovered before the second trunking communication file is generated, the trunking communication mode of the failed node in the second trunking communication file is consistent with the trunking communication mode of the failed node in the first trunking communication file.

5. The method of claim 4, wherein restarting the training task from the second cluster communication file comprises:

starting the fault node according to the second cluster communication file and the process running information;

restarting the training task by using the fault node and the at least one non-fault node, wherein each non-fault node and the fault node in the restarted training task run the same part of training tasks in the training task before restarting.

6. A method according to claim 3, wherein if the hardware resources used before the failure of the failure training process are not recovered before the generation of the second trunking communication file, the trunking communication mode of the failed node in the second trunking communication file is different from the trunking communication mode of the failed node in the first trunking communication file.

7. The method of claim 6, wherein restarting the training task from the second cluster communication file comprises:

determining that there are no redundant nodes other than the failed node that are capable of running the failure training process;

and restarting the training task by using the at least one non-fault node, wherein each non-fault node in the restarted training task also independently runs part of the training task which runs before the restarting of the training task.

8. The method of any of claims 1-7, wherein the number of nodes invoked by the first cluster communication file is determined based on a preset number of nodes interval of the training task, the preset number of nodes interval being used to indicate a maximum and a minimum of the number of nodes invoked by the training task.

9. The method of any of claims 1-8, wherein the first cluster communication file and the second cluster communication file are chip resource configuration files.

10. A model training apparatus, the apparatus comprising:

the management module is used for stopping the training task when a fault node appears in the training task;

the configuration module is used for generating a second cluster communication file according to the first cluster communication file, wherein the first cluster communication file and the second cluster communication file are used for indicating that the cluster communication mode of the at least one non-fault node which does not have the fault is consistent;

and the management module is further used for restarting the training tasks according to the second cluster communication file, and each non-fault node in the restarted training tasks runs part of the training tasks running before restarting according to the cluster communication mode.

11. The apparatus of claim 10, wherein the management module is specifically configured to:

and releasing the training process with the fault in the fault node.

12. The apparatus of claim 11, wherein the management module is specifically configured to:

13. The apparatus of claim 12, wherein if the hardware resources used before the failure training process fails are restored before the second trunking communication file is generated, the trunking communication mode of the failed node in the second trunking communication file is consistent with the trunking communication mode of the failed node in the first trunking communication file.

14. The apparatus of claim 13, wherein the management module is specifically configured to:

15. The apparatus of claim 12, wherein if the hardware resources used before the failure training process fails are not recovered before the second trunking communication file is generated, the trunking communication mode of the failed node in the second trunking communication file is different from the trunking communication mode of the failed node in the first trunking communication file.

16. The apparatus of claim 15, wherein the management module is specifically configured to:

17. The apparatus of any of claims 10-16, wherein the number of nodes invoked by the first cluster communication file is determined based on a preset number of nodes interval of the training task, the preset number of nodes interval being used to indicate a maximum and a minimum of the number of nodes invoked by the training task.

18. The apparatus of any of claims 10-17, wherein the first cluster communication file and the second cluster communication file are chip resource configuration files.

19. A computing device comprising a memory and at least one processor, the memory for storing a set of computer instructions; the method of any of the preceding claims 1-9, when executed by the processor.

20. Model training system, characterized in that it comprises a management node and at least one computing node, said at least one computing node being adapted to perform training tasks, said management node being adapted to perform the operational steps of the method according to any of the preceding claims 1-9, when said at least one computing node has a faulty node during the performance of training tasks.

21. A computer-readable storage medium, the computer-readable storage medium comprising computer software instructions; the computer software instructions, when executed by a processor, perform the operational steps of the method of any of the preceding claims 1-9.