WO2023020355A1 - Ai模型的分布式训练方法和相关设备 - Google Patents

Ai模型的分布式训练方法和相关设备 Download PDF

Info

Publication number
WO2023020355A1
WO2023020355A1 PCT/CN2022/111716 CN2022111716W WO2023020355A1 WO 2023020355 A1 WO2023020355 A1 WO 2023020355A1 CN 2022111716 W CN2022111716 W CN 2022111716W WO 2023020355 A1 WO2023020355 A1 WO 2023020355A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
computing node
computing
model
task
Prior art date
Application number
PCT/CN2022/111716
Other languages
English (en)
French (fr)
Inventor
练韵文
李亿
金小贤
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to EP22857667.4A priority Critical patent/EP4375892A1/en
Publication of WO2023020355A1 publication Critical patent/WO2023020355A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the embodiments of the present application relate to the technical field of artificial intelligence (AI), and in particular, to a distributed training method of an AI model and related equipment.
  • AI artificial intelligence
  • the current AI field mainly involves three key aspects: training data, AI models, and hardware computing power.
  • the training process of the AI model is to input a large amount of training data into the AI model deployed on the hardware, and the AI model uses the computing power of the hardware to process and learn the training data. In most cases, the more training data, the better the learning effect and the higher the accuracy of the AI model. As the scale of the problems solved by the AI model increases, the amount of data required for AI model training is also increasing, resulting in an increasing demand for computing power of the hardware. For example, some current AI models have 170 billion parameters, and the training data used for training is 45T, and 355 GPUs are required to train for one year to complete the training.
  • the general approach is to increase the scale of parallel computing resources used for AI model training jobs; for example, increase the scale of computing resources for AI model training jobs to 4096 GPUs, so that the computing resources The scale is more than 11 times that of the original 355 GPUs, which reduces the training time of the AI model to about one month.
  • This application provides a distributed training method and related equipment for an AI model, which can reduce the duration of fault recovery during the training process.
  • the present application relates to a distributed training method of an artificial intelligence AI model, which is applied to an AI platform, and the AI platform is associated with a computing resource pool, and the computing resource pool includes a distributed training method for the AI model.
  • the AI platform can perform distributed training on the AI model, and the AI platform is associated with a computing resource pool.
  • the computing resource pool includes multiple computing nodes for distributed training of the AI model.
  • Each computing node executes a training task for the distributed training of the AI model, for example, each computing node executes a training task for the distributed training of the AI model; during the distributed training process of the AI model, the AI platform can determine the Whether there is a faulty first computing node, if the AI platform determines that there is a faulty first computing node among the multiple computing nodes, it will perform fault isolation on the first computing node so that the first computing node is no longer used In addition, the AI platform can determine a second computing node other than the aforementioned multiple computing nodes from the computing resource pool, and configure the second computing node so that the first computing node is used.
  • the second computing node replaces the first computing node to perform the training task of the distributed training of the AI model.
  • the computing node used for the distributed training of the AI model in this application fails, the first computing node that fails is dynamically isolated, and the second computing node is supplemented to replace the first computing node to continue training, ensuring that the training process is not interrupted, thereby The overall training time is not affected, which reduces the time for fault recovery.
  • the computing capability of the second computing node is the same or equivalent to that of the first computing node, or that the specification of the second computing node is the same or equivalent to that of the first computing node, so as to ensure that the second A compute node can successfully replace the first compute node.
  • the first computing node executes other AI model distributed training training tasks in addition to the training tasks of the AI model distributed training, after fault isolation is performed on the first computing node, the first computing node will not It is then used to execute the training tasks affected by the failure of the first computing node; the second computing node replaces the first computing node to perform the training tasks affected by the failure of the first computing node;
  • the training tasks affected by the failure include one or more of the following: the training tasks of the distributed training of the AI model, and the training tasks of the distributed training of other AI models.
  • the first computing node is a faulty computing node: the first computing node hardware failure, the The training process corresponding to the training task executed by the first computing node exits, and the fault reported by the first computing node.
  • the hardware failure of the first computing node, the exit of the training process corresponding to the training task performed by the first computing node, and the failure reported to the AI platform by the first computing node can all be monitored by the AI platform; if the AI platform detects If one or more of the above items are used, it is determined that the first computing node is a faulty computing node, and it is triggered to determine that the second computing node replaces the first computing node to perform the training task; in this way, the AI platform can detect in time that there is a fault in the distributed training of the AI model , which is beneficial to reduce the time for fault recovery.
  • the first computing node executes other AI model distributed training training tasks in addition to the training tasks of the AI model distributed training, then when the first computing node hardware fails, due to the first computing node
  • the training tasks affected by the fault include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the exit of the training process corresponding to the training task executed by the first computing node includes the exit of the training process corresponding to the training task of the AI model distributed training executed by the first computing node and the exit of the training process corresponding to the training task of other AI model distributed training.
  • the process exits, that is, as long as the training process on the first computing node exits, the first computing node is the computing node that has failed; when the training process corresponding to the training task of the AI model distributed training exits, the first computing node
  • the training task affected by the failure is the training task of the distributed training of the AI model; when the training process corresponding to the training task of the distributed training of other AI models exits, the training task affected by the failure of the first computing node is the training task of other AI models.
  • the training task of the model distributed training when the training process corresponding to the training task of the distributed training of the AI model and the training process corresponding to the training task of the distributed training of other AI models both exit, the affected due to the failure of the first computing node
  • the training tasks include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the fault reported by the first computing node includes the fault reported by the first computing node for the training task of the distributed training of the AI model and the fault reported for the training task of the distributed training of other AI models, that is, as long as the first computing node reports fault, the first computing node is the faulty computing node; when the fault reported by the first computing node is the fault reported by the first computing node for the training task of the distributed training of the AI model, due to the failure of the first computing node, the The affected training task is the training task of the distributed training of the AI model; when the fault reported by the first computing node includes the fault reported by the first computing node for the training tasks of the distributed training of other AI models, due to the failure of the first computing node, the The affected training task is the training task of distributed training of other AI models; when the fault reported by the first computing node includes the fault reported by the first computing node for the training task of the distributed training of the AI model and the distributed training task for other AI models For the faults reported by the training tasks, the training tasks, the
  • the method includes: sending a notification of stopping the training process to the first computing node, where the notification of stopping the training process is used to instruct the first computing node to stop executing the training task corresponding to training process.
  • some types of hardware failures will not cause the exit or stop of the training process on the computing nodes, but will only affect the computing performance of the computing nodes; If the node can successfully replace the first computing node to execute the training task, the AI platform sends a notification to the first computing node to stop the training process, instructing the first computing node to stop the training process corresponding to the training task; thus avoiding that the second computing node is already executing In the case of the training task originally executed by the first computing node, the first computing node is still executing the training task. It should be understood that the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task affected by the failure of the first computing node.
  • the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task of the distributed training of the AI model;
  • the training task is the training task of distributed training of other AI models, and the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task of distributed training of other AI models;
  • the affected training tasks include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the notification of stopping the training process is used to instruct the first computing node to stop the training corresponding to the training tasks of the distributed training of the AI model. Process and the training process corresponding to the training tasks of other AI model distributed training.
  • the method further includes: sending a notification of suspending the training process to the third computing node,
  • the third computing node is a computing node that has not failed among the plurality of computing nodes, and the notification of suspending the training process is used to instruct the third computing node to suspend the training task corresponding to the distributed training of the AI model training process.
  • the distributed training of the AI model includes calculation and gradient synchronization of multiple computing nodes.
  • the first computing node fails, if the training process of the unfailed third computing node is not suspended, the third computing node After the gradient is calculated, the gradient synchronization will be carried out; however, the first computing node is isolated due to a fault and cannot participate in the gradient synchronization. In this case, there will be problems with the gradient synchronization; therefore, in order to avoid problems with the gradient synchronization, The training process executed by the third computing node needs to be suspended until a newly added second computing node joins for training.
  • the notification of suspending the training process is specifically used to: instruct the third computing node to suspend the distributed training of the AI model after performing the gradient calculation of the distributed training of the AI model The training process corresponding to the training task.
  • the training process executed by the third computing node is suspended; in this way, after the newly added second computing node is added for training, it can Gradient synchronization is performed directly, which is beneficial to reduce the fault recovery time.
  • the method further includes: after determining the second computing node, the method further includes: sending a notification of continuing training to the third computing node, and the notification of continuing training Used to instruct the third computing node to delete the first computing node and add the second computing node in the communication topology in the training framework of the AI model distributed training, and resume the AI model distributed training
  • the training process corresponds to the training task, and the communication topology is used for gradient synchronization of the distributed training of the AI model.
  • the AI platform sends a notification to the third computing node to continue training; after receiving the notification to continue training, the third computing node knows that the second computing node will replace the failed first computing node to perform training, so Delete the first computing node and add the second computing node in the communication topology in the training framework of AI model distributed training; thus the third computing node can perform gradient synchronization with the second computing node, so that the second computing node can obtain the synchronized training parameters.
  • the method further includes: sending a notification of continuing training to the third computing node, and the notification of continuing training is used to indicate that the third computing node
  • the computing node deletes the first computing node from the communication topology in the training framework of the distributed training of the AI model, and restores the training process corresponding to the training task of the distributed training of the AI model, and the communication topology is used for the distributed training of the AI model. Gradient synchronization for distributed training of AI models.
  • the failed first computing node will be discarded and only A third computing node that has not failed is employed for performing training.
  • the present application relates to a distributed training device for an artificial intelligence AI model.
  • the distributed training device of the AI model has the function of realizing the behavior in the method embodiment of the first aspect above.
  • the functions described above may be implemented by hardware, or may be implemented by executing corresponding software on the hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the distributed training device of the AI model is applied to an AI platform, and the AI platform is associated with a computing resource pool, and the computing resource pool includes multiple computing nodes, each computing node in the plurality of computing nodes executes a training task of the distributed training of the AI model;
  • the device includes: a resource management module, configured to perform fault isolation on the first computing node, the The first computing node is a faulty computing node among the multiple computing nodes;
  • the task scheduling module is configured to determine a second computing node, and the second computing node is a computing node in the computing resource pool except for the multiple computing nodes. a computing node other than the node; and configuring the second computing node, so that the second computing node replaces the first computing node to perform training tasks.
  • the first computing node is a faulty computing node: the first computing node hardware failure, the The training process corresponding to the training task executed by the first computing node exits, and the fault reported by the first computing node.
  • the task scheduling module is further configured to: send a notification of stopping the training process to the first computing node, and the notification of stopping the training process is used to instruct the first computing node to stop executing The training process corresponding to the training task.
  • the task scheduling module is further configured to: send a message to the third computing node to suspend the training process
  • the notification of the third computing node is a computing node that has not failed among the plurality of computing nodes, and the notification of suspending the training process is used to instruct the third computing node to suspend the training of the AI model distributed training The training process corresponding to the task.
  • the notification of suspending the training process is specifically used to: instruct the third computing node to suspend the distributed training of the AI model after performing the gradient calculation of the distributed training of the AI model The training process corresponding to the training task.
  • the task scheduling module is further configured to: send a notification of continuing training to the third computing node, and the notification of continuing training is used to indicate
  • the third computing node deletes the first computing node and adds the second computing node in the communication topology in the training framework of the AI model distributed training, and resumes the training task of the AI model distributed training
  • the communication topology is used for gradient synchronization of the distributed training of the AI model.
  • the task scheduling module is further configured to: send a notification of continuing training to the third computing node, and the notification of continuing training is used to indicate the The third computing node deletes the first computing node in the communication topology in the training framework of the AI model distributed training, and restores the training process corresponding to the training task of the AI model distributed training, and the communication topology Gradient synchronization for distributed training of the AI model.
  • the present application relates to a computing device, the computing device includes a processor and a memory, wherein: the memory stores computer instructions, and the processor executes the computer instructions to implement the method of the first aspect and its possible implementation manners.
  • the present application relates to a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions in the computer-readable storage medium are executed by a computing device, the computing device executes the first aspect and A method in a possible implementation thereof, or a function of an apparatus for enabling a computing device to implement the above second aspect and possible implementations thereof.
  • the present application relates to a computer program product containing instructions, which, when run on a computing device, cause the computing device to execute the method of the above-mentioned first aspect and possible implementations thereof, or enable the computing device to implement the above-mentioned first aspect.
  • Figure 1 is a schematic diagram of data parallel distributed training
  • FIG. 2 is a schematic structural diagram of an AI platform 210 provided by an exemplary embodiment of the present application.
  • FIG. 3 is a schematic diagram of an application scenario of an AI platform 210 provided by an exemplary embodiment of the present application
  • FIG. 4 is a schematic diagram of deployment of an AI platform 210 provided by an exemplary implementation of the present application.
  • FIG. 5 is a schematic structural diagram of a computing device 500 deploying an AI platform 210 provided by an exemplary implementation of the present application;
  • Fig. 6 is a schematic diagram of a processing flow timeline of a training job
  • FIG. 7 is a schematic flowchart of a distributed training method for an AI model provided by an exemplary embodiment of the present application.
  • Fig. 8 is a schematic diagram of a user interaction interface provided by an exemplary embodiment of the present application.
  • Fig. 9 is a schematic diagram of gradient synchronization provided by an exemplary embodiment of the present application.
  • FIG. 10 is a schematic diagram of a communication topology for updating a training framework provided by an exemplary embodiment of the present application.
  • Fig. 11 is a schematic diagram of another communication topology for updating the training framework provided by an exemplary embodiment of the present application.
  • Fig. 12 is a schematic diagram of a processing flow timeline of a training job provided by an exemplary embodiment of the present application.
  • FIG. 13 is a schematic flowchart of another distributed training method for an AI model provided by an exemplary embodiment of the present application.
  • Fig. 14 is a schematic structural diagram of a computing device provided by an exemplary embodiment of the present application.
  • Machine learning is a core means of realizing AI.
  • Machine learning has penetrated into various industries such as medicine, transportation, education, and finance. Not only professional and technical personnel, but even non-AI technical majors in various industries are looking forward to using AI and machine learning to complete specific tasks.
  • AI model is a kind of mathematical algorithm model that uses machine learning ideas to solve practical problems.
  • the AI model includes a large number of parameters and calculation formulas (or calculation rules).
  • the parameters in the AI model can be used to train the AI model through the training data set.
  • the obtained values, for example, parameters of the AI model are calculation formulas or weights of calculation factors in the AI model.
  • the AI model also includes some hyperparameters. Hyperparameters are parameters that cannot be obtained by training the AI model through the training data set. The hyperparameters can be used to guide the construction of the AI model or the training of the AI model. There are many kinds of hyperparameters.
  • the number of iterations (iteration) of AI model training For example, the number of iterations (iteration) of AI model training, the learning rate (leaning rate), the batch size (batch size), the number of layers of the AI model, and the number of neurons in each layer.
  • the difference between the hyperparameters and parameters of the AI model is that the values of the hyperparameters of the AI model cannot be obtained by analyzing the training data set, while the values of the parameters of the AI model can be obtained according to the training data set during the training process. Analyze to modify and confirm.
  • the AI model mentioned in this application is a general term, and the AI model includes a deep learning model, a machine learning model, and the like.
  • the neural network model is a mathematical algorithm model that imitates the structure and function of the biological neural network (the central nervous system of animals).
  • a neural network model can include multiple neural network layers with different functions, and each layer includes parameters and calculation formulas. According to different calculation formulas or different functions, different layers in the neural network model have different names. For example, a layer that performs convolutional calculations is called a convolutional layer, and a convolutional layer is often used to extract features from an input signal (such as an image).
  • a neural network model can also be composed of multiple existing neural network models. Neural network models with different structures can be used in different scenarios (such as classification, recognition, etc.) or provide different effects when used in the same scenario.
  • the different structure of the neural network model specifically includes one or more of the following: the number of network layers in the neural network model is different, the order of each network layer is different, and the weights, parameters or calculation formulas in each network layer are different.
  • the number of network layers in the neural network model is different
  • the order of each network layer is different
  • the weights, parameters or calculation formulas in each network layer are different.
  • some neural network models can be trained by a specific training data set and used alone to complete a task or Combine with other neural network models (or other functional modules) to complete a task.
  • a general AI model needs to be trained before it can be used to complete a task.
  • Training the AI model refers to using the existing data to make the AI model fit the law of the existing data through a certain method, and to determine the parameters in the AI model.
  • To train an AI model you need to prepare a training data set. According to whether the training data in the training data set has labels (that is, whether the data corresponds to specific label information, such as type, name, and label boxes contained in the data), you can use AI Model training is divided into supervised training and unsupervised training. When performing supervised training on an AI model, the training data in the training data set used for training is labeled.
  • the training data in the training data set is used as the input of the AI model, and the AI model calculates the input training data to obtain the output value of the AI model, and the label corresponding to the training data is used as a reference for the output value of the AI model.
  • Each training data in the training data set is used to iteratively train the AI model, and the parameters of the AI model are continuously adjusted until the AI model can output the same or similar output value as the corresponding label of the training data with high accuracy according to the input training data .
  • the training data in the data set used for training is not marked, and the training data in the training data set are input to the AI model in turn, and the AI model gradually recognizes the association and potential of the training data in the training data set.
  • a trained AI model can be used to complete a specific task.
  • AI models in machine learning need to be trained in a supervised learning manner.
  • Training AI models in a supervised learning manner can enable AI models to learn more specifically in labeled training data sets. The association of training data and corresponding labels in the training data set makes the trained AI model more accurate when used to predict other input reasoning data.
  • the loss function is a function used to measure the degree to which the AI model is trained (that is, to calculate the difference between the result predicted by the AI model and the real target).
  • the loss function is a function used to measure the degree to which the AI model is trained (that is, to calculate the difference between the result predicted by the AI model and the real target).
  • use the loss function to judge the difference between the value predicted by the current AI model and the real target value, and update the parameters of the AI model until the AI model can predict the real desired target value or match the real desired target. If the value is very close, that is, the loss function is smaller than the threshold and relatively stable, the AI model is considered to be trained.
  • Gradient a vector containing the partial derivatives of the function.
  • the gradient descent method is often used to update the parameters of the model. Therefore, it is necessary to calculate the corresponding The gradient of the loss function, and then update the parameters of the AI model according to the gradient.
  • Distributed training is one of the commonly used acceleration methods in the AI model training process.
  • Distributed training refers to splitting the training among multiple independent computing nodes for independent calculation, and then periodically summarizing and redistributing the results, thereby accelerating the training process of the AI model.
  • Distributed training can include data-parallel distributed training.
  • Data parallel distributed training is to deploy the same AI model on multiple computing nodes, distribute the training data in the training data set to multiple computing nodes to perform calculations simultaneously, and execute the training of the AI model on each computing node. After the gradient of the model parameters generated on each computing node is aggregated, the model parameters are updated.
  • the training data set into m computing nodes: (1) m computing nodes
  • the batch size on each computing node in is the same as the batch size when using a single computing node for calculation, and the batch size refers to the number of training data selected in the training data set before adjusting parameters each time.
  • the batch size on each computing node is the batch size when computing with a single computing node divided by m, so that the global batch size after aggregation remains the same.
  • the AI model training method is described by taking data parallel distributed training as an example.
  • FIG. 1 is a schematic diagram of an exemplary data parallel distributed training.
  • the calculation of the data-parallel distributed training is performed by m computing nodes (respectively computing node 1, computing node 2, ..., computing node m);
  • m computing nodes In round training, the training samples on each computing node are different, so m computing nodes have m batches of samples (the first batch of samples, the second batch of samples, ..., the mth batch of samples),
  • each of the m computing nodes calculates a gradient, so the m computing nodes have m gradients (respectively gradient 1, gradient 2, ..., gradient m); in gradient synchronization
  • the m gradients are averaged to obtain the average value of the m gradients; the parameters of the AI model are updated according to the average value of the m gradients, and then the next round of training is performed based on the AI model after the updated parameters.
  • the AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users. There are various pre-trained AI models or AI sub-models built into the AI platform to solve different problems.
  • the AI platform can search and build applicable AI models according to the needs of users. Users only need to determine their needs in the AI platform and follow the prompts. Prepare the training data set and upload it to the AI platform, and the AI platform can train the user an AI model that can be used to meet the user's needs. Alternatively, the user prepares his own algorithm (also called the initial AI model) and training data set according to the prompts, and uploads them to the AI platform. Based on the user's own algorithm and training data set, the AI platform can train an AI model.
  • the AI model before being trained by the AI platform (for example, an algorithm uploaded by a user, an algorithm preset by the AI platform, or a pre-trained model) is called an initial AI model.
  • Deep learning is a type of machine learning technology based on deep neural network algorithms, and its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
  • a container is a relatively independent and isolated environment for process running constructed using a virtualization technology in a computer operating system.
  • the environment can contain separate file systems, namespaces, resource views, etc.
  • Using containers can simplify the software deployment process, enhance software portability and security, and improve system resource utilization.
  • a job is a collection of a set of programs that need to be executed to complete a specific computing service, and usually corresponds to a set of processes, containers, or other runtime entities on one or more computers.
  • a task is a single program in a set of programs corresponding to a job, usually corresponding to a process, container or other runtime entity on a computer. Wherein, a job includes at least one task.
  • a training job is a set of programs that need to be executed to complete the training of an initial AI model.
  • the completion of a training job represents the completion of the training of an initial AI model, and obtains the trained AI model.
  • a training task is a single program in a set of programs corresponding to a training job; that is, a user-submitted task logic instance that identifies the differences between tasks.
  • the training task of an initial AI model is used to perform multiple rounds of iterative training on the initial AI model.
  • a training job includes at least one training task.
  • the training process is the process of a training task executed on the computing node.
  • one training process corresponds to one training task, and one computing node can execute one or more training tasks, so one or more training processes exist on one computing node.
  • the training framework is the toolkit or function package that needs to be relied on during the AI model training process, and is the running program framework that each training task in the training job needs to rely on.
  • each deep learning researcher needs to write a lot of repetitive code; in order to improve work efficiency, the researcher writes these codes into a framework and puts it on the Internet for all researchers to use together. frame.
  • the most popular deep learning frameworks in the world include Tensorflow, Caffe, Theano, MXNet, Torch and PyTorch.
  • the computing resource pool consists of computing resources that can be used for AI model training.
  • the computing resources can be computing nodes.
  • computing resources refer to all computing nodes used in the training process, and each computing node can be a computing device (such as a server) or a computing card (such as a GPU).
  • FIG. 2 is a schematic structural diagram of an AI platform 210 provided by an embodiment of the present application. It should be understood that FIG. division of modules. As shown in FIG. 2 , the AI platform 210 includes a task scheduling module 211 , a resource management module 212 and a data storage module 213 . The AI platform 210 is associated with a computing resource pool 220, and the computing resource pool 220 includes multiple computing nodes, and the AI platform can schedule computing nodes in the computing resource pool 220 for AI model training.
  • the task scheduling module 211 is used for: configuring training jobs, scheduling training jobs; receiving training jobs submitted by users, managing the training jobs, and applying for computing resources to run the training jobs.
  • how to train an initial AI model includes: how many training tasks are divided into the training job corresponding to the initial AI model, and which training tasks are divided into the training job corresponding to the initial AI model; what training data is used for training an initial AI model Including: how much training data is required for the training job corresponding to the initial AI model, which training data is required for the training job corresponding to the initial AI model, and how much training data is required for each training task in the training job corresponding to the initial AI model , and which training data need to be used for each training task in the training job corresponding to the initial AI model; what computing resource training is used for an initial AI model includes: how many computing nodes are used to execute the training job corresponding to the initial AI model, What specifications of computing nodes are used to execute the training job corresponding to the initial AI model, and what specifications of computing nodes are used to execute each training task in the training job corresponding to the initial AI model.
  • the resource management module 212 is used for: computing resource management, scheduling computing resources, and allocating computing resources for training jobs.
  • the resource management module 212 needs to understand topology (Topo) information between clusters, where a cluster refers to a cluster composed of all computing resources; when allocating computing resources, affinity allocation is performed according to physical locations.
  • Topo topology
  • affinity principle refers to the priority allocation of resources in the same cabinet in the same physical location.
  • the task scheduling module 211 is further configured to: configure the training tasks in the training job to be executed on the computing resources allocated by the resource management module 212 .
  • the task scheduling module 211 can divide a training job into one or more training tasks according to the number of computing nodes required by a training job, for example, how many computing nodes are required to execute a training job, then how many Training tasks, and then configure each training task to be executed on the corresponding computing node.
  • Data storage module 213 (for example, it may be the data storage resource corresponding to the OBS provided by the cloud service provider): used to store the training framework, the training data set uploaded by the user, the initial AI model uploaded by the user, the initial AI model uploaded by other users, and The trained AI model, etc.
  • one or more training jobs can be executed at the same time, and each training job is used to train an AI model.
  • the training of an AI model is based on the same training framework, and the training framework is the training task of each training job in the training job.
  • the running program framework that needs to be depended on; each training job includes one or more training tasks, and all training tasks in each training job need to rely on the same running program framework.
  • the computing resource pool 220 executes n training jobs, the computing resource pool 220 is used to train n AI models; for any training job in the n training jobs, all training tasks need to rely on the running program framework are the same training frame, and the training frame can be obtained from the data storage module 213.
  • the task scheduling module 211 takes a training job as an example to describe the start of the training job.
  • the task scheduling module 211 applies to the resource management module 212 for computing resources to execute multiple training jobs in the training job. Training task; the resource management module 212 is assigning multiple computing nodes for this training job or these multiple training tasks, and returns the assignment result to the task scheduling module 211; the task scheduling module 211 will train the framework, training data set, initial AI model etc. to these multiple computing nodes, or these multiple computing nodes can obtain the training framework, training data set, initial AI model, etc.
  • the task scheduling module 211 configures the multiple training tasks on the multiple computing nodes, thereby starting the training.
  • the task scheduling module 211 can also inform each computing node of the plurality of computing nodes which computing node or nodes it is jointly executing the training job with, so that it knows which computing node or computing nodes to perform training parameter synchronization, training Parameter synchronization includes gradient synchronization.
  • the task scheduling module 211 and the resource management module 212 can communicate. In this way, the task scheduling module 211 can apply to the resource management module 212 for computing resources for executing training tasks.
  • the task scheduling module 211 can communicate with the computing resource pool 220, so that the task scheduling module 211 can call the computing nodes in the computing resource pool 220 to execute the training task.
  • the resource management module 212 can communicate with the computing resource pool 220 , so that the resource management module 212 can allocate and schedule computing resources in the computing resource pool 220 .
  • the computing nodes in the computing resource pool 220 can communicate with each other, so that multiple computing nodes corresponding to the same training job can perform gradient synchronization.
  • the gradient synchronization process described in this application includes the following three possible situations:
  • Each computing node in the multiple computing nodes calculates a gradient, and each computing node sends the calculated gradient to the AI platform 210; therefore, the AI platform 210 can receive multiple gradients, and the AI platform 210 is responsible for this Multiple gradients are aggregated to obtain aggregated gradients; the AI platform 210 sends the aggregated gradients back to each computing node, and each computing node updates model parameters based on the aggregated gradients.
  • Each computing node in multiple computing nodes calculates a gradient, and each computing node sends the calculated gradient to other computing nodes; therefore, each computing node can obtain multiple gradients, and each computing node Aggregate these multiple gradients to obtain the aggregated gradient; each computing node updates the model parameters based on the aggregated gradient.
  • Each computing node in the multiple computing nodes calculates a gradient, one of the multiple computing nodes is used to aggregate the gradient, and other computing nodes in the multiple computing nodes will calculate the obtained gradient One of the computing nodes; therefore, one of the computing nodes can obtain multiple gradients, and one of the computing nodes aggregates the multiple gradients to obtain the aggregated gradient; one of the computing nodes sends the aggregated gradients to Back to other computing nodes, each computing node updates the model parameters based on the aggregated gradients.
  • the AI platform further includes an algorithm management module 214 (not shown in FIG. 2 ).
  • the algorithm management module 214 is used to: provide an initial AI model management interface for users to upload initial AI models created based on their own training objectives; or, users obtain existing initial AI models in the initial AI model library.
  • the algorithm management module 211 can also be used to obtain the initial AI model preset on the AI platform according to the task goal input by the user.
  • the initial AI model created by users based on their own training goals can be written based on the framework provided by the AI platform.
  • the initial AI model may include an AI model that has not been trained, and an AI model that has been trained but not fully trained.
  • An AI model that has not been trained means that the constructed AI model has not been trained using the training data set, and the parameters in the constructed AI model are all preset values.
  • the task scheduling module 211 and the algorithm management module 214 can communicate, and are used to obtain the access address of the initial AI model from the algorithm management module 214 .
  • the AI platform 210 further includes a human-computer interaction module 215 (not shown in FIG. 2 ), which provides an interactive interface with the user.
  • the human-computer interaction module 215 communicates with the task scheduling module 211, forwards the user's instructions to the task scheduling module 211, obtains the status information of the training process, the trained AI model, etc., and provides the status information and the AI model to the user.
  • the AI platform in this application can be a system that can interact with users.
  • This system can be a software system, a hardware system, or a combination of software and hardware, which is not limited in this application.
  • FIG. 3 is a schematic diagram of an application scenario of an AI platform 210 provided by an embodiment of the present application.
  • the AI platform 210 may be fully deployed in a cloud environment.
  • Cloud environment is an entity that uses basic resources to provide users with cloud services under the cloud computing model.
  • the cloud environment includes cloud data centers and cloud service platforms.
  • Cloud data centers include a large number of basic resources (including computing resource pools, storage resources, and network resources) owned by cloud service providers.
  • the computing resource pools included in cloud data centers can be a large number of computing resources. Nodes (such as servers).
  • the AI platform 210 can be independently deployed on servers or virtual machines in the cloud data center, and the AI platform 210 can also be deployed on multiple servers in the cloud data center in a distributed manner, or distributed on multiple servers in the cloud data center. On multiple virtual machines, or distributedly deployed on servers and virtual machines in the cloud data center.
  • the AI platform 210 is abstracted into an AI cloud service by the cloud service provider on the cloud service platform and provided to the user. Settlement), the cloud environment uses the AI platform 210 deployed in the cloud data center to provide AI platform cloud services to users.
  • users can determine the tasks to be completed by the AI model, upload training data sets to the cloud environment, etc. through the application program interface (application program interface, API) or graphical user interface (graphical user interface, GUI).
  • the AI platform 210 in the environment receives user task information and training data sets, and performs data preprocessing and AI model training.
  • the AI platform returns the status information of the training process of the AI model to the user through the API or GUI.
  • the trained AI model can be downloaded by users or used online to complete specific tasks.
  • the user when the AI platform in the cloud environment is abstracted into an AI cloud service and provided to the user, the user can purchase the usage time of the container with a fixed resource usage. The longer the usage time, the higher the cost, and vice versa. During this usage time, the AI platform trains the AI model. Alternatively, the user can pre-charge, and after the training is completed, the settlement will be made according to the number of GPUs used and the duration of use.
  • the deployment of the AI platform 210 provided by the present application is relatively flexible, as shown in FIG. 4 , in another embodiment, the AI platform 210 provided by the present application can also be distributed and deployed in different environments.
  • the AI platform 210 provided in this application can be logically divided into multiple parts, and each part has different functions.
  • the AI platform 210 includes a task scheduling module 211 , a resource management module 212 and a data storage module 213 .
  • Each part of the AI platform 210 can be deployed in any two or three environments of the terminal computing device, the edge environment, and the cloud environment.
  • Terminal computing devices include: terminal servers, smart phones, notebook computers, tablet computers, personal desktop computers, smart cameras, etc.
  • the edge environment is an environment that includes a collection of edge computing devices that are relatively close to the terminal computing device, and the edge computing devices include: edge servers, edge small stations with computing capabilities, and the like.
  • the AI platform 210 deployed in different environments or devices cooperate to provide users with functions such as training AI models.
  • the task scheduling module 211 in the AI platform 210 is deployed in the terminal computing device
  • the resource management module 212 in the AI platform 210 is deployed in the edge computing device in the edge environment
  • the AI platform 212 is deployed in the cloud computing device in the cloud environment.
  • the user sends the training job to the task scheduling module 211 in the terminal computing device
  • the terminal computing device applies for computing resources to the resource management module 212 in the edge computing device
  • the edge computing device allocates computing resources for the training job
  • the terminal computing device takes the training job
  • the training task configuration in is executed on the allocated computing resources, and the required sample set, initial AI model and other data are obtained from the data storage module 213 in the cloud computing device when executing the training task.
  • this application does not restrictively divide which parts of the AI platform 210 are deployed in which environment. In actual application, it can be determined according to the computing power of the terminal computing device, the resource occupancy of the edge environment and the cloud environment, or specific application requirements. Adaptive deployment.
  • FIG. 5 is a schematic diagram of a hardware structure of a computing device 500 deployed with an AI platform 210 .
  • the computing device 500 shown in FIG. 5 includes a memory 501 , a processor 502 , a communication interface 503 and a bus 504 .
  • the memory 501, the processor 502, and the communication interface 503 realize the communication connection between each other through the bus 504.
  • the memory 501 may be a read only memory (read only memory, ROM), a random access memory (random access memory, RAM), a hard disk, a flash memory or any combination thereof.
  • the memory 501 can store programs, and when the programs stored in the memory 501 are executed by the processor 502, the processor 502 and the communication interface 503 are used to execute the AI platform 210 to train the AI model for the user.
  • the memory can also store training data sets. For example, a part of storage resources in the memory 501 is divided into a data storage module 213 for storing data required by the AI platform 210 .
  • the processor 502 may be a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), GPU or any combination thereof.
  • Processor 502 may include one or more chips.
  • the processor 502 may include an AI accelerator, such as a neural network processor (neural processing unit, NPU).
  • the communication interface 503 uses a transceiver module such as a transceiver to implement communication between the computing device 500 and other devices or communication networks. For example, data can be acquired through the communication interface 503 .
  • a transceiver module such as a transceiver to implement communication between the computing device 500 and other devices or communication networks. For example, data can be acquired through the communication interface 503 .
  • Bus 504 may include pathways for transferring information between various components of computing device 500 (eg, memory 501 , processor 502 , communication interface 503 ).
  • Fig. 6 is a schematic diagram of the timeline of the processing flow of the relevant training job.
  • the processing flow of the training job is implemented based on the AI platform shown in Fig. 2.
  • the processing flow of the training job includes the following steps:
  • the task scheduling module 211 starts the training job according to the configuration of the user, and applies to the resource management module 212 for computing resources for executing the training job; the resource management module 212 allocates a plurality of computing nodes for the training job, and the task scheduling module 211 Start the training tasks in the training job on multiple compute nodes.
  • Each of the plurality of computing nodes performs data loading, and starts to perform training (calculation) after completing the data loading.
  • data loading refers to preparing the data required for training, including obtaining the training framework, training data set, initial AI model, etc., and deploying the training framework.
  • each computing node periodically saves the Ckpt (check point) file in the training script of the training task.
  • the training script is the training program run by the training task;
  • the Ckpt file is a file saved during the execution of the training task, which is a binary file, which stores all weights, biases, gradients, etc. Variable, used to restore the training progress after the training task fails.
  • each computing node has a hardware failure or a software failure (for example, the training task freezes, the training task times out, the training task exits, etc.), the execution of the training task will be abnormal, and the training task will fail. and exit. It should be understood that as long as one training task in the training job fails and exits, the entire training job will fail and exit, that is, the training job will be interrupted.
  • a hardware failure or a software failure for example, the training task freezes, the training task times out, the training task exits, etc.
  • the AI platform updates the status of the training job. For example, the AI platform can show the user that the training job was interrupted.
  • the AI platform re-applies for computing resources for the training job, that is, the AI platform re-applies for multiple computing nodes for the training job, and each re-applied computing node is used to execute the training job.
  • the Ckpt file pulled by each computing node re-applied is the Ckpt file saved in the training script of the training task that needs to be executed before the failure occurs. It should be understood that the training task performed by each computing node that needs to be re-applied is the training task in the training job.
  • Each re-applied computing node continues training based on the pulled Ckpt file, that is, each re-applied computing node executes the required training task based on the pulled Ckpt file.
  • the AI platform re-applies for computing resources for the training job, and it takes a long time to re-apply for computing resources. It takes a long time to start the training task on the re-applied computing resources, which makes the fault recovery time-consuming.
  • the training data is saved through the Ckpt file for recovery after the failure. Since the Ckpt file is relatively large, it takes a long time to obtain the Ckpt file, which makes the recovery time for the failure longer.
  • this application mainly solves the problem of long time-consuming fault recovery due to failure of computing nodes used to perform training tasks during the AI model training process.
  • the technical solution provided by this application provides dynamic fault recovery capability, and guarantees the lossless recovery of the entire training when a fault occurs during the training process. Specifically, during the training process of the AI model, when the computing node used for training fails, the faulty computing node is dynamically isolated and new computing nodes are added to replace the failed computing node without interrupting the training operation. , to ensure that the training process of the AI model is not interrupted, so that the duration of the training of the AI model is not affected.
  • the new computing node is a computing node that has not been used to perform training when the resource is applied for, or the new computing node is a computing node that has been used to perform training when the resource is applied for, but the training task it executes is different from the faulty computing node
  • the training tasks performed by the nodes do not belong to the same training job.
  • This application improves the functions of the AI platform 210 shown in Figure 2, including enhancing the capabilities of the task scheduling module 211 and the resource management module 212, so that the task scheduling module 211 has functions such as fault recovery, and the resource management module 212 has fault recovery functions. isolation, resource dynamic adjustment and other functions. The details are as follows.
  • the resource management module 212 of the present application is also used to: perform fault monitoring on any computing node used to perform training tasks; A computing node performs fault isolation; and reports the fault to the task scheduling module 211, that is, notifies the task scheduling module 211 that the first computing node is faulty.
  • the first computing node refers to a type of computing node that fails, and may be one or more computing nodes.
  • the resource management module 212 of the present application is specifically used to: monitor whether any computing node used to execute the training task has a hardware failure, and monitor whether the training process on any computing node exits . Wherein, if one or more of the following conditions are met: a hardware failure occurs on any one computing node, and the training process on any one computing node exits, then any one computing node fails, that is, any one computing node is the first computing node.
  • fault isolation includes two meanings: the first meaning is that the training task of a training job is executed by multiple computing nodes, and if there is a first computing node among the multiple computing nodes, the first The computing node is removed from the plurality of computing nodes, so that the first computing node is no longer used to execute the training task of the training job; the second layer of meaning, in the case of a hardware failure of the first computing node, the After the first computing node is isolated from the fault, the first computing node will not be used to execute any training task of the training job before the first computing node is recovered from the fault.
  • the task scheduling module 211 is further configured to: receive a fault reported from the first computing node.
  • each computing node in this application will also monitor whether the training process corresponding to the training task has an operation failure. Determine yourself as the first computing node, and report the fault to the task scheduling module 211 . From the perspective of software, when each computing node of this application is executing a training task, the monitoring program in the computing node will monitor whether the training process corresponding to the training task has an operation failure.
  • the monitoring program detects that the training process corresponding to the training task
  • a fault occurs in the training process
  • it is determined that the computing node is faulty that is, it is determined that the computing node is the first computing node
  • the monitoring program reports the fault to the task scheduling module 211 .
  • the resource management module 212 is further configured to: receive a fault reported from the first computing node; after receiving the fault reported from the first computing node, perform fault isolation on the first computing node, And report the fault to the task scheduling module 211 .
  • each computing node in this application executes a training task, when it detects that the training process corresponding to the training task has an operation failure, it determines that it has a failure, that is, it determines that it is the first computing node, and reports to the resource management module 212 Fault reporting: after receiving the fault reported by the first computing node, the resource management module 212 performs fault isolation on the first computing node, and forwards the fault to the task scheduling module 211 .
  • the monitoring program in the computing node will monitor whether the training process corresponding to the training task has an operation failure.
  • the monitoring program detects that the training process corresponding to the training task
  • a fault occurs in the training process, it is determined that the computing node is faulty, that is, it is determined that the computing node is the first computing node, and the monitoring program reports the fault to the resource management module 212 .
  • the aforementioned failure of the training process does not include the exit of the training process from the computing node.
  • the task scheduling module 211 of the present application is also used to: perform fault recovery after a failure occurs in the computing node during the execution of the training task; After the fault is reported by the first computing node, fault recovery is performed.
  • the task scheduling module 211 of the present application is specifically used to: notify the third computing node that has not failed to suspend the training task Execute, that is, notify the third computing node to suspend the training process, and apply to the resource management module 212 for the second computing node to replace the first computing node.
  • the second computing node refers to a type of computing node used to replace the first computing node, which can be one or more computing nodes; the second computing node is used to execute the training tasks originally performed by the first computing node; the second The computing node may be a computing node not used to execute the training task in the computing resource pool 220; or the second computing node may be a computing node that has been used to execute the training task in the computing resource pool 220, but the training task executed by the second computing The training task executed by the first computing node does not belong to the same training job.
  • the third computing node refers to a type of computing node that has not failed, and may be one or more computing nodes; the third computing node and the first computing node are used to execute training tasks in the same training job.
  • the task scheduling module 211 of the present application is specifically configured to: notify the resource management module 212 to perform a fault report on the first computing node For fault isolation, notify the third computing node to suspend the execution of the training task, and apply to the resource management module 212 for a second computing node to replace the first computing node.
  • the resource management module 212 of the present application is further configured to: after receiving the application for the second computing node from the task scheduling module 211, reallocate computing resources, that is, from the computing resource pool 220 allocating a second computing node to replace the first computing node; and after reallocating computing resources, notifying the task scheduling module 211 of the result of reallocating computing resources.
  • the resource management module 212 can increase computing nodes during the execution of the training task, that is, dynamically adjust resources. For example, if a certain computing node fails when executing one or some training tasks, the resource management module 212 may add a second computing node to replace the first computing node to execute the training task.
  • the task scheduling module 211 of the present application is also configured to: receive the result of reallocating computing resources from the resource management module 212, call the added second computing node to execute the original execution of the first computing node the training task; and informing the third computing node to continue the execution of the training task, that is, notifying the third computing node to continue executing the previously suspended training process.
  • the resource management module 212 can report a fault to the task scheduling module 211, and the task scheduling module 211 can notify the resource management module 212 to perform fault isolation on the first computing node.
  • the task scheduling module 211 can communicate with the computing resource pool 220, the task scheduling module 211 can notify the computing nodes in the computing resource pool 220 to suspend the execution of the training task and continue the execution of the training task.
  • the resource management module 212 can communicate with the computing resource pool 220, the resource management module 212 can monitor whether a computing node in the computing resource pool 220 fails when executing a training task, and the first computing node in the computing resource pool 220 Perform fault isolation, etc.
  • data-parallel distributed training is divided into two stages: computing by multiple computing nodes and synchronizing gradients.
  • the training job is divided into multiple training tasks, which are executed by multiple computing nodes; in the computing phase of multiple computing nodes, multiple computing nodes independently complete their own tasks.
  • the corresponding gradient is obtained by the calculation; in the gradient synchronization stage, each computing node in the multiple computing nodes provides the gradient obtained by its own calculation, and jointly completes the gradient synchronization.
  • the task time is based on the same training framework, so the communication topology can be set in the training framework, which is used for gradient synchronization of multiple computing nodes.
  • the communication topology is a topology composed of multiple computing nodes that execute a training job.
  • the communication topology records which computing nodes jointly execute the training job, and which computing nodes participate in gradient synchronization; the communication topology record can be used for Communication is performed among the plurality of computing nodes.
  • the present application also optimizes the capabilities of the training framework, and increases the fault-tolerant processing of training tasks in the existing training framework; that is, supports dynamic Add and delete computing nodes to ensure high availability of tasks.
  • multiple training tasks in a training job are executed by multiple computing nodes.
  • the resource management module 212 will fault the first computing node. Isolate, and assign a second computing node to replace the first computing node; in this case, it is necessary to update the communication topology in the training framework, that is, delete the first computing node in the communication topology in the training framework, and add the second computing node Two computing nodes.
  • the resource management module 212 assigns the second computing node to replace the first computing node, it notifies the task scheduling module 211; the task scheduling module 211 notifies the third computing node among the multiple computing nodes of the information of the second computing node node, the third computing node deletes the first computing node and adds the second computing node in the communication topology in the training framework on it; the task scheduling module 211 sends the training framework, training data set, initial AI model, etc. to the third computing node
  • the second computing node, or the second computing node can obtain the training framework, training data set, initial AI model, etc.
  • the task scheduling module 211 also sends the information of the third computing node to the second computing node, the second The second computing node can deploy the training framework and build a communication topology in the deployed training framework based on its own information and the information of the third computing node; after the task scheduling module 211 can also deploy the training framework on the second computing node, on the second computing node
  • the middle configuration starts the training task originally executed by the first computing node.
  • the communication topology in the training framework on the third computing node and the second computing node is the same, so that the third computing node and the second computing node can perform gradient synchronization.
  • the second computing node because the second computing node has not executed the training task originally performed by the first computing node, there is no corresponding training parameter, so the second computing node does not provide training parameters during this gradient synchronization.
  • the AI platform 210 provided in the embodiment of the present application dynamically isolates the first computing node that fails when the training fails, supplements the second computing node, and the second computing node replaces the first computing node that fails Perform training to ensure that the training process is not interrupted, so that the overall training time is not affected, and the time for fault recovery is reduced.
  • FIG. 7 is a schematic flowchart of a distributed training method for an AI model provided in an embodiment of the present application.
  • the distributed training method for an AI model shown in FIG. 7 can be implemented based on the AI platform 210 shown in FIG. 2 .
  • the task scheduling module and resource management module in FIG. 7 can be the task scheduling module 211 and the resource management module 212 in FIG. 2 respectively, and the computing resources in FIG. 7 can be computing resource pool 220 or computing resources in computing resource pool 220 .
  • the distributed training of the initial AI model corresponds to a training job, and the computing resources required for the distributed training of the initial AI model are multiple computing nodes; the training job can be divided into multiple training tasks, and multiple training tasks use the same training framework , multiple computing nodes correspond to multiple training tasks one by one; each of the multiple computing nodes executes the corresponding training task, that is, each of the multiple computing nodes runs a training session for the corresponding training task process, so there are multiple training processes, and multiple training tasks correspond to multiple training processes one by one.
  • each of the multiple computing nodes can only execute one training task, that is, there is only one training process on each computing node, and there is only one training framework on each computing node.
  • the computing resources in Figure 7 may represent all or part of the computing nodes in the multiple computing nodes, and each computing node in the multiple computing nodes is used to execute one of the multiple training tasks;
  • the first computing node in Fig. 7 can represent the computing node that any one of multiple computing nodes breaks down;
  • the third computing node in Fig. 7 can represent the computing node that any one of multiple computing nodes does not break down;
  • the second computing node in 7 may represent a computing node used to replace any first computing node, and the second computing node is used to execute the training task originally performed by the first computing node.
  • the process of the distributed training method of the AI model includes four stages: task initiation, status monitoring, fault isolation, and fault recovery. The foregoing stages are described in detail below.
  • the task starts.
  • Step S1 The user starts the training job.
  • the user creates and submits a training job for training the initial AI model through the human-computer interaction module 215, and the human-computer interaction module 215 generates the user's instruction according to the training job submitted by the user, and forwards the user's instruction to the task scheduling Module 211:
  • the task scheduling module 211 receives the training job submitted by the user, so as to enable the user to start the training job.
  • FIG. 8 is a schematic diagram of a user interaction interface provided by an embodiment of the present application.
  • the interface shown in FIG. 8 is an interface for creating a training job displayed in the human-computer interaction module 215 .
  • Creating a training job includes three steps: service selection, specification confirmation, and completion.
  • the AI platform 210 of this application can provide the user with AI model training services based on the billing mode of pay-as-you-go.
  • the algorithms used for training can be obtained on-demand based on different sources, such as selecting used algorithms, preset algorithms, common frameworks (that is, common training frameworks), and custom algorithms; when selecting an algorithm, it can be based on the name of the algorithm choose.
  • the computing resource pool 220 is divided into a public resource pool and a dedicated resource pool. Users can select the corresponding computing resource pool for training according to their needs; both the public resource pool and the dedicated resource pool can include computing resources of different specifications.
  • the scale of computing resources Select suitable computing resources for training from computing resources of different specifications; when selecting computing resources, you can set the number of computing nodes based on training requirements.
  • the user service model selection is completed, the user confirms the specifications. After the specification confirmation is completed, the user completes the creation of the training job.
  • the task scheduling module 211 learns that the user has initiated the execution of the training job, and applies for computing resources for executing the training job, that is, executes step S2.
  • Step S2 The task scheduling module applies for computing resources from the resource management module.
  • the task scheduling module 211 will apply to the resource management module 212 for computing resources according to the user's settings. For example, if the user sets how many computing nodes are needed to execute the training job, the task scheduling module 211 will apply to the resource management module 212 for how many computing nodes; It will apply to the resource management module 212 for a computing node of what specification.
  • Step S3 The resource management module allocates computing resources.
  • the resource management module 212 allocates multiple computing nodes for the training job from the computing resource pool 220 according to the application for computing resources, and the allocated multiple computing nodes
  • the node specification is the specification of the computing node requested by the task scheduling module 211 .
  • the resource management module 212 will return the result of allocating computing resources to the task scheduling module 211, so the task scheduling module 211 can know which computing nodes are the computing resources allocated by the resource management module 212 for the training job; wherein, the result of allocating computing resources Optionally include: the name of the computing node, the identifier of the computing node, and the specification of the computing node.
  • Step S4 The task scheduling module starts training.
  • the task scheduling module 211 divides the training job into multiple training tasks according to the number of computing nodes set by the user when creating the training job, wherein the number of training tasks in the multiple training tasks is equal to the number of computing nodes set , each training task in the training job is the training task of the distributed training of the initial AI model, and each training task in the training job is used to perform multiple rounds of iterative training on the initial AI model; the task scheduling module 211 will A computing node for executing each training task is determined in the nodes, and each training task is configured to be executed on the computing node determined for the training task, so multiple training tasks correspond to multiple computing nodes one by one. It should be understood that one training task will run on one computing node, so that each of the multiple computing nodes will run a training process for one training task in the training job, so multiple training tasks and multiple There is a one-to-one correspondence between each training process.
  • the task scheduling module 211 when the task scheduling module 211 determines a computing node for executing each training task from multiple computing nodes, it can match a suitable computing node from multiple computing nodes based on the specification of the computing node required by each training task. To execute each training task; when the task scheduling module 211 configures the training task to be executed on a certain computing node, it can precisely configure the training task to execute on the corresponding computing node based on the name of the computing node or the identification of the computing node.
  • the task scheduling module 211 will divide the training job into 4 training tasks, and apply to the resource management module 212 for 4 computing nodes. To execute the training job, and configure the four training tasks to be executed on the four computing nodes in one-to-one correspondence.
  • the second stage is status monitoring.
  • the AI platform 210 For each computing node among the multiple computing nodes used to execute the training job, the AI platform 210 provides the status monitoring capability during the computing node executing the training task, and the AI platform 210 periodically monitors the computing node Status monitoring; status monitoring includes the resource management module 212 performing fault monitoring on the computing nodes and self-fault monitoring of the computing nodes, specifically as follows:
  • Step S5 The resource management module performs fault monitoring on computing resources.
  • the resource management module 212 performs fault monitoring on computing resources, that is, the resource management module 212 performs fault monitoring on computing nodes. Specifically, the resource management module 212 performs periodic fault monitoring on each of the multiple computing nodes to determine whether each of the multiple computing nodes fails when executing the training task. Wherein, the resource management module 212 performs fault monitoring on the computing node, including monitoring whether a hardware failure occurs on the computing node and/or monitoring whether the training process on the computing node exits; When the process exits, the compute node is confirmed to have failed.
  • the hardware faults in this application can be classified into the first type of hardware fault and the second type of hardware fault.
  • the first type of hardware failure It will cause the training process on the computing node to exit or stop; for example, the computing node is powered off, and the network between the computing node and other computing nodes that are used to execute the same training job is disconnected.
  • the second type of hardware failure it will not cause the training process on the computing node to exit or stop, but only affect the computing performance of the computing node; for example, the computing of the computing node is very slow and the training process on the computing node does not exit.
  • the hardware failure of the computing node is only one of the possibilities for the exit of the training process on the computing node. This application does not specifically limit the reasons for the exit of the training process on the computing node. When the training process exits, it can be monitored by the resource management module 212 .
  • Step S6 Each computing node performs self-fault monitoring.
  • each of the multiple computing nodes executes the training task, it monitors whether a training process corresponding to the training task has an operation failure. From the perspective of software, when each computing node of this application executes a training task, the monitoring program in it will monitor whether the training process corresponding to the training task has an operation failure.
  • steps S5 and S6 do not have a sequence of execution time, and when the above step S5 is executed, S6 is optional, and when the above step S6 is executed, the above step S5 is optional.
  • the third stage fault isolation.
  • the resource management module 212 performs fault monitoring on computing resources (that is, computing nodes) and/or self-fault monitoring of computing nodes, and the faults detected by the two monitoring methods can be considered as faulty computing nodes
  • the failure of the computing node will cause the failure of the training task executed by the computing node
  • the failure of the training task means the failure of the training job
  • the failure of the training job means the failure of the distributed training of the initial AI model.
  • Mode 1 The resource management module triggers fault isolation.
  • the following steps S7 to S9 are steps for triggering fault isolation by the resource management module.
  • Step S7 The resource management module performs fault isolation on the first computing node.
  • the resource management module 212 detects that a failure occurs on the first computing node among the multiple computing nodes, it performs fault isolation on the first computing node, so that the first computing node is no longer used to execute the training tasks in the training job; And in the case that the first computing node has a hardware failure, it is prevented that the first computing node is called again before the failure recovers.
  • the failures monitored by the resource management module 212 include hardware failures of the computing nodes and/or the exit of the training process on the computing nodes.
  • Step S8 The resource management module reports the failure to the task scheduling module.
  • the resource management module 212 reports the detected failure to the task scheduling module 211, for example, reports the hardware failure of the first computing node or the exit of the training process on the first computing node to the task scheduling module 211, and the task scheduling module 211 Process the training process on the first computing node.
  • Step S9 the task scheduling module sends a notification to stop the training process to the first computing node.
  • step S9 is optional.
  • the training process on the first computing node automatically Stop, do not execute step S9; when the resource management module 212 detects that the second type of hardware failure occurs on the first computing node, the training process on the first computing node has not stopped, and step S9 needs to be executed.
  • the task scheduling module 211 sends a notification to stop the training process to the first computing node where the second type of hardware failure occurs, and the notification to stop the training process is used to stop the training process on the first computing node where the second type of hardware failure occurs.
  • Method 2 Computing nodes trigger fault isolation.
  • the following steps S10 to S12 are steps for triggering fault isolation by computing nodes.
  • Step S10 The first computing node detects that the training process fails.
  • each computing node in the plurality of computing nodes determines that it has a fault when it detects that the training process on it has a running fault, that is, it determines that it is the first computing node. From the perspective of software, when the monitoring program on each of the multiple computing nodes detects that the training process on the computing node fails, it determines that the computing node is faulty, that is, it determines that the computing node is the first A computing node. Wherein, the training process on the computing node has an operation failure, that is, the training task corresponding to the training process has an operation failure.
  • Step S11 The first computing node reports the fault to the task scheduling module.
  • the first computing node reports the fault to the task scheduling module; from the perspective of software, the monitoring program on the first computing node reports the fault to the task scheduling module 211 .
  • Step S12 The task scheduling module sends a notification of fault isolation to the first computing node to the resource management module, and the resource management module performs fault isolation on the first computing node.
  • the task scheduling module 211 first sends a notification of fault isolation to the first computing node to the resource management module 212, and after the resource management module 212 receives the notification from the task scheduling module 211 of fault isolation of the first computing node, Fault isolation is performed on the first computing node.
  • the first computing node first reports the fault to the task scheduling module 211, and then is isolated by the resource management module 212.
  • the first computing node may first report the fault to the resource management module 212, and then the resource management module 212 reports the fault to the task scheduling module 211, and the resource management module 212 performs fault isolation on the first computing node.
  • the resource management module 212 reports the fault to the task scheduling module 211 and the resource management module 212 performs fault isolation on the first computing node.
  • S11 the first computing node reports the fault to the resource management module, and the resource management module reports the fault to the task scheduling module;
  • S12 the resource management module isolates the fault of the first computing node.
  • the fourth stage is failure recovery.
  • the failure recovery phase is optional.
  • the interface for creating a training job provided by this application has an optional configuration item "Start failure recovery"; for a training job jointly executed by multiple computing nodes, the user has checked the In the case of the configuration item "Start failure recovery", if there is a first failed computing node among the multiple computing nodes, then step S13-step S18 will be executed; otherwise, step S13-step S18 will not be executed.
  • step S13- Step S18 otherwise execute step S13-step S18.
  • the user can also set the threshold of the failure rate; For a training job jointly executed by computing nodes, if the ratio of the number of the first computing node that fails among the multiple computing nodes to the number of computing nodes among the multiple computing nodes exceeds the threshold of the failure rate, step S13- Step S18, otherwise execute step S13-step S18.
  • step S13- Step S18 For example, when the user creates a training job, set the training job to be executed by 4 computing nodes, and set the failure rate threshold to 50%. If the number of the first computing node among the 4 computing nodes exceeds 2, no failure will be performed. recovery, otherwise failure recovery is performed.
  • the resource management module 212 detects the failure of the computing node and reports the failure to the task scheduling module 211, and the task scheduling module 211 determines the training tasks affected by the failure of the computing node.
  • the computing node when a computing node monitors that a training process on it has a running failure, the computing node fails.
  • the computing node directly reports the fault to the task scheduling module 211, or the computing node indirectly reports the fault to the task scheduling module 211 (the computing node first reports the fault to the resource management module 212, and the resource management module 212 then reports the fault to the task scheduling module 211).
  • the fault is reported to the task scheduling module 211), and when the fault is reported, directly inform the task scheduling module 211 which or which training processes have a running fault, that is, inform the task scheduling module 211 which or which training tasks have a running fault.
  • the task scheduling module 211 judges that the training task executed on the computing node is faulty; In the case of running faults and reporting the faults to the task scheduling module 211 directly or indirectly, the task scheduling module 211 receives that a training task executed by a computing node fails.
  • the task scheduling module 211 determines that the training task executed on the computing node is faulty or receives that the training task executed by the computing node is faulty, it triggers a fault recovery operation.
  • the stages of failure recovery are described in detail below:
  • Step S13 The task scheduling module sends a notification of suspending the training process to the third computing node that does not fail.
  • the calculation and gradient synchronization of multiple computing nodes are performed alternately.
  • the first computing node cannot participate in gradient synchronization; thus, due to the lack of the first computing node, there may be problems with gradient synchronization. Therefore, in the fault recovery process, it is necessary to notify the third computing node that has not failed to suspend training among the multiple computing nodes.
  • the task scheduling module 211 sends a notification of suspending the training process to the third computing node among the multiple computing nodes.
  • the notification of suspending the training process is used to suspend The training process on the third computing node, wherein the training task corresponding to the suspended training process is a training task in the faulty training job.
  • Step S14 The third computing node suspends the training process after receiving the notification of suspending the training process from the task scheduling module.
  • the third computing node suspends the training process on it, and waits for receiving the notification of continuing the training, so as to continue to execute the suspended training process.
  • the third calculation node continues to complete the calculation of the training process after receiving the notification of suspending the training process, and suspends the training process after obtaining the gradient corresponding to the training process.
  • the third computing node After the third computing node completes the calculation of the training process and obtains the gradient corresponding to the training process, it should enter the gradient synchronization; but because it receives a notice to suspend the training process, it suspends the gradient synchronization and starts to wait in a loop (no timeout or exit) ) until a notification to continue training is received.
  • FIG. 9 is a schematic diagram of gradient synchronization provided by an embodiment of the present application.
  • multiple computing nodes complete their respective calculations, and then multiple computing nodes perform gradient synchronization;
  • a fault occurs, for example, the first computing node among the multiple computing nodes occurs fault, remove the first computing node (that is, the resource management module 212 performs fault isolation on the first computing node), so that the first computing node is no longer used for the current training, and the third computing node among the multiple computing nodes is in
  • the current training is suspended, that is, the gradient synchronization is suspended.
  • the third computing node After the third computing node suspends the training process, it enters a loop waiting.
  • This application can set a maximum waiting time. If the third computing node loops waiting time exceeds the longest waiting time, it exits the loop. Wait, the training fails, and it will be repaired by the operation and maintenance personnel, so as to avoid infinite hanging and increase the robustness of the program.
  • Step S15 The task scheduling module re-applies for computing resources from the resource management module.
  • the task scheduling module 211 applies to the resource management module 212 for the second computing node to be used to replace the first computing node among the multiple computing nodes, so as to continue execution based on the second computing node and the third computing node among the multiple computing nodes Training; wherein, the second computing node is a computing node other than multiple computing nodes; when the task scheduling module 211 applies for the second computing node to the resource management module 212, it applies based on the specifications of the first computing node, for example, the applied second computing node
  • the specification of the computing node is the same or equivalent to that of the first computing node.
  • Step S16 The resource management module reallocates computing resources.
  • the resource management module 212 reallocates the second computing node from the computing resource pool 220, and returns to the task scheduling module 211 the reallocation of the computing resource. result.
  • the resource management module 212 reallocates the second computing node from the computing resource pool 220, and returns to the task scheduling module 211 the reallocation of the computing resource. result.
  • the result of reallocating computing resources is that the second computing node has been allocated; the result of reallocating computing resources may also optionally include: the name of the second computing node, the identifier of the second computing node, and the second computing node specifications etc.
  • the task scheduling module 211 facilitates the task scheduling module 211 to configure the faulty training task to be executed on the second computing node, so as to replace the first computing node with the second computing node.
  • the result of reallocating computing resources is that the second computing node is not allocated.
  • This possible situation is because the computing resource pool 220 is limited, and there is no computing resource available for reallocation.
  • training can be continued based on the third computing node according to the configuration before the failure.
  • Step S17 The task scheduling module sends a notification of continuing training to the third computing node.
  • the task scheduling module 211 sends a notification of continuing training to the third computing node, and the notification of continuing training is used for the third computing node to update the communication topology in the training framework on it, and after updating the communication topology in the training framework Continue with the training process.
  • the notification of continuation of training needs to notify that the training process to be continued is the previously suspended training process.
  • the training frame that needs to update the communication topology on the third computing node is the training frame corresponding to the faulty training job, that is, the training frame that needs to update the communication topology is the training frame that trains the faulty initial AI model.
  • the resource management module 212 returns to the task scheduling module 211 the result of reallocating computing resources, there are two possible situations: the second computing node is allocated and the second computing node is not allocated, the task scheduling module 211 sends the continuation to the third computing node There are also two possibilities for the content and purpose of the training notification, which are described as follows.
  • the notification to continue training may include the information of the second computing node (such as the name of the second computing node, the identifier of the second computing node, and the second Computing node specifications, etc.); the notification of continuing training is used for the third computing node to delete the first computing node and add the second computing node in the communication topology in the training framework, and to continue training after updating the communication topology in the training framework process; wherein, the training frame for updating the communication topology is the training frame corresponding to the faulty training job, that is, the training frame for training the faulty initial AI model.
  • step S18 is executed, and then the training is continued based on the third computing node and the second computing node.
  • FIG. 10 is a schematic diagram of a communication topology for updating a training framework provided by an embodiment of the present application.
  • the training job is divided into four training tasks (respectively, training task 1, training task 2, training task 3, and training task 4), so that the training job consists of four computing nodes (respectively, computing node 1, computing Node 2, Computing Node 3 and Computing Node 4) execute; before failure, the communication topology in the training framework of Computing Node 1, Computing Node 2, Computing Node 3 and Computing Node 4 is based on Computing Node 1, Computing Node 2.
  • a communication network formed by computing nodes 3 and computing nodes 4 .
  • Computing Node 1 Computing Node 2, and Computing Node 3 do not fail, that is, Computing Node 1, Computing Node 2, and Computing Node 3 are the third computing nodes; and Computing Node 4 fails, that is, Computing Node 4 is the first Computing node; reassign computing node 5 to replace computing node 4, that is, computing node 5 is the second computing node; computing node 1, computing node 2 and computing node 3 all delete computing node 4 and add computing node 4 in the communication topology in the training framework Computing node 5 is updated to a communication network composed of computing node 1 , computing node 2 , computing node 3 and computing node 5 . In this way, computing node 1, computing node 2, computing node 3, and computing node 5 can continue training, that is, the training job is executed by computing node 1, computing node 2, computing node 3, and computing node 5.
  • Case 2 If the result of reallocating computing resources is that the second computing node is not allocated, the notification to continue training does not include the information of the second computing node; the notification to continue training is used for the communication topology of the third computing node in the training framework Delete the first computing node, and continue to execute the training process after updating the communication topology in the training framework; wherein, the training framework for updating the communication topology is the training framework corresponding to the training job where the failure occurred, that is, the training framework for the initial AI model where the failure occurred. training framework. In this case, step S18 is not executed, and then the training is continued based on the third computing node.
  • the training job corresponds to a total of m computing nodes, and the samples for each round of training are divided into 1/m shares.
  • Each computing node uses 1/m samples for training each time. There are n first computing nodes that fail, so If there are n computing nodes that cannot continue training, then n/m fewer samples are trained in each round of training.
  • FIG. 11 is a schematic diagram of another communication topology for updating the training framework provided by the embodiment of the present application.
  • the training job is divided into four training tasks (respectively, training task 1, training task 2, training task 3, and training task 4), so that the training job consists of four computing nodes (respectively, computing node 1, computing Node 2, Computing Node 3 and Computing Node 4) execute; before failure, the communication topology in the training framework of Computing Node 1, Computing Node 2, Computing Node 3 and Computing Node 4 is based on Computing Node 1, Computing Node 2.
  • Step S18 the second computing node performs data recovery.
  • the task scheduling module 211 sends the training framework, training data set, initial AI model, etc. to the second computing node, or the second computing node can obtain the training framework, training data set, initial AI model from the data storage module 213 etc., so that the second computing node can deploy the training framework and use the training data in the training data set to train the initial AI model; the task scheduling module 211 will also use the information of the third computing node (such as the name of the third computing node, the third computing node The identification of the node and the specifications of the third computing node, etc.) are sent to the second computing node, and the second computing node can deploy the training framework and build a communication topology in the deployed training framework based on its own information and the information of the third computing node; task The scheduling module 211 configures the training task originally executed by the first computing node to be executed on the second computing node, that is, the second computing node will run a training process for the training task originally executed by the first computing node.
  • the third computing node such as the
  • the third computing node updates the communication topology in the training framework, and the second computing node builds the communication topology in the training framework, the communication topology in the training framework on the third computing node and the second computing node are the same , so that the third computing node and the second computing node can perform gradient synchronization.
  • the second computing node when the second computing node participates in gradient synchronization, it starts by loading the Ckpt file saved by the first computing node before the failure occurs. Compute the data of the node before failure.
  • the second computing node and the third computing node perform gradient synchronization ;
  • the model parameters in the AI model trained by the second computing node and the third computing node are the same, so that AI can be performed based on the second computing node and the third computing node The next round of training of the model.
  • the distributed training method of the AI model shown in FIG. 7 is based on the training of an initial AI model, one training job, each computing node only executes one training task, and each computing node has only one training process. There is only one training framework on each computing node as an example.
  • the AI platform 210 may train multiple initial AI models at the same time.
  • the multiple initial AI models correspond to multiple training jobs; in this case, each computing node executes one or more training tasks (each of the multiple training tasks training tasks belong to one of the multiple training jobs), and there are one or more training processes and one or more training frameworks on each computing node; when multiple initial AI models are trained, the training steps and figure
  • the process described in 7 is the same, please refer to the description in Figure 7 for specific steps, but the following steps need further explanation:
  • Step S2 and Step S3 For multiple training jobs, the computing resources used may or may not overlap; when there is overlap, the overlapping computing nodes are used to execute multiple training tasks, and these multiple training tasks belong to for different training tasks.
  • Step S5 Each computing node is used to execute one or more training tasks (each training task in multiple training tasks belongs to one of the training jobs in multiple training jobs), then there is one or Multiple training processes; when a hardware failure occurs on the first computing node, one or more training processes on the first computing node are all affected by the failure of the first computing node; when a hardware failure on the first computing node is detected When the training process exits, the exited training process on the first computing node is affected, the exited training process on the first computing node is affected, and the unexited training process on the first computing node continues to run normally.
  • Step S6 Each computing node will monitor whether one or more training processes on it have malfunctions, that is, each computing node will monitor whether each training process on it has malfunctions.
  • Step S7 The first computing node may execute one or more training tasks (each training task in the multiple training tasks belongs to one of the training jobs in the multiple training jobs), then there may be one or more training tasks on the first computing node Multiple training processes; when monitoring the hardware failure of the first computing node, the first computing node is a faulty computing node, and after fault isolation is performed on the first computing node, the first computing node is no longer used to execute the one or more training tasks.
  • the first computing node When it is monitored that at least one training process on the first computing node exits, the first computing node is a faulty computing node; that is, as long as a training process exits on the first computing node, it indicates that the first computing node is faulty; After fault isolation is performed on the first computing node, the first computing node is no longer used to execute the training task corresponding to the exited training process, and the training process that has not exited on the first computing node continues to run normally; The fault recovery is performed on the training task corresponding to the training process, and there is no need to perform fault recovery on the training task corresponding to the training process that has not exited.
  • Step S9 When the first type of hardware failure occurs on the first computing node, one or more training processes on the first computing node will be affected, and the notification of stopping the training process is optionally used to stop the One or more training processes; so that the training tasks corresponding to the one or more training processes can be optionally recovered from faults later.
  • Step S10 The first computing node detects that any one of the one or more training processes is faulty, and the first computing node is the faulty computing node; that is, as long as there is a training process running on the first computing node fault, it indicates that the first computing node is faulty.
  • Step S11 When the first computing node fails to report, it only reports the fault for the training process that has a running fault, and does not report the fault for the training process that does not have a running fault; that is, the first computing node only requests the AI platform to report the running fault.
  • the training task corresponding to the training process performs fault recovery, and there is no need to perform fault recovery on the training task corresponding to the normal running training process.
  • Step S12 After the first computing node is fault-isolated, the first computing node is no longer used to execute the training tasks corresponding to the training process that has a running fault, and the first computing node continues to be used to execute the training process that does not have a running fault
  • the corresponding training task that is, the training process on the first computing node that does not have a running failure continues to run normally.
  • the third computing node may be used to execute one or more training tasks (each training task in the multiple training tasks belongs to one of the training jobs in the multiple training jobs), then the third computing node There may be one or more training processes; the training process that needs to be suspended on the third computing node is the training process corresponding to the training task in the faulty training job.
  • Step S15 and Step 16 Since the first computing node may be used to execute one or more training tasks, one or more second computing nodes may be supplemented for one first computing node. For example, supplementing a second computing node to replace a first computing node to perform the one or more training tasks; or supplementing multiple second computing nodes to replace a first computing node to perform the one or more training tasks, the multiple Each of the second computing nodes executes at least one training task of the plurality of training tasks.
  • Step S17 The training of different initial AI models is implemented based on different training frameworks, that is, different training jobs correspond to different training frameworks, and training tasks that do not belong to the same training job correspond to different training frameworks; because the third computing node may is used to execute one or more training tasks, then there may be one or more training frames on the third computing node; the training frame that needs to update the communication topology on the third computing node is the training frame corresponding to the faulty training task, that is, The training frame that needs to update the communication topology on the third computing node is the training frame corresponding to the faulty training job.
  • FIG. 12 is a schematic diagram of a processing flow timeline of a training job provided by an embodiment of the present application.
  • the processing flow timeline of the training job is the processing flow timeline of the training job of the distributed training of the AI model shown in FIG. 7, and the processing flow of the training job includes the following steps:
  • Each of the multiple computing nodes is used to execute one of the training tasks of the training job. After the training job is started, each computing node performs data loading and starts to execute training after the data loading is completed.
  • the fault recovery of the training job includes the training job hardware repair phase and the training job software recovery phase, and the training job hardware repair phase includes the following steps (3) and (4), the recovery phase of the training operation software includes the following steps (5), (6) and (7).
  • the AI platform automatically performs fault isolation on the first computing node.
  • the AI platform automatically allocates the second computing node to replace the first computing node.
  • the second computing node performs data loading.
  • the third computing node without failure among the plurality of computing nodes updates the communication topology in the training frame, and the second computing node creates the communication topology in the training frame.
  • the second computing node and the third computing node perform training parameter synchronization. Then train normally based on the second computing node and the third computing node.
  • the second computing node performs gradient synchronization with the third computing node to obtain the latest training results, which reduces the training loss caused by the failure of high-frequency storage of Ckpt files.
  • the affected stage is: during the period from the failure of the first computing node to the joining of the second computing node, the first computing node does not participate in the calculation of the part of the training samples.
  • the calculation result of the affected samples is (T/t) ⁇ (n/m)
  • T the fault recovery time
  • t the training duration of each round of training
  • n is the number of the first computing nodes
  • m is the The total number of compute nodes required.
  • the fault recovery time T can be reduced through the optimization of the task scheduling module, thereby reducing the impact of the fault on the entire training job; the fault recovery time T is generally 1 to 2 minutes, and for training jobs with an execution time of hours ( For large-scale training operations), it can basically achieve non-destructive failure recovery of training operations.
  • FIG. 13 is a flowchart of a process 1300 of another distributed training method for an artificial intelligence AI model provided by an embodiment of the present application.
  • the process 1300 is described as a series of steps or operations. It should be understood that the process 1300 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 13 .
  • the process 1300 is applied to an AI platform, the AI platform is associated with a computing resource pool, the computing resource pool includes a plurality of computing nodes for distributed training of the AI model, each of the plurality of computing nodes computes The node executes a training task of the AI model distributed training; the process 1300 includes but not limited to the following steps or operations:
  • Step 1301 Perform fault isolation on a first computing node, where the first computing node is a faulty computing node among the plurality of computing nodes;
  • Step 1302 Determine a second computing node, where the second computing node is a computing node other than the plurality of computing nodes in the computing resource pool;
  • Step 1303 configure the second computing node, so that the second computing node replaces the first computing node to execute the training task.
  • the AI platform can perform distributed training on the AI model, and the AI platform is associated with a computing resource pool, which includes multiple computing nodes for distributed training of the AI model, and each of the multiple computing nodes Each computing node executes a training task for the distributed training of the AI model, for example, each computing node executes a training task for the distributed training of the AI model; during the distributed training process of the AI model, the AI platform can determine the Whether there is a faulty first computing node, if the AI platform determines that there is a faulty first computing node among the multiple computing nodes, it will perform fault isolation on the first computing node so that the first computing node is no longer used In addition, the AI platform can determine a second computing node other than the aforementioned multiple computing nodes from the computing resource pool, and configure the second computing node so that the first computing node is used.
  • the second computing node replaces the first computing node to perform the training task of the distributed training of the AI model.
  • the computing node used for the distributed training of the AI model in this application fails, the first computing node that fails is dynamically isolated, and the second computing node is supplemented to replace the first computing node to continue training, ensuring that the training process is not interrupted, thereby The overall training time is not affected, which reduces the time for fault recovery.
  • the computing capability of the second computing node is the same or equivalent to that of the first computing node, or that the specification of the second computing node is the same or equivalent to that of the first computing node, so as to ensure that the second A compute node can successfully replace the first compute node.
  • the first computing node executes other AI model distributed training training tasks in addition to the training tasks of the AI model distributed training, after fault isolation is performed on the first computing node, the first computing node will not It is then used to execute the training tasks affected by the failure of the first computing node; the second computing node replaces the first computing node to perform the training tasks affected by the failure of the first computing node;
  • the training tasks affected by the failure include one or more of the following: the training tasks of the distributed training of the AI model, and the training tasks of the distributed training of other AI models.
  • the first computing node is a faulty computing node: the first computing node hardware failure, the The training process corresponding to the training task executed by the first computing node exits, and the fault reported by the first computing node.
  • the hardware failure of the first computing node, the exit of the training process corresponding to the training task performed by the first computing node, and the failure reported to the AI platform by the first computing node can all be monitored by the AI platform; if the AI platform detects If one or more of the above items are used, it is determined that the first computing node is a faulty computing node, and it is triggered to determine that the second computing node replaces the first computing node to perform the training task; in this way, the AI platform can detect in time that there is a fault in the distributed training of the AI model , which is beneficial to reduce the time for fault recovery.
  • the first computing node executes other AI model distributed training training tasks in addition to the training tasks of the AI model distributed training, then when the first computing node hardware fails, due to the first computing node
  • the training tasks affected by the fault include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the exit of the training process corresponding to the training task executed by the first computing node includes the exit of the training process corresponding to the training task of the AI model distributed training executed by the first computing node and the exit of the training process corresponding to the training task of other AI model distributed training.
  • the process exits, that is, as long as the training process on the first computing node exits, the first computing node is the computing node that has failed; when the training process corresponding to the training task of the AI model distributed training exits, the first computing node
  • the training task affected by the failure is the training task of the distributed training of the AI model; when the training process corresponding to the training task of the distributed training of other AI models exits, the training task affected by the failure of the first computing node is the training task of other AI models.
  • the training task of the model distributed training when the training process corresponding to the training task of the distributed training of the AI model and the training process corresponding to the training task of the distributed training of other AI models both exit, the affected due to the failure of the first computing node
  • the training tasks include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the fault reported by the first computing node includes the fault reported by the first computing node for the training task of the distributed training of the AI model and the fault reported for the training task of the distributed training of other AI models, that is, as long as the first computing node reports fault, the first computing node is the faulty computing node; when the fault reported by the first computing node is the fault reported by the first computing node for the training task of the distributed training of the AI model, due to the failure of the first computing node, the The affected training task is the training task of the distributed training of the AI model; when the fault reported by the first computing node includes the fault reported by the first computing node for the training tasks of the distributed training of other AI models, due to the failure of the first computing node, the The affected training task is the training task of distributed training of other AI models; when the fault reported by the first computing node includes the fault reported by the first computing node for the training task of the distributed training of the AI model and the distributed training task for other AI models For the faults reported by the training tasks, the training tasks, the
  • the method includes: sending a notification of stopping the training process to the first computing node, where the notification of stopping the training process is used to instruct the first computing node to stop executing the training task corresponding to training process.
  • some types of hardware failures will not cause the exit or stop of the training process on the computing nodes, but will only affect the computing performance of the computing nodes; If the node can successfully replace the first computing node to execute the training task, the AI platform sends a notification to the first computing node to stop the training process, instructing the first computing node to stop the training process corresponding to the training task; thus avoiding that the second computing node is already executing In the case of the training task originally executed by the first computing node, the first computing node is still executing the training task. It should be understood that the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task affected by the failure of the first computing node.
  • the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task of the distributed training of the AI model;
  • the training task is the training task of distributed training of other AI models, and the notification of stopping the training process is used to instruct the first computing node to stop the training process corresponding to the training task of distributed training of other AI models;
  • the affected training tasks include the training tasks of the distributed training of the AI model and the training tasks of the distributed training of other AI models.
  • the notification of stopping the training process is used to instruct the first computing node to stop the training corresponding to the training tasks of the distributed training of the AI model. Process and the training process corresponding to the training tasks of other AI model distributed training.
  • the method further includes: sending a notification of suspending the training process to the third computing node,
  • the third computing node is a computing node that has not failed among the plurality of computing nodes, and the notification of suspending the training process is used to instruct the third computing node to suspend the training task corresponding to the distributed training of the AI model training process.
  • the distributed training of the AI model includes calculation and gradient synchronization of multiple computing nodes.
  • the first computing node fails, if the training process of the unfailed third computing node is not suspended, the third computing node After the gradient is calculated, the gradient synchronization will be carried out; however, the first computing node is isolated due to a fault and cannot participate in the gradient synchronization. In this case, there will be problems with the gradient synchronization; therefore, in order to avoid problems with the gradient synchronization, The training process executed by the third computing node needs to be suspended until a newly added second computing node joins for training.
  • the notification of suspending the training process is specifically used to: instruct the third computing node to suspend the distributed training of the AI model after performing the gradient calculation of the distributed training of the AI model The training process corresponding to the training task.
  • the training process executed by the third computing node is suspended; in this way, after the newly added second computing node is added for training, it can Gradient synchronization is performed directly, which is beneficial to reduce the fault recovery time.
  • the method further includes: after determining the second computing node, the method further includes: sending a notification of continuing training to the third computing node, and the notification of continuing training Used to instruct the third computing node to delete the first computing node and add the second computing node in the communication topology in the training framework of the AI model distributed training, and resume the AI model distributed training
  • the training process corresponds to the training task, and the communication topology is used for gradient synchronization of the distributed training of the AI model.
  • the AI platform sends a notification to the third computing node to continue training; after receiving the notification to continue training, the third computing node knows that the second computing node will replace the failed first computing node to perform training, so Delete the first computing node and add the second computing node in the communication topology in the training framework of AI model distributed training; thus the third computing node can perform gradient synchronization with the second computing node, so that the second computing node can obtain the synchronized training parameters.
  • the method further includes: sending a notification of continuing training to the third computing node, and the notification of continuing training is used to indicate that the third computing node
  • the computing node deletes the first computing node from the communication topology in the training framework of the distributed training of the AI model, and restores the training process corresponding to the training task of the distributed training of the AI model, and the communication topology is used for the distributed training of the AI model. Gradient synchronization for distributed training of AI models.
  • the failed first computing node will be discarded and only A third computing node that has not failed is employed for performing training.
  • the computing resource pool includes multiple computing nodes for the distributed training of the AI model, and each computing node in the multiple computing nodes executes a training task for the distributed training of the AI model .
  • the distributed training device of the AI model can be implemented as part or all of the device through software, hardware or a combination of the two.
  • the distributed training device for the AI model can implement the processes described in other embodiments of this application.
  • the distributed training device of the AI model a resource management module 212, configured to perform fault isolation on the first computing node, the first computing node being a faulty computing node among the plurality of computing nodes; a task scheduling module 211, It is used to determine a second computing node, where the second computing node is a computing node other than the plurality of computing nodes in the computing resource pool; and configure the second computing node so that the second computing The node replaces the first computing node to execute the training task.
  • a resource management module 212 configured to perform fault isolation on the first computing node, the first computing node being a faulty computing node among the plurality of computing nodes
  • a task scheduling module 211 It is used to determine a second computing node, where the second computing node is a computing node other than the plurality of computing nodes in the computing resource pool; and configure the second computing node so that the second computing The node replaces the first computing node to execute the training task.
  • the first computing node is a faulty computing node: the first computing node hardware failure, the The training process corresponding to the training task executed by the first computing node exits, and the fault reported by the first computing node.
  • the task scheduling module 211 is further configured to: send a notification of stopping the training process to the first computing node, and the notification of stopping the training process is used to instruct the first computing node to stop executing The training process corresponding to the training task.
  • the task scheduling module 211 is further configured to: send a pause training message to the third computing node process notification, the third computing node is a computing node that has not failed among the plurality of computing nodes, and the notification of suspending the training process is used to instruct the third computing node to suspend the distributed training of the AI model The training process corresponding to the training task.
  • the notification of suspending the training process is specifically used to: instruct the third computing node to suspend the distributed training of the AI model after performing the gradient calculation of the distributed training of the AI model The training process corresponding to the training task.
  • the task scheduling module 211 is further configured to: send a notification of continuing training to the third computing node, and the notification of continuing training is used for Instruct the third computing node to delete the first computing node and add the second computing node in the communication topology in the training framework of the AI model distributed training, and resume the training of the AI model distributed training
  • the training process corresponding to the task, and the communication topology is used for gradient synchronization of the distributed training of the AI model.
  • the task scheduling module 211 is further configured to: send a notification of continuing training to the third computing node, and the notification of continuing training is used to indicate
  • the third computing node deletes the first computing node in the communication topology in the training framework of the AI model distributed training, and restores the training process corresponding to the training task of the AI model distributed training, the communication
  • the topology is used for gradient synchronization of the distributed training of the AI model.
  • each functional module in each embodiment of the present application can be integrated into one In the processor, it may exist separately physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • the present application also provides a computing device 500 as shown in FIG. 5 .
  • the processor 502 in the computing device 500 reads the program and data set stored in the memory 501 to execute the method performed by the aforementioned AI platform.
  • each module in the AI platform 210 provided by this application can be distributed and deployed on multiple computers in the same environment or in different environments, this application also provides a computing device as shown in Figure 14, the computing device A plurality of computers 1400 are included, and each computer 1400 includes a memory 1401 , a processor 1402 , a communication interface 1403 and a bus 1404 . Wherein, the memory 1401 , the processor 1402 , and the communication interface 1403 are connected to each other through a bus 1404 .
  • the memory 1401 may be a read-only memory, a static storage device, a dynamic storage device or a random access memory.
  • the memory 1401 may store programs, and when the programs stored in the memory 1401 are executed by the processor 1402, the processor 1402 and the communication interface 1403 are used to execute part of the method for training the AI model on the AI platform.
  • the memory can also store training data sets. For example, a part of storage resources in the memory 1401 is divided into a data set storage module for storing training data sets required by the AI platform.
  • the processor 1402 may be a general-purpose central processing unit, a microprocessor, an application-specific integrated circuit, a graphics processor, or one or more integrated circuits.
  • the communication interface 1403 uses a transceiver module such as but not limited to a transceiver to implement communication between the computer 1400 and other devices or communication networks.
  • a transceiver module such as but not limited to a transceiver to implement communication between the computer 1400 and other devices or communication networks.
  • the training data set can be obtained through the communication interface 1403 .
  • Bus 1404 may include pathways for transferring information between various components of computer 1400 (eg, memory 1401 , processor 1402 , communication interface 1403 ).
  • a communication path is established between each of the above-mentioned computers 1400 through a communication network.
  • Any one or more of the task scheduling module 211 , the resource management module 212 , the data storage module 213 , the algorithm management module 214 and the human-computer interaction module 215 runs on each computer 1400 .
  • Any computer 1400 may be a computer (for example, a server) in a cloud data center, or a computer in an edge data center, or a terminal computing device.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, twisted pair or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium stores the provided The computer program instructions of the AI platform.
  • the computer-readable storage medium can be any medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more integrated media.
  • the medium can be a magnetic medium, (eg, floppy disk, hard disk, magnetic tape), optical media (eg, optical disc), or semiconductor media (eg, solid state drive).
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions.
  • the computing device executes the procedures described in the embodiments of the present application. or function.
  • all or part may be implemented by software, hardware or a combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product of the AI platform provided by this application includes one or more computer instructions into the AI platform. When these computer program instructions are loaded and executed on the computer, all or part of the procedures or functions described in accordance with the embodiments of this application will be generated.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods shown in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the modules in the device of the embodiment of the present application can be combined, divided and deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请涉及人工智能技术领域,提供了一种AI模型的分布式训练方法和相关设备,其中方法应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;所述方法包括:对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。本申请实施例能够降低故障恢复的时长。

Description

AI模型的分布式训练方法和相关设备
本申请要求于2021年08月20日提交中国专利局、申请号为202110963715.7、申请名称为“AI模型的分布式训练方法和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种AI模型的分布式训练方法和相关设备。
背景技术
当前AI领域,主要涉及训练数据、AI模型及硬件的算力三个关键方面。AI模型的训练过程是将大量的训练数据输入至部署在硬件上的AI模型,并由AI模型利用硬件的算力支撑对训练数据进行处理和学习的过程。大部分情况下,训练数据越多,学习效果越好,AI模型的准确率越高。而随着利用AI模型解决的问题的规模增大,要求的用于进行AI模型训练的数据量也不断增加,导致对硬件的计算能力的需求也越来越大。例如当前的一些AI模型,其有1700亿的参数,其训练用的训练数据有45T,其完成训练需要355个GPU训练一年。为了减少训练的耗时,通用的做法是提高用于AI模型的训练作业的并行计算资源的规模;例如将该AI模型的训练作业的计算资源的规模提高到4096个GPU,如此,计算资源的规模是原来355个GPU的11倍以上,这使得该AI模型的训练时长可以减少到1个月左右。
然而,随着用于训练作业的计算资源规模的增加,导致训练过程中硬件、软件的故障率大幅提升,而训练过程中发生故障会导致整个训练作业失败而退出或中断。同时,因为训练作业的计算资源规模的增加,导致训练作业故障后的恢复时长增加,进而导致训练作业的整体完成时长增加。
发明内容
本申请提供一种AI模型的分布式训练方法和相关设备,能够降低训练过程中故障恢复的时长。
上述和其它目标通过独立权利要求的主体实现。其它实现方式在从属权利要求、具体实施方式和附图中显而易见。
具体实施例在所附独立权利要求中概述,其它实施例在从属权利要求中概述。
根据第一方面,本申请涉及一种人工智能AI模型的分布式训练方法,应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;所述方法包括:对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
在本申请中,AI平台可以对AI模型的进行分布式训练,AI平台与计算资源池相关联, 计算资源池包括用于AI模型分布式训练的多个计算节点,多个计算节点中的每个计算节点执行该AI模型分布式训练的一个训练任务,例如每个计算节点执行一个AI模型分布式训练的训练任务;在AI模型的分布式训练过程中,AI平台可以确定多个计算节点中是否存在发生故障的第一计算节点,AI平台如果确定到多个计算节点中存在发生故障的第一计算节点,则对该第一计算节点进行故障隔离,以使得该第一计算节点不再用于执行该AI模型分布式训练的训练任务;并且,AI平台可以从计算资源池中确定除前述多个计算节点之外的第二计算节点,以及配置该第二计算节点,以使得采用该第二计算节点来替代该第一计算节点执行该AI模型分布式训练的训练任务。如此,本申请用于AI模型的分布式训练的计算节点发生故障时,动态隔离发生故障的第一计算节点,补充第二计算节点替代第一计算节点继续训练,保障训练过程不被中断,从而整体训练时长不受影响,实现降低故障恢复的时长。应理解,该第二计算节点的计算能力与该第一计算节点的计算能力相同或相当,或者说该第二计算节点的规格与该第一计算节点的规格相同或相当,以确保该第二计算节点可以成功替代该第一计算节点。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,对第一计算节点进行故障隔离后,第一计算节点不再用于执行因第一计算节点发生故障而受影响的训练任务;第二计算节点替代第一计算节点执行因第一计算节点发生故障而受影响的训练任务;其中,因第一计算节点发生故障而受影响的训练任务包括以下一项或多项:该AI模型分布式训练的训练任务,其他AI模型分布式训练的训练任务。
在一种可能的实现方式中,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
在本实现方式中,第一计算节点硬件故障、第一计算节点执行的训练任务对应的训练进程退出以及第一计算节点上报到AI平台的故障均可以被AI平台监测到;如果AI平台监测到前述一项或多项,则确定第一计算节点为发生故障的计算节点,并触发确定第二计算节点替代第一计算节点执行训练任务;如此,AI平台可以及时发现AI模型分布式训练存在故障,有利于降低故障恢复的时长。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,则当第一计算节点硬件故障,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。进一步地,第一计算节点执行的训练任务对应的训练进程退出包括第一计算节点执行的该AI模型分布式训练的训练任务对应的训练进程退出以及其他AI模型分布式训练的训练任务对应的训练进程退出,也即只要第一计算节点上有训练进程退出,第一计算节点就为发生故障的计算节点;当该AI模型分布式训练的训练任务对应的训练进程退出,因第一计算节点发生故障而受影响的训练任务为该AI模型分布式训练的训练任务;当其他AI模型分布式训练的训练任务对应的训练进程退出,因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务;当该AI模型分布式训练的训练任务对应的训练进程和其他AI模型分布式训练的训练任务对应的训练进程均退出,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。此外,第一计算节点上报的故障包括第一计算节点针对该AI模型分布式训练的训练任务上报的故障以及针对其他AI模型分布式训练的训练任务上报的故障,也即只要第一计算节点上报故障,第一计算节点就为发生故障的计算节点;当第一计算节点上报的故障为第一计算节点针对该AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而 受影响的训练任务为该AI模型分布式训练的训练任务;当第一计算节点上报的故障包括第一计算节点针对其他AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务;当第一计算节点上报的故障包括第一计算节点针对该AI模型分布式训练的训练任务上报的故障和针对其他AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。
在一种可能的实现方式中,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述方法包括:向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
在本实现方式中,有些类型的硬件故障不会导致计算节点上的训练进程退出或停止,仅会影响计算节点的计算性能;在第一计算节点发生硬件故障的情况下,为确保第二计算节点能成功替代第一计算节点执行训练任务,AI平台向第一计算节点发送停止训练进程的通知,指示第一计算节点停止执行的训练任务对应的训练进程;从而避免第二计算节点已经在执行原来由第一计算节点执行的训练任务的情况下,第一计算节点还在执行该训练任务。应理解,停止训练进程的通知用于指示第一计算节点停止因第一计算节点发生故障而受影响的训练任务对应的训练进程。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,则当因第一计算节点发生故障而受影响的训练任务为该AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止该AI模型分布式训练的训练任务对应的训练进程;当因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止其他AI模型分布式训练的训练任务对应的训练进程;当因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止该AI模型分布式训练的训练任务对应的训练进程和其他AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述方法还包括:向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
在本实现方式中,该AI模型分布式训练包括多个计算节点计算和梯度同步,当第一计算节点发生故障,若不暂停未发生故障的第三计算节点的训练进程,则第三计算节点计算得到梯度后,就会进行梯度同步;但是,第一计算节点因为发生故障而被故障隔离,无法参与梯度同步,在这种情况下梯度同步会出现问题;因此,为避免梯度同步出现问题,需要将第三计算节点执行的训练进程进行暂停,直到有新增的第二计算节点加入用于执行训练。
在一种可能的实现方式中,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
在本实现方式中,在未发生故障的第三计算节点梯度计算结束后,再暂停第三计算节点执行的训练进程;如此,等新增的第二计算节点加入用于执行训练后,即可直接进行梯度同步,有利于降低故障恢复时长。
在一种可能的实现方式中,所述方法还包括:在所述确定第二计算节点之后,所述方法 还包括:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在本实现方式中,AI平台向第三计算节点发送继续训练的通知;第三计算节点在接收到继续训练的通知之后,知晓第二计算节点会替代发生故障的第一计算节点执行训练,故在AI模型分布式训练的训练框架中的通讯拓扑中删除第一计算节点以及增加第二计算节点;从而第三计算节点可以与第二计算节点进行梯度同步,使得第二计算节点获得同步后的训练参数。
在一种可能的实现方式中,若未确定到第二计算节点,所述方法还包括:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在本实现方式中,如果无法申请到第二计算节点用于替代发生故障的第一计算节点,为了保证训练不中断或不退出,训练能够继续进行,则舍弃发生故障的第一计算节点,仅采用未发生故障的第三计算节点用于执行训练。
根据第二方面,本申请涉及一种人工智能AI模型的分布式训练装置,有益效果可以参见第一方面的描述,此处不再赘述。所述AI模型的分布式训练装置具有实现上述第一方面的方法实施例中行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一种可能的实现方式中,该AI模型的分布式训练装置应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;所述装置包括:资源管理模块,用于对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;任务调度模块,用于确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;以及配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
在一种可能的实现方式中,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
在一种可能的实现方式中,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述任务调度模块还用于:向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
在一种可能的实现方式中,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述任务调度模块还用于:向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,在所述确定第二计算节点之后,所述任务调度模块还用于:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在一种可能的实现方式中,若未确定到第二计算节点,所述任务调度模块还用于:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
根据第三方面,本申请涉及一种计算设备,计算设备包括处理器和存储器,其中:存储器中存储有计算机指令,处理器执行计算机指令,以实现第一方面及其可能的实现方式的方法。
根据第四方面,本申请涉及一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,当计算机可读存储介质中的计算机指令被计算设备执行时,使得计算设备执行第一方面及其可能的实现方式的方法,或者使得计算设备实现上述第二方面及其可能的实现方式的装置的功能。
根据第五方面,本申请涉及一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面及其可能的实现方式的方法,或者使得计算设备实现上述第二方面及其可能的实现方式的装置的功能。
附图及以下说明中将详细描述一个或多个实施例。其它特征、目的和优点在说明、附图以及权利要求中是显而易见的。
附图说明
下面对本申请实施例用到的附图进行介绍。
图1是数据并行的分布式训练的示意图;
图2是本申请一个示例性实施例提供的AI平台210的结构示意图;
图3是本申请一个示例性实施例提供的AI平台210的应用场景示意图;
图4是本申请一个示例性实施提供的AI平台210的部署示意图;
图5是本申请一个示例性实施提供的部署AI平台210的计算设备500的结构示意图;
图6是一种训练作业的处理流程时间线的示意图;
图7是本申请一个示例性实施例提供的一种AI模型的分布式训练方法的流程示意图;
图8是本申请一个示例性实施例提供的一种用户交互界面的示意图;
图9是本申请一个示例性实施例提供的一种梯度同步的示意图;
图10是本申请一个示例性实施例提供的一种更新训练框架的通讯拓扑的示意图;
图11是本申请一个示例性实施例提供的另一种更新训练框架的通讯拓扑的示意图;
图12是本申请一个示例性实施例提供的一种训练作业的处理流程时间线的示意图;
图13是本申请一个示例性实施例提供的另一种AI模型的分布式训练方法的流程示意图;
图14是本申请一个示例性实施例提供的一种计算设备的结构示意图。
具体实施方式
目前,人工智能热潮不断,机器学习是一种实现AI的核心手段,机器学习渗透至医学、交通、教育、金融等各个行业。不仅仅是专业技术人员,就连各行业的非AI技术专业也期盼用AI、机器学习完成特定任务。
为了便于理解本申请提供的技术方案和实施例,下面对AI模型、AI模型的训练、分布式训练、AI平台等概念进行详细说明:
AI模型,是一类用机器学习思想解决实际问题的数学算法模型,AI模型中包括大量的参数和计算公式(或计算规则),AI模型中的参数是可以通过训练数据集对AI模型进行训练获得的数值,例如,AI模型的参数是AI模型中的计算公式或计算因子的权重。AI模型还包含一些超(hyper)参数,超参数是无法通过训练数据集对AI模型进行训练获得的参数,超参数可用于指导AI模型的构建或者AI模型的训练,超参数有多种。例如,AI模型训练的迭代(iteration)次数、学习率(leaning rate)、批尺寸(batch size)、AI模型的层数、每层神经元的个数。换而言之,AI模型的超参数与参数的区别在于:AI模型的超参数的值无法通过对训练数据集进行分析获得,而AI模型的参数的值可根据在训练过程中对训练数据集进行分析进行修改和确定。需要说明的是,本申请中提到的AI模型是一种泛指,AI模型包括深度学习模型、机器学习模型等。
AI模型多种多样,使用较为广泛的一类AI模型为神经网络模型,神经网络模型是一类模仿生物神经网络(动物的中枢神经系统)的结构和功能的数学算法模型。一个神经网络模型可以包括多种不同功能的神经网络层,每层包括参数和计算公式。根据计算公式的不同或功能的不同,神经网络模型中不同的层有不同的名称。例如,进行卷积计算的层称为卷积层,卷积层常用于对输入信号(如图像)进行特征提取。一个神经网络模型也可以由多个已有的神经网络模型组合构成。不同结构的神经网络模型可用于不同的场景(如分类、识别等)或在用于同一场景时提供不同的效果。神经网络模型结构不同具体包括以下一项或多项:神经网络模型中网络层的层数不同、各个网络层的顺序不同、每个网络层中的权重、参数或计算公式不同。业界已存在多种不同的用于识别或分类等应用场景的具有较高准确率的神经网络模型,其中,一些神经网络模型可以被特定的训练数据集进行训练后单独用于完成一项任务或与其他神经网络模型(或其他功能模块)组合完成一项任务。
一般的AI模型在被用于完成一项任务前都需要被训练。
训练AI模型,是指利用已有的数据通过一定方法使AI模型拟合已有数据的规律,确定AI模型中的参数。训练一个AI模型需要准备一个训练数据集,根据训练数据集中的训练数据是否有标注(即:数据是否对应有特定的标签信息,例如,类型、名称、数据中包含的标注框),可以将AI模型的训练分为监督训练(supervised training)和无监督训练(unsupervised training)。对AI模型进行监督训练时,用于训练的训练数据集中的训练数据带有标注(label)。训练AI模型时,将训练数据集中的训练数据作为AI模型的输入,由AI模型对输入的训练数据进行计算,获得AI模型输出值,将训练数据对应的标注作为AI模型的输出值的参考,利用损失函数(loss function)计算AI模型输出值与训练数据对应的标注的损失(loss)值,根据损失值调整AI模型中的参数。用训练数据集中的每个训练数据迭代地对AI模型进行训 练,AI模型的参数不断调整,直到AI模型可以根据输入的训练数据准确度较高地输出与训练数据对应的标注相同或相似的输出值。对AI模型进行无监督训练,则用于训练的数据集中的训练数据没有标注,训练数据集中的训练数据依次输入至AI模型,由AI模型逐步识别训练数据集中的训练数据之间的关联和潜在规则,直到AI模型可以用于判断或识别输入的数据的类型或特征。例如,聚类,用于聚类的AI模型接收到大量的训练数据后,可学习到各个训练数据的特征以及训练数据之间的关联和区别,将训练数据自动地分为多个类型。不同的任务类型可采用不同的AI模型,一些AI模型仅可以用监督学习的方式训练,一些AI模型仅可以用无监督学习的方式训练,还有一些AI模型既可以用监督学习的方式训练又可以用无监督学习的方式训练。经过训练完成的AI模型可以用于完成一项特定的任务。通常而言,机器学习中的AI模型都需要采用有监督学习的方式进行训练,有监督学习的方式对AI模型进行训练可使AI模型在带有标注的训练数据集中更有针对性地学习到训练数据集中训练数据与对应标注的关联,使训练完成的AI模型用于预测其他输入推理数据时准确率较高。
损失函数,是用于衡量AI模型被训练的程度(也就是用于计算AI模型预测的结果与真实目标之间的差异)的函数。在训练AI模型的过程中,因为希望AI模型的输出尽可能的接近真正想要预测的值,所以可以通过比较当前AI模型根据输入数据的预测值和真正想要的目标值(即输入数据的标注),再根据两者之间的差异情况来更新AI模型中的参数。每次训练都通过损失函数判断一下当前的AI模型预测的值与真实目标值之间的差异,更新AI模型的参数,直到AI模型能够预测出真正想要的目标值或与真正想要的目标值非常接近的值,即损失函数小于阈值且较稳定,则认为AI模型被训练完成。
梯度,是包括函数的偏导数的向量。在AI模型的训练过程中,需要调整模型的参数以使下一次迭代的损失函数更小,常采用梯度下降方法来更新模型的参数,因此每一轮迭代的时候需要计算本轮训练数据对应的损失函数的梯度,进而根据梯度去更新AI模型的参数。
分布式训练,分布式训练是AI模型训练过程中常用的加速手段之一。分布式训练指:将训练拆分在多个独立的计算节点当中进行独立计算,再将结果进行周期性的汇总和重新分发,由此加速AI模型的训练过程。分布式训练可以包括数据并行的分布式训练。
数据并行的分布式训练是在多个计算节点上部署同样的AI模型,将训练数据集中的训练数据分布到多个计算节点上同时进行计算,在每个计算节点上执行对AI模型的训练,并将每个计算节点上产生的模型参数的梯度进行聚合后,再更新模型参数,具体的,将训练数据集切分到m个计算节点上时有两种选择:(1)m个计算节点中每个计算节点上的批大小与使用单个计算节点进行计算时的批大小相同,批大小指每次调整参数前在训练数据集所选取的训练数据的数目。(2)每个计算节点上的批大小是使用单个计算节点进行计算时的批大小除以m,这样聚合后的全局批大小保持不变。在本申请实施例的后续描述中,以数据并行的分布式训练为例描述AI模型的训练方法。
数据并行的分布式训练的过程可以大致划分两个阶段,分别为多个计算节点计算和梯度同步。图1为一种示例性的数据并行的分布式训练的示意图。如图1所示,该数据并行的分布式训练的计算由m个计算节点(分别为计算节点1、计算节点2、......、计算节点m)执行;在分布式训练每一轮训练中,每个计算节点上训练的样本不同,从而m个计算节点则有m批样本(分别为第1批样本、第2批样本、......、第m批样本),且m个计算节点中的每个计算节点均计算得到一个梯度,从而m个计算节点则有m个梯度(分别为梯度1、梯度2、......、梯度m);在梯度同步时,对这m个梯度求平均,得到m个梯度的平均值;根据这m个梯度的平均值更新该AI模型的参数,然后基于该更新参数后的AI模型进行下一轮的训练。
AI平台,是一种为AI开发者和用户提供便捷的AI开发环境以及便利的开发工具的平台。AI平台中内置有各种解决不同问题的预训练AI模型或者AI子模型,AI平台可以根据用户的需求搜索并且建立适用的AI模型,用户只需在AI平台中确定自己的需求,且按照提示准备好训练数据集上传至AI平台,AI平台就能为用户训练出一个可用于实现用户需要的AI模型。或者,用户按照提示准备好自己的算法(也称为初始AI模型)和训练数据集,上传至AI平台,AI平台基于用户自己的算法和训练数据集,可以训练出一个可用于实现用户需要的AI模型。用户可利用训练完成的AI模型完成自己的特定任务。应理解,本申请中在被AI平台训练前的AI模型(例如,用户上传的算法、AI平台预置的算法或者预训练模型),称为初始AI模型。
深度学习(Deep Learning),是一类基于深层次神经网络算法的机器学习技术,其主要特征是使用多重非线性变换构对数据进行处理和分析。主要应用于人工智能领域的感知、决策等场景,例如图像和语音识别、自然语言翻译、计算机博弈等。
容器(Container),是利用计算机操作系统中的一种虚拟化技术构造的用于进程运行的相对独立和隔离的环境。该环境中可以包含独立的文件系统、命名空间、资源视图等。使用容器能够简化软件的部署流程,增强软件的可移植性和安全性,并提高系统资源利用率。
作业(Job),是完成一项特定的计算业务所需要执行的一组程序的集合,通常对应于一台或多台计算机上的一组进程、容器或其他运行时实体。
任务(Task),是一个作业对应的一组程序中的单个程序,通常对应于一台计算机上的一个进程、容器或其他运行时实体。其中,一个作业包括至少一个任务。
训练作业,完成一个初始AI模型的训练所需要执行的一组程序的集合。其中,一个训练作业的完成代表完成一个初始AI模型的训练,得到训练完成的AI模型。
训练任务,是一个训练作业对应的一组程序中的单个程序;也即,用户提交的任务逻辑实例,标识任务间的区别。例如,一个初始AI模型的训练任务用于对该初始AI模型进行多轮迭代训练。其中,一个训练作业包括至少一个训练任务。
训练进程,是计算节点上执行的一个训练任务的进程。其中,一个训练进程对应一个训练任务,而一个计算节点可以执行一个或多个训练任务,故一个计算节点上存在一个或多个训练进程。
训练框架,是AI模型训练过程中所需要依赖的工具包或者函数包,是训练作业中每个训练任务需要依赖的运行程序框架。在深度学习初始阶段,每个深度学习研究者都需要写大量的重复代码;为了提高工作效率,研究者就将这些代码写成了一个框架放到网上让所有研究者一起使用,该框架也即训练框架。当前全世界最为流行的深度学习框架有Tensorflow、Caffe、Theano、MXNet、Torch和PyTorch。
计算资源池,由可用于进行AI模型训练的计算资源组成,计算资源可以是计算节点。对于一个训练作业说,计算资源是指训练过程中使用的所有计算节点,每个计算节点可以是一个计算设备(例如服务器),也可以是一个计算卡(例如GPU)。
图2是本申请实施例提供的一种AI平台210的结构示意图,应理解,图2仅是示例性地展示了AI平台210的一种结构化示意图,本申请并不限定对AI平台210中的模块的划分。如图2所示,AI平台210包括任务调度模块211、资源管理模块212和数据存储模块213。AI平台210与计算资源池220相关联,计算资源池220包括多个计算节点,AI平台可以调度计算资源池220中的计算节点,用于AI模型训练。
下面简要地描述AI平台210中的各个模块的功能:
任务调度模块211用于:配置训练作业、调度训练作业;接收用户提交的训练作业,进行训练作业的管理,为训练作业申请计算资源运行。
应理解,一个初始AI模型怎么训练、采用什么训练数据训练以及采用什么计算资源训练,可以由用户创建该初始AI模型对应的训练作业时设置;在用户没有设置的情况下,由任务调度模块211配置。其中,一个初始AI模型怎么训练包括:该初始AI模型对应的训练作业分为多少个训练任务,以及该初始AI模型对应的训练作业分为哪些训练任务等;一个初始AI模型采用什么训练数据训练包括:该初始AI模型对应的训练作业需要采用多少训练数据,该初始AI模型对应的训练作业需要采用哪些训练数据,该初始AI模型对应的训练作业中的每个训练任务分别需要采用多少训练数据,以及该初始AI模型对应的训练作业中的每个训练任务分别需要采用哪些训练数据等;一个初始AI模型采用什么计算资源训练包括:该初始AI模型对应的训练作业由多少个计算节点执行,该初始AI模型对应的训练作业由什么规格的计算节点执行,以及该初始AI模型对应的训练作业中的每个训练任务分别由什么规格的计算节点执行等。
资源管理模块212用于:计算资源管理,调度计算资源,为训练作业分配计算资源。资源管理模块212需要了解集群间的拓扑(Topo)信息,其中,集群是指由所有计算资源组成的集群;在分配计算资源时,按物理位置进行亲和性分配。其中,亲和原则指同一个物理位置在同一个机柜的资源优先分配。
任务调度模块211还用于:将训练作业中的训练任务配置到资源管理模块212分配的计算资源上执行。其中,任务调度模块211可以根据一个训练作业所需的计算节点的数量将该训练作业划分成一个或多个训练任务,例如,一个训练作业需要多少个计算节点执行,则该训练作业分成多少个训练任务,然后将每个训练任务配置在对应的计算节点上执行。
数据存储模块213(如可以是云服务提供商提供的OBS对应的数据存储资源):用于存储训练框架、用户上传的训练数据集、用户上传的初始AI模型、其他用户上传的初始AI模型以及训练完成的AI模型等。
其中,计算资源池220中,可以同时执行一个或多个训练作业,每个训练作业用于训练一个AI模型,一个AI模型的训练基于同一个训练框架,训练框架是训练作业中每个训练任务需要依赖的运行程序框架;每个训练作业包括一个或多个训练任务,每个训练作业中的所有训练任务需要依赖的运行程序框架为同一个训练框架。例如,计算资源池220执行n个训练作业,则计算资源池220用于训练n个AI模型;对于n个训练作业中的任意一个训练作业来说,其中的所有训练任务需要依赖的运行程序框架为同一个训练框架,且该训练框架可以从数据存储模块213中获取。
以一个训练作业为例来描述训练作业的启动,任务调度模块211在接收到用户提交的一个训练作业之后,任务调度模块211向资源管理模块212申请计算资源用于执行该训练作业中的多个训练任务;资源管理模块212在为这个训练作业或这多个训练任务分配多个计算节点,并将分配结果返回给任务调度模块211;任务调度模块211将训练框架、训练数据集、初始AI模型等发送给这多个计算节点,或者说这多个计算节点均可以从数据存储模块213中获取训练框架、训练数据集、初始AI模型等,以在这多个计算节点中每个计算节点中部署好训练框架;任务调度模块211将这多个训练任务分别配置在这多个计算节点上,从而启动训练。此外,任务调度模块211还可以告知这多个计算节点中的每个计算节点与其共同执行该训练作业的计算节点是哪个或哪些,以便于其知晓与哪个或哪些计算节点进行训练参数同 步,训练参数同步包括梯度同步。
任务调度模块211与资源管理模块212可以通信,如此,任务调度模块211可以向资源管理模块212申请用于执行训练任务的计算资源。
任务调度模块211与计算资源池220可以通信,如此,任务调度模块211可以调用计算资源池220中的计算节点执行训练任务。
资源管理模块212与计算资源池220可以通信,如此,资源管理模块212可以对计算资源池220中的计算资源进行分配和调度。
计算资源池220中的计算节点之间可以相互通信,如此,同一训练作业对应的多个计算节点可以进行梯度同步。
需要说明的是,对于一个训练作业来说,本申请所描述的梯度同步的过程包括以下三种可能情况:
(1)多个计算节点中的每个计算节点计算得到一个梯度,每个计算节点均将计算得到的梯度发给AI平台210;故AI平台210可以接收到多个梯度,AI平台210对这多个梯度进行聚合,得到聚合后的梯度;AI平台210将聚合后的梯度发回给每个计算节点,每个计算节点基于聚合后的梯度更新模型参数。
(2)多个计算节点中的每个计算节点计算得到一个梯度,每个计算节点均将计算得到的梯度发给其他计算节点;故每个计算节点均可以得到多个梯度,每个计算节点对这多个梯度进行聚合,得到聚合后的梯度;每个计算节点基于聚合后的梯度更新模型参数。
(3)多个计算节点中的每个计算节点计算得到一个梯度,多个计算节点中的其中一个计算节点用于对梯度进行聚合,多个计算节点中的其他计算节点均将计算得到的梯度发给该其中一个计算节点;故该其中一个计算节点可以得到多个梯度,该其中一个计算节点对这多个梯度进行聚合,得到聚合后的梯度;该其中一个计算节点将聚合后的梯度发回给其他计算节点,每个计算节点基于聚合后的梯度更新模型参数。
在一种可能的实现方式中,AI平台还包括算法管理模块214(图2中未示出)。算法管理模块214用于:提供初始AI模型管理界面,用于用户上传基于自己的训练目标创建的初始AI模型;或者,用户在初始AI模型库中,获取已有的初始AI模型。或者,算法管理模211还可以用于根据用户输入的任务目标,获取AI平台上预置的初始AI模型。用户基于自己的训练目标创建的初始AI模型可以基于AI平台提供的框架进行编写。初始AI模型可以包括未进行训练的AI模型、进行训练但是未完全训练完成的AI模型。未进行训练的AI模型指构建的AI模型还未使用训练数据集进行训练,构建的AI模型中的参数均是预设的数值。
任务调度模块211与算法管理模块214可以通信,用于从算法管理模块214获取初始AI模型的访问地址。
在一种可能的实现方式中,AI平台210还包括人机交互模块215(图2中未示出),提供与用户的交互界面。人机交互模块215与任务调度模块211通信,将用户的指令转发给任务调度模块211,获取训练过程的状态信息、训练完成的AI模型等,将该状态信息、该AI模型提供给用户。
需要说明的是,本申请中的AI平台可以是一个可以与用户交互的系统,这个系统可以是软件系统也可以是硬件系统,也可以是软硬结合的系统,本申请中不进行限定。
图3为本申请实施例提供的一种AI平台210的应用场景示意图,如图3所示,在一种实施例中,AI平台210可全部部署在云环境中。云环境是云计算模式下利用基础资源向用户提 供云服务的实体。云环境包括云数据中心和云服务平台,云数据中心包括云服务提供商拥有的大量基础资源(包括计算资源池、存储资源和网络资源),云数据中心包括的计算资源池可以是大量的计算节点(例如服务器)。AI平台210可以独立地部署在云数据中心中的服务器或虚拟机上,AI平台210也可以分布式地部署在云数据中心中的多台服务器上、或者分布式地部署在云数据中心中的多台虚拟机上、再或者分布式地部署在云数据中心中的服务器和虚拟机上。如图3所示,AI平台210由云服务提供商在云服务平台抽象成一种AI云服务提供给用户,用户在云服务平台购买该云服务后(可预充值再根据最终资源的使用情况进行结算),云环境利用部署在云数据中心的AI平台210向用户提供AI平台云服务。在使用AI平台云服务时,用户可以通过应用程序接口(application program interface,API)或者图形用户界面(graphical user interface,GUI)确定要AI模型完成的任务、上传训练数据集至云环境等,云环境中的AI平台210接收用户的任务信息、训练数据集,执行数据预处理、AI模型训练。AI平台通过API或者GUI向用户返回AI模型的训练过程的状态信息等内容。训练完成的AI模型可被用户下载或者在线使用,用于完成特定的任务。
在本申请的另一种实施例中,云环境下的AI平台抽象成一种AI云服务向用户提供时,用户可以购买固定资源使用量的容器的使用时长,在资源使用量固定的情况下,使用时长越长,需要的费用越高,反之越低。在该使用时长内,AI平台训练AI模型。或者,用户可以预充值,在训练完成后再根据最终使用的GPU的数量和使用时长进行结算。
本申请提供的AI平台210的部署较为灵活,如图4所示,在另一种实施例中,本申请提供的AI平台210还可以分布式地部署在不同的环境中。本申请提供的AI平台210可以在逻辑上分成多个部分,每个部分具有不同的功能。例如,在一种实施例中AI平台210包括任务调度模块211、资源管理模块212和数据存储模块213。AI平台210中的各部分可以分别部署在终端计算设备、边缘环境和云环境中的任意两个或三个环境中。终端计算设备包括:终端服务器、智能手机、笔记本电脑、平板电脑、个人台式电脑、智能摄相机等。边缘环境为包括距离终端计算设备较近的边缘计算设备集合的环境,边缘计算设备包括:边缘服务器、拥有计算能力的边缘小站等。部署在不同环境或设备的AI平台210的各个部分协同实现为用户提供训练AI模型等功能。
例如,在一种场景中,终端计算设备中部署AI平台210中的任务调度模块211,边缘环境的边缘计算设备中部署AI平台210中的资源管理模块212,云环境的云计算设备中部署AI平台210中的数据存储模块213。用户将训练作业发送至终端计算设备中的任务调度模块211,终端计算设备向边缘计算设备中的资源管理模块212申请计算资源,边缘计算设备为训练作业分配计算资源,终端计算设备将该训练作业中的训练任务配置在分配的计算资源上执行,在执行训练任务时从云计算设备中的数据存储模块213中获取需要的样本集、初始AI模型等数据。
应理解,本申请不对AI平台210的哪些部分部署具体部署在什么环境进行限制性的划分,实际应用时可根据终端计算设备的计算能力、边缘环境和云环境的资源占有情况或具体应用需求进行适应性的部署。
AI平台210也可以部署在前述任意环境中的计算设备上(如边缘环境的边缘服务器上)。图5为部署有AI平台210的计算设备500的硬件结构示意图,图5所示的计算设备500包括存储器501、处理器502、通信接口503以及总线504。其中,存储器501、处理器502、通 信接口503通过总线504实现彼此之间的通信连接。
存储器501可以是只读存储器(read only memory,ROM),随机存取存储器(random access memory,RAM),硬盘,快闪存储器或其任意组合。存储器501可以存储程序,当存储器501中存储的程序被处理器502执行时,处理器502和通信接口503用于执行AI平台210为用户训练AI模型。存储器还可以存储训练数据集。例如,存储器501中的一部分存储资源被划分成一个数据存储模块213,用于存储AI平台210所需的数据。
处理器502可以采用中央处理器(central processing unit,CPU),应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器502可以包括一个或多个芯片。处理器502可以包括AI加速器,例如神经网络处理器(neural processing unit,NPU)。
通信接口503使用例如收发器一类的收发模块,来实现计算设备500与其他设备或通信网络之间的通信。例如,可以通过通信接口503获取数据。
总线504可包括在计算设备500各个部件(例如,存储器501、处理器502、通信接口503)之间传送信息的通路。
为了便于理解本申请实施例,进一步分析并提出本申请所具体要解决的技术问题。
图6是相关的训练作业的处理流程时间线示意图,该训练作业的处理流程基于图2所示的AI平台实现,该训练作业的处理流程包括以下步骤:
(1)任务调度模块211根据用户的配置启动训练作业,并向资源管理模块212申请用于执行该训练作业的计算资源;资源管理模块212为该训练作业分配多个计算节点,任务调度模块211在多个计算节点上启动该训练作业中的训练任务。多个计算节点中的每个计算节点进行数据加载,以及在完成数据加载之后开始执行训练(计算)。其中,数据加载是指准备好训练所需的数据,包括获取训练框架、训练数据集、初始AI模型等,以及部署好训练框架等。
(2)每个计算节点在执行训练任务的过程中,在训练任务的训练脚本中周期性保存Ckpt(check point)文件。其中,训练脚本是训练任务运行的训练程序;Ckpt文件为执行训练任务的过程中保存的文件,其为二进制文件,其中保存了所有的权重(weights)、偏差(biases)、梯度(gradients)等变量,用于训练任务失败后恢复训练进度。
(3)每个计算节点在执行训练任务的过程中,若发生硬件故障或软件故障(例如训练任务僵死、训练任务超时、训练任务退出等),则导致执行的训练任务异常,从而训练任务失败并退出。应理解,只要训练作业中有一个训练任务失败并退出,就会导致整个训练作业失败并退出,也即导致训练作业中断。
(4)该训练作业中断之后,AI平台更新该训练作业的状态。例如,AI平台可以向用户展示该训练作业中断的情况。
(5)用户在AI平台上发现该训练作业中断,从而在AI平台中重启该训练作业。
(6)用户重启该训练作业以后,AI平台为该训练作业重新申请计算资源,也即AI平台为该训练作业重新申请多个计算节点,重新申请的每个计算节点用于执行训练作业中的一个训练任务;重新申请的每个计算节点均进行数据加载。
(7)重新申请的每个计算节点完成数据加载之后,进行Ckpt文件拉取。其中,重新申请的每个计算节点拉取的Ckpt文件为发生故障之前在需要其执行的训练任务的训练脚本中保存的Ckpt文件。应理解,需要重新申请的每个计算节点执行的训练任务为该训练作业中的训练任务。
(8)重新申请的每个计算节点基于拉取得到的Ckpt文件继续训练,也即重新申请的每个计算节点基于拉取得到的Ckpt文件执行需要其执行的训练任务。
由图6可知,上述AI模型的训练流程中,存在以下问题:
(1)训练任务发生故障时,该训练任务所归属的训练作业中断,需要人工介入重启该训练作业,AI平台为该训练作业重新申请计算资源,且重新申请计算资源的耗时较长以及在重新申请的计算资源上启动训练任务的耗时也较长,使得故障恢复的耗时较长。
(2)因需要为该训练作业重新申请计算资源,使得用于执行该训练作业的计算资源规模增大,然而受到整个计算资源池的规模限制,重启训练作业后,若要重新申请大规模的计算资源用于执行该训练作业,存在着计算资源无法申请成功的可能,导致故障无法恢复。
(3)在AI模型训练时,通过Ckpt文件保存训练数据以用于故障后的恢复,由于Ckpt文件比较大,Ckpt文件获取的耗时较长,使得故障恢复的耗时较长。
(4)由于Ckpt文件比较大,Ckpt文件保存的耗时也较长,在训练过程中无法高频保存Ckpt文件,故在故障恢复时无法恢复Ckpt文件保存至训练任务发生故障之间的训练数据,也即从Ckpt文件保存至训练任务发生故障之间的训练数据都会损失掉。
综上分析,相关技术中,计算节点发生故障后,故障恢复的耗时较长或故障可能无法恢复,且因故障导致的训练损失较大。
鉴于上述相关技术中存在的问题,本申请主要解决在AI模型训练过程中,因用于执行训练任务的计算节点发生故障而存在的故障恢复的耗时较长的问题。本申请提供的技术方案,提供动态故障恢复能力,在训练过程发生故障时,保障整个训练无损恢复。具体地,在AI模型的训练过程中,用于执行训练的计算节点发生故障时,在训练作业不中断的前提下,动态隔离发生故障的计算节点,补充新的计算节点替代发生故障的计算节点,保障AI模型的训练过程不被中断,使得完成该AI模型的训练的时长不受影响。其中,新的计算节点为在资源申请时未用于执行训练的计算节点,或新的计算节点为在资源申请时已经用于执行训练的计算节点,但其执行的训练任务与发生故障的计算节点执行的训练任务不归属于同一个训练作业。
下面结合具体实施方式对本申请提供的技术方案进行详细的介绍。
本申请对图2所示的AI平台210的功能进行改进,包括对任务调度模块211以及资源管理模块212进行能力增强,使得任务调度模块211具备故障恢复等功能,以及使得资源管理模块212具备故障隔离、资源动态调整等功能。具体如下介绍。
在一种可能的实现方式中,本申请的资源管理模块212还用于:对用于执行训练任务的任意一个计算节点进行故障监测;在监测到计算节点发生故障后,对该发生故障的第一计算节点进行故障隔离;以及向任务调度模块211进行故障上报,也即告知任务调度模块211第一计算节点发生故障。其中,第一计算节点是指发生故障的一类计算节点,可以为一个或多个计算节点。
作为一示例,在进行故障监测方面,本申请的资源管理模块212具体用于:监测用于执行训练任务的任意一个计算节点是否发生硬件故障,以及监测该任意一个计算节点上的训练进程是否退出。其中,若满足以下一种或多种的情况时:该任意一个计算节点发生硬件故障、该任意一个计算节点上的训练进程退出,则该任意一个计算节点发生故障,也即该任意一个计算节点为第一计算节点。
需要说明的是,故障隔离包括两层含义:第一层含义,一个训练作业的训练任务由多个 计算节点执行,在该多个计算节点中存在第一计算节点的情况下,将该第一计算节点从该多个计算节点中剔除,以使该第一计算节点不再用于执行该训练作业的训练任务;第二层含义,在该第一计算节点是发生硬件故障的情况下,该第一计算节点在被故障隔离之后,该第一计算节点在被故障恢复之前,该第一计算节点不会被用于执行任何训练作业的训练任务。
在一种可能的实现方式中,任务调度模块211还用于:接收来自第一计算节点上报的故障。本申请的每个计算节点在执行训练任务时,也会监测该训练任务对应的训练进程是否发生运行故障,当监测到该训练任务对应的训练进程发生运行故障时,确定自己发生故障,也即确定自己为第一计算节点,并向任务调度模块211进行故障上报。从软件层面来看,本申请的每个计算节点在执行训练任务时,该计算节点中的监测程序会监测该训练任务对应的训练进程是否发生运行故障,当该监测程序监测到该训练任务对应的训练进程发生运行故障时,确定该计算节点发生故障,也即确定该计算节点为第一计算节点,该监测程序向任务调度模块211进行故障上报。
在一种可能的实现方式中,资源管理模块212还用于:接收来自第一计算节点上报的故障;在接收到来自第一计算节点上报的故障之后,对该第一计算节点进行故障隔离,以及将该故障向任务调度模块211进行上报。本申请的每个计算节点在执行训练任务时,当监测到该训练任务对应的训练进程发生运行故障时,确定自己发生故障,也即确定自己为第一计算节点,并向资源管理模块212进行故障上报;资源管理模块212在接收到该第一计算节点上报的故障后,对该第一计算节点进行故障隔离,以及将该故障转发给任务调度模块211。从软件层面来看,本申请的每个计算节点在执行训练任务时,该计算节点中的监测程序会监测该训练任务对应的训练进程是否发生运行故障,当该监测程序监测到该训练任务对应的训练进程发生运行故障时,确定该计算节点发生故障,也即确定该计算节点为第一计算节点,该监测程序向资源管理模块212进行故障上报。
需要说明的是,前述训练进程发生运行故障不包括训练进程从计算节点上退出。
在一种可能的实现方式中,本申请的任务调度模块211还用于:在计算节点在执行训练任务的过程中发生故障后,进行故障恢复;也即,在接收到来自资源管理模块212或第一计算节点上报的故障后,进行故障恢复。
作为一示例,在资源管理模块212向任务调度模块211进行故障上报的情况下,在故障恢复方面,本申请的任务调度模块211具体用于:通知未发生故障的第三计算节点暂停训练任务的执行,也即通知第三计算节点暂停训练进程,向资源管理模块212申请第二计算节点用于替代第一计算节点。其中,第二计算节点是指用于替代第一计算节点的一类计算节点,可以为一个或多个计算节点;第二计算节点用于执行原来由第一计算节点执行的训练任务;第二计算节点可以为计算资源池220中未用于执行训练任务的计算节点;或第二计算节点可以为计算资源池220中已经用于执行训练任务的计算节点,但第二计算节点执行的训练任务与第一计算节点执行的训练任务不归属于同一个训练作业。第三计算节点是指未发生故障的一类计算节点,可以为一个或多个计算节点;第三计算节点与第一计算节点用于执行同一个训练作业中的训练任务。
作为另一示例,在第一计算节点向任务调度模块211进行故障上报的情况下,在故障恢复方面,本申请的任务调度模块211具体用于:通知资源管理模块212对该第一计算节点进行故障隔离,通知第三计算节点暂停训练任务的执行,向资源管理模块212申请第二计算节点用于替代该第一计算节点。
在一种可能的实现方式中,本申请的资源管理模块212还用于:在接收来自任务调度模 块211的申请第二计算节点的申请后,重新分配计算资源,也即从计算资源池220中分配用于替代该第一计算节点的第二计算节点;以及在重新分配计算资源之后,将重新分配计算资源的结果告知任务调度模块211。
如此,资源管理模块212在训练任务执行过程中可以进行计算节点的增加,也即资源动态调整。例如,若某个计算节点在执行某个或某些训练任务时,发生故障,资源管理模块212可以增加第二计算节点替代该第一计算节点执行该训练任务。
在一种可能的实现方式中,本申请的任务调度模块211还用于:接收来自资源管理模块212的重新分配计算资源的结果,调用增加的第二计算节点执行该第一计算节点原来执行的训练任务;以及通知第三计算节点继续训练任务的执行,也即通知第三计算节点继续执行之前暂停的训练进程。
由于任务调度模块211与资源管理模块212可以通信,如此,资源管理模块212可以向任务调度模块211进行故障上报,以及任务调度模块211可以通知资源管理模块212对第一计算节点进行故障隔离等。
由于任务调度模块211与计算资源池220可以通信,如此,任务调度模块211可以通知计算资源池220中的计算节点暂停训练任务的执行以及继续训练任务的执行等。
由于资源管理模块212与计算资源池220可以通信,如此,资源管理模块212可以监测计算资源池220中的计算节点在执行训练任务时是否发生故障,以及对计算资源池220中的第一计算节点进行故障隔离等。
如前所述,数据并行的分布式训练分为多个计算节点计算和梯度同步两个阶段。如此,对于一个AI模型的训练作业来说,该训练作业分为多个训练任务,这多个训练任务由多个计算节点执行;在多个计算节点计算阶段,多个计算节点分别独立完成自己的计算得到对应的梯度;在梯度同步阶段,多个计算节点中的每个计算节点均提供自己计算得到的梯度,共同完成梯度同步。由于多个计算节点在计算阶段是独立完成的,对于多个计算节点中的任意一个计算节点来说,其并不知道自己需要与哪个或哪些节点进行梯度同步,而多个计算节点在执行训练任务时是基于同一个训练框架的,因此可以在该训练框架中设置通讯拓扑,该通讯拓扑用于多个计算节点进行梯度同步。通讯拓扑为由执行一个训练作业的多个计算节点组成的拓扑结构,该通讯拓扑记录该训练作业由哪些计算节点共同执行,以及在梯度同步时有哪些计算节点参与;该通讯拓扑记录可以用于这多个计算节点之间进行通信。
在一种可能的实现方式中,本申请还对训练框架的能力进行优化,在现有的训练框架中增加训练任务的容错处理;也即,支持在训练过程中在训练框架的通讯拓扑中动态增删计算节点,保证任务高可用。
具体地,一个训练作业中的多个训练任务由多个计算节点执行,在训练过程中,当多个计算节点中存在第一计算节点时,资源管理模块212会对该第一计算节点进行故障隔离,以及分配第二计算节点替代该第一计算节点;此种情况下,需要更新训练框架中的通讯拓扑,也即在训练框架中的通讯拓扑中删除该第一计算节点,并且增加该第二计算节点。
作为一种示例,资源管理模块212分配第二计算节点替代该第一计算节点之后,通知任务调度模块211;任务调度模块211将该第二计算节点的信息告知多个计算节点中的第三计算节点,第三计算节点在其上的训练框架中的通讯拓扑中删除该第一计算节点以及增加该第二计算节点;任务调度模块211将训练框架、训练数据集、初始AI模型等发送给第二计算节点,或者说第二计算节点可以从数据存储模块213中获取训练框架、训练数据集、初始AI模型等;任务调度模块211还将第三计算节点的信息发给第二计算节点,第二计算节点可以 部署训练框架以及基于自己的信息和第三计算节点的信息在部署的训练框架中构建通讯拓扑;任务调度模块211还第二计算节点可以部署训练框架之后,在该第二计算节点中配置启动原来由第一计算节点执行的训练任务。
应理解,在第三计算节点以及该第二计算节点均更新了训练框架中的通讯拓扑之后,以及第二计算节点基于自己的信息和第三计算节点的信息在部署训练框架中构建通讯拓扑之后,第三计算节点以及第二计算节点上的训练框架中的通讯拓扑是相同的,从而第三计算节点以及第二计算节点可以进行梯度同步。其中,因第二计算节点还没有执行过原来由第一计算节点执行的训练任务,故不存在相应的训练参数,所以第二计算节点在本次梯度同步时不提供训练参数。
通过上述各模块的功能,本申请实施例提供的AI平台210,训练发生故障时,动态隔离发生故障的第一计算节点,补充第二计算节点,第二计算节点替代发生故障的第一计算节点执行训练,保障训练过程不被中断,从而整体训练时长不受影响,实现降低故障恢复的时长。
请参阅图7,图7是本申请实施例提供的一种AI模型的分布式训练方法的流程示意图,图7所示的AI模型的分布式训练方法可以基于图2所示的AI平台210实现。图7中的任务调度模块、资源管理模块可以分别为图2中的任务调度模块211、资源管理模块212,图7中的计算资源可以为计算资源池220或计算资源池220中的计算资源。
下面以AI平台210对一个初始AI模型进行分布式训练以得到训练完成的AI模型为例,介绍本申请提供的AI模型的分布式训练方法。该初始AI模型的分布式训练对应一个训练作业,该初始AI模型的分布式训练需要的计算资源为多个计算节点;该训练作业可以分为多个训练任务,多个训练任务采用同一训练框架,多个计算节点与多个训练任务一一对应;多个计算节点中的每个计算节点执行对应的训练任务,也即多个计算节点中的每个计算节点为对应的训练任务跑一个训练进程,从而有多个训练进程,且多个训练任务与多个训练进程一一对应。此外,多个计算节点中的每个计算节点可以仅执行一个训练任务,即每个计算节点上仅有一个训练进程,每个计算节点上仅有一个训练框架。
如图7所示,图7中的计算资源可以表示多个计算节点中的全部或部分计算节点,多个计算节点中的每个计算节点用于执行多个训练任务中的其中一个训练任务;图7中的第一计算节点可以表示多个计算节点中的任意一个发生故障的计算节点;图7中的第三计算节点可以表示多个计算节点中的任意一个未发生故障的计算节点;图7中的第二计算节点可以表示用于替代任意一个第一计算节点的计算节点,第二计算节点用于执行原来由第一计算节点执行的训练任务。该AI模型的分布式训练方法的流程包括任务启动、状态监测、故障隔离以及故障恢复四个阶段,下面对前述各个阶段进行详细介绍。
第一阶段,任务启动。
步骤S1:用户启动训练作业。
具体地,用户通过人机交互模块215创建并提交用于对初始AI模型进行训练的训练作业,人机交互模块215根据用户提交的训练作业生成用户的指令,并将用户的指令转发给任务调度模块211;任务调度模块211接收用户提交的训练作业,从而实现用户启动训练作业。
请参阅图8,图8是本申请实施例提供的一种用户交互界面的示意图,图8所示的界面为人机交互模块215中展示的创建训练作业的界面。创建训练作业包括服务选型、规格确认和完成三个步骤。用户在服务选型过程中,本申请AI平台210可以基于按需计费的计费模式向用户提供AI模型的训练服务。在创建训练作业过程中,可以设定训练作业的名称,以及一 键式参数设置。训练使用的算法可以基于不同的来源,按需获取,例如选择使用过的算法、预置算法、常用框架(也即常用训练框架)、自定义算法;在选择算法时,可以基于算法的名称进行选择。计算资源池220分为公共资源池和专用资源池,用户可以根据需求选择对应的计算资源池用于训练;公共资源池和专用资源池中均可以包括不同规格的计算资源,用户可以根据所需计算资源的规模从不同规格的计算资源中选择合适的计算资源用于训练;在选择计算资源时,可以基于训练需求设置计算节点的数量。在用户服务选型完成后,用户进行规格确认,规格确认完成后,用户完成创建训练作业。
应理解,任务调度模块211接收到用户提交的训练作业之后,获知用户启动了训练作业的执行,从而申请计算资源用于执行训练作业,也即执行步骤S2。
步骤S2:任务调度模块向资源管理模块申请计算资源。
具体地,由于用户在创建训练作业时,设置了用于执行训练作业的计算节点的数量、计算节点的规格等,任务调度模块211会根据用户的设置向资源管理模块212申请计算资源。例如,用户设置需要多少个计算节点执行训练作业,任务调度模块211就会向资源管理模块212申请多少个计算节点;以及用户设置用于执行训练作业的计算节点的规格是什么,任务调度模块211就会向资源管理模块212申请什么规格的计算节点。
步骤S3:资源管理模块分配计算资源。
具体地,资源管理模块212在接收到来自任务调度模块211的申请计算资源的申请之后,根据申请计算资源的申请从计算资源池220中为训练作业分配多个计算节点,且分配的多个计算节点的规格是任务调度模块211申请的计算节点的规格。
此外,资源管理模块212会将分配计算资源的结果返回给任务调度模块211,因此任务调度模块211可以知晓资源管理模块212为训练作业分配的计算资源是哪些计算节点;其中,分配计算资源的结果可选地包括:计算节点的名称、计算节点的标识以及计算节点的规格等。
步骤S4:任务调度模块启动训练。
具体地,任务调度模块211会根据用户在创建训练作业时设置的计算节点的数量将训练作业划分为多个训练任务,其中,多个训练任务中的训练任务的数量等于设置的计算节点的数量,训练作业中的每个训练任务为初始AI模型的分布式训练的训练任务,训练作业中的每个训练任务用于对初始AI模型进行多轮迭代训练;任务调度模块211会从多个计算节点中确定用于执行每个训练任务的计算节点,并将每个训练任务配置在为该训练任务确定的计算节点上执行,故多个训练任务与多个计算节点一一对应。应理解,一个训练任务会在一个计算节点中运行,从而多个计算节点中的每个计算节点均会针对训练作业中的一个训练任务跑一个训练进程用于训练,故多个训练任务与多个训练进程一一对应。
其中,任务调度模块211从多个计算节点中确定用于执行每个训练任务的计算节点时,可以基于每个训练任务所需的计算节点的规格从多个计算节点中匹配合适的计算节点用于执行每个训练任务;任务调度模块211将训练任务配置在确定的计算节点上执行时,可以基于计算节点的名称或计算节点的标识等精准将训练任务配置在对应的计算节点上执行。
如图8所示,用户在创建训练作业时,设置计算节点的数量为4个,那么任务调度模块211会将训练作业分为4个训练任务,并且向资源管理模块212申请4个计算节点用于执行训练作业,以及将4个训练任务一一对应配置在这4个计算节点上执行。
第二阶段,状态监测。
本申请中,对于用于执行训练作业的多个计算节点中的每个计算节点,计算节点在执行训练任务的过程中,AI平台210提供状态监测能力,AI平台210周期性的对计算节点进行状 态监测;状态监测包括资源管理模块212对计算节点进行故障监测以及计算节点进行自我故障监测,具体如下:
步骤S5:资源管理模块对计算资源进行故障监测。
资源管理模块212对计算资源进行故障监测,也即资源管理模块212对计算节点进行故障监测。具体地,资源管理模块212对多个计算节点中的每个计算节点进行周期性的故障监测,以确定这多个计算节点中的每个计算节点在执行训练任务时是否发生故障。其中,资源管理模块212对计算节点进行故障监测包括监测计算节点是否发生硬件故障和/或监测计算节点上的训练进程是否退出;当监测到计算节点发生硬件故障和/或监测计算节点上的训练进程退出时,确认计算节点发生故障。
需要说明的是,本申请的硬件故障可以分为第一类硬件故障和第二类硬件故障。第一类硬件故障:会导致计算节点上的训练进程退出或停止;例如,计算节点发生掉电、计算节点和与其共同用于执行同一训练作业的其他计算节点之间的网络断开。第二类硬件故障:不会导致计算节点上的训练进程退出或停止,仅会影响计算节点的计算性能;例如,计算节点的计算很慢且计算节点上的训练进程没有退出。此外,计算节点发生硬件故障仅是导致计算节点上的训练进程退出的其中一种可能,本申请对导致计算节点上的训练进程退出的原因不进行具体限定,任何一种原因导致计算节点上的训练进程退出时,均可以被资源管理模块212监测到。
步骤S6:每个计算节点进行自我故障监测。
具体地,多个计算节点中的每个计算节点在执行训练任务时,均会监测训练任务对应的训练进程是否发生运行故障。从软件层面来看,本申请的每个计算节点在执行训练任务时,其中的监测程序均会监测训练任务对应的训练进程是否发生运行故障。
应理解,上述步骤S5和S6不存在执行时间的先后的关系,且当执行了上述步骤S5时,S6是可选的,当执行了上述步骤S6时,上述步骤S5是可选的。
第三阶段,故障隔离。
由状态监测阶段可知,本申请中,资源管理模块212对计算资源(也即计算节点)进行故障监测和/或计算节点进行自我故障监测,两种监测方式监测到故障均可以认为计算节点发生故障;其中,计算节点发生故障则导致计算节点执行的训练任务发生故障,训练任务发生故障也即训练作业发生故障,训练作业发生故障也即初始AI模型的分布式训练发生故障。为保证初始AI模型的分布式训练能顺利进行,一旦监测到发生故障,则需要对发生故障的第一计算节点进行故障隔离。由于存在两种监测方式监测到的故障,故触发故障隔离的触发方式也有两种,下面分别描述。
方式一:资源管理模块触发故障隔离。如下步骤S7-步骤S9为由资源管理模块触发故障隔离的步骤。
步骤S7:资源管理模块对第一计算节点进行故障隔离。
具体地,资源管理模块212监测到多个计算节点中的第一计算节点发生故障时,对第一计算节点进行故障隔离,以使第一计算节点不再用于执行训练作业中的训练任务;以及第一计算节点是发生硬件故障的情况下,避免第一计算节点在故障恢复之前被再次调用。其中,资源管理模块212监测到的故障包括计算节点发生硬件故障和/或计算节点上的训练进程退出。
步骤S8:资源管理模块向任务调度模块进行故障上报。
具体地,资源管理模块212将监测到故障上报到任务调度模块211,例如将第一计算节 点发生硬件故障或第一计算节点上的训练进程退出的信息上报到任务调度模块211,由任务调度模块211对第一计算节点上的训练进程进行处理。
步骤S9:任务调度模块向第一计算节点发送停止训练进程的通知。
应理解,步骤S9是可选的,当资源管理模块212监测到第一计算节点发生第一类硬件故障或监测到第一计算节点上的训练进程退出时,第一计算节点上的训练进程自动停止,不执行步骤S9;当资源管理模块212监测到第一计算节点发生第二类硬件故障时,第一计算节点上训练进程并没有停止,需要执行步骤S9。
具体地,任务调度模块211向发生第二类硬件故障的第一计算节点发送停止训练进程的通知,停止训练进程的通知用于停止发生第二类硬件故障的第一计算节点上的训练进程。
方式二:计算节点触发故障隔离。如下步骤S10-步骤S12为由计算节点触发故障隔离的步骤。
步骤S10:第一计算节点监测到训练进程运行故障。
具体地,多个计算节点中的每个计算节点在监测到其上的训练进程发生运行故障时,确定自己发生故障,也即确定自己为第一计算节点。从软件层面来看,多个计算节点中的每个计算节点上的监测程序在监测到该计算节点上的训练进程发生运行故障时,确定该计算节点发生故障,也即确定该计算节点为第一计算节点。其中,计算节点上的训练进程发生运行故障,也即训练进程对应的训练任务发生运行故障。
步骤S11:第一计算节点向任务调度模块进行故障上报。
具体地,第一计算节点上的训练进程发生运行故障,第一计算节点向任务调度模块进行故障上报;从软件层面来看,由第一计算节点上的监测程序将故障上报到任务调度模块211。
步骤S12:任务调度模块向资源管理模块发送对第一计算节点进行故障隔离的通知,资源管理模块对第一计算节点进行故障隔离。
也即,任务调度模块211先向资源管理模块212发送对第一计算节点进行故障隔离的通知,资源管理模块212在接收到来自任务调度模块211的对第一计算节点进行故障隔离的通知之后,对第一计算节点进行故障隔离。
上述步骤S11和S12中,第一计算节点先将故障上报到任务调度模块211,再被资源管理模块212故障隔离。可选地,第一计算节点也可以先将故障上报到资源管理模块212,然后资源管理模块212将故障上报到任务调度模块211以及资源管理模块212对第一计算节点进行故障隔离。需要说明的是,此种情况时,资源管理模块212向任务调度模块211进行故障上报以及资源管理模块212对第一计算节点进行故障隔离的动作执行时间可以是有先后顺序,也可以同时发生。
在一种可能的实现方式中,S11:第一计算节点向资源管理模块进行故障上报,资源管理模块向任务调度模块进行故障上报;S12:资源管理模块对第一计算节点进行故障隔离。
第四阶段,故障恢复。
需要说明的是,故障恢复阶段是可选执行的。如图8所示,本申请提供的创建训练作业的界面存在“启动故障恢复”的可选配置项;对于由多个计算节点共同执行的训练作业来说,用户在创建训练作业时勾选了“启动故障恢复”的配置项的情况下,若多个计算节点中存在发生故障的第一计算节点,则执行步骤S13-步骤S18,否则不执行步骤S13-步骤S18。同时,本申请提供的创建训练作业的界面存在“故障率”的配置项,用户在勾选了“启动故障恢复”的配置项的情况下,用户还可以设置故障率的阈值;对于由多个计算节点共同执行的训练作业来说,若多个计算节点中发生故障的第一计算节点的数量与多个计算节点中的计算节点的 数量的比值超过故障率的阈值,则不执行步骤S13-步骤S18,否则执行步骤S13-步骤S18。例如,用户在创建训练作业时,设置该训练作业由4个计算节点执行,故障率的阈值设置为50%,若4个计算节点中的第一计算节点的数量超过2个,则不进行故障恢复,否则进行故障恢复。
本申请中,当计算节点发生硬件故障或计算节点上的训练进程退出时,该计算节点发生故障。此种情况,由资源管理模块212监测到该计算节点发生故障以及将故障上报到任务调度模块211,由任务调度模块211判断因该计算节点发生故障而受影响的训练任务。
本申请中,当计算节点监测到其上的训练进程发生运行故障时,该计算节点发生故障。此种情况,由该计算节点直接将故障上报到任务调度模块211,或由计算节点间接将故障上报到任务调度模块211(计算节点先将故障上报到资源管理模块212,资源管理模块212再将故障上报到任务调度模块211),且在故障上报时直接告知任务调度模块211哪个或哪些训练进程发生运行故障,也即告知任务调度模块211哪个或哪些训练任务发生运行故障。
综上,在资源管理模块212监测到计算节点发生故障,并上报到任务调度模块211的情况下,任务调度模块211判断到计算节点上执行的训练任务发生故障;在计算节点监测到训练进程发生运行故障,并直接或间接将故障上报到任务调度模块211的情况下,任务调度模块211接收到计算节点执行的训练任务发生故障。
任务调度模块211在判断到计算节点上执行的训练任务发生故障或接收到计算节点执行的训练任务发生故障后,触发故障恢复操作。下面对故障恢复的阶段进行具体说明:
步骤S13:任务调度模块向未发生故障的第三计算节点发送暂停训练进程的通知。
由于数据并行的分布式训练过程中,多个计算节点计算和梯度同步是交替进行的。在执行初始AI模型的训练过程中,当多个计算节点中存在第一计算节点时,第一计算节点无法参与梯度同步;如此,因为缺少了第一计算节点,故梯度同步可能会出现问题,所以在故障恢复过程中,需要通知多个计算节点中的未发生故障的第三计算节点暂停训练。
具体地,由于训练作业中的训练任务发生故障,则该训练作业发生故障,任务调度模块211向多个计算节点中的第三计算节点发送暂停训练进程的通知,暂停训练进程的通知用于暂停第三计算节点上的训练进程,其中,暂停的训练进程对应的训练任务为发生故障的训练作业中的训练任务。
步骤S14:第三计算节点在接收到来自任务调度模块的暂停训练进程的通知之后,进行训练进程的暂停。
具体地,第三计算节点在接收到暂停训练进程的通知之后,暂停其上的训练进程,且等待接收继续训练的通知,以便继续执行被暂停的训练进程。
在一种可能的实现方式中,第三计算节点在接收到暂停训练进程的通知之后,继续完成训练进程的计算,在得到训练进程对应的梯度之后,暂停训练进程。
具体地,第三计算节点完成训练进程的计算,得到训练进程对应的梯度之后,本应进入梯度同步;但因接收到暂停训练进程的通知,故暂停梯度同步,开始循环等待(不超时不退出),直到接收到继续训练的通知。
请参阅图9,图9是本申请实施例提供的一种梯度同步的示意图。如图9所示,分布式训练在未发生故障时,多个计算节点完成各自的计算,然后多个计算节点进行梯度同步;当发生故障时,例如多个计算节点中的第一计算节点发生故障,则去掉第一计算节点(也即资源管理模块212对第一计算节点进行故障隔离),从而第一计算节点不再用于当前的训练,并且多个计算节点中的第三计算节点在完成计算后暂停执行当前的训练,也即暂停梯度同步。
在一种可能的实现方式中,第三计算节点暂停训练进程之后,进入循环等待,本申请可以设置一个最长等待时长,若第三计算节点循环等待的时长超过最长等待时长,则退出循环等待,训练失败,由运维人员修复,从而避免无限挂死,增加程序的健壮性。
步骤S15:任务调度模块向资源管理模块重新申请计算资源。
具体地,任务调度模块211向资源管理模块212申请第二计算节点用于替代多个计算节点中的第一计算节点,以便基于第二计算节点和多个计算节点中的第三计算节点继续执行训练;其中,第二计算节点为多个计算节点之外的计算节点;任务调度模块211向资源管理模块212申请第二计算节点时,基于第一计算节点的规格进行申请,例如申请的第二计算节点与第一计算节点的规格相同或相当。
步骤S16:资源管理模块重新分配计算资源。
具体地,资源管理模块212接收到来自任务调度模块211的申请第二计算节点的申请后,从计算资源池220中重新分配第二计算节点,以及向给任务调度模块211返回重新分配计算资源的结果。其中,重新分配计算资源的结果存在两种可能情况,分别如下描述。
第一种情况,重新分配计算资源的结果为已分配第二计算节点;重新分配计算资源的结果还可以可选地包括:第二计算节点的名称、第二计算节点的标识以及第二计算节点的规格等。此种可能情况,便于任务调度模块211将发生故障的训练任务配置在第二计算节点上执行,以实现采用第二计算节点替代第一计算节点。
第二种情况,重新分配计算资源的结果为未分配第二计算节点。此种可能情况是因为计算资源池220受限,其中没有可供重新分配的计算资源。发生此种可能情况时,可按发生故障之前的配置,基于第三计算节点继续训练。
步骤S17:任务调度模块向第三计算节点发送继续训练的通知。
具体地,任务调度模块211向第三计算节点发送继续训练的通知,该继续训练的通知用于第三计算节点更新其上的训练框架中的通讯拓扑,以及在更新训练框架中的通讯拓扑之后继续执行训练进程。其中,继续训练的通知需要通知继续执行的训练进程为之前暂停的训练进程。
需要说明的是,第三计算节点上需要更新通讯拓扑的训练框架为发生故障的训练作业对应的训练框架,也即需要更新通讯拓扑的训练框架为训练发生故障的初始AI模型的训练框架。
由于资源管理模块212向任务调度模块211返回的重新分配计算资源的结果有已分配第二计算节点和未分配第二计算节点两种可能情况,则任务调度模块211向第三计算节点发送的继续训练的通知包括的内容以及用途也有两种可能情况,分别如下描述。
第一种情况:若重新分配计算资源的结果为已分配第二计算节点,继续训练的通知可以包括第二计算节点的信息(例如第二计算节点的名称、第二计算节点的标识以及第二计算节点的规格等);继续训练的通知用于第三计算节点在训练框架中的通讯拓扑中删除第一计算节点以及增加第二计算节点,以及在更新训练框架中的通讯拓扑之后继续执行训练进程;其中,更新通讯拓扑的训练框架为发生故障的训练作业对应的训练框架,也即训练发生故障初始AI模型的训练框架。此种情况下,执行步骤S18,之后基于第三计算节点和第二计算节点继续训练。
请参阅图10,图10是本申请实施例提供的一种更新训练框架的通讯拓扑的示意图。如图10所示,训练作业分为4个训练任务(分别为训练任务1、训练任务2、训练任务3和训练任务4),从而训练作业由4个计算节点(分别为计算节点1、计算节点2、计算节点3和 计算节点4)执行;在发生故障之前,计算节点1、计算节点2、计算节点3和计算节点4中的训练框架中的通讯拓扑均是以计算节点1、计算节点2、计算节点3和计算节点4构成的通信网络。若计算节点1、计算节点2和计算节点3未发生故障,也即计算节点1、计算节点2和计算节点3为第三计算节点;以及计算节点4发生故障,也即计算节点4为第一计算节点;重新分配计算节点5替代计算节点4,也即计算节点5为第二计算节点;计算节点1、计算节点2和计算节点3均在训练框架中的通讯拓扑中删除计算节点4以及增加计算节点5,更新为以计算节点1、计算节点2、计算节点3和计算节点5构成的通信网络。如此,计算节点1、计算节点2、计算节点3和计算节点5可以继续训练,也即之后训练作业由计算节点1、计算节点2、计算节点3和计算节点5执行。
第二种情况:若重新分配计算资源的结果为未分配第二计算节点,继续训练的通知不包括第二计算节点的信息;继续训练的通知用于第三计算节点在训练框架中的通讯拓扑中删除第一计算节点,以及在更新训练框架中的通讯拓扑之后继续执行训练进程;其中,更新通讯拓扑的训练框架为发生故障的训练作业对应的训练框架,也即训练发生故障初始AI模型的训练框架。此种情况下,不执行步骤S18,之后基于第三计算节点继续训练。
需要说明的是,上述第二种情况,由于第一计算节点发生故障后,剔除第一计算节点,仅基于第三计算节点继续训练,因为每个计算节点训练的样本数(BatchSize)没有变更,所以在发生故障过程中,每个第三计算节点的训练时间没有变化,从而整体训练时间没有变化。对于整个训练作业来说,若每个训练任务的每轮训练的样本数(BatchSize)不变,则每轮训练少训练了n/m份样本,n为发生故障的第一计算节点的数量,m为用于执行训练作业的计算节点的总数量。也即,训练作业总共对应m个计算节点,每轮训练的样本分为1/m份,每个计算节点每次使用1/m份样本训练,发生故障的第一计算节点有n个,故有n个计算节点无法继续训练,则每轮训练少训练n/m份样本。
请参阅图11,图11是本申请实施例提供的另一种更新训练框架的通讯拓扑的示意图。如图11所示,训练作业分为4个训练任务(分别为训练任务1、训练任务2、训练任务3和训练任务4),从而训练作业由4个计算节点(分别为计算节点1、计算节点2、计算节点3和计算节点4)执行;在发生故障之前,计算节点1、计算节点2、计算节点3和计算节点4中的训练框架中的通讯拓扑均是以计算节点1、计算节点2、计算节点3和计算节点4构成的通信网络;若计算节点1、计算节点2和计算节点3未发生故障,也即计算节点1、计算节点2和计算节点3为第三计算节点;以及计算节点4发生故障,也即计算节点4为第一计算节点;计算节点1、计算节点2和计算节点3均在训练框架中的通讯拓扑中删除计算节点4,更新为以计算节点1、计算节点2和计算节点3构成的通信网络。如此,计算节点1、计算节点2和计算节点3可以继续训练,也即之后训练作业仅由3个计算节点执行,每轮训练少训练1/4份样本。
步骤S18:第二计算节点进行数据恢复。
具体地,任务调度模块211将训练框架、训练数据集、初始AI模型等发送给第二计算节点,或者说第二计算节点可以从数据存储模块213中获取训练框架、训练数据集、初始AI模型等,从而第二计算节点可以部署训练框架以及采用训练数据集中的训练数据对初始AI模型进行训练;任务调度模块211还将第三计算节点的信息(例如第三计算节点的名称、第三计算节点的标识以及第三计算节点的规格等)发给第二计算节点,第二计算节点可以部署训练框架以及基于自己的信息和第三计算节点的信息在部署的训练框架中构建通讯拓扑;任务调度模块211将原来由第一计算节点执行的训练任务配置在第二计算节点上执行,也即第 二计算节点会针对原来由第一计算节点执行的训练任务跑一个训练进程。
应理解,在第三计算节点更新了训练框架中的通讯拓扑,以及第二计算节点在训练框架中构建通讯拓扑之后,第三计算节点以及第二计算节点上的训练框架中的通讯拓扑是相同的,从而第三计算节点与第二计算节点可以进行梯度同步。
其中,第二计算节点参与梯度同步时,通过加载第一计算节点在发生故障之前保存的Ckpt文件启动,该Ckpt文件是针对发生故障的训练任务保存的,从而第二计算节点可以恢复得到第一计算节点在发生故障之前的数据。
如图9所示,在第三计算节点在更新训练框架中的通讯拓扑,且第二计算节点在训练框架中构建通讯拓扑以及数据恢复完成之后,第二计算节点与第三计算节点进行梯度同步;第二计算节点与第三计算节点在完成一次梯度同步之后,第二计算节点和第三计算节点训练的AI模型中的模型参数相同,从而可以基于第二计算节点与第三计算节点进行AI模型的下一轮训练。
应理解,图7所示的AI模型的分布式训练方法是以一个初始AI模型的训练、一个训练作业、每个计算节点仅执行一个训练任务、每个计算节点上仅有一个训练进程、每个计算节点上仅有一个训练框架为例描述的。
需要说明的是,AI平台210可能同时对多个初始AI模型进行训练。当AI平台210同时对多个初始AI模型进行训练时,多个初始AI模型对应多个训练作业;此种情况下,每个计算节点执行一个或多个训练任务(多个训练任务中的每个训练任务归属于多个训练作业中的其中一个训练作业),每个计算节点上存在一个或多个训练进程以及一个或多个训练框架;多个初始AI模型训练时,训练的步骤与图7描述的过程是相同的,具体的步骤请参阅图7的描述,但以下几个步骤需要进一步说明:
步骤S2和步骤S3:对于多个训练作业来说,使用的计算资源可能存在重合,也可能不重合;当存在重合时,重合的计算节点用于执行多个训练任务,这多个训练任务归属于不同的训练作业。
步骤S5:每个计算节点用于执行一个或多个训练任务(多个训练任务中的每个训练任务归属于多个训练作业中的其中一个训练作业),那么每个计算节点上存在一个或多个训练进程;当监测到第一计算节点发生硬件故障时,第一计算节点上的一个或多个训练进程均因第一计算节点发生故障而受到影响;当监测到第一计算节点上的训练进程退出时,第一计算节点上退出的训练进程受到影响,第一计算节点上退出的训练进程受到影响,第一计算节点上没有退出的训练进程继续正常运行。
步骤S6:每个计算节点会监测其上的一个或多个训练进程是否发生运行故障,也即每个计算节点会监测其上的每个训练进程是否发生运行故障。
步骤S7:第一计算节点可能执行一个或多个训练任务(多个训练任务中的每个训练任务归属于多个训练作业中的其中一个训练作业),那么第一计算节点上可能存在一个或多个训练进程;当监测第一计算节点硬件故障时,第一计算节点为发生故障的计算节点,对第一计算节点进行故障隔离后,第一计算节点不再用于执行这一个或多个训练任务。当监测到第一计算节点上的至少一个训练进程退出时,第一计算节点为发生故障的计算节点;也即,只要第一计算节点上有训练进程退出,则说明第一计算节点发生故障;对第一计算节点进行故障隔离后,第一计算节点仅不再用于执行退出的训练进程对应的训练任务,而第一计算节点上没有退出的训练进程继续正常运行;且后面仅需对退出的训练进程对应的训练任务进行故障恢 复,而无需对没有退出的训练进程对应的训练任务进行故障恢复。
步骤S9:当第一计算节点发生第一类硬件故障时,第一计算节点上的一个或多个训练进程均会受到影响,停止训练进程的通知可选地用于停止第一计算节点上的一个或多个训练进程;从而后面可选的对这一个或多个训练进程对应的训练任务进行故障恢复。
步骤S10:第一计算节点监测到一个或多个训练进程中的任意一个训练进程发生运行故障,则第一计算节点为发生故障的计算节点;也即,只要第一计算节点上有训练进程运行故障,则说明第一计算节点发生故障。
步骤S11:第一计算节点故障上报时,仅针对发生运行故障的训练进程进行故障上报,不对没有发生运行故障的训练进程进行故障上报;也即,第一计算节点仅请求AI平台对运行故障的训练进程对应的训练任务进行故障恢复,而无需对正常运行的训练进程对应的训练任务进行故障恢复。
步骤S12:对第一计算节点进行故障隔离后,第一计算节点仅不再用于执行发生运行故障的训练进程对应的训练任务,而第一计算节点继续用于执行没有发生运行故障的训练进程对应的训练任务,也即第一计算节点上没有发生运行故障的训练进程继续正常运行。
步骤S13和步骤14:第三计算节点可能用于执行一个或多个训练任务(多个训练任务中的每个训练任务归属于多个训练作业中的其中一个训练作业),那么第三计算节点上可能存在一个或多个训练进程;第三计算节点上需要暂停的训练进程为发生故障的训练作业中的训练任务对应的训练进程。
步骤S15和步骤16:因第一计算节点可能用于执行一个或多个训练任务,针对一个第一计算节点,补充进来的第二计算节点可能是一个或多个。例如,补充一个第二计算节点替代一个第一计算节点执行这一个或多个训练任务;或者,补充多个第二计算节点替代一个第一计算节点执行这一个或多个训练任务,这多个第二计算节点中的每个第二计算节点执行这多个训练任务中的至少一个训练任务。
步骤S17:不同的初始AI模型的训练基于不同的训练框架实现,也即不同的训练作业对应不同的训练框架,不归属于同一训练作业的训练任务对应的训练框架不同;因第三计算节点可能用于执行一个或多个训练任务,那么第三计算节点上可能存在一个或多个训练框架;第三计算节点上需要更新通讯拓扑的训练框架为发生故障的训练任务对应的训练框架,也即第三计算节点上需要更新通讯拓扑的训练框架为发生故障的训练作业对应的训练框架。
请参阅图12,图12是本申请实施例提供的一种训练作业的处理流程时间线示意图。该训练作业的处理流程时间线为图7所示的AI模型的分布式训练的训练作业的处理流程时间线,该训练作业的处理流程包括以下步骤:
(1)多个计算节点中的每个计算节点用于执行该训练作业的其中一个训练任务,在启动该训练作业后,每个计算节点进行数据加载以及在完成数据加载之后开始执行训练。
(2)多个计算节点中存在发生故障的第一计算节点。
当多个计算节点中存在第一计算节点时,进行训练作业的故障恢复,训练作业的故障恢复包括训练作业硬件修复阶段和训练作业软件恢复阶段,训练作业硬件修复阶段包括以下步骤(3)和(4),训练作业软件恢复阶段包括以下步骤(5)、(6)和(7)。
(3)AI平台自动对第一计算节点进行故障隔离。
(4)AI平台自动分配第二计算节点,用于替代第一计算节点。
(5)第二计算节点进行数据加载。
(6)多个计算节点中未发生故障的第三计算节点更新训练框架中的通讯拓扑,且第二计算节点在训练框架中创建通讯拓扑。
(7)第二计算节点与第三计算节点进行训练参数同步。之后基于第二计算节点与第三计算节点正常训练。
由图12可知,本申请提供的AI模型的训练流程中,至少存在以下有益效果:
(1)训练过程中,第一计算节点发生故障时,训练作业发生故障,但训练作业不是直接退出;AI平台210在知道训练作业发生故障后,自动为训练作业分配第二计算节点以用于替代第一计算节点继续执行训练;如此,在训练作业不中断的情况下,完成故障恢复,无需人工介入重启发生故障的训练作业,减少故障恢复的耗时。
(2)第一计算节点发生故障时,剔除第一计算节点并暂停未发生故障的第三计算节点上的训练任务的执行,分配第二计算节点替代第一计算节点后,再恢复第三计算节点上的训练任务的执行。如此,仅需重新申请第二计算节点替代第一计算节点,避免重新申请训练作业所需的所有计算节点,减少了因重新申请计算资源失败而导致故障恢复失败的可能性。
(3)由于仅重新分配了第二计算节点用于替代发生故障的第一计算节点,故仅需第二计算节点拉取Ckpt文件启动,避免训练作业所需的所有计算节点均拉取Ckpt文件重启,减少了通过Ckpt文件启动时导致的训练时长的损失。
(4)故障恢复过程中,仅需第二计算节点进行数据加载,避免训练作业所需的计算节点均进行数据加载,减少了加载数据时所需要的带宽。
(5)第二计算节点通过与第三计算节点进行梯度同步以获取最新的训练结果,减少了因Ckpt文件无法高频保存而带来的训练损失。对于发生故障的训练作业,其受影响的阶段为:第一计算节点从发生故障到第二计算节点加入的过程中,第一计算节点没有参与训练的那部分样本的计算。其中,受影响的样本计算结果为(T/t)×(n/m),T为故障恢复时间,t为每轮训练的训练时长,n为第一计算节点的数量,m为训练作业所需的计算节点的总数量。并且,可以通过任务调度模块的优化减少故障恢复时间T,从而可以减少因发生故障对整个训练作业的影响;故障恢复时间T一般是在1~2分钟,对于执行时间为小时级的训练作业(大规模的训练作业)来说,基本可以做到训练作业的无损故障恢复。
请参阅图13,图13是本申请实施例提供的另一种人工智能AI模型的分布式训练方法的过程1300的流程图。过程1300描述为一系列的步骤或操作,应当理解的是,过程1300可以以各种顺序执行和/或同时发生,不限于图13所示的执行顺序。过程1300应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;过程1300包括但不限于如下步骤或操作:
步骤1301:对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;
步骤1302:确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;
步骤1303:配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
在本申请中,AI平台可以对AI模型的进行分布式训练,AI平台与计算资源池相关联,计算资源池包括用于AI模型分布式训练的多个计算节点,多个计算节点中的每个计算节点执 行该AI模型分布式训练的一个训练任务,例如每个计算节点执行一个AI模型分布式训练的训练任务;在AI模型的分布式训练过程中,AI平台可以确定多个计算节点中是否存在发生故障的第一计算节点,AI平台如果确定到多个计算节点中存在发生故障的第一计算节点,则对该第一计算节点进行故障隔离,以使得该第一计算节点不再用于执行该AI模型分布式训练的训练任务;并且,AI平台可以从计算资源池中确定除前述多个计算节点之外的第二计算节点,以及配置该第二计算节点,以使得采用该第二计算节点来替代该第一计算节点执行该AI模型分布式训练的训练任务。如此,本申请用于AI模型的分布式训练的计算节点发生故障时,动态隔离发生故障的第一计算节点,补充第二计算节点替代第一计算节点继续训练,保障训练过程不被中断,从而整体训练时长不受影响,实现降低故障恢复的时长。应理解,该第二计算节点的计算能力与该第一计算节点的计算能力相同或相当,或者说该第二计算节点的规格与该第一计算节点的规格相同或相当,以确保该第二计算节点可以成功替代该第一计算节点。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,对第一计算节点进行故障隔离后,第一计算节点不再用于执行因第一计算节点发生故障而受影响的训练任务;第二计算节点替代第一计算节点执行因第一计算节点发生故障而受影响的训练任务;其中,因第一计算节点发生故障而受影响的训练任务包括以下一项或多项:该AI模型分布式训练的训练任务,其他AI模型分布式训练的训练任务。
在一种可能的实现方式中,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
在本实现方式中,第一计算节点硬件故障、第一计算节点执行的训练任务对应的训练进程退出以及第一计算节点上报到AI平台的故障均可以被AI平台监测到;如果AI平台监测到前述一项或多项,则确定第一计算节点为发生故障的计算节点,并触发确定第二计算节点替代第一计算节点执行训练任务;如此,AI平台可以及时发现AI模型分布式训练存在故障,有利于降低故障恢复的时长。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,则当第一计算节点硬件故障,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。进一步地,第一计算节点执行的训练任务对应的训练进程退出包括第一计算节点执行的该AI模型分布式训练的训练任务对应的训练进程退出以及其他AI模型分布式训练的训练任务对应的训练进程退出,也即只要第一计算节点上有训练进程退出,第一计算节点就为发生故障的计算节点;当该AI模型分布式训练的训练任务对应的训练进程退出,因第一计算节点发生故障而受影响的训练任务为该AI模型分布式训练的训练任务;当其他AI模型分布式训练的训练任务对应的训练进程退出,因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务;当该AI模型分布式训练的训练任务对应的训练进程和其他AI模型分布式训练的训练任务对应的训练进程均退出,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。此外,第一计算节点上报的故障包括第一计算节点针对该AI模型分布式训练的训练任务上报的故障以及针对其他AI模型分布式训练的训练任务上报的故障,也即只要第一计算节点上报故障,第一计算节点就为发生故障的计算节点;当第一计算节点上报的故障为第一计算节点针对该AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而受影响的训练任务为该AI模型分布式训练的训练任务;当第一计算节点上报的故障包括第一 计算节点针对其他AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务;当第一计算节点上报的故障包括第一计算节点针对该AI模型分布式训练的训练任务上报的故障和针对其他AI模型分布式训练的训练任务上报的故障,因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务。
在一种可能的实现方式中,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述方法包括:向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
在本实现方式中,有些类型的硬件故障不会导致计算节点上的训练进程退出或停止,仅会影响计算节点的计算性能;在第一计算节点发生硬件故障的情况下,为确保第二计算节点能成功替代第一计算节点执行训练任务,AI平台向第一计算节点发送停止训练进程的通知,指示第一计算节点停止执行的训练任务对应的训练进程;从而避免第二计算节点已经在执行原来由第一计算节点执行的训练任务的情况下,第一计算节点还在执行该训练任务。应理解,停止训练进程的通知用于指示第一计算节点停止因第一计算节点发生故障而受影响的训练任务对应的训练进程。需要说明的是,若第一计算节点除执行该AI模型分布式训练的训练任务外,还执行其他AI模型分布式训练的训练任务,则当因第一计算节点发生故障而受影响的训练任务为该AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止该AI模型分布式训练的训练任务对应的训练进程;当因第一计算节点发生故障而受影响的训练任务为其他AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止其他AI模型分布式训练的训练任务对应的训练进程;当因第一计算节点发生故障而受影响的训练任务包括该AI模型分布式训练的训练任务和其他AI模型分布式训练的训练任务,停止训练进程的通知用于指示第一计算节点停止该AI模型分布式训练的训练任务对应的训练进程和其他AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述方法还包括:向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
在本实现方式中,该AI模型分布式训练包括多个计算节点计算和梯度同步,当第一计算节点发生故障,若不暂停未发生故障的第三计算节点的训练进程,则第三计算节点计算得到梯度后,就会进行梯度同步;但是,第一计算节点因为发生故障而被故障隔离,无法参与梯度同步,在这种情况下梯度同步会出现问题;因此,为避免梯度同步出现问题,需要将第三计算节点执行的训练进程进行暂停,直到有新增的第二计算节点加入用于执行训练。
在一种可能的实现方式中,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
在本实现方式中,在未发生故障的第三计算节点梯度计算结束后,再暂停第三计算节点执行的训练进程;如此,等新增的第二计算节点加入用于执行训练后,即可直接进行梯度同步,有利于降低故障恢复时长。
在一种可能的实现方式中,所述方法还包括:在所述确定第二计算节点之后,所述方法还包括:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三 计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在本实现方式中,AI平台向第三计算节点发送继续训练的通知;第三计算节点在接收到继续训练的通知之后,知晓第二计算节点会替代发生故障的第一计算节点执行训练,故在AI模型分布式训练的训练框架中的通讯拓扑中删除第一计算节点以及增加第二计算节点;从而第三计算节点可以与第二计算节点进行梯度同步,使得第二计算节点获得同步后的训练参数。
在一种可能的实现方式中,若未确定到第二计算节点,所述方法还包括:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在本实现方式中,如果无法申请到第二计算节点用于替代发生故障的第一计算节点,为了保证训练不中断或不退出,训练能够继续进行,则舍弃发生故障的第一计算节点,仅采用未发生故障的第三计算节点用于执行训练。
需要说明的是,图13所示的实施例的描述可以参阅图1-图12所示的实施例的描述。
前述图2中AI平台210中的任务调度模块211、资源管理模块212的功能可以由AI模型的分布式训练装置执行,该AI模型的分布式训练装置应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务。该AI模型的分布式训练装置可以通过软件、硬件或者两者的结合实现成为装置中的部分或者全部。该AI模型的分布式训练装置可以实现本申请其他实施例所描述的流程。该AI模型的分布式训练装置:资源管理模块212,用于对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;任务调度模块211,用于确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;以及配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
在一种可能的实现方式中,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
在一种可能的实现方式中,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述任务调度模块211还用于:向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
在一种可能的实现方式中,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述任务调度模块211还用于:向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
在一种可能的实现方式中,在所述确定第二计算节点之后,所述任务调度模块211还用 于:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
在一种可能的实现方式中,若未确定到第二计算节点,所述任务调度模块211还用于:向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时也可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成为一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请还提供一种如图5所示的计算设备500,计算设备500中的处理器502读取存储器501存储的程序和数据集合以执行前述AI平台执行的方法。
由于本申请提供的AI平台210中的各个模块可以分布式地部署在同一环境或不同环境中的多个计算机上,因此,本申请还提供一种如图14所示的计算设备,该计算设备包括多个计算机1400,每个计算机1400包括存储器1401、处理器1402、通信接口1403以及总线1404。其中,存储器1401、处理器1402、通信接口1403通过总线1404实现彼此之间的通信连接。
存储器1401可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器1401可以存储程序,当存储器1401中存储的程序被处理器1402执行时,处理器1402和通信接口1403用于执行AI平台训练AI模型的部分方法。存储器还可以存储训练数据集,例如,存储器1401中的一部分存储资源被划分成一个数据集存储模块,用于存储AI平台所需的训练数据集。
处理器1402可以采用通用的中央处理器,微处理器,应用专用集成电路,图形处理器或者一个或多个集成电路。
通信接口1403使用例如但不限于收发器一类的收发模块,来实现计算机1400与其他设备或通信网络之间的通信。例如,可以通过通信接口1403获取训练数据集。
总线1404可包括在计算机1400各个部件(例如,存储器1401、处理器1402、通信接口1403)之间传送信息的通路。
上述每个计算机1400间通过通信网络建立通信通路。每个计算机1400上运行任务调度模块211、资源管理模块212、数据存储模块213、算法管理模块214和人机交互模块215中的任意一个或多个。任一计算机1400可以为云数据中心中的计算机(例如,服务器),或边缘数据中心中的计算机,或终端计算设备。
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质存储有提供AI平台的计算机程序指令。所述计算机可读存储介质可以是计算机能够存取的任何介质或 者是包含一个或多个介质集成的服务器、数据中心等数据存储设备。所述介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,光盘)、或者半导体介质(如固态硬盘)。
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。
本申请还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,当计算机可读存储介质中的计算机指令被计算设备执行时,使得计算设备执行本申请实施例所描述的流程或功能。
在上述实施例中,可以全部或部分地通过软件、硬件或者其组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。本申请提供AI平台的计算机程序产品包括一个或多个进AI平台的计算机指令,在计算机上加载和执行这些计算机程序指令时,全部或部分地产生按照本申请实施例所描述的流程或功能。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
上述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所示方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (17)

  1. 一种人工智能AI模型的分布式训练方法,其特征在于,应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;所述方法包括:
    对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;
    确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;
    配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
  2. 根据权利要求1所述的方法,其特征在于,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:
    所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
  3. 根据权利要求2所述的方法,其特征在于,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述方法包括:
    向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述方法还包括:
    向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
  5. 根据权利要求4所述的方法,其特征在于,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
  6. 根据权利要求4或5所述的方法,其特征在于,在所述确定第二计算节点之后,所述方法还包括:
    向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
  7. 根据权利要求4或5所述的方法,其特征在于,若未确定到第二计算节点,所述方法还包括:
    向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
  8. 一种人工智能AI模型的分布式训练装置,其特征在于,应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于所述AI模型分布式训练的多个计算节点,所述多个计算节点中的每个计算节点执行所述AI模型分布式训练的一个训练任务;所述装置包括:
    资源管理模块,用于对第一计算节点进行故障隔离,所述第一计算节点为所述多个计算节点中发生故障的计算节点;
    任务调度模块,用于确定第二计算节点,所述第二计算节点为所述计算资源池中除所述多个计算节点之外的计算节点;
    以及配置所述第二计算节点,以使所述第二计算节点替代所述第一计算节点执行训练任务。
  9. 根据权利要求8所述的装置,其特征在于,所述AI平台在监测到以下一项或多项的情况下,所述第一计算节点为发生故障的计算节点:
    所述第一计算节点硬件故障,所述第一计算节点执行的训练任务对应的训练进程退出,所述第一计算节点上报的故障。
  10. 根据权利要求9所述的装置,其特征在于,若所述AI平台监测到所述第一计算节点硬件故障,且未监测到所述第一计算节点执行的训练任务对应的训练进程退出;在所述对第一计算节点进行故障隔离之后,所述任务调度模块还用于:
    向所述第一计算节点发送停止训练进程的通知,所述停止训练进程的通知用于指示所述第一计算节点停止执行的训练任务对应的训练进程。
  11. 根据权利要求8-10任一项所述的装置,其特征在于,在所述对第一计算节点进行故障隔离之后,在所述确定第二计算节点之前,所述任务调度模块还用于:
    向第三计算节点发送暂停训练进程的通知,所述第三计算节点为所述多个计算节点中未发生故障的计算节点,所述暂停训练进程的通知用于指示所述第三计算节点暂停所述AI模型分布式训练的训练任务对应的训练进程。
  12. 根据权利要求11所述的装置,其特征在于,所述暂停训练进程的通知具体用于:指示所述第三计算节点在执行完所述AI模型分布式训练的梯度计算之后,暂停所述AI模型分布式训练的训练任务对应的训练进程。
  13. 根据权利要求11或12所述的装置,其特征在于,在所述确定第二计算节点之后,所述任务调度模块还用于:
    向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点和增加所 述第二计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
  14. 根据权利要求11或12所述的装置,其特征在于,若未确定到第二计算节点,所述任务调度模块还用于:
    向所述第三计算节点发送继续训练的通知,所述继续训练的通知用于指示所述第三计算节点在所述AI模型分布式训练的训练框架中的通讯拓扑中删除所述第一计算节点,以及恢复所述AI模型分布式训练的训练任务对应的训练进程,所述通讯拓扑用于所述AI模型分布式训练的梯度同步。
  15. 一种计算设备,其特征在于,所述计算设备包括存储器和处理器,所述存储器用于存储计算机指令;
    所述处理器执行所述存储器存储的计算机指令,以执行上述权利要求1-7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被计算设备执行时,所述计算设备执行上述权利要求1-7中任一项所述的方法。
  17. 一种计算机程序产品,当其在计算设备上运行时,使得所述计算设备执行上述权利要求1-7中任一项所述的方法。
PCT/CN2022/111716 2021-08-20 2022-08-11 Ai模型的分布式训练方法和相关设备 WO2023020355A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22857667.4A EP4375892A1 (en) 2021-08-20 2022-08-11 Distributed training method for ai model and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110963715.7 2021-08-20
CN202110963715.7A CN115712830A (zh) 2021-08-20 2021-08-20 Ai模型的分布式训练方法和相关设备

Publications (1)

Publication Number Publication Date
WO2023020355A1 true WO2023020355A1 (zh) 2023-02-23

Family

ID=85230161

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/111716 WO2023020355A1 (zh) 2021-08-20 2022-08-11 Ai模型的分布式训练方法和相关设备

Country Status (3)

Country Link
EP (1) EP4375892A1 (zh)
CN (1) CN115712830A (zh)
WO (1) WO2023020355A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349026A (zh) * 2023-12-04 2024-01-05 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116755941B (zh) * 2023-08-21 2024-01-09 之江实验室 一种节点故障感知的分布式模型训练的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290223A1 (en) * 2012-04-27 2013-10-31 Yahoo! Inc. Method and system for distributed machine learning
US20190114537A1 (en) * 2017-10-16 2019-04-18 Facebook, Inc. Distributed training and prediction using elastic resources
CN110852445A (zh) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 分布式机器学习训练方法、装置、计算机设备和存储介质
CN111078480A (zh) * 2019-12-17 2020-04-28 北京奇艺世纪科技有限公司 一种异常恢复方法和服务器
CN113656175A (zh) * 2021-08-18 2021-11-16 北京百度网讯科技有限公司 基于分布式系统训练模型的方法、设备及程序产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130290223A1 (en) * 2012-04-27 2013-10-31 Yahoo! Inc. Method and system for distributed machine learning
US20190114537A1 (en) * 2017-10-16 2019-04-18 Facebook, Inc. Distributed training and prediction using elastic resources
CN110852445A (zh) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 分布式机器学习训练方法、装置、计算机设备和存储介质
CN111078480A (zh) * 2019-12-17 2020-04-28 北京奇艺世纪科技有限公司 一种异常恢复方法和服务器
CN113656175A (zh) * 2021-08-18 2021-11-16 北京百度网讯科技有限公司 基于分布式系统训练模型的方法、设备及程序产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349026A (zh) * 2023-12-04 2024-01-05 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统
CN117349026B (zh) * 2023-12-04 2024-02-23 环球数科集团有限公司 一种用于aigc模型训练的分布式算力调度系统

Also Published As

Publication number Publication date
CN115712830A (zh) 2023-02-24
EP4375892A1 (en) 2024-05-29

Similar Documents

Publication Publication Date Title
WO2023020355A1 (zh) Ai模型的分布式训练方法和相关设备
EP2614436B1 (en) Controlled automatic healing of data-center services
US11301307B2 (en) Predictive analysis for migration schedulers
CN110134495B (zh) 一种容器跨主机在线迁移方法、存储介质及终端设备
US7779298B2 (en) Distributed job manager recovery
US8549536B2 (en) Performing a workflow having a set of dependancy-related predefined activities on a plurality of task servers
Bhattacharjee et al. IBM deep learning service
US10454801B2 (en) Methods and systems that diagnose and manage undesirable operational states of computing facilities
US20130042003A1 (en) Smart cloud workload balancer
US9940598B2 (en) Apparatus and method for controlling execution workflows
US10970649B2 (en) Automated reinforcement-learning-based application manager that uses local agents
US10797938B2 (en) Automatic monitoring, correlation, and resolution of network alarm conditions
US11042640B2 (en) Safe-operation-constrained reinforcement-learning-based application manager
CN111880934A (zh) 一种资源管理方法、装置、设备及可读存储介质
CN112905297A (zh) 容器集群资源调度方法和装置
CN110413369A (zh) 用于虚拟化环境中的备份的系统和方法
WO2023165512A1 (zh) 一种故障文件保存方法及相关装置
CN111445027B (zh) 机器学习模型的训练方法和装置
US11900325B2 (en) Utilizing a combination of machine learning models to determine a success probability for a software product
CN115437766A (zh) 一种任务处理方法和装置
CN114153427A (zh) 持续集成流水线的优化方法及系统
US20240143369A1 (en) Using rule engine with push mechanism for configuration data of a containerized computing cluster
US20240028388A1 (en) Application usage and auto maintenance driven migration of applications and their dependencies
US20240028387A1 (en) Device health driven migration of applications and its dependencies
US20240143368A1 (en) Using rule engine with polling mechanism for configuration data of a containerized computing cluster

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857667

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022857667

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022857667

Country of ref document: EP

Effective date: 20240219

NENP Non-entry into the national phase

Ref country code: DE