WO2022048557A1 - Ai模型的训练方法、装置、计算设备和存储介质 - Google Patents

Ai模型的训练方法、装置、计算设备和存储介质 Download PDF

Info

Publication number
WO2022048557A1
WO2022048557A1 PCT/CN2021/115881 CN2021115881W WO2022048557A1 WO 2022048557 A1 WO2022048557 A1 WO 2022048557A1 CN 2021115881 W CN2021115881 W CN 2021115881W WO 2022048557 A1 WO2022048557 A1 WO 2022048557A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
task
model
user
mode
Prior art date
Application number
PCT/CN2021/115881
Other languages
English (en)
French (fr)
Inventor
朱疆成
黄哲思
吴仁科
白小龙
杨兵兵
李亿
钟京华
戴宗宏
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to EP21863621.5A priority Critical patent/EP4209972A4/en
Publication of WO2022048557A1 publication Critical patent/WO2022048557A1/zh
Priority to US18/179,661 priority patent/US20230206132A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5011Pool
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects

Definitions

  • the present application relates to the technical field of artificial intelligence (AI), and in particular, to a training method, apparatus, computing device and storage medium for an AI model.
  • AI artificial intelligence
  • Training the initial AI model is a relatively critical process. Training refers to inputting the data of the training data set into the initial AI model, calculating by the initial AI model, and updating the parameters of the initial AI model through the calculation results.
  • the process of an AI model with certain capabilities eg, image classification capabilities, target detection capabilities, natural language recognition capabilities, etc.).
  • the present application provides an AI model training method, apparatus, computing device and storage medium, so as to perform distributed training more flexibly.
  • the present application provides an AI model training method, the method is applied to an AI platform, the AI platform is associated with a computing resource pool, and the computing resource pool includes computing nodes used for model training, the method includes: Provides a training configuration interface, wherein the training configuration interface includes multiple training modes for the user to choose, and each training mode represents an allocation strategy for computing nodes required for training the initial AI model; according to the user's selection in the training configuration interface, At least one training task is generated; the at least one training task is executed to train the initial AI model, and the AI model is obtained, and the obtained AI model is available for the user to download or use.
  • the AI platform provides the user with the function of selecting a training mode, and the user can select an appropriate training mode to generate at least one training task instead of using conventional distributed training, so the distributed training can be made flexible , which can balance the user's training needs and resource utilization.
  • the multiple training modes include a first mode and/or a second mode, the first mode indicates that the number of training tasks is automatically adjusted in the process of training the initial AI model, and the second mode indicates a different Training tasks share the resources of the same compute node.
  • the plurality of training modes may further include a third mode, which represents a regular mode, which means that distributed training is performed using preset or preselected computing nodes.
  • the first mode may also be referred to as a performance mode or a turbo mode
  • the second mode may also be referred to as a shared mode or an economic mode.
  • the first mode indicates that the number of training tasks for one training job is automatically adjusted in the process of training the initial AI model
  • the second mode indicates that different training tasks share the resources of the same computing node.
  • different training tasks may belong to the same training job or may belong to different training jobs.
  • the number of training tasks can be dynamically adjusted to speed up the training
  • training resources can be shared with other training tasks to improve resource utilization.
  • the at least one training task runs on the container, and the method further includes: in the process of training the initial AI model, providing the user with state information of the training process, where the state information includes the following At least one piece of information: the number of containers that execute the training task, the resource usage of each container, the number of computing nodes that execute the training task, and the resource usage of the computing nodes that execute the training task.
  • At least one training task runs on a container, and each container contains a complete runtime environment: a training task, all dependencies required to execute the training task, etc.
  • the AI platform can also provide users with status information of the training process. In this way, the training process can be displayed to the user more intuitively.
  • the multiple training modes include a first mode and a second mode
  • at least one training task is generated according to the user's selection on the training configuration interface, including: according to the first mode selected by the user on the training configuration interface and a second mode, generating at least one training task.
  • the multiple training modes include a first mode and a second mode
  • the AI platform can generate at least one training task according to the first mode and the second mode selected by the user on the training configuration interface.
  • the training configuration interface when the user selects the first mode in the training configuration interface, the training configuration interface also allows the user to input or select the number of containers that can run the training task; according to the user's selection in the training configuration interface, generate The at least one training task includes: generating at least one training task according to the training mode selected by the user on the training configuration interface and the number of containers that can run the training task input or selected by the user.
  • the training configuration interface may also allow the user to input or select the number of containers that can run the training task.
  • the user can input or select the number of containers that can run training tasks in the training configuration interface.
  • the AI platform can generate at least one training task according to the training mode selected by the user on the training configuration interface and the number of containers. In this way, since the number of containers that can run training tasks can be selected by the user, training is made more intelligent.
  • the training configuration interface is also for the user to input or select the resource usage of the container running the training task; according to the user's selection in the training configuration interface, Generating at least one training task includes: generating at least one training task according to a training mode selected by a user on a training configuration interface and a user input or resource usage of a container selected to run the training task.
  • the training configuration interface can also allow the user to input or select the resource usage of the container running the training task.
  • the user can enter or select the resource usage of the container running the training task in the training configuration interface.
  • the AI platform can generate at least one training task according to the training mode selected by the user on the training configuration interface and the resource usage. In this way, since the resource usage of the container running the training task can be selected by the user, the training is made more intelligent.
  • the resource usage of the container running the training task includes GPU resource usage smaller than a single graphics processing unit (graphics processing unit, GPU) and/or video memory usage smaller than a single video memory. In this way, since the resource usage of a single container is relatively small, the resource utilization rate can be higher.
  • performing at least one training task to train the initial AI model includes: during the process of performing the at least one training task to train the initial AI model, When it is detected that the conditions for elastic expansion and contraction are met, obtain the idle amount of computing resources in the computing resource pool; adjust the number of at least one training task and adjust the number of training tasks used to run the training task according to the idle amount of computing resources in the computing resource pool. The number of containers; run the adjusted training task in the adjusted container to train the initial AI model.
  • the AI platform when the first mode is selected, can detect whether the at least one training task meets the requirements for elastic expansion and contraction during the process of executing at least one training task to train the initial AI model.
  • the AI platform can obtain the idle amount of computing resources in the computing resource pool when it detects that the conditions for elastic expansion and contraction are met.
  • the AI platform uses the idle amount to adjust the number of the at least one training task and the number of containers running the training task.
  • the AI platform can run the adjusted training task in the adjusted container to train the initial AI model. In this way, since the capacity can be elastically expanded and contracted, the training speed can be accelerated.
  • the number of at least one training task and the number of containers used for running the training task are adjusted, and the adjusted training task is run in the adjusted container to train the initial AI model, It includes: adding part of the training tasks in the at least one training task to the target container of the training tasks in the at least one training task, running multiple training tasks serially in the target container, during the training process, running the training tasks in series The average value of the model parameters obtained from multiple training tasks is used as the updated value of the model parameters.
  • the number of containers will be reduced when shrinking.
  • Part of the training tasks in the at least one training task is run on the reduced container, and the part of the training tasks is added to the target container in which the training tasks in the at least one training task have been run. Since there are training tasks running in the target container, and some training tasks are added to the target container, the target container runs multiple training tasks.
  • the multiple training tasks are run serially in the target container.
  • the average value of the model parameters obtained by running multiple training tasks in series is used as the updated value of the model parameters. In this way, since the target container runs multiple training tasks in series, it is equivalent to that multiple training tasks are executed in a distributed manner, which is the same as the original execution method before scaling down, which will not lead to the training accuracy of the AI model. reduce.
  • the method when the second mode is selected, includes: determining the remaining resources of the computing node corresponding to each container according to the resource usage of the container running at least one training task in the second mode; One or more other training tasks are run using the remaining resources of the computing node corresponding to each container.
  • the AI platform when the second mode is selected, can also reduce the total resource amount of the computing node corresponding to each container according to the resource usage of the container running at least one training task in the second mode. Remove the amount of used resources to obtain the remaining resources of the computing node corresponding to each container. The AI platform can use the remaining resources of the computing nodes corresponding to each container to run one or more other training tasks. In this way, the remaining resources on each computing node can be utilized to improve resource utilization.
  • the present application provides an AI model training device, the device is applied to an AI platform, the AI platform is associated with a computing resource pool, and the computing resource pool includes computing nodes for model training, including : a training configuration module for providing a training configuration interface to the user, wherein the training configuration interface includes multiple training modes for the user to select, and each training mode represents a calculation node required for training the initial AI model.
  • a task management module configured to: generate at least one training task according to the user's selection on the training configuration interface; execute the at least one training task to train the initial AI model to obtain an AI model , and the obtained AI model is available for the user to download or use.
  • the AI platform provides the user with the function of selecting a training mode, and the user can select an appropriate training mode to generate at least one training task instead of using conventional distributed training, so that the distributed training can be performed flexibly, and further Balance user training needs and resource utilization.
  • the multiple training modes include a first mode and/or a second mode
  • the first mode indicates that the number of training tasks is automatically adjusted in the process of training the initial AI model
  • the second mode indicates that different training tasks share the resources of the same computing node.
  • the at least one training task runs on the container, and the apparatus further includes:
  • a presentation module configured to provide the user with state information of the training process during the process of training the initial AI model, wherein the state information includes at least one of the following information: a container for executing a training task The number of containers, the resource usage of each container, the number of computing nodes that perform training tasks, and the resource usage of computing nodes that perform training tasks.
  • the multiple training modes include a first mode and a second mode
  • the task management module is configured to:
  • At least one training task is generated according to the first mode and the second mode selected by the user in the training configuration interface.
  • the training configuration interface is further for the user to input or select the number of containers that can run the training task;
  • the task management module is used for:
  • At least one training task is generated according to the training mode selected by the user in the training configuration interface and the number of containers in which the training task can be run as input or selected by the user.
  • the training configuration interface is further for the user to input or select the resource usage of the container running the training task ;
  • the task management module is used for:
  • At least one training task is generated according to the training mode selected by the user in the training configuration interface and the resource usage of the container input or selected by the user to run the training task.
  • the resource usage of the container running the training task includes a GPU resource usage smaller than a single graphics processor GPU and/or a video memory usage smaller than a single video memory.
  • the task management module when the first mode is selected, is configured to:
  • the adjusted training task is run in the adjusted container to train the initial AI model.
  • the task management module is used for:
  • Part of the training tasks in the at least one training task is added to the target container in which the training tasks in the at least one training task have been run, and multiple training tasks are serially run in the target container.
  • the average value of the model parameters obtained by running the multiple training tasks in series is used as the updated value of the model parameters.
  • the task management module when the second mode is selected, is further configured to:
  • One or more other training tasks are run using the remaining resources of the computing node corresponding to each container.
  • a computing device in a third aspect, includes a processor and a memory, wherein: computer instructions are stored in the memory, and the processor executes the computer instructions to implement the method of the first aspect and possible implementations thereof.
  • a computer-readable storage medium where computer instructions are stored in the computer-readable storage medium, and when the computer instructions in the computer-readable storage medium are executed by a computing device, the computing device is made to perform the first aspect and the possibility thereof.
  • a fifth aspect provides a computer program product comprising instructions that, when run on a computing device, cause the computing device to perform the method of the first aspect and possible implementations thereof, or cause the computing device to implement the second aspect described above and the functions of the device of its possible implementations.
  • FIG. 1 is a schematic structural diagram of an AI platform 100 provided by an exemplary embodiment of the present application.
  • FIG. 2 is a schematic diagram of an application scenario of the AI platform 100 provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic diagram of the deployment of the AI platform 100 provided by an exemplary implementation of this application;
  • FIG. 4 is a schematic structural diagram of a computing device 400 for deploying the AI platform 100 provided by an exemplary implementation of this application;
  • FIG. 5 is a schematic flowchart of a training method for an AI model provided by an exemplary implementation of the present application
  • FIG. 6 is a schematic diagram of state information of a training process provided by an exemplary implementation of the present application.
  • FIG. 7 is a schematic flowchart of a training method for an AI model provided by an exemplary implementation of the present application.
  • FIG. 8 is a schematic diagram of a training configuration interface provided by an exemplary implementation of the present application.
  • FIG. 9 is a schematic diagram of capacity expansion provided by an exemplary implementation of the present application.
  • FIG. 10 is a schematic diagram of shrinkage provided by an exemplary implementation of the present application.
  • FIG. 11 is a schematic flowchart of a training method for an AI model provided by an exemplary implementation of the present application.
  • FIG. 12 is a schematic flowchart of a training method for an AI model provided by an exemplary implementation of the present application
  • FIG. 13 is a schematic structural diagram of a computing device provided by an exemplary implementation of the present application.
  • Machine learning is a core method to realize AI.
  • Machine learning penetrates into various industries such as medicine, transportation, education, and finance. Not only professional technicians, but even non-AI technology professionals in various industries are looking forward to using AI and machine learning to complete specific tasks.
  • AI model is a type of mathematical algorithm model that uses machine learning ideas to solve practical problems.
  • the AI model includes a large number of parameters and calculation formulas (or calculation rules).
  • the parameters in the AI model can be used to train the AI model through the training data set.
  • the obtained numerical value for example, the parameter of the AI model is the calculation formula or the weight of the calculation factor in the AI model.
  • the AI model also includes some hyperparameters. Hyperparameters are parameters that cannot be obtained by training the AI model through the training data set. Hyperparameters can be used to guide the construction of AI models or the training of AI models. There are many types of hyperparameters. For example, the number of iterations of AI model training, the learning rate, the batch size, the number of layers of the AI model, and the number of neurons in each layer.
  • the difference between the hyperparameters of the AI model and the parameters is that the values of the hyperparameters of the AI model cannot be obtained by analyzing the training data set, while the values of the parameters of the AI model can be obtained according to the training data set during the training process. Perform analysis to modify and determine.
  • the neural network model is a mathematical algorithm model that imitates the structure and function of a biological neural network (animal central nervous system).
  • a neural network model can include a variety of neural network layers with different functions, and each layer includes parameters and calculation formulas. According to different calculation formulas or different functions, different layers in the neural network model have different names. For example, layers that perform convolution calculations are called convolutional layers, and convolutional layers are often used for feature extraction on input signals such as images.
  • a neural network model can also be composed of a combination of multiple existing neural network models. Neural network models with different structures can be used in different scenarios (such as classification, recognition, etc.) or provide different effects when used in the same scenario.
  • the different structure of the neural network model specifically includes one or more of the following: the number of layers of the network layers in the neural network model is different, the order of each network layer is different, and the weights, parameters or calculation formulas in each network layer are different.
  • the number of layers of the network layers in the neural network model is different
  • the order of each network layer is different
  • the weights, parameters or calculation formulas in each network layer are different.
  • Training an AI model refers to using the existing data to make the AI model fit the rules of the existing data through a certain method, and determine the parameters in the AI model.
  • the training data in the training dataset used for training has labels.
  • the training data in the training data set is used as the input of the AI model, and the AI model calculates the input training data to obtain the output value of the AI model, and the annotation corresponding to the training data is used as a reference for the output value of the AI model.
  • Use the loss function to calculate the loss value of the label corresponding to the output value of the AI model and the training data, and adjust the parameters in the AI model according to the loss value.
  • the AI model is iteratively trained with each training data in the training data set, and the parameters of the AI model are continuously adjusted until the AI model can output the same or similar output values as the labels corresponding to the training data with high accuracy according to the input training data .
  • the training data in the training data set is not labeled, and the training data in the training data set is input to the AI model in turn, and the AI model gradually identifies the correlation and potential between the training data in the training data set.
  • the AI model used for clustering can learn the characteristics of each training data and the correlation and difference between the training data, and automatically divide the training data into multiple types. Different task types can use different AI models.
  • Some AI models can only be trained with supervised learning, some AI models can only be trained with unsupervised learning, and some AI models can be trained with supervised learning. It can be trained with unsupervised learning.
  • a trained AI model can be used to complete a specific task.
  • AI models in machine learning need to be trained in a supervised learning method. Training an AI model in a supervised learning method enables the AI model to learn more specifically in the labeled training data set. The association between the training data and the corresponding annotations in the training data set enables the trained AI model to have a higher accuracy when used to predict other input inference data.
  • the loss function is a function used to measure the degree to which the AI model is trained (that is, used to calculate the difference between the results predicted by the AI model and the real target).
  • the loss function is a function used to measure the degree to which the AI model is trained (that is, used to calculate the difference between the results predicted by the AI model and the real target).
  • the loss function is used to judge the difference between the value predicted by the current AI model and the real target value, and the parameters of the AI model are updated until the AI model can predict the real target value or the real target value. If the value is very close, the AI model is considered to be trained.
  • Distributed training is one of the commonly used acceleration methods in the AI model training process.
  • Distributed training refers to splitting training into multiple independent computing nodes for independent computing, and then periodically summarizing and redistributing the results, thereby accelerating the AI model training process.
  • the current mainstream distributed computing topologies include ps-worker and all-reduce.
  • Distributed training can include data-parallel distributed training and model-parallel distributed training.
  • Data-parallel distributed training distributes the training data in the training data set to multiple computing nodes for simultaneous computing, performs AI model training on each computing node, and uses the model parameters generated on each computing node to be calculated. After the gradient is aggregated, the model parameters are updated.
  • the batch size on each computing node in the K computing nodes and using a single computing node The batch size is the same when performing the calculation, and the batch size refers to the number of training data selected in the training data set before each parameter adjustment.
  • the batch size on each compute node is the batch size when a single compute node is used for computation divided by K, so that the aggregated global batch size remains unchanged.
  • the training method of the AI model is described by taking data-parallel distributed training as an example.
  • Model parallel distributed training is to divide the model into multiple computing nodes, and the data does not need to be divided.
  • the model can be divided into Segmentation.
  • segmentation methods There are also multiple segmentation methods for model-parallel distributed training. For example, for a neural network model, this layered model can be divided into layers, that is, each layer or multiple layers is placed on a computing node.
  • the AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users. There are various pre-trained AI models or AI sub-models built into the AI platform to solve different problems.
  • the AI platform can search and build suitable AI models according to the needs of users. Users only need to determine their needs in the AI platform and follow the prompts.
  • the AI platform can train the user an AI model that can be used to achieve the user's needs.
  • the user prepares his own algorithm (also known as the initial AI model) and training data set according to the prompts, and uploads it to the AI platform.
  • AI model Users can use the trained AI model to complete their specific tasks.
  • the AI model before being trained by the AI platform (for example, the algorithm uploaded by the user, the algorithm preset by the AI platform, or the pre-training model) is referred to as the initial AI model.
  • the embodiments of the present application provide an AI platform, in which multiple training modes are introduced, and each training mode is used to represent Allocation strategy for computing nodes required by the initial AI model.
  • AI model mentioned above is a general term, and AI models include deep learning models, machine learning models, etc.
  • FIG. 1 is a schematic structural diagram of an AI platform 100 in an embodiment of the present application. It should be understood that FIG. 1 is only a schematic structural diagram of the AI platform 100 exemplarily, and the present application does not limit the modules in the AI platform 100 division.
  • the AI platform 100 includes an algorithm management module 101 , a training configuration module 102 , a task management module 103 and a data storage module 104 .
  • the AI platform is associated with a computing resource pool.
  • the computing resource pool includes multiple computing nodes for model training.
  • the AI platform can schedule computing nodes in the computing resource pool for model training.
  • Algorithm management module 101 Provides an initial AI model management interface for users to upload initial AI models created based on their own training objectives; or, users obtain existing initial AI models from the initial AI model library. Alternatively, the algorithm management module 101 may also be used to obtain the initial AI model preset on the AI platform according to the task objective input by the user.
  • the initial AI model created by users based on their own training goals can be written based on the framework provided by the AI platform.
  • the initial AI model may include an AI model that has not been trained, and an AI model that has been trained but not fully trained.
  • An untrained AI model means that the built AI model has not been trained using the training data set, and the parameters in the built AI model are all preset values.
  • Training configuration module 102 provides a training configuration interface for the user.
  • the user may select a training mode in the training configuration interface, and the training mode may include a normal mode, a first mode and a second mode.
  • the first mode can also be referred to as turbo mode and performance mode, and the second mode can also be referred to as economic mode and shared mode.
  • the first mode is referred to as the performance mode, and the second mode is referred to as the is shared mode.
  • Regular mode is the existing distributed training mode.
  • performance mode refers to the dynamic adjustment of the resources used by the initial AI model during the training process of the AI model.
  • Shared mode In the AI model training process, the training of different AI models can share the resources of the same computing node, or different training tasks of the same AI model can share the resources of the same computing node.
  • Regular mode In the AI model training process, the training of each AI model occupies all the resources of one or more computing nodes and will not be dynamically adjusted.
  • the user when the user selects the training mode as the shared mode, the user can also select the resource usage of the container running the training task in the training configuration interface.
  • the user when the user selects the training mode as the performance mode, the user can also select the number of containers that can run the training task in the training configuration interface.
  • the user can also select the initial AI model and configure the input and output object storage service (OBS) path on the training configuration interface.
  • OBS object storage service
  • the user can also select the specification of the computing node used for training the initial AI model in the training configuration interface, such as the graphics processing unit (GPU) size and the amount of video memory of the computing node required to train the initial AI model.
  • the specification of the computing node used for training the initial AI model in the training configuration interface such as the graphics processing unit (GPU) size and the amount of video memory of the computing node required to train the initial AI model.
  • GPU graphics processing unit
  • the user can also input a training data set for training the initial AI model in the training configuration interface.
  • the data in the training dataset can be labeled data or unlabeled data. Specifically, it may be the access address of the input training data set.
  • the user can also input the expected effect of the AI model for completing the task objective and the expected completion time of training in the training configuration interface.
  • the expected effect of the AI model for completing the task objective and the expected completion time of training in the training configuration interface For example, inputting or selecting the resulting AI model for face recognition is more than 99% accurate, expecting to complete training within 24 hours.
  • the training configuration module 102 can communicate with the algorithm management module 101 for obtaining the access address of the initial AI model from the algorithm management module 101 .
  • the training configuration module 102 is further configured to package the training job based on the access address of the initial AI model and some content input or selected by the user in the training configuration interface.
  • the configuration module 102 can also communicate with the task management module 103 to submit a training job to the task management module 103 .
  • Task management module 103 a core module that manages the process of training the AI model.
  • the task management module 103 can communicate with the algorithm management module 101 , the training configuration module 102 and the data storage module 104 .
  • the specific processing is:
  • the task management module 103 pulls the corresponding training image and the initial AI model based on the training mode, the number of containers, the container resource usage, the access address of the initial AI model and other information in the training job provided by the training configuration module 102, and generates and runs at least A container for training tasks.
  • the container of at least one training task is delivered to the computing node of the computing resource pool for running.
  • the task management module 103 is further configured to monitor whether the at least one training task satisfies the expansion and shrinkage conditions, and dynamically adjust the at least one training task and the container of the at least one training task when the expansion and shrinkage conditions are met.
  • the task management module 103 is further configured to configure shared resources of each container. For example, schedule container 1 and container 2 to one computing node in the computing resource pool.
  • Data storage module 104 (for example, it may be a data storage resource corresponding to OBS provided by a cloud service provider): used to store the training data set uploaded by the user, the initial AI model uploaded by the user, the initial AI model uploaded by other users, and the data of the training mode. Some configuration items, etc.
  • the AI platform further includes a display module 105 (not shown in FIG. 1 ), and the display module 105 communicates with the task management module 103 to obtain the status information of the training process, the trained AI model, etc. AI models are provided to users.
  • the AI platform in this application may be a system that can interact with users.
  • This system may be a software system, a hardware system, or a system combining software and hardware, which is not limited in this application.
  • the AI platform provided by the embodiments of the present application can provide users with flexible distributed training services, so that the AI platform can balance users' training requirements and resource utilization requirements.
  • FIG. 2 is a schematic diagram of an application scenario of an AI platform 100 provided by an embodiment of the present application.
  • the AI platform 100 may be fully deployed in a cloud environment.
  • Cloud environment is an entity that utilizes basic resources to provide cloud services to users under the cloud computing model.
  • the cloud environment includes cloud data centers and cloud service platforms.
  • Cloud data centers include a large number of basic resources (including computing resource pools, storage resources, and network resources) owned by cloud service providers.
  • the computing resource pools included in cloud data centers can be a large number of computing resources.
  • Node eg server).
  • the AI platform 100 can be independently deployed on servers or virtual machines in the cloud data center, and the AI platform 100 can also be deployed on multiple servers in the cloud data center in a distributed manner, or distributed on multiple servers in the cloud data center. On multiple virtual machines, or distributed on servers and virtual machines in cloud data centers.
  • the AI platform 100 is abstracted by the cloud service provider into an AI cloud service on the cloud service platform and provided to the user. settlement), the cloud environment uses the AI platform 100 deployed in the cloud data center to provide AI platform cloud services to users.
  • the user can determine the tasks to be completed by the AI model, upload the training data set to the cloud environment, etc. through the application program interface (API) or graphical user interface (graphical user interface, GUI).
  • API application program interface
  • GUI graphical user interface
  • the AI platform 100 in the environment receives the user's task information and training data sets, and performs data preprocessing and AI model training.
  • the AI platform returns the status information of the training process of the AI model to the user through API or GUI.
  • the trained AI model can be downloaded by users or used online to complete specific tasks.
  • the AI platform in the cloud environment when the AI platform in the cloud environment is abstracted into an AI cloud service and provided to the user, if the user selects the sharing mode, the user can purchase the usage time of the container with a fixed amount of resource usage, In the case of a fixed resource usage, the longer the usage time, the higher the cost, and vice versa. During this period of use, the AI platform trains the AI model.
  • the user selects the sharing mode the user can pre-charge and settle the payment according to the number of GPUs finally used and the usage time after the training is completed.
  • the user can pre-charge, and after the training is completed, the settlement will be made according to the number of GPUs used and the duration of use.
  • the AI platform 100 in the cloud environment when abstracted into AI cloud services and provided to users, it can be divided into two parts, namely: basic AI cloud services and AI elastic training cloud services. Users can purchase only basic AI cloud services on the cloud service platform, and then purchase them when they need to use the AI elastic training cloud service. After the purchase, the cloud service provider provides the AI elastic training cloud service API, and finally the AI is called according to the number of API calls. Additional billing for elastic training cloud services.
  • the deployment of the AI platform 100 provided by the present application is relatively flexible. As shown in FIG. 3 , in another embodiment, the AI platform 100 provided by the present application may also be deployed in different environments in a distributed manner.
  • the AI platform 100 provided by this application can be logically divided into multiple parts, each part having different functions.
  • the AI platform 100 includes an algorithm management module 101 , a training configuration module 102 , a task management module 103 and a data storage module 104 .
  • Each part of the AI platform 100 may be deployed in any two or three environments of the terminal computing device, the edge environment and the cloud environment, respectively.
  • Terminal computing equipment includes: terminal server, smart phone, notebook computer, tablet computer, personal desktop computer, smart camera, etc.
  • the edge environment is an environment including a set of edge computing devices close to the terminal computing device, and the edge computing devices include: edge servers, edge small stations with computing capabilities, and the like.
  • Various parts of the AI platform 100 deployed in different environments or devices cooperate to provide users with functions such as training AI models.
  • the algorithm management module 101, the training configuration module 102 and the data storage module 104 in the AI platform 100 are deployed in the terminal computing device, and the task management module 103 in the AI platform 100 is deployed in the edge computing device in the edge environment .
  • the user sends the initial AI model to the algorithm management module 101 in the terminal computing device, and the terminal computing device stores the initial AI model in the data storage module 104 .
  • the user selects the training mode through the training configuration module 102 .
  • the task management module 103 in the edge computing device generates at least one training task, and executes the at least one training task. It should be understood that this application does not restrictively divide which parts of the AI platform 100 are deployed in what environment. In actual application, it can be carried out according to the computing power of the terminal computing device, the resource occupation of the edge environment and the cloud environment, or specific application requirements. Adaptive deployment.
  • FIG. 4 is a schematic diagram of a hardware structure of a computing device 400 in which the AI platform 100 is deployed.
  • the computing device 400 shown in FIG. 4 includes a memory 401 , a processor 402 , a communication interface 403 and a bus 404 .
  • the memory 401 , the processor 402 , and the communication interface 403 are connected to each other through the bus 404 for communication.
  • the memory 401 may be read only memory (ROM), random access memory (RAM), hard disk, flash memory or any combination thereof.
  • the memory 401 can store programs. When the programs stored in the memory 401 are executed by the processor 402, the processor 402 and the communication interface 403 are used to execute the AI platform 100 to train an AI model for the user.
  • the memory may also store training data sets. For example, a part of the storage resources in the memory 401 is divided into a data storage module 104 for storing data required by the AI platform 100 .
  • the processor 402 may adopt a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), a GPU or any combination thereof.
  • Processor 402 may include one or more chips.
  • Processor 402 may include an AI accelerator, such as a neural processing unit (NPU).
  • NPU neural processing unit
  • Communication interface 403 uses a transceiver module, such as a transceiver, to enable communication between computing device 400 and other devices or a communication network. For example, data may be acquired through the communication interface 403 .
  • a transceiver module such as a transceiver
  • Bus 404 may include pathways for communicating information between various components of computing device 400 (eg, memory 401, processor 402, communication interface 403).
  • Step 501 the AI platform provides a training configuration interface to the user, wherein the training configuration interface includes multiple training modes for the user to select, and each training mode represents an allocation strategy for computing nodes required to train the initial AI model.
  • the user can open the training configuration interface in the AI platform.
  • the training configuration interface may include multiple training modes for the user to select, and each training mode may represent an allocation strategy for computing nodes required for training the initial AI model.
  • the training configuration interface not only displays multiple training modes, but also displays selection options corresponding to each training mode, as well as an introduction for each training mode. Users can select options for various training modes and select options for each training mode. , select the training mode for training the AI model.
  • Step 502 the AI platform generates at least one training task according to the user's selection on the training configuration interface.
  • the AI platform may acquire the user's selection on the training configuration interface, and generate at least one training task (task) according to the user's selection on the training configuration interface and the initial AI model.
  • the at least one training task is used to train an initial AI model.
  • Executing the training of the initial AI model may be referred to as executing a training job (job), that is, a training job includes at least one training task.
  • Step 503 the AI platform performs at least one training task to train the initial AI model, obtains the AI model, and the obtained AI model is downloaded or used by the user for a specific application.
  • the AI platform is associated with a computing resource pool, and the computing resource pool includes computing nodes used for model training. At least one training task is performed by the computing node to train the initial AI model to obtain the AI model.
  • the computing node feeds the AI model to the AI platform.
  • the AI platform can provide users with an interface for downloading AI models, through which users can download AI models and use the AI models to perform corresponding tasks.
  • the user can upload the inference data set on the AI platform, and the AI model performs the inference process on the inference data set.
  • the user can select an appropriate training mode for generating at least one training task, so that the distributed training can be performed flexibly, thereby balancing the user's training requirements and resource utilization.
  • the multiple training modes may include a performance mode and a shared mode.
  • the performance mode indicates that the number of training tasks is automatically adjusted based on a certain strategy in the process of training the initial AI model.
  • Shared mode means that different training tasks share the resources of the same computing node.
  • the resources may include GPU resources and/or video memory.
  • different training tasks may belong to the same training job or may belong to different training jobs.
  • the training job A of user A and the training job B of user B are respectively executed in the AI platform, wherein training job A includes training tasks a and b, and training job B includes training tasks c and d.
  • the AI platform can determine the remaining resources of the computing node corresponding to the container running the training task a according to the resource usage of the container of the training task a of the training job A, and the AI platform determines that the remaining resources are greater than the running training.
  • the resource usage of the container of task c the AI platform can schedule the container of training task c to the computing node corresponding to the container of training task a.
  • At least one training task runs on different containers, and each container contains a complete runtime environment: a training task, all dependencies required to execute the training task, and so on.
  • the runtime here refers to a program's running dependencies.
  • the AI platform may deliver the containers in which the at least one training task runs respectively to the computing nodes in the computing resource pool, and start the delivered containers. At least one training task is performed by the container to train the initial AI model to obtain the AI model.
  • the AI platform can also provide users with status information of the training process, which can include the number of containers performing training tasks and the resource usage of each container.
  • status information of the training process can include the number of containers performing training tasks and the resource usage of each container.
  • the number of containers that perform training tasks at each time point is displayed, which can be represented by the curve of time and the number of containers. And it shows the resource usage of each container. In this way, the number of containers that perform training tasks is displayed to the user in real time, and the training performance is intuitively displayed.
  • the status information may further include the number of computing nodes that perform the training task, or the resource usage of the computing nodes that perform the training task.
  • the interface for displaying status information may also include the name of the initial AI model (such as AA), the specification of the computing node used in the training mode (such as the performance mode) (such as 8 cores), the training input, the start running time (such as 2020/9/27/10:38) and other information.
  • the name of the initial AI model such as AA
  • the specification of the computing node used in the training mode such as the performance mode
  • the training input such as 2020/9/27/10:38
  • start running time such as 2020/9/27/10:38
  • FIG. 7 it provides a schematic diagram of the training process of the AI model when the user only selects the performance mode:
  • Step 701 the AI platform provides a training configuration interface to the user, wherein the training configuration interface includes multiple training modes for the user to select.
  • the training configuration interface also allows the user to input or select the number of containers that can run the training task.
  • the AI platform can provide the user with a training configuration interface, and the training configuration interface includes multiple training modes for the user to select.
  • the selection interface of the training mode includes a performance mode, a sharing mode, a normal mode, etc., and the user selects the performance mode.
  • the training configuration interface When the training mode selected by the user is the performance mode, the training configuration interface also provides the user to input or select the number of containers that can run training tasks.
  • the number of containers that can run training tasks is to constrain the number of containers that each training job can use when scaling up or down.
  • the user can input or select the number of containers that can run training tasks in the training configuration interface.
  • the training configuration interface displays the number of containers that can be selected by the user, and the user can select the number of containers that can run the training task from the number of containers.
  • the number of containers that can be selected by the user is 1, 2, 4, and 8, and the user inputs or selects 1, 2, and 4 as the number of containers that can run the training task.
  • the range of the number of containers is displayed in the training configuration interface, and the user can select the number of containers that can run the training task in the range of the number of containers.
  • the range of the number of containers is [1, 8], and the user inputs or selects 1, 2, and 4 as the number of containers that can run training tasks.
  • the maximum number of containers that can run training tasks is the maximum number of containers used to run training tasks
  • the minimum number of containers that can run training tasks is the number of containers used to run training tasks.
  • Minimum number of containers. Limiting the number of containers is to limit the range of elastic scaling when executing at least one training task in performance mode.
  • the value of the number of containers for running training tasks during expansion and contraction can be 2n, and n is greater than or equal to 0 and less than or equal to the target value.
  • the target value can be 4.
  • the training configuration interface also displays the source of the dataset, and the user can select the training dataset and version.
  • the training configuration interface also displays the resource usage of the container.
  • the training configuration interface also displays a billing method for prompting the user.
  • the training configuration interface also displays the source of the initial AI model, which is used to display the selected initial AI model.
  • options of a public resource pool and an exclusive resource pool are also displayed corresponding to the computing resource pool.
  • the computing nodes in the common resource pool can be used by multiple training jobs.
  • the computing nodes in the exclusive resource pool are only used for the user's training work.
  • Each computing node in the exclusive resource pool executes multiple training tasks to achieve resource sharing among multiple training tasks and improve resources. utilization.
  • the charging can be performed according to the foregoing method.
  • the billing is based on the number of computing nodes used and the duration of use.
  • Step 702 the AI platform generates at least one training task according to the training mode selected by the user on the training configuration interface and the number of containers that can run the training task input or selected by the user.
  • the AI platform obtains that the training mode selected by the user on the training configuration interface is the performance mode, the AI platform can obtain the resource usage of each container in the performance mode, and the training mode selected by the user only includes the performance mode
  • the resource usage of the container running the training task is a preset value.
  • the resource usage of the container is the usage of all GPU resources and all video memory on a single computing node
  • the resource usage of the container is that of a single computing node. Two GPU resources and two video memory usage, etc.
  • the AI platform can generate at least one training task based on the idle computing nodes in the current computing resource pool, the number of containers that can run training tasks entered or selected by the user, the resource usage of the containers, and the initial AI model.
  • the resource usage of a container is all GPU resources and all video memory usage on a single computing node.
  • the maximum number of containers that can run training tasks is 8, and the AI platform can generate 8 Training tasks, each training task runs on a container, and each container occupies a computing node.
  • the AI platform when the AI platform generates a training task for the first time, the AI platform obtains the maximum number of containers that can run the training task and the resource usage of each container. The AI platform generates the maximum number of training tasks. If the AI platform determines, according to the resource usage of each container, that the current idle resources in the computing resource pool can be run by the maximum number of containers, the maximum number of containers will be created. Each training task runs on one container, and different training tasks run on different containers. If the AI platform determines, according to the resource usage of each container, that the current idle resources in the computing resource pool cannot be run by the maximum number of containers, it determines the number of containers that can be run, and creates the number of training tasks. Since this number is less than the maximum value, multiple training tasks run on one container.
  • Step 703 the AI platform performs at least one training task to train the initial AI model to obtain the AI model.
  • the AI platform can deliver the container to the computing node of the computing resource pool, and the computing node runs the container to implement at least one training task to train the initial AI model and obtain the AI model.
  • the AI platform determines 8 training tasks, determines 8 containers for running different training tasks, and each container runs on 8 different computing nodes respectively.
  • the initial AI model is trained using 8 computing nodes.
  • the AI platform in the process of training the AI model, can dynamically adjust the number of containers, and the processing can be as follows:
  • the idle amount of computing resources in the computing resource pool is obtained; according to the idle amount of computing resources in the computing resource pool , adjust the number of at least one training task and the number of containers that run the adjusted training task; run the adjusted training task in the adjusted container to train the initial AI model.
  • the AI platform when the AI platform performs at least one training task to train the initial AI model, the AI platform can periodically determine the ratio of the idle amount of computing resources in the computing resource pool to all computing resources in the computing resource pool Whether it is higher than the target value, if it is higher than the target value, you can further obtain the operation information of each training job in the computing resource pool, the operation information includes information such as the operation time, and the operation phase can include the training data set loading phase and the training phase. .
  • the AI platform can determine the ratio of the remaining running time to the running time of each training job in the computing resource pool, and determine the speedup ratio of each training job.
  • the speedup ratio can use the maximum number of containers for the training job and the current usage
  • the ratio of the number of containers is reflected in the ratio of the number of containers.
  • the training work with an acceleration ratio of 1 indicates that the number of containers is already the maximum, and the number of containers will not be adjusted.
  • the AI platform can determine the ratio of the elapsed running time to the remaining running time of each training job, and the weighted value of the speedup ratio.
  • the AI platform sorts the weighted values in descending order.
  • the AI platform determines the idle amount of computing resources in the computing resource pool and the number of containers that each training job can run.
  • the training work that can realize the expansion of the volume can be used as the object of container adjustment.
  • the training work includes training tasks.
  • the AI platform uses at least one training task mentioned in step 701 as the container adjustment object, that is, it is determined that the at least one training task satisfies the expansion and shrinkage conditions, the AI platform can use at least one training task as the container adjustment object.
  • the maximum number of containers for a training task is used as the adjusted number of containers.
  • the AI platform can adjust the number of training tasks to match the adjusted number of containers. Then the AI platform sends the newly added container to the computing node, and the newly added container runs the training tasks adjusted in the existing container.
  • the number of containers that can run at least one training task is 1, 2, and 4.
  • at least one training task is one training task A
  • the training task A includes four training processes (training process 1, training process 2, training process 3 and training process 4), and uses one container.
  • Training process 1, training process 2, training process 3, and training process 4 run on a container, and the container occupies exactly one computing node's resources, and currently occupies one computing node.
  • the one training task A can be divided into four training tasks (training task i, training task j, training task k and training task o).
  • Each training task includes training process 1, training process 2, training process 3 and training process 4.
  • the four training tasks are run on four containers, and each container is located on a computing node. After adjustment, it is equivalent to using 4 4 containers occupy 4 computing nodes.
  • the AI platform can determine whether there is a new training job. When it is determined that there is a new training job, it can determine whether the computing resources of the computing nodes in the computing resource pool can perform the training job. If the training job can be performed, directly download the Just send the container running the training task of the training job to the computing node. In the case that the training job cannot be performed, the AI platform obtains the operation information of each training job in the computing resource pool.
  • the operation information includes information such as the operation time, and the operation phase may include the training data set loading phase and the training phase.
  • the AI platform can determine the ratio of the running time to the remaining running time of each training job in the computing resource pool, and determine the speedup ratio of each training job.
  • the speedup ratio can be calculated using the number of containers currently used by the training job and the The ratio of the minimum number of containers is reflected.
  • the training work with an acceleration ratio of 1 indicates that the number of containers is already the minimum number of containers, and the number of containers is not adjusted.
  • the AI platform can determine the ratio of the remaining running time to the running time of each training job, and the weighted value of the speedup ratio.
  • the AI platform is sorted according to the weighted value from small to large.
  • the AI platform determines the free amount of computing resources in the computing resource pool and the number of containers that can be run for each training job.
  • the training work that can be scaled down can be used as a container adjustment object.
  • the AI platform uses the at least one training task mentioned in step 701 as the container adjustment object, it is determined that the at least one training task satisfies the expansion and shrinkage conditions.
  • the AI platform can The number of containers of the at least one training task is lowered by one level to be the adjusted number of containers.
  • the AI platform can adjust the number of training tasks to match the adjusted number of containers. Then the AI platform deletes the container, and adjusts the training tasks on the container to run on other containers with at least one training task.
  • the number of containers that can run at least one training task is 1, 2, and 4.
  • at least one training task consists of 4 training tasks (training task 1 includes training process 1, training task 2 includes training process 2, training task 3 includes training process 3, and training task 4 includes training process 4), 4
  • Each training task uses 4 containers, each training task runs on a container, and each container occupies exactly one computing node resource, currently occupying 4 nodes, the 4 training tasks are scaled down, each Two training tasks (training process 1 and training process 3 belong to one training task a, and training process 2 and training process 4 belong to one training task b) can run on one container respectively, and each container is located on a computing node, After adjustment, two containers are used, and two containers occupy two computing nodes.
  • the ultimate goal of performance mode is to minimize the overall expected running time of at least one training task.
  • the processing in order to ensure that the training accuracy does not decrease after the expansion and contraction (that is, including expansion and contraction), the processing can be as follows:
  • the number of containers when shrinking is performed, the number of containers will be reduced. Part of the training tasks in the at least one training task is run on the reduced container, and the part of the training tasks is added to the target container in which the training tasks in the at least one training task have been run. Since there are training tasks running in the target container, and some training tasks are added to the target container, the target container runs multiple training tasks. The multiple training tasks are run serially in the target container. The AI platform will use the average value of the model parameters obtained by running multiple training tasks in series as the updated value of the model parameters. In this way, since the target container runs multiple training tasks in series, it is equivalent to that multiple training tasks are executed in a distributed manner, which is the same as the original execution method before scaling down, which will not lead to the training accuracy of the AI model. reduce.
  • the above processing process can be called batch approximation, which is used to simulate distributed N container running tasks, which is equivalent to imitating the distributed training of integer multiples of containers by means of simulation when scaling down, which can ensure that the accuracy does not decrease.
  • training process 1 and training process 3 belong to a training task a after adjustment
  • training process 2 and training process 4 belong to a training task b after adjustment
  • training task a runs on container a
  • Training task b runs on container b
  • container a runs training process 1 and training process 3 serially
  • container b runs training process 2 and training process 4 serially.
  • each container uses 64 data to train an AI model, and the model parameters obtained by training each container in the 16 containers are averaged to obtain an AI model.
  • 16 sets of data are used to train the AI model in series, and the obtained model parameters are finally averaged to obtain the final AI model, so the training accuracy of the AI model will not be reduce.
  • the AI platform in order to ensure that the training accuracy does not decrease after the expansion and contraction, can perform adaptive parameter adjustment, and the AI platform can use historical training experience, offline test parameter sets, etc. While expanding and shrinking, the corresponding hyperparameters are adaptively adjusted so that the training accuracy remains unchanged.
  • Step 1101 the AI platform provides a training configuration interface to the user, wherein the training configuration interface includes multiple training modes for the user to select.
  • the training configuration interface also allows the user to input or select the resource usage of the container running the training task.
  • the AI platform can provide the user with a training configuration interface, and the training configuration interface includes multiple training modes for the user to select.
  • the training configuration interface displays the resource usage that can be selected by the user. In the resource usage, the user can select or input the resource usage of the container corresponding to the running training task.
  • the training configuration interface displays the resource usage range, and the user can select or input the resource usage of the container corresponding to the runnable training task in the resource usage range. For example, resource usage ranges from 0.1 GPU to 1 GPU, and the user can choose 0.5 GPU.
  • Step 1102 the AI platform generates at least one training task according to the training mode selected by the user on the training configuration interface and the resource usage of the container input by the user or selected by the user to run the training task.
  • the AI platform can obtain that the training mode selected by the user on the training configuration interface is the shared mode, the AI platform can obtain the resource usage of each container in the shared mode, and the training mode selected by the user only includes the shared mode In the case of mode, the resource usage of the container running the training task is a preset value.
  • the AI platform can generate at least one training task based on the idle computing nodes in the current computing resource pool, the preset number of containers, the resource usage of the containers, and the initial AI model.
  • the preset number of containers here may be the number of available containers specified by the AI platform for the at least one training task, or may be the number of containers for the at least one training task specified by the user.
  • the resource usage of the container running the training task includes a GPU resource usage smaller than a single GPU and/or a video memory usage smaller than a single video memory.
  • the computing resources on the computing nodes can be divided more finely, and the resource utilization rate can be higher.
  • Step 1103 the AI platform performs at least one training task to train the initial AI model to obtain the AI model.
  • step 1103 may be:
  • each container running at least one training task in the shared mode and the remaining resources of each computing node in the computing resource pool determine the computing node on which the container of each training task runs; start at least one computing node on the determined computing node A container for training tasks to train the initial AI model.
  • the AI platform can count the remaining resources of each computing node in the computing resource pool, and obtain the resource usage of each container running at least one training task. If the remaining resources of a computing node that has occupied some resources are greater than the resource usage of each container, the AI platform can deliver a container to the computing node. If the remaining resources of all computing nodes that have occupied some resources are less than the resource usage of each container, the AI platform can deliver at least one container running a training task to computing nodes that have not occupied resources. In this way, the AI platform delivers the container to the computing node. Then the AI platform starts the container on the computing node, and the initial AI model can be trained.
  • step 1103 may be:
  • the AI platform may use the resource usage of the container running at least one training task in the second mode to determine the remaining resources of the computing node corresponding to the container running the at least one training task.
  • the AI platform performs other training tasks, if the remaining resources of a certain computing node used by the at least one training task are enough to execute one or more other training tasks, then one or more other training tasks can be run with the remaining resources of the computing node. task to achieve resource sharing of the same computing node.
  • the computing nodes that have occupied some resources are used as much as possible, which can reduce resource fragmentation and improve the overall utilization of resources.
  • FIG. 12 it provides a schematic diagram of the training process of the AI model when the user selects the performance mode and the sharing mode:
  • Step 1201 the AI platform provides a training configuration interface to the user, wherein the training configuration interface includes multiple training modes for the user to select.
  • the training configuration interface also allows the user to input or select the number of containers that can run the training task and the resource usage of the containers running the training task.
  • step 1201 is the processing procedure of the combination of step 701 and step 1101, and reference may be made to the description of step 701 and step 1101, which will not be repeated here.
  • Step 1202 the AI platform generates at least one training task according to the training mode selected by the user in the training configuration interface, the number of containers that can run the training task input or selected by the user, and the resource usage of the container input or selected by the user to run the training task. .
  • the AI platform can obtain that the training mode selected by the user on the training configuration interface is the shared mode, the AI platform can obtain the resource usage of each container in the shared mode, and the training mode selected by the user only includes the shared mode In the case of mode, the resource usage of the container running the training task is a preset value.
  • the AI platform can generate at least one training task based on the idle computing nodes in the current computing resource pool, the number of containers that can run training tasks entered or selected by the user, the resource usage of the containers, and the initial AI model.
  • the number of containers determined by the AI platform belongs to the number of containers input or selected by the user that can run training tasks.
  • the resource usage of the container running the training task includes a GPU resource usage smaller than a single GPU and/or a video memory usage smaller than a single video memory.
  • the computing resources on the computing nodes can be divided more finely, and the resource utilization rate can be higher.
  • Step 1203 the AI platform performs at least one training task to train the initial AI model to obtain the AI model.
  • step 1203 the process of dynamic capacity expansion and contraction in the process of FIG. 7 and the shared resources in the process of FIG. 11 may be combined.
  • step 1203 the process of dynamic capacity expansion and contraction in the process of FIG. 7 and the shared resources in the process of FIG. 11 may be combined.
  • the computing nodes that have already occupied some resources can be used as much as possible, which can reduce resource fragmentation, improve the overall utilization of resources, and reduce the training of AI models by a single user. the cost of.
  • performance mode by dynamically adjusting the number of containers, training can be accelerated as much as possible and the efficiency of training AI models can be improved.
  • the training data set is stored in OBS
  • the computing node which can be solid state storage (SSS).
  • SSS solid state storage
  • each computing node that subsequently performs the training work can directly read data from the storage space. For example, by expanding the new container, it can directly read data from the storage space, reducing the need to re-download the training data set from OBS. required time.
  • the container is generated by the AI platform by pulling the image and the initial AI model.
  • the training tasks running on each container will not interfere with each other.
  • the AI platform provides a variety of training modes for users to choose from, and the user can select an appropriate training mode so that distributed training can be performed flexibly, thereby balancing the user's training needs and resource utilization.
  • FIG. 1 is a structural diagram of an AI model training apparatus provided by an embodiment of the present application.
  • the apparatus is applied to an AI platform.
  • the AI platform is associated with a computing resource pool, and the computing resource pool includes computing nodes for model training.
  • the apparatus can be implemented by software, hardware or a combination of the two to become a part or all of the apparatus.
  • the apparatus provided in the embodiment of the present application can implement the processes described in FIG. 7 , FIG. 11 , and FIG. 12 in the embodiment of the present application.
  • the apparatus includes: a training configuration module 102 , a task management module 103 and a presentation module 105 , wherein:
  • the training configuration module 102 is configured to provide a training configuration interface to the user, wherein the training configuration interface includes a plurality of training modes for the user to select, each training mode represents a calculation node required for training the initial AI model.
  • a kind of allocation strategy which can be specifically used to realize the training configuration function of step 701 and execute the implicit steps included in step 701;
  • the task management module 103 is used for:
  • the multiple training modes include a first mode and/or a second mode
  • the first mode indicates that the number of training tasks is automatically adjusted in the process of training the initial AI model
  • the second mode indicates that different training tasks share the resources of the same computing node.
  • the at least one training task runs on the container, and the apparatus further includes:
  • the presentation module 105 is configured to provide the user with state information of the training process during the process of training the initial AI model, wherein the state information includes at least one of the following information: The number of containers, the resource usage of each container, the number of computing nodes that perform training tasks, and the resource usage of computing nodes that perform training tasks.
  • the multiple training modes include a first mode and a second mode
  • the task management module 103 is configured to:
  • At least one training task is generated according to the first mode and the second mode selected by the user in the training configuration interface.
  • the training configuration interface is further for the user to input or select the number of containers that can run the training task;
  • the task management module 103 is used for:
  • At least one training task is generated according to the training mode selected by the user in the training configuration interface and the number of containers in which the training task can be run as input or selected by the user.
  • the training configuration interface is further for the user to input or select the resource usage of the container running the training task ;
  • the task management module 103 is used for:
  • At least one training task is generated according to the training mode selected by the user in the training configuration interface and the resource usage of the container input or selected by the user to run the training task.
  • the resource usage of the container running the training task includes a GPU resource usage smaller than a single GPU and/or a video memory usage smaller than a single video memory.
  • the task management module 103 when the first mode is selected, is configured to:
  • the adjusted training task is run in the adjusted container to train the initial AI model.
  • the task management module 103 is used for:
  • Part of the training tasks in the at least one training task is added to the target container in which the training tasks in the at least one training task have been run, and multiple training tasks are serially run in the target container.
  • the average value of the model parameters obtained by running the multiple training tasks in series is used as the updated value of the model parameters.
  • the task management module 103 is further configured to:
  • One or more other training tasks are run using the remaining resources of the computing node corresponding to each container.
  • the division of modules in the embodiments of the present application is schematic, and is only a logical function division. In actual implementation, there may also be other division methods.
  • the functional modules in the various embodiments of the present application may be integrated into one
  • the processor may also exist physically alone, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
  • the present application also provides a computing device 400 as shown in FIG. 4 .
  • the processor 402 in the computing device 400 reads the program and data set stored in the memory 401 to execute the aforementioned AI platform execution method.
  • each module in the AI platform 100 provided by the present application can be distributed on multiple computers in the same environment or in different environments, the present application also provides a computing device as shown in FIG. 13 , the computing device A plurality of computers 1300 are included, and each computer 1300 includes a memory 1301 , a processor 1302 , a communication interface 1303 , and a bus 1304 .
  • the memory 1301 , the processor 1302 , and the communication interface 1303 are connected to each other through the bus 1304 for communication.
  • the memory 1301 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory.
  • the memory 1301 can store programs, and when the programs stored in the memory 1301 are executed by the processor 502, the processor 1302 and the communication interface 1303 are used to execute part of the method for training the AI model by the AI platform.
  • the memory may also store training data sets. For example, a part of the storage resources in the memory 1301 is divided into a data set storage module for storing the training data sets required by the AI platform.
  • the processor 1302 can be a general-purpose central processing unit, a microprocessor, an application-specific integrated circuit, a graphics processor, or one or more integrated circuits.
  • the communication interface 1303 uses a transceiver module such as, but not limited to, a transceiver to implement communication between the computer 1300 and other devices or a communication network.
  • a transceiver module such as, but not limited to, a transceiver to implement communication between the computer 1300 and other devices or a communication network.
  • the training data set can be obtained through the communication interface 1303 .
  • Bus 504 may include a pathway for communicating information between various components of computer 1300 (eg, memory 1301, processor 1302, communication interface 1303).
  • a communication path is established between each of the above computers 1300 through a communication network.
  • Each computer 1300 runs any one or more of the algorithm management module 101 , the training configuration module 102 , the task management module 103 , the data storage module 104 and the presentation module 105 .
  • Any computer 1300 may be a computer (eg, a server) in a cloud data center, or a computer in an edge data center, or a terminal computing device.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware or a combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product that provides the AI platform includes one or more computer instructions for entering the AI platform.
  • these computer program instructions are loaded and executed on the computer, all or part of the computer program instructions shown in FIG. 7 , FIG. 11 or FIG. 12 in the embodiments of the present application are generated. the described process or function.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, twisted pair, or wireless (eg, infrared, wireless, microwave, etc.) Computer program instructions of the AI platform.
  • the computer-readable storage medium can be any medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more medium integration.
  • the medium can be a magnetic medium, (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, optical disks), or semiconductor media (eg, solid state drives).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Stored Programmes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供了一种AI模型的训练方法、装置、计算设备和存储介质,属于人工智能技术领域。该方法应用于AI平台,AI平台与计算资源池相关联,计算资源池包括用于模型训练的计算节点,该方法包括:向用户提供训练配置界面,训练配置界面包括供用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略,根据用户在训练配置界面的选择,生成至少一个训练任务,执行该至少一个训练任务以对初始AI模型进行训练,获得AI模型,获得的AI模型供用户下载或使用。采用本申请,可以更灵活地执行分布式训练。

Description

AI模型的训练方法、装置、计算设备和存储介质
本申请要求于2020年09月07日提交中国知识产权局、申请号为202010926721.0、申请名称为“一种弹性训练的方法及系统”的中国专利申请,以及于2020年09月29日提交中国知识产权局、申请号为202011053283.8、申请名称为“AI模型的训练方法、装置、计算设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(artificial intelligence,AI)技术领域,特别涉及一种AI模型的训练方法、装置、计算设备和存储介质。
背景技术
随着人工智能技术的发展,以深度学习为代表的AI模型在各个领域中广泛应用,如图像分类、目标检测、自然语言处理等。对初始AI模型进行训练是一个比较关键的过程,训练是指通过将训练数据集的数据输入至初始AI模型,由初始AI模型进行计算,通过计算结果对初始AI模型的参数进行更新,最终获得具备一定能力(例如,图像分类能力、目标检测能力、自然语言识别能力等)的AI模型的过程。
由于训练过程比较复杂且需要消耗巨大的计算资源,利用多个计算节点对初始AI模型执行分布式训练成为了一种能有效满足训练效率的方式,然而如何更灵活地执行分布式训练,以平衡用户的训练需求和资源利用率的需求是急需解决的一个问题。
发明内容
本申请提供了一种AI模型的训练方法、装置、计算设备和存储介质,用以更灵活地执行分布式训练。
第一方面,本申请提供了一种AI模型的训练方法,该方法应用于AI平台,AI平台与计算资源池相关联,计算资源池包括用于模型训练的计算节点,该方法包括:向用户提供训练配置界面,其中,训练配置界面包括供用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略;根据用户在训练配置界面的选择,生成至少一个训练任务;执行该至少一个训练任务以对初始AI模型进行训练,获得AI模型,获得的AI模型供用户下载或使用。
本申请所示的方案,AI平台给用户提供选择训练模式的功能,用户可以选择合适的训练模式,用于生成至少一个训练任务,而不是使用常规的分布式训练,所以可以使得分布式训练灵活的执行,进而可以平衡用户的训练需求和资源利用率。
在一种可能的实现方式中,多种训练模式包括第一模式和/或第二模式,第一模式表示对初始AI模型进行训练的过程中自动调整训练任务的个数,第二模式表示不同训练任务共 享同一计算节点的资源。多种训练模式还可以包括第三模式,第三模式表示常规模式,表示利用预设或预选的计算节点执行分布式训练。
本申请所示的方案,第一模式也可以称为是性能模式、turbo模式,第二模式也可以称为是共享模式、经济(economic)模式。第一模式表示对初始AI模型进行训练的过程中自动调整一个训练工作的训练任务的个数,第二模式表示不同训练任务共享同一计算节点的资源。此处不同训练任务可以属于同一个训练工作,也可以属于不同训练工作。这样,在至少使用第一模式时,可以动态调整训练任务的个数,加快训练速度,在至少使用第二模式时,可以与其它训练工作共享训练资源,提升资源利用率。
在一种可能的实现方式中,该至少一个训练任务运行在容器上,该方法还包括:在对初始AI模型进行训练的过程中,向用户提供训练过程的状态信息,其中,状态信息包括以下信息中的至少一种信息:执行训练任务的容器个数,每个容器的资源使用量,执行训练任务的计算节点的个数,和执行训练任务的计算节点的资源使用量。
本申请所示的方案,至少一个训练任务运行在容器上,每个容器包含了完整的运行时环境:一个训练任务,执行这个训练任务所需的全部依赖等。在对初始AI模型进行训练的过程中,AI平台还可以向用户提供训练过程的状态信息。这样,可以更直观的用户展现训练过程。
在一种可能的实现方式中,多种训练模式包括第一模式和第二模式,根据用户在训练配置界面的选择,生成至少一个训练任务,包括:根据用户在训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。
本申请所示的方案,多种训练模式包括第一模式和第二模式,AI平台可以根据用户在训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。这样,同时使用第一模式和第二模式,由于第一模式可以动态调整训练任务的个数,所以加快训练速度,而且由于使用第二模式,可以与其它训练工作共享计算节点的资源,提升资源利用率。
在一种可能的实现方式中,当用户在训练配置界面中选择第一模式时,训练配置界面还供用户输入或选择可运行训练任务的容器个数;根据用户在训练配置界面的选择,生成至少一个训练任务,包括:根据用户在训练配置界面选择的训练模式和用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
本申请所示的方案,在用户在训练配置界面中选择第一模式的情况下,训练配置界面中还可以供用户输入或选择可运行训练任务的容器个数。用户可以在训练配置界面中输入或者选择可运行训练任务的容器个数。AI平台可以根据用户在训练配置界面选择的训练模式和该容器个数,生成至少一个训练任务。这样,由于可运行训练任务的容器个数可以由用户选择,使得训练更加智能化。
在一种可能的实现方式中,当用户在训练配置界面中选择第二模式时,训练配置界面还供用户输入或选择运行训练任务的容器的资源使用量;根据用户在训练配置界面的选择,生成至少一个训练任务,包括:根据用户在训练配置界面选择的训练模式和用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
本申请所示的方案,在用户在训练配置界面中选择第二模式的情况下,训练配置界面中还可以供用户输入或选择运行训练任务的容器的资源使用量。用户可以在训练配置界面 中输入或者选择运行训练任务的容器的资源使用量。AI平台可以根据用户在训练配置界面选择的训练模式和该资源使用量,生成至少一个训练任务。这样,由于运行训练任务的容器的资源使用量可以由用户选择,使得训练更加智能化。
在一种可能的实现方式中,运行训练任务的容器的资源使用量包括小于单个图形处理器(graphics processing unit,GPU)的GPU资源使用量和/或小于单个显存的显存使用量。这样,由于单个容器的资源使用量比较小,可以使资源的利用率更高。
在一种可能的实现方式中,在选择第一模式的情况下,执行至少一个训练任务以对初始AI模型进行训练,包括:在执行至少一个训练任务以对初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取计算资源池中计算资源的空闲量;根据计算资源池中计算资源的空闲量,调整至少一个训练任务的个数以及调整用于运行训练任务的容器的个数;在调整后的容器中运行调整后的训练任务以对初始AI模型进行训练。
本申请所示的方案,在选择第一模式的情况下,AI平台在执行至少一个训练任务以对初始AI模型进行训练的过程中,AI平台可以检测至少一个训练任务是否满足弹性扩缩容的条件,在检测到满足弹性扩缩容的条件时,AI平台可以获取计算资源池中计算资源的空闲量。然后AI平台使用该空闲量,调整该至少一个训练任务的个数,以及调整运行训练任务的容器的个数。然后AI平台可以调整后的容器中,运行调整后训练任务,实现对初始AI模型进行训练。这样,由于可以弹性扩缩容,所以可以加快训练速度。
在一种可能的实现方式中,调整至少一个训练任务的个数以及调整用于运行训练任务的容器的个数,在调整后的容器中运行调整后的训练任务以对初始AI模型进行训练,包括:将至少一个训练任务中的部分训练任务添加到已运行至少一个训练任务中的训练任务的目标容器中,在目标容器中串行运行多个训练任务,在训练过程中,将串行运行多个训练任务获得的模型参数的平均值作为模型参数的更新值。
本申请所示的方案,在进行缩容时,容器的个数会减少。缩减掉的容器上运行至少一个训练任务中的部分训练任务,将该部分训练任务,添加到已运行至少一个训练任务中的训练任务的目标容器中。由于目标容器中本身运行有训练任务,再将部分训练任务添加至目标容器,目标容器运行多个训练任务。目标容器中串行运行该多个训练任务。将串行运行多个训练任务获取的模型参数的平均值,作为模型参数的更新值。这样,由于在目标容器运行多个训练任务时,是串行运行,所以还相当于多个训练任务是分布式执行,和原来未缩容前的执行方式相同,不会导致AI模型的训练精度降低。
在一种可能的实现方式中,在选择第二模式的情况下,该方法包括:根据第二模式下至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;利用每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
本申请所示的方案,在选择第二模式的情况下,AI平台还可以根据第二模式下至少一个训练任务运行的容器的资源使用量,将每个容器对应的计算节点的总资源量减去已使用资源量,获得每个容器对应的计算节点的剩余资源。AI平台可以使用每个容器对应的计算节点的剩余资源,运行一个或多个其它训练任务。这样,可以将每个计算节点上的剩余资源利用上,提升资源利用率。
第二方面,本申请提供了一种AI模型的训练装置,所述装置应用于AI平台,所述AI 平台与计算资源池相关联,所述计算资源池包括用于模型训练的计算节点,包括:训练配置模块,用于向用户提供训练配置界面,其中,所述训练配置界面包括供所述用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略;任务管理模块,用于:根据所述用户在所述训练配置界面的选择,生成至少一个训练任务;执行所述至少一个训练任务以对所述初始AI模型进行训练,获得AI模型,获得的所述AI模型供所述用户下载或使用。这样,AI平台给用户提供选择训练模式的功能,用户可以选择合适的训练模式,用于生成至少一个训练任务,而不是使用常规的分布式训练,所以可以使得分布式训练灵活的执行,进而可以平衡用户的训练需求和资源利用率。
在一种可能的实现方式中,所述多种训练模式包括第一模式和/或第二模式,所述第一模式表示对所述初始AI模型进行训练的过程中自动调整训练任务的个数,所述第二模式表示不同训练任务共享同一计算节点的资源。
在一种可能的实现方式中,所述至少一个训练任务运行在容器上,所述装置还包括:
展示模块,用于在对所述初始AI模型进行训练的过程中,向所述用户提供训练过程的状态信息,其中,所述状态信息包括以下信息中的至少一种信息:执行训练任务的容器个数,每个容器的资源使用量,执行训练任务的计算节点的个数,和执行训练任务的计算节点的资源使用量。
在一种可能的实现方式中,所述多种训练模式包括第一模式和第二模式,所述任务管理模块,用于:
根据所述用户在所述训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。
在一种可能的实现方式中,当所述用户在所述训练配置界面中选择所述第一模式时,所述训练配置界面还供所述用户输入或选择可运行训练任务的容器个数;
所述任务管理模块,用于:
根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
在一种可能的实现方式中,当所述用户在所述训练配置界面中选择所述第二模式时,所述训练配置界面还供所述用户输入或选择运行训练任务的容器的资源使用量;
所述任务管理模块,用于:
根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
在一种可能的实现方式中,所述运行训练任务的容器的资源使用量包括小于单个图形处理器GPU的GPU资源使用量和/或小于单个显存的显存使用量。
在一种可能的实现方式中,在选择所述第一模式的情况下,所述任务管理模块,用于:
在执行所述至少一个训练任务以对所述初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取所述计算资源池中计算资源的空闲量;
根据所述计算资源池中计算资源的空闲量,调整所述至少一个训练任务的个数以及调整用于运行训练任务的容器的个数;
在调整后的容器中运行调整后的训练任务以对所述初始AI模型进行训练。
在一种可能的实现方式中,所述任务管理模块,用于:
将所述至少一个训练任务中的部分训练任务添加到已运行所述至少一个训练任务中的训练任务的目标容器中,在所述目标容器中串行运行多个训练任务,在训练过程中,将串行运行所述多个训练任务获得的模型参数的平均值作为模型参数的更新值。
在一种可能的实现方式中,在选择所述第二模式的情况下,所述任务管理模块,还用于:
根据所述第二模式下所述至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;
利用所述每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
第三方面,提供了一种计算设备,计算设备包括处理器和存储器,其中:存储器中存储有计算机指令,处理器执行计算机指令,以实现第一方面及其可能的实现方式的方法。
第四方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机指令,当计算机可读存储介质中的计算机指令被计算设备执行时,使得计算设备执行第一方面及其可能的实现方式的方法,或者使得计算设备实现上述第二方面及其可能的实现方式的装置的功能。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面及其可能的实现方式的方法,或者使得计算设备实现上述第二方面及其可能的实现方式的装置的功能。
附图说明
图1为本申请一个示例性实施例提供的AI平台100的结构示意图;
图2为本申请一个示例性实施例提供的AI平台100的应用场景示意图;
图3为本申请一个示例性实施提供的AI平台100的部署示意图;
图4为本申请一个示例性实施提供的部署AI平台100的计算设备400的结构示意图;
图5为本申请一个示例性实施提供的AI模型的训练方法的流程示意图;
图6为本申请一个示例性实施提供的训练过程的状态信息的示意图;
图7为本申请一个示例性实施提供的AI模型的训练方法的流程示意图;
图8为本申请一个示例性实施提供的训练配置界面的示意图;
图9为本申请一个示例性实施提供的扩容的示意图;
图10为本申请一个示例性实施提供的缩容的示意图;
图11为本申请一个示例性实施提供的AI模型的训练方法的流程示意图;
图12为本申请一个示例性实施提供的AI模型的训练方法的流程示意图;
图13为本申请一个示例性实施提供的计算设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
目前,人工智能热潮不断,机器学习是一种实现AI的核心手段,机器学习渗透至医学、 交通、教育、金融等各个行业。不仅仅是专业技术人员,就连各行业的非AI技术专业也期盼用AI、机器学习完成特定任务。
为了便于理解本申请提供的技术方案和实施例,下面对AI模型、AI模型的训练、分布式训练、AI平台等概念进行详细说明:
AI模型,是一类用机器学习思想解决实际问题的数学算法模型,AI模型中包括大量的参数和计算公式(或计算规则),AI模型中的参数是可以通过训练数据集对AI模型进行训练获得的数值,例如,AI模型的参数是AI模型中的计算公式或计算因子的权重。AI模型还包含一些超(hyper)参数,超参数是无法通过训练数据集对AI模型进行训练获得的参数,超参数可用于指导AI模型的构建或者AI模型的训练,超参数有多种。例如,AI模型训练的迭代(iteration)次数、学习率(leaning rate)、批尺寸(batch size)、AI模型的层数、每层神经元的个数。换而言之,AI模型的超参数与参数的区别在于:AI模型的超参数的值无法通过对训练数据集进行分析获得,而AI模型的参数的值可根据在训练过程中对训练数据集进行分析进行修改和确定。
AI模型多种多样,使用较为广泛的一类AI模型为神经网络模型,神经网络模型是一类模仿生物神经网络(动物的中枢神经系统)的结构和功能的数学算法模型。一个神经网络模型可以包括多种不同功能的神经网络层,每层包括参数和计算公式。根据计算公式的不同或功能的不同,神经网络模型中不同的层有不同的名称。例如,进行卷积计算的层称为卷积层,卷积层常用于对输入信号(如图像)进行特征提取。一个神经网络模型也可以由多个已有的神经网络模型组合构成。不同结构的神经网络模型可用于不同的场景(如分类、识别等)或在用于同一场景时提供不同的效果。神经网络模型结构不同具体包括以下一项或多项:神经网络模型中网络层的层数不同、各个网络层的顺序不同、每个网络层中的权重、参数或计算公式不同。业界已存在多种不同的用于识别或分类等应用场景的具有较高准确率的神经网络模型,其中,一些神经网络模型可以被特定的训练数据集进行训练后单独用于完成一项任务或与其他神经网络模型(或其他功能模块)组合完成一项任务。
一般的AI模型在被用于完成一项任务前都需要被训练。
训练AI模型,是指利用已有的数据通过一定方法使AI模型拟合已有数据的规律,确定AI模型中的参数。训练一个AI模型需要准备一个训练数据集,根据训练数据集中的训练数据是否有标注(即:数据是否对应有特定的标签信息,例如,类型、名称、数据中包含的标注框),可以将AI模型的训练分为监督训练(supervised training)和无监督训练(unsupervised training)。对AI模型进行监督训练时,用于训练的训练数据集中的训练数据带有标注(label)。训练AI模型时,将训练数据集中的训练数据作为AI模型的输入,由AI模型对输入的训练数据进行计算,获得AI模型输出值,将训练数据对应的标注作为AI模型的输出值的参考,利用损失函数(loss function)计算AI模型输出值与训练数据对应的标注的损失(loss)值,根据损失值调整AI模型中的参数。用训练数据集中的每个训练数据迭代地对AI模型进行训练,AI模型的参数不断调整,直到AI模型可以根据输入的训练数据准确度较高地输出与训练数据对应的标注相同或相似的输出值。对AI模型进行无监督训练,则用于训练的数据集中的训练数据没有标注,训练数据集中的训练数据依次输入至AI模型,由AI模型逐步识别训练数据集中的训练数据之间的关联和潜在规则,直到AI模 型可以用于判断或识别输入的数据的类型或特征。例如,聚类,用于聚类的AI模型接收到大量的训练数据后,可学习到各个训练数据的特征以及训练数据之间的关联和区别,将训练数据自动地分为多个类型。不同的任务类型可采用不同的AI模型,一些AI模型仅可以用监督学习的方式训练,一些AI模型仅可以用无监督学习的方式训练,还有一些AI模型既可以用监督学习的方式训练又可以用无监督学习的方式训练。经过训练完成的AI模型可以用于完成一项特定的任务。通常而言,机器学习中的AI模型都需要采用有监督学习的方式进行训练,有监督学习的方式对AI模型进行训练可使AI模型在带有标注的训练数据集中更有针对性地学习到训练数据集中训练数据与对应标注的关联,使训练完成的AI模型用于预测其他输入推理数据时准确率较高。
损失函数,是用于衡量AI模型被训练的程度(也就是用于计算AI模型预测的结果与真实目标之间的差异)的函数。在训练AI模型的过程中,因为希望AI模型的输出尽可能的接近真正想要预测的值,所以可以通过比较当前AI模型根据输入数据的预测值和真正想要的目标值(即输入数据的标注),再根据两者之间的差异情况来更新AI模型中的参数(当然,在第一次更新之前通常会有初始化的过程,即为AI模型中的参数预先配置初始值)。每次训练都通过损失函数判断一下当前的AI模型预测的值与真实目标值之间的差异,更新AI模型的参数,直到AI模型能够预测出真正想要的目标值或与真正想要的目标值非常接近的值,则认为AI模型被训练完成。
分布式训练,分布式训练是AI模型训练过程中常用的加速手段之一。分布式训练指:将训练拆分在多个独立的计算节点当中进行独立计算,再将结果进行周期性的汇总和重新分发,由此加速AI模型的训练过程。当前主流的分布式计算拓扑结构包括ps-worker和all-reduce等。分布式训练可以包括数据并行的分布式训练和模型并行的分布式训练。
数据并行的分布式训练,将训练数据集中的训练数据分布到多个计算节点上同时进行计算,在每个计算节点上执行对AI模型的训练,并将每个计算节点上产生的模型参数的梯度聚合后,再更新模型参数,具体的,将训练数据集切分到K个计算节点上时有两种选择,1、K个计算节点中每个计算节点上的批大小与使用单个计算节点进行计算时的批大小相同,批大小指每次调整参数前在训练数据集所选取的训练数据的数目。2、每个计算节点上的批大小是使用单个计算节点进行计算时的批大小除以K,这样聚合后的全局批大小保持不变。在本申请实施例的后续描述中,以数据并行的分布式训练为例描述AI模型的训练方法。
模型并行的分布式训练,是将模型切分到多个计算节点上,而数据不需要被切分,对于大规模的深度学习或者机器学习模型,内存或者显存的消耗非常大,所以可以将模型切分。模型并行的分布式训练也存在多种切分方式。例如,对于神经网络模型这种分层模型模可以按照分层切分,即每一层或者多层放在一个计算节点上。
AI平台,是一种为AI开发者和用户提供便捷的AI开发环境以及便利的开发工具的平台。AI平台中内置有各种解决不同问题的预训练AI模型或者AI子模型,AI平台可以根据用户的需求搜索并且建立适用的AI模型,用户只需在AI平台中确定自己的需求,且按照提示准备好训练数据集上传至AI平台,AI平台就能为用户训练出一个可用于实现用户需要的AI模型。或者,用户按照提示准备好自己的算法(也称为初始AI模型)和训练数 据集,上传至AI平台,AI平台基于用户自己的算法和训练数据集,可以训练出一个可用于实现用户需要的AI模型。用户可利用训练完成的AI模型完成自己的特定任务。应理解,本申请中在被AI平台训练前的AI模型(例如,用户上传的算法、AI平台预置的算法或者预训练模型),称为初始AI模型。
为了更灵活地执行分布式训练,以及平衡用户的训练需求和资源利用率的需求,本申请实施例提供了一种AI平台,该AI平台中引入多种训练模式,每种训练模式用于表示对初始AI模型所需的计算节点的分配策略。
需要说明的是,上文中提到的AI模型是一种泛指,AI模型包括深度学习模型、机器学习模型等。
图1为本申请实施例中的AI平台100的结构示意图,应理解,图1仅是示例性地展示了AI平台100的一种结构化示意图,本申请并不限定对AI平台100中的模块的划分。如图1所示,AI平台100包括算法管理模块101、训练配置模块102、任务管理模块103和数据存储模块104。AI平台与计算资源池相关联,计算资源池中包括多个用于模型训练的计算节点,AI平台可以调度计算资源池中的计算节点,用于模型训练。
下面简要地描述AI平台100中的各个模块的功能:
算法管理模块101:提供初始AI模型管理界面,用于用户上传基于自己的训练目标创建的初始AI模型;或者,用户在初始AI模型库中,获取已有的初始AI模型。或者,算法管理模101还可以用于根据用户输入的任务目标,获取AI平台上预置的初始AI模型。用户基于自己的训练目标创建的初始AI模型可以基于AI平台提供的框架进行编写。初始AI模型可以包括未进行训练的AI模型、进行训练但是未完全训练完成的AI模型。未进行训练的AI模型指构建的AI模型还未使用训练数据集进行训练,构建的AI模型中的参数均是预设的数值。
训练配置模块102:为用户提供了训练配置界面。用户可以在训练配置界面中选取训练模式,训练模式可以包括常规模式、第一模式和第二模式。第一模式也可以称为是turbo模式、性能模式,第二模式也可以称为是经济模式、共享模式,在后文中描述时,将第一模式称为是性能模式,将第二模式称为是共享模式。常规模式是现有的分布式训练的模式。
其中,性能模式:指在AI模型的训练过程中,动态的调整初始AI模型使用的资源。
共享模式:指在AI模型的训练过程中,不同AI模型的训练可以共享同一个计算节点的资源,或者同一AI模型的不同训练任务共享同一个计算节点的资源。
常规模式:指在AI模型的训练过程中,每个AI模型的训练占用一个或多个计算节点的全部资源,且不会动态调整。
可选的,在用户选取训练模式为共享模式的情况下,用户还可以在训练配置界面中选取运行训练任务的容器的资源使用量。
可选的,在用户选取训练模式为性能模式的情况下,用户还可以在训练配置界面中选取可运行训练任务的容器个数。
可选的,用户还可以在训练配置界面选取初始AI模型、配置输入输出对象存储服务(object storage service,OBS)路径。
可选的,用户还可以在训练配置界面中选取用于训练初始AI模型的计算节点的规格,如要求训练初始AI模型的计算节点的图形处理器(graphics processing unit,GPU)大小、显存量。
可选的,用户还可以在训练配置界面中输入用于训练初始AI模型的训练数据集。该训练数据集中的数据可以是已标注的数据,也可以是未标注的数据。具体的,可以是输入训练数据集的访问地址。
可选的,用户还可以在训练配置界面中输入对完成任务目标的AI模型的效果期望和训练期望完成时间。例如,输入或选择最终获得的AI模型用于人脸识别的准确率要高于99%,期望在24小时内完成训练。
训练配置模块102与算法管理模块101可以通信,用于从算法管理模块101获取初始AI模型的访问地址。训练配置模块102还用于基于初始AI模型的访问地址、用户在训练配置界面输入或者选择的一些内容打包训练作业。
配置模块102还与任务管理模块103可以通信,将训练作业(training job)提交给任务管理模块103。
任务管理模块103:管理训练AI模型过程的核心模块,任务管理模块103与算法管理模块101、训练配置模块102和数据存储模块104均可以通信。具体处理为:
任务管理模块103基于训练配置模块102提供的训练作业中的训练模式、容器个数、容器资源使用量、初始AI模型的访问地址等信息,拉取相应的训练镜像和初始AI模型,生成运行至少一个训练任务的容器。将至少一个训练任务的容器下发给计算资源池的计算节点上进行运行。
可选的,任务管理模块103还用于监控至少一个训练任务是否满足扩缩容条件,在满足扩缩容条件的情况下,动态调整至少一个训练任务以及至少一个训练任务的容器。
可选的,任务管理模块103还用于配置各个容器的共享资源。例如,将容器1和容器2调度至计算资源池的一个计算节点上。
数据存储模块104(如可以是云服务提供商提供的OBS对应的数据存储资源):用于存储用户上传的训练数据集、用户上传的初始AI模型、其他用户上传的初始AI模型以及训练模式的一些配置项等。
可选的,AI平台还包括展示模块105(图1中未示出),展示模块105与任务管理模块103通信,获取训练过程的状态信息、训练完成的AI模型等,将该状态信息、该AI模型提供给用户。
需要说明的是,本申请中的AI平台可以是一个可以与用户交互的系统,这个系统可以是软件系统也可以是硬件系统,也可以是软硬结合的系统,本申请中不进行限定。
通过上述各模块的功能,本申请实施例提供的AI平台可向用户提供灵活的分布式训练的业务,使得AI平台可以平衡用户的训练需求和资源利用率的需求。
图2为本申请实施例提供的一种AI平台100的应用场景示意图,如图2所示,在一种实施例中,AI平台100可全部部署在云环境中。云环境是云计算模式下利用基础资源向用户提供云服务的实体。云环境包括云数据中心和云服务平台,云数据中心包括云服务提供 商拥有的大量基础资源(包括计算资源池、存储资源和网络资源),云数据中心包括的计算资源池可以是大量的计算节点(例如服务器)。AI平台100可以独立地部署在云数据中心中的服务器或虚拟机上,AI平台100也可以分布式地部署在云数据中心中的多台服务器上、或者分布式地部署在云数据中心中的多台虚拟机上、再或者分布式地部署在云数据中心中的服务器和虚拟机上。如图2所示,AI平台100由云服务提供商在云服务平台抽象成一种AI云服务提供给用户,用户在云服务平台购买该云服务后(可预充值再根据最终资源的使用情况进行结算),云环境利用部署在云数据中心的AI平台100向用户提供AI平台云服务。在使用AI平台云服务时,用户可以通过应用程序接口(application program interface,API)或者图形用户界面(graphical user interface,GUI)确定要AI模型完成的任务、上传训练数据集至云环境等,云环境中的AI平台100接收用户的任务信息、训练数据集,执行数据预处理、AI模型训练。AI平台通过API或者GUI向用户返回AI模型的训练过程的状态信息等内容。训练完成的AI模型可被用户下载或者在线使用,用于完成特定的任务。
在本申请的另一种实施例中,云环境下的AI平台抽象成一种AI云服务向用户提供时,在用户选择共享模式的情况下,用户可以购买固定资源使用量的容器的使用时长,在资源使用量固定的情况下,使用时长越长,需要的费用越高,反之越低。在该使用时长内,AI平台训练AI模型。或者,在用户选择共享模式的情况下,用户可以预充值,在训练完成后再根据最终使用的GPU的数量和使用时长进行结算。
在用户选择性能模式的情况下,用户可以预充值,在训练完成后再根据最终使用的GPU的数量和使用时长进行结算。
在本申请的另一种实施例中,云环境下的AI平台100抽象成AI云服务向用户提供时,可分为两部分,即:基础AI云服务和AI弹性训练云服务。用户在云服务平台可先仅购买基础AI云服务,在需要使用AI弹性训练云服务时再进行购买,购买后由云服务提供商提供AI弹性训练云服务API,最终按照调用API的次数对AI弹性训练云服务进行额外计费。
本申请提供的AI平台100的部署较为灵活,如图3所示,在另一种实施例中,本申请提供的AI平台100还可以分布式地部署在不同的环境中。本申请提供的AI平台100可以在逻辑上分成多个部分,每个部分具有不同的功能。例如,在一种实施例中AI平台100包括算法管理模块101、训练配置模块102、任务管理模块103和数据存储模块104。AI平台100中的各部分可以分别部署在终端计算设备、边缘环境和云环境中的任意两个或三个环境中。终端计算设备包括:终端服务器、智能手机、笔记本电脑、平板电脑、个人台式电脑、智能摄相机等。边缘环境为包括距离终端计算设备较近的边缘计算设备集合的环境,边缘计算设备包括:边缘服务器、拥有计算能力的边缘小站等。部署在不同环境或设备的AI平台100的各个部分协同实现为用户提供训练AI模型等功能。例如,在一种场景中,终端计算设备中部署AI平台100中的算法管理模块101、训练配置模块102和数据存储模块104,边缘环境的边缘计算设备中部署AI平台100中的任务管理模块103。用户将初始AI模型发送至终端计算设备中的算法管理模块101,终端计算设备将初始AI模型存储至数据存储模块104。用户通过训练配置模块102选取训练模式。边缘计算设备中任务管理模块103生成至少一个训练任务,执行该至少一个训练任务。应理解,本申请不对AI平台100的哪些部分部署具体部署在什么环境进行限制性的划分,实际应用时可根据终端计算 设备的计算能力、边缘环境和云环境的资源占有情况或具体应用需求进行适应性的部署。
AI平台100也可以单独部署在任意环境中的一个计算设备上(如单独部署在边缘环境的一个边缘服务器上)。图4为部署有AI平台100的计算设备400的硬件结构示意图,图4所示的计算设备400包括存储器401、处理器402、通信接口403以及总线404。其中,存储器401、处理器402、通信接口403通过总线404实现彼此之间的通信连接。
存储器401可以是只读存储器(read only memory,ROM),随机存取存储器(random access memory,RAM),硬盘,快闪存储器或其任意组合。存储器401可以存储程序,当存储器401中存储的程序被处理器402执行时,处理器402和通信接口403用于执行AI平台100为用户训练AI模型。存储器还可以存储训练数据集。例如,存储器401中的一部分存储资源被划分成一个数据存储模块104,用于存储AI平台100所需的数据。
处理器402可以采用中央处理器(central processing unit,CPU),应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器402可以包括一个或多个芯片。处理器402可以包括AI加速器,例如神经网络处理器(neural processing unit,NPU)。
通信接口403使用例如收发器一类的收发模块,来实现计算设备400与其他设备或通信网络之间的通信。例如,可以通过通信接口403获取数据。
总线404可包括在计算设备400各个部件(例如,存储器401、处理器402、通信接口403)之间传送信息的通路。
下面结合图5描述在一种实施例中AI模型的训练方法的具体流程,以该方法由AI平台执行为例进行说明:
步骤501,AI平台向用户提供训练配置界面,其中,训练配置界面包括供用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略。
在本实施例中,用户想要使用AI平台训练AI模型,可以在AI平台中,打开训练配置界面。该训练配置界面中可以包括供用户选择的多种训练模式,每种训练模式可以表示对训练初始AI模型所需的计算节点的一种分配策略。具体的,训练配置界面不仅显示有多种训练模式,还对应每种训练模式显示有选择选项,以及针对每种训练模式的介绍,用户可以通过各种训练模式的选择选项以及针对每种训练模式的介绍,选择训练AI模型的训练模式。
步骤502,AI平台根据用户在训练配置界面的选择,生成至少一个训练任务。
在本实施例中,AI平台可以获取到用户在训练配置界面的选择,根据用户在训练配置界面的选择,以及初始AI模型,生成至少一个训练任务(task)。该至少一个训练任务用于训练初始AI模型。执行训练初始AI模型可以称为是执行一个训练工作(job),也就是说一个训练工作包括至少一个训练任务。
步骤503,AI平台执行至少一个训练任务以对初始AI模型进行训练,获得AI模型,获得的AI模型供用户下载或使用,以用于特定应用。
在本实施例中,AI平台与计算资源池相关联,计算资源池中包括用于模型训练的计算节点。由计算节点执行至少一个训练任务,以对初始AI模型进行训练,获得AI模型。计 算节点将AI模型,反馈给AI平台。AI平台可以为用户提供下载AI模型的界面,用户可以通过该界面,下载AI模型,使用该AI模型执行相应的任务。或者用户可以在AI平台上上传推理数据集,由该AI模型执行对推理数据集的推理过程。
这样,通过本申请实施例,用户可以选择合适的训练模式,用于生成至少一个训练任务,使得分布式训练可以灵活的执行,进而可以平衡用户的训练需求和资源利用率。
以下针对图5的流程进行补充说明:
在一种可能的实现方式中,多种训练模式可以包括性能模式和共享模式。性能模式表示对初始AI模型进行训练的过程中基于一定的策略自动调整训练任务的个数。共享模式表示不同训练任务共享同一计算节点的资源。该资源可以包括GPU资源和/或显存。此处不同训练任务可以属于同一个训练工作,也可以属于不同训练工作。例如,AI平台中分别执行用户A的训练作业A和用户B的训练作业B,其中,训练作业A包括训练任务a、b,训练作业B包括训练任务c、d,用户A在训练配置界面选择了训练模式为共享模式,则AI平台可以根据训练作业A的训练任务a的容器的资源使用量,确定运行训练任务a的容器对应的计算节点的剩余资源,AI平台确定该剩余资源大于运行训练任务c的容器的资源使用量,AI平台可以将训练任务c的容器调度到运行训练任务a的容器对应的计算节点。
在一种可能的实现方式中,至少一个训练任务分别运行在不同的容器上,每个容器包含了完整的运行时环境:一个训练任务,执行这个训练任务所需的全部依赖等。此处运行时是指一个程序在运行的依赖。具体的,在步骤503中,AI平台可以将至少一个训练任务分别运行的容器下发到计算资源池中的计算节点,并启动下发的容器。由容器执行至少一个训练任务,以对初始AI模型进行训练,获得AI模型。
在对初始AI模型进行训练的过程中,AI平台还可以向用户提供训练过程的状态信息,该状态信息可以包括执行训练任务的容器个数、每个容器的资源使用量。如图6显示,在对初始AI模型进行训练的过程中,显示了各个时间点执行训练任务的容器个数,可以使用时间和容器个数的曲线表示。并且显示了每个容器的资源使用量。这样,实时的为用户展现执行训练任务的容器个数,直观的展现了训练的性能。
可选的,状态信息中还可以包括执行训练任务的计算节点的个数,或者执行训练任务的计算节点的资源使用量。
可选的,在显示状态信息的界面中还可以包括初始AI模型的名称(如AA)、训练模式(如性能模式)使用的计算节点的规格(如8核)、训练输入、开始运行时间(如2020/9/27/10:38)等信息。
如图7所示,提供了用户仅选择性能模式的情况下,AI模型的训练流程示意图:
步骤701,AI平台向用户提供训练配置界面,其中,训练配置界面包括供用户选择的多种训练模式。用户在训练配置界面中选择性能模式时,训练配置界面还供用户输入或选择可运行训练任务的容器个数。
在本实施例中,AI平台可以向用户提供训练配置界面,训练配置界面包括供用户选择的多种训练模式。例如,如图8所示,训练模式的选择界面包括性能模式和共享模式、常规模式等,用户选择了性能模式。
在用户选择的训练模式为性能模式的情况下,该训练配置界面还提供了供用户输入或 选择可运行训练任务的容器个数。可运行训练任务的容器个数是为了约束在扩缩容时,每个训练工作所能使用的容器的个数。
用户可以在训练配置界面中输入或者选择可运行训练任务的容器个数。具体的,训练配置界面中显示可供用户选择的容器个数,用户可以在该容器个数中,选择出可运行训练任务的容器个数。例如,可供用户选择的容器个数为1、2、4、8,用户输入或者选取1、2、4作为可运行训练任务的容器个数。或者,训练配置界面中显示容器个数范围,用户可以在该容器个数范围中,选择可运行训练任务的容器个数。例如,容器个数范围为[1,8],用户输入或者选取1、2、4作为可运行训练任务的容器个数。
此处需要说明的是,可运行训练任务的容器个数的最大个数是用于运行训练任务的容器的最大数目,可运行训练任务的容器个数的最小个数是用于运行训练任务的容器的最小数目。限制容器个数是为了在性能模式下执行至少一个训练任务时限定弹性扩缩容的范围。
可选的,为了更方便扩缩容处理,扩缩容时运行训练任务的容器个数的取值可以是2n,且n大于或等于0,且小于或等于目标数值,例如,目标数值可以是4。
可选的,如图8所示,训练配置界面还显示有数据集来源,用户可以选择训练数据集以及版本。
可选的,如图8所示,训练配置界面还显示有容器的资源使用量。
可选的,如图8所示,训练配置界面还显示有计费方式,用于提示用户。
可选的,如图8所示,训练配置界面还显示有初始AI模型来源,用于显示已经选择的初始AI模型。
可选的,如图8所示,在训练配置界面中,还对应计算资源池显示有公共资源池和专属资源池的选项。在用户选择公共资源池的情况下,公共资源池中的计算节点可以供多个训练工作使用。在用户选择专属资源池的情况下,专属资源池中的计算节点仅供用户的训练工作使用,专属资源池中每个计算节点执行多个训练任务,实现多个训练任务的资源共享,提升资源利用率。
在用户选择公共资源池的情况下,可以按照前文中的方式进行计费。在用户选择专属资源池的情况下,是按照使用的计算节点的数目以及使用时长进行计费。
步骤702,AI平台根据用户在训练配置界面选择的训练模式和用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
在本实施例中,AI平台获取到用户在训练配置界面选择的训练模式为性能模式,AI平台可以获取在性能模式下,每个容器的资源使用量,在用户选取的训练模式仅包括性能模式的情况下,运行训练任务的容器的资源使用量是预设数值,如容器的资源使用量是单个计算节点上的所有GPU资源以及所有显存使用量、容器的资源使用量是单个计算节点上的两个GPU资源以及两个显存使用量等。AI平台可以基于当前计算资源池中空闲的计算节点、用户输入或选择的可运行训练任务的容器个数、容器的资源使用量以及初始AI模型,生成至少一个训练任务。例如,容器的资源使用量是单个计算节点上的所有GPU资源以及所有显存使用量,当前资源池中空闲16个计算节点,可运行训练任务的容器个数最大为8,AI平台可以生成8个训练任务,每个训练任务运行在一个容器上,每个容器占用一个计算节点。
此处需要说明的是,在AI平台第一次生成训练任务时,AI平台获取可运行训练任务的容器个数的最大值,以及每个容器的资源使用量。AI平台生成该最大值个训练任务。若AI平台根据每个容器的资源使用量,确定计算资源池中当前的空闲资源可以供最大值个容器运行,则创建该最大值个容器。每个训练任务运行于一个容器上,且不同的训练任务运行于不同的容器上。若AI平台根据每个容器的资源使用量,确定计算资源池中当前的空闲资源不可以供最大值个容器运行,则确定所能运行的容器的数目,创建该数目个训练任务。由于该数目小于最大值,所以多个训练任务运行于一个容器上。
步骤703,AI平台执行至少一个训练任务以对初始AI模型进行训练,获得AI模型。
在本实施例中,AI平台可以将容器下发到计算资源池的计算节点上,由计算节点运行容器,实现执行至少一个训练任务,以对初始AI模型进行训练,获得AI模型。例如,在步骤702中,AI平台确定出8个训练任务,确定出8个容器用于运行不同的训练任务,每个容器分别运行于8个不同的计算节点上。利用8个计算节点实现对初始AI模型进行训练。
在选择性能模式的情况下,在对AI模型进行训练的过程中,AI平台可以动态调整容器个数,处理可以为:
在执行至少一个训练任务以对初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取计算资源池中计算资源的空闲量;根据计算资源池中计算资源的空闲量,调整至少一个训练任务的个数以及调整运行调整后的训练任务的容器的个数;在调整后的容器中运行调整后的训练任务以对初始AI模型进行训练。
在本实施例中,AI平台在执行至少一个训练任务以对初始AI模型进行训练的过程中,AI平台可以周期性确定计算资源池的计算资源的空闲量占计算资源池中所有计算资源的比例是否高于目标数值,在高于目标数值的情况下,可以进一步获取计算资源池中各训练工作的运行信息,该运行信息包括运行时间等信息,运行阶段可以包括训练数据集加载阶段和训练阶段。AI平台可以确定计算资源池中各训练工作的剩余运行时间与已运行时间的比值,并且确定各训练工作的加速比,对于一个训练工作,加速比可以使用该训练工作最大容器个数与当前使用的容器个数的比值体现,加速比为1的训练工作说明已经是最大的容器个数,不进行调整容器个数处理。
AI平台可以确定各训练工作的已运行时间与剩余运行时间的比值、及加速比的加权值。AI平台按照加权值从大到小的顺序进行排序,AI平台根据计算资源池中计算资源的空闲量,以及各训练工作可运行的容器个数,在顺序排列的训练工作中,确定出这些空闲量所能实现扩容的训练工作,作为容器调整对象。在前文中有说明,训练工作包括训练任务,AI平台将在步骤701中提到的至少一个训练任务作为容器调整对象时,即确定该至少一个训练任务满足扩缩容条件,AI平台可以将至少一个训练任务的最大容器个数作为调整后的容器个数。AI平台可以将训练任务的个数,调整为与调整后的容器个数相匹配。然后AI平台将新增的容器下发至计算节点,该新增的容器中运行已有的容器中调整出的训练任务。
例如,可运行至少一个训练任务的容器个数为1、2、4。如图9所示,至少一个训练任务为1个训练任务A,该训练任务A包括4个训练进程(训练进程1、训练进程2、训练进程3和训练进程4),使用了1个容器。训练进程1、训练进程2、训练进程3和训练进程4运行于一个容器上,且该容器恰好占用一个计算节点的资源,当前占用了1个计算节 点。若当前计算资源池中仅存在该1个训练任务A,该1个训练任务A可以被拆分为4个训练任务(训练任务i、训练任务j、训练任务k和训练任务o),此时每个训练任务分别包括训练进程1、训练进程2、训练进程3和训练进程4,4个训练任务分别运行于4个容器上,且每个容器位于一个计算节点,调整后相当于使用了4个容器,4个容器占用了4个计算节点。
AI平台可以判断是否有新的训练工作,在确定存在新的训练工作时,判断计算资源池中的计算节点的计算资源是否能够执行该训练工作,在能够执行该训练工作的情况下,直接下发该训练工作的训练任务所运行的容器至计算节点即可。在不能执行该训练工作的情况下,AI平台获取计算资源池中各训练工作的运行信息,该运行信息包括运行时间等信息,运行阶段可以包括训练数据集加载阶段和训练阶段。AI平台可以确定计算资源池中各训练工作的已运行时间与剩余运行时间的比值,并且确定各训练工作的加速比,对于一个训练工作,加速比可以使用该训练工作当前使用的容器个数与最小容器个数的比值体现,加速比为1的训练工作说明已经是最小的容器个数,不进行调整容器个数处理。AI平台可以确定各训练工作的剩余运行时间与已运行时间的比值、及加速比的加权值。AI平台按照加权值从小到大的顺序进行排序,AI平台根据计算资源池中计算资源的空闲量,以及各训练工作可运行的容器个数,在顺序排列的训练工作中,确定出这些空闲量所能实现缩容的训练工作,作为容器调整对象。
AI平台在步骤701中提到的至少一个训练任务作为容器调整对象时,即确定该至少一个训练任务满足扩缩容条件,为了保证AI平台上的训练任务可以快速的被执行完毕,AI平台可以将该至少一个训练任务的容器个数下调一个等级,作为调整后的容器个数。AI平台可以将训练任务的个数,调整为与调整后的容器个数相匹配。然后AI平台删除容器,将该容器上的训练任务调整至至少一个训练任务的其他容器上运行。
例如,可运行至少一个训练任务的容器个数为1、2、4。如图10所示,至少一个训练任务为4个训练任务(训练任务1包括训练进程1、训练任务2包括训练进程2、训练任务3包括训练进程3和训练任务4包括训练进程4),4个训练任务使用了4个容器,每个训练任务运行于一个容器上,且每个容器恰好占用一个计算节点的资源,当前占用了4个节点,对该4个训练任务进行缩容处理,每两个训练任务(训练进程1和训练进程3属于一个训练任务a、以及训练进程2和训练进程4属于一个训练任务b)可以分别运行于1个容器上,且每个容器位于一个计算节点,调整后相当于使用了两个容器,两个容器占用了两个计算节点。
此处需要说明的是,不管是扩容,还是缩容,性能模式的最终想要达到的目标是使得至少一个训练任务的整体预期运行时间最小。
这样,在计算资源池中的空闲资源比较多时,对还在运行的至少一个训练任务进行扩容,加速其运行,使得其尽可能在最短的时间内完成训练,尽量不占用下一个忙时间段的计算资源,所以可以尽可能的快速完成训练。
在一种可能的实现方式中,为了保证在扩缩容(即包括扩容和缩容)后训练精度不下降,处理可以为:
将至少一个训练任务中的部分训练任务添加到已运行至少一个训练任务中的训练任务 的目标容器中,在目标容器中串行运行多个训练任务,在训练过程中,将串行运行多个训练任务获得的模型参数的平均值作为模型参数的更新值。
在本实施例中,在进行缩容时,容器的个数会减少。缩减掉的容器上运行至少一个训练任务中的部分训练任务,将该部分训练任务,添加到已运行至少一个训练任务中的训练任务的目标容器中。由于目标容器中本身运行有训练任务,再将部分训练任务添加至目标容器,目标容器运行多个训练任务。目标容器中串行运行该多个训练任务。AI平台将串行运行多个训练任务获取的模型参数的平均值,作为模型参数的更新值。这样,由于在目标容器运行多个训练任务时,是串行运行,所以还相当于多个训练任务是分布式执行,和原来未缩容前的执行方式相同,不会导致AI模型的训练精度降低。
上述处理过程可以称为是批处理近似,用于模拟分布式N个容器运行任务,相当于在缩容时,通过模拟的方法模仿整数倍个容器的分布式训练,可以保证精度不下降。例如,在图10的示例中,训练进程1和训练进程3在调整后属于一个训练任务a,训练进程2和训练进程4在调整后属于一个训练任务b,训练任务a运行在容器a上,训练任务b运行在容器b上,容器a串行运行训练进程1和训练进程3,容器b串行运行训练进程2和训练进程4。
再例如,在16个容器上,每个容器使用64个数据训练AI模型,将16个容器中每个容器训练获得的模型参数平均,获得AI模型。在缩容至一个容器中,也是串行使用16组数据(每组数据为64个数据)训练AI模型,最终将获得的模型参数平均,获得最终的AI模型,所以AI模型的训练精度不会降低。
这样,按照串行运行每个调整前的训练任务,可以保证在缩容后训练精度不下降。
在一种可能的实现方式中,为了保证在扩缩容后训练精度不下降,AI平台可以进行自适应的参数调整,AI平台可以使用历史的训练经验、离线测试的参数组等方式,在进行扩缩容的同时,自适应的调整相应的超参数,使得训练精度保持不变。
如图11所示,提供了用户仅选择共享模式的情况下,AI模型的训练流程示意图:
步骤1101,AI平台向用户提供训练配置界面,其中,训练配置界面包括供用户选择的多种训练模式。用户在训练配置界面中选择共享模式时,训练配置界面还供用户输入或选择运行训练任务的容器的资源使用量。
在本实施例中,AI平台可以向用户提供训练配置界面,训练配置界面包括供用户选择的多种训练模式。训练配置界面显示了用户可选择的资源使用量,用户可以在该资源使用量中,选择或输入可运行训练任务对应的容器的资源使用量。或者,该训练配置界面显示了资源使用量范围,用户可以在该资源使用量范围中,选择或输入可运行训练任务对应的容器的资源使用量。例如,资源使用量范围为0.1个GPU至1个GPU,用户可以选择0.5个GPU。
步骤1102,AI平台根据用户在训练配置界面选择的训练模式和用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
在本实施例中,AI平台可以获取到用户在训练配置界面选择的训练模式为共享模式,AI平台可以获取在共享模式下,每个容器的资源使用量,在用户选取的训练模式仅包括共享模式的情况下,运行训练任务的容器的资源使用量是预设数值。AI平台可以基于当前计 算资源池中空闲的计算节点、预设的容器个数、容器的资源使用量以及初始AI模型,生成至少一个训练任务。此处预设的容器个数,可以是AI平台为该至少一个训练任务规定的可使用容器个数,也可以是用户指定的该至少一个训练任务的容器个数。
可选的,运行训练任务的容器的资源使用量包括小于单个GPU的GPU资源使用量和/或小于单个显存的显存使用量。这样,可以更细化的划分计算节点上的计算资源,可以使资源利用率更高。
步骤1103,AI平台执行至少一个训练任务以对初始AI模型进行训练,获得AI模型。
可选的,为了减少资源碎片化,步骤1103的处理可以为:
根据共享模式下至少一个训练任务运行的每个容器的资源使用量和计算资源池中各计算节点的剩余资源,确定每个训练任务的容器运行的计算节点;在确定的计算节点上启动至少一个训练任务的容器,对初始AI模型进行训练。
在本实施例中,AI平台可以统计计算资源池中各计算节点的剩余资源,并且获取至少一个训练任务运行的每个容器的资源使用量。若某个已被占用部分资源的计算节点的剩余资源大于该每个容器的资源使用量,则AI平台可以将某个容器下发至该计算节点。若所有已被占用部分资源的计算节点的剩余资源均小于该每个容器的资源使用量,则AI平台可以将至少一个训练任务运行的容器下发至未被占用资源的计算节点。按照这种方式,AI平台将容器下发至计算节点。然后AI平台在计算节点上启动容器,即能实现对初始AI模型进行训练。
可选的,为了减少资源碎片化,步骤1103的另一种处理方式可以为:
根据第二模式下至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;利用每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
在本实施例中,AI平台可以使用第二模式下至少一个训练任务运行的容器的资源使用量,确定出该至少一个训练任务运行的容器对应的计算节点的剩余资源。AI平台在进行其它训练工作时,该至少一个训练任务使用的某个计算节点的剩余资源,够执行一个或多个其它训练任务,则可以该计算节点的剩余资源,运行一个或多个其它训练任务,实现同一计算节点的资源共享。
这样,在共享模式下,在保证容器的资源需求的情况下,尽可能的使用已经被占用部分资源的计算节点,可以减少资源碎片化,且可以提升资源的整体利用率。
需要说明的是,在图11的流程中,由于共享模式是多个容器共享计算节点的资源,所以在初始AI模型中,需要加入显存限制的功能,以避免多个容器共享计算节点时,由于单个任务超用显存,而导致其他容器出现内存泄露错误。
如图12所示,提供了用户选择性能模式和共享模式的情况下,AI模型的训练流程示意图:
步骤1201,AI平台向用户提供训练配置界面,其中,训练配置界面包括供用户选择的多种训练模式。用户在训练配置界面中选择性能模式和共享模式时,训练配置界面还供用户输入或选择可运行训练任务的容器个数和运行训练任务的容器的资源使用量。
步骤1201的处理过程,是步骤701和步骤1101相结合后的处理过程,可参见步骤701和步骤1101的描述,此处不再赘述。
步骤1202,AI平台根据用户在训练配置界面选择的训练模式、用户输入或选择的可运行训练任务的容器个数、和用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
在本实施例中,AI平台可以获取到用户在训练配置界面选择的训练模式为共享模式,AI平台可以获取在共享模式下,每个容器的资源使用量,在用户选取的训练模式仅包括共享模式的情况下,运行训练任务的容器的资源使用量是预设数值。AI平台可以基于当前计算资源池中空闲的计算节点、用户输入或选择的可运行训练任务的容器个数,容器的资源使用量以及初始AI模型,生成至少一个训练任务。此处,AI平台所确定出的容器个数属于用户输入或选择的可运行训练任务的容器个数。
可选的,运行训练任务的容器的资源使用量包括小于单个GPU的GPU资源使用量和/或小于单个显存的显存使用量。这样,可以更细化的划分计算节点上的计算资源,可以使资源利用率更高。
步骤1203,AI平台执行至少一个训练任务以对初始AI模型进行训练,获得AI模型。
在步骤1203中,可以结合图7的流程中动态扩缩容的处理,以及图11的流程中的共享资源。具体描述参见图7和图11中的描述,此处不再赘述。
这样,在共享模式下,在保证容器的资源需求的情况下,尽可能的使用已经被占用部分资源的计算节点,可以减少资源碎片化,且提升资源的整体利用率,降低单个用户训练AI模型的成本。而且在性能模式下,通过动态的调整容器的个数,可以尽可能加快训练,提升训练AI模型的效率。
另外,针对图7、图11和图12的流程,由于训练数据集存储在OBS中,为了减少每个容器从OBS中下载训练数据集,在单个容器首先加载训练工作的训练数据时,将整个训练数据集从OBS下载到计算节点加载的存储空间中,该存储空间可以是固态存储(solid state storage,SSS)。这样,后续执行该训练工作的每个计算节点可以直接从该存储空间中读取数据,如通过扩容新增的容器可以直接从该存储空间中读取数据,减少从OBS中重新下载训练数据集所需的时间。
需要说明的是,本申请实施例中,容器是AI平台通过拉取镜像及初始AI模型生成的。另外,通过在容器上运行训练任务,由于容器是一个隔离比较好,所以即使在同一个节点上部署多个容器,各个容器上运行的训练任务不会相互干扰。
通过本申请实施例,在AI平台中提供了多种训练模式供用户选择,用户可以通过选择合适的训练模式,使得分布式训练可以灵活的执行,进而可以平衡用户的训练需求和资源利用率。
图1是本申请实施例提供的AI模型的训练装置的结构图,该装置应用于AI平台,该AI平台与计算资源池相关联,计算资源池包括用于模型训练的计算节点。该装置可以通过软件、硬件或者两者的结合实现成为装置中的部分或者全部。本申请实施例提供的装置可以实现本申请实施例图7、图11、图12所述的流程,该装置包括:训练配置模块102、任务管理模块103和展示模块105,其中:
训练配置模块102,用于向用户提供训练配置界面,其中,所述训练配置界面包括供所述用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略,具体可以用于实现步骤701的训练配置功能以及执行步骤701包含的隐含步骤;
任务管理模块103,用于:
根据所述用户在所述训练配置界面的选择,生成至少一个训练任务;
执行所述至少一个训练任务以对所述初始AI模型进行训练,获得AI模型,获得的所述AI模型供所述用户下载或使用,具体可以用于实现步骤702和步骤703的任务管理功能以及执行步骤702和步骤703包含的隐含步骤。
在一种可能的实现方式中,所述多种训练模式包括第一模式和/或第二模式,所述第一模式表示对所述初始AI模型进行训练的过程中自动调整训练任务的个数,所述第二模式表示不同训练任务共享同一计算节点的资源。
在一种可能的实现方式中,所述至少一个训练任务运行在容器上,所述装置还包括:
展示模块105,用于在对所述初始AI模型进行训练的过程中,向所述用户提供训练过程的状态信息,其中,所述状态信息包括以下信息中的至少一种信息:执行训练任务的容器个数,每个容器的资源使用量,执行训练任务的计算节点的个数,和执行训练任务的计算节点的资源使用量。
在一种可能的实现方式中,所述多种训练模式包括第一模式和第二模式,所述任务管理模块103,用于:
根据所述用户在所述训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。
在一种可能的实现方式中,当所述用户在所述训练配置界面中选择所述第一模式时,所述训练配置界面还供所述用户输入或选择可运行训练任务的容器个数;
所述任务管理模块103,用于:
根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
在一种可能的实现方式中,当所述用户在所述训练配置界面中选择所述第二模式时,所述训练配置界面还供所述用户输入或选择运行训练任务的容器的资源使用量;
所述任务管理模块103,用于:
根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
在一种可能的实现方式中,所述运行训练任务的容器的资源使用量包括小于单个GPU的GPU资源使用量和/或小于单个显存的显存使用量。
在一种可能的实现方式中,在选择所述第一模式的情况下,所述任务管理模块103,用于:
在执行所述至少一个训练任务以对所述初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取所述计算资源池中计算资源的空闲量;
根据所述计算资源池中计算资源的空闲量,调整所述至少一个训练任务的个数以及调 整用于运行训练任务的容器的个数;
在调整后的容器中运行调整后的训练任务以对所述初始AI模型进行训练。
在一种可能的实现方式中,所述任务管理模块103,用于:
将所述至少一个训练任务中的部分训练任务添加到已运行所述至少一个训练任务中的训练任务的目标容器中,在所述目标容器中串行运行多个训练任务,在训练过程中,将串行运行所述多个训练任务获得的模型参数的平均值作为模型参数的更新值。
在一种可能的实现方式中,在选择所述第二模式的情况下,所述任务管理模块103,还用于:
根据所述第二模式下所述至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;
利用所述每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时也可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成为一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请还提供一种如图4所示的计算设备400,计算设备400中的处理器402读取存储器401存储的程序和数据集合以执行前述AI平台执行的方法。
由于本申请提供的AI平台100中的各个模块可以分布式地部署在同一环境或不同环境中的多个计算机上,因此,本申请还提供一种如图13所示的计算设备,该计算设备包括多个计算机1300,每个计算机1300包括存储器1301、处理器1302、通信接口1303以及总线1304。其中,存储器1301、处理器1302、通信接口1303通过总线1304实现彼此之间的通信连接。
存储器1301可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器1301可以存储程序,当存储器1301中存储的程序被处理器502执行时,处理器1302和通信接口1303用于执行AI平台训练AI模型的部分方法。存储器还可以存储训练数据集,例如,存储器1301中的一部分存储资源被划分成一个数据集存储模块,用于存储AI平台所需的训练数据集。
处理器1302可以采用通用的中央处理器,微处理器,应用专用集成电路,图形处理器或者一个或多个集成电路。
通信接口1303使用例如但不限于收发器一类的收发模块,来实现计算机1300与其他设备或通信网络之间的通信。例如,可以通过通信接口1303获取训练数据集。
总线504可包括在计算机1300各个部件(例如,存储器1301、处理器1302、通信接口1303)之间传送信息的通路。
上述每个计算机1300间通过通信网络建立通信通路。每个计算机1300上运行算法管理模块101、训练配置模块102、任务管理模块103、数据存储模块104和展示模块105中的任意一个或多个。任一计算机1300可以为云数据中心中的计算机(例如,服务器),或边缘数据中心中的计算机,或终端计算设备。
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其 他流程的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件或者其组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。提供AI平台的计算机程序产品包括一个或多个进AI平台的计算机指令,在计算机上加载和执行这些计算机程序指令时,全部或部分地产生按照本申请实施例图7、图11或图12所述的流程或功能。
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质存储有提供AI平台的计算机程序指令。所述计算机可读存储介质可以是计算机能够存取的任何介质或者是包含一个或多个介质集成的服务器、数据中心等数据存储设备。所述介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,光盘)、或者半导体介质(如固态硬盘)。

Claims (22)

  1. 一种人工智能AI模型的训练方法,其特征在于,所述方法应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于模型训练的计算节点,包括:
    向用户提供训练配置界面,其中,所述训练配置界面包括供所述用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略;
    根据所述用户在所述训练配置界面的选择,生成至少一个训练任务;
    执行所述至少一个训练任务以对所述初始AI模型进行训练,获得AI模型,获得的所述AI模型供所述用户下载或使用。
  2. 根据权利要求1所述的方法,其特征在于,所述多种训练模式包括第一模式和/或第二模式,所述第一模式表示对所述初始AI模型进行训练的过程中自动调整训练任务的个数,所述第二模式表示不同训练任务共享同一计算节点的资源。
  3. 根据权利要求1或2所述的方法,其特征在于,所述至少一个训练任务运行在容器上,所述方法还包括:
    在对所述初始AI模型进行训练的过程中,向所述用户提供训练过程的状态信息,其中,所述状态信息包括以下信息中的至少一种信息:执行训练任务的容器个数,每个容器的资源使用量,执行训练任务的计算节点的个数,和执行训练任务的计算节点的资源使用量。
  4. 根据权利要求2或3所述的方法,其特征在于,所述多种训练模式包括第一模式和第二模式,根据所述用户在所述训练配置界面的选择,生成至少一个训练任务,包括:
    根据所述用户在所述训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。
  5. 根据权利要求2-4任一项所述的方法,其特征在于,当所述用户在所述训练配置界面中选择所述第一模式时,所述训练配置界面还供所述用户输入或选择可运行训练任务的容器个数;
    所述根据所述用户在所述训练配置界面的选择,生成至少一个训练任务,包括:
    根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
  6. 根据权利要求2-5任一项所述的方法,其特征在于,当所述用户在所述训练配置界面中选择所述第二模式时,所述训练配置界面还供所述用户输入或选择运行训练任务的容器的资源使用量;
    所述根据所述用户在所述训练配置界面的选择,生成至少一个训练任务,包括:
    根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
  7. 根据权利要求5或6所述的方法,其特征在于,所述运行训练任务的容器的资源使用量包括小于单个图形处理器GPU的GPU资源使用量和/或小于单个显存的显存使用量。
  8. 根据权利要求2-7任一项所述的方法,其特征在于,在选择所述第一模式的情况下,所述执行所述至少一个训练任务以对所述初始AI模型进行训练,包括:
    在执行所述至少一个训练任务以对所述初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取所述计算资源池中计算资源的空闲量;
    根据所述计算资源池中计算资源的空闲量,调整所述至少一个训练任务的个数以及调整用于运行训练任务的容器的个数;
    在调整后的容器中运行调整后的训练任务以对所述初始AI模型进行训练。
  9. 根据权利要求8所述的方法,其特征在于,所述调整所述至少一个训练任务的个数以及调整用于运行训练任务的容器的个数,在调整后的容器中运行所述调整后的训练任务以对所述初始AI模型进行训练,包括:
    将所述至少一个训练任务中的部分训练任务添加到已运行所述至少一个训练任务中的训练任务的目标容器中,在所述目标容器中串行运行多个训练任务,在训练过程中,将串行运行所述多个训练任务获得的模型参数的平均值作为模型参数的更新值。
  10. 根据权利要求2-9任一项所述的方法,其特征在于,在选择所述第二模式的情况下,所述方法包括:
    根据所述第二模式下所述至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;
    利用所述每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
  11. 一种人工智能AI模型的训练装置,其特征在于,所述装置应用于AI平台,所述AI平台与计算资源池相关联,所述计算资源池包括用于模型训练的计算节点,包括:
    训练配置模块,用于向用户提供训练配置界面,其中,所述训练配置界面包括供所述用户选择的多种训练模式,每种训练模式表示对训练初始AI模型所需的计算节点的一种分配策略;
    任务管理模块,用于:
    根据所述用户在所述训练配置界面的选择,生成至少一个训练任务;
    执行所述至少一个训练任务以对所述初始AI模型进行训练,获得AI模型,获得的所述AI模型供所述用户下载或使用。
  12. 根据权利要求11所述的装置,其特征在于,所述多种训练模式包括第一模式和/或第二模式,所述第一模式表示对所述初始AI模型进行训练的过程中自动调整训练任务的个数,所述第二模式表示不同训练任务共享同一计算节点的资源。
  13. 根据权利要求11或12所述的装置,其特征在于,所述至少一个训练任务运行在容器上,所述装置还包括:
    展示模块,用于在对所述初始AI模型进行训练的过程中,向所述用户提供训练过程的状态信息,其中,所述状态信息包括以下信息中的至少一种信息:执行训练任务的容器个数,每个容器的资源使用量,执行训练任务的计算节点的个数,和执行训练任务的计算节点的资源使用量。
  14. 根据权利要求12或13所述的装置,其特征在于,所述多种训练模式包括第一模式和第二模式,所述任务管理模块,用于:
    根据所述用户在所述训练配置界面选择的第一模式和第二模式,生成至少一个训练任务。
  15. 根据权利要求12-14任一项所述的装置,其特征在于,当所述用户在所述训练配置界面中选择所述第一模式时,所述训练配置界面还供所述用户输入或选择可运行训练任务的容器个数;
    所述任务管理模块,用于:
    根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择的可运行训练任务的容器个数,生成至少一个训练任务。
  16. 根据权利要求12-15任一项所述的装置,其特征在于,当所述用户在所述训练配置界面中选择所述第二模式时,所述训练配置界面还供所述用户输入或选择运行训练任务的容器的资源使用量;
    所述任务管理模块,用于:
    根据所述用户在所述训练配置界面选择的训练模式和所述用户输入或选择运行训练任务的容器的资源使用量,生成至少一个训练任务。
  17. 根据权利要求15或16所述的装置,其特征在于,所述运行训练任务的容器的资源使用量包括小于单个图形处理器GPU的GPU资源使用量和/或小于单个显存的显存使用量。
  18. 根据权利要求12-17任一项所述的装置,其特征在于,在选择所述第一模式的情况下,所述任务管理模块,用于:
    在执行所述至少一个训练任务以对所述初始AI模型进行训练的过程中,当检测到满足弹性扩缩容的条件时,获取所述计算资源池中计算资源的空闲量;
    根据所述计算资源池中计算资源的空闲量,调整所述至少一个训练任务的个数以及调整用于运行训练任务的容器的个数;
    在调整后的容器中运行调整后的训练任务以对所述初始AI模型进行训练。
  19. 根据权利要求18所述的装置,其特征在于,所述任务管理模块,用于:
    将所述至少一个训练任务中的部分训练任务添加到已运行所述至少一个训练任务中的训练任务的目标容器中,在所述目标容器中串行运行多个训练任务,在训练过程中,将串行运行所述多个训练任务获得的模型参数的平均值作为模型参数的更新值。
  20. 根据权利要求12-19任一项所述的装置,其特征在于,在选择所述第二模式的情况下,所述任务管理模块,还用于:
    根据所述第二模式下所述至少一个训练任务运行的容器的资源使用量确定每个容器对应的计算节点的剩余资源;
    利用所述每个容器对应的计算节点的剩余资源,运行一个或多个其他训练任务。
  21. 一种计算设备,其特征在于,所述计算设备包括存储器和处理器,所述存储器用于存储计算机指令;
    所述处理器执行所述存储器存储的计算机指令,以执行上述权利要求1-10中任一项所述的方法。
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被计算设备执行时,所述计算设备执行上述权利要求1-10中任一项所述的方法。
PCT/CN2021/115881 2020-09-07 2021-09-01 Ai模型的训练方法、装置、计算设备和存储介质 WO2022048557A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21863621.5A EP4209972A4 (en) 2020-09-07 2021-09-01 AI MODEL LEARNING METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIUM
US18/179,661 US20230206132A1 (en) 2020-09-07 2023-03-07 Method and Apparatus for Training AI Model, Computing Device, and Storage Medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010926721 2020-09-07
CN202010926721.0 2020-09-07
CN202011053283.8 2020-09-29
CN202011053283.8A CN114154641A (zh) 2020-09-07 2020-09-29 Ai模型的训练方法、装置、计算设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/179,661 Continuation US20230206132A1 (en) 2020-09-07 2023-03-07 Method and Apparatus for Training AI Model, Computing Device, and Storage Medium

Publications (1)

Publication Number Publication Date
WO2022048557A1 true WO2022048557A1 (zh) 2022-03-10

Family

ID=80462178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115881 WO2022048557A1 (zh) 2020-09-07 2021-09-01 Ai模型的训练方法、装置、计算设备和存储介质

Country Status (4)

Country Link
US (1) US20230206132A1 (zh)
EP (1) EP4209972A4 (zh)
CN (1) CN114154641A (zh)
WO (1) WO2022048557A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418456A (zh) * 2022-03-11 2022-04-29 希望知舟技术(深圳)有限公司 一种基于工况的机器学习进度管控方法及相关装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116963099A (zh) * 2022-04-14 2023-10-27 维沃移动通信有限公司 模型输入的确定方法及通信设备
CN117217292A (zh) * 2022-05-30 2023-12-12 华为云计算技术有限公司 一种模型训练方法及装置
CN116450486B (zh) * 2023-06-16 2023-09-05 浪潮电子信息产业股份有限公司 多元异构计算系统内节点的建模方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109714400A (zh) * 2018-12-12 2019-05-03 华南理工大学 一种面向容器集群的能耗优化资源调度系统及其方法
CN110389834A (zh) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 一种用于提交深度学习训练任务的方法和装置
US20200151617A1 (en) * 2018-11-09 2020-05-14 Citrix Systems, Inc. Systems and methods for machine generated training and imitation learning
CN111160569A (zh) * 2019-12-30 2020-05-15 第四范式(北京)技术有限公司 基于机器学习模型的应用开发方法、装置及电子设备
CN111274036A (zh) * 2020-01-21 2020-06-12 南京大学 一种基于速度预测的深度学习任务的调度方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151617A1 (en) * 2018-11-09 2020-05-14 Citrix Systems, Inc. Systems and methods for machine generated training and imitation learning
CN109714400A (zh) * 2018-12-12 2019-05-03 华南理工大学 一种面向容器集群的能耗优化资源调度系统及其方法
CN110389834A (zh) * 2019-06-28 2019-10-29 苏州浪潮智能科技有限公司 一种用于提交深度学习训练任务的方法和装置
CN111160569A (zh) * 2019-12-30 2020-05-15 第四范式(北京)技术有限公司 基于机器学习模型的应用开发方法、装置及电子设备
CN111274036A (zh) * 2020-01-21 2020-06-12 南京大学 一种基于速度预测的深度学习任务的调度方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4209972A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418456A (zh) * 2022-03-11 2022-04-29 希望知舟技术(深圳)有限公司 一种基于工况的机器学习进度管控方法及相关装置
CN114418456B (zh) * 2022-03-11 2022-07-26 希望知舟技术(深圳)有限公司 一种基于工况的机器学习进度管控方法及相关装置

Also Published As

Publication number Publication date
EP4209972A1 (en) 2023-07-12
EP4209972A4 (en) 2024-03-06
CN114154641A (zh) 2022-03-08
US20230206132A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
WO2022048557A1 (zh) Ai模型的训练方法、装置、计算设备和存储介质
US11769061B2 (en) Processing computational graphs
US20210295161A1 (en) Training neural networks represented as computational graphs
Wang et al. Distributed machine learning with a serverless architecture
EP3754495B1 (en) Data processing method and related products
US11847554B2 (en) Data processing method and related products
EP3446260B1 (en) Memory-efficient backpropagation through time
JP2022003576A (ja) 制御パルス生成方法、装置、システム、電子デバイス、記憶媒体及びプログラム
JP2021505993A (ja) 深層学習アプリケーションのための堅牢な勾配重み圧縮方式
CN110826708B (zh) 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN112764893B (zh) 数据处理方法和数据处理系统
CN111738488A (zh) 一种任务调度方法及其装置
CN111966361A (zh) 用于确定待部署模型的方法、装置、设备及其存储介质
CN116194934A (zh) 模块化模型交互系统和方法
Hosny et al. Characterizing and optimizing EDA flows for the cloud
CN115827225A (zh) 异构运算的分配方法、模型训练方法、装置、芯片、设备及介质
CN113961765B (zh) 基于神经网络模型的搜索方法、装置、设备和介质
WO2021051920A1 (zh) 模型优化方法、装置、存储介质及设备
CN110377769A (zh) 基于图数据结构的建模平台系统、方法、服务器及介质
CN117827619B (zh) 异构算力的耗时预测仿真方法、装置、设备、介质及系统
CN117764179A (zh) 机器学习模型的全局解释优化方法、系统、介质及设备
CN115688893A (zh) 内存调度方法及装置、电子设备和存储介质
CN115904422A (zh) 眼底筛查设备以及硬件、软件及设备升级方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863621

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021863621

Country of ref document: EP

Effective date: 20230402

NENP Non-entry into the national phase

Ref country code: DE