CN117407713A

CN117407713A - Training management method and related device for distributed model training

Info

Publication number: CN117407713A
Application number: CN202311346413.0A
Authority: CN
Inventors: 张吉; 章海涛
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2024-01-16

Abstract

The present disclosure provides a training management method, device, system, device and storage medium for training a distributed model, where a management program is used to manage a plurality of training tasks for performing distributed training, where the plurality of training tasks are respectively based on initial training configuration information configured by a user and perform distributed training for a target model; in the process of carrying out distributed training on the target model, acquiring training states of the plurality of training tasks for carrying out distributed training on the target model, and deciding whether the initial training configuration information needs to be updated or not based on the training states; if so, acquiring target training configuration information appointed by a user, respectively issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to rerun after updating initial training configuration information corresponding to the target training configuration information based on the target training configuration information so as to continue to carry out distributed training on the target model.

Description

Training management method and related device for distributed model training

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a training management method, device, system, equipment and storage medium for distributed model training.

Background

The machine learning model requires model training prior to being put into service. For some complex machine learning models, model training is a large project, requiring the provision of sufficient computational effort and a significant amount of time. At present, in view of this, a distributed model training manner including a plurality of training tasks may be adopted to improve training efficiency of a model. In the training process, a user is required to find a problem, the user is required to stop the current training task, the training task is restarted after the configuration is manually modified, and a certain time is required to be consumed in the process.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a training management method, apparatus, system, device, and storage medium for distributed model training.

According to a first aspect of embodiments of the present disclosure, there is provided a training management method for training a distributed model, where the method is applied to a management program, and the management program is configured to manage a plurality of training tasks for performing distributed training, where the plurality of training tasks perform distributed training for a target model based on initial training configuration information configured by a user, respectively; the method comprises the following steps:

In the process of carrying out distributed training on the target model, acquiring training states of the plurality of training tasks for carrying out distributed training on the target model, and deciding whether the initial training configuration information needs to be updated or not based on the training states;

if so, acquiring target training configuration information appointed by a user, respectively issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to rerun after updating initial training configuration information corresponding to the target training configuration information based on the target training configuration information respectively so as to continue to carry out distributed training on the target model.

According to a second aspect of embodiments of the present specification, there is provided a training management apparatus for distributed model training, the apparatus being applied to a management program for managing a plurality of training tasks for performing distributed training, the plurality of training tasks performing distributed training for a target model based on initial training configuration information configured by a user, respectively; the device comprises:

the decision module is used for acquiring training states of the plurality of training tasks for the distributed training of the target model in the process of carrying out the distributed training on the target model, and deciding whether the initial training configuration information needs to be updated or not based on the training states;

The triggering module is used for acquiring target training configuration information appointed by a user under the condition that the initial training configuration information is required to be updated in a decision making mode, respectively issuing the target training configuration information to the plurality of training tasks so as to trigger the plurality of training tasks to update the initial training configuration information corresponding to the target training configuration information based on the target training configuration information and then rerun the initial training configuration information so as to continue to perform distributed training on the target model.

According to a third aspect of embodiments of the present specification, there is provided a distributed training system, the system including a hypervisor for managing a plurality of training tasks for performing distributed training, and a plurality of training tasks for performing distributed training with respect to a target model based on initial training configuration information configured by a user, respectively; the hypervisor, when executed by a processor, implements the steps of the training management method of distributed model training of the first aspect.

According to a fourth aspect of embodiments of the present specification, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the training management method embodiments of the distributed model training of the first aspect described above are implemented when the computer program is executed by the processor.

According to a fifth aspect of embodiments of the present specification, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the embodiments of the training management method of the distributed model training of the first aspect described above.

The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:

in the embodiment of the present disclosure, a management program is configured for a plurality of training tasks for performing distributed training, where in a process of performing distributed training on a target model by the plurality of training tasks based on initial training configuration information configured by a user, the management program may obtain training states of performing distributed training on the target model by the plurality of training tasks, and determine whether to update the initial training configuration information based on the training states; if the initial training configuration information is required to be updated, the management program can acquire target training configuration information appointed by a user, respectively issue the target training configuration information to a plurality of training tasks to trigger the plurality of training tasks to respectively update the initial training configuration information corresponding to the initial training configuration information based on the target training configuration information and then rerun the initial training configuration information so as to continue to perform distributed training on the target model. Therefore, the task dynamic update in the distributed training process is realized through the management program, the management program can automatically monitor the training state of the model and automatically make decision to update, new target configuration information is issued to the training tasks when the model needs to be updated, and the training tasks do not need to be stopped to run and restarted, so that the training efficiency of the model can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a model training process in the related art, as illustrated in the present specification, according to an exemplary embodiment.

Fig. 2A is a schematic diagram of a model training process in this embodiment, which is illustrated in this specification according to an exemplary embodiment.

Fig. 2B is a schematic diagram of a distributed training scenario illustrated in the present specification according to an exemplary embodiment.

FIG. 2C is a flowchart illustrating a training management method for distributed model training according to an exemplary embodiment of the present description.

Fig. 3 is a hardware configuration diagram of a computer device where a training management apparatus for distributed model training is located, according to an exemplary embodiment of the present disclosure.

FIG. 4 is a block diagram of a training management apparatus for distributed model training according to an exemplary embodiment of the present description.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this disclosure are both user-authorized or fully authorized information and data by parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation portals for user selection of authorization or denial.

In conventional machine learning training, the computing resources of only a single computer may be utilized, but for large-scale data sets and complex models, this may face problems of insufficient computing resources, excessively long training time, and the like. With the development of deep learning and other technologies, the complexity of the model is higher and higher, and the training of the model adopts distributed training more. Distributed training refers to breaking the training of a machine learning model into multiple training tasks to train on multiple processing nodes (e.g., containers or virtual machines, etc.) simultaneously. Each processing node may run a training task, each of which may be responsible for processing a portion of the training dataset and may independently calculate gradients, update model parameters, and the like. The training speed and performance of the model can be accelerated through parallel processing and cooperative work. Communication between these training tasks is also typically required to share model parameters, synchronize gradients, or exchange data, among others.

In the model training process, if problems such as convergence or training performance are found, the work of each training task of the model is often required to be adjusted. In the related art, these adjustments all require ending each training task currently running, manually modifying the configuration, and then recreating the new plurality of training tasks to be started, so that the new plurality of training tasks are re-run on the machine. This process takes time, especially when large models are trained in large scale, distributed, which may take several minutes to tens of minutes. The other idea is that the user determines the adjustment requirement in advance, and some preset adjustment logics are hard-coded in the training codes, but the flexibility of the mode is poor, and the new adjustment requirement of the user cannot be met. For example, as shown in fig. 1, a schematic diagram of a model training flow in the related art is shown, and after model training is started, a training cycle is entered; if the training is stopped, ending the training and saving model; if training is not stopped, the steps of acquiring training from a training data set (dataset) and optimizing the model according to the super parameters are circularly executed. It can be seen that the flow is entirely one main loop, and no change flow is provided during training. If the change is needed, the whole process is needed to be terminated, and training is performed again after the training task is restarted.

Based on this, the present embodiment provides a training management scheme for training a distributed model, as shown in fig. 2A, which is a schematic diagram of a model training flow in this embodiment, in the flow in this embodiment, whether to update training configuration information may be automatically determined according to a training state, if so, one or more combinations of model parameters, super parameters, optimization strategies, or training data may be updated, and in the case of distributed training, each training task may also communicate to synchronize the state. Thus, the embodiment can support dynamic update without terminating the execution flow.

FIG. 2B is a schematic diagram of an application scenario of a distributed model training according to an embodiment of the present disclosure, where the application scenario includes a hypervisor and a plurality of training tasks managed by the hypervisor for managing the execution of the distributed training; the plurality of training tasks are respectively used for carrying out distributed training on the target model based on initial training configuration information configured by a user. The method comprises the steps that a management program is configured for a plurality of training tasks for executing distributed training, and in the process that the plurality of training tasks perform distributed training aiming at a target model based on initial training configuration information configured by a user, the management program can acquire training states of the plurality of training tasks for performing distributed training aiming at the target model and determine whether the initial training configuration information needs to be updated based on the training states; if the initial training configuration information is required to be updated, the management program can acquire target training configuration information appointed by a user, respectively issue the target training configuration information to a plurality of training tasks to trigger the plurality of training tasks to respectively update the initial training configuration information corresponding to the initial training configuration information based on the target training configuration information and then rerun the initial training configuration information so as to continue to perform distributed training on the target model. Therefore, the task dynamic update in the distributed training process is realized through the management program, and the training efficiency of the model can be improved without stopping running each training task and restarting a new training task.

As shown in fig. 2C, which is a flowchart of a training management method for distributed model training according to an embodiment of the present disclosure, the method may include:

step 202, in the process of performing distributed training on the target model, acquiring training states of the multiple training tasks for performing distributed training on the target model, and deciding whether the initial training configuration information needs to be updated based on the training states.

And 204, if so, acquiring target training configuration information appointed by a user, respectively issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to update initial training configuration information corresponding to the target training configuration information based on the target training configuration information respectively and then rerun the initial training configuration information so as to continue to perform distributed training on the target model.

In some examples, the hypervisor of the present embodiment may be a program running in a machine learning model training system. Machine learning model training systems are software platforms for training machine learning models, and typically include a series of functions such as data preprocessing, model construction, training optimization, and model evaluation. These platforms can be used for user-oriented training of various types of machine learning models, and the system can also provide visual interfaces and various APIs (Application Programming Interface, application program interfaces) that allow users to conveniently build, train and evaluate machine learning models. The system may also incorporate various common machine learning algorithms, model structures, model optimization methods, etc., so that a user may quickly select different models and algorithms.

In some examples, the model training may be configured initially by the user. For example, the machine learning model training system may provide an interactive interface to initial training configuration information, such that a user may submit a variety of training configuration information through the interactive interface. Alternatively, the interactive interface may be provided by a hypervisor, that is, the hypervisor obtains initial training configuration information configured by the user through the interactive interface, and the interactive interface may also be provided by other programs in the machine learning model training system, which is not limited in this embodiment.

As an example, the initialization configuration information of the user configuration may include a combination of one or more of the following: training a data set, model parameters, an optimization strategy and super parameters; in addition to these task information related to the running of the training task, optionally, according to the actual application scenario, various other information such as a distributed training framework, an optimizer, a model save path, and the like may be included, which is not limited in this embodiment.

The number of training tasks is not limited in this embodiment; the number of training tasks can be specified by a user, or can be comprehensively and automatically determined according to various factors such as actual hardware resources and the like. As an example, the training of the model may be divided into a plurality of training tasks, each training task may process different training data, and the division of tasks may be achieved by dividing the training data set, i.e. the number of training tasks is derived based on the number of the plurality of training data subsets divided into the training data set. Depending on the number of divided training tasks, a corresponding number of training processes may be started. A training task may correspond to a training process running on a machine (e.g., container or virtual machine, etc.).

Alternatively, in each training task, model parameters may be initialized according to configuration information. These initial model parameters may be user-configured, may be random, or may be loaded from a pre-trained model, depending on a variety of factors such as the target model being trained, the actual application scenario, or the needs of the user, which is not limited in this embodiment. As an example, the target model of the present embodiment may be a neural network-based deep learning model, or a pre-training model, or the like.

In each training task, a training cycle of the model may be performed, including steps of forward propagation, calculating a loss function, backward propagation gradient, parameter updating, and the like. Each training task may only process training data for its own responsibility.

In distributed training, communication can be performed between different training tasks to achieve gradient transfer and parameter updating. For example, the gradient calculated by the current task may be transferred to other tasks through a communication mechanism, so as to implement global parameter update, and so on. Alternatively, the implementation of communication between training tasks may rely on some communication library or framework, such as NVIDIA NCCL or TensorFlow's distributed communication components, etc. As an example, NVIDIA Collective Communications Library (NCCL) is a high performance GPU-to-GPU communications library optimized for a multiple GPU (Graphics Processing Unit, graphics processor) system. In the distributed training, before the training is started, a communication group can be established through NCCL, each training task is contained in the communication group, after communication connection is established, the gradient obtained through calculation can be transferred to other tasks through the established communication connection after the training is started. As an example, the process may be:

Creating a communication group: before distributed training begins, each training task creates a communication group using NCCL. A communication group is a logical concept that organizes tasks that participate in a communication.

Initializing communication: each training task, after creating a communication group, initializes the NCCL communication environment in preparation for subsequent communication operations.

And (3) data exchange: in the training process, at the moment when gradient transfer and parameter updating are required, each task sends the calculated gradient data to other tasks in the communication group through NCCL interfaces.

Synchronizing: after gradient data transmission is completed, each task needs to be synchronously operated to ensure that all tasks complete gradient transmission and then the next parameter updating can be performed.

If the training configuration information needs to be adjusted in the training process, as described above, in the related technology, the running of each training task currently running is finished, the new multiple training tasks are restarted after the configuration is manually modified, so that the new multiple training tasks are restarted on the machine, so that the running of each training task currently running is finished, which means that the communication group established by the multiple training tasks is cleared, and the communication connection needs to be reestablished for each new training task when the new multiple training tasks are restarted, thereby bringing more time loss. In the embodiment of the present disclosure, since a management program for managing a plurality of training tasks is provided, the management program may interact with each training task and send a new configuration to each training task, so that it is not necessary to terminate the training task and restart the plurality of training tasks, which means that the communication connection between each training task is not cleared, so that it is not necessary to reestablish the communication connection, and therefore training efficiency may be improved.

According to the method, training configuration information can be dynamically updated in the distributed training process of the model, and a management program can acquire training states of the distributed training of the plurality of training tasks aiming at the target model in the distributed training process of the target model and decide whether the initial training configuration information needs to be updated or not based on the training states.

Optionally, a hypervisor may be in communication with each training task, the hypervisor obtaining training states of the plurality of training tasks for distributed training of the target model from any of the plurality of training tasks. Wherein, the training state can include various indexes, such as fall information of Loss, gradient information and the like, and the training state information is used for indicating whether the current training state of the model is normal or not; in practical application, various indexes for representing training states can be flexibly configured according to needs, and the embodiment is not limited to the above.

Optionally, the plurality of training tasks may include a master task; the obtaining the training states of the plurality of training tasks for the distributed training of the target model may include: and acquiring training states of the plurality of training tasks sent by the main task for distributed training aiming at the target model. In practical application, a main task of the plurality of training tasks can be flexibly configured according to an actual scene, can be a task selected randomly from the plurality of training tasks, can be specified by a user, and can also be a task selected according to one or more rules; as an example, each training task has an identification, and a main task is determined based on the identification, for example, a task with the smallest identification among a plurality of training tasks is used as the main task, and the present embodiment is not limited thereto. Based on the above, the management program can only communicate with the main task for data interaction when needed, and does not need to communicate with a plurality of training tasks, so that the efficiency can be improved.

In addition, the management program may periodically acquire the training states of the plurality of training tasks for performing distributed training on the target model based on a set rule, for example, may periodically acquire the training states at set time intervals, for example, every one hour or the like. The distributed training may include training of a plurality of rounds, may be periodically acquired per round, or the like, which is not limited in this embodiment.

There are a number of ways in which the hypervisor can decide whether an update is needed. As one example, the hypervisor may provide a change interface for changing the training configuration information;

the obtaining the training states of the plurality of training tasks for the distributed training of the target model, and deciding whether to update the initial training configuration information based on the training states may include:

acquiring training states of the plurality of training tasks for distributed training of the target model, outputting the training states to a user, and determining that the initial training configuration information needs to be updated in response to a received call request of the user for the change interface initiated based on the output training states;

The obtaining the target training configuration information specified by the user may include:

and acquiring target training configuration information which is contained in the call request and designated by a user.

In this embodiment, the management program may output the training status to the user in various manners; for example, the training state may be output into a log file in the form of a log file so that the user may view it when needed. The training state can be directly output to the terminal of the user for the user to review, and the training state can be presented to the user in the form of a chart, a curve and the like by combining with other visual tools according to the requirement, so that the training process and the result are displayed more intuitively. In addition, the manner of adopting mail or instant messaging is also optional, and can be flexibly realized according to the requirements of users and actual scenes, which is not limited in the embodiment.

The hypervisor provides an interactive change interface for changing the training configuration information, which can be implemented in a variety of ways. For example, the user may be provided with changes to the training configuration information via a graphical user interface. The user may enter new configuration parameters through interactive elements on the interface (e.g., text boxes, drop down menus, check boxes, etc.) and trigger save or submit operations to update the configuration information. Alternatively, the user may also enter specific commands and parameters through the command line interface to submit the target training configuration information. Alternatively, APIs may be provided for use by the user, and the user may programmatically use the APIs to pass new target configuration parameters to the interface to enable updating of the configuration information. Alternatively, the target training configuration information may be stored in a file using a predefined configuration file format. When the configuration information needs to be changed, the user can directly edit the target configuration file and save the modified file to update the configuration information and the like. One or more of the above-described different interface implementations may be used in combination according to actual needs, which is not limited in this embodiment. Through the embodiment, the user can dynamically update through the change interface at any time in the training process.

In other examples, the hypervisor may maintain decision rules pre-configured by a user;

acquiring training states of the plurality of training tasks for distributed training of the target model, executing a maintained decision rule pre-configured by the user, and determining whether the training states of the target model are matched with training states designated by the user and contained in the decision rule;

if the training state of the target model is matched with the training state specified by the user and contained in the decision rule, determining that the initial training configuration information needs to be updated;

and acquiring target training configuration information which is contained in the decision rule and designated by a user.

In practical applications, the decision rule may be configured by the user according to his own needs and experience, e.g. to specify a specific training state, performance index or other trigger condition, etc. As an example, the decision rule may specifically be: the calculation time length of one iteration is larger than a threshold value, and the corresponding target training configuration information can be that the current batch size is reduced by a set proportion; wherein, the batch size refers to the number of samples used in one iteration, namely the number of samples processed each time the model parameters are updated; the decreasing setting ratio is merely an example, and may be configured as needed in practical applications. The decision rule may specifically be: the gradient descent speed is smaller than the threshold value, and the corresponding target training configuration information can be to increase the current learning rate by a set proportion; the increasing setting ratio is merely an example, and may be configured as needed in practical applications. In practical application, a plurality of different decision rules can be configured according to needs, and this embodiment is not limited to this. When the training state of the target model matches the user-specified decision rule, a corresponding operation, such as updating the initial training configuration information, may be performed. As can be seen from the above embodiments, the user may pre-submit the decision rules so that the hypervisor may monitor the training state of the model during the training process and automatically decide by the hypervisor whether an update is needed. Therefore, the training state is not required to be continuously focused by the user, the management program can automatically monitor the training state and automatically make decision and update, convenience is brought to the user, and the training efficiency is improved.

In some examples, the hypervisor may provide a configuration interface for configuring the decision rule; the method may further comprise:

and responding to a received call request initiated by a user and aiming at the configuration interface, acquiring a decision rule configured by the user and contained in the call request, and locally maintaining the decision rule to complete the configuration of the decision rule.

In this embodiment, the configuration interface of the decision rule provided by the management program may have various implementation manners, for example, the graphical user interface described above is provided for the user to configure the decision rule. The user may also enter specific commands and parameters through the command line interface to submit decision rules. Alternatively, APIs may be provided for use by the user, who may programmatically use the APIs, pass decision rules to the interface, and so forth. One or more of the above-described different interface implementations may be used in combination according to actual needs, which is not limited in this embodiment. The management program obtains the decision rule configured by the user and maintains the decision rule locally to complete the configuration of the decision rule. Through the embodiment, a user can submit decision rules to the management program conveniently.

In some examples, the training configuration information may include a combination of one or more of the following: training data sets, model parameters, optimization strategies, and super parameters.

Wherein the training data set is a data set for training a machine learning model. It contains training samples, each containing data and corresponding labels (if a supervised learning task). The choice of training data set has an important impact on the training and performance of the model. Reasonable selection of the training data set can ensure that the model has good generalization capability and can represent expected application scenes. Optionally, if the target training configuration information specified by the user is the target training data set, the initial training data set may be added or deleted, the consumption order of the initial training data set may be adjusted, a part of the initial training data set may be skipped, and the embodiment is not limited to this.

Model parameters refer to adjustable quantities inside the model, which determine the learning and representation capabilities of the model. During training, the model minimizes the loss function by adjusting these parameters so that the model can better fit the training data and make predictions. Alternatively, if the target training configuration information specified by the user is a model parameter, the target training configuration information may be obtained from a target model parameter file of the target turn specified by the user.

The optimization strategy refers to the methods and algorithms that are taken during the training process to update the model parameters to minimize the loss function. For example, the optimization strategies may include one or more combinations of gradient descent, random gradient descent, adam, parallel methods, video memory optimization strategies, or operator optimization strategies. The selection and adjustment of the optimization strategy has great influence on the convergence speed, stability and performance of the model. Different optimization strategies may be applicable to different types of models and problems. Alternatively, the hypervisor of the present embodiment may provide the user with a plurality of optimization strategies, so as to receive one or more optimization strategies selected by the user.

Hyper-parameters refer to parameters that need to be manually set prior to model training, which control some important aspects of the model training process, such as learning rate, regularization parameters, batch size (batch size), etc. The choice of superparameter may be determined empirically, experimentally, etc., and different superparameter settings may result in different training results and performance. Adjusting the hyper-parameters may improve the generalization ability and robustness of the model. Alternatively, the user may configure a new hyper-parameter.

The embodiment provides the four training configuration information, so that a user can designate any one or a combination of multiple kinds as target training configuration information to be adjusted according to the need in the distributed training process of the model, thereby meeting the adjustment requirement of the user.

In practical applications, the user may configure the target training configuration information to include one or more of the above according to needs, which is not limited in this embodiment. After the management program acquires the target training configuration information, each training task can be informed to trigger each training task to update the initial training configuration information corresponding to the management program based on the target training configuration information and then rerun the initial training configuration information so as to continue to perform distributed training on the target model. In some examples, each training task may implement a module for updating training configuration information, for updating initial training configuration information corresponding to itself; based on the foregoing embodiments of the various training configuration information, each training task may further include a module for updating each training configuration information, for example, a module for updating a model parameter, a module for updating training data, a module for updating an optimization policy, a module for updating a super parameter, and other modules, for example, a synchronization module for communicating between the training tasks, etc., which is not limited in this embodiment.

In practical application, the method of the embodiment can be executed for a plurality of times in the model training process, that is, the initial training configuration information can be updated for a plurality of times; for example, the target model starts training based on the configuration of the user, then the manager decides that the target model needs to be updated, at this time, the manager stops running after acquiring the designated target training configuration information for the first time, and after updating the initial training configuration information of the manager to the target training configuration information, the manager re-runs according to the current training configuration information to continue distributed training. After a period of time, the management program can decide that the update is needed, and the second update occurs at this time, and similarly, the management program can again acquire the target training configuration information appointed in the update, then the training tasks again pause operation, re-operate the training configuration information after updating, continue the distributed training, and so on.

As an example, the hypervisor may broadcast a message characterizing the target training configuration information, or may send the message to the master task, where the master task sends the message to other training tasks, where each training task receives the message and may determine that the training configuration information needs to be updated; for example, each training task may pause running after receiving the message, and then acquire the target training configuration information and rerun. In some examples, the message sent by the management program may include the target training configuration information, and each training task may directly obtain the target training configuration information from the message; for example, for optimization strategies and superparameters, the smaller data size of these two types of information may be directly contained in the message. In other examples, the message sent by the management program may include a storage address of the target training configuration information, and after each training task receives the message, it may determine that the training configuration information needs to be updated, that is, the message may trigger each training task to suspend operation, and further may directly obtain the storage address of the target training configuration information from the message, and obtain the target training configuration information through the storage address to rerun; for example, for training data sets and model parameters, where the data size of the two types of information is large, a manner of transmitting a storage address may be adopted. The above two processing methods may be implemented alone or in combination in practical application, and may be flexibly configured as needed in practical application, which is not limited in this embodiment.

In some examples, the plurality of training tasks may include a master task;

if the target training configuration information is a training data set of a target model, the issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to update and rerun initial training configuration information corresponding to the target training configuration information based on the target training configuration information, respectively, may include:

and transmitting the data storage position of the training data set to the main task, acquiring the training data set by the main task based on the data storage position, and transmitting the training data set to the plurality of training tasks respectively to trigger the plurality of training tasks to update initial training configuration information corresponding to the training tasks based on the target training configuration information respectively and then rerun the initial training configuration information.

In this embodiment, when the target training configuration information includes a training data set of the target model, the management program may acquire a data storage location of the training data set, and issue the data storage location to the main task, and the main task may acquire the training data set based on the data storage location and then issue the training data set to each training task respectively; the main task respectively transmits the training data sets to the training tasks, which may include a case that the main task directly transmits the training data sets to each training task respectively, or may include a case that the main task transmits the data storage positions of the training data sets to each training task, and each training task may acquire the training data sets based on the received data storage positions after receiving the data storage positions.

The training tasks receive the training data set, and can determine that the tasks need to be adjusted currently, so that initial training configuration information corresponding to the training tasks can be updated based on the target training configuration information and then rerun, for example, each training task can acquire a training data subset required by the training tasks from the received training data set. Optionally, the main task may further determine a training data subset (for example, a training data subset of the main task itself and other training data subsets of each training task) corresponding to each training task, and send the training data subsets to each training task, where the main task may include sending, to each training task, a data identifier of the training data subset corresponding to each training task, where the training task obtains, based on the received data identifier of the training data subset, the training data subset corresponding to the main task based on the data storage location. The method for determining the training data subset may be various, for example, the target training configuration information may specify a division method of the training data set, or the main task may be a division method set in the initial training configuration information, which is not limited in this embodiment. Therefore, each training task can access the storage position of the training data set and acquire the training data subset required by the task, and the data acquisition efficiency of each training task can be improved.

In some examples, the plurality of training tasks includes a master task; the distributed training may include multiple rounds of training;

if the target training configuration information is a model parameter of the target model, the issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to update and rerun initial training configuration information corresponding to the target training configuration information based on the target training configuration information, respectively, may include:

acquiring a storage position of a target model parameter file corresponding to a target round designated by a user from the plurality of rounds;

and transmitting the file storage position of the target model parameter file to the main task, accessing the target model parameter file by the main task based on the file storage position, acquiring target model parameters recorded in the target model parameter file, and then respectively transmitting the target model parameters to each training task in the training tasks so that each training task responds to the received target model parameters, and updating the parameters of the target model configured by the main task into the received target model parameters and then rerun the received target model parameters.

In the machine learning field, the model parameter file refers to a file that stores parameters of a training model and states of an optimizer, for example, a checkpoint file. In the model training process, the state of the current model can be saved into a model parameter file periodically, wherein the parameters comprise weight, offset value and the like, and the state of an optimizer, such as learning rate, momentum and the like. In this way, when an unexpected interruption occurs during the training process or when the training needs to be paused, the model parameter file can be used to resume the state of the model, and the training can be continued from the saved position without restarting. The model parameter file typically contains the structure and parameter values of the model, as well as other information related to training, such as history of the loss function, number of training steps, verification results, etc. By loading the model parameter file, the state of the whole model can be restored, so that the model can be continuously trained, inferred or otherwise operated.

During model training, the model parameter file will periodically save the state of the current model, and the specific implementation may be different for different deep learning frameworks. As an example, the automatic save function provided by the machine learning framework may also be used by writing the save frequency and manner of the code custom model parameter file. That is, the storage frequency of the model parameter file should be adjusted according to the actual situation. The saving of the model parameter file is typically performed in accordance with training rounds (epochs). At the end of each training round, the state of the current model may be selected for storage and stored as a model parameter file.

In this embodiment, a user may specify a target round from multiple rounds, where a target model parameter file of the target round is target training configuration information; the management program can acquire a file storage position of the target model parameter file and send the file storage position to the main task, wherein the main task can only access the target model parameter file based on the storage position, read target model parameters recorded in the file, and then send the acquired target model parameters to each training task; the main task respectively issues the target model parameters to the training tasks, which may include a case that the main task directly issues the target model parameters to each training task, or may include a case that after the main task stores the obtained target model parameters, the main task issues the storage positions of the target model parameters to each training task, and each training task may obtain the target model parameters based on the received storage positions after receiving the storage positions.

Therefore, the embodiment does not need to access the target model parameter file for each training task, and improves the data reading efficiency; meanwhile, the problem of collision possibly caused by that a plurality of training tasks access one file at the same time can be avoided.

As can be seen from the above embodiments, since the management program for managing each training task is provided, the management program is used to support dynamic update of the training configuration information by the user in the distributed training, and the training process does not need to be restarted when the update is needed, so that the development and training processes of model training are facilitated and accelerated, the idle time of training resources is reduced, and the training efficiency is improved.

Corresponding to the foregoing embodiments of the training method for distributed model training, the present specification also provides embodiments of a training management apparatus for distributed model training and a computer device to which the training management apparatus is applied.

Embodiments of the training apparatus for distributed model training of the present specification may be applied to a computer device, such as a server or a terminal device, etc. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor where the device is located. In terms of hardware, as shown in fig. 3, a hardware structure diagram of a computer device where the training apparatus for training a distributed model in the present disclosure is located is shown in fig. 3, and the computer device where the training apparatus 331 for training a distributed model in the embodiment is located in addition to the processor 310, the memory 330, the network interface 320, and the nonvolatile memory 340 shown in fig. 3 may generally include other hardware according to the actual function of the computer device, which is not described herein again.

As shown in fig. 4, fig. 4 is a block diagram of a training apparatus for training a distributed model according to an exemplary embodiment of the present disclosure, the apparatus being applied to a hypervisor for managing a plurality of training tasks for performing distributed training, the plurality of training tasks performing distributed training for a target model based on initial training configuration information configured by a user, respectively; the device comprises:

the decision module 41 obtains training states of the plurality of training tasks for the distributed training of the target model in the process of performing the distributed training on the target model, and decides whether the initial training configuration information needs to be updated or not based on the training states;

and the triggering module 42 is used for acquiring target training configuration information appointed by a user under the condition that the initial training configuration information is required to be updated in a decision, respectively issuing the target training configuration information to the plurality of training tasks so as to trigger the plurality of training tasks to update the initial training configuration information corresponding to the target training configuration information based on the target training configuration information and then rerun the initial training configuration information so as to continue to perform distributed training on the target model.

In some examples, the plurality of training tasks includes a master task;

the obtaining the training states of the plurality of training tasks for the distributed training of the target model includes:

and acquiring training states of the plurality of training tasks sent by the main task for distributed training aiming at the target model.

In some examples, the hypervisor provides a change interface for changing the training configuration information;

the obtaining the training states of the plurality of training tasks for the distributed training of the target model, and deciding whether the initial training configuration information needs to be updated based on the training states, includes:

the obtaining the target training configuration information specified by the user comprises the following steps:

In some examples, the hypervisor maintains decision rules pre-configured by the user;

In some examples, the hypervisor provides a configuration interface for configuring the decision rule;

the method further comprises the steps of:

In some examples, the training configuration information includes a combination of one or more of the following: training data sets, model parameters, optimization strategies, and super parameters.

In some examples, the plurality of training tasks includes a master task;

if the target training configuration information is a training data set of a target model, the issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to update and rerun initial training configuration information corresponding to the target training configuration information based on the target training configuration information, including:

In some examples, the plurality of training tasks includes a master task; the distributed training comprises training of a plurality of rounds;

if the target training configuration information is a model parameter of a target model, the issuing the target training configuration information to the plurality of training tasks to trigger the plurality of training tasks to update and rerun initial training configuration information corresponding to the target training configuration information based on the target training configuration information, including:

In some examples, the object model includes: a neural network based deep learning model, or a pre-trained model.

The implementation process of the functions and roles of each module in the training management device for the distributed model training is specifically detailed in the implementation process of corresponding steps in the training management method for the distributed model training, and is not described herein again.

Correspondingly, the embodiment of the specification also provides a distributed training system, which comprises a management program and a plurality of training tasks, wherein the management program is used for managing the plurality of training tasks for executing distributed training, and the plurality of training tasks respectively perform distributed training on a target model based on initial training configuration information configured by a user;

The steps of a training management method embodiment for implementing the foregoing distributed model training when the hypervisor is executed by a processor.

Accordingly, embodiments of the present specification also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the training management method embodiments of distributed model training described above.

Accordingly, the embodiments of the present specification further provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the training management method embodiment of the distributed model training when the processor executes the program.

Accordingly, the present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a training management method embodiment of distributed model training.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The above-described embodiments may be applied to one or more computer devices, which are devices capable of automatically performing numerical calculations and/or information processing according to preset or stored instructions, the hardware of which include, but are not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer device may be any electronic product that can interact with a user in a human-computer manner, such as a personal computer, tablet computer, smart phone, personal digital assistant (Personal Digital Assistant, PDA), game console, interactive internet protocol television (Internet Protocol Television, IPTV), smart wearable device, etc.

The computer device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.

The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this application to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

Where a description of "a specific example", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present description. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. The training management method is applied to a management program, and the management program is used for managing a plurality of training tasks for executing distributed training, wherein the plurality of training tasks are respectively used for carrying out distributed training on a target model based on initial training configuration information configured by a user; the method comprises the following steps:

2. The method of claim 1, the plurality of training tasks comprising a master task;

3. The method of claim 1, the hypervisor providing a change interface for changing the training configuration information;

4. The method of claim 1, the hypervisor maintaining decision rules pre-configured by a user;

5. The method of claim 4, the hypervisor providing a configuration interface for configuring the decision rule;

the method further comprises the steps of:

6. The method of claim 1, the training configuration information comprising a combination of one or more of the following: training data sets, model parameters, optimization strategies, and super parameters.

7. The method of claim 6, the plurality of training tasks comprising a master task;

8. The method of claim 6, the plurality of training tasks comprising a master task; the distributed training comprises training of a plurality of rounds;

9. The method of claim 1, the object model comprising: a neural network based deep learning model, or a pre-trained model.

10. A training management device for training a distributed model, the device being applied to a management program, the management program being configured to manage a plurality of training tasks for performing distributed training, the plurality of training tasks being configured to perform distributed training for a target model based on initial training configuration information configured by a user, respectively; the device comprises:

11. A distributed training system, the system comprising a hypervisor for managing a plurality of training tasks for performing distributed training, and a plurality of training tasks for performing distributed training for a target model based on initial training configuration information configured by a user, respectively;

the hypervisor, when executed by a processor, implements the steps of the method of any of claims 1 to 9.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1 to 9 when the computer program is executed by the processor.

13. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of claims 1 to 9.