CN114780225A

CN114780225A - Distributed model training system, method and device

Info

Publication number: CN114780225A
Application number: CN202210668952.5A
Authority: CN
Inventors: 王勤龙; 桑波
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-07-22
Anticipated expiration: 2042-06-14
Also published as: CN114780225B

Abstract

The distributed model training system comprises a node allocation unit, a resource prediction unit, a working node and a parameter server, wherein the node allocation unit is used for training a task according to a model and predicting a request of a node resource, the resource prediction unit is used for responding to the node resource prediction request, predicting resources required by executing the model training task according to historical tasks, and determining the number of the nodes according to the predicted resources. The node distribution unit determines each working node according to the number of the nodes, and distributes a model training task to each working node, so that the parameter server is matched with each working node to execute the model training task. Therefore, the resource prediction unit can automatically determine the number of the working nodes, and the node allocation unit automatically determines the working nodes, so that the working nodes start to execute the model training task, a user does not need to manually allocate the working nodes before the model training starts, and the model training speed is improved.

Description

Distributed model training system, method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a distributed model training system, method, and apparatus.

Background

With the development of information technology, deep learning is adopted to automatically learn effective feature representation from data, so that the accuracy of a prediction model is improved, and the method is widely applied to the fields of voice recognition, image recognition, target detection and the like. In order to further improve the accuracy of the trained model, the number of training samples is also increased, which results in a longer training time for model training. Aiming at the problems, a mode that a plurality of working nodes train the same model at the same time can be adopted, so that the model training time is reduced, and the model training speed is increased.

However, since the difference between the resource demands of different model structures and different training data sets on the working nodes is large, it is difficult to accurately allocate the resource usage amount required by the working nodes before the model training starts, thereby reducing the speed of the model training.

Disclosure of Invention

The present specification provides a distributed model training system, method and apparatus, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a distributed model training system, the system comprising: the system comprises a node allocation unit, a resource prediction unit, a working node and a parameter server;

the node allocation unit is used for sending a node resource prediction request to the resource prediction unit according to the model training task; determining each working node of the node quantity used for executing the model training task from a distributed system according to the node quantity sent by the resource prediction unit, and distributing the model training task to each determined working node;

the resource prediction unit is used for responding to the resource prediction request and predicting resources required by executing the model training task according to historical tasks; determining the number of nodes according to the predicted resources; sending the number of nodes to the node allocation unit;

the working node is used for receiving the model training task distributed by the node distribution unit and the model parameters sent by the parameter server; determining a model gradient according to a pre-stored model structure, the model parameters and the model training task distributed by the node distribution unit; sending the model gradient to a parameter server;

the parameter server is used for receiving the model gradient sent by the working node; and updating the model parameters stored by the parameter server according to the model gradient, and returning the model parameters to the working node.

Optionally, the resource predicting unit is configured to predict, in response to a resource prediction request sent by the node allocating unit, a first resource usage amount required by the master working node to execute the model training task according to a historical resource usage amount corresponding to a historical task searched from a historical database, and send the first resource usage amount to the node allocating unit; receiving the specified resource parameters fed back by the main working node as the resources required by the execution of the model training task;

and the node allocation unit is used for determining the main working node in a distributed system according to the first resource usage amount, allocating the model training task to the main working node, and enabling the main working node to execute the model training task and feed back the specified resource parameters to the resource prediction unit.

Optionally, the resource prediction unit is configured to determine time consumed by the primary working node according to the specified resource parameter fed back by the primary working node; determining the number of slave working nodes according to the time consumption of the master working node; sending the number of the slave working nodes to a node distribution unit;

and the node distribution unit is used for determining each slave working node executing the model training task in the distributed system according to the number of the received slave working nodes and distributing the model training task to each slave working node.

Optionally, the parameter server is configured to monitor a load of the parameter server processing the model gradient sent by the master working node as a first load; sending the first load to the resource prediction unit;

the resource prediction unit is used for receiving a first load sent by the parameter server; and determining the number of the slave working nodes according to the time consumption of the master working node and the first load.

Optionally, the resource prediction unit is configured to determine a resource usage amount of the slave working node according to the specified resource parameter fed back by the master working node; sending the resource usage amount of the slave working node to the node allocation unit;

the node allocation unit is used for allocating the model training task and the resource usage amount of the slave working node to each determined working node;

and the working node is used for determining resources allocated for executing the model training task according to the resource usage amount of the slave working node.

This specification provides a distributed model training method, including:

predicting resources required for executing the model training task according to the historical task in response to a node resource prediction request sent by a node allocation unit;

determining the number of nodes according to the predicted resources;

and sending the node quantity to the node distribution unit so that the node distribution unit determines each working node of the node quantity used for executing the model training task from a distributed system according to the node quantity, and distributes the model training task to each determined working node, wherein each working node performs distributed model training together with a parameter server according to the model training task.

Optionally, predicting resources required for executing the model training task according to historical tasks, specifically including:

searching historical resource usage corresponding to the historical tasks from a historical database;

in response to a first node resource prediction request sent by the node allocation unit, predicting a first resource usage amount required by a main working node to execute the model training task according to the historical resource usage amount;

and sending the first resource usage amount to the node allocation unit, so that the node allocation unit determines a main working node for executing the model training task in a distributed system according to the first resource usage amount, and allocates the model training task to the main working node, so that the main working node executes the model training task and feeds back specified resource parameters.

Optionally, determining the number of nodes according to the predicted resource includes:

receiving specified resource parameters fed back by the main working node as resources required for executing the model training task, wherein the specified resource parameters comprise actual resource usage amount and time consumption of the main working node when the model training task is executed;

acquiring a first load sent by a parameter server; the first load is used for representing the actual resource usage amount used by the parameter server for processing the model gradient sent by the main working node;

responding to a second node resource prediction request sent by the node allocation unit, and determining the number of slave working nodes according to the time consumption of the master working node and the first load;

and determining the resource usage of each slave working node according to the actual resource usage of the master working node.

Optionally, the method further comprises:

acquiring a second load sent by the parameter server; the second load is used for representing the resource usage amount of the parameter server for processing the model gradients respectively sent by the main working node and each slave working node;

and in response to a third node resource prediction request sent by the node allocation unit, determining the adjusted number of the slave working nodes according to the second load, and sending the adjusted number to the node allocation unit, so that the node allocation unit adjusts the current number of the slave working nodes executing the model training task according to the adjusted number of the slave working nodes.

This specification provides a distributed model training method, including:

according to a model training task, sending a node resource prediction request to a resource prediction unit so that the resource prediction unit responds to the node resource prediction request, predicts resources required by execution of the model training task according to historical tasks, and determines the number of nodes according to the predicted resources;

and determining each working node of the node number, which is used for executing the model training task, from a distributed system according to the node number sent by the resource prediction unit, and distributing the model training task to each determined working node, wherein each working node performs distributed model training together with a parameter server according to the model training task.

Optionally, determining, according to the number of nodes sent by the resource prediction unit, each working node, which is used for executing the model training task, of the number of nodes from the distributed system includes:

starting a main working node executing the model training task from a distributed system according to the resource usage amount of the main working node sent by the resource prediction unit;

determining whether the primary work node performs the model training task;

if not, adjusting the resource usage amount of the main working node, and restarting the main working node according to the adjusted resource usage amount of the main working node until the main working node executes the model training task.

Optionally, determining, according to the number of nodes sent by the resource prediction unit, each working node of the number of nodes in the distributed system, where the working node is used to execute the model training task, specifically includes:

receiving an adjusted number of slave working nodes sent by the resource prediction unit;

obtaining the current number of slave working nodes executing the model training task;

judging whether the current number of the slave working nodes is the same as the adjusted number of the slave working nodes or not;

if not, adjusting the current number of the slave working nodes according to the adjusted number of the slave working nodes.

This specification provides a distributed model training device, including:

the resource prediction module is used for responding to a node resource prediction request sent by the node allocation unit and predicting resources required by executing the model training task according to historical tasks;

the node quantity determining module is used for determining the quantity of the nodes according to the predicted resources;

and the sending module is used for sending the node number to the node distribution unit so that the node distribution unit determines each working node of the node number, which is used for executing the model training task, from a distributed system according to the node number, and distributes the model training task to each determined working node, and each working node performs distributed model training together with a parameter server according to the model training task.

This specification provides a distributed model training device, including:

the request module is used for sending a node resource prediction request to a resource prediction unit according to a model training task so that the resource prediction unit responds to the node resource prediction request, predicts resources required by executing the model training task according to historical tasks, and determines the number of nodes according to the predicted resources;

and the working node determining module is used for determining each working node of the node number, which is used for executing the model training task, from the distributed system according to the node number sent by the resource prediction unit, and distributing the model training task to each determined working node, wherein each working node performs distributed model training together with the parameter server according to the model training task.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described distributed model training method.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the distributed model training method described above when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

the distributed model training system comprises a node distribution unit, a resource prediction unit, a working node and a parameter server, wherein the node distribution unit is used for training tasks and node resource prediction requests according to a model, the resource prediction unit is used for responding to the node resource prediction requests, resources required by the execution of the model training tasks are predicted according to historical tasks, and then the number of the nodes is determined according to the predicted resources. And the node distribution unit determines each working node according to the number of the nodes, and distributes a model training task to each working node, so that the parameter server is matched with each working node to execute the model training task. Therefore, the resource prediction unit can automatically determine the number of the working nodes, and the node allocation unit automatically determines the working nodes, so that the working nodes start to execute the model training task, a user does not need to manually allocate the working nodes before the model training starts, and the model training speed is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the principles of the specification and not to limit the specification in a limiting sense. In the drawings:

FIG. 1A is a schematic diagram of a distributed deep learning system according to the present disclosure;

FIG. 1B is a schematic diagram of a distributed deep learning system according to the present disclosure;

FIG. 2 is a schematic diagram of a distributed model training system according to the present disclosure;

FIG. 3 is a schematic flow chart of a distributed model training method in the present specification;

FIG. 4 is a schematic flow chart diagram illustrating a distributed model training method according to the present disclosure;

FIG. 5 is a schematic flow chart diagram illustrating a distributed model training method according to the present disclosure;

FIG. 6 is a schematic diagram of a distributed model training apparatus provided herein;

FIG. 7 is a schematic diagram of a distributed model training apparatus provided herein;

fig. 8 is a schematic diagram of an electronic device corresponding to fig. 3 provided in the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without making any creative effort belong to the protection scope of the present specification.

In addition, it should be noted that all the actions of acquiring signals, information or data in the present invention are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Deep learning is an important branch field of computer science and artificial intelligence, and is a further extension of neural networks. By automatically learning effective feature representations from data, accuracy of prediction models is improved, and the method has been widely applied to the fields of speech recognition, image recognition, target detection and the like.

With the rapid expansion of the scale of deep learning data, in the traditional model training, iterative computation can only utilize all hardware resources on a host where the current process is located, but the extensibility of a single computer is always limited, so that the single computer cannot be used when facing massive data and huge models, the data or the models are necessarily divided into a plurality of parts, and the training is accelerated on a plurality of machines by means of the hardware resources on different hosts. Based on this, the mode of executing the model training operation by adopting a plurality of working nodes in parallel is widely applied. Among them, the manner in which a plurality of working nodes are respectively configured in a distributed model training system is of particular interest to those skilled in the relevant art, as shown in fig. 1A. The distributed deep learning system deploys deep learning tasks with huge calculation amount and data amount on a plurality of working nodes to be executed in parallel, so that the calculation efficiency of deep learning is improved, as shown in fig. 1B, the specific implementation steps are as follows:

s100: and configuring a plurality of working nodes and a parameter server.

Specifically, in order to increase the execution speed of the model training task, a plurality of working nodes are configured in the distributed deep learning system to execute the model training task in parallel. Each working node can be respectively deployed on a plurality of different machines, and the same model training task is executed by using hardware resources of the machines so as to face model training tasks of massive training samples and huge model structures.

In a model training task performed in a distributed deep learning system, a plurality of worker nodes deployed on a plurality of machines perform the model training task in parallel. The distributed mechanism of each working node may be data parallel or model parallel, which is not limited in this specification.

For a data parallel mechanism: data parallel refers to a complete neural network structure that a single working node can accommodate a model, but the scale of training samples of the model is huge, at this time, the training samples can be divided and trained in parallel on multiple working nodes, that is, each working node only receives a preset number of training samples and trains the model based on the distributed preset number of training samples.

For the model parallel mechanism: the parallel model refers to the fact that a single working node cannot accommodate a complete model structure, the model structure is stored on a plurality of working nodes in a partitioned mode, calculation of the later part of the model must be finished by the previous calculation, and therefore calculation among different working nodes is actually serial. But each working node is not interfered with each other.

S102: and the parameter server sends the model parameters stored by the parameter server to each working node.

In practical application, the distributed deep learning platform may adopt a parameter server architecture. The parameter server architecture may include a number of parameter servers and a number of work nodes. The model parameters can be stored in a plurality of parameter servers in a fragmentation mode, and each parameter server only stores part of the model parameters. When each parameter server respectively processes the gradient sent by each working node and updates the model parameters, the updating is carried out in a slicing mode, namely, each parameter server is only responsible for updating the part of the model parameters stored by the parameter server. Of course, for a model training task with a smaller model parameter scale, only a single parameter server may be configured to store and update the model parameters, which is not limited in this specification.

S104: and each working node performs model training according to the model structure stored in the working node, the training samples distributed from the training sample set and the model parameters sent by the parameter server, and determines the gradient of the model.

The working nodes store the network structure of the model to be trained, and forward calculation and backward propagation of the model are executed according to the training samples and the model parameters to obtain the gradient of the model. The model gradient can be used for updating the model parameters, so that each working node respectively sends the determined model gradient to all parameter servers in the distributed deep learning system so as to update the model parameters.

S106: and each working node respectively sends the model gradient to a parameter server.

S108: and the parameter server updates the model parameters according to all the received model gradients.

S110: and the parameter server sends the updated model parameters to the working nodes respectively. And repeating the steps S104 to S108 until the model training is finished.

Specifically, each working node in the distributed deep learning system determines a model gradient under the current model parameter by using a training sample obtained by the working node, and sends the determined model gradient to the parameter server, the parameter server processes the gradients sent by all the working nodes in the existing modes such as aggregation and the like, and updates the model parameter according to the gradients, and the parameter server sends the updated model parameter to each working node, so that each working node performs the next iteration process by using the updated model parameter, and the iteration process is repeated until the model training is finished.

It can be seen that, in the existing distributed deep learning system, multiple working nodes can be deployed on multiple different machines, and the speed of model training is increased by using hardware computing resources of the multiple machines. The number of working nodes and the hardware computing resources available to the working nodes are usually configured manually by the user before the model training begins.

However, due to the difference of the model structure and the difference of the training data sets, the number of the required working nodes and the hardware computing resources required by the working nodes are greatly different, and the manual configuration is often inaccurate. If the node resource allocation is excessive, the resource may be wasted. If the node resource allocation is too low, the model training speed is affected. Therefore, it is necessary to automatically configure a working node, so as to achieve that the distributed deep learning job completes training faster under the condition of limited hardware resources, improve the utilization rate of computing resources, and enable a user to concentrate on algorithm design without being responsible for resource management of a distributed system.

The distributed model training system provided in the embodiment of the present specification is to implement automatic configuration of a working node in step S100 in fig. 1B, and the distributed model training system includes a node management system and a distributed system, as shown in fig. 2.

The node management system comprises a resource prediction unit and a node distribution unit, and the distributed system comprises a plurality of working nodes and a plurality of parameter servers. The node management system is used for distributing resources for distributed deep learning operation or model training tasks performed on the distributed deep learning platform, wherein the distributed resources comprise at least one of the number of working nodes, memory occupied by the working nodes, a CPU and a GPU. The distributed system utilizes the working nodes to carry out a specific model training task, and the parameter server is used for storing the latest model parameters and processing the gradient sent by each working node so as to update the model parameters. The resource prediction unit and the node allocation unit may be actual hardware units or may also be understood as processes, and the resource prediction unit and the node allocation unit may be in different machines or may be in the same machine, which is not limited in this specification.

It is understood that a plurality of parameter servers may be included in the distributed system to store the model parameters in a distributed manner, so as to store the model parameters in a super-large scale. The training in the embodiment of the present specification refers to a process of iteratively optimizing a model parameter by calculating a gradient descent by using a training sample acquired in advance, and finally obtaining a trained model.

In addition, each parameter server and each working node in the distributed model training system may be an actual hardware machine, or may be understood as a process. The present specification is limited to this, and only takes the example that the parameter server and the working node are configured on a specific hardware machine, so as to explain the specific solution.

In the embodiment of the present specification, the distributed mechanism of each working node is not limited, and a specific implementation is described by taking a data parallel mechanism as an example.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 3 is a schematic flow diagram of a distributed model training system in this specification, which specifically includes the following steps:

s200: and the node allocation unit sends a node resource prediction request to the resource prediction unit according to the model training task.

The distributed model training needs to simultaneously perform model training through a plurality of working nodes deployed on different machines, each working node comprises a preconfigured model structure, a parameter server in the distributed system sends current model parameters to each working node, each working node performs forward calculation on the model by using an obtained training sample to obtain a gradient, and each working node sends the determined gradient to the parameter server in the distributed system, so that the parameter server processes the gradient, further updates the model parameters, and realizes gradient descent, thereby performing next iteration until the model training is finished.

Therefore, at the beginning of model training, the worker node needs to be started on the distributed system. In the distributed model training system provided in the embodiment of the present specification, a node allocation unit sends a node resource request to a resource prediction unit without manually configuring a working node by a user, so that the resource prediction unit determines the number of the working nodes required in the model training at this time and hardware resources occupied by each working node.

Optionally, the working nodes include a master working node and at least one slave working node, that is, in the schematic diagram of the distributed model training system shown in fig. 2, the working node may be the master working node or the slave working node. In order to achieve the two purposes of starting each working node to execute a model training task when model training is started and dynamically adjusting the working nodes in the process of model training to improve the resource utilization rate, in the embodiment of the present specification, the timing when the node allocation unit sends the node resource prediction request to the resource prediction unit may be at the start of model training (when any working node is not started), after only the main working node is started to execute the model training task, and in the process of starting each working node to execute the model training task.

The first working node for executing the model training task is used as a main working node, and the first working node and the working nodes are used as slave working nodes. The master working node and the slave working nodes are similar in structure and resource configuration, and each working node is divided into the master working node and each slave working node only according to the time when the working node starts to execute the model training task to be executed.

S202: the resource prediction unit predicts resources required for executing the model training task according to the historical tasks in response to the node resource prediction request sent by the node allocation unit.

Specifically, the resource prediction unit may search a plurality of historical tasks from the historical database in response to the node resource prediction request, and predict resources required for executing the current model training task according to the historical resource usage amounts of the historical work nodes stored in the historical database, the historical resource usage amounts being used for executing the historical tasks, respectively. The predicted resource required for executing the model training task may refer to the time required for executing the model training task by a single node and the resource usage amount.

S204: and determining the number of nodes according to the predicted resources.

Specifically, after determining the resources required for executing the required model training task according to the historical tasks, the resource prediction unit needs to determine the number of the working nodes required for executing the model training task. To increase the speed of model training, multiple working nodes are typically deployed in a distributed system such that each working node performs a model training task in parallel. The resource prediction unit may determine the number of working nodes required to execute the model training task and the resource usage amount of each working node directly according to the predicted resource.

Therefore, the resource prediction unit can predict the number of the working nodes executing the model training task and the resource usage amount required by each working node in response to the resource prediction request sent by the node allocation unit once. The node allocation unit can directly deploy a plurality of working nodes in the distributed system according to the number of the working nodes returned by the resource prediction unit and the resource usage amount required by each working node, so that each working node can execute the model training task in parallel. As all the node resource parameters required for executing the model training task can be obtained only by sending a request to the resource prediction unit by the node allocation unit once, the interaction cost between the node allocation unit and the resource prediction unit is reduced.

S206: and the resource prediction unit sends the node resource parameters to the node allocation unit.

S208: and the node allocation unit determines each working node for executing the model training task from the distributed system according to the node resource parameters sent by the resource prediction unit.

Specifically, the node allocation unit starts each working node in the distributed system or dynamically adjusts each working node according to the specific type of the node resource sent by the resource prediction unit, so as to achieve the purpose of automatically configuring the number of the working nodes in the distributed system and the hardware resource occupied by each working node, and improve the efficiency of model training without manual configuration by a user.

S210: and the node distribution unit distributes the model training task to each determined working node.

In the embodiment of the present specification, a distributed mechanism of each working node as shown in step S100 in fig. 1B is used as an example of data parallel operation, and a specific technical solution is explained. The node allocation unit can allocate training samples to all working nodes in the distributed system, and allocate model training tasks to all the working nodes, so that all the working nodes train the model to be trained by using the allocated training samples.

S212: and the working node determines the gradient of the model according to a pre-stored model structure, the model parameters and the model training task distributed by the node distribution unit.

S214: the model gradient is sent to a parameter server.

S216: and the parameter server updates the model parameters stored by the parameter server according to the model gradient.

S218: and the parameter server returns the updated model parameters to each working node, so that each working node performs the next iterative training according to the updated model parameters.

Specifically, steps S212 to S218 are similar to steps S102 to S110 shown in fig. 1B, and are not repeated herein.

Based on the distributed model training system shown in fig. 3, the distributed model training system comprises a node allocation unit, a resource prediction unit, a working node and a parameter server, wherein the node allocation unit predicts resources required for executing a model training task according to model training tasks and node resource prediction requests, the resource prediction unit responds to the node resource prediction requests, and according to historical tasks, the resources required for executing the model training tasks are predicted, so that the number of nodes is determined according to the predicted resources. The node distribution unit determines each working node according to the number of the nodes, and distributes a model training task to each working node, so that the parameter server is matched with each working node to execute the model training task. Therefore, the resource prediction unit can automatically determine the number of the working nodes, and the node allocation unit automatically determines the working nodes, so that the working nodes start to execute the model training task, a user does not need to manually allocate the working nodes before the model training starts, and the model training speed is improved.

In this embodiment of the present specification, the node allocating unit shown in steps S100 to S102 in fig. 3 sends a node resource prediction request to the resource predicting unit according to the model training task, and the resource predicting unit determines, in response to the node resource prediction request sent by the node allocating unit, that is, the node resource parameter is determined directly by the resource predicting unit in response to the resource prediction request sent by the node allocating unit at a single time.

In an optional embodiment of this specification, as in steps S100 to S102 in fig. 3, the node allocating unit may further send different types of resource prediction requests to the resource predicting unit according to different execution phases of the model training task. The resource prediction unit can select to predict the resource usage amount of the main working node firstly according to the type and the opportunity of the received node resource prediction request, the node allocation unit starts the main working node according to the resource usage amount of the main working node to execute the model training task, and the resource prediction unit determines at least one of the number of the slave working nodes and the resource usage amount of each slave working node according to the actual resource usage amount used by the main working node to execute the model training task. And the predicted number of the slave working nodes and/or the resource usage amount of each slave working node are/is returned to the node allocation unit, so that the node allocation unit determines each slave working node in the distributed system to execute the model training task in parallel.

In order to achieve the purposes of starting each working node to execute a model training task when model training is started and dynamically adjusting the working nodes in the process of model training to improve the resource utilization rate, in the embodiment of the present specification, the timing when the node allocation unit sends the node resource prediction request to the resource prediction unit may be when model training is started (when any working node is not started), after only the main working node is started to execute the model training task, and in the process of starting each working node to execute the model training task.

According to different occasions when the node allocation unit sends the node resource prediction request to the resource prediction unit, the method can be divided into two tasks, namely starting each working node when the model starts to train and dynamically adjusting each working node in the process of model training.

For starting each working node when the model starts to train, the specific implementation steps can be as shown in fig. 4, and are implemented by the following steps:

s300: the node allocation unit sends a first node resource prediction request to the resource prediction unit.

Specifically, when the model training starts, the node allocation unit sends a first node resource prediction request to the resource prediction unit. The first node resource prediction request is used for requesting a resource prediction unit to predict hardware computing resources, such as at least one of a memory, a CPU and a GPU, required by the main work node in the model training operation.

S302: the resource prediction unit is used for responding to the first node resource prediction request sent by the node allocation unit and searching historical resource usage corresponding to the historical tasks from the historical training database.

Because the working node is not started in the model training operation, the resource prediction unit searches from the historical training database to predict hardware computing resources required by the main working node, and appoints each historical working node used in a plurality of historical training tasks in a historical time period and the resource usage amount of each historical working node.

The historical training database can store the number of the working nodes in all the model training tasks executed by the distributed deep learning platform and the hardware computing resource usage amount of each working node.

S304: and the resource prediction unit predicts the first resource usage required by the main working node to execute the model training task according to the historical resource usage.

The distributed model training system may not execute the current model training task, that is, the model structure in the current model training task may be different from the structure of the historical model stored in the historical training database. In order to ensure that the main working node can start and execute the model training task, the maximum hardware computing resource usage amount used by the historical working node can be selected from all model training jobs in a specified historical time period as a target resource usage amount, and a preset margin is added on the basis of the target resource usage amount as the resource usage amount of the main working node. Therefore, even if the distributed deep learning platform does not execute the model training task, the resource usage amount of the main working node in the model training task can be predicted by referring to the historical resource usage amount of the historical working node in the historical model training task.

It should be noted that, in this step, the main working node is not started in the distributed system, and the predicted first resource usage amount required by the main working node to execute the model training task means that the main working node needs to be deployed in the distributed system in order to execute the model training task, and the predicted first resource usage amount is the resource usage amount required by the main working node to be deployed.

S306: the resource prediction unit transmits the first resource usage to the node allocation unit.

S308: the node allocation unit receives the first resource usage amount transmitted by the resource prediction unit.

S310: the node allocation unit starts a main working node in the distributed system according to the first resource usage amount. And assigning a model training task to the master work node.

In this step, the node allocation unit starts the master work node in the distributed system after obtaining the resource usage amount of the master work node sent by the resource prediction unit, and allocates the hardware computing resource in the machine where the master work node is located to the master work node according to the resource usage amount of the master work node.

Optionally, since the distributed deep learning platform may not have executed the current model training task, and the historical training database does not store historical data about hardware computing resources required by the current model training, the resource prediction unit may not be able to satisfy the execution of the current model training task according to the resource usage amount required by the main working node predicted by the historical data stored in the historical training database. Therefore, when the node allocation unit starts the master working node in the distributed system, the resource usage is allocated to the master working node according to the predicted resource usage, and the master working node may not be started normally. At this point, the resource usage of the master worker node may be readjusted until the master worker node can start up normally in the distributed system.

Further, after the main working node is started in the distributed system, the node allocation unit may allocate a preset number of training samples in a training sample set configured by a user to the main working node, so that the main working node executes a model training task based on the allocated training samples.

S312: and the parameter server sends the initialization model parameters to the main working node.

In particular, several parameter servers in the distributed system are used for storing model parameters in a distributed manner. When the main working node is normally started in the distributed system, the model parameters are requested to be acquired from the parameter server, and the initialized model parameters are sent to the main working node by the parameter server in response to the model parameter acquisition request of the main working node, so that the main working node executes a model training task based on the initialized model parameters.

S314: and the main working node determines the gradient according to a pre-stored model structure, model parameters sent by the parameter server and the distributed model training task.

S316: and the main working node sends the determined gradient to the parameter server.

S318: the parameter server processes the received gradient and updates the model parameters so as to carry out the iterative process of the next model training.

In the model training process, the specific iterative process is as follows:

each working node prestores a complete model network structure, and the gradient of the current model is determined according to model parameters respectively sent by all parameter servers and training samples distributed by the node distribution unit. Then, each working node sends the gradient of the current model to all the parameter servers, and after receiving the gradients sent by all the working nodes, the parameter servers process the received gradients and update part of the model parameters stored by the parameter servers. And finally, each working node requests to acquire the latest model parameters from all the parameter servers, and performs the next iterative training by using the updated model parameters.

S320: and the main working node feeds back the specified resource parameters to the resource prediction unit.

The specified resource parameters comprise actual resource usage and time consumption of the main working node when the model training task is executed. The time consumption of the main working node for executing the model training task can be determined by specifying model parameters, wherein the specified model parameters comprise the number of samples of a data set used by the model training task, the number of samples used in each iteration, the number of times of sample cyclic training and the current training speed. The appointed model parameters are configured in advance by a user, the main working node can automatically acquire the appointed model parameters after being started in the distributed system, and the main working node can send the appointed model parameters to the resource prediction unit so that the resource prediction unit can determine the time consumed by the main working node to execute the model training task according to the acquired appointed model parameters, and certainly, the main working node can calculate the time consumed by the main working node to execute the model training task according to the appointed model parameters and feed the time consumed to the resource prediction unit. Specifically, the time consumed for the main working node to independently complete the current model training task is determined for the first time, and the formula is as follows:

wherein, t is the time consumed by the main working node to execute the model training task independently, sample _ count is the number of samples of the data set of the training operation, batch _ size is the number of samples used in each iteration, epoch is the number of times of sample cycle training, and speed is the training speed of the main working node.

The training speed of the main working node is determined by the training speed of the main working node for independently carrying out the model training task according to the consumed time for completing the iterative process after the iterative process of the model training task is completed by the main working node.

For example, the main working node executes a model training task, the time consumed by training the model in one iteration process is 10 seconds, the training speed of the main working node for executing the model training task is 0.1 time/second, and if the condition for finishing the model training is that 1 ten thousand iteration training processes are finished, the time required by the main working node for finishing the model training is 10 seconds. As can be seen, the training speed may characterize the time consumed by the master work node to perform the model training task alone.

After the actual resource usage of the main working node can complete an iteration process of a model training task through the main working node, the actual resource usage of the main working node in the iteration process is reported to the resource prediction unit.

S322: and the parameter server reports the first load to the resource prediction unit.

The first load of the parameter server is used for representing the resource usage amount of the parameter server when the parameter server processes the gradient sent by the main working node and updates the model parameter. The parameter server processes the gradient sent by the main working node, updates the initialized model parameters, and occupies the computing resources of the machine where the parameter server is located when the updated model parameters are obtained. The parameter server can also comprise a resource monitoring unit, the resource monitoring unit sends the resource occupancy rate of the gradient sent by the parameter server processing main working nodes to the resource prediction unit, and the resource prediction unit can determine the number of the slave working nodes according to the resource occupancy rate of the machine where the parameter server is located.

S324: the node allocation unit transmits a second node resource prediction request to the resource prediction unit.

After the master working node in the distributed system is started and starts to perform the model training task, a plurality of slave working nodes need to be started and execute the model training task together with the master working node in parallel. The model training speed is improved by executing the model training task in parallel by the plurality of working nodes deployed on the plurality of machines.

S326: and in response to a second node resource prediction request sent by the node allocation unit, determining the number of slave working nodes according to the time consumed by the master working node to execute the model training task contained in the specified resource parameters and the first load sent by the parameter server.

Configuring the number of the slave working nodes according to the time consumption of the master working node for executing the task to be specified, wherein the formula is as follows:

wherein the first predicted number of slave working nodes

，

It is the user that can configure the parameters for adjusting the number of slave worker nodes, e.g., configuration 3600.

In addition, the resource prediction unit may also determine to limit the number of slave working nodes according to the load of the current parameter server. Because the load of the parameter server represents the resource usage amount of the parameter server when processing the gradient, and the computing resource of the machine where the parameter server is located is limited, if the configuration of the working nodes is increased, the gradient that the parameter server needs to process is also greatly increased, and further the resource usage amount of the parameter server is also increased, once the computing resource of the machine where the parameter server is located is all used for processing the gradient sent by the working nodes, even if the working nodes are continuously increased, the parameter server has no redundant computing resource to process the gradient sent by too many working nodes, and further the bottleneck of the model training speed is reached, therefore, the load of the parameter server needs to be used for limiting the number of the working nodes, and the formula is as follows:

wherein the second predicted number of slave working nodes

，

Is a parameter pre-configured by the user, such as configuration 0.8.

Is the first load of the parameter server. The first load is obtained by monitoring the load of the model gradient sent by the main working node through the parameter server, and is sent to the resource prediction unit by the parameter server. The first load can be used for determining the number of the slave working nodes, and the condition that the parameter server is excessively loaded and the model training speed is limited due to the fact that an excessive number of working nodes are deployed in the distributed model training system is avoided.

Optionally, the first predicted number of slave working nodes may be derived in the manner described above

Second predicted number of slave working nodes

Determines the number of the slave working nodes, namely the number of the slave working nodes can be configured to be the number of the slave working nodes according to the time consumption of the master working node to execute the task to be specified

The number of the slave work nodes can be determined as the first load of the parameter server

It is also possible to predict the number of slave working nodes to be N = min (N)₀，N ₁)。

S328: and the resource prediction unit determines the resource usage of each slave working node according to the actual resource usage of the master working node contained in the specified resource parameter.

The resource prediction unit configures the resource usage of the slave working node according to the actual resource usage of the master working node and a parameter preset by a user. The real resource usage amount of the main working node refers to: and after the main working node in the distributed system is successfully started, finishing the hardware resource amount used in the model training iterative process. The resource in the resource usage here refers to a hardware computing resource in each machine in the distributed system, such as at least one of a memory, a CPU, and a GPU. It can be understood that the parameter preset by the user may be a positive value or a negative value, that is, the resource usage of the slave working node may be slightly higher than the actual resource usage of the master working node or slightly lower than the actual resource usage of the master working node according to a specific application.

S330: the resource prediction unit transmits the number of slave work nodes and the resource usage amount of each slave work node to the node allocation unit.

S332: the node allocation unit receives the number of slave working nodes and the resource usage amount of each slave working node transmitted by the resource prediction unit.

S334: the node allocation unit starts each slave work node on the distributed system based on the number of slave work nodes and the resource usage amount of each slave work node transmitted by the resource prediction unit. And distributing training samples to each slave work node.

In this step, the node allocating unit starts the slave work nodes in accordance with the number of the slave work nodes in the distributed system after obtaining the number of the slave work nodes and the resource usage amount of each slave work node transmitted by the resource predicting unit, and allocates the hardware calculation resources in the machine in which each slave work node is respectively located to each slave work node in accordance with the resource usage amount of each slave work node.

S336: each slave worker node determines and sends a gradient. The parameter server processes the gradient sent by each working node, updates the model parameters, and distributes the updated model parameters to the main working node and each working node so as to carry out the iterative process of the next model training.

In this step, the process of executing the model training task by each slave work node and the interaction process with the parameter server are similar to the above-mentioned steps S102 to S110 shown in fig. 1B, and are not described again here.

In an alternative embodiment of this specification, as shown in step S310 in fig. 4, in the step of allocating the model training task to each determined working node, the node allocating unit further determines a training speed of each working node by monitoring an allocation amount of the model training task of each working node, and sends the training speed to the resource predicting unit, so that the resource predicting unit stores the training speed in the history database. The method comprises the following concrete steps:

firstly, the node allocation unit monitors the allocation amount of the training samples of each working node.

In this step, the node allocating unit may be configured to allocate a training sample to each working node, and when the working node performs a model training process using the currently allocated training sample, request the node allocating unit to acquire another training sample, so that the node allocating unit reallocates the training sample for the working node. Therefore, the node allocation unit can monitor the allocation amount of the training samples of each working node.

And secondly, the node distribution unit determines the training speed of each working node according to the monitored distribution quantity of the training samples of each working node.

The node distribution unit can determine the training speed of each working node according to the distribution amount of the training samples of each working node in unit time.

Then, the training speed of each working node is sent to the resource prediction unit.

The resource prediction unit may send the number of the working nodes executing the current model training task and the resource usage of each working node to the historical training database, so that the historical training database stores the relevant data of the working nodes in the current model training task.

It can be seen that in the distributed model training method provided in this specification, the resource prediction unit obtains the specified resource parameters fed back by the master working node on the basis of determining the first resource usage of the master working node, and enabling the node allocation unit to deploy the master working node to execute the model training task, so as to predict the number of the slave working nodes and the resource usage of each slave working node, the node allocation unit is configured to start the master working node and the plurality of slave working nodes according to the data sent by the resource prediction unit, so that the master working node and each slave working node use hardware in a plurality of machines to calculate resources, execute the model training task, without manually configuring the number of the working nodes and the resource usage of the working nodes by a user, the distributed model training system can automatically predict the number of the working nodes and the resource usage, and automatically deploy the working nodes on the plurality of machines, the speed of model training is improved.

Further, once the current distributed deep learning system starts to execute model training, the configured working nodes cannot be adjusted according to the specific conditions of the model training, and only the current model training can be terminated, and the model training is restarted after the working nodes are reconfigured, so that the progress of the model training is influenced. Therefore, in an optional embodiment of the present specification, the distributed model training method provided in the embodiment of the present specification can further adjust the number of the working nodes in real time during the process that each working node executes the model training task as shown in steps S104 to S110 shown in fig. 1B, so as to avoid the problem that the model training is performed again because the working nodes do not meet the requirement of the model training. The specific implementation manner can be implemented by the following steps as shown in fig. 5:

s400: and the parameter server reports the second load to the resource prediction unit.

And the second load of the parameter server is used for representing the amount of hardware resources occupied by the parameter server when the gradient sent by the main working node and each slave working node is processed and the model parameters are updated.

Specifically, the parameter server receives gradient data sent by a plurality of working nodes, and the parameter server needs to process the gradient data sent by the plurality of working nodes and update the model parameters, and both processes the gradient data sent by the plurality of working nodes and updates the model parameters occupy hardware computing resources of a machine in which the parameter server is located. If the amount of computing resources used by the parameter server to process the gradient data is close to the amount of all computing resources of the machine where the parameter server is located, the parameter server cannot process more gradient data, and at this time, if the number of working nodes is increased, the parameter server has no capability of processing the gradient data even if receiving more gradient data sent by the working nodes. Therefore, the number of the working nodes needs to be dynamically adjusted according to the second load reported by the parameter server.

Therefore, the resource prediction unit needs to dynamically adjust the number of the working nodes based on the training speed and the second load reported by the parameter server, so that the training speed can be increased, the parameter server can process the gradient data sent by the working nodes, and the situation that the parameter server cannot process the gradient data due to too many working nodes is avoided.

S402: the node allocation unit sends a third node resource prediction request to the resource prediction unit.

In the process of executing model training on each working node, the node allocation unit periodically sends a third node resource prediction request to the resource prediction unit, and the resource prediction unit determines the adjustment number of the slave working nodes, so that the number of the working nodes is dynamically adjusted in the model training process, and the training speed is improved.

It can be understood that the node allocating unit does not need to pay attention to whether the number of the slave working nodes actually needs to be adjusted, the node allocating unit can receive the adjusted number of the slave working nodes returned by the resource predicting unit only by periodically sending a third node resource predicting request to the resource predicting unit, and the node allocating unit only needs to adjust the number of the slave working nodes deployed in the distributed system according to the adjusted number of the slave working nodes sent by the resource predicting unit.

S404: the resource prediction unit determines the adjustment number of the slave working nodes according to the second load sent by the parameter server and the training speed of each working node sent by the node allocation unit in response to the received resource prediction request of the third node.

The resource prediction unit periodically determines the adjustment amount of the slave working nodes in response to a third node resource prediction request transmitted by the node allocation unit. Specifically, the resource prediction unit may first determine the load condition of the machine where the parameter server is located according to the second load, and if the machine where the parameter server is located runs at full load, increasing the number of the slave work nodes does not increase the model training speed, because the parameter server already has no extra computing resource to process extra gradient data sent from the slave work nodes. If the machine where the parameter server is located is in light load operation, the increased number of the slave work nodes can be determined according to the remaining hardware computing resource amount of the machine. If the machine where the parameter server is located is in overload operation, the termination number of the slave working nodes can be determined, the number of the slave working nodes is properly reduced, and the conditions that the number of the working nodes is too large and the parameter server does not have redundant computing resources to process gradient data are avoided on the premise that the training speed is not influenced.

S406: the adjusted number of slave work nodes is sent to the node allocation unit.

S408: the current number of slave worker nodes is adjusted based on a difference between the current number of slave worker nodes and the received adjusted number of slave worker nodes.

Specifically, the node allocation unit receives the adjusted number of the slave working nodes sent by the resource prediction unit;

acquiring the current number of slave working nodes;

if the current number of the slave working nodes is the same as the adjusted number of the slave working nodes, the slave working nodes deployed in the distributed system do not need to be adjusted.

And if the current number of the slave working nodes is different from the adjusted number of the slave working nodes, the node distribution unit adjusts the current number of the slave working nodes according to the received adjusted number of the slave working nodes.

Specifically, if the current number of slave working nodes is greater than the adjusted number of slave working nodes, the node allocation unit may terminate a portion of the slave working nodes in the distributed system such that the current number of slave working nodes is the same as the adjusted number of slave working nodes predicted by the resource prediction unit. If the current number of the slave working nodes is less than the adjusted number of the slave working nodes, the node allocation unit may start the slave working nodes in the distributed system so that the current number of the slave working nodes is the same as the adjusted number of the slave working nodes predicted by the resource prediction unit. The determination method of the resource usage amount of the newly added slave working node in the model training process is similar to that in step S328 in fig. 4, and is not repeated here.

Based on the distributed model training method as shown in fig. 5, during the model training, the node allocating unit may periodically request the resource predicting unit to adjust the number of the slave working nodes, and the resource predicting unit may determine the adjusted number of the slave working nodes according to the current training speed of each slave working node sent by the node allocating unit and the second load sent by the parameter server, and send the predicted adjusted number of the slave working nodes to the node allocating unit, so that the node allocating unit dynamically adjusts the slave working nodes deployed in the distributed system according to the adjusted number of the slave working nodes. Therefore, in the model training process, the node distribution unit can dynamically adjust the number of the slave working nodes according to the adjusted number of the slave working nodes given by the resource prediction unit, so that not only can enough slave working nodes be ensured to execute the model training task in parallel, but also the load capacity of the parameter server is dynamically considered, the problem that the gradient data sent by each working node cannot be processed in time due to full-load operation of the parameter server to generate a training bottleneck is avoided, and the model training speed is further improved.

Based on the same idea, the distributed model training system and method provided above for one or more embodiments of the present specification further provide a corresponding distributed model training apparatus, as shown in fig. 6 and 7.

Fig. 6 is a schematic diagram of a distributed model training apparatus provided in this specification, specifically including:

a resource prediction module 500, configured to predict, according to a historical task, a resource required to execute the model training task in response to a node resource prediction request sent by a node allocation unit;

a node number determining module 502, configured to determine the number of nodes according to the predicted resources;

a sending module 504, configured to send the number of nodes to the node allocation unit, so that the node allocation unit determines, according to the number of nodes, each working node that is used for executing the model training task and is in the number of nodes in a distributed system, and allocates the model training task to each determined working node, where each working node performs distributed model training together with a parameter server according to the model training task.

Optionally, the resource prediction module 500 is specifically configured to search a history database for a history resource usage amount corresponding to a history task; in response to a first node resource prediction request sent by the node allocation unit, predicting a first resource usage amount required by a main working node to execute the model training task according to the historical resource usage amount; and sending the first resource usage amount to the node allocation unit, so that the node allocation unit determines a main working node for executing the model training task in a distributed system according to the first resource usage amount, and allocates the model training task to the main working node, so that the main working node can execute the model training task and feed back specified resource parameters.

Optionally, the node resource parameter determining module 502 is specifically configured to receive a specified resource parameter fed back by the main working node, as a resource required for executing the model training task, where the specified resource parameter includes an actual resource usage amount and time consumption of the main working node when executing the model training task; acquiring a first load sent by a parameter server; the first load is used for representing the actual resource usage amount used by the parameter server for processing the model gradient sent by the main working node; responding to a second node resource prediction request sent by the node allocation unit, and determining the number of slave working nodes according to the time consumption of the master working node and the first load; and determining the resource usage of each slave working node according to the actual resource usage of the master working node.

Optionally, the apparatus further comprises:

an adjusting module 506, specifically configured to obtain the second load sent by the parameter server; the second load is used for representing the resource usage amount of the parameter server for processing the model gradients respectively sent by the main working node and each slave working node; and responding to a third node resource prediction request sent by the node allocation unit, determining the adjustment quantity of the slave working nodes according to the second load, and sending the adjustment quantity to the node allocation unit so that the node allocation unit adjusts the current quantity of the slave working nodes for executing the model training task according to the adjustment quantity of the slave working nodes.

Fig. 7 is a schematic diagram of a distributed model training apparatus provided in this specification, specifically including:

a request module 600, configured to send a node resource prediction request to a resource prediction unit according to a model training task, so that the resource prediction unit predicts, in response to the node resource prediction request, resources required for executing the model training task according to a historical task, and determines the number of nodes according to the predicted resources;

a working node determining module 602, configured to determine, according to the number of nodes sent by the resource predicting unit, each working node in the number of nodes in a distributed system, where the number of nodes is used to execute the model training task, and allocate the model training task to each determined working node, where each working node performs distributed model training together with a parameter server according to the model training task.

Optionally, the working node determining module 602 is specifically configured to start, according to the resource usage amount of the main working node sent by the resource predicting unit, the main working node executing the model training task from the distributed system; determining whether the primary work node performs the model training task; if not, adjusting the resource usage amount of the main working node, and restarting the main working node according to the adjusted resource usage amount of the main working node until the main working node executes the model training task.

Optionally, the working node determining module 602 is specifically configured to receive the adjusted number of the slave working nodes sent by the resource predicting unit; obtaining the current number of slave working nodes executing the model training task; judging whether the current number of the slave working nodes is the same as the adjusted number of the slave working nodes or not; if not, adjusting the current number of the slave working nodes according to the adjusted number of the slave working nodes.

The present specification also provides a computer-readable storage medium storing a computer program, which can be used to execute the distributed model training method shown in fig. 3.

This specification also provides a schematic block diagram of the electronic device shown in fig. 8. As shown in fig. 8, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and may also include hardware required for other services. The processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the distributed model training method shown in fig. 3. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be conceived to be both a software module implementing the method and a structure within a hardware component.

The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of this description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the system embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A distributed model training system, the system comprising: the system comprises a node allocation unit, a resource prediction unit, a working node and a parameter server;

the working node is used for receiving the model training task distributed by the node distribution unit and the model parameters sent by the parameter server; determining a model gradient according to a pre-stored model structure, the model parameters and the model training task distributed by the node distribution unit; sending the model gradient to the parameter server;

2. The system according to claim 1, wherein the resource prediction unit is configured to predict a first resource usage amount required by the master working node to execute the model training task according to a historical resource usage amount corresponding to a historical task searched from a historical database in response to the resource prediction request sent by the node allocation unit, and send the first resource usage amount to the node allocation unit; receiving a specified resource parameter fed back by the main working node as a resource required for executing the model training task;

3. The system of claim 2, wherein the resource prediction unit is configured to determine a time consumption of the primary working node according to the specified resource parameter fed back by the primary working node; determining the number of slave working nodes according to the time consumption of the master working node; sending the number of the slave working nodes to a node distribution unit;

4. The system of claim 2, the parameter server for monitoring the load of the parameter server itself processing the model gradient sent by the primary work node as a first load; sending the first load to the resource prediction unit;

5. The system of claim 2, the resource prediction unit is configured to determine the resource usage amount of the slave working node according to the specified resource parameter fed back by the master working node; sending the resource usage amount of the slave working node to the node allocation unit;

6. A distributed model training method, comprising:

determining the number of nodes according to the predicted resources;

and sending the node number to the node distribution unit, so that the node distribution unit determines each working node of the node number, which is used for executing the model training task, from a distributed system according to the node number, and distributes the model training task to each determined working node, and each working node performs distributed model training together with a parameter server according to the model training task.

7. The method according to claim 6, predicting resources required for executing the model training task based on historical tasks, specifically comprising:

8. The method according to claim 7, wherein determining the number of nodes according to the predicted resources specifically comprises:

acquiring a first load sent by the parameter server; the first load is used for representing the actual resource usage amount used by the parameter server for processing the model gradient sent by the main working node;

9. The method of claim 8, further comprising:

acquiring a second load sent by the parameter server; the second load is used for representing the resource usage amount used by the parameter server for processing the model gradients respectively sent by the main working node and each slave working node;

10. A distributed model training method, comprising:

and determining each working node for executing the model training task from the distributed system according to the number of the nodes sent by the resource prediction unit, and distributing the model training task to each determined working node, wherein each working node performs distributed model training together with a parameter server according to the model training task.

11. The method according to claim 10, wherein determining each working node performing the model training task from the distributed system according to the number of nodes sent by the resource prediction unit specifically comprises:

determining whether the primary work node performs the model training task;

12. The method according to claim 10, wherein determining each working node performing the model training task from the distributed system according to the number of nodes sent by the resource prediction unit specifically comprises:

13. A distributed model training apparatus comprising:

the resource prediction module is used for responding to a node resource prediction request sent by the node allocation unit and predicting resources required by executing the model training task according to the historical task;

the node number determining module is used for determining the number of nodes according to the resources obtained by prediction;

14. A distributed model training apparatus comprising:

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of the preceding claims 6 to 12.

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 6 to 12 when the program is executed by the processor.