CN114492787A

CN114492787A - Adaptive neural network training method, electronic device, medium, and program product

Info

Publication number: CN114492787A
Application number: CN202111673099.8A
Authority: CN
Inventors: 高嘉欣; 廖名学; 晁永越; 吕品
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-13

Abstract

The invention provides an adaptive neural network training method, an electronic device, a medium and a program product, wherein the method comprises the following steps: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round. According to the method, the electronic equipment, the medium and the program product, the training tasks do not need to be manually distributed to the training nodes with different performances, the labor cost can be reduced, the training efficiency of the deep learning task under the heterogeneous cluster can be improved, and meanwhile, a distributed training mode capable of being distributed can be realized, so that the training efficiency of the deep learning task under the heterogeneous cluster can be further improved.

Description

Adaptive neural network training method, electronic device, medium, and program product

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to an adaptive neural network training method, an electronic device, a medium, and a program product.

Background

With the rapid development of deep learning technology, deep learning has been widely applied in the fields of image recognition, natural language processing, speech recognition, reinforcement learning, and the like. In order to obtain a better effect in practical application, a deep learning structure represented by a neural network is more and more complex, that is, the number of network layers and parameters of the neural network is continuously increased, and meanwhile, a data set for training the neural network is larger and larger, so that a large amount of computing resources are consumed in the training process of a deep learning task. And the computing power of a single machine is limited, so that the training time is too long.

At present, in order to save training time, a distributed training mode is mostly adopted for training. Distributed training refers to that a plurality of GPU servers are connected through a high-performance network to jointly train a deep learning task, namely the training task is jointly completed through a plurality of training nodes.

In the process of distributed training, each training node is typically assigned an equal number of training tasks. However, the computing power of each training node is usually different, and the training nodes with poor computing power will slow down the whole training process, resulting in the reduction of the training efficiency of the deep learning task under the heterogeneous cluster.

Disclosure of Invention

The invention provides a self-adaptive neural network training method, electronic equipment, a medium and a program product, which are used for solving the defect of low training efficiency of deep learning tasks under heterogeneous clusters in the prior art.

The invention provides a self-adaptive neural network training method, which comprises the following steps:

training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network;

and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

According to the adaptive neural network training method provided by the invention, the training of the target neural network based on the adaptive parameters of the current training round comprises the following steps:

determining the task weight of each training node based on the adaptive parameters of the current training round;

and training the target neural network based on the task weight of each training node.

According to the self-adaptive neural network training method provided by the invention, any training node trains a target neural network based on the following steps:

determining a sub-training data set of any training node from a total training data set based on the task weight of any training node;

performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter;

updating the network parameters of the target neural network based on the gradient accumulation parameters.

According to the adaptive neural network training method provided by the invention, the adjusting of the adaptive parameters of the current training round based on the training time of each training node in the current training round comprises the following steps:

determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameters of the current training round;

and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node.

According to the adaptive neural network training method provided by the invention, the training of the target neural network is performed based on the adaptive parameters of the current training round, and the method also comprises the following steps:

acquiring a training script representing a training task, compiling a standard based on a preset training script, and verifying the training script;

and if the training script conforms to the training script writing standard, performing adaptive packaging on the training script, and starting the packaged training script, wherein the adaptive packaging is used for adding adaptive parameters of the first training round to the training script.

According to the adaptive neural network training method provided by the invention, the training script for representing the training task is obtained, the training script is verified based on the preset training script compiling standard, and then the method further comprises the following steps:

if the training script does not accord with the training script writing standard, sending error prompt information, wherein the error prompt information is used for prompting the modification of the training script;

acquiring a modified training script, compiling a standard based on the training script, and verifying the modified training script;

if the modified training script conforms to the training script writing standard, performing self-adaptive packaging on the modified training script, and starting the packaged training script;

and if the modified training script does not accord with the training script writing standard, sending error prompt information, returning to the step of obtaining the modified training script, and checking the modified training script based on the training script writing standard until the modified training script accords with the training script writing standard.

According to the adaptive neural network training method provided by the invention, the training script writing specification comprises the following steps: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the adaptive neural network training method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the adaptive neural network training method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the adaptive neural network training method as described in any one of the above.

According to the self-adaptive neural network training method, the electronic equipment, the medium and the program product, the target neural network is trained based on the self-adaptive parameters of the current training round, wherein the self-adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, the training tasks are not required to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, and the adjusted adaptive parameter is determined as the adaptive parameter of the next training round.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an adaptive neural network training method according to the present invention;

FIG. 2 is a schematic flow chart of node training provided by the present invention;

FIG. 3 is a second schematic flow chart of the adaptive neural network training method provided in the present invention;

FIG. 4 is a third schematic flow chart of the adaptive neural network training method provided in the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, deep learning algorithms have been widely used in the fields of image recognition, natural language processing, speech recognition, reinforcement learning, and the like, because they have better effects than conventional algorithms in various tasks. In order to obtain a better training effect in practical application, on one hand, a deep learning structure represented by a neural network is more and more complex, namely the number of network layers and parameters of the neural network is continuously increased, and on the other hand, a deep learning training data set is larger and larger, for example, a training data set in some professional fields can reach TB and even PB levels. The increase of the neural network parameters and the increase of the training data set size will cause the training process of deep learning to consume a large amount of computing resources.

The training time is too long due to the limited computing power of a single machine. In order to save training time, in recent years, training methods based on distributed training are increasingly applied to the field of deep learning. Distributed training refers to the fact that a plurality of GPU servers are connected through a high-performance network to jointly train a task. The distributed training mode breaks through the original calculation limit, so that the calculation scale is expanded, and the training time is further saved.

At present, distributed deep learning has already made a certain progress, and common deep learning frameworks have all started to support distributed training tasks and have obtained a remarkable acceleration effect in some large-scale training tasks. However, distributed training still faces many problems and challenges due to the excessive number of training nodes, the complex structure of the neural network, and the like, and one of the problematic problems is distributed training of heterogeneous computing resources.

When the computing performance of each training node is the same, the training efficiency of distributed training is the highest. However, in an actual production environment, since machines of training nodes are generally purchased in batches, a machine room often has multiple models of training nodes, and the computing performance, the storage performance, the data transmission performance, and the like of the training nodes are greatly different.

At present, in the process of distributed training, an equal amount of training tasks are usually distributed to each training node, although the training tasks can also be completed in this way, because of the barrel effect, the whole training process is slowed down by the training node with the worst computation performance, so the training efficiency of this way is not high, and further the training efficiency of the deep learning task under the heterogeneous cluster is reduced.

Based on the above problems, the present invention considers the difference in performance between different training nodes in the training process of heterogeneous clusters. Based on this, the invention adopts the training mode capable of being distributed, namely, the training nodes with stronger performance are distributed with more training tasks, and the training nodes with weaker performance are distributed with less training tasks, thereby reducing the idle time of the training nodes with stronger performance and being beneficial to improving the training efficiency of the whole distributed training.

The invention also considers that the performance difference of different training nodes is difficult to quantify, although each training node has a performance index, the performance indexes can not play a good guiding role in a specific training task, and the performance indexes are less and less accurate along with the aging of a machine. In addition, the task of training is distributed in terms of the amount of code required, which undoubtedly puts a great burden on the algorithm engineer.

Based on the above, the present invention provides a self-adaptive neural network training method, and fig. 1 is one of the flow diagrams of the self-adaptive neural network training method provided by the present invention, as shown in fig. 1, the method includes:

and step 110, training a target neural network based on adaptive parameters of the current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network.

Specifically, if the current training round is the first training round, the adaptive parameter of the current training round may be a random value, or may be an adaptive parameter finally adjusted for the previous training task. The random value can be obtained by random initialization of a random algorithm or can be determined by preset initial parameters. And if the current training round is not the first training round, the adaptive parameter of the current training round is the adaptive parameter adjusted by the last training round.

Here, the adaptive parameter is used to represent the training task amount of each training node, and may be represented as the size of the training data set to be allocated to each training node. The adaptive parameter may be embodied as a task weight or a task proportion, and the like, which is not limited in the embodiment of the present invention. The embodiment of the present invention is described by taking an adaptive parameter, specifically, a task weight as an example.

For example, the k-th training round of n training nodes is assigned a task weight of

When k is equal to 1, the task weight of each training node may be a random value, for example, the task weight of each training node may be

It should be noted that the whole training task may include one training turn or multiple training turns, i.e. epoch may be 1 or an integer greater than 1.

In each training round, the specific steps of training the target neural network are as follows: all training samples in the total training data set are trained in the target neural network once, that is, all training samples in the total training data set are subjected to forward propagation and backward propagation. Further, each training node trains all training samples in the respective sub-training data set once in the target neural network.

In addition, it should be further noted that the adaptive neural network training method according to the embodiment of the present invention may be applied to any training node in each training node, and may also be applied to another computing node independent from each training node, which is not specifically limited in this embodiment of the present invention.

And step 120, adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

Specifically, after each training node completes the training of the target neural network, each training node counts the training time of the current training round. If a training node completes a training task in advance, the training time of the current training round can be counted first, then after other training nodes complete respective training tasks synchronously, the adaptive parameters of the current training round are adjusted based on the training time of all the training nodes.

For example, the current training round (k-th training round) of n training nodes is assigned a task weight of

The task weight of the next training round is weighted as

Here, the training time may be a gradient calculation time of the training node, or may be an overall training time for the training node to complete forward propagation and backward propagation, and the embodiment of the present invention is not limited in particular.

In one embodiment, each training node may broadcast its training time to other training nodes and receive the training times of other training nodes. At this time, each training node includes training time of all the training nodes, and based on this, each training node can adjust adaptive parameters of the current training round based on the training time of each training node in the current training round.

In another embodiment, each training node may broadcast its training time to a computing node, and the computing node may adjust adaptive parameters of the current training round based on the training time of each training node in the current training round. The certain computing node may be any one of the training nodes, or may be another computing node independent from the training nodes, which is not specifically limited in this embodiment of the present invention.

It can be understood that the adjusted adaptive parameter may affect the distribution of the training tasks of each training node in the next training round, thereby affecting the training time of each training node in the next training round, that is, the embodiment of the present invention finds an appropriate adaptive parameter in an iteration manner, so that each training node can exert performance to the maximum extent, that is, each training node has no idle time, thereby improving the efficiency of the overall training. Based on this, after the above step 120, the method further comprises:

when the training tasks of the next training round are started, the adaptive parameters of the next training round are used as the adaptive parameters of the current training round, and the step 110 is returned until the training tasks of all the training rounds are completed.

In some embodiments, if the adjusted adaptive parameter is substantially the same as the adaptive parameter before adjustment (i.e., the adaptive parameter of the current training round) in a certain training round, the adjusted adaptive parameter may be used as the optimal adaptive parameter of the current cluster.

In other embodiments, if the current training round is the last training round, the adjusted adaptive parameter is determined as the adaptive parameter of the next training task, i.e. as the adaptive parameter of the first training round of the next training task. The adjusted adaptive parameter in the last training round can be used as the optimal adaptive parameter of the current cluster (the cluster to which each training node belongs).

The self-adaptive neural network training method provided by the embodiment of the invention trains a target neural network based on the self-adaptive parameters of the current training round, wherein the self-adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, the training tasks are not required to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, the adjusted adaptive parameter is determined as the adaptive parameter of the next training round, and based on the adaptive parameter, the adaptive parameter can be continuously updated in an iterative manner based on the training time of each training node in any training round, and the optimal adaptive parameter of each training node is finally obtained, namely the optimal adaptive parameter of the current heterogeneous cluster is finally obtained.

Based on the above embodiment, in the method, the step 110 includes:

Specifically, each training node reads in a task weight of a current training round, and obtains a corresponding sub-training data set in a total training data set according to the task weight, and then trains a target neural network based on training samples in the sub-training data set, for example, performs gradient accumulation training based on sample data in the sub-training data set and labels thereof, and then updates network parameters of the target neural network based on gradient accumulation parameters.

For any training node, determining a sub-training data set of any training node from the total training data set based on the task weight of any training node; and training the target neural network based on the sub-training data sets.

The adaptive neural network training method provided by the embodiment of the invention determines the task weight of each training node based on the adaptive parameters of the current training round, so that each training node can respectively train the target neural network based on the corresponding task weight. Through the mode, the embodiment of the invention can automatically determine the distribution of the training tasks of the training nodes with different performances through the self-adaptive parameters, does not need to manually distribute the training tasks to the training nodes with different performances, can reduce the labor cost and improve the training efficiency of the deep learning task under the heterogeneous cluster.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of node training provided by the present invention, and as shown in fig. 2, in the method, any training node trains a target neural network based on the following steps:

step 210, determining a sub-training data set of any training node from a total training data set based on the task weight of any training node.

Here, the number of sub-training data sets may be obtained by multiplying the task weight by the number of total training data sets. After the number of the sub-training data sets is determined, training data may be selected from the total training data set as the sub-training data sets based on the number of the sub-training data sets, and specifically, how to select the training data may be random selection or selection according to a preset rule.

And step 220, performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter.

Specifically, if the training task of the target neural network is supervised training, gradient accumulation training is performed based on sample data and labels thereof in the sub-training data set of any training node. And if the training task of the target neural network is unsupervised training, performing gradient accumulation training based on sample data in the sub-training data set of any training node.

Step 230, updating the network parameters of the target neural network based on the gradient accumulation parameters.

Specifically, after each training node obtains its gradient accumulation parameter, the network parameter of the target neural network is updated based on the gradient accumulation parameters of all the training nodes. Further, after completing the gradient accumulation training in advance, any training node enters synchronous waiting until all training nodes complete the gradient accumulation training, that is, after all training nodes reach the synchronous waiting area, all _ reduce operation is performed to update the network parameters of the target neural network, where the all _ reduce operation is a conventional means of the distributed deep learning technology, and is not described herein again.

The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that a sub-training data set of any training node is determined from a total training data set based on the task weight of the training node; performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter; and updating the network parameters of the target neural network based on the gradient accumulation parameters. Through the mode, the sub-training data set of any training node can be automatically determined through the task weight, namely the training task amount of any training node is automatically determined, training tasks are not required to be manually distributed to the training nodes with different performances, the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; meanwhile, a training mode of gradient accumulation is adopted, so that the training effect of the target neural network can be improved.

Based on any of the above embodiments, fig. 3 is a second flowchart of the adaptive neural network training method provided by the present invention, as shown in fig. 3, in the method, in step 120, adjusting adaptive parameters of the current training round based on the training time of each training node in the current training round includes:

and step 121, determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameter of the current training round.

In one embodiment, the task variation of any training node is calculated based on the following formula:

wherein u is_iFor the task variation of the ith training node,

the task weight (adaptive parameters are embodied as task weight) of the k-th training round (namely the current training round is k) of the ith training node,

the training time of the ith training node in the current training round is shown, and n is the number of each training node.

In another embodiment, the task variation of any training node is calculated based on the following formula:

wherein u is_iFor the task variation of the ith training node,

and n is the number of each training node and round () is a rounding function.

And step 122, adjusting the adaptive parameters of the current training round based on the task variation of each training node.

Specifically, the adjusted adaptive parameter is obtained by adding the task variation to the adaptive parameter of the current training round.

The adjustment formula of the adaptive parameters is as follows:

wherein the content of the first and second substances,

the adjusted task weight (adaptive parameter is embodied as the task weight) of the ith training node, namely the task weight of the next training round of the ith training node, namely the task weight of the (k + 1) th training round of the ith training node;

the task weight (adaptive parameters are embodied as the task weight) of the kth training round (namely the current training round is k) of the ith training node; u. of_iThe task variation of the ith training node.

In a specific embodiment, if the total training round is 1, i.e. if epoch is 1, the adaptive parameters of the current training round may not be adjusted.

The self-adaptive neural network training method provided by the embodiment of the invention determines the task variation of each training node based on the training time of each training node in the current training round and the self-adaptive parameters of the current training round; and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node. Through the method, the adaptive parameters can be continuously updated in an iterative manner based on the training time of each training node in any training turn, and the optimal adaptive parameters of each training node are finally obtained, namely the optimal adaptive parameters of the current heterogeneous cluster are finally obtained.

Based on any of the above embodiments, fig. 4 is a third schematic flowchart of the adaptive neural network training method provided in the present invention, as shown in fig. 4, before the step 110, the method further includes:

and step 410, acquiring a training script representing the training task, compiling the standard based on the preset training script, and verifying the training script.

Here, the training script is used to encapsulate a training task, i.e., an overall training task for encapsulating a target neural network, i.e., for encapsulating a deep learning task. The training script can be written by a user, and the user needs to write according to the writing specification of the training script.

The training script writing standard is used for assisting and restricting a user to write a corresponding training script, and further the user writes the corresponding training script according to the training script writing standard and the requirements of a training task.

In addition, after the training script compiling specification is imported, the training script compiling specification can be used for verifying the input training script, and specifically, whether the training script meets the coding requirement and meets the relevant limit defined by the specification is checked according to the training script compiling specification.

The training script writing specification includes, but is not limited to, one or more of the following: file format specification, file name specification, calling specification of a training framework, naming specification of key variables, and the like, which are not specifically limited in the embodiment of the present invention.

In one embodiment, the training script writing specification comprises: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.

Wherein the file format specification is used to define a file format of the training script. The name specification is used to define the file name of the training script. The calling specification of the training frame is used for limiting the calling mode of the training frame in the training script. The naming convention for key variables is used to define the naming of key variables (important elements) in the training script.

It can be understood that, whether the input training script has problems or not is judged according to the training script writing standard, and based on the judgment, writing of the training script becomes legal without ever changing training scripts.

And 420, if the training script conforms to the training script writing standard, performing adaptive packaging on the training script, and starting the packaged training script, wherein the adaptive packaging is used for adding adaptive parameters of the first training round to the training script.

In particular, the training script may be packaged using an adaptive packaging procedure. The self-adaptive packaging means that self-adaptive parameters are added to the training script on the basis of keeping the original training function of the training script, so that the training script has the function of self-adaptive distributed training, and the training task has the function of self-adaptive distributed training.

Further, the adaptive package is further configured to package a code for adjusting the adaptive parameter, and package a code for selecting a corresponding sub-training data set according to the task weight.

The self-adaptive packaging program can perform self-defined packaging on the training script which accords with the training script writing standard, and specifically, the self-adaptive packaging program can be imported to package the training script.

In addition, the adaptive packaging program may be implemented together by a python interpreter, a regular expression, and other tools, which are not specifically limited in this embodiment of the present invention. Based on this, the training script input by the user can be subjected to the package of the adaptive training.

In another embodiment, the adaptive packaging means that adaptive parameters are added to the training script and/or the training logic of the training task is finely adjusted and/or a training frame used in the packaging script is packaged on the basis of keeping the original training function of the training script, so that the training script has the function of adaptive distributed training, and the training task has the function of adaptive distributed training.

It should be noted that, the packaged training script is started, that is, the training task is started, and training is started, so that each training node enters a training link.

It can be understood that the difficulty in designing the adaptive distributed training method is how to enable the user to efficiently and conveniently use the method in the heterogeneous cluster, and the training can be accelerated at low cost. Based on this, the embodiment of the invention restricts the writing of the training script by making a set of training script writing specifications, namely the original training script is written according to the specifications. The writing specification of the training script has universality and can meet the requirements of most training tasks; meanwhile, the writing specification of the training script is simple and convenient, and the coding work of an algorithm engineer can be reduced as much as possible. Therefore, the training script designed according to the training script writing specification can meet the development requirements of most deep learning tasks.

The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that a training script representing a training task is obtained, the training script is verified based on a preset training script compiling standard, if the training script accords with the training script compiling standard, the training script is subjected to self-adaptive packaging, the packaged training script is started, and the self-adaptive packaging is used for adding self-adaptive parameters of a first training round for the training script. By the method, the training script is verified based on the training script writing standard, and the training script which accords with the training script writing standard is ensured to be subjected to self-adaptive packaging, so that the training task can be normally executed after the training script is started; meanwhile, the training script is packaged into the training script capable of being adaptively trained through adaptive packaging, so that the adaptive parameters can be continuously updated in an iterative manner in the follow-up process, and the optimal adaptive parameters of all training nodes are finally obtained, namely the optimal adaptive parameters of the current heterogeneous cluster are finally obtained, and the training efficiency of the deep learning task under the heterogeneous cluster is further improved.

Based on any of the above embodiments, in this method, after the step 410, the method further includes:

acquiring a modified training script, and verifying the modified training script based on the training script writing standard;

Here, the error prompt information is used to prompt the user to modify the training script, that is, the user can modify the training script according to the error prompt information. The error prompt message may include a location where the training script does not meet the specification, so that the user may quickly determine where the script is wrongly written, thereby speeding up the script modification. In addition, after the user finishes modifying, the modified training script is input again.

For ease of understanding, a specific embodiment is described as an example. The specific embodiment is as follows: firstly, a training script compiling specification is imported, then a training script representing a training task is obtained, then the training script is checked according to the imported training script compiling specification, if the training script does not meet the specification, the system reports an error, a user needs to modify the training script again to meet the specification, if the script meets the specification, an adaptive packaging program is imported, the system packages the training script by using the adaptive packaging program, next, the system starts the training script and adjusts relevant parameters according to an adaptive algorithm, the adaptive algorithm is the steps executed in the step 110 and the step 120, and the description is omitted. And after the adaptive parameters are stable, finally adjusting the obtained adaptive parameters to be the optimal adaptive parameters of the current cluster, outputting the stable adaptive parameters at the moment, and continuing training by using the stable adaptive parameters.

The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that the training script is verified based on the training script compiling standard, if the training script does not accord with the training script compiling standard, error prompt information is output until the training script which accords with the training script compiling standard is input, and the training script which accords with the training script compiling standard is ensured to be subjected to self-adaptive packaging, so that the training task can be normally executed after the training script is started.

The following describes the neural network training device provided by the present invention, and the neural network training device described below and the adaptive neural network training method described above may be referred to correspondingly.

In this embodiment, the neural network training apparatus includes:

the training module is used for training a target neural network based on adaptive parameters of the current training round, and the adaptive parameters are used for determining the training task amount of each training node for training the target neural network;

and the adjusting module is used for adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

The neural network training device provided by the embodiment of the invention trains a target neural network based on the adaptive parameters of the current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, training tasks do not need to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, and the adjusted adaptive parameter is determined as the adaptive parameter of the next training round.

Based on any embodiment above, the training module is further configured to:

Based on any of the above embodiments, any training node trains the target neural network based on the following steps:

Based on any embodiment above, the adjusting module is further configured to:

Based on any one of the above embodiments, the neural network training device further includes:

the first verification module is used for acquiring a training script representing a training task, compiling a standard based on a preset training script and verifying the training script;

and the first packaging module is used for carrying out self-adaptive packaging on the training script and starting the packaged training script if the training script accords with the training script compiling standard, and the self-adaptive packaging is used for adding self-adaptive parameters of the first training round for the training script.

the information sending module is used for sending error prompt information if the training script is not in accordance with the training script compiling standard, wherein the error prompt information is used for prompting the modification of the training script;

the second check module is used for acquiring the modified training script, compiling the standard based on the training script and checking the modified training script;

the second packaging module is used for adaptively packaging the modified training script and starting the packaged training script if the modified training script conforms to the training script writing standard;

and the step returning module is used for sending out error prompt information if the modified training script does not accord with the training script compiling standard, returning to the step of obtaining the modified training script and verifying the modified training script based on the training script compiling standard until the modified training script accords with the training script compiling standard.

Based on any one of the above embodiments, the training script writing specification includes: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an adaptive neural network training method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

In addition, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the adaptive neural network training method provided by the above methods, the method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an adaptive neural network training method provided by the above methods, the method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An adaptive neural network training method, comprising:

2. The adaptive neural network training method of claim 1, wherein training the target neural network based on the adaptive parameters of the current training round comprises:

3. The adaptive neural network training method of claim 2, wherein any one of the training nodes trains the target neural network based on the following steps:

4. The adaptive neural network training method of claim 1, wherein the adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round comprises:

5. The adaptive neural network training method of any one of claims 1-4, wherein training the target neural network based on the adaptive parameters of the current training round further comprises:

6. The adaptive neural network training method of claim 5, wherein the obtaining of the training script characterizing the training task and the verification of the training script based on the preset training script writing specification further comprises:

7. The adaptive neural network training method of claim 5, wherein the training script writing specification comprises: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the adaptive neural network training method according to any one of claims 1 to 7 are implemented when the program is executed by the processor.

9. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the adaptive neural network training method according to any one of claims 1 to 7.

10. A computer program product comprising a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the adaptive neural network training method according to any one of claims 1 to 7.