CN114492787A - Adaptive neural network training method, electronic device, medium, and program product - Google Patents

Adaptive neural network training method, electronic device, medium, and program product Download PDF

Info

Publication number
CN114492787A
CN114492787A CN202111673099.8A CN202111673099A CN114492787A CN 114492787 A CN114492787 A CN 114492787A CN 202111673099 A CN202111673099 A CN 202111673099A CN 114492787 A CN114492787 A CN 114492787A
Authority
CN
China
Prior art keywords
training
script
adaptive
neural network
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111673099.8A
Other languages
Chinese (zh)
Inventor
高嘉欣
廖名学
晁永越
吕品
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111673099.8A priority Critical patent/CN114492787A/en
Publication of CN114492787A publication Critical patent/CN114492787A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an adaptive neural network training method, an electronic device, a medium and a program product, wherein the method comprises the following steps: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round. According to the method, the electronic equipment, the medium and the program product, the training tasks do not need to be manually distributed to the training nodes with different performances, the labor cost can be reduced, the training efficiency of the deep learning task under the heterogeneous cluster can be improved, and meanwhile, a distributed training mode capable of being distributed can be realized, so that the training efficiency of the deep learning task under the heterogeneous cluster can be further improved.

Description

Adaptive neural network training method, electronic device, medium, and program product
Technical Field
The present invention relates to the field of deep learning technologies, and in particular, to an adaptive neural network training method, an electronic device, a medium, and a program product.
Background
With the rapid development of deep learning technology, deep learning has been widely applied in the fields of image recognition, natural language processing, speech recognition, reinforcement learning, and the like. In order to obtain a better effect in practical application, a deep learning structure represented by a neural network is more and more complex, that is, the number of network layers and parameters of the neural network is continuously increased, and meanwhile, a data set for training the neural network is larger and larger, so that a large amount of computing resources are consumed in the training process of a deep learning task. And the computing power of a single machine is limited, so that the training time is too long.
At present, in order to save training time, a distributed training mode is mostly adopted for training. Distributed training refers to that a plurality of GPU servers are connected through a high-performance network to jointly train a deep learning task, namely the training task is jointly completed through a plurality of training nodes.
In the process of distributed training, each training node is typically assigned an equal number of training tasks. However, the computing power of each training node is usually different, and the training nodes with poor computing power will slow down the whole training process, resulting in the reduction of the training efficiency of the deep learning task under the heterogeneous cluster.
Disclosure of Invention
The invention provides a self-adaptive neural network training method, electronic equipment, a medium and a program product, which are used for solving the defect of low training efficiency of deep learning tasks under heterogeneous clusters in the prior art.
The invention provides a self-adaptive neural network training method, which comprises the following steps:
training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network;
and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
According to the adaptive neural network training method provided by the invention, the training of the target neural network based on the adaptive parameters of the current training round comprises the following steps:
determining the task weight of each training node based on the adaptive parameters of the current training round;
and training the target neural network based on the task weight of each training node.
According to the self-adaptive neural network training method provided by the invention, any training node trains a target neural network based on the following steps:
determining a sub-training data set of any training node from a total training data set based on the task weight of any training node;
performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter;
updating the network parameters of the target neural network based on the gradient accumulation parameters.
According to the adaptive neural network training method provided by the invention, the adjusting of the adaptive parameters of the current training round based on the training time of each training node in the current training round comprises the following steps:
determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameters of the current training round;
and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node.
According to the adaptive neural network training method provided by the invention, the training of the target neural network is performed based on the adaptive parameters of the current training round, and the method also comprises the following steps:
acquiring a training script representing a training task, compiling a standard based on a preset training script, and verifying the training script;
and if the training script conforms to the training script writing standard, performing adaptive packaging on the training script, and starting the packaged training script, wherein the adaptive packaging is used for adding adaptive parameters of the first training round to the training script.
According to the adaptive neural network training method provided by the invention, the training script for representing the training task is obtained, the training script is verified based on the preset training script compiling standard, and then the method further comprises the following steps:
if the training script does not accord with the training script writing standard, sending error prompt information, wherein the error prompt information is used for prompting the modification of the training script;
acquiring a modified training script, compiling a standard based on the training script, and verifying the modified training script;
if the modified training script conforms to the training script writing standard, performing self-adaptive packaging on the modified training script, and starting the packaged training script;
and if the modified training script does not accord with the training script writing standard, sending error prompt information, returning to the step of obtaining the modified training script, and checking the modified training script based on the training script writing standard until the modified training script accords with the training script writing standard.
According to the adaptive neural network training method provided by the invention, the training script writing specification comprises the following steps: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the adaptive neural network training method as described in any of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the adaptive neural network training method as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the adaptive neural network training method as described in any one of the above.
According to the self-adaptive neural network training method, the electronic equipment, the medium and the program product, the target neural network is trained based on the self-adaptive parameters of the current training round, wherein the self-adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, the training tasks are not required to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, and the adjusted adaptive parameter is determined as the adaptive parameter of the next training round.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an adaptive neural network training method according to the present invention;
FIG. 2 is a schematic flow chart of node training provided by the present invention;
FIG. 3 is a second schematic flow chart of the adaptive neural network training method provided in the present invention;
FIG. 4 is a third schematic flow chart of the adaptive neural network training method provided in the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In recent years, deep learning algorithms have been widely used in the fields of image recognition, natural language processing, speech recognition, reinforcement learning, and the like, because they have better effects than conventional algorithms in various tasks. In order to obtain a better training effect in practical application, on one hand, a deep learning structure represented by a neural network is more and more complex, namely the number of network layers and parameters of the neural network is continuously increased, and on the other hand, a deep learning training data set is larger and larger, for example, a training data set in some professional fields can reach TB and even PB levels. The increase of the neural network parameters and the increase of the training data set size will cause the training process of deep learning to consume a large amount of computing resources.
The training time is too long due to the limited computing power of a single machine. In order to save training time, in recent years, training methods based on distributed training are increasingly applied to the field of deep learning. Distributed training refers to the fact that a plurality of GPU servers are connected through a high-performance network to jointly train a task. The distributed training mode breaks through the original calculation limit, so that the calculation scale is expanded, and the training time is further saved.
At present, distributed deep learning has already made a certain progress, and common deep learning frameworks have all started to support distributed training tasks and have obtained a remarkable acceleration effect in some large-scale training tasks. However, distributed training still faces many problems and challenges due to the excessive number of training nodes, the complex structure of the neural network, and the like, and one of the problematic problems is distributed training of heterogeneous computing resources.
When the computing performance of each training node is the same, the training efficiency of distributed training is the highest. However, in an actual production environment, since machines of training nodes are generally purchased in batches, a machine room often has multiple models of training nodes, and the computing performance, the storage performance, the data transmission performance, and the like of the training nodes are greatly different.
At present, in the process of distributed training, an equal amount of training tasks are usually distributed to each training node, although the training tasks can also be completed in this way, because of the barrel effect, the whole training process is slowed down by the training node with the worst computation performance, so the training efficiency of this way is not high, and further the training efficiency of the deep learning task under the heterogeneous cluster is reduced.
Based on the above problems, the present invention considers the difference in performance between different training nodes in the training process of heterogeneous clusters. Based on this, the invention adopts the training mode capable of being distributed, namely, the training nodes with stronger performance are distributed with more training tasks, and the training nodes with weaker performance are distributed with less training tasks, thereby reducing the idle time of the training nodes with stronger performance and being beneficial to improving the training efficiency of the whole distributed training.
The invention also considers that the performance difference of different training nodes is difficult to quantify, although each training node has a performance index, the performance indexes can not play a good guiding role in a specific training task, and the performance indexes are less and less accurate along with the aging of a machine. In addition, the task of training is distributed in terms of the amount of code required, which undoubtedly puts a great burden on the algorithm engineer.
Based on the above, the present invention provides a self-adaptive neural network training method, and fig. 1 is one of the flow diagrams of the self-adaptive neural network training method provided by the present invention, as shown in fig. 1, the method includes:
and step 110, training a target neural network based on adaptive parameters of the current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network.
Specifically, if the current training round is the first training round, the adaptive parameter of the current training round may be a random value, or may be an adaptive parameter finally adjusted for the previous training task. The random value can be obtained by random initialization of a random algorithm or can be determined by preset initial parameters. And if the current training round is not the first training round, the adaptive parameter of the current training round is the adaptive parameter adjusted by the last training round.
Here, the adaptive parameter is used to represent the training task amount of each training node, and may be represented as the size of the training data set to be allocated to each training node. The adaptive parameter may be embodied as a task weight or a task proportion, and the like, which is not limited in the embodiment of the present invention. The embodiment of the present invention is described by taking an adaptive parameter, specifically, a task weight as an example.
For example, the k-th training round of n training nodes is assigned a task weight of
Figure BDA0003453590760000071
When k is equal to 1, the task weight of each training node may be a random value, for example, the task weight of each training node may be
Figure BDA0003453590760000072
It should be noted that the whole training task may include one training turn or multiple training turns, i.e. epoch may be 1 or an integer greater than 1.
In each training round, the specific steps of training the target neural network are as follows: all training samples in the total training data set are trained in the target neural network once, that is, all training samples in the total training data set are subjected to forward propagation and backward propagation. Further, each training node trains all training samples in the respective sub-training data set once in the target neural network.
In addition, it should be further noted that the adaptive neural network training method according to the embodiment of the present invention may be applied to any training node in each training node, and may also be applied to another computing node independent from each training node, which is not specifically limited in this embodiment of the present invention.
And step 120, adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
Specifically, after each training node completes the training of the target neural network, each training node counts the training time of the current training round. If a training node completes a training task in advance, the training time of the current training round can be counted first, then after other training nodes complete respective training tasks synchronously, the adaptive parameters of the current training round are adjusted based on the training time of all the training nodes.
For example, the current training round (k-th training round) of n training nodes is assigned a task weight of
Figure BDA0003453590760000081
The task weight of the next training round is weighted as
Figure BDA0003453590760000082
Here, the training time may be a gradient calculation time of the training node, or may be an overall training time for the training node to complete forward propagation and backward propagation, and the embodiment of the present invention is not limited in particular.
In one embodiment, each training node may broadcast its training time to other training nodes and receive the training times of other training nodes. At this time, each training node includes training time of all the training nodes, and based on this, each training node can adjust adaptive parameters of the current training round based on the training time of each training node in the current training round.
In another embodiment, each training node may broadcast its training time to a computing node, and the computing node may adjust adaptive parameters of the current training round based on the training time of each training node in the current training round. The certain computing node may be any one of the training nodes, or may be another computing node independent from the training nodes, which is not specifically limited in this embodiment of the present invention.
It can be understood that the adjusted adaptive parameter may affect the distribution of the training tasks of each training node in the next training round, thereby affecting the training time of each training node in the next training round, that is, the embodiment of the present invention finds an appropriate adaptive parameter in an iteration manner, so that each training node can exert performance to the maximum extent, that is, each training node has no idle time, thereby improving the efficiency of the overall training. Based on this, after the above step 120, the method further comprises:
when the training tasks of the next training round are started, the adaptive parameters of the next training round are used as the adaptive parameters of the current training round, and the step 110 is returned until the training tasks of all the training rounds are completed.
In some embodiments, if the adjusted adaptive parameter is substantially the same as the adaptive parameter before adjustment (i.e., the adaptive parameter of the current training round) in a certain training round, the adjusted adaptive parameter may be used as the optimal adaptive parameter of the current cluster.
In other embodiments, if the current training round is the last training round, the adjusted adaptive parameter is determined as the adaptive parameter of the next training task, i.e. as the adaptive parameter of the first training round of the next training task. The adjusted adaptive parameter in the last training round can be used as the optimal adaptive parameter of the current cluster (the cluster to which each training node belongs).
The self-adaptive neural network training method provided by the embodiment of the invention trains a target neural network based on the self-adaptive parameters of the current training round, wherein the self-adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, the training tasks are not required to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, the adjusted adaptive parameter is determined as the adaptive parameter of the next training round, and based on the adaptive parameter, the adaptive parameter can be continuously updated in an iterative manner based on the training time of each training node in any training round, and the optimal adaptive parameter of each training node is finally obtained, namely the optimal adaptive parameter of the current heterogeneous cluster is finally obtained.
Based on the above embodiment, in the method, the step 110 includes:
determining the task weight of each training node based on the adaptive parameters of the current training round;
and training the target neural network based on the task weight of each training node.
Specifically, each training node reads in a task weight of a current training round, and obtains a corresponding sub-training data set in a total training data set according to the task weight, and then trains a target neural network based on training samples in the sub-training data set, for example, performs gradient accumulation training based on sample data in the sub-training data set and labels thereof, and then updates network parameters of the target neural network based on gradient accumulation parameters.
For any training node, determining a sub-training data set of any training node from the total training data set based on the task weight of any training node; and training the target neural network based on the sub-training data sets.
The adaptive neural network training method provided by the embodiment of the invention determines the task weight of each training node based on the adaptive parameters of the current training round, so that each training node can respectively train the target neural network based on the corresponding task weight. Through the mode, the embodiment of the invention can automatically determine the distribution of the training tasks of the training nodes with different performances through the self-adaptive parameters, does not need to manually distribute the training tasks to the training nodes with different performances, can reduce the labor cost and improve the training efficiency of the deep learning task under the heterogeneous cluster.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of node training provided by the present invention, and as shown in fig. 2, in the method, any training node trains a target neural network based on the following steps:
step 210, determining a sub-training data set of any training node from a total training data set based on the task weight of any training node.
Here, the number of sub-training data sets may be obtained by multiplying the task weight by the number of total training data sets. After the number of the sub-training data sets is determined, training data may be selected from the total training data set as the sub-training data sets based on the number of the sub-training data sets, and specifically, how to select the training data may be random selection or selection according to a preset rule.
And step 220, performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter.
Specifically, if the training task of the target neural network is supervised training, gradient accumulation training is performed based on sample data and labels thereof in the sub-training data set of any training node. And if the training task of the target neural network is unsupervised training, performing gradient accumulation training based on sample data in the sub-training data set of any training node.
Step 230, updating the network parameters of the target neural network based on the gradient accumulation parameters.
Specifically, after each training node obtains its gradient accumulation parameter, the network parameter of the target neural network is updated based on the gradient accumulation parameters of all the training nodes. Further, after completing the gradient accumulation training in advance, any training node enters synchronous waiting until all training nodes complete the gradient accumulation training, that is, after all training nodes reach the synchronous waiting area, all _ reduce operation is performed to update the network parameters of the target neural network, where the all _ reduce operation is a conventional means of the distributed deep learning technology, and is not described herein again.
The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that a sub-training data set of any training node is determined from a total training data set based on the task weight of the training node; performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter; and updating the network parameters of the target neural network based on the gradient accumulation parameters. Through the mode, the sub-training data set of any training node can be automatically determined through the task weight, namely the training task amount of any training node is automatically determined, training tasks are not required to be manually distributed to the training nodes with different performances, the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; meanwhile, a training mode of gradient accumulation is adopted, so that the training effect of the target neural network can be improved.
Based on any of the above embodiments, fig. 3 is a second flowchart of the adaptive neural network training method provided by the present invention, as shown in fig. 3, in the method, in step 120, adjusting adaptive parameters of the current training round based on the training time of each training node in the current training round includes:
and step 121, determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameter of the current training round.
In one embodiment, the task variation of any training node is calculated based on the following formula:
Figure BDA0003453590760000121
wherein u isiFor the task variation of the ith training node,
Figure BDA0003453590760000122
the task weight (adaptive parameters are embodied as task weight) of the k-th training round (namely the current training round is k) of the ith training node,
Figure BDA0003453590760000123
the training time of the ith training node in the current training round is shown, and n is the number of each training node.
In another embodiment, the task variation of any training node is calculated based on the following formula:
Figure BDA0003453590760000124
wherein u isiFor the task variation of the ith training node,
Figure BDA0003453590760000125
the task weight (adaptive parameters are embodied as task weight) of the k-th training round (namely the current training round is k) of the ith training node,
Figure BDA0003453590760000126
and n is the number of each training node and round () is a rounding function.
And step 122, adjusting the adaptive parameters of the current training round based on the task variation of each training node.
Specifically, the adjusted adaptive parameter is obtained by adding the task variation to the adaptive parameter of the current training round.
The adjustment formula of the adaptive parameters is as follows:
Figure BDA0003453590760000127
wherein the content of the first and second substances,
Figure BDA0003453590760000128
the adjusted task weight (adaptive parameter is embodied as the task weight) of the ith training node, namely the task weight of the next training round of the ith training node, namely the task weight of the (k + 1) th training round of the ith training node;
Figure BDA0003453590760000129
the task weight (adaptive parameters are embodied as the task weight) of the kth training round (namely the current training round is k) of the ith training node; u. ofiThe task variation of the ith training node.
In a specific embodiment, if the total training round is 1, i.e. if epoch is 1, the adaptive parameters of the current training round may not be adjusted.
The self-adaptive neural network training method provided by the embodiment of the invention determines the task variation of each training node based on the training time of each training node in the current training round and the self-adaptive parameters of the current training round; and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node. Through the method, the adaptive parameters can be continuously updated in an iterative manner based on the training time of each training node in any training turn, and the optimal adaptive parameters of each training node are finally obtained, namely the optimal adaptive parameters of the current heterogeneous cluster are finally obtained.
Based on any of the above embodiments, fig. 4 is a third schematic flowchart of the adaptive neural network training method provided in the present invention, as shown in fig. 4, before the step 110, the method further includes:
and step 410, acquiring a training script representing the training task, compiling the standard based on the preset training script, and verifying the training script.
Here, the training script is used to encapsulate a training task, i.e., an overall training task for encapsulating a target neural network, i.e., for encapsulating a deep learning task. The training script can be written by a user, and the user needs to write according to the writing specification of the training script.
The training script writing standard is used for assisting and restricting a user to write a corresponding training script, and further the user writes the corresponding training script according to the training script writing standard and the requirements of a training task.
In addition, after the training script compiling specification is imported, the training script compiling specification can be used for verifying the input training script, and specifically, whether the training script meets the coding requirement and meets the relevant limit defined by the specification is checked according to the training script compiling specification.
The training script writing specification includes, but is not limited to, one or more of the following: file format specification, file name specification, calling specification of a training framework, naming specification of key variables, and the like, which are not specifically limited in the embodiment of the present invention.
In one embodiment, the training script writing specification comprises: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.
Wherein the file format specification is used to define a file format of the training script. The name specification is used to define the file name of the training script. The calling specification of the training frame is used for limiting the calling mode of the training frame in the training script. The naming convention for key variables is used to define the naming of key variables (important elements) in the training script.
It can be understood that, whether the input training script has problems or not is judged according to the training script writing standard, and based on the judgment, writing of the training script becomes legal without ever changing training scripts.
And 420, if the training script conforms to the training script writing standard, performing adaptive packaging on the training script, and starting the packaged training script, wherein the adaptive packaging is used for adding adaptive parameters of the first training round to the training script.
In particular, the training script may be packaged using an adaptive packaging procedure. The self-adaptive packaging means that self-adaptive parameters are added to the training script on the basis of keeping the original training function of the training script, so that the training script has the function of self-adaptive distributed training, and the training task has the function of self-adaptive distributed training.
Further, the adaptive package is further configured to package a code for adjusting the adaptive parameter, and package a code for selecting a corresponding sub-training data set according to the task weight.
The self-adaptive packaging program can perform self-defined packaging on the training script which accords with the training script writing standard, and specifically, the self-adaptive packaging program can be imported to package the training script.
In addition, the adaptive packaging program may be implemented together by a python interpreter, a regular expression, and other tools, which are not specifically limited in this embodiment of the present invention. Based on this, the training script input by the user can be subjected to the package of the adaptive training.
In another embodiment, the adaptive packaging means that adaptive parameters are added to the training script and/or the training logic of the training task is finely adjusted and/or a training frame used in the packaging script is packaged on the basis of keeping the original training function of the training script, so that the training script has the function of adaptive distributed training, and the training task has the function of adaptive distributed training.
It should be noted that, the packaged training script is started, that is, the training task is started, and training is started, so that each training node enters a training link.
It can be understood that the difficulty in designing the adaptive distributed training method is how to enable the user to efficiently and conveniently use the method in the heterogeneous cluster, and the training can be accelerated at low cost. Based on this, the embodiment of the invention restricts the writing of the training script by making a set of training script writing specifications, namely the original training script is written according to the specifications. The writing specification of the training script has universality and can meet the requirements of most training tasks; meanwhile, the writing specification of the training script is simple and convenient, and the coding work of an algorithm engineer can be reduced as much as possible. Therefore, the training script designed according to the training script writing specification can meet the development requirements of most deep learning tasks.
The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that a training script representing a training task is obtained, the training script is verified based on a preset training script compiling standard, if the training script accords with the training script compiling standard, the training script is subjected to self-adaptive packaging, the packaged training script is started, and the self-adaptive packaging is used for adding self-adaptive parameters of a first training round for the training script. By the method, the training script is verified based on the training script writing standard, and the training script which accords with the training script writing standard is ensured to be subjected to self-adaptive packaging, so that the training task can be normally executed after the training script is started; meanwhile, the training script is packaged into the training script capable of being adaptively trained through adaptive packaging, so that the adaptive parameters can be continuously updated in an iterative manner in the follow-up process, and the optimal adaptive parameters of all training nodes are finally obtained, namely the optimal adaptive parameters of the current heterogeneous cluster are finally obtained, and the training efficiency of the deep learning task under the heterogeneous cluster is further improved.
Based on any of the above embodiments, in this method, after the step 410, the method further includes:
if the training script does not accord with the training script writing standard, sending error prompt information, wherein the error prompt information is used for prompting the modification of the training script;
acquiring a modified training script, and verifying the modified training script based on the training script writing standard;
if the modified training script conforms to the training script writing standard, performing self-adaptive packaging on the modified training script, and starting the packaged training script;
and if the modified training script does not accord with the training script writing standard, sending error prompt information, returning to the step of obtaining the modified training script, and checking the modified training script based on the training script writing standard until the modified training script accords with the training script writing standard.
Here, the error prompt information is used to prompt the user to modify the training script, that is, the user can modify the training script according to the error prompt information. The error prompt message may include a location where the training script does not meet the specification, so that the user may quickly determine where the script is wrongly written, thereby speeding up the script modification. In addition, after the user finishes modifying, the modified training script is input again.
For ease of understanding, a specific embodiment is described as an example. The specific embodiment is as follows: firstly, a training script compiling specification is imported, then a training script representing a training task is obtained, then the training script is checked according to the imported training script compiling specification, if the training script does not meet the specification, the system reports an error, a user needs to modify the training script again to meet the specification, if the script meets the specification, an adaptive packaging program is imported, the system packages the training script by using the adaptive packaging program, next, the system starts the training script and adjusts relevant parameters according to an adaptive algorithm, the adaptive algorithm is the steps executed in the step 110 and the step 120, and the description is omitted. And after the adaptive parameters are stable, finally adjusting the obtained adaptive parameters to be the optimal adaptive parameters of the current cluster, outputting the stable adaptive parameters at the moment, and continuing training by using the stable adaptive parameters.
The self-adaptive neural network training method provided by the embodiment of the invention is characterized in that the training script is verified based on the training script compiling standard, if the training script does not accord with the training script compiling standard, error prompt information is output until the training script which accords with the training script compiling standard is input, and the training script which accords with the training script compiling standard is ensured to be subjected to self-adaptive packaging, so that the training task can be normally executed after the training script is started.
The following describes the neural network training device provided by the present invention, and the neural network training device described below and the adaptive neural network training method described above may be referred to correspondingly.
In this embodiment, the neural network training apparatus includes:
the training module is used for training a target neural network based on adaptive parameters of the current training round, and the adaptive parameters are used for determining the training task amount of each training node for training the target neural network;
and the adjusting module is used for adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
The neural network training device provided by the embodiment of the invention trains a target neural network based on the adaptive parameters of the current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network, and on the basis, training tasks do not need to be manually distributed to the training nodes with different performances, so that the labor cost can be reduced, and the training efficiency of deep learning tasks under heterogeneous clusters is improved; based on the training time of each training node in the current training round, the adaptive parameter of the current training round is adjusted, and the adjusted adaptive parameter is determined as the adaptive parameter of the next training round.
Based on any embodiment above, the training module is further configured to:
determining the task weight of each training node based on the adaptive parameters of the current training round;
and training the target neural network based on the task weight of each training node.
Based on any of the above embodiments, any training node trains the target neural network based on the following steps:
determining a sub-training data set of any training node from a total training data set based on the task weight of any training node;
performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter;
updating the network parameters of the target neural network based on the gradient accumulation parameters.
Based on any embodiment above, the adjusting module is further configured to:
determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameters of the current training round;
and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node.
Based on any one of the above embodiments, the neural network training device further includes:
the first verification module is used for acquiring a training script representing a training task, compiling a standard based on a preset training script and verifying the training script;
and the first packaging module is used for carrying out self-adaptive packaging on the training script and starting the packaged training script if the training script accords with the training script compiling standard, and the self-adaptive packaging is used for adding self-adaptive parameters of the first training round for the training script.
Based on any one of the above embodiments, the neural network training device further includes:
the information sending module is used for sending error prompt information if the training script is not in accordance with the training script compiling standard, wherein the error prompt information is used for prompting the modification of the training script;
the second check module is used for acquiring the modified training script, compiling the standard based on the training script and checking the modified training script;
the second packaging module is used for adaptively packaging the modified training script and starting the packaged training script if the modified training script conforms to the training script writing standard;
and the step returning module is used for sending out error prompt information if the modified training script does not accord with the training script compiling standard, returning to the step of obtaining the modified training script and verifying the modified training script based on the training script compiling standard until the modified training script accords with the training script compiling standard.
Based on any one of the above embodiments, the training script writing specification includes: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an adaptive neural network training method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
In addition, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the adaptive neural network training method provided by the above methods, the method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an adaptive neural network training method provided by the above methods, the method comprising: training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network; and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An adaptive neural network training method, comprising:
training a target neural network based on adaptive parameters of a current training round, wherein the adaptive parameters are used for determining the training task amount of each training node for training the target neural network;
and adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round, and determining the adjusted adaptive parameters as the adaptive parameters of the next training round.
2. The adaptive neural network training method of claim 1, wherein training the target neural network based on the adaptive parameters of the current training round comprises:
determining the task weight of each training node based on the adaptive parameters of the current training round;
and training the target neural network based on the task weight of each training node.
3. The adaptive neural network training method of claim 2, wherein any one of the training nodes trains the target neural network based on the following steps:
determining a sub-training data set of any training node from a total training data set based on the task weight of any training node;
performing gradient accumulation training on the target neural network based on the sub-training data set of any training node to obtain a gradient accumulation parameter;
updating the network parameters of the target neural network based on the gradient accumulation parameters.
4. The adaptive neural network training method of claim 1, wherein the adjusting the adaptive parameters of the current training round based on the training time of each training node in the current training round comprises:
determining the task variation of each training node based on the training time of each training node in the current training round and the adaptive parameters of the current training round;
and adjusting the self-adaptive parameters of the current training round based on the task variable quantity of each training node.
5. The adaptive neural network training method of any one of claims 1-4, wherein training the target neural network based on the adaptive parameters of the current training round further comprises:
acquiring a training script representing a training task, compiling a standard based on a preset training script, and verifying the training script;
and if the training script conforms to the training script writing standard, performing adaptive packaging on the training script, and starting the packaged training script, wherein the adaptive packaging is used for adding adaptive parameters of the first training round to the training script.
6. The adaptive neural network training method of claim 5, wherein the obtaining of the training script characterizing the training task and the verification of the training script based on the preset training script writing specification further comprises:
if the training script does not accord with the training script writing standard, sending error prompt information, wherein the error prompt information is used for prompting the modification of the training script;
acquiring a modified training script, compiling a standard based on the training script, and verifying the modified training script;
if the modified training script conforms to the training script writing standard, performing self-adaptive packaging on the modified training script, and starting the packaged training script;
and if the modified training script does not accord with the training script writing standard, sending error prompt information, returning to the step of obtaining the modified training script, and checking the modified training script based on the training script writing standard until the modified training script accords with the training script writing standard.
7. The adaptive neural network training method of claim 5, wherein the training script writing specification comprises: at least one of file format specification, file name specification, calling specification of training framework, and naming specification of key variable.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the adaptive neural network training method according to any one of claims 1 to 7 are implemented when the program is executed by the processor.
9. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the adaptive neural network training method according to any one of claims 1 to 7.
10. A computer program product comprising a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the adaptive neural network training method according to any one of claims 1 to 7.
CN202111673099.8A 2021-12-31 2021-12-31 Adaptive neural network training method, electronic device, medium, and program product Pending CN114492787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111673099.8A CN114492787A (en) 2021-12-31 2021-12-31 Adaptive neural network training method, electronic device, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111673099.8A CN114492787A (en) 2021-12-31 2021-12-31 Adaptive neural network training method, electronic device, medium, and program product

Publications (1)

Publication Number Publication Date
CN114492787A true CN114492787A (en) 2022-05-13

Family

ID=81508207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111673099.8A Pending CN114492787A (en) 2021-12-31 2021-12-31 Adaptive neural network training method, electronic device, medium, and program product

Country Status (1)

Country Link
CN (1) CN114492787A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024026990A1 (en) * 2022-08-04 2024-02-08 上海扩博智能技术有限公司 Automatic iterative training method, system and device for recognition model, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024026990A1 (en) * 2022-08-04 2024-02-08 上海扩博智能技术有限公司 Automatic iterative training method, system and device for recognition model, and storage medium

Similar Documents

Publication Publication Date Title
US20200265315A1 (en) Neural architecture search
CN112333234B (en) Distributed machine learning training method and device, electronic equipment and storage medium
CN113536081B (en) Data center data management method and system based on artificial intelligence
CN112686383B (en) Method, system and device for reducing distributed random gradient of communication parallelism
CN115145812B (en) Test case generation method and device, electronic equipment and storage medium
CN114841315A (en) Method and system for implementing hybrid expert model, electronic device and storage medium
US8768680B2 (en) Simulator of multi-core system employing reconfigurable processor cores and method of simulating multi-core system employing reconfigurable processor cores
CN109102468B (en) Image enhancement method and device, terminal equipment and storage medium
CN114492787A (en) Adaptive neural network training method, electronic device, medium, and program product
CN116644804A (en) Distributed training system, neural network model training method, device and medium
CN116881641A (en) Pre-training model adjustment method and device, storage medium and computing equipment
CN113963248A (en) Method, device, equipment and storage medium for neural network training and scene decision
CN112613605A (en) Neural network acceleration control method and device, electronic equipment and storage medium
CN115238883A (en) Neural network model training method, device, equipment and storage medium
US20230087774A1 (en) Parameter optimization method, electronic device, and storage medium
CN112287950A (en) Feature extraction module compression method, image processing method, device and medium
CN114742035B (en) Text processing method and network model training method based on attention mechanism optimization
JP2023123636A (en) Hyper parameter tuning method, device and program
CN115688917A (en) Neural network model training method and device, electronic equipment and storage medium
CN114707636A (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN114780228A (en) Hybrid cloud resource creation method and system
CN114253550A (en) Optimization strategy generation method and operator construction method
CN111915017A (en) Calibration method, calibration device, terminal equipment and storage medium
CN113746899B (en) Cloud platform access method and device
CN117786416B (en) Model training method, device, equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination