CN111047016A - Model training method and device - Google Patents

Model training method and device Download PDF

Info

Publication number
CN111047016A
CN111047016A CN201911284996.2A CN201911284996A CN111047016A CN 111047016 A CN111047016 A CN 111047016A CN 201911284996 A CN201911284996 A CN 201911284996A CN 111047016 A CN111047016 A CN 111047016A
Authority
CN
China
Prior art keywords
model
training
parameters
check point
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911284996.2A
Other languages
Chinese (zh)
Inventor
王洪伟
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Software Co Ltd
Kingsoft Corp Ltd
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Beijing Kingsoft Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Kingsoft Interactive Entertainment Technology Co ltd, Beijing Kingsoft Software Co Ltd filed Critical Chengdu Kingsoft Interactive Entertainment Technology Co ltd
Priority to CN201911284996.2A priority Critical patent/CN111047016A/en
Publication of CN111047016A publication Critical patent/CN111047016A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

The application provides a method and a device for model training, wherein the method comprises the following steps: monitoring the training progress of the model and the loss value of the model corresponding to the training progress, saving the model parameters of the model corresponding to the check point when the training progress of the model reaches a preset check point, and directly setting the model parameters of the model as the model parameters of the model corresponding to the check point when the abnormal training of the model is determined according to the loss value of the model. The training parameters of the model are directly adjusted, so that the training process of the model has higher randomness, the training of the model is continued, the training process of the model is ensured to be not stopped, and the training speed of the model can be accelerated.

Description

Model training method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for model training, a computing device, and a computer-readable storage medium.
Background
A neural network model is one of hot contents of research, and in recent years, with the continuous improvement of computer computing capability, a deep neural network technology with a more complex structure is developed, and some technical schemes using the deep neural network achieve significant performance improvement in the fields of voice and image recognition and the like.
Generally, the more complex the topological structure of the model is, the more the model parameters are, the more the complex model improves the simulation accuracy of the model, but the scale of the corresponding model parameters is often huge, so that the model training process is slow, and the application requirements of system products cannot be met. For example, in speech recognition applications, training data is often thousands of hours or more of speech data, and training a normal-scale deep neural network based on such data often takes months or even years, which is completely unacceptable for product upgrades.
When a neural network model is trained, particularly for a complex model, gradient mutation in a training stage is a frequent condition, the training is stopped when meeting the condition, the training can be continued only by human intervention, the condition that the training process of the model is stopped can occur, the model training process of the complex model is slow due to the large scale of model parameters, the training speed of the model is slow due to the occurrence of the condition, and the training efficiency of the model is directly influenced.
Disclosure of Invention
In view of the above, embodiments of the present application provide a method and an apparatus for model training, a computing device, and a computer-readable storage medium, so as to solve the technical defects in the prior art.
The embodiment of the application discloses a model training method, which comprises the following steps:
monitoring the training progress of the model and the loss value of the model corresponding to the training progress;
under the condition that the training progress of the model reaches a preset check point, saving model parameters corresponding to the check point of the model;
according to the loss value of the model, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model under the condition that the model training is determined to be abnormal;
adjusting training parameters of the model to continue training the model.
The embodiment of the application also discloses a device for model training, which comprises:
the monitoring module is configured to monitor the training progress of the model and the loss value of the model corresponding to the training progress;
the saving module is configured to save model parameters corresponding to the model at a preset check point when the training progress of the model reaches the preset check point;
the setting module is configured to set the model parameters of the model as the model parameters corresponding to the model at the check point under the condition that the model training is determined to be abnormal according to the loss value of the model;
an adjustment module configured to adjust training parameters of the model to continue training the model.
The embodiment of the application discloses a computing device, which comprises a memory, a processor and computer instructions stored on the memory and capable of running on the processor, wherein the processor executes the instructions to realize the steps of the model training method.
Embodiments of the present application disclose a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of model training as described above.
The model training method and the model training device provided by the application monitor the training progress of the model and the loss value of the model corresponding to the training progress, under the condition that the training progress of the model reaches a preset check point, saving the model parameters of the model corresponding to the check point, according to the loss value of the model, under the condition that the training abnormality of the model is determined, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model, directly adjusting the training parameters of the model to ensure that the training process of the model has greater randomness, therefore, the training of the model can be continued without human intervention, the training process of the model is ensured to be continuous, the training speed of the model can be accelerated, and particularly, the training efficiency of the model can be obviously improved under the condition of training a complex model.
In addition, under the condition that the model training is determined to be abnormal, the model parameters of the model are set to be the model parameters corresponding to the model at the check point, so that repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, a large amount of computing resources are saved, and the training speed of the model is further improved.
Drawings
FIG. 1 is a schematic block diagram of a computing device according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for model training according to a first embodiment of the present application;
FIG. 3 is a schematic flow chart of a method for model training according to a second embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of a method for model training according to a third embodiment of the present application;
fig. 5 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Loss value: the function value of the loss function is a loss value, the loss value is used for measuring the inconsistency degree of the predicted value and the true value of the model, and the smaller the loss value of the loss function is, the better the robustness of the model is.
Gradient: in the case of a univariate, real-valued function, the gradient is simply understood as the derivative, or for a linear function, the gradient is the slope of the line.
Training steps: and when training data is input into the model to finish the training of the model once, the training steps are one step.
Model parameters: in the neural network model, the model parameters include weight (weight) and bias (bias), and the model parameters are automatically updated or learned in the training process of the model. The model parameters are configuration variables inside the model, and the model parameters are updated and adjusted in each training step of the model. For example, each neuron in each layer of the model structure is connected to each neuron in the previous layer, and each connection has a weight and bias to transmit, and the weight and bias are automatically updated in the training process of the model.
In the present application, a method and apparatus for model training, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments one by one.
Fig. 1 is a block diagram illustrating a configuration of a computing device 100 according to an embodiment of the present specification. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 100 and other components not shown in FIG. 1 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a method of model training according to a first embodiment of the present application, comprising steps 202 to 208.
Step 202: and monitoring the training progress of the model and the loss value of the model corresponding to the training progress.
The model is a model to be trained, after the model starts to be trained, the training progress of the model and the loss value of the model corresponding to the training progress are monitored in real time, and the training progress of the model can be the training step number of the model or the training time of the model.
For example, monitoring the number of training steps of the model and the loss value of the model corresponding to the number of training steps, and monitoring that the number of training steps of the model is the 50 th step and the number of training steps of the model is the loss value of the model corresponding to the 50 th step; or monitoring the training time of the model and the loss value of the model corresponding to the training time, wherein the training time of the model is ten minutes, and the training time of the model is ten minutes.
The loss value of the model is used for estimating the inconsistency degree of the predicted value and the true value of the model, and the loss value of the model corresponding to the number of steps of the current training is calculated when the model completes each training step.
Step 204: and under the condition that the training progress of the model reaches a preset check point, saving the model parameters of the model corresponding to the check point.
In the training process of the model, the model is generally trained through training data, the model learns and updates model parameters in each training step until the output result of the model is in accordance with expectation, whether the model is trained or not is judged according to the output result of the model, the training step number is not limited in the model training, and the fixed step number at intervals can be used as a preset check point in the model training.
For example, a check point is set every 100 steps in the training of the model, in other words, a node of the model after each 100 steps of training is taken as a preset check point, that is, a node of the model after the 100 th step of training is taken as a first check point, a node of the model after the 200 th step of training is taken as a second check point, and so on, a plurality of check points exist in the training process of the model until the model completes the training.
It should be noted that, in the training process of the model, the total training step number of the model may also be set, the fixed training step number is determined according to the total training step number of the model, and the node where the model completes each fixed training step number is used as the check point of the model. Thus, the number of the check points in the training process of the model is determined, and the model parameters of the model corresponding to the check points are saved.
And updating and adjusting model parameters in each step of training of the model, storing the model parameters corresponding to the node of which the model completes the training in the 100 th step under the condition that the training step number of the model reaches a preset first check point, storing the model parameters corresponding to the node of which the model completes the training in the 200 th step under the condition that the training step number of the model reaches a preset second check point, and repeating the steps until the training end condition is reached.
In the neural network model, the model parameters include weights and biases, the model parameters are configuration variables inside the model, and the weights and the biases are automatically updated in the training process of the model or the model learns by itself, and the model parameters are illustrated for the following schematic description.
For example, the convolution layer of the model has a plurality of different convolution kernels, each convolution kernel has m × n weights, and the convolution kernel can be used for extracting and enhancing the features of the image in the process that the model can realize image recognition. Each neuron of each layer in the structure of the model is connected with each neuron of the previous layer, each connection has a weight and an offset for transmission, and the weight and the offset are automatically updated in the training process of the model.
Step 206: and according to the loss value of the model, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model under the condition that the model training is determined to be abnormal.
The loss value of the model is determined to be abnormal according to the degree of the loss value of the model deviating from the mean value of historical loss values by monitoring the training progress of the model and the loss value of the model corresponding to the training progress, the training of the model is stopped under the condition that the training of the model is determined to be abnormal, and the model parameters of the model are set to be the model parameters corresponding to the inspection point of the model.
Step 208: adjusting training parameters of the model to continue training the model.
The training parameters are not learned by the model, but are numerical values which participate randomly in the training process of the model, the training process of the model has higher randomness by adjusting the training parameters of the model, and the training of the model is continued, wherein the training parameters can comprise the input sequence of training data and the initial values of the hyper-parameters of the model.
Adjusting an input order of training data of the model and/or an initial value of a hyper-parameter of the model to continue training the model.
The gradient mutation of the model in the training stage often occurs, according to the loss value of the model, under the condition that the model is determined to be abnormal in training, the model parameters of the model are directly set as the model parameters of the model corresponding to the inspection point, so that the repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, thereby saving a large amount of computing resources, after the model parameters of the model are set as the model parameters of the model corresponding to the inspection point, the training parameters of the model are directly adjusted, so that the training process of the model has greater randomness, the training of the model is continuously carried out, the training of the model can be continuously carried out without human intervention, the training process of the model is ensured to be not stopped, the training speed of the model can be accelerated, particularly under the condition that the complex model is trained, the training efficiency of the model can be obviously improved.
Fig. 3 is a schematic flow chart diagram illustrating a method of model training according to a second embodiment of the present application, comprising steps 302 to 316.
Step 302: and monitoring the training progress of the model and the loss value of the model corresponding to the training progress.
Step 304: and under the condition that the training progress of the model reaches a preset check point, saving the model parameters of the model corresponding to the check point.
The steps 302 to 304 are the same as the steps 202 to 204 in the first embodiment, and for specific description, refer to the steps 202 to 204 in the first embodiment, which are not repeated herein.
Step 306: the mean and standard deviation of the loss values of the model were calculated.
And the mean value and the standard deviation of the loss value of the model are respectively calculated according to the historical loss value of the model.
Step 308: and acquiring the current loss value of the model, and taking the difference value between the current loss value of the model and the mean value as a deviation value.
And acquiring the current loss value of the model, namely judging whether the current loss value is abnormal or not, taking the difference value between the current loss value of the model and the mean value as a deviation value, and judging whether the model training is abnormal or not according to the deviation value.
Step 310: and judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple, if so, executing step 312, and if not, executing step 314.
The preset multiple may be two times, three times or four times, the preset multiple is set according to the actual condition of model training, for example, the preset multiple is three times, when the multiple of the deviation value and the standard deviation is 3.5 times, the multiple of the deviation value and the standard deviation is 3.5 times and exceeds the preset multiple by 3 times, and the training abnormality of the model is determined.
Step 312: determining the model training abnormity, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model, and executing step 316.
The model parameters of the model are directly set to be the model parameters corresponding to the check points of the model, so that repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, and a large amount of computing resources are saved.
Step 314: and carrying out next training on the model.
And if the multiple of the deviation value and the standard deviation does not exceed the preset multiple, determining that the model is normally trained, and continuing to train the model in the next step.
Step 316: adjusting training parameters of the model to continue training the model.
The training parameters of the model are directly adjusted after the model parameters corresponding to the check points are set as the model parameters of the model, so that the training process of the model has higher randomness, the model can be trained continuously without human intervention, the training process of the model is ensured to be not stopped, the training speed of the model can be accelerated, and the training efficiency of the model can be obviously improved particularly under the condition of training a complex model.
Fig. 4 is a schematic flow chart diagram illustrating a method of model training according to a third embodiment of the present application, comprising steps 402 to 412.
Step 402: and setting the total training steps of the model.
Step 404: and determining the fixed training steps according to the total training steps of the model, and taking the node of the model which finishes each fixed training step as a check point of the model.
Before training a model, setting the total training step number of the model, for example, the total training step number of the model is 2000 steps, determining the fixed training step number as 100 steps according to the set total training step number of the model of 2000 steps, taking a node of the model after each 100 steps of training as a preset check point, namely taking a node of the model after the 100 th step of training as a first check point, taking a node of the model after the 200 th step of training as a second check point, taking a node of the model after the 300 th step of training as a third check point, and so on, wherein a plurality of check points exist in the training process of the model.
The more complex the topological structure of a general complex model is, the more the model parameters are, that is, the model parameters of the model have certain data volume to be processed and stored, the fixed training step number is determined according to the total training step number of the model, and the node of the model completing each fixed training step number is taken as the check point of the model, so that the distribution position and the distribution number of the preset check points in the total training step number can be reasonably determined, on one hand, the situation that the number of the preset check points is too large to cause the large data volume of the stored model parameters is avoided, and the waste of system resources is reduced; on the other hand, the model parameters of the appropriate nodes are prevented from being difficult to store due to the fact that the number of preset check points is too small, and when the training progress of the model reaches the preset check points in the following steps, the model parameters corresponding to the model at the appropriate check points can be stored, so that the training efficiency of the model is improved.
Step 406: and monitoring the training steps of the model and the loss value of the model corresponding to the training steps.
The model has a loss value every time the model completes one-step training, for example, the number of training steps of the model is monitored to be step 1, and the loss value of the model after the model completes the training of step 1 is monitored; and monitoring the training steps of the model as the 2 nd step, monitoring the loss value of the model after the model completes the 2 nd step training, and so on.
Step 408: and under the condition that the training progress of the model reaches a preset check point, saving the model parameters of the model corresponding to the check point.
In the above example, the node where the model completes the training at step 100 is used as the first check point, the node where the model completes the training at step 200 is used as the second check point, and so on, and there are several check points in the training process of the model.
And updating and adjusting model parameters in each step of training of the model, saving the model parameters corresponding to the node of which the model completes the 100 th step of training when the training step number of the model reaches a preset first check point, and saving the model parameters corresponding to the node of which the model completes the 200 th step of training when the training step number of the model reaches a preset second check point.
Step 410: and according to the loss value of the model, under the condition that the model training is determined to be abnormal, setting the model parameter of the model as the model parameter corresponding to the previous check point of the training step number with the abnormal loss value.
The model parameters of the model are directly set to be the model parameters corresponding to the previous check point of the training step number with abnormal loss values, so that repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, and a large amount of computing resources are saved.
Step 412: adjusting an input order of training data of the model and/or an initial value of a hyper-parameter of the model to continue training the model.
The training data of the model is sample data input into the model for training.
The model typically includes two parts of parameters: model parameters and hyper-parameters. Where hyper-parameters are parameters used to control the behavior of the model, these parameters are not learned by the model itself. For example, in a polynomial regression model, the degree of the polynomial and the learning rate are hyper-parameters. These hyper-parameters cannot be trained by the model itself.
The initial values may be reset for the hyper-parameters in the model, such as the weights in the hyper-parameters.
Continuing to train the model by adjusting an input order of training data of the model; or adjusting the initial value of the hyper-parameter of the model to continue training the model.
In this embodiment, the model is continuously trained by adjusting the input sequence of the training data of the model and/or the initial value of the hyper-parameter of the model, so that the training process of the model has higher randomness, the training of the model is continuously performed, the training of the model can be continuously performed without human intervention, the training process of the model is ensured to be not stopped, the training speed of the model can be increased, and particularly, the training efficiency of the model can be remarkably improved under the condition of training a complex model.
The present embodiment will be schematically described below by way of example.
Supposing that a complex model needs to be trained, setting the total training step number of the model to be ten thousand steps, determining that the fixed training step number is 200 steps according to the total training step number of the model, taking a node of the model after each 200-step training as a preset check point, namely taking a node of the model after the 200-step training as a first check point, taking a node of the model after the 400-step training as a second check point, and so on, wherein a plurality of check points exist in the training process of the model.
The training step number of the monitoring model is the first step, the loss value of the model after the first-step training of the model is finished is a, the training step number of the monitoring model is the second step, the loss value of the model after the second-step training of the model is finished is b, and the like, and the training progress of the model and the loss value corresponding to the training progress of the model are monitored.
When the training step number of the model is 200 steps, a preset first check point is reached, and model parameters corresponding to the check point of the model are stored, in other words, the model parameters of the model after the 200-step training of the model is stored; when the training steps of the model are 400 steps, a preset second check point is reached, and model parameters of the model after the model is trained in 400 steps are stored; and when the training step number of the model is 600 steps, a preset third check point is reached, the model parameters of the model after the model is trained for 600 steps are stored, and by analogy, the model parameters corresponding to the check point of the model are stored.
Assuming that the model obtains the loss value of the model when the model completes the 4500 th training after the model completes the 4500 th training, and taking the difference value between the current loss value of the model and the mean value as a deviation value, wherein the multiple of the deviation value and the standard deviation is 3.5 times, and the multiple of the deviation value and the standard deviation is 3.5 times and exceeds a preset multiple of 3 times, determining that the model is abnormal in training.
Directly setting the model parameters of the model as the model parameters corresponding to the previous check point of the training step number with abnormal loss value, namely, when the model is abnormal after the model completes the 4500 step training, the loss value corresponding to the 4500 step training of the model is abnormal, the previous check point of the 4500 step training of the model is the model parameters saved after the 4400 step training of the model is completed, and directly resetting the model parameters of the model as the model parameters saved after the 4400 step training of the model is completed. Therefore, repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, and a large amount of computing resources are saved.
The above is the case of the first abnormal model training, for the training of the complex model, the abnormal model training is frequent, the second abnormal model training occurs after the model completes the step 5300 training, and the loss value corresponding to the step 5300 training of the model is abnormal, the previous check point of the step 5300 training of the model is the model parameter stored after the model completes the step 5200 training, and the model parameter of the model is directly reset to the model parameter stored after the model completes the step 5200 training; and when the model is abnormally trained for the third time after the model completes the 6625 th step and the loss value corresponding to the 6625 th step of training of the model is abnormal, the previous check point of the 6625 th step of training of the model is the model parameter stored after the model completes the 6600 th step of training, and the model parameter of the model is directly reset to the model parameter stored after the model completes the 6600 th step of training.
The model parameters of the model are directly set to be the model parameters corresponding to the previous check point of the training step number with abnormal loss values, so that repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, and a large amount of computing resources are saved.
The input sequence of the training data of the model and the initial value of the hyper-parameter of the model are adjusted to continue training the model until the model finishes the set training steps of ten thousand, so that the training process of the model has higher randomness, the training of the model is continued, the training of the model can be continued without human intervention, the training process of the model is ensured to be not stopped, the training speed of the model can be accelerated, and particularly, the training efficiency of the model can be obviously improved under the condition of training a complex model.
Fig. 5 is a schematic structural diagram of an apparatus for model training according to a fourth embodiment of the present application, including:
a monitoring module 502 configured to monitor a training progress of the model and a loss value of the model corresponding to the training progress;
the saving module 504 is configured to save the model parameters corresponding to the model at the check point when the training progress of the model reaches the preset check point;
a setting module 506, configured to set the model parameters of the model as the model parameters corresponding to the inspection point of the model when determining that the model training is abnormal according to the loss value of the model;
an adjustment module 508 configured to adjust training parameters of the model to continue training the model.
The monitoring module 502 is further configured to monitor the training steps of the model and the loss value of the model corresponding to the training steps.
The model training device further comprises:
a step number setting module configured to set a total number of training steps of the model;
and the processing module is configured to determine the fixed training steps according to the total training steps of the model, and take the node of the model completing each fixed training step as the check point of the model.
The setup module 506 is further configured to: calculating the mean and standard deviation of the loss values of the model;
obtaining the current loss value of the model, and taking the difference value between the current loss value of the model and the mean value as a deviation value;
judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not;
if so, determining that the model is abnormal in training, and setting model parameters of the model as the model parameters corresponding to the inspection point of the model;
if not, the model is trained in the next step.
The setting module 506 is further configured to set the model parameter of the model to the model parameter corresponding to the previous check point of the training step with abnormal loss value.
The training parameters comprise an input sequence of training data and an initial value of a hyper-parameter of the model;
the adjustment module 508 is further configured to adjust an input order of training data of the model and/or an initial value of a hyper-parameter of the model to continue training the model.
In the above embodiment, by monitoring the training progress of the model and the loss value of the model corresponding to the training progress, when the training progress of the model reaches a preset check point, the model parameters corresponding to the check point of the model are saved, and according to the loss value of the model, when the abnormal training of the model is determined, the model parameters of the model are directly set as the model parameters corresponding to the check point of the model, so that the repeated training of the model is greatly reduced, in other words, the model is prevented from being repeatedly learned to update the model parameters, thereby saving a large amount of computing resources, and after the model parameters of the model are set as the model parameters corresponding to the check point of the model, the training parameters of the model are directly adjusted, so that the training process of the model has greater randomness, thereby continuing the training of the model, and continuing the training of the model without human intervention, the training speed of the model can be accelerated by ensuring that the training process of the model is not stopped, and particularly, the training efficiency of the model can be remarkably improved under the condition of training a complex model.
An embodiment of the present application also provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor executes the instructions to implement the steps of the method for model training as described above.
An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the method of model training as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the model training method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the model training method.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (10)

1. A method of model training, comprising:
monitoring the training progress of the model and the loss value of the model corresponding to the training progress;
under the condition that the training progress of the model reaches a preset check point, saving model parameters corresponding to the check point of the model;
according to the loss value of the model, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model under the condition that the model training is determined to be abnormal;
adjusting training parameters of the model to continue training the model.
2. The method of claim 1, wherein monitoring a training progress of the model and a loss value of the model corresponding to the training progress comprises:
and monitoring the training steps of the model and the loss value of the model corresponding to the training steps.
3. The method of claim 1, wherein in a case that the training progress of the model reaches a preset check point, saving the model before the model parameters corresponding to the check point further comprises:
setting the total training step number of the model;
and determining the fixed training steps according to the total training steps of the model, and taking the node of the model which finishes each fixed training step as a check point of the model.
4. The method according to claim 1, wherein in the case that it is determined that the model training is abnormal according to the loss value of the model, setting the model parameters of the model as the model parameters corresponding to the inspection point of the model comprises:
calculating the mean and standard deviation of the loss values of the model;
obtaining the current loss value of the model, and taking the difference value between the current loss value of the model and the mean value as a deviation value;
judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not;
if so, determining that the model is abnormal in training, and setting model parameters of the model as the model parameters corresponding to the inspection point of the model;
if not, the model is trained in the next step.
5. The method according to any one of claims 1 to 4, wherein setting the model parameters of the model as the model parameters corresponding to the inspection points of the model comprises:
and setting the model parameters of the model as the model parameters corresponding to the previous check point of the training step number with abnormal loss values.
6. The method of claim 1, wherein the training parameters include an input order of training data and an initial value of a hyper-parameter of the model;
adjusting training parameters of the model to continue training the model, including:
adjusting an input order of training data of the model and/or an initial value of a hyper-parameter of the model to continue training the model.
7. An apparatus for model training, comprising:
the monitoring module is configured to monitor the training progress of the model and the loss value of the model corresponding to the training progress;
the saving module is configured to save model parameters corresponding to the model at a preset check point when the training progress of the model reaches the preset check point;
the setting module is configured to set the model parameters of the model as the model parameters corresponding to the model at the check point under the condition that the model training is determined to be abnormal according to the loss value of the model;
an adjustment module configured to adjust training parameters of the model to continue training the model.
8. The apparatus of claim 7, wherein the setting module is further configured to:
calculating the mean and standard deviation of the loss values of the model;
obtaining the current loss value of the model, and taking the difference value between the current loss value of the model and the mean value as a deviation value;
judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not;
if so, determining that the model is abnormal in training, and setting model parameters of the model as the model parameters corresponding to the inspection point of the model;
if not, the model is trained in the next step.
9. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-6 when executing the instructions.
10. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 6.
CN201911284996.2A 2019-12-13 2019-12-13 Model training method and device Pending CN111047016A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911284996.2A CN111047016A (en) 2019-12-13 2019-12-13 Model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911284996.2A CN111047016A (en) 2019-12-13 2019-12-13 Model training method and device

Publications (1)

Publication Number Publication Date
CN111047016A true CN111047016A (en) 2020-04-21

Family

ID=70236261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911284996.2A Pending CN111047016A (en) 2019-12-13 2019-12-13 Model training method and device

Country Status (1)

Country Link
CN (1) CN111047016A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780583A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Model training monitoring method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780583A (en) * 2021-09-18 2021-12-10 中国平安人寿保险股份有限公司 Model training monitoring method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Brownlee What is the Difference Between a Batch and an Epoch in a Neural Network
US11650968B2 (en) Systems and methods for predictive early stopping in neural network training
US20190294975A1 (en) Predicting using digital twins
CN111401788B (en) Attribution method and device of service timing sequence index
US9934470B2 (en) Production equipment including machine learning system and assembly and test unit
CN110390425A (en) Prediction technique and device
WO2022245502A1 (en) Low-rank adaptation of neural network models
CN110119787B (en) Working condition detection method and equipment for rotary mechanical equipment
CN108879732B (en) Transient stability evaluation method and device for power system
JPWO2019146189A1 (en) Neural network rank optimizer and optimization method
US20240095535A1 (en) Executing a genetic algorithm on a low-power controller
JP7110929B2 (en) Knowledge Complementary Program, Knowledge Complementary Method, and Knowledge Complementary Device
CN113743991A (en) Life cycle value prediction method and device
CN111400964B (en) Fault occurrence time prediction method and device
CN112182056A (en) Data detection method, device, equipment and storage medium
CN116363452A (en) Task model training method and device
CN111047016A (en) Model training method and device
CN114581119A (en) Flow prediction method and device
CN107743071B (en) Enhanced representation method and device for network node
CN114595815A (en) Transmission-friendly cloud-end cooperation training neural network model method
Mills et al. L2nas: Learning to optimize neural architectures via continuous-action reinforcement learning
US20200372363A1 (en) Method of Training Artificial Neural Network Using Sparse Connectivity Learning
WO2023082045A1 (en) Neural network architecture search method and apparatus
CN115841190A (en) Road PCI multi-step prediction method and device
Benaddy et al. Evolutionary prediction for cumulative failure modeling: A comparative study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200421