WO2020087281A1 - Hyper-parameter optimization method and apparatus - Google Patents

Hyper-parameter optimization method and apparatus Download PDF

Info

Publication number
WO2020087281A1
WO2020087281A1 PCT/CN2018/112712 CN2018112712W WO2020087281A1 WO 2020087281 A1 WO2020087281 A1 WO 2020087281A1 CN 2018112712 W CN2018112712 W CN 2018112712W WO 2020087281 A1 WO2020087281 A1 WO 2020087281A1
Authority
WO
WIPO (PCT)
Prior art keywords
hyperparameters
optimization
machine learning
value
loss
Prior art date
Application number
PCT/CN2018/112712
Other languages
French (fr)
Chinese (zh)
Inventor
蒋阳
赵丛
张李亮
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2018/112712 priority Critical patent/WO2020087281A1/en
Priority to CN201880038686.XA priority patent/CN110770764A/en
Publication of WO2020087281A1 publication Critical patent/WO2020087281A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of computer technology, and in particular, to a hyperparameter optimization method and device.
  • the parameters of machine learning algorithms mainly include hyper-parameters and ordinary parameters. Among them, common parameters can be learned and estimated from the data; hyperparameters cannot be estimated from the data, and can only be specified by human experience design, hyperparameters are parameters that need to be set before starting the learning process. Hyperparameters define higher-level concepts about machine learning models, such as complexity or learning ability. For example, the hyperparameters may include and are not limited to: regular term coefficients, learning rate, network structure, width and depth of convolution kernel, and so on.
  • the adjustment of hyperparameters has a very large impact on the performance of machine learning algorithms.
  • the adjustment of hyperparameters is a black box operation (black box), which often requires algorithm designers to get through a lot of debugging, and designers need to have in this field.
  • the deeper accumulation requires a lot of time and effort, and even the optimal results are often not obtained, and the optimization efficiency is low.
  • the desired hyperparameters can be obtained by modeling the unknown function and searching for its global optimal solution.
  • Bayesian optimization algorithm Boyesian Optimization Algorithm, BOA
  • BOA Bayesian Optimization Algorithm
  • the number of hyperparameters that need to be optimized may be very large, which makes it difficult to solve the global optimal solution of the unknown function in a high-dimensional space. It is often stuck in the local optimal solution and cannot Get better results.
  • the present application provides a hyperparameter optimization method and device, which can realize a dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.
  • a hyperparameter optimization method includes: dividing the hyperparameters to be optimized by machine learning into N groups of hyperparameters, where N is an integer greater than 1; and performing Bayesian optimization on the N groups of hyperparameters, respectively To obtain the optimized hyperparameters, where in the process of Bayesian optimization of each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.
  • a hyperparameter optimization device in a second aspect, includes: a division unit that divides the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters, where N is an integer greater than 1; Bayesian optimization is performed on the group of hyperparameters to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.
  • an apparatus for processing video images includes a memory and a processor.
  • the memory is used to store instructions
  • the processor is used to execute instructions stored in the memory.
  • the execution of the instructions stored in the memory causes the processor to execute the first On the one hand provides the optimization method.
  • a chip is provided.
  • the chip includes a processing module and a communication interface.
  • the processing module is used to control the communication interface to communicate with the outside.
  • the processing module is also used to implement the optimization method provided in the first aspect.
  • a computer-readable storage medium on which a computer program is stored, which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.
  • a computer program product containing instructions which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.
  • the solution provided by this application performs Bayesian optimization on the hyperparameter grouping that needs to be optimized for machine learning, on the one hand, it can realize the dimensionality reduction search for the hyperparameters, and on the other hand, it can weaken the limitation of the dimensionality reduction assumption.
  • Figure 1 is a schematic diagram of the basic principle of the Bayesian optimization algorithm.
  • FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.
  • FIG. 3 is another schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.
  • FIG. 5 is another schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.
  • Bayesian optimization algorithm (Bayesian Optimization Algorithm, BOA) is an algorithm for solving the global optimal solution of unknown functions.
  • D is the candidate set of s.
  • the goal of Bayesian optimization is to select an s from D, so that the value of the unknown function f (s) is the smallest (or largest).
  • the unknown function f (s) can be called the objective function.
  • the first step is to make a priori assumption about the function space distribution of the objective function f (s), that is to say, the function space distribution of f (s) is a priori distribution.
  • the a priori assumption usually uses Gaussian process prior (Gaussian process prior). For example, suppose the spatial distribution of the function of f (s) is Gaussian (Gaussian distribution).
  • the first step also includes obtaining at least two sample values and obtaining at least two observation values corresponding to these sample values.
  • the observed values are f (s 0 ) and f (s 1 ).
  • sampling values can be selected from the candidate set D by sampling and the like s 0 and s 1 .
  • the first step also includes using at least two observations to update the average and variance of the prior distribution to obtain a posterior distribution.
  • the modified Gaussian distribution model is the posterior distribution of f (s).
  • the acquisition function is constructed using the posterior distribution, and the acquisition function is used to calculate the next sample value.
  • the process of the second step is specifically to select the next sampled value s i from the modified Gaussian distribution model
  • the selection criterion is that, relative to the candidate set D
  • the acquisition function mentioned in the second step is to average the smaller (or larger, PS: f (s) is the loss function here, if f (s) represents the accuracy of the model)
  • PS the smaller (or larger)
  • f (s) the loss function here, if f (s) represents the accuracy of the model
  • the observation value corresponding to the sampling value obtained in the second step is obtained, and whether the sampling value is the optimal solution is judged according to the observation value. If it is, the Bayesian optimization process ends, if not, go to the fourth step.
  • the adopted value can be substituted into the objective function f (s) to calculate the observed value.
  • the observation value obtained in the third step is used to continue to modify the posterior distribution, and the process goes to the second step. That is, the second step, the third step, and the fourth step are repeatedly executed until convergence (that is, the optimal solution is obtained in the third step).
  • the Bayesian optimization algorithm can be used to adjust the hyperparameters of the machine learning model (also called optimization).
  • the hyperparameter adjustment process of machine learning is regarded as solving the maximum value problem in the Bayesian optimization algorithm, in which the hyperparameters to be optimized are regarded as s, and the candidate values of the hyperparameters to be optimized constitute the candidate set D, and then passed
  • the Bayesian optimization process shown in 1 looks for the global optimal solution of the objective function, and the optimized hyperparameters can be obtained.
  • the loss function is generally used as the objective function.
  • the loss function is used to estimate the degree of inconsistency between the predicted value and the true value of the machine learning model. It can be a non-negative real-valued function. Assuming that the independent variable of the machine learning model g () is X and the dependent variable is Y, taking the samples (X i , Y i ) as an example, the predicted value of the machine learning model is g (X i ), and the true value of the machine learning model is Y i .
  • loss functions There are many common loss functions, for example, log loss function, square loss function (also called least squares loss function), exponential loss function and other loss functions.
  • n is the number of samples
  • g (X i ) represents the predicted value of the machine learning model
  • Y i represents the true value of the machine learning model
  • Y i -g (X i ) represents the difference between the predicted value and the true value of the machine learning model
  • L (Y, g (X)) represents the sum of squares of the residuals in the sample space.
  • Bayesian optimization is used as the objective function in the Bayesian optimization algorithm.
  • the purpose of Bayesian optimization is to minimize the value of the square loss function, so as to obtain the optimized hyperparameters.
  • the hyperparameters to be optimized are usually defined as a multi-dimensional vector S.
  • the process of Bayesian optimization is the process of searching the optimal value of the vector S.
  • the number of hyperparameters that need to be optimized may be very large, resulting in a very high dimension of the vector S. It is very difficult to solve the global optimal solution of the unknown function in a high-dimensional space, and it is often stuck The local optimal solution cannot obtain good results.
  • the existing solutions aim at high-dimensional hyperparameters and assume that the solution space of the global optimal solution of the unknown function is a relatively low-dimensional solution space, and then directly perform Bayesian optimization in the hypothetical low-dimensional solution space. This makes the hypothesis strategy from the solution space of the global optimal solution of the unknown function to the relatively low-dimensional solution space have a great influence on the Bayesian optimization results. If the hypothesis strategy is unreasonable, it will lead to poor optimization results. This makes the algorithm not robust enough.
  • This application proposes a hyperparameter optimization scheme, which can realize the dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.
  • FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.
  • the optimization method includes the following steps.
  • the hyperparameters to be optimized include N sets of hyperparameters, and N is an integer greater than 1.
  • the hyperparameters to be optimized for machine learning may be divided into N groups in advance.
  • the hyperparameters that need to be optimized for machine learning may be divided into N groups in real time when optimization is needed.
  • the grouping strategy for the hyperparameters that need to be optimized may be different.
  • the number of hyperparameters included in each group of hyperparameters in the N sets of hyperparameters is less than the total number of hyperparameters that need to be optimized in machine learning.
  • S220 Perform Bayesian optimization on the N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values. value.
  • the Bayesian optimization algorithm shown in Figure 1 can be used to implement.
  • the values of the remaining groups of hyperparameters are fixed to the latest Value.
  • the values of the remaining groups of hyperparameters can be determined by sampling.
  • each Bayesian optimization process Bayesian optimization is performed on the solution space corresponding to a set of hyperparameters, because the dimension of each set of hyperparameters is smaller than the total dimension of the hyperparameters that machine learning needs to optimize Therefore, the dimensionality reduction search for hyperparameters can be realized, and the optimal solution can be avoided from being stuck in the local optimal solution.
  • the dimensionality reduction search can be performed on the hyperparameters
  • each set of hyperparameters in the N sets of hyperparameters that need to be optimized by machine learning includes at least one hyperparameter.
  • the number of hyperparameters included in each group in the N groups of hyperparameters may be the same, that is, the dimensions of each group of hyperparameters may be the same.
  • the number of hyperparameters included in different groups among the N sets of hyperparameters may also be different, that is, the dimensions of different groups of hyperparameters may not be completely the same.
  • the N sets of hyperparameters are obtained by randomly grouping the hyperparameters that need to be optimized.
  • the N sets of hyperparameters are obtained by grouping the hyperparameters that need to be optimized through experience.
  • the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.
  • Hyperparameters can include at least two of the following: kernel size (kernel), kernel number (kernel), convolution step (stride), jumper connection (shortcut connection), and sum operation ( add) and selection of concatenation operation (concat), number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients, learning rate, neural network structure, neural The number of layers of the network.
  • the hyperparameter types of different groups of hyperparameters in the N groups of hyperparameters may not be completely the same.
  • hyperparameters have different hyperparameter types.
  • the hyper-parameters to be optimized are grouped according to the hyper-parameter type, and then each group of hyper-parameters are optimized separately, so that the optimization efficiency of the hyper-parameters can be improved to a certain extent.
  • the grouping strategy for the hyperparameters to be optimized is fixed.
  • the grouping strategy for the hyperparameters that need to be optimized may be different or the same, which is not limited in this application and can be determined according to actual needs.
  • an implementation manner of step S220 is: using at least one round of Bayesian optimization operations to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters in the group of hyperparameters.
  • each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters in the group of hyperparameters.
  • the values of the remaining groups of hyperparameters are fixed to the latest values, and i traverses 1 , 2, ..., N.
  • Bayesian optimization is performed on N sets of hyperparameters in each round of Bayesian optimization operation, in other words, in the process of obtaining optimized hyperparameters, each hyperparameter that needs to be optimized by machine learning All are optimized by Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction hypothesis can be weakened.
  • the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
  • step S220 two or three or more rounds of Bayesian optimization operations are performed to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters.
  • Bayesian optimization is performed on the i-th group of hyperparameters.
  • the values of the remaining groups of hyperparameters are fixed to the latest values, i traverses 1, 2, ..., N.
  • Bayesian optimization of alternative optimization The method of performing Bayesian optimization on N sets of hyperparameters in each round of Bayesian optimization operation can be referred to as Bayesian optimization of alternative optimization.
  • the embodiment of the present application introduces the idea of alternating optimization into the process of Bayesian optimization, which can achieve effective dimensionality reduction for the high-dimensional search space, weaken the assumption limitations in the existing research technology, and help to search for the optimal solution. parameter.
  • the entire process of optimizing hyperparameters in the embodiments of the present application is as follows.
  • the hyperparameter adjustment process of machine learning is regarded as the objective function f (S).
  • S represents the hyperparameters that need to be optimized.
  • S ⁇ D, D represents the sample space of the hyper-parameter S that needs to be optimized.
  • the objective function f (S) may be a loss function.
  • the process of sampling from D i to the sampled value and obtaining the observation value may be that the sampled value is brought into the objective function f (S) to obtain the observation value corresponding to the sampled value.
  • the objective function of Bayesian optimization is a loss function.
  • the objective function of Bayesian optimization may be any of the following: log loss function, square loss function (also called least squares loss function), and exponential loss function.
  • a loss function can be selected as the objective function of Bayesian optimization according to the needs of the actual application.
  • the objective function f (S) of Bayesian optimization is as follows:
  • (X, Y) is the sample.
  • g (X) represents the machine learning model
  • X represents the independent variable of the machine learning model
  • Y represents the dependent variable of the machine learning model.
  • n represents the number of samples.
  • the samples here refer to (X, Y) samples.
  • g (X i ) represents the predicted value of the machine learning model.
  • Y i represents the true value of the machine learning model.
  • Y i -g (X i ) represents the residual between the predicted value and the true value of the machine learning model.
  • L (Y, g (X)) represents the sum of squared residuals in the sample space.
  • the samples used in the Bayesian optimized objective function may be training set samples, or test set samples, or training set samples and test set samples.
  • (X, Y) is the sample.
  • g (X) represents the machine learning model
  • X represents the independent variable of the machine learning model
  • Y represents the dependent variable of the machine learning model.
  • g (X i ) represents the predicted value of the machine learning model.
  • Y i represents the true value of the machine learning model.
  • Y i -g (X i ) represents the residual between the predicted value and the true value of the machine learning model.
  • L (Y, g (X)) represents the sum of squared residuals in the sample space.
  • the sample space is the sample space of the training set
  • n represents the number of samples in the training set.
  • the sample space is the sample space of the test set
  • n represents the number of samples in the test set.
  • the sample space is a sample space composed of a training set and a test set, and n represents the total number of samples in the training set and the test set.
  • each value of the hyperparameter corresponds to a machine learning model.
  • the values of hyperparameters are different, and the corresponding machine learning models are also different. Therefore, in the Bayesian optimization process of hyperparameters, each time the value of the hyperparameter is updated, the machine learning model used in the objective function should also be updated.
  • the machine learning model corresponding to the value of each hyperparameter can be obtained through training.
  • any existing feasible model training method may be used to train the machine learning model corresponding to each hyperparameter value, which is not limited in this application.
  • the observation value in the Bayesian optimization process is determined according to the loss function used by the machine learning model in the training process.
  • the observation value corresponding to a sampled value of the i-th set of hyperparameters is determined by the following formula:
  • T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training
  • V_loss (j) is the loss value of the machine learning model on the test set samples after the jth round of training
  • w 1 and w 2 are the weights of T_loss (j) and V_loss (j)
  • w 1 and w 2 are not simultaneously zero.
  • the number of trainings of the control machine learning model is less than a preset value.
  • the number of trainings to control the machine learning model is less than 20.
  • the convergence time of the machine learning model or the number of training times of the machine learning model directly affects the optimization speed of the hyperparameter.
  • the embodiment of the present application can increase the optimization speed of the hyperparameters by limiting the training times of the machine learning model to less than a preset value.
  • the final performance of the model is related to the initial performance of the model training.
  • the final performance of the model will also be monotonically convergent; if the model no longer monotonically converges (ie, diverges) at the initial stage of training, then the final performance of the model will no longer monotonically converge.
  • the number of training rounds should be controlled within the preset value.
  • controlling the number of training times of the machine learning model corresponding to each update of the i-th set of hyperparameters is less than a preset value, including: the machine corresponding to the value of each update of the i-th set of hyperparameters
  • the early stop strategy is adopted, so that the training times of the machine learning model are less than the preset value.
  • the preset value is 20.
  • the machine learning model corresponding to each hyperparameter only 20 training stops. If the number of training rounds is less than 20, the machine learning model no longer converges monotonously, so it stops early.
  • the machine learning model converges monotonously and the training is also stopped.
  • the solution of the embodiment of the present application may be applied to the hyper-parameter adjustment process of deep learning.
  • step 220 is mainly described as an example shown in FIG. 3.
  • the implementation of step 220 includes, but is not limited to, the method shown in FIG. 3.
  • Bayesian optimization is performed on the N sets of hyperparameters during the process of obtaining optimized hyperparameters, all such schemes fall into the protection scope of the present application.
  • each round of Bayesian optimization operation on the first N1 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N1 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters
  • the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N1.
  • Each round of Bayesian optimization operations on the last N2 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N2 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters During the process, the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N2.
  • the first group of hyperparameters and the second group of hyperparameters are alternately optimized as follows to obtain the optimized first group of hyperparameters and second group of hyperparameters: perform at least one round of Bayesian optimization Operation, each round of Bayesian optimization operations includes: Bayesian optimization of the first group of hyperparameters, in the process, the value of the remaining group of hyperparameters is fixed to the latest value; the Bayesian of the second group of hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value.
  • the third group of hyperparameters, the fourth group of hyperparameters and the fifth group of hyperparameters are alternately optimized as follows to obtain the optimized third group of hyperparameters,
  • the fourth group of hyperparameters and the fifth group of hyperparameters perform at least one round of Bayesian optimization operations, each round of Bayesian optimization operations includes: Bayesian optimization of the third group of hyperparameters, in the process, the rest The values of the group hyperparameters are the latest values; Bayesian optimization is performed on the group 4 hyperparameters. In the process, the values of the other group hyperparameters are fixed to the latest values; Bayesian is performed on the group 5 hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value.
  • the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction assumption can be weakened.
  • solution provided by the present application may be, but not limited to, optimization of hyperparameters in machine learning, and may also be applied to other scenarios where a global optimal solution of an unknown function needs to be solved.
  • FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus 400 provided by an embodiment of the present application.
  • the device 400 includes the following units.
  • the dividing unit 410 divides the hyper-parameters to be optimized for machine learning into N sets of hyper-parameters, where N is an integer greater than 1;
  • the optimization unit 420 is used to perform Bayesian optimization on N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed Take the latest value.
  • each Bayesian optimization process Bayesian optimization is performed on the solution space corresponding to a set of hyperparameters, because the dimension of each set of hyperparameters is smaller than the total dimension of the hyperparameters that machine learning needs to optimize Therefore, the dimensionality reduction search for hyperparameters can be realized, and the optimal solution can be avoided from being stuck in the local optimal solution.
  • the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
  • the optimization unit 420 is configured to obtain optimized hyperparameters using at least one round of Bayesian optimization operations, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th set of hyperparameters. During the Bayesian optimization of the i-th set of hyperparameters, the values of the remaining sets of hyperparameters are fixed to the latest values, i traverses 1, 2, ... , N.
  • the N sets of hyperparameters are optimized separately, and the order of optimization may make the optimization of each hyperparameter group different.
  • multiple rounds of optimization are performed.
  • the Bayesian optimization operation can weaken this difference to a certain extent, thereby further weakening the limitation of the dimensionality reduction hypothesis.
  • the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
  • the number of hyperparameters included in each group of the N groups of hyperparameters may be the same, that is, the dimension of each group of hyperparameters may be the same.
  • the number of hyperparameters included in different groups among the N sets of hyperparameters may also be different, that is, the dimensions of different groups of hyperparameters may not be completely the same.
  • the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.
  • the hyper-parameters may include at least two of the following: kernel size (kernel size), kernel number (kernel), convolution stride (stride), jumper connection ( shortcut connection method), addition and concatenation (concat) selection, number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients , Learning rate, neural network structure, the number of layers of the neural network.
  • the hyperparameter types of different groups of hyperparameters in the N groups of hyperparameters may not be completely the same.
  • hyperparameters have different hyperparameter types.
  • the hyper-parameters to be optimized are grouped according to the hyper-parameter type, and then each group of hyper-parameters are optimized separately, so that the optimization efficiency of the hyper-parameters can be improved to a certain extent.
  • the objective function of Bayesian optimization is a loss function
  • the samples used in the loss function are training set samples and / or test set samples.
  • the observation values used by Bayesian optimization are based on the loss values used in model training by the machine learning model corresponding to each group of hyperparameters determine.
  • the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:
  • epoch is the number of training rounds of the machine learning model corresponding to the current value of each group of hyperparameters
  • T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training
  • V_loss ( j) is the loss value of the machine learning model on the test set samples after the jth round of training
  • w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, and w 1 and w 2 are not zero at the same time .
  • the optimization unit 420 is configured to control the number of trainings of the machine learning model to be less than a preset value during Bayesian optimization of each set of hyperparameters.
  • the optimization unit 420 is configured to adopt an early stop strategy so that the number of trainings of the machine learning model is less than a preset value.
  • the dividing unit 410 is used to divide the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters according to the application scenario of machine learning, and N is an integer greater than 1.
  • the machine learning model is a deep learning model.
  • an embodiment of the present application further provides a hyperparameter optimization apparatus 500, which includes a processor 510 and a memory 520.
  • the memory 520 is used to store instructions
  • the processor 510 is used to execute instructions stored in the memory 520.
  • the execution of the instructions stored in the memory 520 makes the processor 510 be used to execute the optimization method in the above method embodiment.
  • Execution of the instructions stored in the memory 520 causes the processor 510 to be used to perform the actions performed by the dividing unit 410 and the optimization unit 420 in the above-described embodiments.
  • the apparatus 500 may further include a communication interface 530 for exchanging signals with external devices.
  • the processor 510 is used to control the interface 530 to receive and / or send signals.
  • Embodiments of the present application also provide a computer storage medium on which a computer program is stored.
  • the computer program executes the optimization method in the foregoing method embodiments.
  • Embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to execute the optimization method in the foregoing method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital video disc (DVD)), or semiconductor media (eg, solid state disk (SSD)), etc. .
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a division of logical functions.
  • there may be other divisions for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Feedback Control In General (AREA)

Abstract

Provided are a hyper-parameter optimization method and apparatus. The method comprises: dividing hyper-parameters that need to be optimized for machine learning into N hyper-parameter groups; and separately performing Bayesian optimization on the N hyper-parameter groups to obtain optimized hyper-parameters, wherein during the process of Bayesian optimization of each hyper-parameter group, the values of the remaining hyper-parameter groups are fixed to latest values. Performing Bayesian optimization on hyper-parameter groups that need to be optimized for machine learning can implement dimensionality reduction search for the hyper-parameters, and can also weaken the limits for dimensionality reduction assumption.

Description

超参数的优化方法及装置Superparameter optimization method and device
版权申明Copyright statement
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。The content disclosed in this patent document contains material protected by copyright. The copyright is owned by the copyright owner. The copyright owner has no objection to anyone copying the patent document or the patent disclosure existing in the official records and archives of the Patent and Trademark Office.
技术领域Technical field
本申请涉及计算机技术领域,具体地,涉及一种超参数的优化方法及装置。This application relates to the field of computer technology, and in particular, to a hyperparameter optimization method and device.
背景技术Background technique
机器学习算法的参数主要有超参数(Hyper-parameter)和普通参数两类。其中,普通参数可以从数据中学习估计得到;超参数无法从数据中学习估计得到,只能靠人的经验设计指定,超参数是在开始学习过程之前需要设置的参数。超参数定义了关于机器学习模型的更高层次的概念,如复杂性或学习能力。例如,超参数可以包括且不限于:正则项系数、学习率、网络结构、卷积核的宽度和深度等。The parameters of machine learning algorithms mainly include hyper-parameters and ordinary parameters. Among them, common parameters can be learned and estimated from the data; hyperparameters cannot be estimated from the data, and can only be specified by human experience design, hyperparameters are parameters that need to be set before starting the learning process. Hyperparameters define higher-level concepts about machine learning models, such as complexity or learning ability. For example, the hyperparameters may include and are not limited to: regular term coefficients, learning rate, network structure, width and depth of convolution kernel, and so on.
超参数的调节对于机器学习算法性能有非常大的影响,然而,超参数的调节是一个黑箱操作(black box),往往需要算法设计人员经过大量的调试得出,且需要设计人员在该领域有较为深厚的积累,需要花费大量的时间和精力,甚至常常无法得到最优结果,优化效率低。The adjustment of hyperparameters has a very large impact on the performance of machine learning algorithms. However, the adjustment of hyperparameters is a black box operation (black box), which often requires algorithm designers to get through a lot of debugging, and designers need to have in this field The deeper accumulation requires a lot of time and effort, and even the optimal results are often not obtained, and the optimization efficiency is low.
如果将机器学习的超参数调整过程视作一个未知函数,通过对该未知函数建模并寻找其全局最优解,就可以得到想要的超参数。贝叶斯优化算法(Bayesian Optimization Algorithm,BOA)是一种求解未知函数全局最优解的算法。因此,贝叶斯优化算法被提出来用于对机器学习模型的超参数进行调整。If the hyperparameter adjustment process of machine learning is regarded as an unknown function, the desired hyperparameters can be obtained by modeling the unknown function and searching for its global optimal solution. Bayesian optimization algorithm (Bayesian Optimization Algorithm, BOA) is an algorithm for solving the global optimal solution of unknown functions. Therefore, the Bayesian optimization algorithm is proposed to adjust the hyperparameters of the machine learning model.
但在一些机器学习的应用场景中,需要优化的超参数的数量可能非常大,导致在高维空间中求解未知函数的全局最优解的难度非常大,往往会卡在局部最优解,无法得到较好的结果。However, in some machine learning application scenarios, the number of hyperparameters that need to be optimized may be very large, which makes it difficult to solve the global optimal solution of the unknown function in a high-dimensional space. It is often stuck in the local optimal solution and cannot Get better results.
发明内容Summary of the invention
本申请提供一种超参数的优化方法及装置,可以实现对超参数的降维搜索,同时可以弱化限制解空间的假设,从而可以获得较好的超参数的优化结果。The present application provides a hyperparameter optimization method and device, which can realize a dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.
第一方面,提供一种超参数的优化方法,该方法包括:将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数;分别对N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。In the first aspect, a hyperparameter optimization method is provided. The method includes: dividing the hyperparameters to be optimized by machine learning into N groups of hyperparameters, where N is an integer greater than 1; and performing Bayesian optimization on the N groups of hyperparameters, respectively To obtain the optimized hyperparameters, where in the process of Bayesian optimization of each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.
第二方面,提供一种超参数的优化装置,该装置包括:划分单元,将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数;优化单元,用于分别对N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。In a second aspect, a hyperparameter optimization device is provided. The device includes: a division unit that divides the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters, where N is an integer greater than 1; Bayesian optimization is performed on the group of hyperparameters to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.
第三方面,提供一种处理视频图像的装置,该装置包括存储器和处理器,存储器用于存储指令,处理器用于执行存储器存储的指令,并且对存储器中存储的指令的执行使得处理器执行第一方面提供的优化方法。In a third aspect, an apparatus for processing video images is provided. The apparatus includes a memory and a processor. The memory is used to store instructions, and the processor is used to execute instructions stored in the memory. The execution of the instructions stored in the memory causes the processor to execute the first On the one hand provides the optimization method.
第四方面,提供一种芯片,芯片包括处理模块与通信接口,处理模块用于控制通信接口与外部进行通信,处理模块还用于实现第一方面提供的优化方法。According to a fourth aspect, a chip is provided. The chip includes a processing module and a communication interface. The processing module is used to control the communication interface to communicate with the outside. The processing module is also used to implement the optimization method provided in the first aspect.
第五方面,提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被计算机执行时使得计算机实现第一方面提供的优化方法。According to a fifth aspect, there is provided a computer-readable storage medium on which a computer program is stored, which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.
第六方面,提供一种包含指令的计算机程序产品,指令被计算机执行时使得计算机实现第一方面提供的优化方法。In a sixth aspect, a computer program product containing instructions is provided, which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.
本申请提供的方案,通过对机器学习需要优化的超参数分组进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。The solution provided by this application performs Bayesian optimization on the hyperparameter grouping that needs to be optimized for machine learning, on the one hand, it can realize the dimensionality reduction search for the hyperparameters, and on the other hand, it can weaken the limitation of the dimensionality reduction assumption.
附图说明BRIEF DESCRIPTION
图1为贝叶斯优化算法的基本原理的示意图。Figure 1 is a schematic diagram of the basic principle of the Bayesian optimization algorithm.
图2为本申请实施例提供的超参数的优化方法的示意性流程图。FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.
图3为本申请实施例提供的超参数的优化方法的另一示意性流程图。FIG. 3 is another schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.
图4为本申请实施例提供的超参数的优化装置的示意性框图。FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.
图5为本申请实施例提供的超参数的优化装置的另一示意性框图。FIG. 5 is another schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the drawings.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the present application. The terminology used in the specification of the present application herein is for the purpose of describing specific embodiments only, and is not intended to limit the present application.
首先介绍本申请实施例涉及的相关技术及概念。First, related technologies and concepts involved in the embodiments of the present application are introduced.
贝叶斯优化算法(Bayesian Optimization Algorithm,BOA)是一种求解未知函数全局最优解的算法。Bayesian optimization algorithm (Bayesian Optimization Algorithm, BOA) is an algorithm for solving the global optimal solution of unknown functions.
贝叶斯优化算法主要面向的问题场景可以通过如下公式进行描述:The problem scenarios that the Bayesian optimization algorithm mainly faces can be described by the following formula:
S *=arg s∈D max f(s), S * = arg s ∈ D max f (s),
其中,D是s的候选集。贝叶斯优化的目标是从D中选择一个s,使得未知函数f(s)的值最小(或最大)。未知函数f(s)可以称为目标函数。Among them, D is the candidate set of s. The goal of Bayesian optimization is to select an s from D, so that the value of the unknown function f (s) is the smallest (or largest). The unknown function f (s) can be called the objective function.
贝叶斯优化算法的大致流程如图1所示,包括如下步骤。The general flow of Bayesian optimization algorithm is shown in Figure 1, including the following steps.
第一步,对目标函数f(s)的函数空间分布做一定的先验假设(prior belief),即假设f(s)的函数空间分布为先验分布。The first step is to make a priori assumption about the function space distribution of the objective function f (s), that is to say, the function space distribution of f (s) is a priori distribution.
先验假设通常使用高斯过程先验(Gaussian process prior)。例如,假设f(s)的函数空间分布为高斯分布(Gaussian distribution)。The a priori assumption usually uses Gaussian process prior (Gaussian process prior). For example, suppose the spatial distribution of the function of f (s) is Gaussian (Gaussian distribution).
应理解,既然需要求出满足条件的s,如果知道了f(s)的函数曲线图,就可以直接算出满足条件的s了。但是f(s)的函数曲线图是未知的,即f(s)的函数空间分布特点是未知的。所以需要对f(s)的函数空间分布做假设,常见的假设是f(s)的函数空间分布满足高斯分布,也就是正态分布。It should be understood that since the s that satisfies the condition needs to be required, if the function curve of f (s) is known, the s that satisfies the condition can be directly calculated. However, the function curve of f (s) is unknown, that is, the spatial distribution characteristics of the function of f (s) are unknown. Therefore, it is necessary to make assumptions about the spatial distribution of the function of f (s). The common assumption is that the spatial distribution of the function of f (s) satisfies the Gaussian distribution, that is, the normal distribution.
除了高斯分布,也可以假设f(s)的函数空间分布满足其它概率分布。实际应用中,可以针对不同问题,为f(s)选择适合的概率分布假设。In addition to the Gaussian distribution, it can also be assumed that the spatial distribution of the function of f (s) satisfies other probability distributions. In practical applications, you can choose a suitable probability distribution hypothesis for f (s) for different problems.
第一步还包括,获取至少两个采样值,并获得这些采样值对应的至少两个观测值。The first step also includes obtaining at least two sample values and obtaining at least two observation values corresponding to these sample values.
假设采样值为s 0、s 1,则观测值为f(s 0)、f(s 1)。 Assuming that the sampling values are s 0 and s 1 , the observed values are f (s 0 ) and f (s 1 ).
例如,可以通过采样等方式从候选集D中选择采样值为s 0、s 1For example, the sampling values can be selected from the candidate set D by sampling and the like s 0 and s 1 .
第一步还包括,利用至少两个观测值更新先验分布的平均值和方差,得到后验分布(posterior distribution)。The first step also includes using at least two observations to update the average and variance of the prior distribution to obtain a posterior distribution.
以f(s)的先验分布为高斯分布为例,将采样值与观测值输入高斯分布模型中,对高斯分布模型的平均值和方差进行修正,以使其接近目标函数f(s)真实的函数空间分布。修正后的高斯分布模型就是f(s)的后验分布。Taking the prior distribution of f (s) as the Gaussian distribution as an example, the sampled values and observations are input into the Gaussian distribution model, and the average and variance of the Gaussian distribution model are corrected to make it close to the objective function f (s). Function spatial distribution. The modified Gaussian distribution model is the posterior distribution of f (s).
第二步,利用后验分布构造获取函数(acquisition function),使用获取函数计算下一个采样值。In the second step, the acquisition function is constructed using the posterior distribution, and the acquisition function is used to calculate the next sample value.
以f(s)的函数空间分布为高斯分布为例,第二步的过程具体为,从修正后的高斯分布模型中选择下一个采样值s i,选择的标准是,相对于候选集D中的其它采样值,假设给高斯分布模型输入(s i,f(s i)),会使得高斯分布模型更快、更准确地向目标函数f(s)的真实分布接近,因此我们要寻找有较小平均值和较大方差的地方进行优化。 Taking the spatial distribution of the function f (s) as a Gaussian distribution as an example, the process of the second step is specifically to select the next sampled value s i from the modified Gaussian distribution model, the selection criterion is that, relative to the candidate set D Other sampling values of, assuming input to the Gaussian distribution model (s i , f (s i )), will make the Gaussian distribution model closer to the true distribution of the objective function f (s) faster and more accurately, so we need to find Optimize for smaller average values and larger variances.
第二步中提到的获取函数就是将较小(或较大,PS:f(s)是损失函数时这里是较低,如果f(s)表征模型准确性时这里是较高值)平均值和较大方差这两个因素进行综合考虑后推荐下一个采样值。应理解,获取函数的设计为现有技术,本文不作详述。The acquisition function mentioned in the second step is to average the smaller (or larger, PS: f (s) is the loss function here, if f (s) represents the accuracy of the model) The two factors of value and large variance are considered comprehensively and the next sample value is recommended. It should be understood that the design of the acquisition function is prior art, and will not be described in detail in this article.
第三步,获得第二步获得的采样值对应的观测值,根据该观测值判断这个采样值是否是最优解,若是,贝叶斯优化过程结束,若否,转到第四步。In the third step, the observation value corresponding to the sampling value obtained in the second step is obtained, and whether the sampling value is the optimal solution is judged according to the observation value. If it is, the Bayesian optimization process ends, if not, go to the fourth step.
可以将采用值代入目标函数f(s),计算得到观测值。The adopted value can be substituted into the objective function f (s) to calculate the observed value.
第四步,利用第三步获得的观测值继续修正后验分布,转到第二步。即重复执行第二步、第三步、第四步,直至收敛(即在第三步获得最优解)。In the fourth step, the observation value obtained in the third step is used to continue to modify the posterior distribution, and the process goes to the second step. That is, the second step, the third step, and the fourth step are repeatedly executed until convergence (that is, the optimal solution is obtained in the third step).
前文已述,贝叶斯优化算法可以用于对机器学习模型的超参数进行调整(也可称为优化)。将机器学习的超参数调整过程视为求解贝叶斯优化算法中的最值问题,其中,需要优化的超参数视为s,需要优化的超参数的候选值构成候选集D,然后通过如图1所示的贝叶斯优化流程寻找目标函数的全局最优解,就可以获得优化后的超参数。As mentioned above, the Bayesian optimization algorithm can be used to adjust the hyperparameters of the machine learning model (also called optimization). The hyperparameter adjustment process of machine learning is regarded as solving the maximum value problem in the Bayesian optimization algorithm, in which the hyperparameters to be optimized are regarded as s, and the candidate values of the hyperparameters to be optimized constitute the candidate set D, and then passed The Bayesian optimization process shown in 1 looks for the global optimal solution of the objective function, and the optimized hyperparameters can be obtained.
在机器学习中,一般将损失函数(loss function)作为目标函数。In machine learning, the loss function is generally used as the objective function.
损失函数用来估量机器学习模型的预测值与真实值的不一致程度,它可以是一个非负实值函数。假设机器学习模型g()的自变量为X,因变量为Y,以样本(X i,Y i)为例,机器学习模型的预测值为g(X i),机器学习模型的真实值为Y iThe loss function is used to estimate the degree of inconsistency between the predicted value and the true value of the machine learning model. It can be a non-negative real-valued function. Assuming that the independent variable of the machine learning model g () is X and the dependent variable is Y, taking the samples (X i , Y i ) as an example, the predicted value of the machine learning model is g (X i ), and the true value of the machine learning model is Y i .
常见的损失函数有多种,例如,对数损失函数、平方损失函数(也称为最小二乘法损失函数)、指数损失函数等其它损失函数。There are many common loss functions, for example, log loss function, square loss function (also called least squares loss function), exponential loss function and other loss functions.
以平方损失函数为例,平方损失(square loss)函数的标准形式如下:Taking the square loss function as an example, the standard form of the square loss function is as follows:
Figure PCTCN2018112712-appb-000001
Figure PCTCN2018112712-appb-000001
其中,n为样本个数,g(X i)表示机器学习模型的预测值,Y i表示机器学习模型的真实值,Y i-g(X i)表示机器学习模型的预测值与真实值之间的残差,L(Y,g(X))表示的是样本空间上残差的平方和。 Where n is the number of samples, g (X i ) represents the predicted value of the machine learning model, Y i represents the true value of the machine learning model, and Y i -g (X i ) represents the difference between the predicted value and the true value of the machine learning model Between the residuals, L (Y, g (X)) represents the sum of squares of the residuals in the sample space.
如果将平方损失函数作为贝叶斯优化算法中的目标函数,则贝叶斯优化的目的就是最小化该平方损失函数的值,从而获得优化后的超参数。If the square loss function is used as the objective function in the Bayesian optimization algorithm, the purpose of Bayesian optimization is to minimize the value of the square loss function, so as to obtain the optimized hyperparameters.
在使用贝叶斯优化算法调节超参数的过程中,通常将需要优化的超参数定义为一个多维的向量S,贝叶斯优化的过程就是搜索向量S的最优取值的过程。在一些机器学习的应用场景中,需要优化的超参数的数量可能非常大,导致向量S的维度非常高,在高维空间中求解未知函数的全局最优解的难度非常大,往往会卡在局部最优解,无法得到较好的结果。In the process of adjusting the hyperparameters using the Bayesian optimization algorithm, the hyperparameters to be optimized are usually defined as a multi-dimensional vector S. The process of Bayesian optimization is the process of searching the optimal value of the vector S. In some machine learning application scenarios, the number of hyperparameters that need to be optimized may be very large, resulting in a very high dimension of the vector S. It is very difficult to solve the global optimal solution of the unknown function in a high-dimensional space, and it is often stuck The local optimal solution cannot obtain good results.
现有的解决方案针对高维超参数,将求解未知函数的全局最优解的解空间假设为一个相对低维的解空间,然后直接在假设的低维解空间中进行贝叶斯优化。这使得,由求解未知函数的全局最优解的解空间到相对低维的解空间的假设策略对贝叶斯优化结果的影响很大,如果假设策略不合理,会导致较差的优化结果,从而使得算法不够鲁棒。The existing solutions aim at high-dimensional hyperparameters and assume that the solution space of the global optimal solution of the unknown function is a relatively low-dimensional solution space, and then directly perform Bayesian optimization in the hypothetical low-dimensional solution space. This makes the hypothesis strategy from the solution space of the global optimal solution of the unknown function to the relatively low-dimensional solution space have a great influence on the Bayesian optimization results. If the hypothesis strategy is unreasonable, it will lead to poor optimization results. This makes the algorithm not robust enough.
本申请提出一种超参数的优化方案,可以实现对超参数的降维搜索,同时可以弱化限制解空间的假设,从而可以获得较好的超参数的优化结果。This application proposes a hyperparameter optimization scheme, which can realize the dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.
图2为本申请实施例提供的超参数的优化方法的示意性流程图。该优化方法包括如下步骤。FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application. The optimization method includes the following steps.
S210,获取机器学习需要优化的超参数,需要优化的超参数包括N组超参数,N为大于1的整数。S210. Obtain hyperparameters to be optimized for machine learning. The hyperparameters to be optimized include N sets of hyperparameters, and N is an integer greater than 1.
可选地,机器学习需要优化的超参数可以是预先被划分为N组。Alternatively, the hyperparameters to be optimized for machine learning may be divided into N groups in advance.
可选地,机器学习需要优化的超参数可以是在需要优化时实时地被划分为N组。Alternatively, the hyperparameters that need to be optimized for machine learning may be divided into N groups in real time when optimization is needed.
例如,在不同的超参数优化任务中,对需要优化的超参数的分组策略可以不同。For example, in different hyperparameter optimization tasks, the grouping strategy for the hyperparameters that need to be optimized may be different.
应理解,N组超参数中的每组超参数所包括的超参数的数量小于机器学习中需要优化的超参数的总数量。It should be understood that the number of hyperparameters included in each group of hyperparameters in the N sets of hyperparameters is less than the total number of hyperparameters that need to be optimized in machine learning.
S220,分别对N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每一组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。S220: Perform Bayesian optimization on the N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values. value.
针对每组超参数的贝叶斯优化,可以采用如图1所示的贝叶斯优化算法进行实现,在每组超参数的贝叶斯优化过程中,固定其余组超参数的取值为最新取值。For the Bayesian optimization of each group of hyperparameters, the Bayesian optimization algorithm shown in Figure 1 can be used to implement. In the Bayesian optimization process of each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value.
以对第i组超参数进行贝叶斯优化为例,以其余组超参数为第j(j≠i)组超参数为例,假设在第i组超参数的贝叶斯优化过程开始之前,第j组超参数的取值为Z,则在第i组超参数的贝叶斯优化过程中,固定第j组超参数的取值为最新取值Z。Taking Bayesian optimization of the i-th group of hyperparameters as an example, and taking the remaining group of hyperparameters as the j-th (j ≠ i) group of hyperparameters as an example, suppose the Bayesian optimization process of the i-th group of hyperparameters starts, The value of the jth group of hyperparameters is Z, then in the Bayesian optimization process of the ith group of hyperparameters, the value of the jth group of hyperparameters is fixed to the latest value Z.
在对第1组超参数进行贝叶斯优化过程中,其余组超参数的取值可以通过采样的方式确定。During the Bayesian optimization of the first group of hyperparameters, the values of the remaining groups of hyperparameters can be determined by sampling.
通过多少次贝叶斯优化过程后,获得优化后的超参数,由收敛条件决定。本文对此不作详述。After the Bayesian optimization process is passed, the optimized hyperparameters are obtained, which is determined by the convergence conditions. This article will not elaborate on this.
在本申请中,在每个贝叶斯优化过程中,是在一组超参数对应的解空间上进行贝叶斯优化,由于每组超参数的维度小于机器学习需要优化的超参数的总维度,因此,可以实现对超参数的降维搜索,可以避免最优解卡在局部最优解。In this application, in each Bayesian optimization process, Bayesian optimization is performed on the solution space corresponding to a set of hyperparameters, because the dimension of each set of hyperparameters is smaller than the total dimension of the hyperparameters that machine learning needs to optimize Therefore, the dimensionality reduction search for hyperparameters can be realized, and the optimal solution can be avoided from being stuck in the local optimal solution.
此外,在本申请中,在获得优化后的超参数的过程中,对N组超参数中的每组超参数均进行了贝叶斯优化,换言之,在获得优化的超参数的过程中,机器学习需要优化的每个超参数均通过贝叶斯优化算法进行了优化,因此,可以弱化降维假设的限制。In addition, in this application, in the process of obtaining optimized hyperparameters, Bayesian optimization is performed for each set of hyperparameters in the N sets of hyperparameters. In other words, in the process of obtaining optimized hyperparameters, the machine Each hyperparameter that needs to be optimized for learning is optimized by the Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction assumption can be weakened.
因此,本申请实施例通过分别对机器学习需要优化的超参数中的N组超参数进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。Therefore, in this embodiment of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized for machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters;
在本申请中,机器学习需要优化的N组超参数中的每组超参数包括至少一个超参数。In this application, each set of hyperparameters in the N sets of hyperparameters that need to be optimized by machine learning includes at least one hyperparameter.
可选地,N组超参数中每组包括的超参数的数量可以相同,即每组超参数的维度可以相同。Optionally, the number of hyperparameters included in each group in the N groups of hyperparameters may be the same, that is, the dimensions of each group of hyperparameters may be the same.
可选地,N组超参数中不同组包括的超参数的数量也可以不同,即不同组超参数的维度可以不完全相同。Optionally, the number of hyperparameters included in different groups among the N sets of hyperparameters may also be different, that is, the dimensions of different groups of hyperparameters may not be completely the same.
应理解,当N组超参数中不同组的维度不完全相同时,贝叶斯优化过程中涉及的后验分布需要拆分成多个子后验分布。It should be understood that when the dimensions of different groups in the N groups of hyperparameters are not completely the same, the posterior distribution involved in the Bayesian optimization process needs to be split into multiple sub-posterior distributions.
在本申请中,N组超参数的分组策略可以有多种。In this application, there may be multiple grouping strategies for N sets of hyperparameters.
可选地,在一些实施例中,N组超参数是对需要优化的超参数进行随机分组得到的。Optionally, in some embodiments, the N sets of hyperparameters are obtained by randomly grouping the hyperparameters that need to be optimized.
可选地,在一些实施例中,N组超参数是通过经验对需要优化的超参数进行分组得到的。Optionally, in some embodiments, the N sets of hyperparameters are obtained by grouping the hyperparameters that need to be optimized through experience.
可选地,在一些实施例中,N组超参数按照机器学习中的超参数类型进行划分。Optionally, in some embodiments, the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.
超参数可以包括下列中的至少两种:卷积核大小(kernel size),卷积核数量(kernel num),卷积步长(stride),跳线连接(shortcut连接方式),加和操作(add)和拼接操作(concat)的选择,分支数量,层数(layer num),迭代次数(epoch),初始化参数(例如MSRA初始化与Xaiver初始化),正则项系数,学习率,神经网络结构,神经网络的层数。Hyperparameters can include at least two of the following: kernel size (kernel), kernel number (kernel), convolution step (stride), jumper connection (shortcut connection), and sum operation ( add) and selection of concatenation operation (concat), number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients, learning rate, neural network structure, neural The number of layers of the network.
N组超参数中不同组超参数的超参数类型可以不完全相同。The hyperparameter types of different groups of hyperparameters in the N groups of hyperparameters may not be completely the same.
可选地,不同组超参数的超参数类型不同。Optionally, different types of hyperparameters have different hyperparameter types.
应理解,分别对超参数类型相同的一组超参数进行贝叶斯优化,在一定程度上,可以提高收敛速度,从而提高超参数的优化效率。It should be understood that performing Bayesian optimization on a group of hyperparameters of the same hyperparameter type can, to a certain extent, increase the convergence speed, thereby improving the optimization efficiency of hyperparameters.
因此,本申请实施例,按照超参数类型将需要优化的超参数分组,然后分别对每组超参数进行优化,这样可以在一定程度上提高超参数的优化效率。Therefore, in the embodiment of the present application, the hyper-parameters to be optimized are grouped according to the hyper-parameter type, and then each group of hyper-parameters are optimized separately, so that the optimization efficiency of the hyper-parameters can be improved to a certain extent.
在同一次超参数优化任务中,对需要优化的超参数的分组策略是固定的。In the same hyperparameter optimization task, the grouping strategy for the hyperparameters to be optimized is fixed.
针对不同的超参数优化任务,例如,针对不同应用场景下的超参数优化任务,对需要优化的超参数的分组策略可以不同或相同,本申请对此不作限定,可以根据实际需求而定。For different hyperparameter optimization tasks, for example, for hyperparameter optimization tasks in different application scenarios, the grouping strategy for the hyperparameters that need to be optimized may be different or the same, which is not limited in this application and can be determined according to actual needs.
可选地,如图3所示,步骤S220的一种实现方式为:利用至少一轮贝叶斯优化操作,获得优化后的超参数,其中,每一轮贝叶斯优化操作包括:对N组超参数中的第i组超参数进行贝叶斯优化,其中,在对第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N。Optionally, as shown in FIG. 3, an implementation manner of step S220 is: using at least one round of Bayesian optimization operations to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters in the group of hyperparameters. In the process of performing Bayesian optimization on the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values, and i traverses 1 , 2, ..., N.
在本申请实施例中,在每轮贝叶斯优化操作中均分别对N组超参数进行贝叶斯优化,换言之,在获得优化的超参数的过程中,机器学习需要优化的每个超参数均通过贝叶斯优化算法进行了优化,因此,可以弱化降维假设的限制。In the embodiment of the present application, Bayesian optimization is performed on N sets of hyperparameters in each round of Bayesian optimization operation, in other words, in the process of obtaining optimized hyperparameters, each hyperparameter that needs to be optimized by machine learning All are optimized by Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction hypothesis can be weakened.
因此,本申请实施例,通过分别对机器学习需要优化的超参数中的N组超参数进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。Therefore, in the embodiments of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized by machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
应理解,通过多少轮贝叶斯优化操作获得优化后的超参数,可以根据收敛条件决定。It should be understood that the number of rounds of Bayesian optimization operation to obtain the optimized hyperparameters can be determined according to the convergence conditions.
例如,在步骤S220中,执行两轮或三轮或更多轮的贝叶斯优化操作,获得优化后的超参数,其中,每一轮贝叶斯优化操作包括:对N组超参数中的第i组超参数进行贝叶斯优化,其中,在对第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N。For example, in step S220, two or three or more rounds of Bayesian optimization operations are performed to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters. In the process of performing Bayesian optimization on the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values, i traverses 1, 2, ..., N.
应理解,在每一轮贝叶斯优化操作中,对N组超参数先后分别进行优化,优化顺序的先后可能使得各个超参数组的优化之间存在差异,但是本申请实施例通过执行多轮(即至少不小于两轮)的贝叶斯优化操作,可以在一定程度上弱化这种差异,从而进一步地弱化降维假设的限制。It should be understood that in each round of Bayesian optimization operations, the N sets of hyperparameters are optimized separately, and the order of optimization may make the optimization of each hyperparameter group different. However, in this embodiment of the present application, multiple rounds are performed. (That is, at least not less than two rounds) Bayesian optimization operation can weaken this difference to a certain extent, thereby further weakening the limitation of the dimensionality reduction hypothesis.
在每轮贝叶斯优化操作中均分别对N组超参数进行贝叶斯优化的方式可以称为交替优化(alternative optimization)的贝叶斯优化。The method of performing Bayesian optimization on N sets of hyperparameters in each round of Bayesian optimization operation can be referred to as Bayesian optimization of alternative optimization.
本申请实施例将交替优化的思想引入贝叶斯优化的过程中,对于高维度的搜索空间可实现有效降维,弱化现有研究技术中的假设限制,有助于搜索到最优解的超参数。The embodiment of the present application introduces the idea of alternating optimization into the process of Bayesian optimization, which can achieve effective dimensionality reduction for the high-dimensional search space, weaken the assumption limitations in the existing research technology, and help to search for the optimal solution. parameter.
作为示例,本申请实施例对超参数进行优化的整个流程如下。As an example, the entire process of optimizing hyperparameters in the embodiments of the present application is as follows.
将机器学习的超参数调节过程视为目标函数f(S)。赋予目标函数f(S)高斯过程先验,即p(f)=GP(f;μ;cov),其中,μ表示期望,cov表示方差,GP表示高斯过程。S表示需要优化的超参数。S∈D,D表示需要优化的超参数S的样本空间。The hyperparameter adjustment process of machine learning is regarded as the objective function f (S). The objective function f (S) Gaussian process is given a priori, that is, p (f) = GP (f; μ; cov), where μ represents expectation, cov represents variance, and GP represents Gaussian process. S represents the hyperparameters that need to be optimized. S ∈ D, D represents the sample space of the hyper-parameter S that needs to be optimized.
将需要优化的超参数S划分为N个组:S i∈D i,i=1,2,…,N,N为大于1的整数。 The hyper-parameter S to be optimized is divided into N groups: S i ∈ D i , i = 1, 2,..., N, N is an integer greater than 1.
执行如下代码,直至得到最优S。Execute the following code until the optimal S is obtained.
Figure PCTCN2018112712-appb-000002
Figure PCTCN2018112712-appb-000002
Figure PCTCN2018112712-appb-000003
Figure PCTCN2018112712-appb-000003
其中,目标函数f(S)可以为损失函数。Among them, the objective function f (S) may be a loss function.
从D i中抽样到采样值,获得观测值的过程可以为,将该采样值带入目标函数f(S)中,获得该采样值对应的观测值。 The process of sampling from D i to the sampled value and obtaining the observation value may be that the sampled value is brought into the objective function f (S) to obtain the observation value corresponding to the sampled value.
可选地,在本申请中,贝叶斯优化的目标函数为损失函数。Optionally, in this application, the objective function of Bayesian optimization is a loss function.
例如,贝叶斯优化的目标函数可以为如下任一种:对数损失函数、平方损失函数(也称为最小二乘法损失函数)、指数损失函数。For example, the objective function of Bayesian optimization may be any of the following: log loss function, square loss function (also called least squares loss function), and exponential loss function.
应理解,在本申请中,贝叶斯优化的目标函数还可以为其它类型的损失函数,这里不再枚举。It should be understood that in this application, the objective function of Bayesian optimization may also be other types of loss functions, which are not enumerated here.
应理解,实际应用中,可以根据实际应用的需求,选择一种损失函数作为贝叶斯优化的目标函数。It should be understood that in practical applications, a loss function can be selected as the objective function of Bayesian optimization according to the needs of the actual application.
以平方损失函数为目标函数为例,在本申请一些实施例中,贝叶斯优化的目标函数f(S)如下公式所示:Taking the square loss function as the objective function as an example, in some embodiments of the present application, the objective function f (S) of Bayesian optimization is as follows:
Figure PCTCN2018112712-appb-000004
Figure PCTCN2018112712-appb-000004
其中,(X,Y)为样本。g(X)表示机器学习模型,X表示机器学习模型的自变量,Y表示机器学习模型的因变量。n表示样本个数,这里的样本是指(X,Y)的样本。g(X i)表示机器学习模型的预测值。Y i表示机器学习模型的真实值。Y i-g(X i)表示机器学习模型的预测值与真实值之间的残差。L(Y,g(X))表示的是样本空间上残差的平方和。 Among them, (X, Y) is the sample. g (X) represents the machine learning model, X represents the independent variable of the machine learning model, and Y represents the dependent variable of the machine learning model. n represents the number of samples. The samples here refer to (X, Y) samples. g (X i ) represents the predicted value of the machine learning model. Y i represents the true value of the machine learning model. Y i -g (X i ) represents the residual between the predicted value and the true value of the machine learning model. L (Y, g (X)) represents the sum of squared residuals in the sample space.
可选地,在本申请的一些实施例中,贝叶斯优化的目标函数中使用的样本可以是训练集样本,或者是测试集样本,或者是训练集样本和测试集样本。Optionally, in some embodiments of the present application, the samples used in the Bayesian optimized objective function may be training set samples, or test set samples, or training set samples and test set samples.
例如,以目标函数为如下所示的平方损失函数为例:For example, take the objective function as the square loss function shown below as an example:
Figure PCTCN2018112712-appb-000005
Figure PCTCN2018112712-appb-000005
其中,(X,Y)为样本。g(X)表示机器学习模型,X表示机器学习模型的自变量,Y表示机器学习模型的因变量。g(X i)表示机器学习模型的预测值。Y i表示机器学习模型的真实值。Y i-g(X i)表示机器学习模型的预测值与真实值之间的残差。L(Y,g(X))表示的是样本空间上残差的平方和。其中,样本空间为训练集样本空间,n表示训练集中样本的个数。或者,样本空间为测试集样本空间,n表示测试集中样本的个数。或者,样本空间为训练集与测试集构成的样本空间,n表示训练集与测试集样本的总个数。 Among them, (X, Y) is the sample. g (X) represents the machine learning model, X represents the independent variable of the machine learning model, and Y represents the dependent variable of the machine learning model. g (X i ) represents the predicted value of the machine learning model. Y i represents the true value of the machine learning model. Y i -g (X i ) represents the residual between the predicted value and the true value of the machine learning model. L (Y, g (X)) represents the sum of squared residuals in the sample space. Among them, the sample space is the sample space of the training set, and n represents the number of samples in the training set. Or, the sample space is the sample space of the test set, and n represents the number of samples in the test set. Or, the sample space is a sample space composed of a training set and a test set, and n represents the total number of samples in the training set and the test set.
应理解,超参数的每一个取值,对应一个机器学习模型。换言之,超参数的取值不同,其对应的机器学习模型也不同。因此,在超参数的贝叶斯优化过程中,超参数的取值每更新一次,目标函数中的使用的机器学习模型也要更新。It should be understood that each value of the hyperparameter corresponds to a machine learning model. In other words, the values of hyperparameters are different, and the corresponding machine learning models are also different. Therefore, in the Bayesian optimization process of hyperparameters, each time the value of the hyperparameter is updated, the machine learning model used in the objective function should also be updated.
还应理解,每个超参数的取值对应的机器学习模型可以通过训练得到。例如,可以采用现有的任一种可行的模型训练方法,来训练每个超参数的取值对应的机器学习模型,本申请对此不作限定。It should also be understood that the machine learning model corresponding to the value of each hyperparameter can be obtained through training. For example, any existing feasible model training method may be used to train the machine learning model corresponding to each hyperparameter value, which is not limited in this application.
可选地,在本申请的一些实施例中,贝叶斯优化过程中的观测值根据机器学习模型在训练过程中使用的损失函数确定。Optionally, in some embodiments of the present application, the observation value in the Bayesian optimization process is determined according to the loss function used by the machine learning model in the training process.
例如,在对第i组超参数进行贝叶斯优化的过程中,第i组超参数的一个采样值对应的观测值由如下公式确定:For example, during Bayesian optimization of the i-th set of hyperparameters, the observation value corresponding to a sampled value of the i-th set of hyperparameters is determined by the following formula:
Figure PCTCN2018112712-appb-000006
Figure PCTCN2018112712-appb-000006
其中,epoch为第i组超参数的本次取值所对应的机器学习模型的训练轮数,T_loss(j)为该机器学习模型在第j轮训练之后在训练集样本上的损失值,V_loss(j)为该机器学习模型在第j轮训练之后在测试集样本上的损失值,w 1和w 2分别为T_loss(j)与V_loss(j)的权重,w 1和w 2不同时为零。 Where epoch is the number of training rounds of the machine learning model corresponding to the current value of the i-th group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss (j) is the loss value of the machine learning model on the test set samples after the jth round of training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j), and w 1 and w 2 are not simultaneously zero.
当w 1为零,w 2不为零时,表示观测值Loss只与机器学习模型在测试集上的损失值有关。 When w 1 is zero and w 2 is not zero, it means that the observed value Loss is only related to the loss value of the machine learning model on the test set.
当w 2为零,w 1不为零时,表示观测值Loss只与机器学习模型在训练集上的损失值有关。 When w 2 is zero and w 1 is not zero, it means that the observed value Loss is only related to the loss value of the machine learning model on the training set.
当w 1与w 2均不为零时,表示观测值Loss既与机器学习模型在测试集上的损失值有关,还与机器学习模型在训练集上的损失值有关。 When both w 1 and w 2 are not zero, it means that the observed value Loss is related to both the loss value of the machine learning model on the test set and the loss value of the machine learning model on the training set.
可选地,在一些实施例中,在对第i组超参数进行贝叶斯优化的过程中,控制机器学习模型的训练次数小于预设值。Optionally, in some embodiments, during the Bayesian optimization of the i-th set of hyperparameters, the number of trainings of the control machine learning model is less than a preset value.
例如,控制机器学习模型的训练次数小于20次。For example, the number of trainings to control the machine learning model is less than 20.
应理解,在超参数的优化过程中,机器学习模型的收敛时间或机器学习模型的训练次数,直接影响超参数的优化速度。本申请实施例通过限制机器学习模型的训练次数小于预设值,可以提高超参数的优化速度。It should be understood that in the hyperparameter optimization process, the convergence time of the machine learning model or the number of training times of the machine learning model directly affects the optimization speed of the hyperparameter. The embodiment of the present application can increase the optimization speed of the hyperparameters by limiting the training times of the machine learning model to less than a preset value.
在本申请中,假设模型的最终表现与模型训练初期的表现相关。换言之,如果模型训练初期就单调收敛,则该模型的最终表现也是单调收敛;如果模型训练初期就不再单调收敛(即发散),则该模型的最终表现也不再单调收敛。In this application, it is assumed that the final performance of the model is related to the initial performance of the model training. In other words, if the model initially monotonically converges, the final performance of the model will also be monotonically convergent; if the model no longer monotonically converges (ie, diverges) at the initial stage of training, then the final performance of the model will no longer monotonically converge.
基于这个假设,对于每个超参数对应的机器学习模型,控制其训练轮数在预设值之内。Based on this assumption, for the machine learning model corresponding to each hyperparameter, the number of training rounds should be controlled within the preset value.
可选地,在一些实施例中,控制第i组超参数每更新一次所对应的机器学习模型的训练次数小于预设值,包括:在第i组超参数每更新一次的取值对应的机器学习模型的训练过程中,采用早停策略,使得机器学习模型的训练次数小于预设值。Optionally, in some embodiments, controlling the number of training times of the machine learning model corresponding to each update of the i-th set of hyperparameters is less than a preset value, including: the machine corresponding to the value of each update of the i-th set of hyperparameters During the training process of the learning model, the early stop strategy is adopted, so that the training times of the machine learning model are less than the preset value.
例如预设值为20,对于每个超参数对应的机器学习模型,只训练20次就停止。如果训练轮数还不足20次时,机器学习模型已经不再单调收敛,则早停。For example, the preset value is 20. For the machine learning model corresponding to each hyperparameter, only 20 training stops. If the number of training rounds is less than 20, the machine learning model no longer converges monotonously, so it stops early.
如果训练轮数已经满20次,机器学习模型单调收敛,也停止训练。If the number of training rounds has reached 20, the machine learning model converges monotonously and the training is also stopped.
本申请实施例的方案可以应用于深度学习的超参数调节过程中。The solution of the embodiment of the present application may be applied to the hyper-parameter adjustment process of deep learning.
应理解,使用贝叶斯优化搜索深度学习模型的超参数,一般情况下需等到深度学习模型完全收敛才能能到,会导致超参数的优化时间较长。采用本申请实施例提供的方案后,可以有效减少优化超参数所需的时间。It should be understood that using Bayesian optimization to search for the hyper-parameters of the deep learning model generally requires the deep-learning model to fully converge before it can be achieved, which will result in a longer optimization time for the hyper-parameters. After the solution provided by the embodiment of the present application is adopted, the time required for optimizing the hyperparameters can be effectively reduced.
上文主要以步骤220的实现方式为如图3所示的方式为例进行了描述。在本申请中,步骤220的实现方式包括但不限于图3所示的方式。只要在获取优化后的超参数的过程中,分别对N组超参数进行了贝叶斯优化,这样的方案均落入本申请的保护范围。In the above, the implementation of step 220 is mainly described as an example shown in FIG. 3. In this application, the implementation of step 220 includes, but is not limited to, the method shown in FIG. 3. As long as Bayesian optimization is performed on the N sets of hyperparameters during the process of obtaining optimized hyperparameters, all such schemes fall into the protection scope of the present application.
可选地,步骤S220的另一种实现方式为:先对N组超参数中的前N1组超参数进行至少一轮贝叶斯优化操作,获得优化后的前N1组超参数;然后,对N组超参数中的前N2(N1+N2=N)组超参数进行至少一轮贝叶斯优 化操作,获得优化后的后N2组超参数。其中,对前N1组超参数的每一轮贝叶斯优化操作包括:对该N1组超参数中的第i组超参数进行贝叶斯优化,其中,在对第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N1。对后N2组超参数的每一轮贝叶斯优化操作包括:对该N2组超参数中的第i组超参数进行贝叶斯优化,其中,在对第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N2。Optionally, another implementation manner of step S220 is: first perform at least one round of Bayesian optimization operations on the first N1 group of hyperparameters in the N group of hyperparameters to obtain the optimized top N1 group of hyperparameters; and then, The first N2 (N1 + N2 = N) hyperparameters of the N hyperparameters are subjected to at least one round of Bayesian optimization operation to obtain the optimized post N2 hyperparameters. Among them, each round of Bayesian optimization operation on the first N1 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N1 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters In the process of Sri Lankan optimization, the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N1. Each round of Bayesian optimization operations on the last N2 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N2 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters During the process, the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N2.
作为示例,假设N等于5,先对第1组超参数和第2组超参数进行如下交替优化,获得优化后的第1组超参数和第2组超参数:执行至少一轮贝叶斯优化操作,每一轮贝叶斯优化操作包括:对第1组超参数进行贝叶斯优化,其过程中,固定其余组超参数的取值为最新取值;对第2组超参数进行贝叶斯优化,其过程中,固定其余组超参数的取值为最新取值。完成第1组超参数和第2组超参数的优化后,再对第3组超参数、第4组超参数和第5组超参数进行如下交替优化,获得优化后的第3组超参数、第4组超参数和第5组超参数:执行至少一轮贝叶斯优化操作,每一轮贝叶斯优化操作包括:对第3组超参数进行贝叶斯优化,其过程中,固定其余组超参数的取值为最新取值;对第4组超参数进行贝叶斯优化,其过程中,固定其余组超参数的取值为最新取值;对第5组超参数进行贝叶斯优化,其过程中,固定其余组超参数的取值为最新取值。As an example, assuming that N is equal to 5, the first group of hyperparameters and the second group of hyperparameters are alternately optimized as follows to obtain the optimized first group of hyperparameters and second group of hyperparameters: perform at least one round of Bayesian optimization Operation, each round of Bayesian optimization operations includes: Bayesian optimization of the first group of hyperparameters, in the process, the value of the remaining group of hyperparameters is fixed to the latest value; the Bayesian of the second group of hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value. After the optimization of the first group of hyperparameters and the second group of hyperparameters is completed, the third group of hyperparameters, the fourth group of hyperparameters and the fifth group of hyperparameters are alternately optimized as follows to obtain the optimized third group of hyperparameters, The fourth group of hyperparameters and the fifth group of hyperparameters: perform at least one round of Bayesian optimization operations, each round of Bayesian optimization operations includes: Bayesian optimization of the third group of hyperparameters, in the process, the rest The values of the group hyperparameters are the latest values; Bayesian optimization is performed on the group 4 hyperparameters. In the process, the values of the other group hyperparameters are fixed to the latest values; Bayesian is performed on the group 5 hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value.
因此,本申请实施例提供的方案,通过对机器学习需要优化的超参数分组进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。Therefore, in the solution provided by the embodiments of the present application, by performing Bayesian optimization on the hyperparameter grouping that needs to be optimized for machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction assumption can be weakened.
应理解,本申请提供的方案可以应用于优化对象是高维的场景,同样也可以应用于优化对象是低维的场景。It should be understood that the solution provided by the present application can be applied to scenarios where the optimization object is high-dimensional, and can also be applied to scenarios where the optimization object is low-dimensional.
还应理解,本申请提供的方案,可以但不限于机器学习中超参数的优化,还可以应用于其它的需要求解未知函数的全局最优解的场景中。It should also be understood that the solution provided by the present application may be, but not limited to, optimization of hyperparameters in machine learning, and may also be applied to other scenarios where a global optimal solution of an unknown function needs to be solved.
还应理解,本申请提供的方案的应用场景包括但不限于图像检测、目标追踪或自动机器学习。It should also be understood that the application scenarios of the solutions provided by the present application include but are not limited to image detection, target tracking, or automatic machine learning.
上文描述了本申请的方法实施例,下文将描述上文方法实施例对应的装置实施例。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的内容可以参见前面方法实施例,为了简洁,这里不再赘述。The method embodiments of the present application are described above, and the device embodiments corresponding to the above method embodiments will be described below. It should be understood that the description of the device embodiments and the description of the method embodiments correspond to each other. Therefore, for the content that is not described in detail, please refer to the foregoing method embodiments.
图4为本申请实施例提供的超参数的优化装置400的示意性框图。该装置400包括如下单元。FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus 400 provided by an embodiment of the present application. The device 400 includes the following units.
划分单元410,将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数;The dividing unit 410 divides the hyper-parameters to be optimized for machine learning into N sets of hyper-parameters, where N is an integer greater than 1;
优化单元420,用于分别对N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。The optimization unit 420 is used to perform Bayesian optimization on N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed Take the latest value.
在本申请中,在每个贝叶斯优化过程中,是在一组超参数对应的解空间上进行贝叶斯优化,由于每组超参数的维度小于机器学习需要优化的超参数的总维度,因此,可以实现对超参数的降维搜索,可以避免最优解卡在局部最优解。In this application, in each Bayesian optimization process, Bayesian optimization is performed on the solution space corresponding to a set of hyperparameters, because the dimension of each set of hyperparameters is smaller than the total dimension of the hyperparameters that machine learning needs to optimize Therefore, the dimensionality reduction search for hyperparameters can be realized, and the optimal solution can be avoided from being stuck in the local optimal solution.
此外,在本申请中,在获得优化后的超参数的过程中,对N组超参数中的每组超参数均进行了贝叶斯优化,换言之,在获得优化的超参数的过程中,机器学习需要优化的每个超参数均通过贝叶斯优化算法进行了优化,因此,可以弱化降维假设的限制。In addition, in this application, in the process of obtaining optimized hyperparameters, Bayesian optimization is performed for each set of hyperparameters in the N sets of hyperparameters. In other words, in the process of obtaining optimized hyperparameters, the machine Each hyperparameter that needs to be optimized for learning is optimized by the Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction assumption can be weakened.
因此,本申请实施例,通过分别对机器学习需要优化的超参数中的N组超参数进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。Therefore, in the embodiments of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized by machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
可选地,作为一个实施例,优化单元420,用于利用至少一轮贝叶斯优化操作,获得优化后的超参数,其中,每一轮贝叶斯优化操作包括:对N组超参数中的第i组超参数进行贝叶斯优化,其中,在对第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N。Optionally, as an embodiment, the optimization unit 420 is configured to obtain optimized hyperparameters using at least one round of Bayesian optimization operations, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th set of hyperparameters. During the Bayesian optimization of the i-th set of hyperparameters, the values of the remaining sets of hyperparameters are fixed to the latest values, i traverses 1, 2, ... , N.
应理解,在每一轮贝叶斯优化操作中,对N组超参数先后分别进行优化,优化顺序的先后可能使得各个超参数组的优化之间存在差异,本申请实施例通过执行多轮的贝叶斯优化操作,可以在一定程度上弱化这种差异,从而进一步地弱化降维假设的限制。It should be understood that in each round of Bayesian optimization operations, the N sets of hyperparameters are optimized separately, and the order of optimization may make the optimization of each hyperparameter group different. In this embodiment of the present application, multiple rounds of optimization are performed. The Bayesian optimization operation can weaken this difference to a certain extent, thereby further weakening the limitation of the dimensionality reduction hypothesis.
因此,本申请实施例,通过分别对机器学习需要优化的超参数中的N组超参数进行贝叶斯优化,一方面可以对超参数实现降维搜索,另一方面可以弱化降维假设的限制。Therefore, in the embodiments of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized by machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .
可选地,N组超参数中每组包括的超参数的数量可以相同,即每组超参 数的维度可以相同。Optionally, the number of hyperparameters included in each group of the N groups of hyperparameters may be the same, that is, the dimension of each group of hyperparameters may be the same.
可选地,N组超参数中不同组包括的超参数的数量也可以不同,即不同组超参数的维度可以不完全相同。Optionally, the number of hyperparameters included in different groups among the N sets of hyperparameters may also be different, that is, the dimensions of different groups of hyperparameters may not be completely the same.
应理解,当N组超参数中不同组的维度不完全相同时,贝叶斯优化过程中涉及的后验分布需要拆分成多个子后验分布。It should be understood that when the dimensions of different groups in the N groups of hyperparameters are not completely the same, the posterior distribution involved in the Bayesian optimization process needs to be split into multiple sub-posterior distributions.
可选地,作为一个实施例,N组超参数按照机器学习中的超参数的类型进行划分。Optionally, as an embodiment, the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.
可选地,作为一个实施例,超参数可以包括下列中的至少两种:卷积核大小(kernel size),卷积核数量(kernel num),卷积步长(stride),跳线连接(shortcut连接方式),加和操作(add)和拼接操作(concat)的选择,分支数量,层数(layer num),迭代次数(epoch),初始化参数(例如MSRA初始化与Xaiver初始化),正则项系数,学习率,神经网络结构,神经网络的层数。Optionally, as an embodiment, the hyper-parameters may include at least two of the following: kernel size (kernel size), kernel number (kernel), convolution stride (stride), jumper connection ( shortcut connection method), addition and concatenation (concat) selection, number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients , Learning rate, neural network structure, the number of layers of the neural network.
N组超参数中不同组超参数的超参数类型可以不完全相同。The hyperparameter types of different groups of hyperparameters in the N groups of hyperparameters may not be completely the same.
可选地,不同组超参数的超参数类型不同。Optionally, different types of hyperparameters have different hyperparameter types.
应理解,分别对超参数类型相同的一组超参数进行贝叶斯优化,在一定程度上,可以提高收敛速度,从而提高超参数的优化效率。It should be understood that performing Bayesian optimization on a group of hyperparameters of the same hyperparameter type can, to a certain extent, increase the convergence speed, thereby improving the optimization efficiency of hyperparameters.
因此,本申请实施例,按照超参数类型将需要优化的超参数分组,然后分别对每组超参数进行优化,这样可以在一定程度上提高超参数的优化效率。Therefore, in the embodiment of the present application, the hyper-parameters to be optimized are grouped according to the hyper-parameter type, and then each group of hyper-parameters are optimized separately, so that the optimization efficiency of the hyper-parameters can be improved to a certain extent.
可选地,作为一个实施例,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化的目标函数为损失函数,损失函数使用的样本为训练集样本和/或测试集样本。Optionally, as an embodiment, in the process of performing Bayesian optimization on each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the samples used in the loss function are training set samples and / or test set samples. .
可选地,作为一个实施例,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化使用的观测值根据每组超参数对应的机器学习模型在模型训练中使用的损失值确定。Optionally, as an embodiment, in the process of performing Bayesian optimization on each group of hyperparameters, the observation values used by Bayesian optimization are based on the loss values used in model training by the machine learning model corresponding to each group of hyperparameters determine.
可选地,作为一个实施例,在对每组超参数进行贝叶斯优化的过程中,每组超参数的一个采样值对应的观测值Loss由如下公式确定:Optionally, as an embodiment, in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:
Figure PCTCN2018112712-appb-000007
Figure PCTCN2018112712-appb-000007
其中,epoch为每组超参数的本次取值所对应的机器学习模型的训练轮 数,T_loss(j)为该机器学习模型在第j轮训练之后在训练集样本上的损失值,V_loss(j)为该机器学习模型在第j轮训练之后在测试集样本上的损失值,w 1和w 2分别为T_loss(j)与V_loss(j)的权重,w 1和w 2不同时为零。 Among them, epoch is the number of training rounds of the machine learning model corresponding to the current value of each group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss ( j) is the loss value of the machine learning model on the test set samples after the jth round of training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, and w 1 and w 2 are not zero at the same time .
可选地,作为一个实施例,优化单元420用于,在对每组超参数进行贝叶斯优化的过程中,控制机器学习模型的训练次数小于预设值。Optionally, as an embodiment, the optimization unit 420 is configured to control the number of trainings of the machine learning model to be less than a preset value during Bayesian optimization of each set of hyperparameters.
可选地,作为一个实施例,优化单元420用于,采用早停策略,使得机器学习模型的训练次数小于预设值。Optionally, as an embodiment, the optimization unit 420 is configured to adopt an early stop strategy so that the number of trainings of the machine learning model is less than a preset value.
可选地,作为一个实施例,划分单元410用于,根据机器学习的应用场景,将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数。Optionally, as an embodiment, the dividing unit 410 is used to divide the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters according to the application scenario of machine learning, and N is an integer greater than 1.
可选地,作为一个实施例,机器学习模型为深度学习模型。Optionally, as an embodiment, the machine learning model is a deep learning model.
如图5所示,本申请实施例还提供一种超参数的优化装置500,该装置包括处理器510与存储器520,存储器520用于存储指令,处理器510用于执行存储器520存储的指令,并且对存储器520中存储的指令的执行使得,处理器510用于执行上文方法实施例中的优化方法。As shown in FIG. 5, an embodiment of the present application further provides a hyperparameter optimization apparatus 500, which includes a processor 510 and a memory 520. The memory 520 is used to store instructions, and the processor 510 is used to execute instructions stored in the memory 520. And the execution of the instructions stored in the memory 520 makes the processor 510 be used to execute the optimization method in the above method embodiment.
对存储器520中存储的指令的执行使得处理器510用于执行上述实施例中划分单元410和优化单元420执行的动作。Execution of the instructions stored in the memory 520 causes the processor 510 to be used to perform the actions performed by the dividing unit 410 and the optimization unit 420 in the above-described embodiments.
可选地,如图5所示,该装置500还可以包括通信接口530,用于与外部设备交互信号。例如,处理器510用于控制接口530进行接收和/或发送信号。Optionally, as shown in FIG. 5, the apparatus 500 may further include a communication interface 530 for exchanging signals with external devices. For example, the processor 510 is used to control the interface 530 to receive and / or send signals.
本申请实施例还提供一种计算机存储介质,其上存储有计算机程序,计算机程序被计算机执行时使得,计算机执行上文方法实施例中的优化方法。Embodiments of the present application also provide a computer storage medium on which a computer program is stored. When the computer program is executed by the computer, the computer executes the optimization method in the foregoing method embodiments.
本申请实施例还提供一种包含指令的计算机程序产品,指令被计算机执行时使得计算机执行上文方法实施例中的优化方法。Embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to execute the optimization method in the foregoing method embodiments.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线 (例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present invention are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including a server, a data center, and the like integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital video disc (DVD)), or semiconductor media (eg, solid state disk (SSD)), etc. .
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a division of logical functions. In actual implementation, there may be other divisions, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (25)

  1. 一种超参数的优化方法,其特征在于,包括:A hyperparameter optimization method, which is characterized by including:
    将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数;The hyperparameters that need to be optimized for machine learning are divided into N groups of hyperparameters, where N is an integer greater than 1;
    分别对所述N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。Perform Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters, wherein in the process of performing Bayesian optimization on each set of hyperparameters, the values of the remaining sets of hyperparameters are fixed to the latest values .
  2. 根据权利要求1所述的优化方法,其特征在于,所述分别对所述N组超参数进行贝叶斯优化,获得优化后的超参数,包括:The optimization method according to claim 1, wherein performing Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters includes:
    利用至少一轮贝叶斯优化操作,获得所述优化后的超参数,其中,每一轮贝叶斯优化操作包括:Using at least one round of Bayesian optimization operations to obtain the optimized hyperparameters, wherein each round of Bayesian optimization operations includes:
    对所述N组超参数中的第i组超参数进行贝叶斯优化,其中,在对所述第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N。Bayesian optimization is performed on the i-th group of hyperparameters among the N groups of hyperparameters, wherein during Bayesian optimization of the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value, i traverse 1, 2, ..., N.
  3. 根据权利要求1或2所述的优化方法,其特征在于,所述N组超参数按照机器学习中的超参数的类型进行划分。The optimization method according to claim 1 or 2, wherein the N sets of hyperparameters are divided according to types of hyperparameters in machine learning.
  4. 根据权利要求3所述的优化方法,其特征在于,所述超参数包括下列中的至少两种:卷积核大小,卷积核数量,卷积步长,跳线连接,加和操作和拼接操作的选择,分支数量,层数,迭代次数,初始化参数,正则项系数,学习率,神经网络结构,神经网络的层数。The optimization method according to claim 3, wherein the hyperparameters include at least two of the following: convolution kernel size, number of convolution kernels, convolution step size, jumper connection, addition operation and splicing The choice of operation, the number of branches, the number of layers, the number of iterations, initialization parameters, regular term coefficients, learning rate, neural network structure, the number of layers of the neural network.
  5. 根据权利要求1至4中任一项所述的优化方法,其特征在于,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化的目标函数为损失函数,所述损失函数使用的样本为训练集样本和/或测试集样本。The optimization method according to any one of claims 1 to 4, wherein during the Bayesian optimization of each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the loss function The samples used are training set samples and / or test set samples.
  6. 根据权利要求1至5中任一项所述的优化方法,其特征在于,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化使用的观测值根据所述每组超参数对应的机器学习模型在模型训练中使用的损失值确定。The optimization method according to any one of claims 1 to 5, wherein during the Bayesian optimization of each group of hyperparameters, the observations used by Bayesian optimization are based on the hyperparameters of each group The corresponding machine learning model is used to determine the loss value used in model training.
  7. 根据权利要求6所述的优化方法,其特征在于,在对每组超参数进行贝叶斯优化的过程中,所述每组超参数的一个采样值对应的观测值Loss由如下公式确定:The optimization method according to claim 6, wherein in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:
    Figure PCTCN2018112712-appb-100001
    Figure PCTCN2018112712-appb-100001
    其中,epoch为所述每组超参数的本次取值所对应的机器学习模型的训 练轮数,T_loss(j)为该机器学习模型在第j轮训练之后在训练集样本上的损失值,V_loss(j)为该机器学习模型在第j轮训练之后在测试集样本上的损失值,w 1和w 2分别为T_loss(j)与V_loss(j)的权重,w 1和w 2不同时为零。 Where epoch is the number of training rounds of the machine learning model corresponding to the current value of each set of hyperparameters, and T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss (j) is the loss value of the machine learning model on the test set samples after the jth training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, when w 1 and w 2 are different Is zero.
  8. 根据权利要求1至7中任一项所述的优化方法,其特征在于,在对每组超参数进行贝叶斯优化的过程中,控制机器学习模型的训练次数小于预设值。The optimization method according to any one of claims 1 to 7, wherein during the Bayesian optimization of each set of hyperparameters, the number of trainings of the control machine learning model is less than a preset value.
  9. 根据权利要求8所述的优化方法,其特征在于,所述控制机器学习模型的训练次数小于预设值,包括:The optimization method according to claim 8, wherein the number of trainings of the control machine learning model is less than a preset value, including:
    采用早停策略,使得所述机器学习模型的训练次数小于所述预设值。The early stop strategy is adopted so that the training times of the machine learning model are less than the preset value.
  10. 根据权利要求1至9中任一项所述的优化方法,其特征在于,所述将机器学习需要优化的超参数划分为N组超参数,包括:The optimization method according to any one of claims 1 to 9, wherein the dividing the hyperparameters to be optimized by machine learning into N sets of hyperparameters includes:
    根据机器学习的应用场景,将机器学习需要优化的超参数划分为N组超参数。According to the application scenarios of machine learning, the hyperparameters that need to be optimized for machine learning are divided into N groups of hyperparameters.
  11. 根据权利要求8或9所述的优化方法,其特征在于,所述机器学习模型为深度学习模型。The optimization method according to claim 8 or 9, wherein the machine learning model is a deep learning model.
  12. 一种超参数的优化装置,其特征在于,包括:A hyperparameter optimization device, characterized by including:
    划分单元,将机器学习需要优化的超参数划分为N组超参数,N为大于1的整数;The division unit divides the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters, where N is an integer greater than 1;
    优化单元,用于分别对所述N组超参数进行贝叶斯优化,获得优化后的超参数,其中,在对每组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值。The optimization unit is used to perform Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each set of hyperparameters, the other groups of hyperparameters The value is the latest value.
  13. 根据权利要求12所述的优化装置,其特征在于,所述优化单元用于,利用至少一轮贝叶斯优化操作,获得优化后的超参数,其中,每一轮贝叶斯优化操作包括:The optimization device according to claim 12, wherein the optimization unit is configured to obtain optimized hyperparameters using at least one round of Bayesian optimization operations, wherein each round of Bayesian optimization operations includes:
    对所述N组超参数中的第i组超参数进行贝叶斯优化,其中,在对所述第i组超参数进行贝叶斯优化的过程中,固定其余组超参数的取值为最新取值,i遍历1,2,…,N。Bayesian optimization is performed on the i-th group of hyperparameters among the N groups of hyperparameters, wherein during Bayesian optimization of the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value, i traverse 1, 2, ..., N.
  14. 根据权利要求12或13所述的优化装置,其特征在于,所述N组超参数按照机器学习中的超参数的类型进行划分。The optimization device according to claim 12 or 13, wherein the N sets of hyperparameters are divided according to types of hyperparameters in machine learning.
  15. 根据权利要求14所述的优化装置,其特征在于,所述超参数包括 下列中的至少两种:卷积核大小,卷积核数量,卷积步长,跳线连接,加和操作和拼接操作的选择,分支数量,层数,迭代次数,初始化参数,正则项系数,学习率,神经网络结构,神经网络的层数。The optimization device according to claim 14, wherein the hyperparameters include at least two of the following: convolution kernel size, number of convolution kernels, convolution step size, jumper connection, sum operation and splicing The choice of operation, the number of branches, the number of layers, the number of iterations, initialization parameters, regular term coefficients, learning rate, neural network structure, the number of layers of the neural network.
  16. 根据权利要求12至15中任一项所述的优化装置,其特征在于,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化的目标函数为损失函数,所述损失函数使用的样本为训练集样本和/或测试集样本。The optimization device according to any one of claims 12 to 15, wherein in the process of performing Bayesian optimization on each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the loss function The samples used are training set samples and / or test set samples.
  17. 根据权利要求12至16中任一项所述的优化装置,其特征在于,在对每组超参数进行贝叶斯优化的过程中,贝叶斯优化使用的观测值根据所述每组超参数对应的机器学习模型在模型训练中使用的损失值确定。The optimization device according to any one of claims 12 to 16, wherein during the Bayesian optimization of each group of hyperparameters, the observations used by Bayesian optimization are based on the hyperparameters of each group The corresponding machine learning model is used to determine the loss value used in model training.
  18. 根据权利要求17所述的优化装置,其特征在于,在对每组超参数进行贝叶斯优化的过程中,所述每组超参数的一个采样值对应的观测值Loss由如下公式确定:The optimization device according to claim 17, wherein in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:
    Figure PCTCN2018112712-appb-100002
    Figure PCTCN2018112712-appb-100002
    其中,epoch为每组超参数的本次取值所对应的机器学习模型的训练轮数,T_loss(j)为该机器学习模型在第j轮训练之后在训练集样本上的损失值,V_loss(j)为该机器学习模型在第j轮训练之后在测试集样本上的损失值,w 1和w 2分别为T_loss(j)与V_loss(j)的权重,w 1和w 2不同时为零。 Among them, epoch is the number of training rounds of the machine learning model corresponding to the current value of each group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss ( j) is the loss value of the machine learning model on the test set samples after the jth round of training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, and w 1 and w 2 are not zero at the same time .
  19. 根据权利要求12至18中任一项所述的优化装置,其特征在于,所述优化单元用于,在对每组超参数进行贝叶斯优化的过程中,控制机器学习模型的训练次数小于预设值。The optimization device according to any one of claims 12 to 18, wherein the optimization unit is used to control the number of trainings of the machine learning model to be less than the Bayesian optimization process for each group of hyperparameters default value.
  20. 根据权利要求19所述的优化装置,其特征在于,所述优化单元用于,采用早停策略,使得所述机器学习模型的训练次数小于所述预设值。The optimization device according to claim 19, wherein the optimization unit is configured to adopt an early stop strategy so that the number of trainings of the machine learning model is less than the preset value.
  21. 根据权利要求12至20中任一项所述的优化装置,其特征在于,所述划分单元用于,根据机器学习的应用场景,将机器学习需要优化的超参数划分为N组超参数。The optimization device according to any one of claims 12 to 20, wherein the dividing unit is configured to divide the hyper-parameters to be optimized for machine learning into N sets of hyper-parameters according to an application scenario of machine learning.
  22. 根据权利要求19或20所述的优化装置,其特征在于,所述机器学习模型为深度学习模型。The optimization device according to claim 19 or 20, wherein the machine learning model is a deep learning model.
  23. 一种超参数的优化装置,其特征在于,包括:存储器与处理器,所述存储器用于存储指令,所述处理器用于执行所述存储器存储的指令,并且对所述存储器中存储的指令的执行使得,所述处理器用于执行如权利要求1 至11中任一项所述的优化方法。A hyperparameter optimization device, comprising: a memory and a processor, the memory is used to store instructions, the processor is used to execute the instructions stored in the memory, and the instructions stored in the memory The execution is such that the processor is used to execute the optimization method according to any one of claims 1 to 11.
  24. 一种计算机存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被计算机执行时使得,所述计算机执行如权利要求1至11中任一项所述的方法。A computer storage medium characterized in that a computer program is stored thereon, and when the computer program is executed by a computer, the computer executes the method according to any one of claims 1 to 11.
  25. 一种包含指令的计算机程序产品,其特征在于,所述指令被计算机执行时使得计算机执行如权利要求1至11中任一项所述的优化方法。A computer program product containing instructions, characterized in that, when the instructions are executed by a computer, the computer is caused to perform the optimization method according to any one of claims 1 to 11.
PCT/CN2018/112712 2018-10-30 2018-10-30 Hyper-parameter optimization method and apparatus WO2020087281A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/112712 WO2020087281A1 (en) 2018-10-30 2018-10-30 Hyper-parameter optimization method and apparatus
CN201880038686.XA CN110770764A (en) 2018-10-30 2018-10-30 Method and device for optimizing hyper-parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/112712 WO2020087281A1 (en) 2018-10-30 2018-10-30 Hyper-parameter optimization method and apparatus

Publications (1)

Publication Number Publication Date
WO2020087281A1 true WO2020087281A1 (en) 2020-05-07

Family

ID=69328799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/112712 WO2020087281A1 (en) 2018-10-30 2018-10-30 Hyper-parameter optimization method and apparatus

Country Status (2)

Country Link
CN (1) CN110770764A (en)
WO (1) WO2020087281A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112598133A (en) * 2020-12-16 2021-04-02 联合汽车电子有限公司 Vehicle data processing method, device, equipment and storage medium
CN112990480A (en) * 2021-03-10 2021-06-18 北京嘀嘀无限科技发展有限公司 Method and device for building model, electronic equipment and storage medium
CN113052252A (en) * 2021-03-31 2021-06-29 北京字节跳动网络技术有限公司 Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment
WO2022211179A1 (en) * 2021-03-30 2022-10-06 주식회사 솔리드웨어 Optimal model seeking method, and device therefor
CN115796346A (en) * 2022-11-22 2023-03-14 烟台国工智能科技有限公司 Yield optimization method and system and non-transitory computer readable storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368931B (en) * 2020-03-09 2023-11-17 第四范式(北京)技术有限公司 Method for determining learning rate of image classification model
US11823076B2 (en) 2020-07-27 2023-11-21 International Business Machines Corporation Tuning classification hyperparameters
CN112232508A (en) * 2020-09-18 2021-01-15 苏州浪潮智能科技有限公司 Model training method, system, device and medium
CN112883331B (en) * 2021-02-24 2024-03-01 东南大学 Target tracking method based on multi-output Gaussian process
CN113312855B (en) * 2021-07-28 2021-12-10 北京大学 Search space decomposition-based machine learning optimization method, electronic device, and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042322A (en) * 2014-08-19 2016-03-31 日本電気株式会社 Data analysis device, analysis method, and program thereof
CN107018184A (en) * 2017-03-28 2017-08-04 华中科技大学 Distributed deep neural network cluster packet synchronization optimization method and system
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning
CN108573281A (en) * 2018-04-11 2018-09-25 中科弘云科技(北京)有限公司 A kind of tuning improved method of the deep learning hyper parameter based on Bayes's optimization
WO2018189279A1 (en) * 2017-04-12 2018-10-18 Deepmind Technologies Limited Black-box optimization using neural networks

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102638802B (en) * 2012-03-26 2014-09-03 哈尔滨工业大学 Hierarchical cooperative combined spectrum sensing algorithm
US20140156231A1 (en) * 2012-11-30 2014-06-05 Xerox Corporation Probabilistic relational data analysis
CN108470210A (en) * 2018-04-02 2018-08-31 中科弘云科技(北京)有限公司 A kind of optimum option method of hyper parameter in deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042322A (en) * 2014-08-19 2016-03-31 日本電気株式会社 Data analysis device, analysis method, and program thereof
CN107018184A (en) * 2017-03-28 2017-08-04 华中科技大学 Distributed deep neural network cluster packet synchronization optimization method and system
WO2018189279A1 (en) * 2017-04-12 2018-10-18 Deepmind Technologies Limited Black-box optimization using neural networks
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning
CN108573281A (en) * 2018-04-11 2018-09-25 中科弘云科技(北京)有限公司 A kind of tuning improved method of the deep learning hyper parameter based on Bayes's optimization

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112598133A (en) * 2020-12-16 2021-04-02 联合汽车电子有限公司 Vehicle data processing method, device, equipment and storage medium
CN112598133B (en) * 2020-12-16 2023-07-28 联合汽车电子有限公司 Method, device, equipment and storage medium for processing vehicle data
CN112990480A (en) * 2021-03-10 2021-06-18 北京嘀嘀无限科技发展有限公司 Method and device for building model, electronic equipment and storage medium
WO2022211179A1 (en) * 2021-03-30 2022-10-06 주식회사 솔리드웨어 Optimal model seeking method, and device therefor
CN113052252A (en) * 2021-03-31 2021-06-29 北京字节跳动网络技术有限公司 Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment
CN113052252B (en) * 2021-03-31 2024-03-26 北京字节跳动网络技术有限公司 Super-parameter determination method, device, deep reinforcement learning framework, medium and equipment
CN115796346A (en) * 2022-11-22 2023-03-14 烟台国工智能科技有限公司 Yield optimization method and system and non-transitory computer readable storage medium
CN115796346B (en) * 2022-11-22 2023-07-21 烟台国工智能科技有限公司 Yield optimization method, system and non-transitory computer readable storage medium

Also Published As

Publication number Publication date
CN110770764A (en) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2020087281A1 (en) Hyper-parameter optimization method and apparatus
CN110503192B (en) Resource efficient neural architecture
JP6620439B2 (en) Learning method, program, and learning apparatus
US10460230B2 (en) Reducing computations in a neural network
US20210142181A1 (en) Adversarial training of machine learning models
EP3711000B1 (en) Regularized neural network architecture search
US11853882B2 (en) Methods, apparatus, and storage medium for classifying graph nodes
KR20210032521A (en) Determining the fit of machine learning models to data sets
US20170147921A1 (en) Learning apparatus, recording medium, and learning method
CN113692594A (en) Fairness improvement through reinforcement learning
US11562250B2 (en) Information processing apparatus and method
US20210215818A1 (en) Generative adversarial network-based target identification
KR101828215B1 (en) A method and apparatus for learning cyclic state transition model on long short term memory network
WO2019045802A1 (en) Distance metric learning using proxies
Bohdal et al. Meta-calibration: Learning of model calibration using differentiable expected calibration error
US20210110298A1 (en) Interactive machine learning
US20210110299A1 (en) Interactive machine learning
US20170176956A1 (en) Control system using input-aware stacker
US20210397948A1 (en) Learning method and information processing apparatus
TWI758223B (en) Computing method with dynamic minibatch sizes and computing system and computer-readable storage media for performing the same
US20230325717A1 (en) Systems and methods for repurposing a machine learning model
WO2020173270A1 (en) Method and device used for parsing data and computer storage medium
US20230186150A1 (en) Hyperparameter selection using budget-aware bayesian optimization
EP3742354A1 (en) Information processing apparatus, information processing method, and program
WO2021061798A1 (en) Methods and apparatus to train a machine learning model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18938904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18938904

Country of ref document: EP

Kind code of ref document: A1