WO2020087281A1

WO2020087281A1 - Hyper-parameter optimization method and apparatus

Info

Publication number: WO2020087281A1
Application number: PCT/CN2018/112712
Authority: WO
Inventors: 蒋阳; 赵丛; 张李亮
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-07
Also published as: CN110770764A

Abstract

Provided are a hyper-parameter optimization method and apparatus. The method comprises: dividing hyper-parameters that need to be optimized for machine learning into N hyper-parameter groups; and separately performing Bayesian optimization on the N hyper-parameter groups to obtain optimized hyper-parameters, wherein during the process of Bayesian optimization of each hyper-parameter group, the values of the remaining hyper-parameter groups are fixed to latest values. Performing Bayesian optimization on hyper-parameter groups that need to be optimized for machine learning can implement dimensionality reduction search for the hyper-parameters, and can also weaken the limits for dimensionality reduction assumption.

Description

Superparameter optimization method and device

Copyright statement

The content disclosed in this patent document contains material protected by copyright. The copyright is owned by the copyright owner. The copyright owner has no objection to anyone copying the patent document or the patent disclosure existing in the official records and archives of the Patent and Trademark Office.

Technical field

This application relates to the field of computer technology, and in particular, to a hyperparameter optimization method and device.

Background technique

The parameters of machine learning algorithms mainly include hyper-parameters and ordinary parameters. Among them, common parameters can be learned and estimated from the data; hyperparameters cannot be estimated from the data, and can only be specified by human experience design, hyperparameters are parameters that need to be set before starting the learning process. Hyperparameters define higher-level concepts about machine learning models, such as complexity or learning ability. For example, the hyperparameters may include and are not limited to: regular term coefficients, learning rate, network structure, width and depth of convolution kernel, and so on.

The adjustment of hyperparameters has a very large impact on the performance of machine learning algorithms. However, the adjustment of hyperparameters is a black box operation (black box), which often requires algorithm designers to get through a lot of debugging, and designers need to have in this field The deeper accumulation requires a lot of time and effort, and even the optimal results are often not obtained, and the optimization efficiency is low.

If the hyperparameter adjustment process of machine learning is regarded as an unknown function, the desired hyperparameters can be obtained by modeling the unknown function and searching for its global optimal solution. Bayesian optimization algorithm (Bayesian Optimization Algorithm, BOA) is an algorithm for solving the global optimal solution of unknown functions. Therefore, the Bayesian optimization algorithm is proposed to adjust the hyperparameters of the machine learning model.

However, in some machine learning application scenarios, the number of hyperparameters that need to be optimized may be very large, which makes it difficult to solve the global optimal solution of the unknown function in a high-dimensional space. It is often stuck in the local optimal solution and cannot Get better results.

Summary of the invention

The present application provides a hyperparameter optimization method and device, which can realize a dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.

In the first aspect, a hyperparameter optimization method is provided. The method includes: dividing the hyperparameters to be optimized by machine learning into N groups of hyperparameters, where N is an integer greater than 1; and performing Bayesian optimization on the N groups of hyperparameters, respectively To obtain the optimized hyperparameters, where in the process of Bayesian optimization of each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.

In a second aspect, a hyperparameter optimization device is provided. The device includes: a division unit that divides the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters, where N is an integer greater than 1; Bayesian optimization is performed on the group of hyperparameters to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values.

In a third aspect, an apparatus for processing video images is provided. The apparatus includes a memory and a processor. The memory is used to store instructions, and the processor is used to execute instructions stored in the memory. The execution of the instructions stored in the memory causes the processor to execute the first On the one hand provides the optimization method.

According to a fourth aspect, a chip is provided. The chip includes a processing module and a communication interface. The processing module is used to control the communication interface to communicate with the outside. The processing module is also used to implement the optimization method provided in the first aspect.

According to a fifth aspect, there is provided a computer-readable storage medium on which a computer program is stored, which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.

In a sixth aspect, a computer program product containing instructions is provided, which when executed by a computer causes the computer to implement the optimization method provided in the first aspect.

The solution provided by this application performs Bayesian optimization on the hyperparameter grouping that needs to be optimized for machine learning, on the one hand, it can realize the dimensionality reduction search for the hyperparameters, and on the other hand, it can weaken the limitation of the dimensionality reduction assumption.

BRIEF DESCRIPTION

Figure 1 is a schematic diagram of the basic principle of the Bayesian optimization algorithm.

FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.

FIG. 3 is another schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application.

FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.

FIG. 5 is another schematic block diagram of a hyperparameter optimization apparatus provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the present application. The terminology used in the specification of the present application herein is for the purpose of describing specific embodiments only, and is not intended to limit the present application.

First, related technologies and concepts involved in the embodiments of the present application are introduced.

Bayesian optimization algorithm (Bayesian Optimization Algorithm, BOA) is an algorithm for solving the global optimal solution of unknown functions.

The problem scenarios that the Bayesian optimization algorithm mainly faces can be described by the following formula:

S ^* = arg s _{∈ D} max f (s),

Among them, D is the candidate set of s. The goal of Bayesian optimization is to select an s from D, so that the value of the unknown function f (s) is the smallest (or largest). The unknown function f (s) can be called the objective function.

The general flow of Bayesian optimization algorithm is shown in Figure 1, including the following steps.

The first step is to make a priori assumption about the function space distribution of the objective function f (s), that is to say, the function space distribution of f (s) is a priori distribution.

The a priori assumption usually uses Gaussian process prior (Gaussian process prior). For example, suppose the spatial distribution of the function of f (s) is Gaussian (Gaussian distribution).

It should be understood that since the s that satisfies the condition needs to be required, if the function curve of f (s) is known, the s that satisfies the condition can be directly calculated. However, the function curve of f (s) is unknown, that is, the spatial distribution characteristics of the function of f (s) are unknown. Therefore, it is necessary to make assumptions about the spatial distribution of the function of f (s). The common assumption is that the spatial distribution of the function of f (s) satisfies the Gaussian distribution, that is, the normal distribution.

In addition to the Gaussian distribution, it can also be assumed that the spatial distribution of the function of f (s) satisfies other probability distributions. In practical applications, you can choose a suitable probability distribution hypothesis for f (s) for different problems.

The first step also includes obtaining at least two sample values and obtaining at least two observation values corresponding to these sample values.

Assuming that the sampling values are s ₀ and s ₁ , the observed values are f (s ₀ ) and f (s ₁ ).

For example, the sampling values can be selected from the candidate set D by sampling and the like s ₀ and s ₁ .

The first step also includes using at least two observations to update the average and variance of the prior distribution to obtain a posterior distribution.

Taking the prior distribution of f (s) as the Gaussian distribution as an example, the sampled values and observations are input into the Gaussian distribution model, and the average and variance of the Gaussian distribution model are corrected to make it close to the objective function f (s). Function spatial distribution. The modified Gaussian distribution model is the posterior distribution of f (s).

In the second step, the acquisition function is constructed using the posterior distribution, and the acquisition function is used to calculate the next sample value.

Taking the spatial distribution of the function f (s) as a Gaussian distribution as an example, the process of the second step is specifically to select the next sampled value s _i from the modified Gaussian distribution model, the selection criterion is that, relative to the candidate set D Other sampling values of, assuming input to the Gaussian distribution model (s _i , f (s _i )), will make the Gaussian distribution model closer to the true distribution of the objective function f (s) faster and more accurately, so we need to find Optimize for smaller average values and larger variances.

The acquisition function mentioned in the second step is to average the smaller (or larger, PS: f (s) is the loss function here, if f (s) represents the accuracy of the model) The two factors of value and large variance are considered comprehensively and the next sample value is recommended. It should be understood that the design of the acquisition function is prior art, and will not be described in detail in this article.

In the third step, the observation value corresponding to the sampling value obtained in the second step is obtained, and whether the sampling value is the optimal solution is judged according to the observation value. If it is, the Bayesian optimization process ends, if not, go to the fourth step.

The adopted value can be substituted into the objective function f (s) to calculate the observed value.

In the fourth step, the observation value obtained in the third step is used to continue to modify the posterior distribution, and the process goes to the second step. That is, the second step, the third step, and the fourth step are repeatedly executed until convergence (that is, the optimal solution is obtained in the third step).

As mentioned above, the Bayesian optimization algorithm can be used to adjust the hyperparameters of the machine learning model (also called optimization). The hyperparameter adjustment process of machine learning is regarded as solving the maximum value problem in the Bayesian optimization algorithm, in which the hyperparameters to be optimized are regarded as s, and the candidate values of the hyperparameters to be optimized constitute the candidate set D, and then passed The Bayesian optimization process shown in 1 looks for the global optimal solution of the objective function, and the optimized hyperparameters can be obtained.

In machine learning, the loss function is generally used as the objective function.

The loss function is used to estimate the degree of inconsistency between the predicted value and the true value of the machine learning model. It can be a non-negative real-valued function. Assuming that the independent variable of the machine learning model g () is X and the dependent variable is Y, taking the samples (X _i , Y _i ) as an example, the predicted value of the machine learning model is g (X _i ), and the true value of the machine learning model is Y _i .

There are many common loss functions, for example, log loss function, square loss function (also called least squares loss function), exponential loss function and other loss functions.

Taking the square loss function as an example, the standard form of the square loss function is as follows:

Where n is the number of samples, g (X _i ) represents the predicted value of the machine learning model, Y _i represents the true value of the machine learning model, and Y _i -g (X _i ) represents the difference between the predicted value and the true value of the machine learning model Between the residuals, L (Y, g (X)) represents the sum of squares of the residuals in the sample space.

If the square loss function is used as the objective function in the Bayesian optimization algorithm, the purpose of Bayesian optimization is to minimize the value of the square loss function, so as to obtain the optimized hyperparameters.

In the process of adjusting the hyperparameters using the Bayesian optimization algorithm, the hyperparameters to be optimized are usually defined as a multi-dimensional vector S. The process of Bayesian optimization is the process of searching the optimal value of the vector S. In some machine learning application scenarios, the number of hyperparameters that need to be optimized may be very large, resulting in a very high dimension of the vector S. It is very difficult to solve the global optimal solution of the unknown function in a high-dimensional space, and it is often stuck The local optimal solution cannot obtain good results.

The existing solutions aim at high-dimensional hyperparameters and assume that the solution space of the global optimal solution of the unknown function is a relatively low-dimensional solution space, and then directly perform Bayesian optimization in the hypothetical low-dimensional solution space. This makes the hypothesis strategy from the solution space of the global optimal solution of the unknown function to the relatively low-dimensional solution space have a great influence on the Bayesian optimization results. If the hypothesis strategy is unreasonable, it will lead to poor optimization results. This makes the algorithm not robust enough.

This application proposes a hyperparameter optimization scheme, which can realize the dimensionality reduction search for hyperparameters, and at the same time can weaken the assumption of limiting the solution space, so as to obtain better hyperparameter optimization results.

FIG. 2 is a schematic flowchart of a hyperparameter optimization method provided by an embodiment of the present application. The optimization method includes the following steps.

S210. Obtain hyperparameters to be optimized for machine learning. The hyperparameters to be optimized include N sets of hyperparameters, and N is an integer greater than 1.

Alternatively, the hyperparameters to be optimized for machine learning may be divided into N groups in advance.

Alternatively, the hyperparameters that need to be optimized for machine learning may be divided into N groups in real time when optimization is needed.

For example, in different hyperparameter optimization tasks, the grouping strategy for the hyperparameters that need to be optimized may be different.

It should be understood that the number of hyperparameters included in each group of hyperparameters in the N sets of hyperparameters is less than the total number of hyperparameters that need to be optimized in machine learning.

S220: Perform Bayesian optimization on the N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values. value.

For the Bayesian optimization of each group of hyperparameters, the Bayesian optimization algorithm shown in Figure 1 can be used to implement. In the Bayesian optimization process of each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value.

Taking Bayesian optimization of the i-th group of hyperparameters as an example, and taking the remaining group of hyperparameters as the j-th (j ≠ i) group of hyperparameters as an example, suppose the Bayesian optimization process of the i-th group of hyperparameters starts, The value of the jth group of hyperparameters is Z, then in the Bayesian optimization process of the ith group of hyperparameters, the value of the jth group of hyperparameters is fixed to the latest value Z.

During the Bayesian optimization of the first group of hyperparameters, the values of the remaining groups of hyperparameters can be determined by sampling.

After the Bayesian optimization process is passed, the optimized hyperparameters are obtained, which is determined by the convergence conditions. This article will not elaborate on this.

In this application, in each Bayesian optimization process, Bayesian optimization is performed on the solution space corresponding to a set of hyperparameters, because the dimension of each set of hyperparameters is smaller than the total dimension of the hyperparameters that machine learning needs to optimize Therefore, the dimensionality reduction search for hyperparameters can be realized, and the optimal solution can be avoided from being stuck in the local optimal solution.

In addition, in this application, in the process of obtaining optimized hyperparameters, Bayesian optimization is performed for each set of hyperparameters in the N sets of hyperparameters. In other words, in the process of obtaining optimized hyperparameters, the machine Each hyperparameter that needs to be optimized for learning is optimized by the Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction assumption can be weakened.

Therefore, in this embodiment of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized for machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters;

In this application, each set of hyperparameters in the N sets of hyperparameters that need to be optimized by machine learning includes at least one hyperparameter.

Optionally, the number of hyperparameters included in each group in the N groups of hyperparameters may be the same, that is, the dimensions of each group of hyperparameters may be the same.

Optionally, the number of hyperparameters included in different groups among the N sets of hyperparameters may also be different, that is, the dimensions of different groups of hyperparameters may not be completely the same.

It should be understood that when the dimensions of different groups in the N groups of hyperparameters are not completely the same, the posterior distribution involved in the Bayesian optimization process needs to be split into multiple sub-posterior distributions.

In this application, there may be multiple grouping strategies for N sets of hyperparameters.

Optionally, in some embodiments, the N sets of hyperparameters are obtained by randomly grouping the hyperparameters that need to be optimized.

Optionally, in some embodiments, the N sets of hyperparameters are obtained by grouping the hyperparameters that need to be optimized through experience.

Optionally, in some embodiments, the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.

Hyperparameters can include at least two of the following: kernel size (kernel), kernel number (kernel), convolution step (stride), jumper connection (shortcut connection), and sum operation ( add) and selection of concatenation operation (concat), number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients, learning rate, neural network structure, neural The number of layers of the network.

The hyperparameter types of different groups of hyperparameters in the N groups of hyperparameters may not be completely the same.

Optionally, different types of hyperparameters have different hyperparameter types.

It should be understood that performing Bayesian optimization on a group of hyperparameters of the same hyperparameter type can, to a certain extent, increase the convergence speed, thereby improving the optimization efficiency of hyperparameters.

Therefore, in the embodiment of the present application, the hyper-parameters to be optimized are grouped according to the hyper-parameter type, and then each group of hyper-parameters are optimized separately, so that the optimization efficiency of the hyper-parameters can be improved to a certain extent.

In the same hyperparameter optimization task, the grouping strategy for the hyperparameters to be optimized is fixed.

For different hyperparameter optimization tasks, for example, for hyperparameter optimization tasks in different application scenarios, the grouping strategy for the hyperparameters that need to be optimized may be different or the same, which is not limited in this application and can be determined according to actual needs.

Optionally, as shown in FIG. 3, an implementation manner of step S220 is: using at least one round of Bayesian optimization operations to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters in the group of hyperparameters. In the process of performing Bayesian optimization on the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values, and i traverses 1 , 2, ..., N.

In the embodiment of the present application, Bayesian optimization is performed on N sets of hyperparameters in each round of Bayesian optimization operation, in other words, in the process of obtaining optimized hyperparameters, each hyperparameter that needs to be optimized by machine learning All are optimized by Bayesian optimization algorithm, therefore, the limitation of the dimensionality reduction hypothesis can be weakened.

Therefore, in the embodiments of the present application, by performing Bayesian optimization on the N sets of hyperparameters in the hyperparameters that need to be optimized by machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction hypothesis can be weakened .

It should be understood that the number of rounds of Bayesian optimization operation to obtain the optimized hyperparameters can be determined according to the convergence conditions.

For example, in step S220, two or three or more rounds of Bayesian optimization operations are performed to obtain optimized hyperparameters, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th group of hyperparameters. In the process of performing Bayesian optimization on the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest values, i traverses 1, 2, ..., N.

It should be understood that in each round of Bayesian optimization operations, the N sets of hyperparameters are optimized separately, and the order of optimization may make the optimization of each hyperparameter group different. However, in this embodiment of the present application, multiple rounds are performed. (That is, at least not less than two rounds) Bayesian optimization operation can weaken this difference to a certain extent, thereby further weakening the limitation of the dimensionality reduction hypothesis.

The method of performing Bayesian optimization on N sets of hyperparameters in each round of Bayesian optimization operation can be referred to as Bayesian optimization of alternative optimization.

The embodiment of the present application introduces the idea of alternating optimization into the process of Bayesian optimization, which can achieve effective dimensionality reduction for the high-dimensional search space, weaken the assumption limitations in the existing research technology, and help to search for the optimal solution. parameter.

As an example, the entire process of optimizing hyperparameters in the embodiments of the present application is as follows.

The hyperparameter adjustment process of machine learning is regarded as the objective function f (S). The objective function f (S) Gaussian process is given a priori, that is, p (f) = GP (f; μ; cov), where μ represents expectation, cov represents variance, and GP represents Gaussian process. S represents the hyperparameters that need to be optimized. S ∈ D, D represents the sample space of the hyper-parameter S that needs to be optimized.

The hyper-parameter S to be optimized is divided into N groups: S _i ∈ D _i , i = 1, 2,..., N, N is an integer greater than 1.

Execute the following code until the optimal S is obtained.

Among them, the objective function f (S) may be a loss function.

The process of sampling from D _i to the sampled value and obtaining the observation value may be that the sampled value is brought into the objective function f (S) to obtain the observation value corresponding to the sampled value.

Optionally, in this application, the objective function of Bayesian optimization is a loss function.

For example, the objective function of Bayesian optimization may be any of the following: log loss function, square loss function (also called least squares loss function), and exponential loss function.

It should be understood that in this application, the objective function of Bayesian optimization may also be other types of loss functions, which are not enumerated here.

It should be understood that in practical applications, a loss function can be selected as the objective function of Bayesian optimization according to the needs of the actual application.

Taking the square loss function as the objective function as an example, in some embodiments of the present application, the objective function f (S) of Bayesian optimization is as follows:

Among them, (X, Y) is the sample. g (X) represents the machine learning model, X represents the independent variable of the machine learning model, and Y represents the dependent variable of the machine learning model. n represents the number of samples. The samples here refer to (X, Y) samples. g (X _i ) represents the predicted value of the machine learning model. Y _i represents the true value of the machine learning model. Y _i -g (X _i ) represents the residual between the predicted value and the true value of the machine learning model. L (Y, g (X)) represents the sum of squared residuals in the sample space.

Optionally, in some embodiments of the present application, the samples used in the Bayesian optimized objective function may be training set samples, or test set samples, or training set samples and test set samples.

For example, take the objective function as the square loss function shown below as an example:

Among them, (X, Y) is the sample. g (X) represents the machine learning model, X represents the independent variable of the machine learning model, and Y represents the dependent variable of the machine learning model. g (X _i ) represents the predicted value of the machine learning model. Y _i represents the true value of the machine learning model. Y _i -g (X _i ) represents the residual between the predicted value and the true value of the machine learning model. L (Y, g (X)) represents the sum of squared residuals in the sample space. Among them, the sample space is the sample space of the training set, and n represents the number of samples in the training set. Or, the sample space is the sample space of the test set, and n represents the number of samples in the test set. Or, the sample space is a sample space composed of a training set and a test set, and n represents the total number of samples in the training set and the test set.

It should be understood that each value of the hyperparameter corresponds to a machine learning model. In other words, the values of hyperparameters are different, and the corresponding machine learning models are also different. Therefore, in the Bayesian optimization process of hyperparameters, each time the value of the hyperparameter is updated, the machine learning model used in the objective function should also be updated.

It should also be understood that the machine learning model corresponding to the value of each hyperparameter can be obtained through training. For example, any existing feasible model training method may be used to train the machine learning model corresponding to each hyperparameter value, which is not limited in this application.

Optionally, in some embodiments of the present application, the observation value in the Bayesian optimization process is determined according to the loss function used by the machine learning model in the training process.

For example, during Bayesian optimization of the i-th set of hyperparameters, the observation value corresponding to a sampled value of the i-th set of hyperparameters is determined by the following formula:

Where epoch is the number of training rounds of the machine learning model corresponding to the current value of the i-th group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss (j) is the loss value of the machine learning model on the test set samples after the jth round of training, w ₁ and w ₂ are the weights of T_loss (j) and V_loss (j), and w ₁ and w _{2 are} not simultaneously zero.

When w ₁ is zero and w _{2 is} not zero, it means that the observed value Loss is only related to the loss value of the machine learning model on the test set.

When w ₂ is zero and w _{1 is} not zero, it means that the observed value Loss is only related to the loss value of the machine learning model on the training set.

When both w ₁ and w _{2 are} not zero, it means that the observed value Loss is related to both the loss value of the machine learning model on the test set and the loss value of the machine learning model on the training set.

Optionally, in some embodiments, during the Bayesian optimization of the i-th set of hyperparameters, the number of trainings of the control machine learning model is less than a preset value.

For example, the number of trainings to control the machine learning model is less than 20.

It should be understood that in the hyperparameter optimization process, the convergence time of the machine learning model or the number of training times of the machine learning model directly affects the optimization speed of the hyperparameter. The embodiment of the present application can increase the optimization speed of the hyperparameters by limiting the training times of the machine learning model to less than a preset value.

In this application, it is assumed that the final performance of the model is related to the initial performance of the model training. In other words, if the model initially monotonically converges, the final performance of the model will also be monotonically convergent; if the model no longer monotonically converges (ie, diverges) at the initial stage of training, then the final performance of the model will no longer monotonically converge.

Based on this assumption, for the machine learning model corresponding to each hyperparameter, the number of training rounds should be controlled within the preset value.

Optionally, in some embodiments, controlling the number of training times of the machine learning model corresponding to each update of the i-th set of hyperparameters is less than a preset value, including: the machine corresponding to the value of each update of the i-th set of hyperparameters During the training process of the learning model, the early stop strategy is adopted, so that the training times of the machine learning model are less than the preset value.

For example, the preset value is 20. For the machine learning model corresponding to each hyperparameter, only 20 training stops. If the number of training rounds is less than 20, the machine learning model no longer converges monotonously, so it stops early.

If the number of training rounds has reached 20, the machine learning model converges monotonously and the training is also stopped.

The solution of the embodiment of the present application may be applied to the hyper-parameter adjustment process of deep learning.

It should be understood that using Bayesian optimization to search for the hyper-parameters of the deep learning model generally requires the deep-learning model to fully converge before it can be achieved, which will result in a longer optimization time for the hyper-parameters. After the solution provided by the embodiment of the present application is adopted, the time required for optimizing the hyperparameters can be effectively reduced.

In the above, the implementation of step 220 is mainly described as an example shown in FIG. 3. In this application, the implementation of step 220 includes, but is not limited to, the method shown in FIG. 3. As long as Bayesian optimization is performed on the N sets of hyperparameters during the process of obtaining optimized hyperparameters, all such schemes fall into the protection scope of the present application.

Optionally, another implementation manner of step S220 is: first perform at least one round of Bayesian optimization operations on the first N1 group of hyperparameters in the N group of hyperparameters to obtain the optimized top N1 group of hyperparameters; and then, The first N2 (N1 + N2 = N) hyperparameters of the N hyperparameters are subjected to at least one round of Bayesian optimization operation to obtain the optimized post N2 hyperparameters. Among them, each round of Bayesian optimization operation on the first N1 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N1 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters In the process of Sri Lankan optimization, the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N1. Each round of Bayesian optimization operations on the last N2 group of hyperparameters includes: performing Bayesian optimization on the i-th group of hyperparameters in the N2 group of hyperparameters, where Bayesian optimization is performed on the i-th group of hyperparameters During the process, the values of the remaining sets of hyperparameters are fixed to the latest value, i traverses 1, 2, ..., N2.

As an example, assuming that N is equal to 5, the first group of hyperparameters and the second group of hyperparameters are alternately optimized as follows to obtain the optimized first group of hyperparameters and second group of hyperparameters: perform at least one round of Bayesian optimization Operation, each round of Bayesian optimization operations includes: Bayesian optimization of the first group of hyperparameters, in the process, the value of the remaining group of hyperparameters is fixed to the latest value; the Bayesian of the second group of hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value. After the optimization of the first group of hyperparameters and the second group of hyperparameters is completed, the third group of hyperparameters, the fourth group of hyperparameters and the fifth group of hyperparameters are alternately optimized as follows to obtain the optimized third group of hyperparameters, The fourth group of hyperparameters and the fifth group of hyperparameters: perform at least one round of Bayesian optimization operations, each round of Bayesian optimization operations includes: Bayesian optimization of the third group of hyperparameters, in the process, the rest The values of the group hyperparameters are the latest values; Bayesian optimization is performed on the group 4 hyperparameters. In the process, the values of the other group hyperparameters are fixed to the latest values; Bayesian is performed on the group 5 hyperparameters In the process of optimization, the value of the rest of the hyperparameters is fixed to the latest value.

Therefore, in the solution provided by the embodiments of the present application, by performing Bayesian optimization on the hyperparameter grouping that needs to be optimized for machine learning, on the one hand, the dimensionality reduction search can be performed on the hyperparameters, and on the other hand, the limitation of the dimensionality reduction assumption can be weakened.

It should be understood that the solution provided by the present application can be applied to scenarios where the optimization object is high-dimensional, and can also be applied to scenarios where the optimization object is low-dimensional.

It should also be understood that the solution provided by the present application may be, but not limited to, optimization of hyperparameters in machine learning, and may also be applied to other scenarios where a global optimal solution of an unknown function needs to be solved.

It should also be understood that the application scenarios of the solutions provided by the present application include but are not limited to image detection, target tracking, or automatic machine learning.

The method embodiments of the present application are described above, and the device embodiments corresponding to the above method embodiments will be described below. It should be understood that the description of the device embodiments and the description of the method embodiments correspond to each other. Therefore, for the content that is not described in detail, please refer to the foregoing method embodiments.

FIG. 4 is a schematic block diagram of a hyperparameter optimization apparatus 400 provided by an embodiment of the present application. The device 400 includes the following units.

The dividing unit 410 divides the hyper-parameters to be optimized for machine learning into N sets of hyper-parameters, where N is an integer greater than 1;

The optimization unit 420 is used to perform Bayesian optimization on N groups of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each group of hyperparameters, the values of the remaining groups of hyperparameters are fixed Take the latest value.

Optionally, as an embodiment, the optimization unit 420 is configured to obtain optimized hyperparameters using at least one round of Bayesian optimization operations, where each round of Bayesian optimization operations includes: Bayesian optimization is performed on the i-th set of hyperparameters. During the Bayesian optimization of the i-th set of hyperparameters, the values of the remaining sets of hyperparameters are fixed to the latest values, i traverses 1, 2, ... , N.

It should be understood that in each round of Bayesian optimization operations, the N sets of hyperparameters are optimized separately, and the order of optimization may make the optimization of each hyperparameter group different. In this embodiment of the present application, multiple rounds of optimization are performed. The Bayesian optimization operation can weaken this difference to a certain extent, thereby further weakening the limitation of the dimensionality reduction hypothesis.

Optionally, the number of hyperparameters included in each group of the N groups of hyperparameters may be the same, that is, the dimension of each group of hyperparameters may be the same.

Optionally, as an embodiment, the N sets of hyperparameters are divided according to the type of hyperparameters in machine learning.

Optionally, as an embodiment, the hyper-parameters may include at least two of the following: kernel size (kernel size), kernel number (kernel), convolution stride (stride), jumper connection ( shortcut connection method), addition and concatenation (concat) selection, number of branches, number of layers (layer), number of iterations (epoch), initialization parameters (such as MSRA initialization and Xaiver initialization), regular term coefficients , Learning rate, neural network structure, the number of layers of the neural network.

Optionally, as an embodiment, in the process of performing Bayesian optimization on each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the samples used in the loss function are training set samples and / or test set samples. .

Optionally, as an embodiment, in the process of performing Bayesian optimization on each group of hyperparameters, the observation values used by Bayesian optimization are based on the loss values used in model training by the machine learning model corresponding to each group of hyperparameters determine.

Optionally, as an embodiment, in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:

Among them, epoch is the number of training rounds of the machine learning model corresponding to the current value of each group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss ( j) is the loss value of the machine learning model on the test set samples after the jth round of training, w ₁ and w ₂ are the weights of T_loss (j) and V_loss (j) respectively, and w ₁ and w _{2 are} not zero at the same time .

Optionally, as an embodiment, the optimization unit 420 is configured to control the number of trainings of the machine learning model to be less than a preset value during Bayesian optimization of each set of hyperparameters.

Optionally, as an embodiment, the optimization unit 420 is configured to adopt an early stop strategy so that the number of trainings of the machine learning model is less than a preset value.

Optionally, as an embodiment, the dividing unit 410 is used to divide the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters according to the application scenario of machine learning, and N is an integer greater than 1.

Optionally, as an embodiment, the machine learning model is a deep learning model.

As shown in FIG. 5, an embodiment of the present application further provides a hyperparameter optimization apparatus 500, which includes a processor 510 and a memory 520. The memory 520 is used to store instructions, and the processor 510 is used to execute instructions stored in the memory 520. And the execution of the instructions stored in the memory 520 makes the processor 510 be used to execute the optimization method in the above method embodiment.

Execution of the instructions stored in the memory 520 causes the processor 510 to be used to perform the actions performed by the dividing unit 410 and the optimization unit 420 in the above-described embodiments.

Optionally, as shown in FIG. 5, the apparatus 500 may further include a communication interface 530 for exchanging signals with external devices. For example, the processor 510 is used to control the interface 530 to receive and / or send signals.

Embodiments of the present application also provide a computer storage medium on which a computer program is stored. When the computer program is executed by the computer, the computer executes the optimization method in the foregoing method embodiments.

Embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to execute the optimization method in the foregoing method embodiments.

In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present invention are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including a server, a data center, and the like integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, digital video disc (DVD)), or semiconductor media (eg, solid state disk (SSD)), etc. .

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a division of logical functions. In actual implementation, there may be other divisions, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A hyperparameter optimization method, which is characterized by including:

The hyperparameters that need to be optimized for machine learning are divided into N groups of hyperparameters, where N is an integer greater than 1;

Perform Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters, wherein in the process of performing Bayesian optimization on each set of hyperparameters, the values of the remaining sets of hyperparameters are fixed to the latest values .
The optimization method according to claim 1, wherein performing Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters includes:

Using at least one round of Bayesian optimization operations to obtain the optimized hyperparameters, wherein each round of Bayesian optimization operations includes:

Bayesian optimization is performed on the i-th group of hyperparameters among the N groups of hyperparameters, wherein during Bayesian optimization of the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value, i traverse 1, 2, ..., N.
The optimization method according to claim 1 or 2, wherein the N sets of hyperparameters are divided according to types of hyperparameters in machine learning.
The optimization method according to claim 3, wherein the hyperparameters include at least two of the following: convolution kernel size, number of convolution kernels, convolution step size, jumper connection, addition operation and splicing The choice of operation, the number of branches, the number of layers, the number of iterations, initialization parameters, regular term coefficients, learning rate, neural network structure, the number of layers of the neural network.
The optimization method according to any one of claims 1 to 4, wherein during the Bayesian optimization of each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the loss function The samples used are training set samples and / or test set samples.
The optimization method according to any one of claims 1 to 5, wherein during the Bayesian optimization of each group of hyperparameters, the observations used by Bayesian optimization are based on the hyperparameters of each group The corresponding machine learning model is used to determine the loss value used in model training.
The optimization method according to claim 6, wherein in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:

Where epoch is the number of training rounds of the machine learning model corresponding to the current value of each set of hyperparameters, and T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss (j) is the loss value of the machine learning model on the test set samples after the jth training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, when w 1 and w 2 are different Is zero.
The optimization method according to any one of claims 1 to 7, wherein during the Bayesian optimization of each set of hyperparameters, the number of trainings of the control machine learning model is less than a preset value.
The optimization method according to claim 8, wherein the number of trainings of the control machine learning model is less than a preset value, including:

The early stop strategy is adopted so that the training times of the machine learning model are less than the preset value.
The optimization method according to any one of claims 1 to 9, wherein the dividing the hyperparameters to be optimized by machine learning into N sets of hyperparameters includes:

According to the application scenarios of machine learning, the hyperparameters that need to be optimized for machine learning are divided into N groups of hyperparameters.
The optimization method according to claim 8 or 9, wherein the machine learning model is a deep learning model.
A hyperparameter optimization device, characterized by including:

The division unit divides the hyperparameters that need to be optimized for machine learning into N sets of hyperparameters, where N is an integer greater than 1;

The optimization unit is used to perform Bayesian optimization on the N sets of hyperparameters respectively to obtain optimized hyperparameters. In the process of performing Bayesian optimization on each set of hyperparameters, the other groups of hyperparameters The value is the latest value.
The optimization device according to claim 12, wherein the optimization unit is configured to obtain optimized hyperparameters using at least one round of Bayesian optimization operations, wherein each round of Bayesian optimization operations includes:

Bayesian optimization is performed on the i-th group of hyperparameters among the N groups of hyperparameters, wherein during Bayesian optimization of the i-th group of hyperparameters, the values of the remaining groups of hyperparameters are fixed to the latest Value, i traverse 1, 2, ..., N.
The optimization device according to claim 12 or 13, wherein the N sets of hyperparameters are divided according to types of hyperparameters in machine learning.
The optimization device according to claim 14, wherein the hyperparameters include at least two of the following: convolution kernel size, number of convolution kernels, convolution step size, jumper connection, sum operation and splicing The choice of operation, the number of branches, the number of layers, the number of iterations, initialization parameters, regular term coefficients, learning rate, neural network structure, the number of layers of the neural network.
The optimization device according to any one of claims 12 to 15, wherein in the process of performing Bayesian optimization on each set of hyperparameters, the objective function of Bayesian optimization is a loss function, and the loss function The samples used are training set samples and / or test set samples.
The optimization device according to any one of claims 12 to 16, wherein during the Bayesian optimization of each group of hyperparameters, the observations used by Bayesian optimization are based on the hyperparameters of each group The corresponding machine learning model is used to determine the loss value used in model training.
The optimization device according to claim 17, wherein in the process of performing Bayesian optimization on each group of hyperparameters, the observation value Loss corresponding to one sample value of each group of hyperparameters is determined by the following formula:

Among them, epoch is the number of training rounds of the machine learning model corresponding to the current value of each group of hyperparameters, T_loss (j) is the loss value of the machine learning model on the training set samples after the jth round of training, V_loss ( j) is the loss value of the machine learning model on the test set samples after the jth round of training, w 1 and w 2 are the weights of T_loss (j) and V_loss (j) respectively, and w 1 and w 2 are not zero at the same time .
The optimization device according to any one of claims 12 to 18, wherein the optimization unit is used to control the number of trainings of the machine learning model to be less than the Bayesian optimization process for each group of hyperparameters default value.
The optimization device according to claim 19, wherein the optimization unit is configured to adopt an early stop strategy so that the number of trainings of the machine learning model is less than the preset value.
The optimization device according to any one of claims 12 to 20, wherein the dividing unit is configured to divide the hyper-parameters to be optimized for machine learning into N sets of hyper-parameters according to an application scenario of machine learning.
The optimization device according to claim 19 or 20, wherein the machine learning model is a deep learning model.
A hyperparameter optimization device, comprising: a memory and a processor, the memory is used to store instructions, the processor is used to execute the instructions stored in the memory, and the instructions stored in the memory The execution is such that the processor is used to execute the optimization method according to any one of claims 1 to 11.
A computer storage medium characterized in that a computer program is stored thereon, and when the computer program is executed by a computer, the computer executes the method according to any one of claims 1 to 11.
A computer program product containing instructions, characterized in that, when the instructions are executed by a computer, the computer is caused to perform the optimization method according to any one of claims 1 to 11.