KR20220151257A

KR20220151257A - Hyper-parameter optimization based on reinforcement learning and gaussian process regression

Info

Publication number: KR20220151257A
Application number: KR1020210058260A
Authority: KR
Inventors: 조인휘; 마지흔
Original assignee: 한양대학교 산학협력단
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2022-11-15
Also published as: KR102640009B1

Abstract

Disclosed is a hyperparameter optimization technique based on a reinforcement learning and a Gaussian process regression. A hyperparameter optimization method performed by a hyperparameter optimization system according to one embodiment may comprise: a step of updating a machine learning model using a training set generated through a combination of hyperparameters selected from an agent based on reinforcement learning; and a step of predicting a compensation for the updated machine learning model using the Gaussian process regression. Therefore, the present invention is capable of reducing a cost for evaluation.

Description

Reinforcement Learning and Gaussian Process Regression Based Hyperparameter Optimization {HYPER-PARAMETER OPTIMIZATION BASED ON REINFORCEMENT LEARNING AND GAUSSIAN PROCESS REGRESSION}

아래의 설명은 하이퍼 파라미터를 최적화하는 기술에 관한 것이다. The description below relates to techniques for optimizing hyperparameters.

일반적으로 사용되는 하이퍼 파라미터 최적화 방법에는 그리드 탐색(grid search; GS), 랜덤 탐색(random search; RS), 베이지안 최적화(Bayesian optimization; BO), 경사 기반 최적화(gradient-based optimization; GO), 친화적 최적화(evolutionary optimization) 및 개체군 기반 최적화(population-based optimization) 등이 있다. Commonly used hyperparameter optimization methods include grid search (GS), random search (RS), Bayesian optimization (BO), gradient-based optimization (GO), and friendly optimization. (evolutionary optimization) and population-based optimization.

그리드 탐색은 모든 하이퍼 파라미터의 조합을 시도해 적합한 블록의 하이퍼 파라미터 설정을 찾는 방법이다. 총 K(K는 자연수)개가 있다고 가정할 때, 제k 개수는 mk개수를 취한다. 총 설정 조합은 m1*m1, ..., mK이다. 만약, 오버래핑이 연속적인 경우, 오버래핑을 분산시킬 수 있다. 예를 들면, 학습률

가 설정되면,

와 같이 몇 개의 "경험치"를 선택할 수 있다. 연속적인 오버래핑의 경우, 동일한 간격으로 이산화를 할 수 없으며, 오버헤드를 자신의 특성에 따라 분산시켜야 한다. 그리드 탐색은 하이퍼 파라미터의 서로 다른 조합 방식에 따라 각각 하나의 모델을 훈련시킨 후, 성능을 측정하고, 가장 성능이 좋은 사양을 선택한다. 그러나, 그리드 탐색은 필요한 함수 평가 수가 구성 공간의 차원성에 따라 기하급수적으로 증가하기 때문에 차원성의 저주에 시달린다. 그리드 검색의 또 다른 문제는 이산화의 분해능을 높이려면 필요한 함수 평가 수가 상당히 증가한다는 것이다. Grid search is a method of trying all hyperparameter combinations to find the appropriate block hyperparameter settings. Assuming that there are a total of K (K is a natural number), the kth number takes the mk number. The total setting combinations are m1*m1, ..., mK. If the overlapping is continuous, the overlapping can be dispersed. For example, the learning rate

is set,

You can choose a few "Experience Points" like . In the case of continuous overlapping, it is impossible to discretize at equal intervals, and the overhead must be distributed according to its own characteristics. Grid search trains one model each according to different combinations of hyperparameters, measures its performance, and selects the best-performing specification. However, grid search suffers from the curse of dimensionality because the number of required function evaluations grows exponentially with the dimensionality of the construction space. Another problem with grid search is that increasing the resolution of the discretization significantly increases the number of required function evaluations.

만약 하이퍼 파라미터에 따라 모델의 성능에 영향을 미치는 영향이 매우 큰 차이가 있다면, 어떤 하이퍼 파라미터(예를 들면, 정칙화 계수)는 모델의 성능에 영향을 제한적으로 주지만, 어떤 하이퍼 파라미터(예를 들면, 학습률)는 모델의 성능에 비교적 큰 영향을 준다. 이 경우, 그리드 탐색은 중요하지 않은 오버래핑을 시도할 수 있다. 이에, 랜덤 탐색이 더 적합하다. 랜덤 탐색은 파라미터의 무작위 조합, 그리고 가장 성능이 좋은 설정을 선택한다. 이와 같이, 랜덤 탐색은 랜덤으로 그려진 점 집합에서 가장 적합한 점을 선택한다. 경험적으로나 이론적으로 랜덤 탐색이 그리드 탐색보다 더 실용적이고 효율적이라는 것을 보여주지만, 랜덤 탐색은 최적의 검색을 약속하지는 않는다. 이는 탐색이 길어질수록 최적의 하이퍼 파라미터를 찾을 가능성이 높지만 더 많은 리소스를 소비하게 된다는 것을 의미한다. If there is a very large difference in the effect that affects model performance depending on the hyperparameter, some hyperparameters (eg, regularization coefficients) have a limited effect on the model performance, but some hyperparameters (eg, regularization coefficients) have a limited effect on the performance of the model. , learning rate) has a relatively large effect on the performance of the model. In this case, the grid search may attempt an insignificant overlap. For this reason, random search is more suitable. Random search selects a random combination of parameters and the best-performing configuration. In this way, random search selects the most suitable point from a randomly drawn set of points. Although both empirically and theoretically show that random search is more practical and efficient than grid search, random search does not promise an optimal search. This means that the longer the search, the more likely it is to find the optimal hyperparameters, but will consume more resources.

베이지안 최적화는 값비싼 블랙박스 함수(blackbox function)의 전역 최적화를 위한 효율적인 방법이다. 베이지안 최적화는 surrogate model based optimization(SMBO) 방법이며, 하이퍼 파라미터에서 유효성 검사 세트에서 평가된 목표 메트릭으로 확률적 모델 매핑을 구축한다. 탐색(가능한 많은 하이퍼 파라미터 세트 평가)과 공격(유망한 하이퍼 파라미터에 더 많은 리소스를 할당)의 균형을 잘 맞춘다. 베이지안 최적화에 대해 가장 주의해야 할 점은 국부 최대치 또는 최소치를 찾으면 이 지역에서 끊임없이 샘플을 채취하기 때문에 국부 최대치에 빠지기 쉽다는 것이다. 베이지안 최적화는 대량의 자원과 시간을 소모해야 한다. 알 수 없는 평활도가 소음이 있는 고차원, 비볼록 함수에 대해 베이지안 최적화 알고리즘은 종종 이를 의합하고 최적화하기 어렵다. 일반적으로 베이지안 최적화 알고리즘은 매우 강한 가설 조건을 가지고 있으며, 이러한 요건은 일반적으로 충족되기 어렵다. 또한, 베이지안 최적화 알고리즘이 랜덤 탐색에 비해 월등히 뛰어나지 않다는 논문도 있다. Bayesian optimization is an efficient method for global optimization of expensive blackbox functions. Bayesian optimization is a surrogate model based optimization (SMBO) method, which builds a stochastic model mapping from hyperparameters to target metrics evaluated on a validation set. It strikes a good balance between exploration (evaluating as many hyperparameter sets as possible) and attacking (allocating more resources to promising hyperparameters). The most important thing to note about Bayesian optimization is that once you find a local maxima or minima, it is easy to fall into a local maxima because it constantly samples in this region. Bayesian optimization needs to consume a large amount of resources and time. For high-dimensional, non-convex functions where unknown smoothness is noisy, Bayesian optimization algorithms often find it difficult to agree and optimize. In general, Bayesian optimization algorithms have very strong hypothesis conditions, and these requirements are generally difficult to meet. In addition, there are papers that say that Bayesian optimization algorithms are not significantly better than random searches.

경사 기반 최적화는 블랙박스 HOP 방법(예를 들면, 그리드 탐색, 랜덤 탐색 및 베이지안 최적화)과는 달리 경사(gradient) 정보를 사용하여 하이퍼 파라미터를 최적화하고 HPO의 효율성을 크게 향상시킨다. 하이퍼 파라미터 공간은 보통 많은 이상의 결정으로 이루어져 있기 때문에 연속적이거나 미분이 가능한(differentiable) 것이 아니다. 이 때문에 통상 하이퍼 파라미터 공간에는 경사 하강(gradient descent)을 만들 수 없다.Unlike black box HOP methods (e.g., grid search, random search, and Bayesian optimization), gradient-based optimization uses gradient information to optimize hyperparameters and greatly improve the efficiency of HPO. The hyperparameter space is usually not continuous or differentiable because it consists of many ideal decisions. Because of this, gradient descent cannot be made in normal hyperparameter space.

강화 학습과 가우시안 프로세스 회귀에 기초하여 하이퍼 파라미터를 최적화하는 방법 및 시스템을 제공할 수 있다.A method and system for optimizing hyperparameters based on reinforcement learning and Gaussian process regression can be provided.

에이전트(Agent)를 사용하여 순차적으로 하이퍼 파라미터를 선택하고 훈련 과정을 가속화 하기 위하여 하이퍼 파라미터의 조합과 보상을 트레이닝 데이터로 제시하며, 가우시안 프로세스 회귀를 사용하여 보상을 예측하는 방법 및 시스템을 제공할 수 있다. In order to sequentially select hyperparameters using an agent and accelerate the training process, a combination of hyperparameters and rewards are presented as training data, and a method and system for predicting rewards using Gaussian process regression can be provided. have.

하이퍼 파라미터 최적화 시스템에 의해 수행되는 하이퍼 파라미터 최적화 방법은, 강화 학습에 기초하여 에이전트로부터 선택된 하이퍼 파라미터의 조합을 통해 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 업데이트하는 단계; 및 가우시안 프로세스 회귀를 사용하여 상기 업데이트된 머신러닝 모델에 대한 보상을 예측하는 단계를 포함할 수 있다.The hyperparameter optimization method performed by the hyperparameter optimization system includes updating a machine learning model using a training set generated through a combination of hyperparameters selected from an agent based on reinforcement learning; and predicting a reward for the updated machine learning model using Gaussian process regression.

상기 업데이트하는 단계는, 머신러닝 모델의 제1 정책과 제2 정책을 포함하는 복수 개의 정책을 초기화하고, 상기 제1 정책을 이용하여 데이터를 샘플링하고, 상기 샘플링된 데이터로 상기 제2 정책을 트레이닝하고, 상기 머신러닝 모델에 상기 에이전트로부터 하이퍼 파라미터 최적화를 위한 복수 개의 하이퍼 파라미터가 선택됨을 수신하는 단계를 포함할 수 있다. The updating may include initializing a plurality of policies including a first policy and a second policy of a machine learning model, sampling data using the first policy, and training the second policy with the sampled data. and receiving a selection of a plurality of hyperparameters for hyperparameter optimization from the agent in the machine learning model.

상기 업데이트하는 단계는, 상기 선택된 복수 개의 하이퍼 파라미터의 조합에 따라 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 트레이닝하고, 벨리데이션 세트를 이용하여 상기 트레이닝된 머신러닝 모델의 정확도를 획득하고, 상기 획득된 정확도를 보상으로 사용하는 단계를 포함할 수 있다.The updating may include training a machine learning model using a training set generated according to a combination of the selected plurality of hyperparameters, obtaining accuracy of the trained machine learning model using a validation set, and obtaining the A step of using the obtained accuracy as a reward may be included.

상기 업데이트하는 단계는, 상기 선택된 하이퍼 파라미터의 조합과 상기 획득된 보상을 사용하여 머신러닝 모델에 대한 PPO 알고리즘을 통해 제1 정책을 업데이트하고, 상기 선택된 하이퍼 파라미터의 조합과 상기 획득된 보상을 가우시안 예측을 위한 트레이닝 세트로 사용하는 단계를 포함할 수 있다. The updating may include updating a first policy through a PPO algorithm for a machine learning model using the selected hyperparameter combination and the obtained reward, and Gaussian prediction of the selected hyperparameter combination and the obtained reward It may include using as a training set for.

상기 예측하는 단계는, 상기 머신러닝에 구성된 복수 개의 정책의 KL 거리가 임계값보다 작거나 같으면 가우시안 프로세스 회귀를 사용하고, 상기 선택된 하이퍼 파라미터의 조합을 상기 가우시안 프로세스 회귀의 입력으로 사용하여 보상을 예측하고, 상기 선택된 하이퍼 파라미터의 조합과 상기 예측된 보상을 사용하여 PPO 알고리즘을 통해 제1 정책을 업데이트하는 단계를 포함할 수 있다. In the predicting step, when the KL distances of the plurality of policies configured in the machine learning are less than or equal to the threshold value, Gaussian process regression is used, and a reward is predicted by using a combination of the selected hyperparameters as an input of the Gaussian process regression and updating a first policy through a PPO algorithm using the selected hyperparameter combination and the predicted reward.

상기 예측하는 단계는, 상기 가우시안 프로세스 회귀를 사용하여 보상을 예측하고, 상기 머신러닝에 구성된 복수 개의 정책의 KL 거리가 임계값보다 크다면, 트레이닝 세트에서 선택된 다른 하이퍼 파라미터의 조합을 이용하여 상기 머신러닝 모델을 트레이닝하는 과정을 반복하는 단계를 포함할 수 있다. In the predicting step, the reward is predicted using the Gaussian process regression, and if the KL distances of the plurality of policies configured in the machine learning are greater than a threshold value, the machine using a combination of other hyperparameters selected from a training set. It may include repeating the process of training the learning model.

하이퍼 파라미터 최적화 시스템은, 강화 학습에 기초하여 에이전트로부터 선택된 하이퍼 파라미터의 조합을 통해 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 업데이트하는 강화 학습부; 및 가우시안 프로세스 회귀를 사용하여 상기 업데이트된 머신러닝 모델에 대한 보상을 예측하는 보상 예측부를 포함할 수 있다. The hyperparameter optimization system includes a reinforcement learning unit that updates a machine learning model using a training set generated through a combination of hyperparameters selected from an agent based on reinforcement learning; and a reward prediction unit that predicts a reward for the updated machine learning model using Gaussian process regression.

강화 학습에 기초하여 하이퍼 파라미터 선택 과정의 보상 함수를 예측하기 위하여 가우시안 프로세스 회귀를 이용하여 하이퍼 파마리터의 최적화에 따른 평가 비용을 감소시킬 수 있다. In order to predict the reward function of the hyperparameter selection process based on reinforcement learning, Gaussian process regression can be used to reduce the evaluation cost according to the optimization of the hyperparameter.

선택된 하이퍼 파라미터에 따라 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 트레이닝하여, 벨리데이션 세트의 정확도를 검증받은 후, 하이퍼 파라미터와 정확도를 가우시안 프로세스 회귀의 트레이닝 세트로 이용하여 특정 통제 방법에 따라 가우시안 프로세스 회귀를 사용하여 정확도를 예측할 수 있다. After the accuracy of the validation set is verified by training the machine learning model using the training set generated according to the selected hyperparameters, the Gaussian process is performed according to a specific control method using the hyperparameters and accuracy as a training set for Gaussian process regression. Regression can be used to predict accuracy.

가우시안 프로세스 회귀를 통해 보상을 정확하게 예측할 수 있으며 정규 분포를 충족하지 못하는 데이터도 가우시안 프로세스 회귀로 예측이 가능하다.Rewards can be accurately predicted through Gaussian process regression, and data that does not satisfy a normal distribution can also be predicted by Gaussian process regression.

도 1은 일 실시예에 따른 하이퍼 파라미터 최적화 시스템의 동작을 설명하기 위한 흐름도이다.
도 2는 일 실시예에 따른 하이퍼 파라미터 최적화 시스템의 구성을 설명하기 위한 블록도이다.
도 3은 일 실시예에 따른 하이퍼 파라미터 최적화 시스템에서 하이퍼 파라미터 최적화 방법을 설명하기 위한 도면이다. 1 is a flowchart illustrating the operation of a system for optimizing hyperparameters according to an exemplary embodiment.
2 is a block diagram for explaining the configuration of a system for optimizing hyperparameters according to an embodiment.
3 is a diagram for explaining a hyperparameter optimization method in a hyperparameter optimization system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

실시예에서는 강화 학습과 가우시안 프로세스 회귀에 기초하여 하이퍼 파라미터를 최적화하는 동작에 대하여 설명하기로 한다. 에이전트를 사용하여 순차적으로 하이퍼 파라미터를 선택하고 훈련 과정을 가속화하기 위하여 하이퍼 파라미터의 조합과 보상을 트레이닝 데이터로 제시하고, 가우시안 프로세스 회귀를 사용하여 보상을 예측할 수 있다. In the embodiment, an operation of optimizing hyperparameters based on reinforcement learning and Gaussian process regression will be described. In order to sequentially select hyperparameters using an agent and accelerate the training process, hyperparameter combinations and rewards are presented as training data, and rewards can be predicted using Gaussian process regression.

도 1은 일 실시예에 따른 하이퍼 파라미터 최적화 시스템의 동작을 설명하기 위한 흐름도이다. 1 is a flowchart illustrating the operation of a system for optimizing hyperparameters according to an exemplary embodiment.

실시예에서는 하이퍼 파라미터의 선택을 서열 결정 과정으로 추상화하는데, 이전의 선택에 따라 다음 하이퍼 파라미터를 선택하기 때문에 마르코프 결정 프로세스(Markov Decision Process; MDP)를 모델링할 수 있다. 이에, 강화 학습이 이용되어 하이퍼 파라미터가 선택될 수 있다. LSTM(Long-Short-Memory)으로 에이전트(Agent)가 모델링될 수 있다. 에이전트는 순서적으로 하이퍼 파라미터를 선택할 수 있다. 벨리데이션 세트에서 획득한 정확도를 보상(reward)으로 하여 PPO(Proximal Policy Optimization)를 통해 다음 정책(policy)이 선택될 수 있다. 이때, 중간 단계에서, 가우시안 프로세스 회귀(GPR)가 사용되어 보상이 예측될 수 있다. 언제 진실한 보상을 사용하는지, 언제 가우시안 프로세스 회귀로 보상이 예측되는지 통제되어야 한다. In the embodiment, the selection of hyperparameters is abstracted into a sequence determination process, and since the next hyperparameter is selected according to the previous selection, a Markov Decision Process (MDP) can be modeled. Thus, reinforcement learning may be used to select hyperparameters. An agent can be modeled with LSTM (Long-Short-Memory). Agents can select hyperparameters sequentially. A next policy may be selected through Proximal Policy Optimization (PPO) using the accuracy obtained from the validation set as a reward. At this time, in an intermediate step, Gaussian Process Regression (GPR) can be used to predict the reward. It should be controlled when true rewards are used and when rewards are predicted by Gaussian process regression.

하이퍼 파라미터 시스템은 강화 학습에 기초하여 에이전트로부터 선택된 하이퍼 파라미터의 조합을 통해 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 업데이트할 수 있다. The hyperparameter system may update the machine learning model using a training set generated through a combination of hyperparameters selected from the agent based on reinforcement learning.

상세하게는, 하이퍼 파라미터 시스템은 제1 정책(

)과 제2 정책(

)을 포함하는 복수 개의 정책을 초기화할 수 있다. 하이퍼 파라미터 시스템은 제1 정책을 이용하여 데이터를 샘플링하고, 샘플링된 데이터로 제2 정책을 훈련함에 따라 머신러닝 모델을 업데이트할 수 있다. In detail, the hyperparameter system is a first policy (

) and the second policy (

) can initialize a plurality of policies including. The hyperparameter system may update the machine learning model by sampling data using the first policy and training a second policy with the sampled data.

먼저 초기화 제1 정책(

)과 재2 정책(

)은 데이터를 샘플링하고 샘플링된 데이터로 제2 정책(

)을 훈련할 수 있다. 최적화할 복수 개(n개)의 하이퍼 파라미터의 수가 선택될 수 있다. LSTM으로 에이전트를 트레이닝시킬 수 있다. Initialization first policy (

) and second policy (

) Samples the data and uses the sampled data as the second policy (

) can be trained. The number of multiple (n) hyperparameters to be optimized can be selected. You can train an agent with LSTM.

하이퍼 파라미터 시스템은 에이전트로부터 복수 개의 하이퍼 파라미터가 선택됨을 수신할 수 있다. 다시 말해서, 에이전트는 복수 개의 하이퍼 파라미터를 하나씩 선택할 수 있다. 선택된 하이퍼 파라미터를 이용하여 생성된 트레이닝 세트를 통해 머신러닝 모델이 트레이닝될 수 있다. 벨리데이션 세트를 이용하여 트레이닝된 머신러닝 모델의 정확도(성능 결과 평가)가 획득될 수 있다. 벨리데이션 세트의 정확도가 보상으로 사용될 수 있다. 하이퍼 파라미터의 조합과 보상을 사용하여 PPO 알고리즘을 통해

가 업데이트될 수 있다. 하이퍼 파라미터의 조합과 보상이 가우시안 예측을 위한 트레이닝 세트로 사용될 수 있다. 이때, 가우시안 예측을 위한 트레이닝 세트(하이퍼 파라미터의 조합과 보상)가 저장될 수 있다. The hyperparameter system may receive selection of a plurality of hyperparameters from the agent. In other words, the agent can select a plurality of hyperparameters one by one. A machine learning model may be trained through a training set generated using selected hyperparameters. The accuracy (performance result evaluation) of the machine learning model trained using the validation set may be obtained. The accuracy of the validation set can be used as a reward. Through the PPO algorithm using a combination of hyperparameters and compensation

can be updated. A combination of hyperparameters and compensation can be used as a training set for Gaussian prediction. In this case, a training set (hyper-parameter combination and compensation) for Gaussian prediction may be stored.

PPO 방법은 다음과 같이 정의될 수 있다.The PPO method can be defined as follows.

L은 다음과 같이 정의될 수 있다.L can be defined as:

하이퍼 파라미터 시스템은 마지막 두 정책(policy)의 KL 거리가 임계값보다 작거나 같으면 가우시안 프로세스 회귀를 사용할 수 있다. 에이전트는 여전히 복수 개의 하이퍼 파라미터를 하나씩 선택하고, 선택된 하이퍼 파라미터의 조합을 가우시안 프로세스 회귀의 입력으로 사용하여 보상을 예측한 다음, 하이퍼 파라미터의 조합과 예측한 보상을 사용하여 PPO 알고리즘을 통해

를 업데이트할 수 있다. 그런 다음, 에이전트가 하이퍼 파라미터를 다시 선택할 수 있다. 액션(Action) 공간에서 모델을 사용하기 전후에 정책 간의 거리 측정이 정의될 수 있다.The hyperparameter system can use Gaussian process regression if the KL distance of the last two policies is less than or equal to the threshold. The agent still selects multiple hyperparameters one by one, uses the combination of selected hyperparameters as input to a Gaussian process regression to predict a reward, and then uses the combination of hyperparameters and the predicted reward to generate a reward through the PPO algorithm.

can be updated. The agent can then choose the hyperparameter again. Distance measures between policies can be defined before and after using the model in the action space.

하이퍼 파라미터 시스템은 가우시안 프로세스 회귀로 보상을 예측할 수 있다. 만약, 이때 에이전트가 m차원의 하이퍼 파라미터를 선택했다고 가정하면, 가우스에 의해 예측된 트레이닝 데이터는 다음과 같이 표시될 수 있다.Hyper-parameter systems can predict rewards by Gaussian process regression. If it is assumed that the agent selects an m-dimensional hyperparameter at this time, the training data predicted by Gaussian can be expressed as follows.

관측치는 보통 소음이 발생하기 때문에 각 관측치 y를 은함수 f(x)에 가우스 소음으로 모델링할 수 있다. Since observations are usually noisy, each observation y can be modeled as Gaussian noise in the luminous function f(x).

즉,

가 된다. in other words,

becomes

f(x)는 가우스 과정 선험을 부여한다고 가정될 수 있다. f(x) can be assumed to give a Gaussian process a priori.

즉, f(x)~GP(0, K)이다. That is, f(x) to GP(0, K).

공분산 함수는 squared exponential을 선택할 수 있다. The covariance function can choose squared exponential.

즉,

이다. in other words,

to be.

이에 따라 소음을 넣은 k(x, x')를 다음과 같이 표시될 수 있다. Accordingly, k(x, x') with noise can be expressed as follows.

즉,

이다. in other words,

to be.

이로써, 다른 계산에 관계없이 기본적인 모델링이 완료될 수 있다. 새로운 x*(에이전트가 선택한 새로운 하이퍼 파라미터의 조합)에 대해 그에 대응하는 y*를 구해야 한다. In this way, basic modeling can be completed regardless of other calculations. For each new x* (the new combination of hyperparameters chosen by the agent), we need to find the corresponding y*.

그 중에, in those,

연합분포가 있으면 그 다음에 쉽게 y*의 조건분포 p(y*|y)를 구할 수 있다. 상기 조건분포도 다음과 같이 가우스 분포가 유도되어 획득될 수 있다. If there is a joint distribution, then we can easily obtain the conditional distribution p(y*|y) of y*. The condition distribution can also be obtained by deriving a Gaussian distribution as follows.

y*의 추정에 대하여 분포의 평균이 추정치로 사용될 수 있다.For the estimate of y*, the mean of the distribution can be used as an estimate.

이런 절차를 거쳐 보상이 예측될 수 있다.Through this process, rewards can be predicted.

머신러닝 모델에 대한 마지막 두 정책의 KL 거리가 임계값보다 크면 오차가 증가하기 시작한다. 그런 다음, 하이퍼 파라미터 시스템은 트레이닝 세트에서 선택된 다음 하이퍼 파라미터의 조합을 이용하여 머신러닝 모델을 트레이닝할 수 있다. 다시 말해서, 상기 언급한 프로세스가 반복될 수 있다. The error starts to increase when the KL distance of the last two policies for the machine learning model is greater than the threshold. The hyperparameter system can then train a machine learning model using the next selected hyperparameter combination from the training set. In other words, the process mentioned above can be repeated.

일 실시예에 따르면, 강화 학습에 기초하여 하이퍼 파라미터 선택 과정의 보상 함수를 예측하기 위하여 가우시안 프로세스 회귀를 이용하여 하이퍼 파마리터의 최적화에 따른 평가 비용을 감소시킬 수 있다. According to an embodiment, evaluation cost according to optimization of a hyperparameter may be reduced by using Gaussian process regression to predict a reward function of a hyperparameter selection process based on reinforcement learning.

일 실시예에 따르면, 선택된 하이퍼 파라미터에 따라 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 트레이닝하여, 벨리데이션 세트의 정확도를 검증받은 후, 하이퍼 파라미터와 정확도를 가우시안 프로세스 회귀의 트레이닝 세트로 이용하여 특정 통제 방법(예를 들면, 머신러닝 모델의 두 정책 간의 거리)에 따라 가우시안 프로세스 회귀를 사용하여 정확도를 예측할 수 있다. According to an embodiment, after training a machine learning model using a training set generated according to selected hyperparameters, verifying the accuracy of the validation set, and then using the hyperparameters and accuracy as a training set for Gaussian process regression, a specific Depending on the control method (e.g., the distance between two policies in a machine learning model), Gaussian process regression can be used to predict accuracy.

일 실시예에 따르면, 가우시안 프로세스 회귀는 예측시 큰 장점이 있으며, 정규 분포를 충족하지 못하는 데이터도 가우시안 프로세스 회귀로 예측이 가능하다.According to an embodiment, Gaussian process regression has a great advantage in prediction, and even data that does not satisfy a normal distribution can be predicted by Gaussian process regression.

도 2는 일 실시예에 따른 하이퍼 파라미터 최적화 시스템의 구성을 설명하기 위한 블록도이고, 도 3은 일 실시예에 따른 하이퍼 파라미터 최적화 시스템에서 하이퍼 파라미터 최적화 방법을 설명하기 위한 도면이다. 2 is a block diagram for explaining the configuration of a hyper parameter optimization system according to an embodiment, and FIG. 3 is a diagram for explaining a hyper parameter optimization method in the hyper parameter optimization system according to an embodiment.

하이퍼 파라미터 최적화 시스템(100)의 프로세서는 강화 학습부(210) 및 보상 예측부(220)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 하이퍼 파라미터 최적화 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 3의 하이퍼 파라미터 최적화 방법이 포함하는 단계들(310 내지 320)을 수행하도록 하이퍼 파라미터 최적화 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The processor of the hyperparameter optimization system 100 may include a reinforcement learning unit 210 and a reward prediction unit 220 . Components of the processor may be representations of different functions performed by the processor according to control instructions provided by program codes stored in the hyperparameter optimization system. The processor and components of the processor may control the hyper parameter optimization system to perform steps 310 to 320 included in the hyper parameter optimization method of FIG. 3 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 하이퍼 파라미터 최적화 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 하이퍼 파라미터 최적화 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 하이퍼 파라미터 최적화 시스템을 제어할 수 있다. 이때, 강화 학습부(210) 및 보상 예측부(220) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(310 내지 320)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다.The processor may load a program code stored in a file of a program for a hyperparameter optimization method into a memory. For example, when a program is executed in the hyperparameter optimization system, the processor may control the hyperparameter optimization system to load a program code from a program file into a memory under the control of an operating system. At this time, each of the reinforcement learning unit 210 and the reward prediction unit 220 executes a command of a corresponding part of the program code loaded into the memory to perform the subsequent steps 310 to 320 with different functional representations of the processor. can be picked up

단계(310)에서 강화 학습부(210)는 강화 학습에 기초하여 에이전트로부터 선택된 하이퍼 파라미터의 조합을 통해 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 업데이트할 수 있다. 강화 학습부(210)는 머신러닝 모델의 제1 정책과 제2 정책을 포함하는 복수 개의 정책을 초기화하고, 제1 정책을 이용하여 데이터를 샘플링하고, 샘플링된 데이터로 제2 정책을 트레이닝하고, 머신러닝 모델에 에이전트로부터 하이퍼 파라미터 최적화를 위한 복수 개의 하이퍼 파라미터가 선택됨을 수신할 수 있다. 강화 학습부(210)는 선택된 복수 개의 하이퍼 파라미터의 조합에 따라 생성된 트레이닝 세트를 이용하여 머신러닝 모델을 트레이닝하고, 벨리데이션 세트를 이용하여 트레이닝된 머신러닝 모델의 정확도를 획득하고, 획득된 정확도를 보상으로 사용할 수 있다. 강화 학습부(210)는 선택된 하이퍼 파라미터의 조합과 획득된 보상을 사용하여 머신러닝 모델에 대한 PPO 알고리즘을 통해 제1 정책을 업데이트하고, 선택된 하이퍼 파라미터의 조합과 획득된 보상을 가우시안 예측을 위한 트레이닝 세트로 사용할 수 있다.In step 310, the reinforcement learning unit 210 may update the machine learning model using a training set generated through a combination of hyperparameters selected from the agent based on reinforcement learning. The reinforcement learning unit 210 initializes a plurality of policies including the first policy and the second policy of the machine learning model, samples data using the first policy, trains the second policy with the sampled data, The machine learning model may receive a selection of a plurality of hyperparameters for hyperparameter optimization from an agent. The reinforcement learning unit 210 trains a machine learning model using a training set generated according to a combination of a plurality of selected hyperparameters, obtains accuracy of the trained machine learning model using a validation set, and obtains accuracy. can be used as a reward. The reinforcement learning unit 210 updates the first policy through the PPO algorithm for the machine learning model using the selected hyperparameter combination and the obtained reward, and uses the selected hyperparameter combination and the obtained reward for training for Gaussian prediction Can be used as a set.

단계(320)에서 보상 예측부(220)는 가우시안 프로세스 회귀를 사용하여 상기 업데이트된 머신러닝 모델에 대한 보상을 예측할 수 있다. 보상 예측부(220)는 머신러닝에 구성된 복수 개의 정책의 KL 거리가 임계값보다 작거나 같으면 가우시안 프로세스 회귀를 사용하고, 선택된 하이퍼 파라미터의 조합을 가우시안 프로세스 회귀의 입력으로 사용하여 보상을 예측하고, 선택된 하이퍼 파라미터의 조합과 예측된 보상을 사용하여 PPO 알고리즘을 통해 제1 정책을 업데이트할 수 있다. 보상 예측부(220)는 가우시안 프로세스 회귀를 사용하여 보상을 예측하고, 머신러닝에 구성된 복수 개의 정책의 KL 거리가 임계값보다 크다면, 트레이닝 세트에서 선택된 다른 하이퍼 파라미터의 조합을 이용하여 머신러닝 모델을 트레이닝하는 과정을 반복할 수 있다. In step 320, the reward predictor 220 may predict the reward for the updated machine learning model using Gaussian process regression. The reward prediction unit 220 uses Gaussian process regression when the KL distances of a plurality of policies configured in machine learning are less than or equal to the threshold value, predicts compensation by using a combination of selected hyperparameters as an input of Gaussian process regression, The first policy may be updated through the PPO algorithm using the selected combination of hyperparameters and the predicted reward. The reward prediction unit 220 predicts a reward using Gaussian process regression, and if the KL distances of a plurality of policies configured in machine learning are greater than a threshold value, a machine learning model using a combination of other hyperparameters selected from the training set The training process can be repeated.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the hyperparameter optimization method performed by the hyperparameter optimization system,
Updating a machine learning model using a training set generated through a combination of hyperparameters selected from an agent based on reinforcement learning; and
Predicting a reward for the updated machine learning model using Gaussian process regression.
Hyperparameter optimization method comprising a.

According to claim 1,
The updating step is
A plurality of policies including a first policy and a second policy of a machine learning model are initialized, data is sampled using the first policy, the second policy is trained with the sampled data, and the machine learning model Receiving that a plurality of hyperparameters for hyperparameter optimization are selected from the agent in
Hyperparameter optimization method comprising a.

According to claim 2,
The updating step is
A machine learning model is trained using a training set generated according to the combination of the selected plurality of hyperparameters, accuracy of the trained machine learning model is obtained using a validation set, and the obtained accuracy is used as a reward. step to do
Hyperparameter optimization method comprising a.

According to claim 3,
The updating step is
Update a first policy through a PPO algorithm for a machine learning model using the selected combination of hyperparameters and the obtained reward, and use the selected combination of hyperparameters and the obtained reward as a training set for Gaussian prediction step to do
Hyperparameter optimization method comprising a.

According to claim 1,
The predicting step is
If the KL distances of the plurality of policies configured in the machine learning are less than or equal to the threshold, Gaussian process regression is used, and a reward is predicted using a combination of the selected hyperparameters as an input of the Gaussian process regression, and the selected hyperparameters Updating a first policy through a PPO algorithm using a combination of and the predicted reward.
Hyperparameter optimization method comprising a.

According to claim 5,
The predicting step is
Predicting a reward using the Gaussian process regression, and training the machine learning model using a combination of other hyperparameters selected from a training set if the KL distances of the plurality of policies configured in the machine learning are greater than a threshold value. steps to repeat
Hyperparameter optimization method comprising a.

In the hyperparameter optimization system,
a reinforcement learning unit that updates a machine learning model using a training set generated through a combination of hyperparameters selected from an agent based on reinforcement learning; and
A reward prediction unit that predicts a reward for the updated machine learning model using Gaussian process regression
A hyperparameter optimization system comprising a.