KR20200125029A

KR20200125029A - Method and apparatus for regression analysis

Info

Publication number: KR20200125029A
Application number: KR1020190048680A
Authority: KR
Inventors: 김철호; 이경채; 김태우; 백옥기; 양은주; 윤찬현
Original assignee: 한국전자통신연구원
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2020-11-04

Abstract

Provided is a method for performing regression analysis on data. The regression analysis method comprises the following steps of: generating a missing value replacement learning model based on training data including a first missing value; replacing a second missing value by using the missing value replacement learning model when the second missing value is included in input data; and predicting target data based on input data in which the second missing value is replaced. Therefore, when performing regression analysis and classification on high-dimensional large scale big data including a large number of missing values, it is possible to prevent performance degradation due to missing values of a gradient boosting based prediction model and ensure low latency and high throughput.

Description

Method and apparatus for regression analysis {METHOD AND APPARATUS FOR REGRESSION ANALYSIS}

본 기재는 데이터에 대한 회귀 분석을 수행하는 방법 및 장치에 관한 것이다.The present disclosure relates to a method and apparatus for performing regression analysis on data.

빅데이터에 대한 회귀 분석기 및 분류기는 많은 결측치를 포함하는 고용량 및 고차원 데이터에 대한 정확한 예측 성능을 나타내야 한다. 기존의 회귀 분석기 및 분류기는 결측 데이터를 이용 불가능하게 하거나 비효율적으로 처리하는 경우가 많고, 다수의 결측 데이터로 인해 예측 모델의 성능을 저하시키는 문제가 있다.Regression analyzers and classifiers for big data must exhibit accurate prediction performance for high-volume and high-dimensional data containing many missing values. Existing regression analyzers and classifiers often make missing data unavailable or process inefficiently, and there is a problem of deteriorating the performance of a predictive model due to a large number of missing data.

한 실시예는 데이터에 대한 회귀 분석을 수행하는 방법을 제공한다.One embodiment provides a method of performing regression analysis on data.

한 실시예는 데이터에 대한 회귀 분석을 수행하는 장치를 제공한다.An embodiment provides an apparatus for performing regression analysis on data.

한 실시예에 따르면, 데이터에 대한 회귀 분석을 수행하는 방법이 제공된다. 상기 회귀 분석 방법은 제1 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성하는 단계, 입력 데이터에 제2 결측치가 포함될 때 상기 결측치 대체 학습 모델을 이용하여 상기 제2 결측치를 대체하는 단계, 그리고 상기 제2 결측치가 대체된 입력 데이터에 기반하여 목표 데이터를 예측하는 단계를 포함한다.According to one embodiment, a method of performing regression analysis on data is provided. The regression analysis method includes generating a missing value replacement learning model based on training data including a first missing value, replacing the second missing value using the missing value replacement learning model when a second missing value is included in the input data And predicting target data based on the input data in which the second missing value has been replaced.

상기 결측치 대체 학습 모델을 생성하는 단계는, 상기 오토인코더를 훈련하는 단계 이후, 상기 오토인코더의 하이퍼파라미터를 조정하여 훈련을 반복하고, 복수의 후보 결측치 대체 학습 모델을 생성하는 단계를 포함할 수 있다.The generating of the missing value replacement learning model may include, after the training of the autoencoder, repeating training by adjusting a hyperparameter of the autoencoder, and generating a plurality of candidate missing value replacement learning models. .

상기 결측치 대체 학습 모델을 생성하는 단계는, 상기 후보 결측치 대체 학습 모델을 생성하는 단계 이후, 교차 검증을 통해 상기 복수의 후보 결측치 대체 학습 모델 중 하나의 결측치 대체 학습 모델을 선택하는 단계를 포함할 수 있다.Generating the missing value replacement learning model may include, after generating the candidate missing value replacement learning model, selecting one of the plurality of candidate missing value replacement learning models through cross-validation. have.

상기 오토인코더를 훈련하는 단계는, 최적화 기법을 이용하여 상기 제1 결측치를 포함하는 훈련 데이터와 복원 데이터 간의 크로스-엔트로피(cross-entropy)를 감소시키는 방향으로 상기 오토인코더를 훈련할 수 있다.In the training of the autoencoder, the autoencoder may be trained in a direction of reducing cross-entropy between training data including the first missing value and reconstructed data using an optimization technique.

상기 최적화 기법은 경사하강법(Gradient Descent)일 수 있다.The optimization technique may be gradient descent.

상기 하이퍼파라미터는 은닉층의 개수, 은닉 요소의 개수, 및 드롭아웃 비율 중 하나일 수 있다.The hyperparameter may be one of the number of hidden layers, the number of hidden elements, and a dropout ratio.

한 실시예에 따르면, 데이터에 대한 회귀 분석을 수행하는 장치가 제공된다. 상기 회귀 분석 장치는 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성하는 훈련부, 입력 데이터에 결측치가 포함될 때 상기 결측치 대체 학습 모델을 이용하여 상기 입력 데이터에 포함된 결측치를 대체값으로 대체하는 결측치 대체부, 그리고 상기 대체값을 포함하는 입력 데이터에 기반하여 목표 데이터를 예측하는 예측부를 포함한다.According to an embodiment, an apparatus for performing regression analysis on data is provided. The regression analysis device is a training unit that generates a missing value replacement learning model based on training data including missing values, and when the input data contains missing values, the missing value included in the input data is replaced with a replacement value using the missing value replacement learning model. And a predictor for predicting target data based on input data including the missing value replacement unit and the replacement value.

상기 훈련부는, 상기 훈련 데이터를 생성하고, 상기 훈련 데이터를 이용하여 오토인코더를 훈련시킬 수 있다.The training unit may generate the training data and train the autoencoder using the training data.

상기 훈련부는, 상기 오토인코더의 하이퍼파라미터를 조정하여 상기 훈련을 반복하고, 복수의 후보 결측치 대체 학습 모델을 생성할 수 있다.The training unit may repeat the training by adjusting hyperparameters of the autoencoder, and may generate a plurality of candidate missing value replacement learning models.

상기 훈련부는, 교차 검증을 통해 상기 복수의 후보 결측치 대체 학습 모델 중 하나의 결측치 대체 학습 모델을 선택할 수 있다.The training unit may select one missing value replacement learning model from among the plurality of candidate missing value replacement learning models through cross-validation.

상기 훈련부는, 최적화 기법을 이용하여 상기 훈련 데이터와 복원 데이터 간의 크로스-엔트로피(cross-entropy)를 감소시키는 방향으로 상기 오토인코더를 훈련시킬 수 있다.The training unit may train the autoencoder in a direction of reducing cross-entropy between the training data and the restoration data by using an optimization technique.

한 실시예에 따르면, 데이터에 대한 회귀 분석을 수행하는 장치가 제공된다. 상기 회귀 분석 장치는 프로세서 및 메모리를 포함하고, 상기 프로세서는 상기 메모리에 저장된 프로그램을 실행하여, 제1 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성하는 단계, 입력 데이터에 제2 결측치가 포함될 때 상기 결측치 대체 학습 모델을 이용하여 상기 제2 결측치를 대체하는 단계, 그리고 상기 제2 결측치가 대체된 입력 데이터에 기반하여 목표 데이터를 예측하는 단계를 수행할 수 있다.According to an embodiment, an apparatus for performing regression analysis on data is provided. The regression analysis apparatus includes a processor and a memory, and the processor executes a program stored in the memory to generate a missing value replacement learning model based on training data including a first missing value, and a second missing value in the input data When is included, replacing the second missing value using the missing value replacement learning model, and predicting target data based on input data in which the second missing value has been replaced may be performed.

다수의 결측치를 포함하는 고차원 대용량 빅데이터에 대한 회귀 분석 및 분류시 그래디언트 부스팅 기반 예측 모델의 결측치에 의한 성능 저하를 방지할 수 있고, 낮은 지연율과 높은 처리율을 보장할 수 있다.When regression analysis and classification of high-dimensional large-scale big data including a large number of missing values is performed, performance degradation due to missing values of a gradient boosting-based predictive model can be prevented, and low latency and high throughput can be guaranteed.

고차원 데이터에 대한 결측치 대체 능력을 향상시킬 수 있다.The ability to substitute missing values for high-dimensional data can be improved.

도 1은 한 실시예에 따른 회귀 분석 장치의 블록도이다.
도 2 및 도 3은 한 실시예에 따른 회귀 분석 방법의 흐름도이다.
도 4 및 도 5는 다른 실시예에 따른 회귀 분석 방법의 흐름도이다.
도 6은 결측치를 대체한 데이터를 이용하여 회귀 분석을 수행하는 방법의 흐름도이다.
도 7은 한 실시예에 따른 회귀 분석 장치를 나타내는 블록도이다.1 is a block diagram of an apparatus for regression analysis according to an embodiment.
2 and 3 are flowcharts of a regression analysis method according to an embodiment.
4 and 5 are flowcharts of a regression analysis method according to another embodiment.
6 is a flowchart of a method of performing a regression analysis using data replacing missing values.
7 is a block diagram illustrating a regression analysis apparatus according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the embodiments of the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

도 1은 한 실시예에 따른 회귀 분석 장치의 블록도이다. 도 2 및 도 3은 한 실시예에 따른 회귀 분석 방법의 흐름도이다.1 is a block diagram of an apparatus for regression analysis according to an embodiment. 2 and 3 are flowcharts of a regression analysis method according to an embodiment.

도 1 및 도 2를 참조하면, 한 실시예에 따른 회귀 분석 장치는, 훈련부(100), 결측치 대체부(210), 예측부(220)를 포함한다. 1 and 2, the regression analysis apparatus according to an embodiment includes a training unit 100, a missing value replacement unit 210, and a prediction unit 220.

훈련부(100)는 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성한다(S100). 구체적으로, 훈련부(100)는 결측치를 포함하는 훈련 데이터를 생성한다(S110). 결측치에 의한 영향을 최소화하고, 결측치를 포함하지 않는 데이터로부터 핵심 요소를 효과적으로 추출하기 위해, 결측치의 위치를 나타내는 결측 행렬이 사전에 설계자에 의해 정의될 수 있다. 결측 행렬을 포함하는 훈련 데이터인 입력 데이터와 복원 데이터인 출력 데이터 간의 크로스-엔트로피(cross-entropy)를 수학식 1과 같이 훈련의 비용함수(C)로 사전에 설계자에 의해 정의될 수 있다.The training unit 100 generates a missing value replacement learning model based on the training data including the missing value (S100). Specifically, the training unit 100 generates training data including the missing value (S110). In order to minimize the effect of missing values and to effectively extract key elements from data not including missing values, a missing matrix indicating the location of the missing values may be defined in advance by the designer. A cross-entropy between input data, which is training data including a missing matrix, and output data, which is reconstructed data, may be previously defined by a designer as a training cost function (C) as shown in Equation 1.

x는 입력 데이터를 나타내고, z는 오토인코더(212)가 복원한 입력 데이터를 나타내며, m은 입력 데이터와 같은 차원을 갖는 결측 행렬을 나타낸다. m은 입력 데이터에 결측치가 존재할 때 0의 값을 갖고, 결측치가 존재하지 않을 때 1의 값을 갖는다.x denotes input data, z denotes input data restored by the autoencoder 212, and m denotes a missing matrix having the same dimension as the input data. m has a value of 0 when there is a missing value in the input data, and has a value of 1 when there is no missing value.

훈련부(100)는 훈련 데이터를 이용하여 오토인코더(212)를 훈련시킨다(S120). 구체적으로, 훈련부(100)는 한 실시예로서, 경사하강법 등의 최적화 기법을 사용하여 수학식 1의 비용함수(C)가 최소화되도록 오토인코더(212)를 훈련(training)시킨다. 즉, 훈련부(100)는 최적화 기법을 이용하여 결측치를 포함하는 훈련 데이터와 복원 데이터 간의 크로스-엔트로피를 감소시키는 방향으로 오토인코더(212)를 훈련시킬 수 있다.The training unit 100 trains the autoencoder 212 using the training data (S120). Specifically, as an embodiment, the training unit 100 trains the autoencoder 212 to minimize the cost function C of Equation 1 by using an optimization technique such as gradient descent. That is, the training unit 100 may train the autoencoder 212 in a direction of reducing cross-entropy between training data including missing values and reconstructed data using an optimization technique.

훈련부(100)는 오토인코더(212)의 하이퍼파라미터(hyperparameter)를 조정하여 훈련을 반복하고, 복수의 후보 결측치 대체 학습 모델을 생성한다(S130). 구체적으로, 훈련부(100)는 교차검증(cross-validation) 비용이 최소화되도록 오토인코더(212)의 하이퍼파라미터를 조정하여 오토인코더(212)에 대한 훈련을 반복 수행할 수 있다. 여기서, 하이퍼파라미터는 한 실시예로서, 은닉층의 개수, 은닉 요소의 개수, 드롭아웃 비율, 조기종료 조건 중 하나일 수 있다. 한 실시예로서, 최적의 하이퍼파라미터를 결정할 때, 교차검증 데이터가 사용될 수 있다.The training unit 100 repeats training by adjusting a hyperparameter of the autoencoder 212, and generates a plurality of candidate missing value replacement learning models (S130). Specifically, the training unit 100 may repeatedly perform training on the autoencoder 212 by adjusting the hyperparameters of the autoencoder 212 to minimize cross-validation cost. Here, as an example, the hyperparameter may be one of the number of hidden layers, the number of hidden elements, a dropout ratio, and an early termination condition. As an embodiment, when determining an optimal hyperparameter, cross-validation data may be used.

훈련부(100)는 교차 검증을 통해 복수의 후보 결측치 대체 학습 모델 중 하나의 결측치 대체 학습 모델을 선택한다(S140). 구체적으로, 훈련부(100)는 한 실시예로서, k-fold cross validation 등의 현존하는 교차검증 기법을 통해 복수의 후보 결측치 대체 학습 모델 중 최적의 성능을 나타내는 결측치 대체 학습 모델을 선택할 수 있다.The training unit 100 selects one missing value replacement learning model from among a plurality of candidate missing value replacement learning models through cross-validation (S140). Specifically, as an embodiment, the training unit 100 may select a missing value replacement learning model representing optimal performance from among a plurality of candidate missing value replacement learning models through an existing cross-validation technique such as k-fold cross validation.

결측치 대체부(210)는 입력 데이터에 결측치가 포함되어 있을 때 결측치 대체 학습 모델을 이용하여 결측치를 대체값으로 대체한다(S200). 구체적으로, 데이터 생성자(300)로부터 결측치가 포함된 데이터가 결측지 대체부(210)에 입력될 때, 결측치 대체부(210)는 훈련부(100)에 의해 생성된 결측치 대체 학습 모델을 이용하여 결측치를 대체할 수 있다. 결측치 대체부(210)는 한 실시예로서, 오토인코더(212)를 포함할 수 있다.When the missing value is included in the input data, the missing value replacement unit 210 replaces the missing value with the replacement value using the missing value replacement learning model (S200). Specifically, when data including missing values from the data generator 300 is input to the missing paper replacement unit 210, the missing values replacement unit 210 uses the missing value replacement learning model generated by the training unit 100 Can be substituted for The missing value replacement unit 210 may include an autoencoder 212 as an example.

데이터 생성자(300)는 고차원의 대용량 데이터를 정기적으로 또는 비정기적으로 생성한다. 데이터 생성자(300)가 생성하는 데이터는 복수의 결측치를 포함할 수 있다. 결측치는 데이터 생성 및 전송에서 일어날 수 있는 우발적인 결측 데이터 생성 또는 고의적 결측 등 여러가지 원인에 의해 발생될 수 있다. The data generator 300 regularly or irregularly generates high-dimensional, large-capacity data. The data generated by the data generator 300 may include a plurality of missing values. Missing values can be caused by various causes, such as accidental generation of missing data or intentional missing data that may occur in data generation and transmission.

훈련부(100)는 데이터 생성자(300)에 의해 생성된 데이터 중 임의의 경로를 통해 회귀분석 또는 분류 모델의 예측 목표 값이 미리 밝혀져 있는 데이터를 훈련 데이터로 묶고, 이 훈련 데이터를 이용하여 결측치 대체 학습 모델을 생성할 수 있다.The training unit 100 binds the data in which the prediction target value of the regression analysis or classification model is previously identified through an arbitrary path among the data generated by the data generator 300 into training data, and uses the training data to replace missing values. You can create a model.

예측부(220)는 대체값을 포함하는 입력 데이터에 기반하여 목표 데이터를 예측한다(S300).The prediction unit 220 predicts target data based on input data including a replacement value (S300).

도 4 및 도 5는 다른 실시예에 따른 회귀 분석 방법의 흐름도이다.4 and 5 are flowcharts of a regression analysis method according to another embodiment.

도 1 및 도 4를 참조하면, 다른 실시예에 따른 회귀 분석 방법은, 훈련부(100)에 의해 훈련된 결측치 대체 학습 모델이 존재하는지 여부를 판단하는 단계(S410), 결측치를 포함하는 훈련 데이터를 생성하는 단계(S420), 최적화 기법을 이용하여 결측치를 포함하는 훈련 데이터와 복원 데이터 간의 크로스-엔트로피를 감소시키는 방향으로 오토인코더(212)를 훈련시키고, 복수의 후보 결측치 대체 학습 모델을 생성하는 단계(S430), 결측치가 대체된 데이터에 대해 예측을 수행하는 단계(S440), 오토인코더(212)의 모든 하이퍼파라미터 후보군에 대해 훈련이 수행되었는지 여부를 판단하는 단계(S450), 오토인코더(212)의 모든 하이퍼파라미터 후보군에 대해 훈련이 수행되지 않은 경우 하이퍼파라미터를 조정하는 단계(S460), 오토인코더(212)의 모든 하이퍼파라미터 후보군에 대해 훈련이 수행된 경우 교차 검증을 통해 복수의 후보 결측치 대체 학습 모델 중 하나의 결측치 대체 학습 모델을 선택하는 단계(S470), 데이터 생성자(300)에 의해 데이터가 입력되는 단계(S480), 입력된 데이터에 결측치가 포함되었는지 여부를 판단하는 단계(S490), 입력 데이터에 결측치가 포함된 경우 훈련부(100)에 의해 생성된 결측치 대체 학습 모델을 이용하여 결측치를 대체하는 단계(S500), 결측치가 대체된 입력 데이터에 대해 회귀 분석을 수행하는 단계(S510)를 포함할 수 있다.1 and 4, the regression analysis method according to another embodiment includes the step of determining whether there is a missing value replacement learning model trained by the training unit 100 (S410), and training data including the missing value Generating (S420), training the autoencoder 212 in a direction to reduce cross-entropy between training data including missing values and reconstructed data using an optimization technique, and generating a plurality of candidate missing value replacement learning models (S430), performing prediction on the data in which the missing values have been replaced (S440), determining whether training has been performed on all hyperparameter candidate groups of the autoencoder 212 (S450), and the autoencoder 212 Adjusting the hyperparameters when training has not been performed for all of the hyperparameter candidates (S460).If training has been performed on all hyperparameter candidates of the autoencoder 212, a plurality of candidate missing values replacement learning Selecting one of the models to replace missing values (S470), data input by the data generator 300 (S480), determining whether the input data contains missing values (S490), input If the data contains missing values, replacing the missing values using the missing value replacement learning model generated by the training unit 100 (S500), and performing regression analysis on the input data in which the missing values have been replaced (S510). can do.

도 5를 참조하면, 한 실시예에 따른 훈련부(100)를 통해 오토인코더를 훈련시키는 방법은, 오토인코더(212)의 파라미터를 초기화하는 단계(S431), 훈련 데이터를 오토인코더(212)에 배치(batch) 단위로 입력하는 단계(S432), 입력된 데이터에 상응하는 결측행렬을 생성하는 단계(S433), 미리 설정된 파라미터를 이용하여 전방 전파하는 단계(S434), 입력된 데이터와 전방 전파의 결과값으로부터 비용함수를 계산하는 단계(S435), 미리 설정된 종료 조건(예를 들어, 반복횟수를 충족하거나 조기종료 조건)을 만족하는 지 여부를 판단하는 단계(S436), 미리 설정된 종료 조건을 만족하지 않는 경우 비용함수를 최소화시키는 방향으로 후방 전파하여 파라미터를 갱신하는 단계(S437)를 포함할 수 있다. 파라미터는 하이퍼파라미터와 달리 훈련과정에서 갱신되고, 후보군이 존재하지 않는다.Referring to FIG. 5, a method of training an autoencoder through the training unit 100 according to an embodiment includes initializing parameters of the autoencoder 212 (S431), and arranging training data in the autoencoder 212 Inputting in (batch) units (S432), generating a missing matrix corresponding to the input data (S433), forward propagating using preset parameters (S434), input data and results of forward propagation The step of calculating the cost function from the value (S435), the step of determining whether a preset termination condition (for example, the number of repetitions is satisfied or an early termination condition) is satisfied (S436), the preset termination condition is not satisfied If not, it may include a step (S437) of updating the parameter by propagating backwards in a direction to minimize the cost function. Unlike hyperparameters, parameters are updated during training, and no candidate group exists.

도 6은 예측부(220)가 결측치를 대체한 데이터를 이용하여 회귀 분석을 수행하는 방법의 흐름도이다.6 is a flowchart of a method of performing a regression analysis using data in which the predictor 220 replaces a missing value.

도 6을 참조하면, 예측부(220)가 결측치를 대체한 데이터를 이용하여 회귀 분석을 수행하는 방법은, 입력값을 설정하는 단계(S610), 알고리즘 연산을 수행하는 단계(S620)를 포함한다.Referring to FIG. 6, a method of performing a regression analysis by using the data replacing the missing value by the prediction unit 220 includes setting an input value (S610), and performing an algorithmic operation (S620). .

입력값을 설정하는 단계(S610)는 N개의 입력 데이터 순서쌍

, 반복횟수

, 손실함수

, 및 기반 학습 모델

을 설정한다.The step of setting the input value (S610) is an ordered pair of N input data

, Number of repetitions

, Loss function

, And based learning model

Is set.

알고리즘 연산을 수행하는 단계(S620)는 f를 임의의 상수

로 초기화하고, t=1부터

까지 과정 1 내지 과정 4를 반복한다.In the step of performing the algorithmic operation (S620), f is an arbitrary constant

Initialize to and from t=1

Repeat steps 1 to 4 until.

과정 1은 음의 그래디언트

를 연산하고, 과정 2는 현재 추정량

에 대해 새로운 학습 모델

을 설정하며, 과정 3은 경사하강 갱신을 위한 최적의 변화계수

를 수학식 2를 이용하여 연산하며, 과정 4는 수학식 3을 이용하여 예측값을 갱신한다.Step 1 is a negative gradient

And process 2 is the current estimator

New learning model for

And process 3 is the optimal coefficient of change for updating the gradient descent.

Is calculated using Equation 2, and in process 4, the predicted value is updated using Equation 3.

f는 약한 성능을 갖는 그래디언트 부스팅(Gradient boosting) 기반 회귀분석기를 나타내고, M은 반복횟수를 나타낸다.f denotes a gradient boosting-based regression analyzer with weak performance, and M denotes the number of iterations.

훈련부(100)는 M번의 반복횟수 동안 f와 같은 분석기를 비용함수 ψ가 최소화되도록 훈력하고, 갱신함으로써, 강한 성능의 회귀분석기를 생성할 수 있다.The training unit 100 trains and updates an analyzer such as f so that the cost function ψ is minimized during the number of iterations of M times, thereby generating a regression analyzer with strong performance.

입력 데이터

는 결측치 대체부(210)에 의해 결측치가 대체된 데이터일 수 있다.Input data

May be data in which the missing value is replaced by the missing value replacement unit 210.

도 7은 한 실시예에 따른 회귀 분석 장치를 나타내는 블록도이다.7 is a block diagram illustrating a regression analysis apparatus according to an embodiment.

도 7을 참조하면, 한 실시예에 따른 회귀 분석 장치는, 컴퓨터 시스템, 예를 들어 컴퓨터 판독 가능 매체로 구현될 수 있다. 컴퓨터 시스템(700)은, 버스(720)를 통해 통신하는 프로세서(710), 메모리(730), 사용자 인터페이스 입력 장치(760), 사용자 인터페이스 출력 장치(770), 및 저장 장치(780) 중 적어도 하나를 포함할 수 있다. 컴퓨터 시스템(700)은 또한 네트워크에 결합된 네트워크 인터페이스(790)를 포함할 수 있다. 프로세서(710)는 중앙 처리 장치(central processing unit, CPU)이거나, 또는 메모리(730) 또는 저장 장치(780)에 저장된 명령을 실행하는 반도체 장치일 수 있다. 메모리(730) 및 저장 장치(780)는 다양한 형태의 휘발성 또는 비휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리는 ROM(read only memory)(731) 및 RAM(random access memory)(732)를 포함할 수 있다. 본 기재의 실시예는 컴퓨터에 구현된 방법으로서 구현되거나, 컴퓨터 실행 가능 명령이 저장된 비일시적 컴퓨터 판독 가능 매체로서 구현될 수 있다. 한 실시예에서, 프로세서에 의해 실행될 때, 컴퓨터 판독 가능 명령은 본 기재의 적어도 하나의 양상에 따른 방법을 수행할 수 있다.Referring to FIG. 7, the apparatus for regression analysis according to an embodiment may be implemented as a computer system, for example, a computer-readable medium. The computer system 700 includes at least one of a processor 710, a memory 730, a user interface input device 760, a user interface output device 770, and a storage device 780 communicating through the bus 720. It may include. Computer system 700 may also include a network interface 790 coupled to a network. The processor 710 may be a central processing unit (CPU) or a semiconductor device that executes instructions stored in the memory 730 or the storage device 780. The memory 730 and the storage device 780 may include various types of volatile or nonvolatile storage media. For example, the memory may include a read only memory (ROM) 731 and a random access memory (RAM) 732. The embodiments of the present disclosure may be implemented as a method embodied in a computer, or may be implemented as a non-transitory computer-readable medium storing computer-executable instructions. In one embodiment, when executed by a processor, computer-readable instructions may perform a method according to at least one aspect of the present disclosure.

한 실시예에 따른 회귀 분석 장치는 프로세서(710) 및 메모리(730)를 포함하고, 프로세서(710)는 메모리(730)에 저장된 프로그램을 실행하여, 제1 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성하는 단계, 입력 데이터에 제2 결측치가 포함될 때 결측치 대체 학습 모델을 이용하여 제2 결측치를 대체하는 단계, 그리고 제2 결측치가 대체된 입력 데이터에 기반하여 목표 데이터를 예측하는 단계를 수행한다.The regression analysis apparatus according to an embodiment includes a processor 710 and a memory 730, and the processor 710 executes a program stored in the memory 730 to determine a missing value based on training data including a first missing value. Generating a replacement learning model, replacing the second missing value using the missing value replacement learning model when the second missing value is included in the input data, and predicting target data based on the input data in which the second missing value has been replaced. Perform.

한 실시예에 따른 제1 결측치를 포함하는 훈련 데이터에 기반하여 결측치 대체 학습 모델을 생성하는 단계, 입력 데이터에 제2 결측치가 포함될 때 결측치 대체 학습 모델을 이용하여 제2 결측치를 대체하는 단계, 그리고 제2 결측치가 대체된 입력 데이터에 기반하여 목표 데이터를 예측하는 단계를 수행하는 것을 통해, 회귀 분석 장치의 프로세서(710)의 회귀분석 성능을 향상시킬 수 있다.Generating a missing value replacement learning model based on the training data including the first missing value according to an embodiment, replacing the second missing value using the missing value replacement learning model when the second missing value is included in the input data, and The regression analysis performance of the processor 710 of the regression analysis apparatus may be improved by performing the step of predicting target data based on the input data in which the second missing value has been replaced.

본 발명에 따른 회귀 분석 장치는 그래디언트 부스팅(Gradient boosting) 기반 회귀 분석 장치 및 분류 장치로서, 결정 트리 및 기타 기법들을 통해 약한 성능을 갖는 분류기 및 분석기를 앙상블(ensemble) 하여 하나의 강력한 예측모델을 구축할 수 있고, 함수적 경사 하강법을 통해 모델을 훈련할 수 있다. The regression analysis apparatus according to the present invention is a gradient boosting-based regression analysis apparatus and classification apparatus, which ensembles classifiers and analyzers having weak performance through decision trees and other techniques to construct one powerful prediction model. You can, and you can train the model through functional gradient descent.

본 발명에 따른 회귀 분석 장치의 오토인코더 기반 결측치 대체 방법은 결측치를 포함하는 데이터에 대한 심도 훈련된 오토인코더의 다차원적 이해를 바탕으로, 결측 이전의 값을 효과적으로 예측할 수 있다. 본 발명에 따른 회귀 분석 장치의 오토인코더 기반 결측치 대체 방법에 따르면 고차원의 값과 결측치를 다수 포함하는 빅데이터의 결측치를 효과적으로 대체할 수 있다. The autoencoder-based missing value replacement method of the regression analysis apparatus according to the present invention can effectively predict a value before the missing value based on a multidimensional understanding of the autoencoder trained in depth for data including the missing value. According to the autoencoder-based missing value replacement method of the regression analysis apparatus according to the present invention, it is possible to effectively replace the missing value of big data including a large number of high-dimensional values and missing values.

본 발명에 따른 회귀 분석 장치의 심도 훈련된 잡음 제거 오토인코더(212)는 고차원 데이터로부터 핵심이 되는 요소들을 효과적으로 추출할 수 있다. 심층 신경망은 대용량 데이터일수록 더 높은 정확도를 나타내도록 학습하므로, 심도 훈련된 잡음 제거 오토인코더(212)는 빅데이터에 대한 결측치를 효과적으로 대체할 수 있다.The depth-trained noise removal autoencoder 212 of the regression analysis apparatus according to the present invention can effectively extract core elements from high-dimensional data. Since the deep neural network learns to show higher accuracy as the amount of data is larger, the depth-trained noise removal auto-encoder 212 can effectively replace missing values for big data.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

As a method of performing regression analysis on data,
Generating a missing value replacement learning model based on the training data including the first missing value,
Replacing the second missing value using the missing value replacement learning model when the second missing value is included in the input data, and
Predicting target data based on input data in which the second missing value is replaced
Regression analysis method comprising a.

In claim 1,
Generating the missing value replacement learning model,
Generating the training data comprising the first missing value, and
And training an autoencoder using the training data.

In paragraph 2,
Generating the missing value replacement learning model,
After the step of training the autoencoder,
And repeating training by adjusting hyperparameters of the autoencoder, and generating a plurality of candidate missing value replacement learning models.

In paragraph 3,
Generating the missing value replacement learning model,
After the step of generating the candidate missing value replacement learning model,
And selecting one of the plurality of candidate missing value replacement learning models through cross-validation.

In paragraph 2,
Training the autoencoder,
A regression analysis method for training the autoencoder in a direction to reduce cross-entropy between the training data including the first missing value and the reconstructed data by using an optimization technique.

In clause 5,
The optimization technique is a gradient descent method, a regression analysis method.

In paragraph 3,
The hyperparameter is one of the number of hidden layers, the number of hidden elements, and a dropout ratio.

A device that performs regression analysis on data,
A training unit that generates a learning model for replacing missing values based on training data including missing values,
A missing value replacement unit that replaces the missing value included in the input data with a replacement value using the missing value replacement learning model when the input data contains missing values, and
A prediction unit that predicts target data based on input data including the replacement value
Regression analysis device comprising a.

In clause 8,
The training unit,
A regression analysis apparatus for generating the training data and training an autoencoder using the training data.

In claim 9,
The training unit,
A regression analysis apparatus for repeating the training by adjusting hyperparameters of the autoencoder, and generating a plurality of candidate missing value replacement learning models.

In claim 10,
The training unit,
A regression analysis apparatus for selecting one of the plurality of candidate missing value replacement learning models through cross-validation.

In claim 9,
The training unit,
A regression analysis apparatus for training the autoencoder in a direction of reducing cross-tropy between the training data and the reconstructed data using an optimization technique.

In claim 12,
The optimization technique is a gradient descent method, a regression analysis device.

In claim 10,
The hyperparameter is one of the number of hidden layers, the number of hidden elements, and a dropout ratio.

A device that performs regression analysis on data,
Including processor and memory,
The processor executes a program stored in the memory,
Generating a missing value replacement learning model based on the training data including the first missing value,
Replacing the second missing value using the missing value replacement learning model when the second missing value is included in the input data, and
Predicting target data based on input data in which the second missing value is replaced
To perform, regression analysis device.