KR20230108598A

KR20230108598A - Method and system for compensating missing value in real-time collected data based on multi-time series prediction model

Info

Publication number: KR20230108598A
Application number: KR1020220004211A
Authority: KR
Inventors: 이준기; 고석갑; 김낙우; 손승철; 이병탁
Original assignee: 한국전자통신연구원
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2023-07-18

Abstract

A method for correcting missing values in real-time collected data based on a multi-time series prediction model is provided. The method includes the steps of: determining whether there are missing values in predetermined input time series data; reading previously stored time series data when a missing value exists as a result of the determination; outputting a predicted value of the read time series data through an optimal prediction model selected according to predetermined conditions among a plurality of time series prediction models; and correcting and storing missing values based on the output predicted value.

Description

METHOD AND SYSTEM FOR COMPENSATING MISSING VALUE IN REAL-TIME COLLECTED DATA BASED ON MULTI-TIME SERIES PREDICTION MODEL}

본 발명은 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for correcting missing values of data collected in real time based on a multi-time series prediction model.

주기적으로 실시간 데이터가 수집되는 환경에서, 통신 오류나 센서의 오작동 등의 사유로 인해 데이터가 수집되지 못하는 경우가 발생할 수 있다. 이러한 데이터의 결측치는 데이터를 기계학습 등 다른 분야에 활용함에 있어 성능 저하를 유발한다. 이와 같은 상황을 방지하기 위하여 시계열 데이터의 결측치를 보정하는 알고리즘에 대한 많은 연구가 진행되었다.In an environment where real-time data is periodically collected, data may not be collected due to reasons such as a communication error or sensor malfunction. Missing values in these data cause performance degradation when using the data in other fields such as machine learning. In order to prevent this situation, many studies on algorithms for correcting missing values of time series data have been conducted.

기존 시계열 데이터의 결측치를 제거하는 방법으로, 전후 데이터를 활용한 평균값 처리, K-NN 기반의 결측치 보정 등 여러가지 방법이 있다. 해당 결측치 보정 기법들은 특정 데이터 세트에 대해서는 좋은 성능을 보일 수 있으나, 반대로 더 낮은 결측치 처리 성능을 보일 수도 있는 단점이 있다.As a method of removing missing values of existing time series data, there are various methods such as average value processing using before and after data and correction of missing values based on K-NN. The missing value correction techniques may show good performance for a specific data set, but on the contrary, they may show lower missing value processing performance.

이를 해결하기 위해 딥러닝 기법을 활용한 결측치 처리 방법도 존재하나, 이 경우 많은 연산량과 학습에 필요한 데이터의 양이 많다는 문제점이 있다.To solve this problem, there is also a missing value processing method using deep learning techniques, but in this case, there is a problem in that the amount of computation and the amount of data required for learning are large.

공개특허공보 제10-2020-0030303호 (2020.03.20)Publication No. 10-2020-0030303 (2020.03.20)

본 발명이 해결하고자 하는 과제는 시계열 데이터가 수집되는 환경에서, 해당 시계열 데이터를 예측할 수 있는 다수의 예측 모델을 구축하고, 발생되는 결측치를 최적 예측 모델을 활용하여 보정하는, 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법 및 시스템을 제공하는 것이다.The problem to be solved by the present invention is to build a plurality of prediction models capable of predicting the time series data in an environment where time series data is collected, and to correct missing values generated using the optimal prediction model, based on multiple time series prediction models in real time. It is to provide a method and system for correcting missing values of collected data.

다만, 본 발명이 해결하고자 하는 과제는 상기된 바와 같은 과제로 한정되지 않으며, 또다른 과제들이 존재할 수 있다.However, the problem to be solved by the present invention is not limited to the above problem, and other problems may exist.

상술한 과제를 해결하기 위한 본 발명의 제1 측면에 따른 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법은 입력된 소정의 시계열 데이터의 결측치 유무를 판단하는 단계; 상기 판단 결과 결측치가 존재하는 경우 이전 저장된 시계열 데이터를 독출하는 단계; 복수의 시계열 예측 모델 중 소정의 조건에 따라 선정된 최적 예측 모델을 통해 상기 독출한 시계열 데이터의 예측값을 출력하는 단계; 및 상기 출력된 예측값을 기반으로 결측치를 보정 및 저장하는 단계를 포함한다.A method for correcting a missing value in real-time data collected based on a multi-time series prediction model according to a first aspect of the present invention for solving the above problems includes determining whether or not there is a missing value in input predetermined time series data; reading previously stored time-series data when there is a missing value as a result of the determination; outputting a predicted value of the read time-series data through an optimal prediction model selected according to a predetermined condition from among a plurality of time-series prediction models; and correcting and storing a missing value based on the output predicted value.

본 발명의 일부 실시예는, 상기 판단 결과 결측치가 존재하지 않는 경우, 상기 입력된 시계열 데이터를 기반으로 복수의 시계열 예측 모델을 학습하는 단계; 상기 학습된 시계열 예측 모델의 각 예측값과 상기 시계열 데이터의 값을 비교하는 단계; 및 상기 비교 결과에 기초하여 복수의 시계열 예측 모델 중 오차가 가장 적은 시계열 예측 모델을 상기 최적 예측 모델로 저장하는 단계를 더 포함할 수 있다.In some embodiments of the present invention, when there is no missing value as a result of the determination, learning a plurality of time series prediction models based on the input time series data; comparing each predicted value of the learned time series prediction model with a value of the time series data; and storing a time series prediction model having the least error among a plurality of time series prediction models as the optimal prediction model based on the comparison result.

본 발명의 일부 실시예에서, 상기 입력된 시계열 데이터를 기반으로 복수의 시계열 예측 모델을 학습하는 단계는, 상기 입력된 시계열 데이터에 대하여 미래 값을 예측하는 복수의 시계열 예측 모델을 병렬적으로 학습할 수 있다.In some embodiments of the present invention, the step of learning a plurality of time series prediction models based on the input time series data includes parallel learning of a plurality of time series prediction models predicting future values with respect to the input time series data. can

본 발명의 일부 실시예는, 상기 비교 결과에 따른 오차를 기반으로 상기 복수의 시계열 예측 모델의 가중치를 업데이트하는 단계를 더 포함할 수 있다.Some embodiments of the present invention may further include updating weights of the plurality of time series prediction models based on errors according to the comparison result.

본 발명의 일부 실시예에서, 상기 최적 예측 모델로 저장하는 단계는 상기 시계열 데이터가 실시간으로 입력될 때마다 반복 수행될 수 있다.In some embodiments of the present invention, the saving as the optimal prediction model may be repeatedly performed whenever the time series data is input in real time.

본 발명의 일부 실시예는, 기 저장된 시계열 데이터에 상기 입력된 시계열 데이터를 추가하는 단계를 더 포함하고, 상기 입력된 시계열 데이터를 기반으로 복수의 시계열 예측 모델을 학습하는 단계는, 상기 추가된 시계열 데이터를 기반으로 복수의 시계열 예측 모델을 학습할 수 있다.Some embodiments of the present invention further include adding the input time series data to pre-stored time series data, and learning a plurality of time series prediction models based on the input time series data includes the added time series data. Multiple time series prediction models can be learned based on data.

상술한 과제를 해결하기 위한 본 발명의 다른 면에 따른 컴퓨터 프로그램은, 하드웨어인 컴퓨터와 결합되어 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법을 실행하며, 컴퓨터 판독가능 기록매체에 저장된다.A computer program according to another aspect of the present invention for solving the above problems is combined with a computer that is hardware to execute a missing value correction method for real-time collected data based on a multi-time series prediction model, and is stored in a computer-readable recording medium.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

전술한 본 발명의 일 실시예는, 다수의 시계열 예측 알고리즘 기반의 기계학습 모델을 활용해 서로 다른 시계열 데이터에 대해 최적의 결측치 처리 성능을 갖는 기계학습 모델을 구축할 수 있다. According to an embodiment of the present invention described above, a machine learning model having optimal missing value processing performance for different time series data may be constructed by utilizing a plurality of time series prediction algorithm-based machine learning models.

또한, 실시간으로 수집되는 데이터에서 발생하는 결측치에 대한 실시간 보정 기능을 기대할 수 있다.In addition, a real-time correction function for missing values generated from data collected in real time can be expected.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시예에 따른 결측치 보정 방법의 순서도이다.
도 2는 본 발명의 일 실시예에서의 시계열 예측 모델의 입출력 값을 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 시스템의 블록도이다.1 is a flowchart of a missing value correction method according to an embodiment of the present invention.
2 is a diagram showing input and output values of a time series prediction model in one embodiment of the present invention.
3 is a block diagram of a missing value correction system for real-time collected data based on a multi-time series prediction model according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only these embodiments are intended to complete the disclosure of the present invention, and are common in the art to which the present invention belongs. It is provided to fully inform the person skilled in the art of the scope of the invention, and the invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" and/or "comprising" does not exclude the presence or addition of one or more other elements other than the recited elements. Like reference numerals throughout the specification refer to like elements, and “and/or” includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various components, these components are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first element mentioned below may also be the second element within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

본 발명은 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법 및 시스템(100)에 관한 것이다. The present invention relates to a method and system 100 for correcting missing values of data collected in real time based on a multi-time series prediction model.

종래 기술의 경우 실시간으로 데이터의 결측치를 보완할 수 없었고, 데이터의 종류 및 상황별로 데이터를 가장 잘 예측할 수 있는 기계학습 알고리즘이 상이하므로 최적의 모델 적용이 불가하다는 문제가 있었다. In the case of the prior art, it was not possible to compensate for missing data in real time, and since the machine learning algorithm that can best predict data for each type and situation of data is different, there is a problem that the optimal model cannot be applied.

본 발명은 종래 데이터의 결측치를 보정하는 알고리즘의 문제점들을 해결하기 위해, 여러 시계열 예측 알고리즘을 활용하여 시계열 데이터의 종류에 따라 최적의 알고리즘을 기반 기계학습 모델을 구축하고, 실시간으로 발생되는 결측치를 보정하고, 지속적으로 예측 모델의 가중치를 업데이트할 수 있다. In order to solve the problems of conventional algorithms for correcting missing values of data, the present invention builds a machine learning model based on an optimal algorithm according to the type of time series data by utilizing various time series prediction algorithms, and corrects missing values generated in real time. and continuously update the weights of the predictive model.

또한, 본 발명은 기존의 훈련 데이터 세트의 특성만을 반영하는 배치 기반의 학습의 단점을 개선한 온라인 학습 기법을 적용하여, 실시간으로 수집되는 시계열 데이터의 결측치를 높은 성능으로 예측할 수 있다.In addition, the present invention can predict missing values of time-series data collected in real time with high performance by applying an online learning technique that improves the disadvantages of batch-based learning that reflects only the characteristics of existing training data sets.

이러한 본 발명의 일 실시예는 연산량이 많은 신경망 모델 등의 기법을 사용하지 않고, 가벼운 컴퓨팅 자원을 활용하여 높은 성능의 결측치 보정을 수행할 수 있다.Such an embodiment of the present invention can perform high-performance missing value correction using light computing resources without using techniques such as a neural network model that requires a large amount of computation.

도 1은 본 발명의 일 실시예에 따른 결측치 보정 방법의 순서도이다. 1 is a flowchart of a missing value correction method according to an embodiment of the present invention.

도 1을 참조하면, 먼저 소정의 시계열 데이터를 입력받는다(S110). 일 예로, 소정의 시계열 데이터는 태양광 발전량 데이터나 고속도로 통행량 데이터와 같이 주기적으로 수집되는 시계열 데이터일 수 있다.Referring to FIG. 1 , first, predetermined time-series data is input (S110). For example, predetermined time-series data may be time-series data periodically collected, such as photovoltaic power generation data or highway traffic data.

이때, 입력된 시계열 데이터에는 결측치가 존재할 수도 있고, 존재하지 않을 수도 있다. At this time, missing values may or may not exist in the input time series data.

다음으로, 입력된 시계열 데이터의 결측치 유무를 판단한다(S115). Next, it is determined whether or not there is a missing value in the input time series data (S115).

그리고 판단 결과 결측치가 존재하지 않는 경우, 입력된 시계열 데이터를 기 저장된 시계열 데이터에 추가 저장한다(S120).And, if there is no missing value as a result of the determination, the input time series data is additionally stored in the previously stored time series data (S120).

다음으로, 입력된 시계열 데이터를 기반으로 복수의 시계열 예측 모델에 대한 학습을 수행한다(S125). 이때, 기 저장된 시계열 데이터가 존재하여 신규 입력된 시계열 데이터가 추가되는 경우, 추가된 시계열 데이터를 기반으로 학습을 수행할 수도 있다. 이때, 시계열 예측 모델을 위한 학습 알고리즘으로는 Online ARIMA, Online Kalman filter 등의 온라인 학습 기반 알고리즘이 사용될 수 있다. Next, learning is performed on a plurality of time series prediction models based on the input time series data (S125). In this case, when newly input time series data is added because pre-stored time series data exists, learning may be performed based on the added time series data. At this time, online learning-based algorithms such as Online ARIMA and Online Kalman filter may be used as the learning algorithm for the time series prediction model.

이러한 학습 알고리즘은 복수의 시계열 예측 모델의 종류에 상응하도록 적용될 수 있으며, 복수의 시계열 예측 모델은 병렬적으로 학습될 수 있다.This learning algorithm may be applied to correspond to a plurality of types of time series prediction models, and the plurality of time series prediction models may be trained in parallel.

결측치가 없는 시계열 데이터가 입력되면 여러 학습 알고리즘을 기반으로 학습된 시계열 예측 모델의 각 예측값과, 실제 입력된 시계열 데이터의 값을 비교한다(S130).When time series data without missing values is input, each predicted value of the time series prediction model learned based on various learning algorithms is compared with the value of actually input time series data (S130).

이 과정에서 예측값과 실측값의 오차를 기반으로 복수의 시계열 예측 모델에 설정된 가중치에 대한 업데이트가 진행될 수 있다(S135).In this process, weights set in a plurality of time series prediction models may be updated based on an error between the predicted value and the measured value (S135).

다음으로, 비교 결과에 기초하여 복수의 시계열 예측 모델 중 오차가 가장 적은 시계열 예측 모델을 최적 예측 모델로 저장한다(S140).Next, based on the comparison result, a time series prediction model having the smallest error among a plurality of time series prediction models is stored as an optimal prediction model (S140).

이러한 S110 단계 내지 S140 단계는 시계열 데이터가 실시간으로 입력될 때마다 반복하여 수행될 수 있으며, 매 단계마다 오차가 가장 적은 시계열 예측 모델을 판별하고 저장한다. 이 과정에서 시계열 예측 모델에 온라인 학습 기법이 적용되어 시계열 데이터가 새로 입력될 때마다 예측을 수행하고, 각 시계열 예측 모델의 가중치가 업데이트된다.Steps S110 to S140 may be repeatedly performed whenever time-series data is input in real time, and a time-series prediction model with the smallest error is determined and stored at each step. In this process, an online learning technique is applied to the time series prediction model to perform prediction whenever time series data is newly input, and the weight of each time series prediction model is updated.

전술한 과정이 완료되면 입력된 시계열 데이터는 소정의 저장 공간에 저장된다(S145).When the above process is completed, the input time series data is stored in a predetermined storage space (S145).

전술한 과정을 반복 수행함에 있어, 입력되는 시계열 데이터에 결측치가 존재하는 경우, 시계열 예측 모델의 예측값과 실제 시계열 데이터의 값을 비교하는 것은 불가능한다. 이 경우, 앞서 저장된 시계열 데이터를 독출하고(S150), 복수의 시계열 예측 모델 중 소정의 조건에 따라 선정된 최적 예측 모델을 통해 독출한 시계열 데이터의 예측값을 출력한다(S155). 이때, 소정의 조건은 결측치가 존재하지 않는 경우에서의 선정 조건일 수 있다.In repeatedly performing the above-described process, when a missing value exists in the input time series data, it is impossible to compare the predicted value of the time series prediction model with the value of the actual time series data. In this case, the previously stored time series data is read (S150), and a predicted value of the read time series data is output through an optimal prediction model selected according to a predetermined condition among a plurality of time series prediction models (S155). In this case, the predetermined condition may be a selection condition in the case where missing values do not exist.

이후, 출력된 예측값을 기반으로 결측치를 보정 및 저장한다(S160, S165).Thereafter, missing values are corrected and stored based on the output prediction values (S160 and S165).

이처럼 본 발명의 일 실시예는 시계열 데이터가 실시간으로 입력될 때마다, 다수의 시계열 예측 모델이 병렬적으로 예측값을 출력하며, 해당 예측값과 실제 입력된 시계열 데이터의 차이를 통해 각 시계열 예측 모델의 가중치를 업데이트할 수 있다.As such, in one embodiment of the present invention, whenever time series data is input in real time, a plurality of time series prediction models output predicted values in parallel, and the weight of each time series prediction model through the difference between the predicted value and the actually input time series data. can be updated.

또한, 각 데이터의 입력 단계마다 최적의 성능을 보이는 시계열 예측 모델을 파악하여 결측치가 발생하는 순간, 이전 단계에서 저장된 최적 예측 모델을 활용하여 결측치를 보정하여 최적의 결측치 보정 성능을 얻을 수 있다.In addition, at each data input step, a time-series prediction model with optimal performance is identified, and the moment a missing value occurs, the optimal missing value correction performance can be obtained by correcting the missing value using the optimal prediction model stored in the previous step.

도 2는 본 발명의 일 실시예에서의 시계열 예측 모델의 입출력 값을 나타낸 도면이다.2 is a diagram showing input and output values of a time series prediction model in one embodiment of the present invention.

도 2는 본 발명에 따른 방법에 의한 결측치 처리 결과의 예시를 도시한 것으로, 입력된 시계열 데이터에 발생된 결측치를 온라인 학습 기반의 시계열 예측 모델을 활용하여, 보정된 시계열 데이터를 저장함으로써 결측치가 없는 데이터로 가공하여 저장할 수 있다.2 shows an example of the missing value processing result by the method according to the present invention. Missing values generated in the input time series data are saved without missing values by using an online learning-based time series prediction model and storing corrected time series data. Data can be processed and stored.

한편, 상술한 설명에서, 단계 S110 내지 S165는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 한편, 도 1 내지 도 2의 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법의 내용은 도 3의 내용에도 적용될 수 있다.Meanwhile, in the above description, steps S110 to S165 may be further divided into additional steps, or combined into fewer steps, according to an embodiment of the present invention. Also, some steps may be omitted if necessary, and the order of steps may be changed. Meanwhile, the content of the method for correcting missing values in real-time collected data based on the multi-time series prediction model of FIGS. 1 and 2 may also be applied to the content of FIG. 3 .

도 3은 본 발명의 일 실시예에 따른 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 시스템(100)의 블록도이다.3 is a block diagram of a missing value correction system 100 for real-time collected data based on a multi-time series prediction model according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 결측치 보정 시스템(100)은 시계열 예측 모델 학습 및 선택부(110)와, 결측치 처리부(120)를 포함한다.The missing value correction system 100 according to an embodiment of the present invention includes a time series prediction model learning and selection unit 110 and a missing value processing unit 120 .

시계열 예측 모델 학습 및 선택부(110)는 특정 시계열 데이터를 실시간으로 입력받아 여러 학습 알고리즘을 기반으로 다수의 시계열 예측 모델을 학습한다. 그리고 학습된 시계열 예측 모델의 각 예측값과 시계열 데이터의 값을 비교하고, 비교 결과에 기초하여 오차가 가장 적은 시계열 예측 모델을 최적 예측 모델로 저장한다.The time series prediction model learning and selection unit 110 receives specific time series data in real time and learns a plurality of time series prediction models based on several learning algorithms. Then, each predicted value of the learned time series prediction model is compared with the value of the time series data, and based on the comparison result, the time series prediction model with the smallest error is stored as the best prediction model.

이러한 시계열 예측 모델 학습 및 선택부(110)는 시계열 데이터가 수집되는 매 순간 시계열 데이터를 예측하는 최적 예측 모델을 파악하여, 추후 시계열 데이터에 결측치가 발생하였을 시 결측치 보정에 적용되도록 할 수 있다.The time series prediction model learning and selection unit 110 may identify an optimal prediction model for predicting the time series data at every moment when the time series data is collected, and apply the missing value correction when a missing value occurs in the time series data later.

또한, 시계열 예측 모델 학습 및 선택부(110)는 시계열 데이터가 입력될 때마다 다수의 시계열 예측 모델을 기반으로 하는 예측을 수행하고, 이를 기반으로 복수의 시계열 예측 모델의 가중치를 업데이트할 수 있다.In addition, the time series prediction model learning and selection unit 110 may perform prediction based on a plurality of time series prediction models whenever time series data is input, and update weights of the plurality of time series prediction models based on the prediction.

결측치 처리부(120)는 시계열 예측 모델 학습 및 선택부(110)에서 선택한 최적 예측 모델을 활용하여 시계열 데이터에 발생한 결측치를 보정한다. 각 결측치 예측 단계마다 최적의 예측 성능을 보이는 최적 예측 모델을 적용하도록 하며, 매 단계마다 복수의 시계열 예측 모델의 가중치가 업데이트됨에 따라 점점 높은 수준의 예측 성능을 갖도록 할 수 있다.The missing value processor 120 corrects missing values generated in the time series data by utilizing the optimal prediction model selected by the time series prediction model learning and selection unit 110 . In each missing value prediction step, an optimal prediction model exhibiting optimal predictive performance is applied, and as weights of a plurality of time series prediction models are updated in each step, prediction performance of a higher level can be obtained.

전술한 본 발명의 일 실시예에 따른 다중 시계열 예측 모델 기반 실시간 수집 데이터 결측치 보정 방법 은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The above-described multi-time series prediction model-based missing value correction method for real-time collected data according to an embodiment of the present invention may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, Ruby, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-mentioned program is C, C++, JAVA, Ruby, C, C++, JAVA, Ruby, which the processor (CPU) of the computer can read through the device interface of the computer so that the computer reads the program and executes the methods implemented as a program. It may include a code coded in a computer language such as machine language. These codes may include functional codes related to functions defining necessary functions for executing the methods, and include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, these codes may further include memory reference related codes for which location (address address) of the computer's internal or external memory should be referenced for additional information or media required for the computer's processor to execute the functions. there is. In addition, when the processor of the computer needs to communicate with any other remote computer or server in order to execute the functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes for whether to communicate, what kind of information or media to transmit/receive during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium is not a medium that stores data for a short moment, such as a register, cache, or memory, but a medium that stores data semi-permanently and is readable by a device. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers accessible by the computer or various recording media on the user's computer. In addition, the medium may be distributed to computer systems connected through a network, and computer readable codes may be stored in a distributed manner.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

100: 결측치 보정 시스템
110: 시계열 예측 모델 학습 및 선택부
120: 결측치 처리부100: missing value correction system
110: time series prediction model learning and selection unit
120: missing value processing unit

Claims

In a method performed by a computer,
determining whether or not there is a missing value in the input predetermined time series data;
reading previously stored time-series data when there is a missing value as a result of the determination;
outputting a predicted value of the read time-series data through an optimal prediction model selected according to a predetermined condition from among a plurality of time-series prediction models; and
Correcting and storing missing values based on the output prediction value,
A method for correcting missing values in real-time collected data based on multiple time series prediction models.