KR20230109935A

KR20230109935A - Service providing apparatus and method to support augmentation of data for epidemic analysis

Info

Publication number: KR20230109935A
Application number: KR1020220005784A
Authority: KR
Inventors: 이상민; 강민정; 김준혁
Original assignee: 광운대학교 산학협력단
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2023-07-21
Also published as: KR102635099B1

Abstract

본 발명은 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치 및 방법에 관한 것으로서, 더욱 상세히는 전염병과 관련한 확진자 변화를 예측하기 위한 예측 모델을 학습시키는데 필요한 학습 데이터를 실측 데이터인 시계열 데이터를 기반으로 증강하여 예측 모델의 학습에 충분한 학습 데이터를 확보하고 이를 예측 모델에 학습시켜 예측 모델의 예측 정확도를 높일 수 있도록 지원하는 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치 및 방법에 관한 것이다. 본 발명은 외부 서버로부터 전염병에 대한 확진자수와 관련되어 수집된 시계열 실측 데이터인 원본 데이터를 다양한 방식으로 증강하여 별도의 증강 데이터를 다수 생성하고, 이러한 다수 생성된 증강 데이터 각각에 대해 원본 데이터와 패턴 기반으로 유사도를 비교하여 실제 데이터인 원본 데이터와 유사도가 높은 증강 데이터를 선별한 후 신경망으로 구성된 예측 모델의 학습에 이용함으로써, 확산이 시작된 전염병에 대한 실측 데이터가 충분하지 않아 예측 모델의 학습이 어려운 경우에도 실제 데이터와 유사한 다수의 증강 데이터를 생성한 후 이를 기초로 예측 모델을 학습시켜 예측 모델을 통한 미래 기간에 대해 예측되는 확진자수에 대한 신뢰도 및 정확도를 높이는 효과가 있다.The present invention relates to an apparatus and method for providing a service that supports augmentation of data for analyzing an infectious disease, and more particularly, to an apparatus and method for providing a service that supports augmentation of data for analyzing an infectious disease, in which learning data necessary for learning a predictive model for predicting changes in confirmed cases related to an infectious disease is augmented based on time-series data, which is actually measured data, to secure sufficient learning data for learning the predictive model and train it into the predictive model to increase the predictive accuracy of the predictive model. The present invention augments original data, which is time-series actual measurement data collected from an external server in relation to the number of confirmed cases of an infectious disease, in various ways to generate a large number of separate augmented data, compares the pattern-based similarity with the original data for each of these multiple augmented data, selects augmented data that has a high similarity to the original data, which is actual data, and uses it to learn a prediction model composed of a neural network. It is effective to increase the reliability and accuracy of the number of confirmed cases predicted for the future period through the prediction model by learning the prediction model based on the basis.

Description

Service providing apparatus and method to support augmentation of data for epidemic analysis {Service providing apparatus and method to support augmentation of data for epidemic analysis}

본 발명은 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치 및 방법에 관한 것으로서, 더욱 상세히는 전염병과 관련한 확진자 변화를 예측하기 위한 예측 모델을 학습시키는데 필요한 학습 데이터를 실측 데이터인 시계열 데이터를 기반으로 증강하여 예측 모델의 학습에 충분한 학습 데이터를 확보하고 이를 예측 모델에 학습시켜 예측 모델의 예측 정확도를 높일 수 있도록 지원하는 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for providing a service that supports augmentation of data for analyzing an infectious disease, and more particularly, to an apparatus and method for providing a service that supports augmentation of data for analyzing an infectious disease, in which learning data necessary for learning a predictive model for predicting changes in confirmed cases related to an infectious disease is augmented based on time-series data, which is actually measured data, to secure sufficient learning data for learning the predictive model and train it into the predictive model to increase the predictive accuracy of the predictive model.

[국가지원 연구개발에 대한 설명][Description of State-Supported R&D]

본 연구는 2021년도 정부의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1G1A1012888). This study was conducted with the support of the National Research Foundation of Korea with government funding in 2021 (No. 2021R1G1A1012888).

현재 메르스, 코로나 바이러스 등과 같은 다양한 전염병이 등장하고 있으며, 이러한 전염병은 높은 전파력으로 수많은 확진자를 발생시키며, 이로 인해 사회적 및 경제적으로 상당한 피해를 야기하고 있다.Currently, various infectious diseases such as MERS and coronavirus are emerging, and these infectious diseases generate a large number of confirmed cases with high transmission power, thereby causing significant social and economic damage.

따라서, 정부 기관은 이러한 피해를 최소화하기 위해 확진자 변화 양상을 파악하여 확진자 증가 발생시 사람들간의 만남을 제한하는 조치를 적용하고, 이러한 제한 조치가 지속되면 사회적 불만이 가중되므로 확진자가 감소하면 기존 제한 조치를 완화하는 등의 대책을 수립하고 있다.Therefore, in order to minimize such damage, government agencies are establishing countermeasures such as identifying changes in the number of confirmed cases and applying measures to limit meetings between people when the number of confirmed cases increases.

그러나, 다양한 요인들로 인해 확진자가 지속적으로 증가와 감소를 반복하게 되는데 이러한 대책 수립이 확진자수의 변화 이후에 수립되므로 확진자가 증가하기 이전에 선제적 조치를 취하기 어려운 문제가 있으며, 이로 인해 사회적 불안과 불만이 가중되는 문제를 야기하고 있다.However, various factors cause a continuous increase and decrease in the number of confirmed cases. Since these measures are established after changes in the number of confirmed cases, it is difficult to take preemptive measures before the number of confirmed cases increases, which causes social anxiety and complaints to increase.

최근 인공 신경망의 발전과 더불어 이러한 전염병의 확산 양상을 예측 모델에 학습시켜 확진자수의 변화를 예측하고자 하는 시도가 있으나, 전파 기간이 오래되지 않은 전염병의 경우 예측 모델을 학습시키기 위한 충분한 데이터를 확보하기 어려워 이러한 예측 모델의 정확도가 떨어지는 문제가 있다.With the recent development of artificial neural networks, attempts have been made to predict changes in the number of confirmed cases by learning the spread of these infectious diseases into predictive models.

한국공개특허 제10-2017-0053145호Korean Patent Publication No. 10-2017-0053145

본 발명은 확진자수를 실제 측정한 실측 데이터인 시계열 데이터를 다양한 방식으로 증강하여 전염병 발생에 따른 확진자수의 변화를 예측하기 위한 모델의 학습에 필요한 학습 데이터를 충분히 확보할 수 있도록 지원함과 아울러, 이러한 실측 데이터의 증강시 확진자수의 변화를 야기하는 요인들에 대해 분석된 확진자수의 변화 패턴을 반영하여 증강시킴으로써 해당 요인에 대응되는 실측 데이터와 높은 유사도를 가지는 학습 데이터를 생성할 수 있도록 지원하여 이러한 학습 데이터를 통해 예측 모델 학습시 예측 모델로부터 산출되는 결과에 대한 신뢰도 및 정확도를 높일 수 있도록 지원하는데 그 목적이 있다.The present invention augments time-series data, which is actual data that actually measures the number of confirmed cases, in various ways to support sufficient learning data required for learning a model for predicting changes in the number of confirmed cases due to outbreaks of infectious diseases, and also supports to generate learning data that has a high similarity to the actually measured data corresponding to the factor by reflecting and augmenting the pattern of changes in the number of confirmed cases analyzed for factors that cause changes in the number of confirmed cases when augmenting the actual data Its purpose is to help increase reliability and accuracy.

본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치는, 전염병의 시간별 확진자수에 대해 실측된 시계열 데이터인 원본 데이터를 수집하여 저장하는 데이터 수집부와, 상기 원본 데이터를 미리 설정된 증강 알고리즘에 따라 왜곡 보정하여 복수의 증강 데이터를 생성하며 상기 복수의 증강 데이터를 저장하는 데이터 증강부와, 상기 증강 데이터와 원본 데이터 상호 간 동일 시간 축에서 각 변수 간의 값의 차이를 비교하여 상기 증강 데이터와 원본 데이터 사이의 선형 유사도에 대한 R2(R-Squared)값을 산출하는 선형 비교부와, DTW(Dynamic time wrapping) 알고리즘을 통해 상기 증강 데이터와 상기 원본 데이터 사이의 비선형 유사도인 DTW값을 산출하는 비선형 비교부 및 상기 복수의 증강 데이터별로 상기 선형 비교부 및 비선형 비교부를 통해 산출한 상기 R2값과 DTW값을 기초로 미리 설정된 조건을 만족하는 하나 이상의 증강 데이터를 각각 후보 데이터로 추출하고, 상기 추출된 하나 이상의 후보 데이터 각각의 R2값과 DTW값을 기초로 상기 R2값과 DTW값 각각에 대한 임계치를 설정하고, 상기 임계치 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 각각 상기 확진자수의 예측을 위한 예측모델의 학습을 위한 학습 데이터로 선별하는 데이터 선별부를 포함할 수 있다.An apparatus for providing a service that supports augmentation of data for analysis of an infectious disease according to an embodiment of the present invention includes: a data collection unit that collects and stores original data, which is time-series data actually measured for the number of confirmed cases of an infectious disease by time, and a data augmentation unit that generates and stores a plurality of augmented data by correcting distortion of the original data according to a preset augmentation algorithm and stores the plurality of augmented data; Squared) value, a nonlinear comparison unit that calculates a DTW value, which is a nonlinear degree of similarity between the augmented data and the original data, through a DTW (Dynamic time wrapping) algorithm, and the linear comparison unit and the nonlinear comparison unit for each of the plurality of augmented data. Extract one or more augmented data that satisfy a preset condition based on the R2 value and DTW value as candidate data, respectively, and based on the R2 value and DTW value of each of the extracted one or more candidate data A data selector for setting thresholds for each of the R2 value and the DTW value, and selecting one or more augmented data from which R2 and DTW values equal to or greater than the threshold are respectively selected as learning data for learning a predictive model for predicting the number of confirmed cases.

본 발명과 관련된 일 예로서, 상기 데이터 증강부는 크롭핑(cropping), 양자화(Quantizing) 및 상기 시계열 데이터를 구성하는 하나 이상의 변수값 중 적어도 하나를 랜덤하게 변경하는 드리프트(Drift) 중 적어도 하나를 수행하여 상기 원본 데이터를 왜곡 보정하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data augmentation unit performs at least one of cropping, quantizing, and drift for randomly changing at least one of one or more variable values constituting the time series data. It may be characterized by performing distortion correction on the original data.

본 발명과 관련된 일 예로서, 상기 데이터 선별부는, 상기 복수의 증강 데이터 중 상기 산출한 R2값과 DTW값이 미리 설정된 상위 α% 이내인 하나 이상의 증강 데이터를 각각 상기 후보 데이터로 추출하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data selector may extract, as the candidate data, one or more augmented data in which the calculated R2 value and DTW value are within a preset upper α% among the plurality of augmented data.

본 발명과 관련된 일 예로서, 상기 데이터 선별부는 상기 임계치 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 부트스트래핑(Bootstrapping)하여 n개의 증강 데이터를 획득하고, 상기 n개의 증강 데이터를 각각 상기 학습 데이터로 선별하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data selector obtains n augmented data by bootstrapping one or more augmented data for which R2 and DTW values equal to or greater than the threshold are calculated, and selects the n augmented data as the learning data, respectively.

본 발명과 관련된 일 예로서, 상기 데이터 선별부는 미리 설정된 예측 모델에 예측 대상 기간에 대응되는 예상 확진자수가 산출되도록 하나 이상의 상기 학습 데이터 및 상기 원본 데이터를 학습시키는 것을 특징으로 할 수 있다.As an example related to the present invention, the data selector may be characterized in that the data selection unit learns one or more of the learning data and the original data so that the predicted number of confirmed cases corresponding to the prediction target period is calculated in a preset prediction model.

본 발명과 관련된 일 예로서, 상기 서비스 제공 장치는, 상기 데이터 수집부와 연동하여 사용자 입력에 따른 특정 기간 동안의 확진자수의 증가 패턴이 상기 증강 데이터에 반영되도록 상기 특정 기간을 미리 설정된 주기로 분할한 복수의 단위 기간과 각각 대응되는 하나 이상의 원본 데이터를 수집하고, 상기 수집된 하나 이상의 원본 데이터를 기초로 일자별로 확진자수를 평균하여 상기 증가 패턴을 산출하는 패턴 적용부를 더 포함하고, 상기 데이터 증강부는 상기 패턴 적용부와 연동하여 상기 증강 데이터의 왜곡 보정 이전에 상기 증가 패턴을 상기 원본 데이터에 적용한 후 왜곡 보정하는 것을 특징으로 할 수 있다.As an example related to the present invention, the service providing apparatus further includes a pattern application unit that collects one or more original data corresponding to a plurality of unit periods obtained by dividing the specific period into preset cycles so that an increase pattern of the number of confirmed cases during a specific period according to a user input is reflected in the augmented data in conjunction with the data collection unit, and calculates the increase pattern by averaging the number of confirmed cases per day based on the collected one or more original data, wherein the data augmentation unit interworks with the pattern application unit to correct the distortion of the augmented data. Distortion correction may be performed after applying to the original data.

본 발명과 관련된 일 예로서, 상기 서비스 제공 장치는, 상기 데이터 수집부와 연동하여 상기 확진자수를 증가시키는 외부 인자에 대응되도록 미리 설정된 키워드에 대한 검색량이 증가될 때의 하나 이상의 원본 데이터를 수집하고, 상기 키워드에 대응되는 검색량의 증가 수치를 미리 설정된 Newton Method에 적용되는 독립 변수의 변수값으로 설정하고 상기 검색량의 증가 수치가 산출된 기간 동안의 상기 하나 이상의 원본 데이터에 따른 확진자수의 변화를 상기 Newton Method에 적용되는 다른 독립 변수로 설정하여 상기 Newton Method를 통해 확진자수의 증가량 및 상기 확진자수의 급증 가속도를 변화시키기 위한 기울기 및 가속도를 산출하는 외부 인자 수집부를 더 포함하고, 상기 데이터 증강부는 상기 외부 인자 수집부를 통해 산출된 기울기 및 가속도를 상기 원본 데이터에 적용한 후 상기 기울기 및 가속도가 적용된 원본 데이터를 기초로 상기 증강 데이터를 생성하거나 상기 증강 데이터에 상기 기울기 및 가속도를 적용하는 것을 특징으로 할 수 있다.As an example related to the present invention, the service providing device collects one or more original data when the search volume for a preset keyword increases to correspond to an external factor that increases the number of confirmed cases in conjunction with the data collection unit, sets the increase in search volume corresponding to the keyword as a variable value of an independent variable applied to a preset Newton Method, sets the change in the number of confirmed cases according to the one or more original data during the period during which the increase in search volume is calculated as another independent variable applied to the Newton Method, and uses the Newton Method. It may further include an external factor collection unit that calculates a gradient and an acceleration for changing an increase in the number of confirmed cases and an acceleration of rapid increase in the number of confirmed cases, wherein the data augmentation unit generates the augmented data based on original data to which the gradient and acceleration are applied or applies the gradient and acceleration to the augmented data after applying the gradient and acceleration calculated through the external factor collector to the original data.

본 발명과 관련된 일 예로서, 상기 외부 인자 수집부는 서로 다른 종류의 인자별로 하나 이상의 키워드가 미리 설정되며, 상기 키워드별로 상기 기울기 및 가속도를 산출하는 것을 특징으로 할 수 있다.As an example related to the present invention, the external factor collection unit may be characterized in that one or more keywords are previously set for each different type of factor, and the slope and acceleration are calculated for each keyword.

본 발명의 실시예에 따른 서비스 제공 장치의 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 방법은, 전염병의 시간별 확진자수에 대해 실측된 시계열 데이터인 원본 데이터를 수집하여 저장하는 데이터 수집 단계와, 상기 원본 데이터를 미리 설정된 증강 알고리즘에 따라 왜곡 보정하여 복수의 증강 데이터를 생성하며 상기 복수의 증강 데이터를 저장하는 데이터 증강 단계와, 상기 증강 데이터와 원본 데이터 상호 간 동일 시간 축에서 각 변수 간의 값의 차이를 비교하여 산출한 상기 증간 데이터와 원본 데이터의 선형 유사도를 구한 R2값을 산출하는 선형 비교 단계와, DTW(Dynamic time wrapping) 알고리즘을 통해 상기 증강 데이터와 상기 원본 데이터 사이의 비선형 유사도인 DTW값을 산출하는 비선형 비교 단계 및 상기 복수의 증강 데이터별로 산출한 상기 R2값과 DTW값을 기초로 미리 설정된 조건을 만족하는 하나 이상의 증강 데이터를 각각 후보 데이터로 추출하고, 상기 추출된 하나 이상의 후보 데이터 각각의 R2값과 DTW값을 기초로 상기 R2값과 DTW값에 각각에 대한 임계치를 설정하고, 상기 임계치 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 각각 상기 확진자수의 예측을 위한 예측모델의 학습을 위한 학습 데이터로 선별하는 데이터 선별 단계를 포함할 수 있다.A service providing method for supporting augmentation of data for an epidemic analysis by a service providing device according to an embodiment of the present invention includes a data collection step of collecting and storing original data, which is time series data actually measured for the number of confirmed cases of an epidemic by time, data augmentation step of generating and storing a plurality of augmented data by distorting and correcting the original data according to a preset augmentation algorithm, and storing the plurality of augmented data, and linear similarity between the augmented data and the original data calculated by comparing the difference between each variable on the same time axis between the augmented data and the original data A linear comparison step of calculating the obtained R2 value, a nonlinear comparison step of calculating a DTW value, which is a nonlinear degree of similarity between the augmented data and the original data, through a DTW (Dynamic time wrapping) algorithm, and the R2 value and the DTW value calculated for each of the plurality of augmented data Extracting one or more augmented data satisfying a preset condition as candidate data, respectively, and determining the R2 value and the DTW value based on the R2 value and DTW value of each of the extracted one or more candidate data A data selection step of setting thresholds for each and selecting one or more augmented data from which R2 values and DTW values greater than or equal to the thresholds are calculated as learning data for learning a predictive model for predicting the number of confirmed cases.

본 발명과 관련된 일 예로서, 상기 데이터 증강 단계는 크롭핑(cropping), 양자화(Quantizing) 및 상기 시계열 데이터를 구성하는 하나 이상의 변수값 중 적어도 하나를 랜덤하게 변경하는 드리프트(Drift) 중 적어도 하나를 수행하여 상기 원본 데이터를 왜곡 보정하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data augmentation step may include performing at least one of cropping, quantizing, and drift of randomly changing at least one of one or more variable values constituting the time series data. It may be characterized by distortion correction of the original data.

본 발명과 관련된 일 예로서, 상기 데이터 선별 단계는, 상기 복수의 증강 데이터 중 상기 산출한 R2값과 DTW값이 미리 설정된 상위 α% 이내인 하나 이상의 증강 데이터를 각각 상기 후보 데이터로 추출하는 것을 특징으로 할 수 있다.As an example related to the present invention, in the data selection step, one or more augmented data in which the calculated R2 value and DTW value are within a preset upper α% among the plurality of augmented data are extracted as the candidate data.

본 발명과 관련된 일 예로서, 상기 데이터 선별 단계는 상기 임계치 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 부트스트래핑(Bootstrapping)하여 n개의 증강 데이터를 획득하고, 상기 n개의 증강 데이터를 각각 상기 학습 데이터로 선별하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data selection step obtains n augmented data by bootstrapping one or more augmented data from which R2 and DTW values equal to or greater than the threshold are calculated, and selects the n augmented data as the learning data, respectively.

본 발명과 관련된 일 예로서, 상기 데이터 선별 단계는 미리 설정된 예측 모델에 예측 대상 기간에 대응되는 예상 확진자수가 산출되도록 하나 이상의 상기 학습 데이터 및 상기 원본 데이터를 학습시키는 것을 특징으로 할 수 있다.As an example related to the present invention, the data selection step may be characterized in that one or more of the learning data and the original data are trained so that a preset prediction model calculates the expected number of confirmed cases corresponding to the predicted period.

본 발명과 관련된 일 예로서, 상기 데이터 수집 단계는, 사용자 입력에 따른 특정 기간 동안의 확진자수의 증가 패턴이 상기 증강 데이터에 반영되도록 상기 특정 기간을 미리 설정된 주기로 분할한 복수의 단위 기간과 각각 대응되는 하나 이상의 원본 데이터를 수집하는 단계를 더 포함하고, 상기 데이터 증강 단계는, 상기 특정 기간에 대응되어 수집된 하나 이상의 원본 데이터를 기초로 일자별로 확진자수를 평균하여 상기 증가 패턴을 산출하고, 상기 증강 데이터의 왜곡 보정 이전에 상기 증가 패턴을 상기 원본 데이터에 적용한 후 왜곡 보정하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data collection step further comprises collecting one or more original data corresponding to a plurality of unit periods obtained by dividing the specific period into preset cycles so that the increase pattern of the number of confirmed cases during a specific period according to a user input is reflected in the augmented data, and the data augmentation step calculates the increase pattern by averaging the number of confirmed cases per day based on one or more original data collected corresponding to the specific period to calculate the increase pattern, apply the increase pattern to the original data before distortion correction of the augmented data, and then apply the increase pattern to the original data. It can be characterized as correcting.

본 발명과 관련된 일 예로서, 상기 데이터 수집 단계는, 상기 확진자수를 증가시키는 외부 인자에 대응되도록 미리 설정된 키워드에 대한 검색량이 증가되는 관심 기간에 속한 하나 이상의 원본 데이터를 수집하는 단계를 더 포함하고, 상기 데이터 증강 단계는, 상기 키워드에 대응되는 검색량의 증가 수치를 미리 설정된 Newton Method에 적용되는 독립 변수의 변수값으로 설정하고 상기 검색량의 증가 수치가 산출된 상기 관심 기간 동안의 상기 하나 이상의 원본 데이터에 따른 확진자수의 변화를 상기 Newton Method에 적용되는 다른 독립 변수로 설정하여 상기 Newton Method를 통해 확진자수의 증가량 및 상기 확진자수의 급증 가속도를 변화시키기 위한 기울기 및 가속도를 산출하며, 상기 산출된 기울기 및 가속도를 상기 원본 데이터에 적용한 후 상기 기울기 및 가속도가 적용된 원본 데이터를 기초로 상기 증강 데이터를 생성하거나 상기 증강 데이터에 상기 기울기 및 가속도를 적용하는 것을 특징으로 할 수 있다.As an example related to the present invention, the data collection step further comprises collecting one or more original data belonging to a period of interest in which the search volume for a preset keyword corresponding to an external factor that increases the number of confirmed cases increases, and the data augmentation step sets an increase in the search volume corresponding to the keyword as a variable value of an independent variable applied to a preset Newton Method, and changes in the number of confirmed cases according to the one or more original data during the interest period during which the increase in the search volume is calculated as another independent variable applied to the Newton Method. After setting, the gradient and acceleration for changing the increase in the number of confirmed cases and the rapid acceleration of the number of confirmed cases are calculated through the Newton Method, and after applying the calculated gradient and acceleration to the original data, the augmented data may be generated based on the original data to which the gradient and acceleration are applied, or the gradient and acceleration may be applied to the augmented data.

본 발명과 관련된 일 예로서, 상기 서비스 제공 장치에는 서로 다른 종류의 인자별로 하나 이상의 키워드가 미리 설정되며, 상기 데이터 증강 단계는, 상기 키워드별로 상기 기울기 및 가속도를 산출하는 것을 특징으로 할 수 있다.As an example related to the present invention, one or more keywords are preset for each different type of factor in the service providing device, and the data augmentation step may be characterized by calculating the slope and acceleration for each keyword.

본 발명은 외부 서버로부터 전염병에 대한 확진자수와 관련되어 수집된 시계열 실측 데이터인 원본 데이터를 다양한 방식으로 증강하여 별도의 증강 데이터를 다수 생성하고, 이러한 다수 생성된 증강 데이터 각각에 대해 원본 데이터와 패턴 기반으로 유사도를 비교하여 실제 데이터인 원본 데이터와 유사도가 높은 증강 데이터를 선별한 후 신경망으로 구성된 예측 모델의 학습에 이용함으로써, 확산이 시작된 전염병에 대한 실측 데이터가 충분하지 않아 예측 모델의 학습이 어려운 경우에도 실제 데이터와 유사한 다수의 증강 데이터를 생성한 후 이를 기초로 예측 모델을 학습시켜 예측 모델을 통한 미래 기간에 대해 예측되는 확진자수에 대한 신뢰도 및 정확도를 높이는 효과가 있다.The present invention augments original data, which is time-series actual measurement data collected from an external server in relation to the number of confirmed cases of an infectious disease, in various ways to generate a large number of separate augmented data, compares the pattern-based similarity with the original data for each of these multiple augmented data, selects augmented data that has a high similarity to the original data, which is actual data, and uses it to learn a prediction model composed of a neural network. It is effective to increase the reliability and accuracy of the number of confirmed cases predicted for the future period through the prediction model by learning the prediction model based on the basis.

또한, 본 발명은 예측 모델의 학습에 이용되는 증강 데이터의 생성 과정에서 계절성 요소에 따른 패턴이나 확진자를 증가시키는 다양한 외부 인자에 의한 확진자수의 변화와 관련된 기울기 및 가속도를 산출한 후 증강 데이터에 반영되도록 하고, 이러한 계절성 요소 또는 외부 인자가 반영된 증강 데이터 중에서 원본 데이터와 유사한 데이터를 선별하여 예측 모델의 학습에 이용되도록 함으로써, 계절성 요소나 외부 인자에 의한 영향을 반영하여 예측 대상 기간 동안의 확진자수의 변화를 정확하게 예측할 수 있도록 지원하는 효과가 있다.In addition, in the process of generating augmented data used for learning a predictive model, the slope and acceleration related to changes in the number of confirmed cases due to patterns according to seasonal factors or various external factors that increase the number of confirmed cases are calculated and reflected in the augmented data, and data similar to the original data is selected from augmented data in which these seasonal factors or external factors are reflected and used for learning the prediction model, thereby reflecting the influence of seasonal factors or external factors to accurately predict changes in the number of confirmed cases during the forecasting period.

도 1은 본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치의 구성 환경도.
도 2는 본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치의 구성도.
도 3은 본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치의 동작 예시도.
도 4 및 도 5는 본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치의 증강 데이터 생성 과정에서 반영하는 계절성 증가 패턴에 대한 예시도.1 is a configuration environment diagram of a service providing device supporting augmentation of data for analyzing an infectious disease according to an embodiment of the present invention.
2 is a block diagram of a service providing device supporting augmentation of data for analyzing infectious diseases according to an embodiment of the present invention.
3 is an exemplary operation diagram of a service providing device supporting augmentation of data for analyzing infectious diseases according to an embodiment of the present invention.
4 and 5 are exemplary diagrams of a seasonal increase pattern reflected in a process of generating augmented data of a service providing device supporting augmentation of data for analyzing infectious diseases according to an embodiment of the present invention.

이하, 도면을 참고하여 본 발명의 상세 실시예를 설명한다.Hereinafter, detailed embodiments of the present invention will be described with reference to the drawings.

도 1은 본 발명의 실시예에 따른 전염병 분석을 위한 데이터의 증강을 지원하는 서비스 제공 장치(이하, 서비스 제공 장치)의 구성 환경도이다.1 is a configuration environment diagram of a service providing device (hereinafter referred to as a service providing device) supporting augmentation of data for an epidemic analysis according to an embodiment of the present invention.

도시된 바와 같이, 상기 서비스 제공 장치(100)는 특정 전염병에 대해 시간별로 실측된 확진자수에 대한 시계열 데이터를 제공하는 외부 서버와 통신망을 통해 통신할 수 있으며, 상기 외부 서버로부터 상기 시계열 데이터인 원본 데이터를 수집할 수 있다.As shown, the service providing apparatus 100 may communicate with an external server that provides time-series data on the number of confirmed cases actually measured for a specific infectious disease over time, and may collect original data, which is the time-series data, from the external server.

이때, 상기 외부 서버는, 질병관리청 관련 서버일 수 있다.In this case, the external server may be a server related to the Korea Centers for Disease Control and Prevention.

또한, 본 발명에서 설명하는 통신망은 유/무선 통신망을 포함할 수 있으며, 이러한 무선 통신망의 일례로 무선랜(Wireless LAN: WLAN), DLNA(Digital Living Network Alliance), 와이브로(Wireless Broadband: Wibro), 와이맥스(World Interoperability for Microwave Access: Wimax), GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), CDMA2000(Code Division Multi Access 2000), EV-DO(Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), IEEE 802.16, 롱 텀 에볼루션(Long Term Evolution: LTE), LTE-A(Long Term Evolution-Advanced), 광대역 무선 이동 통신 서비스(Wireless Mobile Broadband Service: WMBS), 5G 이동통신 서비스, 블루투스(Bluetooth), LoRa(Long Range), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association: IrDA), UWB(Ultra Wideband), 지그비(ZigBee), 인접 자장 통신(Near Field Communication: NFC), 초음파 통신(Ultra Sound Communication: USC), 가시광 통신(Visible Light Communication: VLC), 와이 파이(Wi-Fi), 와이 파이 다이렉트(Wi-Fi Direct) 등이 포함될 수 있다. 또한, 유선 통신망으로는 유선 LAN(Local Area Network), 유선 WAN(Wide Area Network), 전력선 통신(Power Line Communication: PLC), USB 통신, 이더넷(Ethernet), 시리얼 통신(serial communication), 광/동축 케이블 등이 포함될 수 있다.In addition, the communication network described in the present invention may include a wired / wireless communication network, and as an example of such a wireless communication network, wireless LAN (WLAN), DLNA (Digital Living Network Alliance), WiBro (Wireless Broadband: Wibro), WiMAX (World Interoperability for Microwave Access: Wimax), GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), CDMA2000 (Code Division Multi Access 2000) , Enhanced Voice-Data Optimized or Enhanced Voice-Data Only (EV-DO), Wideband CDMA (WCDMA), High Speed Downlink Packet Access (HSDPA), High Speed Uplink Packet Access (HSUPA), IEEE 802.16, Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), Wireless Mobile Broadband Service (WMBS), 5G mobile communication service, Bluetooth, LoRa (Long Range), RFID (Radio Frequency Identification), Infrared Data Association (IrDA), UWB (Ultra Wideband), ZigBee, Near Field Communication (NFC), Ultra Sound Communication (USC), Visible Light Communication (VLC), Wi-Fi, Wi-Fi Direct (Wi-Fi Direct), etc. may be included. In addition, the wired communication network may include a wired local area network (LAN), a wired wide area network (WAN), power line communication (PLC), USB communication, Ethernet, serial communication, and optical/coaxial cables.

또한, 서비스 제공 장치(100)는 상기 외부 서버로부터 서로 다른 복수의 원본 데이터를 수집할 수 있으며, 상기 복수의 원본 데이터를 상기 서비스 제공 장치(100)에 미리 설정된 예측 모델에 학습시켜 미래 기간에 대해 예측되는 확진자수를 상기 예측 모델을 통해 예측할 수 있다.In addition, the service providing apparatus 100 may collect a plurality of different original data from the external server, and train the plurality of original data into a predictive model preset in the service providing apparatus 100. The predicted number of confirmed cases for a future period can be predicted through the predictive model.

그러나, 바이러스 등과 같은 전염병은 확산 초기에는 충분한 수의 데이터를 확보하기 어려울 뿐만 아니라 여행, 변이 바이러스, 종교, 집회 등과 같은 전염병을 폭증시키는 다양한 외부 인자(또는 환경 인자)에 의한 확진자수의 증가 패턴을 학습하기 위해서는 이러한 외부 인자의 발생 횟수가 충분해야 하나 이러한 외부 인자의 발생 횟수가 적고 다음 외부 인자의 발생 기간까지의 기간이 상당하여 외부 인자로 인한 확진자수의 증가 패턴을 학습할 수 있는 데이터 확보 역시 어려워, 단순 실측 데이터인 원본 데이터만으로 예측 모델을 학습시키는 경우 예측 모델의 확진자 수 예측에 대한 정확도를 보장하기 어렵다.However, for infectious diseases such as viruses, it is not only difficult to secure a sufficient number of data in the early stages of spread, but also the number of occurrences of these external factors must be sufficient in order to learn the pattern of increase in the number of confirmed cases due to various external factors (or environmental factors) that exploding epidemics such as travel, mutated viruses, religion, and gatherings. It is difficult to guarantee the accuracy of number predictions.

따라서, 본 발명에 따른 서비스 제공 장치(100)는 외부 서버로부터 전염병에 대한 확진자수와 관련되어 수집된 시계열 실측 데이터인 원본 데이터를 다양한 방식으로 왜곡 보정하여(증강(augmentation)하여) 별도의 증강 데이터를 다수 생성하고, 이러한 다수 생성된 증강 데이터 각각에 대해 원본 데이터와 패턴 기반으로 유사도를 비교하여 실제 데이터인 원본 데이터와 유사한 증강 데이터를 선별한 후 선별된 증강 데이터와 원본 데이터를 예측 모델의 학습에 이용함으로써, 실제 데이터와 유사한 다수의 증강 데이터를 기초로 예측 모델을 학습시켜 예측 모델을 통한 미래 기간에 대해 예측되는 확진자수에 대한 신뢰도 및 정확도를 높일 수 있도록 지원한다.Therefore, the service providing apparatus 100 according to the present invention generates a plurality of separate augmented data by distorting (augmenting) the original data, which is time-series actual data collected in relation to the number of confirmed cases of an infectious disease from an external server, in various ways, and compares the degree of similarity with the original data based on the pattern for each of these generated augmented data, selects augmented data similar to the original data, which is actual data, and uses the selected augmented data and original data for learning of a predictive model, thereby obtaining a plurality of augmented data similar to actual data. It supports to increase the reliability and accuracy of the number of confirmed cases predicted for the future period through the prediction model by learning the prediction model based on the basis.

이와 같은 본 발명의 서비스 제공 장치(100)에 대한 상세 구성을 도 2에 따른 서비스 제공 장치(100)의 구성도 및 도 3에 따른 서비스 제공 장치(100)의 동작 예시도를 참고하여 상세히 설명한다.The detailed configuration of the service providing apparatus 100 according to the present invention will be described in detail with reference to the configuration diagram of the service providing apparatus 100 according to FIG. 2 and an exemplary operation diagram of the service providing apparatus 100 according to FIG. 3 .

우선, 도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 서비스 제공 장치(100)는, 통신부(110), 제어부(130) 및 저장부(120)를 포함하여 구성될 수 있으며, 이에 한정되지 않고 다양한 구성부를 더 포함하여 구성될 수 있다.First, as shown in FIG. 2, the service providing apparatus 100 according to an embodiment of the present invention may include a communication unit 110, a control unit 130, and a storage unit 120, but is not limited thereto and may further include various components.

우선, 상기 통신부(110)는 상기 통신망을 통해 상기 외부 서버 등과 같은 다양한 장치와 통신할 수 있으며, 상기 제어부(130)는 상기 통신부(110)를 통해 상기 다양한 장치와 통신할 수 있다.First of all, the communication unit 110 can communicate with various devices such as the external server through the communication network, and the control unit 130 can communicate with the various devices through the communication unit 110 .

이때, 이하에서 상기 제어부(130)의 상기 통신부(110)를 통한 통신 구성은 생략하기로 한다.At this time, the configuration of communication through the communication unit 110 of the control unit 130 will be omitted below.

또한, 상기 저장부(120)는, 각종 정보를 저장할 수 있으며, HDD(Hard Disk Drive), SSD(Solid State Drive) 등과 같은 다양한 형태로 구성될 수 있고, DB로 구성되거나 DB를 포함하여 구성될 수 있다.In addition, the storage unit 120 may store various types of information, and may be configured in various forms such as a hard disk drive (HDD) and a solid state drive (SSD), and may be configured as a DB or including a DB.

또한, 상기 제어부(130)는, 상기 저장부(120)에 저장된 각종 정보를 기초로 상기 서비스 제공 장치(100)의 전반적인 제어 기능을 수행할 수 있으며, 상기 제어부(130)는 RAM, ROM, CPU, GPU, 버스를 포함할 수 있으며, RAM, ROM, CPU, GPU 등은 버스를 통해 서로 연결될 수 있다.In addition, the control unit 130 may perform overall control functions of the service providing apparatus 100 based on various information stored in the storage unit 120, and the control unit 130 may include RAM, ROM, CPU, GPU, and a bus, and RAM, ROM, CPU, and GPU may be connected to each other through a bus.

또한, 상기 통신부(110) 및 저장부(120)는 상기 제어부(130)에 포함되어 구성될 수도 있다.Also, the communication unit 110 and the storage unit 120 may be included in the control unit 130 and configured.

또한, 상기 서비스 제공 장치(100)에는 사용자 입력을 수신하기 위한 사용자 입력부 및 각종 정보를 표시하는 표시부 등과 같은 다양한 구성부가 추가 구성될 수도 있으며, 상기 추가 구성된 구성부는 상기 제어부(130)에 의해 제어될 수 있다.In addition, various components such as a user input unit for receiving a user input and a display unit for displaying various information may be additionally configured in the service providing apparatus 100, and the additionally configured components may be controlled by the control unit 130.

또한, 상기 제어부(130)는, 데이터 수집부(131), 데이터 증강부(132), 선형 비교부(135), 비선형 비교부(136), 데이터 선별부(137) 등과 같은 복수의 구성부를 포함하여 구성될 수 있다.In addition, the controller 130 may include a plurality of components such as a data collection unit 131, a data augmentation unit 132, a linear comparison unit 135, a nonlinear comparison unit 136, and a data selection unit 137.

이때, 상기 제어부(130)를 구성하는 복수의 구성부는 데이터 처리가 가능한 프로세서 등에 의해 구현될 수 있으며, 각각이 분리되어 상이한 프로세서에 의해 구현될 수도, 하나의 프로세서 내에서 기능적으로 분리될 수도 있다.In this case, the plurality of components constituting the control unit 130 may be implemented by a processor capable of processing data, and may be separately implemented by different processors or functionally separated within one processor.

도 3을 참고하여, 상기 제어부(130)의 상세 동작 구성을 설명한다.Referring to FIG. 3 , a detailed operation configuration of the control unit 130 will be described.

우선, 상기 데이터 수집부(131)는, 상기 외부 서버로부터 전염병의 시간별 확진자수에 대해 실측된 시계열 데이터인 원본 데이터를 수집하여 상기 저장부(120)에 포함된 DB에 저장할 수 있다.First of all, the data collection unit 131 may collect original data, which is time-series data actually measured on the number of confirmed cases of infectious diseases by hour, from the external server and store them in the DB included in the storage unit 120 .

이때, 상기 데이터 수집부(131)는 서로 다른 복수의 시간대별로 하나 이상의 원본 데이터를 수집하여 상기 DB에 저장할 수 있다.At this time, the data collection unit 131 may collect one or more original data for each of a plurality of different time periods and store them in the DB.

또한, 상기 데이터 증강부(132)는, 상기 DB에 저장되거나 상기 데이터 수집부(131)로부터 제공되는 상기 원본 데이터를 미리 설정된 증강 알고리즘에 따라 왜곡 보정하여 복수의 증강 데이터를 생성할 수 있다.In addition, the data augmentation unit 132 may generate a plurality of augmented data by performing distortion correction on the original data stored in the DB or provided from the data collection unit 131 according to a preset augmentation algorithm.

일례로, 상기 데이터 증강부(132)는 크롭핑(cropping), 양자화(Quantizing) 및 상기 시계열 데이터를 구성하는 하나 이상의 변수값 중 적어도 하나를 랜덤하게 변경하는 드리프트(Drift) 중 적어도 하나를 수행하여 상기 원본 데이터를 왜곡 보정할 수 있으며, 상기 왜곡 보정을 통해 서로 다른 복수의 증강 데이터를 생성할 수 있다.For example, the data augmentation unit 132 may perform at least one of cropping, quantizing, and drift of randomly changing at least one of one or more variable values constituting the time series data to perform distortion correction on the original data, and generate a plurality of different augmented data through the distortion correction.

이때, 상기 데이터 증강부(132)는, 상기 복수의 증강 데이터를 상기 DB에 상기 증강 데이터의 생성에 이용된 원본 데이터와 매칭하여 저장할 수 있다.In this case, the data augmentation unit 132 may match and store the plurality of augmented data in the DB with original data used to generate the augmented data.

또는, 상기 데이터 증강부(132)는, 상기 증강 데이터 생성시마다 상기 증강 데이터의 생성에 이용된 원본 데이터와 상기 생성된 증강 데이터를 상기 선형 비교부(135) 및 비선형 비교부(136)로 제공할 수 있다.Alternatively, the data augmentation unit 132 may provide the original data used to generate the augmented data and the generated augmented data to the linear comparison unit 135 and the non-linear comparison unit 136 whenever the augmented data is generated.

또한, 상기 선형 비교부(135)는, 상기 데이터 증강부(132)로부터 제공되는 상기 증강 데이터와 원본 데이터 상호 간 동일 시간 축에서 각 변수 간의 값의 차이를 비교하여 상기 증강 데이터와 원본 데이터 사이의 선형 유사도에 대한 R2(R-Squared : 결정 계수)값을 산출할 수 있다.In addition, the linear comparator 135 compares the difference between the values of each variable on the same time axis between the augmented data provided from the data enhancer 132 and the original data, and calculates the R2 (R-Squared: coefficient of determination) value for the linear similarity between the augmented data and the original data.

이때, 상기 선형 비교부(135)는, 미리 설정된 회귀 모델에 상기 증강 데이터 및 원본 데이터를 적용하여 상기 R2값을 산출할 수 있으며, 상기 선형 비교부(135)에 의해 비교되는 변수는 확진자수일 수 있다.At this time, the linear comparison unit 135 may calculate the R2 value by applying the augmented data and original data to a preset regression model, and the variable compared by the linear comparison unit 135 may be the number of confirmed cases.

또한, 상기 비선형 비교부(136)는 상기 데이터 증강부(132)로부터 상기 증강 데이터 및 원본 데이터 수신시 미리 설정된 DTW(Dynamic time wrapping) 알고리즘을 통해 상기 증강 데이터와 상기 원본 데이터 사이의 비선형 유사도에 대한 DTW값을 산출할 수 있다.In addition, the nonlinear comparison unit 136, when receiving the augmented data and the original data from the data augmentation unit 132, uses a preset dynamic time wrapping (DTW) algorithm between the augmented data and the original data. A DTW value for the similarity can be calculated.

이때, 상기 DTW 값은 상기 DTW 알고리즘을 통해 상기 원본 데이터와 증강 데이터 간의 동일 시간 축 뿐만 아니라 다른 시간 축도 모두 포함하여 변수 간의 값 차이를 비교하여 두 데이터 전체의 유사도를 구한 값을 의미할 수 있다.In this case, the DTW value may refer to a value obtained by comparing the value difference between variables including not only the same time axis between the original data and the augmented data but also other time axes through the DTW algorithm to obtain a degree of similarity between the entire two data.

일례로, 상기 비선형 비교부(136)는, 상기 증강 데이터의 특정 시간대에서의 확진자수의 변화 패턴을 상기 원본 데이터의 상기 특정 시간대를 제외한 나머지 시간대에서의 확진자수의 변화 패턴과 비교할 수 있으며, 이를 통해 상기 증강 데이터를 미리 설정된 복수의 시간대로 분할하여 생성한 시간대별로 확진자수의 변화 패턴을 상기 원본 데이터에서 동일 시간대의 확진자수의 변화 패턴 뿐만 아니라 다른 하나 이상의 시간대별 확진자수의 변화 패턴과 모두 비교하여 상기 증강 데이터와 상기 원본 데이터 사이의 비선형 유사도인 DTW값을 산출할 수 있다.For example, the nonlinear comparator 136 may compare the change pattern of the number of confirmed cases in a specific time zone of the augmented data with the pattern of change in the number of confirmed cases in other time zones of the original data, excluding the specific time zone, thereby comparing the change pattern of the number of confirmed cases by time zone, which is generated by dividing the augmented data into a plurality of preset time zones, with both the change pattern of the number of confirmed cases in the same time zone in the original data as well as the pattern of change in the number of confirmed cases for one or more other time zones in the original data, which is a nonlinear similarity between the augmented data and the original data, DTW value can be calculated.

상술한 구성에서, 상기 선형 비교부(135)는, 상기 증강 데이터(또는 특정 증강 데이터)에 대해 산출된 R2값을 해당 R2값에 대응되는 증강 데이터(또는 특정 증강 데이터)와 매칭하여 DB에 저장할 수 있으며, 상기 비선형 비교부(136) 역시 상기 증강 데이터(또는 특정 증강 데이터)에 대해 산출된 DTW값을 해당 DTW값에 대응되는 상기 증강 데이터(또는 특정 증강 데이터)와 매칭하여 DB에 저장할 수 있다.In the configuration described above, the linear comparator 135 may match the R2 value calculated for the augmented data (or specific augmented data) with the augmented data (or specific augmented data) corresponding to the corresponding R2 value and store it in the DB, and the nonlinear comparator 136 may also match the DTW value calculated for the augmented data (or specific augmented data) with the augmented data (or specific augmented data) corresponding to the corresponding DTW value and store it in the DB.

이를 통해, 원본 데이터를 이용하여 생성된 복수의 증강 데이터별로 R2값 및DTW값이 산출되어 DB에 저장될 수 있다.Through this, the R2 value and the DTW value may be calculated for each of a plurality of augmented data generated using the original data and stored in the DB.

한편, 상기 데이터 선별부(137)는, 상기 선형 비교부(135) 및 비선형 비교부(136)와 연동하여 상기 복수의 증강 데이터별로 상기 선형 비교부(135) 및 비선형 비교부(136)를 통해 산출한 상기 R2값과 DTW값을 기초로 미리 설정된 조건을 만족하는 하나 이상의 증강 데이터를 각각 후보 데이터로 추출할 수 있다.Meanwhile, the data selection unit 137 may extract one or more augmented data that satisfy a preset condition as candidate data based on the R2 value and the DTW value calculated through the linear comparison unit 135 and the nonlinear comparison unit 136 for each of the plurality of augmented data in conjunction with the linear comparison unit 135 and the nonlinear comparison unit 136.

일례로, 상기 데이터 선별부(137)는, 상기 복수의 증강 데이터 중 상기 산출한 R2값과 DTW값이 상기 조건에 따른 미리 설정된 상위 α%(또는 n%) 이내인 하나 이상의 증강 데이터를 각각 상기 후보 데이터로 추출할 수 있다.For example, the data selector 137 may extract, as the candidate data, one or more augmented data in which the calculated R2 value and DTW value are within a preset upper α% (or n%) of the plurality of augmented data.

이때, 상기 데이터 선별부(137)는, 상기 DB에서 상기 복수의 증강 데이터별로 상기 R2값과 상기 DTW값을 확인한 후 상기 조건과 비교하여 상기 후보 데이터를 추출하거나, 상기 선형 비교부(135) 및 비선형 비교부(136)로부터 상기 증강 데이터별로 수신되는 R2값 및 DTW값을 상기 조건과 비교하여 상기 선형 비교부(135) 및 비선형 비교부(136) 중 어느 하나로부터 수신되는 증강 데이터 중 상기 후보 데이터를 추출(또는 선택)할 수 있다.At this time, the data selection unit 137 checks the R2 value and the DTW value for each of the plurality of augmented data in the DB and compares them with the condition to extract the candidate data, or the linear comparison unit 135 and the nonlinear comparison unit 136. The R2 value and the DTW value received for each augmented data are compared with the condition to select the candidate data among the augmented data received from any one of the linear comparison unit 135 and the nonlinear comparison unit 136 can be extracted (or selected).

또한, 상기 데이터 선별부(137)는, 상기 복수의 증강 데이터 중 후보 데이터로 추출된(또는 선택된) 증강 데이터마다 후보 데이터로 설정하여 상기 DB에 저장할 수 있다.In addition, the data selector 137 may set each augmented data extracted (or selected) as candidate data among the plurality of augmented data as candidate data and store the augmented data in the DB.

또한, 상기 데이터 선별부(137)는, 상기 추출된 하나 이상의 후보 데이터 각각의 R2값과 DTW값을 기초로 상기 R2값과 DTW값 각각에 대한 임계치를 설정할 수 있다.In addition, the data selector 137 may set a threshold for each of the R2 value and DTW value based on the R2 value and DTW value of each of the extracted one or more candidate data.

일례로, 상기 데이터 선별부(137)는, 상기 추출된 하나 이상의 후보 데이터별 R2값을 평균하여 R2값 또는 선형 유사도에 대응되는 제 1 임계치를 설정할 수 있으며, 상기 추출된 하나 이상의 후보 데이터별 DTW값을 평균하여 DTW값 또는 비선형 유사도에 대응되는 제 2 임계치를 설정할 수 있다.For example, the data selector 137 may set a first threshold corresponding to the R2 value or linear similarity by averaging the R2 values for each of the extracted one or more candidate data, and set a second threshold corresponding to the DTW value or nonlinear similarity by averaging the DTW values for each extracted one or more candidate data.

즉, 상기 데이터 선별부(137)는, 상기 예측 모델의 학습 데이터로 사용이 가능한 증강 데이터의 선형 유사도에 대한 상기 제 1 임계치를 설정하고, 상기 예측 모델의 학습 데이터로 사용이 가능한 증강 데이터의 비선형 유사도에 대한 상기 제 2 임계치를 설정할 수 있다.That is, the data selector 137 may set the first threshold for the linear similarity of augmented data usable as training data of the predictive model, and set the second threshold for the non-linear similarity of augmented data usable as training data of the predictive model.

또한, 데이터 선별부(137)는 상기 임계치 설정 이후 상기 DB에 저장된 복수의 증강 데이터를 대상으로 상기 임계치(제 1 임계치 및 제 2 임계치를 포함) 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 각각 상기 확진자수의 예측을 위한 예측모델의 학습을 위한 학습 데이터로 선별(추출)할 수 있다.In addition, after setting the threshold, the data selector 137 selects (extracts) one or more augmented data from which R2 and DTW values that are equal to or greater than the threshold (including the first threshold and the second threshold) are calculated for the plurality of augmented data stored in the DB as learning data for learning a predictive model for predicting the number of confirmed cases.

일례로, 상기 데이터 선별부(137)는, 상기 DB에 저장된 복수의 증강 데이터 중 제 1 증강 데이터에 대해 산출된 R2값이 제 1 임계치 미만이면 상기 DTW값과 관계 없이 상기 제 1 증강 데이터를 학습 데이터로 선별하지 않으며, 상기 복수의 증강 데이터 중 제 2 증강 데이터에 대해 산출된 R2값이 상기 제 1 임계치 이상이고, 상기 제 2 증강 데이터에 대해 산출된 DTW값이 상기 제 2 임계치 이상이면 상기 제 2 증강 데이터를 학습 데이터로 선별할 수 있다.For example, the data selector 137 does not select the first augmented data as learning data regardless of the DTW value if the R2 value calculated for the first augmented data among the plurality of augmented data stored in the DB is less than the first threshold value, and if the R2 value calculated for the second augmented data among the plurality of augmented data is greater than the first threshold value and the DTW value calculated for the second augmented data is greater than the second threshold value, the second augmented data Data can be selected as training data.

이때, 상기 데이터 선별부(137)는, 특정 원본 데이터와 매칭된 복수의 증강 데이터를 대상으로 산출된 임계치를 상기 특정 원본 데이터와 매칭된 복수의 증강 데이터를 대상으로만 적용하여 상기 학습 데이터를 선별할 수 있으며, 상기 특정 원본 데이터와 상이한 다른 원본 데이터와 매칭된 복수의 증강 데이터를 대상으로 상술한 바와 동일한 과정을 통해 상기 임계치와 다른 임계치를 산출한 후 상기 다른 원본 데이터와 매칭된 복수의 증강 데이터를 대상으로 상기 다른 임계치를 적용하여 상기 학습 데이터를 선별할 수 있다.In this case, the data selector 137 may select the learning data by applying the calculated threshold to the plurality of augmented data matched with the specific original data only to the plurality of augmented data matched with the specific original data, and may select the learning data by calculating a threshold different from the threshold through the same process as described above for a plurality of augmented data matched with other original data different from the specific original data, and then applying the other threshold to the plurality of augmented data matched with the other original data.

상술한 구성에서, 상기 데이터 선별부(137)는 상기 학습 데이터의 신뢰도를 높이기 위해, 상기 임계치 이상인 R2값과 DTW값이 산출된 하나 이상의 증강 데이터를 부트스트래핑(Bootstrapping)하여 n개의 증강 데이터를 획득하고, 상기 n개의 증강 데이터를 각각 상기 학습 데이터로 선별할 수도 있다.In the configuration described above, in order to increase the reliability of the learning data, the data selector 137 bootstrapping one or more augmented data from which R2 and DTW values equal to or greater than the threshold are calculated to obtain n augmented data, and may select the n augmented data as the learning data, respectively.

또한, 상기 데이터 선별부(137)는, 상기 서비스 제공 장치(100)에 미리 설정된 예측 모델에 예측 대상 기간에 대응되는 예상 확진자수가 산출되도록 상기 선별된 하나 이상의 상기 학습 데이터 및 상기 원본 데이터를 학습시킬 수 있다.In addition, the data selector 137 may allow the selected one or more learning data and the original data to be trained in a prediction model preset in the service providing device 100 so that the expected number of confirmed cases corresponding to the predicted period is calculated.

이때, 상기 예측 모델은 딥러닝(Deep Learning) 모델로 구성될 수 있으며, 상기 딥러닝 모델은 하나 이상의 신경망 모델로 구성될 수 있다.In this case, the prediction model may be composed of a deep learning model, and the deep learning model may be composed of one or more neural network models.

또한, 신경망 모델(또는 신경망)은 입력층(Input Layer), 하나 이상의 은닉층(Hidden Layers) 및 출력층(Output Layer)으로 구성될 수 있으며, 상기 신경망 모델에는 DNN(Deep Neural Network), RNN(Recurrent Neural Network), CNN(Convolutional Neural Network), LSTM(Long Short Term Memory) 등과 같은 다양한 종류의 신경망이 적용될 수 있다.In addition, a neural network model (or neural network) may be composed of an input layer, one or more hidden layers, and an output layer, and various types of neural networks such as a deep neural network (DNN), a recurrent neural network (RNN), a convolutional neural network (CNN), and a long short term memory (LSTM) may be applied to the neural network model.

또한, 상기 예측 모델은 상기 제어부(130)에 설정되거나, 상기 서비스 제공 장치(100) 이외의 별도 외부 장치에 구성될 수도 있으며, 상기 데이터 선별부(137)는 상기 통신부(110)를 통해 상기 외부 장치와 연동하여 상기 예측 모델을 상술한 바와 같이 학습시킬 수 있다.In addition, the prediction model may be set in the control unit 130 or may be configured in a separate external device other than the service providing device 100, and the data selection unit 137 may learn the prediction model as described above in conjunction with the external device through the communication unit 110.

또는, 상기 제어부(130)는, 상기 예측 모델을 포함하는 예측부(138)를 더 포함하여 구성될 수도 있으며, 상기 예측부(138)는, 상기 데이터 선별부(137)와 연동하여 상기 데이터 선별부(137)에 의해 선별된 하나 이상의 학습 데이터 및 원본 데이터를 상기 예측 모델에 학습시킬 수 있다.Alternatively, the control unit 130 may further include a prediction unit 138 including the prediction model, and the prediction unit 138 may work with the data selection unit 137 to train one or more training data and original data selected by the data selection unit 137 into the prediction model.

또한, 상기 예측부(138)는, 상기 예측 모델의 학습 완료시 상기 사용자 입력부를 통해 수신된 예측 대상 기간에 대한 입력 정보를 상기 예측 모델에 적용하여 상기 예측 대상 기간 동안에 예상되는 확진자수의 시계열 변화에 대한 예측 정보를 생성하여 제공할 수 있다.In addition, when learning of the prediction model is completed, the prediction unit 138 applies the input information for the prediction target period received through the user input unit to the prediction model to generate and provide prediction information about the time series change in the expected number of confirmed cases during the prediction target period.

한편, 상술한 구성에서, 상기 원본 데이터와 최대한 유사하게 상기 증강 데이터가 생성될 수 있도록 상기 제어부(130)는 데이터 증강부(132)의 원본 데이터에 대한 왜곡 보정시 확진자수의 기간별 증가 패턴이나 확진자수를 폭증시키는 환경 인자인 외부 인자에 의한 증가 패턴이 반영되어 기간별 증가 패턴 및 외부 인자에 의한 증가 패턴 중 적어도 하나가 반영된 증강 데이터가 생성될 수 있도록 지원하는 구성부를 더 포함하여 구성될 수 있는데, 이를 상세히 설명한다.Meanwhile, in the configuration described above, so that the augmented data can be generated as similarly as possible to the original data, the control unit 130 may further include a component that supports generation of augmented data in which at least one of the periodic increase pattern and the increase pattern due to an external factor is reflected by reflecting the periodic increase pattern of the number of confirmed cases or the increase pattern by an external factor that exponentially increases the number of confirmed cases when the data augmentation unit 132 corrects the distortion of the original data, which will be described in detail.

우선, 제어부(130)는 데이터 수집부(131) 및 데이터 증강부(132)와 연동하는 패턴 적용부(133)를 더 포함할 수 있다.First of all, the controller 130 may further include a pattern application unit 133 that works with the data collection unit 131 and the data augmentation unit 132 .

해당 패턴 적용부(133)는, 상기 데이터 수집부(131)와 연동하여 상기 사용자 입력부를 통한 사용자 입력에 따른 특정 기간 동안의 확진자수의 증가 패턴이 상기 증강 데이터에 반영되도록 상기 특정 기간을 미리 설정된 주기로 분할한 복수의 단위 기간과 각각 대응되는 하나 이상의 원본 데이터를 수집하고, 상기 수집된 하나 이상의 원본 데이터를 기초로 일자별로 확진자수를 평균하여 상기 특정 기간에 대응되는 증가 패턴을 산출할 수 있다.The corresponding pattern application unit 133 may collect one or more original data corresponding to a plurality of unit periods obtained by dividing the specific period into preset cycles so that the increase pattern of the number of confirmed cases during a specific period according to a user input through the user input unit is reflected in the augmented data in conjunction with the data collection unit 131, and average the number of confirmed cases per day based on the collected one or more original data to calculate an increase pattern corresponding to the specific period.

일례로, 상기 패턴 적용부(133)는, 제 1 단위 기간에 대응되는 하나 이상의 원본 데이터를 평균한 제 1 평균값과, 제 2 단위 기간에 대응되는 하나 이상의 원본 데이터를 평균한 제 2 평균값을 기초로 상기 증가 패턴을 산출할 수 있다.For example, the pattern application unit 133 may calculate the increase pattern based on a first average value obtained by averaging one or more original data corresponding to the first unit period and a second average value obtained by averaging one or more original data corresponding to the second unit period.

또한, 상기 데이터 증강부(132)는 상기 패턴 적용부(133)와 연동하여 상기 증강 데이터의 왜곡 보정 이전에 상기 증가 패턴을 상기 원본 데이터에 적용한 후 왜곡 보정하여 상기 특정 기간에 대응되는 증가 패턴이 반영된 증강 데이터를 생성할 수 있다.In addition, the data augmentation unit 132 interworks with the pattern application unit 133 to apply the increase pattern to the original data before distortion correction of the augmented data, and then corrects the distortion, thereby generating augmented data in which the increase pattern corresponding to the specific period is reflected.

이에 대한 일례로, 도 4에 도시된 바와 같이, 증강 데이터의 생성 이전에 '주말', '명절' 등과 같이 확진자수의 증가가 나타나는 특징적인 기간의 확진자수에 대한 계절성 증가 패턴(seasonality pattern)을 원본 데이터에 먼저 반영한 후 데이터를 증강(증강 데이터를 생성)함으로써, 확진자 상승/폭증시에 대해서 더욱 민감도 있게 예측할 수 있도록 데이터를 증강시킬 수 있다.As an example of this, as shown in FIG. 4, before generating augmented data, the seasonality pattern of the number of confirmed cases during a characteristic period in which the number of confirmed cases increases, such as 'weekend' or 'holiday', is first reflected in the original data, and then the data is augmented (generating augmented data), so that the data can be more sensitively predicted for the rise/explosion of confirmed cases.

도 4의 그래프는 2021/07/12 ~ 2021/09/12의 특정 기간으로, 계절성 요소(seasonality component)에 따른 패턴을 명확하게 확인할 수 있으며, 해당 시점은 여러 외부 변인이 많이 제외되어 있다.The graph of FIG. 4 is a specific period from 2021/07/12 to 2021/09/12, and a pattern according to the seasonality component can be clearly identified, and at that time, many external variables are excluded.

도시된 바와 같이, 2021/07/12일이 월요일로서 해당 특정 기간의 확진자수에 대한 요일별 패턴이 존재한다.As shown, 2021/07/12 is Monday, and there is a pattern for each day of the week for the number of confirmed cases in that specific period.

이에 따라, 상기 패턴 적용부(133)는 상기 특정 기간에 대해 요일별 확진자수의 평균값을 산출하여 도 5에 도시된 바와 같이 상기 특정 기간에 대응되는 증가 패턴(또는 계절성 증가 패턴)을 얻을 수 있다.Accordingly, the pattern application unit 133 may obtain an increase pattern (or seasonality increase pattern) corresponding to the specific period as shown in FIG. 5 by calculating an average value of the number of confirmed cases per day for the specific period.

이에 따라, 상기 데이터 증강부(132)는, 상기 증강 데이터의 생성 이전에 상기 원본 데이터에 상기 특정 기간에 대응되는 증가 패턴(계절성 증가 패턴)을 적용한 후 왜곡 보정하여 증강 데이터를 생성함으로써, 이러한 증강 데이터가 예측 모델의 학습에 사용되도록 하여 예측 모델에서 계절성 요소를 반영하여 확진자수의 증가 및 감소에 대해 더 민감하게 예측할 수 있도록 지원할 수 있다.Accordingly, the data augmentation unit 132 generates augmented data by applying an increase pattern (seasonal increase pattern) corresponding to the specific period to the original data before generating the augmented data, and then correcting the distortion to generate augmented data, so that the augmented data can be used for learning of a predictive model, and a seasonal factor can be reflected in the predictive model to more sensitively predict an increase or decrease in the number of confirmed cases.

또한, 상기 제어부(130)는 확진자수를 증가시키는 외부 인자에 의한 영향이 상기 원본 데이터를 이용한 증강 데이터 생성시 반영되도록 하기 위해, 상기 제어부(130)는 상기 데이터 수집부(131) 및 데이터 증강부(132)와 연동하는 외부 인자 수집부(134)를 더 포함할 수 있다.In addition, in order for the controller 130 to reflect the influence of external factors that increase the number of confirmed cases when augmented data is generated using the original data, the controller 130 may further include an external factor collector 134 that works with the data collector 131 and the data augmenter 132.

상기 외부 인자 수집부(134)는, 상기 데이터 수집부(131)와 연동하여 상기 확진자수를 증가시키는 외부 인자에 대응되도록 미리 설정된 키워드에 대한 검색량이 증가될 때의 하나 이상의 원본 데이터를 수집하고, 상기 키워드에 대응되는 검색량의 증가 수치를 미리 설정된 Newton Method에 적용되는 독립 변수의 변수값으로 설정하고 상기 검색량의 증가 수치가 산출된 기간 동안의 상기 하나 이상의 원본 데이터에 따른 확진자수의 변화를 상기 Newton Method에 적용되는 다른 독립 변수로 설정하여 상기 Newton Method를 통해 확진자수의 증가량 및 상기 확진자수의 급증 가속도를 변화시키기 위한 기울기 및 가속도를 산출할 수 있다.The external factor collection unit 134 collects one or more original data when the search volume for a preset keyword increases to correspond to an external factor that increases the number of confirmed cases in conjunction with the data collection unit 131, sets the increase in search volume corresponding to the keyword as a variable value of an independent variable applied to the preset Newton Method, and sets the change in the number of confirmed cases according to the one or more original data during the period during which the increase in the search volume is calculated as another independent variable applied to the Newton Method. The slope and acceleration for changing the increase in the number of confirmed cases and the rapid acceleration of the number of confirmed cases can be calculated through the method.

이에 따라, 상기 데이터 증강부(132)는 상기 외부 인자 수집부(134)를 통해 산출된 기울기 및 가속도를 상기 원본 데이터에 적용한 후 상기 기울기 및 가속도가 적용된 원본 데이터를 기초로 상기 증강 데이터를 생성하거나 상기 증강 데이터에 상기 기울기 및 가속도를 적용할 수 있다.Accordingly, the data augmentation unit 132 applies the gradient and acceleration calculated through the external factor collection unit 134 to the original data, and then generates the augmented data based on the original data to which the gradient and acceleration are applied. Alternatively, the gradient and acceleration may be applied to the augmented data.

이때, 상기 외부 인자 수집부(134)는 서로 다른 종류의 인자별로 하나 이상의 키워드가 미리 설정되며, 상기 키워드별로 상기 기울기 및 가속도를 산출할 수 있다.At this time, the external factor collection unit 134 may preset one or more keywords for each different type of factor, and calculate the slope and acceleration for each keyword.

또한, 상기 외부 인자는, 국내 여행, 국제 여행, 변이바이러스, 종교, 집회 등과 같은 다양한 인자를 포함할 수 있다.In addition, the external factors may include various factors such as domestic travel, international travel, mutant viruses, religion, and gatherings.

이에 대한 일례로, 도심 및 국제간 이동의 증가, 바이러스 변이, 종교, 집회 등의 특정 사유로 확진자수가 급격히 증가하는 패턴을 증강 데이터의 생성시 반영하기 위해, 상기 외부 인자 수집부(134)는, 상기 데이터 증강부(132)와 연동하여 특정 독립변수(x)값이 높게 발현될 경우, 탐색 방향으로 learning rate을 올리는 방식으로, 특정 사유에 따른 발현 시 선별적인 Newton method를 적용하여 이전보다 급상승하는 구간에 대해 loss function에 반영하는 기법을 상기 원본 데이터 또는 증강 데이터에 적용할 수 있다.As an example of this, in order to reflect a pattern in which the number of confirmed cases rapidly increases for specific reasons such as increase in urban and international movement, virus mutation, religion, assembly, etc., when generating augmented data, the external factor collection unit 134, in conjunction with the data augmentation unit 132, raises the learning rate in the search direction when a specific independent variable (x) value is expressed high. It can be applied to data or augmented data.

즉, 상기 외부 인자 수집부(134)는, 도심 및 국제간 이동의 증가, 바이러스 변이, 종교, 집회 등과 같은 외부 인자에 해당하는 하나 이상의 독립변수를 각각 x⁺, 그 외 독립변수를 각각 x^-로 설정한다.That is, the external factor collection unit 134 sets one or more independent variables corresponding to external factors such as increase in urban and international movement, virus mutation, religion, assembly, etc. to x ⁺ , respectively, and other independent variables to x ^- .

또한, 상기 외부 인자 수집부(134)는, 상기 데이터 수집부(131)와 연동하여 국내외 검색엔진을 통해 외부 인자에 대응되는 특정 키워드의 검색량을 정량화된 값으로 수집할 수 있다.In addition, the external factor collection unit 134 may collect the search amount of a specific keyword corresponding to the external factor as a quantified value through domestic and foreign search engines in conjunction with the data collection unit 131 .

일례로, 상기 외부 인자 수집부(134)는 종교 활동에 대응되도록 미리 설정된 키워드의 검색량이 증가될 때의 해당 검색량에 대한 수치를 x_r 변수의 값으로 설정할 수 있다.For example, the external factor collection unit 134 may set a numerical value for the corresponding search volume when the search volume of a preset keyword corresponding to religious activities increases as the value of the x _r variable.

또한, 상기 외부 인자 수집부(134)는, 하기 수학식 1과 같이, x⁺에 해당하는 변수들의 값이 크게 변이하는 부분에 대해서 Hessian matrix(변수들의 2nd derivative matrix: H(f))를 기준으로 하기 수학식 1과 같이 가중치를 통해 더 큰 기울기(gradient)를 부여하여 적합하도록 할 수 있다.In addition, the external factor collection unit 134, as shown in Equation 1 below, based on the Hessian matrix (the 2nd derivative matrix of variables: H(f)) for the part where the value of the variables corresponding to x ⁺ varies greatly.

이렇게 하기 위해서 x⁺가 포함된 element에 대해서만 1 값을 부여하는 mask matrix(M(f))를 도입할 수 있다. M(f)는 x⁺에 해당하는 변수가 포함된 element인 경우에만 1값을 부여하여 H(f) 값이 활성화되도록 유도할 수 있다.To do this, we can introduce a mask matrix (M(f)) that assigns a value of 1 only to elements containing x ⁺ . M(f) can be induced to activate the H(f) value by assigning a value of 1 only to an element that includes a variable corresponding to x ⁺ .

또한, 본 발명은 다변량회귀모델을 기반으로 하고 있으므로, 상기 외부 인자 수집부(134)는, 상기 데이터 수집부(131)와 연동하여 상기 확진자수를 증가시키는 외부 인자에 대응되도록 미리 설정된 키워드에 대한 검색량이 증가될 때의 하나 이상의 원본 데이터를 수집하고, 상기 키워드에 대응되는 검색량의 증가 수치를 하기 수학식 2에 따른 Newton Method에 적용되는 독립 변수의 변수값으로 설정하고 상기 검색량의 증가 수치가 산출된 기간 동안의 상기 하나 이상의 원본 데이터에 따른 확진자수의 변화를 상기 Newton Method에 적용되는 다른 독립 변수로 설정하며, 상기 수학식 1을 통해 산출한 H(f) 및 M(f)의 값과 함께 상기 검색량의 증가 수치 및 상기 검색량의 증가 수치에 대응되는 확진자 수의 변화를 상기 Newton Method에 적용하여 특정 변수가 더 많이 발현될 수 있는 상황에 빠르게 적합이 될 수 있도록 기울기 및 가속도를 산출할 수 있다.In addition, since the present invention is based on a multivariate regression model, the external factor collection unit 134 collects one or more original data when the search volume for a preset keyword corresponding to an external factor that increases the number of confirmed cases increases in association with the data collection unit 131, sets the increase in search volume corresponding to the keyword as a variable value of an independent variable applied to the Newton Method according to Equation 2 below, and The change in the number of confirmed cases according to the Newton Method is set as another independent variable applied to the Newton Method, and the value of H(f) and M(f) calculated through Equation 1, together with the increase in search amount and the change in the number of confirmed cases corresponding to the increase in search amount can be applied to the Newton Method to calculate the slope and acceleration so that it can be quickly adapted to a situation in which a specific variable can be expressed more.

상술한 바와 같이, 본 발명은 외부 서버로부터 전염병에 대한 확진자수와 관련되어 수집된 시계열 실측 데이터인 원본 데이터를 다양한 방식으로 증강하여 별도의 증강 데이터를 다수 생성하고, 이러한 다수 생성된 증강 데이터 각각에 대해 원본 데이터와 패턴 기반으로 유사도를 비교하여 실제 데이터인 원본 데이터와 유사도가 높은 증강 데이터를 선별하여 신경망으로 구성된 예측 모델의 학습에 이용함으로써, 확산이 시작된 전염병에 대한 실측 데이터가 충분하지 않아 예측 모델의 학습이 어려운 경우에도 실제 데이터와 유사한 다수의 증강 데이터를 생성한 후 이를 기초로 예측 모델을 학습시켜 예측 모델을 통한 미래 기간에 대해 예측되는 확진자수에 대한 신뢰도 및 정확도를 높일 수 있다.As described above, the present invention augments original data, which is time-series actual data collected from an external server in relation to the number of confirmed cases of an infectious disease, in various ways to generate a large number of separate augmented data, compares the similarity with the original data for each of these augmented data based on a pattern, selects augmented data that has a high similarity to the original data, which is actual data, and uses it to learn a prediction model composed of a neural network. After generating, it is possible to increase the reliability and accuracy of the number of confirmed cases predicted for the future period through the prediction model by learning the prediction model based on it.

또한, 본 발명은 예측 모델의 학습에 이용되는 증강 데이터의 생성 과정에서 계절성 요소에 따른 패턴이나 확진자를 증가시키는 다양한 외부 인자에 의한 확진자수의 변화와 관련된 기울기 및 가속도를 산출한 후 증강 데이터에 반영되도록 하고, 이러한 계절성 요소 또는 외부 인자가 반영된 증강 데이터 중에서 원본 데이터와 유사한 데이터를 선별하여 예측 모델의 학습에 이용되도록 함으로써, 계절성 요소나 외부 인자에 의한 영향을 반영하여 예측 대상 기간 동안의 확진자수의 변화를 정확하게 예측할 수 있도록 지원할 수 있다.In addition, in the process of generating augmented data used for learning a predictive model, the present invention calculates the slope and acceleration related to the change in the number of confirmed cases due to a pattern according to seasonal factors or various external factors that increase the number of confirmed cases, and then reflects them in the augmented data. By selecting data similar to the original data from augmented data in which these seasonal factors or external factors are reflected and using them for learning the prediction model, it is possible to support the accurate prediction of changes in the number of confirmed cases during the forecast period by reflecting the influence of seasonal factors or external factors.

본 발명의 실시예들에서 설명된 구성요소는, 예를 들어, 메모리 등의 저장부(120), 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(Field Programmable Gate Array), PLU(programmable logic unit), 마이크로프로세서 등의 하드웨어, 명령어 세트를 포함하는 소프트웨어 내지 이들의 조합 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다.The components described in the embodiments of the present invention may be, for example, hardware such as a storage unit 120 such as a memory, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, software including an instruction set, or a combination thereof, or any other device capable of executing and responding to instructions, or one or more general-purpose computers. Alternatively, it may be implemented using a special purpose computer.

전술된 내용은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing may be modified and modified by those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed according to the claims below, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

100: 서비스 제공 장치 110: 통신부
120: 저장부 130: 제어부
131: 데이터 수집부 132: 데이터 증강부
133: 패턴 적용부 134: 외부 인자 수집부
135: 선형 비교부 136: 비선형 비교부
137: 데이터 선별부 138: 예측부100: service providing device 110: communication unit
120: storage unit 130: control unit
131: data collection unit 132: data augmentation unit
133: pattern application unit 134: external factor collection unit
135: linear comparison unit 136: non-linear comparison unit
137: data selection unit 138: prediction unit

Claims

a data collection unit that collects and stores original data, which is time-series data actually measured for the number of confirmed cases of an infectious disease by hour;
a data augmentation unit generating a plurality of augmented data by performing distortion correction on the original data according to a preset augmentation algorithm and storing the plurality of augmented data;
a linear comparison unit calculating a R-Squared (R2) value for linear similarity between the augmented data and the original data by comparing differences between values of each variable on the same time axis between the augmented data and the original data;
a non-linear comparison unit calculating a DTW value, which is a non-linear similarity between the augmented data and the original data, through a dynamic time wrapping (DTW) algorithm; and
For each of the plurality of augmented data, extract one or more augmented data that satisfy a preset condition as candidate data based on the R2 value and DTW value calculated through the linear comparator and the nonlinear comparator, and based on the R2 value and DTW value of each of the extracted one or more candidate data, set a threshold for each of the R2 value and DTW value, and learn a predictive model for predicting the number of confirmed cases for each of the one or more augmented data for which the R2 value and DTW value equal to or greater than the threshold were calculated. Data selection unit that selects with data
A service providing device that supports augmentation of data for epidemic analysis comprising a.

The method of claim 1,
The data augmentation unit performs at least one of cropping, quantizing, and drift of randomly changing at least one of one or more variable values constituting the time series data to correct distortion of the original data. A service providing device that supports augmentation of data for an epidemic analysis, characterized in that.

The method of claim 1,
Wherein the data selector extracts, as the candidate data, one or more augmented data in which the calculated R2 value and DTW value are within a preset upper α% among the plurality of augmented data, respectively, as the candidate data. A service providing device that supports augmentation of data.

The method of claim 1,
Wherein the data selection unit obtains n augmented data by bootstrapping one or more augmented data from which the R2 value and DTW value equal to or greater than the threshold are calculated, and selects the n augmented data as the training data, respectively.

The method of claim 1,
Wherein the data selection unit learns one or more of the learning data and the original data so that the predicted number of confirmed cases corresponding to the prediction target period is calculated in a preset prediction model.

The method of claim 1,
A pattern application unit that interworks with the data collection unit to collect one or more original data corresponding to a plurality of unit periods obtained by dividing the specific period into preset cycles so that the pattern of increase in the number of confirmed cases during a specific period according to user input is reflected in the augmented data, and calculates the increase pattern by averaging the number of confirmed cases per day based on the collected one or more original data;
The data augmentation unit supports augmentation of data for an epidemic analysis, characterized in that for distortion correction after applying the increase pattern to the original data before distortion correction of the augmented data in conjunction with the pattern application unit.

The method of claim 1,
In conjunction with the data collection unit, one or more original data when the search volume for a preset keyword increases corresponding to an external factor that increases the number of confirmed cases is collected, the increase in the search volume corresponding to the keyword is set as a variable value of an independent variable applied to the preset Newton Method, and the change in the number of confirmed cases according to the one or more original data during the period during which the increase in search volume is calculated is set as another independent variable applied to the Newton Method to increase the number of confirmed cases and accelerate the rapid acceleration of the number of confirmed cases through the Newton Method Further comprising an external factor collection unit for calculating the slope and acceleration to change,
The data augmentation unit applies the gradient and acceleration calculated through the external factor collection unit to the original data, and then generates the augmented data based on the original data to which the gradient and acceleration are applied, or the gradient and acceleration to the augmented data. A service providing device that supports augmentation of data for an epidemic analysis, characterized in that for applying.

The method of claim 7,
Wherein the external factor collection unit pre-sets one or more keywords for each different type of factor, and calculates the slope and acceleration for each keyword.

In a service providing method supporting augmentation of data for epidemiological analysis of a service providing device,
A data collection step of collecting and storing original data, which is time-series data actually measured for the number of confirmed cases of an infectious disease by hour;
a data augmentation step of generating a plurality of augmented data by performing distortion correction on the original data according to a preset augmentation algorithm and storing the plurality of augmented data;
a linear comparison step of calculating an R2 value obtained by obtaining a linear similarity between the augmented data and the original data, which is calculated by comparing differences between values of each variable on the same time axis between the augmented data and the original data;
a non-linear comparison step of calculating a DTW value, which is a non-linear similarity between the augmented data and the original data, through a dynamic time wrapping (DTW) algorithm; and
A data selection step of extracting one or more augmented data satisfying a preset condition as candidate data based on the R2 value and DTW value calculated for each of the plurality of augmented data, respectively, setting thresholds for the R2 value and DTW value based on the R2 value and DTW value of each of the extracted one or more candidate data, and selecting one or more augmented data for which R2 value and DTW value equal to or greater than the threshold value are calculated as learning data for learning a predictive model for predicting the number of confirmed cases, respectively.
A service providing method that supports augmentation of data for epidemiological analysis comprising a.

The method of claim 9,
The data augmentation step performs at least one of cropping, quantizing, and drift of randomly changing at least one of one or more variable values constituting the time series data to correct distortion of the original data.

The method of claim 9,
In the data selection step, one or more augmented data in which the calculated R2 value and DTW value are within a preset upper α% among the plurality of augmented data are extracted as the candidate data, respectively. A service providing method that supports augmentation of data for analysis.

The method of claim 9,
In the data selection step, n augmented data are acquired by bootstrapping one or more augmented data from which the R2 value and DTW value equal to or greater than the threshold are calculated, and each of the n augmented data is selected as the learning data.

The method of claim 9,
In the data selection step, one or more of the learning data and the original data are learned from a preset prediction model so that the expected number of confirmed cases corresponding to the predicted period is calculated.

The method of claim 9,
The data collection step may further include collecting one or more original data corresponding to a plurality of unit periods obtained by dividing the specific period into preset cycles so that an increase pattern of the number of confirmed cases during a specific period according to a user input is reflected in the augmented data,
In the data augmentation step, the increase pattern is calculated by averaging the number of confirmed cases by date based on one or more original data collected in correspondence with the specific period, and the increase pattern is applied to the original data before distortion correction of the augmented data. A service providing method that supports augmentation of data for analysis of infectious diseases, characterized in that the distortion is corrected.

The method of claim 9,
The data collection step further includes collecting one or more original data belonging to a period of interest in which a search volume for a preset keyword is increased to correspond to an external factor that increases the number of confirmed cases,
In the data augmentation step, a gradient and acceleration for changing the increase in the number of confirmed cases and the rapid acceleration of the number of confirmed cases are calculated through the Newton Method by setting an increase in the search volume corresponding to the keyword as a variable value of an independent variable applied to a preset Newton Method, and setting a change in the number of confirmed cases according to one or more original data during the period of interest during which the increase in the search volume was calculated as another independent variable applied to the Newton Method. After applying the calculated gradient and acceleration to the original data, the gradient A method of providing a service that supports augmentation of data for an epidemic analysis, characterized in that the augmented data is generated based on the original data to which the force and acceleration are applied or the gradient and acceleration are applied to the augmented data.

The method of claim 15
One or more keywords are preset for each different type of factor in the service providing device,
Wherein the data augmentation step calculates the gradient and acceleration for each keyword.