KR20140034994A

KR20140034994A - Apparatus and method for estimation of disease transmission situation using social network service data

Info

Publication number: KR20140034994A
Application number: KR1020120100643A
Authority: KR
Inventors: 김경현; 이형우; 김의기
Original assignee: 고려대학교 산학협력단
Priority date: 2012-09-11
Filing date: 2012-09-11
Publication date: 2014-03-21
Also published as: KR101405309B1

Abstract

When predicting a disease transmission situation, doctor/patient percentage data for a predetermined disease is collected. The doctor/patient percentage data for the predetermined disease is collected and stored. An original SNS data is collected and stored. An effective SNS data including at least one among predetermined standard words associated with the disease is extracted from the stored SNS data. Markers are determined from the words included in the effective SNS data, based on the correlation between frequency in use of each word included in the effective SNS data and the doctor/patient percentage data. A rate of use of the markers in the stored SNS data is calculated. A disease transmission prediction value corresponding to a predicted target date is calculated, based on the rate of use of the markers and the doctor/patient percentage data. [Reference numerals] (S110) Collect SNS data and disease statistics data; (S120) Extract effective SNS data from the SNS data; (S130) Select markers for disease transmission prediction analysis, based on the effective SNS data; (S140) Calculate the rate of use of the markers in the SNS data and perform linear regression analysis of the disease statistics data; (S150) Obtain a disease transmission prediction value; (S160) Provide disease transmission information

Description

[0001] APPARATUS AND METHOD FOR ESTIMATION OF DISEASE TRANSMISSION SITUATION USING SOCIAL NETWORK SERVICE DATA [0002]

본 발명은 소셜 네트워크 서비스(SNS, Social Network Service) 데이터에 기반하여 전염병의 확산 상황을 추정하는 질병 확산 상황 예측 장치 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and method for predicting a spread of a disease based on social network service (SNS) data.

전염병(즉, 전염성 질병)의 출현은 중대한 보건 문제로서, 기후 및 환경 문제 등과 더불어 세계적인 이슈(issue)이다. 최근 발생한 ‘신종 플루’, ‘SARS’, ‘구제역’ 및 ‘조류 인플루엔자’의 사례에서 보는 것처럼, 운송 수단의 발달과 글로벌화로 인해 질병의 전염은 국경을 무시하고 빠른 시간 내에 전 세계적인 재앙으로 확산될 가능성이 매우 높다.The emergence of an epidemic (ie, communicable disease) is a major health issue, as well as climate and environmental issues, and a global issue. As seen in the case of the recent H1N1 virus, SARS, foot-and-mouth disease, and avian influenza, the development of transportation and the globalization of the disease have spread the disease into global catastrophes in a short period of time The possibility is very high.

이전부터 전염병을 대비하기 위한 생물학적 백신 개발이 활발하게 진행되고 있으나, 사실상 전염병에 대한 대응책으로는 조기 예방이 가장 중요하다. 이는, 백신 제조에는 인플루엔자(즉, 독감)의 경우 4~6개월 가량의 긴 시간이 필요하므로 전염병이 창궐한 경우 백신 제조 기간 동안에 질병이 상당히 확산 되었을 가능성이 높기 때문이다. Previously, biological vaccines have been actively developed to prepare for infectious diseases, but in fact, early prevention is the most important response against infectious diseases. This is because influenza (ie, influenza) requires about 4 to 6 months to produce a vaccine, which is likely to have spread during vaccine production if the epidemic has occurred.

한편, 전염병의 일례로서 인플루엔자의 경우, 질병 관리 본부 등의 기관에서는 질병 확산 조기 예방 차원에서 인플루엔자 표본 감시 사업을 진행하며, 인플루엔자의 유행 수준을 주 단위로 발표하고 있다. 그런데, 전염병의 원인 병원체를 동정하고 확진하는데 소요되는 시간에 비해 병원체의 변이 진화 및 확산은 훨씬 빠른 속도로 진행되므로, 이러한 방식으로 질병 확산에 대응하는 예방 조치를 취하는 데에는 어려움이 있다. 실제적으로 인플루엔자 의사 환자(ILI, influenza-like illness) 정보의 취합과 분석에는 많은 시간이 소요되므로 인플루엔자 표본 감시 보고서는 보고일로부터 대략 2주 전에 발생한 환자들에 대한 정보를 포함하고 있는 실정이다.On the other hand, as an example of an infectious disease, in the case of influenza, the CDC and the like conduct an influenza sample surveillance project in order to prevent the spread of the disease, and announce the influenza epidemic level on a weekly basis. However, the evolution and spread of mutations of pathogens progress much faster than the time required to identify and identify causative agents of infectious diseases, making it difficult to take preventive measures in response to the spread of diseases in this way. In practice, the collection and analysis of influenza-like illness (ILI) information is time-consuming, so the influenza sample surveillance report contains information about patients about two weeks prior to the reporting date.

따라서, 전염병의 특정한 유행 시즌과 관계없이 지속적으로 바이러스 등의 병원체의 발생 상태를 감시하고 확산 동태를 파악 및 예측할 수 있는 방법이 필요하다.Therefore, there is a need for a method that can continuously monitor the occurrence of viruses, such as viruses, regardless of a specific epidemic season, and to identify and predict the spreading behavior.

이와 관련하여, 대한민국공개특허 제2011-0056800호(역학 시뮬레이션 시스템 및 방법)에서는, 전국의 각 지역 내에서 사람 간의 접촉률과 각 지역의 인구 통계학적 특성 및 지역간 네트워크를 고려한 교통량을 적용하여 질병 전파의 방향 및 확산 정보와 환자 수를 계산하는 패치 모델을 통해 인플루엔자의 유행 시 전염 경로 및 지역 등을 예측하는 역학 시뮬레이션 시스템 및 방법을 개시하고 있다. In this connection, Korean Patent Publication No. 2011-0056800 (Dynamic Simulation System and Method) discloses a method for estimating the degree of disease transmission by applying traffic volume in consideration of human contact rate, demographic characteristics of each region, Discloses a dynamic simulation system and method for predicting influenza epidemic pathways and regions and the like through a patch model that calculates direction and diffusion information and the number of patients.

본 발명의 일 실시예는 SNS 데이터를 이용하여 질병의 확산 상황을 예측 분석하여 그 예측 정보를 제공하는 질병 확산 상황 예측 장치 및 방법을 제공하고자 한다.An embodiment of the present invention is to provide an apparatus and method for predicting a disease proliferation state that predicts and prognoses diffusion of a disease using SNS data and provides the prediction information.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 일 측면에 따른 질병 확산 상황 예측 장치는, 질병 통계 서버가 제공하는 기설정된 질병에 대한 의사 환자 분율 데이터를 수집하여 저장하는 질병 통계 자료 수집부; 소셜 네트워크 서비스(SNS, Social Network Service) 서버가 제공하는 원본 SNS 데이터를 수집하여 저장하는 SNS 자료 수집부; 상기 저장된 SNS 데이터로부터 상기 질병과 관련된 기설정된 기준 단어 중 적어도 하나가 포함된 유효 SNS 데이터를 추출하는 유효 데이터 추출부; 상기 유효 SNS 데이터에 포함된 단어 별 사용 빈도와 상기 의사 환자 분율 데이터와의 상관도에 기초하여 상기 유효 SNS 데이터의 단어들로부터 마커를 결정하는 마커 선별부; 기설정된 기간의 상기 저장된 SNS 데이터로부터 상기 마커의 사용 횟수 및 상기 기설정된 기간의 상기 저장된 SNS 데이터의 총 개수를 포함하는 마커 매트릭스를 생성하는 마커 매트릭스 생성부; 상기 마커 매트릭스를 이용하여 상기 저장된 SNS 데이터 내에서의 상기 마커의 사용 비율을 산출하고, 상기 마커의 사용 비율 및 상기 의사 환자 분율 데이터에 기초하여 예측 목표일에 대응하는 질병 확산 예측 값을 산출하는 질병 확산 예측부; 및 상기 산출된 질병 확산 예측 값에 기초한 질병 확산 정보를 생성하여 제공하는 질병 확산 정보 제공부를 포함한다.According to an aspect of the present invention, there is provided an apparatus for predicting a disease spread state according to one aspect of the present invention, the apparatus comprising: a disease statistics server for collecting and storing data on doctor- part; An SNS data collection unit for collecting and storing original SNS data provided by a social network service (SNS) server; A valid data extracting unit for extracting valid SNS data including at least one of preset reference words related to the disease from the stored SNS data; A marker selection unit for determining a marker from words of the valid SNS data based on a degree of correlation between the frequency of use of words included in the valid SNS data and the doctor's patient percentage data; A marker matrix generator for generating a marker matrix including the number of times of use of the marker from the stored SNS data of the predetermined period and the total number of the stored SNS data of the predetermined period; Calculating a ratio of use of the marker in the stored SNS data by using the marker matrix and calculating a disease spread prediction value corresponding to a predicted target day based on the use ratio of the marker and the doctor's patient percentage data A diffusion predicting unit; And a disease spread information providing unit for generating and providing disease spread information based on the calculated disease spread prediction value.

또한, 본 발명의 다른 측면에 따른 질병 확산 상황 예측 장치의 질병 확산 상황 예측 방법은, (a) 소셜 네트워크 서비스(SNS, Social Network Service) 서버가 제공하는 원본 SNS 데이터 및 질병 통계 서버가 제공하는 기설정된 질병에 대한 의사 환자 분율 데이터를 각각 수집하여 저장하는 단계; (b) 상기 저장된 SNS 데이터로부터 질병과 관련된 기설정된 기준 단어 중 적어도 하나를 포함하는 유효 SNS 데이터를 추출하는 단계; (c) 상기 유효 SNS 데이터에 포함된 단어 별 사용 빈도와 상기 의사 환자 분율 데이터와의 상관도에 기초하여 상기 유효 SNS 데이터의 단어들로부터 마커를 결정하는 단계; (d) 기설정된 기간 단위 내의 상기 저장된 SNS 데이터 내에서의 상기 마커의 사용 비율을 산출하는 단계; (e) 상기 마커의 사용 비율 값 및 상기 의사 환자 분율 데이터에 대해 선형 회귀 분석 처리하여 예측 목표일에 대응하는 질병 확산 예측 값을 산출하는 단계; 및 (f) 상기 산출된 질병 확산 예측 값에 기초한 질병 확산 정보를 생성하여 사용자 단말에 제공하는 단계를 포함한다.According to another aspect of the present invention, a method for predicting a disease proliferation state of an apparatus for predicting a prognosis of a disease includes the steps of: (a) providing source SNS data provided by a social network service (SNS) Collecting and storing physician patient percentage data for the set disease, respectively; (b) extracting, from the stored SNS data, valid SNS data including at least one of preset predetermined reference words related to a disease; (c) determining a marker from the words of the valid SNS data based on a degree of correlation between the frequency of use of words included in the valid SNS data and the doctor-patient percentage data; (d) calculating a usage rate of the marker in the stored SNS data within a predetermined period of time; (e) performing a linear regression analysis processing on the marker use rate value and the doctor's patient percentage data to calculate a disease spread prediction value corresponding to a prediction target date; And (f) generating disease spread information based on the calculated disease spread prediction value and providing the generated disease spread information to the user terminal.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, SNS 데이터를 이용하여 전염병의 발생 및 확산에 대한 실시간 모니터링 및 예측 분석이 가능하여 질병에 대한 조기 예방 조치를 취할 수 있다는 효과가 있다.According to any of the above-mentioned objects of the present invention, it is possible to perform real-time monitoring and prediction analysis on the occurrence and spread of communicable diseases using SNS data, and to take early preventive measures against diseases.

그리고, 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 질병 확산 상황 예측을 위한 자료로서의 SNS 데이터를 이용함으로써, 간편하게 다양한 질병에 대한 각각의 질병 확산 상황 예측 분석을 수행하여 그 예측 정보를 제공할 수 있는 효과가 있다. Further, according to any one of the tasks of the present invention, by using the SNS data as the data for predicting the disease spreading state, it is possible to easily perform the disease spreading state prediction analysis on various diseases and to provide the prediction information There is an effect.

또한, 본 발명의 과제 해결 수단 중 어느 하나에 의하면, SNS 데이터 중 특정 언어(특히, 한글) 별로 적합한 데이터 가공 처리를 하여 실제 질병 통계 데이터와 상관도가 높은 SNS 데이터를 질병 확산 상황의 예측 분석의 조건으로 사용함으로써 예측의 정확도를 높일 수 있다는 효과가 있다.Further, according to any one of the tasks of the present invention, SNS data, which is highly correlated with actual disease statistical data, is subjected to data processing processing suitable for a specific language (in particular, Hangul) among the SNS data, The accuracy of prediction can be increased.

도 1은 본 발명의 일 실시예에 따른 질병 확산 상황 예측 장치의 구성을 나타내는 블록도이다.
도 2는 본 발명의 일 실시예에서 LASSO(Least-Absolute Shrinkage And Selection Operator) 분석의 결과에 따른 마커를 선별하는 방식을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에서 SNS 데이터의 마커와 ILI 분율 데이터 간의 상관 관계를 나타내는 일례이다.
도 4는 본 발명의 일 실시예에서 SNS 데이터와 ILI 분율 데이터에 기초하여 질병 확산 예측을 수행한 결과를 나타내는 일례이다.
도 5는 본 발명의 일 실시예에서 질병 확산 예측 분석에 사용된 데이터의 기간 별 확산 예측 결과를 나타내는 일례이다.
도 6은 본 발명의 일 실시예에서 질병 확산 예측 분석에 사용된 데이터의 기간 별 예측 분석 결과 값들의 유효성을 나타내는 일례이다.
도 7은 본 발명의 일 실시예에서 제공하는 질병 확산 정보의 일례를 나타내는 그래프이다.
도 8은 본 발명의 일 실시예에 따른 질병 확산 예측 분석을 적용함에 따른 다른 종류의 질병에 대한 SNS 데이터를 이용한 질병 통계 데이터와의 상관 관계 및 질병 확산 예측 분석의 결과를 나타내는 일례이다.
도 9는 본 발명의 일 실시예에서 스마트 폰을 통해 제공되는 질병 확산 정보 화면의 일례를 나타내는 도면이다.
도 10은 본 발명의 일 실시예에 따른 질병 확산 상황 예측 방법을 설명하기 위한 순서도이다.1 is a block diagram showing a configuration of an apparatus for predicting a disease spreading state according to an embodiment of the present invention.
FIG. 2 is a view for explaining a method of selecting markers according to a result of a Least-Absolute Shrinkage And Selection Operator (LASSO) analysis in an embodiment of the present invention.
3 is an example showing the correlation between the marker of the SNS data and the ILI fraction data in an embodiment of the present invention.
FIG. 4 is an example showing results of performing disease spread prediction based on SNS data and ILI fraction data in an embodiment of the present invention.
FIG. 5 is an example showing a result of diffusion prediction by period of data used in the disease spread prediction analysis in one embodiment of the present invention.
6 is an example showing the validity of the results of predictive analysis of the data used for the disease spread prediction analysis in the embodiment of the present invention.
7 is a graph showing an example of disease spread information provided in an embodiment of the present invention.
FIG. 8 is an example showing the correlation between disease statistics data using SNS data for different kinds of diseases and the results of the disease spread prediction analysis according to the disease spread prediction analysis according to an embodiment of the present invention.
9 is a view showing an example of a disease spread information screen provided through a smartphone in an embodiment of the present invention.
FIG. 10 is a flowchart for explaining a disease spreading state predicting method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise.

도 1은 본 발명의 일 실시예에 따른 질병 확산 상황 예측 장치의 구성을 나타내는 블록도이다.1 is a block diagram showing a configuration of an apparatus for predicting a disease spreading state according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명의 일 실시예에 따른 질병 확산 상황 예측 장치(100)는 SNS 자료 수집부(110), 질병 통계 자료 수집부(120), 데이터베이스(130), 유효 데이터 추출부(140), 마커 선별부(150), 마커 매트릭스 생성부(160), 질병 확산 예측부(170) 및 질병 확산 정보 제공부(180)를 포함하여 구성된다.1, an apparatus 100 for predicting a disease diffusion state according to an embodiment of the present invention includes an SNS data collection unit 110, a disease statistics data collection unit 120, a database 130, A marker selecting unit 150, a marker matrix generating unit 160, a disease spread estimating unit 170, and a disease spread information providing unit 180.

SNS 자료 수집부(110)는 소셜 네트워크 서비스 서버(10)가 제공하는 SNS 데이터(이하, ‘원본 SNS 데이터’라고 함)를 수집하여 데이터베이스(130)에 저장(이하, ‘저장 SNS 데이터’라고 함)하되, 사전에 저장되어 있던 일별 저장 SNS 데이터에 업데이트하여 저장한다. 참고로, SNS 서버(10)로부터 제공되는 SNS 데이터는 가공되지 않은 상태의 데이터 즉, 미가공 데이터(raw data)이다. 그리고, SNS 데이터는 복수의 계정 별 사용자가 SNS 서버(10) 상에 업로드하는 텍스트(text) 데이터를 의미한다. The SNS data collection unit 110 collects SNS data (hereinafter referred to as 'original SNS data') provided by the social network service server 10 and stores the collected SNS data in the database 130 ), But stores it in the daily stored SNS data stored in advance. For reference, the SNS data provided from the SNS server 10 is unprocessed data, that is, raw data. The SNS data refers to text data uploaded to the SNS server 10 by a plurality of account users.

본 발명의 실시예에서는 다수의 SNS 별로 제공하는 작성 툴(tool)을 이용하여 사용자가 일회에 작성(즉, 업로드)하는 텍스트 데이터의 기본 단위(예를 들어, ‘멘션’, ‘트윗’ 등의 용어로 지칭되는 단위)를 개별 SNS 데이터라 지칭하도록 하며, 이러한 개별 SNS 데이터는 음절, 단어, 구, 절, 및 문장 중 적어도 하나의 단위가 포함된다.In the embodiment of the present invention, a basic unit of text data (e.g., 'mentions', 'tweets', etc.) that the user creates (i.e., uploads) Referred to as a term) is referred to as individual SNS data, and such individual SNS data includes at least one unit of syllables, words, phrases, clauses, and sentences.

그리고, 본 발명의 일 실시예에서는 SNS 서버(10)가 SNS 데이터를 불특정 다수에게 랜덤 제공하는 오픈형 SNS 서버인 것을 예로서 설명하나, SNS 자료 수집부(110)가 인터넷에서 상용되고 있는 SNS 데이터 수집 장치 또는 애플리케이션을 통해 실시간으로 발행되는 SNS 데이터를 수집하는 것도 가능하다.In an embodiment of the present invention, the SNS server 10 is an open SNS server that randomly provides SNS data to an unspecified number of users. However, the SNS data collection unit 110 may collect SNS data It is also possible to collect SNS data that is issued in real time via a device or an application.

한편, SNS 자료 수집부(110)는 질병 확산 예측 분석을 위한 SNS 데이터를 효율적으로 검출하여 제공하기 위하여, 원본 SNS 데이터를 관계형 데이터베이스 형태로 저장할 수 있다. 예를 들어, 원본 SNS 데이터를 사용자 계정 별로 분류하여 테이블 형태로 저장할 수 있다.Meanwhile, the SNS data collection unit 110 may store the original SNS data in the form of a relational database in order to efficiently detect and provide the SNS data for the disease spread prediction analysis. For example, the original SNS data can be classified into user accounts and stored in the form of a table.

또한, SNS 자료 수집부(110)는 원본 SNS 데이터를 기설정된 추출 조건에 따라 선별하여 데이터베이스(130)에 저장함으로써 저장 SNS 데이터를 구성할 수 있다.In addition, the SNS data collection unit 110 may configure the stored SNS data by selecting the original SNS data according to predetermined extraction conditions and storing the selected SNS data in the database 130. [

구체적으로, SNS 자료 수집부(110)는 원본 SNS 데이터 중 기설정된 특정 언어의 SNS 데이터를 검출하거나, 하이퍼링크(hyper link)를 포함하는 SNS 데이터를 제거하거나, 기설정된 스팸(SPAM) 단어를 포함하는 SNS 데이터를 제거하거나, 각 SNS 계정 별로 해당 계정의 사용자 이외의 사용자에 의해 작성된 SNS 데이터(예를 들어, 리트윗(RT, Retweet) 등)를 제거하는 것 중 적어도 하나의 처리를 통해 저장할 SNS 데이터를 선별할 수 있다. 참고로, 이상의 SNS 데이터 선별 방법은 원본 SNS 데이터에 대해 각 처리가 공통 또는 별개로 적용되거나, 동시 또는 단계 별로 적용될 수 있다.Specifically, the SNS data collection unit 110 detects SNS data of a predetermined language among the original SNS data, removes SNS data including a hyper link, or includes predetermined SPAM words (RT, Retweet, etc.) created by a user other than the user of the account for each SNS account by removing at least one of SNS data (e.g., RT, Retweet, etc.) Data can be selected. For reference, the above-described SNS data selection method can apply each process to the original SNS data in common or separately, or simultaneously or stepwise.

본 발명의 일 실시예에서는, SNS 자료 수집부(110)가 한글로 작성된 SNS 데이터를 선별하여, 선별된 한글 SNS 데이터에 대해 나머지 추출 조건에 따른 선별 및 이를 이용한 질병 확산 예측 분석을 처리하는 것을 설명하도록 한다. 이는, 다른 종류의 언어 중에서도 한글이 가지고 있는 특성 중, 구성 방식, 다양한 어미 변화, 다양한 동음이의어, 다양한 단어 활용형, 신조어, 및 띄어 쓰기 등을 적용한 분석을 효율적으로 처리하는 방식을 설명하기 위한 것으로서, 다른 종류의 언어에도 각각 적합한 데이터 가공 전처리를 적용할 수 있다.In one embodiment of the present invention, the SNS data collection unit 110 selects SNS data created in Hangul, processes the selected Hangul SNS data according to the remaining extraction conditions, and processes the disease spread prediction analysis using the extracted SNS data . This is to explain the method of efficiently processing the analysis applying the configuration method, various ending words, various homonyms, various word usage types, new words, and spacing among the characteristics of the other kinds of languages, It is also possible to apply appropriate data processing preprocessing to different kinds of languages.

질병 통계 자료 수집부(120)는 질병 통계 서버(20)가 제공하는 기설정된 질병에 대한 질병 통계 데이터(본 발명의 일 실시예에서는, 의사 환자 분율 데이터)를 수집하여 데이터베이스(130)에 저장하되, 사전에 통계 보고일 별로 저장되어 있던 질병 통계 데이터에 업데이트하여 저장한다.The disease statistical data collection unit 120 collects disease statistical data (predetermined number of patient data in the embodiment of the present invention) for a predetermined disease provided by the disease statistics server 20 and stores the collected data in the database 130 , The disease statistics data previously stored in the statistical report day are updated and stored.

참고로, 본 발명의 실시예에 적용되는 질병 통계 서버(20)는 질병 관리 본부 등 질병에 대한 각종 통계 자료를 제공하는 기관 자체의 서버이거나 상기 기관과 연계된 서버로서, 질병 통계 데이터는 각 질병의 종류 별로 통계된 의사 환자 분율 데이터일 수 있다. 이때, 본 발명의 실시예에서 의사 환자 분율 데이터는 업데이트 당일로부터 적어도 하루 이전의 의사 환자 분율에 대한 정보를 포함하는 것이다.For reference, the disease statistics server 20 applied to the embodiment of the present invention is a server itself or a server associated with the institution providing various statistical data on disease such as a disease management center, May be statistical patient percentage data statistically by type of the patient. At this time, the doctor patient percentage data in the embodiment of the present invention includes information on the doctor patient percentage at least one day before the update date.

이하에서는, 본 발명의 일 실시예에 따른 질병 확산 상황 예측 장치(100)가 전염병 중 ‘인플루엔자(독감)’를 확산 예측용 질병으로 설정하여, ‘인플루엔자’에 관한 확산 예측 분석을 수행하는 것을 설명하도록 한다. 이때, 의사 환자 분율 데이터로서 인플루엔자 의사 환자(ILI, influenza-like illness) 분율 데이터를 수집한다.Hereinafter, it will be described that the apparatus 100 for predicting the disease spread state according to an embodiment of the present invention sets the 'influenza (influenza)' as a disease for predicting the spread and performs a spread prediction analysis on the 'influenza' . At this time, ILI (influenza-like illness) fraction data is collected as physician patient fraction data.

데이터베이스(130)는 SNS 자료 수집부(110)로부터 수집된 SNS 데이터 및 질병 통계 자료 수집부(120)로부터 수집된 ILI 분율 데이터를 포함하여, 질병 확산 예측 분석 처리에 필요한 기준 데이터들을 저장한다. 그리고, 데이터베이스(130)는 질병 확산 상황 예측 장치(100)의 각 부가 요청하는 데이터들을 검출하여 제공한다.The database 130 stores SNS data collected from the SNS data collection unit 110 and ILI fraction data collected from the disease statistics data collection unit 120 to store reference data necessary for the disease diffusion prediction analysis process. Then, the database 130 detects and provides the data requested by each unit of the disease spreading state prediction apparatus 100.

또한, 도 1에는 각 데이터를 저장하고 있는 데이터베이스(130)를 하나의 개념으로 나타내었으나, 데이터베이스(130)의 기능 및 저장되어 있는 데이터의 종류에 따라 다수의 분리된 데이터베이스(130) 형태로 포함되는 것도 가능하다.In addition, although the database 130 storing each data is shown as one concept in FIG. 1, the database 130 is included in the form of a plurality of separate databases 130 according to the functions of the database 130 and the type of the stored data It is also possible.

예를 들어, 본 발명의 일 실시예에 따른 데이터베이스(130)는 SNS 데이터, 질병 통계 데이터, 유효 SNS 데이터 추출을 위한 기준인 기준 단어 데이터, 스팸형 SNS 데이터를 제거하기 위한 기준인 스팸 단어 데이터, SNS 데이터 내 단어들의 각 기본형을 검출하기 위한 단어 활용형 데이터, 예측 분석에 유효한 SNS 데이터를 선별하기 위해 적합한 기준 단어의 사용을 검출하기 위한 동음이의어 데이터 및 안부성 문장/구/단어 데이터를 각각 저장하는 복수의 데이터베이스들(미도시)이 분리되어 구비될 수 있다. 참고로, 기준 단어 데이터, 스팸 단어 데이터, 단어 활용형 데이터, 동음이의어 데이터, 및 안부성 문장/구/단어 데이터의 각 데이터베이스들은 ‘언어 온톨로지’ 형태일 수 있으며, ‘언어 온톨로지’는 단어와 관계들로 구성된 일종의 사전일 수 있다. For example, the database 130 according to an exemplary embodiment of the present invention includes SNS data, disease statistics data, reference word data as a reference for extracting valid SNS data, spam word data as a criterion for removing spam type SNS data, Word usage type data for detecting each basic type of words in the SNS data, homonym word data for detecting use of a proper reference word to select SNS data effective for prediction analysis, and sentence / phrase / word data A plurality of databases (not shown) may be separately provided. For reference, each database of reference word data, spam word data, word usage data, homonyms data, and word / phrase / phrase data may be in the form of a 'language ontology' As shown in FIG.

유효 데이터 추출부(140)는 원본 SNS 데이터로부터 질병과 관련된 기설정된 기준 단어 중 적어도 하나가 포함된 유효 SNS 데이터를 추출한다. 이때, 유효 데이터 추출부(140)는 사용자가 요청한 예측 목표일의 원본 SNS 데이터로부터 유효 SNS 데이터를 추출하거나, 주/일/시 별로 원본 SNS 데이터 중 유효 SNS 데이터를 자동 추출할 수 있다. The valid data extracting unit 140 extracts the valid SNS data including at least one of the predetermined reference words related to the disease from the original SNS data. At this time, the valid data extracting unit 140 may extract the valid SNS data from the original SNS data requested by the user or automatically extract the valid SNS data from the original SNS data per week / day / hour.

구체적으로, 유효 데이터 추출부(140)는 원본 SNS 데이터 중 SNS 자료 수집부(110)가 선별한 SNS 데이터(즉, 상기 추출 조건을 만족하는 한글 SNS 데이터)로부터, 질병(즉, 인플루엔자)과 관련하여 사전에 설정된 기준 단어들을 포함하는 SNS 데이터를 유효 SNS 데이터로 추출한다. 일례로서, ‘인플루엔자’, ‘독감’, ‘감기’, ‘기침’, 및 ‘플루’ 등을 기준 단어로 사용할 수 있다.Specifically, the valid data extracting unit 140 extracts valid data from the original SNS data from the SNS data selected by the SNS data collecting unit 110 (i.e., Korean SNS data satisfying the extraction condition) And extracts SNS data including predefined reference words as valid SNS data. As an example, reference words such as 'influenza', 'flu', 'cold', 'cough', and 'flu' may be used.

또한, 유효 데이터 추출부(140)는 기준 단어를 포함하는 SNS 데이터 중 기준 단어와 관련하여 기설정된 안부성 문장 또는 구 또는 단어를 포함하는 SNS 데이터의 제거, 및 기준 단어와의 동음이의어를 포함하는 SNS 데이터의 제거 중 적어도 하나의 제거를 처리하여 유효 SNS 데이터를 추출할 수 있다.In addition, the valid data extracting unit 140 extracts the valid data from the SNS data including the reference word, the removal of the SNS data including the phrase or phrase or the word predetermined in association with the reference word, At least one removal of the SNS data may be processed to extract the valid SNS data.

구체적으로, 유효 데이터 추출부(140)는 사전에 저장되어 있는 안부성 문장/구/단어 데이터에 기초하여 개별 SNS 데이터를 분석하여, 예를 들어 ‘일교차가 심하네요. 감기 조심하세요.’와 같이 실제 질병의 확산 상황의 예측과는 상관이 없는 안부성 문장을 포함하는 SNS 데이터를 제거할 수 있다. Specifically, the valid data extracting unit 140 analyzes the individual SNS data based on the previously stored sentence / phrase / word data, for example, 'the diurnal difference is severe. SNS data including the anomaly sentences that are not related to the prediction of the actual disease spreading state can be removed.

그리고, 유효 데이터 추출부(140)는 사전에 저장되어 있는 동음이의어 데이터에 기초하여 개별 SNS 데이터를 분석하여, 예를 들어 ‘눈은 감기고 시험은 일주일도 안 남았고……’와 같은 기준 단어(즉, ‘감기’)의 동음이의어를 포함하는 SNS 데이터를 제거할 수 있다. Then, the valid data extracting unit 140 analyzes the individual SNS data based on the previously stored homonym data, for example, 'eyes are rolled and the test is not performed for a week. ... (I.e., ' cold ') such as ', ', < / RTI >

예를 들어, 유효 데이터 추출부(140)는 기저장된 안부성 문장/구/단어와 일치하는 데이터가 개별 SNS 데이터 내에 포함되어 있거나, 동음이의어 별로 상용적으로 기준 단어의 동음이의어와 함께 사용되는 적어도 하나의 단어가 개별 SNS 데이터 내에 포함되어 있을 경우 각 SNS 데이터를 제거할 수 있다.For example, the valid data extracting unit 140 may extract data corresponding to previously stored sentence / phrase / word in the individual SNS data, or may include at least the same If one word is included in the individual SNS data, each SNS data can be removed.

마커 선별부(150)는 유효 SNS 데이터에 포함된 단어 별 사용 빈도를 산출하고, 산출된 단어 별 사용 빈도와 의사 환자 분율 데이터 간의 상관도를 분석하여, 분석의 결과에 기초하여 단어 중 마커를 결정한다.The marker selection unit 150 calculates the frequency of use of each word included in the valid SNS data, analyzes the correlation between the frequency of use calculated by the calculated word and the data of the doctor's patient percentage, determines the marker in the word based on the result of the analysis do.

이때, 마커 선별부(150)는 유효 SNS 데이터들에 포함된 단어 별 사용 빈도와 의사 환자 분율 데이터에 대해 LASSO 분석을 수행한다. 그리고, 마커 선별부(150)는 LASSO 분석을 통해 산출된 단어 별 상관 계수가 기설정된 임계치 이상인 단어이거나 상관 계수의 값이 높은 순서에 따라 기설정된 순위 범위까지의 단어를 마커로 설정할 수 있다.At this time, the marker selection unit 150 performs LASSO analysis on the frequency of use of each word included in the valid SNS data and the pseudo patient percentage data. The marker selection unit 150 may set a word to a predetermined rank range according to a sequence in which the word-by-word correlation coefficient calculated through the LASSO analysis is equal to or higher than a preset threshold value or in the order of a higher correlation coefficient value.

구체적으로, 마커 선별부(150)는 유효 SNS 데이터에 포함된 개별 SNS 데이터들을 각각 빈칸에 기준하여 개별 단어로 구분하고, 유효 SNS 데이터 내에서 개별 단어 별 사용 횟수를 산출한다. 그리고, 마커 선별부(150)는 사전에 저장된 단어활용형 데이터에 기초하여 상기 구분된 단어 별로 각각 매칭되는 기본형을 검출하여, 동일한 기본형을 갖는 단어들의 유효 SNS 데이터들 내 사용 횟수를 총합하여 단어 별 사용 빈도를 산출한다.Specifically, the marker selector 150 divides the individual SNS data included in the valid SNS data into individual words on the basis of the blank spaces, and calculates the number of times of use of individual words in the valid SNS data. The marker sorting unit 150 detects a basic type that matches each of the separated words based on the word utilization type data stored in advance, and calculates the sum of the usage counts of the valid SNS data of the words having the same basic type, And calculates the frequency.

그리고, 마커 선별부(150)는 단어 사용 빈도가 높은 순서에 따라 기설정된 순위 범위까지의 단어를 판단하고, 상기 순위 범위까지의 단어를 마커 후보 단어로 선별한다.The marker selection unit 150 determines words up to a preset range in accordance with the order of frequency of word usage, and selects words up to the ranking range as marker candidate words.

예를 들어, 하기 표 1은 마커 선별부(150)를 통해 ‘인플루엔자’에 대한 유효 SNS 데이터로부터 선별된 마커 후보 단어들(즉, 유효 SNS 데이터 내 사용 횟수가 높은 단어들)을 나타내었다.For example, Table 1 below shows candidate marker words selected from valid SNS data for 'influenza' through the marker selector 150 (ie, words having a high frequency of use in valid SNS data).

[표 1][Table 1]

한편, 마커 선별부(150)가 마커 후보 단어 별 사용 빈도와 의사 환자 분율 데이터를 LASSO 분석한 결과는 하기 도 2와 같이 나타낼 수 있다.On the other hand, the result of the LASSO analysis of the frequency of use of the marker candidate word and the data of the doctor percentage of the patient by the marker selection unit 150 can be shown in FIG.

도 2는 본 발명의 일 실시예에서 LASSO 분석의 결과에 따른 마커를 선별하는 방식을 설명하기 위한 도면이다.2 is a view for explaining a method of selecting markers according to the result of LASSO analysis in an embodiment of the present invention.

도 2에서는 마커 후보 단어 별로 의사 환자 분율 데이터와의 LASSO 분석을 통해 산출된 상관 계수(Weighting coefficient)를 나타내었다. 이처럼, 유효 SNS 데이터 내 단어들은 각각 의사 환자 분율 데이터와의 상관 관계도에 따라 '+' 또는 '-'의 상관 계수 값으로 표현되며, 마커 선별부(150)는 상관 계수가 기설정된 임계치 이상이거나 상관 계수 값이 높은 순서에 따라 기설정된 순위 범위까지의 단어를 마커로 결정한다.FIG. 2 shows the weighting coefficient calculated from the LASSO analysis of the doctor patient percentage data for each marker candidate word. As described above, the words in the valid SNS data are represented by the correlation coefficient values of '+' or '-' depending on the degree of correlation with the pseudo patient fraction data, and the marker selection unit 150 determines whether the correlation coefficient is equal to or greater than a predetermined threshold value The words up to the preset ranking range are determined as markers according to the order of the correlation coefficient value being high.

다시 도 1로 돌아가서, 마커 매트릭스 생성부(160)는 저장 SNS 데이터 내에서 기설정된 기간 단위(예를 들어, 일 단위 또는 주 단위 등)의 마커의 이용 횟수와 해당 기간 단위의 저장 SNS 데이터 총 개수를 추출하여 마커 매트릭스를 생성한다.Referring back to FIG. 1, the marker matrix generating unit 160 generates a marker matrix by using the number of times of use of the markers of a preset period (for example, one unit or week, etc.) in the stored SNS data and the total number of stored SNS data To generate a marker matrix.

구체적으로, 마커 매트릭스 생성부(160)는 질병 확산 예측부(170)의 입력 데이터로서 활용될 저장 SNS 데이터 내 마커의 기간 단위 총 사용 횟수 및 동일 기간의 저장 SNS 데이터의 총 개수를 포함하는 마커 매트릭스를 생성한다.Specifically, the marker matrix generator 160 generates a marker matrix using a marker matrix including the total number of times of use of the marker in the stored SNS data to be utilized as the input data of the disease spread predictor 170 and the total number of stored SNS data of the same period .

질병 확산 예측부(170)는 마커 매트릭스 생성부(160)에서 생성된 마커 매트릭스를 이용하여 저장 SNS 데이터 내에서의 마커의 사용 비율을 산출하고, 마커의 사용 비율 및 의사 환자 분율 데이터에 기초하여 예측 목표일에 대응하는 질병 확산 예측 값을 산출한다.The disease spread prediction unit 170 calculates the use ratio of the markers in the stored SNS data using the marker matrix generated by the marker matrix generation unit 160, And calculates a disease spread prediction value corresponding to the target date.

구체적으로, 질병 확산 예측부(170)는 마커 매트릭스에 포함된 저장 SNS 데이터 내에서의 마커의 총 사용 횟수 및 동일 기간의 저장 SNS 데이터의 총 개수에 기초하여, 마커의 총 사용 횟수를 저장 SNS 데이터의 총 개수로 나눈 값을 마커 사용 비율 값으로 산출한다. 그리고, 질병 확산 예측부(170)는 마커 사용 비율 값 및 의사 환자 분율 데이터에 대해 선형 회귀(linear regression) 분석을 수행하여 질병 확산 예측 값을 산출한다. Specifically, based on the total number of times of use of the markers in the stored SNS data included in the marker matrix and the total number of stored SNS data of the same period, the disease spread prediction unit 170 stores the total number of times of use of the marker in the stored SNS data Is calculated as the marker use ratio value. Then, the disease spread prediction unit 170 performs a linear regression analysis on the marker use rate value and the doctor percentage data to calculate the disease spread prediction value.

한편, 도 3은 본 발명의 일 실시예에서 SNS 데이터의 마커와 ILI 분율 데이터 간의 상관 관계를 나타내는 일례이다.3 is an example showing the correlation between the marker of the SNS data and the ILI fraction data in the embodiment of the present invention.

도 3에서는, 복수 개의 마커를 이용하여 선형 회귀 예측 분석을 수행한 결과와 동일 기간에서의 의사 환자 분율 데이터 간의 상관 관계를 나타내었다. 이처럼, SNS 데이터 내 질병 관련 단어의 사용 횟수에 기초하여 의사 환자 분율 데이터와의 선형 회귀 분석을 수행함으로써, 실제 질병 확산의 패턴과 거의 동일한 SNS 데이터에 기반한 질병 확산 예측 패턴을 산출할 수 있다. 참고로, 사용되는 마커의 개수가 많을수록 산출된 질병 확산 예측 값과 의사 환자 분율 데이터 간의 상관 관계는 더욱 긴밀해진다. FIG. 3 shows the correlation between the result of the linear regression prediction analysis using a plurality of markers and the doctor-patient percentage data in the same period. Thus, by performing a linear regression analysis with the doctor's patient fraction data based on the use frequency of the disease related words in the SNS data, the disease spread prediction pattern based on the SNS data almost identical to the actual disease spread pattern can be calculated. For reference, the greater the number of markers used, the closer the correlation between the calculated disease spread prediction value and the doctor-patient percentage data becomes.

한편, 인플루엔자의 발생 패턴은 유행 시즌 별로 유사하며, SNS 데이터 사용자 들의 글쓰기 습관 또한 일정하게 유지된다고 가정할 수 있다. On the other hand, it can be assumed that the occurrence pattern of influenza is similar for each season, and the writing habits of SNS data users are also kept constant.

이에 따라, 질병 확산 예측부(170)는 질병의 이전 유행 시즌(예를 들어, 년 또는 월 단위)의 데이터가 확보되지 않은 경우, 현재 유행 시즌의 의사 환자 분율 데이터 및 SNS 데이터와 현재 유행 시즌의 의사 환자 분율 데이터 및 SNS 데이터로부터 추출된 마커 데이터 중 적어도 하나의 데이터를 이전 유행 시즌 데이터로서 중복(즉, 대체) 사용할 수 있다. 이는, 하기 도 4 내지 도 6에서 나타내는 바와 같이, 질병 확산 상황 예측의 정밀한 산출을 위해서는 이전 유행 시즌의 데이터들이 중요하기 때문이다.Accordingly, when the data on the previous epidemic season of the disease (for example, in units of years or months) is not obtained, the disease spread predicting unit 170 predicts the epidemics of the current epidemic season, At least one of the doctor patient percentage data and the marker data extracted from the SNS data may be used as duplicate (i.e., substitute) as the previous trend season data. This is because, as shown in Figs. 4 to 6, data of the previous epidemic season are important for precise calculation of the disease spreading state prediction.

예를 들어, 도 4는 본 발명의 일 실시예에서 SNS 데이터와 ILI 분율 데이터에 기초하여 질병 확산 예측 분석을 수행한 결과를 나타내는 일례이다.For example, FIG. 4 is an example showing results of performing disease spread prediction analysis based on SNS data and ILI fraction data in an embodiment of the present invention.

도 4의 (a)는 현재 유행 시즌의 데이터만을 이용하여 선형 회귀 분석을 수행한 결과 값을 나타내었으며, (b)는 이전 유행 시즌의 데이터를 포함한 데이터를 이용하여 선형 회귀 분석을 수행한 결과를 나타내었다. 이때, 도 4는 각 기간의 날짜 별 질병 확산 예측 값 및 인플루엔자 의사 환자(ILI) 분율 데이터의 변화를 비교한 그래프를 나타내었다. 도 4에 나타난 바와 같이, 이전 유행 시즌의 데이터를 포함한 데이터를 이용한 선형 회귀 분석의 결과 값이 현재 유행 시즌의 데이터만을 이용한 선형 회귀 분석의 결과 값 보다 인플루엔자 의사 환자 분율 데이터와 상관 관계가 긴밀한 것을 알 수 있다. FIG. 4 (a) shows the results of performing linear regression analysis using only the data of the current fashion season, (b) shows the result of performing linear regression analysis using data including data of the previous fashion season Respectively. At this time, FIG. 4 shows a graph comparing the predicted values of the disease spread by date and the ILI fraction data of influenza in each period. As shown in FIG. 4, the results of the linear regression analysis using the data including the data of the previous epidemic season are more closely correlated with the data of the influenza epidemic patients than the results of the linear regression analysis using only the data of the current epidemic season .

또한, 도 5는 본 발명의 일 실시예에서 질병 확산 예측 분석에 사용된 데이터의 기간 별 확산 예측 결과를 나타내는 일례이다.5 is an example showing the result of diffusion prediction of data used in the disease spread prediction analysis in accordance with an embodiment of the present invention.

도 5에서는 목표 예측일을 n이라고 가정할 경우, (n-1) 전부터 (n-21) 전까지의 데이터를 이용하여 질병 확산 예측 값을 산출한 결과를 나타내었다. 이때, 도 5의 (a)는 (n-1)일, (b)는 (n-7)일, (c)는 (n-14)일, 및 (d)는 (n-21)일의 질병 예측 결과 값 및 인플루엔자 의사 환자(ILI) 분율 데이터의 비교 결과를 나타낸 그래프이다. FIG. 5 shows the results of calculating the disease spread prediction value using data from (n-1) to (n-21) before assuming the target prediction date to be n. 5 (a), 5 (b), 5 (c), and 5 (d) (ILI) percentage data of the influenza-like patient (ILI).

도 6은 본 발명의 일 실시예에서 질병 확산 예측 분석에 사용된 데이터의 기간 별 예측 분석 결과 값들의 유효성을 나타내는 일례이다.6 is an example showing the validity of the results of predictive analysis of the data used for the disease spread prediction analysis in the embodiment of the present invention.

도 6에서는 도 5에서와 같이 (n-1)일부터 (n-21)일까지의 데이터를 이용하여 선형 회귀 분석을 하여 산출한 질병 확산 예측 값에 대한 회귀식 설명력 R²값을 나타내었다. 이때, 도 6에서와 같이, (n-21) 까지의 데이터를 이용하여 산출된 질병 확산 예측 값의 R²값들이 각각 0.8 이상의 값으로서 유효한(즉, 유의미한) 예측 값인 것을 알 수 있다.FIG. 6 shows the regression formula R ² value for the predicted disease value calculated by linear regression analysis using data from (n-1) to (n-21) days as shown in FIG. As shown in FIG. 6, it can be seen that the R ² values of the disease spread prediction values calculated using the data up to (n-21) are effective (that is, meaningful) prediction values of 0.8 or more.

다시 도 1로 돌아가서, 질병 확산 정보 제공부(180)는 질병 확산 예측부(170)가 산출한 질병 확산 예측 값에 기초하여, SNS 데이터에 기반한 질병 확산 정보를 생성하여 제공한다. 예를 들어, 질병 확산 정보 제공부(180)는 산출된 질병 확산 예측 값의 날짜 별 변화를 그래프 형태로 출력할 수 있다.Referring back to FIG. 1, the disease spread information providing unit 180 generates and provides disease spread information based on the SNS data based on the disease spread prediction value calculated by the disease spread prediction unit 170. For example, the disease spread information providing unit 180 may output the change of the calculated disease spread prediction value by date in a graph form.

이때, 질병 확산 정보 제공부(180)는 사용자 단말(미도시) 또는 자체적으로 포함한 출력부를 통해 질병 확산 정보가 출력될 수 있도록 포맷화된 질병 확산 정보를 제공할 수 있다. 참고로, 사용자 단말(미도시)은 개인 단말이거나, 국가 질병 관리 기관 또는 질병 백신 개발 사업체 등 다양한 사용자의 단말을 의미한다.At this time, the disease spread information providing unit 180 may provide disease spread information formatted so that disease spread information can be output through a user terminal (not shown) or an output unit including itself. For reference, a user terminal (not shown) refers to a terminal of various users such as a personal terminal, a national disease control organization, or a disease vaccine development business entity.

도 7은 본 발명의 일 실시예에서 제공하는 질병 확산 정보의 일례를 나타내는 그래프이다.7 is a graph showing an example of disease spread information provided in an embodiment of the present invention.

이때, 도 7에서는 질병 확산 상황 예측의 예측 목표일이 인플루엔자 의사 환자(ILI) 분율 데이터가 발표되지 않은 날인 경우, 질병 확산 상황의 예측 값을 나타내었다. 즉, 도 7에서와 같이, 질병 관리 본부에서 인플루엔자 의사 환자(ILI) 분율 데이터 보고가 발표되지 않은 일자에도 질병 확산 상황 예측을 수행할 수 있다. 참고로, 동일 날짜에 의사 환자 분율 데이터를 수집하였다 하더라도, 사실상 의사 환자 분율 데이터는 업데이트 당일로부터 대략 2주 정도 이전의 질병 확산 상황을 보고하는 것이므로, 본 발명의 실시예에서와 같이 SNS 데이터에 기초하여 질병 확산 예측 값을 산출함으로써 실시간으로 질병 확산 상황을 예측할 수 있다.In this case, FIG. 7 shows the predicted value of the disease spreading condition when the predicted target date of the disease spreading state prediction is the day when the data of the ILL percentage of the influenza patient is not announced. That is, as shown in FIG. 7, the disease spreading state prediction can be performed even when the ILI percentage data report is not announced at the CDC. For reference, even if the doctor's patient percentage data is collected on the same date, the doctor's patient percentage data reports the disease spreading status about two weeks before the update date. Therefore, as in the embodiment of the present invention, The disease spreading state can be predicted in real time by calculating the disease spread prediction value.

한편, 이상에서는 본 발명의 일 실시예에 따른 질병 확산 상황 예측 장치(100)가 ‘인플루엔자’에 대한 질병 통계 데이터 및 SNS 데이터를 이용하여 질병 확산 상황을 예측하는 것을 설명하였으나, 다른 질병에 대해서도 동일한 방식으로 질병 확산 상황을 예측할 수 있다.In the foregoing, it has been described that the disease spreading state predicting apparatus 100 according to an embodiment of the present invention predicts a disease spreading state using disease statistical data and SNS data for 'influenza'. However, The disease spreading situation can be predicted.

예를 들어, 도 8은 본 발명의 일 실시예에 따른 질병 확산 예측 분석을 적용함에 따른 다른 종류의 질병에 대한 SNS 데이터를 이용한 질병 통계 데이터와의 상관 관계 및 질병 확산 예측 분석의 결과를 나타내는 일례이다.For example, FIG. 8 is a graph showing a correlation with disease statistical data using SNS data for different kinds of diseases according to an embodiment of the present invention, to be.

도 8에서는, ‘결막염’ 및 ‘눈병’에 대한 질병 확산 상황 예측을 수행한 결과를 나타내었다.FIG. 8 shows the results of the prediction of disease spreading status for 'conjunctivitis' and 'eye disease'.

도 8의 (a)에서는 결막염 및 눈병 별로 하나의 마커를 사용한 마커 사용 비율 값과 유행성 결막염 평균 환자 데이터(즉, 질병 통계 데이터)를 선형 회귀 분석한 결과를 나타내었으며, (b)에서는 복수개의 마커를 사용한 마커 사용 비율 값 과 유행성 결막염 평균 환자 데이터를 선형 회귀 분석한 결과를 나타내었다. 8 (a) shows the result of linear regression analysis of the marker use ratio value using one marker for conjunctivitis and eye diseases and the average patient data of epidemic conjunctivitis (that is, disease statistics data). In (b) And the average patient data of epidemic conjunctivitis were analyzed by linear regression analysis.

또한, 질병 확산 정보 제공부(180)가 제공하는 질병 확산 정보는 질병 확산 예측 값 및 의사 환자 분율 데이터의 변화를 비교한 그래프 형태의 정보를 포함하되, 각 사용자 단말에서는 목적에 따라 상기 질병 확산 정보를 이용할 수 있다.In addition, the disease spread information provided by the disease spread information providing unit 180 includes information of a graph form that compares the disease diffusion predicted value and the change of the doctor patient percentage data. In each user terminal, Can be used.

일례로, 질병 통제 기관의 경우, 국가적으로 감염성 질병 위기 경보 단계를 설정하여 두고, 질병 확산 정보 제공부(180)로부터 수신된 질병 확산 예측 값에 기초하여 해당하는 위기 경보 단계를 발령할 수 있으며, 질병에 대한 국가 정책 결정의 근거로 사용하여 각 위기 경보 단계에 해당하는 대응 조치 정보를 출력할 수 있다.For example, in case of a disease control agency, an infectious disease crisis warning stage may be set up nationwide, and a corresponding crisis warning stage may be issued based on a disease spread prediction value received from the disease spread information providing unit 180, It can be used as a basis for national policy decisions on diseases and can output countermeasures information corresponding to each crisis warning level.

또 다른 예로서, 개인 사용자의 스마트 폰의 경우, 사전에 설치된 애플리케이션을 통해 질병 상황 예측을 요청할 수 있으며, 이에 따라 질병 확산 상황 예측 장치(100)로부터 예측 목표일에 대한 질병 확산 예측 값을 포함하는 질병 확산 정보를 수신하여 출력할 수 있다. As another example, in the case of a smartphone of an individual user, it is possible to request a disease state prediction through a pre-installed application, and thereby, from the disease state prediction apparatus 100, Disease spreading information can be received and output.

도 9는 본 발명의 일 실시예에서 스마트 폰을 통해 제공되는 질병 확산 정보 화면의 일례를 나타내는 도면이다.9 is a view showing an example of a disease spread information screen provided through a smartphone in an embodiment of the present invention.

도 9에서는, 질병 확산 정보 제공부(180)가 사용자 단말(스마트 폰)로 예측 목표일의 질병 확산 예측 값을 포함하는 질병 확산 정보를 제공한 것을 나타내었다. 이때, 도 9에서와 같이, 사용자 단말(스마트 폰)의 화면에는, 기간 별 인플루엔자 의사 환자(ILI) 분율 데이터 및 인플루엔자 질병 확산 예측 데이터가 서로 비교될 수 있도록 그래프 형태로 출력될 수 있다. 또한, SNS 데이터에 포함된 GPS 정보를 이용한 지역별 질병 확산 상황이 지도 상에 매핑되어 출력될 수 있으며, 인플루엔자에 대한 자가 진단 및 예방 조치에 대한 정보가 출력될 수 있다.In FIG. 9, the disease spread information providing unit 180 has provided disease spread information including a disease spread prediction value of a predicted target date to a user terminal (smart phone). In this case, as shown in FIG. 9, the screen of the user terminal (smartphone) may be displayed in graph form so that the influenza physician patient ILI fraction data and the influenza disease spread prediction data are compared with each other. Also, the disease spreading situation by region using the GPS information included in the SNS data can be mapped and output on the map, and information on the self diagnosis and preventive measures for the influenza can be output.

이하, 도 10을 참고하여 본 발명의 일 실시예에 따른 질병 확산 상황 예측 방법에 대해서 상세히 설명하도록 한다.Hereinafter, a disease spreading state predicting method according to an embodiment of the present invention will be described in detail with reference to FIG.

도 10은 본 발명의 일 실시예에 따른 질병 확산 상황 예측 방법을 설명하기 위한 순서도이다.FIG. 10 is a flowchart for explaining a disease spreading state predicting method according to an embodiment of the present invention.

먼저, 소셜 네트워크 서비스 서버가 제공하는 원본 SNS 데이터 및 질병 통계 서버가 제공하는 기설정된 질병에 대한 질병 통계 데이터를 수집하여 저장한다(S110).First, the original SNS data provided by the social network service server and disease statistical data on a predetermined disease provided by the disease statistics server are collected and stored (S110).

이때, 원본 SNS 데이터(즉, 수집되어 저장되는 ‘저장 SNS 데이터’)는 매일 업데이트(예를 들어, 시간 별 또는 일 별)되며, 질병 통계 데이터는 질병 통계 서버로부터 월 단위, 주 단위 또는 일 단위로 사전에 설정된 질병 통계 데이터 보고일에 업데이트 될 수 있다.At this time, the original SNS data (i.e., the 'stored SNS data' to be collected and stored) is updated every day (for example, by time or by day) May be updated at a pre-set disease statistical data reporting date.

또한, 원본 SNS 데이터 중 기설정된 특정 언어를 사용하거나, 하이퍼링크(hyper link)의 불포함, 기설정된 스팸(SPAM) 단어의 불포함, 및 SNS 계정 별로 계정 사용자 이외의 사용자에 의해 작성된 데이터 불포함 중 적어도 하나의 조건을 만족하는 SNS 데이터를 수집하여 저장할 수 있다.Also, at least one of the use of a predetermined specific language of the original SNS data, the absence of a hyper link, the absence of a predetermined SPAM word, and the absence of data created by a user other than the account user for each SNS account The SNS data can be collected and stored.

그런 다음, 수집한 원본 SNS 데이터로부터 질병과 관련된 기설정된 기준 단어를 적어도 하나 포함하는 유효 SNS 데이터를 추출한다(S120).Then, valid SNS data including at least one predetermined reference word related to the disease is extracted from the collected original SNS data (S120).

이때, 기준 단어를 포함하는 SNS 데이터 중 기준 단어와 관련하여 기설정된 안부성 문장 또는 구 또는 단어를 포함하는 SNS 데이터를 제거하거나, 기준 단어와의 동음이의어를 포함하는 SNS 데이터를 제거하여 유효 SNS 데이터를 추출할 수 있다.At this time, the SNS data including the prefixed phrase or the phrase or the word preset in relation to the reference word among the SNS data including the reference word is removed, or the SNS data including the homonym of the reference word is removed, Can be extracted.

다음으로, 유효 SNS 데이터 내의 단어 별 사용 횟수와 질병 통계 데이터에 기초하여, 유효 SNS 데이터 내의 단어 중에서 질병 확산 예측 분석을 위한 마커를 선별한다(S130).Next, based on the number of times of use of each word in the valid SNS data and the disease statistical data, a marker for predicting the spread of the disease among the words in the valid SNS data is selected (S130).

구체적으로, 유효 SNS 데이터에 포함된 단어 별 사용 빈도를 산출하고, 단어 별 사용 빈도와 질병 통계 데이터에 대해 LASSO 분석을 하고, LASSO 분석 처리의 결과 산출된 유효 SNS 데이터에 포함된 단어 별 상관 계수에 기초하여 마커를 결정한다.Specifically, the frequency of use of each word included in the valid SNS data is calculated, and a LASSO analysis is performed on the frequency of use and the statistical data of each word, and the correlation coefficient between words included in the effective SNS data calculated as a result of the LASSO analysis processing Based on which the marker is determined.

그런 후, 저장 SNS 데이터 내 마커 사용 비율 및 질병 통계 데이터에 대한 선형 회귀 분석 처리를 하여(S140), 예측 목표일의 질병 확산 예측 값을 산출한다(S150).Then, a linear regression analysis process is performed on the ratio of the marker in the stored SNS data and the disease statistical data (S140), and the predicted disease spread value of the predicted target day is calculated (S150).

구체적으로, 저장 SNS 데이터에 대한 마커의 총 사용 횟수 및 저장 SNS 데이터의 총 개수에 기초하여 마커 사용 비율 값을 산출한 후, 마커 사용 비율 값 및 상기 의사 환자 분율 데이터에 대해 선형 회귀(linear regression) 분석 처리하여 상기 질병 확산 예측 값을 산출한다.Specifically, a marker use ratio value is calculated based on the total number of times of use of the marker for the stored SNS data and the total number of the stored SNS data, and then, linear regression is performed on the marker use ratio value and the doctor patient percentage data, And the disease spread prediction value is calculated.

다음으로, 산출된 질병 확산 예측 값을 포함하는 질병 확산 정보를 제공한다(S160).Next, disease spread information including the calculated disease spread prediction value is provided (S160).

예를 들어, 날짜 별 질병 확산 예측 값 및 의사 환자 분율 데이터의 변화를 비교한 그래프를 질병 확산 정보로서 제공할 수 있다.For example, a graph comparing the predicted value of the disease spread by date and the change of the doctor patient percentage data may be provided as the disease spread information.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. One embodiment of the present invention may also be embodied in the form of a recording medium including instructions executable by a computer, such as program modules, being executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable medium may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes any information delivery media, including computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism.

또한, 본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수도 있다.In addition, while the methods and systems of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.It is intended that the present invention covers the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. .

10: SNS 서버
20: 질병 통계 서버
100: 질병 확산 상황 예측 장치
110: SNS 자료 수집부
120: 질병 통계 자료 수집부
130: 데이터베이스
140: 유효 데이터 추출부
150: 마커 선별부
160: 마커 매트릭스 생성부
170: 질병 확산 예측부
180: 질병 확산 정보 제공부10: SNS server
20: Disease Statistics Server
100: Disease spread condition prediction device
110: SNS data collection unit
120: Disease statistics collection section
130: Database
140: valid data extracting unit
150: Marker selection unit
160: Marker matrix generator
170: disease spread prediction unit
180: Disease proliferation information provider

Claims

An apparatus for predicting a disease proliferation state,
A disease statistics collection unit for collecting and storing physician patient percentage data on predetermined diseases provided by the disease statistics server;
An SNS data collection unit for collecting and storing original SNS data provided by a social network service (SNS) server;
A valid data extracting unit for extracting valid SNS data including at least one of preset reference words related to the disease from the stored SNS data;
A marker selection unit for determining a marker from words of the valid SNS data based on a degree of correlation between the frequency of use of words included in the valid SNS data and the doctor's patient percentage data;
A marker matrix generator for generating a marker matrix including the number of times of use of the marker from the stored SNS data of the predetermined period and the total number of the stored SNS data of the predetermined period;
Calculating a ratio of use of the marker in the stored SNS data using the marker matrix, calculating a disease spread prediction value corresponding to a predicted target day based on the use ratio of the marker and the doctor's patient percentage data A diffusion predicting unit; And
And a disease spread information provision unit for generating and providing disease spread information based on the calculated disease spread prediction value.

The method of claim 1,
Wherein the marker selector comprises:
Performing Least-Absolute Shrinkage And Selection Operator (LASSO) analysis on the frequency of use of each word and the doctor-patient percentage data included in the valid SNS data,
Wherein the marker is determined based on word-by-word correlation coefficients included in the valid SNS data calculated through the LASSO analysis.

The method of claim 1,
Wherein the marker selector comprises:
Wherein the basic SNS data is determined for each word of the valid SNS data, and the total use frequency for each basic type is calculated as the word usage frequency.

The method of claim 1,
Wherein the disease-
Calculating a marker use ratio value based on the total number of times of use of the marker with respect to the stored SNS data of the predetermined period and the total number of the stored SNS data of the predetermined period,
Wherein the disease spreading state prediction value is calculated by performing a linear regression analysis on the marker use ratio value and the doctor's patient percentage data.

5. The method of claim 4,
Wherein the disease-
And performs the linear regression analysis by using the stored SNS data and the doctor-patient percentage data in a predetermined first period as data of a second period before the first period.

The method of claim 1,
The SNS data collection unit,
SNS data satisfying predetermined extraction conditions among the original SNS data is selected and collected,
The above-
The absence of a hyperlink, the absence of a predetermined SPAM word, and the absence of data created by a user other than an account user for each SNS account. &Lt; RTI ID = 0.0 & Diffusion state prediction device.

The method of claim 1,
The valid data extracting unit may extract,
Processing of removing at least one of the removal of the SNS data including the phrase or phrase or the word preset in association with the reference word among the original SNS data and the removal of the SNS data including the homonym of the reference word A disease predicting device.

The method of claim 1,
The disease spread information providing unit,
And outputs the disease-specific predictive value-by-date change in a graph form.

A method for predicting a disease proliferation state through a disease proliferation state predicting apparatus,
(a) collecting and storing the original SNS data provided by the social network service (SNS) server and the physician patient percentage data for the predetermined disease provided by the disease statistics server, respectively;
(b) extracting, from the stored SNS data, valid SNS data including at least one of preset predetermined reference words related to a disease;
(c) determining a marker from the words of the valid SNS data based on a degree of correlation between the frequency of use of words included in the valid SNS data and the doctor-patient percentage data;
(d) calculating a usage ratio value of the marker in the stored SNS data within a predetermined time period unit; And
(e) performing a linear regression analysis on the use ratio value of the marker and the doctor-patient percentage data to calculate a disease spread prediction value corresponding to a prediction target date; And
(f) generating disease spread information based on the calculated disease spread prediction value and providing it to a user terminal.

The method of claim 9,
The step (c)
(c-1) calculating a frequency of use for each word included in the valid SNS data;
(c-2) analyzing the LASSO (Least-Absolute Shrinkage And Selection Operator) analysis on the frequency of use of words and the doctor's patient percentage data; And
(c-3) determining the marker based on word-by-word correlation coefficients included in the valid SNS data calculated as a result of the LASSO analysis processing.

The method of claim 9,
The step (d)
Wherein the marker use ratio value is calculated based on a total number of times of use of the marker for the stored SNS data within a predetermined period unit and a total number of the stored SNS data.

The method of claim 9,
Prior to step (d)
Generating a marker matrix including the number of times of use of the marker and the total number of the stored SNS data of the predetermined period from the stored SNS data of the predetermined period,
The step (d)
And calculating a use ratio value of the marker using the marker matrix.

The method of claim 9,
The step (a)
At least one condition among the use of the predetermined specific language among the original SNS data, the absence of a hyper link, the absence of a predetermined SPAM word, and the absence of data created by a user other than an account user for each SNS account Wherein the SNS data satisfies a predetermined condition.

The method of claim 9,
The step (b)
Processing of removing at least one of the removal of the SNS data including the phrase or phrase or the word preset in association with the reference word among the original SNS data and the removal of the SNS data including the homonym of the reference word And extracting the valid SNS data.

The method of claim 9,
The step (f)
And outputting the disease-specific predictive value-by-date change in a graph form.