KR20190070430A

KR20190070430A - Estimation method and apparatus for name of disease based on big data analysis

Info

Publication number: KR20190070430A
Application number: KR1020170170917A
Authority: KR
Inventors: 심재희; 김태형; 유지석; 김태경
Original assignee: (주)엔텔스
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2019-06-21

Abstract

A method for estimating a disease diagnosis based on a big data analysis comprises the steps of: generating structured data by performing text-mining on unstructured data of a target electronic medical record (EMR) by using a computer apparatus; extracting a keyword from the structured data according to a correlation between a specific disease and the keyword by using the computer apparatus; and classifying a target diagnosis corresponding to the keyword by inputting the keyword to a previously provided naive Bayes classifier by using the computer apparatus.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and an apparatus for estimating a disease diagnosis name based on big data analysis.

이하 설명하는 기술은 빅데이터 분석 기법에 기반하여 전자의무기록으로부터 질명 진단명을 도출하는 기법에 관한 것이다.The technique described below relates to a technique for deriving a diagnosis name from electronic medical records based on a big data analysis technique.

현재 많은 병원은 전자의무기록(Electronic Medical Record, EMR)을 사용하고 있다. 현재 EMR은 종래에 수기로 작성하던 내용을 디지털화한 것에 불과하여 환자 관리에 이용하는 정도로 활용되고 있다. 기본적으로 EMR은 작성자가 컴퓨터 장치나 스마트 기기를 이용하여 해당 내용을 기재하는 것으로, 내용 일부에 오류가 존재할 수 있다. Many hospitals now use electronic medical records (EMR). At present, EMR is used only to digitize the contents that had been written by hand and to be used for patient management. Basically, the author writes the contents using a computer device or a smart device, and there may be an error in a part of the contents.

특히 군부대에서 사용하는 국방의료시스템 경우 데이터의 정확도가 떨어지고, 진단명이 입력 또는 업데이트되는 과정에서 각종 오류가 발생하여 진단 기록에 대한 정확한 관리가 어려운 상황이다. 특히, 감염병 진단명에 대한 분류 정확도 여부는 환자의 과거이력과 무관하게 돌발적으로 발생하거나 그 증상구분이 쉽지 않은 질환의 특성상, 환자의 신체검사기록, 계급, 과거병력 등과 같은 통상적인 정형 데이터만으로는 검증이 쉽지 않은 한계가 존재한다.Especially, in the case of military medical systems used in military units, accuracy of data is low, and various errors are generated in the process of inputting or updating the diagnosis name, which makes it difficult to accurately manage diagnostic records. In particular, the classification accuracy of the diagnosis of infectious diseases depends on the nature of the disease, which occurs irrespective of the past history of the patient, or in which the symptom is difficult to distinguish, and verification is carried out only by the conventional form data such as the patient's physical examination record, There are limitations that are not easy.

미국등록특허 US9,477,756US Patent No. 9,477,756

EMR은 주된 내용이 비정형 데이터이다. 따라서 EMR은 진료를 받는 치료 과정 외에 다른 용도로 활용되기 어렵다. 나아가 전술한 바와 같이 EMR은 질병에 대한 오진과 같은 부정확한 정보를 포함할 수 있다.The main content of EMR is atypical data. Therefore, it is difficult to use EMR for other purposes besides the medical treatment. Further, as described above, the EMR may contain inaccurate information such as a misidentification of the disease.

이하 설명하는 기술은 빅데이터 분석 기술에 기반하여 EMR로부터 질병에 대한 진단명을 도출하는 기법을 제공하고자 한다. The technique described below is intended to provide a technique for deriving a diagnosis name from an EMR based on a big data analysis technique.

빅데이터 분석 기반 질병 진단명 추정 방법은 컴퓨터 장치가 타겟 전자의무기록(EMR)의 비정형 데이터에 대한 텍스트 마이닝으로 정형화 데이터를 생성하는 단계, 상기 컴퓨터 장치가 특정 질환과 키워드의 상관 관계에 따라 상기 정형화 데이터에서 키워드를 추출하는 단계 및 상기 컴퓨터 장치가 상기 키워드를 사전에 마련한 나이브 베이즈 분류기에 입력하여 상기 키워드에 대응하는 타겟 진단명을 분류하는 단계를 포함한다. 상기 나이브 베이즈 분류기는 키워드와 질병에 대한 관계를 정의한 정보를 포함한다.The Big Data Analysis-based disease diagnosis name estimation method includes the steps of: generating formatting data by text mining of unstructured data of a target electronic medical record (EMR) of a computer device; Extracting a keyword from the keyword, and inputting the keyword to a Naïve Bayes classifier provided in advance, and classifying the target diagnosis name corresponding to the keyword. The Naive Bayes classifier includes information defining a relationship between a keyword and a disease.

이하 설명하는 기술은 EMR로부터 정확한 질병 진단명을 도출하여 진단의 정확성을 제고한다. 이하 설명하는 기술은 EMR을 분석하여 질병과 관련된 키워드 또는 패턴을 제공하여 새로운 의료 정보를 제공한다.The technique described below improves the accuracy of diagnosis by deriving an accurate diagnosis name from EMR. The techniques described below analyze the EMR and provide new medical information by providing a keyword or pattern associated with the disease.

도 1은 빅데이터 분석 기반 질병 진단명 추정 시스템에 대한 예이다.
도 2는 빅데이터 분석 기반 질병 진단명 추정 방법에 대한 순서도의 예이다.
도 3은 전자의무기록에서 빅데이터 분석 모델을 생성하는 과정에 대한 예이다.
도 4는 입력된 전자의무기록에서 질병 진단명을 추정하는 과정에 대한 예이다.
도 5는 전자의무기록의 비정형 데이터를 정형화하는 과정에 대한 예이다.
도 6은 전자의무기록에서 추정한 질병 정보를 시각화한 예이다.1 is an example of a big data analysis-based disease diagnosis name estimation system.
2 is an example of a flow chart for a Big Data Analysis-based disease diagnosis name estimation method.
3 is an example of a process of generating a big data analysis model in electronic medical records.
FIG. 4 is an example of a process of estimating the disease diagnosis name in the inputted electronic medical record.
FIG. 5 shows an example of a process for formatting unstructured data of electronic medical records.
6 is an example of visualizing disease information estimated from electronic medical records.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The following description is intended to illustrate and describe specific embodiments in the drawings, since various changes may be made and the embodiments may have various embodiments. However, it should be understood that the following description does not limit the specific embodiments, but includes all changes, equivalents, and alternatives falling within the spirit and scope of the following description.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, A, B, etc., may be used to describe various components, but the components are not limited by the terms, but may be used to distinguish one component from another . For example, without departing from the scope of the following description, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.As used herein, the singular " include "should be understood to include a plurality of representations unless the context clearly dictates otherwise, and the terms" comprises & , Parts or combinations thereof, and does not preclude the presence or addition of one or more other features, integers, steps, components, components, or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다.Before describing the drawings in detail, it is to be clarified that the division of constituent parts in this specification is merely a division by main functions of each constituent part. That is, two or more constituent parts to be described below may be combined into one constituent part, or one constituent part may be divided into two or more functions according to functions that are more subdivided. In addition, each of the constituent units described below may additionally perform some or all of the functions of other constituent units in addition to the main functions of the constituent units themselves, and that some of the main functions, And may be carried out in a dedicated manner.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Also, in performing a method or an operation method, each of the processes constituting the method may take place differently from the stated order unless clearly specified in the context. That is, each process may occur in the same order as described, may be performed substantially concurrently, or may be performed in the opposite order.

이하 설명하는 기술은 전자의무기록(EMR)에 기록된 정보를 기반으로 특정 진료기록으로 추정되는 질병 진단명을 추정한다. 이를 통해 이하 설명하는 기술은 EMR에 잘못 기재된 진단명을 업데이트할 수 있다. 나아가 이하 설명하는 기술은 EMR을 통해 현재 환자의 질병을 사전에 추정할 수 있다. 이하 설명하는 기술은 빅데이터 분석에 사용되는 기술을 활용한다.The technique described below estimates a disease diagnosis name that is presumed to be a specific medical record based on the information recorded in the electronic medical record (EMR). This allows the technology described below to update the diagnostics name incorrectly listed in the EMR. Further, the technology described below can proactively estimate the current patient's disease through EMR. The techniques described below utilize techniques used in big data analysis.

도 1은 빅데이터 분석 기반 질병 진단명 추정 시스템(100)에 대한 예이다. 시스템(100)은 클라이언트 장치(110), 분석 서버(120) 및 EMR DB(130)을 포함한다. FIG. 1 is an example of a big data analysis-based disease diagnosis name estimation system 100. FIG. The system 100 includes a client device 110, an analysis server 120, and an EMR DB 130.

클라이언트 장치(110)는 분석 서버(120)에 EMR 데이터에 기반하여 질병 진단명을 추정하라는 명령을 전달한다. 또 클라이언트 장치(110)는 분석 결과에 따른 진단명, 질병과 관련된 정보 등을 수신할 수 있다. 클라이언트 장치(100)는 PC, 스마트 기기 등과 같은 사용자 장치에 해당한다.The client device 110 sends an instruction to the analysis server 120 to estimate the disease diagnosis name based on the EMR data. In addition, the client device 110 can receive the diagnosis name, disease-related information, and the like according to the analysis result. The client device 100 corresponds to a user device such as a PC, a smart device, and the like.

EMR DB(130)는 EMR을 저장한 장치이다. EMR은 특정 환자에 대한 진단 및 치료 과정에서 발생한 정보를 저장한다. 따라서 EMR DB(130)는 기본적으로 환자별로 매우 많은 정보를 저장한다.The EMR DB 130 is a device that stores the EMR. The EMR stores information generated during diagnosis and treatment for a particular patient. Therefore, the EMR DB 130 basically stores very much information for each patient.

한편 클라이언트 장치(110)는 특정 EMR DB에 대한 정보(식별자, IP 등), 해당 EMR DB에 저장된 특정 EMR 데이터(데이터 식별자, 환자 식별자 등)를 분석 서버(120)에 전달할 수 있다.Meanwhile, the client device 110 may transmit to the analysis server 120 the information (identifier, IP, etc.) about the specific EMR DB and specific EMR data (data identifier, patient identifier, etc.) stored in the corresponding EMR DB.

분석 서버(120)는 클라이언트 장치(110)로부터 수신한 명령 및 정보에 기반하여 특정 EMR DB(130)로부터 특정 EMR 데이터에 접근한다. 분석 서버(120)는 수신한 EMR 데이터에 기반하여 대응되는 질병 진단명을 추정한다. 분석 서버(120)는 질병 진단명 추정을 위한 모델을 활용하여 진단명을 추정할 수 있다. 분석 모델 DB(125)는 EMR을 입력으로 질병 진단명을 추정하기 위한 모델을 보유한다. 도 1에서 분석 모델 DB(125)를 별도의 객체로 도시하였으나 분석 서버(120)에 포함된 형태일 수 있다.The analysis server 120 accesses the specific EMR data from the specific EMR DB 130 based on the command and information received from the client device 110. [ The analysis server 120 estimates the diagnosis diagnosis name based on the received EMR data. The analysis server 120 can estimate the diagnosis name using a model for disease diagnosis name estimation. The analysis model DB 125 has a model for estimating the disease diagnosis name by inputting the EMR. In FIG. 1, the analysis model DB 125 is shown as a separate object, but it may be included in the analysis server 120.

나아가 EMR에 기반하여 질명 진단명을 추정하는 기법은 네트워크상의 시스템뿐만 아니라, PC와 같은 개별 컴퓨터 장치에서 동작할 수도 있다. 즉 컴퓨터 장치가 질병 진단명 추정을 위한 프로그램을 실행하고, 일정한 EMR을 입력받아 진단명을 추정할 수 있다. 다만 이하 설명의 편의를 위하여 도 1에서 설명한 빅데이터 분석 기반 질병 진단명 추정 시스템(100)을 기준으로 설명한다. 분석 서버(120)도 일종의 컴퓨터 장치이다. 따라서 컴퓨터 장치가 EMR에 기반하여 질명 진단명을 추정한다고 할 수 있다.Furthermore, the technique of estimating the diagnostic name based on the EMR may operate not only on a networked system but also on individual computer devices such as a PC. That is, the computer device executes a program for estimating the diagnosis name of the disease, and can input a certain EMR to estimate the diagnosis name. However, for convenience of explanation, the Big Data Analysis based disease diagnosis name estimation system 100 described in FIG. 1 will be described. The analysis server 120 is also a kind of computer device. Therefore, it can be said that the computer device estimates the diagnosis diagnosis name based on the EMR.

도 2는 빅데이터 분석 기반 질병 진단명 추정 방법(200)에 대한 순서도의 예이다. 컴퓨터 장치는 먼저 빅데이터 기반한 분석 모델을 생성한다(210). 여기서 컴퓨터 장치는 전술한 분석 서버(120), 개별 컴퓨터 장치 등을 포함하는 의미이다. 분석 모델은 이후 EMR 데이터를 이용하여 질병 진단명을 추정하기 위한 모델에 해당한다. 예컨대, 텍스트 분류에 사용되는 나이브 베이즈(Naive Bayes) 분류 모델이 사용될 수 있다. FIG. 2 is an example of a flowchart for the Big Data Analysis-based disease diagnosis name estimation method 200. FIG. The computer device first generates an analysis model based on the Big Data (210). Here, the computer device includes the above-described analysis server 120, individual computer devices, and the like. The analysis model corresponds to a model for estimating the disease diagnosis name using EMR data. For example, the Naive Bayes classification model used for text classification can be used.

분석을 위한 모델이 마련된 상황에서 컴퓨터 장치는 진단명 추정을 위한 EMR 데이터를 입력받는다. 진단명 추정을 위한 EMR 데이터를 이하 타겟 EMR이라고 명명한다. 타겟 EMR은 기본적으로 비정형 데이터로 구성된다. 따라서 분석을 위해 먼저 컴퓨터 장치는 타겟 EMR에 대한 텍스트 마이닝을 수행하면서 정형화된 데이터를 생성한다(220).In a situation where a model for analysis is prepared, the computer device receives EMR data for the diagnosis name estimation. EMR data for diagnosis name estimation is hereinafter referred to as target EMR. The target EMR is basically composed of unstructured data. Therefore, for analysis, the computer device first generates textual data while performing text mining on the target EMR (220).

컴퓨터 장치는 정형화된 EMR 데이터에서 질병과 관련된 키워드를 추출한다(230). 컴퓨터 장치는 추출한 키워드는 분석 모델에 입력하여 질병 진단명을 추정한다(240). 나아가 컴퓨터 장치는 추정한 질병 진단명, 진단한 질병 정보를 일정하게 가공할 수 있다(250). 예컨대, 컴퓨터 장치는 질병 진단명과 함께 관련된 정보를 시각화하여 출력할 수 있다. The computer device extracts the disease related keywords from the formatted EMR data (230). The computer device inputs the extracted keyword into the analysis model to estimate the disease diagnosis name (240). Further, the computer device can process the estimated disease diagnosis name and the diagnosed disease information in a uniform manner (250). For example, the computer device can visualize and output relevant information along with a disease diagnosis name.

도 3은 전자의무기록에서 빅데이터 분석 모델을 생성하는 과정(300)에 대한 예이다. EMR DB(130)는 진료/치료와 관련된 다양한 정보를 저장한다. EMR DB(130)는 입원/퇴원 기록지, 진료/간호기록지, 수술/간호기록지, 응급 기록지 등을 포함한다. 입원 기록지는 입원 과정에서 환자에 대하여 기재한 자료이다. 입원 기록지는 입원을 하게 된 경위, 환자의 상태 등을 기록한다. 퇴원 기록지는 퇴원 과정에서 환자에 대하여 기재한 자료이다. 퇴원 기록지는 퇴원시의 환자 상태, 이후 치료 과정 등에 대한 정보를 기록한다. 진료/간호 기록지는 진료 과정에서 환자에 대하여 기재한 정보이다. 진료/간호 기록지는 시간의 흐름에 따른 환자 상태, 측정한 생체 정보(혈압, 체온, 맥박 등), 처방한 약물 등에 대한 정보를 기록한다. 수술/간호기록지는 수술 과정과 수술 이후 환자를 치료하는 과정에서 발생한 정보를 저장한다. 응급 기록지는 환자에 대한 응급 진료에 대한 정보를 저장한다. 도 3에 도시한 EMR의 종류는 하나의 예이며, 다양한 분류에 따라 환자에 대한 다양한 정보가 저장될 수 있다.3 is an example of a process 300 for creating a big data analysis model in electronic medical records. The EMR DB 130 stores various information related to the treatment / treatment. The EMR DB 130 includes an admission / discharge recording sheet, a medical / nursing recording sheet, a surgical / nursing recording sheet, a emergency recording sheet, and the like. The admission record is the data about the patient during admission. The hospital record sheet records the progress of the hospitalization, the patient's condition, and so on. The discharge report is written about the patient during discharge. The discharge record sheet records information on the patient's condition at the time of discharge and subsequent treatment. A medical / nursing record is information about the patient in the course of treatment. The medical care / nursing record sheet records information on patient condition, measured biometric information (blood pressure, body temperature, pulse, etc.) over time, and prescription drugs. Surgical / nursing records store information generated during the surgical procedure and during the treatment of patients after surgery. The emergency record book stores information about emergency care for the patient. The type of EMR shown in FIG. 3 is one example, and various information about the patient can be stored according to various types of classification.

샘플 EMR은 분석 모델 구축에 사용되는 입력 데이터를 의미한다. 도 3은 부석 서버(120)가 분석 모델을 구축하는 예를 설명하였다. 다만 분석 서버(120)가 아닌 별도의 컴퓨터 장치가 사전에 분석 모델을 구축할 수도 있다. 분석 서버(120)는 샘플 EMR을 입력받는다(310). 도 3에서 우측 사각형 점선으로 표시한 부분은 분석 서버(120)가 수행하는 과정에 해당한다. 분석 서버(120)는 샘플 EMR에서 비정형 데이터를 일정하게 전처리한다(320). 이를 통해 분석 서버(120)는 비정형 데이터를 정형 데이터로 변환한다. 전처리 과정은 후술한다. 이후 분석 서버(120)는 정형화된 EMR 데이터에서 질병과 관련된 키워드를 추출한다(330). 분석 서버(120)는 추출한 키워드와 해당 EMR에서 진단된 질병 진단명을 매칭한다(340). 예컨대, 분석 서버(120)는 키워드와 관련된 진별 진단명을 매칭한 테이블을 생성할 수 있다. 분석 서버(120)는 생성한 모델을 분석 모델 DB에 저장할 수 있다. 이후 질병 진단명 추정을 위한 분석 모델은 생성한 테이블을 사용하여 질병 진단명을 추정한다. 한편 분석 모델은 EMR 데이터를 사용하지 않고 다른 루트로 생성될 수도 있다. 예컨대, 분석 모델은 의료 분야에서 널리 알려진 정보를 이용하여 생성될 수도 있다. The sample EMR refers to the input data used to construct the analysis model. FIG. 3 illustrates an example in which the pumice server 120 constructs an analysis model. However, a separate computer device other than the analysis server 120 may construct an analysis model in advance. The analysis server 120 receives the sample EMR (310). In FIG. 3, the portion indicated by the right square dotted line corresponds to the process performed by the analysis server 120. The analysis server 120 prepares the unstructured data in the sample EMR to be constant (320). The analysis server 120 converts the unstructured data into the formatted data. The preprocessing process will be described later. Thereafter, the analysis server 120 extracts the keyword related to the disease from the formalized EMR data (330). The analysis server 120 matches the extracted keyword with the disease diagnosis name diagnosed in the corresponding EMR (340). For example, the analysis server 120 may generate a table that matches the true diagnosis name associated with the keyword. The analysis server 120 may store the generated model in the analysis model DB. Then, the analysis model for estimating the disease diagnosis name estimates the disease diagnosis name using the generated table. On the other hand, the analytical model may be created as a different route without using the EMR data. For example, the analysis model may be generated using information well known in the medical field.

도 4는 입력된 전자의무기록에서 질병 진단명을 추정하는 과정(400)에 대한 예이다. 도 4에서 우측 사각형 점선으로 표시한 부분은 분석 서버(120)가 수행하는 과정에 해당한다. 타겟 EMR은 질병 진단명 추정을 위한 입력 데이터를 의미한다. 분석 서버(120)는 EMR DB(130)로부터 타겟 EMR을 입력받는다(410). 분석 서버(120)는 타겟 EMR에서 비정형 데이터를 일정하게 전처리한다(420). 이를 통해 분석 서버(120)는 비정형 데이터를 정형 데이터로 변환한다. 이후 분석 서버(120)는 정형화된 EMR 데이터에서 질병과 관련된 키워드를 추출한다(430). 4 is an example of a process 400 for estimating a disease diagnosis name in the inputted electronic medical record. In FIG. 4, the portion indicated by the right square dotted line corresponds to the process performed by the analysis server 120. The target EMR means input data for disease diagnosis name estimation. The analysis server 120 receives the target EMR from the EMR DB 130 (410). The analysis server 120 regularly prepares irregular data in the target EMR (420). The analysis server 120 converts the unstructured data into the formatted data. Thereafter, the analysis server 120 extracts the keyword related to the disease from the formalized EMR data (430).

도 5는 전자의무기록의 비정형 데이터를 정형화하는 과정(500)에 대한 예이다. 도 5는 비정형 데이터를 전처리하여 정형화 데이터를 생성하는 예이다. 분석 서버(120)는 자연어 처리 프로그램을 사용하여 비정형 데이터를 전처리할 수 있다. 예컨대, EMR 데이터의 텍스트가 한국어라면, 자연한 분석 툴인 KoNLP(Korean Natural Language Processing)를 이용하여 비정형 데이터를 전처리할 수도 있다. Figure 5 is an example of a process 500 for formatting unstructured data of electronic medical records. FIG. 5 shows an example of preprocessing irregular data to generate formatted data. The analysis server 120 may preprocess the unstructured data using a natural language processing program. For example, if the text of the EMR data is in Korean, it is possible to preprocess irregular data using Korean Natural Language Processing (KoNLP), a natural analysis tool.

분석 서버(120)는 입력 텍스트 데이터에서 불용어를 제거한다(510). 이후 분석 서버(120)는 입력 텍스트 데이터를 문장 단위로 구분 처리한다(520). 분석 서버(120)는 문장 단위 별로 해당 문장(단어)의 어근을 추출하거나, 해당 문장(단어)을 기본형으로 변환한다(530). 마지막으로 분석 서버(120)는 추출한 키워드에 대하여 일정한 가중치를 부여할 수 있다. 예컨대, 분석 서버(120)는 키워드에 대하여 단어/역단어 빈도 결합(TF-IDF) 가중치를 부여할 수 있다. TF-IDF(Term Frequency - Inverse Document Frequency)는 텍스트 마이닝에서 이용하는 가중치이다. 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. TF(단어 빈도, term frequency)는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값이다. 하지만 단어 자체가 문서군 내에서 자주 사용되는 경우, 이것은 그 단어가 흔하게 등장한다는 것을 의미한다. 이것을 DF(문서 빈도, document frequency)라고 하며, 이 값의 역수를 IDF(역문서 빈도, inverse document frequency)라고 한다. TF-IDF는 TF와 IDF를 곱한 값이다.The analysis server 120 removes the stopwords from the input text data (510). Thereafter, the analysis server 120 separates the input text data in units of sentences (520). The analysis server 120 extracts the root of the corresponding sentence (word) by sentence unit or converts the sentence (word) into the basic type (530). Finally, the analysis server 120 may assign a predetermined weight to the extracted keywords. For example, the analysis server 120 may assign a word / inverse word frequency combination (TF-IDF) weight to a keyword. TF-IDF (Term Frequency - Inverse Document Frequency) is a weight used in text mining. It is a statistical number that indicates how important a word is in a particular document when it is present. TF (word frequency, term frequency) is a value that indicates how often a particular word appears in a document. However, if the word itself is frequently used within a set of documents, this means that the word appears frequently. This is called DF (document frequency), and the reciprocal of this value is called IDF (inverse document frequency). TF-IDF is the product of TF and IDF.

도 5는 EMR에 포함된 텍스트 데이터 "포상휴가 중 심한 발열, 두통, 설사 증상이 7일간 지속되어 병원 방문함"을 예로 도시한다. 입력 텍스트 데이터에서 조사 등과 같은 불용어가 제거된다. 도 5에서 제거된 불용어는 사각형 박스로 표시하였다. 불용어가 제거된 텍스트 데이터에서 문장 단위를 텍스트를 구분하면 "포상휴가/ 심한/ 발열/ 두통/ 설사/ 증상/ 7일/ 지속"이 될 수 있다. 이후 문장 단위로 문장(단어)을 어근 또는 기본형으로 변환한다. 변환된 결과는 "포상휴가 심하다 발열 두통 설사 증상 7일 지속"이다. 이후 각 키워드가 등장한 빈도 등을 연산하여 TF-IDF값을 키워드에 부여한다.FIG. 5 shows an example of the text data included in the EMR "Hospital visit due to severe fever, headache and diarrhea during the reward vacation lasting 7 days. &Quot; Quoted words such as illumination are removed from the input text data. In Fig. 5, the deleted abbreviation is indicated by a square box. If the text is separated from the text data in which the phrase is removed, the sentence unit can be classified as "reward vacation / severe / fever / headache / diarrhea / symptom / 7 days / persistence". Subsequently, the sentence (word) is converted into a root or basic form in sentence units. The converted result is "a fever headache diarrhea symptom lasting 7 days" which is a rewarding holiday. Then, the TF-IDF value is given to the keyword by calculating the frequency of occurrence of each keyword and the like.

분석 서버(120)는 도 5와 같은 텍스트 마이닝 과정을 거쳐 정형화 데이터를 생성한다. 나아가 도 4에는 도시하지 않았지만, 분석 서버(120)는 정형화 데이터로 생성된 모든 단어를 사용하지 않고, 특정 질병과 관련된 키워드를 선택할 수도 있다. 분석 서버(120)는 특정 질병과 키워드의 상관 관계를 이용하여 정형화 데이터에서 키워드를 선택할 수 있다. 상관 관계는 EMR에서 특정 질병에 대하여 특정 키워드가 존재하는 빈도 및 특정 질병에 대하여 특정 키워드가 나타나는 패턴을 기준으로 상기 특정 질병과 상기 특정 키워드의 상관도를 정의한 함수일 수 있다. 예컨대, 상관 관계는 특정 질병과 관련된 EMR에서 특정 단어(키워드)가 등장하는 빈도 또는/및 진단/치료/입원 등의 각 과정에서 해당 키워드가 등장하는 패턴(예컨대, 초기 진단, 입원 과정 또는 치료 후기 중 어떤 기간에 집중도가 높음)을 파악하여 사전에 마련될 수 있다. The analysis server 120 generates the formatting data through a text mining process as shown in FIG. Further, although not shown in FIG. 4, the analysis server 120 may select a keyword related to a specific disease without using all the words generated by the regularization data. The analysis server 120 can select a keyword from the stereotyped data using the correlation between the specific disease and the keyword. The correlation may be a function defining the correlation between the specific disease and the specific keyword based on a frequency in which a specific keyword exists for a specific disease in EMR and a pattern in which a specific keyword appears for a specific disease. For example, the correlation may be a pattern in which the keyword appears in each process such as the frequency of occurrence of a specific word (keyword) in an EMR related to a specific disease and / or diagnosis / treatment / hospitalization (for example, initial diagnosis, The concentration is high during a certain period of time).

도 4에 대한 설명으로 돌아간다. 분석 서버(120)는 추출한 키워드를 기준으로 질병 진단명을 추정한다(440). 분석 서버(120)는 도 3에서 설명한 테이블을 사용하여, 현재 입력된 EMR에서 등장하는 키워드를 기준으로 진단명을 분류할 수 있다. 예컨대, 분석 서버(120)는 키워드의 빈도와 테이블(키워드와 질병 진단명 매칭된)을 이용하여 현재 입력된 EMR에 대응하는 진단명을 추정할 수 있다. 분석 서버(120)는 나이브 베이즈 분류(분류기) 기법을 활용하여 키워드를 기준으로 가장 확률이 높은 진단명을 도출할 수 있다. 도 4의 하단에는 나이브 베이즈 분류에 따라 입력 EMR에서 추출한 키워드로 추정되는 진단명에 대한 예를 도시한다. 도 4에 따르면 입력된 키워드와 가장 연관성 있는 질병은 감기(91%)이다. 따라서 분석 서버(120)는 현재 EMR에 따른 질병 진단명을 감기라고 도출한다.Returning to the description of FIG. The analysis server 120 estimates the disease diagnosis name based on the extracted keyword (440). The analysis server 120 can classify the diagnosis names based on the keywords appearing in the currently input EMR by using the table described in FIG. For example, the analysis server 120 may estimate the diagnosis name corresponding to the currently input EMR using the frequency of the keyword and the table (matching the keyword and disease diagnosis name). The analysis server 120 can derive the most probable diagnosis name based on the keyword by using the Naive Bayes classification (classifier) technique. The lower part of FIG. 4 shows an example of the diagnosis name estimated by the keyword extracted from the input EMR according to the Naïve Bayes classification. According to Fig. 4, the disease most related to the input keyword is the cold (91%). Therefore, the analysis server 120 derives the disease diagnosis name according to the current EMR as a cold.

나이브 베이즈 분류기는 스팸 메일 필터링이나 키워드 검색에 사용되는 지도 학습(Supervised Learning) 분류기이다. 나이브 베이즈 분류기의 기본 원리는 조건부 확률에 베이즈 정리를 적용하고, 문서나 데이터를 구성하는 각각의 요소들이 등장할 확률에 대하여 독립성을 가정하여 입력벡터를 분류하는 확률적 분류 기법이다. 조건부 확률(conditional probability)은 사건 A가 발생했을 때, 사건 B가 발생할 확률을 의미한다. 조건부 확률은 A와 B가 동시에 발생할 확률을 A가 발생할 확률로 나눈 값과 같다. 나이브 베이즈 분류의 동작은 해당 분야에서 널리 알려진 것이므로 자세한 설명은 생략한다. 정리하면 분석 서버(120)는 사전에 키워드와 질병 진단명을 매칭한 테이블을 활용하고, EMR을 정형화 데이터로 처리한 후 도출한 키워드를 기준으로 해당 EMR과 관련성이 높은 질병 진단명을 추정한다.The Naive Bayes classifier is a Supervised Learning classifier used for spam filtering and keyword searches. The basic principle of the Naive Bayes classifier is a stochastic classification scheme that applies Bayesian theorem to conditional probability and classifies the input vector assuming the independence of the probability that each element constituting the document or data appears. The conditional probability is the probability that event B occurs when event A occurs. The conditional probability is equal to the probability that A and B will occur simultaneously by the probability that A will occur. The operation of the Naïve Bayes classification is well known in the field, so a detailed description is omitted. In summary, the analysis server 120 uses a table matching the keyword and disease diagnosis name in advance, processes the EMR as formalized data, and estimates a disease diagnosis name having high relevance to the corresponding EMR based on the derived keyword.

한편 분석 서버(120)는 진단명 또는 질병과 관련된 정보를 일정하게 가공하여 새로운 정보를 생성할 수 있다. 예컨대, 분석 서버(120)는 질병과 관련된 정보를 시각화한 데이터를 생성할 수 있다. 분석 서버(120)는 생성한 시각화 데이터를 클라이언트 장치에 제공할 수 있다. 클라이언트 장치(110)는 시각화 데이터를 화면에 출력할 수 있다. 도 6은 전자의무기록에서 추정한 질병 정보를 시각화한 예이다. 도 6은 일종의 단어 네트워크를 도시한 예이다. 텍스트 마이닝 분야에서 텍스트 마이닝 결과를 시각적으로 표현하기 위하여 네트워크 형태를 자주 사용한다. 텍스트 마이닝 결과 자주 등장한 단어를 크게 표현할 수 있다. 분석 서버(120)는 EMR에서 추출한 키워드 중 질병과 관련성 높은 단어를 질별 진단명과 가깝게 또는 크게 표시할 수 있다. Meanwhile, the analysis server 120 can generate new information by processing the diagnosis name or information related to the disease to a certain extent. For example, the analysis server 120 may generate data that visualizes information related to the disease. The analysis server 120 can provide the generated visualization data to the client device. The client device 110 can output the visualization data to the screen. 6 is an example of visualizing disease information estimated from electronic medical records. 6 shows an example of a kind of word network. In the field of text mining, the network type is frequently used to visually express the result of text mining. As a result of text mining, it is possible to express a large number of frequently appearing words. The analysis server 120 can display a word related to the disease among the keywords extracted from the EMR close to or larger than the diagnostic diagnosis name.

나아가 분석 서버(120)는 현재 EMR에 기록된 질병 진단명과 분석 모델을 통해 추정된 진단명이 서로 다른 경우, 추정된 진단명으로 기록을 업데이트할 수 있다. 이 경우 분석 서버(120)는 추정된 진단명을 EMR DB(130)에 전달하고, EMR DB(130)는 타겟 EMR에 새로운 질병 진단명을 기록할 수 있다.Furthermore, the analysis server 120 can update the record with the estimated diagnosis name if the diagnosis names estimated through the disease diagnosis name and the analysis model recorded in the current EMR are different from each other. In this case, the analysis server 120 delivers the estimated diagnosis name to the EMR DB 130, and the EMR DB 130 can record a new disease diagnosis name in the target EMR.

전술한 빅데이터 기반 질병 진단명 추정 방법을 통해 현재 기록된 EMR에 기반하여 특정 환자에 대한 질병을 진단할 수 있다. 또 현재 EMR에 기록된 질병 진단명에 오류가 있다는 것을 알고, 질병 진단명을 갱신할 수 있다.Based on the above-mentioned Big data-based disease diagnosis name estimation method, it is possible to diagnose a disease for a specific patient based on the currently recorded EMR. Also, knowing that there is an error in the diagnosis name recorded in the current EMR, you can update the diagnosis name.

또한, 상술한 바와 같은 빅데이터 기반 질병 진단명 추정 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the Big Data-based disease diagnosis name estimation method as described above can be implemented as a program (or an application) including an executable algorithm that can be executed in a computer. The program may be stored and provided in a non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.A non-transitory readable medium is a medium that stores data for a short period of time, such as a register, cache, memory, etc., but semi-permanently stores data and is readable by the apparatus. In particular, the various applications or programs described above may be stored on non-volatile readable media such as CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM,

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The present embodiment and drawings attached hereto are only a part of the technical idea included in the above-described technology, and it is easy for a person skilled in the art to easily understand the technical idea included in the description of the above- It will be appreciated that variations that may be deduced and specific embodiments are included within the scope of the foregoing description.

100 : 빅데이터 분석 기반 질병 진단명 추정 시스템
110 : 클라이언트 장치
120 : 분석 서버
125 : 모델 DB
130 : EMR DB100: Big data analysis based disease diagnosis name estimation system
110: Client device
120: Analysis server
125: Model DB
130: EMR DB

Claims

Generating formalized data by text mining of unstructured data of a target electronic medical record (EMR) using a natural language analysis tool KoNLP (Korean Natural Language Processing);
Extracting a keyword from the stereotyped data according to a correlation between a specific disease and a keyword;
Inputting the keyword into a Naive Bayes classifier provided in advance and classifying a target diagnosis name corresponding to the keyword; And
Visualizing the target diagnostic name categorized with respect to the target electronic medical record and outputting it to a screen,
The Naive Bayes classifier includes information defining a relationship between a keyword and a disease,
The correlation is information that defines a correlation between the specific disease and the specific keyword based on a frequency of occurrence of a specific keyword with respect to a specific disease and a pattern in which the specific keyword appears in the electronic medical record,
The step of generating the formatting data may include the steps of removing an idiomatic word from the unstructured data, discriminating a sentence from the unstructured data, converting a word included in the unstructured data into a root-type or a basic type, And assigning a TF-IDF (Term Frequency - Inverse Document Frequency) weighting to the converted word.

The computer device generating formatting data by text mining for unstructured data of a target electronic medical record (EMR);
Extracting a keyword from the stereotyped data according to a correlation between a specific disease and a keyword; And
And inputting the keyword to the Naive Bayes classifier provided in advance to classify the target diagnosis name corresponding to the keyword,
Wherein the Naive Bayes classifier comprises information defining a relationship between a keyword and a disease.

3. The method of claim 2,
Wherein the electronic medical record comprises at least one of an admission record, a discharge record, a medical record, an operation record, a nursing record, and an emergency record.

3. The method of claim 2,
The step of generating the formatting data
Removing idle words from the unstructured data;
Identifying sentences in the atypical data;
Converting a word included in the atypical data into a root type or a basic type; And
And assigning a TF-IDF (Term Frequency-Inverse Document Frequency) weight to the root-converted or basic-converted word.

3. The method of claim 2,
Wherein the computer device generates the formatted data using KoNLP (Korean Natural Language Processing), a natural language analysis tool.

3. The method of claim 2,
Wherein the correlation is determined based on a frequency in which a specific keyword exists for a specific disease in the electronic medical record and a pattern in which the specific keyword appears in the specific disease, Based disease diagnosis method.

3. The method of claim 2,
Further comprising the computer device visualizing the target diagnosis name categorized with respect to the target electronic medical record and outputting the visualized name to the screen.

3. The method of claim 2,
Further comprising the computer device modifying the diagnosis name stored in the target electronic medical record to the target diagnosis name.

A computer-readable recording medium recording a program for executing the Big Data Analysis-based disease diagnosis name estimation method according to any one of claims 2 to 8 on a computer.

A database for storing electronic medical records (EMR);
A client device for transmitting a data analysis command for a target electronic medical record out of electronic medical records stored in the database; And
Receiving the analysis command, receiving the target electronic medical record by contacting the database, formatting the unstructured data included in the received target electronic medical record, analyzing the formatted data according to a correlation between a specific disease and a keyword And an analysis server for inputting the keyword into a previously prepared Naive Bayes classifier and deriving a target diagnosis name corresponding to the keyword.

11. The method of claim 10,
The analysis server removes an idiomatic word from the unstructured data by using natural language analysis software, converts a word included in the irregular data into a root type or a basic type, Frequency - Inverse Document Frequency) weights to generate the formatted data.

11. The method of claim 10,
Wherein the analysis server transmits to the client device data obtained by visualizing the target diagnosis name classified with respect to the target electronic medical record to the client device.

11. The method of claim 10,
And the analysis server transmits the target diagnosis name to the database.