KR102359652B1

KR102359652B1 - System and Method for Classifying Disease Using Class Association Rules

Info

Publication number: KR102359652B1
Application number: KR1020190153057A
Authority: KR
Inventors: 전광길
Original assignee: 인천대학교 산학협력단
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2022-02-08
Also published as: KR20210064630A

Abstract

클래스 연관 규칙을 이용한 질병 분류 시스템 및 방법은 증상과 증후군 사이의 강력한 연관 규칙을 적용하기 위해서 우선 순위가 높은 세트를 선택하여 티벳 의학 증후군의 분류 모델을 설정하여 티벳 의학 증후군을 진단 및 치료하고, 일반적인 임상 진단의 정확성을 향상시킬 수 있는 효과가 있다. A disease classification system and method using class association rules establishes a classification model of Tibetan medical syndromes by selecting high-priority sets to apply a strong association rule between symptoms and syndromes to diagnose and treat Tibetan medical syndromes, and general It has the effect of improving the accuracy of clinical diagnosis.

Description

System and Method for Classifying Disease Using Class Association Rules

본 발명은 질병 분류 시스템에 관한 것으로서, 특히 증상과 증후군 사이의 강력한 연관 규칙을 적용하기 위해서 우선 순위가 높은 세트를 선택하여 티벳 의학 증후군의 분류 모델을 설정하여 티벳 의학 증후군을 진단 및 치료하는 클래스 연관 규칙을 이용한 질병 분류 시스템 및 방법에 관한 것이다.The present invention relates to a disease classification system, and in particular, class association for diagnosing and treating Tibetan medical syndromes by selecting a high-priority set to apply a strong association rule between symptoms and syndromes to establish a classification model for Tibetan medical syndromes It relates to a disease classification system and method using rules.

티벳 의약은 오랜 역사를 가진 전통적인 국내 의약을 가지고 있다. 중국 의약의 중요한 부분은 독창적인 이론 시스템과 임상 효능이 있다.Tibetan medicine has traditional domestic medicine with a long history. An important part of Chinese medicine has a unique theoretical system and clinical efficacy.

티벳 의약은 중국 북서부 지역의 농민을 위한 의약적 어드바스를 제공하고 있으나, 문화적 배경과 지역 환경의 복잡성으로 인해 임상 연구 기반이 취약하며, 임상 특성 기술이 표준화되지 못한 문제점이 있다.Tibetan Medicine provides medicinal advice for farmers in northwestern China, but the clinical research base is weak due to the complexity of the cultural background and local environment, and clinical characteristics are not standardized.

티벳은 고원 지대에 위치한 특성상 고산병의 발병률이 높고, 고산병의 반복으로 인하여 위암의 위험이 증가하고 있다.Because Tibet is located in the highlands, the incidence of altitude sickness is high, and the risk of stomach cancer is increasing due to the repetition of altitude sickness.

티벳은 의학 이론에 근거한 임상 원칙이 없기 때문에 질병 진단 및 치료 과정에서 의사의 주관적 요인이 크게 존재하는 문제점이 있다.Since Tibet does not have clinical principles based on medical theory, there is a problem in that there is a large amount of subjective factors by doctors in the process of diagnosing and treating diseases.

한국 등록특허번호 제10-1950112호Korean Patent No. 10-1950112

이와 같은 문제점을 해결하기 위하여, 본 발명은 증상과 증후군 사이의 강력한 연관 규칙을 적용하기 위해서 우선 순위가 높은 세트를 선택하여 티벳 의학 증후군의 분류 모델을 설정하여 티벳 의학 증후군을 진단 및 치료하는 클래스 연관 규칙을 이용한 질병 분류 시스템 및 방법을 제공하는데 그 목적이 있다.In order to solve this problem, the present invention establishes a classification model of Tibetan medical syndrome by selecting a high-priority set in order to apply a strong association rule between symptoms and syndromes to diagnose and treat Tibetan medical syndromes. An object of the present invention is to provide a disease classification system and method using rules.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 클래스 연관 규칙을 이용한 질병 분류 시스템은,A disease classification system using a class association rule according to a feature of the present invention for achieving the above object,

의료 정보 데이터베이스부로부터 티벳의 의료 기록 문서를 수신하는 입력부;an input unit for receiving a Tibetan medical record document from the medical information database unit;

상기 의료 기록 문서에서 연관 규칙의 생성과 무관한 불용어를 제거하여 의료 기록 문서에 포함된 복수의 단어 중에서 연관 규칙 생성의 대상이 되는 복수의 의료 단어를 추출하고, 추출한 의료 단어를 이용하여 데이터 마이닝에 적합한 데이터 세트를 생성하는 전처리부;By removing stopwords irrelevant to the generation of the association rule from the medical record document, a plurality of medical words that are the target of generating the association rule from among a plurality of words included in the medical record document are extracted, and the extracted medical word is used for data mining. a preprocessor that generates a suitable data set;

상기 데이터 세트에서 병의 증상을 조건 속성으로 설정하고, 의료 증후군을 결정 속성으로 설정하고, 연관 규칙 마이닝(Mining) 알고리즘을 이용하여 상기 병의 증상과 상기 의료 증후군 간의 클래스 연관 규칙을 생성하는 데이터 마이닝부; 및In the data set, a symptom of a disease is set as a condition attribute, a medical syndrome is set as a decision attribute, and a class association rule between the symptom of the disease and the medical syndrome is generated using an association rule mining algorithm. wealth; and

신뢰도와 지지도에 의한 클래스 연관 규칙을 정렬하는 방법 또는 Lift(상향법)에 의한 클래스 연관 규칙을 정렬을 이용하여 각각 우선 순위를 결정하고, 우선 순위가 가장 높은 규칙 세트를 분류 모델로 설정하는 의료 질병 분류부를 포함하는 것을 특징으로 한다.A medical disease that determines the priority of each class association rule by using the method of sorting class association rules by reliability and support or by sorting class association rules by Lift (bottom-up method), and sets the rule set with the highest priority as a classification model It is characterized in that it includes a classification unit.

본 발명의 특징에 따른 클래스 연관 규칙을 이용한 질병 분류 방법은,A disease classification method using a class association rule according to a feature of the present invention,

의료 정보 데이터베이스부로부터 티벳의 의료 기록 문서를 수신하는 단계;receiving a Tibetan medical record document from a medical information database unit;

상기 의료 기록 문서에서 연관 규칙의 생성과 무관한 불용어를 제거하여 의료 기록 문서에 포함된 복수의 단어 중에서 연관 규칙 생성의 대상이 되는 복수의 의료 단어를 추출하고, 추출한 의료 단어를 이용하여 데이터 마이닝에 적합한 데이터 세트를 생성하는 단계;By removing stopwords irrelevant to the generation of the association rule from the medical record document, a plurality of medical words that are the target of generating the association rule from among a plurality of words included in the medical record document are extracted, and the extracted medical word is used for data mining. generating a suitable data set;

상기 데이터 세트에서 병의 증상을 조건 속성으로 설정하고, 의료 증후군을 결정 속성으로 설정하고, 연관 규칙 마이닝(Mining) 알고리즘을 이용하여 상기 병의 증상과 상기 의료 증후군 간의 클래스 연관 규칙을 생성하는 단계;setting a symptom of a disease as a condition attribute in the data set, setting a medical syndrome as a decision attribute, and generating a class association rule between the symptom of the disease and the medical syndrome using an association rule mining algorithm;

신뢰도와 지지도에 의한 클래스 연관 규칙을 정렬하는 방법 또는 Lift(상향법)에 의한 클래스 연관 규칙을 정렬을 이용하여 각각 우선 순위를 결정하고, 우선 순위가 가장 높은 규칙 세트를 분류 모델로 설정하는 단계를 포함하는 것을 특징으로 한다.The method of sorting class association rules by reliability and support or the step of determining the priority of class association rules by Lift (bottom-up method) using sorting, and setting the rule set with the highest priority as a classification model characterized by including.

전술한 구성에 의하여, 본 발명은 증상과 증후군 사이의 강력한 연관 규칙을 적용하기 위해서 우선 순위가 높은 세트를 선택하여 티벳 의학 증후군의 분류 모델을 설정하여 티벳 의학 증후군을 진단 및 치료하고 일반적인 임상 진단의 정확성을 향상시킬 수 있는 효과가 있다.According to the above configuration, the present invention establishes a classification model of Tibetan medical syndrome by selecting a high-priority set in order to apply a strong association rule between symptoms and syndromes to diagnose and treat Tibetan medical syndromes and general clinical diagnosis. It has the effect of improving accuracy.

도 1은 본 발명의 실시예에 따른 클래스 연관 규칙을 이용한 질병 분류 시스템의 구성을 간략하게 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 증상과 증후군의 연결 관계의 개념을 나타낸 도면이다.
도 3은 본 발명의 실시예에 따른 클래스 연관 규칙을 이용한 질병 분류 방법을 나타낸 도면이다.
도 4는 본 발명의 실시예에 따른 Conf1Sup2와 Lift 정렬 방법을 기초로 한 분류 모델을 비교한 도면이다.1 is a diagram schematically illustrating the configuration of a disease classification system using a class association rule according to an embodiment of the present invention.
2 is a diagram illustrating the concept of a connection relationship between symptoms and syndromes according to an embodiment of the present invention.
3 is a diagram illustrating a disease classification method using a class association rule according to an embodiment of the present invention.
Figure 4 is a view comparing the classification model based on the Conf1Sup2 and Lift alignment method according to an embodiment of the present invention.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

도 1은 본 발명의 실시예에 따른 클래스 연관 규칙을 이용한 질병 분류 시스템의 구성을 간략하게 나타낸 도면이고, 도 2는 본 발명의 실시예에 따른 증상과 증후군의 연결 관계의 개념을 나타낸 도면이며, 도 3은 본 발명의 실시예에 따른 클래스 연관 규칙을 이용한 질병 분류 방법을 나타낸 도면이다.1 is a diagram schematically illustrating the configuration of a disease classification system using class association rules according to an embodiment of the present invention, and FIG. 2 is a diagram illustrating the concept of a connection relationship between symptoms and syndromes according to an embodiment of the present invention, 3 is a diagram illustrating a disease classification method using a class association rule according to an embodiment of the present invention.

본 발명의 실시예에 따른 클래스 연관 규칙을 이용한 질병 분류 시스템(100)은 입력부(110), 전처리부(120), 데이터 마이닝부(130) 및 의료 질병 분류부(140)를 포함한다.The disease classification system 100 using a class association rule according to an embodiment of the present invention includes an input unit 110 , a preprocessor 120 , a data mining unit 130 , and a medical disease classification unit 140 .

입력부(110)는 의료 정보 데이터베이스부(101)로부터 티벳의 의료 기록 문서(임상 진료 문서, 처방 기록 문서, 진료 소견 문서, 의료 실험 데이터)를 수신한다.The input unit 110 receives Tibetan medical record documents (clinical treatment documents, prescription record documents, medical observation documents, and medical test data) from the medical information database unit 101 .

전처리부(120)는 의료 기록 문서에서 연관 규칙의 생성과 무관한 불용어를 제거하여 의료 기록 문서에 포함된 복수의 단어 중에서 연관 규칙 생성의 대상이 되는 복수의 의료 단어를 추출하고, 추출한 의료 단어를 이용하여 데이터 마이닝에 적합한 복수의 데이터 세트를 생성한다(S100, S101).The preprocessor 120 removes stopwords irrelevant to the generation of the association rule from the medical record document, extracts a plurality of medical words that are the target of generating the association rule from among a plurality of words included in the medical record document, and extracts the extracted medical word. A plurality of data sets suitable for data mining are generated by using (S100, S101).

데이터 마이닝부(130)는 단순히 문서에 나타나는 키워드의 빈출 정보를 이용하여 분류 카테고리를 지정하는 통계적인 방법과 다르게 연관규칙분석 기법인 아프리오리(Apriori) 알고리즘을 이용하여 추출한 의료 단어를 바탕으로 문서들 간에 연관성 있는 키워드들의 집합을 추출하여 각 카테고리 별로 의미적으로 대표성을 가진 키워드들로 분류 규칙을 생성할 수 있다.The data mining unit 130 simply uses the frequency information of keywords appearing in the document to separate the documents based on the medical word extracted using the Apriori algorithm, which is an association rule analysis technique, unlike a statistical method of designating a classification category. By extracting a set of related keywords, a classification rule can be created with keywords having semantically representativeness for each category.

데이터 마이닝부(130)는 데이터 세트에서 고산병의 증상을 조건 속성으로 설정하고, 티벳 의료 증후군을 결정 속성으로 설정한다. 이로 인하여 Symptoms => Syndrome(증상이면 증후군이다)의 결과가 도출된다.The data mining unit 130 sets the symptom of altitude sickness as a condition attribute in the data set, and sets the Tibetan medical syndrome as a decision attribute. This results in Symptoms => Syndrome.

데이터 마이닝부(130)는 아프리오리(Apriori) 연관 규칙 마이닝(Mining) 알고리즘을 이용하여 고산병의 증상과 티벳 의료 증후군 간의 클래스 연관 규칙을 생성한다(S102). 이때, 분류의 지지도(support)와 신뢰도(confidence)의 경계치(Threshold)가 미리 정해져 있다.The data mining unit 130 generates a class association rule between the symptoms of altitude sickness and Tibetan medical syndrome by using an Apriori association rule mining algorithm (S102). In this case, the thresholds of support and confidence of classification are predetermined.

연관 규칙(Asscoiation Rules)은 의료 단어와 같은 트랜잭션에서 추출한 "X => Y(support, confidence)" 형태의 조건과 결과(Condition-Conclusion) 식으로 표현되는 유용한 패턴을 의미한다.Association rules refer to useful patterns expressed in terms of conditions and results (Condition-Conclusion) in the form of “X => Y(support, confidence)” extracted from transactions such as medical words.

여기서, X는 트랜잭션을 구성하는 항목(Item)이고, "X => Y"는 조건에 해당하는 항목 X가 발생할 때 결과에 해당하는 항목 Y가 같이 발생함을 의미한다.Here, X is an item constituting a transaction, and "X => Y" means that when an item X corresponding to a condition occurs, an item Y corresponding to the result also occurs.

연관 규칙의 목적은 미리 설정된 지지도 경계치와 신뢰도 경계치를 만족하는 규칙들을 데이터 마이닝하는 것이다.The purpose of the association rule is to data mining rules that satisfy preset support thresholds and reliability thresholds.

지지도(support)는 전체 트랜잭션에서 항목 X와 항목 Y가 동시에 발생하는 트랜잭션의 비율을 의미하고, 신뢰도(confidence)는 항목 X가 포함된 트랜잭션에서 항목 Y를 함께 포함하고 있는 트랜잭션의 비율을 의미한다.Support refers to the ratio of transactions involving item X and item Y simultaneously in all transactions, and confidence refers to the ratio of transactions including item Y in transactions including item X.

클래스 연관 규칙은 연관 규칙의 부분이고, 규칙들의 사후 항목은 분류 속성으로 고정되며, 이는 데이터 오브젝트의 카테고리를 구별하는데 사용될 수 있다.A class association rule is a part of an association rule, and the subsequent items of the rules are fixed as a classification attribute, which can be used to distinguish categories of data objects.

클래스 연관 규칙은 내부 관련성을 반영할 뿐만 아니라 규칙의 예측을 반영하고, 데이터 오브젝트를 예측하기 위해서 최소 지지도와 최소 신뢰도를 만족하는 규칙을 데이터 마이닝한다.The class association rule reflects the prediction of the rule as well as the internal relevance, and data mining the rule that satisfies the minimum support and minimum reliability in order to predict the data object.

데이터 마이닝부(130)는 트랜잭션 데이터 세트(D)라고 할 때, 데이터 세트에서 클래스 연관 규칙(Class Asscoiation Rules, CAR)을 다음의 수학식 1과 같이 정의한다.The data mining unit 130 defines a class association rule (CAR) in the data set as a transaction data set (D) as shown in Equation 1 below.

규칙(Rule) r의 형태는 X => Y이고, I는 데이터 세트(D)의 모든 항목(증상)들의 집합이고, C는 데이터 세트(D)에서 모든 클래스(증후군) 레벨들의 집합이다. 예를 들면, I의 요소에는 기침, 가래, 콧물 등이 있고, C의 요소에는 폐렴, 감기 등이 있다. D는 모든 경우의 전체 집합이다.The form of Rule r is X => Y, I is the set of all items (symptoms) in the data set (D), and C is the set of all class (syndrome) levels in the data set (D). For example, elements of I include cough, sputum, and runny nose, and elements of C include pneumonia and colds. D is the full set of all cases.

데이터 마이닝부(130)는 하기의 수학식 2를 이용하여 클래스 연관 규칙의 지지도를 계산한다.The data mining unit 130 calculates the degree of support of the class association rule by using Equation 2 below.

클래스 연관 규칙의 지지도(Sup(r))는 선행 조건(X)와 분류 속성(Y)의 규칙과 데이터 세트(D)의 모든 항목들의 총 개수를 매칭하는 항목들의 개수 비율을 나타낸다.The degree of support (Sup(r)) of the class association rule represents the ratio of the number of items matching the rule of the prerequisite (X) and the classification attribute (Y) and the total number of all items in the data set (D).

규칙 r의 지지도는 수학식 2와 같이 정의하며, 규칙의 증상과 증후군이 교집합(연결)이 되면 1이다. 이러한 것들을 모두 카운트하고, 전체 집합 D로 나눈다.The degree of support of the rule r is defined as in Equation 2, and is 1 when the symptoms and syndromes of the rule intersect (connect). Count all of these and divide by the whole set D.

예를 들어, 기침 -> 감기, 콧물 -> 감기, 가래 -> 미정이면, 2/3이 Sup이다.For example, if cough -> cold, runny nose -> cold, phlegm -> undecided, 2/3 is Sup.

데이터 마이닝부(130)는 하기의 수학식 3을 이용하여 클래스 연관 규칙의 신뢰도를 계산한다.The data mining unit 130 calculates the reliability of the class association rule using Equation 3 below.

클래스 연관 규칙의 신뢰도(Conf(r))는 선행 조건(X)와 분류 속성(Y)의 규칙과 데이터 세트(D)에서 선행 조건(X)의 항목들의 총 개수를 만족하는 항목들의 개수 비율을 나타낸다.The confidence level (Conf(r)) of the class association rule is the ratio of the number of items that satisfy the rule of the antecedent condition (X) and the classification attribute (Y) and the total number of items of the antecedent condition (X) in the data set (D). indicates.

수학식 3은 전술한 수학식 2에서 D 대신에 count(r.x)로 교체한 것이다.Equation 3 is replaced with count(r.x) instead of D in Equation 2 described above.

예를 들어, 기침 -> 감기, 콧물 -> 감기, 가래 -> 미정이면, 2/3이 Conf이다.For example, if cough -> cold, runny nose -> cold, phlegm -> undecided, 2/3 is Conf.

클래스 연관 규칙은 수학식 1, 수학식 2, 수학식 3의 조건을 만족하는 규칙이며, minSup와 minConf는 지지도와 신뢰도가 미리 설정된 최소 지지도 경계치와 최소 신뢰도 경계치이다(수학식 4).The class association rule is a rule that satisfies the conditions of Equation 1, Equation 2, and Equation 3, and minSup and minConf are the minimum support threshold and the minimum reliability boundary in which support and reliability are preset (Equation 4).

데이터 마이닝부(130)는 각각의 클래스의 지지도를 계산하기 위해서 너비 우선 반복 검색 방법(Breadth First Iterative Search Method)이 사용되고, 최소 지지도 경계치와 비교한 후, 클래스 연결 규칙의 빈발(Frequent) 1-itemset(빈도가 가장 많이 나오는 항목 집합)을 검색한다.The data mining unit 130 uses a breadth first iterative search method to calculate the support for each class, compares it with the minimum support boundary, and selects the frequency 1- of the class connection rule. Search for itemset (the set of items that appear most frequently).

다음으로, 데이터 마이닝부(130)는 1-itemset(빈도가 가장 많이 나오는 항목 집합)의 클래스 연관 규칙의 결과를 이용하여 클래스 연관 규칙의 후보 2-itemset과 빈발(Frequent) 2-itemset이 결정된다. 이러한 과정은 클래스 연관 규칙의 k-itemset이 더 이상 발생할 수 없을 때까지 진행한다. 이로써 티벳 의료의 진단 및 치료에 기반하여 클래스 연관 규칙의 후보 집합은 "어떤 증상이면 어떤 증후군"이라는 규칙이 결정된다.Next, the data mining unit 130 determines the candidate 2-itemset and the frequent 2-itemset of the class association rule by using the result of the class association rule of 1-itemset (the item set with the highest frequency). . This process continues until the k-itemset of the class association rule can no longer occur. As a result, based on the diagnosis and treatment of Tibetan medical care, a rule is determined that the candidate set of class association rules is “any symptom, any syndrome”.

다시 말해, 데이터 마이닝부(130)는 대용량의 데이터 세트의 단위 트랜잭션에서 빈번하게 발생하는 사건의 유형을 검색하고, 트랜잭션을 대상으로 최소 지지도 경계치 이상을 만족하는 빈발 항목 집합을 검색하고, 빈발 항목으로 후보 집합을 생성하고, 새로운 빈발 항목 집합이 생성되지 않을 때까지 반복 수행하며, 검색된 다량의 항목 집합 내에 포함된 항목들 중에서 최소 신뢰도 경계치 이상을 만족하는 항목들 간의 연관 규칙을 생성할 수 있다.In other words, the data mining unit 130 searches for a type of event that occurs frequently in a unit transaction of a large data set, searches for a frequent item set that satisfies a minimum support threshold or more for a transaction, and searches for a frequent item Creates a candidate set with , repeats until no new frequent item set is created, and creates association rules between items that satisfy the minimum reliability threshold or higher among items included in a large number of searched item sets. .

후보 CAR 세트의 규칙은 중복적이고 우선 순위를 가지므로 분류 모델의 정확성을 향상시키기 위해 중복 규칙을 제거해야 하며, 우선 순위가 높은 CAR이 분류 모델로 선택된다.Since the rules of the candidate CAR set are redundant and have priority, it is necessary to remove the redundant rules to improve the accuracy of the classification model, and the CAR with higher priority is selected as the classification model.

의료 질병 분류부(140)는 샘플을 올바르게 분류하는 기능을 기반으로 클래스 연관 규칙을 제거하는데, 후보 CAR 세트의 각 규칙 r에 대해서 훈련 세트에 규칙 r이 포함된 항목이 있는지 여부가 검출된다.The medical disease classification unit 140 removes the class association rule based on the function of correctly classifying the sample. For each rule r of the candidate CAR set, whether there is an item including the rule r in the training set is detected.

의료 질병 분류부(140)는 규칙 r이 하나 이상의 샘플을 올바르게 분류하면 규칙 r은 잠재적으로 유용한 클래스 연관 규칙으로 표시된다.The medical disease classifier 140 indicates that if rule r correctly classifies one or more samples, then rule r is marked as a potentially useful class association rule.

의료 질병 분류부(140)는 후보 CAR 세트의 규칙 또는 훈련 세트의 항목이 모두 순회되면, 항목을 올바르게 분류할 수 없는 규칙이 차단된다.When all of the rules of the candidate CAR set or the items of the training set are traversed, the medical disease classification unit 140 blocks a rule that cannot correctly classify the items.

의료 질병 분류부(140)는 CAR의 신뢰도를 첫 번째 키워드로 하고, 지지도를 두 번째 키워드로 하는 ConflSup2 및 리프트(Lift) 방법으로 각각 우선 순위를 두고, 우선 순위가 높은 규칙 세트를 선택하여 분류 모델을 설정한다(S103). 위의 두 가지 정렬 방법에 대한 설명은 다음과 같다.The medical disease classification unit 140 prioritizes CAR reliability as a first keyword, ConflSup2 and Lift method as a second keyword, and selects a high-priority rule set to select a classification model is set (S103). The description of the above two sorting methods is as follows.

(1) 신뢰도와 지지도에 의한 클래스 연관 규칙을 정렬하는 방법을 ConflSup2라 부른다.(1) The method of arranging class association rules by reliability and support is called ConflSup2.

r1(규칙 1)의 신뢰도가 r2(규칙 2)의 신뢰도보다 클 때,When the reliability of r1 (rule 1) is greater than the reliability of r2 (rule 2),

r1(규칙 1)의 신뢰도가 r2(규칙 2)의 신뢰도가 같고, r1(규칙 1)의 지지도가 r2(규칙 2)의 지지도보다 클 때,When the reliability of r1 (rule 1) is equal to that of r2 (rule 2), and the support of r1 (rule 1) is greater than that of r2 (rule 2),

r1(규칙 1)의 신뢰도가 r2(규칙 2)의 신뢰도가 같고, r1(규칙 1)의 지지도가 r2(규칙 2)의 지지도보다 같으며, r1의 범주가 r2의 보다 작을때(r2가 r1을 포함)When the reliability of r1 (rule 1) is equal to that of r2 (rule 2), the support of r1 (rule 1) is equal to that of r2 (rule 2), and the category of r1 is smaller than that of r2 (r2 is r1) including)

여기서, Sup는 규칙의 지지도를 Conf는 규칙의 신뢰도를 나타낸다. 만일 r1과 r2가 상기 규칙 중 어느 하나를 만족하면 r1에 우선권이 주어지고, r1 > r2로 표시한다.Here, Sup represents the support of the rule and Conf represents the reliability of the rule. If r1 and r2 satisfy any one of the above rules, priority is given to r1, expressed as r1 > r2.

(2) Lift(상향법)에 의한 클래스 연관 규칙을 정렬한다.(2) Sort class association rules by Lift (upward method).

r1(규칙 1)의 리프트 > r2(규칙 2)의 리프트를 나타낸다. r1 및 r2가 위의 조건을 만족하면, r1의 우선 순위는 r2보다 높고, r1 > r2로 표시한다.It represents the lift of r1 (rule 1) > the lift of r2 (rule 2). If r1 and r2 satisfy the above condition, the priority of r1 is higher than that of r2, and it is expressed as r1 > r2.

의료 질병 분류부(140)는 신뢰도와 지지도에 의한 클래스 연관 규칙을 정렬하는 방법(ConflSup2) 또는 Lift(상향법)에 의한 클래스 연관 규칙을 정렬하는 방법을 수행하고, 분류 결과의 정확성이 높은 정렬 방법을 분류 모델로 선택한다.The medical disease classification unit 140 performs a method of aligning class association rules by reliability and support (ConflSup2) or a method of aligning class association rules by Lift (bottom-up method), and an alignment method with high accuracy of classification results is selected as the classification model.

하기의 표 1 및 도 4에 도시된 바와 같이, 클래스 연관 규칙이 Conf1Sup2를 기준으로 정렬 된 경우, 규칙의 개수가 40 내지 400의 범위에 있을 때 규칙의 개수가 증가함에 따라 정확도가 증가하고, 선택한 규칙의 개수가 80이면 최대 80.78%이다.As shown in Table 1 and Figure 4 below, when the class association rules are sorted based on Conf1Sup2, the accuracy increases as the number of rules increases when the number of rules is in the range of 40 to 400, and the selected If the number of rules is 80, the maximum is 80.78%.

그러나 규칙의 개수가 400에서 800 사이인 경우 규칙의 개수가 증가해도 정확도는 변경되지 않는다.However, if the number of rules is between 400 and 800, the accuracy does not change even if the number of rules increases.

리프트를 기준으로 클래스 연관 규칙을 정렬하는 경우 규칙의 개수가 40 내지 400 범위에 있을 때 규칙의 개수가 증가함에 따라 정확도가 증가하고 규칙의 개수가 400인 경우 정확도가 최대 82.72%이다. 그러나 규칙의 개수가 400에서 800사이일 때 규칙의 개수가 증가하면 정확도가 떨어진다.In the case of sorting class association rules based on lift, when the number of rules is in the range of 40 to 400, the accuracy increases as the number of rules increases, and when the number of rules is 400, the accuracy is up to 82.72%. However, when the number of rules is between 400 and 800, the accuracy decreases as the number of rules increases.

클래스 연관 규칙을 정렬하는 데 사용되는 방법에 관계없이 Lift를 기반으로 구성된 분류 모델의 정확도는 200을 갖는 규칙의 개수를 제외하고 Conf1Sup2를 기준으로 한 정확도보다 높다.Regardless of the method used to sort the class association rules, the accuracy of the classification model constructed based on Lift is higher than that based on Conf1Sup2, except for the number of rules with 200.

의료 질병 분류부(140)는 두 가지 정렬 방법을 비교할 때 리프트 정렬 방법을 기준으로 클래스 연관 규칙을 정렬하고 규칙의 개수가 400인 경우 정확도가 82.72%로 가장 높기 때문에 규칙의 개수에 따라 ConflSup2 또는 Lift 중 정확도가 높은 정렬 방식을 선택한 분류 모델을 설정할 수 있다.When comparing the two sorting methods, the medical disease classification unit 140 sorts the class association rules based on the lift sorting method, and when the number of rules is 400, the accuracy is the highest at 82.72%, so ConflSup2 or Lift depending on the number of rules You can set a classification model that selects a high-accuracy sorting method.

즉, 의료 질병 분류부(140)는 도 4에 도시된 바와 같이, 규칙의 개수가 200인 경우, ConflSup2 정렬 방법으로 클래스 연관 규칙의 우선 순위를 정렬하고, 40, 80, 400, 800, 1000인 경우, Lift 정렬 방법으로 클래스 연관 규칙의 우선 순위를 정렬한다.That is, as shown in FIG. 4 , when the number of rules is 200, the medical disease classification unit 140 sorts the priorities of class association rules using the ConflSup2 sorting method, In this case, the priority of class association rules is sorted by Lift sorting method.

또한, 의료 질병 분류부(140)는 ConflSup2 또는 Lift 정렬 방법을 기반으로 클래스 연관 규칙의 우선 순위를 정렬하고, 40, 80, 200, 400, 800, 1000개의 서로 다른 크기의 규칙을 선택하여 높은 순서에서 낮은 순서로 우선 순위 순서에 따라 분류 모델을 설정할 수 있다.In addition, the medical disease classification unit 140 sorts the priority of the class association rules based on the ConflSup2 or Lift sorting method, and selects 40, 80, 200, 400, 800, and 1000 different sized rules in a high order The classification model can be set according to the order of priority from the lowest to the lowest.

의료 질병 분류부(140)는 클래스 연관 규칙를 기반으로 하는 분류 모델을이용하여 질병에 대한 증상을 티벳 의학 증후군의 연관 관계가 분류 모델의 규칙 우선 순위에 따라 항목이 올바르게 분류되는지 검증하는 프로세스를 실행한다(S104).The medical disease classification unit 140 uses a classification model based on class association rules to execute a process of verifying that the symptoms for the disease are correctly classified according to the rule priority of the Tibet medical syndrome classification model. (S104).

이상에서 본 발명의 실시예는 장치 및/또는 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하기 위한 프로그램, 그 프로그램이 기록된 기록 매체 등을 통해 구현될 수도 있으며, 이러한 구현은 앞서 설명한 실시예의 기재로부터 본 발명이 속하는 기술분야의 전문가라면 쉽게 구현할 수 있는 것이다.In the above, the embodiment of the present invention is not implemented only through the apparatus and/or method, and may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention, a recording medium in which the program is recorded, etc. And, such an implementation can be easily implemented by an expert in the technical field to which the present invention belongs from the description of the above-described embodiment.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto. is within the scope of the right.

100: 질병 분류 시스템
101: 의료 정보 데이터베이스부
110: 입력부
120: 전처리부
130: 데이터 마이닝부
140: 의료 질병 분류부100: disease classification system
101: medical information database unit
110: input unit
120: preprocessor
130: data mining unit
140: medical disease classification unit

Claims

an input unit for receiving a Tibetan medical record document from the medical information database unit;
By removing stopwords irrelevant to the generation of the association rule from the medical record document, a plurality of medical words that are the target of generating the association rule from among a plurality of words included in the medical record document are extracted, and the extracted medical word is used for data mining. a preprocessor that generates a suitable data set;
In the data set, a symptom of a disease is set as a condition attribute, a medical syndrome is set as a decision attribute, and a class association rule between the symptom of the disease and the medical syndrome is generated using an association rule mining algorithm. wealth; and
A medical disease that determines the priority of each class association rule by using the method of sorting class association rules by reliability and support or by sorting class association rules by Lift (bottom-up method), and sets the rule set with the highest priority as a classification model contains a classification,
The medical disease classification unit determines the priority of the class association rule by Lift (a bottom-up method) by sorting by Equation 2 below, and when r1 and r2 satisfy the condition of Equation 2 below, r1 A disease classification system using a class association rule, characterized in that the priority is higher than r2, and r1 > r2.
[Equation 2]

The lift of r1 (rule 1) > denotes the lift of r2 (rule 2).

According to claim 1,
The medical disease classification unit determines the priority by arranging the class association rule based on reliability and support according to Equation 1 below, and if r1 and r2 satisfy any one of the rules of Equation 1 below, priority is given to r1 A disease classification system using a class association rule, characterized in that given and expressed as r1 > r2.
[Equation 1]

When the reliability of r1 (rule 1) is greater than the reliability of r2 (rule 2),
When the reliability of r1 (rule 1) is equal to that of r2 (rule 2), and the support of r1 (rule 1) is greater than that of r2 (rule 2),
When the reliability of r1 (rule 1) is equal to that of r2 (rule 2), the support of r1 (rule 1) is equal to that of r2 (rule 2), and the category of r1 is smaller than that of r2 (r2 is r1) including)
Sup indicates the degree of support of the rule and Conf indicates the reliability of the rule.

delete

According to claim 1,
The data mining unit defines the class association rule in the data set as shown in Equation 3 below, and the degree of support of rule r is 1 when the symptoms and syndromes of the rule intersect (connection) as shown in Equation 4 below, and the intersection A disease classification system using a class association rule, characterized in that counting all the things that are, dividing the whole set D, and defining the reliability of the rule r as in Equation 5 below.
[Equation 3]

X is the prerequisite of the rule, Y is the classification attribute, I is the set of all items (symptoms) in the data set (D), C is the set of all class (syndrome) levels in the data set (D), D is The full set of all cases.
[Equation 4]

[Equation 5]

receiving a Tibetan medical record document from a medical information database unit;
By removing stopwords irrelevant to the generation of the association rule from the medical record document, a plurality of medical words that are the target of generating the association rule from among a plurality of words included in the medical record document are extracted, and the extracted medical word is used for data mining. generating a suitable data set;
setting a symptom of a disease as a condition attribute in the data set, setting a medical syndrome as a decision attribute, and generating a class association rule between the symptom of the disease and the medical syndrome using an association rule mining algorithm;
The method of sorting class association rules by reliability and support or the step of determining the priority of class association rules by Lift (bottom-up method) using sorting, and setting the rule set with the highest priority as a classification model includes,
The step of setting the classification model is,
The priority of the class association rule by Lift (a bottom-up method) is determined using sorting by Equation 2 below, and when r1 and r2 satisfy the condition of Equation 2 below, the priority of r1 is r2 A disease classification method using a class association rule, characterized in that higher, r1 > r2.
[Equation 2]

The lift of r1 (rule 1) > denotes the lift of r2 (rule 2).

6. The method of claim 5,
The step of setting the classification model is,
Priority is determined by arranging the class association rule based on the reliability and support by Equation 1 below, and if r1 and r2 satisfy any one of the rules of Equation 1 below, priority is given to r1, r1 > A disease classification method using a class association rule, characterized in that it is denoted by r2.
[Equation 1]

delete