KR20230124536A

KR20230124536A - Device for digitizing checkup data based on data volume growth, method and program

Info

Publication number: KR20230124536A
Application number: KR1020230107923A
Authority: KR
Inventors: 임진환; 김하영; 신동엽
Original assignee: 주식회사 에임메드; 연세대학교 산학협력단
Priority date: 2021-01-12
Filing date: 2023-08-17
Publication date: 2023-08-25
Also published as: KR102575752B1; KR102686984B1; KR20220101822A; KR20240116432A

Abstract

In accordance with the present invention, provided is a method for classifying checkup data based on a plurality of opinions. The method includes the following steps of: receiving checkup data; preprocessing the checkup data; generating an ensemble classification model based on the preprocessed checkup data; and obtaining a classification result of the checkup data using the trained ensemble classification model. Further, the generating step performs static embedding and contextualized embedding for the preprocessed checkup data, and generates the ensemble classification model based on result data of the performance of the static embedding and the contextualized embedding. Therefore, the present invention is capable of improving checkup data classification performance.

Description

Examination data quantification device, method and program based on data volume proliferation

본 발명은 데이터양 증식 기반의 검진데이터 수치화 장치, 방법 및 프로그램에 관한 것이다.The present invention relates to an apparatus, method, and program for digitizing examination data based on data amount proliferation.

머신러닝 기술이 발전됨에 따라 다양한 머신러닝 기법이 개발되었으며, 이러한 머신러닝 기법은 다양한 분야에 적용되고 있다. As machine learning technology develops, various machine learning techniques have been developed, and these machine learning techniques are being applied to various fields.

자연어 처리 분야에 있어서도 다양한 머신러닝 기법이 개발되었으며, 문자로 이루어진 데이터에 대하여 보다 정확한 분류를 수행하기 위해 다양한 방법들이 시도되고 있다. Even in the field of natural language processing, various machine learning techniques have been developed, and various methods are being attempted to perform more accurate classification of character data.

대한민국 등록특허공보 제10-1713487호, 2017. 02. 28 등록Republic of Korea Patent Registration No. 10-1713487, registered on February 28, 2017

본 발명은, 단어 및 문맥에 대한 특징이 모두 반영된 앙상블 자연어처리 모델을 통해 검진데이터를 분류하는 검진데이터 분류장치 및 분류방법을 제공하는 것을 일 목적으로 한다.An object of the present invention is to provide a examination data classification device and classification method for classifying examination data through an ensemble natural language processing model in which all characteristics of words and contexts are reflected.

또한, 본 발명은, 사전훈련 언어모델을 이용하여 분류 성능이 향상된 앙상블 자연어처리 모델을 제공하는 검진데이터 분류장치 및 분류방법을 제공하는 것을 일 목적으로 한다. Another object of the present invention is to provide an apparatus and method for classifying examination data that provide an ensemble natural language processing model with improved classification performance by using a pretraining language model.

또한, 본 발명은, 진단명과 대응되는 키워드를 포함하고 토큰의 개수가 사전훈련에 사용된 코퍼스의 토큰의 최대 개수 이하인 문장을 검진데이터로부터 추출하고, 추출된 문장을 이용해 학습된 앙상블 자연어처리 모델을 제공하는 검진데이터 분류장치 및 분류방법을 제공하는 것을 일 목적으로 한다. In addition, the present invention extracts sentences containing keywords corresponding to diagnosis names and the number of tokens is less than the maximum number of tokens of the corpus used for pre-training from examination data, and uses the extracted sentences to form a learned ensemble natural language processing model An object of the present invention is to provide a examination data classification device and a classification method.

또한, 본 발명은, 검진데이터에 적합화된 전처리 과정을 통해 학습된 앙상블 자연어처리 모델을 제공하는 검진데이터 분류장치 및 분류방법을 제공하는 것을 일 목적으로 한다. Another object of the present invention is to provide an apparatus and method for classifying examination data that provide an ensemble natural language processing model learned through a preprocessing process adapted to examination data.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

상술한 과제를 해결하기 위하여, 본 발명의 일 실시예에 따른 검진데이터 분류방법은, 장치에 의해 수행되는, 복수의 소견 기반의 검진데이터 분류 방법으로서, 검진데이터를 수신하는 단계; 상기 검진데이터를 전처리하는 단계; 전처리된 검진데이터에 기초하여 앙상블 분류모델을 생성하는 단계; 및 학습된 상기 앙상블 분류모델을 이용하여 상기 검진데이터의 분류결과를 획득하는 단계를 포함한다.In order to solve the above problems, a method for classifying examination data according to an embodiment of the present invention is a method for classifying examination data based on a plurality of findings, performed by an apparatus, comprising the steps of receiving examination data; pre-processing the examination data; generating an ensemble classification model based on the preprocessed examination data; and obtaining a classification result of the examination data using the learned ensemble classification model.

또한, 상기 생성하는 단계는, 상기 전처리된 검진데이터에 대한 고정 임베딩(Static Embedding) 및 문맥화 임베딩(Contextualized Embedding)을 수행하고, 상기 고정 임베딩 및 문맥화 임베딩이 수행된 결과 데이터를 기반으로 상기 앙상블 분류모델을 생성하는 것이다. In addition, in the generating step, static embedding and contextualized embedding are performed on the preprocessed checkup data, and the ensemble is performed based on data obtained as a result of performing the static embedding and contextualized embedding. to create a classification model.

또한, 상기 생성하는 단계는, 고정 임베딩된 데이터에 기초한 학습을 통해 적어도 하나의 제1 분류모델을 생성하는 단계; 문맥화 임베딩된 데이터에 기초한 학습을 통해 적어도 하나의 제2 분류모델을 생성하는 단계; 및 상기 적어도 하나의 제1 분류모델 및 상기 적어도 하나의 제2 분류모델에 기초하여 상기 앙상블 분류모델을 생성하는 단계를 포함한다.The generating may include generating at least one first classification model through learning based on fixedly embedded data; generating at least one second classification model through learning based on contextualization-embedded data; and generating the ensemble classification model based on the at least one first classification model and the at least one second classification model.

또한, 상기 적어도 하나의 제2 분류모델을 생성하는 단계는, 기 획득된 코퍼스셋을 이용한 학습을 통해 초기 가중치를 획득하는 단계; 및 문맥화 임베딩된 데이터에 기초한 학습을 통해 상기 초기 가중치를 미세조정(fine-tuning)하여 상기 적어도 하나의 제2 분류모델을 생성하는 단계를 포함한다.The generating of the at least one second classification model may include acquiring an initial weight through learning using a pre-obtained corpus set; and generating the at least one second classification model by fine-tuning the initial weights through learning based on contextualization-embedded data.

또한, 상기 적어도 하나의 제1 분류모델의 학습에는 CNN(Convolution Neural Network) 및 DCNN(Deep Convolution Neural Network) 중 적어도 하나가 사용되고, 상기 적어도 하나의 제2 분류모델의 학습에는 LSTM(Long Short-Term Memory models), KoBERT(Korean Bidirectional Encoder Representations from Transformers), KoELECTRA(Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately) BERT(Bidirectional Encoder Representations from Transformers) 및 ELECTRA(Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 중 적어도 하나가 사용될 수 있다.In addition, at least one of CNN (Convolution Neural Network) and DCNN (Deep Convolution Neural Network) is used to learn the at least one first classification model, and LSTM (Long Short-Term) is used to learn the at least one second classification model. Memory models), Korean Bidirectional Encoder Representations from Transformers (KoBERT), Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA), Bidirectional Encoder Representations from Transformers (BERT), and Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) one can be used.

또한, 상기 수신된 검진데이터는, 토크나이징(tokenizing)을 통해 산출되는 토큰(token)의 최대 및 평균 중 적어도 하나의 개수가 기 설정된 기준 개수보다 큰 텍스트데이터를 포함하고, 상기 기준 개수는 상기 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수에 대응된다.In addition, the received checkup data includes text data in which at least one of the maximum and average number of tokens calculated through tokenizing is greater than a preset reference number, and the reference number is This corresponds to the maximum number of tokens of the corpus included in the acquired corpus set.

또한, 상기 검진데이터는, 복수의 진단명 각각에 대해 산출되며, 상기 전처리하는 단계는, 상기 텍스트데이터에서 상기 복수의 진단명 각각에 대응되는 키워드를 검색하는 단계; 상기 검색된 키워드를 포함하고, 토큰의 개수가 상기 기준 개수 이하인 문장을 추출하는 단계; 및 상기 추출된 문장에 기초하여 전처리를 수행하는 단계를 포함한다.In addition, the checkup data is calculated for each of a plurality of diagnosis names, and the preprocessing may include searching for a keyword corresponding to each of the plurality of diagnosis names in the text data; extracting a sentence including the searched keyword and having a number of tokens less than or equal to the reference number; and performing pre-processing based on the extracted sentence.

또한, 상기 검진데이터는 개별소견 텍스트, 수치 검사 결과 및 종합소견 텍스트를 포함하고, 상기 검진데이터는, 복수의 진단명 각각에 대해 산출된다. In addition, the examination data includes individual findings text, numerical examination results, and comprehensive findings text, and the examination data is calculated for each of a plurality of diagnosis names.

또한, 상기 수치 검사 결과는 상기 복수의 진단명 각각과 대응되어 기 설정된 정형화된 데이터이고, 상기 종합소견 텍스트는, 토크나이징(tokenizing)을 통해 산출되는 토큰(token)의 최대 및 평균 중 적어도 하나의 개수가 기준 개수보다 크며, 상기 기준 개수는, 상기 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수에 대응될 수 있다. In addition, the numerical test result is predetermined standardized data corresponding to each of the plurality of diagnosis names, and the comprehensive findings text is at least one of a maximum and an average of tokens calculated through tokenizing. The number may be greater than the reference number, and the reference number may correspond to the maximum number of tokens of the corpus included in the pre-obtained corpus set.

또한, 상기 전처리하는 단계는, 상기 검진데이터에서 기 설정된 기준 특수문자를 제외한 나머지 특수문자를 삭제하는 단계; 기 설정된 기준 문법에 기초하여 상기 특수문자가 삭제된 검진데이터의 띄어 쓰기 규칙을 균일화하는 단계; 상기 균일화된 검진데이터를 수치화하는 단계; 및 상기 수치화된 데이터를 패딩(padding)하는 단계를 포함한다. In addition, the pre-processing may include deleting the remaining special characters except for the preset standard special characters from the examination data; uniformizing spacing rules of examination data from which the special characters are deleted, based on a preset standard grammar; digitizing the homogenized examination data; and padding the digitized data.

또한, 상기 수치화하는 단계는, 상기 균일화된 검진데이터를 토크나이징(tokenizing)하는 단계; 토크나이징을 통해 생성된 토큰(token)의 개수를 카운팅하는 단계; 및 상기 토큰의 개수에 기초하여 상기 균일화된 검진데이터를 수치화하는 단계를 포함한다.In addition, the digitizing step may include tokenizing the uniformized examination data; Counting the number of tokens generated through tokenization; and digitizing the uniformized examination data based on the number of tokens.

또한, 상기 검진데이터는, 각각이 상기 복수의 소견 각각에 대응되는 복수의 그룹으로 분류되고, 상기 수치화하는 단계는, 상기 복수의 그룹 중 데이터의 양이 가장 작은 최소그룹의 데이터의 양을 증식시키는 단계를 더 포함한다.In addition, the examination data is classified into a plurality of groups each corresponding to the plurality of findings, and the step of digitizing increases the amount of data of the minimum group having the smallest amount of data among the plurality of groups. Include more steps.

또한, 상기 증식시키는 단계는, 상기 최소그룹에 포함된 데이터를 문장 단위로 분할하여 복수의 문장을 생성하는 단계; 및 상기 복수의 문장끼리 서로 연결하여 상기 최소그룹의 데이터의 양을 증식시키는 단계를 포함한다.The multiplying may include generating a plurality of sentences by dividing the data included in the minimum group into sentence units; and increasing the amount of data of the minimum group by connecting the plurality of sentences to each other.

또, 본 발명의 실시 예에 따른 검진데이터 분류장치는, 통신부; 데이터베이스부; 및 상기 데이터베이스부에 저장된 검진데이터 또는 상기 통신부를 통해 외부서버에서 수신된 검진데이터를 기 설정된 복수의 소견으로 분류하는 제어부를 포함한다.In addition, the examination data classification apparatus according to an embodiment of the present invention includes a communication unit; database unit; and a control unit classifying examination data stored in the database unit or examination data received from an external server through the communication unit into a plurality of predetermined findings.

또한, 상기 제어부는, 상기 검진데이터를 전처리하고, 상기 전처리된 검진데이터에 대한 고정 임베딩(Static Embedding) 및 문맥화 임베딩(Contextualized Embedding)을 수행하고, 상기 고정 임베딩 및 문맥화 임베딩이 수행된 결과 데이터를 기반으로 상기 앙상블 분류모델을 생성하며, 학습된 상기 앙상블 분류모델을 이용하여 상기 검진데이터의 분류결과를 획득한다.In addition, the control unit pre-processes the examination data, performs static embedding and contextualized embedding on the pre-processed examination data, and results data obtained by performing the static embedding and contextualized embedding The ensemble classification model is generated based on , and a classification result of the examination data is obtained using the learned ensemble classification model.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer readable recording medium recording a computer program for executing the method may be further provided.

상기와 같은 본 발명에 따르면 다음과 같은 효과가 도출될 수 있다.According to the present invention as described above, the following effects can be derived.

먼저, 본 발명에 따르면, 단어 및 문맥에 대한 특징이 모두 반영된 앙상블 자연어처리 모델을 통해 검진데이터가 분류되므로, 검진데이터에 대한 분류 성능이 향상될 수 있다. First, according to the present invention, since examination data is classified through an ensemble natural language processing model in which all characteristics of words and contexts are reflected, classification performance of examination data can be improved.

또한, 본 발명에 따르면, 사전훈련 언어모델을 이용한 앙상블 자연어처리 모델을 통해 검진데이터가 분류되므로, 검진데이터에 대한 분류 성능이 향상될 수 있다.In addition, according to the present invention, since examination data is classified through an ensemble natural language processing model using a pretraining language model, classification performance of the examination data can be improved.

또한, 본 발명에 따르면, 단어에 대한 특징이 반영된 자연어처리 모델이 문맥에 대한 특징이 반영된 자연어처리 모델과 함께 사용되므로, 검진데이터의 토큰의 개수가 문맥의 특징에 대한 사전훈련에 사용된 코퍼스의 토큰의 최대 개수 이상인 경우에도 정확한 분류가 이루어질 수 있다. In addition, according to the present invention, since the natural language processing model in which the characteristics of words are reflected is used together with the natural language processing model in which the characteristics of context are reflected, the number of tokens in the examination data is the number of corpus used for pre-training on the characteristics of context. Accurate classification can be made even when the number of tokens is greater than the maximum number.

또한, 본 발명에 따르면, 진단명과 대응되는 키워드를 포함하고 토큰의 개수가 사전훈련에 사용된 코퍼스의 토큰의 최대 개수 이하인 문장을 검진데이터로부터 추출하고, 추출된 문장을 이용해 학습된 앙상블 자연어처리 모델을 통해 검진데이터의 분류가 이루어진다. 이를 통해, 검진데이터의 토큰의 개수가 사전훈련에 사용된 코퍼스의 토큰의 최대 개수 이상인 경우에도 정확한 분류가 이루어질 수 있다. In addition, according to the present invention, a sentence containing a keyword corresponding to a diagnosis name and the number of tokens being less than the maximum number of tokens of the corpus used for pre-training is extracted from the examination data, and the extracted sentence is used to learn Ensemble Natural Language Processing Model Through this, the examination data is classified. Through this, even when the number of tokens of the examination data is equal to or greater than the maximum number of tokens of the corpus used for pre-training, accurate classification can be achieved.

또한, 본 발명에 따르면, 검진데이터에 적합화된 전처리 과정을 통해 학습된 앙상블 자연어처리 모델을 통해 검진데이터가 분류되므로, 검진데이터에 대한 분류 성능이 향상되 수 있다. Further, according to the present invention, since the examination data is classified through an ensemble natural language processing model learned through a preprocessing process adapted to the examination data, the classification performance of the examination data can be improved.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 실시예에 따른 검진데이터 분류시스템의 구성을 도시하는 예시도이다.
도 2는 본 발명의 실시예에 따른 검진데이터 분류장치의 구성을 도시하는 블록도이다.
도 3은 도 2의 제어부의 구성을 도시하는 블록도이다.
도 4는 본 발명의 실시예에 따른 검진데이터 분류방법의 과정을 도시하는 흐름도이다.
도 5는 본 발명의 실시예에 따른 검진데이터 분류방법의 과정을 예시적으로 도시하는 개념도이다.
도 6은 도 4의 S10단계의 일 실시 예의 과정을 도시하는 흐름도이다.
도 7은 도 6의 S13단계의 과정을 도시하는 흐름도이다.
도 8은 도 7의 S133단계에 따라 수치화된 데이터를 예시적으로 도시하는 개념도이다.
도 9는 도 4의 S20단계의 과정을 도시하는 흐름도이다.
도 10은 도 9의 제1 분류모델을 예시적으로 도시하는 개념도이다.
도 11은 도 9의 제2 분류모델을 예시적으로 도시하는 개념도이다.
도 12는 도 9의 S23단계의 과정을 도시하는 흐름도이다.
도 13은 도 4의 앙상블모델을 예시적으로 도시하는 개념도이다.
도 14는 도 4의 S10단계의 다른 실시 예의 과정을 도시하는 흐름도이다.
도 15는 도 14의 S102단계에 따라 추출된 문장을 예시적으로 도시하는 개념도이다. 1 is an exemplary view showing the configuration of a examination data classification system according to an embodiment of the present invention.
2 is a block diagram showing the configuration of an apparatus for classifying examination data according to an embodiment of the present invention.
FIG. 3 is a block diagram showing the configuration of a control unit in FIG. 2 .
4 is a flowchart illustrating a process of a method for classifying examination data according to an embodiment of the present invention.
5 is a conceptual diagram exemplarily illustrating a process of a method for classifying examination data according to an embodiment of the present invention.
6 is a flowchart illustrating a process of an embodiment of step S10 of FIG. 4 .
FIG. 7 is a flowchart illustrating the process of step S13 of FIG. 6 .
FIG. 8 is a conceptual diagram exemplarily illustrating digitized data in step S133 of FIG. 7 .
9 is a flowchart illustrating the process of step S20 of FIG. 4 .
FIG. 10 is a conceptual diagram illustrating the first classification model of FIG. 9 as an example.
FIG. 11 is a conceptual diagram illustrating the second classification model of FIG. 9 as an example.
FIG. 12 is a flowchart illustrating the process of step S23 of FIG. 9 .
FIG. 13 is a conceptual diagram illustrating the ensemble model of FIG. 4 as an example.
14 is a flowchart illustrating a process of another embodiment of step S10 of FIG. 4 .
FIG. 15 is a conceptual diagram exemplarily illustrating a sentence extracted in step S102 of FIG. 14 .

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only these embodiments are intended to complete the disclosure of the present invention, and are common in the art to which the present invention belongs. It is provided to fully inform the person skilled in the art of the scope of the invention, and the invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" and/or "comprising" does not exclude the presence or addition of one or more other elements other than the recited elements. Like reference numerals throughout the specification refer to like elements, and “and/or” includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various components, these components are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first element mentioned below may also be the second element within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

설명에 앞서 본 명세서에서 사용하는 용어의 의미를 간략히 설명한다. 그렇지만 용어의 설명은 본 명세서의 이해를 돕기 위한 것이므로, 명시적으로 본 발명을 한정하는 사항으로 기재하지 않은 경우에 본 발명의 기술적 사상을 한정하는 의미로 사용하는 것이 아님을 주의해야 한다.Prior to the description, the meaning of the terms used in this specification will be briefly described. However, it should be noted that the description of terms is intended to help the understanding of the present specification, and is not used in the sense of limiting the technical spirit of the present invention unless explicitly described as limiting the present invention.

다만, 몇몇 실시예에서 검진데이터 분류장치(10)는 도 2 및 도 3에 도시된 구성요소보다 더 적은 수의 구성요소나 더 많은 구성요소를 포함할 수도 있다.However, in some embodiments, the examination data classification apparatus 10 may include fewer or more components than those shown in FIGS. 2 and 3 .

본 명세서에서 설명되는 검진데이터 입력단말(2)에는 예를 들면 휴대폰, 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), 디지털방송용 단말기, PDA(personal digital assistants), PMP(portable multimedia player), 네비게이션, 슬레이트 PC(slate PC), 태블릿 PC(tablet PC), 울트라북(ultrabook), 웨어러블 디바이스(wearable device, 예를 들어, 와치형 단말기 (smartwatch), 글래스형 단말기 (smart glass), HMD(head mounted display)) 등이 포함될 수 있다. The examination data input terminal 2 described in this specification includes, for example, a mobile phone, a smart phone, a laptop computer, a digital broadcasting terminal, a PDA (personal digital assistants), a PMP (portable multimedia player), Navigation, slate PC, tablet PC, ultrabook, wearable device (e.g., watch type terminal (smartwatch), glass type terminal (smart glass), HMD (head) mounted display)) may be included.

그러나, 본 명세서에 기재된 실시 예에 따른 구성은 단말기에만 적용 가능한 경우를 제외하면, 디지털 TV, 데스크탑 컴퓨터, 디지털 사이니지 등과 같은 고정 단말기에도 적용될 수도 있음을 본 기술분야의 당업자라면 쉽게 알 수 있을 것이다.However, those skilled in the art will readily recognize that the configuration according to the embodiments described in this specification may be applied to fixed terminals such as digital TVs, desktop computers, digital signage, etc., except when applicable only to terminals. .

도 1을 참조하면, 본 발명의 실시 예에 따른 검진데이터 분류 시스템은 서버(1) 및 다수의 검진데이터 입력단말(2)이 네트워크(3)를 통해서 통신 가능하도록 서로 연결될 수 있다. Referring to FIG. 1 , in the examination data classification system according to an embodiment of the present invention, a server 1 and a plurality of examination data input terminals 2 may be connected to each other to enable communication through a network 3 .

서버(1)는 다수의 검진데이터 입력단말(2)로부터 수신한 검진데이터에 대한 전처리를 수행하고, 전처리된 검진데이터에 기초하여 앙상블 분류모델을 생성하며, 생성된 앙상블 분류모델을 이용해 검진데이터의 분류결과를 획득한다. The server 1 performs preprocessing on the examination data received from the plurality of examination data input terminals 2, generates an ensemble classification model based on the preprocessed examination data, and uses the generated ensemble classification model to determine the examination data. Get the classification result.

도 2를 참조하면, 서버(1)는 검진데이터 분류장치(10)를 포함할 수 있다. 다만, 이에 한정되는 것은 아니며 검진데이터 분류장치(10)는 서버(1)와 별도로 구비될 수 있다.Referring to FIG. 2 , the server 1 may include a checkup data classification device 10 . However, it is not limited thereto, and the examination data classification device 10 may be provided separately from the server 1.

검진데이터 분류장치(10)는 통신부(11), 데이터베이스부(12) 및 제어부(13)를 포함한다. 통신부(11), 데이터베이스부(12) 및 제어부(13)는 서로를 연결하는 시스템 버스에 의해 연결될 수 있다. The examination data classification device 10 includes a communication unit 11, a database unit 12 and a control unit 13. The communication unit 11, the database unit 12, and the control unit 13 may be connected by a system bus connecting each other.

검진데이터 분류장치(10)와 검진데이터 분류장치(10) 사이, 검진데이터 분류장치(10)와 검진데이터 입력단말(2) 사이 또는 검진데이터 분류장치(10)와 외부 서버 사이의 정보교환은 통신부(11)를 통해 수행될 수 있다. Information exchange between the examination data classification device 10 and the examination data classification device 10, between the examination data classification device 10 and the examination data input terminal 2, or between the examination data classification device 10 and the external server is performed by the communication unit (11) can be performed.

통신부는 유선통신모듈, 무선통신모듈 및 근거리통신모듈 중 적어도 하나를 통해 구현될 수 있다. 무선 인터넷 모듈은 무선 인터넷 접속을 위한 모듈을 말하는 것으로 각 장치에 내장되거나 외장될 수 있다. 무선 인터넷 기술로는 WLAN(Wireless LAN)(Wi-Fi), Wibro(Wireless broadband), Wimax(World Interoperability for Microwave Access), HSDPA(High Speed Downlink Packet Access), LTE(long term evolution), LTE-A(Long Term Evolution-Advanced) 등이 이용될 수 있다.The communication unit may be implemented through at least one of a wired communication module, a wireless communication module, and a short-distance communication module. A wireless Internet module refers to a module for wireless Internet access, and may be built into or external to each device. Wireless Internet technologies include WLAN (Wireless LAN) (Wi-Fi), Wibro (Wireless broadband), Wimax (World Interoperability for Microwave Access), HSDPA (High Speed Downlink Packet Access), LTE (long term evolution), LTE-A (Long Term Evolution-Advanced) and the like may be used.

데이터베이스부(12)는 통신부(11)를 통해 수신된 각종 정보, 제어부(13)의 기능수행을 위한 각종 정보, 제어부(13)에서 연산된 각종 정보를 저장한다. The database unit 12 stores various information received through the communication unit 11, various information for performing functions of the control unit 13, and various information calculated by the control unit 13.

데이터베이스부(12)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(random access memory; RAM), SRAM(static random access memory), 롬(read-only memory; ROM), EEPROM(electrically erasable programmable read-only memory), PROM(programmable read-only memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 또한, 데이터베이스부(12)는 웹스토리지 형태로 구현될 수 있다. The database unit 12 is a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg SD or XD memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, It may include a storage medium of at least one type of a magnetic disk and an optical disk. In addition, the database unit 12 may be implemented in the form of a web storage.

제어부(13)는 전처리모듈(14) 및 분류모듈(15)을 포함한다. The control unit 13 includes a preprocessing module 14 and a classification module 15.

전처리모듈(14)은 수신한 검진데이터를 앙상블 분류모델의 입력데이터로 변환하는 전처리과정을 수행한다. The preprocessing module 14 performs a preprocessing process of converting the received examination data into input data of an ensemble classification model.

전처리과정은 정형화, 수치화, 패딩 및 데이터증식과정을 포함한다. The preprocessing process includes standardization, digitization, padding, and data augmentation processes.

이를 위하여, 전처리모듈(14)은 정형화 유닛(141), 수치화 유닛(142), 패딩 유닛(143) 및 데이터증식 유닛(144)을 포함한다. 전처리과정에 대해서는 본 발명에 따른 검진데이터 분류방법과 함께 뒤에서 상세히 설명한다. To this end, the preprocessing module 14 includes a shaping unit 141, a digitization unit 142, a padding unit 143, and a data augmentation unit 144. The pre-processing process will be described in detail later along with the examination data classification method according to the present invention.

분류모듈(15)은 전처리된 데이터를 임베딩(Embedding)하는 임베딩 유닛(151)을 포함한다.The classification module 15 includes an embedding unit 151 that embeds the preprocessed data.

또한, 분류모듈(15)은 임베딩된 데이터를 이용한 학습을 수행하거나 학습된 모델을 이용하여 임베딩된 데이터를 분류하는 분류 유닛(152)을 포함한다. In addition, the classification module 15 includes a classification unit 152 that performs learning using the embedded data or classifies the embedded data using the learned model.

분류 유닛(152)은 앙상블 기법을 이용하여 앙상블 분류모델을 학습시키거나 학습된 앙상블 분류모델을 이용하여 임베딩된 데이터에 대한 분류를 수행한다.The classification unit 152 learns an ensemble classification model using an ensemble technique or performs classification on embedded data using the learned ensemble classification model.

일 실시 예에서, 앙상블 기법(Ensemble Learning)에는 보팅(Voting), 배깅(Bagging) 및 부스팅(Boosting) 방법이 사용될 수 있다. 다만, 이에 한정되는 것은 아니다. In one embodiment, Voting, Bagging, and Boosting methods may be used in Ensemble Learning. However, it is not limited thereto.

즉, 분류 유닛(152)은 다양한 알고리즘을 이용한 앙상블 기법을 통해 검진데이터를 분류한다. That is, the classification unit 152 classifies the examination data through an ensemble technique using various algorithms.

앙상블 기법에는 CNN(Convolution Neural Network) 및 DCNN(Deep Convolution Neural Network), LSTM(Long Short-Term Memory models), KoBERT(Korean Bidirectional Encoder Representations from Transformers), KoELECTRA(Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately) BERT(Bidirectional Encoder Representations from Transformers) 및 ELECTRA(Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 중 적어도 하나의 알고리즘이 사용될 수 있다. Ensemble techniques include CNN (Convolution Neural Network) and DCNN (Deep Convolution Neural Network), LSTM (Long Short-Term Memory models), KoBERT (Korean Bidirectional Encoder Representations from Transformers), KoELECTRA (Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately ) At least one algorithm of BERT (Bidirectional Encoder Representations from Transformers) and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) may be used.

일 실시 예에서, 분류 유닛(152)은 CNN(Convolution Neural Network) 및 DCNN(Deep Convolution Neural Network) 중 적어도 하나의 알고리즘을 이용하여 적어도 하나의 제1 분류모델을 생성할 수 있다. In an embodiment, the classification unit 152 may generate at least one first classification model using at least one algorithm of a convolution neural network (CNN) and a deep convolution neural network (DCNN).

일 실시 예에서, 분류 유닛(152)은 LSTM(Long Short-Term Memory models), KoBERT(Korean Bidirectional Encoder Representations from Transformers), KoELECTRA(Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately) BERT(Bidirectional Encoder Representations from Transformers) 및 ELECTRA(Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 중 적어도 하나의 알고리즘을 이용하여 적어도 하나의 제2 분류모델을 생성할 수 있다. In one embodiment, the classification unit 152 may include Long Short-Term Memory models (LSTM), Korean Bidirectional Encoder Representations from Transformers (KoBERT), Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately (KoELECTRA) Bidirectional Encoder Representations from BERT Transformers) and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) to generate at least one second classification model.

일 실시 예에서, 분류 유닛(152)은 제1 분류모델과 제2 분류모델을 이용하여 앙상블 분류모델을 생성할 수 있다.In an embodiment, the classification unit 152 may generate an ensemble classification model using the first classification model and the second classification model.

제어부(13)는, 하드웨어적으로, ASICs(applicationspecific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서(processors), 제어기(controllers), 마이크로컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다.In terms of hardware, the control unit 13 includes application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors ), controllers, micro-controllers, microprocessors, and electrical units for performing other functions.

또한, 소프트웨어적으로, 본 명세서에서 설명되는 절차 및 기능과 같은 실시 예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 상기 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 작동을 수행할 수 있다. 소프트웨어 코드는 적절한 프로그램 언어로 쓰여진 소프트웨어 애플리케이션으로 소프트웨어 코드가 구현될 수 있다. 상기 소프트웨어 코드는 메모리에 저장되고, 제어부(13)에 의해 실행될 수 있다.Also, in terms of software, embodiments such as procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules may perform one or more functions and operations described herein. The software code may be implemented as a software application written in any suitable programming language. The software code may be stored in a memory and executed by the control unit 13 .

아래에서는 도 4 내지 도 15를 참조하여 본 발명의 실시 예에 따른 검진데이터 분류방법(S1)에 대해서 상세히 설명한다. Hereinafter, a checkup data classification method S1 according to an embodiment of the present invention will be described in detail with reference to FIGS. 4 to 15 .

도 4를 참조하면, 검진데이터 분류방법(S1)은 검진데이터를 전처리하는 단계(S10), 전처리된 검진데이터에 기초하여 앙상블 분류모델을 학습하는 단계(S20) 및 학습된 앙상블 분류모델을 이용하여 전처리된 검진데이터를 분류하는 단계(S30)를 포함한다.Referring to FIG. 4, the examination data classification method (S1) includes preprocessing the examination data (S10), learning an ensemble classification model based on the preprocessed examination data (S20), and using the learned ensemble classification model. and classifying the preprocessed examination data (S30).

도 5를 참조하면, 검진데이터 분류장치(10)가 갑상선에 대한 검진데이터에 대한 분류모델을 생성하고 생성된 분류모델을 이용해 기 설정된 복수의 소견에 따른 분류결과를 도출하는 과정이 개념적으로 도시된다. Referring to FIG. 5, a process in which the examination data classification apparatus 10 creates a classification model for thyroid examination data and derives a classification result according to a plurality of preset findings using the generated classification model is conceptually illustrated. .

도시된 실시 예에서, 검진데이터는 건강검진데이터로부터 도출될 수 있다. 건강검진데이터는 건강검진에서 확인될 수 있는 각종 진단명에 대한 개별소견 텍스트, 각종 이상과 관련된 수치 검사 결과 및 종합소견 텍스트로 구성될 수 있다. In the illustrated embodiment, the examination data may be derived from health examination data. The health examination data may be composed of individual findings texts for various diagnoses that can be confirmed in health examinations, numerical test results related to various abnormalities, and comprehensive findings texts.

검진데이터는 특정 진단명에 대한 개별소견 텍스트, 특정 진단명과 관련된 수치 검사 결과 및 종합소견 텍스트로 구성될 수 있다. 일 실시 예에서, 검진데이터는 개별소견 텍스트, 수치 검사 결과 및 종합소견 텍스트를 연결한 자연어의 나열로 구성될 수 있다. Examination data may be composed of individual finding texts for a specific diagnosis name, numerical test results related to a specific diagnosis name, and comprehensive finding texts. In one embodiment, the examination data may be composed of a list of natural words connecting individual finding texts, numerical examination results, and comprehensive finding texts.

도시된 실시 예에서, 검진데이터는 갑상선에 대한 것이며, 갑상선 개별소견 텍스트, 호르몬 수치 검사 결과 및 종합소견 텍스트로 구성된다. 도시되지 않은 실시 예에서, 검진데이터는 갑상선 외에 다른 진단명에 대한 검진데이터로 구성될 수 있다. In the illustrated embodiment, the examination data is for the thyroid, and is composed of thyroid individual finding text, hormone level test result, and comprehensive finding text. In an embodiment not shown, the checkup data may include checkup data for diagnoses other than thyroid.

검진데이터가 검진데이터 분류장치(10)로 수신되면, 검진데이터 분류장치(10)의 전처리모듈(14)이 검진데이터를 전처리한다(S10). When the examination data is received by the examination data classification device 10, the pre-processing module 14 of the examination data classification device 10 pre-processes the examination data (S10).

전처리가 완료되면, 검진데이터 분류장치(10)의 분류모듈(15)은 전처리된 검진데이터에 기초하여 앙상블 분류모델을 학습한다(S20). When the preprocessing is completed, the classification module 15 of the examination data classification device 10 learns an ensemble classification model based on the preprocessed examination data (S20).

학습이 완료되면, 검진데이터 분류장치(10)의 분류모듈(15)은 학습된 앙상블 분류모델을 이용하여 전처리된 검사데이터에 대한 분류결과를 획득한다(S30). When the learning is completed, the classification module 15 of the examination data classification apparatus 10 obtains a classification result for the preprocessed examination data using the learned ensemble classification model (S30).

분류결과는 기 설정된 복수의 소견으로 분류된 결과이며, 일 실시 예에서, 복수의 소견은 정상(Class 0) 및 유소견(Class 1)일 수 있다. 또한, 일 실시 예에서, 복수의 소견은 정상(Class 0), 유소견(Class 1) 및 중증유소견(Class 2)일 수 있다. The classification result is a result of being classified into a plurality of preset findings, and in an embodiment, the plurality of findings may be normal (Class 0) and abnormal findings (Class 1). Also, in one embodiment, the plurality of findings may be normal (Class 0), minor findings (Class 1), and severe findings (Class 2).

아래에서는, 도 6 내지 도 8을 참조하여, 검진데이터를 전처리하는 단계(S10)에 대해 구체적으로 설명한다.Below, with reference to FIGS. 6 to 8, the pre-processing of examination data (S10) will be described in detail.

(1) 검진데이터를 전처리하는 단계(S10)의 설명(1) Description of pre-processing of examination data (S10)

도 6을 참조하면, 검진데이터를 전처리하는 단계(S10)는 기 설정된 특수문자를 제외한 나머지 특수문자를 삭제하는 단계(S11), 띄어쓰기 규칙을 균일화하는 단계(S12), 데이터를 수치화하는 단계(S13) 및 데이터를 패딩(padding)하는 단계(S14)를 포함한다. Referring to FIG. 6 , the pre-processing of examination data (S10) includes deleting special characters except for preset special characters (S11), standardizing spacing rules (S12), and digitizing data (S13). ) and data padding (S14).

먼저, 검진데이터 분류장치(10)가 수신된 검진데이터에서 기 설정된 특수문자를 제외한 나머지 특수문자를 삭제한다(S11). 특수문자의 삭제는 전처리모듈(14)의 정형화 유닛(141)에 의해 수행될 수 있다. First, the examination data classification apparatus 10 deletes the remaining special characters except for the preset special characters from the received examination data (S11). Deletion of special characters may be performed by the shaping unit 141 of the preprocessing module 14 .

검진데이터에는 측정단위와 같은 특수문자가 많이 포함되어 있어 검진데이터의 시퀸스(sequence)가 과도하게 증가될 수 있다. 따라서, 검진데이터의 시퀸스를 감소시키기 위해 기 설정된 특수문자를 제외한 나머지 특수문자를 삭제한다. Since the examination data includes many special characters such as unit of measurement, the sequence of the examination data may be excessively increased. Therefore, in order to reduce the sequence of examination data, the remaining special characters except for the preset special characters are deleted.

일 실시 예에서, 기 설정된 특수문자는 " , . & ' / ~ ² -"와 같은 8개의 특수문자일 수 있다. In one embodiment, the preset special characters may be 8 special characters such as ", . &' / ~ ² -".

또한, 검진데이터에 포함된 수치 검사 결과는 진단명과 매칭되어 정형화된 형태의 수치데이터이며, 정형화 유닛(141)은 정형화된 형태의 수치데이터를 텍스트 데이터로 변환시킬 수 있다. In addition, the numerical examination result included in the examination data is numerical data in a standardized form matched with a diagnosis name, and the formalizing unit 141 may convert the numerical data in a standardized form into text data.

예를 들어, 검진데이터가 갑상선에 연관된 경우, 수치 결과 데이터는 호르몬 수치에 대한 데이터일 수 있다. 정형화 유닛(141)은 호르몬 수치에 대한 데이터를 "호르몬 검사 수치 정상입니다."와 같은 텍스트 형태로 변환시킬 수 있다. For example, when the examination data is related to the thyroid, the numerical result data may be hormone level data. The shaping unit 141 may convert the hormone level data into a text format such as "hormonal test levels are normal."

특수문자 삭제가 완료되면, 검진데이터 분류장치(10)가 기 설정된 문법규칙에 기초하여 검진데이터에 포함된 텍스트의 띄어쓰기 규칙을 획일화시킨다(S12). 띄어쓰기 규칙의 획일화는 정형화 유닛(141)에 의해 수행될 수 있다. When the deletion of the special characters is completed, the checkup data classification device 10 standardizes spacing rules of text included in the checkup data based on a preset grammar rule (S12). Standardization of spacing rules may be performed by the shaping unit 141 .

작성하는 사람에 따라 검진데이터에 포함된 텍스트에는 띄어쓰기 규칙에 오류가 있거나 서로 다른 형태의 띄어쓰기 규칙이 사용될 수 있다. 예를 들어, "할수 있다", "할수있다" 및 "할 수 있다"가 함께 포함되어 있을 수 있다. 해당 텍스트들은 모두 같은 내용의 자연어 이므로, 이들을 기 설정된 띄어쓰기 규칙에 기초하여 통일시킬 수 있다. 예를 들어, "할수 있다", "할수있다"를 모두 "할 수 있다"로 통일시킬 수 있다. Depending on the writer, the text included in the examination data may have errors in spacing rules or may use different types of spacing rules. For example, "may", "may" and "may" may be included together. Since all of the corresponding texts are natural language with the same content, they can be unified based on a predetermined spacing rule. For example, "can" and "could" can be unified into "could".

일 실시 예에서, 띄어쓰기 외에도 다양한 문법규칙이 획일화될 수 있다. In one embodiment, various grammatical rules other than spaces may be standardized.

띄어쓰기 규칙의 획일화에는 다양한 토크나이저(tokenizer)가 사용될 수 있다. 일 실시 예에서, Mecab-Ko, WordPiece from KoBERT, WordPiece from KoELECRA, Mecab-Ko & WordPiece 중 적어도 하나의 토크나이저가 사용될 수 있다. 토크나이저에 의해 검진데이터에 포함된 텍스트가 복수의 토큰(token)으로 분할된다.Various tokenizers may be used to standardize spacing rules. In an embodiment, at least one tokenizer of Mecab-Ko, WordPiece from KoBERT, WordPiece from KoELECRA, Mecab-Ko & WordPiece may be used. The text included in the examination data is divided into a plurality of tokens by the tokenizer.

띄어쓰기 규칙에 대한 획일화가 완료되면, 검진데이터 분류장치(10)는 획일화된 검진데이터를 수치화한다(S13). When the standardization of spacing rules is completed, the examination data classification apparatus 10 digitizes the uniform examination data (S13).

도 7을 참조하면, 검진데이터를 수치화하는 단계(S13)의 구체적인 과정이 도시된다.Referring to FIG. 7 , a detailed process of digitizing examination data (S13) is illustrated.

먼저, 수치화 유닛(142)에 의해 검진데이터에 대한 토크나이징이 수행된다(S131). 도시되지 않은 실시 예에서, 수치화 유닛(142)은 별도로 토크나이징을 수행하지 않고 정형화 유닛(141)에서 토크나이징된 결과를 사용할 수 있다. First, tokenization is performed on examination data by the digitization unit 142 (S131). In an embodiment not shown, the digitization unit 142 may use the result tokenized by the shaping unit 141 without separately performing tokenization.

토크나이징에 의해 검진데이터가 복수의 토큰으로 분리되면, 수치화 유닛(142)이 토큰의 개수를 카운팅한다(S132). When the examination data is divided into a plurality of tokens by tokenizing, the digitization unit 142 counts the number of tokens (S132).

카운팅이 완료되면, 수치화 유닛(142)이 토큰의 개수에 기초하여 데이터를 수치화한다(S133). When the counting is completed, the digitization unit 142 digitizes the data based on the number of tokens (S133).

도 8을 참조하면, 상기 S133단계에 의해 수치화된 결과가 예시적으로 도시된다. Referring to FIG. 8 , the numerical result obtained by step S133 is shown as an example.

"호르몬" 토큰은 검진데이터에 포함된 토큰 중 13023번째로 배치된 토큰이고, "검사" 토큰은 검진데이터에 포함된 토큰 중 6911번째로 배치된 토큰이다. 즉, 토큰의 배열 순서에 따라 수치화가 이루어진다. The “hormone” token is the 13023rd disposed token among the tokens included in the examination data, and the “examination” token is the 6911th allocated token among the tokens included in the examination data. That is, digitization is performed according to the arrangement order of tokens.

다시 도 6을 참조하면, 수치화가 완료됨에 따라 패딩 유닛(143)이 수치화가 완료된 검진데이터에 대한 패딩(padding)을 수행한다. Referring back to FIG. 6 , as digitization is completed, the padding unit 143 performs padding on the examination data for which digitization has been completed.

패딩 유닛(143)은 복수의 수치화된 검진데이터의 시퀸스(sequence)를 비교하고, 복수의 수치화된 검진데이터의 최대 시퀸스를 확인한다. The padding unit 143 compares sequences of a plurality of digitized examination data and identifies a maximum sequence of the plurality of digitized examination data.

또한, 최대 시퀸스가 확인되면, 확인된 최대 시퀸스에 기초하여 복수의 수치화된 검진데이터의 시퀸스를 증가시킨다. 즉, 복수의 수치화된 검진데이터 모두의 길이를 통일시킨다. In addition, when the maximum sequence is confirmed, the sequence of a plurality of digitized examination data is increased based on the maximum sequence. That is, the lengths of all of the plurality of digitized examination data are unified.

상술한 과정을 통해 검진데이터에 대한 전처리가 완료된다. Pre-processing of the examination data is completed through the above process.

전처리가 완료되면, 전처리모듈(14)의 데이터증식 유닛(144)은 추가적으로 데이터 증식을 수행할 수 있다. When the preprocessing is completed, the data augmentation unit 144 of the preprocessing module 14 may additionally perform data augmentation.

앙상블 분류모델의 학습과정에서 사용되는 검진데이터는, 각각이 기 설정된 복수의 소견 각각에 대응되는 복수의 그룹으로 분류될 수 있다. 즉, 학습과정에서 사용되는 검진데이터는 복수의 그룹으로 라벨링될 수 있다.Examination data used in the learning process of the ensemble classification model may be classified into a plurality of groups each corresponding to a plurality of preset findings. That is, examination data used in the learning process may be labeled as a plurality of groups.

예를 들어, 학습과정에 사용되는 검진데이터는 정상(Class 0), 유소견(Class 1) 및 중증유소견(Class 2)으로 분류될 수 있다. For example, examination data used in the learning process may be classified into normal (Class 0), mild findings (Class 1), and severe findings (Class 2).

다만, 분류된 특정 그룹의 데이터양이 너무 작은 경우 해당 그룹의 특징이 학습모델에 제대로 반영되지 않는 문제점이 발생될 수 있다. 따라서, 데이터증식 유닛(144)은 데이터 양이 과도하게 적은 그룹에 대한 데이터 증식을 수행할 수 있다. However, if the amount of data of the classified specific group is too small, a problem may occur in which the characteristics of the group are not properly reflected in the learning model. Accordingly, the data augmentation unit 144 may perform data augmentation for a group with an excessively small amount of data.

데이터증식 유닛(144)은 데이터량이 가장 적은 최소그룹에 포함된 데이터를 문장 단위로 분할하여 복수의 문장을 생성하고, 생성된 복수의 문장끼리 서로 연결하여 최소그룹의 데이터의 양을 증식시킬 수 있다.The data augmentation unit 144 divides the data included in the minimum group with the smallest amount of data into sentence units to generate a plurality of sentences, and connects the generated sentences to increase the amount of data in the minimum group. .

예를 들어, 제1 문장, 제2 문장 및 제3 문장으로 분류된 경우, 제1 및 제2 문장의 결합, 제2 및 제3 문장의 결합, 제1, 제2 및 제3 문장 결합을 통해 데이터의 양을 증식시킬 수 있다. For example, when classified as a first sentence, a second sentence, and a third sentence, through a combination of the first and second sentences, a combination of the second and third sentences, and a combination of the first, second, and third sentences. The amount of data can be multiplied.

특수문자 생략, 띄어쓰기 규칙 균일화, 데이터 수치화 및 데이터 패딩 등 검진데이터에 적합화된 전처리과정을 통해 앙상블 분류모델이 생성되므로, 생성된 앙상블 분류모델의 분류 정확도가 향상될 수 있다. Since the ensemble classification model is generated through a preprocessing process suitable for examination data, such as omission of special characters, standardization of spacing rules, data digitization, and data padding, the classification accuracy of the generated ensemble classification model can be improved.

(2) 전처리된 검진데이터에 기초하여 앙상블 분류모델을 학습하는 단계(S20)의 설명(2) Description of the step (S20) of learning an ensemble classification model based on the preprocessed examination data

도 9를 참조하면, 전처리된 검진데이터에 기초하여 앙상블 분류모델을 학습하는 단계(S20)의 구체적인 과정이 도시된다.Referring to FIG. 9 , a detailed process of learning an ensemble classification model based on preprocessed examination data (S20) is illustrated.

먼저, 검진데이터 분류장치(10)가 전처리된 검진데이터에 대하여 고정 임베딩(Static Embedding) 및 문맥화 임베딩(Contextualized Embedding)을 수행한다. 임베딩을 수행하는 과정은 제어부(13)의 임베딩 유닛(151)에 의해 수행될 수 있다. First, the examination data classification apparatus 10 performs static embedding and contextualized embedding on the preprocessed examination data. The embedding process may be performed by the embedding unit 151 of the control unit 13 .

도 10을 참조하면, 고정 임베딩이 예시적으로 도시된다. Referring to FIG. 10 , fixed embedding is illustrated as an example.

도시된 실시 예에서, 검사데이터에 포함된 복수의 토근 중 일부 토큰에 대하여 기 설정된 윈도우 사이즈에 기초한 고정 임베딩이 수행된다. In the illustrated embodiment, fixed embedding based on a preset window size is performed on some tokens among a plurality of tokens included in the inspection data.

도시된 실시 예에서, 상측에 도시된 모델에서는 “혈압”, ”이” 및 ”평균” 토큰에 대하여 고정 임베딩이 수행되고, “약간”, “높”, “습니다” 및 “.” 토큰에 대하여 고정 임베딩이 수행된다.In the illustrated embodiment, in the model shown above, fixed embedding is performed for “blood pressure”, “this” and “average” tokens, and “slightly”, “high”, “high” and “.” A fixed embedding is performed on the token.

도시된 실시 예에서, 하측에 도시된 모델에서는 “혈압”,”이” 및 ”평균” 토큰에 대하여 고정 임베딩이 수행되고, “수치”, “검사” 및 “결과” 토큰에 대하여 고정 임베딩이 수행된다. In the illustrated embodiment, in the model shown below, fixed embedding is performed for “blood pressure”, “this” and “average” tokens, and fixed embedding is performed for “numerical”, “test” and “result” tokens do.

일 실시 예에서, 고정 임베딩에는 일정한 크기의 윈도우가 사용될 수 있으며, 다양한 크기위 윈도우가 복합적으로 사용될 수 있다. In one embodiment, a window of a certain size may be used for fixed embedding, and windows of various sizes may be used in combination.

도 11을 참조하면, 문맥화 임베딩이 예시적으로 도시된다. Referring to FIG. 11 , contextualization embedding is illustrated by way of example.

도시된 실시 예에서, 문맥과 관련된 특징을 고려하기 위하여 기 설정된 개수 이상의 토큰에 대하여 문맥화 임베딩을 수행한다. In the illustrated embodiment, contextualization embedding is performed on a predetermined number of tokens or more in order to consider context-related characteristics.

일 실시 예에서, 문맥화 임베딩은 토큰 임베딩, 새그먼트 임베딩 및 포지션 임베딩을 포함할 수 있다. In one embodiment, contextualization embeddings may include token embeddings, segment embeddings, and position embeddings.

토큰 임베딩(Token Embedding)은 각 문자 단위로 임배딩을 하고, 자주 등장하면서 가장 긴 길이의 sub-word를 하나의 단위로 만든다. 자주 등장하지 않는 단어를 OOV(Out Of Vacabulary)처리하여 모델링 성능 저하를 방지할 수 있다. Token Embedding embeds each character unit, and makes the longest sub-word that appears frequently into one unit. Deterioration in modeling performance can be prevented by OOV (Out Of Vacabulary) processing for words that do not appear frequently.

또한, 세그먼트 임베딩(Segment Embedding)은 토큰 시킨 단어들을 다시 하나의 문장으로 구성한다. 두 개의 문장 사이에는 구분자 [SEP]를 활용하고 그 두 문장을 하나의 Segment로 지정하여 입력한다. In addition, segment embedding constructs tokenized words into one sentence again. Use the separator [SEP] between the two sentences and designate the two sentences as one segment and input them.

포지션 임베딩(Position Embedding)은 토큰의 순차적으로 인코딩한다. Position Embedding sequentially encodes tokens.

다시 도 9를 참조하면, 검진데이터 분류장치(10)는 고정 임베딩이 수행된 결과 데이터에 기초한 학습을 통해 제1 분류모델을 생성한다(S22). Referring back to FIG. 9 , the examination data classification apparatus 10 generates a first classification model through learning based on data obtained as a result of fixed embedding (S22).

도 10을 참조하면, CNN 및 DCNN 알고리즘에 의해 제1 분류모델이 생성되는 과정이 개념적으로 도시된다. Referring to FIG. 10, a process of generating a first classification model by CNN and DCNN algorithms is conceptually illustrated.

다시 도 9를 참조하면, 검진데이터 분류장치(10)는 문맥화 임베딩이 수행된 결과 데이터에 기초한 학습을 통해 제2 분류모델을 생성한다(S23).Referring back to FIG. 9 , the examination data classification apparatus 10 generates a second classification model through learning based on the result data of contextualization embedding (S23).

도 11을 참조하면, BERT(Bidirectional Encoder Representations from Transformers) 알고리즘에 의해 제2 분류모델이 생성되는 과정이 개념적으로 도시된다. Referring to FIG. 11, a process of generating a second classification model by a Bidirectional Encoder Representations from Transformers (BERT) algorithm is conceptually illustrated.

문맥의 특징이 반영된 제2 분류모델이 제1 분류모델과 함께 사용되므로, 앙상블 분리모델의 분류성능이 향상될 수 있다. Since the second classification model in which context characteristics are reflected is used together with the first classification model, the classification performance of the ensemble separation model can be improved.

도 12를 참조하면, 제2 분류모델을 생성하는 단계(S23)의 일 실시 예의 구체적인 과정이 도시된다. Referring to FIG. 12 , a detailed process of an embodiment of generating a second classification model (S23) is illustrated.

먼저, 검진데이터 분류장치(10)는 기 획득된 코퍼스셋을 이용한 학습을 통해 초기 가중치를 획득한다(S231). First, the examination data classification apparatus 10 obtains an initial weight through learning using a pre-obtained corpus set (S231).

초기 가중치 획득이 완료되면, 검진데이터 분류장치(10)는 문백화 임베딩이 수행된 결과 데이터에 기초한 학습을 통해 초기 가중치를 미세조정(fine-tuning)하여 제2 분류모델을 생성한다(S232).When the acquisition of the initial weights is completed, the examination data classification apparatus 10 generates a second classification model by fine-tuning the initial weights through learning based on the data resulting from the culturalization embedding (S232).

일 실시 예에서, KoBERT(Korean Bidirectional Encoder Representations from Transformers), KoELECTRA(Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately) BERT(Bidirectional Encoder Representations from Transformers) 및 ELECTRA(Efficiently Learning an Encoder that Classifies Token Replacements Accurately) 중 적어도 하나의 알고리즘이 사용될 수 있다. In one embodiment, KoBERT (Korean Bidirectional Encoder Representations from Transformers), KoELECTRA (Korean Efficiently Learning an Encoder that Classifies Token Replacements Accurately) BERT (Bidirectional Encoder Representations from Transformers) and ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) At least one algorithm may be used.

기 획득된 코퍼스셋을 이용하여 사전학습을 수행되므로, 상대적으로 적은 양의 검진데이터를 이용하여 높은 정확도의 분류성능을 갖는 분류모델을 생성될 수 있다. Since pre-learning is performed using a previously acquired corpus set, a classification model having high accuracy classification performance can be generated using a relatively small amount of examination data.

다만, 기 획득된 코퍼스셋에 의해서 학습된 경우, 미세조정 또는 분류과정에서 제2 분류모델에는 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수보다 짧은 길이의 데이터가 입력되야하는 문제점이 존재했다. However, in the case of learning from a previously acquired corpus set, there is a problem in that data with a length shorter than the maximum number of tokens of the corpus included in the previously acquired corpus set must be input to the second classification model in the fine-tuning or classification process. did.

따라서, 검진데이터에 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수보다 큰 길이의 데이터가 포함되는 경우, 이를 분할하여 사용해야하며, 분할된 데이터에 원하는 키워드가 포함되지 않을 수 있다. 특히, 검사데이터의 종합소견 텍스트의 경우 그 길이가 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수보다 긴 경우가 빈번하였다. Therefore, if the checkup data includes data having a length greater than the maximum number of tokens of the corpus included in the acquired corpus set, the divided data should be used, and the desired keyword may not be included in the divided data. In particular, in the case of the comprehensive findings text of the examination data, the length was frequently longer than the maximum number of tokens of the corpus included in the acquired corpus set.

본 발명에 따르면, 코퍼스의 토큰의 최대 개수에 영향을 받지 않는 제1 분류 모델이 함께 사용되므로, 검사데이터의 길이에 무관하게 정확한 분류가 이루어질 수 있다. According to the present invention, since the first classification model that is not affected by the maximum number of tokens of the corpus is used together, accurate classification can be made regardless of the length of examination data.

일 실시 예에서, 검진데이터는, 토크나이징(tokenizing)을 통해 산출되는 토큰(token)의 최대 및 평균 중 적어도 하나의 개수가 기 설정된 기준 개수보다 큰 텍스트데이터를 포함하고, 기준 개수는 상기 기 획득된 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수에 대응될 수 있다. 예를 들어, 종합소견 텍스트의 토큰의 최대 및 평균 중 적어도 하나의 개수가 기준 개수보다 클 수 있다. In one embodiment, the checkup data includes text data in which at least one of the maximum and average numbers of tokens calculated through tokenizing is greater than a preset reference number, and the reference number is It may correspond to the maximum number of tokens of the corpus included in the acquired corpus set. For example, at least one of the maximum and average number of tokens of the comprehensive findings text may be greater than the reference number.

다시 도 9를 참조하면, 검진데이터 분류장치(10)가 생성된 적어도 하나의 제1 분류모델 및 적어도 하나의 제2 분류모델에 기초하여 앙상블 분류모델을 생성한다(S24). Referring back to FIG. 9 , the examination data classification apparatus 10 generates an ensemble classification model based on the generated at least one first classification model and at least one second classification model (S24).

(3) 학습된 앙상블 분류모델을 이용하여 전처리된 검진데이터를 분류하는 단계(S30)의 설명(3) Description of the step (S30) of classifying the preprocessed examination data using the learned ensemble classification model

다시 도 4를 참조하면, 앙상블 분류모델이 생성되면 검진데이터 분류장치(10)는 전처리된 검진데이터를 분류한다(S30). Referring back to FIG. 4 , when an ensemble classification model is generated, the examination data classification apparatus 10 classifies the preprocessed examination data (S30).

도 13을 참조하면, DCNN 알고리즘에 의해 생성된 제1 분류모델과 사전학습을 이용한 KoELECTRA 알고리즘에 의해 생성된 제2 분류모델을 함께 사용하는 앙상블 분류모델에 대한 테스트 결과가 도시된다. Referring to FIG. 13, a test result of an ensemble classification model using a first classification model generated by the DCNN algorithm and a second classification model generated by the KoELECTRA algorithm using prior learning is shown.

도시된 실시 예에서, 복수의 제1 분류모델, 복수의 제2 분류모델 및 앙상블 분류모델에 대한 정밀도(Macro P) 및 재현율(Macro R)에 기초해 산출되는 F1 스코어(Macro F1)가 도시된다. DCNN 및 KoELECTRA 알고리즘을 통한 앙상블 분류모델의 정밀도, 재현율 및 F1 스코어가 제1 분류모델 및 제2 분류모델만을 사용하는 것에 비해 향상됨이 도시된다. In the illustrated embodiment, an F1 score (Macro F1) calculated based on precision (Macro P) and recall (Macro R) for a plurality of first classification models, a plurality of second classification models, and an ensemble classification model is shown. . It is shown that the precision, recall, and F1 score of the ensemble classification model through the DCNN and KoELECTRA algorithms are improved compared to those using only the first classification model and the second classification model.

일 실시 예에서, 앙상블 분류모델에는 CNN, DCNN 및 LSTM 등의 알고리즘을 이용한 제1 및 제2 분류모델과 사전훈련을 이용한 방식으로 학습된 제2 분류모델(KoELECTRA 등)이 함께 사용될 수 있다.In one embodiment, first and second classification models using algorithms such as CNN, DCNN, and LSTM, and a second classification model (KoELECTRA, etc.) learned by using pre-training may be used together as an ensemble classification model.

이를 통해, 사전훈련을 이용한 방식으로 학습된 제2 분류모델(KoELECTRA 등)을 사용함으로써 분류도의 정확성이 향상됨과 동시에, 검사데이터의 길이에 제약받지 않고 정확한 분류를 수행할 수 있다. Through this, by using the second classification model (KoELECTRA, etc.) learned in the method using the pre-training, the accuracy of the classification diagram is improved, and at the same time, accurate classification can be performed without being restricted by the length of the test data.

(4) 다른 실시 예에 따른 검진데이터를 전처리하는 단계(S10)의 설명(4) Description of pre-processing of examination data according to another embodiment (S10)

도 14를 참조하면, 다른 실시 예에 따른 검진데이터를 전처리하는 단계(S10)의 과정이 도시된다.Referring to FIG. 14 , a process of pre-processing examination data according to another embodiment (S10) is illustrated.

먼저, 검진데이터 분류장치(10)는 코퍼스셋에 포함된 코퍼스의 토큰의 최대 개수인 기준 개수에 기초하여 검진데이터를 재가공하고, 재가공된 검진데이터에 대하여 전처리를 수행할 수 있다. First, the examination data classification apparatus 10 may reprocess examination data based on the reference number, which is the maximum number of tokens of the corpus included in the corpus set, and may perform preprocessing on the reprocessed examination data.

특정 진단명에 대응되는 검진데이터에 대한 학습 및 분류를 수행하는 경우, 검진데이터 분류장치(10)는 종합소견 텍스트로부터 소정 요건을 만족시키는 문장을 추출하고, 개별소견 텍스트, 수치 검사 결과 및 추출된 문장에 대하여 전처리를 수행할 수 있다. When learning and classifying examination data corresponding to a specific diagnosis name is performed, the examination data classification device 10 extracts sentences satisfying predetermined requirements from the comprehensive findings text, and separates the individual findings text, numerical test results, and extracted sentences. preprocessing can be performed.

먼저, 검진데이터 분류장치(10)는 종합소견 텍스트에서 진단명에 대응되는 키워드를 검색한다(S101). First, the examination data classification device 10 searches for keywords corresponding to diagnosis names in the comprehensive findings text (S101).

예를 들어, 개별소견 텍스트가 갑상선에 관련된 경우, 갑상선과 관련하여 기 설정된 키워드에 대해 검색할 수 있다. For example, when the individual finding text is related to the thyroid, a search may be performed for a predetermined keyword related to the thyroid.

키워드가 검색되면, 검진데이터 분류장치(10)는 검색된 키워드를 포함하고 토큰의 개수가 기준 개수 이하인 문장을 추출한다(S102).When a keyword is searched for, the examination data classification apparatus 10 extracts sentences including the searched keyword and the number of tokens being equal to or less than the reference number (S102).

도 15를 참조하면, 추출된 문장이 예시적으로 도시된다. Referring to FIG. 15 , an extracted sentence is illustrated as an example.

다시 도 14를 참조하면, 검진데이터 분류장치(10)는 추출된 문장에 기초하여 전처리를 수행한다(S103).Referring back to FIG. 14 , the examination data classification apparatus 10 performs pre-processing based on the extracted sentence (S103).

일 실시 예에서, 검진데이터 분류장치(10)는 개별소견 텍스트, 수치 검사 결과 및 추출된 문장에 대하여 전처리를 수행한다. In one embodiment, the examination data classification apparatus 10 performs preprocessing on individual finding texts, numerical examination results, and extracted sentences.

이를 통해, 검진데이터의 토큰의 개수가 사전훈련에 사용된 코퍼스의 토큰의 최대 개수 이상인 경우에도 정확한 학습 및 분류가 이루어질 수 있다.Through this, even when the number of tokens of the examination data is greater than or equal to the maximum number of tokens of the corpus used for pre-training, accurate learning and classification can be performed.

이상으로 설명한 본 발명의 실시예에 따른 검진데이터 분류방법은 도 1 내지 도 3를 통해 설명한 검진데이터 분류 시스템 및 장치와 발명의 카테고리만 다를 뿐, 동일한 내용이므로 중복되는 설명, 예시는 생략하도록 한다.The examination data classification method according to the embodiment of the present invention described above is the same content as the examination data classification system and device described through FIGS. 1 to 3 and only the category of the invention is different.

이상에서 전술한 본 발명의 실시예에 따른 방법은, 하드웨어인 서버와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The method according to the embodiment of the present invention described above may be implemented as a program (or application) to be executed in combination with a server, which is hardware, and stored in a medium.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 데이터베이스부의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 데이터베이스부 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The aforementioned program is C, C++, JAVA, machine language, etc. It may include a code coded in a computer language of. These codes may include functional codes related to functions defining necessary functions for executing the methods, and include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, this code may further include a database reference code for determining where additional information or media necessary for the processor of the computer to execute the functions should be referenced from which location (address address) of the internal or external database of the computer. can In addition, when the processor of the computer needs to communicate with any other remote computer or server in order to execute the functions, the code uses the communication module of the computer to determine how to communicate with any other remote computer or server. It may further include communication-related codes for whether to communicate, what kind of information or media to transmit/receive during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 데이터베이스부 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium is not a medium that stores data for a short moment, such as a register, cache, or database unit, but a medium that stores data semi-permanently and is readable by a device. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers accessible by the computer or various recording media on the user's computer. In addition, the medium may be distributed to computer systems connected through a network, and computer readable codes may be stored in a distributed manner.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 데이터베이스부(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.Steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, implemented in a software module executed by hardware, or implemented by a combination thereof. Software modules include RAM (Random Access Memory), ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), flash database unit (Flash Memory), hard disk, removable disk, CD-ROM, Alternatively, it may reside in any form of computer readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can be implemented in other specific forms without changing the technical spirit or essential features of the present invention. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

1: 서버
2: 검진데이터 입력단말
3: 네트워크
10: 검진데이터 분류장치
11: 통신부
12: 데이터베이스부
13: 제어부
14: 전처리모듈
141: 정형화 유닛
142: 수치화 유닛
143: 패딩 유닛
144: 데이터증식 유닛
15: 분류모듈
151: 임베딩 유닛
152: 분류 유닛1: server
2: Examination data input terminal
3: Network
10: Examination data classification device
11: Communication department
12: database unit
13: control unit
14: preprocessing module
141: formalization unit
142: digitization unit
143: padding unit
144: data multiplication unit
15: classification module
151: embedding unit
152: classification unit

Claims

In the method of digitizing examination data based on data amount proliferation, performed by the device,
Receiving examination data;
pre-processing the examination data;
generating an ensemble classification model based on the preprocessed examination data; and
Acquiring a classification result of the examination data using the learned ensemble classification model;
The examination data is classified into a plurality of groups each corresponding to each of the plurality of findings;
In the preprocessing step,
deleting the remaining special characters except for the predetermined reference special characters from the examination data;
uniformizing spacing rules of examination data from which the special characters are deleted, based on a preset standard grammar;
digitizing the homogenized examination data; and
Including the step of padding the digitized data,
The step of quantifying,
Further comprising increasing the amount of data of the minimum group with the smallest amount of data among the plurality of groups,
The growing step is
Generating a plurality of sentences by dividing the data included in the minimum group into sentence units;
characterized in that the amount of data of the minimum group is increased by connecting the plurality of sentences to each other.

According to claim 1,
The generating step is
performing static embedding and contextualized embedding on the preprocessed examination data;
The method of generating the ensemble classification model based on result data obtained by performing the fixed embedding and the contextualized embedding.

According to claim 2,
The generating step is
generating at least one first classification model through learning based on fixedly embedded data;
generating at least one second classification model through learning based on contextualization-embedded data; and
and generating the ensemble classification model based on the at least one first classification model and the at least one second classification model.

According to claim 3,
The step of generating the at least one second classification model,
acquiring initial weights through learning using a previously acquired corpus set; and
and generating the at least one second classification model by fine-tuning the initial weights through learning based on contextualized embedded data.

According to claim 4,
The examination data includes text data in which at least one of a maximum and an average number of tokens calculated through tokenizing is greater than a predetermined reference number;
The reference number corresponds to the maximum number of tokens of the corpus included in the pre-obtained corpus set.

According to claim 5,
The examination data is calculated for each of a plurality of diagnosis names,
In the preprocessing step,
searching the text data for a keyword corresponding to each of the plurality of diagnosis names;
extracting a sentence including the searched keyword and having a number of tokens less than or equal to the reference number; and
And performing pre-processing based on the extracted sentence.

According to claim 6,
The examination data includes individual findings text, numerical test results, and comprehensive findings text,
The examination data is calculated for each of a plurality of diagnosis names,
The numerical test result is standardized data corresponding to each of the plurality of diagnosis names,
In the comprehensive opinion text, at least one of the maximum and average number of tokens calculated through tokenizing is greater than the reference number,
The reference number corresponds to the maximum number of tokens of the corpus included in the already obtained corpus set.

According to claim 7,
The step of quantifying,
Tokenizing the homogenized examination data;
counting the number of tokens generated through the tokenizing; and
And digitizing the uniformized examination data based on the number of tokens.

communications department;
database unit; and
A control unit classifying examination data stored in the database unit or examination data received from an external server through the communication unit into a plurality of preset findings;
The examination data is classified into a plurality of groups each corresponding to each of the plurality of findings;
The control unit,
Preprocessing the examination data,
performing static embedding and contextualized embedding on the preprocessed examination data;
An ensemble classification model is generated based on the resultant data obtained by performing the fixed embedding and contextualized embedding;
Obtaining a classification result of the examination data using the learned ensemble classification model;
Delete the rest of the special characters except for the preset standard special characters in the examination data,
Standardizing spacing rules of examination data from which the special characters are deleted based on a preset standard grammar;
Digitize the uniform examination data,
Padding the digitized data;
Increase the amount of data of the minimum group with the smallest amount of data among the plurality of groups,
Generating a plurality of sentences by dividing the data included in the minimum group into sentence units;
An apparatus for quantifying examination data based on data amount increase, characterized in that the plurality of sentences are connected to each other to increase the amount of data of the minimum group.

A program that is combined with a computer, which is hardware, and stored in a medium to execute the method of any one of claims 1 to 8.