KR102461295B1

KR102461295B1 - A method for normalizing biomedical names

Info

Publication number: KR102461295B1
Application number: KR1020170083844A
Authority: KR
Inventors: 이현주; 조혜진
Original assignee: 광주과학기술원
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2022-11-01
Also published as: KR20190003231A

Abstract

생의학적 용어에 관한 알려진 데이터 베이스 및 분류되지 않은 데이터를 처리하여 트레이닝 데이터를 구축하는 단계, 상기 구축된 트레이닝 데이터를 딥러닝 기반의 단어 표현(word representation)을 적용하여 정규화 모델을 구성하는 단계 및 상기 구성된 정규화 모델에 식별자를 부여하고자하는 테스트 데이터를 적용하여 식별자를 부여하는 단계를 포함한다.Building training data by processing a known database and unclassified data about biomedical terms, constructing a regularization model by applying a word representation based on deep learning to the built training data, and the and assigning an identifier by applying the test data to which the identifier is to be assigned to the constructed normalization model.

Description

{A method for normalizing biomedical names}

본 발명은 생의학과 관련된 개체 이름을 정규화하는 방법에 관한 것이다. The present invention relates to a method for normalizing biomedical related subject names.

생의학적 문헌에서, 개체(entity) 이름을 식별하는 기술(이른바, 개체명 인식(named entity recognition, NER) 기술은 다양한 문헌에서 우리가 원하는 지식을 추출하는데 필수적인 요소이다. 생의학적 문헌을 이용하는 텍스트 마이닝 분야에서는 개체명 인식 기법이 적용된 후, 식별된 이름을 표준화된 식별자로 정규화하는 것이 중요하다. In biomedical literature, the technology for identifying entity names (so-called named entity recognition, NER) technology is an essential element in extracting the knowledge we want from various documents. Text mining using biomedical literature In the field, it is important to normalize the identified name to a standardized identifier after the entity name recognition technique is applied.

일반적으로 많은 개체명 정규화 방법은 동의어와 약어 식별의 문제를 해결하기 위해, 특정한 도메인 사전에 의존한다. 그러나 사전은 유전자와 같은 일부 개체명을 제외하고는 다양한 개체 이름을 커버하지 못한다. 최근에는 생의학적 문헌이 급격하게 축적되고 있기 때문에, 많은 양의 데이터를 통합하는 신경망 기반 알고리즘(neural network-based algorithm)이 다양한 텍스트마이닝 기법에 적용되고 있다. In general, many entity name canonicalization methods rely on specific domain dictionaries to solve the problem of identifying synonyms and abbreviations. However, dictionaries do not cover various object names except for some object names such as genes. Recently, since biomedical literature is rapidly accumulating, a neural network-based algorithm that integrates a large amount of data is being applied to various text mining techniques.

McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., New York (2010)McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co., New York (2010) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Ecient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)Mikolov, T., Chen, K., Corrado, G., Dean, J.: Ecient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) Leaman, R., Dogan, R.I., Lu, Z.: Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics, 474 (2013)Leaman, R., Dogan, R.I., Lu, Z.: Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics, 474 (2013)

기존의 개체명 정규화 방법보다 다양한 개체명에 대하여 정확한 정규화 방법을 제안한다. We propose an accurate normalization method for various entity names than the existing entity name normalization method.

생의학적 용어에 관한 기존의 데이터 베이스와 분류되지 않은 데이터를 처리하여 트레이닝 데이터를 구축하는 단계, 상기 구축된 트레이닝 데이터를 딥러닝 기반의 단어 표현(word representation)을 적용하여 정규화 모델을 구성하는 단계 및 상기 구성된 정규화 모델에 식별자를 부여하고자 하는 테스트 데이터를 적용하여 식별자를 부여하는 단계를 포함한다 Building training data by processing the existing database and unclassified data on biomedical terms, constructing a regularization model by applying a deep learning-based word representation to the built training data; and assigning an identifier by applying test data to which an identifier is to be assigned to the configured normalization model.

본 발명의 일 실시 예에 따르면, 기존의 개체명 정규화 방법보다 정확한 개체명 정규화가 가능하다. According to an embodiment of the present invention, more accurate entity name normalization than the existing entity name normalization method is possible.

도 1은 본 발명의 일 실시 예에 따른 생의학적 개체명의 식별자를 정규화하는 방법을 개략적으로 나타낸다.
도 2는 본 발명의 일 실시 예에 따른 알고리즘을 보다 상세하게 나타내는 블록도이다.
도 3은 약어 풀이(abbreviation resolution) 단계가 있는 경우와 없는 경우에서 본 발명에서 제안하는 모델과 기존의 개체명 정규화 모델인 DNorm의 퍼포먼스를 비교한 것을 나타낸다.
도 4는 본 발명에서 제안하는 식물(plant) 이름 정규화를 위한 모델들간의 퍼포먼스를 비교한 것을 나타낸다.
도 5는 본 발명의 일 실시 예에 따른 생의학적 용어 식별 방법에 관한 흐름도를 나타낸다.1 schematically shows a method for normalizing an identifier of a biomedical entity name according to an embodiment of the present invention.
2 is a block diagram illustrating an algorithm according to an embodiment of the present invention in more detail.
3 shows a comparison of the performance of the model proposed in the present invention and the existing entity name normalization model, DNorm, with and without an abbreviation resolution step.
4 shows a comparison of performances between models for plant name normalization proposed in the present invention.
5 is a flowchart illustrating a biomedical term identification method according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first, second, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that this does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

도 1은 본 발명의 일 실시 예에 따른 생의학적 개체명의 식별자를 정규화하는 방법을 개략적으로 나타낸다.1 schematically shows a method for normalizing an identifier of a biomedical entity name according to an embodiment of the present invention.

일반적으로 자연어란, 의사 소통을 위해 사용하는 언어와 같이 자연 발생적으로 생성되는 언어를 말한다. 반대되는 개념으로 인공언어가 있다. 예를 들면 자연어는 한국어, 영어 등을 지칭하며, 인공언어는 프로그래밍 언어일 수 있다.In general, a natural language refers to a language that occurs naturally, such as a language used for communication. The opposite concept is artificial language. For example, the natural language refers to Korean, English, and the like, and the artificial language may be a programming language.

최근에 방대한 데이터를 전산화하기 위하여 자연어를 처리하여 컴퓨터가 이해할 수 있는 언어로 바꾸고 이를 다시 역으로 바꾸는 기술이 필요하다. 다시 말해서, 자연어를 특정의 심볼과 매칭하고, 이를 디지털화하는 작업이 필요하다. 디지털화된 데이터는 서버 등에 저장되고, 다시 변환 과정을 역으로 수행하여 인간이 이해할 수 있는 언어로 표현이 가능해진다.Recently, in order to computerize vast amounts of data, a technology is needed to process natural language, convert it into a language that computers can understand, and vice versa. In other words, it is necessary to match the natural language with a specific symbol and digitize it. The digitized data is stored on a server, etc., and the conversion process is reversed, so that it can be expressed in a language that can be understood by humans.

여기에서 본 발명의 일 실시 예에 따른 정규화 방법을 설명하기에 앞서, 몇가지 용어를 설명한다.Here, before describing the normalization method according to an embodiment of the present invention, some terms will be described.

개체명 인식(named entity recognition)이란 미리 정의해 둔 사람, 회사, 장소, 시간 등에 해당하는 단어(개체명)를 문서에서 인식하여 추출, 분류하는 기법을 말한다. 예를 들어, 질병명 인식(disease named entity recognition)의 경우, 어떤 문장이 "cancer", "tumor", "prostate cancer"라는 단어를 포함하는 경우, 상기의 단어들을 미리 정의해둔 개체 타입인 질병(disease)에 해당하는 단어로 인식할 수 있다.Named entity recognition refers to a technique for recognizing, extracting, and classifying a word (entity name) corresponding to a predefined person, company, place, time, etc. from a document. For example, in the case of disease named entity recognition, when a sentence includes the words “cancer”, “tumor”, and “prostate cancer”, the disease is an entity type in which the above words are predefined. ) can be recognized as a corresponding word.

개체명 정규화(named entity normalization)란 인식된 개체명에 적절한 고유 식별자 또는 식별 부호를 부여하는 것을 말한다. 예를 들어, 질병명 인식 기법을 통해 "cancer"라는 단어가 질병 이름으로 인식되었다면 정규화를 통해 기존에 알려진 "MESH" 데이터베이스 식별자인 "MESH:D009369"라는 식별자를 "cancer"에 부여할 수 있다. 동일하게 인식된 개체가 다른 형태의 단어로 표현된 경우에도 동일한 식별자가 부여될 수 있다. 예를 들어, 어느 단어는 풀네임이나, 어느 단어는 그 단어의 약어인 경우에도 의미하는 바는 같은바, 동일한 식별자가 부여될 수 있다.Named entity normalization refers to assigning an appropriate unique identifier or identification code to a recognized entity name. For example, if the word “cancer” is recognized as a disease name through the disease name recognition technique, an identifier “MESH:D009369”, which is a previously known “MESH” database identifier, can be assigned to “cancer” through normalization. Even when identically recognized entities are expressed using different types of words, the same identifier may be assigned. For example, even when a word is a full name or a word is an abbreviation of the word, the meaning is the same, and the same identifier may be assigned.

일반적인 NER은 미리 정의해둔 일반명사(사람, 장소, 시간 등)을 인식하기 때문에 개별적인 ID를 부여하지 않아도 되는 경우가 있다. 그러나, 다양한 생의학적 개체명을 구분해야 하는 경우에는 개별적인 식별자 없이 NER을 적용하기 어렵다는 한계가 있다.Because general NER recognizes predefined common nouns (person, place, time, etc.), there are cases in which it is not necessary to assign an individual ID. However, there is a limit in that it is difficult to apply NER without an individual identifier when it is necessary to distinguish various biomedical entity names.

이를 극복하기 위해 사전 기반의 접근 방법이 제안되었다. 그러나, 사전 기반 정규화 방법은 특정 도메인에 관련된 사전을 이용하여 정규화하는 방법으로, 사전에 없는 단어에 대하여는 적용할 수 없다는 한계점이 있다.To overcome this, a dictionary-based approach has been proposed. However, the dictionary-based normalization method is a method of normalizing using a dictionary related to a specific domain, and has a limitation in that it cannot be applied to words not in the dictionary.

또 다른 극복 방법으로 기계학습을 이용한 정규화 방법이 제안되었다. 기계 학습을 이용한 정규화 방법으로 규칙 기반 방법, 특징 기반 방법, TF-IDF 등의 방법이 제안되어 왔다. 그러나, 기계 학습을 이용한 정규화 방법의 경우 새로운 방식의 축약어와 같은 신조어들을 정규화 할 수 없다는 한계점이 있다.As another overcoming method, a regularization method using machine learning has been proposed. As a regularization method using machine learning, methods such as a rule-based method, a feature-based method, and TF-IDF have been proposed. However, in the case of the regularization method using machine learning, there is a limitation in that new words such as abbreviations of a new method cannot be normalized.

따라서, 본 발명의 생의학적 개체명 정규화 방법은 도 1에 도시된 바와 같이 크게 두가지 단계를 통해 수행된다.Accordingly, the biomedical entity name normalization method of the present invention is largely performed through two steps as shown in FIG. 1 .

첫번째 단계는 word representation 단계이다. 생의학 문헌과 기존의 DB를 딥러닝 기반의 처리 방법을 통해 정규화 모델화한다. 이때, 사용되는 딥러닝 기반의 word representation방법은 Word2Vec 방법이 바람직하다. 이는 비슷한 분포를 가진 단어는 비슷한 의미를 갖는다는 전제하에 이루어지는 딥러닝 방법이다.The first stage is the word representation stage. Normalize the biomedical literature and the existing DB through a deep learning-based processing method. In this case, the word representation method based on deep learning used is preferably the Word2Vec method. This is a deep learning method under the premise that words with a similar distribution have similar meanings.

두번째 단계는 정규화단계로서. 첫번째 단계를 통해 구성된 모델에 식별하고자하는 단어를 적용하는 단계이다. 구체적으로 정규화 단계는 생의학 문헌등에서 아직 정규화되지 않은 후보 단어들을 첫번째 단계에서 구성된 정규화 모델에 적용하여 단어에 식별자를 부여하는 단계이다. 단어에 부여된 식별자는 컴퓨터 언어로 변환하기 쉬운 형태로 구성되어 생의학적 용어를 식별하는데 용이하게 이용될 수 있다.The second step is the normalization step. The first step is to apply the word to be identified to the model constructed through the first step. Specifically, the normalization step is a step of assigning an identifier to a word by applying candidate words that have not yet been normalized in the biomedical literature or the like to the normalization model constructed in the first step. The identifier assigned to the word is configured in a form that is easy to convert into a computer language and can be easily used to identify a biomedical term.

도 2는 본 발명의 일 실시 예에 따른 알고리즘을 보다 상세하게 나타내는 블록도이다. 2 is a block diagram illustrating an algorithm according to an embodiment of the present invention in more detail.

먼저, 트레이닝 단계에서, 기존의 데이터 베이스 및 생의학 문헌을 사용하여 정규화 모델을 구성한다. First, in the training stage, a regularization model is constructed using the existing database and biomedical literature.

1) 트레이닝 데이터 셋에서 정보의 통합1) Integration of information from the training data set

트레이닝 코퍼스(corpus)와 분류되지 않은 데이터의 모든 단어에 대한 워드 벡터를 구성하기 전에, 개체 사전 및 트레이닝 코퍼스 내 정보가 어떻게 통합되는지를 설명한다. 이하에서, 문장 속에 있는 생의학적 개체의 이름을 멘션(mention)이라 지칭한다.Before constructing the training corpus and word vectors for all words in the unclassified data, we describe how the information in the object dictionary and training corpus is integrated. Hereinafter, the name of the biomedical entity in the sentence is referred to as a mention.

본 발명에서는 사전에 있는 동의어와 트레이닝 코퍼스에 대한 개념으로 트레이닝 코퍼스 및 분류되지 않은 데이터의 문장을 대체한다. 예를 들어 "cancer"가 문장에서 언급되는 경우, "cancer"를 "neoplasms", "tumor", "tumors", "tumour", or "tumours"와 같은 동의어로 대체하는 새로운 문장을 만든다. In the present invention, the sentences of the training corpus and unclassified data are replaced with the concept of synonyms and training corpus in the dictionary. For example, if "cancer" is mentioned in a sentence, create a new sentence replacing "cancer" with synonyms such as "neoplasms", "tumor", "tumors", "tumour", or "tumours".

또한, 질병 이름의 어근 추출 및 변형방법이 추가된다. 어휘 변이는 porter stemming algorithm(선행기술문헌 1)를 구현하는 Apache Lucene의 스테밍(stemming) 분석기를 통해 획득된다. 예를 들어, "metabolism"이 문장에서 언급되는 경우, "metabolic", "metabolite", 및 "metabolize"를 포함하는 "metabolic"의 일반적인 변형 및 어근("metabole")이 새로운 문장을 만들기 위해 대체된다.In addition, root extraction and transformation methods of disease names are added. The lexical variation is obtained through the stemming analyzer of Apache Lucene that implements the porter stemming algorithm (Prior Art Document 1). For example, if "metabolism" is mentioned in a sentence, the common variations of "metabolic", including "metabolic", "metabolite", and "metabolize", and the root ("metabole") are replaced to make a new sentence. .

멘션에 여러 단어가 포함된 경우, 본 발명에서는 각 단어를 밑줄("_")로 연결하여 하나의 단어를 생성한다. 예를 들어, "breast cancer"라는 멘션이 문장으로부터 식별되는 경우, "breast cancer"를 하나의 단일 단어인 "breast_cancer"로 대체하는 새로운 문장이 생성된다. 벡터 공간에서 표현되는 개체의 커버리지를 증가시키기 위하여, 트레이닝 데이터에 포함되지 않은 개체 사전 내 질병 또는 식물(plant) 이름이 트레이닝 데이터에 추가된다.When a mention includes several words, in the present invention, each word is connected with an underscore ("_") to create one word. For example, if a mention of "breast cancer" is identified from a sentence, a new sentence is created that replaces "breast cancer" with one single word, "breast_cancer". In order to increase the coverage of entities represented in the vector space, diseases or plant names in the entity dictionary that are not included in the training data are added to the training data.

2) Word representations(Word2Vec)2) Word representations (Word2Vec)

Mikolov는 단어의 벡터 표현을 계산하기 위한 신경망 접근 방식인 Word2Vec(선행기술문헌 2)을 개발했다. 벡터는 두 개의 알고리즘(continuous bag-of-word(CBOW) 모델과 skip-gram 모델)을 사용하여 구성할 수 있다. CBOW 모델은 주변 단어를 사용하여 문장에서 단어를 예측함으로써 단어 표현을 트레이닝하고, skip-gram 모델은 입력 레이어에서 단어의 주변 단어를 예측하여 단어 표현을 트레이닝한다. Word2Vec에서 단어는 수백 차원의 벡터로 표현되며, 관련 의미를 갖는 단어는 벡터 공간에서 비슷한 값을 가질 확률이 높다. 문장에서 t 번째 포지션에 위치하는 단어에 대한 벡터(w_t)는 수학식 1(CBOW 등식) 및 수학식 2(skip-gram 등식)과 같이, 평균 로그 확률을 최대화함으로써 계산된다.Mikolov developed Word2Vec (Prior Art Document 2), a neural network approach for computing vector representations of words. Vectors can be constructed using two algorithms: a continuous bag-of-word (CBOW) model and a skip-gram model. The CBOW model trains word representations by predicting words in a sentence using the surrounding words, and the skip-gram model trains word representations by predicting the surrounding words of a word in the input layer. In Word2Vec, words are expressed as vectors with hundreds of dimensions, and words with related meanings have a high probability of having similar values in the vector space. The vector w _t for the word located at the t-th position in the sentence is calculated by maximizing the average log probability as shown in Equations 1 (CBOW equation) and Equation 2 (skip-gram equation).

여기에서,

는 문장에서 c번째 단어의 주변에 있는 단어 벡터들이고, T는 토큰의 수이다. 본 발명에서는 모델을 트레이닝하기 위해 CBOW 및 skip-gram 알고리즘에서 단어 주변의 윈도우 사이즈 및 단어의 벡터 크기의 몇가지 옵션을 적용한다. 그리고 난 뒤, 본 발명에서는 개발 셋을 사용하여 최선의 옵션을 선택한다.From here,

are the word vectors around the c-th word in the sentence, and T is the number of tokens. In the present invention, several options of the window size around the word and the vector size of the word are applied in the CBOW and skip-gram algorithms to train the model. Then, in the present invention, the best option is selected using the development set.

분류되지 않은 데이터의 양이 방대하여 개체명을 수동으로 처리하기 어려운 면이 있는바, 본 발명에서는 기존의 NER 시스템을 사용하여 분류되지 않은 데이터를 구성한다. 수정된 증거 문장(modified evidence sentences)은 상술한 바와 같이, 생의학적 개체 멘션을 사전에 있는 트레이닝 세트 및 동의어 개념으로 대체하여 구성되었다.Since the amount of unclassified data is huge, it is difficult to manually process the entity name. In the present invention, unclassified data is constructed using the existing NER system. Modified evidence sentences were constructed, as described above, by replacing biomedical entity mentions with pre-existing training sets and synonym concepts.

예를 들어, "VWS"는 약어 사전을 통해 "van der woude syndrome"의 약어이고, 단어에 "lip pits"와 같은 동의어가 있음이 알려져 있다. 트레이닝 데이터에서, 문장은 "affected males and females are equally likely to transmit VWS."와 비슷하게 나타난다.For example, "VWS" is an abbreviation for "van der woude syndrome" through an abbreviation dictionary, and it is known that the word has synonyms such as "lip pits". In the training data, the sentence appears similar to "affected males and females are equally likely to transmit VWS."

룰을 통해서, "VWS"라는 단어가 "van der woude syndrome", "van_der_woude_syndrom", "lip pits", 또는 "lip_pits"로 문장 내에서 변경된다. 그러므로, 기본 라인 문장에서 다음과 같은 문장들이 추가로 획득될 수 있다.Through the rules, the word "VWS" is changed in a sentence to "van der woude syndrome", "van_der_woude_syndrom", "lip pits", or "lip_pits". Therefore, the following sentences can be additionally obtained from the basic line sentence.

(1) "affected males and females are equally likely to transmit van der woude syndrome,"(1) "affected males and females are equally likely to transmit van der woude syndrome,"

(2) "affected males and females are equally likely to transmit van_der_woude_syndrome"(2) "affected males and females are equally likely to transmit van_der_woude_syndrome"

(3) "affected males and females are equally likely to transmit lip pits" (3) "affected males and females are equally likely to transmit lip pits"

(4) "affected males and females are equally likely to transmit lip_pits"(4) "affected males and females are equally likely to transmit lip_pits"

본 발명에서는 4개의 semi-supervised 학습 모델(learning model)을 제안한다. 각각의 모델은 트레이닝 코퍼스 및 분류되지 않은(unlabeled) 데이터 셋에 Word2Vec를 적용함으로써 벡터 공간에서의 단어를 word representing한 벡터 셋(V)를 구성한다. 각 학습 모델은 다음과 같다.In the present invention, four semi-supervised learning models are proposed. Each model constructs a vector set (V) representing words in the vector space by applying Word2Vec to the training corpus and unlabeled data set. Each learning model is as follows.

(1) semi-supervised learning with unlabeled data of "all abstracts" (hereafter referred to as "SSL-all abstracts"),(1) semi-supervised learning with unlabeled data of "all abstracts" (hereafter referred to as "SSL-all abstracts"),

(2) semi-supervised learning with unlabeled data of "entity-specific abstracts" ("SSL-entity abstracts"),(2) semi-supervised learning with unlabeled data of "entity-specific abstracts" ("SSL-entity abstracts");

(3) semi-supervised learning with unlabeled data of "evidence sentences" ("SSL-evidences"),(3) semi-supervised learning with unlabeled data of "evidence sentences" ("SSL-evidences"),

(4) semi-supervised learning with unlabeled data of "modified evidence sentences" ("SSL-modified evidences").(4) semi-supervised learning with unlabeled data of "modified evidence sentences" ("SSL-modified evidences").

상기의 4가지 모델에 더하여, 본 발명에서는 추가적으로 2가지 모델을 더 구성한다.In addition to the above four models, the present invention further comprises two additional models.

(5) semi-supervised model that used only modified evidence sentences without the training corpus ("SSL-only modified evidences").(5) semi-supervised model that used only modified evidence sentences without the training corpus ("SSL-only modified evidences").

(6) supervised learning model with the training corpus ("SL-only training data").(6) supervised learning model with the training corpus ("SL-only training data").

그리고 난 뒤, 테스트 단계에서는 트레이닝 단계에서 구성된 모델에 식별하고자하는 용어들을 적용한다. Then, in the testing phase, the terms to be identified are applied to the model constructed in the training phase.

1) Prediction for normalizing biological entities1) Prediction for normalizing biological entities

도 2에 도시된 바와 같이, 테스트 단계에서 일 예로 NCBI 질병 코퍼스와 식물 코퍼스의 초록(abstract)이 정규화 모델을 테스트하는데 사용된다. 생의학적 멘션은 초록에서 추출된다. 추출된 멘션이 컨셉 이름과 정확하게 일치하면 해당 컨셉 ID에 할당되고, 추가 정규화 단계는 수행되지 않는다. 다음으로 약어 모음 사전이나, 약어 풀이 시스템을 사용하여 약어가 원래의 긴 단어로 변경되는 약어 풀이 단계가 적용된다.As shown in FIG. 2 , in the testing phase, as an example, abstracts of the NCBI disease corpus and plant corpus are used to test the normalization model. Biomedical mentions are extracted from the abstract. If the extracted mention exactly matches the concept name, it is assigned to the corresponding concept ID, and no further normalization steps are performed. Next, an abbreviation step is applied, in which the abbreviation is changed to the original long word using an abbreviation vowel dictionary or an abbreviation system.

정규화를 위해 테스트 멘션의 벡터와 개체 사전에서 가능한 모든 컨셉의 벡터 사이의 코사인 유사도를 계산하여 테스트 멘션을 해당 컨셉에 매핑한다. 그런 다음, 코사인 유사도가 높은 단어를 후보 컨셉으로 간주한다. 멘션 m 과 후보 컨셉 c를 벡터 v_m 과 v_c로 각각 나타낸다. 멘션 m이 "cancer" 또는 "tumor"와 같은 단일 토큰을 구성하면, 벡터 집합(V)의 단일 토큰에 대한 벡터가 v_m에 할당된다. 멘션 m이 다중 토큰을 포함할 때, v_m은 다음 수학식 3과 같이 멘션에서 토큰에 대한 벡터의 평균에 할당된다.For normalization, the test mention is mapped to the concept by calculating the cosine similarity between the vector of the test mention and the vector of all possible concepts in the object dictionary. Then, a word with a high cosine similarity is considered as a candidate concept. Mention m and candidate concept c are represented by vectors v _m and v _c , respectively. If mention m constitutes a single token, such as “cancer” or “tumor”, a vector for a single token in the vector set (V) is assigned to v _m . When mention m includes multiple tokens, v _m is assigned to the average of vectors for tokens in mention as shown in Equation 3 below.

여기에서, v_mi은 멘션에서 i번째 토큰의 벡터이고, n은 토큰의 수이다. 만약 v_mi이 상술한 벡터 집합(V)에 포함되지 ??는 경우, v_mi이 제로로 할당되고, 상기 수학식 3을 이용하여 평균 벡터 v_m이 계산된다. 이때, 다중 토큰에 대한 컨셉은 트레이닝 단계에서의 밑줄을 사용하여 단일 토큰으로 변환된다. 생의학적 개체들에 대한 멘션이 벡터로 표현되고 난 뒤, 벡터 집합 V에 포함되는 단어 벡터(v_c)에 대하여 생의학적 개체의 벡터(v_m)에 대한 하이 코사인 유사도에 대한 컨셉이 정규화 컨셉으로 추천된다.Here, v _mi is the vector of the i-th token in the mention, and n is the number of tokens. If v _mi is not included in the aforementioned vector set V, v _mi is assigned to zero, and an average vector v _m is calculated using Equation 3 above. At this time, the concept of multiple tokens is converted into a single token using underscores in the training phase. After mentions of biomedical entities are expressed as vectors, the concept of high cosine similarity to the vector (v _m ) of biomedical entities with respect to the word vector (v _c ) included in the vector set V is a normalization concept. Recommended.

2) 정규화 툴의 성능 측정(Evaluation metric)2) Evaluation metric of regularization tool

질병 이름 정규화 툴의 성능을 측정하기 위해 테스트 코퍼스에서 수동으로 매핑된 컨셉과 본 발명의 일 실시 예에 따른 알고리즘에 따라 예측된 컨셉을 비교한다. 표 1은 NCBI 테스트 세트의 정규화된 질병 이름의 예를 나타낸다. In order to measure the performance of the disease name normalization tool, a concept mapped manually in a test corpus is compared with a concept predicted according to an algorithm according to an embodiment of the present invention. Table 1 shows examples of fully qualified disease names in the NCBI test set.

ranksranks Candidate namesCandidate names cosine similaritycosine similarity *
*
*
*
*
*

**
*
*
*
*
*

* 1
2
3
4
5
6
7
8
9
10One
2
3
4
5
6
7
8
9
10 COMPLEMENT_COMPONENT_7_DEFICIENCY
complement_component_7_defici
c7_defici
complement_component_7_deficiency
C7_DEFICIENCY
c7_deficiency
antibodi_defici_syndrom
Immunologic_Deficiency_Synmdromes
immunolog_defici_syndrom
c7dCOMPLEMENT_COMPONENT_7_DEFICIENCY
complement_component_7_defici
c7_defici
complement_component_7_deficiency
C7_DEFICIENCY
c7_deficiency
antibodi_defici_syndrom
Immunologic_Deficiency_Synmdromes
immunolog_defici_syndrom
c7d 0.559244
0.554464
0.549911
0.540654
0.533657
0.525014
0.510718
0.499981
0.492753
0.4919250.559244
0.554464
0.549911
0.540654
0.533657
0.525014
0.510718
0.499981
0.492753
0.491925

"C7 defect"는 NCBI 질병 테스트 코퍼스에서 질병 멘션으로 "COMPLEMENT COMPOENET 7 DEFICIENCY"의 동의어이며, 대응하는 컨셉 식별자는 "OMIM:610102"이다. 주어진 멘션에서, 다른 이름들은 벡터 표현에서의 멘션에 대한 코사인 유사도에 따라 순위가 매겨진다. "C7 defect" is a synonym for "COMPLEMENT COMPOENET 7 DEFICIENCY" as a disease mention in the NCBI disease test corpus, and the corresponding concept identifier is "OMIM:610102". For a given mention, other names are ranked according to their cosine similarity to the mention in the vector representation.

컨셉 식별자에는 몇 가지 질병 동의어가 포함되기 때문에, 첫번째 컬럼의 별표는 컨셉 식별자에 대한 동의어임을 나타내고, 이는 그것들이 올바르게 추천된 답변임을 나타낸다. 표 2에서 첫번째, 두번째, 세번째, 네번째, 다섯번째, 여섯번째, 그리고 열번째로 순위가 매겨진 후보는 올바른 결과이다.Since the concept identifier contains several disease synonyms, the asterisk in the first column indicates that they are synonyms for the concept identifier, indicating that they are correctly suggested answers. Candidates ranked first, second, third, fourth, fifth, sixth, and tenth in Table 2 are correct results.

각각의 순위 임계값에 대한 테스트 셋에서의 모든 멘션들에 대한 정규화 모델의 퍼포먼스가 측정된다. 주어진 순위 임계값에 대해 임계 값보다 높은 순위로 평가된 예측된 이름(또는 해당 컨셉 ID)은 긍정적으로 예측된 것으로 간주된다. True positive(TP)는 올바른 긍정적 예측이고, false positive(FP)는 올바르지 않은 긍정적 예측이고, false negative(FN)는 긍정적으로 예측되지 않은 멘션이다. 추출된 멘션이 컨셉 이름과 정확하게 일치하는 경우에는 단일 컨셉 ID만 할당되었으며, 그것은 올바르게 정규화되었다고 판단된다. 따라서, 각 순위 임계 값에 대한 실적을 계산할 때, 이러한 정확한 매칭은 TP로 취급된다. 도 3은 후보 목록과 TP, FP 및 FN의 예를 보여준다. 정밀도(p), 회수율(r) 및 F-점수(f)는 수학식 4와 같이 표현된다.The performance of the regularization model for all mentions in the test set for each rank threshold is measured. For a given rank threshold, a predicted name (or its concept ID) that ranks higher than the threshold is considered positively predicted. True positive (TP) is a correct positive prediction, false positive (FP) is an incorrect positive prediction, and false negative (FN) is a mention that is not positively predicted. If the extracted mention exactly matches the concept name, only a single concept ID has been assigned, and it is judged to be properly qualified. Thus, when calculating performance for each rank threshold, this exact match is treated as a TP. 3 shows a candidate list and examples of TP, FP and FN. The precision (p), the recovery rate (r) and the F-score (f) are expressed as in Equation (4).

도 3은 약어 풀이(abbreviation resolution) 단계가 있는 경우와 없는 경우에서 본 발명에서 제안하는 모델과 기존의 정규화 모델인 DNorm(선행기술문헌 3)의 퍼포먼스를 비교한 것을 나타낸다.Figure 3 shows a comparison of the performance of the model proposed in the present invention and the existing regularization model DNorm (Prior Art Document 3) in the presence and absence of an abbreviation resolution step.

도 3에서는, "SSL-modified evidence"가 두 경우에서 모두 우월한 퍼포먼스를 보여주면서, "SL-only training data"가 "SSL-only modified evidence" 보다 더 나은 효과를 보여준다.In Figure 3, "SSL-modified evidence" shows superior performance in both cases, "SL-only training data" shows a better effect than "SSL-only modified evidence".

도 3은 분류되지 않은 데이터가 트레이닝 데이터와 결합되는 경우 정규화 정확도가 향상되는 것을 보여준다. "SSL-modified evidences"가 가장 좋은 퍼포먼스를 보여준다. 3 shows that normalization accuracy is improved when unclassified data is combined with training data. "SSL-modified evidences" gives the best performance.

도 4는 본 발명에서 제안하는 식물 이름 정규화를 위한 모델들간의 퍼포먼스를 비교한 것을 나타낸다.4 shows a comparison of performance between models for plant name normalization proposed in the present invention.

도 4에서 보여주는 바와 같이, "SSL-modified plant evidence"가 가장 좋은 퍼포먼스를 보여준다. 질병 이름 또는 질병명 정규화 결과와 달리, "SSL-only modified evidence"가 "SL-only training data"보다 나은 퍼포먼스를 보여준다. 약어 사전이 사용될 수 없기 때문에 식물 이름은 그들의 컨텍스트, 지역 또는 언어에 의존하는 복수의 타입으로 표현된다. 그리고 식물 이름 또는 식물 이름 정규화는 질병 이름 정규화에 비해 낮은 정확도를 보여준다.As shown in Figure 4, "SSL-modified plant evidence" shows the best performance. Unlike disease name or disease name normalization results, "SSL-only modified evidence" shows better performance than "SL-only training data". Since an abbreviation dictionary cannot be used, plant names are expressed in multiple types depending on their context, region or language. And plant name or plant name normalization shows lower accuracy than disease name normalization.

도 5는 본 발명의 일 실시 예에 따른 생의학적 용어 식별 방법에 관한 흐름도를 나타낸다.5 is a flowchart illustrating a biomedical term identification method according to an embodiment of the present invention.

도 5에서 설명하는 알고리즘은 프로그램화 될 수 있으며, 해당 프로그램은 개체명 정규화 시스템에서 구현될 수 있다.The algorithm described in FIG. 5 may be programmable, and the corresponding program may be implemented in an entity name normalization system.

개체명 정규화 시스템은 트레이닝 데이터를 구축한다(S101).The entity name normalization system constructs training data (S101).

이때, 개체명 정규화 시스템은 기존의 데이터 베이스 및 분류되지 않은 데이터를 처리하여 트레이닝 데이터를 구축할 수 있다. 개체 이름 식별 시스템은 사전에 있는 동의어를 이용하여 분류되지 않은 용어를 대체하는 방법으로 추가 데이터를 구축하는 방법을 포함할 수 있다. 또한, 개체명 정규화 시스템은 분류되지 않은 용어에서 어근 추출 및 변형하는 방법으로 추가 데이터를 구축하는 방법을 포함할 수 있다. 또한, 멘션에 여러 단어가 포함되는 경우, 각 단어를 밑줄로 연결하여 하나의 단어로 생성하는 방법을 포함할 수 있다.In this case, the entity name normalization system may construct training data by processing the existing database and unclassified data. The entity name identification system may include constructing additional data by substituting unclassified terms using synonyms in the dictionary. In addition, the entity name normalization system may include a method of constructing additional data by extracting and transforming roots from unclassified terms. Also, when multiple words are included in the mention, a method of generating a single word by connecting each word with an underscore may be included.

개체명 정규화 시스템은 구축된 트레이닝 데이터를 딥러닝 기반의 단어 표현을 적용하여 정규화 모델을 구성한다(S103). 이때, 사용되는 딥러닝 기반의 알고리즘은 Word2Vec 알고리즘을 사용하는 것이 바람직하다.The entity name normalization system configures a normalization model by applying a deep learning-based word expression to the constructed training data (S103). In this case, it is preferable to use the Word2Vec algorithm as the deep learning-based algorithm used.

이때, 구성되는 정규화 모델은 복수일 수 있다.In this case, the configured regularization model may be plural.

개체명 정규화 시스템은 구성된 정규화 모델에 테스트 데이터를 적용하여 생의학적 용어를 식별한다(S105).The entity name normalization system identifies biomedical terms by applying test data to the configured normalization model (S105).

여기에서 테스트 데이터란, 식별자를 부여하고자 하는 생의학적 데이터를 지칭한다. 개체명 정규화 시스템은 식별자가 부여되지 않은 데이터에 정규화 모델에 따라 식별자를 부여하여, 해당 용어의 디지털 데이터 베이스화가 용이하게 수행되게 할 수 있다.Here, the test data refers to biomedical data to which an identifier is to be assigned. The entity name normalization system assigns an identifier to data to which an identifier is not assigned according to a normalization model, so that digital databaseization of a corresponding term can be easily performed.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be implemented as computer-readable code on a medium in which a program is recorded. The computer-readable medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is this. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

In the object name normalization method in which each step is performed by a computing device,
processing known databases and unclassified data on biomedical terms to construct training data;
constructing a regularization model by applying a word representation based on deep learning to the constructed training data; and
applying the test data to which the identifier is to be assigned to the configured normalization model to give an identifier, and identifying the biomedical term with the normalized model to which the identifier is assigned;
The step of constructing the training data is,
A method of constructing additional data using synonyms in a known dictionary, a method of constructing additional data using root extraction and transformation, Concatenate with underscores to form a single word.
How to normalize object names.

delete

According to claim 1,
The deep learning-based word expression is
Word2Vec algorithm
How to normalize object names.