KR102346136B1

KR102346136B1 - A relevant document extraction system for gene chemical disease

Info

Publication number: KR102346136B1
Application number: KR1020200041324A
Authority: KR
Inventors: 이현주; 김정균; 김정재
Original assignee: 광주과학기술원
Priority date: 2020-04-06
Filing date: 2020-04-06
Publication date: 2022-01-03
Also published as: KR20210123756A

Abstract

본 발명에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템에는, 사용자 쿼리가 자연어로 입력되는 입력부; 입력된 자연어를 개체로 인식하는 개체명 인식부; 상기 개체명 인식부에서 인식된 개체를 이용하여, 유전자와 화합물과 질병과 관련되는 문헌을 제공하는 인공지능 학습모델; 상기 인공지능학습모델에서 추출된 문헌을 후처리하는 후처리부; 및 상기 후처리부에서 처리되어 걸러진 문헌이 출력되는 출력부가 포함된다. In the system for extracting documents related to genes, compounds, and diseases according to the present invention, an input unit for inputting a user query in natural language; an entity name recognition unit for recognizing the input natural language as an entity; an artificial intelligence learning model that provides documents related to genes, compounds, and diseases using the entity recognized by the entity name recognition unit; a post-processing unit for post-processing the documents extracted from the artificial intelligence learning model; and an output unit for outputting documents that have been processed and filtered by the post-processing unit.

Description

A relevant document extraction system for gene chemical disease

본 발명은 유전자와 화합물과 질병의 관련문헌 추출시스템에 관한 것이다. The present invention relates to a system for extracting related literature of genes, compounds and diseases.

특정의 질병을 대상으로 하는 약품개발 및 유전자의 연구에 있어서, 유전자와 화합물과 질병의 상호관계를 함께 인식하는 것은 중요한 일이다. 또한, 질병의 발생 및 진행에 있어서, 유전자와 화합물의 상호작용 및 그 매커니즘은 방대한 실험에 의해서 알려져 있고, 다수의 공지문헌 데이터베이스(예를 들어, PebMed)에 의해서 공지되어 있다. In drug development and gene research targeting specific diseases, it is important to recognize the interrelationship between genes, compounds, and diseases. In addition, in the development and progression of a disease, the interaction between genes and compounds and their mechanisms are known through extensive experiments, and are known by a number of well-known databases (eg, PebMed).

연구자들이 상기 공지문헌 데이터베이스에서 자기가 원하는 문헌을 추출해서 연구에 참조하는 것은 중요한 일이다. It is important for researchers to extract the literature they want from the known literature database and refer to it for research.

종래 대다수의 문헌 추출 시스템은 공지문헌 데이터베이스로부터, 유전자, 화합물, 및 질병 중의 두 개의 개체가 인용되는 문헌만을 추출하여 볼 수 있을 뿐이이다. 예를 들어, 출원번호 10-2014-0015578, '유전자와 질병간의 관계를 포함하는 문장 검색엔진'에는 그러한 기술이 개시되어 있다. Most of the conventional literature extraction systems can only extract and view documents citing two individuals of genes, compounds, and diseases from a known literature database. For example, such technology is disclosed in Application No. 10-2014-0015578, 'Sentence search engine including relationship between gene and disease'.

유전자, 화합물, 및 질병 모두의 관계를 이용하여 문서를 추출하는 기술로는, 출원번호 10-2010-0105299, '다차원 인덱스를 이용하여 대용량 생명공학 문헌으로부터 유전자 질병 화합물 관계를 추출하는 방법'가 개시된 바가 있다. 그러나, 이 기술은 인덱싱 기술을 이용하는 것으로서 그 효과가 높지는 않다. As a technology for extracting documents using the relationship between genes, compounds, and diseases, Application No. 10-2010-0105299, 'A method for extracting a gene disease compound relationship from a large-scale biotechnology literature using a multidimensional index' is disclosed. There is a bar. However, this technique uses an indexing technique, and the effect is not high.

출원번호 10-2014-0015578, '유전자와 질병간의 관계를 포함하는 문장 검색엔진'Application No. 10-2014-0015578, 'Sentence search engine including the relationship between genes and diseases' 출원번호 10-2010-0105299, '다차원 인덱스를 이용하여 대용량 생명공학 문헌으로부터 유전자 질병 화합물 관계를 추출하는 방법'Application No. 10-2010-0105299, 'Method for extracting gene disease compound relationship from large-scale biotechnology literature using multidimensional index'

본 발명은 상기되는 배경하에서 제안되는 것으로서, 유전자 화합물 질병의 관계를 잘 보여주는 문헌을 추출하는 유전자와 화합물과 질병의 관련문헌 추출시스템을 제안한다. The present invention is proposed under the background described above, and it proposes a system for extracting documents related to genes, compounds, and diseases for extracting documents showing the relationship between genetic compounds and diseases.

본 발명에 따르면, 특정 유전자와 특정 화합물과 특정 질병과의 세 가지 관계(triplet relationship)를 더 잘 이용할 수 있다. According to the present invention, the triplet relationship between a specific gene and a specific compound and a specific disease can be better utilized.

도 1은 실시예에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템의 구성도.
도 2는 인공지능학습모델의 학습장치를 보이는 도면.
도 3은 실시예의 효과를 설명하는 도면. 1 is a block diagram of a system for extracting documents related to genes, compounds, and diseases according to an embodiment;
2 is a view showing a learning apparatus of an artificial intelligence learning model.
Fig. 3 is a view for explaining the effect of the embodiment;

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 이하의 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에 포함되는 다른 실시예를 구성요소의 부가, 변경, 삭제, 및 추가 등에 의해서 용이하게 제안할 수 있을 것이나, 이 또한 본 발명 사상의 범위 내에 포함된다고 할 것이다. Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the following embodiments, and those skilled in the art who understand the spirit of the present invention can easily change other embodiments included within the scope of the same idea by adding, changing, deleting, and adding components. However, this will also be included within the scope of the present invention.

도 1은 실시예에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템의 구성도이다. 1 is a block diagram of a system for extracting documents related to genes, compounds, and diseases according to an embodiment.

도 1을 참조하면, 실시예에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템에는, 사용자가 쿼리를 입력하는 입력부(1), 입력된 자연어를 인식하는 개체명 인식부(2), 상기 개체명 인식부(2)에서 인식된 개체를 이용하여 유전자와 화합물과 질병과 관련되는 문헌을 제공하는 인공지능학습모델(3), 상기 인공지능학습모델(3)에서 추출된 문헌을 후처리하는 후처리부(4), 및 상기 후처리부(4)에서 처리되어 걸러진 문헌이 출력되는 출력부(5)가 포함된다. Referring to FIG. 1 , in the system for extracting documents related to genes, compounds, and diseases according to the embodiment, an input unit (1) for inputting a query by a user, an entity name recognition unit (2) for recognizing the input natural language, and the entity name An artificial intelligence learning model (3) that provides documents related to genes, compounds, and diseases using the entity recognized by the recognition unit (2), and a post-processing unit that post-processes the documents extracted from the artificial intelligence learning model (3) (4), and an output unit 5 through which documents processed and filtered by the post-processing unit 4 are output.

상기 출력부(5)는 유전자 및 화합물이 단일의 어느 한 문장에 포함되어 있고, 상기 질병이 다른 문장에 포함되는 문헌을 추출할 수 있다. 여기서 문장은 마침표 단위로 끝나는 하나의 문장을 지칭한다. 이하 문장의 의미는 동일하다. The output unit 5 may extract documents in which genes and compounds are included in one single sentence and the disease is included in another sentence. Here, a sentence refers to a single sentence ending with a period unit. The meanings of the following sentences are the same.

이와 같이 유전자 및 화합물이 어느 하나의 문장에 포함되고, 질병이 다른 문장에 포함되는 문헌을 검색하는 것으로서 가장 적합한 문헌을 검색할 수 있다. As described above, the most appropriate literature can be searched by searching for a document in which a gene and a compound are included in one sentence and a disease is included in another sentence.

실시예에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템을 더 상세하게 설명한다. A system for extracting documents related to genes, compounds, and diseases according to Examples will be described in more detail.

사용자는 상기 입력부(1)를 통하여, 유전자, 화합물, 및 질병의 명칭을 자신이 알고 있는 단어로 입력한다. 입력된 단어는 상기 개체명 인식부(2)에서 각각의 개체명 인식툴(NER tool)에 의해서 인식될 수 있다. The user inputs the names of genes, compounds, and diseases as words he knows through the input unit 1 . The input word may be recognized by each entity name recognition tool (NER tool) in the entity name recognition unit 2 .

상기 유전자의 개체명 인식에는 GNormPlus가 사용될 수 있고, 상기 화합물의 개체명 인식에는 tmChem가 사용될 수 있고, 상기 질병의 개체명 인식에는 Dnorm이 사용될 수 있다. GNormPlus may be used to recognize the individual name of the gene, tmChem may be used to recognize the individual name of the compound, and Dnorm may be used to recognize the individual name of the disease.

인식된 유전자, 화합물, 및 질병의 개체명은, 인공지능학습모델(3)로 입력되고, 상기 인공지능 학습모델(3)은 적합한 문헌은 출력할 수 있다. Recognized genes, compounds, and individual names of diseases are input to the AI learning model 3 , and the AI learning model 3 can output appropriate documents.

상기 인공지능학습모델(3)의 학습장치는 도 2에 개시된다. The learning apparatus of the artificial intelligence learning model 3 is disclosed in FIG. 2 .

도 2를 참조하면, K러닝을 이용하는 인공지능 학습장치(30)에는, 유전자와 화합물과 질병의 명칭을 문장별로 전처리하여 입력하는 전처리부(31), 전처리된 입력정보를, 단어벡터 및 개체타입벡터으로 밀집벡터로 제작하는 임베딩부(32), Bi-LSTM 층으로 예시되는 학습부(33)가 포함된다. 상기 학습부(33)를 통하여 학습된 학습모델(34)이 더 제공될 수 있다. 상기 학습모델(34)은 상기 인공지능 학습모델(3)에 저장될 수 있다. 2, in the artificial intelligence learning device 30 using ‹K-learning, the preprocessor 31 for preprocessing and inputting the names of genes, compounds, and diseases for each sentence, and the preprocessed input information, word vectors and entities An embedding unit 32 for producing a dense vector as a type vector and a learning unit 33 exemplified as a Bi-LSTM layer are included. A learning model 34 learned through the learning unit 33 may be further provided. The learning model 34 may be stored in the artificial intelligence learning model 3 .

상기 전처리부(31)는, 어느 한 문헌에서, 유전자와 화합물이 단일의 문장에 포함되는 유전자/화합물 문장부(311)와, 질병이 포함되는 질병 문장부(312)를 서로 정렬하여 입베딩부로 출력한다. 문헌의 데이터 베이스로는 Pubmed의 MEDLINE가 사용될 수 있다. The preprocessor 31 aligns the gene/compound sentence part 311 in which a gene and compound are included in a single sentence and the disease sentence part 312 in which a disease is included in one document and outputs it to the mouth bedding part. . As a database of literature, Pubmed's MEDLINE can be used.

상기 임베딩부(32)는, 두 개의 벡터를 사용할 수 있다. 첫번째는 워드 표현 벡터(word vector)로서, 상기 MEDLINE의 데이터베이스가 소정의 기법에 의해서 전처리된, 크기 200의 워드 벡터가 사용될 수 있다. 여기서 전처리 기법으로는 'Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013'가 사용될 수 있다. 두번째는 개체 타입 표현 벡터(entity type vector)로서, 상기 개체명 인식부(2)로부터 얻어진 각 단어의 개체명을 나타낼 수 있다.The embedding unit 32 may use two vectors. The first is a word vector. A word vector having a size of 200 in which the MEDLINE database has been pre-processed by a predetermined technique may be used. Here, the preprocessing technique is 'Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013' may be used. The second is an entity type vector, which may indicate the entity name of each word obtained from the entity name recognition unit 2 .

상기 임베딩부(32)에서 밀집벡터로 만들어진 다음에, 학습부(33)에서 학습된 후에, 학습모델(34)을 제공할 수 있다. 상기 학습모델(34)은 실시예에 따른 유전자와 화합물과 질병의 관련문헌 추출시스템의 상기 인공지능 학습모델(3)로 제공되어 질문자의 쿼리에 대응하여 결과물을 추출할 수 있다. After being made into a dense vector by the embedding unit 32 and then trained by the learning unit 33 , the learning model 34 may be provided. The learning model 34 is provided as the artificial intelligence learning model 3 of the system for extracting documents related to genes, compounds, and diseases according to an embodiment, so that a result can be extracted in response to a query of a questioner.

상기 학습모델(34)은 포지티브(11) 및 네거티브(12)의 문헌을 추출할 수 있다. 상기 포지티브는 유전자와 화합물이 단일의 어느 문장에 포함되고, 질병이 상기 단일의 어느 문장 또는 다른 문장에 포함되는 문헌을 의미할 수 있다. 상기 네거티브는 포지티브는 아니지만, 쿼리로 입력된 유전자, 화합물, 및 질병이 모두 포함되는 문헌을 의미할 수 있다. The learning model 34 may extract literature of the positive 11 and the negative 12 . The positive may refer to a document in which a gene and a compound are included in a single sentence, and a disease is included in a single sentence or another sentence. The negative may mean a document that is not positive, but includes all of the genes, compounds, and diseases input as a query.

다시 도 1로 돌아가서, 상기 인공지능 학습모델(3)은 사용자의 쿼리에 대응하여 문헌을 출력할 수 있다. 출력된 문헌에는 상기 포지티브로 검색되었지만, 잘못된 포지티브 문헌이 있다. 잘못된 포지티브 문헌을 추가로 걸러내는 후처리부(4)가 더 포함될 수 있다. Returning to FIG. 1 again, the artificial intelligence learning model 3 may output a document in response to a user's query. The printed document was searched for as the above positive, but there is an erroneous positive document. A post-processing unit 4 for further filtering out false positive documents may be further included.

상기 후처리부(4)가 추가로 걸러내는 규칙은 다음과 같다. The rules for further filtering by the post-processing unit 4 are as follows.

첫째: 인식된 언급(언급은 문헌에서 개체로 인식된 실제 단어를 의미할 수 있고, 이하 동일하다)은, 인식된 언급이 개체명으로 정규화된 후에, 사전에서 개체의 동의어로 포함되지 않은 경우. First: A recognized reference (a reference may mean an actual word recognized as an entity in the literature, hereinafter the same) is not included in the dictionary as a synonym for an entity after the recognized reference is normalized to the entity name.

둘째: 어떤 언급이 두개 이상의 개체 타입으로 인식되는 경우. 예를 들어, 동일단어가 개체명 인식부에서 유전자 및 화합물로 함께 인식되는 경우.Second: when a reference is recognized as more than one entity type. For example, when the same word is recognized as a gene and a compound in the entity name recognition unit.

셋째: 유전자, 화합물, 또는 질병을 포함하는 문장이, 워드넷으로부터 공부(study)의 하위어(hyponym)를 포함하는 경우. 이 경우에는 리서치의 목적으로서 실험이 아닐 수 있음. Third: When a sentence containing a gene, compound, or disease contains a hyponym of study from WordNet. In this case, it may not be an experiment for the purpose of research.

넷째: 부정어, 예를 들어, not, never 등의 단어를 포함하는 경우.Fourth: If it contains negative words, eg, not, never, etc.

다섯째: 유전자 명과 화합물 명이 파싱트리에 의존하여 접속사로 연결되는 경우. 이 경우에 대부분의 문헌은 유전자와 화합물이 서로 작용하지 않음. Fifth: When a gene name and a compound name are linked by a conjunction depending on the parsing tree. In this case, most of the literature shows that genes and compounds do not interact with each other.

위와 같이 다섯가지의 규칙에 해당하는 문헌은 상기 포지티브(11)에서 문헌을 제거할 수 있다. 다른 경우로, 해당하는 문헌은 네거티브(12)로 할당할 수 있다. 다른 경우로, 포지티브의 점수를 낮게 할 수도 있다. As described above, documents corresponding to the five rules can be removed from the positive 11 . Alternatively, the corresponding document may be assigned a negative 12 . In another case, the positive score may be lowered.

상기 후처리부(4)에서 처리된 결과는 출력부(5)를 통하여 출력될 수 있다. 출력은, 이미 설명한 바와 같이, 포지티브(11) 및 네거티브(12)로 출력될 수 있다. 상기 포지티브 및 네거티브에 포함되는 각 문헌은 점수를 가질 수 있다. 상기 점수는 상기 학습모델로 부터 주어질 수 있다. The result processed by the post-processing unit 4 may be output through the output unit 5 . The output can be positive 11 and negative 12, as already described. Each document included in the positive and negative may have a score. The score may be given from the learning model.

도 3은 실시예의 효과를 설명하는 도면이다. 3 is a view for explaining the effect of the embodiment.

도 3은 질병의 종류를 알츠하이머(A), 고혈압(B), 유방암(C), 전립선암(D)의 경우에, 실시예의 시스템인 DigChem과, 종래 알려진, CTD, WDD, 및 DrugBank가 검색한 문헌의 갯수를 보인다. Figure 3 shows the types of diseases in Alzheimer's (A), hypertension (B), breast cancer (C), and prostate cancer (D), in which DigChem, a system of an embodiment, and CTD, WDD, and DrugBank, which are known in the prior art, searched Shows the number of documents.

도 3에 따르면, DrugBank는 작업자가 구분해 놓은 것으로서, 가장 우수한 결과를 보일 수 있다. 이에 대비하여, 실시예는 CTD 및 WDD에 비해서는 우수한 결과를 보이는 것을 알 수 있다. According to FIG. 3 , the DrugBank is classified by the operator, and may show the best results. In contrast to this, it can be seen that the Example shows superior results compared to CTD and WDD.

본 발명에 따르면 유전자와 화합물과 질병의 관계를 연구하는 연구자에게 큰 도움이 되고, 나아가서, 의학산업의 발전에 기여할 수 있다. According to the present invention, it is of great help to researchers studying the relationship between genes, compounds, and diseases, and furthermore, it can contribute to the development of the medical industry.

3: 인공지능 학습모델3: AI Learning Model

Claims

an input unit for inputting a user query in natural language;
an entity name recognition unit for recognizing the input natural language as an entity;
an artificial intelligence learning model that provides documents related to genes, compounds, and diseases using the entity recognized by the entity name recognition unit;
a post-processing unit for post-processing the documents extracted from the artificial intelligence learning model; and
An output unit for outputting documents that have been processed and filtered by the post-processing unit is included,
the output unit,
positive including documents in which genes and compounds are included in any single sentence and the disease is included in any single sentence or other sentences; and
Negatives containing documents that include all genes, compounds, and diseases entered into the query, but are not positive;
Gene, compound, and disease related literature extraction system that distinguishes and displays.

The method of claim 1,
The artificial intelligence learning model is a result output from the artificial intelligence learning device,
In the artificial intelligence learning device,
a pre-processing unit for pre-processing and inputting names of genes, compounds, and diseases for each sentence;
an embedding unit that converts the preprocessed input information into a dense vector into a word vector and an object type vector;
A system for extracting documents related to genes, compounds, and diseases including a learning unit for learning the output result of the embedding unit and outputting the artificial intelligence learning model.

3. The method of claim 2,
The pre-processing unit, in any document,
a gene/compound sentence part in which the gene and the compound are included together in a single sentence;
The disease sentence part of the sentence containing the disease,
A system for extracting documents related to genes, compounds, and diseases that are aligned with each other and output to the embedding unit.

delete

The method of claim 1,
The post-processing unit,
- if the recognized reference is not included as a synonym for the entity in the dictionary after the recognized reference has been normalized to the entity name;
- if a reference is recognized as more than one entity type;
-When a sentence containing a gene, compound, or disease contains a hyponym of study,
- if it contains a word with a negative meaning, and
-When the gene name and the compound name are connected by a conjunction depending on the parsing tree,
A system for extracting documents related to genes, compounds, and diseases that excludes the positively searched literature from the positive.