KR102606267B1

KR102606267B1 - Target Prediction Method and System using correction based on prediction reliability

Info

Publication number: KR102606267B1
Application number: KR1020230056484A
Authority: KR
Inventors: 한석진; 김태용
Original assignee: 주식회사 스탠다임
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-11-29

Abstract

본 발명은 예측 신뢰도에 기반한 보정 기술을 이용하는 타겟 예측 방법 및 시스템에 관한 것이다. 구체적으로, 본 발명은 질의된 제1 개체와 관련된 복수의 제2 개체마다의 관련성 정보를 도출하고, 도출된 관련성 정보의 예측 신뢰도를 함께 도출하도록 학습 데이터를 학습하여 학습 데이터 편중 분포에 따른 문제점을 해결한 방법 및 시스템에 관한 것이다.The present invention relates to a target prediction method and system using correction techniques based on prediction reliability. Specifically, the present invention derives relevance information for each of a plurality of second entities related to the queried first entity, learns the training data to simultaneously derive the predictive reliability of the derived relevance information, and solves problems caused by biased distribution of the learning data. It is about the method and system that solved the problem.

Description

Target prediction method and system using correction technology based on prediction reliability {Target Prediction Method and System using correction based on prediction reliability}

본 발명은 예측 신뢰도에 기반한 보정 기술을 이용하는 타겟 예측 방법 및 시스템에 관한 것이다.The present invention relates to a target prediction method and system using correction techniques based on prediction reliability.

신약 개발에 있어 인공 지능을 활용하여 신약 개발의 효율성을 높이고자 하는 시도가 증가하고 있다. 특정 질병 치료에 효과가 있는 신약을 탐색하기 위한 인공 지능 기반의 예측 모델 개발에 많은 노력이 이루어져 왔으며, 질병의 명칭을 질의하여 질의된 질병과 관련된 약물들이 도출될 수 있는 기술은 신약 개발의 핵심 기술이 되고 있다. There are increasing attempts to increase the efficiency of new drug development by using artificial intelligence. A lot of effort has been made to develop artificial intelligence-based prediction models to search for new drugs that are effective in treating specific diseases, and the technology that can query the name of the disease to derive drugs related to the queried disease is a core technology in new drug development. It is becoming.

한편, 예측 모델을 통해 도출된 약물들은 질의된 질병과 특정 관계를 통해 관련성을 갖는다고 볼 수 있으며, 각기 다른 관련성 점수를 가질 수 있다. 하지만, 관련성 점수는 어떠한 학습 데이터로 훈련되었는지에 따라 예측 모델마다 상이하다. 학습 데이터는, 훈련 시점의 기술 수준에 따라 일정 영역에 편중될 수 있으며, 학습 데이터의 편중/비편중 영역에 따라 예측 신뢰도 분포가 달라진다.Meanwhile, drugs derived through a prediction model can be viewed as having a relationship with the queried disease through a specific relationship, and may have different relevance scores. However, the relevance score is different for each prediction model depending on what learning data it was trained with. Learning data may be biased in a certain area depending on the technology level at the time of training, and the prediction reliability distribution varies depending on the biased/unbiased area of the learning data.

질의된 질병과 관련된 것으로 도출된 약물별로 예측 신뢰도가 다름에도 불구하고 이들을 동일선상에 두고 결과를 해석하는 것은 부적절하다.Even though the prediction reliability is different for each drug found to be related to the queried disease, it is inappropriate to interpret the results by placing them on the same line.

이에, 본 발명자들은 학습을 통해 생성된 예측 모델에서 도출되는 결과값의 예측 신뢰도에 기반한 보정이 반영된 보정 결과값을 최종 출력으로 하여 보다 정확한 결과를 제공하는 본 발명을 개발하기에 이르렀다.Accordingly, the present inventors developed the present invention, which provides more accurate results by using a correction result value reflecting correction based on the prediction reliability of the result value derived from a prediction model generated through learning as the final output.

한국등록특허문헌 제10-2225278호 (2021.03.10.)Korean Patent Document No. 10-2225278 (2021.03.10.)

상기한 과제를 해결하기 위해 본 발명은 질의된 제1 개체와 관련된 복수의 제2 개체마다의 관련성 정보를 도출하고, 도출된 관련성 정보의 예측 신뢰도를 함께 도출하도록 학습 데이터를 학습하여 학습 데이터 편중 분포에 따른 문제점을 해결한 방법 및 시스템을 제공하는 것에 그 목적이 있다.In order to solve the above problem, the present invention derives relevance information for each of a plurality of second entities related to the queried first entity, learns the learning data to jointly derive the prediction reliability of the derived relevance information, and distributes the learning data bias. The purpose is to provide a method and system that solves the problems caused by.

또한, 관련성 정보를 도출하는 단계와 예측 신뢰도를 도출하는 단계가 동시에 이루어지도록 학습 데이터를 학습하여, 생성된 예측 모델의 정확성이 향상된 방법 및 시스템을 제공하는 것에 그 목적이 있다.In addition, the purpose is to provide a method and system that improves the accuracy of the generated prediction model by learning training data so that the step of deriving relevance information and the step of deriving prediction reliability are performed simultaneously.

또한, 도출된 예측 신뢰도를 반영하여 관련성 정보(관련성 점수)를 보정하고, 보정의 결과인 보정 관련성 정보를 사용자에게 제공함으로써 종래 기술 대비 신뢰도가 높은 방법 및 시스템을 제공하는 것에 그 목적이 있다.In addition, the purpose is to provide a method and system with higher reliability compared to the prior art by correcting the relevance information (relevance score) by reflecting the derived prediction reliability and providing the user with the corrected relevance information that is the result of the correction.

또한, 예측 신뢰도 도출을 위한 학습 데이터 학습 과정에서, 인공신경망 모델의 가중치 행렬이 직교행렬이 되도록 제약을 부여함으로써, 생성된 예측 모델의 정확성이 향상된 방법 및 시스템을 제공하는 것에 그 목적이 있다.In addition, the purpose is to provide a method and system that improves the accuracy of the generated prediction model by imposing constraints so that the weight matrix of the artificial neural network model is an orthogonal matrix in the process of learning training data to derive prediction reliability.

또한, 예측 신뢰도가 도출됨으로써 학습 이력이 없는 입력에 대한 신뢰성을 낮추어 과잉확신(overconfidence) 이슈를 해결할 수 있는 방법 및 시스템을 제공하는 것에 그 목적이 있다.In addition, the purpose is to provide a method and system that can solve the issue of overconfidence by lowering the reliability of input without learning history by deriving prediction reliability.

상기한 목적을 달성하기 위한 본 발명의 일 실시예는, 러닝 프로세서가, 제1 개체가 질의되는 경우, 질의된 제1 개체와 관련된 복수의 제2 개체 및 질의된 제1 개체와 각 제2 개체마다의 관련성 정보를 도출하도록 인공신경망을 학습시키는 제1 학습 단계 및 상기 러닝 프로세서가, 제1 개체가 질의되는 경우, 질의된 제1 개체와 관련된 복수의 제2 개체 및 질의된 제1 개체와 각 제2 개체마다의 예측 신뢰도를 도출하도록 인공신경망을 학습시키는 제2 학습 단계를 포함하고, 상기 인공신경망은, 질병 관련 데이터가 제1 노드로 규정되고, 유전자 관련 데이터가 제2 노드로 규정되며, 약물 관련 데이터가 제3 노드로 규정된 노드 데이터 및 노드-쌍마다 미리 설정된 경로를 포함하는 학습 데이터를 학습하는, 타겟 예측 방법을 제공한다.In one embodiment of the present invention for achieving the above object, when a first entity is queried, a learning processor provides a plurality of second entities related to the queried first entity and each of the queried first entity and the second entity. A first learning step of training an artificial neural network to derive relevance information for each and the learning processor, when a first entity is queried, a plurality of second entities related to the queried first entity and each of the queried first entity A second learning step of training an artificial neural network to derive prediction reliability for each individual, wherein the artificial neural network has disease-related data defined as a first node and gene-related data defined as a second node, A target prediction method is provided in which drug-related data learns learning data including node data defined as a third node and a preset path for each node-pair.

일 실시예에 있어서, 상기 제1 학습 단계와 상기 제2 학습 단계는 동시에 이루어질 수 있다.In one embodiment, the first learning step and the second learning step may be performed simultaneously.

일 실시예에 있어서, 상기 제2 학습 단계는, 상기 학습 데이터에 대해 상기 예측 신뢰도가 최대가 되도록 학습하되, 질의된 제1 개체 - 도출된 제2 개체 쌍과 상기 학습 데이터 간의 유사도에 따라 서로 다른 예측 신뢰도가 도출되도록 학습하는 단계를 더 포함할 수 있다.In one embodiment, the second learning step is to learn so that the prediction reliability is maximized for the training data, and the training data is different depending on the similarity between the queried first entity-derived second entity pair and the training data. A learning step may be further included to derive prediction reliability.

일 실시예에 있어서, 상기 제2 학습 단계는, 상기 학습 데이터에 대해 상기 예측 신뢰도가 최대가 되도록 학습하되, 질의된 제1 개체 - 도출된 제2 개체 쌍이 상기 학습 데이터에 포함되는지 여부에 따라 서로 다른 예측 신뢰도가 도출되도록 학습하는 단계를 더 포함할 수 있다.In one embodiment, the second learning step is to learn so that the prediction reliability is maximized for the training data, and the queried first entity - derived second entity pair is mutually dependent on whether or not the pair is included in the training data. A step of learning to derive different prediction reliability may be further included.

일 실시예에 있어서, 상기 제2 학습 단계는, 상기 인공신경망의 가중치 행렬이 직교행렬(orthogonal matrix)이 되도록 상기 학습 데이터를 학습하는 단계를 포함할 수 있다.In one embodiment, the second learning step may include learning the training data so that the weight matrix of the artificial neural network becomes an orthogonal matrix.

일 실시예에 있어서, 상기 가중치 행렬(Q)은, 일 때, 이고, 여기서, I는 단위 행렬이고, 내지 은 하삼각 행렬(lower triangular matrix)인 매개 변수 행렬의 열 벡터일 수 있다.In one embodiment, the weight matrix (Q) is: when, , where I is the identity matrix, inside may be a column vector of a parameter matrix that is a lower triangular matrix.

일 실시예에 있어서, 상기 임의의 제1 개체는 질병, 유전자 및 약물 중 어느 하나의 유형에 속하는 용어이며, 상기 예측 모델은 입력된 제1 개체의 유형과 다른 유형에 속하는 제2 개체들을 출력할 수 있다.In one embodiment, the arbitrary first entity is a term belonging to any one type of disease, gene, or drug, and the prediction model outputs second entities belonging to a type different from the type of the input first entity. You can.

일 실시예에 있어서, 입력 장치에 임의의 제1 개체가 입력되는 입력 단계, 프로세서가 입력된 임의의 제1 개체를 상기 인공신경망의 입력층에 질의하고, 상기 인공신경망의 출력층에서 상기 복수의 제2 개체, 상기 관련성 정보 및 상기 예측 신뢰도가 도출되는 도출 단계 및 출력 장치에서 상기 도출 단계에서 도출된 상기 복수의 제2 개체, 상기 관련성 정보 및 상기 예측 신뢰도를 출력하는 출력 단계를 더 포함할 수 있다.In one embodiment, an input step in which an arbitrary first object is input to an input device, a processor queries an input layer of the artificial neural network for an arbitrary first object, and the plurality of It may further include a derivation step in which two entities, the relevance information, and the prediction reliability are derived, and an output step in which the output device outputs the plurality of second entities, the relevance information, and the prediction reliability derived in the derivation step. .

일 실시예에 있어서, 상기 도출 단계 이후 상기 출력 단계 이전, 상기 프로세서가, 상기 도출된 예측 신뢰도에 기초하여 복수의 제2 개체마다의 관련성 정보를 보정하여 보정 관련성 정보를 도출하는 보정 단계를 더 포함하고, 상기 출력 단계는, 상기 보정 단계에서 도출된 보정 관련성 정보를 출력하는 단계를 더 포함할 수 있다.In one embodiment, after the derivation step and before the output step, the processor further includes a correction step in which the processor corrects the relevance information for each of the plurality of second entities based on the derived prediction reliability to derive corrected relevance information. And, the output step may further include outputting the correction relevance information derived in the correction step.

일 실시예에 있어서, 상기 관련성 정보는 질의된 제1 개체와의 관련성에 비례하는 관련성 점수를 포함하고, 상기 보정 단계는, 상기 도출된 예측 신뢰도가 낮을수록 상기 관련성 점수가 낮아지도록 보정하는 단계를 더 포함할 수 있다.In one embodiment, the relevance information includes a relevance score proportional to the relevance to the queried first entity, and the correction step includes correcting the relevance score so that the lower the derived prediction reliability is, the lower the relevance score is. More may be included.

일 실시예에 있어서, 상기 학습 데이터는, 제1 노드 내지 제3 노드의 임베딩 벡터 및 상기 노드-쌍마다의 미리 설정된 경로 유형(metapath)에 포함된 복수의 경로 중, 미리 설정된 방법에 의해 연산된 경로 스코어가 높은 순서로 추출된 일부의 경로를 포함할 수 있다.In one embodiment, the learning data is calculated by a preset method among the embedding vectors of the first to third nodes and a plurality of paths included in the preset path type (metapath) for each node-pair. It may include some paths extracted in order of high path score.

일 실시예에 있어서, 상기 학습 데이터는, 임의의 개체의 쌍이 질의 인코딩 모델에 입력됨에 따라 상기 질의 인코딩 모델에서 출력되는 임베딩 벡터 및 상기 질의 인코딩 모델에 입력된 임의의 개체의 쌍을 서로 연결하는 경로들 중 상기 추출된 일부의 경로가 경로 인코딩 모델에 입력됨에 따라 상기 경로 인코딩 모델에서 출력되는 임베딩 벡터가 질의-경로 인코딩 모델에 입력됨에 따라 상기 질의-경로 인코딩 모델에서 출력되는 임베딩 벡터를 포함할 수 있다.In one embodiment, the training data includes an embedding vector output from the query encoding model as a pair of random entities is input to the query encoding model, and a path connecting the pair of random entities input to the query encoding model. As some of the extracted paths are input to the path encoding model, the embedding vector output from the path encoding model may include an embedding vector output from the query-path encoding model as it is input to the query-path encoding model. there is.

일 실시예에 있어서, 상기 임의의 개체의 쌍은, 상기 임의의 개체의 쌍의 개체들 각각을 식별하기 위한 개체 임베딩 벡터들과, 상기 개체 임베딩 벡터들 각각의 위치를 식별하기 위한 포지셔널 임베딩 벡터들을 포함하는 질의 임베딩 벡터의 형태로 상기 질의 인코딩 모델에 입력되고, 상기 임의의 개체의 쌍을 서로 연결하는 경로는, 상기 임의의 개체의 쌍 사이의 경로에 포함된 개체들과, 상기 개체들을 서로 연결하는 엣지의 엣지 유형들 각각을 식별하기 위한 임베딩 벡터들, 그리고 상기 임베딩 벡터들 각각의 위치를 식별하기 위한 포지셔널 임베딩 벡터들을 포함하는 경로 임베딩 벡터의 형태로 상기 경로 인코딩 모델에 입력될 수 있다.In one embodiment, the random entity pair includes entity embedding vectors for identifying each entity of the random entity pair, and a positional embedding vector for identifying the location of each of the entity embedding vectors. is input to the query encoding model in the form of a query embedding vector containing It can be input to the path encoding model in the form of a path embedding vector including embedding vectors for identifying each of the edge types of connecting edges, and positional embedding vectors for identifying the positions of each of the embedding vectors. .

또한, 본 발명은 전술한 사전 학습된 인공신경망을 이용한 시스템으로서, 임의의 제1 개체를 입력받도록 구성된 입력 장치, 상기 입력 장치를 통해 입력된 임의의 제1 개체를 상기 인공신경망의 입력층에 질의하여 복수의 제2 개체, 관련성 정보 및 예측 신뢰도를 도출하도록 구성된 프로세서 및 상기 프로세서에서 도출된 복수의 제2 개체, 관련성 정보 및 예측 신뢰도를 출력하도록 구성된 출력 장치를 포함하는, 타겟 예측 시스템을 제공한다.In addition, the present invention is a system using the above-described pre-trained artificial neural network, which includes an input device configured to receive an arbitrary first object, and a query for an arbitrary first object input through the input device to the input layer of the artificial neural network. A target prediction system is provided, including a processor configured to derive a plurality of second entities, relevance information, and prediction reliability, and an output device configured to output a plurality of second entities, relevance information, and prediction reliability derived from the processor. .

일 실시예에 있어서, 상기 프로세서는, 도출된 예측 신뢰도에 기초하여 상기 관련성 정보를 보정하고, 상기 출력 장치를 통해 보정 관련성 정보가 출력될 수 있다.In one embodiment, the processor may correct the relevance information based on the derived prediction reliability, and the corrected relevance information may be output through the output device.

또한, 본 발명은 전술한 방법을 실행하도록 컴퓨터 판독 가능한 기록 매체에 저장된, 컴퓨터 프로그램을 제공한다.Additionally, the present invention provides a computer program stored on a computer-readable recording medium for executing the above-described method.

본 발명에 따르면, 질의된 제1 개체와 관련된 복수의 제2 개체마다의 관련성 정보를 도출하고, 도출된 관련성 정보의 예측 신뢰도를 함께 도출하도록 학습 데이터를 학습하여 학습 데이터 편중 분포에 따른 문제점이 해결된다.According to the present invention, relevance information for each of a plurality of second entities related to the queried first entity is derived, and the learning data is learned to simultaneously derive the predictive reliability of the derived relevance information, thereby solving the problem caused by biased distribution of the learning data. do.

또한, 관련성 정보를 도출하는 단계와 예측 신뢰도를 도출하는 단계가 동시에 이루어지도록 학습 데이터를 학습하여, 생성된 예측 모델의 정확성이 향상된다.In addition, the accuracy of the generated prediction model is improved by learning the training data so that the step of deriving relevance information and the step of deriving prediction reliability are performed simultaneously.

또한, 도출된 예측 신뢰도를 반영하여 관련성 정보(관련성 점수)를 보정하고, 보정의 결과인 보정 관련성 정보를 사용자에게 제공함으로써 종래 기술 대비 신뢰도가 높다.In addition, the reliability is higher compared to the prior art by correcting the relevance information (relevance score) by reflecting the derived prediction reliability and providing the corrected relevance information that is the result of the correction to the user.

또한, 예측 신뢰도 도출을 위한 학습 데이터 학습 과정에서, 인공신경망 모델의 가중치 행렬이 직교행렬이 되도록 제약을 부여함으로써, 생성된 예측 모델의 정확성이 향상된다.Additionally, in the process of learning training data to derive prediction reliability, the accuracy of the generated prediction model is improved by imposing constraints so that the weight matrix of the artificial neural network model is an orthogonal matrix.

또한, 예측 신뢰도가 도출됨으로써 학습 이력이 없는 입력에 대한 신뢰성을 낮추어 과잉확신(overconfidence) 이슈가 해소된다.In addition, by deriving the prediction reliability, the issue of overconfidence is resolved by lowering the reliability of input without learning history.

도 1은 본 발명의 실시예에 따른 시스템을 설명하기 위한 개략적인 블록도이다.
도 2는 도 1의 시스템에서 제1 프로세서에 의한 데이터 처리 과정을 설명하기 위한 개략적인 블록도이다.
도 3은 본 발명의 실시예에 따른 시스템에서 구축된 지식 그래프를 설명하기 위한 도면이다.
도 4는 본 발명에 따른 지식 그래프의 구축 과정에서 사용되는 노드들과, 엣지들을 설명하기 위한 도면이다.
도 5는 도 1의 시스템에서 제1 프로세서에 의한 데이터 처리 과정을 설명하기 위한 도면이다.
도 6은 도 1의 시스템에서 제2 프로세서에 의한 데이터 처리 과정을 설명하기 위한 도면이다.
도 7은 도 6의 질의 인코딩 모델에 입력되는 임베딩 벡터와, 이에 따라 출력되는 임베딩 벡터를 설명하기 위한 도면이다.
도 8은 도 6의 경로 인코딩 모델에 입력되는 임베딩 벡터와, 이에 따라 출력되는 임베딩 벡터를 설명하기 위한 도면이다.
도 9는 도 6의 질의-경로 인코딩 모델에 입력되는 임베딩 벡터와, 이에 따라 출력되는 임베딩 벡터, 그리고 이들이 학습되는 모습을 나타낸 도면이다.
도 10은 러닝 프로세서에 의한 인공신경망 모델 학습 과정을 설명하는 도면이다.
도 11은 학습 완료된 인공신경망 모델에의 질의 및 도출 과정을 설명하기 위한 도면이다.
도 12는 본 발명의 실시예에 따른 출력 정보의 양태를 설명하기 위한 도면이다.1 is a schematic block diagram for explaining a system according to an embodiment of the present invention.
FIG. 2 is a schematic block diagram illustrating a data processing process by a first processor in the system of FIG. 1.
Figure 3 is a diagram for explaining a knowledge graph constructed in a system according to an embodiment of the present invention.
Figure 4 is a diagram for explaining nodes and edges used in the process of building a knowledge graph according to the present invention.
FIG. 5 is a diagram illustrating a data processing process by a first processor in the system of FIG. 1.
FIG. 6 is a diagram illustrating a data processing process by a second processor in the system of FIG. 1.
FIG. 7 is a diagram illustrating an embedding vector input to the query encoding model of FIG. 6 and an embedding vector output accordingly.
FIG. 8 is a diagram for explaining an embedding vector input to the path encoding model of FIG. 6 and an embedding vector output accordingly.
Figure 9 is a diagram showing the embedding vectors input to the query-path encoding model of Figure 6, the embedding vectors output accordingly, and how they are learned.
Figure 10 is a diagram explaining the artificial neural network model learning process by a learning processor.
Figure 11 is a diagram to explain the query and derivation process for a trained artificial neural network model.
Figure 12 is a diagram for explaining aspects of output information according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the attached drawings.

이하에서, 용어 "개체(entity)"는 단어나 기호 등을 의미하며, 질병의 명칭, 유전자(또는 단백질)의 명칭 및 약물의 명칭이 여기에 포함될 수 있다. 마찬가지로, "개체-쌍"은 개체들의 쌍으로 이루어진 데이터를 의미하며, 서로 다른 유형의 개체로 이루어진 데이터(질병-유전자, 유전자-질병, 질병-약물, 약물-질병, 유전자-약물, 약물-유전자 등)를 의미한다.Hereinafter, the term “entity” means a word or symbol, and may include the name of the disease, the name of the gene (or protein), and the name of the drug. Likewise, “entity-pair” refers to data consisting of pairs of entities, and data consisting of different types of entities (disease-gene, gene-disease, disease-drug, drug-disease, gene-drug, drug-gene etc.).

이하에서, 용어 "유전자"는 DNA나 RNA로 이루어진 유전체에서 특정 염기서열로 이루어진 유전정보의 개별적 단위를 지칭하는 것으로, DNA와 RNA뿐만 아니라 단백질(protein)로 이루어진 유전체에서 특정 아미노산 서열로 이루어진 유전정보의 개별적 단위 역시 포함하는 개념이다. 또한, 유전체의 번역 결과인 단백질(protein) 역시 본 발명의 "유전자"에 포함되는 개념이다.Hereinafter, the term "gene" refers to an individual unit of genetic information composed of a specific base sequence in a genome composed of DNA or RNA, and genetic information composed of a specific amino acid sequence in a genome composed of protein as well as DNA and RNA. It is a concept that also includes individual units of . Additionally, protein, which is the result of translation of the genome, is also included in the term “gene” of the present invention.

이하에서, 용어 "타겟"은 질의된 제1 개체와 관련된 제2 개체(예측 대상 개체)를 지칭하는 것으로, 제1 개체 및 제2 개체 모두 질병, 유전자 및 약물 중 어느 하나의 유형이 속하나, 제1 개체와 제2 개체의 유형은 서로 다르다(예를 들어, 제1 개체가 질병일 경우, 제2 개체는 유전자 또는 약물).Hereinafter, the term "target" refers to a second entity (predicted entity) related to the first entity queried, where both the first entity and the second entity belong to any one type of disease, gene, or drug, but The types of the first entity and the second entity are different (for example, if the first entity is a disease, the second entity is a gene or drug).

1. 시스템의 설명1. Description of the system

도 1을 참조하여, 본 발명의 실시예에 따른 시스템을 구체적으로 설명한다.Referring to Figure 1, a system according to an embodiment of the present invention will be described in detail.

본 발명의 실시예에 따른 시스템은 모델 생성 단말(100) 및 예측 단말(200)을 포함한다.A system according to an embodiment of the present invention includes a model creation terminal 100 and a prediction terminal 200.

모델 생성 단말(100)은 본 발명에서 사용되는 예측 모델을 생성하도록 구성된 단말이다.The model generation terminal 100 is a terminal configured to generate a prediction model used in the present invention.

도 1을 참조하면, 모델 생성 단말(100)은 통신부(110), 입력부(120), 제1 프로세서(130), 제2 프로세서(140), 제어부(150), 출력부(160), 메모리(170) 및 러닝 프로세서(180)를 포함한다. 즉, 모델 생성 단말(100)은 일 예로 통신/입력/연산/출력 기능을 갖는 컴퓨팅 장치일 수 있으며, 일 예로 서버, 데스크탑 PC, 노트북 PC, 태블릿 PC 등과 같은 다양한 전자 장치로 구현될 수 있다.Referring to FIG. 1, the model creation terminal 100 includes a communication unit 110, an input unit 120, a first processor 130, a second processor 140, a control unit 150, an output unit 160, and a memory ( 170) and a learning processor 180. That is, the model creation terminal 100 may be a computing device having communication/input/computation/output functions, and may be implemented as various electronic devices such as servers, desktop PCs, laptop PCs, tablet PCs, etc.

일 실시예에서, 통신부(110)는 외부 기기와의 통신을 위한 구성으로, 통신부(110)를 통해 외부 기기와의 데이터 송수신이 가능할 수 있다.In one embodiment, the communication unit 110 is configured for communication with an external device, and data may be transmitted and received with the external device through the communication unit 110.

일 실시예에서, 입력부(120)는 모델 생성 단말(100)의 구성요소(예를 들어, 프로세서, 러닝 프로세서)에 사용될 명령 또는 데이터를 모델 생성 단말(100)의 외부(예를 들어, 사용자)로부터 수신할 수 있다. 입력부(120)는, 예를 들면, 마이크, 마우스, 키보드, 키(예를 들어, 버튼), 또는 디지털 펜(예를 들어, 스타일러스 펜)을 포함할 수 있다.In one embodiment, the input unit 120 transmits commands or data to be used for components (e.g., processor, learning processor) of the model creation terminal 100 to an external (e.g., user) of the model creation terminal 100. It can be received from. The input unit 120 may include, for example, a microphone, mouse, keyboard, keys (eg, buttons), or a digital pen (eg, stylus pen).

일 실시예에서, 제어부(150)는, 예를 들면 소프트웨어를 실행하여 프로세서에 연결된 모델 생성 단말(100)의 하나 이상의 다른 구성요소(예를 들어, 하드웨어 또는 소프트웨어 구성요소)를 제어할 수 있다.In one embodiment, the control unit 150 may control one or more other components (eg, hardware or software components) of the model creation terminal 100 connected to the processor by executing software, for example.

일 실시예에서, 제1 프로세서(130), 제2 프로세서(140) 및 러닝 프로세서(180)는, 데이터 처리 또는 연산 기능을 수행할 수 있으며, 데이터 처리 또는 연산 기능의 적어도 일부로서, 다른 구성요소로부터 수신된 명령 또는 데이터를 휘발성 메모리에 저장하고, 휘발성 메모리에 저장된 명령 또는 데이터를 처리하고, 결과 데이터를 비휘발성 메모리에 저장할 수 있다. 또한, 러닝 프로세서(180)는 인공지능 모델의 처리에 특화된 하드웨어 구조를 포함할 수 있다.In one embodiment, the first processor 130, the second processor 140, and the learning processor 180 may perform data processing or calculation functions, and as at least part of the data processing or calculation function, other components Commands or data received from may be stored in volatile memory, commands or data stored in volatile memory may be processed, and resultant data may be stored in non-volatile memory. Additionally, the learning processor 180 may include a hardware structure specialized for processing artificial intelligence models.

출력부(160)는 모델 생성 단말(100)에서 처리 또는 연산된 정보를 외부로 출력할 수 있으며, 디스플레이, 스피커 등이 여기에 포함될 수 있다.The output unit 160 can output information processed or calculated in the model creation terminal 100 to the outside, and a display, speaker, etc. may be included here.

메모리(170)는 모델 생성 단말(100)에서 사용되는 다양한 데이터를 저장할 수 있다. 데이터는, 예를 들어, 소프트웨어 및 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다.The memory 170 can store various data used in the model creation terminal 100. Data may include, for example, input data or output data for software and instructions related thereto.

후술하는 예측 단말(200) 역시 컴퓨팅 장치의 형태로 구현될 수 있으며, 모델 생성 단말(100)에 포함된 구성요소를 그대로 포함하거나 적어도 일부를 포함할 수 있다.The prediction terminal 200, which will be described later, may also be implemented in the form of a computing device, and may include the components included in the model creation terminal 100 as is or at least some of them.

모델 생성 단말(100)의 러닝 프로세서(180)는 학습 데이터를 학습하여, 임의의 제1 개체가 질의되면, 질의된 제1 개체와 관련된 복수의 제2 개체들을 도출하고, 질의된 제1 개체와 각 제2 개체마다의 관련성 정보를 도출하며, 각 제2 개체마다의 예측 신뢰도를 도출하는 예측 모델을 생성하게 된다. 이하에서는, 예측 모델을 생성하는 데 사용되는 학습 데이터가 어떻게 구성되는지에 대해 구체적으로 설명한다.The learning processor 180 of the model creation terminal 100 learns training data, and when any first entity is queried, derives a plurality of second entities related to the queried first entity, and Relevance information for each second entity is derived, and a prediction model is created to derive prediction reliability for each second entity. Below, we will describe in detail how the learning data used to create the prediction model is structured.

학습 데이터는 모델 생성 단말(100)의 제1 프로세서(130) 및 제2 프로세서(140)에 의해 전처리 및 생성된다.Learning data is preprocessed and generated by the first processor 130 and the second processor 140 of the model creation terminal 100.

제1 프로세서(130)는 노드, 엣지, 경로를 규정하고, 규정된 노드와 엣지로 이루어진 지식 그래프를 생성하며, 생성된 지식 그래프를 임베딩하고, 임베딩 결과를 이용하여 임의의 개체-쌍의 경로들 중 중요도 높은 일부 경로를 추출하도록 구성된다.The first processor 130 defines nodes, edges, and paths, generates a knowledge graph consisting of the defined nodes and edges, embeds the generated knowledge graph, and uses the embedding result to create paths of arbitrary entity-pairs. It is configured to extract some paths of high importance.

도 2를 참조하면, 제1 프로세서(130)는 데이터 수집부(131), 자연어 처리부(132), 노드 규정부(133), ID 부여부(134), 엣지 규정부(135), 경로 규정부(136), 임베딩부(137), 경로 스코어 연산부(138) 및 경로 추출부(139)를 포함한다.Referring to FIG. 2, the first processor 130 includes a data collection unit 131, a natural language processing unit 132, a node definition unit 133, an ID granting unit 134, an edge definition unit 135, and a path definition unit. (136), an embedding unit 137, a path score calculation unit 138, and a path extraction unit 139.

데이터 수집부(131)는 다수의 데이터베이스(D1, D2, … Dn)로부터 데이터를 수집하도록 구성된다. 데이터 수집부(131)에 의해 수집되는 데이터는 일 예로 유전자 발현 데이터, 약물-단백질 결합 데이터, 논문에 기재된 정보를 항목화한 데이터, 문서 데이터 등일 수 있으나, 상기한 형태에 제한되지 않고 질병 관련 데이터, 유전자 관련 데이터 및 약물 관련 데이터를 포함하는 것이면 그 형식은 제한되지 않는다.The data collection unit 131 is configured to collect data from multiple databases (D1, D2, ... Dn). Data collected by the data collection unit 131 may be, for example, gene expression data, drug-protein binding data, data itemizing information described in papers, document data, etc., but is not limited to the above-mentioned forms and is disease-related data. , the format is not limited as long as it includes gene-related data and drug-related data.

이를 위해, 본 발명의 실시예에 따른 시스템은 다수의 데이터베이스(D1, D2, … Dn)와 통신 연결될 수 있으며(모델 생성 단말의 통신부에 의해 상호 통신 연결될 수 있음), 다수의 데이터베이스(D1, D2, … Dn)는 공개된 데이터베이스일 수 있으나, 이에 제한되지 않고 비공개 데이터베이스일 수도 있으며, 논문 데이터베이스, 의학 정보 데이터베이스, 약학 정보 데이터베이스 및 검색 포털 데이터베이스 등을 포함할 수 있다.For this purpose, the system according to the embodiment of the present invention can be connected to a plurality of databases (D1, D2, ... Dn) (can be connected to each other by the communication unit of the model creation terminal), and a plurality of databases (D1, D2 , … Dn) may be a public database, but is not limited to this and may also be a private database, and may include a thesis database, a medical information database, a pharmaceutical information database, and a search portal database.

데이터 수집부(131)는 다수의 데이터 베이스(D1, D2, … Dn) 각각으로부터 질병(disease)과 관련된 제1 데이터, 유전자(gene)와 관련된 제2 데이터 및 약물(compound)과 관련된 제3 데이터를 수집할 수 있다. 또한, 데이터 수집부(131)는 경로(pathway)를 제1 내지 제3 데이터와는 다른 제4 데이터로 수집할 수 있으며, 상기 데이터들 간에 연관성 있다고 판단한 근거가 되는 문서(예를 들어, 논문 등)를 지칭하기 위한 데이터(예를 들어, 특정 데이터베이스에서 해당 논문 각각에 부여된 ID)를 또 다른 유형의 제5 데이터로 수집할 수 있다.The data collection unit 131 collects first data related to a disease, second data related to a gene, and third data related to a drug (compound) from each of a plurality of databases (D1, D2, ... Dn). can be collected. In addition, the data collection unit 131 may collect the path as fourth data that is different from the first to third data, and documents (e.g., papers, etc.) that serve as the basis for determining that there is a relationship between the data. ) can be collected as another type of fifth data (for example, the ID assigned to each relevant paper in a specific database) to refer to the paper.

제1 데이터는 질병과 관련된 데이터로서, 질병의 명칭 데이터, 질병의 해부학적(anatomy) 데이터(예를 들어, 질병이 발병하는 신체의 해부학적 데이터, 간암의 경우 간이 여기에 해당할 수 있음) 및 질병의 증상 데이터를 포함할 수 있다. 즉, 질병 자체를 지칭하는 용어뿐만 아니라, 질병과 관련된 정보를 제공하는데 필요한 모든 용어를 포함하는 개념이다.The first data is data related to the disease, such as name data of the disease, anatomy data of the disease (for example, anatomical data of the body in which the disease occurs, in the case of liver cancer, this may include the liver), and May include disease symptom data. In other words, it is a concept that includes not only terms referring to the disease itself, but also all terms necessary to provide information related to the disease.

제2 데이터는 유전자와 관련된 데이터로서, 유전자의 명칭 데이터, 유전자의 유전자 온톨로지(gene ontology) 데이터, 유전자의 해부학적 데이터(예를 들어, 유전자가 발현되는 신체 조직 정보, 간암과 관련성이 있는 유전자를 찾기 위해 간에서 발현이 높은 유전자들을 우선적으로 고려할 경우 간이 여기에 해당할 수 있음) 및 유전자의 생물학적 경로(biological pathway) 데이터를 포함할 수 있으며, 유전자 온톨로지 데이터는 유전자의 생물학적 과정(biological process) 데이터, 유전자의 세포 내 위치(cellular component) 데이터 및 유전자의 분자 기능(molecular function) 데이터를 포함할 수 있다. 즉, 유전자 자체를 지칭하는 용어뿐만 아니라, 유전자와 관련된 정보를 제공하는데 필요한 모든 용어를 포함하는 개념이다.The second data is data related to the gene, including name data of the gene, gene ontology data of the gene, anatomical data of the gene (e.g., body tissue information in which the gene is expressed, genes related to liver cancer). To find genes with high expression in the liver, liver may be included) and biological pathway data of the gene, and gene ontology data is the biological process data of the gene. , may include data on the cellular location of the gene (cellular component) and data on the molecular function of the gene. In other words, it is a concept that includes not only terms referring to the gene itself, but also all terms necessary to provide information related to the gene.

해부학적 데이터는 제1 데이터 또는 제2 데이터에 포함될 수 있는데, 예를 들어 데이터에 A 유전자가 B 조직에서 발현한다라는 내용이 포함된 경우, B 조직은 유전자 관련 데이터인 제2 데이터로 수집될 수 있고, 데이터에 C 질병이 D 조직에서 발병된다는 내용이 포함된 경우, D 조직은 질병 관련 데이터인 제1 데이터로 수집될 수 있다.Anatomical data may be included in first data or second data. For example, if the data includes that gene A is expressed in tissue B, tissue B may be collected as second data that is gene-related data. , if the data includes that disease C occurs in tissue D, tissue D may be collected as first data that is disease-related data.

제3 데이터는 약물과 관련된 데이터로서, 약물의 명칭 데이터, 약물의 약리학적 분류(pharmacologic class) 데이터 및 약물의 부작용(side effect) 데이터를 포함할 수 있다. 즉, 약물 자체를 지칭하는 용어뿐만 아니라, 약물과 관련된 정보를 제공하는데 필요한 모든 용어를 포함하는 개념이다.The third data is data related to the drug and may include name data of the drug, pharmacologic class data of the drug, and side effect data of the drug. In other words, it is a concept that includes not only terms referring to the drug itself, but also all terms necessary to provide information related to the drug.

하지만, 상기한 유형에 한정되지 않고 각각 질병, 유전자, 약물과 관련된 데이터, 그리고 질병, 유전자 및 단백질 간의 관계를 예측하는 데 필요한 데이터이면 어느 것이든 포함할 수 있다고 할 것이다. However, it is not limited to the above types and can include data related to diseases, genes, drugs, and any data necessary to predict the relationship between diseases, genes, and proteins.

자연어 처리부(132)는 데이터 수집부(131)에 의해 수집된 문서 데이터로부터 기설정된 자연어 처리 알고리즘을 통해, 문서 데이터에 포함된 텍스트(text)로부터 개체들을 추출하고, 개체와 개체들 간의 관계를 도출하도록 구성된다.The natural language processing unit 132 extracts entities from the text included in the document data through a preset natural language processing algorithm from the document data collected by the data collection unit 131, and derives relationships between the entities. It is configured to do so.

자연어 처리부(132)에 의해 추출된 개체와, 도출된 개체들 간의 관계는 각각 노드와 엣지로 규정될 수 있으며, 자세한 설명은 후술한다.The relationships between entities extracted by the natural language processing unit 132 and the derived entities may be defined as nodes and edges, respectively, and detailed descriptions will be provided later.

즉, 자연어 처리부(132)는 문서 데이터에 포함된 질병(disease) 관련 용어를 제1 개체로, 유전자(gene) 관련 용어를 제2 개체로, 약물(compound) 관련 용어를 제3 개체로, 제1 개체 내지 제3 개체 간의 관계를 설명하는 용어를 제4 개체로 각각 인식하여 추출하도록 구성된다. 또한, 자연어 처리부(132)는 문서 데이터에 포함된 경로를 제5 개체로, 연관성 판단의 근거가 된 문서를 지칭하는 데이터를 제6 개체로 인식하여 추출하는 등, 예측에 필요한 개체를 설정하여 추출할 수 있다.That is, the natural language processing unit 132 sets disease-related terms included in the document data as the first entity, gene-related terms as the second entity, and compound-related terms as the third entity. It is configured to recognize and extract terms that describe the relationship between the first entity to the third entity, respectively, as the fourth entity. In addition, the natural language processing unit 132 sets and extracts the entities necessary for prediction, such as recognizing and extracting the path included in the document data as the fifth entity and the data referring to the document that was the basis for the relevance judgment as the sixth entity. can do.

그리고, 자연어 처리부(132)는 추출된 개체들을 이용하여, 기설정된 방법으로 개체들 간의 관계를 도출하도록 구성된다.Additionally, the natural language processing unit 132 is configured to use the extracted entities to derive relationships between entities using a preset method.

본 발명에 따른 자연어 처리부(132)에 의한 개체들의 추출, 개체들 간의 관계 도출은 사전 학습된 신경망(Neural Network) 모델을 이용하여 수행될 수 있다. 즉, 신경망 모델은 예를 들어, 제1 개체 내지 제6 개체들이 각각 레이블링된 학습 데이터를 학습하여, 질의되는 문서 데이터로부터 제1 개체 내지 제6 개체를 추출하고, 개체들 간의 관계를 도출하도록 구성될 수 있다.Extraction of entities and derivation of relationships between entities by the natural language processing unit 132 according to the present invention may be performed using a pre-trained neural network model. That is, the neural network model is configured to, for example, learn learning data in which the first to sixth entities are each labeled, extract the first to sixth entities from the queried document data, and derive relationships between the entities. It can be.

종래 기술에 따를 경우, 추출의 대상이 되는 용어를 미리 색인 사전에 저장해 놓은 후, 미리 저장된 용어만을 텍스트로부터 추출하게 된다. 이 경우, 색인 사전에 미리 저장되지 않은 용어가 텍스트에 포함되어 있다면, 이를 추출하지 못하고 결국에는 기존에 알려진 범위 내에서만 시스템 구축이 가능하다.According to the prior art, terms to be extracted are stored in advance in an index dictionary, and then only the pre-stored terms are extracted from the text. In this case, if the text contains a term that is not previously stored in the index dictionary, it cannot be extracted, and ultimately, the system can only be built within the existing known range.

하지만, 본 발명의 경우 색인 사전에 저장된 용어를 추출하는 것이 아닌, 예를 들어, 신경망 모델이 텍스트의 어느 부분이 설정된 개체 중 어느 개체에 해당하는지 레이블링된 학습 데이터를 학습하기 때문에, 사전 학습되지 않은 용어에 대해서도 용어 자체의 형태나 전후 맥락 등을 고려하여 개체를 추출하는 것이 가능하다. 따라서, 기존의 논문을 통해 알려진 범주뿐만 아니라, 새로운 범주에서의 개체 추출 및 개체 간의 관계 도출이 가능하다.However, in the case of the present invention, rather than extracting terms stored in the index dictionary, for example, the neural network model learns labeled training data to determine which part of the text corresponds to which of the set objects, so it is not pre-trained. For terms, it is possible to extract entities by considering the form of the term itself or its context. Therefore, it is possible to extract entities from categories known through existing papers as well as new categories and derive relationships between entities.

노드 규정부(133), 엣지 규정부(135), 경로 규정부(136)는 각각 지식 그래프의 구성요소인 노드(node) 및 엣지(edge)를 규정하고, 나아가 경로(path)를 규정하도록 구성된다.The node defining unit 133, the edge defining unit 135, and the path defining unit 136 are configured to define nodes and edges, which are components of the knowledge graph, and further define a path. do.

노드 규정부(133)는 데이터 수집부(131)에 의해 수집된 데이터들 중 제1 데이터를 질병의 명칭 데이터, 질병의 해부학적 데이터, 질병의 증상 데이터 등으로 그룹화할 수 있으며, 수집된 제2 데이터를 각각 유전자의 명칭 데이터, 유전자의 생물학적 과정 데이터, 유전자의 해부학적 데이터, 유전자의 세포 내 위치 데이터, 유전자의 분자 기능 데이터, 유전자의 생물학적 경로 데이터 등으로 그룹화할 수 있고, 수집된 제3 데이터를 약물의 명칭 데이터, 약물의 약리학적 분류 데이터, 약물의 부작용 데이터로 그룹화하여 총 11개의 그룹으로 그 유형을 분류할 수 있다(도 4 참조). 하지만, 상기한 개수에 제한되지 않고 다양한 유형의 그룹(예를 들어, 제5 데이터 및 제6 데이터)이 추가될 수 있다.The node regulation unit 133 may group first data among the data collected by the data collection unit 131 into disease name data, disease anatomical data, disease symptom data, etc., and may group the collected second data into disease name data, disease anatomical data, disease symptom data, etc. Data can be grouped into name data of genes, biological process data of genes, anatomical data of genes, intracellular location data of genes, molecular function data of genes, biological pathway data of genes, etc., and collected third data. The types can be classified into a total of 11 groups by grouping them into drug name data, drug pharmacological classification data, and drug side effect data (see FIG. 4). However, the number is not limited to the above and various types of groups (eg, fifth data and sixth data) may be added.

다른 실시예에서는, 노드 규정부(113)는 자연어 처리부(112)를 통해 추출된 개체 각각을 미리 결정된 방법에 따라 그룹화하며, 각각을 노드로 규정할 수도 있다.In another embodiment, the node defining unit 113 may group each entity extracted through the natural language processing unit 112 according to a predetermined method and define each as a node.

즉, 노드 규정부(133)는 자연어 처리부(132)를 통해 추출된 제1 개체 내지 제3 개체와, 다수의 데이터베이스들로부터 수집된 제1 데이터 내지 제3 데이터를 각각 제1 노드 내지 제3 노드로 규정하게 된다(도 3 참조). 다른 실시예에서는, 노드 규정부(133)가 제1 개체 내지 제3 개체, 제5 개체 및 제6 개체, 그리고 제1 데이터 내지 제3 데이터, 제5 데이터 및 제6 데이터를 각각 제1 내지 제5 노드로 규정할 수도 있다. 후술하겠지만, 엣지 규정부(135)는 자연어 처리부(132)를 통해 도출된 제1 개체 내지 제3 개체 간의 관계, 제1 데이터 내지 제3 데이터 간의 관계를 엣지로 규정하게 된다. 또한, 엣지 규정부(135)는 제1 개체 내지 제3 개체, 제5 개체 및 제6 개체 간의 관계, 제1 데이터 내지 제3 데이터, 제5 데이터 및 제6 데이터 간의 관계를 엣지로 규정할 수도 있다. 본 발명에 따라 규정된 노드들과, 노드들을 연결하는 엣지들의 예시가 도 3에 도시된다.That is, the node defining unit 133 divides the first to third entities extracted through the natural language processing unit 132 and the first to third data collected from multiple databases into the first to third nodes, respectively. It is defined as (see Figure 3). In another embodiment, the node defining unit 133 divides the first to third entities, the fifth entity, and the sixth entity, and the first to third data, fifth data, and sixth data into the first to third entities, respectively. It can also be defined as 5 nodes. As will be described later, the edge definition unit 135 defines the relationship between the first and third entities and the relationship between the first and third data derived through the natural language processing unit 132 as an edge. In addition, the edge defining unit 135 may define the relationship between the first to third objects, the fifth object and the sixth object, and the relationship between the first data to the third data, the fifth data and the sixth data as edges. there is. Examples of nodes defined in accordance with the present invention and edges connecting the nodes are shown in Figure 3.

또한, 노드 규정부(133)는 그룹화된 데이터들 내에 포함된 데이터들을 그 종류에 따라 각각의 노드로 규정한다.Additionally, the node defining unit 133 defines the data included in the grouped data as each node according to its type.

즉, 노드 규정부(133)는 제1 데이터(개체)를 그 종류마다 각각의 노드로 규정하고, 제2 데이터(개체)를 그 종류마다 각각의 노드로 규정하며, 제3 데이터(개체)를 그 종류마다 각각 노드로 규정하고, 제5 데이터 및 제6 데이터를 그 종류마다 각각 노드로 규정한다.That is, the node defining unit 133 defines the first data (object) as each node for each type, defines the second data (object) as each node for each type, and defines the third data (object) as each node. Each type is defined as a node, and the fifth data and sixth data are defined as nodes for each type.

ID 부여부(134)는 노드 규정부(133)에 의해 규정된 노드들 각각에 고유의 ID를 부여하도록 구성된다.The ID granting unit 134 is configured to grant a unique ID to each of the nodes defined by the node defining unit 133.

즉, 본 발명에 따른 ID 부여부(134)는 각각의 노드를 나타내는 임의의 용어에 각각 고유의 ID를 부여하게 되는데, 상기 임의의 용어의 동의어(synonym) 및 축약어(abbreviation) 등 상기 임의의 용어와 동일하다고 판단될 수 있는 용어들에도 상기 임의의 용어와 동일한 ID를 부여하도록 구성된다.That is, the ID granting unit 134 according to the present invention assigns a unique ID to each arbitrary term representing each node, such as a synonym and abbreviation of the arbitrary term. It is configured to assign the same ID as the arbitrary term to terms that can be determined to be the same as .

한편, 임의의 용어에 2개 이상의 ID가 부여되는 경우가 있을 수 있다. 예를 들어, alpha-fetoprotein의 경우 AFP라는 축약어로도 지칭되며, alpha-fetoprotein과 AFP는 모두 174라는 ID가 부여될 수 있다.Meanwhile, there may be cases where two or more IDs are assigned to any term. For example, alpha-fetoprotein is also referred to by the abbreviation AFP, and both alpha-fetoprotein and AFP can be assigned an ID of 174.

AFP는 TRIM26이라는 유전자의 동의어에도 해당하는데, 즉 AFP는 TRIM26의 ID와 동일한 7726라는 ID가 부여될 수도 있다.AFP is also a synonym for the gene TRIM26, that is, AFP may be given an ID of 7726, which is the same as the ID of TRIM26.

즉, AFP는 174 및 7726이라는 2개의 ID가 부여되는데, 이 경우 ID 부여부(134)는 축약어에 매칭되는 ID(7726)이 아닌 AFP의 풀 네임(full name)에 매칭되는 ID(174)를 AFP의 ID로 부여하게 된다. In other words, AFP is given two IDs, 174 and 7726. In this case, the ID granting unit 134 gives an ID (174) that matches the full name of the AFP, rather than an ID (7726) that matches the abbreviation. It is assigned as the AFP ID.

메모리(170)에는 각각의 노드마다 고유의 ID가 매핑(mapping)되어 저장되어 있으며, ID 부여부(134)는 메모리(170)에 저장된 ID들을 이용하여 각각의 노드에 고유의 ID를 부여하게 된다.A unique ID is mapped and stored for each node in the memory 170, and the ID granting unit 134 assigns a unique ID to each node using the IDs stored in the memory 170. .

엣지 규정부(135)는 노드 규정부(133)에 의해 규정된 노드 간의 관계를 엣지로 규정한다.The edge defining unit 135 defines the relationship between nodes defined by the node defining unit 133 as an edge.

엣지란 노드와 노드 사이를 잇는 연결관계를 의미하며, 엣지 규정부(135)는 수집된 데이터들에 포함된 노드와 노드 사이의 관계를 해당 노드-쌍을 서로 연결하는 엣지로 규정하게 된다.An edge refers to a connection relationship between nodes, and the edge defining unit 135 defines the relationship between nodes included in the collected data as an edge connecting the corresponding node-pair.

예를 들어 문서 데이터가 "유방암 환자의 경우 멍울 증상이 발생할 수 있으며, 타목시펜 호르몬제 약물을 사용하여 치료가 수행될 수 있다"라는 텍스트를 포함하는 경우, "breast cancer"라는 노드와 "멍울"이라는 노드를 연결하는 하나의 엣지가 규정될 수 있으며, "breast cancer" 노드와 "타목시펜 호르몬제"라는 노드를 연결하는 하나의 엣지가 규정될 수 있다.For example, if document data contains the text "In patients with breast cancer, lump symptoms may occur, and treatment may be performed using the hormonal drug tamoxifen," then there is a node called "breast cancer" and a node called "lumpy". One edge connecting the nodes may be defined, and one edge connecting the node “breast cancer” and the node “tamoxifen hormone drug” may be defined.

이렇듯, 엣지 규정부(135)는 데이터 수집부(131)가 수집한 제1 데이터, 제2 데이터 및 제3 데이터를 이용하여 노드 간의 관계를 엣지로 규정할 수 있으며, 노드 규정부(133)와 마찬가지로 규정된 엣지들을 그룹화할 수 있다.In this way, the edge defining unit 135 can define the relationship between nodes as an edge using the first data, second data, and third data collected by the data collection unit 131, and the node defining unit 133 and Likewise, defined edges can be grouped.

도 3 및 4에는 엣지 규정부(135)에 의해 규정되고 그룹화 및 유형화된 엣지들이 도시된다. 3 and 4 show edges defined by edge definition 135 and grouped and typed.

도 3을 참조하면, 엣지 규정부(135)에 의해 규정된 엣지는 질병-유전자 관계 엣지(Disease-Target), 유전자-약물 관계 엣지(Target-Compound), 질병-약물 관계 엣지(Disease-Compound), 유전자 관련 엣지(Target-related), 질병 관련 엣지(Disease-related) 및 약물 관련 엣지(Compound-related)로 구분될 수 있다.Referring to FIG. 3, the edges defined by the edge regulation unit 135 are a disease-gene relationship edge (Disease-Target), a gene-drug relationship edge (Target-Compound), and a disease-drug relationship edge (Disease-Compound). , can be divided into gene-related edges (Target-related), disease-related edges (Disease-related), and drug-related edges (Compound-related).

도 4에는 각 엣지를 유형화한 엣지 유형(metaedge)이 도시된다.Figure 4 shows the edge type (metaedge) that categorizes each edge.

구체적으로, 질병-유전자 관계 엣지(Disease-Target)는, 유전자-질병 관련성 엣지 유형(associated) 및 유전자-질병 조절 관계 엣지 유형(downregulated_in, upregulated_in)을 포함한다.Specifically, the disease-gene relationship edge (Disease-Target) includes the gene-disease relationship edge type (associated) and the gene-disease regulation relationship edge type (downregulated_in, upregulated_in).

유전자-약물 관계 엣지(Target-Compound)는 약물-유전자 결합 관계 엣지 유형(binds_to) 및 약물-유전자 조절 관계 엣지 유형(downregulated_by, upregulated_by)을 포함한다.The gene-drug relationship edge (Target-Compound) includes the drug-gene binding relationship edge type (binds_to) and the drug-gene regulation relationship edge type (downregulated_by, upregulated_by).

질병-약물 관계 엣지(Disease-Compound)는 약물-질병 치료 관계 엣지 유형(treats)을 포함한다.The disease-drug relationship edge (Disease-Compound) includes drug-disease treatment relationship edge types (treats).

유전자 관련 엣지(Target-related)는 유전자-해부학적 데이터 조절/발현 관계 엣지 유형(expressed_low, expressed_in, expressed_high), 유전자의 공변 관계 엣지 유형(covaries), 유전자의 참여 관계 엣지 유형(biological_process, cellular_component, molecular_function, involved_in), 유전자 또는 단백질 간 상호관계 엣지 유형(PPI, PDI) 및 유전 간섭-유전자 조절 관계 엣지 유형(regulates)을 포함한다.Gene-related edges (Target-related) are gene-anatomical data regulation/expression relationship edge types (expressed_low, expressed_in, expressed_high), gene covariation relationship edge types (covaries), and gene participation relationship edge types (biological_process, cellular_component, molecular_function). , involved_in), interrelationship edge types (PPI, PDI) between genes or proteins, and genetic interference-gene regulation relationship edge types (regulates).

질병 관련 엣지(Disease-related)는 질병-해부학적 데이터 관계 엣지 유형(occurs_in), 질병-증상 관계 엣지 유형(presents) 및 질병 동시발생 유사성 관계 엣지 유형(mentioned_with)을 포함한다.Disease-related edges include disease-anatomical data relationship edge type (occurs_in), disease-symptom relationship edge type (presents), and disease co-occurrence similarity relationship edge type (mentioned_with).

약물 관련 엣지(Compound-related)는 약물-부작용 관계 엣지 유형(causes), 약물 구조적 유사성 관계 엣지 유형(similar_to), 약물-약리학적 분류 관계 엣지 유형(categorized_in)을 포함한다.Drug-related edges (Compound-related) include drug-side effect relationship edge type (causes), drug structural similarity relationship edge type (similar_to), and drug-pharmacological classification relationship edge type (categorized_in).

즉, 엣지 규정부(135)는 엣지들을 24개의 그룹으로 그 유형을 분류할 수 있다. 하지만, 상기한 개수에 제한되지 않고 다양한 유형의 그룹이 추가될 수 있음을 이해하여야 할 것이다.That is, the edge defining unit 135 can classify edges into 24 groups. However, it should be understood that the number is not limited to the above and various types of groups can be added.

경로 규정부(136)는 엣지 규정부(135)에 의해 규정된 엣지를 1개 이상, 구체적으로는 2개 이상 포함하되, 포함된 엣지들이 서로 연결된 것을 경로로 규정한다. The path defining unit 136 includes one or more, specifically two or more, edges defined by the edge defining unit 135, and defines the edges that are included as being connected to each other as a path.

보다 구체적으로, 경로 규정부(136)는 노드-쌍(pair)마다, 노드-쌍을 이루는 노드를 각각 말단으로 하여, 노드 규정부(133)에 의해 규정된 노드와 엣지 규정부(135)에 의해 규정된 엣지들로 서로 연결된 것을 경로로 규정하게 된다.More specifically, the path defining unit 136 is connected to the node defined by the node defining unit 133 and the edge defining unit 135 for each node-pair, with each node forming the node-pair as an end. What is connected to each other through defined edges is defined as a path.

즉, 경로 규정부(136)에 의해 규정되는 경로 중 최소 단위의 경로는, 노드 - 엣지 - 노드 - 엣지 - 노드로 구성될 수 있으며, 다른 예에서 노드 - 엣지 - 노드 - 엣지 - 노드 - 엣지 - 노드 등으로 구성된 경로가 규정될 수도 있다.In other words, the minimum unit path among the paths defined by the path regulation unit 136 may be composed of node - edge - node - edge - node, and in another example, node - edge - node - edge - node - edge - A path consisting of nodes, etc. may be defined.

보다 구체적으로는, 노드-쌍이 서로 연결되되, 2개 이상 5개 이하의 엣지들로 연결되는 것(엣지들)을 경로로 규정할 수 있으며, 더욱 구체적으로는 노드-쌍이 서로 연결되되, 2개 이상 3개 이하의 엣지들로 연결되는 것(엣지들)을 경로로 규정할 수도 있다. 4개 이상의 엣지들로 연결된 노드-쌍들은 유효한 경로에서 제외될 수 있는데, 다수의 단계를 거쳐 노드 간이 서로 연결되는 경우, 그 관련성이 약하다고 볼 수 있기 때문이다.More specifically, a path can be defined as a path in which node-pairs are connected to each other, but are connected by 2 or more and 5 or less edges (edges). More specifically, node-pairs are connected to each other, but are connected by 2 or more edges. Things (edges) connected to three or less edges can also be defined as a path. Node-pairs connected by four or more edges can be excluded from valid paths, because if nodes are connected to each other through multiple steps, the relationship can be considered weak.

경로 규정부(136)에 의해 규정된 경로들은, 경로를 구성하는 엣지들의 개수, 순서 및 유형(도 4에 도시된)의 조합 경우의 수에 따라 다수의 경로 유형이 결정될 수 있다.For the paths defined by the path defining unit 136, a plurality of path types may be determined depending on the number of combinations of the number, order, and type (shown in FIG. 4) of edges constituting the path.

예를 들어, "AKT1 - associates - Alzheimer's disease-resembles - Parkinson's disease"라는 경로는 "Gene - associates - Disease - resembles - Disease"와 같은 경로 유형을 갖는다. 다시 말해, A(a 유형)-B(b 유형) 엣지를 포함하는 경로는 (a,b) 유형으로 규정될 수 있으며, A(a 유형)-B(b 유형)-C(c 유형) 엣지를 포함하는 경로는 (a,b,c) 유형으로 규정될 수 있으며, 서로 다른 유형으로 취급될 수 있다.For example, the path "AKT1 - associates - Alzheimer's disease-resembles - Parkinson's disease" has the same path type as "Gene - associates - Disease - resembles - Disease". In other words, a path containing the A(type a)-B(type b) edge can be specified as (a,b) type, and the path containing the edge A(type a)-B(type b)-C(type c) Paths containing can be defined as (a,b,c) types and can be treated as different types.

또한, 경로 규정부(136)는 다수의 경로의 유형들 중 일부를 기 설정된 경로 유형(metapath)으로 분류할 수 있다. 후술하겠지만, 기 설정된 경로 유형에 해당되지 않은 경로 유형들은 본 발명에 따른 학습 과정에서 배제된다.Additionally, the path definition unit 136 may classify some of the multiple path types into preset path types (metapaths). As will be described later, path types that do not correspond to the preset path types are excluded from the learning process according to the present invention.

예를 들어, 경로 규정부(136)는 다수의 경로 유형들 중, Disease -mentioned_with - Disease - associates_with - Gene 순서의 엣지 유형을 포함하는 경로 유형을 기 설정된 경로 유형으로 설정할 수 있으며, 다른 예에서는 Disease - treated_by - Compound - binds_to - Gene - interacts_with - Gene 순서의 엣지 유형을 포함하는 경로 유형을 기 설정된 경로 유형으로 설정할 수 있다. 본 발명에서는 특별히 이에 제한되지 않고, 시스템 관리자에 의해 기 설정된 경로 유형이 설정될 수도 있으며, 임의의 노드-쌍을 연결하는 경로들 중 생물학적으로 의미가 있는 경로들만을 학습시킴에 따라 학습의 효율과 정확도가 향상될 수 있다.For example, the path regulation unit 136 may set a path type including an edge type in the order Disease -mentioned_with - Disease - associates_with - Gene as a preset path type among a plurality of path types, and in another example, Disease The path type including the edge type in the order of - treated_by - Compound - binds_to - Gene - interacts_with - Gene can be set to the preset path type. The present invention is not particularly limited to this, and a preset path type may be set by the system administrator. By learning only biologically meaningful paths among the paths connecting arbitrary node-pairs, learning efficiency and Accuracy can be improved.

또한, 경로 규정부(136)는 엣지들의 개수, 순서 및 유형의 조합 경우의 수에 따라 결정되는 다수의 경로 유형들 중 Disease - treated by - Compound - downregulates - Gene - regulated by - Gene 순서의 엣지 유형을 포함하는 경로 유형과, Disease - downregulates - Gene - upregulated by - Compound - binds to - Gene의 순서의 엣지 유형을 포함하는 경로 유형은 기 설정된 경로 유형으로 설정하지 않을 수 있다. 이 역시, 시스템 관리자에 의해 기 설정된 경로 유형에서 배제되는 경로 유형이 설정될 수 있으며, 임의의 노드-쌍을 연결하는 경로들 중 생물학적으로 의미가 없거나 중요도가 떨어지는 경로들은 학습 과정에서 배제됨으로써, 학습의 효율이 향상되고 연산의 정확도가 향상될 수 있다.In addition, the path regulation unit 136 selects an edge type in the order Disease - treated by - Compound - downregulates - Gene - regulated by - Gene among a plurality of path types determined according to the number, order, and number of combinations of types of edges. The path type including and the path type including the edge type in the order of Disease - downregulates - Gene - upregulated by - Compound - binds to - Gene may not be set to the preset path type. In this case, a path type that is excluded from the preset path type can be set by the system administrator, and biologically meaningless or less important paths among the paths connecting arbitrary node-pairs are excluded from the learning process, thereby learning Efficiency can be improved and calculation accuracy can be improved.

임베딩부(137)는 노드 규정부(133)에 의해 규정된 노드, 엣지 규정부(135)에 의해 규정된 엣지와 엣지 유형(metaedge) 및 경로 규정부(116)에 의해 규정된 경로와 기 설정된 경로 유형(metapath) 중 하나 이상에 대해 임베딩(embedding)을 수행한다.The embedding unit 137 includes a node defined by the node defining unit 133, an edge defined by the edge defining unit 135, an edge type (metaedge), a path defined by the path defining unit 116, and a preset Perform embedding for one or more of the path types (metapaths).

보다 구체적으로는 임베딩부(137)는 노드 규정부(133)에 규정된 노드와, 엣지 규정부(135)에 의해 규정된 엣지 유형 각각에 대해 임베딩(embedding)을 수행한다.More specifically, the embedding unit 137 performs embedding for each of the node defined by the node defining unit 133 and the edge type defined by the edge defining unit 135.

이하에서는, 임베딩부(137)에 의한 임베딩 방법의 일 예를 설명한다.Below, an example of an embedding method using the embedding unit 137 will be described.

먼저, 임베딩부(137)는 노드 규정부(133)에 의해 규정된 전체 노드를 각각 k개의 랜덤 변수로 구성된 실수 벡터로 초기화한다. 여기서 k는 128일 수 있다. 하지만, 이에 제한되지 않고 64, 256, 512, 1024 등 다양한 랜덤 변수로 구성된 실수 벡터로 초기화하는 것이 가능하다.First, the embedding unit 137 initializes all nodes defined by the node defining unit 133 into real vectors each composed of k random variables. Here k may be 128. However, it is not limited to this, and it is possible to initialize it with a real vector composed of various random variables such as 64, 256, 512, and 1024.

다음, 엣지 규정부(135)에 의해 규정된 전체 엣지 유형을 각각 k개의 랜덤 변수로 구성된 실수 벡터로 초기화한다. 여기서 k는 128일 수 있다. 하지만, 이에 제한되지 않고 64, 256, 512, 1024 등 다양한 랜덤 변수로 구성된 실수 벡터로 초기화하는 것이 가능하다.Next, all edge types defined by the edge defining unit 135 are initialized into real vectors each composed of k random variables. Here k may be 128. However, it is not limited to this, and it is possible to initialize it with a real vector composed of various random variables such as 64, 256, 512, and 1024.

다음, 임의의 노드-쌍이 엣지 규정부(135)에 의해 규정된 엣지 유형을 가지는 엣지로 서로 연결되어 있는지 여부를 판단하여 지도 학습 레이블 데이터로 주입한다. 임의의 노드 쌍(소스 노드, 타겟 노드)이 엣지 규정부(135)에 의해 규정된 엣지 유형을 가지는 엣지로 서로 연결되는 경우 1의 데이터가 주입될 것이며, 서로 연결되지 않을 경우 0의 데이터가 주입될 것이다.Next, it is determined whether a random node-pair is connected to each other by an edge having an edge type defined by the edge definition unit 135, and injected as supervised learning label data. If any pair of nodes (source node, target node) are connected to each other by an edge having an edge type defined by the edge regulation unit 135, data of 1 will be injected, and if they are not connected to each other, data of 0 will be injected. It will be.

3개의 k차원 벡터(소스 노드, 타겟 노드, 엣지 유형)를 입력으로 하는 예측 함수가 실제 연결 여부와 일치되도록 k차원 벡터를 조정한다. 여기서, 예측 함수는 TransE, HolE 또는 DistMult 등의 모델일 수 있으나, 이에 제한되지 않고 다양한 예측 함수 모델이 본 발명에 적용될 수 있다.The prediction function, which takes three k-dimensional vectors (source node, target node, and edge type) as input, adjusts the k-dimensional vector to match the actual connection. Here, the prediction function may be a model such as TransE, HolE, or DistMult, but is not limited to this and various prediction function models may be applied to the present invention.

조정이 완료되면 각각의 노드에 대응하는 k차원의 실수 벡터들이 해당 노드와 엣지 유형의 임베딩 결과로 산출된다.When adjustment is completed, k-dimensional real vectors corresponding to each node are calculated as the embedding results of the corresponding node and edge type.

상기한 방법 이외에도 다양한 임베딩 방법이 수행될 수 있으며, 임베딩부(137)에 의한 임베딩 결과, 각각의 노드는 k차원 공간 상에서 하나의 점으로 사상될 수 있다. 또한, 임베딩부(137)에 의한 임베딩의 결과로서, 제1 노드 내지 제3 노드 각각이 k차원 공간 상에서 사상될 뿐만 아니라, 엣지 유형들이 함께 k차원 공간에 임베딩될 수 있다.In addition to the above-mentioned method, various embedding methods can be performed, and as a result of embedding by the embedding unit 137, each node can be mapped into one point in a k-dimensional space. Additionally, as a result of embedding by the embedding unit 137, not only can each of the first to third nodes be mapped in the k-dimensional space, but also edge types can be embedded together in the k-dimensional space.

경로 스코어 연산부(138)는 경로에 포함된 엣지들의 스코어를 기 설정된 방법에 따라 연산하고, 연산된 엣지 스코어를 이용하여 경로들의 스코어를 연산하도록 구성된다.The path score calculation unit 138 is configured to calculate the scores of edges included in the path according to a preset method and calculate the scores of the paths using the calculated edge scores.

경로에 포함된 각각의 엣지들의 스코어는 임베딩부(137)에서 임베딩된 각각의 노드들과 엣지 유형을 이용하여 연산된다. 즉, 경로에 포함되는 각각의 엣지는 해당 엣지 유형의 k차원 실수 벡터(사상) 및 엣지의 시작과 끝 노드들의 k차원 실수 벡터를 가지며, 이 실수 벡터들로부터 해당 엣지 스코어가 계산될 수 있다. 구체적인 연산 방식의 예시로 임베딩부(137)에서 사용된 예측 함수가 적용될 수 있으며, 각각의 노드 사상들의 유사도(similarity) 역시 적용될 수 있다.The score of each edge included in the path is calculated using each node and edge type embedded in the embedding unit 137. That is, each edge included in the path has a k-dimensional real vector (mapping) of the corresponding edge type and k-dimensional real vectors of the start and end nodes of the edge, and the corresponding edge score can be calculated from these real vectors. As an example of a specific calculation method, the prediction function used in the embedding unit 137 can be applied, and the similarity of each node mapping can also be applied.

상기 노드 사상들의 유사도에 기반한 계산 방식은 k차원 공간 상에 사상된 노드 간의 유사도가 높을수록 해당 노드들을 연결하는 엣지에 높은 스코어가 부여되는 방식이다. 유사도 연산 방식으로는 벡터와 벡터 사이의 각도를 연산하는 방식(보다 구체적으로는 두 벡터의 cosine값을 연산하는 방식)이 적용될 수 있으며, 이는 예시이므로 벡터 간의 유사도를 연산할 수 있는 다양한 방식이 적용될 수 있다고 할 것이다.The calculation method based on the similarity of the node mappings is a method in which the higher the similarity between nodes mapped in a k-dimensional space, the higher the score is given to the edge connecting the corresponding nodes. As a similarity calculation method, a method of calculating the angle between vectors (more specifically, a method of calculating the cosine value of two vectors) can be applied. This is an example, so various methods of calculating similarity between vectors can be applied. I would say it can be done.

n(n은 1 이상의 정수)개의 엣지를 포함하는 경로의 경우 n개 엣지 각각의 엣지 스코어를 합산하여 해당 경로의 스코어가 연산될 수 있으며, n+1개의 엣지를 포함하는 경로의 경우 n+1개 엣지 각각의 스코어를 합산하여 해당 경로의 스코어가 연산될 수 있다.For a path that includes n edges (n is an integer greater than 1), the score of the path can be calculated by adding up the edge scores of each of the n edges, and for a path that includes n+1 edges, n+1. The score of the corresponding path can be calculated by adding up the scores of each edge.

경로 추출부(139)는 경로 스코어 연산부(138)이 연산한 경로 스코어에 기초하여, 경로 규정부(136)에 의해 규정된 경로들 중 일부를 추출하도록 구성된다.The path extraction unit 139 is configured to extract some of the paths defined by the path definition unit 136 based on the path score calculated by the path score calculation unit 138.

구체적으로, 경로 추출부(139)는 임의의 노드-쌍(개체-쌍)의 경로들 중 기 설정된 경로 유형(metapath)마다 일부의 경로를 추출한다.Specifically, the path extractor 139 extracts some paths for each preset path type (metapath) among the paths of any node-pair (entity-pair).

전술한 바와 같이, 경로 유형은 경로에 포함된 엣지들의 개수, 순서 및 유형에 따라 분류될 수 있다. 예를 들어, A(a 유형)-B(b 유형) 엣지를 포함하는 경로는 (a,b) 유형으로 규정될 수 있으며, A(a 유형)-B(b 유형)-C(c 유형) 엣지를 포함하는 경로는 (a,b,c) 유형으로 규정될 수 있으며, 서로 다른 유형으로 취급될 수 있다.As described above, path types can be classified according to the number, order, and type of edges included in the path. For example, a path containing the edges A (type a) - B (type b) can be specified as type (a,b), and A (type a) - B (type b) - C (type c). Paths containing edges can be defined as (a,b,c) types and can be treated as different types.

보다 구체적으로는 임의의 노드-쌍의 경로들 중 기 설정된 경로 유형을 갖는 경로들에 대해, 경로 스코어 연산부(138)가 연산한 경로 스코어를 이용하여, 스코어가 높은 순으로 경로 유형마다 일부의 경로를 추출할 수 있으며, 일 예로 경로의 유형마다 5개의 경로를 추출할 수 있다. 하지만, 5개에 제한되지 않고 5개 미만 또는 5개 초과의 개수의 경로가 추출될 수 있음을 이해하여야 할 것이다. More specifically, for paths with a preset path type among paths of a random node-pair, some paths are selected for each path type in descending order of score using the path score calculated by the path score calculation unit 138. Can be extracted, and as an example, five paths can be extracted for each path type. However, it should be understood that the number of paths is not limited to 5 and less than 5 or more than 5 paths can be extracted.

제2 프로세서(140)는 제1 프로세서(130)에 의해 연산된 임베딩 벡터, 추출된 경로를 이용하여 러닝 프로세서(180)에 학습될 학습 데이터를 생성하도록 구성된다.The second processor 140 is configured to generate training data to be learned by the learning processor 180 using the embedding vector calculated by the first processor 130 and the extracted path.

도 6 내지 9를 참조하여, 이를 구체적으로 설명한다.Referring to FIGS. 6 to 9, this will be described in detail.

제2 프로세서(140)는 시계열적 정보를 인코딩하는 모델을 사용하여 학습 데이터를 생성하는 점에 특징이 있다.The second processor 140 is characterized in that it generates learning data using a model that encodes time-series information.

시계열적 정보를 인코딩하는 모델이란, 입력 시퀀스를 하나의 벡터 표현(context vector)으로 압축하되, 압축된 벡터 표현에서 입력 시퀀스의 시계열적 특징이 드러나도록 입력 시퀀스를 인코딩하는 모델을 의미한다. 입력 시퀀스의 시계열적 특징을 드러내기 위해, 본 발명에서는 포지셔널 벡터(positional vector)를 사용한다.A model that encodes time-series information refers to a model that compresses the input sequence into a single vector representation (context vector), but encodes the input sequence so that the time-series characteristics of the input sequence are revealed in the compressed vector representation. To reveal the time series characteristics of the input sequence, the present invention uses a positional vector.

본 발명에서는 크게 2개의 인코딩 모델이 사용된다. 첫째는, 제1 인코딩 모델이고 둘째는 제2 인코딩 모델인데, 제1 인코딩 모델에서 출력되는 값이 제2 인코딩 모델의 입력값으로 입력되는 두 층의 인코딩 모델 구조(즉, 적층(stacked) 인코딩 모델)를 채택하였다. 여기에서, 제1 인코딩 모델이 시계열적 정보를 인코딩하는 모델이며, 제2 인코딩 모델은 시계열적 정보를 인코딩하지 않는다. 이에 대한 자세한 설명은 후술한다.In the present invention, two encoding models are largely used. The first is the first encoding model and the second is the second encoding model. A two-layer encoding model structure in which the value output from the first encoding model is input as the input value of the second encoding model (i.e., a stacked encoding model) ) was adopted. Here, the first encoding model is a model that encodes time-series information, and the second encoding model does not encode time-series information. A detailed explanation of this will be provided later.

제1 인코딩 모델은 2개의 인코딩 모델로 나누어질 수 있다.The first encoding model can be divided into two encoding models.

첫째는 질의 인코딩 모델로, 학습에 사용되는 경로가 어떠한 개체-쌍을 연결하는 경로인지를 특정하기 위해 사용된다.The first is the query encoding model, which is used to specify which entity-pair connection the path used for learning is.

도 7을 참조하면, 임의의 개체의 쌍은 질의 임베딩 벡터의 형태로 질의 인코딩 모델에 입력된다.Referring to Figure 7, arbitrary pairs of entities are input to the query encoding model in the form of query embedding vectors.

질의 임베딩 벡터는, 쿼리 벡터(제1 질의 임베딩 벡터) - 임의의 개체의 쌍 중 어느 하나의 개체를 특정하는 임베딩 벡터(제2 질의 임베딩 벡터) - 임의의 개체의 쌍 중 다른 하나의 개체를 특정하는 임베딩 벡터(제3 질의 임베딩 벡터) 순으로 구성된다.The query embedding vector is: query vector (first query embedding vector) - an embedding vector that specifies one object among a pair of random objects (second query embedding vector) - specifies the other object among a pair of random objects It is composed in the order of the embedding vector (third query embedding vector).

제1 질의 임베딩 벡터는, 질의 타입 임베딩 벡터와 질의 타입 임베딩 벡터의 위치를 식별하기 위한 제1 포지셔널(positional) 임베딩 벡터를 이용하여 연산되는데, 구체적으로 질의 타입 임베딩 벡터와 제1 포지셔널 임베딩 벡터의 합산 임베딩 벡터일 수 있다. 각각의 임베딩 벡터는 k차원의 벡터 값으로 표현될 수 있는데, 질의 타입 임베딩 벡터와 제1 포지셔널 임베딩 벡터는 동일한 차원의 벡터 값일 수 있으며, 이들을 단순히 더하는 과정을 통해 제1 질의 임베딩 벡터가 생성될 수 있다.The first query embedding vector is calculated using the query type embedding vector and the first positional embedding vector to identify the location of the query type embedding vector. Specifically, the query type embedding vector and the first positional embedding vector. It may be a summed embedding vector of . Each embedding vector can be expressed as a k-dimensional vector value. The query type embedding vector and the first positional embedding vector can be vector values of the same dimension, and the first query embedding vector can be generated through the process of simply adding them. You can.

제2 질의 임베딩 벡터는, 질병, 유전자 및 약물을 포함하는 타입(type) 중 임의의 개체의 쌍 중 어느 하나의 개체가 해당하는 타입을 나타내는 제1 노드 타입 임베딩 벡터와, 어느 하나의 개체를 식별하기 위한 개체 임베딩 벡터, 그리고 개체 임베딩 벡터의 위치를 식별하기 위한 제2 포지셔널 임베딩 벡터를 이용하여 연산되는데, 구체적으로 제1 노드 타입 임베딩 벡터, 개체 임베딩 벡터 및 제2 포지셔널 임베딩 벡터의 합산 임베딩일 수 있다. 도 6을 예로 들면, 어느 하나의 개체가 Alzheimer's disease인 경우, "질병"을 지칭하는 노드 타입 임베딩 벡터가, 제1 노드 타입 임베딩 벡터에 해당될 수 있다.The second query embedding vector identifies the first node type embedding vector indicating the type to which any one of the pairs of entities of types including diseases, genes, and drugs corresponds, and any one of the entities. It is calculated using an entity embedding vector to identify the location of the entity embedding vector, and a second positional embedding vector to identify the location of the entity embedding vector. Specifically, the sum of the first node type embedding vector, the entity embedding vector, and the second positional embedding vector is calculated. It can be. Taking FIG. 6 as an example, when an entity has Alzheimer's disease, a node type embedding vector indicating “disease” may correspond to the first node type embedding vector.

개체 임베딩 벡터의 경우, 제1 개체 임베딩 벡터와 제2 개체 임베딩 벡터를 이용하여 연산되는 것일 수 있고, 보다 구체적으로는 제1 개체 임베딩 벡터와 제2 개체 임베딩 벡터의 결합(concatenate) 임베딩 벡터일 수 있다. 여기에서, "결합"이란 k차원의 임베딩 벡터와 k차원의 임베딩 벡터가 합산되어 k차원이 유지되는 임베딩 벡터의 결과값을 얻는 "합산"과는 다른 개념으로, k/2차원의 임베딩 벡터와 k/2차원의 임베딩 벡터의 결합을 통해 k차원의 임베딩 벡터의 결과값을 얻게 되는 연산 과정을 의미한다(즉, 차원이 합산되는 결과를 얻게 됨). 본 발명에서, "합산"을 통해 생성되는 임베딩 벡터 "결합"을 통해 생성되는 임베딩 벡터는 동일한 차원의 임베딩 벡터인 것이 바람직하다.In the case of an entity embedding vector, it may be calculated using a first entity embedding vector and a second entity embedding vector, and more specifically, it may be a concatenate embedding vector of the first entity embedding vector and the second entity embedding vector. there is. Here, "combining" is a different concept from "summation" in which a k-dimensional embedding vector and a k-dimensional embedding vector are added to obtain the result of an embedding vector that maintains the k dimension. It is a different concept from a k/2-dimensional embedding vector It refers to the computational process of obtaining the result of a k-dimensional embedding vector by combining k/2-dimensional embedding vectors (i.e., obtaining a result in which the dimensions are summed). In the present invention, it is preferable that the embedding vectors generated through “combining” the embedding vectors generated through “summing” are embedding vectors of the same dimension.

제1 개체 임베딩 벡터는 제1 프로세서(130)의 임베딩부(137)에 의해 생성된 임베딩 결과값 중, 상기 어느 하나의 개체를 지칭(식별)하는 임베딩 벡터를 의미한다. 즉, 제1 개체 임베딩 벡터란 노드와 엣지 유형의 임베딩 결과값 중, 상기 어느 하나의 개체를 지칭하는 임베딩 벡터에 해당하며, 이후 러닝 프로세서(180)에 의해 학습이 이루어지더라도 그 값이 고정된(변화하지 않는) 요소이다.The first entity embedding vector refers to an embedding vector that refers to (identifies) an entity among the embedding results generated by the embedding unit 137 of the first processor 130. That is, the first entity embedding vector corresponds to an embedding vector that refers to one of the above-mentioned entities among the node and edge type embedding results, and even if learning is performed by the learning processor 180, its value is fixed. It is an element (that does not change).

제2 개체 임베딩 벡터는, 제1 개체 임베딩 벡터와 마찬가지로 상기 어느 하나의 개체를 지칭하는 임베딩 벡터를 의미한다. 다만, 러닝 프로세서(180)에 의해 학습이 이루어지는 경우 그 값이 변화하는 요소이며(예측 성능이 최대가 되도록, 가중치를 조절하는 과정에서 해당 임베딩 벡터 값이 변화하게 됨), 랜덤 선별된 임베딩 벡터 값이 제2 개체 임베딩 벡터에 적용될 수 있다.The second entity embedding vector, like the first entity embedding vector, refers to an embedding vector that refers to one of the entities. However, when learning is performed by the learning processor 180, it is an element whose value changes (the corresponding embedding vector value changes in the process of adjusting the weights to maximize prediction performance), and the randomly selected embedding vector value This can be applied to the second object embedding vector.

본 발명에서는, 기존의 지식을 활용하여 획득된 제1 개체 임베딩 벡터(이미 알려진 지식을 기반으로 한 정보)를 고정값으로 하고, 학습 과정에서 예측 성능이 최대가 되도록 그 값이 조정될 수 있는 제2 개체 임베딩 벡터 모두를 활용함으로써, 기존 지식에 따른 정보와 인공지능을 활용함으로써 획득되는 2가지 장점을 모두 취하게 된다. 따라서, 종래 기술 대비 예측 성능이 현저히 상승되는 효과가 달성된다.In the present invention, the first entity embedding vector (information based on already known knowledge) obtained using existing knowledge is set as a fixed value, and the second entity embedding vector, the value of which can be adjusted to maximize prediction performance during the learning process, is set as a fixed value. By utilizing both object embedding vectors, both advantages obtained by utilizing information based on existing knowledge and artificial intelligence are taken. Accordingly, the effect of significantly increasing prediction performance compared to the prior art is achieved.

제3 질의 임베딩 벡터는 제2 질의 임베딩 벡터와 형태는 유사하다. 다만, 제2 질의 임베딩 벡터가 임의의 개체의 쌍 중 어느 하나의 개체를 특정하기 위해 사용되는 임베딩 벡터이고, 제3 질의 임베딩 벡터는 상기 임의의 개체의 쌍 중 다른 하나의 개체를 특정하기 위해 사용되는 임베딩 벡터라는 점에서만 다르다.The third query embedding vector is similar in shape to the second query embedding vector. However, the second query embedding vector is an embedding vector used to specify one entity among the pair of random entities, and the third query embedding vector is used to specify the other entity among the pair of random entities. It differs only in that it is an embedding vector.

즉, 제3 질의 임베딩 벡터는 다른 하나의 개체가 해당하는 타입을 나타내는 제2 노드 타입 임베딩 벡터, 다른 하나의 개체를 식별하기 위한 개체 임베딩 벡터, 그리고 개체 임베딩 벡터의 위치를 식별하기 위한 제3 포지셔널 임베딩 벡터를 이용하여 연산되는 것일 수 있고, 보다 구체적으로 제2 노드 타입 임베딩 벡터, 개체 임베딩 벡터 및 제3 포지셔널 임베딩 벡터의 합산 임베딩 벡터일 수 있다. That is, the third query embedding vector is a second node type embedding vector indicating the type to which another entity corresponds, an entity embedding vector for identifying another entity, and a third node type embedding vector for identifying the location of the entity embedding vector. It may be calculated using a sional embedding vector, and more specifically, it may be the sum of the second node type embedding vector, the entity embedding vector, and the third positional embedding vector.

이와 같이, 질의 임베딩 벡터가 제1 질의 임베딩 벡터, 제2 질의 임베딩 벡터 및 제3 질의 임베딩 벡터로 구성되고, 이들이 질의 인코딩 모델에 입력되면, 하나의 압축된 표현인 컨텍스트 벡터가 출력된다. 즉, 질의 인코딩 모델에서 출력되는 컨텍스트 벡터(임베딩 벡터)를 통해 임의의 개체-쌍을 특정하면서도, 임의의 개체-쌍이 어느 순서로 배치되어 있는지 구별되는 것이 가능하다(A와 B 개체의 쌍이더라도, A-B 개체 쌍인지 또는 B-A 개체 쌍인지 구별될 수 있음).In this way, the query embedding vector is composed of a first query embedding vector, a second query embedding vector, and a third query embedding vector, and when these are input to the query encoding model, a context vector, which is one compressed expression, is output. In other words, it is possible to specify the arbitrary object-pair through the context vector (embedding vector) output from the query encoding model, while distinguishing in which order the arbitrary object-pair is placed (even if it is a pair of objects A and B, A distinction can be made between A-B object pairs or B-A object pairs).

다음, 도 8을 참조하여 경로 인코딩 모델에 입력되는 경로 임베딩 벡터를 구체적으로 설명한다.Next, the path embedding vector input to the path encoding model will be described in detail with reference to FIG. 8.

질의 임베딩 벡터가 임의의 개체-쌍을 특정하기 위한 임베딩 벡터였다면, 경로 임베딩 벡터는 임의의 개체-쌍의 경로들을 특정하기 위한 임베딩 벡터라는 점에서 차이가 있고 임베딩 벡터의 형태 자체는 유사하다.If the query embedding vector is an embedding vector for specifying an arbitrary entity-pair, the path embedding vector is different in that it is an embedding vector for specifying the paths of an arbitrary entity-pair, and the form of the embedding vector itself is similar.

경로 임베딩 벡터는, 경로 타입 임베딩 벡터(제1 경로 임베딩 벡터), 임의의 개체의 쌍 중 어느 하나의 개체를 특정하는 임베딩 벡터(제2 경로 임베딩 벡터), 엣지 유형을 특정하는 임베딩 벡터(제3 경로 임베딩 벡터), 상기 어느 하나의 개체와 엣지로 연결된 개체를 특정하는 임베딩 벡터(제4 경로 임베딩 벡터), 엣지 유형을 특정하는 임베딩 벡터(제5 경로 임베딩 벡터) 및 상기 다른 하나의 개체를 특정하는 임베딩 벡터(제6 경로 임베딩 벡터)를 포함할 수 있다. The path embedding vector includes a path type embedding vector (first path embedding vector), an embedding vector specifying one entity among a pair of random entities (second path embedding vector), and an embedding vector specifying an edge type (third path embedding vector). path embedding vector), an embedding vector (fourth path embedding vector) specifying an object connected to one of the entities by an edge, an embedding vector (fifth path embedding vector) specifying an edge type, and an embedding vector specifying the other entity. may include an embedding vector (sixth path embedding vector).

임의의 개체-쌍 간의 경로 길이는 다양할 수 있기 때문에(어떤 경로는 2개의 엣지로만 구성되어 있을 수 있으나, 다른 경로는 3개 이상의 엣지로 구성되어 있을 수 있음), 경로 임베딩 벡터의 길이는 다양할 수 있으나, 제1 경로 임베딩 벡터 내지 제6 경로 임베딩 벡터의 최소 경로 임베딩 벡터를 가질 수 있다(제1 경로 임베딩 벡터 내지 제6 경로 임베딩 벡터는, 임의의 개체-쌍 사이에 2개의 엣지와 1개의 노드를 포함하는, 노드(어느 하나의 개체를 지칭) - 엣지 - 노드 - 엣지 - 노드(다른 하나의 개체를 지칭)로 이어지는 경로를 가질 경우의 임베딩 벡터에 해당함).Since the path length between any object-pair can vary (some paths may consist of only two edges, while others may consist of three or more edges), the length of the path embedding vector can vary. However, it may have a minimum path embedding vector of the first path embedding vector to the sixth path embedding vector (the first path embedding vector to the sixth path embedding vector is 2 edges between any object-pair and 1 (corresponds to an embedding vector when it has a path leading from node (referring to one entity) - edge - node - edge - node (referring to another entity), containing nodes).

제1 경로 임베딩 벡터는 경로 타입 임베딩 벡터와, 경로 타입 임베딩 벡터의 위치를 식별하기 위한 제4 포지셔널 임베딩 벡터를 이용하여 연산되고, 구체적으로 경로 타입 임베딩 벡터와 제4 포지셔널 임베딩 벡터의 합산 임베딩 벡터일 수 있다.The first path embedding vector is calculated using a path type embedding vector and a fourth positional embedding vector to identify the position of the path type embedding vector, and specifically, the sum of the path type embedding vector and the fourth positional embedding vector. It can be a vector.

제2 경로 임베딩 벡터는 상기 어느 하나의 개체의 타입을 나타내는 제1 노드 타입 임베딩 벡터, 상기 어느 하나의 개체를 특정하기 위한 개체 임베딩 벡터, 그리고 개체 임베딩 벡터의 위치를 식별하기 위한 제5 포지셔널 임베딩 벡터를 이용하여 연산되고, 구체적으로 제1 노드 타입 임베딩 벡터, 개체 임베딩 벡터 및 제5 포지셔널 임베딩 벡터의 합산 임베딩일 수 있다. 제2 경로 임베딩 벡터의 개체 임베딩 벡터는, 전술한 것처럼 제1 개체 임베딩 벡터와 제2 개체 임베딩 벡터를 이용하여 연산되는데, 구체적으로 제1 개체 임베딩 벡터와 제2 개체 임베딩 벡터의 결합 임베딩 벡터일 수 있다.The second path embedding vector includes a first node type embedding vector indicating the type of the one entity, an entity embedding vector for specifying the one entity, and a fifth positional embedding vector for identifying the location of the entity embedding vector. It is calculated using a vector, and specifically may be the sum embedding of the first node type embedding vector, the entity embedding vector, and the fifth positional embedding vector. The entity embedding vector of the second path embedding vector is calculated using the first entity embedding vector and the second entity embedding vector as described above. Specifically, it may be a combined embedding vector of the first entity embedding vector and the second entity embedding vector. there is.

제3 경로 임베딩 벡터는 경로에 포함된 엣지이면서 상기 어느 하나의 개체와 연결된 엣지가 해당하는 엣지 유형을 특정하기 위한 제1 엣지 타입 임베딩 벡터와, 제1 엣지 타입 임베딩 벡터의 위치를 식별하기 위한 제6 포지셔널 임베딩 벡터를 이용하여 연산되고, 구체적으로 제1 엣지 타입 임베딩 벡터와 제6 포지셔널 임베딩 벡터의 합산 임베딩 벡터일 수 있다.The third path embedding vector is an edge included in the path and a first edge type embedding vector for specifying the edge type to which the edge connected to one of the entities corresponds, and a first edge type embedding vector for identifying the location of the first edge type embedding vector. It is calculated using a 6-positional embedding vector, and specifically, it may be the sum of the first edge-type embedding vector and the sixth positional embedding vector.

제4 경로 임베딩 벡터는 상기 어느 하나의 개체와 엣지로 연결된 개체의 타입을 나타내는 제2 노드 타입 임베딩 벡터, 상기 개체를 특정하기 위한 개체 임베딩 벡터, 그리고 개체 임베딩 벡터의 위치를 식별하기 위한 제7 포지셔널 임베딩 벡터를 이용하여 연산되고, 구체적으로 제2 노드 타입 임베딩 벡터, 개체 임베딩 벡터 및 제7 포지셔널 임베딩 벡터의 합산 임베딩 벡터일 수 있다.The fourth path embedding vector is a second node type embedding vector indicating the type of the object connected to the one entity by an edge, an entity embedding vector for specifying the entity, and a seventh path for identifying the location of the entity embedding vector. It is calculated using a sional embedding vector, and specifically, it may be the sum of the second node type embedding vector, the entity embedding vector, and the seventh positional embedding vector.

제5 경로 임베딩 벡터는 제4 경로 임베딩 벡터에서 특정된 개체와 연결된 엣지를 특정하기 위한 임베딩 벡터로, 제3 경로 임베딩 벡터와 유사한 형태를 갖는다.The fifth path embedding vector is an embedding vector for specifying an edge connected to the entity specified in the fourth path embedding vector, and has a similar form to the third path embedding vector.

제6 경로 임베딩 벡터는 다른 하나의 개체를 특정하기 위한 임베딩 벡터로, 제2 경로 임베딩 벡터 및 제4 경로 임베딩 벡터와 유사한 형태를 갖는다.The sixth path embedding vector is an embedding vector for specifying another entity and has a similar form to the second path embedding vector and the fourth path embedding vector.

이와 같이, 경로 임베딩 벡터가 제1 경로 내지 제6 경로 임베딩 벡터를 포함하고(경로를 이루는 엣지의 개수에 따라 더 많은 경로 임베딩 벡터를 포함할 수 있음), 이들이 경로 인코딩 모델에 입력되면, 하나의 압축된 표현인 컨텍스트 벡터가 출력된다. 즉, 경로 인코딩 모델에서 출력되는 컨텍스트 벡터(임베딩 벡터)를 통해 임의의 개체-쌍 사이의 경로를 특정하면서도, 어떤 노드와 어떤 엣지가 어떤 순서로 경로가 구성되는지 특정하는 것이 가능하다..In this way, if the path embedding vector includes the first to sixth path embedding vectors (may include more path embedding vectors depending on the number of edges making up the path), and these are input to the path encoding model, one A context vector, which is a compressed representation, is output. In other words, it is possible to specify the path between arbitrary object-pairs through the context vector (embedding vector) output from the path encoding model, while also specifying which nodes and which edges and in what order the path is composed.

임의의 개체-쌍마다 질의 인코딩 모델에서 출력되는 임베딩 벡터와, 경로 인코딩 모델에서 출력되는 임베딩 벡터들(경로가 다수개이므로 각각의 경로마다 임베딩 벡터가 출력될 수 있음) 모두가 질의-경로 인코딩 모델에 입력되고, 하나의 압축된 표현인 컨텍스트 벡터가 인코더에서 출력된다.The embedding vector output from the query encoding model for each random object-pair and the embedding vectors output from the path encoding model (since there are multiple paths, an embedding vector can be output for each path) are all included in the query-path encoding model. is input, and a context vector, a single compressed representation, is output from the encoder.

도 6을 참조하면, 질의 인코딩 모델에서 출력되는 임베딩 벡터 및 경로 인코딩 모델에서 출력되는 임베딩 벡터들이 질의-경로 인코딩 모델의 임베딩으로 각각 입력된다(질의 인코딩 모델에 제1 질의 임베딩 내지 제3 질의 임베딩으로 구성된 질의 임베딩이 입력되는 것과, 경로 인코딩 모델에 제1 경로 임베딩 내지 제6 경로 임베딩으로 구성된 경로 임베딩이 입력되는 것과 유사함).Referring to FIG. 6, the embedding vector output from the query encoding model and the embedding vectors output from the path encoding model are respectively input as embeddings of the query-path encoding model (as the first query embedding to the third query embedding into the query encoding model). It is similar to inputting the constructed query embedding and inputting the path embedding composed of the first to sixth path embeddings into the path encoding model).

한편, 질의 인코딩 모델과 경로 인코딩 모델에 입력되는 임베딩 벡터들이 각각 포지셔널 임베딩을 포함하는 반면, 질의-경로 인코딩 모델에 입력되는 임베딩 벡터들은 포지셔널 임베딩을 포함하지 않는다. 질의-경로 인코딩 모델에 입력되는 임베딩 벡터들은 임의의 개체-쌍 사이의 중요도 높은 경로들을 특정하기 위한 것이므로, 서로 간에 시계열적인 관계를 가지지 않고 독립적이기 때문이다.Meanwhile, while the embedding vectors input to the query encoding model and the path encoding model each include positional embeddings, the embedding vectors input to the query-path encoding model do not include positional embeddings. The embedding vectors input to the query-path encoding model are intended to specify high-importance paths between arbitrary entity-pairs, and are independent without having a time-series relationship with each other.

즉, 질의-경로 인코딩 모델에서 출력되는 컨텍스트 벡터(임베딩 벡터)는 임의의 개체-쌍 사이의 중요도 높은 경로들만이 반영된 임베딩 벡터에 해당한다.In other words, the context vector (embedding vector) output from the query-path encoding model corresponds to an embedding vector that reflects only the highly important paths between arbitrary entity-pairs.

한편, 임의의 개체-쌍마다 생물학적으로 의미가 있는 다수 개의 경로가 존재한다. 본 발명에서는, 임의의 개체-쌍마다의 기 설정된 경로 유형 중, 경로 추출부(139)에 의해 추출된 경로만이 질의-경로 인코딩 모델에 입력된다.Meanwhile, for any individual-pair, there are multiple biologically meaningful pathways. In the present invention, among the preset path types for each entity-pair, only the path extracted by the path extractor 139 is input to the query-path encoding model.

질의-경로 인코딩 모델에서 출력되는 임베딩 벡터는 i) 임의의 개체의-쌍 ii) 상기 임의의 개체의-쌍 사이의 경로 중 중요도 높은 경로의 정보를 모두 포함하고 있다. 즉, 본 발명에서 러닝 프로세서(180)에 의해 임의의 인공신경망 모델에 학습되는 학습 데이터에 i) 임의의 개체-쌍 ii) 상기 임의의 개체의-쌍 사이의 경로 중 중요도 높은 경로가 포함되고, 임의의 제1 개체가 질의되는 경우 질의된 개체와 관련된 제2 개체들, 질의된 제1 개체와 각 제2 개체마다의 관련성 정보, 그리고 각 제2 개체마다의 예측 신뢰도를 도출하도록 상기 학습 데이터를 학습하게 된다. The embedding vector output from the query-path encoding model contains all information on the highly important path among i) a pair of random entities and ii) a path between the pair of random entities. That is, in the present invention, the learning data trained by the learning processor 180 in a random artificial neural network model includes i) a random entity-pair ii) a high-importance path among the paths between the random entity-pair, When any first entity is queried, the learning data is used to derive second entities related to the queried entity, relationship information between the queried first entity and each second entity, and prediction reliability for each second entity. You learn.

여기서, 인공신경망 모델은 DNN(Deep Neural Network), CNN(Convolutional Neural Network), DCNN(Deep Convolutional Neural Network), RNN(Recurrent Neural Network), RBM(Restricted Boltzmann Machine), DBN(Deep Belief Network), SSD(Single Shot Detector), MLP (Multi-layer Perceptron) 또는 어텐션 메커니즘(Attention Mechanism)을 기반으로 한 모델일 수 있으나, 이에 제한되지 않고 다양한 인공신경망 모델이 본 발명에 적용될 수 있다.Here, the artificial neural network models are DNN (Deep Neural Network), CNN (Convolutional Neural Network), DCNN (Deep Convolutional Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), and SSD. It may be a model based on (Single Shot Detector), MLP (Multi-layer Perceptron), or Attention Mechanism, but is not limited to this and various artificial neural network models can be applied to the present invention.

러닝 프로세서(180)는 제1 프로세서(130) 및 제2 프로세서(140)에서 전처리 및 추출된 학습 데이터를 인공신경망에 학습시켜 예측 모델을 생성하도록 구성된다.The learning processor 180 is configured to generate a prediction model by training the artificial neural network on the training data pre-processed and extracted from the first processor 130 and the second processor 140.

러닝 프로세서(180)는 질의-경로 인코딩 모델에서 출력되는 임베딩 벡터를 인공신경망 모델에 학습시킨다.. 질의 경로-인코딩 모델에서 출력되는 임베딩 벡터는 개체-쌍의 기 설정된 경로 유형(metapath) 관점에서 중요한 경로들의 정보를 포함하고 있다. 보다 구체적으로, 러닝 프로세서(180)는, 제1 개체가 질의되는 경우 질의된 제1 개체와 관련된 복수의 제2 개체와, 각 제2 개체마다의 관련성 정보 및 예측 신뢰도를 도출하도록 인공신경망 모델을 학습시킨다.The learning processor 180 trains the artificial neural network model on the embedding vector output from the query-path encoding model. The embedding vector output from the query-path-encoding model is important in terms of the preset path type (metapath) of the entity-pair. Contains route information. More specifically, when a first entity is queried, the learning processor 180 creates an artificial neural network model to derive a plurality of second entities related to the queried first entity, and relevance information and prediction reliability for each second entity. Let them learn.

여기서, 관련성 정보는 점수의 형태일 수 있으며, 보다 구체적으로 0 내지 1의 크기를 갖는 소수점 형태의 점수일 수 있다. 1에 가까울수록 해당 제2 개체가 질의된 제1 개체와 관련되어 있을 확률이 높다는 것을 의미하고, 0에 가까울수록 관련되어 있지 않을 확률이 높다는 것을 의미한다.Here, the relevance information may be in the form of a score, and more specifically, may be a score in the form of a decimal point with a size of 0 to 1. The closer it is to 1, the higher the probability that the second entity is related to the queried first entity, and the closer it is to 0, the higher the probability that it is not related.

다시 말하면, 인공신경망 모델은 "질의된 개체(제1 개체)" - "예측 대상 개체(제2 개체)"의 관련성 및 중요도의 맥락 하에서 각 제2 개체의 점수(스코어)를 연산한다. 즉, 제1 개체 - 제2 개체(예를 들어, 질병 - 타겟) 간 가능한 경로들 중 기 설정된 경로 유형(metapath)에 속하는 경로를 찾고, 각 경로마다 제1 개체 - 제2 개체 간 관련성을 파악하여 가중치를 연산한다. 이 때, 제1 개체 - 제2 개체와 관련된 경로인 경우 높은 가중치가 부여될 수 있으며, 제1 개체 - 제2 개체와 무관한 경로인 경우 낮은 가중치가 부여될 수 있을 것이다. 여기서, 제1 개체와 제2 개체는 서로 다른 유형이다. 다시 말하면, 제1 개체가 "질병" 유형에 속할 경우, 제2 개체는 "유전자" 또는 "약물" 유형에 속한다는 것을 의미한다.In other words, the artificial neural network model calculates the score of each second entity in the context of the relevance and importance of the “queried entity (first entity)” - “predicted entity (second entity)”. In other words, among the possible paths between the first entity and the second entity (e.g., disease - target), find a path that belongs to a preset path type (metapath), and determine the relationship between the first entity and the second entity for each path. Then calculate the weight. At this time, a high weight may be assigned if the path is related to the first entity and the second entity, and a low weight may be assigned if the path is unrelated to the first entity and the second entity. Here, the first entity and the second entity are different types. In other words, if the first entity belongs to the “disease” type, it means that the second entity belongs to the “gene” or “drug” type.

다음, 연산된 가중치에 기초하여 여러 경로들을 하나의 실수 벡터로 병합한다.Next, multiple paths are merged into one real vector based on the calculated weights.

다음, 병합된 실수 벡터와, 질의된 제1 개체 임베딩, 제2 개체 임베딩을 입력으로 하는 다층 퍼셉트론(MLP)을 이용하여 관련성 정보를 연산할 수 있다.Next, relevance information can be calculated using a multilayer perceptron (MLP) that uses the merged real vector and the queried first and second entity embeddings as input.

한편, 제1 프로세서(130) 및 제2 프로세서(140)에서 전처리 및 추출된 학습 데이터는 데이터베이스(D)에서 수집된 데이터를 기반으로 하며, 보다 구체적으로는 데이터 수집 당시의 기술 수준/데이터베이스의 종류에 따라 일정 영역에서의 학습 데이터의 분포도가 다를 수 있다. 다시 말하면, 데이터 수집 당시에 알려져 있는 지식에 의존하므로, 해당 시점에 널리 알려지지 않은 영역(예를 들어, 임상 케이스가 현저히 적은 질병 영역)에서는 널리 알려진 영역(예를 들어, 당뇨병)보다 학습 데이터의 수가 적을 수밖에 없다. 이로 인해, 인공신경망의 학습이 이루어지더라도, 일부 영역에서는 예측 신뢰도가 낮은 문제가 발생하였다.Meanwhile, the learning data preprocessed and extracted by the first processor 130 and the second processor 140 is based on data collected from the database (D), and more specifically, the level of technology/type of database at the time of data collection. Depending on this, the distribution of learning data in a certain area may be different. In other words, because it relies on knowledge known at the time of data collection, areas that are not widely known at the time (e.g., disease areas with significantly fewer clinical cases) may require less training data than areas that are widely known (e.g., diabetes). There is no choice but to do so. As a result, even if the artificial neural network was trained, a problem of low prediction reliability occurred in some areas.

상기한 학습 데이터 분포 편차에 따른 기술적 한계를 극복하기 위해, 종래에는 학습 데이터의 개수를 강제로 평준화시키는 과정이 도입되었다. 예를 들어, 데이터가 많은 영역과, 적은 영역의 학습 데이터 개수를 일치시키는 과정을 도입하는 방식으로 문제를 해결하고자 하였다. 하지만, 학습 데이터의 개수를 강제로 일치시키는 경우 학습 완료된 모델에서 도출되는 결과값의 예측 신뢰도가 저하되는 문제가 있다.In order to overcome the technical limitations caused by the above-described deviation in the distribution of learning data, a process of forcibly equalizing the number of learning data was introduced. For example, we attempted to solve the problem by introducing a process to match the number of learning data in areas with a lot of data and areas with little data. However, when the number of training data is forcibly matched, there is a problem that the prediction reliability of the results derived from the trained model deteriorates.

하지만, 본 발명에서는 학습 데이터 분포 편차를 강제로 해소하는 종래 기술에 따른 전처리 과정을 사용하지 않고, 예측 신뢰도를 도출하도록 데이터를 학습하여, 도출된 예측 신뢰도에 따른 관련성 정보를 보정하고, 보정 관련성 정보를 사용자에게 제공하는 방식으로 학습 데이터 분포 편차에 따른 기술적 한계를 극복하였다. 이를 구체적으로 설명한다.However, in the present invention, rather than using a preprocessing process according to the prior art that forcibly resolves the learning data distribution deviation, the data is learned to derive prediction reliability, and the relevance information according to the derived prediction reliability is corrected, and the corrected relevance information is provided. Technical limitations due to deviations in learning data distribution were overcome by providing to users. This is explained in detail.

본 발명의 실시예에 따른 러닝 프로세서(180)가 제1 개체가 질의되는 경우 질의된 제1 개체와 관련된 복수의 제2 개체와, 각 제2 개체마다의 관련성 정보 및 예측 신뢰도를 도출하도록 인공신경망 모델을 학습시킴을 전술한바 있다.When a first entity is queried, the learning processor 180 according to an embodiment of the present invention uses an artificial neural network to derive a plurality of second entities related to the queried first entity, and relevance information and prediction reliability for each second entity. Learning the model has been described above.

여기서, 예측 신뢰도는, 학습된 인공신경망 모델에서 도출된 제2 개체의 관련성 정보를 얼마나 신뢰할 수 있는지에 대한 척도를 나타내는 지표이다. 관련성 정보에 대한 신뢰도가 높을수록 예측 신뢰도는 높은 값을 가지며(예를 들어, 정규화(normalization)를 통해 1에 가까운 값을 가짐), 관련성 정보에 대한 신뢰도가 낮을수록 예측 신뢰도는 낮은 값을 갖는다(예를 들어, 정규화를 통해 0에 가까운 값을 가짐). 한편, 예측 신뢰도와 반대의 개념을 갖는 비신뢰도를 도입할 수도 있는데, 비신뢰도 = ( 1 - 예측 신뢰도 값 )의 수식을 통해 계산될 수 있다. 여기서, 비신뢰도가 높을수록 해당 관련성 정보의 신뢰도가 낮다는 것을 의미하며, 비신뢰도가 낮을수록 해당 관련성 정보의 신뢰도가 높다는 것을 의미할 수 있다.Here, prediction reliability is an indicator indicating how reliable the relevance information of the second entity derived from the learned artificial neural network model is. The higher the reliability of the relevance information, the higher the prediction reliability (for example, a value close to 1 through normalization), and the lower the reliability of the relevance information, the lower the prediction reliability (e.g., a value close to 1 through normalization). For example, with normalization, it has a value close to 0). Meanwhile, unreliability, which has the opposite concept of prediction reliability, can be introduced, and can be calculated through the formula: non-reliability = (1 - prediction reliability value). Here, a higher unreliability may mean that the reliability of the relevant relevance information is low, and a lower unreliability may mean that the reliability of the relevant relevance information is high.

러닝 프로세서(180)는 예측 신뢰도가 최대의 값을 가지도록, 즉 예측 신뢰도가 1에 가까워지도록 학습 데이터를 학습하며, 구체적으로는 학습 데이터를 질의-경로 인코딩 모델에 입력함으로써 얻어진 임베딩 벡터를 이용해 학습할 수도 있다.The learning processor 180 learns the training data so that the prediction reliability has the maximum value, that is, the prediction reliability is close to 1. Specifically, it learns using the embedding vector obtained by inputting the learning data into the query-path encoding model. You may.

여기서, 임의의 입력 모두에 대해 예측 신뢰도가 최대의 값을 가지도록 학습 데이터를 학습한다면, 예측 신뢰도에 따른 데이터 구분이 어려운 문제가 있다. 또한, 종래 학습된 모델의 경우 과잉 확신(overconfidence) 문제가 발생할 수 있다. 즉, 도출된 개체가 질의된 개체와 관련성이 낮음에도, 높은 점수를 갖는 관련성 정보로 도출되는 문제가 발생한다.Here, if the training data is trained so that the prediction reliability has the maximum value for all arbitrary inputs, there is a problem that it is difficult to classify the data according to the prediction reliability. Additionally, in the case of conventionally learned models, overconfidence problems may occur. In other words, although the derived entity has low relevance to the queried entity, a problem occurs in which relevance information with a high score is derived.

본 발명의 실시예에서는, 상기 문제들을 해결하기 위해, 주어진 입력(제1 개체 - 제2 개체 쌍)의 학습 데이터 포함 여부에 따라 서로 다른 예측 신뢰도를 갖도록 학습이 이루어질 수 있다. 보다 구체적으로는, 주어진 입력이 학습 데이터에 포함되는 경우 1의 값을 가지고, 포함되지 않은 경우 0 이상 1 미만의 값을 갖도록 학습이 이루어질 수 있다. 더욱 구체적으로는, 주어진 입력과 학습 데이터의 유사도에 따라 서로 다른 예측 신뢰도를 갖도록 학습이 이루어질 수 있으며, 유사도에 비례한 값을 갖도록 학습이 이루어질 수 있다.In an embodiment of the present invention, in order to solve the above problems, learning may be performed to have different prediction reliability depending on whether or not a given input (first entity - second entity pair) includes training data. More specifically, learning may be performed so that when a given input is included in the training data, it has a value of 1, and when it is not included, it has a value between 0 and 1 but less than 1. More specifically, learning can be done to have different prediction reliability depending on the similarity between the given input and the training data, and learning can be done to have a value proportional to the similarity.

상기한 학습 과정을 통해, 학습된 모델의 과잉 확신 문제가 해결될 수 있으며, 예측 신뢰도가 관련성 정보와 함께 출력됨에 따라, 예측 모델에서 출력되는 정보의 신뢰도를 함께 확인하는 것이 가능하다.Through the above-described learning process, the problem of overconfidence in the learned model can be solved, and as the prediction reliability is output together with relevance information, it is possible to check the reliability of the information output from the prediction model.

한편, 본 발명에서는 인공신경망 모델의 가중치 행렬이 직교행렬이 되도록 제약을 걸어 학습 데이터를 학습한다.Meanwhile, in the present invention, training data is learned by constraining the weight matrix of the artificial neural network model to be an orthogonal matrix.

인공신경망은 일반적으로 입력층(input layer) 및 출력층(output layer) 사이에 하나 이상의 은닉층(hidden layer)이 포함되며, 학습 과정에서 입력층, 은닉층 및 출력층에 포함된 노드(인공신경망 구조에 포함된 노드를 의미하는 것이며, 전술한 노드 규정부(133)에서 규정된 노드와는 다른 개념임)들 간의 연결 강도(가중치)가 조정된다. 이들 가중치가 서로 직교하도록 제약 조건을 부과하면, 모델의 안정성과 성능을 향상시킬 수 있다.Artificial neural networks generally include one or more hidden layers between an input layer and an output layer, and during the learning process, nodes included in the input layer, hidden layer, and output layer (included in the artificial neural network structure) The connection strength (weight) between the nodes (meaning nodes and a different concept from the nodes defined in the node definition unit 133 described above) is adjusted. By imposing constraints so that these weights are orthogonal to each other, the stability and performance of the model can be improved.

본 발명의 실시예에 따른 인공신경망의 출력층은 관련성 정보와 관련된 제1 출력 노드, 그리고 예측 신뢰도와 관련된 제2 출력 노드를 포함한다.The output layer of the artificial neural network according to an embodiment of the present invention includes a first output node related to relevance information, and a second output node related to prediction reliability.

본 발명에서는, 입력 노드로부터 제2 출력 노드로 이어지는 가중치들의 행렬(가중치 행렬)이 직교행렬이 되도록 학습 데이터를 학습하며, 이로 인해 예측 신뢰도의 도출 정확성이 더욱 향상될 수 있다.In the present invention, training data is learned so that the matrix of weights (weight matrix) leading from the input node to the second output node is an orthogonal matrix, which can further improve the accuracy of deriving prediction reliability.

여기서, 인공신경망의 가중치 행렬(Q)은 다음과 같이 표현될 수 있다.Here, the weight matrix (Q) of the artificial neural network can be expressed as follows.

[수식 1][Formula 1]

일 때, 이고, when, ego,

여기서, I는 단위 행렬이고, 내지 은 매개 변수 행렬의 열 벡터(실수 벡터)이며, 는 인공신경망의 가중치 행렬이다. 여기서, 매개 변수 행렬은 하삼각 행렬(lower triangular matrix)일 수 있다.Here, I is the identity matrix, inside is the column vector (real vector) of the parameter matrix, is the weight matrix of the artificial neural network. Here, the parameter matrix may be a lower triangular matrix.

한편, 본 발명에서는 관련성 정보를 도출하도록 학습 데이터를 학습하는 단계(제1 학습 단계)와, 예측 신뢰도를 도출하도록 학습 데이터를 학습하는 단계(제2 학습 단계)가 동시에 이루어짐으로써, 관련성 정보 및 예측 신뢰도의 도출 정확성이 향상된다.Meanwhile, in the present invention, the step of learning the training data to derive relevance information (first learning step) and the step of learning the training data to derive prediction reliability (second learning step) are performed simultaneously, thereby generating the relevance information and prediction. The accuracy of deriving reliability is improved.

본 발명에서의 인공신경망은 DNN(Deep Neural Network), CNN(Convolutional Neural Network), DCNN(Deep Convolutional Neural Network), RNN(Recurrent Neural Network), RBM(Restricted Boltzmann Machine), DBN(Deep Belief Network), SSD(Single Shot Detector), MLP (Multi-layer Perceptron) 또는 어텐션 메커니즘(Attention Mechanism)을 기반으로 한 모델일 수 있으나, 이에 제한되지 않고 다양한 인공신경망 모델이 본 발명에 적용될 수 있다.The artificial neural network in the present invention includes Deep Neural Network (DNN), Convolutional Neural Network (CNN), Deep Convolutional Neural Network (DCNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), It may be a model based on SSD (Single Shot Detector), MLP (Multi-layer Perceptron), or Attention Mechanism, but is not limited to this and various artificial neural network models can be applied to the present invention.

예측 단말(200)은 모델 생성 단말(100)에 의해 생성된 인공신경망 모델을 전달받고, 입력부(220)를 통해 외부에서 임의의 제1 개체가 입력되는 경우, 프로세서(250)는 입력된 임의의 제1 개체를 인공신경망 모델에 입력하고, 인공신경망 모델의 출력층을 통해 임의의 제1 개체와 관련된 복수의 제2 개체, 제2 개체마다의 관련성 정보 및 예측 신뢰도가 도출된다. 인공신경망 모델의 출력층을 통해 도출되는 정보는 예측 단말(200)의 출력부(260)를 통해 외부에서 시인 가능한 정보의 형태로 출력될 수 있다.The prediction terminal 200 receives the artificial neural network model generated by the model generation terminal 100, and when a random first entity is input from the outside through the input unit 220, the processor 250 A first entity is input into an artificial neural network model, and through the output layer of the artificial neural network model, a plurality of second entities related to a random first entity, relationship information and prediction reliability for each second entity are derived. Information derived through the output layer of the artificial neural network model may be output in the form of information visible from the outside through the output unit 260 of the prediction terminal 200.

일 예에서, 예측 단말(200)의 프로세서(250)는 인공신경망 모델에서 도출된 예측 신뢰도에 기초하여 관련성 정보를 보정하고, 보정의 결과인 보정 관련성 정보를 도출할 수 있다.In one example, the processor 250 of the prediction terminal 200 may correct relevance information based on prediction reliability derived from an artificial neural network model and derive corrected relevance information that is a result of the correction.

관련성 정보는 숫자의 형태인 관련성 점수로 나타날 수 있으며, 프로세서(250)는 낮은 값의 예측 신뢰도가 도출될수록 최종 관련성 점수가 낮아지도록 보정할 수 있다. 예를 들어, 보정 관련성 정보는 (관련성 정보 x 예측 신뢰도)의 수식을 통해 연산될 수 있으며, 다른 예에서는 (관련성 정보 - 비신뢰도)의 수식을 통해 연산될 수도 있다. 하지만 상기한 예에 한정되지 않고, 예측 신뢰도의 값이 높을수록 보정 관련성 정보의 값이 (초기) 관련성 정보의 값에 근접하도록 하는 임의의 수식이 본 발명에 적용될 수 있다고 할 것이다.Relevance information may be expressed as a relevance score in the form of a number, and the processor 250 may correct the final relevance score so that the lower the prediction reliability, the lower the final relevance score. For example, the corrected relevance information may be calculated through the formula (relevance information x prediction reliability), and in another example, it may be calculated through the formula (relevance information - unreliability). However, it is not limited to the above example, and any formula that ensures that the higher the value of prediction reliability is, the closer the value of the corrected relevance information is to the value of the (initial) relevance information can be applied to the present invention.

도 12에 본 발명에서 제안하는 보정 기법이 적용된 일 예가 도시된다.Figure 12 shows an example in which the correction technique proposed by the present invention is applied.

인공신경망 모델에 알츠하이머 질환(Alzheimer's Disease)이 질의되고, GRIN2A, GRIN2B, PPARG, ADRB3, PTGS2의 제2 개체들이 도출되었다. 제2 개체마다 관련성 정보, 그리고 예측 신뢰도가 함께 도출되었으며, 관련성 정보의 크기에 따라 순위가 정렬되어 출력된다.Alzheimer's disease was queried in the artificial neural network model, and secondary entities of GRIN2A, GRIN2B, PPARG, ADRB3, and PTGS2 were derived. Relevance information and prediction reliability are derived for each second entity, and the ranking is sorted and output according to the size of the relevance information.

도 12를 참조하면, 도출된 예측 신뢰도에 기반하여 관련성 정보를 보정한 결과인 보정 관련성 정보가 점수의 형태로 다시 도출된다. 보정 관련성 정보에 따라 순위가 재정렬되어 출력되며, 보정 이전의 순위와 다른 것을 확인할 수 있다(보정 이전: GRIN2A - GRIN2B - PPARG - ADRB3 - PTSG2 순, 보정 이후: GRIN2A - PTGS2 - PPARG - GRIN2B - ADRB3 순). 이를 통해, 시스템 사용자는 보다 더 신뢰도가 높은 제2 개체들을 확인할 수 있다.Referring to FIG. 12, the corrected relevance information, which is a result of correcting the relevance information based on the derived prediction reliability, is derived again in the form of a score. The ranking is rearranged and output according to the correction relevance information, and you can see that it is different from the ranking before correction (before correction: GRIN2A - GRIN2B - PPARG - ADRB3 - PTSG2, after correction: GRIN2A - PTGS2 - PPARG - GRIN2B - ADRB3) ). Through this, system users can identify second entities with higher reliability.

2. 방법의 설명2. Description of method

도 5, 10 및 11을 참조하여, 본 발명의 실시예에 따른 예측 방법을 구체적으로 설명한다. 예측 방법에 사용되는 인공신경망 모델의 학습 방법 등 관련하여서는 전술하였으므로, 이하에서는 각 단계의 핵심적인 부분만 간략히 설명한다.With reference to FIGS. 5, 10, and 11, the prediction method according to an embodiment of the present invention will be described in detail. Since the learning method of the artificial neural network model used in the prediction method has been described above, only the key parts of each step will be briefly explained below.

먼저, 도 5 및 10을 참조하여 모델 생성 단말(100)에 의해 예측 모델의 생성 방법을 구체적으로 설명한다.First, a method of generating a prediction model by the model creation terminal 100 will be described in detail with reference to FIGS. 5 and 10.

데이터 수집부(131)가 다수의 데이터베이스(D1 내지 Dn)로부터 질병과 관련된 제1 데이터, 유전자 또는 단백질과 관련된 제2 데이터 및 약물과 관련된 제3 데이터 중 하나 이상을 수집한다(S51).The data collection unit 131 collects at least one of first data related to a disease, second data related to a gene or protein, and third data related to a drug from a plurality of databases (D1 to Dn) (S51).

다음, 노드 규정부(133), 엣지 규정부(135) 및 경로 규정부(136)가 수집된 제1 데이터 내지 제3 데이터를 이용하여 노드, 엣지 및 경로를 규정한다(S52).Next, the node defining unit 133, the edge defining unit 135, and the path defining unit 136 define nodes, edges, and paths using the collected first to third data (S52).

다음, 임베딩부(137)가 규정된 노드와 엣지 유형이 k차원 공간 상에 실수 벡터화되도록 임베딩을 수행한다(S53).Next, the embedding unit 137 performs embedding so that the specified node and edge types are real vectorized in a k-dimensional space (S53).

다음, 경로 스코어 연산부(138)가 임베딩 결과값을 이용하여 경로에 포함된 엣지들의 스코어를 각각 연산하고, 연산된 엣지 스코어를 이용하여 각각의 경로의 경로 스코어를 연산한다(S55).Next, the path score calculation unit 138 calculates the score of each edge included in the path using the embedding result, and calculates the path score of each path using the calculated edge score (S55).

다음, 경로 스코어 연산부(138)에 의해 연산된 경로 스코어에 기초하여, 경로 추출부(139)가 노드-쌍의 기 설정된 경로 유형마다 다수개의 경로를 추출하게 된다(S55). 경로 추출부(139)에 의해 추출된 경로들이 학습에 사용되고, 추출되지 않은 경로(중요도가 낮은 경로)는 학습에서 배제된다.Next, based on the path score calculated by the path score calculation unit 138, the path extraction unit 139 extracts a plurality of paths for each preset path type of the node-pair (S55). Paths extracted by the path extractor 139 are used for learning, and paths that are not extracted (paths of low importance) are excluded from learning.

다음, 제2 프로세서(140)가 도 5의 각 단계에서 생성된 제1 노드 내지 제3 노드의 임베딩 벡터 및 경로 추출부(139)에 의해 추출된 경로들을 포함하는 학습 데이터를 생성한다(S101). 구체적으로, 제1 노드 내지 제3 노드의 임베딩 벡터 및 경로 추출부(139)에 의해 추출된 경로를 포함하는 데이터가 질의-경로 인코딩 모델에 입력되어 얻어진 임베딩 벡터가 학습 데이터로 사용될 수 있다.Next, the second processor 140 generates learning data including the embedding vectors of the first to third nodes generated in each step of FIG. 5 and the paths extracted by the path extractor 139 (S101) . Specifically, the embedding vector obtained by inputting data including the embedding vectors of the first to third nodes and the path extracted by the path extractor 139 into the query-path encoding model may be used as learning data.

다음, 러닝 프로세서(180)가 제1 개체가 질의되는 경우, 질의된 제1 개체와 관련된 복수의 제2 개체 및 질의된 제1 개체와 각 제2 개체마다의 관련성 정보(관련성 점수)를 도출하도록 인공신경망에 학습 데이터를 학습시킨다(S102). 해당 학습 단계를 통해 관련성 도출 모델이 생성된다.Next, when the first entity is queried, the learning processor 180 derives a plurality of second entities related to the queried first entity and relevance information (relevance score) for each of the queried first entity and each second entity. Training data is trained on the artificial neural network (S102). A relevance derivation model is created through the corresponding learning step.

다음, 러닝 프로세서(180)가 제1 개체가 질의되는 경우, 질의된 제1 개체와 관련된 각 제2 개체마다의 예측 신뢰도를 도출하도록 인공신경망에 학습 데이터를 학습시킨다(S103). 해당 학습 단계를 통해 예측 신뢰도 도출 모델이 생성된다.Next, when the first entity is queried, the learning processor 180 trains the artificial neural network with training data to derive the prediction reliability for each second entity related to the queried first entity (S103). Through this learning step, a predictive reliability derivation model is created.

본 발명의 일 실시예에서는, S102 및 S103 단계가 동시에 수행될 수 있다. 다시 말하면, 러닝 프로세서(180)가 상기 학습 데이터를 학습하여 질의된 제1 개체와 관련된 복수의 제2 개체, 질의된 제1 개체와 각 제2 개체마다의 관련성 정보, 그리고 질의된 제1 개체와 각 제2 개체마다의 예측 신뢰도를 동시에 도출하도록 인공신경망에 학습 데이터를 학습시킨다. 즉, 관련성 도출 모델과 예측 신뢰도 도출 모델이 별개의 모델이 아닌, 하나의 예측 모델로서 생성될 수 있다(S104).In one embodiment of the present invention, steps S102 and S103 may be performed simultaneously. In other words, the learning processor 180 learns the training data to obtain a plurality of second entities related to the queried first entity, relationship information for the queried first entity and each second entity, and the queried first entity and Training data is trained in an artificial neural network to simultaneously derive the prediction reliability for each second entity. In other words, the relevance derivation model and the prediction reliability derivation model can be created as one prediction model rather than as separate models (S104).

도 11을 참조하면, 학습이 완료된 예측 모델(인공신경망 모델)에 임의의 제1 개체(질병, 약물 또는 유전자)가 질의되고(S111), 질의된 제1 개체와 관련된 복수의 제2 개체, 각 제2 개체마다의 관련성 정보 및 예측 신뢰도가 도출된다(S112).Referring to FIG. 11, a random first entity (disease, drug, or gene) is queried in a predicted model (artificial neural network model) on which learning has been completed (S111), and a plurality of second entities, each related to the queried first entity, are queried (S111). Relevance information and prediction reliability for each second entity are derived (S112).

다음, 도출된 예측 신뢰도에 기초하여 관련성 정보가 보정되며(S113), 보정 관련성 정보가 출력되어 사용자에게 가시되는 형태로 출력될 수 있다(S114).Next, the relevance information is corrected based on the derived prediction reliability (S113), and the corrected relevance information can be output in a form visible to the user (S114).

위 설명한 본 발명의 일 실시예에 따른 예측 방법은, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명을 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The prediction method according to an embodiment of the present invention described above may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and constructed for the present invention or may be known and usable by those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to carry out the operations of the present invention, and vice versa.

이상, 본 명세서에는 본 발명을 당업자가 용이하게 이해하고 재현할 수 있도록 도면에 도시한 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 당업자라면 본 발명의 실시예로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 보호범위는 청구범위에 의해서 정해져야 할 것이다. Above, the present invention has been described with reference to the embodiments shown in the drawings so that those skilled in the art can easily understand and reproduce the present invention, but these are merely illustrative examples, and various modifications and equivalent alternatives can be made by those skilled in the art from the embodiments of the present invention. It will be appreciated that embodiments are possible. Therefore, the scope of protection of the present invention should be determined by the claims.

100: 모델 생성 단말
110: 통신부
120: 입력부
130: 제1 프로세서
131: 데이터 수집부
132: 자연어 처리부
133: 노드 규정부
134: ID 부여부
135: 엣지 규정부
136: 경로 규정부
137: 임베딩부
138: 경로 스코어 연산부
139: 경로 추출부
140: 제2 프로세서
150: 제어부
160: 출력부
170: 메모리
180: 러닝 프로세서
200: 예측 단말
210: 통신부
220: 입력부
230: 제어부
240: 메모리
250: 프로세서
260: 출력부100: Model creation terminal
110: Department of Communications
120: input unit
130: first processor
131: Data collection department
132: Natural language processing unit
133: Node regulation part
134: ID grant unit
135: Edge regulation part
136: Path regulation unit
137: Embedding part
138: Path score calculation unit
139: Path extraction unit
140: second processor
150: control unit
160: output unit
170: memory
180: Running processor
200: prediction terminal
210: Department of Communications
220: input unit
230: control unit
240: memory
250: processor
260: output unit

Claims

As an arbitrary first entity is input into the target prediction model, a derivation step in which the processor derives a plurality of second entities related to the input first entity, relationship information and prediction reliability for each of the first entity and each second entity; and
A correction step in which the processor derives corrected relevance information by correcting relevance information for each of a plurality of second entities related to the input first entity, based on each derived prediction reliability, and
The target prediction model is,
When a first entity is queried, the learning processor trains an artificial neural network to derive a plurality of second entities related to the queried first entity and relationship information between the queried first entity and each second entity. step; and
The learning processor trains an artificial neural network to derive, when a first entity is queried, a plurality of second entities related to the queried first entity and prediction reliability for the queried first entity and each second entity. Created by the learning phase;
The artificial neural network includes node data in which disease-related data is defined as a first node, gene-related data is defined as a second node, and drug-related data is defined as a third node, and a preset path for each node-pair. learning training data,
Target prediction method.

According to paragraph 1,
The first learning step and the second learning step are performed simultaneously,
Target prediction method.

According to paragraph 1,
The second learning step is,
Learning to maximize the prediction reliability for the training data, further comprising learning to derive different prediction reliability depending on the similarity between the queried first entity-derived second entity pair and the learning data,
Target prediction method.

According to paragraph 1,
The second learning step is,
Learning so that the prediction reliability is maximized for the training data, but further comprising learning to derive different prediction reliability depending on whether the queried first entity - derived second entity pair is included in the learning data. ,
Target prediction method.

According to paragraph 4,
The second learning step is,
Comprising the step of learning the training data so that the weight matrix of the artificial neural network becomes an orthogonal matrix,
Target prediction method.

According to clause 5,
The weight matrix (Q) is,
when, ego,
Here, I is the identity matrix, inside is the column vector of the parameter matrix, which is a lower triangular matrix, ,
Target prediction method.

According to paragraph 1,
The first entity is a term belonging to any one type of disease, gene, or drug,
The prediction model outputs second entities belonging to a type different from the type of the input first entity,
Target prediction method.

According to paragraph 1,
Further comprising: an output step of outputting the plurality of second entities, the relevance information, and the prediction reliability derived in the derivation step from an output device,
Target prediction method.

According to clause 8,
The correction step is performed after the derivation step and before the output step,
The output step further includes outputting correction relevance information derived from the correction step,
Target prediction method.

According to paragraph 1,
The relevance information includes a relevance score proportional to the relevance to the queried first entity,
The correction step further includes the step of correcting the relevance score so that the lower the derived prediction reliability, the lower the relevance score.
Target prediction method.

According to paragraph 1,
The learning data is,
Embedding vectors of first to third nodes; and
Among the plurality of paths included in the preset path type (metapath) for each node-pair, some paths are extracted in order of high path score calculated by a preset method; including,
Target prediction method.

According to clause 11,
The learning data is,
As a pair of random entities is input to the query encoding model, the embedding vector output from the query encoding model and some of the paths extracted among the paths connecting the pair of random entities input to the query encoding model are path encoded. An embedding vector output from the path encoding model as input to the model includes an embedding vector output from the query-path encoding model as input to the query-path encoding model,
Target prediction method.

According to clause 12,
The pair of random entities is,
To the query encoding model in the form of a query embedding vector including entity embedding vectors for identifying each of the entities of the arbitrary entity pair and positional embedding vectors for identifying the location of each of the entity embedding vectors. is entered,
The path connecting the pair of arbitrary entities is,
Embedding vectors for identifying objects included in the path between the arbitrary pair of objects, each of the edge types of edges connecting the objects, and a positional signal for identifying the location of each of the embedding vectors. Input to the path encoding model in the form of a path embedding vector including embedding vectors,
Target prediction method.

A system using an artificial neural network pre-trained through any one of claims 1 to 13,
an input device configured to receive an arbitrary first object as input;
To query the input layer of the artificial neural network for any first entity input through the input device to derive a plurality of second entities, relevance information, and prediction reliability, and to correct the relevance information based on the derived prediction reliability. configured processor; and
Including; an output device configured to output a plurality of second entities, relevance information, and prediction reliability derived from the processor;
Target prediction system.

According to clause 14,
Correction relevance information is output through the output device,
Target prediction system.

Stored on a computer-readable recording medium to execute the method according to any one of claims 1 to 13,
computer program.