KR20220160407A

KR20220160407A - Device and method for predicting biomedical association

Info

Publication number: KR20220160407A
Application number: KR1020210068615A
Authority: KR
Inventors: 김영학; 전태준; 김윤하; 정연욱; 안임진; 권한슬; 강희준; 조승주
Original assignee: 재단법인 아산사회복지재단; 울산대학교 산학협력단
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-06
Also published as: KR102519848B1

Abstract

According to one embodiment of the present invention, a method for predicting biomedical association performed by a computing device may include: a step of extracting real world data including a diagnosis code connected with patient information and a prescription code from a real world database; a step of extracting biomedical data including genetic information, a disease identifier, a chemical identifier, and a maternal structure identifier from a biomedical database; a step of constructing a new biological network by coupling the real world data and the biomedical data; a step of training a machine learning model by using random work data generated based on random work from the constructed new biological network; and a step of predicting a link between object nodes based on an embedding vector of an object node of a new biological network based on the trained machine learning model. According to one embodiment of the present invention, the computing device may construct a database that can be expanded according to the addition of a biomedical database.

Description

Method and apparatus for predicting biomedical relevance {DEVICE AND METHOD FOR PREDICTING BIOMEDICAL ASSOCIATION}

이하, 생의학적 연관성 예측 방법 및 장치에 관한 기술이 제공된다.Hereinafter, a description of a method and apparatus for predicting biomedical relevance is provided.

전자 장치의 발전과 체계적인 전자 기록 수집이 보편 화되면서 환자 임상데이터가 전자 의무 기록(Electric Medical Record, EMR)의 형태로 축적되고 있다. EMR은 변수가 통제된 시험 환경에서의 임상 데이터가 아닌 다양한 변수들이 복합적으로 포함되는 실세계에서의 데이터로서, 리얼 월드 데이터(Real World Data, RWD)에 해당한다. RWD를 분석하여 얻을 수 있는 의약품 등의 사용 및 잠재적인 유익성과 위해성에 관한 임상적인 증거를 리얼 월드 에비던스(Real World Evidence, RWE)라고 한다. 최근의 임상 연구에서는 RWE를 근거로 하여 환자의 생물학적 현상이나 실제 임상에서의 사건을 예측하거나 설명하려는 시도들이 다각도로 연구되고 있다.With the development of electronic devices and the generalization of systematic electronic record collection, patient clinical data is being accumulated in the form of an Electronic Medical Record (EMR). EMR is not clinical data in a variable-controlled test environment, but data in the real world in which various variables are complexly included, and corresponds to Real World Data (RWD). Clinical evidence about the use and potential benefits and harms of medicines that can be obtained by analyzing RWD is called Real World Evidence (RWE). In recent clinical studies, attempts to predict or explain patients' biological phenomena or actual clinical events based on RWE are being studied from various angles.

질병의 복잡한 생명활동을 설명하고 질병과 약물의 관계성을 예측하려는 연구들이 진행되고 있다. 분자 생물학(Molecular biology), 유전체학(genomics) 등 다양한 필드에서도 각각의 관련 물질에 대한 오픈 생의학 데이터베이스들이 구축되었고 생명활동의 복잡한 현상을 이해하려는 연구에 사용된다. 생의학 데이터베이스 기반 관계성 연구는 유전자(gene), 단백질(protein), 장기(organs), 약물(drug) 및 질병(disease)과 같은 관련 요소(element)들의 상호작용과 이들이 이루는 하나의 큰 맥락(context)을 이해하는 새로운 접근법이 될 수 있다.Research is being conducted to explain the complex life activities of diseases and to predict the relationship between diseases and drugs. In various fields such as molecular biology and genomics, open biomedical databases for each related substance have been established and are used for research to understand the complex phenomena of life activities. Biomedical database-based relational research is concerned with the interaction of related elements such as genes, proteins, organs, drugs, and diseases, and the large context they form. ) can be a new approach to understanding

위에서 설명한 배경기술은 발명자가 본원의 개시 내용을 도출하는 과정에서 보유하거나 습득한 것으로서, 반드시 본 출원 전에 일반 공중에 공개된 공지기술이라고 할 수는 없다.The background art described above is possessed or acquired by the inventor in the process of deriving the disclosure of the present application, and cannot necessarily be said to be known art disclosed to the general public prior to the present application.

일 실시예에 따른 컴퓨팅 장치는 생의학 데이터베이스에 축적된 생물학적, 화학적 지식과 전자의무기록의 임상 지식을 결합할 수 있다.A computing device according to an embodiment may combine biological and chemical knowledge accumulated in a biomedical database with clinical knowledge of an electronic medical record.

일 실시예에 따른 컴퓨팅 장치는 그래프 구조로 결합 구축된 데이터베이스에서 그래프 임베딩 기술을 통한 생의학적 연관성을 탐색할 수 있다.A computing device according to an embodiment may search for biomedical relevance through graph embedding technology in a database constructed in combination with a graph structure.

일 실시예에 따른 컴퓨팅 장치는 전자의무기록의 환자 및 테이블 추가에 따른 데이터베이스 구축을 할 수 있다.A computing device according to an embodiment may build a database by adding a patient and a table of an electronic medical record.

일 실시예에 따른 컴퓨팅 장치는 생의학 데이터베이스 추가에 따른 데이터베이스를 구축할 수 있다.A computing device according to an embodiment may build a database according to the addition of a biomedical database.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical challenges are not limited to the above-described technical challenges, and other technical challenges may exist.

일 실시예에 따른 컴퓨팅 장치에 의해 수행되는 생의학적 연관성 예측 방법은, 리얼 월드 데이터베이스로부터 환자 정보와 연결된 진단 코드 및 처방 코드를 포함하는 리얼 월드 데이터를 추출하는 단계; 생의학 데이터베이스로부터 유전자 정보, 질병 식별자, 화학물질 식별자, 및 모구조 식별자를 포함하는 생의학 데이터를 추출하는 단계; 상기 리얼 월드 데이터 및 상기 생의학 데이터를 결합함으로써 새로운 생물학적 네트워크를 구축하는 단계; 상기 구축된 새로운 생물학적 네트워크로부터 기계 학습 모델을 트레이닝시키는 단계; 및 상기 트레이닝된 기계 학습 모델에 기초하여 상기 새로운 생물학적 네트워크의 개체 노드의 임베딩 벡터에 기초하여 개체 노드들 간의 링크(link)를 예측하는 단계를 포함할 수 있다.A biomedical correlation prediction method performed by a computing device according to an embodiment includes extracting real world data including a diagnosis code and a prescription code connected to patient information from a real world database; extracting biomedical data including genetic information, disease identifiers, chemical substance identifiers, and parent structure identifiers from a biomedical database; constructing a new biological network by combining the real world data and the biomedical data; training a machine learning model from the constructed new biological network; and predicting links between entity nodes based on an embedding vector of entity nodes of the new biological network based on the trained machine learning model.

상기 리얼 월드 데이터를 추출하는 단계는, 환자 정보를 지시하는 환자 노드를 상기 환자 정보에 대응하여 추출된 처방 코드를 지시하는 처방 노드와 진단 코드를 지시하는 진단 노드에 엣지를 통해 연결함으로써, 그래프 구조의 상기 리얼 월드 데이터를 생성하는 단계를 포함할 수 있다.The extracting of the real world data may include connecting a patient node indicating patient information to a prescription node indicating a prescription code extracted corresponding to the patient information and a diagnosis node indicating a diagnosis code through an edge, resulting in a graph structure It may include generating the real world data of.

상기 생의학 데이터를 추출하는 단계는, 유전자 정보를 지시하는 유전자 노드를 상기 유전자 정보에 대응하는 질병 식별자를 지시하는 질병 노드에 엣지를 통해 연결하는 단계; 상기 질병 노드를 상기 질병 식별자에 대응하는 화학물질 식별자를 지시하는 화학물질 노드에 엣지를 통해 연결하는 단계; 상기 화학물질 노드를 상기 화학물질 식별자에 대응하는 모구조 식별자를 지시하는 모구조 노드에 엣지를 통해 연결하는 단계; 및 엣지를 통해 연결된 상기 유전자 노드, 상기 질병 노드, 상기 화학물질 노드, 및 상기 모구조 노드를 포함하는 그래프 구조의 생의학 데이터를 생성하는 단계를 포함할 수 있다.The extracting of the biomedical data may include connecting a gene node indicating genetic information to a disease node indicating a disease identifier corresponding to the genetic information through an edge; connecting the disease node to a chemical node indicating a chemical identifier corresponding to the disease identifier through an edge; connecting the chemical node to a parent structure node indicating a parent structure identifier corresponding to the chemical identity through an edge; and generating biomedical data having a graph structure including the gene node, the disease node, the chemical node, and the parent structure node connected through an edge.

상기 생의학 데이터에서 상기 질병 노드는 엣지를 통해 상기 화학물질 노드를 경유하여 상기 모구조 노드와 연결되고, 상기 화학물질 노드는 엣지를 통해 상기 질병 노드를 경유하여 상기 유전자 노드와 연결될 수 있다.In the biomedical data, the disease node may be connected to the parent structure node via the chemical node through an edge, and the chemical node may be connected to the gene node via the disease node through an edge.

상기 새로운 생물학적 네트워크를 구축하는 단계는, 상기 리얼 월드 데이터의 진단 노드 및 상기 생의학 데이터의 질병 노드를 연결하고 상기 리얼 월드 데이터의 처방 노드 및 상기 생의학 데이터의 화학물질 노드를 연결함으로써 상기 새로운 생물학적 네트워크를 구축하는 단계를 포함할 수 있다.The step of constructing the new biological network may include connecting a diagnosis node of the real world data and a disease node of the biomedical data, and connecting a prescription node of the real world data and a chemical node of the biomedical data to form the new biological network. It may include building steps.

상기 새로운 생물학적 네트워크를 구축하는 단계는, 생의학 용어(biomedical terminology)에 관한 공통 데이터 모델에 기초하여 동일한 고유 개념 식별자에 대응하는 상기 진단 노드 및 상기 질병 노드끼리 매핑하는 단계; 및 상기 공통 데이터 모델에 기초하여 동일한 고유 개념 식별자에 대응하는 상기 처방 노드 및 상기 화학물질 노드끼리 연결하는 단계를 포함할 수 있다.The constructing of the new biological network may include mapping the diagnosis node and the disease node corresponding to the same unique concept identifier based on a common data model for biomedical terminology; and connecting the prescription node and the chemical node corresponding to the same unique concept identifier based on the common data model.

상기 새로운 생물학적 네트워크를 구축하는 단계는, 상기 진단 노드에 의해 지시되는 진단 코드에 대응하는 제1 문자열을 추출하는 단계; 공통 데이터 모델로부터 상기 추출된 제1 문자열을 가지는 제1 개념 고유 식별자를 검색하는 단계; 상기 검색된 제1 개념 고유 식별자에 대응하는 질병 식별자를 지시하는 상기 질병 노드에 상기 진단 노드를 매핑하는 단계를 포함할 수 있다.The constructing of the new biological network may include extracting a first string corresponding to a diagnosis code indicated by the diagnosis node; Retrieving a first concept unique identifier having the extracted first string from a common data model; and mapping the diagnosis node to the disease node indicating a disease identifier corresponding to the retrieved first concept unique identifier.

상기 새로운 생물학적 네트워크를 구축하는 단계는, 상기 처방 노드에 의해 지시되는 처방 코드에 대응하는 제2 문자열을 추출하는 단계; 공통 데이터 모델로부터 상기 추출된 제2 문자열에 대응하는 제2 개념 고유 식별자를 검색하는 단계; 및 상기 검색된 제2 개념 고유 식별자에 대응하는 화학물질 식별자를 지시하는 상기 화학물질 노드에 상기 처방 노드를 매핑하는 단계를 포함할 수 있다.The constructing of the new biological network may include extracting a second string corresponding to a prescription code indicated by the prescription node; Retrieving a second concept unique identifier corresponding to the extracted second string from a common data model; and mapping the prescription node to the chemical substance node indicating a chemical substance identifier corresponding to the retrieved second concept unique identifier.

상기 기계 학습 모델을 트레이닝시키는 단계는, 상기 구축된 새로운 생물학적 네트워크에서 초기 개체 노드를 선정하는 단계; 상기 선정된 초기 개체 노드로부터 미리 결정된 랜덤워크 길이만큼 순차적으로 랜덤워크를 수행함으로써, 개별 랜덤워크에 따른 개체 노드를 선택하는 단계; 및 상기 초기 개체 노드 및 상기 랜덤워크에 기초하여 선택된 개체 노드를 포함하는 랜덤워크 셋트를 지시하는 랜덤워크 데이터를 생성하는 단계를 포함할 수 있다.The step of training the machine learning model may include selecting an initial entity node from the constructed new biological network; selecting entity nodes according to individual random walks by sequentially performing random walks as long as a predetermined random walk length from the selected initial entity nodes; and generating random walk data indicating a random walk set including the initial entity node and entity nodes selected based on the random walk.

상기 기계 학습 모델을 트레이닝시키는 단계는, 상기 랜덤워크 데이터에 기초한 입력 데이터에 상기 기계 학습 모델을 적용함으로써 임시 임베딩 데이터를 산출하는 단계; 및 상기 산출된 임시 임베딩 데이터에 기초하여 상기 랜덤워크 셋트 내 개체 노드들 간의 연결 확률이 최대화되도록 기계 학습 모델의 파라미터를 업데이트하는 단계를 더 포함할 수 있다.The step of training the machine learning model may include calculating temporary embedding data by applying the machine learning model to input data based on the random walk data; and updating parameters of a machine learning model to maximize a connection probability between individual nodes in the random walk set based on the calculated temporary embedding data.

상기 개체 노드들 간의 링크(link)를 예측하는 단계는, 상기 새로운 생물학적 네트워크의 개체 노드들 중 제1 노드에 대응하는 제1 임베딩 벡터를 상기 기계 학습 모델로부터 추출하는 단계; 상기 새로운 생물학적 네트워크의 개체 노드들 중 제2 노드에 대응하는 제2 임베딩 벡터를 상기 기계 학습 모델로부터 추출하는 단계; 및 상기 추출된 제1 임베딩 벡터 및 상기 제2 임베딩 벡터에 기초하여, 상기 제1 노드 및 상기 제2 노드 간의 엣지 존재 여부를 결정하는 단계를 포함할 수 있다.The step of predicting a link between entity nodes may include extracting a first embedding vector corresponding to a first node among entity nodes of the new biological network from the machine learning model; extracting a second embedding vector corresponding to a second node among entity nodes of the new biological network from the machine learning model; and determining whether an edge exists between the first node and the second node based on the extracted first embedding vector and the extracted second embedding vector.

상기 엣지 존재 여부를 결정하는 단계는, 상기 추출된 제1 임베딩 벡터 및 상기 제2 임베딩 벡터 간의 벡터 거리 차이가 임계 거리 값 미만인 경우에 응답하여, 상기 제1 노드 및 상기 제2 노드 간에 엣지가 존재한다고 결정하는 단계를 포함할 수 있다.In the step of determining whether an edge exists, an edge exists between the first node and the second node in response to a case where a vector distance difference between the extracted first embedding vector and the second embedding vector is less than a threshold distance value. It may include the step of deciding to do it.

일 실시예에 따른 컴퓨팅 장치는, 생의학적 연관성 예측을 위한 기계 학습 모델을 저장한 메모리; 및 리얼 월드 데이터베이스로부터 환자 정보와 연결된 진단 코드 및 처방 코드를 포함하는 리얼 월드 데이터를 추출하고, 생의학 데이터베이스로부터 유전자 정보, 질병 식별자, 화학물질 식별자, 및 모구조 식별자를 포함하는 생의학 데이터를 추출하며, 상기 리얼 월드 데이터 및 상기 생의학 데이터를 결합함으로써 새로운 생물학적 네트워크를 구축하고, 상기 구축된 새로운 생물학적 네트워크로부터 랜덤워크에 기반하여 생성된 랜덤워크 데이터를 이용하여, 기계 학습 모델을 트레이닝시키며, 상기 트레이닝된 기계 학습 모델에 기초하여 상기 새로운 생물학적 네트워크의 개체 노드의 임베딩 벡터에 기초하여 개체 노드들 간의 링크(link)를 예측하는 프로세서를 포함할 수 있다.A computing device according to an embodiment includes a memory storing a machine learning model for biomedical correlation prediction; And extracting real world data including diagnosis codes and prescription codes linked to patient information from real world databases, and extracting biomedical data including genetic information, disease identifiers, chemical substance identifiers, and parent structure identifiers from biomedical databases, A new biological network is constructed by combining the real world data and the biomedical data, a machine learning model is trained using random walk data generated based on a random walk from the new biological network, and the trained machine and a processor for predicting links between entity nodes based on an embedding vector of entity nodes of the new biological network based on a learning model.

일 실시예에 따른 컴퓨팅 장치는 생의학 데이터베이스에 축적된 생물학적, 화학적 지식과 전자의무기록의 임상 지식을 효과적으로 결합할 수 있다.A computing device according to an embodiment may effectively combine biological and chemical knowledge accumulated in a biomedical database with clinical knowledge of an electronic medical record.

일 실시예에 따른 컴퓨팅 장치는 그래프 구조로 결합 구축된 데이터베이스에서 그래프 임베딩 기술을 통한 약물 적응증, 약물부작용, 약물치료효과 등의 생물학적 개체 간 연관성을 효과적으로 탐색할 수 있다.The computing device according to an embodiment can effectively search for correlations between biological entities, such as drug indications, drug side effects, and drug treatment effects, through graph embedding technology in a database constructed in a graph structure.

일 실시예에 따른 컴퓨팅 장치는 전자의무기록의 환자 및 테이블 추가에 따른 확장성 있는 데이터베이스를 구축할 수 있다.The computing device according to an embodiment may build an expandable database according to the addition of patients and tables of the electronic medical record.

일 실시예에 따른 컴퓨팅 장치는 생의학 데이터베이스 추가에 따른 확장성 있는 데이터베이스를 구축할 수 있다.A computing device according to an embodiment may build a database that is expandable according to the addition of a biomedical database.

도 1은 일 실시예에 따른 생의학적 연관성 예측 장치의 블록도를 도시한다.
도 2는 일 실시예에 따른 생의학적 연관성 예측 방법을 도시한 흐름도이다.
도 3은 일 실시예에 따른 리얼 월드 데이터(real world data)를 나타낼 수 있다.
도 4는 일 실시예에 따른 생의학 데이터(biomedical data)를 나타낼 수 있다.
도 5 내지 도 7은 일 실시예에 따른 리얼 월드 데이터 및 생의학적 데이터의 매핑을 설명한다.
도 8은 일 실시예에 따른 리얼 월드 데이터 및 생의학적 데이터 간의 매핑 결과를 설명한다.
도 9 및 도 10은 일 실시예에 따라 생성된 새로운 병합된 네트워크(merged network)를 설명한다.
도 11은 일 실시예에 따른 기계 학습 모델의 트레이닝 및 트레이닝된 기계 학습 모델을 이용한 생의학적 연관성을 설명한다.
도 12는 일 실시예에 따라 잠재 공간(latent space)에 임베딩된 개별 개체 노드(entity node)에 대응하는 임베딩 표현 벡터를 예시적으로 도시한다.1 shows a block diagram of a biomedical correlation prediction device according to an embodiment.
2 is a flowchart illustrating a biomedical correlation prediction method according to an embodiment.
3 may represent real world data according to an embodiment.
4 may represent biomedical data according to an embodiment.
5 to 7 illustrate mapping of real world data and biomedical data according to an embodiment.
8 illustrates a mapping result between real world data and biomedical data according to an embodiment.
9 and 10 illustrate a new merged network created according to one embodiment.
11 illustrates training of a machine learning model and biomedical correlation using the trained machine learning model according to an embodiment.
12 illustratively illustrates embedding expression vectors corresponding to individual entity nodes embedded in a latent space according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be changed and implemented in various forms. Therefore, the form actually implemented is not limited only to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted.

최근, 질병의 복잡한 생명활동을 설명하고 질병과 관련 물질들의 관계성을 예측하려는 연구들이 진행되고 있다. 분자 생물학(molecular biology), 유전체학(genomics) 등 다양한 분야에서도 각각의 관련 물질에 대한 오픈소스 데이터베이스들이 구축되었고, 생명활동의 복잡한 현상을 이해하려는 연구에 사용되고 있다.Recently, studies are being conducted to explain the complex life activities of diseases and to predict the relationship between diseases and related substances. In various fields such as molecular biology and genomics, open source databases for each related substance have been established and are used for research to understand the complex phenomena of life activities.

리얼 월드 데이터(Real World data, RWD)는 변수가 통제되는 실험적 설정에서와 달리, 일상적 프랙티스(routine practice)의 자연적 설정(natural setting)에서의 정보를 참조한다. 전체 생명(whole life)의 현상들의 작용이 완전히 밝혀지고 이해되지는 않기 때문에, 요소들(factors) 및 현상들을 측정하는 테스트들은 연구실에서 통제된 설정을 설계하기에 충분히 명확하지 않을 수 있다. Real world data (RWD) refers to information in the natural setting of routine practice, as opposed to experimental settings where variables are controlled. Because the workings of the phenomena of whole life are not fully understood and understood, tests that measure factors and phenomena may not be sufficiently clear to design controlled settings in the laboratory.

생의학(biomedical) 연구 보고에서 리얼 월드 데이터의 사용은 정보 시스템의 성장과 함께 활용될 수 있다. 최근 전자 장치들의 대중화와 함께 체계적인 데이터 수집의 보급과 함께, 생의학 데이터(biomedical data)가 전자의무기록(Electrical Medical Record, EMR)의 형태로 누적될 수 있다. 이를 통해 합리적인 양의 데이터가 계산적으로(computationally) 수집될 수 있다. 임상 연구(clinical research)에서, 전자의무기록 데이터는 임상 사건(clinical event)의 예측, 의학적 예후(medical prognosis), 또는 생의학적 연관성 예측(예를 들어, 약물 재창출(drug repositioning))을 입증(demonstrate)하고, 임상 규제 감시 결정(clinical regulatory surveillance decisions)을 지원하기 위해 사용될 수 있다.The use of real world data in biomedical research reporting can be leveraged with the growth of information systems. Recently, along with the popularization of electronic devices and the prevalence of systematic data collection, biomedical data can be accumulated in the form of an Electronic Medical Record (EMR). This allows a reasonable amount of data to be collected computationally. In clinical research, electronic medical record data demonstrates prediction of clinical events, medical prognosis, or prediction of biomedical relevance (e.g., drug repositioning). and can be used to support clinical regulatory surveillance decisions.

아래에서는 리얼 월드 데이터 및 생의학 데이터의 효과적인 결합 및 결합에 기초하여 새로 구축된 생물학적 네트워크를 활용한 생의학적 연관성을 설명한다. 생의학적 연관성은 생물학적 데이터베이스의 개체 간의 연관성으로서, 생물학적 특성들이 서로 관련되어 있는지 여부를 나타낼 수 있고, 개체 간의 연관성은 예시적으로 약물재창출, 약물 부작용 예측, 약물치료 효과 예측, 질환 발병 예측 등을 포함할 수 있다. 약물, 질병, 처방, 및/또는 진단 간의 이전에 알려지지 않았던 연관성을 예측하는 것을 약물 재창출(drug repositioning)이라고 나타낼 수 있다. 본 명세서에서는 생의학적 연관성 예측의 예시로서 주로 약물 재창출을 설명하지만, 이로 한정하는 것은 아니다.In the following, biomedical correlations using newly built biological networks based on the effective combination and combination of real world data and biomedical data are described. Biomedical correlation is an association between entities in a biological database, and may indicate whether biological characteristics are related to each other, and the association between entities exemplarily includes drug re-creation, drug side effect prediction, drug treatment effect prediction, disease occurrence prediction, and the like. can include Predicting previously unknown associations between drugs, diseases, prescriptions, and/or diagnoses can be referred to as drug repositioning. In this specification, drug re-creation is mainly described as an example of biomedical relevance prediction, but is not limited thereto.

도 1은 일 실시예에 따른 생의학적 연관성 예측 장치의 블록도를 도시한다.1 shows a block diagram of a biomedical correlation prediction device according to an embodiment.

일 실시예에 따른 컴퓨팅 장치(100)는 프로세서(110) 및 메모리(120)를 포함할 수 있다.Computing device 100 according to an embodiment may include a processor 110 and a memory 120 .

프로세서(110)는 리얼 월드 데이터베이스로부터 환자 정보와 연결된 진단 코드 및 처방 코드를 포함하는 리얼 월드 데이터를 추출할 수 있다. 프로세서(110)는 생의학 데이터베이스로부터 유전자 정보, 질병 식별자, 화학물질 식별자, 및 모구조 식별자를 포함하는 생의학 데이터를 추출할 수 있다. 프로세서(110)는 리얼 월드 데이터 및 생의학 데이터를 결합함으로써 새로운 생물학적 네트워크를 구축할 수 있다. 프로세서(110)는 구축된 새로운 생물학적 네트워크로부터 랜덤워크에 기반하여 생성된 랜덤워크 데이터를 이용하여, 기계 학습 모델을 트레이닝시키며, 트레이닝된 기계 학습 모델에 기초하여 새로운 생물학적 네트워크의 개체 노드의 임베딩 벡터에 기초하여 개체 노드들 간의 링크(link)를 예측할 수 있다. 프로세서(110)의 상세한 동작은 하기 도 2 내지 도 12에서 설명한다.The processor 110 may extract real world data including a diagnosis code and a prescription code linked to patient information from a real world database. The processor 110 may extract biomedical data including gene information, disease identifiers, chemical substance identifiers, and parent structure identifiers from a biomedical database. Processor 110 may build a new biological network by combining real world data and biomedical data. The processor 110 trains a machine learning model using the random walk data generated based on the random walk from the constructed new biological network, and based on the trained machine learning model, the embedding vector of the object node of the new biological network. Based on this, it is possible to predict links between entity nodes. Detailed operations of the processor 110 will be described with reference to FIGS. 2 to 12 below.

메모리(120)는 생의학적 연관성 예측을 위한 기계 학습 모델을 저장할 수 있다. 메모리(120)는 생의학적 연관성 및 기계 학습 모델의 트레이닝을 위해 요구도는 데이터를 임시적으로 또는 영구적으로 저장할 수 있다.The memory 120 may store a machine learning model for biomedical relevance prediction. Memory 120 may temporarily or permanently store data required for biomedical relevance and training of machine learning models.

또한, 컴퓨팅 장치(100)는 출력부(예를 들어, 디스플레이)를 더 포함할 수 있다. 출력부는 잠재 공간으로 매핑된 임베딩 벡터 및/또는 개체 노드들 간의 링크 예측 결과를 시각적으로 출력할 수 있다.In addition, the computing device 100 may further include an output unit (eg, a display). The output unit may visually output an embedding vector mapped to the latent space and/or a link prediction result between entity nodes.

일 실시예에 따른 컴퓨팅 장치(100) 전자의무기록과 생의학 데이터베이스에서 공통으로 사용되는 용어(Terminology)에 대한 매핑을 수행하고, 그래프 형태의 데이터베이스를 구축할 수 있다. 그래프 데이터베이스에서 약물의 잠재적인 유익성과 위해성을 탐색할 수 있는 효과적인 그래프 구조가 구축되며, 컴퓨팅 장치(100)는 그래프 임베딩 기술을 통해 약물 효과 탐색을 수행할 수 있다. 컴퓨팅 장치(100)는 전자의무기록과 생의학 데이터베이스를 동시에 활용하여 약물 적응증을 탐색할 수 있다.The computing device 100 according to an embodiment may map terms commonly used in the electronic medical record and the biomedical database, and build a database in the form of a graph. An effective graph structure capable of searching for potential benefits and harms of drugs is constructed in the graph database, and the computing device 100 can search for drug effects through graph embedding technology. The computing device 100 may search for drug indications by simultaneously utilizing the electronic medical record and the biomedical database.

컴퓨팅 장치(100)는 전자의무기록과 생의학 데이터베이스에서 공통으로 사용되는 용어(Terminology)에 대한 매핑을 수행할 수 있다. 컴퓨팅 장치(100)는 전자의무기록과 생의학 데이터베이스를 그래프 형태로 결합 구축하였을 때, 그래프의 잠재 표현(Latent Representation)을 효과적으로 임베딩할 수 있다.The computing device 100 may perform mapping for terms commonly used in the electronic medical record and the biomedical database. The computing device 100 can effectively embed the latent representation of the graph when the electronic medical record and the biomedical database are combined and constructed in the form of a graph.

본 명세서에서 생물학적 네트워크(biological network)는 생물학적 시스템에 관한 데이터에 적용되는 모든 네트워크로서, 데이터 간의 연결에 관한 수학적 표현을 제공할 수 있는 네트워크를 나타낼 수 있다. 생물학적 네트워크는 그래프 구조 데이터로 표현될 수 있다. 그래프는 정점(vertex)과 엣지(edge)로 구성된 자료 구조로서, 본 명세서에서 정점은 개체 노드(entity node)라고도 나타낼 수 있다. 예시적으로 생물학적 네트워크에 포함되는 생물학적 데이터(biological data)는 하나 이상의 개체 노드를 포함할 수 있다. 개체 노드는 생물학적 네트워크 데이터베이스에 포함된 개체에 따라 다양한 개체 타입으로 정의되어 해당 개체 타입에 해당하는 정보(예를 들어, 질병 타입은 질병의 종류를 지시하는 정보)를 지시하는 노드로서, 본 명세서에서 설명되는 예시적인 개체 타입은 리얼 월드 데이터베이스의 진단 타입, 처방 타입, 및 환자 타입, 그리고 생의학 데이터베이스의 유전자 타입, 질병 타입, 화학물질 타입, 및 모구조 타입이 설명된다. 각 개체 타입의 개체 노드는 하기 도 3 및 도 4에서 설명한다. 각 개체 노드는 하나 이상의 다른 개체 노드와 엣지(edge)를 통해 연결될 수 있다. 본 명세서에서는 방향 및 가중치를 가지지 않는 엣지를 주로 설명하나, 이로 한정하는 것은 아니다.In the present specification, a biological network is any network applied to data about a biological system, and may represent a network capable of providing a mathematical expression about a connection between data. Biological networks can be represented as graph-structured data. A graph is a data structure composed of vertices and edges. In this specification, a vertex may also be referred to as an entity node. For example, biological data included in a biological network may include one or more entity nodes. An entity node is a node that is defined as various entity types according to entities included in the biological network database and indicates information corresponding to the entity type (eg, disease type indicates the type of disease). Exemplary entity types described include diagnosis type, prescription type, and patient type in the real world database, and gene type, disease type, chemical substance type, and parent structure type in the biomedical database. Entity nodes of each entity type are described in FIGS. 3 and 4 below. Each entity node may be connected to one or more other entity nodes through an edge. In this specification, an edge having no direction and weight is mainly described, but is not limited thereto.

참고로, 본 명세서에서 생물학적 네트워크는 리얼 월드 데이터베이스(real world database) 에 포함되는 리얼 월드 데이터 네트워크 및 미리 구축된(pre-built) 생의학 데이터베이스(biomedical database)에 포함되는 생의학 네트워크(biomedical network)를 포함할 수 있다. 리얼 월드 데이터 네트워크로부터 추출되는 데이터를 리얼 월드 데이터(RWD, real world data)로 나타낼 수 있고, 생의학 네트워크로부터 추출되는 데이터를 생의학 데이터(biomedical data)라고 나타낼 수 있다.For reference, in the present specification, a biological network includes a real world data network included in a real world database and a biomedical network included in a pre-built biomedical database. can do. Data extracted from the real world data network may be represented as real world data (RWD), and data extracted from the biomedical network may be represented as biomedical data.

도 2는 일 실시예에 따른 생의학적 연관성 예측 방법을 도시한 흐름도이다.2 is a flowchart illustrating a biomedical correlation prediction method according to an embodiment.

우선, 단계(210)에서 컴퓨팅 장치는 리얼 월드 데이터베이스로부터 환자 정보와 연결된 진단 코드 및 처방 코드를 포함하는 리얼 월드 데이터를 추출할 수 있다. 예를 들어, 컴퓨팅 장치는 환자 정보를 지시하는 환자 노드를 환자 정보에 대응하여 추출된 처방 코드를 지시하는 처방 노드와 진단 코드를 지시하는 진단 노드에 엣지를 통해 연결함으로써, 그래프 구조의 리얼 월드 데이터를 생성할 수 있다.First, in step 210, the computing device may extract real world data including a diagnosis code and a prescription code associated with patient information from a real world database. For example, the computing device connects a patient node indicating patient information to a prescription node indicating a prescription code extracted corresponding to the patient information and a diagnosis node indicating a diagnosis code through an edge, thereby providing real world data in a graph structure. can create

그리고 단계(220)에서 컴퓨팅 장치는 생의학 데이터베이스로부터 유전자 정보, 질병 식별자, 화학물질 식별자, 및 모구조 식별자를 포함하는 생의학 데이터를 추출할 수 있다. 예를 들어, 컴퓨팅 장치는 유전자 정보를 지시하는 유전자 노드를 유전자 정보에 대응하는 질병 식별자는 지시하는 질병 노드에 엣지를 통해 연결할 수 있다. 컴퓨팅 장치는 질병 노드를 질병 식별자에 대응하는 화학물질 식별자를 지시하는 화학물질 노드에 엣지를 통해 연결할 수 있다. 컴퓨팅 장치는 화학물질 노드를 화학물질 식별자에 대응하는 모구조 식별자를 지시하는 모구조 노드에 엣지를 통해 연결할 수 있다. 컴퓨팅 장치는 엣지를 통해 연결된 유전자 노드, 질병 노드, 화학물질 노드, 및 모구조 노드를 포함하는 그래프 구조의 생의학 데이터를 생성할 수 있다.In operation 220, the computing device may extract biomedical data including genetic information, disease identifier, chemical substance identifier, and parent structure identifier from the biomedical database. For example, the computing device may connect a gene node indicating genetic information to a disease node indicating a disease identifier corresponding to the genetic information through an edge. The computing device may connect the disease node to a chemical node indicating a chemical identifier corresponding to the disease identifier through an edge. The computing device may connect the chemical node to a parent structure node indicating a parent structure identifier corresponding to the chemical identifier through an edge. The computing device may generate biomedical data in a graph structure including a gene node, a disease node, a chemical node, and a parent structure node connected through an edge.

참고로, 생의학 데이터에서 질병 노드는 엣지를 통해 화학물질 노드를 경유하여 모구조 노드와 연결되고, 화학물질 노드는 엣지를 통해 질병 노드를 경유하여 유전자 노드와 연결될 수 있다.For reference, in biomedical data, a disease node may be connected to a parent structure node via a chemical node through an edge, and a chemical node may be connected to a gene node via a disease node through an edge.

이어서 단계(230)에서 컴퓨팅 장치는 리얼 월드 데이터 및 생의학 데이터를 결합함으로써 새로운 생물학적 네트워크를 구축할 수 있다. 예를 들어, 컴퓨팅 장치는 리얼 월드 데이터의 진단 노드 및 생의학 데이터의 질병 노드를 연결하고 리얼 월드 데이터의 처방 노드 및 생의학 데이터의 화학물질 노드를 연결함으로써 새로운 생물학적 네트워크를 구축할 수 있다.Then, at step 230, the computing device may build a new biological network by combining real world data and biomedical data. For example, the computing device may build a new biological network by connecting a diagnosis node of real world data and a disease node of biomedical data, and a prescription node of real world data and a chemical node of biomedical data.

그리고 단계(240)에서 컴퓨팅 장치는 구축된 새로운 생물학적 네트워크로부터 랜덤워크에 기반하여 생성된 랜덤워크 데이터를 이용하여, 기계 학습 모델을 트레이닝시킬 수 있다.In operation 240, the computing device may train the machine learning model using random walk data generated based on the random walk from the constructed new biological network.

이어서 단계(250)에서 컴퓨팅 장치는 트레이닝된 기계 학습 모델에 기초하여 새로운 생물학적 네트워크의 개체 노드의 임베딩 벡터에 기초하여 개체 노드들 간의 링크(link)를 예측할 수 있다.Subsequently, in step 250, the computing device may predict links between entity nodes based on the embedding vectors of entity nodes of the new biological network based on the trained machine learning model.

도 3은 일 실시예에 따른 리얼 월드 데이터(real world data)를 나타낼 수 있다.3 may represent real world data according to an embodiment.

컴퓨팅 장치는 리얼 월드 데이터 네트워크(real world data network)로부터 리얼 월드 데이터(300)를 추출할 수 있다. 리얼 월드 데이터 네트워크는 예시적으로 전자의무기록(EMR, Electric Medical Record)의 형태로 누적된(accumulated) 데이터로 구성되는 네트워크로서, 리얼 월드 데이터베이스라고도 나타낼 수 있다. 리얼 월드 데이터(300)는, 임의의 변수가 제한된 실험 세팅에서의 임상 데이터가 아닌, 다양한 변수 들이 포함되는 실세계에서의 데이터를 나타낼 수 있다. 생명 현상에 영향을 주는 변수는 매우 다양한 바, 제한된 실험으로 얻어진 실험 결과는 리얼 월드에서 획득되는 결과와 다를 수 있다. 따라서, 리얼 월드 데이터(300)는 제한되지 않은 실제 다양한 변수를 내포하므로, 임상적인 근거를 가질 수 있다. 누적된 리얼 월드 데이터(300), 예를 들어, 전자의무기록 데이터는 새로운 접근을 통해 생물학적 현상이나 실제 임상에서의 사건을 더 잘 설명하기 위한 기회를 제공할 수 있다. The computing device may extract real world data 300 from a real world data network. The real world data network is a network composed of data accumulated in the form of an EMR (Electric Medical Record), and may also be referred to as a real world database. The real world data 300 may represent data in the real world that includes various variables, rather than clinical data in an experimental setting in which arbitrary variables are limited. Variables that affect life phenomena are very diverse, and experimental results obtained through limited experiments may differ from results obtained in the real world. Therefore, the real world data 300 can have a clinical basis because it contains various variables that are not limited. Accumulated real-world data 300, for example, electronic medical record data, may provide opportunities to better explain biological phenomena or actual clinical events through new approaches.

일 실시예에 따른 전자의무기록 데이터는 서울아산병원(Asan Medical Center, AMC)의 CardioNet 데이터베이스로부터 추출될 수 있다. 리얼 월드 데이터(300)의 구조화된 개체(entity)는 환자 인구통계 테이블(patient demographics table), 진단 테이블(310)(diagnosis table) 및 투약 테이블(320)(medication table)로부터 추출될 수 있다. 리얼 월드 데이터베이스의 환자는 익명으로 환자 식별자(Patient Identifier, PAID)에 의해 넘버링될 수 있다. 환자 인카운터 번호(patient encounter number)(EN_NO)는 다른 종류의 진단들을 위한 매 초진마다 주어질 수 있다. 환자 식별자(PAID) 및 환자 인카운터 번호(EN_NO)가 스트링 결합된 키(key)는 환자의 진단 테이블(310) 및 투약 테이블(320)을 조회하는데 사용될 수 있다. 환자 인카운터 번호(EN_NO)는 키(key)로 대체될 수도 있다. 환자 인카운터 정보는 환자 인카운터 번호(EN_NO) 및/또는 환자 식별자(PAID) 및 환자 인카운터 번호(EN_NO)가 스트링 결합된 키를 포함할 수 있다. 예를 들어, 환자 타입의 개체 노드(이하, '환자 노드')는 환자 정보로서, 환자 식별자(PAID)를 지시하는 환자 식별자 노드와 환자 인카운터 번호(EN_NO) 또는 키(key)를 지시하는 환자 인카운터 노드를 포함할 수 있다.Electronic medical record data according to an embodiment may be extracted from the CardioNet database of Asan Medical Center (AMC). A structured entity of the real world data 300 may be extracted from a patient demographics table, a diagnosis table 310 (diagnosis table), and a medication table 320 (medication table). Patients in the real world database can be numbered anonymously by Patient Identifier (PAID). A patient encounter number (EN_NO) may be given at every first visit for different types of diagnoses. A key in which the patient identifier (PAID) and the patient encounter number (EN_NO) are string-coupled may be used to search the diagnosis table 310 and medication table 320 of the patient. The patient encounter number (EN_NO) may be replaced with a key. The patient encounter information may include a key in which a patient encounter number (EN_NO) and/or a patient identifier (PAID) and a patient encounter number (EN_NO) are string-coupled. For example, an entity node of a patient type (hereinafter referred to as 'patient node') is patient information, including a patient identifier node indicating a patient identifier (PAID) and a patient encounter indicating a patient encounter number (EN_NO) or key. Can contain nodes.

컴퓨팅 장치는 진단 테이블(310)(diagnosis table)로부터, 진단 코드(diagnosis code, DICD), 각 키의 인-데이트(in-date, INDT), 및 각 진단 코드(DICD)의 초진 날짜(The date of initial diagnosis, T_INDT)를 추출할 수 있다. 진단 테이블(310)은 한국질병분류(Korean Classification of Diseases, KCD)를 기반으로 부여된 진단 코드(DICD)를 포함할 수 있다. 진단 코드(DICD)는 한 환자에게 각 방문 시 발생한 질병들의 목록 및 증상(symptom)을 지시하는 코드일 수 있다. 진단 코드(DICD)(예를 들어, 서울아산병원 기관의 진단 코드)는 "질병 및 관련 건강 문제의 국제 질병 분류"(International Statistical Classification of Diseases and Related Health Problems, ICD)에 기초하여 생성된 한국질병분류(Korean Classification of Diseases, KCD)로부터 확장된 코드일 수 있다. 진단 코드(DICD)는 주로 소스 코드(source code)로부터 비롯되며(originated), 건강 보험 청구(health insurance claim) 또는 서비스 상세(service detail)(예를 들어, 응급 방문)와 같은 기관의 요구(institutional need)에 따른 코드를 포함할 수 있다. 예를 들어, 진단 타입의 개체 노드(이하, '진단 노드')는 진단 내용을 지시하는 진단 코드(DICD)를 지시하는 개체 노드일 수 있다.From the diagnosis table 310 (diagnosis table), the computing device obtains a diagnosis code (DICD), an in-date (INDT) of each key, and a first diagnosis date (The date) of each diagnosis code (DICD). of initial diagnosis, T_INDT) can be extracted. The diagnosis table 310 may include a diagnosis code (DICD) assigned based on the Korean Classification of Diseases (KCD). The diagnosis code (DICD) may be a code indicating a list of diseases and symptoms occurring at each visit to a patient. A diagnosis code (DICD) (e.g., a diagnosis code of an institution at Asan Medical Center) is a Korean disease created based on the "International Statistical Classification of Diseases and Related Health Problems (ICD)" It may be an extended code from the Korean Classification of Diseases (KCD). Diagnosis codes (DICD) are often originating from source code and institutional requests such as health insurance claims or service details (e.g., emergency visits). You can include code according to your needs. For example, an entity node of a diagnosis type (hereinafter referred to as 'diagnosis node') may be an entity node indicating a diagnosis code (DICD) indicating diagnosis contents.

컴퓨팅 장치는 투약 테이블(320)(medication table)로부터 처방 코드(prescription code)(ODCD)를 각 키에 대해 요소 컬럼(element column, ELEM)에 따라 추출할 수 있다. 투약 테이블(320)은 RxNorm와 RxNorm Extension을 기반으로 부여된 처방 코드(Order code, ODCD) 및 성분 명(ingredient name) 등을 포함할 수 있다. RxNorm 및 RxNorm Extension은 미국 시장에서 구할 수 있는 모든 약품을 포함하는 약품의 미국 특정 용어로서, 개인 건강 기록 애플리케이션에도 사용할 수 있으며, RxNorm은 통합 의료 언어 시스템(Unified Medical Language System, UMLS) 용어의 일부이며 미국 국립 의학 도서관에서 관리될 수 있다. 처방 코드(ODCD)는 병원에 방문한 환자에게 투약된 약물에 관한 정보를 지시하는 코드일 수 있다. 예를 들어, 처방 타입의 개체 노드(이하, '처방 노드')는 처방 내용을 지시하는 처방 코드(ODCD)를 지시하는 개체 노드일 수 있다.The computing device may extract a prescription code (ODCD) from the medication table 320 according to an element column (ELEM) for each key. The medication table 320 may include an order code (ODCD) and ingredient names assigned based on RxNorm and RxNorm Extension. RxNorm and RxNorm Extension are US-specific terms for drugs that include all drugs available on the US market, and can also be used for personal health record applications. RxNorm is part of the Unified Medical Language System (UMLS) terminology. May be maintained at the US National Library of Medicine. The prescription code (ODCD) may be a code indicating information about medication administered to a patient visiting a hospital. For example, an object node of a prescription type (hereinafter referred to as a 'prescription node') may be an object node indicating a prescription code (ODCD) indicating contents of a prescription.

전술한 바와 같이, 처방 코드(ODCD)와 진단 코드(DICD)는, 환자의 질병 이력과 함께 입원 시 환자의 상태 변화나 환자의 특성에 따라 투약된 약의 종류를 내포할 수 있다.As described above, the prescription code (ODCD) and the diagnosis code (DICD) may include the patient's disease history and the type of medication administered according to changes in the patient's condition or characteristics of the patient upon hospitalization.

일 실시예에 따르면 컴퓨팅 장치는 환자 정보(예를 들어, 전술한 환자 식별자(PAID) 및 환자 인카운터 번호(EN_NO))와 연결된 처방 코드(ODCD) 및 진단 코드(DICD)를 개체로서 포함하는 리얼 월드 데이터(300)를 추출할 수 있다. 컴퓨팅 장치는 환자 특성이 반영된 리얼 월드 데이터(300)를 추출할 수 있다. 예를 들어, 컴퓨팅 장치는 추출된 환자 정보를 지시하는 환자 노드를 해당 환자 노드에 대응하여 추출된 처방 코드(ODCD)를 지시하는 처방 노드와 진단 코드(DICD)를 지시하는 진단 노드에 엣지를 통해 연결함으로써 그래프 구조의 리얼 월드 데이터(300)를 생성할 수 있다. 또한, 컴퓨팅 장치는 정의된 결과(330)를 지시하는 아웃컴 노드(outcome node)를 환자 노드에 엣지를 통해 추가로 연결할 수 있다.According to an embodiment, the computing device includes a prescription code (ODCD) and a diagnosis code (DICD) associated with patient information (eg, the patient identifier (PAID) and patient encounter number (EN_NO) described above) as objects in the real world. Data 300 can be extracted. The computing device may extract real world data 300 in which patient characteristics are reflected. For example, the computing device transmits a patient node indicating extracted patient information to a prescription node indicating an extracted prescription code (ODCD) corresponding to the corresponding patient node and a diagnosis node indicating a diagnosis code (DICD) through an edge. By connecting, real world data 300 in a graph structure can be created. In addition, the computing device may further connect an outcome node indicating the defined outcome 330 to the patient node via the edge.

또한, 컴퓨팅 장치는 구조(structure)의 특정 사용에 대해, 정의된 결과(defined outcome)(330)를 나타내는 컬럼 또는 테이블을 추출할 수 있다. 예를 들어, 도 3에서는 처방 코드(ODCD)에 대응하는 약물의 처방에 의해 환자의 상태가 긍정적(positive)으로 호전되었는지 여부 및/또는 부정적(negative)으로 악화되었는지 여부를 지시하는 결과를 포함할 수 있다. 다른 예를 들어, 컴퓨팅 장치는 항암제(anticancer drug)에 대한 환자 조정(patient adjustment)에 대한 정의된 결과로서, 환자 인구 통계 테이블로부터 각 환자 식별자(PAID)에 대해 암 등록 환자의 사망일(death date of a cancer registration patient)(CDTH)을 추출할 수 있다.In addition, the computing device may extract a column or table representing a defined outcome 330 for a specific use of the structure. For example, in FIG. 3 , a result indicating whether or not the patient's condition was positively improved and/or negatively deteriorated by the prescription of the drug corresponding to the prescription code (ODCD) may be included. can In another example, a computing device may determine the death date of a patient enrolled in cancer for each patient identifier (PAID) from the patient demographic table, as a defined outcome of patient adjustment for anticancer drugs. a cancer registration patient) (CDTH) can be extracted.

도 4는 일 실시예에 따른 생의학 데이터(biomedical data)를 나타낼 수 있다.4 may represent biomedical data according to an embodiment.

일 실시예에 따르면 컴퓨팅 장치는 미리 구축된 생의학 데이터베이스(pre-built biomedical database)로부터 생의학 데이터(400)를 추출할 수 있다. 생의학 데이터베이스는 전문가들에 의해 수동으로 큐레이팅된 관계(curated relationship)를 가지는 데이터를 포함할 수 있다. 예를 들어, 생의학 데이터베이스는 CTD(Comparative Toxicogenomic Database) 데이터베이스일 수 있으며, According to an embodiment, the computing device may extract biomedical data 400 from a pre-built biomedical database. Biomedical databases may contain data that have manually curated relationships by experts. For example, the biomedical database may be a Comparative Toxicogenomic Database (CTD) database;

CTD 데이터베이스는 독성 유전학적 관계(toxicogenomical relationships)의 상호작용들이 수동으로 큐레이팅된(curated) 비교 데이터베이스(comparative database)를 나타낼 수 있다. CTD 데이터베이스는 화학물질의 화학 메커니즘과 유전자의 연관 관계 및 약물과 질병의 관계에 대한 데이터를 제공한다. CTD 데이터베이스의 각 데이터셋에서 나타나는 개체 타입별 관계성은 전문가들에 의해 큐레이팅된 관계일 수 있다. CTD 데이터베이스는 환경 화학물질(environmental chemicals)의 화학적 독성(chemical toxicity) 뿐만 아니라 독성 물질(toxicogenomic)의 약리학적 활성(pharmacological activity)도 커버하도록 확장된 데이터베이스일 수 있다. 더 나아가, 유전자 관련 개체(gene related entities)에 관한 큐레이션(curation)은 표현형(phenotypes), 유전자 온톨로지(gene ontology), 및 경로로 확장(broaden)되어 유전자(genes), 단백질(protein), 대사 산물(metabolite) 및 그 관련 반응(reaction)을 포함하는 학제간 연구(interdisciplinary research)로 확장할 가능성을 줄 수 있다.The CTD database may represent a comparative database where interactions of toxicogenomical relationships are manually curated. The CTD database provides data on the relationship between chemical mechanisms and genes of chemicals and between drugs and diseases. Relationships by entity type appearing in each dataset of the CTD database may be curated by experts. The CTD database may be an extended database to cover not only the chemical toxicity of environmental chemicals, but also the pharmacological activity of toxicogenomics. Furthermore, the curation of gene-related entities is broadened to phenotypes, gene ontology, and pathways, including genes, proteins, and metabolites. It can give the possibility of expanding to interdisciplinary research involving metabolites and their related reactions.

질병 식별자(Disease_ID)는 미국 국립보건원 산하 국립의학도서관(National Library of Medicine, NLM)의 의학 주제 표제(Medical Subject Headings, MeSH))의 "질병" 브랜치에 따라 결정될 수 있다. 또한, 질병 식별자(Disease_ID)는 MeSH에 의해 커버되지 않는 일부 유전 질환(genetic disorders)에 대해서는 OMIM(Online Mendelian Inheritance in Man)에 따라 결정될 수 있다. 예를 들어, 질병 타입의 개체 노드(이하, '질병 노드')는 질병의 종류를 지시하는 질병 식별자(Disease_ID)를 지시하는 개체 노드일 수 있다.The disease identifier (Disease_ID) may be determined according to the “disease” branch of Medical Subject Headings (MeSH) of the National Library of Medicine (NLM) under the National Institutes of Health. In addition, the disease identifier (Disease_ID) may be determined according to Online Mendelian Inheritance in Man (OMIM) for some genetic disorders not covered by MeSH. For example, an object node of a disease type (hereinafter, a 'disease node') may be an object node indicating a disease identifier (Disease_ID) indicating a type of disease.

유전자 정보(Gene_Symbol)는 NCBI(National Center for Biotechnology Information)의 유전자 데이터베이스로부터 독성학적으로(toxicologically) 중요한 유전자 심볼 또는 단백질 심볼에 관한 정보를 포함할 수 있다. 예를 들어, 유전자 타입의 개체 노드(이하, '유전자 노드')는 독성학적으로(toxicologically) 중요한 유전자 심볼 및/또는 단백질 심볼의 종류를 지시하는 유전자 정보(Gene_Symbol)를 지시하는 개체 노드일 수 있다.The gene information (Gene_Symbol) may include information about a toxicologically important gene symbol or protein symbol from a gene database of National Center for Biotechnology Information (NCBI). For example, an object node of a gene type (hereinafter referred to as 'gene node') may be an object node indicating gene information (Gene_Symbol) indicating the type of toxicologically important gene symbol and/or protein symbol. .

화학물질 식별자(Chemical_ID)는, NLM MeSH 의 "화학물질 및 약물" 브랜치에 기초하여 결정될 수 있다. 또한, 화학물질 식별자(Chemical_ID)는 분자 시약(molecular reagents), 환경 화학물질(environmental chemicals), 및 임상 약물(clinical drugs)에 따라서도 결정될 수 있다. 예를 들어, 화학물질 타입의 개체 노드(이하, '화학물질 노드')는 화학물질의 종류를 지시하는 화학물질 식별자(Chemical_ID)를 지시하는 개체 노드일 수 있다.The chemical identifier (Chemical_ID) can be determined based on the “Chemicals and Drugs” branch of NLM MeSH. Also, the chemical identifier (Chemical_ID) may be determined according to molecular reagents, environmental chemicals, and clinical drugs. For example, an entity node of a chemical type (hereinafter referred to as 'chemical node') may be an entity node indicating a chemical identifier (Chemical_ID) indicating a type of chemical.

컴퓨팅 장치는 CTD 데이터베이스의 관계 데이터(relationship data)에 기초하여 생의학 데이터(400)를 생성 및 추출할 수 있다. 관계 데이터는, 예를 들어, 화학물질-질병(chemical-disease) 데이터셋(410), 질병-유전자(disease-gene) 데이터셋(420), 및 화학물질들(chemicals) 데이터셋(430)에 기초하여 생의학 데이터(400)를 생성 및 추출할 수 있다. 화학물질-질병 데이터셋(410)에서 화학물질 및 질병 간의 상호작용(예를 들어, 관계성), 질병-유전자 데이터셋(420)에서 질병 및 유전자 간의 상호작용, 및 화학물질들 데이터셋(430)에서 화학물질들 간 상호작용은 출판된 문헌에 기초하여 전문가에 의해 수동으로 큐레이팅된 데이터셋일 수 있다.The computing device may generate and extract biomedical data 400 based on relationship data of the CTD database. Relational data is, for example, in the chemical-disease dataset 410, the disease-gene dataset 420, and the chemicals dataset 430. Based on the biomedical data 400 can be generated and extracted. Interactions (eg, relationships) between chemicals and diseases in the chemical-disease dataset 410, interactions between diseases and genes in the disease-gene dataset 420, and chemicals dataset 430 ) can be a manually curated dataset by experts based on published literature.

화학물질-질병 데이터셋(410)은 화학물질 및 질병 간의 상호작용에 관한 데이터셋으로서, 불필요한 환경 화학물질(environmental chemical)이 제거되고, 나머지 화학물질에 해당하는 약의 성분과 질병, 화학물질-질병 관계가 추출된 데이터일 수 있다. 참고로, 임상 약물의 성분(ingredients)은 화학물질 식별자(Chemical_ID) 및 전자의무기록 데이터베이스의 처방 정보 간의 매핑에 의해 선별(sift)될 수 있다.The chemical-disease dataset 410 is a dataset on interactions between chemicals and diseases, in which unnecessary environmental chemicals are removed, and components of drugs corresponding to the remaining chemicals, diseases, and chemical substances- The disease relationship may be extracted data. For reference, ingredients of a clinical drug may be sifted by mapping between a chemical substance identifier (Chemical_ID) and prescription information of an electronic medical record database.

질병-유전자 데이터셋(420)은 타겟 단백질(protein)이나 타겟 유전자와의 반응(reaction)을 통해 작용하는 약의 기전을 반영하기 위한 질병-유전자 상호작용들 중 전자의무기록 데이터베이스에 있는 질병에 대응하여 큐레이팅된 질병-유전자 관계를 포함할 수 있다. 컴퓨팅 장치는 생의학 데이터베이스로부터 해당 질병에 관련된 복수의 유전자 정보 중 미리 결정된 개수의 유전자 정보를 추출할 수 있다. 컴퓨팅 장치는 생의학 데이터베이스에 의해 제공되는 추론 점수에 기초하여, 해당 질병에 대해 가장 높은 추론 점수를 갖는 유전자 정보부터 내림차순으로 미리 결정된 개수까지 해당 질병에 관련된 유전자 정보를 추출할 수 있다. 예를 들어, 컴퓨팅 장치는 추론 점수에 기초하여 상위 10개 유전자 정보를 추출할 수 있다. 노드 가중치 없는 그래프 구조의 네트워크에서, 모든 연관성들(예를 들어, 관계성)은 중요성에서의 차이에도 불구하고, 랜덤워크를 할 시, 동일한 가능성으로 평가되어야 하는 바, 생의학 데이터(400)에서 개체들 간의 연관성들(예를 들어, 개체 노드들 간의 관련성들)을 지시하는 엣지들의 개수가 정규화될 필요가 있다. 따라서, 한 질병에 관련된 복수의 유전자 정보 중 10개의 가장 중요한 연관성들로 연결된 유전자 정보가 중요성을 정규화하기 위해 추출될 수 있고, 중요한 유전자들에 대한 워킹의 가능성이 네트워크에 걸쳐 표준화될 수 있다.The disease-gene dataset 420 corresponds to a disease in an electronic medical record database among disease-gene interactions to reflect the mechanism of a drug acting through a reaction with a target protein or target gene. to include curated disease-gene relationships. The computing device may extract genetic information of a predetermined number from among a plurality of genetic information related to a corresponding disease from a biomedical database. Based on the inference score provided by the biomedical database, the computing device may extract gene information related to the disease up to a predetermined number in descending order from gene information having the highest inference score for the corresponding disease. For example, the computing device may extract top 10 genetic information based on the inference score. In a graph-structured network without node weights, all associations (eg, relationships), regardless of differences in importance, should be evaluated with the same probability when performing a random walk, so in the biomedical data 400 The number of edges indicating associations between entities (eg, associations between entity nodes) needs to be normalized. Therefore, genetic information linked to the 10 most important associations among a plurality of genetic information related to a disease can be extracted to normalize the importance, and the probability of walking for important genes can be standardized across the network.

추론 점수(Inference Score)는 PPI(Protein Protein Interaction)에서 단백질들 간 기능적 관계를 통계적으로 평가하는데 사용될 수 있다. 추론 점수는 화학물질-유전자 및 질병-유전자 연관성에서 각 질병 또는 화학물질에 연관된 상위 10개 유전자를 픽(pick)하는데 사용될 수 있다. 추론 점수는 생물학적 네트워크가 스케일 프리 랜덤 네트워크(scale free random network)를 형성하기 위해 내재적(innate) 방식의 개념으로부터 도출될 수 있다. 화학물질-유전자-질병 네트워크 및 비전형적인(atypical) 연결성의 정도를 평가하기 위한 유사한 스케일 프리 랜덤 네트워크 간 유사도를 나타낼 수 있다. 더 나아가, 추론 점수를 계산하기 위해 관련 질병/화학물질의 양과 유추된 화학물질/질병 자체의 연결성에 대한 화학물질/질병 및 유전자의 연결성과 같은 공통 이웃(Common Neighbors) 방식이 사용될 수 있다.The inference score can be used to statistically evaluate functional relationships between proteins in Protein Protein Interaction (PPI). The inference score can be used to pick the top 10 genes associated with each disease or chemical in chemical-gene and disease-gene associations. The inference score can be derived from the concept of how biological networks innate to form scale free random networks. It can show the degree of similarity between chemical-gene-disease networks and similar scale-free random networks for assessing the degree of atypical connectivity. Furthermore, Common Neighbors methods such as the connectivity of a chemical/disease and a gene to the amount of related disease/chemical and the connectivity of the inferred chemical/disease itself can be used to calculate the inference score.

화학물질들 데이터셋(430)은 화학물질 어휘 데이터셋(Chemical vocabulary dataset)라고도 나타낼 수 있고, 약물이 모구조(parent structure)에서 작용기 차이로 달라지는 반응을 반영하기 위해 사용될 수 있다. 컴퓨팅 장치는 화학물질들 데이터셋(430)으로부터 화학물질-모구조(parent structure) 간의 관계, 예를 들어, 한 화학물질 식별자(Chemical_ID)에 연결되는 모구조 식별자(Parent_ID)를 추출할 수 있다. 화학물질-모구조 관계는, 모든 자손(descendant)에게 링크되는 부모 구조 정보(parental structure information)를 포함할 수 있다. 따라서, 컴퓨팅 장치는 화학물질의 모구조를 추출하여 작용기 차이를 가지는 화학물질들을 연결시킴으로써, 유사한 화학구조 식을 가지지만 작용기 차이로 인해 효과가 달라지는 약의 특성을 생의학 데이터(400)에 반영할 수 있다. 예를 들어, 모구조 타입의 개체 노드(이하, '모구조 노드')는 모구조의 종류를 지시하는 모구조 식별자(Parent_ID)를 지시하는 개체 노드일 수 있다.The chemicals dataset 430 may also be referred to as a chemical vocabulary dataset, and may be used to reflect reactions in which a drug differs from a parent structure to a functional group. The computing device may extract a chemical substance-parent structure relationship from the chemicals dataset 430, for example, a parent structure identifier (Parent_ID) linked to one chemical substance identifier (Chemical_ID). A chemical-parent structure relationship may include parental structure information that is linked to all descendants. Therefore, the computing device extracts the parent structure of a chemical substance and connects chemicals having different functional groups to reflect characteristics of drugs having similar chemical formulas but different effects due to differences in functional groups in the biomedical data 400. have. For example, an entity node of a parent structure type (hereinafter referred to as 'parent structure node') may be an entity node indicating a parent structure identifier (Parent_ID) indicating the type of parent structure.

또한, 독성 유전학적 관계(toxicogenomical relationships)를 결합하기 위해, 질병-유전자 데이터셋(420) 및 화학물질-유전자 데이터셋이 하는데 사용될 수 있다.Also, to combine toxicogenomical relationships, the disease-gene dataset 420 and the chemical-gene dataset can be used to do so.

도 4에 도시된 예시에서, 컴퓨팅 장치는 생의학 데이터베이스로부터, 유전자 정보(Gene_Symbol), 질병 식별자(Disease_ID), 화학물질 식별자(Chemical_ID), 및 모구조 식별자(Parent_ID)를 개체로서 포함하는 생의학 데이터(400)를 추출할 수 있다. 추출된 생의학 데이터(400)는 서로 대응하는 유전자 정보(Gene_Symbol) 및 질병 식별자(Disease_ID), 질병 식별자(Disease_ID) 및 화학물질 식별자(Chemical_ID), 화학물질 식별자(Chemical_ID) 및 모구조 식별자(Parent_ID)를 포함할 수 있다. 예를 들어, 컴퓨팅 장치는 추출된 유전자 정보(Gene_Symbol)를 지시하는 유전자 노드를 해당 유전자 정보(Gene_Symbol)에 대응하는 질병 식별자(Disease_ID)를 지시하는 질병 노드에 엣지를 통해 연결할 수 있다. 유사하게, 컴퓨팅 장치는, 엣지를 통해, 질병 노드에 해당 질병 식별자(Disease_ID)에 대응하는 화학물질 식별자(Chemical_ID)를 지시하는 화학물질 노드를 연결하고, 화학물질 노드에 해당 화학물질 식별자(Chemical_ID)에 대응하는 모구조 식별자(Parent_ID)를 지시하는 모구조 노드를 연결할 수 있다. 따라서, 컴퓨팅 장치는 노드 간 엣지 연결을 통해 유전자 정보(Gene_Symbol), 질병 식별자(Disease_ID), 화학물질 식별자(Chemical_ID), 및 모구조 식별자(Parent_ID)를 포함하는 생의학 데이터(400)를 추출할 수 있다. 다시 말해, 컴퓨팅 장치는 엣지를 통해 연결된 유전자 노드, 질병 노드, 화학물질 노드, 및 모구조 노드를 포함하는 그래프 구조의 생의학 데이터를 생성할 수 있다. 예시적으로, 질병 노드는 엣지를 통해 화학물질 노드를 경유하여 모구조 노드와 연결될 수 있고, 화학물질 노드는 엣지를 통해 질병 노드를 경유하여 유전자 노드와 연결될 수 있다. 다만, 설명의 편의를 위하여 임의의 순서로 연결 과정을 설명하였으나, 연결 순서를 전술한 바로 한정하는 것은 아니다. 설계에 따라 전술한 노드 간 연결은 역순으로 수행되거나, 임의의 순서로 수행될 수도 있다.In the example shown in FIG. 4 , the computing device includes biomedical data 400 including genetic information (Gene_Symbol), disease identifier (Disease_ID), chemical substance identifier (Chemical_ID), and parent structure identifier (Parent_ID) as entities from a biomedical database. ) can be extracted. The extracted biomedical data 400 includes gene information (Gene_Symbol), disease identifier (Disease_ID), disease identifier (Disease_ID), chemical substance identifier (Chemical_ID), chemical substance identifier (Chemical_ID), and parent structure identifier (Parent_ID) corresponding to each other. can include For example, the computing device may connect a gene node indicating the extracted gene information (Gene_Symbol) to a disease node indicating a disease identifier (Disease_ID) corresponding to the gene information (Gene_Symbol) through an edge. Similarly, the computing device connects the chemical node indicating the chemical identifier (Chemical_ID) corresponding to the corresponding disease identifier (Disease_ID) to the disease node through the edge, and connects the corresponding chemical identifier (Chemical_ID) to the chemical node. A parent structure node indicating a parent structure identifier (Parent_ID) corresponding to may be connected. Accordingly, the computing device may extract biomedical data 400 including gene information (Gene_Symbol), disease identifier (Disease_ID), chemical substance identifier (Chemical_ID), and parent structure identifier (Parent_ID) through edge connection between nodes. . In other words, the computing device may generate biomedical data having a graph structure including a gene node, a disease node, a chemical node, and a parent structure node connected through an edge. For example, the disease node may be connected to the parent structure node via the chemical node through an edge, and the chemical node may be connected to the gene node via the disease node through an edge. However, although the connection process has been described in an arbitrary order for convenience of explanation, the connection order is not limited to the above-mentioned one. Depending on the design, the aforementioned connection between nodes may be performed in the reverse order or in any order.

환자의 유전체 데이터는 EMR에 결여되기 때문에, 엣지가 만들어지지 않을 수 있다. 생의학 데이터베이스에 관련 정보가 추가되지 않을 수 있다. 그럼에도 불구하고, CTD 데이터베이스에서, 공통 이웃 방식의 통계적 평가로부터 도출되는 자체 통계적 추론 점수를 통해, 질병이 질병에 상관(corelate)되거나, 화학물질이 화학물질에 상관될 수 있다.Since the patient's genomic data is lacking in the EMR, an edge may not be created. Relevant information may not be added to biomedical databases. Nonetheless, in the CTD database, diseases may be correlated to diseases, or chemicals may be correlated to chemicals, through self-statistical inference scores derived from statistical evaluation of common-neighbor methods.

도 5 내지 도 7은 일 실시예에 따른 리얼 월드 데이터 및 생의학 데이터의 매핑을 설명한다.5 to 7 illustrate mapping of real world data and biomedical data according to an exemplary embodiment.

일 실시예에 따르면 컴퓨팅 장치는 리얼 월드 데이터(510)의 진단 노드(512) 및 생의학 데이터(530)의 질병 노드를 연결하고 리얼 월드 데이터(510)의 처방 노드(511) 및 생의학 데이터(530)의 화학물질 노드를 연결함으로써 새로운 생물학적 네트워크를 구축할 수 있다. 예를 들어, 컴퓨팅 장치는 생의학 용어(biomedical terminology)에 관한 "공통 데이터 모델"(Common Data Model, CDM)에 기초하여 동일한 고유 개념 식별자에 대응하는 진단 노드(512) 및 질병 노드끼리 매핑함으로써, 리얼 월드 데이터(510) 및 생의학 데이터(530)를 결합할 수 있다. 컴퓨팅 장치는 공통 데이터 모델에 기초하여 동일한 고유 개념 식별자에 대응하는 처방 노드(511) 및 화학물질 노드끼리 연결할 수 있다. 리얼 월드 데이터베이스, 예를 들어, 임상 데이터 셋에서 병원 내 코드들(예를 들어, 전술한 진단 코드(DICD) 및 처방 코드(ODCD))은 각 병원의 고유한 코드 체계를 따라 생성되므로, 병원 내 코드를 다양한 소스에 적용하기 위해, 컴퓨팅 장치는 공통 데이터 모델에 기초한 맵핑을 수행할 수 있다.According to an embodiment, the computing device connects the diagnosis node 512 of the real world data 510 and the disease node of the biomedical data 530, and the prescription node 511 of the real world data 510 and the biomedical data 530 A new biological network can be built by connecting the chemical nodes of For example, the computing device maps the diagnosis node 512 and the disease node corresponding to the same unique concept identifier based on a "Common Data Model" (CDM) for biomedical terminology, thereby real World data 510 and biomedical data 530 may be combined. The computing device may connect the prescription node 511 and the chemical node corresponding to the same unique concept identifier based on a common data model. In a real world database, for example, a clinical data set, in-hospital codes (eg, the above-mentioned diagnosis code (DICD) and prescription code (ODCD)) are generated according to the unique code system of each hospital, so that in-hospital To apply the code to the various sources, the computing device can perform mapping based on a common data model.

공통 데이터 모델은 생의학 용어(biomedical terminology)의 시소러스(thesaurus), 온톨로지(ontology), 및 인포매틱스(informatcs)에 관한 모델로서, 예를 들어, 통합 의료 언어 시스템(Unified Medical Language System, UMLS(520)), 몬도 질병 온톨로지(Mondo Disease Ontology, Mondo), 및 OHDSI(Observational Health Data Sciences and Informatics)를 포함할 수 있다.The common data model is a model for the thesaurus, ontology, and informatics of biomedical terminology, for example, Unified Medical Language System (UMLS 520) ), Mondo Disease Ontology (Mondo), and Observational Health Data Sciences and Informatics (OHDSI).

UMLS(520)는 상호 운용 가능(interoperable)하게 어휘 소스(vocabulary source)를 통합하도록 만들어진 모든 생의학(biomedical) 도메인의 시소러스(thesaurus)일 수 있다. UMLS(520)는 생의학 개념의 용어 문자열(terminology string) 자체가 의미를 나타내며 동일한 의미에 대해 동의어(synonyms)를 가지는 것을 가정한다. UMLS(520)에서는 같은 의미를 갖는 다른 어휘들이 하나의 개념 고유 식별자(Concept Unique Identifier, CUI)로 그룹핑될 수 있다. 예를 들어, 같은 개념 내에서, 다른 동의어 고유 식별자(Lexical Unique Identifier, LUI)들에 각각 할당된 동의어들 및 어휘 변형(lexical variation)은 동일한 개념 고유 식별자로 그룹핑될 수 있다. 같은 동의어들 내에서도 문자열 변형들(예를 들어, 대문자 또는 소문자)에 대해, 다른 문자열 고유 식별자(String Unique Identifier, SUI)들이 주어지고, 해당 문자열 변형들도 동일한 개념 고유 식별자로 그룹핑될 수 있다. UMLS(520) 메타시소러스(Metathesaurus)는, 컬럼이 정의된 카테고리들 하에서 생의학 개념들을 갖는 명칭(name)으로 구성되는, 생의학 개념들의 대규모 시소러스(예를 들어, 동의어 사전)일 수 있다.UMLS 520 may be a thesaurus of all biomedical domains designed to integrate vocabulary sources in an interoperable manner. The UMLS 520 assumes that a terminology string of a biomedical concept itself represents a meaning and has synonyms for the same meaning. In the UMLS 520, different vocabularies having the same meaning may be grouped into one Concept Unique Identifier (CUI). For example, within the same concept, synonyms and lexical variations each assigned to different lexical unique identifiers (LUIs) may be grouped into the same concept unique identifier. Even within the same synonyms, different string unique identifiers (SUIs) are given for string variants (eg, uppercase or lowercase letters), and the corresponding string variants can also be grouped into the same concept unique identifier. The UMLS 520 Metathesaurus may be a large-scale thesaurus of biomedical concepts (e.g., a thesaurus) where columns consist of names having biomedical concepts under defined categories.

Mondo는 복수의 질병 용어 도메인을 병합한 온톨로지로 어휘 소스들에 걸쳐 정확한 매핑들을 제공하고자 하는 온톨로지를 나타낼 수 있다. 개념 및 관련 유의어를 제공하는 UMLS(520)는 일대다 매핑이 가능할 수 있다. 한편, Mondo는 "K-평균 베이지안 온톨로지 추론"(K-means Bayesian-ontology inference, kBoom)을 이용하여 일대일 매핑을 제공할 수 있다. Mondo에서는 소스 제공된 매핑들로부터 확률적 온톨로지가 생성되고, kBoom에 의한 확률을 우선순위화(prioritize)함으로써 온톨로지로 병합(merge)될 수 있다. 이 고품질 일대일 매핑은 랜덤워크 기반 그래프 임베딩을 위한 네트워크에게 활용될 수 있는데, 노드에서 링크의 개수는 임베딩의 퀄리티에 영향을 미칠 수 있다.Mondo is an ontology that merges multiple disease term domains and can represent an ontology that seeks to provide accurate mappings across lexical sources. UMLS 520, which provides concepts and related synonyms, may be capable of one-to-many mapping. On the other hand, Mondo can provide one-to-one mapping using "K-means Bayesian-ontology inference" (kBoom). In Mondo, a probabilistic ontology is created from source-provided mappings and can be merged into the ontology by prioritizing probabilities by kBoom. This high-quality one-to-one mapping can be utilized for networks for random walk-based graph embedding, and the number of links in a node can affect the quality of the embedding.

OHDSI는 글로벌 레벨(global level)에서 기관들(organizations) 간의 건강 데이터 분석을 가능케 하는 표준화된 관찰 연구(standardized observational research)를 제시할 수 있다. OMOP(Observational Medical Outcomes Partnership) CDM은 각 의료기관들이 보유한 다른 구조의 의료 데이터에 적용 가능한 동일한 구조와 규격의 데이터 모델이며, 전세계 여러 기관 소스들(multi-institutional sources)로부터 데이터를 변환하는데 사용될 수 있다. 국가는 헬스케어 제품이 승인되는 자체 국가 건강 부서를 가지기 때문에, 임상 약물 제품(clinical drug product)은 국가마다 다를 수 있다. 글로벌 약물을 표준화하기 위하여, OMOP CDM은 RxNorm Extension이라고 불리는 약(medicine)에 대한 상호 운용 가능한 표준화된 어휘를 제공할 수 있다. 다시 말해, OHDSI는 개별 국가에서 처방되는 약을 글로벌 컨셉으로 분류하기 위해 사용될 수 있다. RxNorm Extension은, 미국 특정 표준 용어인, RxNorm으로부터 도출된 개념들의 확장을 갖는 어휘일 수 있다. 로컬 기관 용어 약물 데이터를 글로벌 약물 개념으로 OMOP CDM을 이용하여 전송하는 것이 가능할 수 있다.OHDSI can present standardized observational research that enables health data analysis between organizations at the global level. OMOP (Observational Medical Outcomes Partnership) CDM is a data model with the same structure and standard that can be applied to medical data of different structures possessed by each medical institution, and can be used to transform data from multi-institutional sources around the world. As countries have their own national health departments where healthcare products are approved, clinical drug products may differ from country to country. To standardize global drugs, OMOP CDM can provide an interoperable standardized vocabulary for medicine called RxNorm Extension. In other words, OHDSI can be used to classify drugs prescribed in individual countries into a global concept. RxNorm Extension may be a vocabulary having an extension of concepts derived from RxNorm, a US specific standard term. It may be possible to transfer local institution term drug data to global drug concept using OMOP CDM.

일 실시예에 따르면 리얼 월드 데이터(510)의 처방 노드(511)에 대응하는 개념 고유 식별자(CUI)는 스트링 매핑 및 요소 매핑(element mapping)에 기초하여 획득될 수 있다. 리얼 월드 데이터(510)의 진단 노드(512)에 대응하는 개념 고유 식별자(CUI)는 스트링 매핑 및 소스 매핑(source mapping)에 기초하여 획득될 수 있다.According to an embodiment, a concept unique identifier (CUI) corresponding to the prescription node 511 of the real world data 510 may be obtained based on string mapping and element mapping. A concept unique identifier (CUI) corresponding to the diagnostic node 512 of the real world data 510 may be obtained based on string mapping and source mapping.

아래에서는 공통 데이터 모델로서 UMLS(520)를 사용하는 예시를 주로 설명하나, 이로 한정하는 것은 아니다.An example using the UMLS 520 as a common data model is mainly described below, but is not limited thereto.

도 6은 예시적으로 UMLS를 이용한 진단 코드(DICD)를 지시하는 질병 노드 및 질병 식별자(Disease_ID)를 지시하는 질병 노드 간의 매핑 동작을 설명하는 도면이다.6 is a diagram illustrating a mapping operation between a disease node indicating a diagnosis code (DICD) using UMLS and a disease node indicating a disease identifier (Disease_ID).

일 실시예에 따른 리얼 월드 데이터의 코드들(예를 들어, 진단 코드(DICD) 및 처방 코드(ODCD))은 공통 데이터 모델의 개념 명칭(concept name)을 기반으로 생성될 수 있다. 공통 데이터 모델에서 다른 문장 문자열(statement string)에 대응하는 식별자들(IDs, identifiers)(예를 들어, 동의어 고유 식별자(LUI) 및 문자열 고유 식별자(SUI) 등)이 같은 개념을 가지는 경우, 전술한 서로 다른 문장 문자열에 대응하는 식별자들은 동일한 개념 고유 식별자(CUI)로 그룹핑될 수 있다. 같은 개념 고유 식별자(CUI)의 개념 내에서는 여러가지 동의(lexical)한 유의어(synonym)를 지시하는 문장 문자열들이 포함될 수 있다. 각 어휘 변형(lexical variation)에는 동의어 고유 식별자(LUI)가 부여될 수 있고, 각 어휘 변형의 문자열 변형(string variation)마다 문자열 고유 식별자(SUI)가 부여될 수 있다. 예를 들어, 도 6은 개념 고유 식별자(CUI)로서 "C0026919"를 가지는 다양한 문장 문자열들의 목록을 도시한다. 따라서, 컴퓨팅 장치는 공통 데이터 모델의 문자열 필드(string field)의 문자열 고유 식별자를 이용하여 개념 매핑(concept mapping)을 수행할 수 있다.Codes of real world data (eg, a diagnosis code (DICD) and a prescription code (ODCD)) according to an embodiment may be generated based on a concept name of a common data model. In the case where identifiers (IDs, identifiers) corresponding to different statement strings (eg, synonym unique identifier (LUI) and string unique identifier (SUI)) have the same concept in the common data model, Identifiers corresponding to different sentence strings may be grouped into the same concept unique identifier (CUI). Within the concept of the same concept unique identifier (CUI), sentence strings indicating various lexical synonyms may be included. A synonym unique identifier (LUI) may be assigned to each lexical variation, and a string unique identifier (SUI) may be assigned to each string variation of each lexical variation. For example, FIG. 6 shows a list of various sentence strings having “C0026919” as a concept unique identifier (CUI). Accordingly, the computing device may perform concept mapping using a string unique identifier of a string field of the common data model.

컴퓨팅 장치는 리얼 월드 데이터의 개체 노드(예를 들어, 진단 노드)에 의해 지시되는 코드(예를 들어, 진단 코드(DICD) 및 처방 코드(ODCD))에 대응하는 문자열을 추출할 수 있다. 컴퓨팅 장치는 공통 데이터 모델(예를 들어, UMLS)로부터 리얼 월드 데이터의 코드에 대응하는 문자열과 일치하는 문자열을 가지는 개념 고유 식별자(CUI)를 추출할 수 있다. 예시적으로, 문자열은 UMLS의 가장 작은 단위이므로, 컴퓨팅 장치는 기본적으로 문자열 단위로 개념 고유 식별자(CUI)를 검색할 수 있으나, 이로 한정하는 것은 아니다. 컴퓨팅 장치는 리얼 월드 데이터의 코드에 일치하는 코드를 갖거나 코드에 대응하는 문자열에 일치하는 문자열을 갖는 고유 개념 식별자(CUI)를 검색할 수 있다. 컴퓨팅 장치는, 리얼 월드 데이터의 코드 및/또는 생의학 데이터의 식별자에 대응하는 문자열과 공통 데이터 모델의 문자열이 정확히 동일한 경우에만 서로 일치하는 것으로 판단할 수 있다. 도 6에서는 진단 코드(DICD) 및 질병 식별자(Disease_ID) 간의 매핑을 주로 설명한다.The computing device may extract a string corresponding to a code (eg, a diagnosis code (DICD) and a prescription code (ODCD)) indicated by an entity node (eg, a diagnosis node) of real world data. The computing device may extract a concept unique identifier (CUI) having a character string identical to a character string corresponding to a code of real world data from a common data model (eg, UMLS). Exemplarily, since a character string is the smallest unit of UMLS, the computing device may basically search for a concept unique identifier (CUI) in units of character strings, but is not limited thereto. The computing device may search for a unique concept identifier (CUI) having a code matching a code of real world data or a character string matching a character string corresponding to a code. The computing device may determine that the strings corresponding to the codes of the real world data and/or identifiers of the biomedical data match each other only when the strings of the common data model are exactly the same. In FIG. 6, mapping between a diagnosis code (DICD) and a disease identifier (Disease_ID) is mainly described.

진단 코드(DICD)는 '코어' 분류로 구성되는 ICD(International Statistical Classification of Diseases and Related Health Problems) 10에 기초할 수 있다. ICD 10은 3개 문자 카테고리들 및 소수점(decimal point) 후에 숫자(numeric character)에서 하위 구분(subdivision)의 목록일 수 있다. 유행성 질환과 중증 질환이 코어 분류의 제목으로 선택되었고 보고에 필수적이며, 최대 10개의 하위 구분이 소수점 이후 숫자로 나타날 수 있다. 하위 카테고리는 다양한 부위 또는 상태의 그룹의 상세 타임 및 다양성에 대해 사용될 수 있다. 예시적으로 장티푸스열에 대해 'A01.0', 림프절 페스트에 대해 'A20.0' 코드가 사용될 수 있다. 도 6에 도시된 'A31.9' 코드는 비전형적 마이코박테리아 감염(Atypical mycobacterial infection, NOS(Not Otherwise Specified))을 나타낼 수 있다. 진단 코드(DICD)는 의사(physicians)에 의해 질병 상세가 기록되지 않는 경우가 많아 3자리 숫자로 기록될 수 있는데, 소수점 이후 숫자는 기관(institution)(예를 들어, 병원) 용으로 확장된 의미를 계층적으로 나타낼 수 있다.The diagnostic code (DICD) may be based on the International Statistical Classification of Diseases and Related Health Problems (ICD) 10, which consists of a 'core' classification. ICD 10 can be a list of three character categories and a subdivision in the numeric character after the decimal point. Epidemic disease and severe disease were chosen as headings for the core categories and are mandatory for reporting, and up to 10 subdivisions may appear as numbers after the decimal point. Subcategories can be used for detail time and variety of groups of different sites or conditions. For example, 'A01.0' for typhoid fever and 'A20.0' for lymph node plague may be used. The 'A31.9' code shown in FIG. 6 may indicate an atypical mycobacterial infection (Not Otherwise Specified (NOS)). A diagnosis code (DICD) may be written as a 3-digit number as the disease details are often not recorded by physicians, with the numbers after the decimal point having extended meaning for institutions (e.g. hospitals). can be hierarchically represented.

일 실시예에 따르면 컴퓨팅 장치는 진단 노드에 의해 지시되는 진단 코드(DICD)에 대응하는 제1 문자열을 추출할 수 있다. 컴퓨팅 장치는 공통 데이터 모델로부터 추출된 제1 문자열을 가지는 제1 개념 고유 식별자(CUI)를 검색할 수 있다. 도 6에 도시된 예시에서, 컴퓨팅 장치는 진단 노드(610)의 진단 코드(DICD) 'A31.9'과 동일한 문자열 'Atypical mycobacterial infection NOS'을 가지는 UMLS의 컬럼(631)으로부터 개념 고유 식별자(CUI) 'C0026919'를 검색할 수 있다. 컴퓨팅 장치는 같은 개념 고유 식별자(CUI)를 갖는 다른 코드들 및 문자열들을 추출함으로써 진단 코드(DICD)의 다양한 표현을 획득할 수 있다. 예시적으로 컴퓨팅 장치는 같은 개념 고유 식별자(CUI)에 속하는 다른 문자열 고유 식별자(SUI)(예를 들어, 도 6에 도시된 'S0016794', 'S0016795' 등) 및 다른 동의어 고유 식별자(LUI)(예를 들어, 도 6에 도시된 'L0026919', 'L10114762', 및 'L6525836' 등)에 대응하는 정보(632)를 획득할 수 있다. 컴퓨팅 장치는 UMLS MeSH 코드에 기반하여 같은 개념 고유 식별자(CUI)에 속하는 문자열 고유 식별자(SUI) 및 동의어 고유 식별자(LUI)에 대응하는 질병 식별자(Disease_ID)를 결정할 수 있다. 컴퓨팅 장치는 검색된 제1 개념 고유 식별자(CUI)에 대응하는 질병 식별자(Disease_ID)를 지시하는 질병 노드에 진단 노드를 매핑할 수 있다. 따라서, 컴퓨팅 장치는 하나의 진단 노드에, 동의어 및 유의어에 대응하는 질병 노드들을 맵핑할 수 있다. 또한, 처방된 약의 상표명이 컨셉 이름으로 쓰이는 경우, 약의 상표명에 대응하는 개념 고유 식별자(CUI)에 포함된 동의어 고유 식별자(LUI) 및 문자열 고유 식별자(SUI)에 대응하는 질병 식별자(Disease_ID)가 매핑될 수 있다.According to an embodiment, the computing device may extract a first string corresponding to the diagnostic code (DICD) indicated by the diagnostic node. The computing device may search for a first concept unique identifier (CUI) having a first string extracted from the common data model. In the example shown in FIG. 6 , the computing device generates a concept unique identifier (CUI) from the column 631 of the UMLS having the same string 'Atypical mycobacterial infection NOS' as the diagnosis code (DICD) 'A31.9' of the diagnosis node 610. ) 'C0026919' can be searched. A computing device may obtain various representations of the diagnostic code (DICD) by extracting other codes and character strings having the same concept unique identifier (CUI). For example, the computing device includes other string unique identifiers (SUIs) belonging to the same concept unique identifier (CUI) (eg, 'S0016794', 'S0016795', etc. shown in FIG. 6) and other synonym unique identifiers (LUIs) ( For example, information 632 corresponding to 'L0026919', 'L10114762', and 'L6525836' shown in FIG. 6 may be obtained. The computing device may determine a disease identifier (Disease_ID) corresponding to a string unique identifier (SUI) belonging to the same concept unique identifier (CUI) and a synonym unique identifier (LUI) based on the UMLS MeSH code. The computing device may map a diagnosis node to a disease node indicating a disease identifier (Disease_ID) corresponding to the retrieved first concept unique identifier (CUI). Accordingly, the computing device may map disease nodes corresponding to synonyms and synonyms to one diagnosis node. In addition, when the brand name of the prescribed drug is used as the concept name, the synonym unique identifier (LUI) included in the concept unique identifier (CUI) corresponding to the brand name of the drug and the disease identifier (Disease_ID) corresponding to the string unique identifier (SUI) can be mapped.

MeSH 용어에 의해 진단 코드(DICD)의 61% 정도가 매핑되는데, 매핑되지 않은 나머지 진단 코드(DICD)는 소스 매핑에 기초하여 질병 식별자(Disease_ID)와 매핑될 수 있다. 예를 들어, 컴퓨팅 장치는 MeSH 코드 매핑 이후 남은 나머지 진단 코드(DICD)에 대해 질병 온톨로지(Disease Ontology) 및 Mondo에서 정의되는 필드(field)에 미리 맵핑된 질병 식별자(Disease_ID)를 매핑할 수 있다. 예를 들어, Bell’s palsy에 대해 스트링 매핑이 불충분할 수 있는데, 소스 매핑이 적용될 경우, 컴퓨팅 장치는 증상인 안면 마비(Facial paralysis) 및 가족유전(familial)의 개념을 가진 코드에 매핑할 수 있다.About 61% of the diagnostic codes (DICD) are mapped by the MeSH terminology, and the remaining diagnostic codes (DICD) that are not mapped may be mapped to the disease identifier (Disease_ID) based on source mapping. For example, the computing device may map a disease identifier (Disease_ID) pre-mapped to a field defined in the disease ontology and Mondo for the remaining diagnostic code (DICD) after MeSH code mapping. For example, string mapping may be insufficient for Bell's palsy, but when source mapping is applied, the computing device can map to a code with the concept of facial paralysis as a symptom and familial inheritance.

리얼 월드 데이터의 RxNorm에서 처방은 용량, 용법, 및 성분 등에 의해 분류되는 반면, 생의학 데이터의 MeSH에서 동일한 성분을 가진 약은 같은 MeSH 컨셉 코드를 가질 수 있다. 따라서 성분명 변환 필요하기 때문에, 컴퓨팅 장치는 OHDSI 데이터 소스를 이용하여 리얼 월드 데이터의 처방 코드(ODCD) 및/또는 생의학 데이터의 화학물질 식별자(Chemical_ID)에 대응하는 성분명 정보를 획득할 수 있다.In RxNorm of real world data, prescriptions are classified by dose, usage, and ingredients, etc., whereas in MeSH of biomedical data, drugs with the same ingredients may have the same MeSH concept code. Accordingly, since ingredient name conversion is required, the computing device may obtain ingredient name information corresponding to a prescription code (ODCD) of real world data and/or a chemical substance identifier (Chemical_ID) of biomedical data using the OHDSI data source.

일 실시예에 따른 컴퓨팅 장치는 리얼 월드 데이터의 처방 코드(ODCD) 및 생의학 데이터의 화학물질 식별자(Chemical_ID) 간의 매핑도 전술한 바와 유사하게 수행할 수 있다. 처방 코드(ODCD)는 RxNorm/RxNormExtension을 기반으로 생성될 수 있다. 처방 코드(ODCD)의 RxNorm의 어휘(vocabulary)는 용량, 용법, 성분 등에 의해 분류되기 때문에 복잡한 개념을 포함할 수 있다. 반면, MeSH에서는 동일한 성분을 가진 약이 같은 MeSH 컨셉 코드에 의해 지시되므로, 성분명 변환이 필요할 수 있다. 참고로, 컴퓨팅 장치는 도 7에 도시된 바와 같이, OHDSI의 'has_ingredient' 관계(700)를 이용한 요소 매핑(Element mapping)을 통해 RxNorm/RxNormExtension의 성분명을 추출할 수 있다.The computing device according to an embodiment may perform mapping between a prescription code (ODCD) of real world data and a chemical substance identifier (Chemical_ID) of biomedical data similarly to the above description. A prescription code (ODCD) can be generated based on RxNorm/RxNormExtension. The RxNorm's vocabulary of prescription codes (ODCD) can contain complex concepts because they are classified by dose, usage, ingredient, etc. On the other hand, in MeSH, drugs with the same ingredients are indicated by the same MeSH concept code, so ingredient name conversion may be required. For reference, as shown in FIG. 7 , the computing device may extract the component name of RxNorm/RxNormExtension through element mapping using the 'has_ingredient' relationship 700 of OHDSI.

일 실시예에 따른 컴퓨팅 장치는 처방 노드에 의해 지시되는 처방 코드(ODCD)에 대응하는 제2 문자열을 추출할 수 있다. 예를 들어, 컴퓨팅 장치는 처방 코드(ODCD)에 따른 각 약의 성분명과 약명을 추출할 수 있다. 컴퓨팅 장치는 처방 코드(ODCD)에 대응하여 획득된 약의 성분명 및/또는 약명에 대해 전처리(예를 들어, 간단한 자연어 처리(Natural Language Processing, NLP), 대소문자 전환, 전치사의 위치 변경, 띄어쓰기 수정, 설명을 포함한 괄호 처리)를 수행하여 소문자로 전환된 제2 문자열을 추출할 수 있다.The computing device according to an embodiment may extract a second string corresponding to the prescription code (ODCD) indicated by the prescription node. For example, the computing device may extract the ingredient name and drug name of each drug according to the prescription code (ODCD). The computing device preprocesses (eg, simple Natural Language Processing (NLP)), case conversion, position change of prepositions, spacing correction for the ingredient name and/or drug name of the drug obtained in correspondence with the prescription code (ODCD). , Parentheses processing including explanation) may be performed to extract the second string converted to lowercase letters.

컴퓨팅 장치는 공통 데이터 모델로부터 추출된 제2 문자열에 대응하는 제2 개념 고유 식별자(CUI)를 검색할 수 있다. 컴퓨팅 장치는 예를 들어, 컴퓨팅 장치는 성분명에 대해서 OHDSI의 RxNormExtension을 성분명을 기준으로 RxNorm코드를 부여한 후 공통 데이터 모델인 UMLS MeSH 코드와 맵핑할 수 있다. 다시 말해, 컴퓨팅 장치는 RxNorm과 용법 및 용량은 다르지만 동일한 성분을 가지는 RxNormExtension에 대해 동일한 성분을 가지는 RxNorm과 똑같은 MeSH 코드를 매핑할 수 있다. 컴퓨팅 장치는 처방 코드(ODCD)의 성분명을 UMLS에 스트링 매핑함으로써 각 성분명(예를 들어, 제2 문자열)에 대한 UMLS의 제2 개념 고유 식별자(CUI)를 획득할 수 있다.The computing device may search for a second concept unique identifier (CUI) corresponding to the second string extracted from the common data model. For example, the computing device may assign an RxNorm code based on the RxNormExtension of OHDSI for the ingredient name, and then map it to the UMLS MeSH code, which is a common data model. In other words, the computing device may map the same MeSH code as RxNorm having the same components to RxNormExtension having the same components but different usage and capacity from RxNorm. The computing device may obtain a second concept unique identifier (CUI) of the UMLS for each ingredient name (eg, a second string) by string-mapping the ingredient name of the prescription code (ODCD) to the UMLS.

컴퓨팅 장치는 검색된 제2 개념 고유 식별자(CUI)에 대응하는 화학물질 식별자를 지시하는 화학물질 노드에 처방 노드를 매핑할 수 있다. 컴퓨팅 장치는 처방 코드(ODCD)에 대응하는 성분명의 다른 표현을 찾기 위해 UMLS에 있는 전체 어휘에 대해 같은 제2 개념 고유 식별자(CUI)를 가지는 개체 노드들을 추출할 수 있다. 컴퓨팅 장치는 전술한 바와 유사하게, 같은 제2 개념 고유 식별자(CUI)에 속하는 동의어 고유 식별자(LUI) 및 문자열 고유 식별자(SUI)에 대응하는 화학물질 노드를 처방 노드에 매핑함으로써 생의학 데이터 및 리얼 월드 데이터를 결합할 수 있다.The computing device may map the prescription node to a chemical substance node indicating a chemical substance identifier corresponding to the retrieved second concept unique identifier (CUI). The computing device may extract entity nodes having the same second concept unique identifier (CUI) for the entire vocabulary in the UMLS to find another representation of the ingredient name corresponding to the prescription code (ODCD). Similar to the above, the computing device maps chemical nodes corresponding to synonym unique identifiers (LUIs) and string unique identifiers (SUI) belonging to the same second concept unique identifier (CUI) to prescription nodes, thereby providing biomedical data and real world data. data can be combined.

도 8은 일 실시예에 따른 리얼 월드 데이터 및 생의학적 데이터 간의 매핑 결과를 설명한다.8 illustrates a mapping result between real world data and biomedical data according to an embodiment.

일 실시예에 따른 컴퓨팅 장치는 도 5 내지 도 7에서 전술한 매핑에 의해 결합된 리얼 월드 데이터(810) 및 생의학 데이터(820)를 포함하는 새로운 생의학적 데이터를 도시한다. 새로운 생의학적 데이터의 집합은 도 9 및 도 10에서 후술하는 바와 같이 새로운 생물학적 네트워크를 구성할 수 있다. 리얼 월드 데이터(810), 생의학 데이터(820), 새로운 생의학적 데이터, 및 새로운 생물학적 네트워크는 그래프 구조로 구현될 수 있다.A computing device according to an embodiment shows new biomedical data including real world data 810 and biomedical data 820 combined by the mapping described above with reference to FIGS. 5 to 7 . A new set of biomedical data may constitute a new biological network as described below with reference to FIGS. 9 and 10 . Real world data 810, biomedical data 820, new biomedical data, and new biological networks can be implemented in a graph structure.

리얼 월드 데이터(810)는 전술한 바와 같이, 아웃컴 노드(815), 환자 식별자 노드(811), 환자 인카운터 노드(812), 처방 노드(813) 및 진단 노드(814)를 포함할 수 있다. 리얼 월드 데이터(810)는 아웃컴 노드(815) 및 환자 식별자 노드(811)를 연결하는 엣지, 환자 식별자 노드(811)와 환자 인카운터 노드(812)를 연결하는 엣지, 환자 인카운터 노드(812)와 처방 노드(813)를 연결하는 엣지, 및 환자 인카운터 노드(812)와 진단 노드(814)를 연결하는 엣지를 포함할 수 있다.Real world data 810 may include outcome node 815, patient identifier node 811, patient encounter node 812, prescription node 813, and diagnosis node 814, as described above. The real world data 810 includes an edge connecting the outcome node 815 and the patient identifier node 811, an edge connecting the patient identifier node 811 and the patient encounter node 812, and a patient encounter node 812 and the patient encounter node 812. An edge connecting the prescription node 813 and an edge connecting the patient encounter node 812 and the diagnosis node 814 may be included.

생의학 데이터(820)는 전술한 바와 같이, 화학물질 노드(821), 모구조 노드(822), 질병 노드(823), 및 유전자 노드(824)를 포함할 수 있다. 생의학 데이터(820)는 화학물질 노드(821)와 모구조 노드(822)를 연결하는 엣지, 화학물질 노드(821)와 질병 노드(823)를 연결하는 엣지, 및 질병 노드(823)와 유전자 노드(824)를 연결하는 엣지를 포함할 수 있다.As described above, the biomedical data 820 may include a chemical node 821, a parent structure node 822, a disease node 823, and a gene node 824. The biomedical data 820 includes an edge connecting the chemical node 821 and the parent structure node 822, an edge connecting the chemical node 821 and the disease node 823, and the disease node 823 and the gene node. 824 may be included.

새로운 생의학적 데이터는 전술한 리얼 월드 데이터(810) 및 생의학 데이터(820)의 개체 노드들 및 엣지들을 모두 포함할 수 있다. 또한, 새로운 생의학적 데이터는 도 5 내지 도 7에서 전술한 매핑에 의해 새로 생성된 엣지들을 더 포함할 수 있다. 예를 들어, 새로운 생의학적 데이터는 처방 노드(813)와 화학물질 노드(821)를 연결하는 엣지 및 진단 노드(815)와 질병 노드(823)를 연결하는 엣지를 포함할 수 있다. 처방 노드(813)와 화학물질 노드(821)를 연결하는 엣지 및 진단 노드(815)와 질병 노드(823)를 연결하는 엣지는 도 11에서 후술하는 랜덤워크 데이터를 생성할 시 사이클릭(cyclic)한 부분을 방지할 수 있다. 사이클은 랜덤워크의 출발 노드와 경유 노드가 동일한 경우를 의미하며, 랜덤워크에서 사이클이 자주 발생하는 경우 후술하는 벡터 임베딩 모델의 성능 하락이 발생할 수 있으므로 사이클릭을 방지하는 것이 필요하다.The new biomedical data may include both object nodes and edges of the real world data 810 and the biomedical data 820 described above. In addition, the new biomedical data may further include edges newly generated by the mapping described above with reference to FIGS. 5 to 7 . For example, the new biomedical data may include an edge connecting the prescription node 813 and the chemical node 821 and an edge connecting the diagnosis node 815 and the disease node 823 . The edge connecting the prescription node 813 and the chemical node 821 and the edge connecting the diagnosis node 815 and the disease node 823 are cyclic when random walk data described later in FIG. 11 is generated. One part can be avoided. Cycle refers to the case where the starting node and passing node of the random walk are the same, and if cycles frequently occur in the random walk, performance degradation of the vector embedding model described later may occur, so it is necessary to prevent cyclics.

다만, 새로운 생물학적 데이터를 도 8에 도시된 바로 한정하는 것은 아니고, 환자 노드(예를 들어, 환자 식별자 노드(811) 및 환자 인카운터 노드(812))가 도 11에서 후술하는 벡터 임베딩 후 배제될 수 있다.However, new biological data is not limited to what is shown in FIG. 8 , and patient nodes (eg, patient identifier node 811 and patient encounter node 812) may be excluded after vector embedding described later in FIG. 11 . have.

도 9 및 도 10은 일 실시예에 따라 생성된 새로운 병합된 네트워크(merged network)를 설명한다.9 and 10 illustrate a new merged network created according to one embodiment.

도 8에서는 이종 데이터의 결합에 의해 생성된 새로운 생의학적 데이터를 설명했고, 도 9에서는 생의학적 데이터 간의 연결을 설명한다. 도 9에 도시된 바와 같이, 제1 리얼 월드 데이터(911)는 제1 생의학 데이터(921)와 연결될 수 있다. 또한, 제1 생의학 데이터는 질병 식별자(Disease_ID) 및 화학물질 식별자(Chemical_ID)를 통해 다른 제2 리얼 월드 데이터(912) 및 모구조를 통해 제3 생의학 데이터(923)와도 연결될 수 있다. 제3 생의학 데이터(923)는 질병 식별자(Disease_ID) 및 화학물질 식별자(Chemical_ID)를 통해 제3 리얼 월드 데이터(913)와 연결될 수 있다. 도 9에 도시된 바와 같이 컴퓨팅 장치는 한 개체 노드에 복수의 개체 노드들을 엣지로 연결함으로써, 새로운 생물학적 네트워크를 구축할 수 있다.In FIG. 8, new biomedical data generated by combining heterogeneous data is described, and in FIG. 9, connections between biomedical data are described. As shown in FIG. 9 , first real world data 911 may be connected to first biomedical data 921 . In addition, the first biomedical data may be connected to other second real world data 912 through a disease identifier (Disease_ID) and a chemical substance identifier (Chemical_ID) and third biomedical data 923 through a parent structure. The third biomedical data 923 may be connected to the third real world data 913 through a disease identifier (Disease_ID) and a chemical substance identifier (Chemical_ID). As shown in FIG. 9 , the computing device may construct a new biological network by connecting a plurality of entity nodes to one entity node through an edge.

도 10은 새로 구축된 생물학적 네트워크(1000)의 예시를 설명한다. 예시적으로 새로 구축된 생물학적 네트워크(1000)의 개체 노드 및 엣지의 개수는 하기 표 1과 같이 나타낼 수 있다.10 illustrates an example of a newly constructed biological network 1000 . Exemplarily, the number of object nodes and edges of the newly constructed biological network 1000 may be shown in Table 1 below.

DatasetsDatasets Entities and relationshipsEntities and relationships EMR
EMR
EntitiesEntities PAIDPAID KEYKEY DICDDICD ODCDODCD 360409360409 1496716914967169 10251025 47984798 Relationships
(edge)Relationships
(edge) PAID - KEYPAID-KEY KEY- ODCDKEY-ODCD KEY- DICDKEY-DICD KEY- CDTHKEY-CDTH 1496716914967169 4173046641730466 2221968922219689 360409360409 CTD
CTD
EntitiesEntities DiseaseIDDiseaseID ChemicalIDChemicalID GeneSymbolGeneSymbol ParentIDParentID 76437643 18051805 14961496 733733 Relationships
(edge)Relationships
(edge) DiseaseID - ChemicalIDDiseaseID-ChemicalID DiseaseID - GeneSymbolDiseaseID - GeneSymbol ChemicalID - ParentIDChemicalID - ParentID ChemicalID - GeneSymbolChemicalID - GeneSymbol 2858028580 43614361 18051805 282409282409 MappingMapping Relationships
(edge)Relationships
(edge) ODCD - ChemicalIDODCD-Chemical ID DICD-DiseaseIDDICD-DiseaseID 66336633 79897989

도 11은 일 실시예에 따른 기계 학습 모델의 트레이닝 및 트레이닝된 기계 학습 모델을 이용한 생의학적 연관성을 설명한다.일 실시예에 따르면 컴퓨팅 장치는 구축된 새로운 생물학적 네트워크로부터 랜덤워크에 기반하여 생성된 랜덤워크 데이터(1120)를 이용하여, 기계 학습 모델을 트레이닝시킬 수 있다.11 illustrates training of a machine learning model and biomedical correlation using the trained machine learning model according to an embodiment. According to an embodiment, a computing device generates a random walk based on a random walk from a new biological network constructed. A machine learning model may be trained using the work data 1120 .

컴퓨팅 장치는 도 10에서 구축된 새로운 생물학적 네트워크로부터 랜덤워크에 기반하여 랜덤워크 데이터(1120)를 생성할 수 있다. 본 명세서에서 리얼 월드 데이터(예를 들어, 전자의무기록 데이터)와 통합된 새로운 생물학적 네트워크의 그래프는 G = (ν,ε)로 정의될 수 있다. 여기서, 노드 집합 ν 는 새로운 생물학적 데이터로서, 리얼 월드 데이터의 5개 개체 타입의 개체 노드(1110) 및 생의학 데이터의 4개 개체 타입의 개체 노드(1110), 총 9개의 이종 개체 노드(1110)(heterogeneous entity node)의 셋트를 나타낼 수 있다. 예시적으로, 제1 개체 타입의 개체 노드 셋트 ν₁= {ν_{11 , ...,}ν_1n}, 내지 제9 개체 타입의 개체 노드 셋트 ν₉= {ν_{91 , ...,}ν_9n}로 표현될 수 있다. 노드 집합 v는 총 V개의 노드 개체들을 포함할 수 있고, 여기서, n 및 V는 1이상의 정수일 수 있다. 개체 노드 셋트는 개체 간의 인접성을 나타내는 인접 행렬로부터 생성될 수 있다. 예를 들어, 인접 행렬의 각 행 및 각 열은 i번째 개체 및 j번째 개체를 나타내고, 인접 행렬의 i번째 행의 j번째 열의 원소 값은 i번째 개체 및 j번째 개체 간의 인접성(예를 들어, 인접 여부)을 나타낼 수 있다. 예시적으로 v_ij는 제i 개체 타입의 제j 인스턴스(instance)의 개체 노드(1110)를 지시할 수 있다. 엣지 ε는 개체 노드(1110) ν_ij 간을 연결하는 엣지로서, 가중치 또는 방향 없이 데이터 셋트에서 연관성(association) 및/또는 관련성(relationship)을 나타낼 수 있다. 엣지는 ε ⊆V×V로 표현될 수 있다.The computing device may generate random walk data 1120 based on the random walk from the new biological network constructed in FIG. 10 . In this specification, a graph of a new biological network integrated with real world data (eg, electronic medical record data) can be defined as G = (ν, ε). Here, the node set ν is new biological data, and includes object nodes 1110 of 5 entity types of real world data and object nodes 1110 of 4 entity types of biomedical data, a total of 9 heterogeneous entity nodes 1110 ( heterogeneous entity nodes). Exemplarily, the set of entity nodes of the first entity type ν ₁ = {ν _{11 , ...,} ν _1n }, to the set of entity nodes of the ninth entity type ν ₉ = {ν _{91 , ...,} ν _9n } can be expressed as The node set v may include a total of V node entities, where n and V may be integers greater than or equal to 1. Entity node sets may be generated from an adjacency matrix representing adjacencies between entities. For example, each row and each column of the adjacency matrix represents the i-th entity and the j-th entity, and the value of the element in the j-th column of the i-th row of the adjacency matrix is the adjacency between the i-th entity and the j-th entity (e.g., adjacency). Illustratively, v _ij may indicate the object node 1110 of the jth instance of the ith entity type. Edge ε is an edge connecting entity nodes 1110 ν _ij and may represent association and/or relationship in a data set without weight or direction. An edge can be expressed as ε ⊆V×V.

랜덤워크 데이터(1120)는 이웃하는 개체 노드들 간의 랜덤워크에 기초한 랜덤 시퀀스를 나타내는 데이터일 수 있다. 그래프 구조의 데이터 네트워크에 대한 랜덤워크는, 데이터 네트워크를 구성하는 임의의 개체 노드(1110)로부터 해당 개체 노드(1110)에 엣지를 통해 연결된 하나 이상의 개체 노드(1110) 중 무작위로 선택된 개체 노드(1110)를 선택 및 추출하는 동작을 나타낼 수 있다. 랜덤워크 데이터(1120)의 길이는 랜덤워크 길이, 다시 말해 랜덤워크의 시행 횟수에 따라 결정될 수 있다. 랜덤워크의 방향 및 확률은 엣지의 방향 및 가중치에 따라 결정될 수 있는데, 전술한 바와 같이, 본 명세서에서는 설명의 편의를 위해 방향이 없고 가중치는 모두 동일한 것을 가정한다. 랜덤워크 데이터(1120)를 랜덤워크 입력(random walk input)이라고도 나타낼 수 있다.The random walk data 1120 may be data representing a random sequence based on a random walk between neighboring entity nodes. The random walk for the graph-structured data network is an entity node (1110) randomly selected from among one or more entity nodes (1110) connected from an arbitrary entity node (1110) constituting the data network to the corresponding entity node (1110) through an edge. ) may indicate an operation of selecting and extracting. The length of the random walk data 1120 may be determined according to the length of the random walk, that is, the number of trials of the random walk. The direction and probability of the random walk may be determined according to the direction and weight of the edge. As described above, in this specification, for convenience of description, it is assumed that there is no direction and all weights are the same. The random walk data 1120 may also be referred to as a random walk input.

랜덤워크 입력은 그래프 네트워크의 위상 정보(topological information)를 캡쳐하기 위해 생성된 데이터로서, 2차 랜덤워크(2^nd order random walk) 데이터라고도 나타낼 수 있다. 새로운 생물학적 네트워크에 포함된 개체 노드들 간의 유사성(similarity)은 주어진 특징 표현(feature representation), 예를 들어, 임베딩 벡터에서 이웃하는 개체 노드(1110)의 발생 확률로서 정의될 수 있다. 이웃하는 개체 노드(1110)의 발생 확률은 이웃하는 개체 노드(1110) 간에 엣지가 존재할 확률이라고도 해석될 수 있다. 컴퓨팅 장치는 그래프 임베딩을 위해 랜덤워크 셋트 N_S(u)를 생성할 수 있다. 컴퓨팅 장치는 랜덤워크를 통해 초기 노드 u로부터 시작하여, 랜덤워크 전략 S를 이용하여 미리 결정된 랜덤워크 길이를 통해 이웃하는 노드로 워크를 생성할 수 있다. 여기서, u는 u ∈ ν일 수 있다. 후술하겠으나, 랜덤워크 입력으로 표현되는 랜덤워크 셋트 N_S(u) 내에서, 이웃하는 노드들의 발생 확률이 임시로 출력된 임베딩 벡터에 기초하여 추정될 수 있고, 임베딩 공간의 유사도에 대해 최적화됨으로써, 후술하는 기계 학습 모델(1140)이 트레이닝될 수 있다.The random walk input is data generated to capture topological information of the graph network, and may also be referred to as 2 ^nd order random walk data. Similarity between entity nodes included in the new biological network may be defined as an occurrence probability of a neighboring entity node 1110 in a given feature representation, eg, an embedding vector. The probability of occurrence of the neighboring object node 1110 may also be interpreted as the probability that an edge exists between the neighboring object nodes 1110. The computing device may generate a random walk set N _S (u) for graph embedding. Starting from an initial node u through a random walk, the computing device may generate a walk to a neighboring node through a predetermined random walk length using a random walk strategy S. Here, u may be u ∈ ν. As will be described later, within the random walk set N _S (u) represented by the random walk input, the probability of occurrence of neighboring nodes can be estimated based on the temporarily output embedding vector, and optimized for the similarity of the embedding space, A machine learning model 1140 described below may be trained.

예를 들어, 컴퓨팅 장치는 구축된 새로운 생물학적 네트워크에서 초기 개체 노드 u를 선정할 수 있다. 컴퓨팅 장치는 선정된 초기 개체 노드 u로부터 미리 결정된 랜덤워크 길이만큼 순차적으로 랜덤워크를 수행함으로써, 개별 랜덤워크에 따른 개체 노드를 선택할 수 있다. 컴퓨팅 장치는 초기 개체 노드 u 및 랜덤워크에 기초하여 선택된 개체 노드를 포함하는 랜덤워크 셋트 N_s(u)를 지시하는 랜덤워크 데이터(1120)를 생성할 수 있다.For example, the computing device may select an initial object node u in the newly constructed biological network. The computing device may select an entity node according to an individual random walk by sequentially performing a random walk as much as a predetermined random walk length from the selected initial entity node u. The computing device may generate random walk data 1120 indicating a random walk set N _s (u) including an initial object node u and an object node selected based on the random walk.

랜덤워크 길이는 새로운 생물학적 네트워크를 구성하는 개체 노드들의 총 개체 타입 개수보다 큰 값을 가질 수 있다. 예를 들어, 도 11에서는 랜덤워크 길이가 10인 예를 설명한다. 예를 들어, 랜덤워크를 생성함에 있어서, 워크 길이는 10으로 설정될 수 있고, 개체 타입들의 380개 다른 조합들이 생성될 수 있다. 예를 들어, 랜덤워크 셋트 N_s(u)로서, [ν_1n, ν_2n, ν_1n, ν_2n, ν_1n, ν_2n, ν_1n, ν_2n, ν_1n, ν_2n] 내지 [ν_9n, ν_8n, ν_9n, ν_8n, ν_9n, ν_8n, ν_5n, ν_7n, ν_9n, ν_8n]가 생성될 수 있다. 다른 예를 들어, 랜덤워크를 생성함에 있어서, 대략적으로 4개 개체 타입들 및 각 개체 노드(1110)에 대해 40개 조합이 생성될 수 있다.The random walk length may have a greater value than the total number of entity types of entity nodes constituting the new biological network. For example, in FIG. 11, an example in which the random walk length is 10 is described. For example, in generating a random walk, the walk length can be set to 10, and 380 different combinations of entity types can be created. For example, as the random work set N _s (u), [ν _1n , ν _2n , ν _1n , ν _2n , ν _1n , ν _2n , ν _1n , ν _2n , ν _1n , ν _2n ] to [ν _9n , ν _8n , ν _9n , ν _8n , ν _9n , ν _8n , ν _5n , ν _7n , ν _9n , ν _8n ] may be generated. For another example, in generating a random walk, approximately 40 combinations of 4 entity types and each entity node 1110 may be generated.

참고로, 리턴 파라미터 p 및 인아웃 파라미터 q는 BFS(Breadth-first Sampling) 및 DFS(Depth-first Sampling)와 같은 샘플링 전략을 위한 서치를 가이드하기 위해 결정될 수 있다. 탐색 바이어스(search bias)를 엣지의 전이 확률에 부가하기 위해, 이전 노드 t가 기억되고, 탐색 바이어스가 노드 t에 비교하여 상대적 위치에 의존하여 계산될 수 있다. 다음 노드 x에 대하여, 노드 t 및 노드 x 간의 가장 짧은 경로의 값들이 d_tx = {0,1,2}로 고정되어, 엣지 (v,x) 상의 탐색 바이어스 α_pq(t, x)가 계산될 수 있다. 탐색 바이어스 α_pq(t, x)는 하기 수학식 1과 같이 결정될 수 있다.For reference, the return parameter p and the in-out parameter q may be determined to guide a search for sampling strategies such as breadth-first sampling (BFS) and depth-first sampling (DFS). To add a search bias to the transition probability of an edge, the previous node t is remembered, and the search bias can be calculated depending on its relative position compared to node t. For the next node x, the values of the shortest path between node t and node x are fixed as d _tx = {0,1,2}, so that the search bias α _pq (t, x) on edge (v,x) is calculated It can be. The search bias α _pq (t, x) may be determined as in Equation 1 below.

이전 노드에 더 가까워지면 p의 역수, 이전 노드로부터 멀어지면 q의 역수가 적용될 수 있다. 전술한 수학식 1에서, 예시적으로 새로운 생물학적 네트워크는 리턴 파라미터 p=1 및 인아웃 파라미터 q=1을 갖는 엣지들을 가지므로, 전이 확률은 d_tx와 무관하게, α_pq(t, x) = 1로 설정될 수 있다The reciprocal of p can be applied when moving closer to the previous node, and the reciprocal of q when moving away from the previous node. In Equation 1 above, since the new biological network exemplarily has edges with return parameter p = 1 and in-out parameter q = 1, the transition probability is independent of d _tx , α _pq (t, x) = 1 can be set to

컴퓨팅 장치는 전술한 랜덤워크 입력을 이용하여 기계 학습 모델(1140)을 트레이닝시킬 수 있다. 기계 학습 모델(1140)은 랜덤워크 입력에 의해 지시되는 각 개체 노드(1110)를 잠재 공간(latent space)의 임베딩 벡터로 투사(project)하도록 설계 및 트레이닝된 모델로서, 예를 들어, 뉴럴 네트워크 모델을 포함할 수 있다. 본 명세서에서는 하나의 히든 레이어를 포함하는 "얕은 뉴럴 네트워크"(Shallow Neural Network)를 주로 설명하나, 이로 한정하는 것은 아니다. 기계 학습 모델(1140)의 은닉층은 룩업 테이블이라는 연산을 담당하며, 투사층(projection layer)라고도 나타낼 수 있다. 기계 학습 모델(1140)은 네거티브 샘플링을 이용한 스킵 그램(skip-gram) 구조로 설계될 수 있다. 다만, 기계 학습 모델(1140)의 구조를 한정하는 것은 아니고, 컨볼루션 뉴럴 네트워크(CNN, convolutional neural network), 및 순환 뉴럴 네트워크(RNN, recurrent neural network) 등 다양한 구조로 구현될 수 있다. 또한, 기계 학습 모델(1140)은 그래프 구조의 데이터를 임베딩하는 그래프 임베딩 모델을 포함하고, 예를 들어, deep walk, node2vec, gcn 등을 포함할 수 있다.The computing device may train the machine learning model 1140 using the aforementioned random walk input. The machine learning model 1140 is a model designed and trained to project each entity node 1110 indicated by a random walk input into an embedding vector in a latent space, and is, for example, a neural network model. can include In this specification, a "shallow neural network" including one hidden layer is mainly described, but is not limited thereto. The hidden layer of the machine learning model 1140 is responsible for an operation called a lookup table, and may also be referred to as a projection layer. The machine learning model 1140 may be designed as a skip-gram structure using negative sampling. However, the structure of the machine learning model 1140 is not limited, and may be implemented in various structures such as a convolutional neural network (CNN) and a recurrent neural network (RNN). In addition, the machine learning model 1140 includes a graph embedding model embedding data of a graph structure, and may include, for example, deep walk, node2vec, gcn, and the like.

컴퓨팅 장치는 도 11에 도시된 바와 같이, 랜덤워크 입력의 각 개체 노드(1110)에 대응하는 입력 데이터(1130)를 기계 학습 모델(1140)에 입력할 수 있다. 컴퓨팅 장치는 랜덤워크 입력에 포함된 복수의 개체 노드들의 각각에 대응하는 원핫벡터(one-hot vector)를 입력 데이터(1130)로서 기계 학습 모델(1140)에 입력할 수 있다. 도 11에서는 랜덤워크 셋트 N_S(u)로서 [ν_9n, ν_8n, ν_9n, ν_8n, ν_9n, ν_8n, ν_5n, ν_7n, ν_9n, ν_8n,]인 랜덤워크 입력을 예로 들어 설명한다. 예시적으로 기계 학습 모델(1140)에 입력되는 입력 데이터(1130)(예를 들어, 원핫 벡터)의 차원은 전술한 도 10에서 생성된 새로운 생물학적 네트워크(1000)의 개체 노드들의 개수와 같을 수 있다. 새로운 생물학적 네트워크(1000)를 구성하는 개체 노드들의 총 개수가 V인 경우, 입력 데이터(1130)인 원핫 벡터의 차원도 V일 수 있다. 여기서, V는 2이상의 정수일 수 있다. 컴퓨팅 장치는 랜덤워크 입력에 기초하여, 랜덤워크 입력에 포함된 각 개체 노드(1110)를 지시하는 원핫 벡터를 생성할 수 있다. 한 객체 노드(1110)를 지시하는 원핫 벡터는 해당하는 개체 노드(1110)를 지시하는 원소 값이 '1'이고, 나머지 노드를 지시하는 원소 값이 '0'인 벡터를 나타낼 수 있다. 도 11에서, 컴퓨팅 장치는 랜덤워크 입력 중 랜덤워크 셋트 N_S(u)에 대해 입력 데이터(1130) 중 개체 노드 ν_9n에 대응하는 원소 값이 '1'이고 나머지 원소 값이 '0'인 원핫 벡터, 개체 노드 ν_8n에 대응하는 원소 값만 '1'인 원핫 벡터, 개체 노드 ν_5n에 대응하는 원소 값만 '1'인 원핫 벡터 등을 생성할 수 있다. 컴퓨팅 장치는 생성된 원핫 벡터를 이용하여 기계 학습 모델(1140)을 트레이닝시킬 수 있다.As shown in FIG. 11 , the computing device may input input data 1130 corresponding to each entity node 1110 of the random walk input into the machine learning model 1140 . The computing device may input a one-hot vector corresponding to each of a plurality of entity nodes included in the random walk input to the machine learning model 1140 as input data 1130 . In FIG. 11, as an example, the random work input of [ν _9n , ν _8n , ν _9n , ν _8n , ν _9n , ν _8n , ν _5n , ν _7n , ν _9n , ν _8n ,] as the random work set _NS (u) is taken as an example. listen and explain Illustratively, the dimension of the input data 1130 (eg, one-hot vector) input to the machine learning model 1140 may be the same as the number of entity nodes of the new biological network 1000 generated in FIG. 10 described above. . When the total number of entity nodes constituting the new biological network 1000 is V, the dimension of the one-hot vector of the input data 1130 may also be V. Here, V may be an integer of 2 or greater. Based on the random walk input, the computing device may generate a one-hot vector indicating each entity node 1110 included in the random walk input. A one-hot vector indicating one object node 1110 may represent a vector in which an element value indicating a corresponding object node 1110 is '1' and an element value indicating other nodes is '0'. In FIG. 11, the computing device obtains a one-hot value in which the element value corresponding to the entity node ν _9n in the input data 1130 is '1' and the other element values are '0' for the random walk set N _S (u) among the random walk inputs. A one-hot vector in which only the element value corresponding to the vector, entity node ν _8n is '1', and a one-hot vector in which only the element value corresponding to the entity node ν _5n is '1' can be created. The computing device may train the machine learning model 1140 using the generated one-hot vector.

기계 학습 모델(1140)의 투사층 f를 포함하는 경우, 파라미터, 예시적으로, 투사층 f와 입력층(input layer) 사이의 가중치 행렬 W는 VХd 차원의 행렬일 수 있다. 여기서, d는 1이상의 정수일 수 있다. 다시 말해, 컴퓨팅 장치는 기계 학습 모델(1140)의 투사층 f을 통해, V차원의 원핫 벡터를 d차원으로 투사함으로써, d 차원의 임베딩 벡터를 생성할 수 있다. 예를 들어, 컴퓨팅 장치는 랜덤워크 셋트 N_S(u)에 대응하는 원핫 벡터들로부터 기계 학습 모델(1140)을 이용하여, 출력 데이터들 f(v_9n), f(v_8n), 및 f(v_5n)을 생성할 수 있다. 출력 데이터들 f(v_9n), f(v_8n), 및 f(v_5n)은 전술한 바와 같이 M차원의 임베딩 벡터로서, d차원 잠재 공간에서 각 개체 노드(1110)의 벡터 좌표를 나타낼 수 있다.When the machine learning model 1140 includes the projection layer f, the parameter, eg, a weight matrix W between the projection layer f and the input layer, may be a VХd-dimensional matrix. Here, d may be an integer of 1 or greater. In other words, the computing device may generate a d-dimensional embedding vector by projecting a V-dimensional one-hot vector onto a d-dimensional one through the projection layer f of the machine learning model 1140 . For example, the computing device uses the machine learning model 1140 from one-hot vectors corresponding to the random work set N _S (u) to output data f(v _9n ), f(v _8n ), and f( v _5n ). As described above, the output data f(v _9n ), f(v _8n ), and f(v _5n ) are M-dimensional embedding vectors, and may represent the vector coordinates of each entity node 1110 in the d-dimensional latent space. have.

전술한 바와 같이 기계 학습 모델(1140)은 스킵 그램 모델(skip-gram model)로서, 개체 노드들 간의 관계를 계산하기 위한 임베딩된 표현(embedded representation)을 각 개체 노드(1110)에 대해 출력하도록, 랜덤워크(random walk)에 기초한 시퀀스들을 학습할 수 있다. 기계 학습 모델(1140)의 학습 및/또는 트레이닝은 후술하는 목적함수 값(1109)에 기초하여 수행될 수 있다. 도 10에서 전술한 생물학적 네트워크(1000)에 대응하는 그래프 구조 데이터 G는 G = (ν,ε)로 표현될 수 있다. 투사층 f는 매핑함수로서, f: v → R^d로 표현될 수 있다. 여기서, R은 노드 v를 d차원의 벡터 공간(예를 들어, 잠재 공간 및/또는 임베딩 공간)으로 매핑하는 V×d 차원의 가중치 행렬의 파라미터들을 나타낼 수 있다. 노드를 관찰하는 확률, 다시 말해, 두 노드를 연결하는 엣지가 존재할 확률은 하기 수학식 2과 같은 문제를 최대화하는 것으로 계산될 수 있다.As described above, the machine learning model 1140 is a skip-gram model, and outputs an embedded representation for calculating a relationship between entity nodes for each entity node 1110, Sequences based on a random walk can be learned. Learning and/or training of the machine learning model 1140 may be performed based on an objective function value 1109 described later. Graph structure data G corresponding to the biological network 1000 described above in FIG. 10 may be expressed as G = (ν, ε). The projection layer f is a mapping function and can be expressed as f: v → R ^d . Here, R may represent parameters of a V×d-dimensional weight matrix that maps a node v to a d-dimensional vector space (eg, a latent space and/or an embedding space). The probability of observing a node, that is, the probability of existence of an edge connecting two nodes can be calculated by maximizing the problem shown in Equation 2 below.

전술한 수학식 2는 랜덤워크 셋트 N_s(u)에 포함된 개체 노드들 간의 연결 확률을 최대화하는 목적함수 값(1109)으로 해석될 수 있다. 전술한 수학식 2을 간소화하기 위해, 노드들의 각 발생 확률은 다른 노드들의 발생 확률에 독립적이며, 2개의 노드 페어에 대해 서로에 대한 효과는 대칭적으로 동일한 것으로 가정될 수 있다. 전술한 가정 및 네거티브 샘플링을 이용한 근사화(approximation) 하에서, 수학식 2의 목적함수 값(1109)은 특징들의 도트 곱으로 하기 수학식 3와 같이 간소화될 수 있다.Equation 2 described above may be interpreted as an objective function value 1109 that maximizes a connection probability between individual nodes included in the random walk set N _s (u). To simplify Equation 2 above, it can be assumed that the probability of occurrence of each node is independent of the probability of occurrence of other nodes, and that the two pairs of nodes have symmetrically equal effects on each other. Under the above-mentioned assumption and approximation using negative sampling, the objective function value 1109 of Equation 2 can be simplified as Equation 3 below as a dot product of features.

전술한 수학식 3는 임베딩 공간에서 노드들의 유사도 및/또는 노드들 간의 엣지 존재 확률을 나타내는 수학식 2를 최대화하는 확률적 경사 상승(stochastic gradient ascent)에 의해 f의 파라미터를 최적화하는데 사용될 수 있다.Equation 3 described above may be used to optimize the parameter f by a stochastic gradient ascent that maximizes Equation 2 representing the similarity of nodes and/or the probability of edge existence between nodes in the embedding space.

전술한 바와 같이, 컴퓨팅 장치는 랜덤워크 데이터(1120)에 기초한 입력 데이터(1130)에 기계 학습 모델(1140)을 적용함으로써 임시 임베딩 데이터를 산출할 수 있다. 임시 임베딩 데이터는 트레이닝이 완료되기 전의 기계 학습 모델(1140), 다시 말해 임시 모델을 이용하여 입력 데이터(1130)(예를 들어, 임의의 개체 노드(1110)를 지시하는 원핫 벡터)로부터 추출된 임베딩 벡터를 나타낼 수 있다. 컴퓨팅 장치는 산출된 임시 임베딩 데이터에 기초하여 랜덤워크 셋트 N_s(u) 내 개체 노드들 간의 연결 확률이 최대화되도록 기계 학습 모델(1140)의 파라미터를 업데이트할 수 있다. 따라서, 컴퓨팅 장치는 랜덤워크를 통해, 새로운 생물학적 네트워크(1000)에서 개체 노드들 간에 구출된 엣지 관계를 기계 학습 모델(1140)에 트레이닝시킴으로써, 서로 엣지를 통해 연결된 개체 노드들의 잠재 공간에서의 벡터 좌표가 서로 인접하게 출력되도록 기계 학습 모델(1140)의 파라미터를 업데이트할 수 있다. 여기서, 기계 학습 모델(1140)의 투사층 f는 V×d 차원의 가중치 행렬일 수 있는데, 가중치 행렬의 각 행(row)에 대응하는 행 벡터(row vector)는 해당하는 개체 노드(1110)를 표현한 임베딩 벡터일 수 있다. 다시 말해, 가중치 행렬의 V개의 복수의 행 벡터들 중 제k 행 벡터는, 새로운 생물학적 네트워크(1000)의 V개의 개체 노드들 중 제k 개체 노드(1110)를 표현한 임베딩 벡터일 수 있다. 여기서, k는 1이상 V이하의 정수일 수 있다.As described above, the computing device may calculate temporary embedding data by applying the machine learning model 1140 to the input data 1130 based on the random walk data 1120 . The temporary embedding data is the machine learning model 1140 before training is completed, that is, the embedding extracted from the input data 1130 (eg, a one-hot vector pointing to an arbitrary entity node 1110) using the temporary model. vectors can be represented. The computing device may update parameters of the machine learning model 1140 so that a connection probability between individual nodes in the random walk set N _s (u) is maximized based on the calculated temporary embedding data. Therefore, the computing device trains the machine learning model 1140 on the edge relationship obtained between the entity nodes in the new biological network 1000 through a random walk, and thereby vector coordinates in the latent space of the entity nodes connected to each other through edges. Parameters of the machine learning model 1140 may be updated so that are output adjacent to each other. Here, the projection layer f of the machine learning model 1140 may be a V×d-dimensional weight matrix, and a row vector corresponding to each row of the weight matrix represents a corresponding entity node 1110. It may be an embedding vector expressed. In other words, the kth row vector among the V plurality of row vectors of the weight matrix may be an embedding vector expressing the kth entity node 1110 among the V entity nodes of the new biological network 1000 . Here, k may be an integer greater than or equal to 1 and less than or equal to V.

컴퓨팅 장치는, 일정 비율(예를 들어, 10%)의 엣지가 무작위로 제거된 새로운 생물학적 네트워크(1000)에 대해 전술한 기계 학습 모델(1140)을 이용하여 링크를 예측함으로써, 기계 학습 모델(1140)의 성능을 검증할 수 있다.The computing device predicts a link using the above-described machine learning model 1140 for a new biological network 1000 from which a certain percentage (eg, 10%) of edges are randomly removed, thereby generating a machine learning model 1140. ) performance can be verified.

도 12는 일 실시예에 따라 잠재 공간(latent space)에 임베딩된 개별 개체 노드(entity node)에 대응하는 임베딩 표현 벡터를 예시적으로 도시한다.12 illustratively illustrates embedding expression vectors corresponding to individual entity nodes embedded in a latent space according to an embodiment.

약물의 새로운 적응증을 탐색하는 것을 약물 재창출 또는 신약재 창출이라고 나타낼 수 있다. 약물의 새로운 적응증이란 시판중인 약물의 기존 적응증과 다른 새로운 적응증(예를 들어, 질환)을 의미하거나, 임상시험에서 실패한 약물의 새로운 대상 환자군 선정을 의미할 수 있다. 신약 재창출은 기존 신약 개발 Process에 비해 적은 비용으로 실패율을 낮추고 보다 신속하게 출시할 수 있다는 장점이 있다.Exploring new indications for drugs can be referred to as drug re-creation or drug re-creation. A new indication of a drug may mean a new indication (eg, disease) different from the existing indication of a drug on the market, or may mean selection of a new target patient group for a drug that has failed in a clinical trial. New drug re-creation has the advantage of lowering the failure rate at a lower cost and being able to launch more quickly than the existing new drug development process.

전술한 도 10에서 구축된 새로운 생물학적 네트워크(1000)에서 관련성에 따라 개체 노드들이 엣지를 통해 연결될 수 있다. 특히, 질병 노드와 연결된 화학물질 노드는 해당 화학물질 노드에 대응하는 화학물질이 질병 노드에 대응하는 질병에 적응증이 있는 것으로 해석될 수 있다. 일 실시예에 따르면, 새로운 생물학적 네트워크(1000)가 화학물질 노드 및 질병 노드 간의 직접적인 엣지 이외에도, 모구조 노드, 처방 노드, 및 진단 노드 등 다른 개체 노드와의 직간접적인 연결을 나타내는 엣지를 포함하는 바, 도 11에서 전술한 기계 학습 모델(1140)은 새로운 생물학적 네트워크(1000)에서 밝혀지지 않은(undiscovered) 미싱 엣지를 예측하도록 설계 및 트레이닝된 모델일 수 있다.In the new biological network 1000 constructed in FIG. 10 described above, entity nodes may be connected through edges according to relevance. In particular, a chemical node connected to a disease node may be interpreted as having an indication for a disease corresponding to a disease node with a chemical corresponding to the corresponding chemical node. According to one embodiment, the new biological network 1000 includes edges representing direct or indirect connections with other entity nodes such as parent structure nodes, prescription nodes, and diagnosis nodes, in addition to direct edges between chemical substance nodes and disease nodes. , the machine learning model 1140 described above in FIG. 11 may be a model designed and trained to predict an undiscovered missing edge in the new biological network 1000 .

일 실시예에 따르면 새로운 생물학적 네트워크(1000)의 모든 개체 노드들의 각각에 대응하는 임베딩 벡터가, 잠재 공간(1200) 내 벡터 좌표로 매핑될 수 있다. 잠재 공간(1200)에서 서로 인접한 개체 노드들 사이에는 엣지가 존재하는 것으로 결정될 수 있다. 예를 들어, 도 12에서 컴퓨팅 장치는 잠재 공간(1200)에서 제1 개체 노드(1210) 및 제2 개체 노드(1220) 간의 거리가 임계 거리 미만인 경우, 제1 개체 노드(1210) 및 제2 개체 노드(1220) 간에 엣지(1290)가 존재하는 것으로 결정할 수 있다. 따라서, 컴퓨팅 장치는 새로운 생물학적 네트워크(1000)에서 미리 구축된 데이터베이스에 의해 식별된 명시적 엣지 외에도, 잠재 공간(1200) 내 개체 노드들에 대응하는 임베딩 벡터 간의 벡터 거리에 따라 잠재적인 엣지를 탐색할 수 있다.According to an embodiment, an embedding vector corresponding to each of all entity nodes of the new biological network 1000 may be mapped to vector coordinates in the latent space 1200 . It may be determined that an edge exists between object nodes adjacent to each other in the latent space 1200 . For example, in FIG. 12 , when the distance between the first object node 1210 and the second object node 1220 in the latent space 1200 is less than a critical distance, the first object node 1210 and the second object node 1210 It may be determined that an edge 1290 exists between nodes 1220 . Therefore, the computing device may search potential edges in the new biological network 1000 according to vector distances between embedding vectors corresponding to object nodes in the latent space 1200, in addition to the explicit edges identified by the pre-built database. can

예를 들어, 컴퓨팅 장치는 새로운 생물학적 네트워크의 개체 노드들 중 제1 노드(예를 들어, 대상 질병 노드)에 대응하는 제1 임베딩 벡터를 기계 학습 모델로부터 추출할 수 있다. 컴퓨팅 장치는 새로운 생물학적 네트워크의 개체 노드들 중 제2 노드(예를 들어, 대상 화학물질 노드)에 대응하는 제2 임베딩 벡터를 기계 학습 모델로부터 추출할 수 있다. 컴퓨팅 장치는 추출된 제1 임베딩 벡터 및 제2 임베딩 벡터에 기초하여, 제1 노드 및 제2 노드 간의 엣지 존재 여부를 결정할 수 있다. 임베딩 벡터들 간의 차이가 작아서, 임베딩 벡터들이 유사한 경우 두 임베딩 벡터 사이에는 잠재적인 엣지 및/또는 링크가 존재하는 것으로 예측될 수 있다. 컴퓨팅 장치는 추출된 제1 임베딩 벡터 및 제2 임베딩 벡터 간의 벡터 거리 차이가 임계 거리 값 미만인 경우에 응답하여, 제1 노드 및 제2 노드 간에 엣지가 존재한다고 결정할 수 있다. 여기서, 두 임베딩 벡터들 간의 벡터 거리는 L2 노름과 같은 유클리드 거리일 수 있다. 전술한 바와 같이, 컴퓨팅 장치는 미리 구축된 생물학적 네트워크에서 링크가 존재하지 않더라도, 잠재 공간 내 임베딩 벡터 간의 거리에 기초하여 잠재적인 링크를 예측함으로써, 생의학적 연관성을 예측할 수 있다.For example, the computing device may extract a first embedding vector corresponding to a first node (eg, a target disease node) among entity nodes of the new biological network from the machine learning model. The computing device may extract a second embedding vector corresponding to a second node (eg, a target chemical node) among entity nodes of the new biological network from the machine learning model. The computing device may determine whether an edge exists between the first node and the second node based on the extracted first and second embedding vectors. When the difference between the embedding vectors is small, it can be predicted that a potential edge and/or link exists between the two embedding vectors when the embedding vectors are similar. The computing device may determine that an edge exists between the first node and the second node in response to a case where a vector distance difference between the extracted first and second embedding vectors is less than a threshold distance value. Here, the vector distance between the two embedding vectors may be a Euclidean distance equal to the L2 norm. As described above, the computing device may predict a biomedical correlation by predicting a potential link based on a distance between embedding vectors in a latent space even if a link does not exist in a pre-constructed biological network.

참고로, 전술한 예시에서, 제1 노드 및 제2 노드가 각각 질병-약물이면 컴퓨팅 장치는 약물재창출을 수행하게 된다. 제1 노드 및 제2 노드가 약물-약물이 되면 컴퓨팅 장치는 약물-약물 상호작용 예측을 수행하게 된다. 제1 노드 및 제2 노드가 각각 질병-질병이 되면 컴퓨팅 장치는 질병-질병 연관성 탐색을 수행하게 된다. 제1 노드 및 제2 노드가 각각 환자-질병이 되면 컴퓨팅 장치는 대상 환자의 새로운 질병의 발병을 예측하게 된다.For reference, in the above example, if the first node and the second node are each disease-drug, the computing device performs drug re-creation. When the first node and the second node become drug-drug, the computing device performs drug-drug interaction prediction. When each of the first node and the second node becomes a disease-disease, the computing device performs a disease-disease correlation search. When the first node and the second node respectively become a patient-disease, the computing device predicts the onset of a new disease of the target patient.

예를 들어, 컴퓨팅 장치는 임의의 대상 질병을 지시하는 질병 노드에 대응하는 제1 임베딩 벡터를 기준으로 임계 거리 값 미만인 화학물질 노드를 검색할 수 있다. 따라서, 대상 질병에 대한 효과를 가지는 약물이 탐색될 수 있다. 다른 예를 들어, 컴퓨팅 장치는 임의의 대상 화학물질을 지시하는 화학물질 노드에 대응하는 제2 임베딩 벡터를 기준으로 임계 거리 값 미만인 질병 노드를 검색할 수 있다. 따라서, 대상 화학물질에 대해 새로운 적응증에 해당하는 질병이 탐색될 수 있다.For example, the computing device may search for a chemical node less than a threshold distance based on a first embedding vector corresponding to a disease node indicating an arbitrary target disease. Thus, a drug having an effect on a target disease can be searched for. For another example, the computing device may search for a disease node less than a threshold distance value based on a second embedding vector corresponding to a chemical node indicating an arbitrary target chemical substance. Therefore, a disease corresponding to a new indication for the target chemical can be searched for.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in the art of computer software. may be Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware device described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the biomedical association prediction method performed by a computing device,
extracting real world data including a diagnosis code and a prescription code linked to patient information from a real world database;
extracting biomedical data including genetic information, disease identifiers, chemical substance identifiers, and parent structure identifiers from a biomedical database;
constructing a new biological network by combining the real world data and the biomedical data;
training a machine learning model from the constructed new biological network; and
Predicting links between entity nodes based on an embedding vector of entity nodes of the new biological network based on the trained machine learning model.
Biomedical relevance prediction method comprising a.

According to claim 1,
The step of extracting the real world data,
Generating the real-world data in a graph structure by connecting a patient node indicating patient information to a prescription node indicating a prescription code extracted corresponding to the patient information and a diagnosis node indicating a diagnosis code through an edge.
Biomedical relevance prediction method comprising a.

According to claim 1,
Extracting the biomedical data,
connecting a gene node indicating genetic information to a disease node indicating a disease identifier corresponding to the genetic information through an edge;
connecting the disease node to a chemical node indicating a chemical identifier corresponding to the disease identifier through an edge;
connecting the chemical node to a parent structure node indicating a parent structure identifier corresponding to the chemical identity through an edge; and
Generating biomedical data of a graph structure including the gene node, the disease node, the chemical node, and the parent structure node connected through an edge.
Biomedical relevance prediction method comprising a.

According to claim 3,
In the biomedical data, the disease node is connected to the parent structure node via the chemical node through an edge, and the chemical node is connected to the gene node via the disease node through an edge,
A method for predicting biomedical relevance.

According to claim 1,
The step of constructing the new biological network,
constructing the new biological network by connecting a diagnosis node of the real world data and a disease node of the biomedical data, and connecting a prescription node of the real world data and a chemical node of the biomedical data.
Biomedical relevance prediction method comprising a.

According to claim 5,
The step of constructing the new biological network,
mapping the diagnosis node and the disease node corresponding to the same unique concept identifier based on a common data model of biomedical terminology; and
Connecting the prescription node and the chemical node corresponding to the same unique concept identifier based on the common data model
Biomedical relevance prediction method comprising a.

According to claim 5,
The step of constructing the new biological network,
extracting a first character string corresponding to a diagnosis code indicated by the diagnosis node;
Retrieving a first concept unique identifier having the extracted first string from a common data model;
mapping the diagnosis node to the disease node indicating a disease identifier corresponding to the retrieved first concept unique identifier;
Biomedical relevance prediction method comprising a.

According to claim 5,
The step of constructing the new biological network,
extracting a second string corresponding to a prescription code indicated by the prescription node;
Retrieving a second concept unique identifier corresponding to the extracted second string from a common data model; and
Mapping the prescription node to the chemical substance node indicating a chemical substance identifier corresponding to the retrieved second concept unique identifier.
Biomedical relevance prediction method comprising a.

According to claim 1,
Training the machine learning model,
selecting an initial object node from the constructed new biological network;
selecting entity nodes according to individual random walks by sequentially performing random walks as long as a predetermined random walk length from the selected initial entity nodes; and
Generating random walk data indicating a random walk set including the initial entity node and an entity node selected based on the random walk
Biomedical relevance prediction method comprising a.

According to claim 9,
Training the machine learning model,
calculating temporary embedding data by applying the machine learning model to input data based on the random walk data; and
Updating parameters of a machine learning model to maximize a connection probability between individual nodes in the random walk set based on the calculated temporary embedding data
A method for predicting biomedical relevance further comprising a.

According to claim 1,
Predicting the link between the entity nodes,
extracting a first embedding vector corresponding to a first node among entity nodes of the new biological network from the machine learning model;
extracting a second embedding vector corresponding to a second node among entity nodes of the new biological network from the machine learning model; and
Determining whether an edge exists between the first node and the second node based on the extracted first embedding vector and the extracted second embedding vector
Biomedical relevance prediction method comprising a.

According to claim 11,
The step of determining whether the edge exists,
Determining that an edge exists between the first node and the second node in response to a case where a vector distance difference between the extracted first embedding vector and the second embedding vector is less than a threshold distance value
Biomedical relevance prediction method comprising a.

A computer program stored in a computer readable recording medium in order to execute the method of any one of claims 1 to 12 in combination with hardware.

In a computing device,
a memory storing machine learning models for predicting biomedical relevance;
Extracting real-world data including diagnosis codes and prescription codes linked to patient information from real-world databases, extracting biomedical data including genetic information, disease identifiers, chemical substance identifiers, and parent structure identifiers from biomedical databases, Constructing a new biological network by combining real world data and the biomedical data, training a machine learning model from the constructed new biological network, and embedding vectors of entity nodes of the new biological network based on the trained machine learning model. A processor that predicts links between entity nodes based on
Computing device comprising a.