KR102405848B1

KR102405848B1 - Method and system for predicting personalized therapeutic information

Info

Publication number: KR102405848B1
Application number: KR1020220000258A
Authority: KR
Inventors: 강민근; 이기원; 양수희; 우인희; 황경조
Original assignee: 주식회사 스파이더코어
Priority date: 2022-01-03
Filing date: 2022-01-03
Publication date: 2022-06-07

Abstract

사용자 맞춤형 치료 정보 예측 방법에서, 데이터 취합부가 적어도 하나의 의약 데이터베이스 및 적어도 하나의 유전자 다형성 데이터베이스로부터 데이터를 수집하고, 정보 추출부가 데이터 취합부에 의해 취합된 데이터로부터 질병, 유전자, 약물, 및 다형성 유전자의 유전자형을 개체로서 추출하고, 개체들 사이의 관계를 도출하여 개체들 사이의 관계를 내부 데이터베이스에 정형화된 형태로 저장하고, 군집화부가 내부 데이터베이스 및 사용자 유전자 데이터베이스를 사용하여 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들로 분류하고, 그래프 생성부가 내부 데이터베이스에 저장된 데이터를 사용하여 제1 내지 제k 군집들 각각에서 개체들 사이의 관계를 나타내는 제1 내지 제k 그래프들을 생성하고, 딥러닝부가 제1 내지 제k 그래프들을 학습 데이터로 사용하여 인공신경망을 학습시키고, 예측부가 질의 개체 및 질의 유전자 변이 정보를 인공신경망에 입력하고, 인공신경망으로부터 출력되는 개체를 사용자에 있어서 질의 개체와 연관도가 높은 타겟 개체로서 출력한다.In the user-customized treatment information prediction method, the data collection unit collects data from at least one drug database and at least one gene polymorphism database, and the information extraction unit collects data from the data collected by the data collection unit for diseases, genes, drugs, and polymorphic genes. extracts the genotype of the individual as an individual, derives the relationship between the individuals, and stores the relationship between the individuals in a standardized form in the internal database, and the clustering unit uses the internal database and the user genetic database to convert a plurality of individuals to similar genetic information 1st to kth graphs indicating a relationship between individuals in each of the 1st to kth clusters using the data stored in the internal database by the graph generating unit and grouping them into 1st to kth clusters , the deep learning unit trains the artificial neural network using the first to kth graphs as learning data, the prediction unit inputs the query object and query genetic variation information to the artificial neural network, and the object output from the artificial neural network is provided to the user. Therefore, it is output as a target entity that has a high degree of relevance to the query entity.

Description

User-customized treatment information prediction method and system

본 발명은 사용자에게 적합한 치료 정보를 예측하는 방법 및 시스템에 관한 것으로, 보다 상세하게는 사용자의 유전자 변이 정보에 기초하여 사용자 맞춤형으로 상기 사용자와 연관되는 질병, 유전자, 및 약물을 예측하는 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for predicting treatment information suitable for a user, and more particularly, a method and system for predicting diseases, genes, and drugs associated with the user customized to the user based on the user's genetic mutation information is about

신약 개발에 있어서 특정 질병을 유발하는 유전자를 탐지하고, 상기 탐지된 유전자의 활성을 효과적으로 억제할 수 있는 약물을 선정하는 것은 가장 기본적인 단계이다.Detecting a gene that causes a specific disease and selecting a drug capable of effectively inhibiting the activity of the detected gene is the most basic step in drug development.

이를 위해 최근에는 생물학적 네트워크를 기반으로 신약 후보군을 예측하는 방법들이 연구되고 있다.To this end, recently, methods for predicting new drug candidates based on biological networks are being studied.

여기서, 생물학적 네트워크란 질병, 유전자, 약물 등과 같은 개체(entity)들 사이의 상호 작용으로 이루어진 네트워크를 의미한다.Here, the biological network refers to a network composed of interactions between entities such as diseases, genes, drugs, and the like.

이러한 생물학적 네트워크를 사용하여 신약 후보군을 예측하는 종래의 방법은 약물과 유전자 사이 및 유전자와 질병 사이의 연관 관계를 분석하여 약물로부터 유전자를 거쳐 질병까지의 유효한 경로를 추출하고, 상기 추출된 경로들에 기초하여 특정 질병에 대한 신약 후보군을 예측한다.A conventional method for predicting a new drug candidate group using such a biological network is to extract an effective pathway from a drug to a disease through a gene by analyzing the relationship between a drug and a gene and between a gene and a disease, Based on the prediction of new drug candidates for a specific disease.

그러나 특정 유전자에 효과적으로 작용하는 약물이라고 하더라도 상기 특정 유전자에 변이가 있는 경우에는 효과가 감소하거나 오히려 부작용을 유발할 수도 있다.However, even if it is a drug that effectively acts on a specific gene, if there is a mutation in the specific gene, the effect may decrease or rather cause side effects.

그런데 사람들마다 서로 다른 유전자 변이를 갖고 있음에도 불구하고, 종래의 신약 후보군 예측 방법은 유전자 변이에 대한 고려 없이 신약 후보군을 예측하므로, 사용자 별로 적절한 신약 후보군을 추전하는 것이 어렵다는 문제점이 있다.However, despite the fact that each person has different genetic mutations, the conventional method for predicting a new drug candidate group predicts a new drug candidate group without considering the genetic variation, so there is a problem in that it is difficult to recommend an appropriate new drug candidate group for each user.

한국등록특허 제10-1878924호 (2018.07.10)Korean Patent Registration No. 10-1878924 (2018.07.10)

본 발명의 일 목적은 사용자 맞춤형으로 사용자에게 적합한 치료 정보를 예측할 수 있는 방법을 제공하는 것이다.One object of the present invention is to provide a method for predicting treatment information suitable for a user in a user-customized manner.

본 발명의 다른 목적은 사용자 맞춤형으로 사용자에게 적합한 치료 정보를 예측할 수 있는 시스템에 관한 것이다.Another object of the present invention relates to a system capable of predicting treatment information suitable for a user in a user-customized manner.

상술한 본 발명의 일 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 방법에서, 데이터 취합부가 질병, 유전자, 및 약물 관련 정보를 저장하는 적어도 하나의 의약 데이터베이스 및 유전자 다형성(gene polymorphism) 관련 정보를 저장하는 적어도 하나의 유전자 다형성 데이터베이스로부터 텍스트(text) 형태로 데이터를 수집하고, 정보 추출부가 자연어 처리 알고리즘을 사용하여 상기 데이터 취합부에 의해 취합된 데이터로부터 질병, 유전자, 약물, 및 다형성 유전자의 유전자형을 개체(entity)로서 추출하고, 상기 개체들 사이의 관계를 도출하여 상기 개체들 사이의 관계를 내부 데이터베이스에 정형화된 형태로 저장하고, 군집화부가 상기 내부 데이터베이스 및 복수의 개인들 각각의 유전자 변이 정보를 저장하는 사용자 유전자 데이터베이스를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k(k는 2 이상의 정수) 군집들로 분류하고, 그래프 생성부가 상기 내부 데이터베이스에 저장된 데이터를 사용하여 상기 제1 내지 제k 군집들 각각에서 상기 개체들 사이의 관계를 나타내는 제1 내지 제k 그래프들을 생성하고, 딥러닝부가 상기 제1 내지 제k 그래프들을 학습 데이터로 사용하여 인공신경망의 입력층에 입력되는 질의 개체 및 질의 유전자 변이 정보에 기초하여 상기 인공신경망이 상기 질의 유전자 변이 정보를 갖는 사용자에 있어서 상기 질의 개체와 연관되는 개체를 출력하도록 상기 인공신경망을 학습시키고, 예측부가 질병, 유전자, 및 약물 중의 하나에 상응하는 질의 개체 및 사용자의 유전자 변이 정보에 상응하는 질의 유전자 변이 정보를 상기 인공신경망에 입력하고, 상기 인공신경망으로부터 출력되는 개체를 상기 사용자에 있어서 상기 질의 개체와 연관도가 높은 타겟 개체로서 출력한다.In order to achieve the above object of the present invention, in the method for predicting user-customized treatment information according to an embodiment of the present invention, at least one drug database and gene polymorphism in which the data collection unit stores disease, gene, and drug-related information (gene polymorphism) data is collected in text form from at least one gene polymorphism database that stores related information, and an information extraction unit uses a natural language processing algorithm to collect diseases, genes, Extracting the genotype of the drug and polymorphic gene as an entity, deriving a relationship between the individuals, storing the relationship between the individuals in a standardized form in an internal database, and a clustering unit in the internal database and a plurality of Using a user genetic database that stores genetic variation information of each individual, the plurality of individuals are grouped into people having similar genetic information, classified into 1st to kth (k is an integer greater than or equal to 2) clusters, and a graph is generated Addition generates first to kth graphs representing the relationship between the entities in each of the first to kth clusters using data stored in the internal database, and a deep learning unit learns the first to kth graphs Using the data as data, based on the query entity and the query genetic variation information input to the input layer of the artificial neural network, the artificial neural network outputs the entity associated with the query entity in the case of the user having the query genetic variation information. learning, the prediction unit inputs the query genetic variation information corresponding to the genetic variation information of the query individual and the user corresponding to one of diseases, genes, and drugs into the artificial neural network, and the individual output from the artificial neural network to the user In this case, it is output as a target entity having a high degree of relevance to the query entity.

일 실시예에 있어서, 상기 내부 데이터베이스는 질병 필드, 유전자 필드, 약물 필드, 단일핵산염기다형성(Single Nucleotide Polymorphism; SNP) 인덱스 필드, 및 복수의 유전자형(genotype) 필드들을 포함하고, 상기 정보 추출부가 상기 데이터 취합부에 의해 취합된 데이터로부터 질병, 유전자, 약물, 및 다형성 유전자의 유전자형을 개체로서 추출하고, 상기 추출된 개체들 사이의 관계를 도출하여 상기 추출된 개체들 사이의 관계를 상기 내부 데이터베이스에 정형화된 형태로 저장하는 단계는, 상기 정보 추출부가 상기 데이터 취합부에 의해 취합된 데이터로부터 서로 연관이 있는 질병, 유전자, 및 약물을 추출하고, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스 및 상기 SNP의 유전자형들을 추출하고, 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 추출하는 단계 및 상기 정보 추출부가 상기 추출된 상기 질병의 명칭, 상기 유전자의 명칭, 상기 약물의 명칭, 상기 SNP 인덱스, 및 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 상기 내부 데이터베이스의 상기 질병 필드, 상기 유전자 필드, 상기 약물 필드, 상기 SNP 인덱스 필드, 및 상응하는 유전자형 필드들 각각에 저장하는 단계를 포함할 수 있다.In an embodiment, the internal database includes a disease field, a gene field, a drug field, a Single Nucleotide Polymorphism (SNP) index field, and a plurality of genotype fields, and the information extraction unit includes the Extracting the genotypes of diseases, genes, drugs, and polymorphic genes as individuals from the data collected by the data collection unit, deriving a relationship between the extracted individuals, and storing the relationship between the extracted individuals in the internal database In the step of storing in a standardized form, the information extraction unit extracts diseases, genes, and drugs that are related to each other from the data collected by the data collection unit, and the SNP index corresponding to the SNP of the gene and the SNP extracting the genotypes of the SNP, extracting the degree of effect of the drug on the disease in a patient having each of the genotypes of the SNP, and the information extracting unit extracted the name of the disease, the name of the gene, and the drug The disease field, the gene field, the drug field, the SNP index field, and the corresponding and storing it in each of the genotype fields.

상기 군집화부가 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계는, 상기 내부 데이터베이스에 저장된 모든 SNP 인덱스들을 독출하는 단계, 상기 사용자 유전자 데이터베이스로부터 상기 복수의 개인들 각각이 갖는 상기 독출된 SNP 인덱스들에 상응하는 SNP 유전자형들을 독출하는 단계, 상기 복수의 개인들 각각에 대해, 상기 복수의 개인들 각각이 갖는 상기 독출된 SNP 인덱스들에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터를 생성하는 단계, 및 상기 복수의 개인들에 상응하는 상기 복수의 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 포함할 수 있다.The step of the clustering unit classifying the plurality of individuals into the first to k-th clusters includes reading all SNP indices stored in the internal database, and each of the plurality of individuals from the user genetic database. reading the SNP genotypes corresponding to the read SNP indices, for each of the plurality of individuals, a SNP vector comprising SNP genotypes corresponding to the read SNP indices of each of the plurality of individuals generating, and applying a K-means algorithm to the plurality of SNP vectors corresponding to the plurality of individuals may include classifying the plurality of individuals into the first to kth clusters. .

상기 군집화부가 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계는, 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 미리 정해진 p(p는 양의 정수)차원보다 높은 경우, 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 q(q는 p보다 작은 양의 정수)차원으로 축소하는 단계 및 상기 복수의 개인들에 상응하는 상기 복수의 축소된 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 더 포함할 수 있다.In the step of the clustering unit classifying the plurality of individuals into the first to kth clusters, the dimension of the SNP vector generated for each of the plurality of individuals is greater than a predetermined p (p is a positive integer) dimension. if high, reducing the dimension of the SNP vectors corresponding to the plurality of individuals to a q (q is a positive integer less than p) dimension lower than the p dimension, and reducing the dimension of the plurality of SNP vectors corresponding to the plurality of individuals The method may further include classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to the reduced SNP vectors.

상기 군집화부는 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 상기 p차원보다 높은 경우, 상기 복수의 개인들에 상응하는 상기 SNP 벡터들에 대해 PCA(Principal Component Analysis) 차원 축소 알고리즘을 적용하여 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성할 수 있다.When the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p dimension, the clustering unit performs a PCA (Principal Component Analysis) dimension reduction algorithm for the SNP vectors corresponding to the plurality of individuals. The reduced SNP vectors may be generated by reducing the dimension of the SNP vectors corresponding to the plurality of individuals to the q dimension lower than the p dimension.

상기 군집화부는 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 상기 p차원보다 높은 경우, 오토인코더(autoencoder)를 사용한 비지도학습을 수행하여 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성할 수 있다.When the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p dimension, the clustering unit performs unsupervised learning using an autoencoder to obtain the SNP vector corresponding to the plurality of individuals. The reduced SNP vectors may be generated by reducing the dimension of ? to the q dimension, which is lower than the p dimension.

상기 군집화부가 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계는, 상기 내부 데이터베이스에 저장된 SNP 인덱스들에 상응하는 SNP들 중에서 SNP의 유전자형에 따라 약물의 효과 정도의 차이가 상대적으로 큰 t(t는 2 이상의 정수)개의 SNP들을 유효 SNP들로서 결정하는 단계, 상기 사용자 유전자 데이터베이스로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하는 단계, 및 상기 복수의 개인들을 상기 유효 SNP들에 대해 동일한 유전자형을 갖는 사람들끼리 군집화하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 포함할 수 있다.In the step of the clustering unit classifying the plurality of individuals into the first to k-th clusters, the difference in the degree of effect of the drug is relatively different according to the genotype of the SNP among the SNPs corresponding to the SNP indices stored in the internal database. determining large t (t is an integer greater than or equal to 2) SNPs as effective SNPs, reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user genetic database, and the plurality of and classifying the plurality of individuals into the first to kth clusters by clustering individuals of the same genotype with respect to the effective SNPs.

상기 군집화부가 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계는, 상기 내부 데이터베이스에 저장된 SNP 인덱스들에 상응하는 SNP들 중에서 SNP의 유전자형에 따라 약물의 효과 정도의 차이가 상대적으로 큰 t(t는 2 이상의 정수)개의 SNP들을 유효 SNP들로서 결정하는 단계, 상기 사용자 유전자 데이터베이스로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하는 단계, 상기 복수의 개인들 각각에 대해, 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터를 생성하는 단계, 및 상기 복수의 개인들에 상응하는 상기 복수의 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 포함할 수 있다.In the step of the clustering unit classifying the plurality of individuals into the first to k-th clusters, the difference in the degree of effect of the drug is relatively different according to the genotype of the SNP among the SNPs corresponding to the SNP indices stored in the internal database. Determining large t (t is an integer greater than or equal to 2) SNPs as effective SNPs, reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database, the plurality of generating, for each individual, a SNP vector comprising SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals, and for the plurality of SNP vectors corresponding to the plurality of individuals The method may include classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm.

상기 정보 추출부는, 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 효과가 가장 낮은 제1 레벨에서 효과가 가장 높은 제d(d는 2 이상의 정수) 레벨로 구분하고, 상기 제1 내지 제d 레벨들 각각에 대해 미리 정해진 효과 점수를 상기 내부 데이터베이스의 상응하는 유전자형 필드들 각각에 저장할 수 있다.The information extraction unit divides the degree of effect of the drug on the disease in the patient having each of the genotypes of the SNP into a first level with the lowest effect and a dth (d is an integer of 2 or more) level with the highest effect, and , a predetermined effect score for each of the first to d-th levels may be stored in each of the corresponding genotype fields of the internal database.

상기 군집화부가 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계는, 상기 내부 데이터베이스에 저장된 데이터에 대해 아래의 수학식을 적용하여 상기 내부 데이터베이스에 저장된 모든 SNP 인덱스들에 상응하는 SNP 각각에 대한 정보량 점수를 결정하는 단계The step of the clustering unit classifying the plurality of individuals into the first to k-th clusters includes applying the following equation to the data stored in the internal database and SNP corresponding to all SNP indices stored in the internal database. determining the amount of information score for each

(여기서, score_i는 i번째 SNP의 상기 정보량 점수를 나타내고, w_i는 상기 i번째 SNP의 대립 유전자 발현 빈도(allele frequency)를 나타내고, d는 상기 내부 데이터베이스에 저장된 약물의 개수를 나타내고, AA_ij는 상기 i번째 SNP의 제1 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, AG_ij는 상기 i번째 SNP의 제2 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, GG_ij는 상기 i번째 SNP의 제3 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타냄), 상기 내부 데이터베이스에 저장된 SNP 인덱스들에 상응하는 SNP들 중에서 상기 정보량 점수가 상대적으로 큰 t개의 SNP들을 유효 SNP들로서 결정하는 단계, 상기 사용자 유전자 데이터베이스로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하는 단계, 및 상기 복수의 개인들을 상기 유효 SNP들에 대해 동일한 유전자형을 갖는 사람들끼리 군집화하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 포함할 수 있다.(here, score _i represents the information content score of the i-th SNP, w _i represents the allele frequency of the i-th SNP, d represents the number of drugs stored in the internal database, AA _ij denotes the effect score indicating the degree of the effect of the first genotype of the i-th SNP in response to the j-th drug, and AG _ij denotes the degree of the effect of the second genotype of the i-th SNP in response to the j-th drug. represents the effect score, GG _ij represents the effect score indicating the degree of effect that the third genotype of the i-th SNP responds to in response to the j-th drug), corresponding to the SNP indexes stored in the internal database Determining t SNPs having a relatively large information content score among SNPs as effective SNPs, reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database; And and classifying the plurality of individuals into the first to kth clusters by grouping the plurality of individuals with people having the same genotype for the effective SNPs.

(여기서, score_i는 i번째 SNP의 상기 정보량 점수를 나타내고, w_i는 상기 i번째 SNP의 대립 유전자 발현 빈도(allele frequency)를 나타내고, d는 상기 내부 데이터베이스에 저장된 약물의 개수를 나타내고, AA_ij는 상기 i번째 SNP의 제1 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, AG_ij는 상기 i번째 SNP의 제2 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, GG_ij는 상기 i번째 SNP의 제3 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타냄), 상기 내부 데이터베이스에 저장된 SNP 인덱스들에 상응하는 SNP들 중에서 상기 정보량 점수가 상대적으로 큰 t개의 SNP들을 유효 SNP들로서 결정하는 단계, 상기 사용자 유전자 데이터베이스로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하는 단계, 상기 복수의 개인들 각각에 대해, 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터를 생성하는 단계, 및 상기 복수의 개인들에 상응하는 상기 복수의 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류하는 단계를 포함할 수 있다.(here, score _i represents the information content score of the i-th SNP, w _i represents the allele frequency of the i-th SNP, d represents the number of drugs stored in the internal database, AA _ij denotes the effect score indicating the degree of the effect of the first genotype of the i-th SNP in response to the j-th drug, and AG _ij denotes the degree of the effect of the second genotype of the i-th SNP in response to the j-th drug. represents the effect score, GG _ij represents the effect score indicating the degree of effect that the third genotype of the i-th SNP responds to in response to the j-th drug), corresponding to the SNP indexes stored in the internal database determining t SNPs having a relatively large information content score among SNPs as effective SNPs, reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user genetic database; generating, for each of a plurality of individuals, a SNP vector comprising SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals, and the plurality of SNP vectors corresponding to the plurality of individuals classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to .

상기 그래프 생성부가 상기 내부 데이터베이스에 저장된 데이터를 사용하여 상기 제1 내지 제k 군집들 각각에서 상기 개체들 사이의 관계를 나타내는 상기 제1 내지 제k 그래프들을 생성하는 단계는, 상기 내부 데이터베이스의 상기 질병 필드에 저장된 상기 질병의 명칭들 각각을 질병 노드로 규정하는 단계, 상기 내부 데이터베이스의 상기 SNP 인덱스 필드에 상기 SNP 인덱스가 저장되어 있지 않은 로우의 경우, 해당 로우의 상기 유전자 필드에 저장된 상기 유전자의 명칭을 유전자 노드로 규정하고, 상기 내부 데이터베이스의 상기 SNP 인덱스 필드에 상기 SNP 인덱스가 저장되어 있는 로우의 경우, 해당 로우의 상기 유전자 필드에 저장된 상기 유전자의 명칭, 해당 로우의 상기 SNP 인덱스 필드에 저장된 상기 SNP 인덱스, 및 상기 SNP 인덱스에 상응하는 SNP의 유전자형들 각각을 연관시켜 복수의 유전자 노드들로 규정하는 단계, 상기 내부 데이터베이스의 상기 약물 필드에 저장된 상기 약물의 명칭들 각각을 약물 노드로 규정하는 단계, 상기 질병 노드들, 상기 유전자 노드들, 및 상기 약물 노드들 사이에서 서로 연관 관계가 있는 노드쌍 사이의 연결 관계를 엣지로 규정하는 단계, 상기 내부 데이터베이스에서 동일한 로우에 대응되는 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지에 대해, 상기 동일한 로우에서 상기 유전자 노드와 관련되는 상기 유전자형 필드에 저장된 상기 효과 점수를 상기 엣지의 가중치로 결정하는 단계, 상기 엣지들을 통해 서로 연결되는 두 개 이상의 노드들과 상기 두 개 이상의 노드들을 연결하는 상기 엣지들의 집합을 경로로 규정하는 단계, 및 상기 제1 내지 제k 군집들 각각에 대해, 제a(a는 k 이하의 양의 정수) 군집에 상응하는 SNP 유전자형들과 연관되는 노드들, 엣지들, 및 경로들 만을 추출하여 제a 그래프를 생성하는 단계를 포함할 수 있다.The step of the graph generating unit generating the first to k-th graphs representing the relationship between the entities in each of the first to k-th clusters by using the data stored in the internal database may include: defining each of the disease names stored in a field as a disease node; in the case of a row in which the SNP index is not stored in the SNP index field of the internal database, the name of the gene stored in the gene field of the corresponding row is a gene node, and in the case of a row in which the SNP index is stored in the SNP index field of the internal database, the name of the gene stored in the gene field of the corresponding row, the SNP index field of the corresponding row defining a plurality of gene nodes by associating each of the SNP index and the genotypes of the SNP corresponding to the SNP index, defining each of the names of the drugs stored in the drug field of the internal database as a drug node , defining a connection relationship between a pair of nodes having a correlation with each other among the disease nodes, the gene nodes, and the drug nodes as an edge, the gene node corresponding to the same row in the internal database and the For the edge connecting drug nodes, determining the effect score stored in the genotype field related to the gene node in the same row as a weight of the edge, two or more nodes connected to each other through the edges and defining the set of edges connecting the two or more nodes as a path, and for each of the first to kth clusters, an SNP corresponding to the a (a is a positive integer less than or equal to k) cluster. It may include generating the a-th graph by extracting only nodes, edges, and paths associated with genotypes.

상기 딥러닝부가 상기 인공신경망을 학습시키는 단계는, 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지의 가중치가 클수록 상기 엣지에 의해 연결되는 상기 유전자 노드와 상기 약물 노드가 서로 가깝게 사상되도록 상기 제1 내지 제k 그래프들에 포함되는 상기 노드들에 대해 임베딩(embedding)을 수행하는 단계 및 상기 임베딩 수행 결과 상기 노드들 각각에 대해 생성되는 벡터들을 사용하여 상기 제1 내지 제k 그래프들에 포함되는 상기 경로들을 상기 인공신경망에 학습시키는 단계를 포함할 수 있다.In the step of the deep learning unit learning the artificial neural network, the greater the weight of the edge connecting the gene node and the drug node, the closer the gene node and the drug node connected by the edge are mapped to each other. performing embedding on the nodes included in the to k-th graphs, and the vectors included in the first to k-th graphs using vectors generated for each of the nodes as a result of performing the embedding It may include the step of learning the paths to the artificial neural network.

상기 딥러닝부가 상기 인공신경망을 학습시키는 단계는, 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지의 가중치가 클수록 상기 엣지에 의해 연결되는 상기 유전자 노드와 상기 약물 노드가 서로 가깝게 사상되도록 상기 제1 내지 제k 그래프들에 포함되는 상기 노드들에 대해 임베딩(embedding)을 수행하는 단계, 상기 제1 내지 제 그래프들에 포함되는 상기 경로들 각각에 대해, 상기 경로에 포함되는 상기 엣지들의 가중치의 합을 경로 가중치로 결정하는 단계, 상기 제1 내지 제k 그래프들에 포함되는 상기 경로들 중에서 상기 경로 가중치가 상대적으로 높은 경로들을 중요 경로들로 결정하는 단계, 및 상기 임베딩 수행 결과 상기 노드들 각각에 대해 생성되는 벡터들을 사용하여 상기 제1 내지 제k 그래프들에 포함되는 상기 경로들 중에서 상기 중요 경로들 만을 상기 인공신경망에 학습시키는 단계를 포함할 수 있다.In the step of the deep learning unit learning the artificial neural network, the greater the weight of the edge connecting the gene node and the drug node, the closer the gene node and the drug node connected by the edge are mapped to each other. performing embedding on the nodes included in the to kth graphs, and for each of the paths included in the first to the kth graphs, the sum of the weights of the edges included in the path Determining as a path weight, determining paths having a relatively high path weight among the paths included in the first to k-th graphs as important paths, and as a result of the embedding, to each of the nodes It may include the step of learning only the important paths from among the paths included in the first to k-th graphs by using the vectors generated for the artificial neural network.

상술한 본 발명의 일 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 시스템은 데이터 취합부, 정보 추출부, 군집화부, 그래프 생성부, 딥러닝부, 및 예측부를 포함한다. 상기 데이터 취합부는 질병, 유전자, 및 약물 관련 정보를 저장하는 적어도 하나의 의약 데이터베이스 및 유전자 다형성(gene polymorphism) 관련 정보를 저장하는 적어도 하나의 유전자 다형성 데이터베이스로부터 텍스트(text) 형태로 데이터를 수집한다. 상기 정보 추출부는 자연어 처리 알고리즘을 사용하여 상기 데이터 취합부에 의해 수집된 데이터로부터 질병, 유전자, 약물, 및 다형성 유전자의 유전자형을 개체(entity)로서 추출하고, 상기 개체들 사이의 관계를 도출하여 상기 개체들 사이의 관계를 내부 데이터베이스에 정형화된 형태로 저장한다. 상기 군집화부는 상기 내부 데이터베이스 및 복수의 개인들 각각의 유전자 변이 정보를 포함하는 사용자 유전자 데이터베이스를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k(k는 2 이상의 정수) 군집들로 분류한다. 상기 그래프 생성부는 상기 내부 데이터베이스에 저장된 데이터를 사용하여 상기 제1 내지 제k 군집들 각각에서 상기 개체들 사이의 관계를 나타내는 제1 내지 제k 그래프들을 생성한다. 상기 딥러닝부는 상기 제1 내지 제k 그래프들을 학습 데이터로 사용하여 인공신경망의 입력층에 입력되는 질의 개체 및 질의 유전자 변이 정보에 기초하여 상기 인공신경망이 상기 질의 유전자 변이 정보를 갖는 사용자에 있어서 상기 질의 개체와 연관되는 개체를 출력하도록 상기 인공신경망을 학습시킨다. 상기 예측부는 질병, 유전자, 및 약물 중의 하나에 상응하는 질의 개체 및 사용자의 유전자 변이 정보에 상응하는 질의 유전자 변이 정보를 상기 인공신경망에 입력하고, 상기 인공신경망으로부터 출력되는 개체를 상기 사용자에 있어서 상기 질의 개체와 연관도가 높은 타겟 개체로서 출력한다.In order to achieve the above object of the present invention, a user-customized treatment information prediction system according to an embodiment of the present invention includes a data collection unit, an information extraction unit, a clustering unit, a graph generation unit, a deep learning unit, and a prediction unit do. The data collection unit collects data in text form from at least one drug database storing disease, gene, and drug-related information and at least one gene polymorphism database storing gene polymorphism-related information. The information extraction unit extracts genotypes of diseases, genes, drugs, and polymorphic genes as entities from the data collected by the data collection unit using a natural language processing algorithm, and derives a relationship between the entities to determine the The relationship between objects is stored in a standardized form in the internal database. The clustering unit uses the internal database and a user genetic database including genetic variation information of each of the plurality of individuals to group the plurality of individuals into people with similar genetic information, ) into clusters. The graph generator generates first to k-th graphs representing relationships between the entities in each of the first to k-th clusters by using data stored in the internal database. The deep learning unit uses the first to kth graphs as learning data and based on the query entity and the query genetic variation information input to the input layer of the artificial neural network, the artificial neural network has the query genetic variation information in the user The artificial neural network is trained to output an entity related to the query entity. The prediction unit inputs the query genetic variation information corresponding to the genetic variation information of the query individual and the user corresponding to one of diseases, genes, and drugs into the artificial neural network, and sets the individual output from the artificial neural network to the user in the It is output as a target entity that has a high degree of relevance to the query entity.

본 발명의 실시예들에 따른 사용자 맞춤형 치료 정보 예측 시스템 및 사용자 맞춤형 치료 정보 예측 방법은 사용자의 유전자 변이 정보에 기초하여 사용자 맞춤형으로 사용자와 연관되는 질병, 유전자, 및 약물을 효과적으로 예측할 수 있다.The user-customized treatment information prediction system and the user-customized treatment information prediction method according to embodiments of the present invention can effectively predict diseases, genes, and drugs associated with a user in a user-customized manner based on the user's genetic mutation information.

도 1은 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 시스템을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 방법을 나타내는 순서도이다.
도 3은 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 정보 추출부의 동작을 설명하기 위한 도면이다.
도 4는 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 군집화부의 동작의 일 예를 설명하기 위한 도면이다.
도 5는 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 정보 추출부가 SNP의 유전자형들 별로 질병에 대한 약물의 효과 정도를 수치화 하는 과정의 일 예를 설명하기 위한 도면이다.
도 6은 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 그래프 생성부에 의해 생성되는 그래프의 일 예를 나타내는 도면이다.1 is a diagram illustrating a user-customized treatment information prediction system according to an embodiment of the present invention.
2 is a flowchart illustrating a method for predicting user-customized treatment information according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining an operation of an information extractor included in the user-customized treatment information prediction system of FIG. 1 .
FIG. 4 is a diagram for explaining an example of an operation of a clustering unit included in the user-customized treatment information prediction system of FIG. 1 .
FIG. 5 is a view for explaining an example of a process in which the information extraction unit included in the user-customized treatment information prediction system of FIG. 1 quantifies the degree of effect of a drug on a disease for each genotype of SNP.
6 is a diagram illustrating an example of a graph generated by a graph generator included in the user-customized treatment information prediction system of FIG. 1 .

본문에 개시되어 있는 본 발명의 실시예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시예들에 한정되는 것으로 해석되어서는 아니 된다.With respect to the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are only exemplified for the purpose of describing the embodiments of the present invention, and the embodiments of the present invention may be embodied in various forms and the text It should not be construed as being limited to the embodiments described in .

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can have various changes and can have various forms, specific embodiments are illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로 사용될 수 있다. 예를 들어, 본 발명의 권리 범위로부터 이탈되지 않은 채 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle. Other expressions describing the relationship between elements, such as "between" and "immediately between" or "neighboring to" and "directly adjacent to", etc., should be interpreted similarly.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or a combination thereof exists, but one or more other features or numbers , it is to be understood that it does not preclude the possibility of the existence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미이다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical and scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as meanings consistent with the context of the related art, and unless explicitly defined in the present application, they are not to be interpreted in an ideal or excessively formal meaning. .

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and repeated descriptions of the same components are omitted.

도 1은 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 시스템을 나타내는 도면이다.1 is a diagram illustrating a user-customized treatment information prediction system according to an embodiment of the present invention.

도 1에 도시된 사용자 맞춤형 치료 정보 예측 시스템(10)은 사용자의 유전자 변이 정보에 기초하여 사용자 맞춤형으로 상기 사용자와 연관되는 질병, 유전자, 및 약물을 예측할 수 있는 시스템을 나타낸다.The user-customized treatment information prediction system 10 shown in FIG. 1 represents a system capable of predicting diseases, genes, and drugs associated with the user in a user-customized manner based on the user's genetic mutation information.

도 2는 본 발명의 일 실시예에 따른 사용자 맞춤형 치료 정보 예측 방법을 나타내는 순서도이다.2 is a flowchart illustrating a method for predicting user-customized treatment information according to an embodiment of the present invention.

도 2에 도시된 사용자 맞춤형 치료 정보 예측 방법은 도 1의 사용자 맞춤형 치료 정보 예측 시스템(10)을 통해 수행될 수 있다.The user-customized treatment information prediction method shown in FIG. 2 may be performed through the user-customized treatment information prediction system 10 of FIG. 1 .

이하, 도 1 및 2를 참조하여 사용자 맞춤형 치료 정보 예측 시스템(10)의 구성 및 동작과 사용자 맞춤형 치료 정보 예측 시스템(10)에 의해 수행되는 사용자 맞춤형 치료 정보 예측 방법에 대해 상세히 설명한다.Hereinafter, the configuration and operation of the user-customized treatment information prediction system 10 and the user-customized treatment information prediction method performed by the user-customized treatment information prediction system 10 will be described in detail with reference to FIGS. 1 and 2 .

도 1을 참조하면, 사용자 맞춤형 치료 정보 예측 시스템(10)은 데이터 취합부(100), 정보 추출부(200), 내부 데이터베이스(IN_DB)(300), 사용자 유전자 데이터베이스(UG_DB)(400), 군집화부(500), 그래프 생성부(600), 딥러닝부(700), 및 예측부(800)를 포함한다.Referring to FIG. 1 , the user-customized treatment information prediction system 10 includes a data collection unit 100 , an information extraction unit 200 , an internal database (IN_DB) 300 , a user genetic database (UG_DB) 400 , and clustering. It includes a unit 500 , a graph generating unit 600 , a deep learning unit 700 , and a prediction unit 800 .

데이터 취합부(100)는 적어도 하나의 의약 데이터베이스(MD_DB)(20) 및 적어도 하나의 유전자 다형성 데이터베이스(GP_DB)(30)로부터 텍스트(text) 형태로 데이터를 수집한다(단계 S100).The data collection unit 100 collects data in text form from at least one medicine database (MD_DB) 20 and at least one gene polymorphism database (GP_DB) 30 (step S100).

의약 데이터베이스(20)는 다양한 종류의 질병(disease), 유전자(gene), 및 약물(drug) 사이의 연관 관계에 대한 정보를 저장할 수 있다.The drug database 20 may store information on associations between various types of diseases, genes, and drugs.

의약 데이터베이스(20)는 질병, 유전자, 및 약물 사이의 연관 관계에 대한 정보를 정형화된 포맷으로 저장하는 데이터베이스일 수도 있고, 질병, 유전자, 및 약물과 관련된 논문이나 저널과 같이 정형화되지 않은 형태의 문서 데이터를 저장하는 데이터베이스일 수도 있다.The drug database 20 may be a database that stores information on the relationship between diseases, genes, and drugs in a standardized format, and documents in non-standard forms such as papers or journals related to diseases, genes, and drugs. It can also be a database that stores data.

본 발명의 실시예들에 따른 의약 데이터베이스(20)는 특정 데이터베이스에 한정되지 않으며, 질병, 유전자, 및 약물 관련 정보를 저장하는 임의의 데이터베이스일 수 있다.The drug database 20 according to embodiments of the present invention is not limited to a specific database, and may be any database that stores disease, gene, and drug-related information.

유전자 다형성 데이터베이스(30)는 유전자 다형성(gene polymorphism) 관련 정보 및 약물유전체학(Pharmacogenomics) 관련 정보를 저장하는 임의의 데이터베이스일 수 있다. 예를 들어, 유전자 다형성 데이터베이스(30)는 다형성 유전자(polymorphic gene)의 유전자형(genotype)에 따른 약물에 대한 생체 반응의 차이에 관한 정보를 저장할 수 있다.The gene polymorphism database 30 may be any database that stores gene polymorphism-related information and pharmacogenomics-related information. For example, the gene polymorphism database 30 may store information on a difference in biological response to a drug according to a genotype of a polymorphic gene.

유전자 다형성 데이터베이스(30)는 유전자 다형성 관련 정보 및 약물유전체학 관련 정보를 정형화된 포맷으로 저장하는 데이터베이스일 수도 있고, 유전자 다형성 및 약물유전체학과 관련된 논문이나 저널과 같이 정형화되지 않은 형태의 문서 데이터를 저장하는 데이터베이스일 수도 있다.The gene polymorphism database 30 may be a database that stores gene polymorphism-related information and pharmacogenomics-related information in a standardized format, and documents data in an unstandardized form such as a thesis or journal related to gene polymorphism and pharmacogenomics. It could be a database.

본 발명의 실시예들에 따른 유전자 다형성 데이터베이스(30)는 특정 데이터베이스에 한정되지 않으며, 유전자 다형성 관련 정보 및 약물유전체학 관련 정보를 저장하는 임의의 데이터베이스일 수 있다.The gene polymorphism database 30 according to embodiments of the present invention is not limited to a specific database, and may be any database that stores gene polymorphism-related information and pharmacogenomics-related information.

예를 들어, 유전자 다형성 데이터베이스(30)는 PharmGKB(The Pharmacogenomics Knowledge Base), PharmaADME(Pharmacogenetics of Absorption, Distribution, Metabolism and Excretion genes), CYP-allele database(The Human Cytochrome P450(CYP) Allele Nomenclature Website), dbSNP(The NCBI database of genetic variation) 등일 수 있다.For example, the gene polymorphism database 30 includes: PharmGKB (The Pharmacogenomics Knowledge Base), PharmaADME (Pharmacogenetics of Absorption, Distribution, Metabolism and Excretion genes), CYP-allele database (The Human Cytochrome P450 (CYP) Allele Nomenclature Website), dbSNP (The NCBI database of genetic variation) and the like.

실시예에 따라서, 적어도 하나의 의약 데이터베이스(20) 및 적어도 하나의 유전자 다형성 데이터베이스(30)는 사용자 맞춤형 치료 정보 예측 시스템(10)의 외부에 존재할 수도 있고, 사용자 맞춤형 치료 정보 예측 시스템(10)의 내부에 미리 구축될 수도 있다.According to an embodiment, the at least one drug database 20 and the at least one genetic polymorphism database 30 may exist outside the user-customized treatment information prediction system 10 , or It can also be pre-built inside.

정보 추출부(200)는 자연어 처리 알고리즘을 사용하여 데이터 취합부(100)에 의해 취합된 데이터로부터 질병, 유전자, 약물, 및 다형성 유전자의 유전자형을 개체(entity)로서 추출하고, 상기 개체들 사이의 관계를 도출하여 상기 개체들 사이의 관계를 내부 데이터베이스(300)에 정형화된 형태로 저장한다(단계 S200).The information extraction unit 200 extracts, as an entity, the genotypes of diseases, genes, drugs, and polymorphic genes from the data collected by the data collection unit 100 using a natural language processing algorithm, and between the individuals. A relationship is derived and the relationship between the entities is stored in a standardized form in the internal database 300 (step S200).

정보 추출부(200)가 사용하는 상기 자연어 처리 알고리즘은 종래에 알려진 임의의 종류의 자연어 처리 알고리즘일 수 있다. 따라서 정보 추출부(200)에 의해 수행되는 자연어 처리 과정에 대한 상세한 설명은 생략한다.The natural language processing algorithm used by the information extraction unit 200 may be any conventionally known natural language processing algorithm. Accordingly, a detailed description of the natural language processing process performed by the information extraction unit 200 will be omitted.

도 3은 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 정보 추출부의 동작을 설명하기 위한 도면이다.FIG. 3 is a diagram for explaining an operation of an information extractor included in the user-customized treatment information prediction system of FIG. 1 .

도 3에는 내부 데이터베이스(300)의 일 예가 도시된다.3 shows an example of an internal database 300 .

도 3을 참조하면, 내부 데이터베이스(300)는 질병 필드(DISEASE_F), 유전자 필드(GENE_F), 약물 필드(DRUG_F), 단일핵산염기다형성(Single Nucleotide Polymorphism; SNP) 인덱스 필드(SNP_INDEX_F), 및 복수의 유전자형 필드들(AA_F, AG_F, GG_F)을 포함할 수 있다.Referring to FIG. 3 , the internal database 300 includes a disease field (DISEASE_F), a gene field (GENE_F), a drug field (DRUG_F), a single nucleotide polymorphism (SNP) index field (SNP_INDEX_F), and a plurality of It may include genotype fields (AA_F, AG_F, GG_F).

이 경우, 정보 추출부(200)는 데이터 취합부(100)에 의해 취합된 데이터로부터 서로 연관이 있는 질병, 유전자, 및 약물을 추출하고, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스 및 상기 SNP의 유전자형들을 추출하고, 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 추출할 수 있다.In this case, the information extraction unit 200 extracts diseases, genes, and drugs that are related to each other from the data collected by the data collection unit 100, and the SNP index corresponding to the SNP of the gene and the SNP The genotypes may be extracted, and the degree of effect of the drug on the disease may be extracted from patients carrying each of the genotypes of the SNP.

이후, 정보 추출부(200)는 상기 질병의 명칭, 상기 유전자의 명칭, 상기 약물의 명칭, 상기 SNP 인덱스, 및 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 내부 데이터베이스(300)의 질병 필드(DISEASE_F), 유전자 필드(GENE_F), 약물 필드(DRUG_F), SNP 인덱스 필드(SNP_INDEX_F), 및 상응하는 유전자형 필드들(AA_F, AG_F, GG_F) 각각에 저장할 수 있다.Thereafter, the information extraction unit 200 calculates the degree of effect of the drug on the disease in the patient having each of the name of the disease, the name of the gene, the name of the drug, the SNP index, and the genotype of the SNP. It can be stored in each of the disease field (DISEASE_F), the gene field (GENE_F), the drug field (DRUG_F), the SNP index field (SNP_INDEX_F), and the corresponding genotype fields (AA_F, AG_F, GG_F) of the database 300 .

예를 들어, 도 3에 도시된 바와 같이, 유전자 다형성 데이터베이스(30)에 제1 로우(R1)와 같은 정보가 저장되어 있는 경우, 정보 추출부(200)는 자연어 처리 알고리즘을 사용하여 데이터 취합부(100)에 의해 취합된 제1 로우(R1)에 상응하는 텍스트 형태의 데이터를 분석하여, Rheumatoid arthritis 라는 질병의 명칭을 추출하고, IL6R 라는 유전자의 명칭을 추출하고, tocilizumab 라는 약물의 명칭을 추출하고, rs12083537 라는 SNP 인덱스를 추출하고, Rheumatoid arthritis 환자에 있어서 rs12083537에 상응하는 SNP의 유전자형이 AA인 경우는 rs12083537에 상응하는 SNP의 유전자형이 AG 또는 GG인 경우에 비해 tocilizumab의 효과가 더 좋다는 정보를 추출할 수 있다.For example, as shown in FIG. 3 , when information such as the first row R1 is stored in the genetic polymorphism database 30 , the information extraction unit 200 uses a natural language processing algorithm to collect the data By analyzing the data in text form corresponding to the first row (R1) collected by (100), the name of the disease called rheumatoid arthritis is extracted, the name of the gene called IL6R is extracted, and the name of the drug called tocilizumab is extracted And, the SNP index of rs12083537 was extracted, and when the genotype of the SNP corresponding to rs12083537 in patients with rheumatoid arthritis is AA, the information that the effect of tocilizumab is better than when the genotype of the SNP corresponding to rs12083537 is AG or GG. can be extracted.

이후, 정보 추출부(200)는, 도 3에 도시된 내부 데이터베이스(300)의 제2 로우(R2)와 같이, Rheumatoid arthritis 라는 질병의 명칭을 내부 데이터베이스(300)의 질병 필드(DISEASE_F)에 저장하고, IL6R 라는 유전자의 명칭을 내부 데이터베이스(300)의 유전자 필드(GENE_F)에 저장하고, tocilizumab 라는 약물의 명칭을 내부 데이터베이스(300)의 약물 필드(DRUG_F)에 저장하고, rs12083537 라는 SNP 인덱스를 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장하고, Much Efficacy를 AA 유전자형에 상응하는 유전자형 필드(AA_F)에 저장하고, Less Efficacy를 AG 유전자형에 상응하는 유전자형 필드(AG_F)에 저장하고, Less Efficacy를 GG 유전자형에 상응하는 유전자형 필드(GG_F)에 저장할 수 있다.Thereafter, the information extraction unit 200 stores the name of a disease called rheumatoid arthritis in the disease field DISEASE_F of the internal database 300 like the second row R2 of the internal database 300 shown in FIG. 3 . and the name of the gene IL6R is stored in the gene field (GENE_F) of the internal database 300, the name of the drug tocilizumab is stored in the drug field (DRUG_F) of the internal database 300, and the SNP index called rs12083537 is stored inside Store in the SNP index field (SNP_INDEX_F) of the database 300, Much Efficacy in the genotype field corresponding to AA genotype (AA_F), Less Efficacy in the genotype field corresponding to AG genotype (AG_F), Less Efficacy can be stored in the genotype field (GG_F) corresponding to the GG genotype.

도 3을 참조하여 상술한 바와 같이, 정보 추출부(200)는 적어도 하나의 의약 데이터베이스(20) 및 적어도 하나의 유전자 다형성 데이터베이스(30)로부터 데이터 취합부(100)에 의해 취합된 텍스트 형태의 데이터를 분석하여 서로 연관이 있는 질병, 유전자, 약물, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스, 및 상기 SNP의 유전자형들 별로 상기 질병에 대한 상기 약물의 효과 정도를 추출하여 내부 데이터베이스(300)에 로우 단위로 저장할 수 있다.As described above with reference to FIG. 3 , the information extraction unit 200 collects data in text form from at least one drug database 20 and at least one gene polymorphism database 30 by the data collection unit 100 . by analyzing the disease, gene, drug, SNP index corresponding to the SNP of the gene, and the degree of the effect of the drug on the disease for each genotype of the SNP are extracted and stored in the internal database 300 can be stored in units.

다시 도 1 및 2를 참조하면, 사용자 맞춤형 치료 정보 예측 시스템(10)은 복수의 개인들 각각에 대해 유전자 검사를 실시하여 획득되는 상기 복수의 개인들 각각의 유전자 변이 정보를 미리 저장하는 사용자 유전자 데이터베이스(400)를 포함할 수 있다.Referring back to FIGS. 1 and 2 , the user-customized treatment information prediction system 10 is a user genetic database that stores in advance genetic variation information of each of the plurality of individuals obtained by performing a genetic test on each of the plurality of individuals. (400).

예를 들어, W개의 SNP들이 알려져 있는 경우, 사용자 유전자 데이터베이스(400)는 상기 복수의 개인들 별로, 상기 복수의 개인들 각각이 갖는 상기 W개의 SNP들에 대한 유전자형들을 저장할 수 있다.For example, when W SNPs are known, the user genetic database 400 may store, for each of the plurality of individuals, genotypes of the W SNPs of each of the plurality of individuals.

군집화부(500)는 내부 데이터베이스(300) 및 사용자 유전자 데이터베이스(400)를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k(k는 2 이상의 정수) 군집들(GR1~GRk)로 분류한다(단계 S300).The clustering unit 500 uses the internal database 300 and the user genetic database 400 to group the plurality of individuals into people having similar genetic information to form first to kth (k is an integer of 2 or more) clusters ( GR1 to GRk) (step S300).

이하, 군집화부(500)가 내부 데이터베이스(300) 및 사용자 유전자 데이터베이스(400)를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류하는 제1 실시예에 대해 상세히 설명한다.Hereinafter, the clustering unit 500 uses the internal database 300 and the user genetic database 400 to group the plurality of individuals into people having similar genetic information to form first to kth clusters GR1 to GRk. The first embodiment of classification will be described in detail.

일 실시예에 있어서, 군집화부(500)는 K-means 알고리즘을 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.In an embodiment, the clustering unit 500 may classify the plurality of individuals into first to kth clusters GR1 to GRk by using a K-means algorithm to group the plurality of individuals into people having similar genetic information. .

구체적으로, 군집화부(500)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 모든 SNP 인덱스들을 독출하고, 사용자 유전자 데이터베이스(400)로부터 상기 복수의 개인들 각각이 갖는 상기 독출된 SNP 인덱스들에 상응하는 SNP 유전자형들을 독출할 수 있다.Specifically, the clustering unit 500 reads all SNP indices stored in the SNP index field (SNP_INDEX_F) of the internal database 300 , and the read SNP index of each of the plurality of individuals from the user gene database 400 . SNP genotypes corresponding to those can be read.

이후, 군집화부(500)는 상기 복수의 개인들 각각에 대해, 상기 복수의 개인들 각각이 갖는 상기 독출된 SNP 인덱스들에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터를 생성할 수 있다.Thereafter, the clustering unit 500 may generate, for each of the plurality of individuals, an SNP vector including SNP genotypes corresponding to the read SNP indices of each of the plurality of individuals.

이후, 군집화부(500)는 상기 복수의 개인들에 상응하는 상기 복수의 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 SNP 벡터들을 k개의 군집으로 군집화(clustering)하고, 동일한 군집에 포함되는 SNP 벡터들에 상응하는 개인들을 동일한 군집으로 분류함으로써 상기 복수의 개인들을 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.Thereafter, the clustering unit 500 applies a K-means algorithm to the plurality of SNP vectors corresponding to the plurality of individuals to cluster the plurality of SNP vectors into k clusters, and to By classifying individuals corresponding to the included SNP vectors into the same cluster, the plurality of individuals may be classified into first to kth clusters GR1 to GRk.

도 4는 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 군집화부의 동작의 일 예를 설명하기 위한 도면이다.FIG. 4 is a diagram for explaining an example of an operation of a clustering unit included in the user-customized treatment information prediction system of FIG. 1 .

도 4는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 N(N은 2 이상의 양의 정수)개의 SNP 인덱스들(rs₁, rs₂, ?, rs_N-1, rs_N)이 저장되어 있고, 사용자 유전자 데이터베이스(400)에 M(M은 2 이상의 양의 정수)명의 개인들 각각의 유전자 변이 정보가 저장되어 있는 경우에 군집화부(500)의 동작을 예시적으로 나타낸다.4 shows N (N is a positive integer greater than or equal to 2) SNP indexes (rs ₁ , rs ₂ , ?, rs _N-1 , rs _N ) are stored in the SNP index field (SNP_INDEX_F) of the internal database 300 The operation of the clustering unit 500 is exemplarily shown when genetic mutation information of each of M (M is a positive integer of 2 or more) individuals is stored in the user genetic database 400 .

도 4를 참조하면, 군집화부(500)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)로부터 N개의 SNP 인덱스들(rs₁, rs₂, ?, rs_N-1, rs_N)을 독출하는 경우, 사용자 유전자 데이터베이스(400)로부터 M명의 개인들 각각에 대해 N개의 SNP 인덱스들(rs₁, rs₂, ?, rs_N-1, rs_N)에 상응하는 SNP 유전자형들을 독출한 후, M명의 개인들 별로 N개의 SNP 인덱스들(rs₁, rs₂, ?, rs_N-1, rs_N)에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터(V_1, V_2, ?, V_M)를 생성할 수 있다.Referring to FIG. 4 , the clustering unit 500 reads N SNP indices rs ₁ , rs ₂ , ?, rs _N-1 , rs _N from the SNP index field SNP_INDEX_F of the internal database 300 . , after reading the SNP genotypes corresponding to N SNP indices (rs ₁ , rs ₂ , ?, rs _N-1 , rs _N ) for each of M individuals from the user gene database 400 , M SNP vectors (V_1, V_2, ?, V_M) containing SNP genotypes corresponding to N SNP indices (rs ₁ , rs ₂ , ?, rs _N-1 , rs _N ) for each individual can be generated. .

예를 들어, 도 4에 도시된 바와 같이, 제1 개인에 대해 생성되는 제1 SNP 벡터(V_1)는 [AA, AA, ?, AG, GG]이고, 제2 개인에 대해 생성되는 제2 SNP 벡터(V_2)는 [AG, GG, ?, AG, AA]이고, 제M 개인에 대해 생성되는 제M SNP 벡터(V_M)는 [AA, AG, ?, AG, GG]일 수 있다.For example, as shown in FIG. 4 , a first SNP vector (V_1) generated for a first individual is [AA, AA, ?, AG, GG], and a second SNP generated for a second individual The vector (V_2) may be [AG, GG, ?, AG, AA], and the M th SNP vector (V_M) generated for the M th individual may be [AA, AG, ?, AG, GG].

따라서 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M) 각각은 N차원 벡터일 수 있다.Accordingly, each of the first to Mth SNP vectors V_1, V_2, ?, and V_M may be an N-dimensional vector.

한편, 군집화부(500)는 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M)에 포함되는 SNP의 유전자형들을 유전자형 별로 미리 정의된 고유한 수치로 치환함으로써 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M)을 숫자 벡터(numeric vector)로 변형할 수 있다.Meanwhile, the clustering unit 500 replaces the genotypes of SNPs included in the first to Mth SNP vectors (V_1, V_2, ?, and V_M) with unique values predefined for each genotype, thereby forming the first to Mth SNP vectors. The fields V_1, V_2, ?, and V_M may be transformed into a numeric vector.

예를 들어, AA 유전자형은 -1로 정의되고, AG 유전자형은 0으로 정의되고, GG 유전자형은 1로 정의되는 경우, 군집화부(500)는 제1 SNP 벡터(V_1)를 [-1, -1, ?, 0, 1]로 변형하고, 제2 SNP 벡터(V_2)를 [0, 1, ?, 0, -1]로 변형하고, 제M SNP 벡터(V_M)를 [-1, 0, ?, 0, 1]로 변형할 수 있다.For example, when the AA genotype is defined as -1, the AG genotype is defined as 0, and the GG genotype is defined as 1, the clustering unit 500 converts the first SNP vector (V_1) to [-1, -1 , ?, 0, 1], the second SNP vector (V_2) is transformed into [0, 1, ?, 0, -1], and the M-th SNP vector (V_M) is [-1, 0, ? , 0, 1].

그러나 본 발명은 이에 한정되지 않으며, 유전자형 별로 미리 정의된 고유한 수치는 실시예에 따라서 상술한 바와 다른 값으로 정의될 수도 있다.However, the present invention is not limited thereto, and intrinsic values predefined for each genotype may be defined as values different from those described above according to embodiments.

이후, 군집화부(500)는 M명의 개인들에 상응하는 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M)에 대해 K-means 알고리즘을 적용하여 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M)을 k개의 군집으로 군집화(clustering)할 수 있다.Thereafter, the clustering unit 500 applies the K-means algorithm to the first to Mth SNP vectors (V_1, V_2, ?, V_M) corresponding to M individuals to obtain the first to Mth SNP vectors ( V_1, V_2, ?, V_M) may be clustered into k clusters.

실시예에 따라서, 군집화부(500)가 군집화하는 군집의 개수는 미리 정의될 수도 있고, 관리자의 입력에 따라 결정될 수도 있다.According to an embodiment, the number of clusters that the clustering unit 500 clusters may be predefined or may be determined according to an input of an administrator.

K-means 알고리즘은 널리 알려진 군집화 알고리즘이므로, 여기서는 군집화부(500)가 K-means 알고리즘을 사용하여 제1 내지 제M SNP 벡터들(V_1, V_2, ?, V_M)을 k개의 군집으로 군집화하는 과정에 대한 상세한 설명은 생략한다.Since the K-means algorithm is a well-known clustering algorithm, here, the clustering unit 500 uses the K-means algorithm to cluster the first to Mth SNP vectors (V_1, V_2, ?, V_M) into k clusters. A detailed description of the will be omitted.

이후, 군집화부(500)는 동일한 군집에 포함되는 SNP 벡터들에 상응하는 개인들을 동일한 군집으로 분류함으로써 M명의 개인들을 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.Thereafter, the clustering unit 500 may classify the M individuals into the first to kth clusters GR1 to GRk by classifying the individuals corresponding to the SNP vectors included in the same cluster into the same cluster.

한편, 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 상기 복수의 개인들의 수보다 매우 큰 경우, 즉, 도 4의 예에서 N이 M보다 매우 큰 경우, 군집화부(500)에 의해 수행되는 군집화의 효율 및 정확성은 감소할 수 있다.On the other hand, when the dimension of the SNP vector generated for each of the plurality of individuals is much greater than the number of the plurality of individuals, that is, when N is much greater than M in the example of FIG. 4 , the clustering unit 500 The efficiency and accuracy of the clustering performed by

따라서 군집화부(500)는 상기 군집화의 효율 및 정확성을 향상시키기 위해, 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 미리 정해진 p(p는 양의 정수)차원보다 높은 경우, 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 q(q는 p보다 작은 양의 정수)차원으로 축소한 후, 상기 복수의 개인들에 상응하는 상기 복수의 축소된 SNP 벡터들에 대해 K-means 알고리즘을 적용하여 상기 복수의 개인들을 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.Accordingly, in order to improve the efficiency and accuracy of the clustering, the clustering unit 500 is configured to, when the dimension of the SNP vector generated for each of the plurality of individuals is higher than a predetermined p (p is a positive integer) dimension, the After reducing the dimension of the SNP vectors corresponding to a plurality of individuals to a q (q is a positive integer less than p) dimension lower than the p dimension, the plurality of reduced SNP vectors corresponding to the plurality of individuals The plurality of individuals may be classified into first to k-th clusters GR1 to GRk by applying a K-means algorithm to them.

일 실시예에 있어서, 군집화부(500)는 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 상기 p차원보다 높은 경우, 상기 복수의 개인들에 상응하는 상기 SNP 벡터들에 대해 PCA(Principal Component Analysis) 차원 축소 알고리즘을 적용하여 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성할 수 있다.In an embodiment, when the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p dimension, the clustering unit 500 performs PCA for the SNP vectors corresponding to the plurality of individuals. (Principal Component Analysis) A dimension reduction algorithm may be applied to reduce the dimension of the SNP vectors corresponding to the plurality of individuals to the q dimension lower than the p dimension to generate the reduced SNP vectors.

PCA 차원 축소 알고리즘은 널리 알려진 차원 축소 알고리즘이므로, 여기서는 군집화부(500)가 PCA 차원 축소 알고리즘을 사용하여 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성하는 과정에 대한 상세한 설명은 생략한다.Since the PCA dimension reduction algorithm is a well-known dimension reduction algorithm, here, the clustering unit 500 uses the PCA dimension reduction algorithm to reduce the dimension of the SNP vectors to the q dimension lower than the p dimension to generate the reduced SNP vectors. A detailed description of the process will be omitted.

다른 실시예에 있어서, 군집화부(500)는 상기 복수의 개인들 각각에 대해 생성되는 상기 SNP 벡터의 차원이 상기 p차원보다 높은 경우, 오토인코더(autoencoder)를 사용한 비지도학습을 수행하여 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성할 수 있다.In another embodiment, when the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p dimension, the clustering unit 500 performs unsupervised learning using an autoencoder to perform unsupervised learning of the plurality of individuals. The reduced SNP vectors may be generated by reducing the dimension of the SNP vectors corresponding to individuals in the q dimension which is lower than the p dimension.

오토인코더를 사용한 비지도학습을 통해 주어진 벡터의 차원을 축소하는 다양한 방법이 널리 알려져 있으며, 군집화부(500)는 종래에 알려진 다양한 차원 축소 방법을 사용하여 상기 복수의 개인들에 상응하는 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성할 수 있다. 따라서 여기서는 군집화부(500)가 오토인코더를 사용한 비지도학습을 수행하여 상기 SNP 벡터들의 차원을 상기 p차원보다 낮은 상기 q차원으로 축소하여 상기 축소된 SNP 벡터들을 생성하는 과정에 대한 상세한 설명은 생략한다.Various methods for reducing the dimension of a given vector through unsupervised learning using an autoencoder are widely known, and the clustering unit 500 uses various conventionally known dimensionality reduction methods to reduce the SNP vector corresponding to the plurality of individuals. The reduced SNP vectors may be generated by reducing the dimension of ? to the q dimension, which is lower than the p dimension. Therefore, here, a detailed description of the process of generating the reduced SNP vectors by the clustering unit 500 performing unsupervised learning using an autoencoder to reduce the dimension of the SNP vectors to the q dimension lower than the p dimension is omitted. do.

이하, 군집화부(500)가 내부 데이터베이스(300) 및 사용자 유전자 데이터베이스(400)를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류하는 제2 실시예에 대해 상세히 설명한다.Hereinafter, the clustering unit 500 uses the internal database 300 and the user genetic database 400 to group the plurality of individuals into people having similar genetic information to form first to kth clusters GR1 to GRk. A second embodiment of classification will be described in detail.

도 3을 참조하여 상술한 바와 같이, 정보 추출부(200)는 데이터 취합부(100)에 의해 취합된 텍스트 형태의 데이터를 분석하여 서로 연관이 있는 질병, 유전자, 약물, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스, 및 상기 SNP의 유전자형들 별로 상기 질병에 대한 상기 약물의 효과 정도를 추출하여 내부 데이터베이스(300)에 로우 단위로 저장할 수 있다.As described above with reference to FIG. 3 , the information extraction unit 200 analyzes the text-type data collected by the data collection unit 100 to find related diseases, genes, drugs, and SNPs of the genes. The corresponding SNP index and the degree of the effect of the drug on the disease for each genotype of the SNP may be extracted and stored in the internal database 300 row by row.

따라서 군집화부(500)는 내부 데이터베이스(300)에 저장된 복수의 SNP들 중에서 SNP의 유전자형에 따라 약물에 대한 생체 반응의 차이가 상대적으로 크게 나타나는 SNP들을 선택한 후, 상기 선택된 SNP들의 유전자형을 기준으로 상기 복수의 개인들을 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.Therefore, the clustering unit 500 selects SNPs having a relatively large difference in biological response to a drug according to the genotype of the SNP from among the plurality of SNPs stored in the internal database 300, and then, based on the genotype of the selected SNPs, the A plurality of individuals may be clustered and classified into first to k-th clusters GR1 to GRk.

구체적으로, 군집화부(500)는 내부 데이터베이스(300)의 유전자형 필드들(AA_F, AG_F, GG_F)에 저장된 SNP의 유전자형들 별 질병에 대한 약물의 효과 정도에 기초하여, 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 복수의 SNP 인덱스들에 상응하는 SNP들 중에서 SNP의 유전자형에 따라 약물의 효과 정도의 차이가 상대적으로 큰 t(t는 2 이상의 정수)개의 SNP들을 유효 SNP들로 결정할 수 있다.Specifically, the clustering unit 500 is based on the degree of effect of the drug on the disease for each genotype of the SNP stored in the genotype fields (AA_F, AG_F, GG_F) of the internal database 300, the SNP of the internal database 300 Among the SNPs corresponding to the plurality of SNP indices stored in the index field (SNP_INDEX_F), t (t is an integer greater than or equal to 2) SNPs having a relatively large difference in the degree of drug effect depending on the genotype of the SNP can be determined as effective SNPs. have.

이후, 군집화부(500)는 사용자 유전자 데이터베이스(400)로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하고, 상기 복수의 개인들을 상기 유효 SNP들에 대해 동일한 유전자형을 갖는 사람들끼리 군집화하여 상기 복수의 개인들을 제1 내지 제k 군집들(GR1~GRk)로 분류할 수 있다.Thereafter, the clustering unit 500 reads out SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database 400, and assigns the plurality of individuals the same genotype to the effective SNPs. The plurality of individuals may be classified into first to k-th clusters GR1 to GRk by grouping them with each other.

일 실시예에 있어서, 정보 추출부(200)는 데이터 취합부(100)에 의해 취합된 텍스트 형태의 데이터를 분석하여 서로 연관이 있는 질병, 유전자, 약물, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스, 및 상기 SNP의 유전자형들 별로 상기 질병에 대한 상기 약물의 효과 정도를 추출하여 내부 데이터베이스(300)에 로우 단위로 저장할 때, 상기 SNP의 유전자형들 별로 상기 질병에 대한 상기 약물의 효과 정도를 수치화한 후, 상기 수치화된 효과 정도를 상응하는 유전자형 필드들(AA_F, AG_F, GG_F)에 저장할 수 있다.In an embodiment, the information extraction unit 200 analyzes the text-type data collected by the data collection unit 100 and correlates diseases, genes, drugs, and SNP indexes corresponding to SNPs of the genes. , and when the degree of effect of the drug on the disease is extracted for each genotype of the SNP and stored in a row unit in the internal database 300, the degree of the effect of the drug on the disease is digitized for each genotype of the SNP. Then, the quantified effect level may be stored in the corresponding genotype fields (AA_F, AG_F, GG_F).

도 5는 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 정보 추출부가 SNP의 유전자형들 별로 질병에 대한 약물의 효과 정도를 수치화 하는 과정의 일 예를 설명하기 위한 도면이다.FIG. 5 is a view for explaining an example of a process in which the information extraction unit included in the user-customized treatment information prediction system of FIG. 1 quantifies the degree of effect of a drug on a disease for each genotype of SNP.

일 실시예에 있어서, 정보 추출부(200)는 데이터 취합부(100)에 의해 취합된 데이터로부터 서로 연관이 있는 질병, 유전자, 및 약물을 추출하고, 상기 유전자가 갖는 SNP에 상응하는 SNP 인덱스 및 상기 SNP의 유전자형들을 추출하고, 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 추출한 후, 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 효과가 가장 낮은 제1 레벨에서 효과가 가장 높은 제d(d는 2 이상의 정수) 레벨로 구분하고, 상기 제1 내지 제d 레벨들 각각에 대해 미리 정해진 효과 점수(score)를 내부 데이터베이스(300)의 상응하는 유전자형 필드들(AA_F, AG_F, GG_F) 각각에 저장할 수 있다.In one embodiment, the information extraction unit 200 extracts diseases, genes, and drugs that are related to each other from the data collected by the data collection unit 100, and the SNP index corresponding to the SNP of the gene and After extracting the genotypes of the SNP, and extracting the degree of effect of the drug on the disease in patients carrying each of the genotypes of the SNP, the effect of the drug on the disease in patients carrying each of the genotypes of the SNP is divided into a d-th level (d is an integer greater than or equal to 2) having the highest effect from the first level with the lowest effect, and a predetermined effect score for each of the first to d-th levels is stored in the internal database 300 ) in each of the corresponding genotype fields (AA_F, AG_F, GG_F).

도 5는 정보 추출부(200)가 상기 SNP의 유전자형들 각각을 보유한 환자에서 상기 질병에 대한 상기 약물의 효과 정도를 제1 내지 제4 레벨들(Toxic, Less Toxic, Efficacy, Much Efficacy) 중의 하나로 구분한 것을 나타낸다.FIG. 5 shows that the information extraction unit 200 shows the degree of effect of the drug on the disease in a patient having each of the genotypes of the SNP as one of first to fourth levels (Toxic, Less Toxic, Efficacy, and Much Efficacy). indicates that it is separated.

이 때, 도 5에 도시된 바와 같이, 정보 추출부(200)는 상기 제1 레벨에 대해 s1을 상기 효과 점수로 부여하고, 상기 제2 레벨에 대해 s1보다 큰 s2를 상기 효과 점수로 부여하고, 상기 제3 레벨에 대해 s2보다 큰 s3를 상기 효과 점수로 부여하고, 상기 제4 레벨에 대해 s3보다 큰 s4를 상기 효과 점수로 부여할 수 있다.At this time, as shown in FIG. 5 , the information extraction unit 200 gives s1 as the effect score for the first level, and s2 that is greater than s1 for the second level as the effect score, and , s3 greater than s2 may be given as the effect score for the third level, and s4 greater than s3 may be assigned as the effect score for the fourth level.

이 경우, 군집화부(500)는 내부 데이터베이스(300)에 저장된 데이터에 대해 아래의 [수학식 1]을 적용하여 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 모든 SNP 인덱스들에 상응하는 SNP 각각에 대한 정보량 점수를 결정할 수 있다.In this case, the clustering unit 500 applies the following [Equation 1] to the data stored in the internal database 300 to correspond to all SNP indices stored in the SNP index field (SNP_INDEX_F) of the internal database 300 It is possible to determine the information amount score for each SNP.

[수학식 1][Equation 1]

여기서, score_i는 i번째 SNP의 상기 정보량 점수를 나타내고, w_i는 상기 i번째 SNP의 대립 유전자 발현 빈도(allele frequency)를 나타내고, d는 내부 데이터베이스(300)의 약물 필드(DRUG_F)에 저장된 약물의 개수를 나타내고, AA_ij는 상기 i번째 SNP의 제1 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, AG_ij는 상기 i번째 SNP의 제2 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타내고, GG_ij는 상기 i번째 SNP의 제3 유전자형이 j번째 약물과 반응하여 나타나는 효과의 정도를 나타내는 상기 효과 점수를 나타낸다.Here, score _i represents the information amount score of the i-th SNP, w _i represents the allele frequency of the i-th SNP, and d is the drug stored in the drug field (DRUG_F) of the internal database 300 represents the number of , AA _ij represents the effect score indicating the degree of effect of the first genotype of the i-th SNP in response to the j-th drug, and AG _ij is the second genotype of the i-th SNP is the j-th drug represents the effect score indicating the degree of the effect shown in response to , and GG _ij represents the effect score indicating the degree of the effect of the third genotype of the i-th SNP in response to the j-th drug.

이 때, 널리 알려진 SNP들 각각에 대한 대립 유전자 발현 빈도(w_i)는 군집화부(500) 내부에 미리 저장될 수 있다.In this case, the allele expression frequency w _i for each of the well-known SNPs may be stored in advance in the clustering unit 500 .

따라서 특정 SNP의 상기 정보량 점수가 클수록, 상기 특정 SNP의 유전자형에 따라 약물에 대한 생체 반응의 차이가 상대적으로 크게 나타난다는 것을 의미할 수 있다.Therefore, as the information amount score of a specific SNP is higher, it may mean that the difference in biological response to a drug is relatively large depending on the genotype of the specific SNP.

따라서 군집화부(500)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 복수의 SNP 인덱스들에 상응하는 SNP들 중에서 상기 정보량 점수가 상대적으로 큰 t개의 SNP들을 상기 유효 SNP들로 결정할 수 있다.Therefore, the clustering unit 500 may determine t SNPs having a relatively large information amount score among SNPs corresponding to a plurality of SNP indices stored in the SNP index field (SNP_INDEX_F) of the internal database 300 as the effective SNPs. have.

이하, 군집화부(500)가 내부 데이터베이스(300) 및 사용자 유전자 데이터베이스(400)를 사용하여 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류하는 제3 실시예에 대해 상세히 설명한다.Hereinafter, the clustering unit 500 uses the internal database 300 and the user genetic database 400 to group the plurality of individuals into people having similar genetic information to form first to kth clusters GR1 to GRk. A third embodiment of classification will be described in detail.

상기 제3 실시예는 상술한 상기 제1 실시예 및 상기 제2 실시예의 결합에 상응할 수 있다. 예를 들어, 군집화부(500)는 상기 제2 실시예에 따른 동작을 수행하여 상기 유효 SNP들을 결정한 후, 상기 유효 SNP들만을 사용하여 상기 제1 실시예에 따른 동작을 수행하여 상기 복수의 개인들을 상기 제1 내지 제k 군집들로 분류할 수 있다.The third embodiment may correspond to a combination of the first embodiment and the second embodiment described above. For example, the clustering unit 500 determines the effective SNPs by performing the operation according to the second embodiment, and then performs the operation according to the first embodiment using only the valid SNPs to perform the operation according to the first embodiment to determine the plurality of individuals. may be classified into the first to kth clusters.

일 실시예에 있어서, 도 5를 참조하여 상술한 바와 같이, 정보 추출부(200)가 상기 SNP의 유전자형들 별로 상기 질병에 대한 상기 약물의 효과 정도를 수치화한 후, 상기 수치화된 효과 정도를 상응하는 유전자형 필드들(AA_F, AG_F, GG_F)에 저장하는 경우, 군집화부(500)는 내부 데이터베이스(300)에 저장된 데이터에 대해 상기 [수학식 1]을 적용하여 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 모든 SNP 인덱스들에 상응하는 SNP 각각에 대한 정보량 점수를 결정한 후, 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 복수의 SNP 인덱스들에 상응하는 SNP들 중에서 상기 정보량 점수가 상대적으로 큰 t개의 SNP들을 상기 유효 SNP들로 결정할 수 있다.In one embodiment, as described above with reference to FIG. 5 , the information extraction unit 200 quantifies the degree of effect of the drug on the disease for each genotype of the SNP, and then matches the quantified degree of effect. When stored in the genotype fields (AA_F, AG_F, GG_F), the clustering unit 500 applies Equation 1 to the data stored in the internal database 300 to apply Equation 1 to the SNP index field of the internal database 300 . After determining the information amount score for each SNP corresponding to all SNP indices stored in (SNP_INDEX_F), the information amount score among the SNPs corresponding to the plurality of SNP indexes stored in the SNP index field (SNP_INDEX_F) of the internal database 300 t SNPs having a relatively large n may be determined as the effective SNPs.

이후, 군집화부(500)는 사용자 유전자 데이터베이스(400)로부터 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 독출하고, 상기 복수의 개인들 각각에 대해, 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들을 포함하는 SNP 벡터를 생성할 수 있다.Thereafter, the clustering unit 500 reads out SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database 400 , and for each of the plurality of individuals, the plurality of individuals A SNP vector comprising SNP genotypes corresponding to each of the effective SNPs can be generated.

상술한 바와 같이, 상기 제3 실시예에 따르면, 군집화부(500)는 SNP의 유전자형에 따라 약물의 효과 정도의 차이가 상대적으로 큰 t개의 SNP들을 상기 유효 SNP들로 결정한 후, 상기 복수의 개인들 각각이 갖는 상기 유효 SNP들에 상응하는 SNP 유전자형들만을 사용하여 상기 복수의 개인들을 제1 내지 제k 군집들(GR1~GRk)로 분류하므로, 분류의 정확도 및 속도는 효과적으로 향상될 수 있다.As described above, according to the third embodiment, the clustering unit 500 determines t SNPs having a relatively large difference in the degree of drug effect according to the genotype of the SNP as the effective SNPs, and then the plurality of individuals Since the plurality of individuals are classified into the first to kth clusters GR1 to GRk using only the SNP genotypes corresponding to the effective SNPs each of them has, the accuracy and speed of classification can be effectively improved.

다시 도 1 및 2를 참조하면, 그래프 생성부(600)는 내부 데이터베이스(300)에 저장된 데이터를 사용하여 제1 내지 제k 군집들(GR1~GRk) 각각에서 질병, 유전자, 약물, 및 다형성 유전자의 유전자형과 같은 개체들 사이의 관계를 나타내는 제1 내지 제k 그래프들(GP1~GPk)을 생성한다(단계 S400).Referring back to FIGS. 1 and 2 , the graph generating unit 600 uses data stored in the internal database 300 to generate diseases, genes, drugs, and polymorphic genes in each of the first to kth clusters GR1 to GRk. First to k-th graphs GP1 to GPk representing relationships between individuals such as genotypes of are generated (step S400 ).

일 실시예에 있어서, 그래프 생성부(600)는 내부 데이터베이스(300)의 질병 필드(DISEASE_F)에 저장된 질병의 명칭들 각각을 질병 노드로 규정할 수 있다.In an embodiment, the graph generator 600 may define each of the names of diseases stored in the disease field DISEASE_F of the internal database 300 as disease nodes.

또한, 그래프 생성부(600)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 SNP 인덱스가 저장되어 있지 않은 로우의 경우, 해당 로우의 유전자 필드(GENE_F)에 저장된 유전자의 명칭을 유전자 노드로 규정할 수 있다.In addition, in the case of a row in which the SNP index is not stored in the SNP index field (SNP_INDEX_F) of the internal database 300, the graph generating unit 600 converts the name of the gene stored in the gene field (GENE_F) of the row to the gene node. can be defined

이에 반해, 그래프 생성부(600)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 SNP 인덱스가 저장되어 있는 로우의 경우, 해당 로우의 유전자 필드(GENE_F)에 저장된 유전자의 명칭, 해당 로우의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 상기 SNP 인덱스, 및 상기 SNP 인덱스에 상응하는 SNP의 유전자형들 각각을 연관시켜 복수의 유전자 노드들로 규정할 수 있다.On the other hand, in the case of a row in which the SNP index is stored in the SNP index field (SNP_INDEX_F) of the internal database 300, the graph generating unit 600 calculates the name of the gene stored in the gene field (GENE_F) of the row, the name of the corresponding row. A plurality of gene nodes may be defined by associating each of the SNP index stored in the SNP index field (SNP_INDEX_F) and the genotypes of the SNP corresponding to the SNP index.

예를 들어, 도 3에 도시된 제2 로우(R2)의 경우, 그래프 생성부(600)는 IL6R, rs12083537, 및 AA를 연관시켜 하나의 유전자 노드로 규정하고, IL6R, rs12083537, 및 AG를 연관시켜 다른 유전자 노드로 규정하고, IL6R, rs12083537, 및 GG를 연관시켜 또 다른 유전자 노드로 규정할 수 있다.For example, in the case of the second row R2 shown in FIG. 3 , the graph generator 600 associates IL6R, rs12083537, and AA to define one gene node, and associates IL6R, rs12083537, and AG. It can be defined as another gene node by linking IL6R, rs12083537, and GG to another gene node.

또한, 그래프 생성부(600)는 내부 데이터베이스(300)의 약물 필드(DRUG_F)에 저장된 약물의 명칭들 각각을 약물 노드로 규정할 수 있다.Also, the graph generator 600 may define each of the drug names stored in the drug field DRUG_F of the internal database 300 as a drug node.

그래프 생성부(600)는 상기 질병 노드들, 상기 유전자 노드들, 및 상기 약물 노드들 사이에서 서로 연관 관계가 있는 노드쌍 사이의 연결 관계를 엣지(edge)로 규정할 수 있다.The graph generating unit 600 may define a connection relationship between the disease nodes, the gene nodes, and the node pairs that are correlated with each other among the drug nodes as an edge.

예를 들어, 내부 데이터베이스(300)의 동일한 로우에 저장된 개체들에 상응하는 노드들 사이의 연결은 상기 엣지로 규정될 수 있다.For example, a connection between nodes corresponding to objects stored in the same row of the internal database 300 may be defined as the edge.

또한, 그래프 생성부(600)는 내부 데이터베이스(300)에서 동일한 로우에 대응되는 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지에 대해, 상기 동일한 로우에서 상기 유전자 노드와 관련되는 유전자형 필드(AA_F, AG_F, GG_F)에 저장된 상기 효과 점수를 상기 엣지의 가중치로 결정할 수 있다.In addition, the graph generating unit 600 generates a genotype field (AA_F, AA_F, The effect score stored in AG_F, GG_F) may be determined as the weight of the edge.

한편, 그래프 생성부(600)는 특정 유전자형과 연관되어 있지 않은 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지에 대해서는 가중치를 부여하지 않을 수 있다.Meanwhile, the graph generating unit 600 may not assign a weight to the edge connecting the gene node and the drug node that are not associated with a specific genotype.

그래프 생성부(600)는 상기 엣지들을 통해 서로 연결되는 두 개 이상의 노드들과 상기 두 개 이상의 노드들을 연결하는 상기 엣지들의 집합을 경로(path)로 규정할 수 있다.The graph generator 600 may define two or more nodes connected to each other through the edges and a set of edges connecting the two or more nodes as a path.

따라서 상기 경로에 포함되는 상기 두 개 이상의 노드들은 서로 밀접한 관련이 있으며, 상기 두 개 이상의 노드들 사이의 관련 정도는 상기 두 개 이상의 노드들을 연결하는 상기 엣지들에 의해 규정될 수 있다.Accordingly, the two or more nodes included in the path are closely related to each other, and the degree of relation between the two or more nodes may be defined by the edges connecting the two or more nodes.

이후, 그래프 생성부(600)는 제1 내지 제k 군집들(GR1~GRk) 각각에 대해, 제a(a는 k 이하의 양의 정수) 군집에 상응하는 SNP 유전자형들과 연관되는 노드들, 엣지들, 및 경로들 만을 추출하여 제a 그래프(GPa)를 생성할 수 있다.Then, the graph generating unit 600, for each of the first to kth clusters GR1 to GRk, nodes associated with the SNP genotypes corresponding to the a (a is a positive integer less than or equal to k) cluster, The a-th graph GPa may be generated by extracting only edges and paths.

예를 들어, 그래프 생성부(600)는 상기 복수의 질병 노드들, 상기 복수의 유전자 노드들, 상기 복수의 약물 노드들, 상기 복수의 엣지들, 및 상기 복수의 경로들 중에서, 제a 군집(GRa)에 상응하는 SNP 유전자형들과 연관되는 노드들, 엣지들, 및 경로들 만을 추출하여 제a 그래프(GPa)를 생성할 수 있다.For example, the graph generating unit 600 may generate a first cluster ( Only nodes, edges, and paths associated with SNP genotypes corresponding to GRa) may be extracted to generate the a-th graph (GPa).

상술한 바와 같이, 군집화부(500)는 내부 데이터베이스(300)의 SNP 인덱스 필드(SNP_INDEX_F)에 저장된 복수의 SNP 인덱스들에 상응하는 SNP들 중에서 SNP의 유전자형에 따라 약물의 효과 정도의 차이가 상대적으로 큰 SNP들을 상기 유효 SNP들로 결정하고, 상기 유효 SNP들의 유전자형을 기준으로 상기 복수의 개인들을 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류하고, 그래프 생성부(600)는 제1 내지 제k 군집들(GR1~GRk) 각각에 대해 제a 군집(GRa)에 상응하는 SNP 유전자형들과 연관되는 노드들, 엣지들, 및 경로들 만을 추출하여 제a 그래프(GPa)를 생성한다. 따라서 그래프 생성부(600)에 의해 생성되는 제1 내지 제k 그래프들(GP1~GPk) 각각은 질병, 유전자, 약물, 및 SNP의 유전자형과 같은 개체들 사이에 서로 다른 연결 관계를 가질 수 있다.As described above, in the clustering unit 500, the difference in the degree of effect of the drug according to the genotype of the SNP among the SNPs corresponding to the plurality of SNP indices stored in the SNP index field (SNP_INDEX_F) of the internal database 300 is relatively The large SNPs are determined as the effective SNPs, the plurality of individuals are clustered based on the genotype of the effective SNPs, and the plurality of individuals are classified into first to kth clusters GR1 to GRk, and the graph generating unit 600 is configured to For each of the 1st to kth clusters GR1 to GRk, only the nodes, edges, and paths associated with the SNP genotypes corresponding to the ath cluster GRa are extracted to generate the ath graph GPa . Accordingly, each of the first to kth graphs GP1 to GPk generated by the graph generating unit 600 may have different connection relationships between entities such as diseases, genes, drugs, and genotypes of SNPs.

도 6은 도 1의 사용자 맞춤형 치료 정보 예측 시스템에 포함되는 그래프 생성부에 의해 생성되는 그래프의 일 예를 나타내는 도면이다.6 is a diagram illustrating an example of a graph generated by a graph generator included in the user-customized treatment information prediction system of FIG. 1 .

도 6에는 제1 그룹(GR1)에 대해 생성되는 제1 그래프(GP1) 및 제2 그룹(GR2)에 대해 생성되는 제2 그래프(GP2)가 예시적으로 도시된다.6 exemplarily illustrates a first graph GP1 generated with respect to the first group GR1 and a second graph GP2 generated with respect to the second group GR2.

도 6에서, 제1 그룹(GR1)은 TNF 유전자의 SNP 인덱스 rs1800629에 상응하는 유전자형이 AA인 개인들을 포함하는 그룹을 나타내고, 제2 그룹(GR2)은 TNF 유전자의 SNP 인덱스 rs1800629에 상응하는 유전자형이 GG인 개인들을 포함하는 그룹을 나타낸다.In FIG. 6 , the first group (GR1) represents a group including individuals whose genotype is AA corresponding to the SNP index rs1800629 of the TNF gene, and the second group (GR2) has a genotype corresponding to the SNP index rs1800629 of the TNF gene. Represents a group comprising individuals who are GG.

도 6에 도시된 바와 같이, 제1 그래프(GP1) 및 제2 그래프(GP2)는 공통적으로 류마티스 관절염(Rheumatoid arthritis)의 발현은 IL6R 유전자 및 TNF 유전자와 관련이 있으며, tocilizumab 약물로 IL6R 유전자에 생체 반응을 유발함으로써 류마티스 관절염(Rheumatoid arthritis)의 발현을 억제할 수 있는 것으로 도시된다.As shown in FIG. 6 , the first graph (GP1) and the second graph (GP2) have in common that the expression of rheumatoid arthritis is related to the IL6R gene and the TNF gene. It is shown that it is possible to suppress the expression of rheumatoid arthritis by inducing a response.

이에 반해, TNF 유전자의 SNP 인덱스 rs1800629에 상응하는 유전자형이 AA인 제1 그룹(GR1)에 대한 제1 그래프(GP1)의 경우, etanercept 약물의 효과는 낮은 것으로 도시되나, TNF 유전자의 SNP 인덱스 rs1800629에 상응하는 유전자형이 GG인 제2 그룹(GR2)에 대한 제2 그래프(GP2)의 경우, etanercept 약물의 효과는 높은 것으로 도시된다.In contrast, in the case of the first graph (GP1) for the first group (GR1) whose genotype is AA corresponding to the SNP index rs1800629 of the TNF gene (GP1), the effect of the etanercept drug is shown to be low, but in the SNP index rs1800629 of the TNF gene For the second graph (GP2) for the second group (GR2) whose corresponding genotype is GG, the effect of the etanercept drug is shown to be high.

이와 같이, 그래프 생성부(600)에 의해 생성되는 제1 내지 제k 그래프들(GP1~GPk) 각각은 질병, 유전자, 약물, 및 SNP의 유전자형과 같은 개체들 사이에 서로 다른 연결 관계를 가질 수 있다.As such, each of the first to k-th graphs GP1 to GPk generated by the graph generating unit 600 may have different connection relationships between individuals such as diseases, genes, drugs, and genotypes of SNPs. have.

다시 도 1 및 2를 참조하면, 딥러닝부(700)는 그래프 생성부(600)에 의해 생성되는 제1 내지 제k 그래프들(GP1~GPk)을 학습 데이터로 사용하여 상기 인공신경망의 입력층에 입력되는 질의 개체 및 질의 유전자 변이 정보에 기초하여 상기 인공신경망이 상기 질의 유전자 변이 정보를 갖는 사용자에 있어서 상기 질의 개체와 연관되는 개체를 출력하도록 상기 인공신경망을 학습시킨다(단계 S500).Referring back to FIGS. 1 and 2 , the deep learning unit 700 uses the first to kth graphs GP1 to GPk generated by the graph generating unit 600 as training data to form the input layer of the artificial neural network. The artificial neural network trains the artificial neural network to output an entity related to the query entity in the case of a user having the query genetic variation information based on the query entity and the query genetic variation information input to (step S500).

일 실시예에 있어서, 딥러닝부(700)는 상기 유전자 노드와 상기 약물 노드를 연결하는 상기 엣지의 가중치가 클수록 상기 엣지에 의해 연결되는 상기 유전자 노드와 상기 약물 노드가 서로 가깝게 사상되도록 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 노드들에 대해 임베딩(embedding)을 수행할 수 있다.In one embodiment, the deep learning unit 700 is configured such that the greater the weight of the edge connecting the gene node and the drug node, the closer the gene node and the drug node connected by the edge are mapped to each other. Embedding may be performed on the nodes included in the k-th graphs GP1 to GPk.

예를 들어, 딥러닝부(700)는 상기 질병 노드들, 상기 유전자 노드들, 및 상기 약물 노드들 각각을 f(f는 2 이상의 양의 정수) 차원의 실수 벡터로 초기화 한 후, 임의의 노드쌍에 대해, 상기 노드쌍이 상기 엣지로 연결되어 있지 않은 경우 상기 노드쌍에 기본값을 갖는 레이블을 부여하고, 상기 노드쌍이 상기 엣지로 연결되어 있는 경우 상기 노드쌍에 상기 엣지의 가중치에 상응하는 값을 갖는 레이블을 부여할 수 있다.For example, the deep learning unit 700 initializes each of the disease nodes, the gene nodes, and the drug nodes to a real vector of a dimension f (f is a positive integer greater than or equal to 2), and then a random node For a pair, if the node pair is not connected by the edge, a label with a default value is given to the node pair, and if the node pair is connected by the edge, a value corresponding to the weight of the edge is given to the node pair if the node pair is connected to the edge You can assign a label with

이후, 딥러닝부(700)는 종래에 알려진 다양한 종류의 임베딩 모델을 사용하여 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 노드들에 대해 임베딩(embedding)을 수행할 수 있다.Thereafter, the deep learning unit 700 may perform embedding on the nodes included in the first to kth graphs GP1 to GPk using various types of embedding models known in the prior art.

딥러닝부(700)는 상기 임베딩 수행 결과 상기 노드들 각각에 대해 생성되는 f 차원의 벡터들을 사용하여 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 경로들을 상기 인공신경망에 학습시킬 수 있다.The deep learning unit 700 uses the f-dimensional vectors generated for each of the nodes as a result of the embedding to learn the paths included in the first to kth graphs GP1 to GPk to the artificial neural network. can

일 실시예에 있어서, 딥러닝부(700)는 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 모든 경로들 중에서 중요도가 높은 경로들 만을 선택하여 상기 인공신경망에 학습시킬 수도 있다.In an embodiment, the deep learning unit 700 may select only high-importance paths from among all the paths included in the first to k-th graphs GP1 to GPk to train the artificial neural network.

예를 들어, 딥러닝부(700)는 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 경로들 각각에 대해, 상기 경로에 포함되는 상기 엣지들의 가중치의 합을 경로 가중치로 결정하고, 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 경로들 중에서 상기 경로 가중치가 상대적으로 높은 경로들을 중요 경로들로 결정할 수 있다.For example, the deep learning unit 700 determines the sum of the weights of the edges included in the path as a path weight for each of the paths included in the first to kth graphs GP1 to GPk, and , paths having a relatively high path weight among the paths included in the first to kth graphs GP1 to GPk may be determined as important paths.

이후, 딥러닝부(700)는 상기 임베딩 수행 결과 상기 노드들 각각에 대해 생성되는 f 차원의 벡터들을 사용하여 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 경로들 중에서 상기 중요 경로들 만을 상기 인공신경망에 학습시킬 수 있다.Thereafter, the deep learning unit 700 uses the f-dimensional vectors generated for each of the nodes as a result of performing the embedding, and the important path among the paths included in the first to kth graphs GP1 to GPk. It is possible to train the artificial neural network only.

상술한 바와 같은 동작을 통해 상기 인공신경망에 대한 학습이 완료되면, 상기 인공신경망은 상기 입력층에 입력되는 질의 개체 및 질의 유전자 변이 정보에 기초하여 상기 질의 유전자 변이 정보를 갖는 사용자에 있어서 상기 질의 개체와 연관되는 개체를 출력할 수 있다.When the learning of the artificial neural network is completed through the operation as described above, the artificial neural network is the query entity for a user having the query genetic mutation information based on the query entity and the query genetic mutation information input to the input layer. Objects related to can be output.

다시 도 1 및 2를 참조하면, 예측부(800)는 외부로부터 질병, 유전자, 및 약물 중의 하나에 상응하는 질의 개체(QE) 및 사용자의 유전자 변이 정보에 상응하는 질의 유전자 변이 정보(QGM)를 수신하여 상기 인공신경망에 입력하고, 상기 인공신경망으로부터 출력되는 개체를 상기 사용자에 있어서 질의 개체(QE)와 연관도가 높은 타겟 개체(TE)로서 출력한다(단계 S600).Referring back to FIGS. 1 and 2 , the prediction unit 800 obtains the query entity (QE) corresponding to one of diseases, genes, and drugs from the outside, and query genetic variation information (QGM) corresponding to the genetic variation information of the user. It is received and input to the artificial neural network, and the object output from the artificial neural network is output as a target object TE with a high degree of association with the query object QE for the user (step S600).

도 1 내지 6을 참조하여 상술한 바와 같이, 본 발명의 실시예들에 따른 사용자 맞춤형 치료 정보 예측 시스템(10) 및 사용자 맞춤형 치료 정보 예측 방법은 SNP의 유전자형에 따라 상기 복수의 개인들을 유사한 유전자 정보를 갖는 사람들로 군집화하여 제1 내지 제k 군집들(GR1~GRk)로 분류하고, 제1 내지 제k 군집들(GR1~GRk) 각각에서 질병, 유전자, 약물, 및 SNP의 유전자형과 같은 개체들 사이의 관계를 나타내는 제1 내지 제k 그래프들(GP1~GPk)을 생성한 후, 제1 내지 제k 그래프들(GP1~GPk)에 포함되는 상기 경로들을 상기 인공신경망에 학습시킴으로써 사용자 맞춤형 치료 정보 예측 모델을 생성한다.As described above with reference to FIGS. 1 to 6 , the user-customized treatment information prediction system 10 and the user-customized treatment information prediction method according to embodiments of the present invention provide similar genetic information to the plurality of individuals according to the genotype of the SNP. Individuals such as genotypes of diseases, genes, drugs, and SNPs are grouped into people with After generating the first to k-th graphs GP1 to GPk indicating the relationship between Create a predictive model.

따라서 본 발명의 실시예들에 따른 사용자 맞춤형 치료 정보 예측 시스템(10) 및 사용자 맞춤형 치료 정보 예측 방법은 사용자의 유전자 변이 정보에 기초하여 사용자 맞춤형으로 상기 사용자와 연관되는 질병, 유전자, 및 약물을 효과적으로 예측할 수 있다.Therefore, the user-customized treatment information prediction system 10 and the user-customized treatment information prediction method according to embodiments of the present invention can effectively detect diseases, genes, and drugs associated with the user in a user-customized manner based on the user's genetic mutation information. predictable.

또한, 본 발명의 실시예들에 따른 사용자 맞춤형 치료 정보 예측 시스템(10) 및 사용자 맞춤형 치료 정보 예측 방법은 유전자 변이에 따라 적합한 신약을 개발하는 데에 유용하게 활용될 수 있다.In addition, the user-customized treatment information prediction system 10 and the user-customized treatment information prediction method according to embodiments of the present invention may be usefully utilized to develop a suitable new drug according to a genetic mutation.

본 발명은 사용자의 유전자 변이 정보에 기초하여 사용자 맞춤형으로 상기 사용자에 적합한 치료 정보를 예측하는 데에 유용하게 이용될 수 있다.The present invention can be usefully used to predict treatment information suitable for the user in a customized manner based on the user's genetic mutation information.

상술한 바와 같이, 본 발명의 바람직한 실시예를 참조하여 설명하였지만 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although described with reference to preferred embodiments of the present invention, those of ordinary skill in the art may vary the present invention within the scope without departing from the spirit and scope of the present invention described in the claims below. It will be understood that modifications and changes can be made to

10: 사용자 맞춤형 치료 정보 예측 시스템
20: 의약 데이터베이스 30: 유전자 다형성 데이터베이스
100: 데이터 취합부 200: 정보 추출부
300: 내부 데이터베이스 400: 사용자 유전자 데이터베이스
500: 군집화부 600: 그래프 생성부
700: 딥러닝부 800: 예측부10: User-tailored treatment information prediction system
20: drug database 30: gene polymorphism database
100: data collection unit 200: information extraction unit
300: internal database 400: user genetic database
500: clustering unit 600: graph generating unit
700: deep learning unit 800: prediction unit

Claims

Collecting data in the form of text from the data collection unit at least one drug database for storing disease, gene, and drug-related information and at least one gene polymorphism database for storing information related to gene polymorphism;
An information extraction unit extracts genotypes of diseases, genes, drugs, and polymorphic genes as entities from the data collected by the data collection unit using a natural language processing algorithm, and derives a relationship between the entities to determine the individual storing the relationship between them in a standardized form in an internal database;
The clustering unit uses the internal database and a user genetic database that stores genetic variation information of each of the plurality of individuals to cluster the plurality of individuals into people having similar genetic information, and the first to kth (k is an integer greater than or equal to 2) classifying into clusters;
generating, by a graph generating unit, first to k-th graphs representing relationships between the entities in each of the first to k-th clusters using data stored in the internal database;
The deep learning unit uses the first to k-th graphs as learning data and based on the query entity and the query genetic variation information input to the input layer of the artificial neural network, the artificial neural network has the query genetic variation information in the user training the artificial neural network to output an entity related to the entity; and
A prediction unit inputs the query gene mutation information corresponding to the genetic mutation information of the query individual and the user corresponding to one of diseases, genes, and drugs into the artificial neural network, and sets the object output from the artificial neural network to the query in the user A method for predicting user-customized treatment information, comprising outputting the object as a target object having a high degree of relevance to the object.

The method of claim 1, wherein the internal database includes a disease field, a gene field, a drug field, a Single Nucleotide Polymorphism (SNP) index field, and a plurality of genotype fields,
The information extraction unit extracts genotypes of diseases, genes, drugs, and polymorphic genes as individuals from the data collected by the data collection unit, derives relationships between the extracted individuals, and the relationship between the extracted individuals The step of storing in a standardized form in the internal database,
The information extraction unit extracts diseases, genes, and drugs that are related to each other from the data collected by the data collection unit, extracts the SNP index corresponding to the SNP of the gene and the genotype of the SNP, and extracting the degree of effect of the drug on the disease in patients carrying each of the genotypes; and
The information extraction unit extracts the name of the disease, the name of the gene, the name of the drug, the SNP index, and the degree of the effect of the drug on the disease in the patient having each of the genotypes of the SNP in the internal database and storing in each of the disease field, the gene field, the drug field, the SNP index field, and the corresponding genotype field.

The method of claim 2, wherein the step of classifying the plurality of individuals into the first to kth clusters by the clustering unit comprises:
reading all SNP indexes stored in the internal database;
reading SNP genotypes corresponding to the read SNP indices of each of the plurality of individuals from the user gene database;
generating, for each of the plurality of individuals, a SNP vector including SNP genotypes corresponding to the read SNP indices of each of the plurality of individuals; and
and classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to the plurality of SNP vectors corresponding to the plurality of individuals.

The method of claim 3, wherein the step of classifying the plurality of individuals into the first to kth clusters by the clustering unit comprises:
When the dimension of the SNP vector generated for each of the plurality of individuals is higher than a predetermined p (p is a positive integer) dimension, the dimension of the SNP vectors corresponding to the plurality of individuals is lower than the p dimension. reducing to dimension q (where q is a positive integer less than p); and
User-tailored treatment information further comprising classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to the plurality of reduced SNP vectors corresponding to the plurality of individuals Prediction method.

5. The method of claim 4, wherein the clustering unit, if the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p-dimensional, PCA (Principal Component) for the SNP vectors corresponding to the plurality of individuals Analysis) A user-customized treatment information prediction method for generating the reduced SNP vectors by reducing the dimension of the SNP vectors corresponding to the plurality of individuals to the q dimension lower than the p dimension by applying a dimension reduction algorithm.

5 . The method of claim 4 , wherein the clustering unit performs unsupervised learning using an autoencoder when the dimension of the SNP vector generated for each of the plurality of individuals is higher than the p-dimension to collect the plurality of individuals. A user-customized treatment information prediction method for generating the reduced SNP vectors by reducing the dimension of the SNP vectors corresponding to the q dimension lower than the p dimension.

The method of claim 2, wherein the step of classifying the plurality of individuals into the first to kth clusters by the clustering unit comprises:
determining, as effective SNPs, t (t is an integer greater than or equal to 2) SNPs having a relatively large difference in the degree of drug effect according to the genotype of the SNP from among the SNPs corresponding to the SNP indices stored in the internal database;
reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database; and
and classifying the plurality of individuals into the first to kth clusters by grouping the plurality of individuals with people having the same genotype for the effective SNPs.

The method of claim 2, wherein the step of classifying the plurality of individuals into the first to kth clusters by the clustering unit comprises:
determining, as effective SNPs, t (t is an integer greater than or equal to 2) SNPs having a relatively large difference in the degree of drug effect according to the genotype of the SNP from among the SNPs corresponding to the SNP indices stored in the internal database;
reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database;
generating, for each of the plurality of individuals, a SNP vector comprising SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals; and
and classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to the plurality of SNP vectors corresponding to the plurality of individuals.

The method according to claim 2, wherein the information extraction unit calculates the degree of the effect of the drug on the disease in the patient having each of the genotypes of the SNP, at a first level with the lowest effect, a d-th (d is 2 or more). Integer) level, and storing a predetermined effect score for each of the first to d-th levels in each of the corresponding genotype fields of the internal database.

The method of claim 9, wherein the step of classifying the plurality of individuals into the first to kth clusters by the clustering unit comprises:
Determining an information amount score for each SNP corresponding to all SNP indices stored in the internal database by applying the following equation to the data stored in the internal database

(here, score _i represents the information content score of the i-th SNP, w _i represents the allele frequency of the i-th SNP, d represents the number of drugs stored in the internal database, AA _ij denotes the effect score indicating the degree of the effect of the first genotype of the i-th SNP in response to the j-th drug, and AG _ij denotes the degree of the effect of the second genotype of the i-th SNP in response to the j-th drug. represents the effect score representing , and GG _ij represents the effect score representing the degree of the effect that the third genotype of the i-th SNP reacts with the j-th drug);
determining, as valid SNPs, t SNPs having a relatively large information amount score among SNPs corresponding to the SNP indices stored in the internal database;
reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database; and
and classifying the plurality of individuals into the first to kth clusters by grouping the plurality of individuals with people having the same genotype for the effective SNPs.

(here, score _i represents the information content score of the i-th SNP, w _i represents the allele frequency of the i-th SNP, d represents the number of drugs stored in the internal database, AA _ij denotes the effect score indicating the degree of the effect of the first genotype of the i-th SNP in response to the j-th drug, and AG _ij denotes the degree of the effect of the second genotype of the i-th SNP in response to the j-th drug. represents the effect score representing , and GG _ij represents the effect score representing the degree of the effect that the third genotype of the i-th SNP reacts with the j-th drug);
determining, as valid SNPs, t SNPs having a relatively large information amount score among SNPs corresponding to the SNP indices stored in the internal database;
reading SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals from the user gene database;
generating, for each of the plurality of individuals, a SNP vector comprising SNP genotypes corresponding to the effective SNPs of each of the plurality of individuals; and
and classifying the plurality of individuals into the first to kth clusters by applying a K-means algorithm to the plurality of SNP vectors corresponding to the plurality of individuals.

The method according to claim 9, wherein the generating of the first to kth graphs indicating the relationship between the entities in each of the first to kth clusters by the graph generating unit using data stored in the internal database comprises:
defining each of the names of the disease stored in the disease field of the internal database as a disease node;
In the case of a row in which the SNP index is not stored in the SNP index field of the internal database, the name of the gene stored in the gene field of the row is defined as a gene node, and the SNP index field of the internal database is In the case of a row in which an SNP index is stored, the name of the gene stored in the gene field of the corresponding row, the SNP index stored in the SNP index field of the corresponding row, and the genotypes of the SNP corresponding to the SNP index are respectively associated defining it as a plurality of gene nodes;
defining each of the names of the drugs stored in the drug field of the internal database as a drug node;
defining, as an edge, a connection relationship between the disease nodes, the gene nodes, and a pair of nodes having a correlation with each other among the drug nodes;
For the edge connecting the drug node and the gene node corresponding to the same row in the internal database, the effect score stored in the genotype field related to the gene node in the same row is determined as the weight of the edge step;
defining two or more nodes connected to each other through the edges and a set of edges connecting the two or more nodes as a path; and
For each of the first to k-th clusters, only the nodes, edges, and paths associated with the SNP genotypes corresponding to the a-th (a is a positive integer less than or equal to k) cluster are extracted to obtain an a-th graph A method for predicting user-customized treatment information comprising the step of generating.

The method of claim 12, wherein the step of learning the artificial neural network by the deep learning unit comprises:
For the nodes included in the first to k-th graphs, the greater the weight of the edge connecting the gene node and the drug node is, the closer the gene node and the drug node connected by the edge are mapped to each other. performing embedding; and
and learning the paths included in the first to kth graphs to the artificial neural network using vectors generated for each of the nodes as a result of the embedding.

The method of claim 12, wherein the step of learning the artificial neural network by the deep learning unit comprises:
For the nodes included in the first to k-th graphs, the greater the weight of the edge connecting the gene node and the drug node is, the closer the gene node and the drug node connected by the edge are mapped to each other. performing embedding;
determining, as a path weight, a sum of weights of the edges included in the path for each of the paths included in the first to second graphs;
determining paths having a relatively high path weight among the paths included in the first to kth graphs as important paths; and
User-tailored treatment information comprising the step of learning, in the artificial neural network, only the important paths among the paths included in the first to k-th graphs using vectors generated for each of the nodes as a result of the embedding Prediction method.

a data collection unit that collects data in text form from at least one drug database for storing disease, gene, and drug-related information and at least one gene polymorphism database for storing gene polymorphism-related information;
Extracting the genotypes of diseases, genes, drugs, and polymorphic genes as entities from the data collected by the data collection unit using a natural language processing algorithm, and deriving relationships between the entities an information extraction unit that stores the relationship in a standardized form in an internal database;
First to kth (k is an integer of 2 or more) clusters by clustering the plurality of individuals into people having similar genetic information using the internal database and a user genetic database including genetic variation information of each of the plurality of individuals a clustering unit that classifies into ;
a graph generating unit generating first to k-th graphs representing relationships between the entities in each of the first to k-th clusters using data stored in the internal database;
Using the first to kth graphs as learning data, the artificial neural network is associated with the query entity in a user having the query genetic variation information based on the query entity and the query genetic variation information input to the input layer of the artificial neural network a deep learning unit that trains the artificial neural network to output an object that becomes an object; and
A query individual corresponding to one of diseases, genes, and drugs, and query genetic variation information corresponding to the user's genetic variation information are input to the artificial neural network, and the individual output from the artificial neural network is combined with the querying individual in the user A user-customized treatment information prediction system comprising a prediction unit that outputs a target object having a high degree of relevance.