KR101693504B1

KR101693504B1 - Discovery system for disease cause by genetic variants using individual whole genome sequencing data

Info

Publication number: KR101693504B1
Application number: KR1020150187554A
Authority: KR
Inventors: 정종선
Original assignee: (주)신테카바이오
Priority date: 2015-12-28
Filing date: 2015-12-28
Publication date: 2017-01-17
Also published as: WO2017116123A1

Abstract

The present invention relates to a system for discovering the cause of a disease by analyzing genetic information from a personal genome by comparing individual whole genome information with a plurality of whole genome DB constructed by a genome project. The present invention includes an analysis data input part for inputting analysis data including personal genome information; a search control part for calculating an analysis result including at least one of genotypes, genotypes, rare mutations, disease and biologically active mutations in each gene by comparing the genetic information stored in the database with the genome information to be analyzed, and generating a result report through the analysis result; and a storage part for storing the gene information of a control group for comparison with the analysis data. According to the present invention, it is possible to provide a genetic analysis platform in which the genotype discrimination of the individual genome and the detection efficiency of significant mutation are improved by effectively comparing the gene mutation information stored in a control database and the individual genome as a subject to be analyzed.

Description

[0001] The present invention relates to a genome-wide genome sequencing system,

본 발명은 게놈 프로젝트에 의해 구축된 다수 전장유전체 DB와 입력된 개인 전장 유전체 정보를 비교하여 개인 유전체로부터 유전정보를 분석하여 제공하는 시스템에 관한 발명이다.The present invention relates to a system for analyzing and providing genetic information from a personal genome by comparing the input full-length genome information with a multiple-length genome DB constructed by a genome project.

현재 IT시장의 추세는 구글(Google) 클라우드컴퓨팅 유비쿼터스(Ubiquitous) 순으로 변화하고 있고, 이와 동시에 바이오메디컬, 생물정보 및 유전체 영역도 바이오구글 시스템바이오 개인별 맞춤의학 순으로 새로운 트랜드에 맞춰 바뀌어 가고 있다. 특히 포스트 인간게놈프로젝는 차세대시퀀싱 기술이 급격하게 발전하여 개인별 맞춤의학을 현실화하기 위한 노력이 활발히 진행되고 있다.Currently, the trend of the IT market is changing in the order of Google cloud computing Ubiquitous. At the same time, biomedical, bioinformation and genome areas are changing to new trends in order of bio-google system bio personalized medicine. In particular, the post-human genome project has been actively pursuing personalized medicine with the rapid development of next-generation sequencing technology.

현재 차세대기술은 인간 1명 (x30)의 유전체를 시퀀싱하는데 약 2주정도 소요가 되는 것으로 알려져 있다. 그리고, 현재 전 세계에 차세대 시퀀서가 20,000여대가 공급된 것으로 보고되었고, 제3세대 시퀀서 (Ion Torrent: 2.5세대, Pacific BioScience의 제3세대)의 주요 개발회사들에게 약 5천억원이 최근에 투자된 것으로 보고되었다. It is known that the next generation technology takes approximately two weeks to sequence the genome of a human (x30). Currently, more than 20,000 sequencers are reported worldwide, and about 500 billion won has recently been invested in the major development companies of the third-generation sequencer (Ion Torrent: 2.5 generation, third generation of Pacific BioScience) .

그 이외에 전 세계적으로는 해당분야에 약 10조원 이상이 투자되었고 개발이 진행되고 있다. 이러한, 추세대로 진행이 되면 향후 2~3년 후에는 1명의 전체 시퀀싱 비용은 약 $1,000로 낮아질 것으로 예상된다. 위의 차세대기술기반의 가장 활용성이 높고 바로 실용화되는 기술은 임상유전체(clinical genomics), 약물유전체학(pharmaco-genomics) 및 중개 임상(translational medicine)으로 예상된다.In addition, more than KRW 10 trillion has been invested in the relevant field worldwide and development is under way. With this trend, it is expected that the total sequencing cost of one person will be reduced to about $ 1,000 in the next two to three years. The most useful and readily available technologies based on the above next generation technologies are expected to be clinical genomics, pharmaco-genomics and translational medicine.

그러나 개인 유전자 분석을 통한 상업화된 맞춤의학 서비스를 제공하기 위하여는 다음과 같은 요구사항이 개선되어야 하는 문제점이 있다.However, in order to provide commercialized customized medical services through analysis of individual genes, the following requirements must be improved.

첫째는 대조군과 개인 유전체 사이의 분석 속도가 향상되어야 하는 과제가 있다. 이는 분석 정확도를 높이기 위하여 대조군의 양이 증가할수록 크게 요구되는 것으로 이를 해결하기 위하여는 분석 속도가 향상된 분석 시스템 및 분석 효율을 향상시킬 수 있는 데이터베이스 스키마가 적용된 분석 플랫폼의 제공이 요구된다.First, there is a problem that the analysis speed between the control group and the individual genome should be improved. In order to increase the accuracy of the analysis, it is required to increase the amount of the control group. In order to solve this problem, it is required to provide an analysis system with improved analysis speed and an analysis platform with a database schema that can improve the analysis efficiency.

둘째는 대조군과 개인 유전체 사이의 변이 마커를 검출하는 효율적인 방식 및 검출된 마커를 분석 목적에 따라 분류하여 분석 결과를 도출하는 효과적인 방식이 요구된다.Second, an efficient method of detecting the variation marker between the control and the individual genome and an effective method of classifying the detected marker according to the purpose of analysis and deriving the analysis result are required.

즉, 인간의 전장 유전체는 천문학적인 염기서열을 포함하고 있으며, 이에 대한 변이 마커 역시 방대한 양이 검출되는 것이 일반적이므로, 이를 이용하여 상용화된 맞춤 의학 서비스를 제공하기 위하여는, 검출된 변이 마커를 효율화된 분류 체계 및 검증 체계에 의해 분석 결과를 도출하는 것이 요구되는 것이다.In other words, since the human whole-genome genome contains an astronomical nucleotide sequence and a large amount of mutation markers are also detected in general, in order to provide a customized medical service commercialized using the genome, It is necessary to derive the analysis results by the classification system and the verification system.

셋째는 맞춤의학 서비스의 질 향상을 위하여, 분석된 결과를 사용자가 가시적으로 용이하게 인식할 수 있도록, 분석 결과를 가시화하여 제공하는 리포팅 모듈이 요구된다.Third, in order to improve the quality of customized medical services, a reporting module is required to provide analysis results in a visualized form so that the user can visually recognize the analyzed results easily.

즉, 정확성이 확보되면서도 빠르게 개인 유전체의 변이 마커를 검출한 경우라도, 방대한 양의 변이마커를 단순 나열한 결과를 사용자에게 제공한다면, 사용자는 해당 변이 마커의 의미 및 변이 정도를 파악할 수 없으므로, 사용자 인식을 용이하게 도울 수 있는 리포팅 모듈이 제공되어야 한다.In other words, even if the correctness is secured and the mutation marker of the individual genome is detected quickly, if the user provides a result of simply listing a large amount of mutation markers, the user can not grasp the meaning and the degree of mutation of the mutation marker, It is necessary to provide a reporting module that can facilitate the operation of the system.

마지막으로, 더 정확하고 다양한 분석 결과를 제공하기 위하여는, 더 많은 변이에 대한 변이 특성 및 작용 기전 등의 정보가 수집되어 제공되어야 한다.Finally, in order to provide more accurate and diverse analysis results, information such as mutation characteristics and mechanism of action for more mutations should be collected and provided.

유전자 변이에 대한 연구는 각국의 대학, 연구기관 등에서 지속적으로 활발히 연구되고 있는 영역으로, 이미 상당량의 정확성이 검증된 정보가 수립되어 있을 뿐만 아니라 지속적으로 연구 결과가 확대되고 있다.Research on gene mutation is continuously being actively researched by universities and research institutes in each country. In addition to the fact that a large amount of information has been verified, research results have been continuously expanded.

예를 들어, 생명윤리 및 안전에 관한 법률 등에는 유전자 변이에 따른 약 138개의 질병정보가 확정되어 있고, PheWAS-GWAS 및 eMEREG 등 해외 데이터 베이스에는 약 1700여 개의 유전자에 대한 표현형 정보가 수록되어 있으며, 약물은행 및 대사물은행에는 각각 6000여 개 및 12000여개의 유전자에 의해 발현되는 단백질과 약물 및 대사물의 저항성 및 감수성 정보가 수록되어 있다.For example, about 138 disease information related to genetic mutation has been confirmed in the Bioethics and Safety Act, and phenotypic information about about 1,700 genes is contained in overseas databases such as PheWAS-GWAS and eMEREG , Drug Bank and Metabolic Bank contain information on the resistance and susceptibility of proteins, drugs and metabolites expressed by over 6,000 and 12,000 genes, respectively.

이와 같은, 유전자 변이 관련 정보는 매년 증가하고 있으며, 본 발명은 검증 데이터의 확장에 의해 분석 정확도 및 영역이 지속적으로 확대될 것이다.Such gene mutation-related information is increasing every year, and the present invention will continuously expand the analysis accuracy and area by expanding the verification data.

한편, 본 출원인은 언급된 유전자 분석 분야의 기술적 요구사항을 개선하기 위해 지속적인 기술의 개발을 수행하고 있다.On the other hand, Applicants have been developing continuous techniques to improve the technical requirements in the field of gene analysis mentioned.

이와 같은 노력의 결과, 유전체 분석 속도를 향상시키기 위한 분석 시스템 구축을 위하여 GPU 기반의 분석시스템을 개발하였고(특허등록: 10-0996443), 데이터의 비교 속도를 향상시키기 위한 기법으로 RVR 파일을 기반으로는 정보 검색 방법을 개발하였으며(특허등록: 10-0880531, 특허등록: 10-1035959, 특허등록: 10-1117603), 대조군과 개인 유전체 사이의 변이 정도를 효율적으로 판단하기 위하여 ADISCAN 방식을 개발하였다(특허등록: 10-1400717, 특허등록: 10-1460520, 특허출원: 10-2014-0020738, 특허출원: 10-2014-0020736).
As a result of these efforts, we have developed a GPU-based analysis system (patent registration: 10-0996443) to build an analysis system to improve the genome analysis speed, and it is a technique to improve the data comparison speed. Based on the RVR file Developed the information retrieval method (patent registration: 10-0880531, patent registration: 10-1035959, patent registration: 10-1117603) and developed the ADISCAN method to efficiently judge the degree of variation between the control group and the individual genome Patent registration: 10-1400717, patent registration: 10-1460520, patent application: 10-2014-0020738, patent application: 10-2014-0020736).

(001) 대한민국 등록특허 제10-0996443호(001) Korean Patent No. 10-0996443 (002) 대한민국 등록특허 제10-1035959호(002) Korean Patent No. 10-1035959 (003) 대한민국 등록특허 제10-1117603호(003) Korean Patent No. 10-1117603 (004) 대한민국 등록특허 제10-1400717호(004) Korean Patent No. 10-1400717 (005) 대한민국 등록특허 제10-1460520호(005) Korean Patent No. 10-1460520 (006) 대한민국 공개특허 제10-20120053623호(006) Korean Patent Publication No. 10-20120053623 (007) 대한민국 공개특허 제10-20150024232호(007) Korean Patent Publication No. 10-20150024232 (008) 대한민국 공개특허 제10-20150024231호(008) Korean Patent Publication No. 10-20150024231

본 발명은 상기와 같은 상용화된 개인 유전자 맞춤의학을 실현하기 위한 요구사항을 개선하기 위해 안출된 것으로, 본 발명은 개인 유전체의 변이 검출 속도 및 효율을 향상시킬 수 있는 데이터베이스 스키마가 적용된 유전자 분석 플랫폼을 제공하기 위한 것이다.Disclosure of the Invention The present invention has been made in order to improve the requirements for realizing the above-mentioned commercial personalized medicine according to the present invention. The present invention provides a genetic analysis platform to which a database schema can be applied, .

또한, 본 발명은 검출된 변이정보를 사용자 인식이 용이하도록 제공될 수 있도록 하는 리포팅 모듈이 포함된 유전자 분석 플랫폼을 제공하기 위한 것이다.
In addition, the present invention provides a gene analysis platform including a reporting module that enables detected mutation information to be easily recognized by a user.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 특징에 따르면, 본 발명은 개인 유전체 정보를 포함하는 분석데이터를 입력받는 분석데이터 입력부와; 데이터베이스에 저장된 유전자 정보와 상기 분석 대상 유전체 정보를 대비하여 각 유전자의 희귀변이 및 질병변이에 대한 분석 결과를 산출하여 상기 분석결과를 통해 결과리포트를 생성하는 검색제어부; 그리고 상기 분석데이터와 대비하기 위한 대조군의 유전자 정보가 저장되는 저장부를 포함하여 구성되고: 상기 저장부는, 대조군 집단의 전장 유전체 정보를 인종을 포함하는 구분기준에 따라 구분되어 저장되는 ADISCAN DB를 포함하여 구성되며: 상기 검색제어부는, 상기 분석데이터에 포함된 각 염기를 상기 ADISCAN DB와 대비하여, 집단대조군 대비 희귀성을 산출하는 ADISCAN 엔진을 포함하여 구성되고: 상기 저장부는, 복수의 질병 별로 각 질병에 관련하여 공지된 유전자 변이 정보를 저장하는 IDA DB를 포함하여 구성되고; 상기 검색제어부는, 상기 분석데이터를 상기 IDA DB와 대비하여 상기 분석데이터에 포함된 공지된 유전자 관련 질병변이를 검출하는 IDA 검색엔진을 포함하여 구성되며: 상기 저장부는, 임상정보 기반의 질병원인 예측 결과를 도출하기 위해 유전적 특성과 함께 고려되어야할 피검사 대상자의 환경적 소인 정보가 저장되는 임상정보 DB를 더 포함하여 구성되고: 상기 검색제어부는, 로지스틱 회귀분석(logistic regression)에 의해 산출된 산술식을 통해 질병원인 관계도(Πx)를 산출하여 질병원인 예측 결과를 도출하되: 상기 질병원인 관계도 산술식은,

이고, 변수 β는 임상정보 DB에 저장된 피검사 대상자의 연령, 성별 또는 체질량(BMI)을 포함하는 건강기록정보(PHR, personal health records)에 따른 매개변수이고; 변수 χ는 상기 검색제어부가 산출한 분석데이터에 포함된 질병관련 유전형에 따른 매개변수이다.
이때, 상기 저장부는, 상기 분석데이터와 대비하기 위하여 대조군 유전자의 유전형 정보를 저장한 하플로스캔(HaploScan) DB를 포함하여 구성되고; 상기 검색제어부는, 상기 검색제어부는, 상기 분석데이터를 상기 하플로스캔(HaploScan) DB와 대비하여, 상기 분석데이터의 유전형을 판별하는 하플로스캔(HaploScan) 엔진을 포함하여 구성될 수도 있다.
그리고 상기 하플로스캔(HaploScan) DB는, 단일유전자에 대한 유전형 정보를 저장하는 단일유전자정보데이터베이스와; 표현형별 다중 유전자의 유전형 정보를 저장하는 다중유전자정보 데이터베이스를 포함하여 구성될 수도 있다.
또한, 상기 단일유전자정보데이터베이스는, 대조군의 단일 유전자에 대하여, 인종별 반수체 및 형질 빈도를 점유 비율별로 구분(군집)하여 저장한 단일유전자 하플로(Haplo) 맵과; 상기 단일 유전자 하플로(Haplo) 맵에 저장된 단일 유전자의 유전형을 구분하는 변이에 대한 변이정보를 저장하는 단일유전자 하플로 프리컨시 정보를 포함하여 구성될 수도 있다.
그리고 상기 다중유전자정보 데이터베이스는, 표현형별 대조군의 다중 유전자에 대하여 유전형 연관 염기의 변이분포를 인종별로 구분(군집)하여 점유비율에 따라 저장한 다중유전자 하플로(Haplo) 맵과; 상기 다중유전자 하플로(Haplo) 맵에 저장된 상기 표현형에 대한 유전형을 구분하는 변이에 대한 변이정보를 저장하는 다중유전자 하플로 프리컨시 정보를 포함하여 구성될 수도 있다.
또한, 상기 검색제어부는, 검출된 변이 유전자 특성을 전장 유전자에 대하여, 유전형에 따라 분류하여 누적된 값을 점(point)으로 가시화한 맨하탄 플롯(Manhattan plot) 상에 표시한 결과 리포트를 생성할 수도 있다.
그리고 상기 맨하탄 플롯은, 변이 유전자의 유의성 여부를 가이드하는 설정값(cut-off)이 표시될 수도 있다.According to an aspect of the present invention, there is provided an apparatus for analyzing data, comprising: an analysis data input unit for inputting analysis data including personal genome information; A search control unit for calculating analysis results of rare mutations and disease variations of each gene by comparing the gene information stored in the database with the genome information to be analyzed and generating a result report through the analysis result; And a storage unit for storing the genetic information of the control group for comparison with the analysis data. The storage unit includes an ADISCAN DB in which the whole genome information of the control group is divided and stored according to the classification criteria including race, Wherein the search control unit comprises an ADISCAN engine that compares each base included in the analysis data with the ADISCAN DB and calculates a rarity with respect to the group of the control group; And IDA DB for storing known gene mutation information in relation to the gene sequence; Wherein the search control unit comprises an IDA search engine for comparing the analysis data with the IDA DB to detect a known gene-related disease variation included in the analysis data, wherein the storage unit stores the clinical information- And a clinical information DB in which environmental swine information of a subject to be examined to be taken into consideration together with a genetic characteristic to derive a result is stored. The search control unit is configured to perform a logistic regression analysis The disease causality relation (Πx) is calculated through an arithmetic expression to derive a disease cause prediction result:

, And the variable? Is a parameter according to the personal health records (PHR) including the age, sex or body mass (BMI) of the subject to be examined stored in the clinical information DB; The variable χ is a parameter according to the disease-related genotype included in the analysis data calculated by the search control unit.
At this time, the storage unit includes a HaploScan DB storing genotype information of a control gene to compare with the analysis data; The search control unit may include a HaploScan engine for comparing the analysis data with the HaploScan DB to determine the genotype of the analysis data.
The HaploScan DB includes: a single gene information database storing genotype information for a single gene; And a multiple gene information database for storing genotype information of multiple genes by phenotype.
In addition, the single gene information database includes a single gene Haplo map which is obtained by classifying (clustering) the race haplotypes and trait frequencies according to occupancy ratios of a single gene in a control group; And single gene hybrid free region information for storing mutation information for a mutation for discriminating a genotype of a single gene stored in the single gene Haplo map.
The multiple gene information database includes a multiple gene Haplo map which is obtained by dividing the variation distribution of the genotypically related bases into multiple genes of the control group according to the phenotype, and storing them according to the occupancy rate; And multigenic gene heterofluorescence information for storing the mutation information of the mutation for discriminating the genotype of the phenotype stored in the Haplo map of the multiple genes.
In addition, the search control unit may classify the detected mutated gene characteristics according to genotypes of the whole gene, generate a report on a Manhattan plot showing the cumulative value as a point visualized have.
In the manhattan plot, a cut-off may be displayed to guide the significance of the mutation gene.

삭제delete

위에서 살핀 바와 같은 본 발명에 의한 개인 전장 유전체의 유전변이정보를 이용한 질병원인 발굴 시스템에서는, 대조군 데이터 베이스에 저장된 유전자 변이정보와 분석대상인 개인 유전체를 효과적으로 대비하여, 개인 유전체의 유전형 판별 및 유의성 변이 검출 효율이 향상된 유전자 분석 플랫폼을 제공할 수 있는 효과가 있다.
In the system for locating disease causation using the genetic variation information of the individual full-length genome according to the present invention as described above, it is possible to effectively compare the genetic variation information stored in the control database with the individual genome to be analyzed, It is possible to provide a gene analysis platform with improved efficiency.

도 1은 본 발명이 적용된 유전자 분석 서비스 구성을 도시한 예시도.
도 2는 본 발명의 구체적인 실시예에 의한 질병원인 발굴 시스템의 주요 구성을 도시한 블록도.
도 3은 본 발명에 의한 질병원인 발굴 시스템을 구성하는 주요 데이터베이스의 구성을 도시한 예시도.
도 4은 본 발명의 구체적인 실시예를 구성하는 HaploScan DB의 구성예를 도시한 예시도.
도 5는 본 발명의 구체적인 실시예를 구성하는 ADISCAN DB의 구성예를 도시한 예시도.
도 6는 본 발명의 구체적인 실시예에 의한 IDA DB의 구성예를 도시한 예시도.
도 7은 본 발명의 구체적인 실시예에 의한 BAV DB의 구성 예를 도시한 예시도.
도 8은 본 발명의 구체적인 실시예에 의한 유전정보 분석 방법을 도시한 흐름도.
도 9은 본 발명의 구체적인 실시예에 의한 DNA Sequencing 데이터 생성방법의 일예를 도시한 예시도.
도 10는 본 발명의 구체적인 실시예에 의해 생성된 결과 리포트의 일예를 도시한 예시도.
도 11은 본 발명의 구체적인 실시예에 의해 생성된 결과 리포트의 다른 예를 도시한 예시도.
도 12은 본 발명의 구체적인 실시예에 의한 생리활성변이 검증 예를 도시한 예시도.
도 13는 본 발명의 구체적인 실시예에 의한 임상정보기반 질병원인 예측 산출예를 도시한 예시도.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an exemplary diagram showing a configuration of a gene analysis service to which the present invention is applied; FIG.
FIG. 2 is a block diagram showing a main configuration of a disease-causing excavation system according to a specific embodiment of the present invention. FIG.
3 is an exemplary diagram showing a configuration of a main database constituting a system for locating a disease cause according to the present invention;
FIG. 4 is an exemplary diagram showing a configuration example of a HaploScan DB constituting a specific embodiment of the present invention; FIG.
5 is an exemplary diagram showing a configuration example of an ADISCAN DB constituting a specific embodiment of the present invention.
6 is an exemplary diagram showing a configuration example of an IDA DB according to a specific embodiment of the present invention;
FIG. 7 is an exemplary diagram showing a configuration example of a BAV DB according to a specific embodiment of the present invention; FIG.
FIG. 8 is a flowchart illustrating a genetic information analysis method according to a specific embodiment of the present invention. FIG.
FIG. 9 is an exemplary diagram showing an example of a DNA sequencing data generating method according to a specific embodiment of the present invention. FIG.
10 is an exemplary diagram showing an example of a result report generated by a specific embodiment of the present invention;
11 is an exemplary diagram showing another example of a result report generated by a specific embodiment of the present invention.
FIG. 12 is an exemplary diagram showing an example of verification of a variation of a physiological activity according to a specific embodiment of the present invention. FIG.
FIG. 13 is an exemplary diagram showing a clinical information-based disease cause prediction calculation example according to a specific embodiment of the present invention; FIG.

이하에서는 첨부된 도면을 참조하여 본 발명의 구체적인 실시예에 의한 개인 전장 유전체의 유전변이정보를 이용한 질병원인 발굴 시스템을 상세히 살펴보기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to the accompanying drawings.

먼저 본 발명에 의한 질병원인 발굴 시스템이 적용되는 유전자 분석 서비스의 구성을 간단히 살펴보기로 한다.First, the structure of the gene analysis service to which the disease-causing system according to the present invention is applied will be briefly described.

도 1에 도시된 바와 같이, 유전자 분석 서비스는 병원 등의 개인 유전자 수집 기관으로부터 혈액 등의 샘플을 수집하여, 해당 샘플을 DNA 진단회사에 진단을 의뢰하게 된다.As shown in FIG. 1, the gene analysis service collects samples of blood and the like from a personal gene collection agency such as a hospital, and requests the DNA diagnosis company to diagnose the sample.

그리고 상기 DNA 진단회사는 수집된 샘플로부터 DNA custom chip을 제작하거나 DNA sequencing(NGS, next generration sequencing)을 수행한다. 물론, 최근에는 기술적 발전에 따라 다양한 방법에 의해 DNA sequencing을 생성할 수 있으므로, 상기 DNA sequencing 생성 방법은 DNA 진단회사의 기술 수준에 따라 다양한 방법에 의해 수행될 수 있다.The DNA diagnostic company produces a DNA custom chip from collected samples or performs DNA sequencing (NGS). Of course, since DNA sequencing can be generated by various methods according to recent technological developments, the method for generating DNA sequencing can be performed by various methods according to the technology level of a DNA diagnostic company.

이와 같이 생성된 DNA sequencing은 본 발명과 같은 질병원인 발굴 시스템을 통해 개인 유전체에 포함된 유전적 정보가 분석되고, 분석된 분석정보는 병원 등의 진단기관 또는 수요자에게 전달된다.The genetic information included in the individual genome is analyzed through the DNA sequencing system as described above, and the analyzed information is transmitted to a diagnostic institution or a consumer such as a hospital.

물론, 상기 DNA 진단회사로부터 DNA 더미 데이터가 제공되는 경우, 본 발명에 의한 질병원인 발굴 시스템은 이로부터 고집적 인덱싱 파일로 형성하여 빅데이터인 유전체 염기서열을 분석한다.Of course, when the DNA dummy data is provided from the DNA diagnostic company, the system for detecting the cause of the disease according to the present invention forms a highly integrated index file from the DNA dummy data to analyze the genome sequence of the big data.

이에 대하여는 이후에 도 9을 설명함에 있어 다시 설명하기로 한다.This will be described below with reference to FIG.

즉, 본 발명은 DNA sequencing 정보로부터 개인 유전체에 포함된 유전적 정보를 분석하는 질병원인 발굴 시스템에 관한 것으로, 이하에서 본 발명에 의한 질병원인 발굴 시스템에 대하여 상세히 살펴보기로 한다.That is, the present invention relates to a disease-causing system for analyzing genetic information contained in a personal genome from DNA sequencing information, and a system for searching for a disease-causing agent according to the present invention will be described in detail below.

도 2는 본 발명의 구체적인 실시예에 의한 질병원인 발굴 시스템의 주요 구성을 도시한 블록도이고, 도 3은 본 발명에 의한 질병원인 발굴 시스템을 구성하는 주요 데이터베이스의 구성을 도시한 예시도이며, 도 4은 본 발명의 구체적인 실시예를 구성하는 HaploScan DB의 구성예를 도시한 예시도이고, 도 5는 본 발명의 구체적인 실시예를 구성하는 ADISCAN DB의 구성예를 도시한 예시도이며, 도 6는 본 발명의 구체적인 실시예에 의한 IDA DB의 구성예를 도시한 예시도이고, 도 7은 본 발명의 구체적인 실시예에 의한 활성변이 DB의 구성 예를 도시한 예시도이다.FIG. 2 is a block diagram showing a main configuration of a disease cause search system according to a specific embodiment of the present invention, FIG. 3 is an exemplary view showing a configuration of a main database constituting a disease cause search system according to the present invention, FIG. 4 is an exemplary view showing a configuration example of a HaploScan DB constituting a specific embodiment of the present invention, FIG. 5 is an exemplary view showing a configuration example of an ADISCAN DB constituting a specific embodiment of the present invention, and FIG. FIG. 7 is a diagram illustrating an example of a structure of an active variation DB according to a specific embodiment of the present invention. Referring to FIG.

도 2에 도시된 바와 같이, 본 발명에 의한 질병원인 발굴 시스템은 분석데이터 입력부(100), 검색제어부(200), 결과 리포트 제공부(300), HaploScan DB(400), ADISCAN DB(500), IDA DB(600), 생리활성 DB(700) 및 레퍼런스 DB(800)를 포함하여 구성된다.2, the disease detection system according to the present invention includes an analysis data input unit 100, a search control unit 200, a result report providing unit 300, a HaploScan DB 400, an ADISCAN DB 500, An IDA DB 600, a physiological activity DB 700, and a reference DB 800.

상기 분석데이터 입력부(100)는 개인 유전체 정보를 입력받는 부분으로, DNA sequencing 데이터를 입력받는다.The analysis data input unit 100 receives the DNA sequencing data.

그리고 상기 검색제어부(200)는 입력된 DNA sequencing으로부터 각 유전자의 유전형, 표현형에 대한 유전형, 희귀변이, 질병변이 및 생리활성변이를 검출하는 부분으로, 이를 위해 상기 검색제어부(200)는 HaploScan엔진(210), ADISCAN 엔진(220), IDA 검색엔진(230) 및 생리활성변이 검색엔진(240)을 포함하여 구성된다.The search control unit 200 detects a genotype, a genotype, a rare variation, a disease variation, and a bioactivity variation for each genotype, a phenotype for each gene from the inputted DNA sequencing. For this purpose, the search control unit 200 searches the HaploScan engine 210, an ADISCAN engine 220, an IDA search engine 230, and a biologically active variation search engine 240.

상기 HaploScan 엔진(210)은 상기 분석데이터(입력된 DNA Sequencying)을 후술할 HaploScan DB(400)에 저장된 Haplo MAP(414, 424)과 대비하여 유전형을 판별하는 역할을 수행한다.The HaploScan engine 210 determines the genotype by comparing the analysis data (input DNA sequencing) with the Haplo MAPs 414 and 424 stored in the HaploScan DB 400 described later.

상기 HaploScan DB(400)의 구조 및 상기 HaploScan 엔진(210)의 검색 방식은 이후 다시 상세히 설명하기로 한다.The structure of the HaploScan DB 400 and the search method of the HaploScan engine 210 will be described in detail later.

그리고 상기 ADISCAN 엔진(220)은 입력된 분석데이터에 포함된 각 염기에 대하여 ADISCAN DB(500)과 ADISCAN 방식으로 대비하여, 집단대조군 대비 희귀성을 산출하는 역할을 수행한다.The ADISCAN engine 220 compares ADISCAN DB 500 with each base included in the input analysis data by ADISCAN method, and calculates the rarity with respect to the group of the control.

또한, 상기 IDA 검색엔진(230)은 이미 알려진 유전자 관련 질병변이를 검출하는 것으로, 알려진 질병변이가 저장된 IDA DB(600)와 분석데이터를 비교하여 질병변이를 검출한다.In addition, the IDA search engine 230 detects an already known gene-related disease variation, and detects disease variation by comparing analysis data with an IDA DB 600 storing a known disease variation.

그리고 상기 생리활성변이 검색엔진(240)은, 단백질 대사관련 유전 변이를 검출하는 것으로, 크게 단백질-약물, 단백질-DNA 및 단백질-단백질 결합에 관여하는 아미노산에 대한 유전변이 여부를 판별한다.The biomarker mutation search engine 240 detects a protein-metabolism-related genetic mutation and discriminates genetic mutation of amino acids involved in protein-drug, protein-DNA, and protein-protein binding.

이때, 상기 생리활성변이 검색엔진(240)은 BAV DB(700)와 분석데이터를 비교하여 상기 분석 데이터 중 상기 BAV DB(700)에 저장된 단백질 결합 관련한 아미노산에 대응하는 염기들의 변이 여부를 판별하게 된다.At this time, the physiological activity variation search engine 240 compares analysis data with the BAV DB 700 to determine whether or not the bases corresponding to amino acids related to protein binding stored in the BAV DB 700 of the analysis data are mutations .

한편, 상기 검색제어부(200)는 HaploScan 엔진(210) 및 ADISCAN 엔진(220)에 의해 판별된 유전형과 각 염기의 유의성(희귀성)을 진단자(또는 사용자)가 가시적으로 용이하게 파악할 수 있도록 맨하탄 플롯 및 방사형 변이 유의성 차트를 이용하여 결과리포트를 생성한다.Meanwhile, the search control unit 200 manages the genotypes determined by the HaploScan engine 210 and the ADISCAN engine 220, and manhattan (or the like) so that the diagnosis (or the user) can easily grasp the significance (rarity) Plot and Radial Variability Create a results report using a significance chart.

그리고 생성된 상기 결과리포트는 결과리포트제공부(300)를 통해 사용자에게 제공된다.The generated result report is provided to the user through the result report providing unit 300.

이하에서는 본 발명에 의한 질병원인 발굴 시스템의 데이터베이스 구조를 설명하기로 한다.Hereinafter, a database structure of a disease cause discovery system according to the present invention will be described.

본 발명에 의한 질병원인 발굴 시스템은 크게 HaploScan DB(400)와 ADISCAN DB(500), IDA DB(600), BAV DB(700) 그리고 Reference DB(800)를 포함하여 구성된다.The system for locating a disease cause according to the present invention includes HaploScan DB 400, ADISCAN DB 500, IDA DB 600, BAV DB 700, and Reference DB 800.

상기 HaploScan DB(400)는 도 3에 도시된 바와 같이, 분석 대상인 개인 유전체 정보로부터 유전형을 산출하기 위해 대조군 유전자의 유전형을 정리한 DB로, 상기 HaploScan DB(400)는 도 2에 도시된 바와 같이, 단일유전자정보데이터베이스(410)와, 다중유전자정보 데이터베이스(420)를 포함하여 구성된다. As shown in FIG. 3, the HaploScan DB 400 is a DB that summarizes a genotype of a control gene to calculate a genotype from individual genome information to be analyzed. The HaploScan DB 400 includes a genome- A single gene information database 410, and a multiple gene information database 420.

그리고 상기 단일유전자정보 데이터베이스(410)는 단일유전자에 대한 유전형들을 저장한 데이터 베이스로, 단일유전자 Haplo 맵(414)과 단일유전자 하플로 프리컨시 정보(412)를 포함하여 구성된다.The single gene information database 410 is a database storing genotypes for a single gene. The single gene information database 410 includes a single gene Haplo map 414 and a single gene hybrid free region information 412.

한편, 도 4에 도시된 바와 같이, 상기 단일유전자 Haplo 맵(414)은 전체 대조군의 동일 유전자에 대하여, 변이 분포를 점유 비율 별로 구분(군집)하여 저장한 것으로, 각 유전자를 활용한 세계 26개 인종의 반수체 (haplotype)계산 및 특정 형질의 빈도 및 각 서브-인종의 빈도를 계산하여 정리한 것이다.Meanwhile, as shown in FIG. 4, the single gene Haplo map 414 is obtained by classifying (clustering) the mutation distribution by the occupancy ratio of the same gene of the entire control group, and 26 genes The calculation of race haplotypes and frequency of specific traits and the frequency of each sub-race are summarized.

그리고 상기 단일유전자 하플로 프리컨시 정보(412)는 상기 각각의 변이에 대한 정보를 저장한 것이다. 이때, 상기 단일유전자 하플로 프리컨시 정보(412)는 변이정보를 직접 저장한 데이터일 수도 있고, 후술할 Reperence DB(800)에 저장된 정보를 위치를 표시하는 식별인자로 구성될 수도 있다. 즉, 상기 단일유전자 하플로 프리컨시 정보(412)는 인간의 39,000개 유전자와 5 천명의 세계인종에서의 각 유전자에서 빈도 및 다양한 질병연관 주석정보를 제공한다.The single gene hybrid free region information 412 stores information on each of the mutations. In this case, the single gene heterofunction information 412 may be data directly storing the variation information, or may be composed of an identification factor indicating the location of the information stored in the reperence DB 800, which will be described later. That is, the single gene haplophytic information 412 provides frequency and various disease-related annotation information in each gene in 39,000 human genes and 5,000 global races.

또한, 상기 다중유전자정보 데이터베이스(420)는 다중유전자에 대한 변이 분포 및 정보를 제공하기 위한 데이터 베이스로, 다중유전자 Haplo 맵(424)과 다중유전자 하플로 프리컨시 정보(422)를 포함하여 구성된다.The multi-gene information database 420 includes a multi-gene Haplo map 424 and multi-gene hybrid free context information 422, do.

이때, 상기 다중유전자 Haplo 맵(424)은 다중유전자에 의해 표현형이 특정되는 유전 특성에 있어, 각 표현형 별로 전체 대조군의 관련 염기에 대한 변이 분포를 점유 비율 별로 군집화하여 저장한 것으로, 표현형 (phenotype)의 원인 변이를 활용한 세계 26개 인종의 반수체(haplotype)계산 및 특정 형질의 빈도 및 각 서브-인종의 빈도를 계산하여 정리한 것이다.In this case, the multi-gene Haplo map 424 is a phenotypic phenotype of the genetic characteristics in which phenotypes are identified by multiple genes, and the distribution of mutations to the related bases of all the control groups is clustered according to occupancy ratios for each phenotype. And the frequencies of specific traits and the frequency of each sub-race were calculated and summarized for the calculation of haplotypes of 26 races in the world.

그리고 상기 다중유전자 하플로 프리컨시 정보(422)는 상기 각각의 변이에 대한 정보를 저장한 것이다. 이때, 상기 다중유전자 하플로 프리컨시 정보(422) 역시 변이정보를 직접 저장한 데이터일 수도 있고, 후술할 Reperence DB(800)에 저장된 정보를 위치를 표시하는 식별인자로 구성될 수도 있다.The multi-gene hybrid free status information 422 stores information on each of the mutations. At this time, the multi-gene heterofluorescence information 422 may be data directly storing the mutation information, or may be composed of an identification factor indicating the location of the information stored in the reperence DB 800, which will be described later.

즉, 상기 다중유전자 하플로 프리컨시 정보(422)는 인간의 39,000개 유전자와 5천명의 세계인종에서의 표현형(phenotype) 연관 유전자 셋트 들의 빈도 및 다양한 질병연관 주석정보를 제공한다.That is, the multi-gene haplophytic information 422 provides frequency of phenotype-related gene sets and various disease-related annotation information in 39,000 human genes and 5,000 world races.

이를 도 4에 도시된 예를 통해 설명하면, HaploScan DB(400)의 X축은 30억 염기서열이고, 상기 염기서열에서 유전자는 39,000개가 있다. 이의 스키마에서 특정 유전자(i)에서 변이가 N(개) 발견이 되었다면, 상기 변이를 Y축: 5,000명에서 haplotype 및 genotype 모두를 사용하여 군집화를 할 수 있고, 군집화가 된 형태가 HaploMap이된다. Referring to the example shown in FIG. 4, the X-axis of the HaploScan DB 400 is 3 billion nucleotides and the nucleotide sequence has 39,000 genes. If the mutation is found in the specific gene (i) in its schema, the mutation can be clustered using both the haplotype and the genotype in the Y axis: 5,000, and the clustered form becomes the HaploMap.

이때, 각 군집은 각 유전형을 의미하는데 이들의 내용을 살펴보면, 첫 번째 GP*47*0 는 그 유전형이 세계인에서 47%를 차지하고, 세계인의 평균과 비교해서 0 bit 다르고(동일하고), 두 번째 유전형 GP*25*1은 세계인에서 25%를 차지함을 의미하며, 세계인의 평균과 비교해서 1 bit 다르다는 것을 의미한다. The first GP * 47 * 0 represents 47% of the global population, 0 bits different from the global average (same), and the second group The genotype GP * 25 * 1 means 25% of the world population, which means that it is 1 bit different from the global average.

또한, 다중유전기반 HaploMap도 동일한 방식에 의해 분류 및 구분된다.In addition, multiple genetic-based HaploMaps are categorized and classified by the same method.

상기 ADISCAN DB(500)는 대조군 집단의 유전체 정보를 저장한 DB로, 구체적으로 집단유전체는 글로벌 게놈프로젝트 수행에 의해 공지된 유전체 정보가 활용될 수 있다.The ADISCAN DB 500 is a DB storing genome information of a control group. Specifically, the genome information known by the execution of a global genome project can be utilized for the collective genome.

한편, 상기 ADISCAN DB(500)는 도 3 및 도 5에 도시된 바와 같이, 대조군 집단의 전장 유전체 정보를 저장하되, 인종 등의 유전형의 군을 형성하는 구분기준에 따라 구분되어 저장될 수 있다.As shown in FIGS. 3 and 5, the ADISCAN DB 500 stores the full-length genome information of the control group, and may be stored in accordance with a classification criterion forming genetic groups such as race.

이때, 상기 인종별 구분은 5개 대분류의 구분일 수도 있고, 26개 소분류의 구분일 수도 있는데, 이는 인종별 유전특성을 반영하여 변이 유전자 여부를 판별/검출하기 위함이다.At this time, the racial discrimination may be classified into five major categories or 26 sub-categories, in order to discriminate whether or not a mutant gene is detected by reflecting the genetic characteristics of each race.

그리고 상기 IDA DB(600)는 이미 알려진 질병과 이에 관련된 유전 변이가 저장되는 곳으로, 도 3 및 도 6에 도시된 바와 같이, 다양한 질병별로 각 질병에 관련된 유전자 변이 정보 및 이들 변이 정보를 뒷받침하는 문헌 정보가 정리되어 저장된다.As shown in FIGS. 3 and 6, the IDA DB 600 stores genetic mutation information and mutation information related to each disease in various diseases, The document information is stored and stored.

또한, BAV DB(700)에는 다양한 단백질의 바인딩 위치의 아미노산 형태를 결정하는 유전자 정보가 저장된다.In addition, the BAV DB 700 stores gene information for determining the amino acid form of the binding position of various proteins.

구체적으로는, 단백질-약물, 단백질-DNA 및 단백질-단백질 간의 바인딩에 있어, 이들 결합에 영향을 미치는 아미노산과 해당 아미노산에 영향을 미치는 유전자 정보가 저장된다.Specifically, in the binding between protein-drug, protein-DNA and protein-protein, the amino acid affecting these bonds and the gene information affecting the corresponding amino acid are stored.

이에 따라, 특정 대사물의 바인딩을 관장하는 아미노산에 대한 염기들에 변이가 다수 발생한 경우, 해당 분석 데이터의 피검사자는 해당 대사물에 대하여 정상적인 체내 처리가 어려워질 가능성이 높아지게 된다.Accordingly, when a large number of mutations are generated in the bases for the amino acid that govern the binding of a specific metabolite, the subject of the analysis data has a high possibility that normal metabolism is difficult to be treated in the body.

즉, 도 7에 도시된 바와 같이, 상기 BAV DB(700)에는 알려진 질병변이를 포함하여 단백질의 약물 결합 위치, Promoter 위치 및 결합상태의 단백질 활성이 예측되는 변이들이 저장된다.That is, as shown in FIG. 7, the BAV DB 700 stores variations predicted for the protein binding activity, the promoter position, and the protein activity of the binding state, including known disease mutations.

상기 BAV DB(700)는 생리활성관련 유전자 정보를 저장하는 데이터 베이스로, 유전자와 약물, 대사물 및 음식물에 대한 저항성 및 감수성 관련정보가 저장된다. 이때, 상기 BAV DB(700) 또한, 공신력이 확보된 공지된 데이터를 연계하여 구축할 수 있고, 예를 들어, 약물은행에 공지된 6,000 여 개의 약물정보(상호작용 단백질과 바인딩 영역 정보 등), 대사물 은행에 공지된 12,000 여 개의 대사물 정보(상호작용 단백질과 바인딩 영역 정보 등) 및 DMET(drug metabolizing enzyme and transporter gene)에 있는 200여 개의 유전자의 약물 대사관련 변이 위치에 대한 정보를 활용할 수 있다.The BAV DB 700 is a database for storing physiological activity-related gene information, and stores information on resistance and susceptibility to genes, drugs, metabolites, and foods. At this time, the BAV DB 700 may also be constructed by linking known data securing the public confidence. For example, the BAV DB 700 may include 6,000 pieces of drug information (interactive protein and binding region information, etc.) Information on the metabolism-related mutations of over 200 genes in DMET (metabolizing enzyme and transporter gene) and 12,000 metabolite information (interacting protein and binding domain information) known in the metabolism bank can be used. have.

한편, 상기 레퍼런스 DB(800)는 알려진 유전체의 변이에 대한 정보를 저장하는 DB로, 문헌정보 뿐만 아니라 공개된 정보 데이터베이스와 연계되어 구축될 수 있다.Meanwhile, the reference DB 800 is a DB that stores information on a variation of a known dielectric, and can be constructed in association with a published information database as well as document information.

예를 들어, PheWAS-GWAS(Genome wide association study) data 및 eMERGE (Electronic Medical Records and Genomics) data가 레퍼런스 DB에 적용될 수 있다.For example, PheWAS-GWAS (Genome wide association study) data and eMERGE (Electronic Medical Records and Genomics) data can be applied to the reference DB.

한편, 도시되지는 않았으나, 상기 검색제어부(200)가 임상정보 기반의 질병원인 예측 결과를 도출하기 위해 유전적 특성과 함께 고려되어야할 피검사 대상자의 환경적 소인 정보가 저장되는 임상정보 DB를 더 포함하여 구성될 수도 있다.Meanwhile, although not shown, the search control unit 200 may include a clinical information DB storing environmental postmark information of a subject to be examined, which should be considered together with a genetic characteristic, in order to derive a clinical information- .

이때, 상기 임상정보 DB는 개인의 환경적 요인 결과물 데이터와 집단 평균 및 기준정보가 저장된다.At this time, the clinical information DB stores personal environmental result data, population average, and reference information.

그리고 상기 개인의 환경적 요인 결과물 데이터는 개인의 종합검진 데이터 등의 임상정보 데이터일 수 있고, 상기 집단 평균 및 기준정보는 질병관리본부가 제공하는 지역사회 코호트 연구 결과를 활용할 수 있다.The individual environmental resultant data may be clinical information data such as individual comprehensive examination data, and the group average and reference information may utilize the results of the community cohort study provided by the CDC.

이하에서는 본 발명에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법을 첨부된 도면과 함께 상세히 살펴보기로 한다.Hereinafter, a method for analyzing genetic information using a personal bio-based dielectric according to the present invention will be described in detail with reference to the accompanying drawings.

도 8은 본 발명의 구체적인 실시예에 의한 유전정보 분석 방법을 도시한 흐름도이고, 도 9은 본 발명의 구체적인 실시예에 의한 DNA Sequencing 데이터 생성방법의 일예를 도시한 예시도이며, 도 10는 본 발명의 구체적인 실시예에 의해 생성된 결과 리포트의 일예를 도시한 예시도이고, 도 11은 본 발명의 구체적인 실시예에 의해 생성된 결과 리포트의 다른 예를 도시한 예시도이며, 도 12은 본 발명의 구체적인 실시예에 의한 생리활성변이 검증 예를 도시한 예시도이고, 도 13는 본 발명의 구체적인 실시예에 의한 임상정보기반 질병원인 예측 산출예를 도시한 예시도이다.FIG. 8 is a flowchart illustrating a method for analyzing genetic information according to a specific embodiment of the present invention. FIG. 9 is a view illustrating an example of DNA sequencing data generating method according to a specific embodiment of the present invention. Fig. 11 is an exemplary view showing another example of a result report generated by a specific embodiment of the present invention, Fig. 12 is a diagram showing an example of a result report generated by the present invention FIG. 13 is an exemplary diagram illustrating an example of predicting and calculating disease cause based on clinical information according to a specific embodiment of the present invention. FIG.

먼저, 도 8에 도시된 바와 같이, 본 발명에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법은 먼저, 분석데이터 입력부가 분석 대상이 되는 분석 데이터(DNA Sequencing)을 수신받는 것으로부터 시작된다(S100).First, as shown in FIG. 8, a method for analyzing genetic information using a personal full-length dielectric according to the present invention starts with receiving analysis data (DNA Sequencing) to be analyzed by an analysis data input unit (S100) .

이때, 상기 분석 데이터가 DNA 조각들로 구성된 Dumy 형태로 제공될 수도 있는데, 이 경우 본 발명은 도 9에 도시된 바와 같이, 제공된 Dumy 데이터에 고집적 인덱싱을 통해 RVR 파일 형태로 DNA sequencing 을 생성하여 저장한다.At this time, the analysis data may be provided in a Dumy form composed of DNA fragments. In this case, as shown in FIG. 9, the present invention generates DNA sequencing in the form of an RVR file through highly integrated indexing on the provided Dumy data, do.

이후, 본 발명에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법은 분석 대상에 따라 크게 4가지 분석을 수행한다.Hereinafter, the method for analyzing genetic information using the personal full-length genome according to the present invention performs four types of analysis according to the analysis target.

즉, 본 발명에 의한 개인 전장 유전체를 이용한 유전정보 분석은 1) 유전형 판별(S200), 2) 희귀변이 산출(S300), 3) 질병변이 산출(S400) 및 생리활성변이 산출(S500)의 4가지 분석을 수행하는 바, 이하에서는 각각에 대하여 상세히 살펴보기로 한다.
That is, the analysis of the genetic information using the private full-length genome according to the present invention can be carried out as follows: 1) genotype discrimination S200, 2) calculation of rare mutation S300, 3) calculation of disease mutation S400, In the following, a detailed description will be given of each of them.

[유전형 판별][Genotype discrimination]

상기 HaploScan 엔진(210)은 상기 DNA Sequencying을 HaploScan DB(400)에 저장된 Haplo Frequency(412) 및 MAP(414)과 대비하여 단일 유전자 및 표현형에 대하여 유전형이 속하는 군집 및 이에 대한 정보를 검출한다.The HaploScan engine 210 detects the cluster and the information about the genotype of the single gene and the phenotype in comparison with the Haplo frequency 412 and the MAP 414 stored in the HaploScan DB 400.

구체적으로 상기 HaploScan 엔진(210)은 상기 DNA sequencying의 i번째 유전자에 대하여 상기 단일유전자 Haplo Frequency(412)의 i번째 유전자 정보와 대비하여(S211), 분석 대상인 개인 유전체의 i번째 유전자가 단일유전자 Haplo MAP(414)에 분류된 단일유전자 분류중 어느 군집에 포함되는지 여부를 판별한다(S213, S215).Specifically, the HaploScan engine 210 compares the i-th gene of the DNA sequencing with the i-th gene information of the single gene Haplo Frequency 412 (S211). If the i-th gene of the individual genome to be analyzed is a single gene Haplo MAP 414 (S213, S215). In this case, it is determined whether the single gene classification is included in the single gene classification classified in the MAP 414 (S213, S215).

이후, 상기 HaploScan 엔진(210)은 i=1 부터 마지막까지(약 i=39,000) 반복하여 분석데이터의 전체 유전자에 대한 유전형을 판별한다(S217, S219).Thereafter, the HaploScan engine 210 repeats i = 1 to the end (about i = 39,000) to determine the genotype of the entire gene of the analysis data (S217, S219).

또한, 상기 HaploScan 엔진(210)은 상기 DNA sequencying을 상기 다중유전자 Haplo Frequency(422)와 대비하여(S221), 각 표현형에 대한 분석 대상 유전체의 다수 유전자의 조합이 다중유전자 Haplo MAP(424)에 분류된 다중 유전자 조합의 분류중 어느 군집에 포함되는지 여부를 판별한다(S223, S225).In addition, the HaploScan engine 210 compares the DNA sequencing with the multigene Haplo frequency 422 (S221). Then, the HaploScan engine 210 classifies the combination of multiple genes of the analyzed genome for each phenotype into the multiple gene Haplo MAP 424 (S223, S225). In the case of the multi-gene combination,

이에서도 역시, 상기 HaploScan 엔진(210)은 다중유전자정보 데이터베이스(420)에 저장된 모든 표현형에 대하여 반복하여 분석데이터의 유전형을 판별한다(S227, S229).Again, the HaploScan engine 210 determines the genotype of the analysis data by repeating all the phenotypes stored in the multiple gene information database 420 (S227, S229).

이와 같은 HaploScaning 과정을 통해 분석 대상 유전체에 포함된 단일 유전자 변이 및 다중 유전자 변이에 따른 유전형을 정의할 수 있다.
Through this HaploScaning process, we can define single genetic mutations and genotypes associated with multiple gene mutations included in the genome to be analyzed.

[희귀변이 산출][Rare variation calculation]

희귀변이는 극히 이례적인 특정 유전 변이에 의해 유발되는 염기 변이로, 일반적으로 희귀질병과 관련된 경우가 많은 것으로, 특정 염기에 대한 변이 유무 또는 차이를 검출하여, 희귀질병 발병 가능성 등을 판단할 수 있다.A rare mutation is a base mutation caused by a specific genetic mutation that is extremely unusual, and is often related to a rare disease. It is possible to detect the presence or absence of a mutation or a difference in a specific base to determine the possibility of developing a rare disease.

이를 위해 본 발명은 먼저, 도 8에 도시된 바와 같이, ADISCAN 엔진(220)이 대조군을 선별한다(S310). To this end, as shown in FIG. 8, the ADISCAN engine 220 selects a control group (S310).

이때 상기 대조군이란, 해당 변이에 대한 희귀성을 판단하게 될 대조 집단으로, 특정 인종을 한정하거나 특정 국가를 대상으로 한정할 수도 있다.Here, the control group is a control group for judging the rarity of the mutation, and may be limited to a specific race or to a specific country.

이후, 상기 ADISCAN 엔진(200)은 특정 로커스의 염기에 대하여 대조군 DB의 염기와 ADISCAN 방식으로 변이지수를 산출하고, 이와 같은 과정을 전체 유전체에 대하여(n=1 부터 n=약 30억) 수행한다(S320, S330, S340).Thereafter, the ADISCAN engine 200 calculates a variation index using a base of the control DB and the ADISCAN method for a specific locus base, and performs such a process on the entire genome (n = 1 to n = about 3 billion) (S320, S330, S340).

이에 따라 전체 염기서열에 대하여 염기들의 희귀성을 산출한다(S350).Thus, the rarity of the bases with respect to the entire base sequence is calculated (S350).

한편, 상기 희귀변이 산출을 위한 ADISCAN(allelic depth and imbalance scanning)이란 정상과 이상 유전자의 차이를 주는 마커들을 스크리닝하는 기법으로, 대립유전자깊이곱탄젠트차이, 대립유전자제곱승차이, 대립유전자절대값차이, 기하학적대립유전자차이, 통계적대립유전자차이 또는 대립유전자불균형비율에 따라 판단된다.
The allelic depth and imbalance scanning (ADISCAN) for calculating the rare mutation is a technique of screening markers giving differences between normal and abnormal genes. Geometric allele difference, statistical allele difference or allelic imbalance ratio.

[질병변이 산출][Calculation of disease variation]

상기 질병변이 검출은 IDA 검색엔진(230)이 분석데이터를 IDA DB(600)의 변이정보와 비교하여, 해당 질병의 위험도를 판단하게 된다(S410).The IDA search engine 230 compares the analysis data with the ID information of the IDA DB 600 to determine the risk of the disease (S410).

이와 같은 방법으로, 상기 IDA DB에 포함된 모든 질병에 대하여 상기 분석데이터를 검토한 후(S420), 유의미한 변이관련 질병들을 산출하게 된다(S430).
In this way, the analysis data is reviewed for all diseases included in the IDA DB (S420), and the significant mutation-related diseases are calculated (S430).

[생리활성변이 산출][Calculation of physiological activity variation]

상기 생리활성변이 검출은 생리활성변이 검색엔진(240)이 BAV DB(생리활성변이 DB)를 검색하여(S510), 단백질의 결합에 관여하는 아미노산에 정보를 검출한다(S520).The physiological activity mutation search engine 240 searches the BAV DB (physiological activity mutation DB) (S510) and detects information on amino acids involved in protein binding (S520).

이때, 상기 단백질 결합은 단백질-약물, 단백질-DNA 및 단백질-단백질의 결합을 포함하고, 상기 아미노산 정보에는 상기 아미노산에 관련된 염기의 정보가 포함된다.Herein, the protein binding includes binding of a protein-drug, a protein-DNA, and a protein-protein, and the amino acid information includes information on a base related to the amino acid.

이후, 상기 생리활성변이 검색엔진(240)은 상기 아미노산 정보에 포함된 염기와 분석데이터를 대비하여 분석 데이터 상에 변이가 발생 된 아미노산 및 이에 관련된 대사물 정보 등을 검출한다(S530, S540).Then, the physiological activity variation search engine 240 detects the amino acid in which the mutation has occurred on the analysis data and metabolite information related thereto, in comparison with the base included in the amino acid information (S530 and S540).

그리고 상기 생리활성변이 검색엔진(240)은 전체 아미노산에 대하여 변이 검출을 반복수행하고, 검출된 정보를 통합하여 생리활성변이정보를 산출한다(S550, S560).The physiological activity variation search engine 240 repeatedly performs mutation detection for all amino acids, and integrates the detected information to calculate physiological activity variation information (S550, S560).

도 10에는 검출된 단백질의 아미노산 변이가 대사 작용에 미치는 영향을 시뮬레이션을 통해 검증하는 예가 도시되어 있다.FIG. 10 shows an example in which the influence of the amino acid variation of the detected protein on metabolism is verified by simulation.

이후 상기 검색제어부(200)는 판별 또는 산출된 유전형, 희귀변이, 질병변이 및 생리활성변이를 통합하여, 사용자에게 제공될 결과리포트를 생성한다(S600).Then, the search control unit 200 integrates the discriminated or calculated genotypes, rare mutations, disease variances, and physiological activity variations to generate a result report to be provided to the user (S600).

이때, 상기 검색제어부(200)는 피검사자의 임상정보가 제공된 경우 이를 바탕으로 임상정보 기반 질병원인을 산출하여 제공할 수 있다.At this time, the search control unit 200 may calculate the cause of disease based on clinical information based on the clinical information of the examinee.

구체적으로, 질병의 원인을 예측하려면 현 상태의 환경적인 요인 결과물(종합검진데이터 및 임상정보)을 포함하는 PHR (personal health records)이 필요하다. 특히, 환경적인 요인에서 집단의 평균 및 기준정보가 필요하게 된다(본 발명에서 상기 기준정보는 질병관리본부에서 제공하는 제2기 지역사회 코호트 연구결과를 활용). 여기서, 이러한 환경적인 요인의 결과물과 유전형과 연계를 지은 것을 PHR-trait 이라고 부른다.Specifically, in order to predict the cause of the disease, personal health records (PHRs) are needed that include the current status of environmental factors (comprehensive screening data and clinical information). In particular, the average and baseline information of the group is needed in the environmental factor (in the present invention, the reference information is utilized in the second community cohort study provided by the CDC). Here, the result of these environmental factors and linkage with the genotype is called PHR-trait.

도 11에 도시된 바와 같이 질병원인 관계도(Π) 검출식은, logistic regression분석 방법을 활용한 것으로, 변수 χ는 전술한 바와 같이 산출된 유전형, 희귀변이, 질병변이 및 생리활성변이에 따라 결정되는 값이고, 변수 β는 상기 PHR로부터 결정되는 값이다.As shown in FIG. 11, the disease causality diagram (Π) detection formula is based on a logistic regression analysis method. The variable χ is determined according to the genotype, rare mutation, disease variance, and biologic activity variation calculated as described above And the variable? Is a value determined from the PHR.

즉, 상기 질병원인 관계도는 Gene, Disease 혹은 Drug의 유전형 (group or cluster of genotypes) vs. PHR (BMI, AGE, SEX, 등)의 연관성을 계산할 수 있게 된다.That is, the cause of disease is the genotype (group or cluster of genotypes) of Gene, Disease or Drug. PHR (BMI, AGE, SEX, etc.) can be calculated.

따라서, 현재의 임상상태 (clinical condition: normal, disease, or phenotype)와 39,000유전자에서 계산한 Gene, Disease, Drug유전형과의 연관성을 계산하여 전체유전자기반 질병원인을 계산한다.Therefore, the cause of the entire gene-based disease is calculated by calculating the relationship between the current clinical condition (normal, disease, or phenotype) and the Gene, Disease, and Drug genotypes calculated from 39,000 genes.

한편, 본 발명에 의한 질병원인 발굴 시스템은 산출된 유전자 변이정보로부터 리포팅 데이터를 생성한다.On the other hand, the system for locating disease causes according to the present invention generates the reporting data from the calculated gene mutation information.

이때 산출되는 결과 리포트는, 산출물에 따라 각각 다소 차이는 있으나, 기본적으로 변이 유전자에 대한 가시화를 위해 매하탄 플롯 및 방사형 변이 차트를 활용한다. The resulting report, which is calculated at this time, basically uses the subharmonic plot and the radial variation chart for the visualization of the mutation gene although there are some differences depending on the products.

도 12은 본 발명의 구체적인 실시예에 의해 생성된 맨하탄 플롯의 일 예를 도시한 예시도이다.12 is an exemplary view showing an example of a Manhattan plot generated by a specific embodiment of the present invention.

도 12에 도시된 바와 같이, 상기 맨하탄 플롯(Manhattan plot)은 39,000 개의 유전자에 대하여, 알려진 모든 SNP의 non-sym 변이들을 기준으로 게놈프로젝트의 표준 유전자를 유전형에 따라 분류하여 누적된 값을 점(point)으로 가시화 한 그래프를 의미한다.As shown in FIG. 12, the Manhattan plot shows the standard genes of the genome project based on the non-sym variations of all known SNPs for 39,000 genes according to their genotypes, point of the graph.

이에 분석 대상 유전체의 유전자를 표시하면, 대조군 대비 분석 대상 유전자의 변이 특이성을 용이하게 인식할 수 있다.By displaying the gene of the genome to be analyzed, the mutation specificity of the gene to be analyzed can be easily recognized compared with the control group.

이와 같은 맨하탄 플롯(Manhattan plot)은 변이 로커스를 손쉽게 파악할 수 있을 뿐만 아니라, 변이 정도도 용이하게 파악할 수 있다.This Manhattan plot not only makes it easier to identify the mutation locus, it also makes it easier to determine the degree of variation.

한편, 상기 맨하탄 플롯에 의해 표시된 유의성 변이들은 변이 정도 및 유전적 특성에 따라 도 13에 도시된 바와 같이, 방사형 변이 차트로 표시될 수 있다.On the other hand, the significance variations displayed by the Manhattan plot can be represented by a radial variation chart, as shown in FIG. 13, depending on the degree of mutation and genetic characteristics.

이때, 상기 분석 대상 유전체의 유전적 변이 정도와 대조군 평균을 함께 표시하여, 분석 대상 유전체의 변이정도를 가시적으로 명확하게 표시할 수 있을 뿐만 아니라, 유전적 특성 정보를 추가적으로 포함시켜 결과리포트를 생성할 수도 있다.At this time, the degree of mutation of the genome to be analyzed and the average of the control group are displayed together, so that the degree of mutation of the subject genome can be visibly and clearly displayed, and genetic characteristic information is additionally included to generate a result report It is possible.

전술한 바와 같은 방법으로 생성된 상기 결과리포트는 결과리포트 제공부를 통해 제공된다.The result report generated in the manner as described above is provided through the result report providing unit.

본 발명의 권리는 위에서 설명된 실시예에 한정되지 않고 청구범위에 기재된 바에 의해 정의되며, 본 발명의 분야에서 통상의 지식을 가진 자가 청구범위에 기재된 권리범위 내에서 다양한 변형과 개작을 할 수 있다는 것은 자명하다.
It is to be understood that the invention is not limited to the disclosed embodiment, but is capable of many modifications and variations within the scope of the appended claims. It is self-evident.

본 발명은 게놈 프로젝트에 의해 구축된 다수 전장유전체 DB와 입력된 개인 전장 유전체 정보를 비교하여 개인 유전체로부터 유전정보를 분석하여 제공하는 시스템에 관한 것으로, 본 발명에 의하면, 변이 유전체 검출 효율이 향상된 유전자 분석 플랫폼을 제공할 수 있는 효과가 있다.
The present invention relates to a system for analyzing and providing genetic information from a personal genome by comparing the input full-length genome information with a multiple-length genome DB constructed by a genome project. According to the present invention, It is possible to provide an analysis platform.

Claims

An analysis data input unit for inputting analysis data including personal genome information;
A search control unit for calculating an analysis result of rare mutation and disease variation of each gene by comparing the gene information stored in the database with the individual genome information of the subject of analysis and generating a result report through the analysis result; And
And a storage unit for storing the gene information of the control group for comparison with the analysis data,
Wherein,
The ADISCAN DB, which is stored as a separate classification according to the classification including race,
The search control unit,
And an ADISCAN engine that compares each base contained in the analysis data with the ADISCAN DB to calculate the rarity with respect to the group of the control:
Wherein,
An IDA DB for storing known gene mutation information related to each disease for each of a plurality of diseases;
The search control unit,
And an IDA search engine for comparing the analysis data with the IDA DB to detect a known gene-related disease variation included in the analysis data,
Wherein,
In order to derive the prediction result of the disease cause based on the clinical information, the clinical information DB storing the environmental test information of the testee to be considered together with the genetic characteristic is further comprised:
The search control unit,
Using the arithmetic formula obtained by logistic regression, we derive the disease causation relation (Πx) and derive the disease cause prediction result:
The disease causative relationship arithmetic expression,

ego,
The variable? Is a parameter according to the personal health records (PHR) including the age, sex or body mass (BMI) of the subject to be examined stored in the clinical information DB;
And the variable χ is a parameter according to the disease-related genotype included in the analysis data calculated by the search control unit.

The method according to claim 1,
Wherein,
And a HaploScan DB storing the genotype information of the control gene to compare with the analysis data;
The search control unit,
The search control unit,
And a HaploScan engine for comparing the analysis data with the HaploScan DB to identify the genotype of the analysis data. Excavation system.

3. The method of claim 2,
In the HaploScan DB,
A single gene information database for storing genomic information on a single gene;
And a genetic information database for storing the genotypic information of the multiple genotypes of the phenotypes.

The method of claim 3,
The single gene information database comprises:
A single gene Haplo map for a single gene in the control group, which is classified by the occupation ratio (population) of haplotypes and trait frequencies by race;
Wherein the genetic mutation information comprises a single gene haplotype free consensus information for storing mutation information of a mutation for discriminating a genotype of a single gene stored in the single gene Haplo map. Disease Causing System Used.

The method of claim 3,
Wherein the multiple gene information database comprises:
A multi-gene Haplo map which is obtained by dividing the variation distribution of the genotypically related bases into multiple genes of the control group by phenotype and clustering them according to occupancy rate;
Wherein the genetic variation information comprises mutation gene heterofluorescence information for storing mutation information for a mutation for discriminating a genotype of the phenotype stored in the multiple gene Haplo map. A system to identify disease causes using.

delete

6. The method according to any one of claims 1 to 5,
The search control unit,
A mutant full-length genome is generated by classifying the detected mutant gene characteristics according to genotypes of a full-length gene, and displaying a cumulative value on a Manhattan plot visualized as a point. Detection system of disease cause using genetic mutation information.

12. The method of claim 11,
The Manhattan plot,
Wherein the genetic mutation information is indicative of a cut-off for guiding the significance of the mutation gene.