KR102078200B1

KR102078200B1 - Analysis platform for personalized medicine based personal genome map and Analysis method using thereof

Info

Publication number: KR102078200B1
Application number: KR1020180166476A
Authority: KR
Inventors: 정종선; 이선호; 가소정; 홍종희; 조양래
Original assignee: (주)신테카바이오
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2020-02-17
Also published as: KR20190000342A

Abstract

본 발명은 상용화된 “개인유전체맵기반 맞춤의학 플랫폼”기반 정밀의학을 실현하기 위한 요구사항을 개선하기 위해 안출된 것으로, 본 발명은 개인 유전체의 변이 정확도(precision) 및 재현성 (recall or reproducibility)을 높이는 방법 및 유전형기반 환자계층화를 수행하기 위하여 공용 GWAS(genome wide association study)마커를 사용하는 방법, 및 한국전자통신연구원이 개발한 마하수퍼컴과 연동에서의 문제를 개선 및 향상을 위한 방법을 위하여 고안된 것이다. 그리고 유전체 빅데이터 및 수퍼컴환경에서 다중 사용자에게 용이하도록 제공될 수 있도록 하는 리포팅 데이터베이스 분석 모듈이 포함된 맞춤의학 및 진단기기로 사용을 위한 개인유전체맵기반 맞춤의학 플랫폼을 상용화를 위한 개선점에 대한 연구결과이다. 본 기술이 상용화되면 국가 간의 유전체데이터기반 맞춤의학이 대부분 아카데미아에 머물러 있고 부분적으로만 가능한 현실에서 첫 전체 영역에서 상용이 가능한 맞춤의학 플랫폼이 제공될 수 있다.The present invention has been made to improve the requirements for realizing the commercialized "Personal Map-Based Customized Medicine Platform" based precision medicine. The present invention is directed to improving the precision and recall or reproducibility of personal genomes. It is designed to increase the method and to use the common GWAS (genome wide association study) marker to perform genotyping-based patient stratification, and to improve and improve the problem in linkage with Mach Supercombe developed by Korea Electronics and Telecommunications Research Institute. will be. In addition, research results for the improvement of commercialization of personal genetic map-based customized medical platform for use as customized medical and diagnostic device including reporting database analysis module that can be easily provided to multiple users in genomic big data and supercom environment. to be. When the technology is commercialized, most countries' genomic data-based customized medicine will remain in Academia, and in the case of only partially available reality, a customized medical platform can be provided that can be used in the first whole area.

Description

Analysis platform for personalized medicine based personal genome map and analysis method using

본 발명은 게놈 프로젝트에 의해 구축된 다수 전장 (엑솜, 혹은 표적) 유전체 DB와 입력된 개인 전장 유전체 정보를 비교하여 개인 유전체로부터 유전정보를 분석하여 제공하는 수퍼컴퓨팅 시스템을 활용한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼에 대한 발명이다.The present invention compares a large number of full-length (exome or target) genome DBs constructed by the Genome Project with personal full-length genome information input, and personal genome map-based customization using a supercomputing system that analyzes and provides genetic information from the personal genome. Invention for the medical analysis platform.

현재 IT 시장의 추세는 구글(Google), 페이스북(fasebook), 아마존(amazon), 클라우드컴퓨팅 및 유비쿼터스(Ubiquitous) 순으로 변화하고 있고, 이와 동시에 바이오 메디컬, 생물정보 및 유전체 영역도 바이오 구글, 시스템 바이오, 개인별 맞춤의학 그리고 정밀의학 (precision medicine) 순으로 새로운 트랜드에 맞춰 바뀌어 가고 있다. 특히 포스트 인간게놈프로젝트는 차세대 시퀀싱 기술이 급격하게 발전하여 개인별 맞춤의학을 현실화하기 위한 노력이 활발히 진행되고 있다.Current trends in the IT market are in the order of Google, Facebook, Amazon, Cloud Computing and Ubiquitous. At the same time, the biomedical, bioinformation, and genome areas are also bio google, systems. Bio, personalized medicine and precision medicine are changing in line with the new trend. In particular, the post-human genome project has been actively developed to realize personalized medicine with the rapid development of the next generation sequencing technology.

현재 차세대 시퀀싱 기술은 인간 1명 (x30)의 전장유전체를 시퀀싱(해독)하고 분석하는데 약 1주일 정도 소요가 되는 것으로 알려져 있다. 그리고 현재 전 세계에 차세대 시퀀서가 100,000여 대가 공급된 것으로 보고되었고, 제3세대 시퀀서 (Ion Torrent: 2.5세대, Pacific BioScience의 제3세대)의 주요 개발회사들에게 많은 자금이 투자된 것으로 보고되었다. Current generation sequencing technology is known to take about one week for sequencing (detoxing) and analyzing one human (x30) full-length dielectric. It is reported that more than 100,000 next-generation sequencers are currently available worldwide, with significant funds being invested in major developers of the third-generation sequencer (Ion Torrent: 2.5 generations, the third generation of Pacific BioScience).

그 이외에 전 세계적으로는 해당분야는 모든 사업 중에서도 가장 빠르게 발전 및 개발이 되는 분야이다. 이러한, 추세대로 진행이 되면 향후 2~3년 후에는 1명의 전장 유전체 시퀀싱 및 분석이 약 $1,000이하로 낮아질 것으로 예상된다. 위의 차세대기술기반의 가장 활용성이 높고 바로 실용화되는 기술은 임상유전체(clinical genomics), 약물유전체학(pharmaco - genomics) 및 중개 임상 (translational medicine)있다, 그리고 최근에 이러한 임상유전체가 의학유전체(medical genomics)로 변신이 되고 있고, 이러한 의학유전체는 환자계층화(patient stratification)기술과 더불어 미국 대통령이 언급한 정밀의학 (precision medicine)이라는 새로운 학문 및 신 조어를 만들어 내게 되었다.In addition, globally, this field is the fastest developing and developing field among all businesses. If this trend continues, one full-length genome sequencing and analysis is expected to be below $ 1,000 in the next two to three years. The most versatile and immediately available technologies based on these next-generation technologies are clinical genomics, pharmaco-genomics, and translational medicine, and these clinical genomes have recently been medical genomics, and the medical genome, along with the patient stratification technology, has created a new discipline and neologism called the precision medicine that the president of the United States referred to.

이와 같은, 유전체 변이 관련 정보는 매년 증가하고 있으며, 본 발명은 검증 데이터의 확장에 의해 분석 정확도 영역이 지속적으로 확대될 것이다.As such, genome variation related information is increasing year by year, and the present invention will continuously expand the area of analysis accuracy by expanding verification data.

한편, 본 출원인은 언급된 유전자 분석 분야의 기술적 요구사항을 개선하기 위해 지속적인 기술의 개발을 수행하고 있다.On the other hand, Applicant is carrying out the development of continuous technology to improve the technical requirements in the field of genetic analysis mentioned.

이와 같은 노력의 결과, 정밀의학 (precision medicine)을 위한, 바이오 빅데이터와 관련된, 임상관련 정보, 단백체 및 유전체 정보, 그리고 이들의 분석 속도를 향상시키기 위한 분석 시스템 구축, 등을 위한 방법을 개발하였고, 특히, 분석속도를 위한 GPU(graphic process unit) 기반의 분석시스템을 개발하였고(특허등록: 10-0996443), 데이터의 비교 속도를 향상시키기 위한 기법인 RVR(records virtual rack)분석 툴의 특징은 파일을 기반으로는 정보 검색 방법(특허등록: 10-0880531, 특허등록: 10-1035959, 및 특허등록: 10-1117603)을 개발하였다.As a result of these efforts, we developed methods for precision medicine, such as bio big data, clinical information, protein and genomic information, and analytical systems to speed up their analysis. In particular, we have developed a graphical processing unit (GPU) based analysis system for analysis speed (Patent registration: 10-0996443), and the features of the RVR (records virtual rack) analysis tool, which is a technique for improving data comparison speed, Based on the file, an information retrieval method (patent registration: 10-0880531, patent registration: 10-1035959, and patent registration: 10-1117603) was developed.

또한, RVR 및 GPU(graphic process unit)에 기반하여 단백체에 적용시킨 (특허등록: 10-1400717), 변이의 정의(variant calling) 및 대조군과 개인 유전체 사이의 희귀변이 정도를 효율적으로 판단하기 위하여 대립유전자깊이기반 ADISCAN 분석 툴을 개발하였다 (특허등록: 10-1460520, 10-1542529, 및 10-2014-0020738). In addition, alleles were applied to the protein based on RVR and graphic process unit (patent registration: 10-1400717), to define the variant calling and to efficiently determine the degree of rare variation between the control and the individual genome. A gene depth based ADISCAN analysis tool was developed (Patent Registration: 10-1460520, 10-1542529, and 10-2014-0020738).

그리고 유전체정보를 효율적으로 관리를 하기 위한 통합유전체 DB 생성, 질병원인을 위한 변이발굴 및 환자계층화를 위한 유전형 계산 방법 (특허등록: 10-2015-0187554, 10-2015-0187556, 및 10-2015-0187559) 및 유전체정보에서 휴먼하플로 타이핑을 계산하는 방법 (특허출원: 10-2016-0096996)을 개발하였다.And genotyping method for efficient genome information management, mutation discovery for disease cause, and patient stratification (Patent registration: 10-2015-0187554, 10-2015-0187556, and 10-2015- 0187559) and a method for calculating human haplo typing from genome information (patent application: 10-2016-0096996).

또한, 통합유전체 DB 같은 빅데이터를 위한 스토리지(storage) 운용에 특화된 미들웨어(middleware)는 한국전자통신연구원(ETRI)에서 개발한 병렬분산 환경에서 동시에 수천개의 유전체 벌크 데이터 분석이 가능하게 만든 마하수퍼컴퓨팅 시스템 (특허등록 10-1460520, 10-1010219, 10-0956637, 10-093623, 10-2013-0005685, 10-2012-0146892 및 10-2013-0004519)이 개발되었다. In addition, middleware specialized in storage operation for big data such as integrated dielectric DB is able to analyze thousands of dielectric bulk data at the same time in parallel distributed environment developed by Korea Electronics and Telecommunications Research Institute (ETRI). Systems (Patent Registrations 10-1460520, 10-1010219, 10-0956637, 10-093623, 10-2013-0005685, 10-2012-0146892 and 10-2013-0004519) have been developed.

본 출원인은 한국전자통신연구원으로부터 마하시스템을 제공받아 임상환경에 적용을 위한 바이오 빅데이터를 활용한 최적화 환경을 갖추고, 정밀의학 구현을 위한 통합유전체분석 시스템과 연동된 국내 첫 수퍼컴퓨팅 시스템을 개발하였다. Applicant developed Mach System from Korea Electronics and Telecommunications Research Institute, developed an optimized environment using bio big data for clinical application, and developed Korea's first supercomputing system linked with an integrated genome analysis system for precision medicine. .

특히, 마하-Fs (유전체와 같은 버크데이터용 초고속 I/O를 위한 스토리지 시스템)는 일반 클라우드컴퓨팅 환경에 맞추어 졌지만, 본 출원인은 재현성 및 정밀성 그리고 시스템의 한계를 명확하게 정의하여, 임상환경 즉 병원에서 진단용으로 사용가능한 마하-FsDx를 개발하였다.In particular, although Mach-Fs (storage systems for ultra-fast I / O for buck data such as dielectrics) are tailored to general cloud computing environments, we have clearly defined reproducibility, precision, and system limitations. Mach-FsDx has been developed for use in diagnostics.

(001) 대한민국 등록특허 제10-0880531호(001) Republic of Korea Patent No. 10-0880531 (002) 대한민국 등록특허 제10-0996443호(002) Republic of Korea Patent No. 10-0996443 (003) 대한민국 등록특허 제10-1035959호(003) Republic of Korea Patent No. 10-1035959 (004) 대한민국 등록특허 제10-1117603호(004) Korea Patent Registration No. 10-1117603 (005) 대한민국 등록특허 제10-1400717호(005) Republic of Korea Patent No. 10-1400717 (006) 대한민국 등록특허 제10-1460520호(006) Korean Patent Registration No. 10-1460520 (007) 대한민국 등록특허 제10-1542529호(007) Korean Patent Registration No. 10-1542529 (008) 대한민국 특허출원 제10-2015-0187554호(008) Republic of Korea Patent Application No. 10-2015-0187554 (009) 대한민국 특허출원 제10-2015-0187556호(009) Republic of Korea Patent Application No. 10-2015-0187556 (010) 대한민국 특허출원 제10-2015-0187559호(010) Republic of Korea Patent Application No. 10-2015-0187559 (011) 대한민국 특허출원 제10-2016-0096996호(011) Republic of Korea Patent Application No. 10-2016-0096996 (012) 대한민국 등록특허 제10-0834574호(012) Korean Patent Registration No. 10-0834574 (013) 대한민국 등록특허 제10-1010219호(013) Republic of Korea Registered Patent No. 10-1010219 (014) 대한민국 등록특허 제10-0956637호(014) Korean Patent Registration No. 10-0956637 (015) 대한민국 등록특허 제10-0936238호(015) Republic of Korea Registered Patent No. 10-0936238 (016) 대한민국 특허출원 제10-2013-0005685호(016) Republic of Korea Patent Application No. 10-2013-0005685 (017) 대한민국 특허출원 제10-2012-0146892호(017) Republic of Korea Patent Application No. 10-2012-0146892 (018) 대한민국 특허출원 제10-2013-0004519호(018) Republic of Korea Patent Application No. 10-2013-0004519

본 발명은 상기와 같은 상용화된 개인유전체맵기반 정밀의학을 실현하기 위한 요구사항을 개선하기 위해 안출된 것으로, 본 발명은 개인 유전체의 변이 검출 속도 및 효율을 향상시킬 수 있는 통합유전체 DB를 향상하고, 또한, 환자계층화의 효율을 높이기 위하여 공용 GWAS(genome wide association study)마커 기반 환자계층화를 향상하기 위한 방법 및, 통합유전체DB의 구성물인 변이(variant)를 위한 알고리듬 방법의 정확도(precision) 및 재현성 (recall or reproducibility)을 높이고 정확한 변이를 정의하는 과제등과 같은 통합유전체DB 와 마하수퍼컴과 연동에서의 문제를 개선 및 향상을 위한 방법을 위하여 고안된 것이다. The present invention has been made to improve the requirements for realizing such commercialized personal genomic map-based precision medicine, the present invention improves the integrated dielectric DB that can improve the speed and efficiency of mutation detection of personal genome In addition, the accuracy and reproducibility of the method for improving common class wide association study (GWAS) marker-based patient stratification in order to increase the efficiency of patient stratification, and the algorithm method for variation that is a component of the integrated genome DB It is designed to improve and improve the problem of integration with the integrated dielectric database and Mach Supercomm, such as the task of increasing recall or reproducibility and defining accurate variation.

또한, 본 발명은 검출된 변이정보를 빅데이터 및 수퍼컴환경에서 다중 사용자에게 용이하도록 제공될 수 있도록 하는 리포팅 데이터베이스 분석 모듈이 포함된 통합유전체 DB 및 마하수퍼컴퓨팅 기반 맞춤의학 및 진단기기로 사용을 위한 개인 유전체맵 기반 맞춤의학 플랫폼을 제공하기 위한 것이다.In addition, the present invention is intended for use as an integrated dielectric DB and Mach supercomputing-based customized medical and diagnostic device that includes a reporting database analysis module that can provide the detected mutation information to multiple users in a big data and supercomputing environment. To provide a personalized genomic map-based custom medicine platform.

상기한 바와 같은 목적을 달성하기 위한 본 발명의 특징에 따르면, 본 발명은 검사자의 통합유전체DB, 바이오메디컬 정보, 단백체 데이터베이스 및 이들을 분석하기 위한 시스템 관련 기술들로 구성된 분석 플랫폼에 있어서, 분석 효율을 향상시키기 위한 특징들을 갖는다.According to a feature of the present invention for achieving the above object, the present invention provides an analysis platform comprising an integrated dielectric DB, biomedical information, protein database and system-related technologies for analyzing them. Has features to improve.

또한, 본 발명은 본 발명에 의한 통합유전체 데이터베이스를 운영하기 위하여는 대한민국 등록특허 제10-0834574호, 제10-1010219호, 제10-0956637호, 제10-0936238호 및 대한민국 특허출원 제10-2013-0005685호, 제10-2012-0146892호, 제10-2013-0004519호에 개시된 마하수퍼컴시스템과 연동하여 하나의 진단기기용 마하-FsDx의 사용자환경, 기능 및 운영 효율을 높이기 위하여 다음과 같은 특징을 갖는다.In addition, the present invention in order to operate the integrated dielectric database according to the present invention Korean Patent Nos. 10-0834574, 10-1010219, 10-0956637, 10-0936238 and Republic of Korea Patent Application No. 10- In order to improve the user environment, function and operation efficiency of Mach-FsDx for one diagnostic device by interworking with Mach Supercom System disclosed in 2013-0005685, 10-2012-0146892, 10-2013-0004519 Has

개인 유전체 분석에서 상업화된 맞춤의학 플랫폼기반 서비스를 제공하기 위하여는 변이정의(variant calling), 환자계층화 및 통합유전체 시스템에 대한 최적화가 필요하다. 이를 위해 본 발명은 변이정의(variant calling), 환자계층화 및 통합유전체 시스템 구축에 대하여 기술적 특징 1 내지 특징 9를 적용하였다.In order to provide commercialized platform-based services for personal genome analysis, optimization of variant calling, patient stratification and integrated genome systems is needed. To this end, the present invention applies the technical features 1 to 9 for the construction of variant calling, patient stratification and integrated genome system.

먼저, 본 발명은 변이정의(variant calling)와 관련하여 3개의 기술적 특징이 적용된다.First, the present invention applies three technical features with respect to variant calling.

상기 변이정의(variant calling)는 분석결과의 기준이 되는 것으로, 바이어스(bias, 편향성) 없이 표준화가 되어야 한다.The variable calling is a standard of analysis results and should be standardized without bias.

그러나 현재까지 알려진 유전체 변이정의 모델은 베이시안모델(Bayesian model)을 기반으로 개발된 GATK, STRELKA, EBcall, MuTect 및 SomaticSniper가 있고, 확률적 예측 및 그래프 모델 (Fisher's Exact Test and string graphs) 방식을 기반으로 개발된 jointSNVMix, VarScan 및 Fermi 등이 있다. 또한, 지식 및 통계기반 모델(Prior knowledge and statistical models)을 기반으로 개발된 GNUmap, GATK, SOAPsnp, SAMTools, 및 SNVer 등이 있다. However, currently known models of genome mutations include GATK, STRELKA, EBcall, MuTect, and SomaticSniper, which are developed based on the Bayesian model, and based on the method of probabilistic prediction and graph graphs (Fisher's Exact Test and string graphs). JointSNVMix, VarScan and Fermi. In addition, there are GNUmap, GATK, SOAPsnp, SAMTools, and SNVer, which are developed based on prior knowledge and statistical models.

그러나 위의 3가지 다른 모델은 모델 자체에 바이어스(bias)가 포함되어 들어가 있기 때문에 편향적인 결과를 도출한다.However, the above three different models produce biased results because the model itself contains biases.

즉, 베이시안모델(Bayesian model)로 개발된 툴 (GATK, STRELKA, EBcall, MuTect 및 SomaticSniper)은 선행지식 기반으로 최적화(fitting)가 되었기 때문에 선행지식 기반으로 알려진 변이는 예측을 잘하지만 신규변이는 예측이 잘 안되는 경우도 있다. In other words, tools developed under the Bayesian model (GATK, STRELKA, EBcall, MuTect, and SomaticSniper) have been optimized based on prior knowledge, so that the variations known as prior knowledge are good at predicting, but new variations In some cases, the prediction is not good.

그리고 확률적 예측 및 그래프모델기반 알고리듬 (jointSNVMix, VarScan 및Fermi)들의 경우, 일반통계에서는 정규분포에서 특정 변이분포가 얼마나 벋어났는지를 계산 한다. 그러나, 실제 변이는 이러한 정규성을 가지고 있지 않고, 특정유전자, 특정질병, 특정 환경 요인에 예민할 수 있다. In the case of probabilistic prediction and graph model-based algorithms (jointSNVMix, VarScan, and Fermi), general statistics calculate how far a particular variation is from the normal distribution. However, actual mutations do not have this normality and may be sensitive to certain genes, certain diseases, and certain environmental factors.

한편, 지식 및 통계 모델 (Prior knowledge and statistical models)기반 툴(GNUmap, GATK, SOAPsnp, SAMTools, 및 SNVer)들은 베이시안모델과 다른 방식의 선행지식 기반 다양한 통계를 사용한다.On the other hand, tools based on knowledge and statistical models (GNUmap, GATK, SOAPsnp, SAMTools, and SNVer) use a variety of statistics based on prior knowledge based on Bayesian models.

따라서, 본 발명은 변이정의에 편향성(bias)을 배제하기 위하여, ADISCAN을 적용한다. Accordingly, the present invention applies ADISCAN to exclude bias in variant definitions.

상기 ADISCAN의 기본 개념은 모든 인간의 유전자는 배수체(엄마의 유전자, 아빠의 유전자)를 가지고 있고, 이를 조각으로 만들고 다시 검출을 했을 때, 염기조각의 서로 다른 대립유전자의 비율이 50:50 확률로 검출을 한다는 가정하에, 검출된 대립유전자의 비율이 50:50이면 hetero 변이로 정의(homozygote)하고, 100:0이면 homo로 정의(reference homo만 수집됨)하며, 0:100이면 alternative homo로 정의(alternative homo만 수집됨)을 하는 방식을 말한다.The basic concept of ADISCAN is that all human genes have drainages (mother's genes and dad's genes), and when fragmented and detected again, the ratio of different alleles in base fragments is 50:50. Under the assumption of detection, if the ratio of alleles detected is 50:50, it is defined as a hetero variant (homozygote), if it is 100: 0, it is defined as homo (only reference homo is collected), and if it is 0: 100, it is defined as alternative homo. (only alternative homos are collected).

그리고 이를 편향성이 없는 ratio (50/50 = 1, 0/100 혹은 100/0 = 0)로 표시할 수 있고, 이러한 모든 수치는 tangent 함수 및 다양한 다른 함수로 표현을 할 수 있다(단, tangent 함수가 성질 특성을 표현하는데 적합함).And this can be expressed as non-biased ratio (50/50 = 1, 0/100 or 100/0 = 0), and all these numbers can be represented by tangent and various other functions (but not tangent) Is suitable for expressing property properties).

이와 같은, ADISCAN의 기본 개념은 출원인의 특허등록 제10-1400717호 및 제10-1542529호에 개시한 바 있다.As such, the basic concept of ADISCAN has been disclosed in the applicant's patent registration Nos. 10-1400717 and 10-1542529.

이에 본 발명은, 상기 ADISCAN을 이용하여, 변이정의를 수행함에 있어, 다음과 같은 기술적 특징을 추가 한정하였다.Thus, the present invention, by using the ADISCAN, in performing the variation definition, further defined the following technical features.

즉, 본 발명은, 상기 ADISCAN을 이용하여, 변이정의를 수행함에 있어, 대립유전자깊이(depth)에 따른 변이 민감도 점수의 최고점 및 최저점을 0 내지 1 단위로 표준화하고, 변이정의(variant calling)가 불가능한 영역(Twilight Zone)의 범위를 규정하였으며, 이를 상수 파라미터(constant parameters)화 하였다(특징 1)That is, in the present invention, the ADISCAN is used to standardize the highest and lowest points of the variation sensitivity scores according to the allele depth in units of 0 to 1, and perform the variable definition. The scope of the Twilight Zone is defined and constant parameters are used (Feature 1).

그리고 본 발명은 상기 ADISCAN을 이용하여, 변이정의를 수행함에 있어, 표준물질(NA12878)을 지정하고, 상기 표준물질을 정답기반으로 하여 기계학습을 수행하여 전술한 상수파라미터를 확정하고, 위양성(false positive)의 대상이 되는 모든 변이를 표준물질(NA12878)과 비교하여 오류를 검출한다(특징 2).In the present invention, the ADISCAN uses the ADISCAN to designate a standard material (NA12878), perform machine learning based on the standard material as a correct answer, and determine the above-described constant parameters, and false positives (false). All mutations that are subject to a positive are compared to the standard (NA12878) to detect errors (Feature 2).

또한, 본 발명은 상기 ADISCAN을 이용하여, 변이정의를 수행함에 있어, 인간 유전체의 변이정의(variant calling)의 정밀도를 향상시키기 위하여, 하플로타입 기반의 보정을 수행한다(특징 3).In addition, the present invention performs a haplotype-based correction in order to improve the accuracy of variant calling of the human genome in performing mutation definition using the ADISCAN (Feature 3).

다음으로, 본 발명은 환자계층화와 관련하여, 2개의 기술적 특징이 적용된다.Next, in the context of patient stratification, two technical features apply.

환자계층화(patient stratification)는 현실적 임상에 사용 가능하고, 국제환경에서 표준화가 필요하다.Patient stratification is available for realistic clinical practice and needs to be standardized in an international environment.

본 발명은 통합유전체데이터 기반 하플로타입(haplotype) 염기서열을 이용하여 유전적 변이를 통해 환자계층화를 수행하고, 이와 같은 환자계층화의 기본적 원리는 출원인의 선행 특허 특허문헌 8, 9, 10에 개시된 바 있다.The present invention performs patient stratification through genetic variation using an integrated genome data-based haplotype sequence, and the basic principle of such a stratification is disclosed in the applicant's prior patent documents 8, 9 and 10. There is a bar.

이와 같은 환자계층화 방법은, 유전자 및 다중유전자 단위에서 하플로타입(gene haplotype) 기반 유전자단위 및 다중유전자들의 군집화하므로, 집단유전체 정보가 충분이 있는 경우에는 활용성이 높으나, 인종에 기인한 국제적인 데이터 확보가 어려운 경우에는 활용성이 낮은 문제점이 있었다.This method of patient stratification clusters gene haplotype-based gene units and multigenes in genes and multigene units, which is highly applicable when there is sufficient population genome information, but due to race international data. If difficult to secure, there was a problem of low usability.

즉, 국제컨소시엄(international consortium)기반으로 알려진 공용 GWAS (genomewide association study) 마커를 기반으로 환자계층화를 유전자단위에서 계산하면 효율이 향상되므로, 이러한 공용 GWAS 마커기반 유전자단위 방식의 환자계층화가 요구된다. That is, the efficiency is improved when the patient stratification is calculated in the gene unit based on the common GWAS (genomewide association study) marker known as the international consortium (international consortium), so the patient stratification of the common GWAS marker-based gene unit method is required.

그러나 공용 GWAS 마커는 희박한 (sparse) 분포에 기인하여, 즉, 많은 마커가 유전자와 유전자 중간에 혹은 인트론(intron)에 위치하여, 단백질 기능성과 연결을 못 시키는 문제점이 있다.However, public GWAS markers have a problem due to sparse distribution, that is, many markers are located between genes and between genes or introns, thus preventing linkage with protein functionality.

현재 GWAS 마커는 GWAS catalog (https://www.ebi.ac.uk/gwas/, 유럽생물정보기관, EBI)에서 만든 공용 마커로, DB는 수천 내지 수십만 명을 사용하여 계산하는 질병연관성 연구는 국제컨소시엄의 결과물이고, 환자계층화연구의 중요한 자산이 된다. Currently, the GWAS marker is a public marker created by the GWAS catalog (https://www.ebi.ac.uk/gwas/, European Biological Information Agency, EBI), and the DB is a disease-related study that calculates using thousands or hundreds of thousands. It is the result of an international consortium and an important asset in patient stratification research.

다만, 50만 개 변이가 30억 개 염기를 대변해야 하기 때문에 변이 한 개의 염기(혹은 변이)를 중심으로 앞과 뒤에 6,000 base pair씩 12,000 bp에 1개의 염기가 있는 방식으로 사용이 되었다. However, since 500,000 mutations were required to represent 3 billion bases, it was used in a way of having 1 base at 12,000 bp of 6,000 base pairs before and after around one base (or mutation).

이 역시 대부분이 비-기능성 변이인 것이 문제점이다. The problem is that most of these are also non-functional variations.

이와 같은 문제를 해결하기 위해서, 본 발명은 연관 불평형(LD, linkage disequilibrium)기반으로 유전자 단위의 환자 계층화를 수행하는 기술(특징 4)이 적용된다.In order to solve such a problem, the present invention is applied to a technique for performing patient stratification of gene units based on linkage disequilibrium (LD) (Feature 4).

또한, 본 발명은 공용 마커의 희박성을 보완하기 위해 바이오데이터 간 연결을 통해 연관성을 확보하는 기술(특징 5)가 적용된다.In addition, the present invention (Feature 5) is applied to secure the association through the connection between the bio-data in order to compensate for the leanness of the common marker.

마지막으로 본 발명은 통합유전체 시스템 구성과 관련하여, 4개의 기술적 특징이 적용된다.Finally, in the context of the integrated dielectric system configuration, the present invention applies four technical features.

본 발명에 적용되는 통합유전체 시스템의 기본적 구성은 대한민국 등록특허 제10-1460520호 및 대한민국 특허 출원 제10-2015-0187554호에 개시된 바 있다.The basic configuration of the integrated dielectric system applied to the present invention has been disclosed in Korean Patent Registration No. 10-1460520 and Korean Patent Application No. 10-2015-0187554.

그러나 개시된 바와 같은 통합유전체 시스템은 pc-cluster(서버 급 노드)를 100 - 1,000대를 연결하여 사용하는 수퍼컴퓨팅 환경에서, 많은 CPU 기반으로 동시에 다수의 인간 유전체데이터를 계산하고 통합하여야 한다. 이와 같은 환경에서, 합리적인 가격으로 pc-cluster를 구성을 하기 위하여, 표준화된 전장유전체 분석을 수행하기 위한 분석 데이터의 분할 구조가 필요하다.However, the integrated dielectric system, as disclosed, has to calculate and integrate a large number of human genome data simultaneously based on many CPUs in a supercomputing environment using 100-1,000 pc-clusters (server-class nodes). In such an environment, in order to construct a pc-cluster at a reasonable price, a partitioning structure of analytical data for performing standardized electric field dielectric analysis is required.

본 발명은 이와 같은 표준화된 분석 데이터 분할 구조를 제공하기 위해, 30억개의 전장유전체를 규정된 개수의 청크 (chunks)로 분할하여 노드(node: multiple CPU)당 최적화된 메모리 크기를 할당한다(특징 6).In order to provide such a standardized analytical data partitioning structure, the present invention divides 3 billion electric field dielectrics into a defined number of chunks and allocates an optimized memory size per node (multiple CPU). 6).

또한, 본 발명은 전장유전체 분석의 효율성을 향상시킨 유전체분석용 수퍼컴을 구성하기 위하여, 메모리기반 CPU, GPGPU기반 GPU 및 I/O성능(1,000대의 동시계산을 수행 가능), 메모리-케시-서버 및 병렬분산형 스토리지에 대한 표준화된 구성(특징 7)을 제공한다.In addition, the present invention provides a memory-based CPU, GPGPU-based GPU and I / O performance (1000 simultaneous calculations), memory-cache-server and to construct a supercomb for genome analysis that improves the efficiency of electric field dielectric analysis It provides a standardized configuration (parallel 7) for distributed storage.

아울러, 본 발명은 이와 같은 하드웨어를 임상 진단기기로 적용하기 위해, 한계 및 재현성 실험결과를 제공(특징 8)한다.In addition, the present invention provides a limit and reproducibility test results (feature 8) to apply such hardware as a clinical diagnostic device.

그리고 마지막으로 본 발명은 개인유전체맵기반 맞춤의학 플랫폼과 같은 대형 시스템은 일반 연구 개발자가 사용할 수 있도록 하기 위하여, 선행계산 및 공용툴을 장착한 표준화된 데이터베이스를 제공(특징 9)한다.And finally, the present invention provides a standardized database equipped with precomputation and common tools for large-scale systems, such as personal genetic map-based custom medicine platforms, for general research developers to use (Feature 9).

위에서 살핀 바와 같은 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼을 이용하면, 맞춤의학용 플랫폼의 상용화가 가능해지고, 유전체 변이정의(variant calling)에 대한 정밀한 수행이 가능해지며, 환자계층화의 결과는 아카데미아 수준으로부터 상용화 수준으로 향상될 수 있으며, 통합유전체DB 시스템은 반자동화 환경에서 자동화가 가능한 효과가 있다.Using the personal genomic map-based customized medical analysis platform according to the present invention as described above, it becomes possible to commercialize the customized medical platform, enable precise execution of genome variant calling, and the result of patient stratification is It can be improved from the academic level to the commercialized level, and the integrated dielectric DB system has the effect of being automated in a semi-automated environment.

도 1은 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법에 있어, 핵심 기술구성을 개시한 개요도.
도 2는 본 발명에 의한 유전체 변이정의 알고리듬(ADISCAN)의 기계학습 개념을 도시한 개념도.
도 3은 본 발명에 의한 표준물질기반 유전체 변이정의 알고리듬 필터링 및 하프로타이핑 기반 오류보정 개념을 도시한 개념도.
도 4는 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법에 적용되는 하플로타이핑의 기본 개념을 도시한 예시도.
도 5는 본 발명에 의한 환자 계층화를 위한 유전체 기술 히스토리와 개념을 정리한 개념도.
도 6는 본 발명에 의한 통합유전체 DB에서 환자계층화를 위한 공용 GWAS 마커 활용 및 Bio-map이 활용된 예를 도시한 예시도.
도 7는 본 발명에 의한 GWAS 마커 기반 환자계층화 생성 방법을 도시한 예시도.
도 8는 본 발명에 의한 GWAS 마커 기반 환자계층화 생성예를 도시한 예시도.
도 9는 본 발명에 적용되는 PC-cluster에서 운영을 위한 인간 유전체 30억 염기에 대한 분할된 청크를 도시한 개념도.
도 10은 본 발명에 의한 유전체데이터 분석을 위한 계산장비(a1-a6), 스토리지(b1-b3)구성 및 마하-FsDx의 구성을 도시한 개념도.
도 11a 내지 도 11f는 본 발명에 의해 마하-Fs를 마하-FsDx 화하여 안정성, 한계성 및 재현성에 대하여 평가한 결과를 도시한 예시도.
도 12a 및 도 12b는 한계성에 대하여, MAHA-FsDx 구성 규모 대비 최대치 성능을 검증한 예시도.
도 13은 본 발명에 의한 통합유전체 DB 플랫폼기반 다중연구자 데이터베이스의 구성을 도시한 개념도
도 14은 본 발명에 의한 통합유전체 DB의 대립유전자깊이, 지노타입 및 하플로타입 DB의 구성을 도시한 개념도.
도 15는 본 발명에 의한 각각의 청크 단위 메트릭스에서 SNV, INDEL 및 CNV의 검출 원리를 도시한 개념도.
도 16은 본 발명에 의한 통합유전체 DB에서 유전자 및 다중유전자기반 유전형 계산을 통한 환자계층화를 도시한 개념도.1 is a schematic diagram illustrating a core technology configuration in a personal genomic map-based customized medical analysis platform and an analysis method using the same according to the present invention.
Figure 2 is a conceptual diagram showing the machine learning concept of the algorithm for dielectric mutation (ADISCAN) according to the present invention.
3 is a conceptual diagram illustrating an algorithm filtering and half-type based error correction concept of a standard material-based dielectric variation correction according to the present invention.
Figure 4 is an illustration showing the basic concept of haplotyping applied to the personal genomic map-based custom medical analysis platform and an analysis method using the same according to the present invention.
5 is a conceptual diagram summarizing the history and concepts of genome technology for patient stratification according to the present invention.
Figure 6 is an exemplary view showing an example of utilizing the common GWAS markers and Bio-map for the stratification of patients in the integrated dielectric DB according to the present invention.
7 is an exemplary view showing a GWAS marker-based patient stratification generation method according to the present invention.
8 is an exemplary view showing an example of generation of GWAS marker-based patient stratification according to the present invention.
9 is a conceptual diagram showing the divided chunks for the human genome 3 billion bases for operation in the PC-cluster applied to the present invention.
10 is a conceptual diagram showing the configuration of the calculation equipment (a1-a6), storage (b1-b3) and Mach-FsDx for analyzing the genome data according to the present invention.
11A to 11F illustrate results obtained by evaluating Mach-Fs to Mach-FsDx according to the present invention for stability, limit and reproducibility.
12A and 12B are exemplary diagrams verifying the maximum performance of the MAHA-FsDx configuration scale with respect to the limits.
13 is a conceptual diagram showing the configuration of a multi-researcher database based on the integrated dielectric DB platform according to the present invention.
14 is a conceptual diagram showing the configuration of allele depth, genotype and haplotype DB of the integrated dielectric DB according to the present invention.
15 is a conceptual diagram illustrating the detection principle of SNV, INDEL and CNV in each chunk unit matrix according to the present invention.
FIG. 16 is a conceptual diagram illustrating patient stratification through gene and multigene-based genotyping in the integrated genome DB according to the present invention. FIG.

이하에서는 첨부된 도면을 참조하여 본 발명의 구체적인 실시 예에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼을 상세히 살펴보기로 한다.Hereinafter, a personal genomic map based customized medical analysis platform according to a specific embodiment of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 각 특징을 설명하기에 앞서 먼저, 본 발명이 적용되는 유전자 분석 서비스의 구성을 간단히 살펴보기로 한다. 유전자 분석 서비스는 병원 등의 개인 유전자 수집 기관으로부터 혈액 등의 샘플을 수집하여, 해당 샘플을 DNA 시퀀싱 회사에 시퀀싱을 의뢰하게 된다.Before describing each feature of the present invention, the configuration of the gene analysis service to which the present invention is applied will be briefly described. Genetic analysis services collect samples such as blood from individual gene collection institutions such as hospitals, and refer the samples to DNA sequencing companies.

그리고 상기 DNA 시퀀싱 회사는 수집된 샘플로부터 DNA(NGS, next generration sequencing) 해독을 수행한다. 물론, 최근에는 기술적 발전에 따라 다양한 방법에 의해 DNA sequencing을 생성할 수 있으므로, 상기 DNA 해독 생성 방법은 DNA 시퀀싱 회사의 기술 수준에 따라 다양한 방법에 의해 수행될 수 있다. 이와 같이 생성된 DNA 해독은 본 발명과 같은 시스템을 통해 개인 유전체에 포함된 유전적 정보가 분석되고, 분석된 분석정보는 병원 등의 진단기관 또는 수요자에게 전달된다. The DNA sequencing company performs DNA (NGS, next generration sequencing) readout from the collected samples. Of course, in recent years, since the DNA sequencing can be generated by various methods according to the technical development, the DNA translation generating method can be performed by various methods according to the technical level of the DNA sequencing company. The DNA decoded as described above is analyzed genetic information contained in the personal genome through a system such as the present invention, the analyzed information is transmitted to a diagnostic institution or consumer, such as a hospital.

물론, 상기 DNA 시퀀싱 회사로부터 DNA 벌크(bulk, dummy) 데이터가 제공되는 경우, 이로부터 고집적 인덱싱 파일로 형성하여 빅데이터인 유전체 염기서열을 분석할 수도 있다.Of course, when the DNA bulk (bulk, dummy) data is provided from the DNA sequencing company, it is also possible to form a high-density indexing file from it to analyze the genome sequences that are big data.

즉, 본 발명은 DNA 해독 정보로부터 개인 유전체에 포함된 유전적 정보를 분석하는 분석 및 진단 시스템에 관한 것으로, 이하에서 본 발명에 의한 통합유전체 빅데이터를 활용한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법에 있어, 기술적 특징 1 내지 9를 중심으로, 본 발명의 기술 구성을 상세히 설명하기로 한다.That is, the present invention relates to an analysis and diagnosis system for analyzing the genetic information contained in the individual genome from the DNA decoded information, hereinafter, personal genomic map-based customized medical analysis platform using integrated genome big data and In the analysis method using the same, with reference to the technical features 1 to 9, the technical configuration of the present invention will be described in detail.

도 1은 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법에 있어, 핵심 기술구성을 개시한 개요도이다. 이에 도시된 바와 같이, 본 발명은, 개인 유전체 분석에서 상업화된 맞춤의학 플랫폼 기반 서비스를 제공하기 위하여 변이정의(variant calling), 환자계층화 및 통합유전체 시스템 구축에 대하여 기술적 특징 1 내지 특징 9가 적용된다.1 is a schematic diagram illustrating a core technology configuration in a personal genomic map based customized medical analysis platform and an analysis method using the same according to the present invention. As shown here, the technical features 1 to 9 are applied to the implementation of variant calling, patient stratification, and integrated genome systems in order to provide commercialized platform based services commercialized in personal genome analysis. .

도 2는 본 발명에 의한 유전체 변이정의 알고리듬(ADISCAN)의 기계학습 개념을 도시한 개념도로 본 발명의 기술적 특징 1의 개념을 도시하고 있다.FIG. 2 is a conceptual diagram illustrating the machine learning concept of the algorithm for dielectric mutation (ADISCAN) according to the present invention, which illustrates the concept of technical feature 1 of the present invention.

도 2에 도시된 바와 같은, 본 발명의 특징 1은 ADISCAN을 이용하여, 변이정의를 수행함에 있어, 대립유전자깊이(depth)에 따른 변이 민감도 점수의 최고점 및 최저점을 0 내지 1 단위로 표준화하고, 변이정의(variant calling)가 불가능한 영역(Twilight Zone)의 범위를 규정하였으며, 이를 상수 파라미터(constant parameters)화 한 것이다.As shown in FIG. 2, feature 1 of the present invention uses ADISCAN to standardize the highest and lowest points of variation sensitivity scores according to allele depths in units of 0 to 1 in performing variation definition. The scope of the Twilight Zone is not defined and it is a constant parameterization.

즉, 기계학습을 위한 score function을 적용하고, 상기 score function은 3가지 상수 파라미터(a, b, 및 c) 가 적용된 tangent 함수를 이용하여 유전체의 변이정의(variant calling)를 적용하되, 상수 파라미터(a)는 대립유전자깊이(depth)에 따른 변이 민감도, 상수 파라미터(b)는 점수의 최고점 및 최저점을 0 - 1단위로 표시하며, 상수 파라미터(c)는 변이정의(variant calling)가 불가능한 영역(Twilight Zone)의 범위를 정한다.That is, a score function for machine learning is applied, and the score function applies a variable calling of the genome using a tangent function to which three constant parameters (a, b, and c) are applied, but the constant parameter ( a) is the variation sensitivity according to allele depth, constant parameter (b) indicates the highest and lowest points of the score in 0-1 units, and constant parameter (c) is the area where variable calling is not possible ( Set the scope of the Twilight Zone.

즉, 본 발명에 의한 ADISCAN을 위한 score function은 That is, the score function for ADISCAN according to the present invention is

Score function S(i)= tan(D-0.5)× a - log b (max(b,min(depth)))+c 에 의해 산출된다.It is calculated by the score function S (i) = tan (D-0.5) x a-log b (max (b, min (depth))) + c.

위 Score function 산출식의 의미는, 대립유전자깊이 비율(ratio)(=D), 대립유전자 깊이 중에 첫 번째 대립유전자(reference depth) 깊이 및 두 번째 대립유전자 깊이(alternative depth) 중 작은 것과 b(상수) 중에 큰 것을 log의 밑 b(상수)으로 하는 로그값, tan(tangent 함수) 값의 가중치 a(상수) 그리고, 0에서 1의 scale을 위한 c(상수)로 유전자 변이정의 값을 표현할 수 있다는 것이다.The score function formula above means that the allele depth ratio (= D), the smaller of the first allele depth and the second allele depth, b (constant) ) Can be expressed as the logarithm value of the logarithm base b (constant), the weight a (constant) of tan (tangent function), and c (constant) for a scale of 0 to 1. will be.

이와 같이, 상수 a, b 및 c를 알면 바로 변이를 homo, hetero 또는 alternative homo로 편향성이 없는 판정을 할 수 있다. In this way, knowing the constants a, b and c, it is possible to make a non-biased determination of the mutation as homo, hetero or alternative homo.

이를 위해 알려진 정답에 상수 a, b 및 c 형식으로 기계학습을 수행하여 상수를 계산하였고, 상수 파라미터(a)는 대립유전자깊이(depth)에 따른 변이민감도, 상수파라미터(b)는 점수의 최고점 및 최저점을 0 - 1단위로 표시하였으며, 상수파라미터(c)는 변이정의(variant calling)가 불가능한 영역(Twilight Zone)의 범위를 최적화를 한다.To this end, constants were calculated by performing machine learning in the form of constants a, b, and c on the known answers.The constant parameter (a) is the variation sensitivity according to the allele depth, the constant parameter (b) is the highest point of the score and The lowest point is expressed in units of 0-1, and the constant parameter (c) optimizes the range of the Twilight Zone where no variant calling is possible.

이에 따라 본 발명의 특징 1에 따르면, 범용성을 위한 알고리듬의 재현성 및 정확성이 일관성이 유지될 수 있는 효과가 있다.Accordingly, according to feature 1 of the present invention, there is an effect that the reproducibility and accuracy of the algorithm for versatility can be maintained.

한편, 본 발명의 특징 2는 상기 Score function의 상수 파라미터를 표준물질(NA12878)에서 제공한 데이터를 활용하여 기계학습을 수행하여 변수를 확정하는 것이다.On the other hand, feature 2 of the present invention is to determine the variable by performing a machine learning using the data provided from the standard parameter (NA12878) the constant parameter of the score function.

도 3은 본 발명에 의한 표준물질기반 유전체 변이정의 알고리듬 필터링 및 하프로타이핑 기반 오류보정 개념을 도시한 개념도.3 is a conceptual diagram illustrating an algorithm filtering and half-type based error correction concept of a standard material-based dielectric variation correction according to the present invention.

도 3에 도시된 바와 같이, 본 발명의 특징 2는 유전체의 변이정의 (variant calling)에 있어, 추가적으로 표준물질(NA12878)과 비교하여 오류를 수정한다.As shown in FIG. 3, feature 2 of the present invention additionally corrects errors in variant calling of the dielectric, compared to standard (NA12878).

인간유전체는 진화가 되는 과정에서 수없이 많은 반복되는 서열을 가지게 되었고, 반복되는 염기조각이 여러 다른 유전자에 흩어져서 정렬이 된다.Human genomes have a number of repeating sequences in the process of evolution, and the repeated bases are scattered and aligned in several different genes.

이때, 레퍼런스 게놈(Reference genome)에 정렬된 염기조각을 BAM(binary alignment map)이라 부르고, 상기 BAM기반으로 변이정의(variant calling)을 하게 되면 모두 위양성(false positive)의 대상이 된다. In this case, the base fragments aligned in the reference genome are called a binary alignment map (BAM), and if a variant calling is performed based on the BAM, all of them are subjected to false positives.

따라서, 이와 같이 중복해서 정렬되는 염기조각을 가지는 유전자를 미리 필터링을 하고, 보정 하면 인간 유전체의 변이정의(variant calling)를 더욱 정밀하게 할 수 있다. Therefore, by filtering and correcting genes having overlapping base sequences in this way, it is possible to more precisely define variant calling of the human genome.

이때, 상기 표준물질(NA12878)은 전 세계 연구자들이 가장 많이 인용하고 비교용으로 사용하는 유전체정보 및 검체 정보를 말하는 것으로, 미국표준연구원의 지원으로 저스틴이 편집한(Justin et. al., Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls,Nature Biotechnology 32, 246251, 2014) nature biotech 논문에서 언급된 바 있다.At this time, the reference material (NA12878) refers to the genome information and specimen information that are used most often by researchers around the world for comparison, and edited by Justin (Justin et. Al., Integrating human) sequence data sets provides a resource of benchmark SNP and indel genotype calls, Nature Biotechnology 32, 246251, 2014).

여기서, 상기 표준물질은 한 사람의 셀라인(cell line) 검체에 대하여 영구적으로 생존하는 셀라인 (cell line)을 만들고, 다양한 플랫폼으로 한 사람에 존재하는 모든 변이를 검출하여, 표준적으로 사용할 수 있고, 공유하는 검체를 말한다.Here, the standard material can be used as a standard by making a cell line that permanently survives a cell line sample of one person, and detects all mutations present in one person through various platforms. It is a sample to share.

상기 NA12878은 다양한 차세대 시퀀싱 플랫폼(4개 이상)을 사용하여 정답 변이가 약 350만 개를 생성이 되어있다. The NA12878 generates approximately 3.5 million correct answers using a variety of next-generation sequencing platforms (more than four).

한편, 변이를 검출하는 툴을 적용하기 전 단계를 BAM(binary alignment map)이라고 한다. 그리고 이러한 BAM파일이 주어지면, ADISCAN 변이검출 툴을 사용하여 변이를 검출을 하는데, 변이를 정답과 비교하면 틀린 답을 선별할 수 있다. On the other hand, the step before applying the tool for detecting the variation is called a binary alignment map (BAM). Given this BAM file, we use the ADISCAN mutation detection tool to detect mutations. By comparing the variation with the correct answer, we can select the wrong answer.

여기서, 틀리게 나오는 이유는 변이 검출 툴 자체가 틀린 것이 아니고, BAM파일이 잘못 만들어진 데 기인한다. 따라서, NA12878의 정답 변이 350만개는 BAM파일의 에러(Error)를 보정하는 용으로 사용 가능하다. In this case, the reason for the error is that the mutation detection tool itself is not wrong, but the BAM file is made wrong. Thus, 3.5 million correct answers of NA12878 can be used to correct errors in BAM files.

구체적으로 상기 BAM파일의 에러보정은 아래 세단계의 작업과정에 의해 수행된다. Specifically, error correction of the BAM file is performed by the following three steps.

첫째로, 상기 NA12878의 정답 변이를 활용하여, 가장 표준적인 유전체 분석파이프라인을 사용하여 계산을 수행한다. 가장 표준적인 유전체분석 파이프라인 1-8 단계가 있고, 본 발명에 의한 변이 분석 역시 아래와 같은 8단계에 의해 수행된다.First, using the correct answer variation of NA12878, the calculation is performed using the most standard genome analysis pipeline. There are the most standard genome analysis pipeline stages 1-8, and the variation analysis according to the present invention is also performed by the following eight stages.

1) Trimming, 2) Mapping, 3) Sort BAMs, 4) Merge BAMs, 5) Remove Duplication, 6) Realign InDEL, 7) Recalibrate Scores, 8) Variant Calling 1) Trimming, 2) Mapping, 3) Sort BAMs, 4) Merge BAMs, 5) Remove Duplication, 6) Realign InDEL, 7) Recalibrate Scores, 8) Variant Calling

이러한 파이프라인을 사용하고, 표준분석 툴(BWA-MEM)을사용하여 새로 변이를 검출을 위한 기초데이터인 BAM (Binary alignment map)을 생성한다.Using this pipeline, a standard analysis tool (BWA-MEM) is used to generate a binary alignment map (BAM), which is the basic data for detecting new mutations.

둘째로 변이 검출 툴인 ADISCAN을 사용하여 변이를 검출하고, 변이가 누적되어 잘못 산출되는 유전자변이를 특정유전자에 누적 합을 계산한다. Second, we use ADISCAN, a mutation detection tool, to detect mutations, and calculate cumulative sums of genetic mutations that are incorrectly calculated by mutations.

누적 합이 1개 이상 나오는 유전자를 선별하고, 1개 이상 나오는 유전자에 대하여 원인분석을 한다. 즉, 원인이 염기조각(read)이 중복(duplication)적으로 정렬되는 reference 자체의 중복성이 있는지, 혹은 특정, exon 또는 intron에 중복이 있는지 확인한다.Genes with one or more cumulative sums are selected and the cause is analyzed for one or more genes. That is, the cause is checked whether there is a duplication of the reference itself in which the base reads are duplicated, or whether there is a duplication in a specific, exon or intron.

셋째로 중복성이 밝혀진 유전자구조(exon, intron, utr 또는 intergenic)영역에 대하여 하플로타이핑(haplotyping)을 하기 위한 1000게놈프로젝트(haplotype 5008명)를 사용하여 하플로타이핑을 수행한다.Third, haplotypes are performed using a 1000 genome project (5008 haplotypes) for haplotyping on exon, intron, utr or intergenic regions.

한편, 도 4에는 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법에 적용되는 하플로타이핑의 기본 개념이 도시되어 있다.Meanwhile, FIG. 4 illustrates a basic concept of haplotyping applied to a personal genomic map based customized medical analysis platform and an analysis method using the same according to the present invention.

상기 하플로타이핑 방법에 대한 기본 개념은 특허출원: 제10-2016-0096996호에 개시된 바 있고, 이를 다시 요약하면, 도 4에 도시된 바와 같이, (A)염기 서열조각모음(fastq)을 (B)인간표준유전체(GRCH38)에 정렬하고, 정렬된 서열(fastq)을 다시 추출하여, (C) 기구성된 다른 하플로타입 데이터베이스 (IMGT/HLA: HLA유전자의 경우에 특수하게 제작된 여러 종류의 하플로타입을 포함하는 데이터베이스)에 정렬을 한다. 그리고 (D) 연속적으로 위양성(false positive)를 제거하면서 랭크를 정하고, (E)최종으로 한 사람의 아빠/엄마의 각 하플로타입 2개를 분리하는 과정을 통해 수행한다. 그리고 (E) 제 (D) 단계의 결과물로 생긴 2가닥의 하플로타입이 나오면, 그것을 합쳐서 배수체(genotype)로 만들어 변이를 보고하는 과정을 통해 수행된다. The basic concept of the haplotyping method is disclosed in Patent Application No. 10-2016-0096996, which is summarized again, as shown in FIG. 4, (A) a base sequence fragment (fastq) ( B) align to human standard dielectric (GRCH38), re-extract the aligned sequence (fastq), and (C) several different types of haplotype databases (IMGT / HLA: HLA genes specially engineered specifically for instrumentation) Sort on the database containing the haflow type). And (D) successively deciding the ranks by removing false positives, and (E) finally separating the two haplotypes of one dad / mom. And (E) when the two strands of the haplotype resulting from the step (D) comes out, it is carried out through the process of combining and making a variation (genotype) to report the variation.

이때, 본 발명에 의한 하플로타이핑은 IMGT/HLA 데이터베이스 대신에 1000게놈프로젝트(haplotype)의 5008 명의 하플로타입 데이터베이스를 사용한다.In this case, the haplotype according to the present invention uses 5008 Haplotype databases of 1000 genome projects (haplotype) instead of the IMGT / HLA database.

한편, 본 발명의 특징 3은 전장유전체의 변이를 계산하는 방법 및 시퀀싱(해독)된 염기서열조각(read)이 중복된 영역에 정렬되어 발생되는 오류(에러)(variant calling error)를 보정하는 방법, 즉 본원 발명 특징 2의 결과를 보정하는 하플로타이핑 방법을 적용시킨 새로운 효율적인 방식이다.On the other hand, the third feature of the present invention is a method of calculating the variation of the full-length dielectric and a method of correcting the error (error) caused by sequencing (decoding) sequence fragments (read) aligned in the overlapping region In other words, it is a new efficient way to apply the haplotyping method for correcting the result of feature 2 of the present invention.

본 발명에 적용되는 하플로타이핑은 1,000게놈 프로젝트에서 생성된 26개 인종기반 HaplotypeDB 및 다른 확립된 하플로타입 DB가 포함된다. 따라서, 하플로타이핑 방식에 의해 하플로타이핑을 수행하고, 변이정의(variant calling)을 수행할 수 있다. 상기 하플로타이핑에 대한 기본 개념은 다시 설명하기로 한다.Haplotyping applied to the present invention includes 26 race-based HaplotypeDBs and 1,000 other established Haplotype DBs generated from the 1,000 Genome Project. Therefore, it is possible to perform haplo typing and variant calling by using the haplo typing method. The basic concept of the haplotyping will be described again.

이때, 레퍼런스 유전체(reference genome)에 정렬된 염기조각(read)를 추출하고(fastq), 추출한 염기조각을 위에서 언급한 1,000게놈의 하플로타입 DB에 에셈블러(bwa-mem, 등)을 통하여 정렬하며, 하플로타이핑 방법을 사용하여 2개의 대립유전자(alleles)를 선별하고, 도 3에 도시된 두 개의 allele A, B를 합쳐, 합쳐진 결과를 기반으로 변이정의(variant calling)을 수행한다.At this time, the base fragments (read) arranged in the reference genome are extracted (fastq), and the extracted base fragments are aligned to the above-mentioned 1,000 genome Haplotype DB through an embambler (bwa-mem, etc.). In addition, two alleles are selected using the haplotyping method, and the two alleles A and B shown in FIG. 3 are combined to perform variant calling based on the combined result.

이하에서는 환자계층화와 관련된 본원 발명의 특징 4 및 특징 5를 설명하기로 한다.Hereinafter, features 4 and 5 of the present invention related to patient stratification will be described.

도 5는 본 발명에 의한 환자 계층화를 위한 유전체 기술 히스토리와 개념을 정리한 개념도이다.5 is a conceptual diagram summarizing the history and concept of the genome technology for patient stratification according to the present invention.

이에 도시된 바와 같이, 환자계층화는 인간유전체가 완성이 된 2002년 이래, SNP칩기반 질병연관성 기술이 활발히 진행이 되었고, 특히, GWAS기반 마커 발굴 국제 컨소시엄이 활발했었고, 이어서 eQTL관련 국제컨소시엄도 계속 생겨났다. As shown, patient stratification has been actively progressed in SNP chip-based disease-related technologies since 2002 when the human genome was completed, and in particular, the international consortium of GWAS-based marker discovery was active, followed by the eQTL-related international consortium. Arose

그리고 이의 결과로서 GWAS_Catalog 약 40,000개 GWAS 마커 및 eQTL의 약 600만개 마커가 공개되었고, 미국NCBI에서 희귀질환, 암질환, 및 일반질환 관련 유전자 변이마커를 Clinvar (Clinical variant)용으로 데이터베이스화를 시작했고, 현재 약 100,000개의 희귀질환 마커가 공개되었다. As a result, 40,000 GWAS_Catalog GWAS markers and 6 million markers of eQTL were published, and the US NCBI began to database databases for Clinvar (Clinical variant) for rare, cancer and general diseases. Currently, about 100,000 rare disease markers have been published.

또한, 2007년 전후에 일루미나 및 프로톤 (PGM-ION) NGS기반 실용화가 된 시퀀싱 장비가 출시되었고, 새로운 변이 혹은 암-변이 기반 진단용 임상유전체가 등장했고, 국제암유전체컨소시엄 및 1,000게놈프로젝트 등에서 각, 2,100쌍의 암환자 전장유전체, 2600명의 정상인 유전체를 공개하기에 이르렀다. 그리고, 신테카바이오가 세계에서 처음으로 두 개의 빅데이터를 30억 염기 * 6,800개 전장유전체를 하나의 대형 메트릭스(matrix)화를 수행을 하였다. 이것을 통합유전체DB라고 명명을 하고 있다. In addition, around 2007, sequencing equipment based on Illumina and Proton (PGM-ION) NGS-based commercialization was introduced, and new mutant or cancer-mutant-based diagnostic clinical genomes appeared, and in each of the International Oncology Consortium and the 1,000 Genome Project, 2,100 pairs of cancer genomes of cancer patients and 2600 normal genomes have been released. And, for the first time in the world, Syntheca Bio performed a large matrix of two big data sets of 3 billion bases * 6,800 electric field dielectrics. This is called the integrated dielectric DB.

도 5에 도시된 바와 같이, 전장유전체와 SNP칩데이터의 마커간의 거리를 비교해보면, SNP칩데이터의 각 마커는 전장유전체의 12,000당 하나씩 존재한다. 즉, 12,000/2 * 500,000마커 ~= 30억 염기가 된다. 따라서, SNP칩에서 찾은 GWAS마커에 기능정보를 연결하려면 LD (Linkage disequilibrium)에 의하여 두 개의 로커스가 서로 함께 유전이 된다는 사실을 기반으로 기능정보를 가진 eQTL마커, BAV (bioactive variant) 및 Clinvar 마커를 연결하는 시도가 필요하다. 또한, 변이의 질병연관성(penetrance) 정보는 임상에서 아주 중요한 질병을 연결하는 척도인데, 이러한 penetrance의 정보 중에서 희귀하지만 낮은 penetrance정보를 주는 영역에 대한 역할이 거의 알려지지 않고 있다. 그리고 이러한 정보는 환자계층화에 희귀변이 혹은 개인별 특정단백질의 활성도에 영향을 줄 것으로 판단을 하고 있다.As shown in FIG. 5, when comparing the distances between the markers of the electric field dielectric and the SNP chip data, each marker of the SNP chip data is present per 12,000 of the electric field dielectric. That is, 12,000 / 2 * 500,000 markers ˜ = 3 billion bases. Therefore, in order to link the functional information to the GWAS marker found in the SNP chip, the eQTL marker, the BAV (bioactive variant) and the Clinvar marker with the functional information are based on the fact that the two locus are inherited by each other by the linkage disequilibrium (LD). You need to try to connect. In addition, the penetrance information of mutations is a measure of linking diseases that are very important in clinical practice, and the role of rare but low penetrance information in the penetrance information is little known. And this information is judged to affect the rare mutation or the activity of specific protein in individual stratification.

도 6는 본 발명에 의한 통합유전체 DB에서 환자계층화를 위한 공용 GWAS 마커 활용 및 Bio-map이 활용된 예를 도시한 예시도이고, 도 14은 본 발명에 의한 통합유전체 DB의 대립유전자깊이, 지노타입 및 하플로타입 DB의 구성을 도시한 개념도이며, 도 16은 본 발명에 의한 통합유전체 DB에서 유전자 및 다중유전자기반 유전형 계산을 통한 환자계층화를 도시한 개념도이다.6 is an exemplary diagram showing an example of utilizing a common GWAS marker and bio-map for the stratification of patients in the integrated dielectric DB according to the present invention, Figure 14 is an allele depth of the integrated dielectric DB according to the present invention, geno A conceptual diagram showing the configuration of the type and haplotype DB, Figure 16 is a conceptual diagram showing the stratification of patients through the genetic and multigene-based genotype calculation in the integrated genome DB according to the present invention.

도 6에 도시된 바와 같이, 본 발명에 의한 개인 유전체 맵 기반 맞춤의학 분석 플랫폼 및 이를 이용한 분석 방법의 특징 4는 연관불평형을 기반으로 공용마커를 활용하여 환자계층화를 수행한다.As shown in FIG. 6, feature 4 of the personal genomic map-based customized medical analysis platform and an analysis method using the same according to the present invention performs patient stratification using a common marker based on association imbalance.

즉, 본 발명에 의한 환자계층화는 도 14 및 도 16에 도시된 바와 같이, 유전자단위 및 다중유전자에서 대조군과 개인 유전체 사이의 일반변이 (SNP: single nucleotide polymorphism) 및 희귀변이 (rare variant)마커를 검출하고, 이들을 기반으로 유전형을 계산하여 환자계층화(patient stratification)를 수행한다.That is, as shown in FIGS. 14 and 16, the patient stratification according to the present invention may be performed by using a single nucleotide polymorphism (SNP) and a rare variant (rare variant) marker between a control and an individual genome in a gene unit and multiple genes. Patient stratification is performed by detecting and calculating genotypes based on these.

이와 같은 환자계층화의 기본 개념은 대한민국 특허출원 제10-2015-0187554호, 제10-2015-0187556호 및 제10-2015-0187559에 개시된 바와 같다.The basic concept of such a patient stratification is as disclosed in Korean Patent Application Nos. 10-2015-0187554, 10-2015-0187556, and 10-2015-0187559.

그러나 이와 같은 환자 계층화 방식(유전자 및 다중유전자 단위에서 하프로타입(gene haplotype) 기반 유전자단위 및 다중유전자들의 군집화를 하는 방법에 의한 유전형 계산법)은 집단정보가 충분이 있는 경우에는 활용성이 높으나, 국제적인 데이터 확보가 어려운 경우에는 효율성이 떨어지게 된다.However, this method of patient stratification (genotyping by gene haplotype-based gene units and multigene clustering in genes and multigene units) is highly useful when there is sufficient population information. If international data is difficult to obtain, it will be less efficient.

즉, 국제컨소시엄(international consortium)기반으로 알려진 공용 GWAS(genome wide association study) 마커 기반 환자계층화를 유전자단위에서 계산하면 효율이 좋아지므로, 이러한 공용 GWAS 마커기반 유전자단위 방식의 환자계층화가 요구된다.In other words, when the gene-based calculation of common GWAS marker-based patient stratification, which is known as the international consortium, is improved, the patient stratification of the common GWAS marker-based gene unit method is required.

그러나 공용 GWAS 마커들은 변이가 넓은 영역에 희박한 (sparse) 분포로 되어 있는 것이 대부분이므로, 마커가 특정 질병에 대한 기능 설명되지 못하는 경향이 있다. However, since most common GWAS markers have a sparse distribution over a wide range of mutations, the markers tend not to explain the function of a particular disease.

따라서, 본 발명은 연관 불평형(LD, linkage disequilibrium)에 따른 지수 R^2을 이용하여 기준값(0.7)을 기준(R^2>0.7)으로 비기능적 변이를 기능변이와 연결을 시켜 공용(GWAS 등)마커기반의 유전자단위 환자계층화를 수행한다.Therefore, the present invention uses the index R ^ 2 according to the linkage disequilibrium (LD) to connect the non-functional variation with the functional variation based on the reference value (0.7) based on the reference value (R ^ 2> 0.7). Perform marker-based genetic stratification of patients.

이때, 상기 연관 불평형 지수는, 아래 산술식에 의해 산출될 수 있다.In this case, the associated unbalance index may be calculated by the following arithmetic expression.

이때, r^2 즉, 연관불평형(D~=LD) 지수 계산은 두 개의 단일염기 다형성 간에 존재하는 연관관계의 강도를 계산한다. In this case, r ^ 2, that is, the linkage disequilibrium (D ~ = LD) index calculation calculates the strength of the relationship between two single-base polymorphisms.

각 단일염기 다형성에서 관측되는 대립유전자를 이용하여 계산한 하플로타입 빈도와 무작위로 나타날 하플로타입 빈도의 차이를 이용하여 LD 지수를 계산한다.The LD index is calculated using the difference between the haplotype frequency calculated from the alleles observed in each monobasic polymorphism and the randomly generated haplotype frequency.

LD 지수로는 현재 D' 이 가장 많이 이용되고 있으며, 일반적으로 |D'| > 0.8 인 경우 두 단일염기 다형성 간에 강한 연관관계가 있다고 판단한다. D 'is the most widely used LD index. Currently, | D' | > 0.8, we believe that there is a strong association between the two monobasic polymorphisms.

또한, r^2가 0.7보다 크면 두 단일염기 다형성 간에 강한 연관관계가 있다고 판단한다.In addition, if r ^ 2 is greater than 0.7, it is determined that there is a strong correlation between two single-base polymorphisms.

여기서, Pa는 첫 번째 locus 에 A 대립유전자, Pa는 첫 번째 locus 에 a 대립유전자, PB는 두 번째 locus에 B 대립유전자 및 Pb는 두 번째 locus에 b 대립유전자를 가지는 하플로타입의 빈도이다.Where Pa is the A allele at the first locus, Pa is the a allele at the first locus, PB is the B allele at the second locus, and Pb is the frequency of the haplotype with the b allele at the second locus.

한편, 도 7에는 본 발명에 의한 GWAS 마커 기반 환자계층화 생성 방법이 예시도로 도시되어 있다.Meanwhile, FIG. 7 illustrates a GWAS marker-based patient stratification generation method according to the present invention.

이에 도시된 바와 같이, 본 발명에서는 환자계층화에서 GWAS마커 유전형은 major 및 minor allele가 있고, 이중에서 GWAS마커와 LD가 형성이 되는 LOF 마커와 eQTL마커는 같은 allele로 되어 있어야 한다.As shown here, in the present invention, the GWAS marker genotypes in the patient stratification have major and minor alleles, and among them, the LOF markers and eQTL markers in which the GWAS markers and LDs are formed should be the same allele.

그리고 GWAS마커는 기능성 정보를 가지고 있지 않고, LD로 연결이 되는 마커는 기능성 정보를 항상 필요로 한다. And the GWAS marker does not have the functional information, and the marker connected to the LD always needs the functional information.

따라서, 환자계층화 생성 규칙을 정의할 수 있고, Genetic variation과 expression의 관계는 GWAS마커, eQTL, LOF, 및 발현(expression)되는 유전자의 간단한 관계도를 보여줄 수 있다.Therefore, the patient stratification generation rule can be defined, and the relationship between genetic variation and expression can show a simple relationship diagram of GWAS marker, eQTL, LOF, and expressed gene.

도 8에는 본 발명에 의한 GWAS 마커 기반 환자계층화 생성예가 도시되어 있다.Figure 8 shows an example of the generation of GWAS marker-based patient stratification according to the present invention.

도 8에 도시된 바와 같이, GWAS마커 (밤색)가 선별이 되고, GWAS마커의 minor allele를 가지고 있으면서 동시에 eQTL마커 (빨강색)가 minor allele인 경우가 선별이 되고, 다음으로 GWAS마커(밤색)의 minor allele를 가지고 있고 LOF 마커(초록)를 가지고 있는 것을 선별한다. As shown in FIG. 8, the GWAS marker (brown) is selected, the case where the eQTL marker (red) is the minor allele while having the minor allele of the GWAS marker is selected, and then the GWAS marker (brown) Select to have the minor allele of and the LOF marker (green).

그리고 질병계층화-1 ~ 질병계층화-N을 선별하고, 비-질병 계층화 군을 분류할 수 있다. And disease stratification-1 to disease stratification-N, and classify non-disease stratification groups.

또한, 나중에 실제 약물을 투입했을 때 약물에 반응하는 환자 (responder)가 특정 질병계층화-i에 포함이 되면, 약물과 특정 환자계층화 마커와의 관련성이 있다고 할 수 있다. In addition, if a patient responding to the drug when the actual drug is added later to the specific disease stratification-i, the drug may be related to the specific patient stratification marker.

그리고 환자계층화 마커와 기능연관성 정보와의 분자생물학 영역의 메커니즘을 규명하는 작업을 할 수 있다. 여기서, eQTL마커는 발현하는 mRNA 코딩 유전자를 알 수 있고, LOF (loss of function)는 GWAS마커의 기능을 직접적으로 설명 하게 된다.It is also possible to identify mechanisms in the molecular biology domain between patient stratification markers and functional correlation information. Here, the eQTL marker can know the mRNA coding gene to be expressed, and the LOF (loss of function) directly describes the function of the GWAS marker.

한편, 전술한 바와 같이, 공용 GWAS 마커와 eQTL 및 유전자변이(gene variants)는 R^2 >0.7과 같은 LD기반으로 기능정보를 연결하지만, eQTL 및 유전자변이(gene variants)가 기능정보를 주지 못할 수 있다. Meanwhile, as described above, common GWAS markers, eQTLs and gene variants link functional information based on LD such as R ^ 2> 0.7, but eQTL and gene variants cannot provide functional information. Can be.

이는 12,000 bp 당 하나씩 존재하는 GWAS 마커가 너무 희박한 (sparse) 분포로 생성된 마커이기 때문에, 여전히 기능성과 연결을 못 시킬 수 있기 때문이다.This is because the GWAS markers, one per 12,000 bp, are still created with too sparse distributions, which still prevents them from linking with functionality.

이러한 문제점을 해결하기 위해 본 발명은 바이오맵(bmap, cmap 및 vmap, 특허등록 제10-0880531호, 제10-0996443호, 제10-1035959호, 제10-1117603호 및 제10-1400717호)에서 사용한 바이오데이터 인덱싱(rvr), 바이오구성물을 상동성기반으로 생성된 맵(bmap), 맵과 맵을 연결기키는 방법(cmap) 및 단백질-약물-변이 복합체의 시뮬레이션(vmap)을 통해 연관성을 확보하여 희박한 (sparse) 분포로 만들어진 공용 GWAS 마커의 기능성을 설명하는 기술(특징 5)이 적용된다.In order to solve this problem, the present invention is a biomap (bmap, cmap and vmap, patent registration No. 10-0880531, No. 10-0996443, No. 10-1035959, No. 10-1117603 and No. 10-1400717) Biodata indexing (rvr), biomap homologs based on homologous constructs (bmap), maps to maps (cmap), and protein-drug-mutant complexes (vmap) A technique (feature 5) is applied to explain the functionality of a common GWAS marker that is secured and made with a sparse distribution.

즉, 본 발명은 바이오맵기반 바이오정보의 활용방법은 특징 4와 같이, 맞춤의학 플랫폼의 통합유전체DB에서 GWAS마커를 전술한 바와 같은 연관불평형(LD)에 의하여 eQTL마커와 연결이 지어지고, eQTL에 의하여 발현되는 유전자가 질병 메카니즘 관련 기능성이 설명되지 않는 경우, 특징 5와 같이, 바이오맵(bmap), 맵x맵(cmap) 및 복합체(변이-약물-단백질, vmap)의 정보를 통하여 기능성을 설명하고, 이와 같은 기능을 RVR 기술을 기반으로 신속하게 네트워킹 및 검색하도록 한다.That is, according to the present invention, the method of using biomap-based bioinformation is connected to the eQTL marker by linkage disequilibrium (LD) as described above in the integrated genome DB of the personalized medicine platform. If the gene expressed by the disease mechanism-related functionality is not explained, as shown in feature 5, the biomap (bmap), mapxmap (cmap) and complex (mutant-drug-protein, vmap) information through the information We will discuss these features and quickly network and discover them based on RVR technology.

여기서, rvr, bmap, cmap,및 vmap에 대한 개념은 특허등록 제10-0880531호, 제10-0996443호, 제10-1035959호, 제10-1117603호 및 제10-1400717호에 개시되어 있으나, 이를 간단히 정리하면 다음과 같다.Here, the concepts of rvr, bmap, cmap, and vmap are disclosed in Patent Registration Nos. 10-0880531, 10-0996443, 10-1035959, 10-1117603, and 10-1400717. To sum it up briefly:

첫 번째로 rvr(records virtual rack)은 단일 데이터 검색을 위한 파일 생성 방법 및 단일 데이터 파일의 검색 방법으로, 단일 파일 검색을 위한 rat(record allocation table)파일을 이용한다. rvr은 컴퓨터공학에서 사용하는 인덱싱 기법(inverted indexing)을 바이오데이터에 적용하도록 변형한 것이다.First, records virtual rack (rvr) is a file creation method for single data retrieval and a single data file retrieval method, and uses a rat (record allocation table) file for single file retrieval. rvr is a modification of the inverted indexing technique used in computer science to apply to biodata.

요점은 다량의 바이오 데이터들의 특성은 벌크데이터 혹은 빅데이터 이므로, 파일의 주소정보를 외부에 기록을 한 파일을 생성(rat: record allocation table)하면, batch 처리를 비롯해서 대용량 바이오데이터를 핸들링 하기가 용이해진다.The point is that the characteristics of a large amount of bio data is bulk data or big data, so when a file is recorded that records address information of a file externally (rat: record allocation table), it is easy to handle a large amount of bio data including batch processing. Become.

본 방법을 사용하여, 염기서열, 단백질서열, 다중정렬파일, text정보, 단백질 3D정보, 변이정보, 이미지정보, 오디오정보, 등, 수많은 정형, 비정형의 다양한 종류의 바이오관련 빅데이터 자료를 처리할 수 있다.This method can be used to process numerous types of bio-related big data such as sequencing, protein sequences, multiple alignment files, text information, protein 3D information, mutation information, image information, audio information, etc. Can be.

두 번째로 bmap(bio map)은 군집 및 백본 데이터베이스 기반 바이오 메디컬 통합 정보 검색 방법으로, 대한민국 특허등록 제10-1035959호에 개시된 바 있다.Secondly, a bmap (bio map) is a method for searching biomedical integrated information based on a cluster and a backbone database, and has been disclosed in Korean Patent Registration No. 10-1035959.

상기 bmap은 사용자가 알고자 하는 유전자의 주석(annotation)을 공용(public) DB에서 해당 유전자(gene)와 관련된 모든 정보를 상동성(Homology) 기반으로 수집 및 rvr 기반의 인덱싱을 사용하여 바이오구글 개념의 검색을 하는 방법을 말한다. The bmap collects all the information related to the gene from the public DB based on homology based on homology and indexing based on rvr. Says how to search.

공용 데이터의 컨텐츠는 Gene expression, Pathway, Disease, Nucleotide, Protein, Regulation, Interaction, Metabolite, SNP & mRNA SNPChip data, Drug and ProteinChemical, 및 Interaction 관련 DB들을 중심으로 유전자맵 결과를 만들 수 있다. The contents of public data can generate gene map results based on Gene expression, Pathway, Disease, Nucleotide, Protein, Regulation, Interaction, Metabolite, SNP & mRNA SNPChip data, Drug and Protein Chemical, and Interaction related databases.

같은 방식으로 약물, 변이, 상호작용 및 FDA 약물 맵도 만들 수 있다.In the same way, you can also map drugs, mutations, interactions, and FDA drug maps.

세 번째로 cmap(cell map)은 상호 연계 가능한 다중 맵 생성을 통한 바이오메디컬 기능연관정보 제공 시스템에 대한 내용으로, 대한민국 특허등록 제10-1117603호에 개시되어 있다.Thirdly, cmap (cell map) is a system for providing biomedical function related information through the generation of multiple maps that can be interconnected, and is disclosed in Korean Patent Registration No. 10-1117603.

상기 cmap는 바이오맵에 사용하는 keyword 혹은 ID를 바이오맵에 연결을 시키는 방법 및 바이오맵과 바이오맵을 연결시키는 방법에 관한 것이다.The cmap relates to a method of linking a keyword or ID used in a biomap to a biomap and a method of linking a biomap and a biomap.

그리고 상기 cmap의 특징은 아래 3가지로 구분될 수 있다.In addition, the characteristics of the cmap may be classified into the following three types.

1) 쿼리(Query)는 유전자 ID, 화합물 ID, 질병 ID 및 중요한 keyword 등 또는 이들의 조합이 될 수 있다.1) The query may be a gene ID, a compound ID, a disease ID, an important keyword, or a combination thereof.

상동성(유사성) Query는 Query와 상동성(유사성)을 주는 모든 컨텐츠를 의미하고, 소문자 query로 사용한다. Homology (similarity) Query means all contents that give homology (similarity) to Query, and it is used as lower case query.

데이터베이스(DB)는 레코드들을 포함하는 집단을 의미하고 레코드는 하나의 어브젝트의 단위이고 어브젝트는 각각의 논문, 바이오 원시정보(예 : 염기서열, 아미노산서열, 분자의 구성물인 원자의 좌표, 화합물 좌표,화학기호, 아미노산, 등)를 포함한다. A database is a group of records, a record is a unit of an object, and an object is each article, bio primitive information (e.g., base sequence, amino acid sequence, atomic coordinates, compounds, etc.). Coordinates, chemical symbols, amino acids, etc.).

2)쿼리 Q를 가지고 데이터베이스 DB를 검색한 결과를 Q x DB로 표시하고, 쿼리와 상동성 쿼리모음을 (q1, q2,..qN)으로 표현한다. 따라서 Q 및 상동성 q를 가지고 검색된 한 셋트의 자료(q1.db1, ... qN.db1, ... q1.dbN ...qN.dbN')이고 이것을 맵(Map)이라 정의한다. 2) The result of searching database DB with query Q is expressed as Q x DB, and the query and homology query collection is expressed as (q1, q2, .. qN). Therefore, a set of data (q1.db1, ... qN.db1, ... q1.dbN ... qN.dbN ') retrieved with Q and homology q is defined as a map.

3)모듬맵/모듬맵은 맵(M1)과 맵(M2)에서 사용한 방식과 같은 방식으로 표현이 가능하다. 즉, 질병관련 유전자 모음(동일 유전자 기능성 모음, 유사한 기능 약물의 모듬, 질병 저항성 유전자 모음, 혹은 유전체, 등)과 연관된 유전자를 모아둔 맵들의 모음과 단일맵의 비교는 M1 x M2으로 표현한다. 3) Assorted map / Assorted map can be expressed in the same way as used in map M1 and map M2. That is, a comparison between a single map and a collection of maps that collect genes associated with a disease-related gene pool (same gene functional collection, a collection of similar functional drugs, a disease resistant gene collection, or a genome, etc.) is expressed as M1 x M2.

네 번째로 vmap(virtual map)은 전체원자기반 고분자 복합체의 시뮬레이션 방법에 관한 것으로, 대한민국 특허등록 제10-1400717호에 개시된 바 있다.Fourth, the vmap (virtual map) relates to a simulation method of the whole atom-based polymer composite, and has been disclosed in Korean Patent Registration No. 10-1400717.

상기 vmap은 단백질-약물-변이의 복합체를 시뮬레이션을 통하여 자유에너지의 차이를 계산을 하고, 변이로 인한 약물의 표적에 대한저항성을 계산할 수 있고, 단백질 및 약물을 반복적으로 수행하는 경우, 특정 원자/아미노산을 고정하고, 스트레스를 많이 주면, 에너지에 가중치 개념의 증폭을 주는 효과가 있으며, 에너지기반 검출을 쉽게 할 수 있다. The vmap can calculate the difference in free energy by simulating a complex of protein-drug-variation, calculate the resistance of the drug to the target due to the variation, and when repeatedly performing the protein and drug, specific atoms / Fixing amino acids and giving them a lot of stress has the effect of amplifying weighting concepts on energy, making energy-based detection easier.

특히, 백신의 경우 본 방법에서는 여러 가지 타입의 MHC(면역 수용체 단백질) 구조를 템플릿으로 만들어 둔 후, 면역 기능에 관련된 여러 유전자 서열의 패턴 분석을 통해 얻어진 아미노산 서열들을 MHC내의 알려진 펩타이드 결합부위에 치환시켜서 펩타이드를 고정시킨 후 시뮬레이션을 진행한 후 각 펩타이드의 에너지 변화를 비교 분석하는 방법으로 MHC와 안정하게 결합하는 펩타이드를 GPGPU기반 스크리닝을 통해 예측해 낼 수 있다.In particular, in the case of a vaccine, the method prepares various types of MHC (immunoreceptor protein) structures as templates, and then replaces amino acid sequences obtained by pattern analysis of various gene sequences related to immune function with known peptide binding sites in the MHC. By fixing the peptides and then conducting a simulation, a comparative analysis of the energy changes of each peptide can be used to predict peptides that bind stably with MHC through GPGPU-based screening.

이하에서는 본 발명에 의한 통합유전체 시스템에 관련된 본 발명의 기술적 특징을 설명하기로 한다.Hereinafter, the technical features of the present invention related to the integrated dielectric system according to the present invention will be described.

도 9는 본 발명에 적용되는 PC-cluster에서 운영을 위한 인간 유전체 30억 염기에 대한 분할된 청크를 도시한 개념도이고, 도 14은 본 발명에 의한 통합유전체 DB의 대립유전자깊이, 지노타입 및 하플로타입 DB의 구성을 도시한 개념도이다.9 is a conceptual diagram showing the divided chunk of the human genome 3 billion bases for operation in the PC-cluster applied to the present invention, Figure 14 is an allele depth, genotype and bottom of the integrated dielectric DB according to the present invention Conceptual diagram showing the configuration of the flow type DB.

도 9에 도시된 바와 같이, 본 발명의 특징 6은 전장유전체분석 및 통합유전체DB에 대한 것으로, 그 기본 구성은 대한민국 등록 특허 제10-1460520호 및 대한민국 특허출원 제10-2015-0187554호에 개시된 바 있다.As shown in FIG. 9, feature 6 of the present invention relates to full-length dielectric analysis and integrated dielectric DB, the basic configuration of which is disclosed in Korean Patent Registration No. 10-1460520 and Korean Patent Application No. 10-2015-0187554 There is a bar.

그러나 현실적으로 pc-cluster(다중 서버급 노드)를 100 ~1,000대 연결하여 사용하는 수퍼컴퓨팅 환경)에서, 많은 CPU기반으로 동시에 많은 수의 인간 유전체데이터를 프로세싱하고 통합을 하면서도, 합리적인 비용으로 pc-cluster를 구성을 하려면, 노드(node: multiple CPU)당 처리 용량을 규정하는 것이 바람직하다.In reality, however, in a PC-cluster (supercomputing environment using 100 to 1,000 connected servers), PC-cluster can be processed at a reasonable cost while processing and integrating a large number of human genome data simultaneously based on a large number of CPUs. In order to configure, it is desirable to define the processing capacity per node (multiple CPU).

이에 본 발명은 인간 유전체 분석에 따라 약 64GB의 메모리를 분산 할당하는 것이 바람직하다.Accordingly, the present invention preferably allocates approximately 64GB of memory according to human genome analysis.

따라서, 본 발명은 이러한 pc-cluster환경에서 인간 전장유전체(30억 염기 base pair * N명)를 통합을 하기 위하여, 전장유전체를 일정한 크기 (약 50,000,000 염기 base pair * N명 = 1개 청크)로 분할하여 600개의 청크 (chunks)로 만들고 운영하도록 하는 기술(특징 6)을 적용하여, PC-cluster기반 분석의 효율을 향상시킨다.Therefore, the present invention is to integrate the full-length dielectric (3 billion base pairs * N people) in a pc-cluster environment, the full-length dielectric to a constant size (about 50,000,000 base pairs * N people = 1 chunk) By applying a technology (part 6) to divide and create 600 chunks, it improves the efficiency of PC-cluster based analysis.

이는 1개의 청크에 10,000명 - 20,000명의 통합된 샘플을 다중정렬된 환경으로 만들려면 약 10 - 20 GB의 메모리가 필요하므로, 이와 같이 셋팅 된 분석 플랫폼의 경제성이 현저히 상승하기 때문이다.This is because the economics of this set-up analysis platform are significantly increased because about 10-20 GB of memory is required to make 10,000-20,000 integrated samples per chunk into a multi-aligned environment.

한편, 도 10은 본 발명에 의한 유전체데이터 분석을 위한 계산장비(a1-a6), 스토리지(b1-b3)구성 및 마하-FsDx의 구성을 도시한 개념도이다.On the other hand, Figure 10 is a conceptual diagram showing the configuration of the calculation equipment (a1-a6), storage (b1-b3) and Mach-FsDx for analyzing the genome data according to the present invention.

도 10에 도시된 바와 같이, 본 발명의 특징 7은 유전체 벌크데이터 운용을 위한 수퍼컴퓨터 환경의 최적화, 데이터 I/O 속도에 따른 파이프라인에서 하드웨어 및 네트워크카드 속도 테스트 및 배치 그리고 동시에 수백개 및 수천개의 계산을 할 때 사용할 수 있는 미들웨어급의 병렬분산 스토리지 환경(예, 클라우드컴퓨팅)의 구성에 관한 것이다.As shown in FIG. 10, feature 7 of the present invention is an optimization of a supercomputer environment for operating dielectric bulk data, testing and deploying hardware and network card speeds in a pipeline according to data I / O rates, and simultaneously hundreds and thousands It is related to the configuration of middleware-class parallel distributed storage environment (eg cloud computing) that can be used when calculating two dogs.

이때, 유전체분석용 수퍼컴퓨터를 구성하려면, 특성상 메모리기반 초고속 계산흐름(a1)(그러나 비교적 작은 I/O), 초고속 계산 후에 대량의 데이터(추가 I/O성능 + 추가 초고속 메모리-케시-서버 + 대형 스토리지)를 필요로 하는 경우(a2), 초고속계산을 수행하는 GPGPU의 초고속 계산 후에 대량의 데이터(추가 I/O성능 + 추가 초고속 메모리-케시-서버 + 대형 스토리지)를 위한 흐름(a3), 수천 대의 일반 계산용 서버에서 수천 개의 계산을 동시에 수행하는 흐름(a4) 및 중간형 속도를 위한 흐름(a5) 그리고 데이터 백업을 하는 흐름(a6)이 필요하다. In this case, to construct a supercomputer for genome analysis, a memory-based ultrafast computational flow (a1) (but relatively small I / O), a large amount of data after ultrafast computation (additional I / O performance + additional ultrafast memory-cache-server + If you need large storage (a2), flow for a large amount of data (additional I / O performance + additional ultra-fast memory-cache-server + large storage) after the ultra-fast calculation of the GPGPU performing ultra-fast calculations (a3), In the thousands of general computing servers, a flow (a4) for performing thousands of calculations simultaneously, a flow for a medium speed (a5), and a data backup (a6) are required.

그리고 스토리지 관점에서도 메모리 디스크 드라이버(b1), 병렬분산형 초고속 대용량 I/O용 스토리지(b2) 그리고 일반 스토리지(b3) 등 추가적으로 I/O성능을 내기위한 초고속 CPU 및 스토리지가 제공되어야 한다.In addition, from the storage point of view, an additional high-speed CPU and storage for providing I / O performance, such as a memory disk driver (b1), parallel distributed ultra-fast mass I / O storage (b2), and general storage (b3), must be provided.

한편, 본 발명의 특징 8은 이러한 대형장비 (마하-fsdx)를 임상에 활용하려면, 재현성 및 일관성이 보장이 되어야 하고, 이에 따라 시스템 한계에 대하여 명확한 정의 한정된 것을 말한다. 따라서, 현재의 마하-Fs를 마하-FsDx화 (진단장비 및 의료기기)를 위한 임상환경의 작업이 필요하다. On the other hand, feature 8 of the present invention means that to utilize such large equipment (mach-fsdx) in the clinical, reproducibility and consistency must be ensured, and accordingly, the definition of the system limits is clearly defined. Therefore, the work of the clinical environment for Mach-FsDx (diagnostic equipment and medical devices) of the current Mach-Fs is needed.

다음은 평가내용의 실시예를 보여준다.The following shows examples of evaluation contents.

도 11a 내지 도 11f는 본 발명에 의해 마하-Fs를 마하-FsDx 화하여 안정성, 한계성 및 재현성에 대하여 평가한 결과를 나타내고 있다.11A to 11F show the results obtained by evaluating Mach-Fs to Mach-FsDx according to the present invention for stability, limit and reproducibility.

여기서, 검증 테스트 환경은,Here, the verification test environment,

1. MAHA-FsDx Master Node : MDS(CPU L5520 4Core x2ea/64GB Memory/128GB Disk)1.MAHA-FsDx Master Node: MDS (CPU L5520 4Core x2ea / 64GB Memory / 128GB Disk)

2. MAHA-FsDx Data Node : DS1~5 (CPU E5630 2.53GHz x1EA/24GB Memory/2TB x5ea)2.MAHA-FsDx Data Node: DS1 ~ 5 (CPU E5630 2.53GHz x1EA / 24GB Memory / 2TB x5ea)

3. MAHA-FsDx가용량 : Physical 50TB / usable 25TB (MAHA Replica 2 option (안정성))3.MAHA-FsDx Capacity: Physical 50TB / usable 25TB (MAHA Replica 2 option (stability))

4. Compute Node : CPU Xeon E5-2630 v2 2.60GHz / 64GB Memory / RAID 11TBCompute Node: CPU Xeon E5-2630 v2 2.60GHz / 64GB Memory / RAID 11TB

5. 샘플 데이터 : CHA-NS161103-0001_R1.fastq.gz CHA-NS161103-0001_R2.fastq.gz Genome TS(Target Sequence) 샘플 (420MB x2ea) 이고;5. Sample Data: CHA-NS161103-0001_R1.fastq.gz CHA-NS161103-0001_R2.fastq.gz Genome Target Sequence (TS) Sample (420MB x2ea);

테스트 방법은, The test method is

1. 재현성 테스트 : Local Disk /awork01와 /maha4에서 동일하게 테스트1.Reproducibility test: Test the same on Local Disk / awork01 and / maha4

2. 안정성 테스트 :5대 MAHA DS 노드 중 한대를 장애 유발 기존 데이터 이슈 상태 여부2. Stability test: Cause failure of one of five MAHA DS nodes.

3. 한계성 테스트 :Compute Node 당 2Jobs 으로 테스트.3. Limit test: Test with 2Jobs per compute node.

A. no caching : 30 / 40 / 60 jobs 결과 A. no caching: 30/40/60 jobs results

B. caching : 60 / 200 / 240 / 360 jobs 결과 B. caching: 60/200/240/360 jobs results

C. no caching 대용량 1.4PB : 392 / 582 / 776 / 784 Jobs 결과 (추가 포함) C. no caching large capacity 1.4PB: 392/582/776/784 Jobs results (additional)

4. 사용된 유전체 분석 파이프라인 및 원시 데이터4. Used genome analysis pipeline and raw data

이에 제시된 바와 같이, 본 발명에 의한 특징 7 및 8이 적용된 마하-FsDx가 마하-Fs에 비하여 재현성, 안정성 및 한계성에서 우수함을 알 수 있다.As shown here, it can be seen that Mach-FsDx to which the features 7 and 8 according to the present invention are applied is superior in reproducibility, stability, and limitation as compared to Mach-Fs.

도 12a 및 도 12b는 한계성에 대하여, MAHA-FsDx 구성 규모 대비 최대치 성능을 검증한 예이다.12A and 12B are examples of verifying the maximum performance of the MAHA-FsDx configuration scale with respect to the limits.

이때, 일반 서버에 5대 구성, 각 서버는 2TB HDD 5개로 MAHA-FsDx로 구성되고, 일반 서버에 5대에 메모리 케쉬(SDRAM chching) 추가 구성하였으며, 일반 고용량 서버 10대를 구성하였고, 각 서버 6TB HDD 24개로 MAHA-FsDx로 구성하였다.At this time, 5 units were configured in a general server, each server was composed of MAHA-FsDx with 5 2TB HDDs, 5 additional memory caches (SDRAM chching) were configured in a general server, and 10 general high capacity servers were configured. It consists of MAHA-FsDx with 24 6TB HDDs.

이들 실험 결과로부터 MAHA-FsDx가 한계성이 현저히 향상되었음을 확인할 수 있다.From these experimental results, it can be seen that the limit of MAHA-FsDx is significantly improved.

도 13은 본 발명에 의한 통합유전체 DB 플랫폼기반 다중연구자 데이터베이스의 구성을 도시한 개념도이고, 도 15는 본 발명에 의한 각각의 청크 단위 메트릭스에서 SNV, INDEL 및 CNV의 검출 원리를 도시한 개념도이다.FIG. 13 is a conceptual diagram illustrating a configuration of an integrated dielectric DB platform-based multi-investigator database according to the present invention, and FIG. 15 is a conceptual diagram illustrating a detection principle of SNV, INDEL, and CNV in each chunk unit matrix according to the present invention.

도 13에 도시된 바와 같이, 본 발명의 특징 9는 개인유전체맵기반 맞춤의학 플랫폼과 같은 대형 시스템에 다수의 연구개발자가 접근이 가능한 형태의 선행계산이 된 데이터 및 툴을 통합하여 분석을 할 수 있는 미들웨어 상의 표준화된 데이터베이스(mySQL 및 mongoDB)기반 사용자환경을 제공하는 것이다.As shown in FIG. 13, the ninth aspect of the present invention is to integrate and analyze data and tools that have been pre-calculated in a form that can be accessed by a large number of research developers in a large system such as a personal genomic map-based customized medicine platform. It provides a standardized database (mySQL and mongoDB) based user environment on middleware.

본 발명은 통합유전체 DB에서 제공하는 다양한 툴 및 공용 툴을 사용하여 다중 연구자들이 필요한 선행 데이터를 계산 및 결과데이터 추출을 한다. 특히, SNV, INDEL 및 CNV의 기능정보, 특히 LOF (loss of function)를 polyphen 및 SNPeffecter를 사용하여 계산을 한 결과가 저장이 되고, GWAS마커, LD계산정보, eQTL, MAF, phenotype, 등 다양한 공용 통계분석 툴을 장착하여, 유전자단위 기반 환자계층화 및 독성마커 계산을 수행한다.The present invention calculates and extracts the result data of the preceding data required by multiple researchers using various tools and common tools provided by the integrated dielectric DB. In particular, the functional information of SNV, INDEL and CNV, especially LOF (loss of function), is calculated using polyphen and SNP effecter, and GWAS marker, LD calculation information, eQTL, MAF, phenotype, etc. Equipped with statistical analysis tools, gene-based patient stratification and toxicity marker calculations are performed.

도 14은 통합유전체 DB를 5,000명의 각 유전체 30억개의 염기서열을 청크단위 (5천 만개)단위로 염기서열을 나누어고 5,000명의 메트릭스 형태로 만든 대립유전자, 지노타입, 및 하플로타입 DB에 대한 예시이다.FIG. 14 shows an allele, genotype, and haplotype DB in which an integrated dielectric DB is divided into nucleotide sequences of 5,000 genomes of 3 billion nucleotides in chunks (50 million) and formed in 5,000 matrices. It is an example.

그리고 도 15는 각각 나누어진, 각 청크 단위에서 본 발명의 특징 9에 따라 연구개발자가 접근 가능한 형태의 선행계산이 된 데이터 (SNV, INDEL, 및 CNV)를 추출하는 계략도이다.FIG. 15 is a schematic diagram of extracting data (SNV, INDEL, and CNV) that have been precomputed in a form accessible to the research developer according to feature 9 of the present invention, in each chunk unit.

또한, 도 16은 본 발명에 의한 통합유전체 DB에서 유전자 및 다중유전자기반 유전형 계산을 통한 환자계층화를 보여준다. In addition, FIG. 16 shows patient stratification through gene and multigene-based genotyping in the integrated genome DB according to the present invention.

본 발명의 권리는 위에서 설명된 실시 예에 한정되지 않고 청구범위에 기재된 바에 의해 정의되며, 본 발명의 분야에서 통상의 지식을 가진 자가 청구범위에 기재된 권리범위 내에서 다양한 변형과 개작을 할 수 있다는 것은 자명하다.The rights of the present invention are not limited to the embodiments described above, but are defined by the claims, and those skilled in the art can make various modifications and adaptations within the scope of the claims. It is self-evident.

이하에서는 본 발명의 구현을 위한 기반기술 중 일부에 대한 개요를 간단히 정리하여 설명한다.Hereinafter, a brief summary of some of the underlying technologies for the implementation of the present invention will be described.

대한민국 특허출원 제10-2015-0187554호(특허문헌 8), 대한민국 특허출원 제10-2015-0187556호(특허문헌 9) 및 대한민국 특허출원 제10-2015-0187559호(특허문헌 10)의 기술요지Technical Summary of Korean Patent Application No. 10-2015-0187554 (Patent Document 8), Korean Patent Application No. 10-2015-0187556 (Patent Document 9) and Korean Patent Application No. 10-2015-0187559 (Patent Document 10)

특허문헌 8, 9 및 10에 의한 질병원인 발굴 시스템은 분석데이터 입력부(100), 검색제어부(200), 결과 리포트 제공부(300), HaploScan DB(400), ADISCAN DB(500), IDA DB(600), 생리활성 DB(700) 및 레퍼런스 DB(800)를 포함하여 구성된다.Patent document 8, 9 and 10 disease cause excavation system is the analysis data input unit 100, the search control unit 200, the result report providing unit 300, HaploScan DB (400), ADISCAN DB (500), IDA DB ( 600), the bioactive DB 700 and the reference DB (800).

상기 분석데이터 입력부(100)는 개인 유전체 정보를 입력받는 부분으로, DNA sequencing 데이터를 입력받는다.The analysis data input unit 100 is a part for receiving personal genomic information and receives DNA sequencing data.

그리고 상기 검색제어부(200)는 입력된 DNA sequencing으로부터 각 유전자의 유전형, 표현형에 대한 유전형, 희귀변이, 질병변이 및 생리활성변이를 검출하는 부분으로, 이를 위해 상기 검색제어부(200)는 HaploScan엔진(210), ADISCAN 엔진(220), IDA 검색엔진(230) 및 생리활성변이 검색엔진(240)을 포함하여 구성된다.The search control unit 200 detects genotypes, genotypes, rare mutations, disease mutations, and physiologically active variants of each gene from the input DNA sequencing. For this purpose, the search control unit 200 is a HaploScan engine ( 210, an ADISCAN engine 220, an IDA search engine 230, and a bioactive mutation search engine 240 are configured.

상기 HaploScan 엔진(210)은 상기 분석데이터(입력된 DNA Sequencying)을 후술할 HaploScan DB(400)에 저장된 Haplo MAP(414, 424)과 대비하여 유전형을 판별하는 역할을 수행한다.The HaploScan engine 210 compares the analysis data (input DNA sequency) with the Haplo MAPs 414 and 424 stored in the HaploScan DB 400 to be described later.

상기 HaploScan DB(400)의 구조 및 상기 HaploScan 엔진(210)의 검색 방식은 이후 다시 상세히 설명하기로 한다.The structure of the HaploScan DB 400 and the search method of the HaploScan engine 210 will be described in detail later.

그리고 상기 ADISCAN 엔진(220)은 입력된 분석데이터에 포함된 각 염기에 대하여 ADISCAN DB(500)과 ADISCAN 방식으로 대비하여, 집단대조군 대비 희귀성을 산출하는 역할을 수행한다.In addition, the ADISCAN engine 220 prepares each base included in the input analysis data by using the ADISCAN DB 500 and the ADISCAN method, and calculates the rarity of the group control group.

또한, 상기 IDA 검색엔진(230)은 이미 알려진 유전자 관련 질병변이를 검출하는 것으로, 알려진 질병변이가 저장된 IDA DB(600)와 분석데이터를 비교하여 질병변이를 검출한다.In addition, the IDA search engine 230 detects a known disease-related disease variation, and detects a disease variation by comparing the analysis data with the IDA DB 600 stored the known disease variation.

그리고 상기 생리활성변이 검색엔진(240)은, 단백질 대사관련 유전 변이를 검출하는 것으로, 크게 단백질-약물, 단백질-DNA 및 단백질-단백질 결합에 관여하는 아미노산에 대한 유전변이 여부를 판별한다.In addition, the physiologically active mutation search engine 240 detects genetic metabolism-related genetic variation, and largely determines whether the genetic variation is related to amino acids involved in protein-drug, protein-DNA, and protein-protein binding.

이때, 상기 생리활성변이 검색엔진(240)은 BAV DB(700)와 분석데이터를 비교하여 상기 분석 데이터 중 상기 BAV DB(700)에 저장된 단백질 결합 관련한 아미노산에 대응하는 염기들의 변이 여부를 판별하게 된다.At this time, the physiologically active search engine 240 compares the analysis data with the BAV DB (700) to determine whether the mutation of the base corresponding to the amino acid associated with the protein binding stored in the BAV DB (700) of the analysis data .

한편, 상기 검색제어부(200)는 HaploScan 엔진(210) 및 ADISCAN 엔진(220)에 의해 판별된 유전형과 각 염기의 유의성(희귀성)을 진단자(또는 사용자)가 가시적으로 용이하게 파악할 수 있도록 맨하탄 플롯 및 방사형 변이 유의성 차트를 이용하여 결과리포트를 생성한다.On the other hand, the search control unit 200 is a Manhattan so that the diagnostic person (or user) can easily determine the genotype and significance (rare) of each base determined by the HaploScan engine 210 and ADISCAN engine 220 easily Plot and radial variation significance charts are used to generate result reports.

그리고 생성된 상기 결과리포트는 결과리포트제공부(300)를 통해 사용자에게 제공된다.The generated result report is provided to the user through the result report providing unit 300.

이하에서는 특허문헌 8, 9 및 10에 의한 질병원인 발굴 시스템의 데이터베이스 구조를 설명하기로 한다.Hereinafter, the database structure of the excavation system that causes the disease according to Patent Documents 8, 9 and 10 will be described.

특허문헌 8, 9 및 10에 의한 질병원인 발굴 시스템은 크게 HaploScan DB(400)와 ADISCAN DB(500), IDA DB(600), BAV DB(700) 그리고 Reference DB(800)를 포함하여 구성된다.The disease cause excavation system according to Patent Documents 8, 9 and 10 is largely comprised of HaploScan DB (400), ADISCAN DB (500), IDA DB (600), BAV DB (700) and Reference DB (800).

상기 HaploScan DB(400)는 도 3에 도시된 바와 같이, 분석 대상인 개인 유전체 정보로부터 유전형을 산출하기 위해 대조군 유전자의 유전형을 정리한 DB로, 상기 HaploScan DB(400)는 도 2에 도시된 바와 같이, 단일유전자정보데이터베이스(410)와, 다중유전자정보 데이터베이스(420)를 포함하여 구성된다. As shown in FIG. 3, the HaploScan DB 400 is a database in which genotypes of control genes are arranged to calculate genotypes from personal genomic information to be analyzed. The HaploScan DB 400 is shown in FIG. 2. And a single gene information database 410 and a multiple gene information database 420.

그리고 상기 단일유전자정보 데이터베이스(410)는 단일유전자에 대한 유전형들을 저장한 데이터 베이스로, 단일유전자 Haplo 맵(414)과 단일유전자 하플로 프리컨시 정보(412)를 포함하여 구성된다.The single gene information database 410 is a database storing genotypes for a single gene, and includes a single gene Haplo map 414 and a single gene Haplo preconciliation information 412.

한편, 도 6에 도시된 바와 같이, 상기 단일유전자 Haplo 맵(414)은 전체 대조군의 동일 유전자에 대하여, 변이 분포를 점유 비율 별로 구분(군집)하여 저장한 것으로, 각 유전자를 활용한 세계 26개 인종의 반수체 (haplotype)계산 및 특정 형질의 빈도 및 각 서브-인종의 빈도를 계산하여 정리한 것이다.On the other hand, as shown in Figure 6, the single gene Haplo map 414 is stored by dividing (distributed) the variation distribution by the occupancy rate for the same gene of the entire control, 26 world using each gene The haplotype calculation of races, the frequency of specific traits, and the frequency of each sub-racial, are summarized.

그리고 상기 단일유전자 하플로 프리컨시 정보(412)는 상기 각각의 변이에 대한 정보를 저장한 것이다. 이때, 상기 단일유전자 하플로 프리컨시 정보(412)는 변이정보를 직접 저장한 데이터일 수도 있고, 후술할 Reperence DB(800)에 저장된 정보를 위치를 표시하는 식별인자로 구성될 수도 있다. 즉, 상기 단일유전자 하플로 프리컨시 정보(412)는 인간의 39,000개 유전자와 5 천명의 세계인종에서의 각 유전자에서 빈도 및 다양한 질병연관 주석정보를 제공한다.The single gene haplo preconciliation information 412 stores information about each variation. In this case, the single gene Haplo preconciliation information 412 may be data that directly stores the variation information, or may be configured as an identification factor indicating a location of information stored in the Reperence DB 800 to be described later. That is, the single gene Haplo preconciliation information 412 provides frequency and various disease-related annotation information in 39,000 genes of humans and in each gene in 5,000 people of the world.

또한, 상기 다중유전자정보 데이터베이스(420)는 다중유전자에 대한 변이 분포 및 정보를 제공하기 위한 데이터 베이스로, 다중유전자 Haplo 맵(424)과 다중유전자 하플로 프리컨시 정보(422)를 포함하여 구성된다.In addition, the multi-gene information database 420 is a database for providing variation distribution and information for the multi gene, and includes a multi-gene Haplo map 424 and a multi-gene Haplo preconciliation information 422. do.

이때, 상기 다중유전자 Haplo 맵(424)은 다중유전자에 의해 표현형이 특정되는 유전 특성에 있어, 각 표현형 별로 전체 대조군의 관련 염기에 대한 변이 분포를 점유 비율 별로 군집화하여 저장한 것으로, 표현형 (phenotype)의 원인 변이를 활용한 세계 26개 인종의 반수체(haplotype)계산 및 특정 형질의 빈도 및 각 서브-인종의 빈도를 계산하여 정리한 것이다.In this case, the multigene Haplo map 424 is a genetic pattern in which the phenotype is specified by the multigene, and stores the distribution of variation for the relevant bases of the entire control group by the occupancy ratio for each phenotype, phenotype (phenotype) The haplotype calculations and the frequency of specific traits and the frequency of each sub-racial for 26 races using the causal variation of the world were calculated.

그리고 상기 다중유전자 하플로 프리컨시 정보(422)는 상기 각각의 변이에 대한 정보를 저장한 것이다. 이때, 상기 다중유전자 하플로 프리컨시 정보(422) 역시 변이정보를 직접 저장한 데이터일 수도 있고, 후술할 Reperence DB(800)에 저장된 정보를 위치를 표시하는 식별인자로 구성될 수도 있다.The multi-gene haplo preconciliation information 422 stores information about each variation. In this case, the multi-gene Haplo preconciliation information 422 may also be data that directly stores mutation information, or may be configured as an identification factor indicating a location of information stored in the Reperence DB 800 to be described later.

즉, 상기 다중유전자 하플로 프리컨시 정보(422)는 인간의 39,000개 유전자와 5천명의 세계인종에서의 표현형(phenotype) 연관 유전자 셋트 들의 빈도 및 다양한 질병연관 주석정보를 제공한다.That is, the multigene haplo preconciliation information 422 provides various disease-related annotation information and the frequency of phenotype-related gene sets in 39,000 genes of humans and 5,000 global races.

HaploScan DB(400)의 X축은 30억 염기서열이고, 상기 염기서열에서 유전자는 39,000개가 있다. 이의 스키마에서 특정 유전자(i)에서 변이가 N(개) 발견이 되었다면, 상기 변이를 Y축: 5,000명에서 haplotype 및 genotype 모두를 사용하여 군집화를 할 수 있고, 군집화가 된 형태가 HaploMap이된다. The X axis of the HaploScan DB 400 is 3 billion nucleotide sequences, and there are 39,000 genes in the nucleotide sequences. If N (varies) were found in a specific gene (i) in its schema, the mutations could be clustered using both haplotype and genotype in Y axis: 5,000, and the clustered form would be HaploMap.

이때, 각 군집은 각 유전형을 의미하는데 이들의 내용을 살펴보면, 첫 번째 GP*47*0 는 그 유전형이 세계인에서 47%를 차지하고, 세계인의 평균과 비교해서 0 bit 다르고(동일하고), 두 번째 유전형 GP*25*1은 세계인에서 25%를 차지함을 의미하며, 세계인의 평균과 비교해서 1 bit 다르다는 것을 의미한다. In this case, each cluster means each genotype. In the contents of the first GP * 47 * 0, the genotype occupies 47% of the global population, 0 bit different from the global average, and second. Genotype GP * 25 * 1 means 25% of the world's population, which means that it is 1 bit different from the world's average.

또한, 다중유전기반 HaploMap도 동일한 방식에 의해 분류 및 구분된다.Multigene-based HaploMaps are also classified and classified in the same manner.

상기 ADISCAN DB(500)는 대조군 집단의 유전체 정보를 저장한 DB로, 구체적으로 집단유전체는 글로벌 게놈프로젝트 수행에 의해 공지된 유전체 정보가 활용될 수 있다.The ADISCAN DB 500 is a DB storing genome information of a control group, and specifically, the genome may use genome information known by performing a global genome project.

한편, 상기 ADISCAN DB(500)는 대조군 집단의 전장 유전체 정보를 저장하되, 인종 등의 유전형의 군을 형성하는 구분기준에 따라 구분되어 저장될 수 있다.On the other hand, the ADISCAN DB 500 stores the full-length genome information of the control group, it may be divided and stored according to the classification criteria forming a genotype group, such as race.

이때, 상기 인종별 구분은 5개 대분류의 구분일 수도 있고, 26개 소분류의 구분일 수도 있는데, 이는 인종별 유전특성을 반영하여 변이 유전자 여부를 판별/검출하기 위함이다.In this case, the racial classification may be a division of five major categories, or may be a division of 26 subclasses, for the purpose of discriminating / detecting whether or not the mutant gene is reflected by the genetic characteristics of each race.

그리고 상기 IDA DB(600)는 이미 알려진 질병과 이에 관련된 유전 변이가 저장되는 곳으로, 다양한 질병 별로 각 질병에 관련된 유전자 변이 정보 및 이들 변이 정보를 뒷받침하는 문헌 정보가 정리되어 저장된다.In addition, the IDA DB 600 is a place where known diseases and genetic variations related thereto are stored. Genetic variation information related to each disease and various literature information supporting these variations are organized and stored for various diseases.

또한, BAV DB(700)에는 다양한 단백질의 바인딩 위치의 아미노산 형태를 결정하는 유전자 정보가 저장된다.In addition, the BAV DB 700 stores genetic information for determining the amino acid form of the binding position of various proteins.

구체적으로는, 단백질-약물, 단백질-DNA 및 단백질-단백질 간의 바인딩에 있어, 이들 결합에 영향을 미치는 아미노산과 해당 아미노산에 영향을 미치는 유전자 정보가 저장된다.Specifically, in the binding between protein-drug, protein-DNA and protein-protein, amino acids that affect these bindings and genetic information that affects those amino acids are stored.

이에 따라, 특정 대사물의 바인딩을 관장하는 아미노산에 대한 염기들에 변이가 다수 발생한 경우, 해당 분석 데이터의 피검사자는 해당 대사물에 대하여 정상적인 체내 처리가 어려워질 가능성이 높아지게 된다.Accordingly, when a large number of mutations occur in the bases for the amino acids that govern the binding of a particular metabolite, the examinee of the analytical data increases the possibility that normal in vivo processing of the metabolite becomes difficult.

즉, 상기 BAV DB(700)에는 알려진 질병변이를 포함하여 단백질의 약물 결합 위치, Promoter 위치 및 결합상태의 단백질 활성이 예측되는 변이들이 저장된다.That is, the BAV DB 700 stores the predicted protein binding position, the promoter position and the protein activity of the binding state, including known disease mutations.

상기 BAV DB(700)는 생리활성관련 유전자 정보를 저장하는 데이터 베이스로, 유전자와 약물, 대사물 및 음식물에 대한 저항성 및 감수성 관련정보가 저장된다. 이때, 상기 BAV DB(700) 또한, 공신력이 확보된 공지된 데이터를 연계하여 구축할 수 있고, 예를 들어, 약물은행에 공지된 6,000 여 개의 약물정보(상호작용 단백질과 바인딩 영역 정보 등), 대사물 은행에 공지된 12,000 여 개의 대사물 정보(상호작용 단백질과 바인딩 영역 정보 등) 및 DMET(drug metabolizing enzyme and transporter gene)에 있는 200여 개의 유전자의 약물 대사관련 변이 위치에 대한 정보를 활용할 수 있다.The BAV DB 700 is a database storing bioactivity-related gene information, and stores information on resistance and sensitivity to genes, drugs, metabolites, and foods. In this case, the BAV DB 700 may also be established by linking well-known data secured by the public trust, for example, about 6,000 drug information (interacting protein and binding region information, etc.) known to the drug bank, Information on more than 12,000 metabolite information (such as interaction protein and binding region information) known to the Metabolite Bank and about the location of drug metabolic mutations in over 200 genes in the drug metabolizing enzyme and transporter gene (DMET) have.

한편, 상기 레퍼런스 DB(800)는 알려진 유전체의 변이에 대한 정보를 저장하는 DB로, 문헌정보 뿐만 아니라 공개된 정보 데이터베이스와 연계되어 구축될 수 있다.On the other hand, the reference DB 800 is a DB that stores information about the variation of the known genome, it can be built in connection with the public information database as well as document information.

예를 들어, PheWAS-GWAS(Genome wide association study) data 및 eMERGE (Electronic Medical Records and Genomics) data가 레퍼런스 DB에 적용될 수 있다.For example, PheWAS-GWAS (Genome wide association study) data and eMERGE (Electronic Medical Records and Genomics) data may be applied to the reference DB.

한편, 도시되지는 않았으나, 상기 검색제어부(200)가 임상정보 기반의 질병원인 예측 결과를 도출하기 위해 유전적 특성과 함께 고려되어야할 피검사 대상자의 환경적 소인 정보가 저장되는 임상정보 DB를 더 포함하여 구성될 수도 있다.Although not shown, the search control unit 200 further stores a clinical information DB in which the environmental predisposition information of the subjects to be examined is to be considered together with the genetic characteristics in order to derive the prediction result of the disease cause based on the clinical information. It may be configured to include.

이때, 상기 임상정보 DB는 개인의 환경적 요인 결과물 데이터와 집단 평균 및 기준정보가 저장된다.In this case, the clinical information DB stores the individual environmental factor result data, the group average, and the reference information.

그리고 상기 개인의 환경적 요인 결과물 데이터는 개인의 종합검진 데이터 등의 임상정보 데이터일 수 있고, 상기 집단 평균 및 기준정보는 질병관리본부가 제공하는 지역사회 코호트 연구 결과를 활용할 수 있다.The individual environmental factor result data may be clinical information data such as an individual's comprehensive examination data, and the group mean and reference information may use a community cohort study result provided by the Center for Disease Control.

이하에서는 특허문헌 8, 9 및 10에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법을 상세히 살펴보기로 한다.Hereinafter, a method of analyzing genetic information using individual full-length genomes according to Patent Documents 8, 9, and 10 will be described in detail.

먼저, 특허문헌 8, 9 및 10에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법은 먼저, 분석데이터 입력부가 분석 대상이 되는 분석 데이터(DNA Sequencing)을 수신받는 것으로부터 시작된다(S100).First, the genetic information analysis method using the individual full-length genomes according to Patent Documents 8, 9, and 10 begins with receiving the analysis data (DNA Sequencing) to be analyzed, the analysis data input unit (S100).

이때, 상기 분석 데이터가 DNA 조각들로 구성된 Dumy 형태로 제공될 수도 있는데, 이 경우 본 발명은 도 15에 도시된 바와 같이, 제공된 Dumy 데이터에 고집적 인덱싱을 통해 RVR 파일 형태로 DNA sequencing 을 생성하여 저장한다.In this case, the analysis data may be provided in the form of Dumy consisting of DNA fragments. In this case, as shown in FIG. 15, the present invention generates and stores DNA sequencing in the form of an RVR file through highly integrated indexing on the provided Dumy data. do.

이후, 본 발명에 의한 개인 전장 유전체를 이용한 유전정보 분석 방법은 분석 대상에 따라 크게 4가지 분석을 수행한다.Thereafter, the genetic information analysis method using the individual full-length genome according to the present invention performs four types of analysis according to the analysis target.

즉, 특허문헌 8, 9 및 10에 의한 개인 전장 유전체를 이용한 유전정보 분석은 1) 유전형 판별(S200), 2) 희귀변이 산출(S300), 3) 질병변이 산출(S400) 및 생리활성변이 산출(S500)의 4가지 분석을 수행하는 바, 이하에서는 각각에 대하여 상세히 살펴보기로 한다.That is, the genetic information analysis using the individual full-length genomes according to Patent Documents 8, 9 and 10 is 1) genotype determination (S200), 2) rare mutation calculation (S300), 3) disease mutation calculation (S400) and bioactive mutation calculation Four analyzes of S500 will be performed. Hereinafter, each of them will be described in detail.

[유전형 판별][Genetic discrimination]

상기 HaploScan 엔진(210)은 상기 DNA Sequencying을 HaploScan DB(400)에 저장된 Haplo Frequency(412) 및 MAP(414)과 대비하여 단일 유전자 및 표현형에 대하여 유전형이 속하는 군집 및 이에 대한 정보를 검출한다.The HaploScan engine 210 compares the DNA sequency with the Haplo Frequency 412 and the MAP 414 stored in the HaploScan DB 400 and detects a population belonging to the genotype and information about the single gene and phenotype.

구체적으로 상기 HaploScan 엔진(210)은 상기 DNA sequencying의 i번째 유전자에 대하여 상기 단일유전자 Haplo Frequency(412)의 i번째 유전자 정보와 대비하여(S211), 분석 대상인 개인 유전체의 i번째 유전자가 단일유전자 Haplo MAP(414)에 분류된 단일유전자 분류중 어느 군집에 포함되는지 여부를 판별한다(S213, S215).Specifically, the HaploScan engine 210 compares the i-th gene of the single gene Haplo Frequency 412 with respect to the i-th gene of the single gene Haplo Frequency 412 with respect to the i-th gene of the DNA sequencying (S211). It is determined in which cluster among the single gene classifications classified in the MAP 414 (S213, S215).

이후, 상기 HaploScan 엔진(210)은 i=1 부터 마지막까지(약 i=39,000) 반복하여 분석데이터의 전체 유전자에 대한 유전형을 판별한다(S217, S219).Thereafter, the HaploScan engine 210 repeats i = 1 to the end (about i = 39,000) to determine genotypes for all genes in the analysis data (S217 and S219).

또한, 상기 HaploScan 엔진(210)은 상기 DNA sequencying을 상기 다중유전자 Haplo Frequency(422)와 대비하여(S221), 각 표현형에 대한 분석 대상 유전체의 다수 유전자의 조합이 다중유전자 Haplo MAP(424)에 분류된 다중 유전자 조합의 분류중 어느 군집에 포함되는지 여부를 판별한다(S223, S225).In addition, the HaploScan engine 210 compares the DNA sequencying with the multigene Haplo Frequency 422 (S221), and a combination of multiple genes of the genome to be analyzed for each phenotype is classified into the multigene Haplo MAP 424. It is determined in which cluster among the classification of the multiple gene combinations included (S223, S225).

이에서도 역시, 상기 HaploScan 엔진(210)은 다중유전자정보 데이터베이스(420)에 저장된 모든 표현형에 대하여 반복하여 분석데이터의 유전형을 판별한다(S227, S229).In this case, the HaploScan engine 210 repeatedly determines all genotypes of the analysis data for all phenotypes stored in the multigene information database 420 (S227 and S229).

이와 같은 HaploScaning 과정을 통해 분석 대상 유전체에 포함된 단일 유전자 변이 및 다중 유전자 변이에 따른 유전형을 정의할 수 있다.Through this HaploScaning process, it is possible to define genotypes according to single gene mutations and multiple gene mutations included in the target genome.

[희귀변이 산출][Rare mutation calculation]

희귀변이는 극히 이례적인 특정 유전 변이에 의해 유발되는 염기 변이로, 일반적으로 희귀질병과 관련된 경우가 많은 것으로, 특정 염기에 대한 변이 유무 또는 차이를 검출하여, 희귀질병 발병 가능성 등을 판단할 수 있다.Rare mutations are base mutations caused by very unusual specific genetic mutations, and are generally related to rare diseases. By detecting the presence or difference of specific bases, the possibility of occurrence of rare diseases can be determined.

이를 위해 본 발명은 먼저, 도 14에 도시된 바와 같이, ADISCAN 엔진(220)이 대조군을 선별한다(S310). To this end, the present invention first, as shown in Figure 14, the ADISCAN engine 220 selects the control (S310).

이때 상기 대조군이란, 해당 변이에 대한 희귀성을 판단하게 될 대조 집단으로, 특정 인종을 한정하거나 특정 국가를 대상으로 한정할 수도 있다.In this case, the control group is a control group that will determine the rareness of the variation, and may limit a specific race or a specific country.

이후, 상기 ADISCAN 엔진(200)은 특정 로커스의 염기에 대하여 대조군 DB의 염기와 ADISCAN 방식으로 변이지수를 산출하고, 이와 같은 과정을 전체 유전체에 대하여(n=1 부터 n=약 30억) 수행한다(S320, S330, S340).Thereafter, the ADISCAN engine 200 calculates the variance index using the base of the control DB and the ADISCAN method for the base of a specific locus, and performs the same process for the whole genome (n = 1 to n = about 3 billion). (S320, S330, S340).

이에 따라 전체 염기서열에 대하여 염기들의 희귀성을 산출한다(S350).This calculates the rareness of the bases for the entire base sequence (S350).

한편, 상기 희귀변이 산출을 위한 ADISCAN(allelic depth and imbalance scanning)이란 정상과 이상 유전자의 차이를 주는 마커들을 스크리닝하는 기법으로, 대립유전자깊이곱탄젠트차이, 대립유전자제곱승차이, 대립유전자절대값차이, 기하학적대립유전자차이, 통계적대립유전자차이 또는 대립유전자불균형비율에 따라 판단된다.On the other hand, ADISCAN (allelic depth and imbalance scanning) for calculating the rare mutation is a technique for screening markers that give a difference between normal and abnormal genes, allelic depth multiplicant difference, allele squared difference, allelic absolute value difference, It is determined by geometric allele differences, statistical allele differences, or allele imbalance rates.

[질병변이 산출][Calculation of Disease]

상기 질병변이 검출은 IDA 검색엔진(230)이 분석데이터를 IDA DB(600)의 변이정보와 비교하여, 해당 질병의 위험도를 판단하게 된다(S410).In the disease mutation detection, the IDA search engine 230 compares the analysis data with the mutation information of the IDA DB 600 to determine the risk of the disease (S410).

이와 같은 방법으로, 상기 IDA DB에 포함된 모든 질병에 대하여 상기 분석데이터를 검토한 후(S420), 유의미한 변이관련 질병들을 산출하게 된다(S430).In this manner, after reviewing the analysis data for all diseases included in the IDA DB (S420), significant mutation-related diseases are calculated (S430).

[생리활성변이 산출][Calculation of physiological activity variation]

상기 생리활성변이 검출은 생리활성변이 검색엔진(240)이 BAV DB(생리활성변이 DB)를 검색하여(S510), 단백질의 결합에 관여하는 아미노산에 정보를 검출한다(S520).The physiologically active mutation detection is a bioactive mutant search engine 240 searches for BAV DB (physiologically active mutant DB) (S510), and detects information on the amino acids involved in the binding of the protein (S520).

이때, 상기 단백질 결합은 단백질-약물, 단백질-DNA 및 단백질-단백질의 결합을 포함하고, 상기 아미노산 정보에는 상기 아미노산에 관련된 염기의 정보가 포함된다.In this case, the protein binding includes protein-drug, protein-DNA and protein-protein binding, and the amino acid information includes base information related to the amino acid.

이후, 상기 생리활성변이 검색엔진(240)은 상기 아미노산 정보에 포함된 염기와 분석데이터를 대비하여 분석 데이터 상에 변이가 발생 된 아미노산 및 이에 관련된 대사물 정보 등을 검출한다(S530, S540).Thereafter, the physiologically active mutation search engine 240 detects amino acids and metabolite information related to the mutations generated in the analysis data by comparing the base and the analysis data included in the amino acid information (S530, S540).

그리고 상기 생리활성변이 검색엔진(240)은 전체 아미노산에 대하여 변이 검출을 반복수행하고, 검출된 정보를 통합하여 생리활성변이정보를 산출한다(S550, S560).The physiologically active mutation search engine 240 repeatedly performs mutation detection for all amino acids and integrates the detected information to calculate physiologically active mutation information (S550 and S560).

*이후 상기 검색제어부(200)는 판별 또는 산출된 유전형, 희귀변이, 질병변이 및 생리활성변이를 통합하여, 사용자에게 제공될 결과리포트를 생성한다(S600).* Then, the search control unit 200 integrates the determined or calculated genotype, rare mutation, disease variation and physiological activity variation, and generates a result report to be provided to the user (S600).

이때, 상기 검색제어부(200)는 피검사자의 임상정보가 제공된 경우 이를 바탕으로 임상정보 기반 질병원인을 산출하여 제공할 수 있다.In this case, the search control unit 200 may calculate and provide a clinical information-based disease cause based on the clinical information of the examinee.

구체적으로, 질병의 원인을 예측하려면 현 상태의 환경적인 요인 결과물(종합검진데이터 및 임상정보)을 포함하는 PHR (personal health records)이 필요하다. 특히, 환경적인 요인에서 집단의 평균 및 기준정보가 필요하게 된다(본 발명에서 상기 기준정보는 질병관리본부에서 제공하는 제2기 지역사회 코호트 연구결과를 활용). 여기서, 이러한 환경적인 요인의 결과물과 유전형과 연계를 지은 것을 PHR-trait 이라고 부른다.Specifically, predicting the cause of a disease requires personal health records (PHR) that include the results of current environmental factors (general examination data and clinical information). In particular, environmental factors require average and baseline information of the population (in the present invention, the baseline information utilizes the results of the second community cohort study provided by the Center for Disease Control and Prevention). Here, the PHR-trait is associated with the result and genotype of these environmental factors.

질병원인 관계도(Π) 검출식은, logistic regression분석 방법을 활용한 것으로, 변수 χ는 전술한 바와 같이 산출된 유전형, 희귀변이, 질병변이 및 생리활성변이에 따라 결정되는 값이고, 변수 β는 상기 PHR로부터 결정되는 값이다.The disease cause relationship (Π) detection equation uses a logistic regression analysis method, and the variable χ is a value determined according to the genotype, rare mutation, disease variation, and bioactive variation calculated as described above, and the variable β is The value is determined from the PHR.

즉, 상기 질병원인 관계도는 Gene, Disease 혹은 Drug의 유전형 (group or cluster of genotypes) vs. PHR (BMI, AGE, SEX, 등)의 연관성을 계산할 수 있게 된다.In other words, the disease cause relationship is Gene, Disease or Drug genotype (group or cluster of genotypes) vs .. The correlation of PHR (BMI, AGE, SEX, etc.) can be calculated.

따라서, 현재의 임상상태 (clinical condition: normal, disease, or phenotype)와 39,000유전자에서 계산한 Gene, Disease, Drug유전형과의 연관성을 계산하여 전체유전자기반 질병원인을 계산한다.Therefore, the gene-based disease cause is calculated by calculating the association between the current clinical condition (normal, disease, or phenotype) and the gene, disease, and drug genotype calculated from the 39,000 gene.

한편, 특허문헌 8, 9 및 10에 의한 질병원인 발굴 시스템은 산출된 유전자 변이정보로부터 리포팅 데이터를 생성한다.On the other hand, the excavation system of the disease causes according to Patent Documents 8, 9 and 10 generates the reporting data from the calculated gene mutation information.

이때 산출되는 결과 리포트는, 산출물에 따라 각각 다소 차이는 있으나, 기본적으로 변이 유전자에 대한 가시화를 위해 매하탄 플롯 및 방사형 변이 차트를 활용한다. The resulting report, although slightly different depending on the output, basically uses a Manhattan plot and a radial variation chart to visualize the mutation gene.

상기 맨하탄 플롯(Manhattan plot)은 39,000 개의 유전자에 대하여, 알려진 모든 SNP의 non-sym 변이들을 기준으로 게놈프로젝트의 표준 유전자를 유전형에 따라 분류하여 누적된 값을 점(point)으로 가시화 한 그래프를 의미한다.The Manhattan plot is a graph that visualizes accumulated values as points by classifying the standard genes of the Genome Project according to genotypes based on non-symmetric variations of all known SNPs for 39,000 genes. do.

이에 분석 대상 유전체의 유전자를 표시하면, 대조군 대비 분석 대상 유전자의 변이 특이성을 용이하게 인식할 수 있다.If the gene of the genome to be analyzed is displayed, mutation specificity of the gene to be analyzed can be easily recognized compared to the control.

이와 같은 맨하탄 플롯(Manhattan plot)은 변이 로커스를 손쉽게 파악할 수 있을 뿐만 아니라, 변이 정도도 용이하게 파악할 수 있다.This Manhattan plot can easily identify the mutation locus, as well as the degree of variation.

한편, 상기 맨하탄 플롯에 의해 표시된 유의성 변이들은 변이 정도 및 유전적 특성에 따라 방사형 변이 차트로 표시될 수 있다.Meanwhile, the significant variations indicated by the Manhattan plot may be displayed in a radial variation chart according to the degree of variation and genetic characteristics.

이때, 상기 분석 대상 유전체의 유전적 변이 정도와 대조군 평균을 함께 표시하여, 분석 대상 유전체의 변이정도를 가시적으로 명확하게 표시할 수 있을 뿐만 아니라, 유전적 특성 정보를 추가적으로 포함시켜 결과리포트를 생성할 수도 있다.In this case, by displaying the degree of genetic variation of the analysis target genome and the control average together, the degree of variation of the genome to be analyzed can be displayed clearly and clearly, and additionally including genetic characteristic information to generate a result report. It may be.

전술한 바와 같은 방법으로 생성된 상기 결과리포트는 결과리포트 제공부를 통해 제공된다.The result report generated by the above-described method is provided through a result report providing unit.

대한민국 특허출원 제10-2016-0096996호(특허문헌 11)의 기술요지Technical summary of Korean Patent Application No. 10-2016-0096996 (Patent Document 11)

특허문헌 11에 의한 하플로타이핑 시스템 및 방법은 인간의 전장 유전체에 대한 하플로타이핑에 적용될 수도 있고, 특정 영역의 SNP에 대하여 적용될 수 있다.The haplotyping system and method according to Patent Document 11 may be applied to haplotyping on a human full-length genome, or may be applied to SNPs in a specific region.

여기서 특정 영역이라 함은 특정 기능 수행에 관련된 유전자(또는 유전자들의 조합) 영역을 의미하는 것으로, 대표적으로는, 인간의 면역체계조절기능을 담당하는 인간 백혈구 항원 유전자(HLA gene) 영역, 약물대사관련 기능을 담당하는 유전자(DMET gene) 영역, 면역세포 발현에 관련된 유전자(KIR gene) 영역 및 혈액 특성에 관련된 유전자(ABO gene) 영역 등이 될 수 있다.Herein, the specific region refers to a region of a gene (or a combination of genes) related to performing a specific function, and typically, a region of human leukocyte antigen gene (HLA gene), which is responsible for the regulation of human immune system, and drug metabolism. It may be a gene responsible for the function (DMET gene region), a gene region related to immune cell expression (KIR gene) region, and a gene region related to blood characteristics (ABO gene).

따라서, 특허문헌 11은 인간 전장 유전체에 대한 하플로타이핑뿐만 아니라, HLA 타이핑, DMET 타이핑, KIR 타이핑 및 ABO 타이핑 등 특정 영역에 대한 하플로타이핑에도 적용될 수 있다.Therefore, Patent Document 11 can be applied not only to haplo typing on a human full-length dielectric but also to haplo typing on a specific region such as HLA typing, DMET typing, KIR typing, and ABO typing.

여기서, DMET (Drug Metabolizing Enzymes and Transporters)은 약물의 흡수(absorption)와 처리(disposition), 약물작용에 관여하는 단백질 효소(enzymes)와 전달자(transporters)들을 일컫는다. Here, DMET (Drug Metabolizing Enzymes and Transporters) refers to protein enzymes (transporters) and enzymes involved in the absorption (absorption) and disposition of drugs, drug action.

예를 들면, cytochrome p450 enzyme family (CYPs), uptake transporters, efflux transporters 등이 이에 속하고, 한 혈족(family) 안에 여러 개의 유전자가 있으며, 이들의 유전자 서열은 서로 비슷하면서도 다형성(polymorphism)을 갖는다.For example, the cytochrome p450 enzyme family (CYPs), uptake transporters, and efflux transporters belong to this group. There are several genes in a family, and their gene sequences are similar to each other but have polymorphism.

개인간 DMET 유전자 서열 차이는 약물반응, 부작용, 질병민감성 등에 영향을 미칠 뿐만아니라 적절한 약물선택의 기준이 될 수 있기 때문에 최근 약물유전학(pharmacogenetics)에서 주목받는 연구분야이다.DMET gene sequence differences between individuals not only affect drug reactions, side effects, disease sensitivity, etc., but also can be a criterion for selection of appropriate drugs, which is a research area that has recently attracted attention in pharmacogenetics.

그리고 KIR (Killer-cell Immunoglobulin-like Receptors)은 Natural killer (NK) cell이나 T cell 과 같은 특정 면역세포의 표면에 발현되는 단백질이다. KIR (Killer-cell Immunoglobulin-like Receptors) is a protein expressed on the surface of certain immune cells such as natural killer (NK) cells or T cells.

KIR은 다른 세포의 표면에 있는 major histocompatibililty (MHC, 주조직적합성) class I 과 상호작용함으로써 NK cell과 T cell의 세포를 죽이는 능력을 조절한다.KIR regulates the ability of NK cells and T cells to kill cells by interacting with major histocompatibililty (MHC) class I on the surface of other cells.

따라서, KIR의 이러한 기능은 감염, 자가면역질환, 암 등에 대한 민감성과 반응성향과 관련이 있다.Thus, this function of KIR is associated with sensitivity and responsiveness to infections, autoimmune diseases, cancer and the like.

그리고 상기 KIR은 매우 다양(polymorphic)하여 유전자 서열이 개인마다 차이가 크며, 개인마다 가지고 있는 유전자 양이나 종류가 다르다.And the KIR is very polymorphic (polymorphic) gene sequence is largely different from person to person, the amount or type of genes that each person has different.

한편, ABO(blood type)는 ABO 혈액형과 수혈관계를 따지는 데 주요한 역할을 하는 유전자로, 크로모좀 9q34에 위치해 있으며 전통적인 혈청기법으로는 3개의 대립유전자(allele)(A, B, O types)을 구분할 수 있다.On the other hand, ABO (blood type) is a gene that plays a major role in distinguishing ABO blood type and transfusion system. It is located in chromosome 9q34 and three alleles (A, B, O types) by conventional serum technique. Can be distinguished.

A, B, O 각각의 대립유전자(allele)에도 세부그룹(subgroup)이 존재하며 드물게 같은 혈액형이라 하더라도 세부그룹(subgroup)간에 수혈이 불가능한 문제가 생기기도 한다.Subgroups exist in alleles of A, B, and O, and even in rare cases, the same blood type may cause a problem in which transfusion is impossible between subgroups.

이하에서는, 설명의 구체성을 확보하기 위해, HLA 타이핑을 대표적인 실시예로 설명하기로 한다.In the following, in order to ensure the specificity of the description, HLA typing will be described as a representative embodiment.

주요 조직 적합성 복합체 분야(The major histocompatibility complex regions)는 휴먼게놈(human genome) 중에서 가장 복잡한 영역 중 하나이고 인간의 면역체계 조절 기능(the regulation of the immune system)을 책임지고 있다. 그 중 인간의 백혈구 항원(the Human leukocyte antigens, HLAs) 유전자는 6번 염색체(chromosome)의 약 3Mbp stretch에 존재하고 병원균(pathogen)을 억제하고 제거하는 적응형 면역 반응(adaptive immune response)에 큰 역할을 담당한다. The major histocompatibility complex regions are one of the most complex regions of the human genome and are responsible for the regulation of the immune system. Among them, the human leukocyte antigens (HLAs) gene are present in about 3Mbp stretch of chromosome 6 and play a big role in adaptive immune response, which suppresses and eliminates pathogens. In charge of.

임상 관점에서는 장기이식을 할 때 기증자(donor)와 수증자(recipient) 간의 HLA 유전자가 유사할 경우 거부반응(rejection)의 위험을 줄일 수 있다. 따라서 정확한 HLA 타이핑(typing)은 매우 중요한 문제이다.From a clinical point of view, the risk of rejection can be reduced if the HLA gene between donor and recipient is similar during organ transplantation. Therefore accurate HLA typing is a very important issue.

그러나 HLA 유전자(genes)의 높은 다형성(highly polymorphic), 연관불균형(linkage disequilibrium) 및 유전자간 서열 유사성(sequence similarity) 때문에 정확한 HLA 타이핑은 매우 어렵다. However, accurate HLA typing is very difficult because of the highly polymorphic, linkage disequilibrium and sequence similarity between HLA genes.

예를 들면 엑손(exons) 2-4 of HLA-A gene in class I에 대해 IMGT/HLA 데이터베이스(database)에 보고된 대립유전자(alleles)는 수 천 개가 존재하고, HLA-A, B 및 C genes 간의 대립유전자(alleles)들은 매우 유사하다.For example, there are thousands of alleles reported in the IMGT / HLA database for exons 2-4 of HLA-A gene in class I. HLA-A, B and C genes Alleles of the liver are very similar.

낮은 해상도(2-digits)에 의해 같은 항원 펩티드(antigen peptide)일지라도 아미노산(amino acid)의 차이로 인해 동종 반응(allogeneic response)을 유발할 수 있기 때문에 아미노산 수준(amino acid level)의 고 해상도(4-digits)까지 HLA 타이핑(ydping)이 필요하다.Because of the low resolution (2-digits), even if the same antigenic peptide (antigen peptide) can cause an allogeneic response due to the difference in amino acid (amino acid level) high resolution of the amino acid level (4- HLA typing is required up to the digits.

고해상도(High resolution) HLA 타이핑(typing)의 기존 방법은 특정 올리고 뉴클레오티드 시퀀스(SSO)에 의한 PCR 법(polymerase chain reaction by sequence specific oligonucleotide)과 SBT(sequence-based typing)법이 있지만 이와 같은 방법은 작업인력의 노동력에 의존하여 처리되어, 낮은 처리량(low-throughput)과 고비용이 문제시된다. Conventional methods of high resolution HLA typing include polymerase chain reaction by sequence specific oligonucleotide (SSO) and sequence-based typing (SBT) methods. Depending on the labor force of the workforce, low-throughput and high costs are a problem.

한편, TAS(Targeted amplicon sequencing) 접근법은 PCR법에 비해 상대적으로 높은 처리량(high-throughput)을 나타내므로, 저렴한 비용으로, 수백 bases의 long reads를 생성하여 높은 정확성을 가지는 HLA 타이핑이 가능하다. Targeted amplicon sequencing (TAS) approach shows relatively high throughput (high-throughput) compared to the PCR method, so that HLA typing with high accuracy is possible by generating long reads of several hundred bases.

그러나 효율성과 비용 때문에 최근 생성되고 있는 대다수의 데이터는 genome-wide sequence, whole genome sequence (WGS) 또는 whole exome sequence (WES)이고, 이와 같은 데이터는 long reads가 아닌 short sequence reads (~101bp)를 가진다. 따라서 이와 같은 short sequence reads 이용하여 TAS 접근법과 같은(또는 그 이상) 정확도와 경제성을 갖춘 HLA 타이핑에 대한 필요성이 대두되고 있다.However, due to efficiency and cost, most of the data generated recently is genome-wide sequence, whole genome sequence (WGS) or whole exome sequence (WES), and such data has short sequence reads (~ 101bp) rather than long reads. . Therefore, there is a need for HLA typing with the same accuracy and economy as the TAS approach using such short sequence reads.

즉, 특허문헌 11에 의한 HLA 타이핑은 short read를 이용하여, 정확성 및 효율성이 확보된 HLA 타이핑을 제공하기 위한 것이다.That is, the HLA typing according to Patent Document 11 is to provide HLA typing that is secured with accuracy and efficiency using short reads.

Short sequence reads를 이용한 HLA typing 방법은 크게 두 분류로 나뉜다. HLA typing using short sequence reads is divided into two categories.

하나는 short reads들을 조합(assemble)하여 긴 콘티그(contigs)를 생성하여 전체 HLA type을 결정하는 것이고, 다른 하나는 알려진 대립 유전자 시퀀스(allele sequences)를 레퍼런스(reference)로 하여 short sequence reads 들을 정렬(align)한 후 정렬된 정보로 실제 대립 유전자(alleles)를 결정하는 방법이다. One is to assemble short reads to create long contigs to determine the overall HLA type, and the other is to align short sequence reads with a known allele sequence as a reference. After aligning, the actual alleles are determined by the sorted information.

조합(Assembly)에 기반한 방법은 short reads를 사용할 경우 phasing issue로 인한 대립유전자의 부정합(false positive allele) 판정 문제를 해결하기 어렵고, 요구되는 시간도 길어지게 된다.The assembly-based approach makes it difficult to solve false positive alleles due to phasing issues and shorter time when using short reads.

한편, 얼라인먼트(Alignment)에 기반한 방법은 HLA 유전자 영역의 높은 다형성(high polymorphic)으로 인해 알려진 대립유전자(alleles)들이 매우 유사하기 때문에 실제 대립유전자(alleles)를 결정하는 것이 쉽지 않다. Alignment-based methods, on the other hand, are difficult to determine the actual alleles because the known alleles are very similar due to the high polymorphicity of the HLA gene region.

이러한 문제점에도 불구하고 연구자들의 많은 관심 속에 조합에 기반한 방법으로는 HLAreporter가 소개되었고, 얼라인먼트에 기반한 방법으로는 PHLAT 등이 소개되었으며, 최근에 발표된 HLAreporter과 PHLAT은 이전 HLA 타이핑에 비하여 정확한 HLA 타이핑 결과를 나타낸다.Despite these problems, HLAreporter was introduced as a combination-based method, PHLAT as an alignment-based method, and recently published HLAreporter and PHLAT were more accurate than previous HLA typing results. Indicates.

특허문헌 11에 의한 HLA 타이핑은 genome-wide short sequencing data에 대해 매우 정확한 HLA 타이핑을 수행하는 것으로, 이하에서는 이를 HLAscan이라 칭한다.HLA typing according to Patent Document 11 performs highly accurate HLA typing on genome-wide short sequencing data, hereinafter referred to as HLAscan.

PHLAT등의 종래기술에서는 정렬된 리드(aligned read)와 유전자 깊이(depth coverage)로 대립유전자 후보군(candidate alleles)을 선별하였으나, 본 발명에 의한 HLAscan은 대립유전자(alleles)에 정렬(align)된 리드의 분포도(read distribution)를 이용한다. In the prior art, such as PHLAT, allele candidates were selected by aligned reads and depth coverage, but the HLAscan according to the present invention is a lead aligned with alleles. Use a read distribution of.

또한, 특허문헌 11은 phase issue로 선택된 대립유전자의 부정합(false positive alleles)을 제거하기 위한 알고리즘(이하 '고유리드 연산 알고리즘'이라 한다)이 적용된다. In addition, Patent Document 11 applies an algorithm for removing false positive alleles selected as a phase issue (hereinafter referred to as a 'high-glass operation algorithm').

특허문헌 11에 의한 HLAscan을 이용하여, 시험한 결과, 11개의 1000 genome samples, 51개의 HapMap samples, 자체 5개의 samples 에 대하여 종래기술에 비하여 정확성이 매우 향상된 결과를 보였다.As a result of the test using the HLAscan according to Patent Document 11, eleven 1000 genome samples, 51 HapMap samples, and five samples of themselves showed very improved accuracy compared to the prior art.

이하에서는 특허문헌 11에 의한 HLAscan의 구체적인 구성을 설명하기로 한다.Hereinafter, the specific configuration of the HLAscan according to Patent Document 11 will be described.

특허문헌 11에 의한 HLAscan은, HLAscan according to Patent Document 11,

1) 대립유전자 후보군(Candidate alleles)을 선별함에 있어, 정렬된 리드의 분포도를 고려한 스코어 기능(score function considering the distribution of aligned reads)을 제공하고;1) in selecting Candidate alleles, provide a score function considering the distribution of aligned reads;

2) phase issue로 생성된 대립유전자의 부정합(false positive alleles)을 검출하기 위한 고유리드 연산 알고리즘을 제공한다.2) A unique read algorithm for detecting false positive alleles generated by a phase issue is provided.

특허문헌 11에 의한 HLAscan은 기본적으로 정렬기반의 접근법(Alignment-based approach)을 기반으로 한 것으로, 도 2에 도시된 바와 같이, 크게 두 단계로 구분된다.HLAscan according to Patent Document 11 is basically based on an alignment-based approach, as shown in FIG. 2, and is largely divided into two steps.

제1단계(Tier1)는 NGS 장비로부터 생성된 원시 시퀀스 리드(raw sequence reads)를 전장 유전체 레퍼런스(whole genome reference)에 정렬(alignment)하여 이진정렬맵(binary alignment/map, BAM)을 생성한 후 HLA 유전자 영역(genes region)에 해당되는 시퀀스 리드(sequence reads)들을 선별하는 과정이다. Tier1 generates a binary alignment map (BAM) by aligning raw sequence reads generated from the NGS equipment with a whole genome reference. It is a process of selecting sequence reads corresponding to the HLA gene region.

제2단계(Tier2)는 먼저, IMGT/HLA database에 존재하는 모든 대립유전자(alleles)에 각각 그 시퀀스 리드(sequence reads)들을 각각 정렬(alignment)한다. The second stage (Tier2) first aligns the sequence reads with each allele in the IMGT / HLA database, respectively.

HLA-A 를 예로 들면 IMGT/HLA database에 HLA-A gene의 알려진 대립유전자(alleles)는 3182개가 존재하고, HLAscan은 이들 대립유전자(alleles)들을 레퍼런스로 하여 수집된 시퀀스 리드(sequence reads)를 각각 정렬(alignment)한다. Using HLA-A as an example, there are 3182 known alleles of the HLA-A gene in the IMGT / HLA database, and HLAscan uses sequence reads collected with reference to these alleles. To align.

그리고 정렬(alignment)된 정보를 이용하여 최종 대립유전자(alleles)를 결정한다. The final alleles are then determined using the aligned information.

이때, 특허문헌 11은 정렬된 정보로부터 최종 대립유전자를 결정함에 있어, 후보군 대립유전자를 선별하기 위하여 정렬된 리드 분포도를 고려한 스코어기능을 제공하고; 최종 대립유전자의 선별함에 있어, phase issue를 해결하기 위해 고유리드 연산 알고리즘을 제공한다.At this time, Patent Document 11, in determining the final alleles from the sorted information, provides a score function in consideration of the aligned read distribution in order to select candidate group alleles; In selecting the final allele, we provide a unique read algorithm to solve the phase issue.

이하에서는, 특허문헌 11에 의한 HLAscan에서 제공하하는 정렬된 리드 분포도를 고려한 스코어기능 및 고유리드 연산 알고리즘의 구체적인 내용을 설명하기로 한다.Hereinafter, specific contents of the score function and the eigenread calculation algorithm in consideration of the aligned read distribution map provided by HLAscan according to Patent Document 11 will be described.

정렬된 리드 분포도를 고려한 스코어 기능(score function considering the distribution of aligned reads)Score function considering the distribution of aligned reads

HLAscan은 IMGT/HLA database에 저장된 수천(약 8,000)개의 대립유전자 모델(alleles from) 중에 진정 대립유전자(true alleles)를 선택하기 위해 정렬된 리드의 분포도를 고려한 스코어 기능(score function considering the distribution of aligned reads)을 사용하여 허위 대립유전자(false alleles)를 제거한다. HLAscan is a score function considering the distribution of aligned to select true alleles among thousands (about 8,000) allele models stored in the IMGT / HLA database. reads) to remove false alleles.

이 과정에서 제거되지 않고 남은 대립유전자(alleles)는 후보 대립유전자(candidate allele)라고 한다.Alleles that remain unremoved in this process are called candidate alleles.

이때 상기 정렬된 리드의 분포도를 고려한 스코어 기능(score function considering the distribution of aligned reads)은 정렬된 리드의 레퍼런스 상의 분포도가 균일하게 분산되지 않은 경우, 해당 레퍼런스의 대립유전자를 허위 대립유전자로 판정하는 것을 말한다.In this case, the score function considering the distribution of aligned reads determines that the allele of the reference is a false allele when the distribution on the reference of the aligned read is not uniformly distributed. Say.

예를 들어, 레퍼런스 시퀀스(Reference sequence, ref)의 position s_i 내지 e_i에 정렬(alignment)된 read_i (1≤i≤n)가 주어졌다고 가정하면, 이때, 'read_i는 ref의 position (s_i+e_i)/2 에 정렬(alignment)되었다' 라고 정의된다. 그리고 리드(Read)가 정렬(align)되어 있지 않은 reference의 연속 포지션(consecutive positions)들을 noread_j (1≤j≤m)라 한다.For example, assuming that read_i (1≤i≤n) aligned to positions s_i to e_i of a reference sequence (ref) is given, 'read_i is a position (s_i + e_i) of ref. Is aligned to / 2 '. Consecutive positions of the reference whose read is not aligned are referred to as noread_j (1 ≦ j ≦ m).

이 경우, score function은,In this case, the score function is

(c is a constant)( c is a constant)

에 의해 산출될 수 있다.Can be calculated by

이때, 산출된 스코어가 기준치보다 크게 산출된 레퍼런스를 허위 대립유전자에 대한 레퍼런스로 판정하여, 해당 대립유전자를 후보에서 제외시킬 수 있다.In this case, the reference whose calculated score is larger than the reference value may be determined as a reference to the false allele, and the allele may be excluded from the candidate.

고유리드Unique Lead 연산 알고리즘 Algorithm

특허문헌 11에 의한 고유리드 연산 알고리즘은 1) 후보 대립유전자(Candidate alleles)가 다수 존재할 경우 불합치 후보 대립유전자(false positive candidate alleles)를 제거하는 알고리즘과, 2) 3개 이하의 후보 대립유전자(candidate alleles)가 존재할 경우 phase issue로 선택된 후보 대립유전자(candidate alleles)를 검출하여 제거하는 알고리즘을 포함한다.The eigenread operation algorithm according to Patent Document 11 includes 1) an algorithm for removing false positive candidate alleles when there are a large number of candidate alleles, and 2) no more than 3 candidate alleles. Alleles, if present, includes an algorithm for detecting and eliminating candidate alleles selected as a phase issue.

또한, 특허문헌 11에 의한 고유리드 연산 알고리즘은 전술한 바와 같은 판단결과를 바탕으로 최종 대립유전자가 동형접합체(homozygous)의 대립유전자 인지 이형접합체(heterozygous)의 대립유전자인지 여부를 판별할 수 있다.In addition, the eigenread operation algorithm according to Patent Document 11 can determine whether the final allele is an allele of a homozygote or a heterozygous allele on the basis of the determination result as described above.

예를 들어, 타이핑할 유전자(gene)로부터 시퀀스 리드(sequence reads)를 수집하였고 시퀀싱에 오류(sequencing error)가 없다고 가정한다.For example, assume that sequence reads were collected from the gene to be typed and there were no sequencing errors in sequencing.

이때, t 개의 candidate allele_i (1≤i≤t) 중 서로 다른 시퀀스를 갖는 두 개의 리드 A, B(two reads A and B which have different sequence)가 서로 다른 allele_p 및 allele_q (1≤p,q≤t)의 position x to y 에 각각 100% 매치 되어 맵핑(mapping with 100% match)되고, 다른 영역에는 맵핑(mapping)되지 않았을 때, 해당 검사체의 실제 유전자는 리드 A의 시퀀스(sequence)를 포함한 한 가닥 그리고 리드B의 시퀀스(sequence)를 포함한 한 가닥을 가진 이형집합체(heterozygous)이다. In this case, two reads A and B which have different sequences among t candidate allele_i (1 ≦ i ≦ t) have different allele_p and allele_q (1 ≦ p, q ≦ t) When mapped with 100% match to position x to y of each other and not mapped to other regions, the actual gene of the subject is included as long as it contains the sequence of Read A. Heterozygous with one strand, including strand and sequence of ReadB.

따라서 리드 A 및 리드 B 중 어떤 것도 mapping with 100% match 되지 않는 후보 대립유전자(candidate alleles)는 불합치 대립유전자(false positive allele)이므로 제거한다. Therefore, candidate alleles that do not map to 100% match with either Lead A or Lead B are eliminated because they are false positive alleles.

또한, 특허문헌 11에 의한 고유리드 연산 알고리즘은 3개 이하의 후보 대립유전자(candidate alleles)가 존재할 때, 각각의 후보 대립유전자(candidate allele)에 대하여 오직 자신에게만 정렬(aligned)되어 있는 시퀀스 리드(sequence reads)의 개수를 카운트(count) 하여 그 순으로 제1후보 대립유전자(the first candidate allele) 및 제2후보 대립유전자(the second candidate allele)를 선정한다. 그리고 선택된 두 개의 후보 대립유전자(candidate alleles)에 대해 같은 과정을 반복한다. In addition, the eigenread operation algorithm according to Patent Document 11 has a sequence read that is aligned only to itself for each candidate allele when there are three or less candidate alleles. The number of sequence reads is counted to select the first candidate allele and the second candidate allele in that order. The same process is repeated for the two candidate alleles selected.

만약 두 후보 대립유전자(candidate alleles) 모두 고유 정렬 리드(unique aligned reads)를 가지고 있을 때 2개의 대립유전자(alleles)를 최종결과물로 산출한다. 이 경우 해당 대립유전자는 이형접합체의 대립유전자임을 의미한다.If both candidate alleles have unique aligned reads, two alleles are produced as final output. In this case, it means that the allele is an allele of a heterozygote.

그리고 하나의 후보 대립유전자(candidate allele)만 고유 정렬 리드(unique aligned reads)를 가지고 있을 때(하나의 allele에 aligned reads가 다른 allele의 모든 aligned reads를 포함한 경우), 고유 정렬 리드를 가진 대립유전자만을 최종 결과로 출력한다. 이 경우 해당 대립유전자는 동형접합체의 대립유전자임을 나타낸다.And when only one candidate allele has unique aligned reads (when aligned reads in one allele includes all aligned reads in another allele), only alleles with unique aligned leads Output the final result. In this case, the allele represents the allele of the homozygotes.

본 발명은 최근 가속화되어 발전하고 있는 맞춤의학 플랫폼의 핵심기술로, 본 발명에 의하면, 현재 부분적으로 독성 여부의 위험도 정도를 검출하는 수준을 넘어, 개인유전체 맵 기반 맞춤의학 플랫폼을 통해 표준화되고, 재현 가능하며 자동화된 기술을 적용하여 상용화된 맞춤의학용 플랫폼이 가능해지고, 이에 따라 인류의 의학적 발전에 크게 기여할 수 있다.The present invention is a core technology of a customized medical platform that has recently been accelerated and developed. According to the present invention, the present invention is partially standardized and reproduced through a personal genetic map-based customized medical platform, beyond the level of detecting the degree of risk of toxicity. It is possible to apply commercialized customized medical platform by applying automated technology, which can greatly contribute to the medical development of human being.

Claims

To analyze the supercom based integrated dielectric map and genomic data,
Different parallel distributed storage at a level that considers heterogeneous computing systems and data I / O based on computer speed;
Precomputation results and statistical tools are provided for use by multiple users in the integrated genome DB;
By creating an input file for the external statistics tool, a mySQL database is built for multiple users:
The integrated dielectric DB,
A human genome comprising 3 billion bases of divided chunks of defined size to operate on a supercomb (pc-cluster) to analyze genetic variation through comparison of the genome sequences;
The chunk is,
The genome sequences extracted from a plurality of different subjects are divided into the same order and size so as to be contrasted with each other, and are formed in a two-dimensional matrix form arranged according to the order of the subjects:
The size of the nucleotide sequence included in the chunk is increased or decreased according to the number of subjects, personal genomic map based personalized medical analysis platform, characterized in that the data size is set to 10 to 20 Gbytes.

The method of claim 1,
The parallel distributed storage,
(a) setting a limit on the I / O;
(b) personal genomic map based customized medical analysis platform, characterized in that it is set by performing the steps of confirming the consistency, reproducibility, and limitations of the data analysis results calculated at multiple nodes (multiple cpu sets).

delete